Comparative analysis of Ukrainian speech recognition models Whsiper and Vosk

Search by:

Year of publication

Author name

Paper title

https://doi.org/10.15407/jai2025.03.094

Comparative analysis of Ukrainian speech recognition models Whsiper and Vosk

Luts V.¹, Bezverkhyi O.¹

¹ National Transport University

tibet.septim@gmail.com; o_bezver@ukr.net

https://orcid.org/0009-0001-2948-6935 https://orcid.org/0000-0002-0834-6335

Full text (PDF)

UDC: 004.8
Publication Language: Ukrainian
Stuc. intelekt. 2025; 30(3):94-102

Abstract: This paper presents a comparative analysis of two approaches to automatic speech recognition (ASR) — the Whisper transformer architecture and the Kaldi-based Vosk solution — with a focus on Ukrainian language recognition. The study is based on experiments performed using two benchmarks: faster-whisper on GPU and local Vosk models on CPU. The evaluation was performed using standard metrics: Word Error Rate (WER) to measure accuracy and Real-Time Factor (RTF) and average processing time to assess performance. Before comparison, the texts were normalized (punctuation removal, conversion to lower case, removal of extra spaces) to ensure correct WER calculation. The results show that, when hardware accelerated, Whisper demonstrates lower WER values and significantly better throughput compared to Vosk on CPU, especially for the medium and large-v3 models. Instead, Vosk proves its competitiveness in resource-constrained scenarios: it consumes less memory, is stable and deterministic in repeated runs, which makes it suitable for embedded and offline solutions. The paper also discusses the limitations of the study, including the small test dataset and the dependence of the results on the hardware parameters and decoding settings. Based on the conclusions obtained, practical recommendations for choosing an architecture are formulated: for services where GPUs are available and critical accuracy is critical, Whisper of medium/large size is recommended; for autonomous systems with limited resources, Vosk. Further research requires expanded datasets, adaptation of models to local vocabulary, and a full analysis of hybrid deployment options.

Keywords: artificial intelligence, information systems, speech recognition, language models, Ukrainian language, information technologies, Vosk, Whisper, WER, RTF, CUDA

References:

Radford, A. (2022). Whisper: Robust speech recognition. OpenAI Technical Report. Retrieved from https://cdn.openai.com/papers/whisper.pdf
Povey, D. (2011). The Kaldi speech recognition toolkit. IEEE ASRU. Retrieved from https://kaldi-asr.org/
Alphacephei. Vosk API documentation. Retrieved from https://alphacephei.com/vosk/
Jurafsky, D., & Martin, J.H. (2020). Speech and language processing (3rd ed.). Prentice Hall. Retrieved from https://web.stanford.edu/~jurafsky/slp3/
Han, K. (2020). ContextNet: Improving convo-lutional neural networks for ASR. Interspeech. Retrieved from https://arxiv.org/abs/2005.03191
Park, D.S. (2019). SpecAugment: A simple data augmentation method for ASR. Interspeech. Retrieved from https://arxiv.org/abs/1904.08779
Pratap, V., et al. (2020). Wav2letter++: The fastest open-source speech recognition system. ICASSP. Retrieved from https://arxiv.org/abs/1812.07625
Kuchayev, O., et al. (2019). Mixed precision training for speech recognition. ICASSP. Retrieved from https://doi.org/10.1109/ICASSP.2019.8682590
Tóth, L. (2015). Combining articulatory and acoustic information in DNN-based ASR. Computer Speech & Language. Retrieved from https://doi.org/10.1016/j.csl.2015.06.001
Besacier, L. (2014). Automatic speech recognition for under-resourced languages. Speech Communication. Retrieved from https://doi.org/10.1016/j.specom.2014.01.002
Panayotov, V. (2015). Librispeech: An ASR corpus based on public domain audiobooks. ICASSP. Retrieved from https://doi.org/10.1109/ICASSP.2015.7178964
Jiwer. Jiwer Python library documentation. Retrieved from https://github.com/jitsi/jiwer
NVIDIA. (2023). CUDA toolkit documentation. Retrieved from https://docs.nvidia.com/cuda/
Chan, W. (2016). Listen, attend and spell. ICASSP. Retrieved from https://doi.org/10.1109/ICASSP.2016.7472621
Baevski, A. (2020). Wav2vec 2.0: A framework for self-supervised learning of speech representations. NeuroIPS. Retrieved from https://arxiv.org/abs/2006.11477
Stolcke, A. (2017). Effects of language model size on speech recognition performance. ICASSP. Retrieved from https://doi.org/10.1109/ICASSP.2017.7952696
Williams, W. (2019). Contextual speech recognition in end-to-end neural models. Interspeech. Retrieved from https://www.isca-speech.org/archive/ Interspeech_2019/williams19_Interspeech.html

View full text (PDF)

Artificial intelligence

Scientific journal

Search by:

Comparative analysis of Ukrainian speech recognition models Whsiper and Vosk