Search by:
Comparative analysis of Ukrainian speech recognition models Whsiper and Vosk
Full text (PDF)
UDC: 004.8
Publication Language: Ukrainian
Stuc. intelekt. 2025; 30(3):94-102
Abstract: This paper presents a comparative analysis of two approaches to automatic speech recognition (ASR) — the Whisper transformer architecture and the Kaldi-based Vosk solution — with a focus on Ukrainian language recognition. The study is based on experiments performed using two benchmarks: faster-whisper on GPU and local Vosk models on CPU. The evaluation was performed using standard metrics: Word Error Rate (WER) to measure accuracy and Real-Time Factor (RTF) and average processing time to assess performance. Before comparison, the texts were normalized (punctuation removal, conversion to lower case, removal of extra spaces) to ensure correct WER calculation. The results show that, when hardware accelerated, Whisper demonstrates lower WER values and significantly better throughput compared to Vosk on CPU, especially for the medium and large-v3 models. Instead, Vosk proves its competitiveness in resource-constrained scenarios: it consumes less memory, is stable and deterministic in repeated runs, which makes it suitable for embedded and offline solutions. The paper also discusses the limitations of the study, including the small test dataset and the dependence of the results on the hardware parameters and decoding settings. Based on the conclusions obtained, practical recommendations for choosing an architecture are formulated: for services where GPUs are available and critical accuracy is critical, Whisper of medium/large size is recommended; for autonomous systems with limited resources, Vosk. Further research requires expanded datasets, adaptation of models to local vocabulary, and a full analysis of hybrid deployment options.
Keywords: artificial intelligence, information systems, speech recognition, language models, Ukrainian language, information technologies, Vosk, Whisper, WER, RTF, CUDA
References:
- Radford, A. (2022). Whisper: Robust speech recognition. OpenAI Technical Report. Retrieved from https://cdn.openai.com/papers/whisper.pdf
- Povey, D. (2011). The Kaldi speech recognition toolkit. IEEE ASRU. Retrieved from https://kaldi-asr.org/
- Alphacephei. Vosk API documentation. Retrieved from https://alphacephei.com/vosk/
- Jurafsky, D., & Martin, J.H. (2020). Speech and language processing (3rd ed.). Prentice Hall. Retrieved from https://web.stanford.edu/~jurafsky/slp3/
- Han, K. (2020). ContextNet: Improving convo-lutional neural networks for ASR. Interspeech. Retrieved from https://arxiv.org/abs/2005.03191
- Park, D.S. (2019). SpecAugment: A simple data augmentation method for ASR. Interspeech. Retrieved from https://arxiv.org/abs/1904.08779
- Pratap, V., et al. (2020). Wav2letter++: The fastest open-source speech recognition system. ICASSP. Retrieved from https://arxiv.org/abs/1812.07625
- Kuchayev, O., et al. (2019). Mixed precision training for speech recognition. ICASSP. Retrieved from https://doi.org/10.1109/ICASSP.2019.8682590
- Tóth, L. (2015). Combining articulatory and acoustic information in DNN-based ASR. Computer Speech & Language. Retrieved from https://doi.org/10.1016/j.csl.2015.06.001
- Besacier, L. (2014). Automatic speech recognition for under-resourced languages. Speech Communication. Retrieved from https://doi.org/10.1016/j.specom.2014.01.002
- Panayotov, V. (2015). Librispeech: An ASR corpus based on public domain audiobooks. ICASSP. Retrieved from https://doi.org/10.1109/ICASSP.2015.7178964
- Jiwer. Jiwer Python library documentation. Retrieved from https://github.com/jitsi/jiwer
- NVIDIA. (2023). CUDA toolkit documentation. Retrieved from https://docs.nvidia.com/cuda/
- Chan, W. (2016). Listen, attend and spell. ICASSP. Retrieved from https://doi.org/10.1109/ICASSP.2016.7472621
- Baevski, A. (2020). Wav2vec 2.0: A framework for self-supervised learning of speech representations. NeuroIPS. Retrieved from https://arxiv.org/abs/2006.11477
- Stolcke, A. (2017). Effects of language model size on speech recognition performance. ICASSP. Retrieved from https://doi.org/10.1109/ICASSP.2017.7952696
- Williams, W. (2019). Contextual speech recognition in end-to-end neural models. Interspeech. Retrieved from https://www.isca-speech.org/archive/ Interspeech_2019/williams19_Interspeech.html