In this thesis, I compare three widely used speech-to-text (STT) systems for
Slovene: Google Cloud Speech-to-Text, Microsoft Azure Speech Service, and the
open-source OpenAI Whisper. For an objective assessment, I constructed a balanced test set of speech recordings from the Artur 1.0 corpus, with an emphasis
on spontaneous monologic speech (Artur-N) as well as read and studio recordings (Artur-B). The selection comprises approximately one hour of speech from
15–20 speakers with diverse demographic characteristics. I evaluated the systems
in terms of accuracy (word error rate — WER), time efficiency (transcription
time), and practical aspects (ease of use). I ran Whisper locally on a laptop
CPU with a preloaded model, while the Google and Azure services were accessed
via their official application programming interfaces (APIs). The entire pipeline
for preparing the speech data and for measuring and exporting the evaluation
results was automated via scripts, enabling reproducibility and scalability of the
evaluation process and results.
|