Fillers are words, phrases, or sounds that do not contribute to the meaning of speech and are often distracting.
This thesis addresses the problem of automatic filler removal in Slovenian speech.
The main basis is provided by modern automatic speech recognition (ASR) models based on neural network architectures. These are end-to-end models that, using deep learning on large amounts of audio recordings and corresponding transcriptions, can recognize speech. Depending on the training data and model architecture, certain models support word-level time stamps. This capability is already used in some foreign languages for detecting and subsequently removing fillers in speech. However, no such tool exists yet for Slovenian.
Within the scope of this thesis, we developed software tools that enable easy use and comparison of different ASR models for detecting and removing fillers in Slovenian audio recordings.
We focused primarily on the accuracy of filler detection, while also evaluating the correctness of time stamps and the subjective quality of the cleaned audio.
The results show that ASR models, with appropriate adaptation, offer a promising solution for automatic filler removal in Slovenian speech, while the developed tools open possibilities for further research and improvements in the field of speech technology.
|