In this thesis, we build a general-purpose solution for the alignment of the voice recording and the associated transcription. The solution consists of three components: sound segmentation, speech recognition, and text alignment. This thesis focuses on the use of different acoustic models for speech recognition and the use of different methods of decoding model outputs. We also propose a new extension of the existing text alignment algorithm to provide alignment of each word in the original text.
The system is evaluated on non-dialectal and dialectal speech and unaccompanied dialectal singing, using three metrics based on absolute alignment error. Speech alignment proves to be of good quality and is comparable to the quality of similar systems in foreign languages.
|