With the growing number of smart devices, monitoring interpersonal interactions of users for research purposes and smart app adaptations is becoming increasingly easy. A lot of useful information can be obtained by recording and analysing conversations. When analysing voice recordings details like presence of speech, number of speakers and when and how much each of participants spoke are of interest.
For this purpose, voice activity detection and speaker diarization are used. In this thesis, we use existing tools to solve problems of voice activity detection and speaker diarization and adapted them to our needs. The voice activity detector uses a logistic regression algorithm. Voice activity detection is the first component of a general model of speaker diarization. It is followed by the speaker segment boundary detection and the merging of the segments that belong to the same speaker. We use commonly used methods based on the Bayesian information criterion. Existing freely available datasets were used for development and testing, and we also prepared a small collection of our recordings.
The voice activity detector that we developed achieves an average accuracy of almost 90 % and can operate in real-time. The results of speaker diarization on freely available datasets are comparable to similar procedures from the literature. On our dataset, which best represents the type of recordings for which we have developed the procedure, our method was the only one from the ones we tested that returned useful results.
|