In Slovenia, there is a shortage of learning material for training speech models, which poses a barrier to the further development of this field. The solution lies in the development of a tool for the automated retrieval of audio material. This thesis discusses the selection of the Scrapy tool for web crawling, written in Python, and its development. A web crawler is a program or script that automatically browses the web and stores its content. The goal of our crawler is to search for audio and video resources in the Slovenian language. During development, we also focused on compliance with legal and ethical guidelines. Additionally, we delved into the development of a script for searching for transcriptions of recordings.
The research is based on the descriptive method to present existing tools for web crawling. A review of legal acts was conducted, focusing on the Copyright and Related Rights Act and the General Data Protection Regulation. We also examined literature on ethical guidelines for web crawling, methods of embedding audio and video content on web pages and methods of finding similarities between texts.
The results of the final testing showed that the web crawler can successfully retrieve large quantities of audio and video resources. Of these, 63,6 % of the recordings were in distinct Slovenian speech. We can conclude that the script for searching for transcriptions also works successfully. Transcriptions with a similarity score greater than 0,9 were found for 16 recordings. We have demonstrated that using web crawlers can automate the retrieval of audio recordings. This thesis contributes to the advancement of Slovenian speech technologies, as it indicates how to collect large quantities of audio material more quickly and easily. It also outlines the conditions for doing so legally and in a server friendly manner.
|