With the rise of hate speech on social networks, there was also a need for moderation, but manual moderation would be practically impossible due to the large amount of information, so neural networks are mostly used to determine hate speech today. Training neural networks requires a large amount of labeled data, but publicly available datasets are rarely labeled in detail, especially for languages with relatively few speakers. For the Slovenian language, there are few publicly available datasets that contain more detailed tags, so we test how a dataset that consists of several different sets performs. We train the BERT neural network on the composite sets with generalized common labels and compare our results with the results obtained by the authors of the original datasets. We conclude that the results we achieve are satisfactory and we suggest improvements that would allow us to achieve equally good results on composite sets as on sets made for a specific task.
|