Uvrščanje sovražnega govora v slovenskem in angleškem jeziku

PIRNAT, NIK

Uvrščanje sovražnega govora v slovenskem in angleškem jeziku
ID PIRNAT, NIK (Author), ID Žitnik, Slavko (Mentor) More about this mentor... This link opens in a new window

PDF - Presentation file, Download (474,40 KB)
MD5: D769D4D6FAF02B2509D7DA54FDB85D8A

Abstract

S porastom sovražnega govora na družbenih omrežjih je nastala tudi večja potreba po nadzoru, vendar bi bil zaradi velike količine informacij ročni nadzor praktično nemogoč, tako se za določanje sovražnega govora danes po večini uporabljajo nevronske mreže. Za učenje nevronskih mrež potrebujemo veliko število označenih podatkov, vendar so javno dostopne podatkovne množice redko podrobno označene, predvsem to drži za jezike z relativno malo govorci. Za slovenski jezik obstaja malo javno dostopnih podatkovnih množic, ki bi vsebovale več bolj podrobnih oznak, zato preizkusimo kako se izkaže podatkovna množica, ki je sestavljena iz več različnih množic. Na sestavljenih množicah s posplošenimi skupnimi oznakami učimo nevronsko mrežo BERT in naše rezultate primerjamo z rezultati, ki so jih dosegli avtorji prvotnih podatkovnih množic. Ugotovimo, da so rezultati, ki jih dosežemo zadovoljivi in predlagamo izboljšave, ki bi omogočile, da bi na sestavljenih množicah dosegli enako dobre rezultate kot na množicah izdelanih za določeno nalogo.

Language:	Slovenian
Keywords:	večrazredno razvrščanje, procesiranje besedila, sovražni govor
Work type:	Bachelor thesis/paper
Typology:	2.11 - Undergraduate Thesis
Organization:	FRI - Faculty of Computer and Information Science
Year:	2022
PID:	20.500.12556/RUL-140537
COBISS.SI-ID:	123820291
Publication date in RUL:	15.09.2022
Views:	493
Downloads:	68
Metadata:
:	Copy citation
Share:

Secondary language

Abstract:
Language:	English
Title:	Hate speech classification for Slovene and English language
With the rise of hate speech on social networks, there was also a need for moderation, but manual moderation would be practically impossible due to the large amount of information, so neural networks are mostly used to determine hate speech today. Training neural networks requires a large amount of labeled data, but publicly available datasets are rarely labeled in detail, especially for languages with relatively few speakers. For the Slovenian language, there are few publicly available datasets that contain more detailed tags, so we test how a dataset that consists of several different sets performs. We train the BERT neural network on the composite sets with generalized common labels and compare our results with the results obtained by the authors of the original datasets. We conclude that the results we achieve are satisfactory and we suggest improvements that would allow us to achieve equally good results on composite sets as on sets made for a specific task.
Keywords:	multiclass classification, text processing, hate speech

Similar works from RUL:
Similar works from other Slovenian collections:

Secondary language

Similar documents