Iskanje novih pomenov besed v slovenščini z velikimi jezikovnimi modeli

BULIĆ, BLAŽ

Iskanje novih pomenov besed v slovenščini z velikimi jezikovnimi modeli
ID BULIĆ, BLAŽ (Author), ID Robnik Šikonja, Marko (Mentor) More about this mentor... This link opens in a new window

PDF - Presentation file, Download (1,06 MB)
MD5: 64196BF65941847F538AAFFA1225DAE7

Abstract

V diplomskem delu smo razvili postopek iskanja novih pomenov besed. Seznam opazovanih besed smo izluščili iz množice za razdvoumljanje pomenov besed. Povedi, ki vsebujejo opazovano besedo, smo pridobili iz podatkovne zbirke novic servisa Event Registry. Besede smo predstavili z vektorji s pomočjo modelov multilingual-BERT-Base, Cased in SloBERTa in jih gručili na več načinov. Rezultate smo primerjali s podatki iz množice za razdvoumljanje in ročno preverili nekaj besed z znanimi semantičnimi premiki. Dobljeni rezultati niso obetavni. Menimo da je glavni razlog neustrezna podatkovna zbirka besedil.

Language:	Slovenian
Keywords:	pomeni besed, vektorske vložitve besed, gručenje, model BERT, procesiranje naravnega jezika, iskanje pomenov besed
Work type:	Bachelor thesis/paper
Typology:	2.11 - Undergraduate Thesis
Organization:	FRI - Faculty of Computer and Information Science FMF - Faculty of Mathematics and Physics
Year:	2023
PID:	20.500.12556/RUL-150264
COBISS.SI-ID:	168959747
Publication date in RUL:	15.09.2023
Views:	1249
Downloads:	60
Metadata:
:	Copy citation
Share:

Secondary language

Abstract:
Language:	English
Title:	Word sense induction in Slovene using large language models
In the thesis, we developed a procedure for discovering new word meanings. We extracted the list of observed words from the word-sense disambiguation dataset. Sentences containing the observed word were obtained from the news database from the Event Registry service. We represented the words with vectors using the models multilingual-BERT-Base, Cased and SloBERTa and clustered them in various ways. We compared the results with the data from the disambiguation dataset and manually checked some words with known semantic shifts. The obtained results are not promising. We believe that the main reason is an unsuitable text database.
Keywords:	meanings of words, sentence vector embedding, clustering, BERT, natural language processing, word sense induction

Similar works from RUL:
Similar works from other Slovenian collections:

Secondary language

Similar documents