Avtomatska gradnja korpusa in ekstrakcija relacij v slovenščini : magistrsko delo

Štravs, Miha

Repository of the University of Ljubljana

Details

Avtomatska gradnja korpusa in ekstrakcija relacij v slovenščini : magistrsko delo
ID Štravs, Miha (Author), ID Žitnik, Slavko (Mentor) More about this mentor... This link opens in a new window

PDF - Presentation file, Download (1,13 MB)
MD5: 2ACA8F4F81D54292B94C155BBA2B785B

Abstract

Iskanje relacij med entitetami v besedilu je področje obdelave naravnega jezika. Pri iskanju relacij želimo v stavku: "Ljubljana je glavno mesto Slovenije" odkriti, da med entitetama Ljubljana in Slovenija nastopa relacija glavno mesto. V zaključnem delu smo najprej naredili pregled metod za učenje modelov za napovedovanje relacij. Nato smo si izbrali tri metode z različnimi pristopi za učenje modelov, ki napovedujejo relacije. Metodo s povratno nevronsko mrežo z dolgim kratkoročnim spominom, metodo z vložitvami BERT in metodo RECON, ki uporabi grafovsko nevronsko mrežo s pozornostjo. Za učenje modelov smo uporabili slovenski korpus, ki smo ga polavtomatsko generirali iz besedil slovenske Wikipedije. Naučene modele smo nato testirali na testnem korpusu besedil slovenske Wikipedije in testnem korpusu člankov strani 24ur.com. Na testnem korpusu slovenske Wikipedije so vse tri metode dosegle visoke priklice in točnosti, najbolje se je odrezala metoda RECON. Veliko slabše rezultate so dosegle na testni množici člankov 24ur.com, kjer se je še najbolje izkazala metoda z vložitvami BERT, ko je uporabila vložitve CroSloEngual.

Language:	Slovenian
Keywords:	ekstrakcija relacij, ekstrakcija informacij, globoko učenje, grafovske mreže pozornosti, BERT, LSTM
Work type:	Master's thesis/paper
Typology:	2.09 - Master's Thesis
Organization:	FMF - Faculty of Mathematics and Physics FRI - Faculty of Computer and Information Science
Year:	2022
PID:	20.500.12556/RUL-138295
UDC:	004
COBISS.SI-ID:	115087619
Publication date in RUL:	14.07.2022
Views:	1790
Downloads:	223
Metadata:
:	Copy citation
Share:

Secondary language

Abstract:
Language:	English
Title:	Automatic corpus construction and relation extraction for Slovene
Finding relations between entities in a text is an area of natural language processing. In the sentence: "Ljubljana is the capital of Slovenia" we want to find the relation capital between entities Ljubljana and Slovenia. We first start with a review of the methods used for training models to predict relations. We then chose three methods with different approaches. The method with long short-term memory neural network, method which uses BERT encoder representations and method RECON which uses graph attention networks. To train the models, we used the Slovenian corpus which was generated semi-automatically from the text of the Slovenian Wikipedia. We test the models on a test corpus of Slovenian Wikipedia and the test corpus of articles on 24ur.com. All three methods achieved high recall and precision for the test corpus of the Slovenian Wikipedia and the RECON method performed best. Results were worse on the test set of 24ur.com articles, where the method which used BERT encoder representations CroSloEngual achieved the best results.
Keywords:	relation extraction, information extraction, deep learning, graph attention networks, BERT, LSTM

Similar works from RUL:
Similar works from other Slovenian collections:

Details

Secondary language

Similar documents