Generiranje slovenskih besednih oblik s pomočjo strojnega učenja

REJC, ROK

Generiranje slovenskih besednih oblik s pomočjo strojnega učenja
ID REJC, ROK (Author), ID Robnik Šikonja, Marko (Mentor) More about this mentor... This link opens in a new window

, ID Krek, Simon (Comentor)

PDF - Presentation file, Download (399,21 KB)
MD5: 08FD3B00CFC8A198F76C9B4D26531CB7
PID: 20.500.12556/rul/59e5b66d-19fb-4762-bbd9-c5df9f7d569a

Abstract

Sloleks je leksikon besednih oblik, ki v strukturirani bazi podatkov vsebuje slovenske besede in njihove pregibne oblike, skupaj z informacijo, v katero besedno vrsto spadajo in kakšne so njihove oblikoskladenjske lastnosti. Zaradi stalnega spreminjanja jezika ter naraščajočih potreb po strojnem procesiranju je Sloleks potrebno stalno posodabljati. Cilj diplomske naloge je bil izdelati orodje, ki bo s strojnim učenjem omogočalo avtomatsko širitev leksikona besednih oblik Sloleks, pri čemer smo se osredotočili na samostalniško besedno vrsto, vendar pa je orodje mogoče uporabiti tudi za druge besedne vrste, kot sta glagol ali pridevnik. Problema smo se lotili z razvrščanjem samostalnikov v skupine s podobnimi oblikoskladenjskimi lastnostmi, pri čemer smo uporabili Gowerjevo razdaljo ter gručenje z medoidi. Na podlagi dobljenih skupin, ki predstavljajo oblikoskladenjske paradigme, smo z naivnim Bayesovim klasifikatorjem zgradili model za napovedovanje teh paradigem tudi za nove besede. Samostalnikom iz korpusa ccGigafida, ki jih manjkajo besedne oblike, smo z zgrajenim klasifikatorjem pripisali skupino ter glede na tipične predstavnike skupine ustrezno dopolnili manjkajoče besedne oblike. Pristop smo ovrednotili kvalitativno ter kvantitativno.

Language:	Slovenian
Keywords:	leksikon besednih oblik, Sloleks, oblikoskladenjske paradigme, strojno učenje, naivni Bayesov klasifikator, gručenje z medoidi
Work type:	Bachelor thesis/paper
Organization:	FRI - Faculty of Computer and Information Science
Year:	2017
PID:	20.500.12556/RUL-91151
Publication date in RUL:	22.03.2017
Views:	1576
Downloads:	436
Metadata:
:	Copy citation
Share:

Secondary language

Abstract:
Language:	English
Title:	Generating Slovene word forms using machine learning
Sloleks is a lexicon of Slovene word forms which contains - in a structured database - Slovene words and all their word forms, their word class and morphosyntactic properties. Due to constant changing of the language and the growing needs for machine processing, Sloleks must be constantly updated. The aim of the thesis was to create a tool using machine learning that will allow automated extension of lexicon of Slovene word forms Sloleks. We focused mainly on nouns, but the tool can also be used for other word classes such as verb or adjective. The problem was tackled with clustering of nouns into groups with similar morphosyntactic properties, where we used clustering around medoids. Based on the obtained groups which represent morphosyntactic paradigms, we build a model using naive Bayes classifier which predicts these paradigms for new words. For nouns from corpus ccGigafida, which have missing word forms, we predicted groups using build classifier and filled the paradigm with missing word form using typical representatives of classes. Approach was evaluated qualitatively and quantitatively.
Keywords:	lexicon of word forms, Sloleks, morphosyntactic paradigms, machine learning, naive Bayes classifier, Partitioning Around Medoids

Similar works from RUL:
Similar works from other Slovenian collections:

Secondary language

Similar documents