izpis_h1_title_alt

Generiranje slovenskih besednih oblik s pomočjo strojnega učenja
REJC, ROK (Author), Robnik Šikonja, Marko (Mentor) More about this mentor... This link opens in a new window, Krek, Simon (Co-mentor)

.pdfPDF - Presentation file, Download (399,21 KB)

Abstract
Sloleks je leksikon besednih oblik, ki v strukturirani bazi podatkov vsebuje slovenske besede in njihove pregibne oblike, skupaj z informacijo, v katero besedno vrsto spadajo in kakšne so njihove oblikoskladenjske lastnosti. Zaradi stalnega spreminjanja jezika ter naraščajočih potreb po strojnem procesiranju je Sloleks potrebno stalno posodabljati. Cilj diplomske naloge je bil izdelati orodje, ki bo s strojnim učenjem omogočalo avtomatsko širitev leksikona besednih oblik Sloleks, pri čemer smo se osredotočili na samostalniško besedno vrsto, vendar pa je orodje mogoče uporabiti tudi za druge besedne vrste, kot sta glagol ali pridevnik. Problema smo se lotili z razvrščanjem samostalnikov v skupine s podobnimi oblikoskladenjskimi lastnostmi, pri čemer smo uporabili Gowerjevo razdaljo ter gručenje z medoidi. Na podlagi dobljenih skupin, ki predstavljajo oblikoskladenjske paradigme, smo z naivnim Bayesovim klasifikatorjem zgradili model za napovedovanje teh paradigem tudi za nove besede. Samostalnikom iz korpusa ccGigafida, ki jih manjkajo besedne oblike, smo z zgrajenim klasifikatorjem pripisali skupino ter glede na tipične predstavnike skupine ustrezno dopolnili manjkajoče besedne oblike. Pristop smo ovrednotili kvalitativno ter kvantitativno.

Language:Slovenian
Keywords:leksikon besednih oblik, Sloleks, oblikoskladenjske paradigme, strojno učenje, naivni Bayesov klasifikator, gručenje z medoidi
Work type:Bachelor thesis/paper (mb11)
Organization:FRI - Faculty of computer and information science
Year:2017
Views:377
Downloads:252
Metadata:XML RDF-CHPDL DC-XML DC-RDF
 
Average score:(0 votes)
Your score:Voting is allowed only to logged in users.
:
Share: Bookmark and Share

Secondary language

Language:English
Title:Generating Slovene word forms using machine learning
Abstract:
Sloleks is a lexicon of Slovene word forms which contains - in a structured database - Slovene words and all their word forms, their word class and morphosyntactic properties. Due to constant changing of the language and the growing needs for machine processing, Sloleks must be constantly updated. The aim of the thesis was to create a tool using machine learning that will allow automated extension of lexicon of Slovene word forms Sloleks. We focused mainly on nouns, but the tool can also be used for other word classes such as verb or adjective. The problem was tackled with clustering of nouns into groups with similar morphosyntactic properties, where we used clustering around medoids. Based on the obtained groups which represent morphosyntactic paradigms, we build a model using naive Bayes classifier which predicts these paradigms for new words. For nouns from corpus ccGigafida, which have missing word forms, we predicted groups using build classifier and filled the paradigm with missing word form using typical representatives of classes. Approach was evaluated qualitatively and quantitatively.

Keywords:lexicon of word forms, Sloleks, morphosyntactic paradigms, machine learning, naive Bayes classifier, Partitioning Around Medoids

Similar documents

Similar works from RUL:
Similar works from other Slovenian collections:

Comments

Leave comment

You have to log in to leave a comment.

Comments (0)
0 - 0 / 0
 
There are no comments!

Back