Označevanje skupin dokumentov z uporabo vložitev besed

Đukić, Nikola

Označevanje skupin dokumentov z uporabo vložitev besed
ID Đukić, Nikola (Author), ID Zupan, Blaž (Mentor) More about this mentor... This link opens in a new window

PDF - Presentation file, Download (583,22 KB)
MD5: 86F404C7BB34FF824B761B9CF38B52D5

Abstract

Dokumente lahko na različne načine predstavimo z vektorji ter jih vizualiziramo v dvorazsežnem prostoru. V tem prostoru lahko poiščemo skupine podobnih dokumentov in nato poiščemo besede, ki dobro opisujejo posamezne skupine. Vizualizacijo dokumentov lahko obogatimo s prikazom najdenih besed. Za to se uporabljajo metode za označevanje skupin dokumentov, ki temeljijo na uporabi mer pomembnosti, ki upoštevajo le frekvence besed v danem korpusu. V tem diplomskem delu predlagamo novo metodo za označevanje skupin dokumentov, ki za vložitev dokumentov in besed uporablja prednaučene modele za vložitev besed ter temelji na predpostavki, da so podobne besede predstavljene s podobnimi vektorji. Modele za vložitev besed med sabo primerjamo s stališča medsebojne podobnosti in uspešnosti na klasifikacijskih nalogah, da bi izbrali tistega, ki ga bomo uporabili v kombinaciji z metodo za označevanje skupin dokumentov. Metodo empirično ovrednotimo ter jo primerjamo z že obstoječim pristopom in pokažemo, da zaradi uporabe prednaučenih modelov lahko uspešno dela tudi na zelo majhnih podatkovnih množicah, česar že obstoječi pristop ne zmore.

Language:	Slovenian
Keywords:	vložitve besed, vizualizacija, gručenje
Work type:	Bachelor thesis/paper
Typology:	2.11 - Undergraduate Thesis
Organization:	FRI - Faculty of Computer and Information Science
Year:	2020
PID:	20.500.12556/RUL-119839
COBISS.SI-ID:	31040003
Publication date in RUL:	11.09.2020
Views:	1133
Downloads:	152
Metadata:
:	Copy citation
Share:

Secondary language

Abstract:
Language:	English
Title:	Labeling document clusters using word embeddings
Documents can be represented as vectors in various ways and visualized in two-dimensional space. In that space, we can find clusters of similar documents and the words that describe each cluster as well as possible. Those words can be added to the visualization to enrich it. This can be achieved by using methods for labeling document clusters. These methods use the frequencies of words in a given corpus to measure the importance of each word. In this thesis we propose a novel method for labeling clusters of documents. The method is based on using pre-trained word embedding models to embed both words and documents and utilizes the assumption that the similar words are represented with similar vectors. We compare word embedding models by computing their similarities and scores achieved on classification tasks to choose the one to use in combination with our method. Method is empirically evaluated and compared with the traditional approach. We show that compared to the traditional approach, our method can work on very small datasets due to the fact that it uses the pre-trained models to obtain the embeddings.
Keywords:	word embeddings, visualization, clustering

Similar works from RUL:
Similar works from other Slovenian collections:

Secondary language

Similar documents