Vaš brskalnik ne omogoča JavaScript!
JavaScript je nujen za pravilno delovanje teh spletnih strani. Omogočite JavaScript ali pa uporabite sodobnejši brskalnik.
Nacionalni portal odprte znanosti
Odprta znanost
DiKUL
slv
|
eng
Iskanje
Brskanje
Novo v RUL
Kaj je RUL
V številkah
Pomoč
Prijava
The influence of feature representation of text on the performance of document classification
ID
Martinčić-Ipšić, Sanda
(
Avtor
),
ID
Miličić, Tanja
(
Avtor
),
ID
Todorovski, Ljupčo
(
Avtor
)
PDF - Predstavitvena datoteka,
prenos
(412,20 KB)
MD5: 7BF9F00A7DB513C37A18A711E031FD82
URL - Izvorni URL, za dostop obiščite
https://www.mdpi.com/2076-3417/9/4/743
Galerija slik
Izvleček
In this paper we perform a comparative analysis of three models for a feature representation of text documents in the context of document classification. In particular, we consider the most often used family of bag-of-words models, the recently proposed continuous space models word2vec and doc2vec, and the model based on the representation of text documents as language networks. While the bag-of-word models have been extensively used for the document classification task, the performance of the other two models for the same task have not been well understood. This is especially true for the network-based models that have been rarely considered for the representation of text documents for classification. In this study, we measure the performance of the document classifiers trained using the method of random forests for features generated with the three models and their variants. Multi-objective rankings are proposed as the framework for multi-criteria comparative analysis of the results. Finally, the results of the empirical comparison show that the commonly used bag-of-words model has a performance comparable to the one obtained by the emerging continuous-space model of doc2vec. In particular, the low-dimensional variants of doc2vec generating up to 75 features are among the top-performing document representation models. The results finally point out that doc2vec shows a superior performance in the tasks of classifying large documents.
Jezik:
Angleški jezik
Ključne besede:
document segmentation
,
bag-of-words
,
word2vec
,
doc2vec
,
graph-of-words
,
complex networks
,
document classification
Vrsta gradiva:
Članek v reviji
Tipologija:
1.01 - Izvirni znanstveni članek
Organizacija:
FU - Fakulteta za upravo
Status publikacije:
Objavljeno
Različica publikacije:
Objavljena publikacija
Leto izida:
2019
Št. strani:
27 str.
Številčenje:
Vol. 9, iss. 4, art. 743
PID:
20.500.12556/RUL-131966
UDK:
004:78
ISSN pri članku:
2076-3417
DOI:
10.3390/app9040743
COBISS.SI-ID:
5274286
Datum objave v RUL:
07.10.2021
Število ogledov:
726
Število prenosov:
183
Metapodatki:
Citiraj gradivo
Navadno besedilo
BibTeX
EndNote XML
EndNote/Refer
RIS
ABNT
ACM Ref
AMA
APA
Chicago 17th Author-Date
Harvard
IEEE
ISO 690
MLA
Vancouver
:
Kopiraj citat
Objavi na:
Gradivo je del revije
Naslov:
Applied sciences
Skrajšan naslov:
Appl. sci.
Založnik:
MDPI
ISSN:
2076-3417
COBISS.SI-ID:
522979353
Licence
Licenca:
CC BY 4.0, Creative Commons Priznanje avtorstva 4.0 Mednarodna
Povezava:
http://creativecommons.org/licenses/by/4.0/deed.sl
Opis:
To je standardna licenca Creative Commons, ki daje uporabnikom največ možnosti za nadaljnjo uporabo dela, pri čemer morajo navesti avtorja.
Začetek licenciranja:
20.02.2019
Sekundarni jezik
Jezik:
Slovenski jezik
Ključne besede:
strojno učenje
,
razvrščanje besedil
,
vreča besed
,
word2vec
,
doc2vec
,
graf besed
,
kompleksna omrežja
Projekti
Financer:
Drugi - Drug financer ali več financerjev
Program financ.:
University of Rijeka
Številka projekta:
13.13.2.2.07
Akronim:
LangNet
Financer:
ARRS - Agencija za raziskovalno dejavnost Republike Slovenije
Številka projekta:
P5-0093
Naslov:
Razvoj sistema učinkovite in uspešne javne uprave
Podobna dela
Podobna dela v RUL:
Podobna dela v drugih slovenskih zbirkah:
Nazaj