Your browser does not allow JavaScript!
JavaScript is necessary for the proper functioning of this website. Please enable JavaScript or use a modern browser.
Open Science Slovenia
Open Science
DiKUL
slv
|
eng
Search
Browse
New in RUL
About RUL
In numbers
Help
Sign in
The influence of feature representation of text on the performance of document classification
ID
Martinčić-Ipšić, Sanda
(
Author
),
ID
Miličić, Tanja
(
Author
),
ID
Todorovski, Ljupčo
(
Author
)
PDF - Presentation file,
Download
(412,20 KB)
MD5: 7BF9F00A7DB513C37A18A711E031FD82
URL - Source URL, Visit
https://www.mdpi.com/2076-3417/9/4/743
Image galllery
Abstract
In this paper we perform a comparative analysis of three models for a feature representation of text documents in the context of document classification. In particular, we consider the most often used family of bag-of-words models, the recently proposed continuous space models word2vec and doc2vec, and the model based on the representation of text documents as language networks. While the bag-of-word models have been extensively used for the document classification task, the performance of the other two models for the same task have not been well understood. This is especially true for the network-based models that have been rarely considered for the representation of text documents for classification. In this study, we measure the performance of the document classifiers trained using the method of random forests for features generated with the three models and their variants. Multi-objective rankings are proposed as the framework for multi-criteria comparative analysis of the results. Finally, the results of the empirical comparison show that the commonly used bag-of-words model has a performance comparable to the one obtained by the emerging continuous-space model of doc2vec. In particular, the low-dimensional variants of doc2vec generating up to 75 features are among the top-performing document representation models. The results finally point out that doc2vec shows a superior performance in the tasks of classifying large documents.
Language:
English
Keywords:
document segmentation
,
bag-of-words
,
word2vec
,
doc2vec
,
graph-of-words
,
complex networks
,
document classification
Work type:
Article
Typology:
1.01 - Original Scientific Article
Organization:
FU - Faculty of Administration
Publication status:
Published
Publication version:
Version of Record
Year:
2019
Number of pages:
27 str.
Numbering:
Vol. 9, iss. 4, art. 743
PID:
20.500.12556/RUL-131966
UDC:
004:78
ISSN on article:
2076-3417
DOI:
10.3390/app9040743
COBISS.SI-ID:
5274286
Publication date in RUL:
07.10.2021
Views:
724
Downloads:
183
Metadata:
Cite this work
Plain text
BibTeX
EndNote XML
EndNote/Refer
RIS
ABNT
ACM Ref
AMA
APA
Chicago 17th Author-Date
Harvard
IEEE
ISO 690
MLA
Vancouver
:
Copy citation
Share:
Record is a part of a journal
Title:
Applied sciences
Shortened title:
Appl. sci.
Publisher:
MDPI
ISSN:
2076-3417
COBISS.SI-ID:
522979353
Licences
License:
CC BY 4.0, Creative Commons Attribution 4.0 International
Link:
http://creativecommons.org/licenses/by/4.0/
Description:
This is the standard Creative Commons license that gives others maximum freedom to do what they want with the work as long as they credit the author.
Licensing start date:
20.02.2019
Secondary language
Language:
Slovenian
Keywords:
strojno učenje
,
razvrščanje besedil
,
vreča besed
,
word2vec
,
doc2vec
,
graf besed
,
kompleksna omrežja
Projects
Funder:
Other - Other funder or multiple funders
Funding programme:
University of Rijeka
Project number:
13.13.2.2.07
Acronym:
LangNet
Funder:
ARRS - Slovenian Research Agency
Project number:
P5-0093
Name:
Razvoj sistema učinkovite in uspešne javne uprave
Similar documents
Similar works from RUL:
Similar works from other Slovenian collections:
Back