Empirična evalvacija procesa avtomatske  klasifikacije sentimenta na finančni domeni

RUTAR, SAŠO

Empirična evalvacija procesa avtomatske klasifikacije sentimenta na finančni domeni
ID RUTAR, SAŠO (Author), ID Robnik Šikonja, Marko (Mentor) More about this mentor... This link opens in a new window

, ID Mozetič, Igor (Comentor)

PDF - Presentation file, Download (1,44 MB)
MD5: 01259D8F0AC45878C040A4CA600A0FCC
PID: 20.500.12556/rul/dac2638b-dc31-4d7d-a2d0-a2af7ba5f911

Abstract

V tem diplomskem delu obravnavamo specifične vidike sistema za avtomatsko analizo sentimenta v tvitih. Naš sistem za analizo sentimenta temelji na tehnikah strojnega učenja in tekstovnega rudarjenja, kot sta predstavitev besedil z vrečami besed in metoda podpornih vektorjev. S sistemom obdelamo podatkovni tok kratkih sporočil (tvitov) na temo finančnih trgov, specifično na temo trgovanja z delnicami, v razponu dveh let. Vsako sporočilo avtomatsko klasificiramo v pozitivni, negativni ali nevtralni razred, kar predstavlja sentiment oziroma stališče do delnice, ki je omenjena v sporočilu. Sentiment torej v našem primeru odraža stališče govorca in v primeru pozitivnega ali negativnega razreda predstavlja nagib k nakupu ali prodaji delnice. Za izgradnjo klasifikacijskega modela uporabimo relativno velik nabor označenih podatkov, ki sestoji iz približno pol milijona tvitov, ki so jih ročno označili eksperti. Za potrebe analize smo razvili evalvacijsko platformo in pripadajočo metodologijo, ki nam omogoča, da z zaporedjem poskusov lahko odgovorimo na številna vprašanja, ki se pojavijo pri aplikacijah analize sentimenta v industrijskih okoljih. Pri analizah upoštevamo časovno sosledje sporočil v podatkovnih tokovih in tako omogočimo sprotno merjenje uspešnosti sistema tudi v produkcijskih okoljih. Rezultati analize nam med drugim razkrijejo (i) najprimernejši algoritem za klasifikacijo, (ii) optimalno velikost in vzorčenje (redčenje) podatkov za ročno označevanje, (iii) odvisnost med uspešnostjo klasifikacije in časovno oddaljenostjo od označenih primerov, (iv) vpliv prisotnosti duplikatov v podatkih in (v) obnašanje izbrane klasifikacijske metode v območju negotovosti ob hiper ravnini klasifikatorja z metodo podpornih vektorjev.

Language:	Slovenian
Keywords:	analiza sentimenta, strojno učenje, rudarjenje mnenj, Twitter, obdelava naravnega jezika, klasifikacija, metoda podpornih vektorjev, empirična evalvacija, finančno trgovanje, delnice
Work type:	Undergraduate thesis
Organization:	FRI - Faculty of Computer and Information Science
Year:	2016
PID:	20.500.12556/RUL-91200
Publication date in RUL:	24.03.2017
Views:	1448
Downloads:	369
Metadata:
:	Copy citation
Share:

Secondary language

Abstract:
Language:	English
Title:	Empirical evaluation of automatic sentiment classification process in financial domain
In this thesis, we explore several specific aspects of Twitter sentiment analysis. Our system for sentiment analysis is based on machine learning and text mining techniques, such as the bag-of-words representation of texts and support vector machine classifier. We employ our system to analyze a stream of short messages (tweets) about financial markets, specifically about stock trading, in the time span of two years. We classify each message into positive, negative, or neutral class, which represent the sentiment or stance towards the stock mentioned in the message. The term sentiment in our case thus denotes the stance of the author (speaker) and in the case of positive or negative class represents the author’s leaning towards buying or selling the stock. To build the classification model, we employ a relatively large gold standard which consists of approximately a half million tweets hand-labeled by the domain experts. For the purpose of this analysis, we developed an evaluation platform and a methodology that allow us, by conducting a series of experiments, to answer various questions which arise when applying sentiment analysis in industrial settings. In the evaluation processes, we take the temporal nature of the data into account and thus enable continuous monitoring of performance of live systems. The results of the analysis reveal (i) the most appropriate classification algorithm, (ii) the optimal size of the labeled data and subsampling method, (iii) the relationship between the classifier performance and the time lag from the training data, and (iv) the effect of duplicated tweets (e.g., retweets), and (v) the behavior of the employed classification method in the uncertainty area near the hyper-plane of support vector machine classifier.
Keywords:	sentiment analysis, machine learning, opinion mining, Twitter, natural language processing, classification with support vector machine, empirical evaluation, financial trading, stocks

Similar works from RUL:
Similar works from other Slovenian collections:

Secondary language

Similar documents