Medjezikovna klasifikacija sentimenta tvitov

Reba, Kristjan

Repository of the University of Ljubljana

Details

Medjezikovna klasifikacija sentimenta tvitov
ID Reba, Kristjan (Author), ID Robnik Šikonja, Marko (Mentor) More about this mentor... This link opens in a new window

, ID Mozetič, Igor (Comentor)

PDF - Presentation file, Download (946,25 KB)
MD5: E317063244789747F42C005746F53081

Abstract

Vektorske vložitve besed so predstavitve besed v obliki vektorjev realnih števil. Predstavljajo temelj mnogih aplikacij v procesiranju naravnega jezika in so potrebne za procesiranje z globokimi nevronskimi mrežami. Medjezikovne vložitve besed preslikajo besede iz več jezikov v isti vektorski prostor, kjer so istopomenske besede poravnane. Uporabljajo se za prenos naučenih modelov med jeziki in širjenje podatkovne množice. Za izgradnjo kakovostnih klasifikacijskih modelov za jezikovne probleme potrebujemo velike množice označenih učnih primerov, ki niso vedno na voljo za vse jezike in vse probleme, zato si želimo, da bi lahko izkoristili učne množice iz drugih, podatkovno bolj bogatih jezikov. V diplomski nalogi želimo za prenos znanja med jeziki izkoristiti medjezikovne vektorske vložitve. Uporabimo podatkovne množice tvitov v 15 različnih jezikih s pripadajočo oceno sentimenta. Klasifikacija sentimenta je naloga klasifikacije besedil, katere cilj je razvrstiti besedilo glede na sentimentno polarnost mnenj, ki jih vsebuje. Nad označenimi podatkovnimi množicami tvitov v različnih jezikih testiramo medjezikovne prenose z modelom BERT in knjižnico LASER. Eksperimenti pokažejo, da prenos informacij med podatkovnimi množicami različnih jezikov tipično ne prinese izboljšav klasifikacijske točnosti.

Language:	Slovenian
Keywords:	sentiment besedil, vektorske vložitve besed, jezikovni model, medjezikovne vložitve, tviti
Work type:	Bachelor thesis/paper
Organization:	FRI - Faculty of Computer and Information Science
Year:	2019
PID:	20.500.12556/RUL-109295
COBISS.SI-ID:	1538310339
Publication date in RUL:	29.08.2019
Views:	2078
Downloads:	301
Metadata:
:	Copy citation
Share:

Secondary language

Abstract:
Language:	English
Title:	Cross-lingual classification of tweet sentiment
Word embeddings are representations of words in the form of numeric vectors. They are the basic representation for many natural language processing applications and are required for deep neural network processing. Cross-lingual word embeddings map words from multiple languages to the same vector space where similar words are aligned. Cross-lingual embeddings are used for machine learning model transfer between languages and for expansion of data sets. To build good classification models for language problems, we need large sets of labeled learning examples, which are not always available for all languages and for all problems. We aim to be able to take advantage of data sets from data-rich languages. In this work, we use cross-lingual word embeddings to transfer knowledge between languages. We use data sets of tweets in 15 different languages with assigned sentiment labels. Sentiment analysis task aims to classify the text according to the sentiment polarity of the opinions it contains. On labeled data sets of tweets in different languages, we test multilingual information transmissions using the BERT model and the LASER library. Experiments show that the transfer of information between data sets of different languages does not necessarily lead to improvements in classification accuracy.
Keywords:	text sentiment, word embeddings, language model, cross-lingual embeddings, tweets

Similar works from RUL:
Similar works from other Slovenian collections:

Details

Secondary language

Similar documents