Cross-lingual word embeddings for knowledge transfer in less-represented languages

Škvorc, Tadej

Cross-lingual word embeddings for knowledge transfer in less-represented languages
ID Škvorc, Tadej (Author), ID Robnik Šikonja, Marko (Mentor) More about this mentor... This link opens in a new window

PDF - Presentation file, Download (1,21 MB)
MD5: FB8C07BE75741D1AE74B6C99259215D1

Abstract

Neural networks and deep learning have led to big advances in the field of natural language processing. However, many such techniques rely on large, manually annotated datasets, which are not always available, particularly for less popular tasks and less-resourced languages. In our thesis, we show how text embeddings and knowledge transfer can be used to improve upon existing state-of-the-art approaches and make them viable for less-resourced languages. We demonstrate the developed methodological novelties on two advanced use cases. In the first part of the thesis, we focus on idiom detection. We use contextual and multilingual text embeddings and develop a novel method that outperforms existing approaches. Our method is capable of detecting idioms that do not appear in the training set, which is a major advantage over existing methods. We evaluate our approach on a novel Slovene dataset and a multilingual dataset of 20 languages. We show that our approach is capable of generalizing between closely related languages (i.e. Slovene and Croatian), that it is able to function with only a small amount of training data, and that we can use it on a related task of metaphor detection with the use of knowledge transfer. In the second part of the thesis, we present a method for automatic conference scheduling. The method arranges paper presentations into a schedule of a scientific conference, minimizing overlaps between presentations with similar topics. We use text and graph-based features to find similar papers and arrange them into a schedule using a novel algorithm based on constrained clustering and optimization. We evaluate our approach both on synthetic data and multiple real-world conferences in English and Slovene. Our approach does not require a labelled dataset and makes use of multilingual embeddings, making it suitable for less-represented languages. Our work shows how text embeddings and knowledge transfer can be used to improve current NLP approaches on less-resourced languages, sometimes removing the need for large annotated datasets that has traditionally been a problem for deep learning approaches.

Language:	English
Keywords:	natural language processing, deep learning, neural networks, machine learning, text embeddings, knowledge transfer, idiom detection, conference scheduling, multilingual embeddings, contextual embeddings
Work type:	Doctoral dissertation
Typology:	2.08 - Doctoral Dissertation
Organization:	FRI - Faculty of Computer and Information Science
Year:	2022
PID:	20.500.12556/RUL-141868
COBISS.SI-ID:	127985923
Publication date in RUL:	10.10.2022
Views:	1383
Downloads:	124
Metadata:
:	Copy citation
Share:

Secondary language

Abstract:
Language:	Slovenian
Title:	Medjezikovne vložitve besed za prenos znanja v manj zastopanih jezikih
Uporaba nevronskih mrež in globokega učenja je bistveno izboljšala obdelavo naravnega jezika. Večina teh metod za učenje potrebuje velike, ročno označene podatkovne množice, ki niso vedno na voljo, predvsem za manj popularne naloge in jezike z manj viri. V doktorskem delu pokažemo, kako lahko z vektorskimi vložitvami besed in učenjem s prenosom znanja izboljšamo obstoječe pristope na jezikih z manj viri. Naše metodološke prispevke pokažemo na dveh zahtevnih nalogah. V prvem delu disertacije se osredotočimo na zaznavanje idiomov. S kontekstualnimi in večjezikovnimi vložitvami zgradimo novo metodo, ki preseže rezultate obstoječih pristopov. Naša metoda je zmožna zaznavanja idiomov, ki niso prisotni v učni množici, kar je velik napredek v primerjavi z obstoječimi modeli. Naš pristop ovrednotimo na novi podatkovni množici slovenskih idiomov in na večjezikovni množici za dvajset jezikov. Pokažemo, da je naš pristop zmožen posploševanja med bližnjimi jeziki (t.j. med slovenščino in hrvaščino), da deluje tudi z majhnimi učnimi množicami in da ga lahko s pomočjo prenosa znanja uporabimo na sorodni domeni zaznavanja metafor. V drugem delu disertacije predstavimo metodo za samodejno razvrščanje člankov v urnik konference. Naš pristop razporedi članke tako, da minimizira prekrivanja med predstavitvami člankov s podobnimi tematikami. Z značilkami na podlagi besedil in grafov najdemo podobne članke in jih razvrstimo v urnik konference z novim algoritmom, ki temelji na gručenju z omejitvami in optimizaciji. Naš pristop ovrednotimo na sintetičnih podatkih in dveh konferencah v angleščini in slovenščini. Predlagana metoda ne potrebuje označenih podatkovnih množic in uporablja večjezikovne vložitve besed, zaradi česar je primerna za jezike z manj viri. V delu pokažemo, kako lahko z vložitvami besed in prenosom znanja izboljšamo trenutne pristope obdelave naravnega jezika na jezikih z manj viri, pri čemer odstranimo potrebo po velikih podatkovnih množicah, ki je sicer značilna za pristope globokega učenja.
Keywords:	obdelava naravnega jezika, globoko učenje, nevronske mreže, strojno učenje, vložitve besedil, prenos znanja, zaznavanje idiomov, urnik konference, večjezične vložitve, kontekstualne vložitve

Similar works from RUL:
Similar works from other Slovenian collections:

Secondary language

Similar documents