Your browser does not allow JavaScript!
JavaScript is necessary for the proper functioning of this website. Please enable JavaScript or use a modern browser.
Repository of the University of Ljubljana
Open Science Slovenia
Open Science
DiKUL
slv
|
eng
Search
Advanced
New in RUL
About RUL
In numbers
Help
Sign in
Details
Mono- and cross-lingual evaluation of representation language models on less-resourced languages
ID
Ulčar, Matej
(
Author
),
ID
Žagar, Aleš
(
Author
),
ID
Armendariz, Carlos S.
(
Author
),
ID
Repar, Andraž
(
Author
),
ID
Pollak, Senja
(
Author
),
ID
Purver, Matthew
(
Author
),
ID
Robnik Šikonja, Marko
(
Author
)
PDF - Presentation file,
Download
(2,51 MB)
MD5: 959E090AF6BB03C8064D9ED465BAB5B6
URL - Source URL, Visit
https://www.sciencedirect.com/science/article/pii/S0885230825000774
Image galllery
Abstract
The current dominance of large language models in natural language processing is based on their contextual awareness. For text classification, text representation models, such as ELMo, BERT, and BERT derivatives, are typically fine-tuned for a specific problem. Most existing work focuses on English; in contrast, we present a large-scale multilingual empirical comparison of several monolingual and multilingual ELMo and BERT models using 14 classification tasks in nine languages. The results show, that the choice of best model largely depends on the task and language used, especially in a cross-lingual setting. In monolingual settings, monolingual BERT models tend to perform the best among BERT models. Among ELMo models, the ones trained on large corpora dominate. Cross-lingual knowledge transfer is feasible on most tasks already in a zero-shot setting without losing much performance.
Language:
English
Keywords:
monolingual models
,
multilingual models
,
ELMo
,
BERT
,
corpus
,
cross-lingual datasets
,
language models
,
contextual embeddings
,
less-resourced languages
,
BERT
,
ELMo
Work type:
Article
Typology:
1.01 - Original Scientific Article
Organization:
FRI - Faculty of Computer and Information Science
Publication status:
Published
Publication version:
Version of Record
Year:
2026
Number of pages:
29 str.
Numbering:
Vol. 95, art. 101852
PID:
20.500.12556/RUL-182550
UDC:
004.8
ISSN on article:
1095-8363
DOI:
10.1016/j.csl.2025.101852
COBISS.SI-ID:
241622275
Publication date in RUL:
15.05.2026
Views:
25
Downloads:
12
Metadata:
Cite this work
Plain text
BibTeX
EndNote XML
EndNote/Refer
RIS
ABNT
ACM Ref
AMA
APA
Chicago 17th Author-Date
Harvard
IEEE
ISO 690
MLA
Vancouver
:
Copy citation
Share:
Record is a part of a journal
Title:
Computer speech & language
Shortened title:
Comput. speech lang.
Publisher:
Elsevier
ISSN:
1095-8363
COBISS.SI-ID:
203927043
Licences
License:
CC BY 4.0, Creative Commons Attribution 4.0 International
Link:
http://creativecommons.org/licenses/by/4.0/
Description:
This is the standard Creative Commons license that gives others maximum freedom to do what they want with the work as long as they credit the author.
Secondary language
Language:
Slovenian
Keywords:
korpusi
,
večjezični veliki modeli
Projects
Funder:
ARIS - Slovenian Research and Innovation Agency
Project number:
P6-0411
Name:
Jezikovni viri in tehnologije za slovenski jezik
Funder:
ARIS - Slovenian Research and Innovation Agency
Project number:
P2-0103
Name:
Tehnologije znanja
Funder:
ARIS - Slovenian Research and Innovation Agency
Project number:
L2-50070
Name:
Tehnike vektorskih vložitev za medijske aplikacije
Funder:
ARIS - Slovenian Research and Innovation Agency
Project number:
J7-3159
Name:
Empirična podlaga za digitalno podprt razvoj pisne jezikovne zmožnosti
Funder:
ARIS - Slovenian Research and Innovation Agency
Project number:
GC-0002
Name:
Veliki jezikovni modeli za digitalno humanistiko
Funder:
ARIS - Slovenian Research and Innovation Agency
Name:
Adaptive Natural Language Processing with Large Language Models
Acronym:
PoVeJMo
Funder:
ARIS - Slovenian Research and Innovation Agency
Project number:
BI-FR/23-24-PROTEUS-006
Name:
Čezjezikovne in čezdomenske metode za luščenje in poravnavo terminologije
Funder:
UKRI - UK Research and Innovation
Funding programme:
EPSRC
Project number:
EP/S033564/1
Name:
Streamlining Social Decision Making for Improved Internet Standards
Funder:
UKRI - UK Research and Innovation
Funding programme:
EPSRC
Project number:
EP/L01632X/1
Name:
EPSRC and AHRC Centre for Doctoral Training in Media and Arts Technology
Funder:
EC - European Commission
Funding programme:
H2020
Project number:
825153
Name:
Cross-Lingual Embeddings for Less-Represented Languages in European News Media
Acronym:
EMBEDDIA
Funder:
EC - European Commission
Funding programme:
HE
Project number:
101186647
Name:
Centre of Excellence in Artificial Intelligence for Digital Humanities
Acronym:
AI4DH
Similar documents
Similar works from RUL:
Similar works from other Slovenian collections:
Back