Mono- and cross-lingual evaluation of representation language models on less-resourced languages

Ulčar, Matej; Žagar, Aleš; Armendariz, Carlos S.; Repar, Andraž; Pollak, Senja; Purver, Matthew; Robnik Šikonja, Marko

Repository of the University of Ljubljana

Details

Mono- and cross-lingual evaluation of representation language models on less-resourced languages
ID Ulčar, Matej (Author), ID Žagar, Aleš (Author), ID Armendariz, Carlos S. (Author), ID Repar, Andraž (Author), ID Pollak, Senja (Author), ID Purver, Matthew (Author), ID Robnik Šikonja, Marko (Author)

	PDF - Presentation file, Download (2,51 MB) MD5: 959E090AF6BB03C8064D9ED465BAB5B6
	URL - Source URL, Visit https://www.sciencedirect.com/science/article/pii/S0885230825000774

Abstract

The current dominance of large language models in natural language processing is based on their contextual awareness. For text classification, text representation models, such as ELMo, BERT, and BERT derivatives, are typically fine-tuned for a specific problem. Most existing work focuses on English; in contrast, we present a large-scale multilingual empirical comparison of several monolingual and multilingual ELMo and BERT models using 14 classification tasks in nine languages. The results show, that the choice of best model largely depends on the task and language used, especially in a cross-lingual setting. In monolingual settings, monolingual BERT models tend to perform the best among BERT models. Among ELMo models, the ones trained on large corpora dominate. Cross-lingual knowledge transfer is feasible on most tasks already in a zero-shot setting without losing much performance.

Language:	English
Keywords:	monolingual models, multilingual models, ELMo, BERT, corpus, cross-lingual datasets, language models, contextual embeddings, less-resourced languages, BERT, ELMo
Work type:	Article
Typology:	1.01 - Original Scientific Article
Organization:	FRI - Faculty of Computer and Information Science
Publication status:	Published
Publication version:	Version of Record
Year:	2026
Number of pages:	29 str.
Numbering:	Vol. 95, art. 101852
PID:	20.500.12556/RUL-182550
UDC:	004.8
ISSN on article:	1095-8363
DOI:	10.1016/j.csl.2025.101852
COBISS.SI-ID:	241622275
Publication date in RUL:	15.05.2026
Views:	25
Downloads:	12
Metadata:
:	Copy citation
Share:

Record is a part of a journal

Title:	Computer speech & language
Shortened title:	Comput. speech lang.
Publisher:	Elsevier
ISSN:	1095-8363
COBISS.SI-ID:	203927043

Licences

License:	CC BY 4.0, Creative Commons Attribution 4.0 International

Link:	http://creativecommons.org/licenses/by/4.0/
Description:	This is the standard Creative Commons license that gives others maximum freedom to do what they want with the work as long as they credit the author.

Secondary language

Language:	Slovenian
Keywords:	korpusi, večjezični veliki modeli

Projects

Funder:	ARIS - Slovenian Research and Innovation Agency
Project number:	P6-0411
Name:	Jezikovni viri in tehnologije za slovenski jezik

Funder:	ARIS - Slovenian Research and Innovation Agency
Project number:	P2-0103
Name:	Tehnologije znanja

Funder:	ARIS - Slovenian Research and Innovation Agency
Project number:	L2-50070
Name:	Tehnike vektorskih vložitev za medijske aplikacije

Funder:	ARIS - Slovenian Research and Innovation Agency
Project number:	J7-3159
Name:	Empirična podlaga za digitalno podprt razvoj pisne jezikovne zmožnosti

Funder:	ARIS - Slovenian Research and Innovation Agency
Project number:	GC-0002
Name:	Veliki jezikovni modeli za digitalno humanistiko

Funder:	ARIS - Slovenian Research and Innovation Agency
Name:	Adaptive Natural Language Processing with Large Language Models
Acronym:	PoVeJMo

Funder:	ARIS - Slovenian Research and Innovation Agency
Project number:	BI-FR/23-24-PROTEUS-006
Name:	Čezjezikovne in čezdomenske metode za luščenje in poravnavo terminologije

Funder:	UKRI - UK Research and Innovation
Funding programme:	EPSRC
Project number:	EP/S033564/1
Name:	Streamlining Social Decision Making for Improved Internet Standards

Funder:	UKRI - UK Research and Innovation
Funding programme:	EPSRC
Project number:	EP/L01632X/1
Name:	EPSRC and AHRC Centre for Doctoral Training in Media and Arts Technology

Funder:	EC - European Commission
Funding programme:	H2020
Project number:	825153
Name:	Cross-Lingual Embeddings for Less-Represented Languages in European News Media
Acronym:	EMBEDDIA

Funder:	EC - European Commission
Funding programme:	HE
Project number:	101186647
Name:	Centre of Excellence in Artificial Intelligence for Digital Humanities
Acronym:	AI4DH

Similar works from RUL:
Similar works from other Slovenian collections:

Details

Record is a part of a journal

Licences

Secondary language

Projects

Similar documents