Your browser does not allow JavaScript!
JavaScript is necessary for the proper functioning of this website. Please enable JavaScript or use a modern browser.
Repository of the University of Ljubljana
Open Science Slovenia
Open Science
DiKUL
slv
|
eng
Search
Advanced
New in RUL
About RUL
In numbers
Help
Sign in
Details
Towards better language representation in natural language processing : a multilingual dataset for text-level grammatical error correction
ID
Masciolini, Arianna
(
Author
),
ID
Caines, Andrew
(
Author
),
ID
Arhar Holdt, Špela
(
Author
),
ID
Žagar, Aleš
(
Author
)
PDF - Presentation file,
Download
(314,03 KB)
MD5: 778590BDBE1CB3567B806D001AD72624
URL - Source URL, Visit
https://www.jbe-platform.com/content/journals/10.1075/ijlcr.24033.mas
Image galllery
Abstract
This paper introduces MultiGEC, a dataset for multilingual Grammatical Error Correction (GEC) in twelve European languages: Czech, English, Estonian, German, Greek, Icelandic, Italian, Latvian, Russian, Slovene, Swedish and Ukrainian. MultiGEC distinguishes itself from previous GEC datasets in that it covers several underrepresented languages, which we argue should be included in resources used to train models for Natural Language Processing tasks which, as GEC itself, have implications for Learner Corpus Research and Second Language Acquisition. Aside from multilingualism, the novelty of the MultiGEC dataset is that it consists of full texts — typically learner essays — rather than individual sentences, making it possible to train systems that take a broader context into account. The dataset was built for MultiGEC-2025, the first shared task in multilingual text-level GEC, but it remains accessible after its competitive phase, serving as a resource to train new error correction systems and perform cross-lingual GEC studies.
Language:
English
Keywords:
learner corpora
,
grammatical error correction
,
multilingual corpora
,
Matthew effect
,
MultiGEC shared task
Work type:
Article
Typology:
1.01 - Original Scientific Article
Organization:
FRI - Faculty of Computer and Information Science
Publication status:
Published
Publication version:
Version of Record
Year:
2025
Number of pages:
Str. 309-335
Numbering:
Vol. 11, iss. 2
PID:
20.500.12556/RUL-172814
UDC:
81'322.2:81'36
ISSN on article:
2215-1478
DOI:
10.1075/ijlcr.24033.mas
COBISS.SI-ID:
234594051
Publication date in RUL:
11.09.2025
Views:
163
Downloads:
68
Metadata:
Cite this work
Plain text
BibTeX
EndNote XML
EndNote/Refer
RIS
ABNT
ACM Ref
AMA
APA
Chicago 17th Author-Date
Harvard
IEEE
ISO 690
MLA
Vancouver
:
Copy citation
Share:
Record is a part of a journal
Title:
International journal of learner corpus research
Publisher:
J. Benjamins
ISSN:
2215-1478
COBISS.SI-ID:
522804761
Licences
License:
CC BY 4.0, Creative Commons Attribution 4.0 International
Link:
http://creativecommons.org/licenses/by/4.0/
Description:
This is the standard Creative Commons license that gives others maximum freedom to do what they want with the work as long as they credit the author.
Secondary language
Language:
Slovenian
Keywords:
učni korpusi
,
popravljanje slovničnih napak
,
večjezični korpusi
,
Matejev učinek
,
skupna naloga MultiGEC
Projects
Funder:
Other - Other funder or multiple funders
Project number:
LM2023044
Name:
Large Research, Development and Innovation Infrastructures
Funder:
Other - Other funder or multiple funders
Project number:
518989-LLP-1-2011-1-DE-KA2-KA2MP
Funder:
Other - Other funder or multiple funders
Name:
The error corpora project
Funder:
Other - Other funder or multiple funders
Project number:
3161
Name:
Latent Aspects in L2 Acquisition (LAL2A)
Funder:
Other - Other funder or multiple funders
Project number:
VPP-LETONIKA-2021/1-0006
Funder:
ARIS - Slovenian Research and Innovation Agency
Project number:
J7-3159-2021
Name:
Empirična podlaga za digitalno podprt razvoj pisne jezikovne zmožnosti
Funder:
ARIS - Slovenian Research and Innovation Agency
Project number:
P6-0411-2019
Name:
Jezikovni viri in tehnologije za slovenski jezik
Funder:
ARIS - Slovenian Research and Innovation Agency
Project number:
GC-0002
Name:
Veliki jezikovni modeli za digitalno humanistiko
Similar documents
Similar works from RUL:
Similar works from other Slovenian collections:
Back