Details

Towards better language representation in natural language processing : a multilingual dataset for text-level grammatical error correction
ID Masciolini, Arianna (Author), ID Caines, Andrew (Author), ID Arhar Holdt, Špela (Author), ID Žagar, Aleš (Author)

.pdfPDF - Presentation file, Download (314,03 KB)
MD5: 778590BDBE1CB3567B806D001AD72624
URLURL - Source URL, Visit https://www.jbe-platform.com/content/journals/10.1075/ijlcr.24033.mas This link opens in a new window

Abstract
This paper introduces MultiGEC, a dataset for multilingual Grammatical Error Correction (GEC) in twelve European languages: Czech, English, Estonian, German, Greek, Icelandic, Italian, Latvian, Russian, Slovene, Swedish and Ukrainian. MultiGEC distinguishes itself from previous GEC datasets in that it covers several underrepresented languages, which we argue should be included in resources used to train models for Natural Language Processing tasks which, as GEC itself, have implications for Learner Corpus Research and Second Language Acquisition. Aside from multilingualism, the novelty of the MultiGEC dataset is that it consists of full texts — typically learner essays — rather than individual sentences, making it possible to train systems that take a broader context into account. The dataset was built for MultiGEC-2025, the first shared task in multilingual text-level GEC, but it remains accessible after its competitive phase, serving as a resource to train new error correction systems and perform cross-lingual GEC studies.

Language:English
Keywords:learner corpora, grammatical error correction, multilingual corpora, Matthew effect, MultiGEC shared task
Work type:Article
Typology:1.01 - Original Scientific Article
Organization:FRI - Faculty of Computer and Information Science
Publication status:Published
Publication version:Version of Record
Year:2025
Number of pages:Str. 309-335
Numbering:Vol. 11, iss. 2
PID:20.500.12556/RUL-172814 This link opens in a new window
UDC:81'322.2:81'36
ISSN on article:2215-1478
DOI:10.1075/ijlcr.24033.mas This link opens in a new window
COBISS.SI-ID:234594051 This link opens in a new window
Publication date in RUL:11.09.2025
Views:163
Downloads:68
Metadata:XML DC-XML DC-RDF
:
Copy citation
Share:Bookmark and Share

Record is a part of a journal

Title:International journal of learner corpus research
Publisher:J. Benjamins
ISSN:2215-1478
COBISS.SI-ID:522804761 This link opens in a new window

Licences

License:CC BY 4.0, Creative Commons Attribution 4.0 International
Link:http://creativecommons.org/licenses/by/4.0/
Description:This is the standard Creative Commons license that gives others maximum freedom to do what they want with the work as long as they credit the author.

Secondary language

Language:Slovenian
Keywords:učni korpusi, popravljanje slovničnih napak, večjezični korpusi, Matejev učinek, skupna naloga MultiGEC

Projects

Funder:Other - Other funder or multiple funders
Project number:LM2023044
Name:Large Research, Development and Innovation Infrastructures

Funder:Other - Other funder or multiple funders
Project number:518989-LLP-1-2011-1-DE-KA2-KA2MP

Funder:Other - Other funder or multiple funders
Name:The error corpora project

Funder:Other - Other funder or multiple funders
Project number:3161
Name:Latent Aspects in L2 Acquisition (LAL2A)

Funder:Other - Other funder or multiple funders
Project number:VPP-LETONIKA-2021/1-0006

Funder:ARIS - Slovenian Research and Innovation Agency
Project number:J7-3159-2021
Name:Empirična podlaga za digitalno podprt razvoj pisne jezikovne zmožnosti

Funder:ARIS - Slovenian Research and Innovation Agency
Project number:P6-0411-2019
Name:Jezikovni viri in tehnologije za slovenski jezik

Funder:ARIS - Slovenian Research and Innovation Agency
Project number:GC-0002
Name:Veliki jezikovni modeli za digitalno humanistiko

Similar documents

Similar works from RUL:
Similar works from other Slovenian collections:

Back