Details

Pomembnost realistične evalvacije : primer popravkov sklona in števila v slovenščini z velikim jezikovnim modelom
ID Petrič, Timotej (Author), ID Arhar Holdt, Špela (Author), ID Robnik Šikonja, Marko (Author)

.pdfPDF - Presentation file, Download (348,90 KB)
MD5: CEED5734C9B1BE69C0D101C89A1A4917
URLURL - Source URL, Visit https://journals.uni-lj.si/slovenscina2/article/view/14902 This link opens in a new window

Abstract
Med napake pri pisanju v standardni slovenščini sodi raba neustreznega slovničnega sklona ali števila. S pomočjo velikega jezikovnega modela SloBERTa smo razvili novo metodologijo za strojno prepoznavo tovrstnih težav, ki smo jo preizkusili na neustrezni rabi tožilnika namesto rodilnika in množine namesto dvojine. Za vrednotenje in spreminjanje besednih oblik v vhodnih povedih smo uporabili standardna orodja za obdelavo naravnega jezika, kot sta oblikoskladenjski označevalnik CLASSLA-Stanza in leksikon besednih oblik Sloleks. Predlagani popravki temeljijo na statistiki besednih oblik pri uporabi napovedovanja maskirane besede z velikim jezikovnim modelom. Zaradi pomanjkanja zadostne količine učnih podatkov smo napovedne modele učili na umetno generiranih napakah. Uspešnost strojnega popravljanja smo najprej ovrednotili na umetnih množicah in korpusu Lektor, kasneje pa še na novoustvarjeni evalvacijski množici Šolar-Eval. Evalvacija na prvih dveh množicah je pokazala visoko uspešnost razvite metodologije (zaznanih več kot 90 % napačno nastavljenih besed), Šolar-Eval pa je razkril mnogo slabšo uspešnost na realističnih podatkih (zaznanih le 29,5 % težav tipa rodilnik-tožilnik in 11,4 % težav tipa dvojina-množina). V celoti rezultati kažejo na nevarnost pretiranega prilagajanja podatkovnim množicam in pomembnost evalvacije na ciljno grajenih avtentičnih podatkih, ki pa so za slovenščino še vedno pomanjkljivi.

Language:Slovenian
Keywords:slovenščina, standardna slovenščina, strojno slovnično pregledovanje, popravljanje napak, slovnični sklon, slovnično število, veliki jezikovni modeli, evalvacije, SloBERT (veliki jezikovni model)
Work type:Article
Typology:1.01 - Original Scientific Article
Organization:FRI - Faculty of Computer and Information Science
FF - Faculty of Arts
Publication status:Published
Publication version:Version of Record
Publication date:01.01.2024
Year:2024
Number of pages:Str. 106-130
Numbering:Letn. 12, št. 1
PID:20.500.12556/RUL-168891 This link opens in a new window
UDC:811.163.6'271.1:81'322.2
ISSN on article:2335-2736
DOI:10.4312/slo2.0.2024.1.106-130 This link opens in a new window
COBISS.SI-ID:227633411 This link opens in a new window
Publication date in RUL:06.05.2025
Views:318
Downloads:82
Metadata:XML DC-XML DC-RDF
:
Copy citation
Share:Bookmark and Share

Record is a part of a journal

Title:Slovenščina 2.0 : empirične, aplikativne in interdisciplinarne raziskave
Publisher:Trojina, zavod za uporabno slovenistiko, Trojina, zavod za uporabno slovenistiko, Trojina, zavod za uporabno slovenistiko, Znanstvena založba Filozofske fakultete, Znanstvena založba Filozofske fakultete, Založba Univerze v Ljubljani
ISSN:2335-2736
COBISS.SI-ID:264547328 This link opens in a new window

Licences

License:CC BY-SA 4.0, Creative Commons Attribution-ShareAlike 4.0 International
Link:http://creativecommons.org/licenses/by-sa/4.0/
Description:This Creative Commons license is very similar to the regular Attribution license, but requires the release of all derivative works under this same license.

Secondary language

Language:English
Title:The importance of realistic evaluation : an example of correcting Slovene grammatical case and number with large language models
Abstract:
Frequent grammar errors in standard Slovene include using an incorrect grammatical conjugation or number. Using the large language model SloBERTa, we have developed a new methodology for the machine detection of such problems and tested it on incorrect use of the accusative instead of the genitive case and the plural instead of the dual. We applied standard natural language processing tools for Slovenian to evaluate and modify word forms in the input sentences, such as morphosyntactic tagger CLASSLA-Stanza and Slovenian word form lexicon Sloleks. The proposed corrections are based on word form statistics when using masked word prediction with a large language model. Due to the lack of sufficient training data, we trained the prediction models on synthetically generated errors. We first evaluated the performance of machine correction on synthetic data and the Lektor corpus, and later on a newly developed evaluation dataset Šolar-Eval. The evaluation on the first two datasets showed the excellent performance of the developed methodology (more than 90% of detected synthetically introduced errors), while with Šolar-Eval it had a far worse performance (only 29.5% of the problems with the genitiveaccusative grammatical case were detected, and just 11.4% of those with the dual-plural grammatical number). Overall, the results show the danger of overfitting to datasets and the importance of evaluating on purposefully designed authentic datasets, which are still rare for Slovene.

Keywords:Slovene, standard Slovene, grammatical error correction, grammatical case, grammatical number, large language models, evaluation

Projects

Funder:ARIS - Slovenian Research and Innovation Agency
Project number:P6-0411
Name:Jezikovni viri in tehnologije za slovenski jezik

Funder:ARIS - Slovenian Research and Innovation Agency
Project number:J7-3159
Name:Empirična podlaga za digitalno podprt razvoj pisne jezikovne zmožnosti

Funder:ARIS - Slovenian Research and Innovation Agency
Project number:GC-0002
Name:Veliki jezikovni modeli za digitalno humanistiko

Similar documents

Similar works from RUL:
Similar works from other Slovenian collections:

Back