Evaluating robustness of LLMs in question answering on multilingual noisy OCR data

Piryani, Bhawna; Mozafari, Jamshid; Abdallah, Abdelrahman; Doucet, Antoine; Jatowt, Adam

Podrobno

Evaluating robustness of LLMs in question answering on multilingual noisy OCR data
ID Piryani, Bhawna (Avtor), ID Mozafari, Jamshid (Avtor), ID Abdallah, Abdelrahman (Avtor), ID Doucet, Antoine (Avtor), ID Jatowt, Adam (Avtor)

	URL - Izvorni URL, za dostop obiščite https://doi.org/10.1145/3746252.3761295
	PDF - Predstavitvena datoteka, prenos (1,38 MB) MD5: A9B7F26E6F89C804437345EC8B7760D1

Izvleček

Optical Character Recognition (OCR) plays a crucial role in digitizing historical and multilingual documents, yet OCR errors - imperfect extraction of text, including character insertion, deletion, and substitution can significantly impact downstream tasks like question-answering (QA). In this work, we conduct a comprehensive analysis of how OCR-induced noise affects the performance of Multilingual QA Systems. To support this analysis, we introduce a multilingual QA dataset MultiOCR-QA, comprising 50K question-answer pairs across three languages, English, French, and German. The dataset is curated from OCR-ed historical documents, which include different levels and types of OCR noise. We then evaluate how different state-of-the-art Large Language Models (LLMs) perform under different error conditions, focusing on three major OCR error types. Our findings show that QA systems are highly prone to OCR-induced errors and perform poorly on noisy OCR text. By comparing model performance on clean versus noisy texts, we provide insights into the limitations of current approaches and emphasize the need for more noise-resilient QA systems in historical digitization contexts.

Jezik:	Angleški jezik
Ključne besede:	multilingual QA, OCR text, large language models
Vrsta gradiva:	Drugo
Tipologija:	1.08 - Objavljeni znanstveni prispevek na konferenci
Organizacija:	FRI - Fakulteta za računalništvo in informatiko
Status publikacije:	Objavljeno
Različica publikacije:	Objavljena publikacija
Leto izida:	2025
Št. strani:	Str. 2366-2376
PID:	20.500.12556/RUL-181096
UDK:	004.85:004.352.242:81'322
DOI:	10.1145/3746252.3761295
COBISS.SI-ID:	272786691
Datum objave v RUL:	25.03.2026
Število ogledov:	125
Število prenosov:	24
Metapodatki:
:	Kopiraj citat
Objavi na:

Gradivo je del monografije

Naslov:	CIKM ’25 : proceedings of the 34th ACM International Conference on Information and Knowledge Management
Kraj izida:	New York (NY)
Založnik:	The Association for Computing Machinery
ISBN:	979-8-4007-2040-6
COBISS.SI-ID:	272764675

Licence

Licenca:	CC BY 4.0, Creative Commons Priznanje avtorstva 4.0 Mednarodna

Povezava:	http://creativecommons.org/licenses/by/4.0/deed.sl
Opis:	To je standardna licenca Creative Commons, ki daje uporabnikom največ možnosti za nadaljnjo uporabo dela, pri čemer morajo navesti avtorja.

Sekundarni jezik

Jezik:	Slovenski jezik
Ključne besede:	večjezično zagotavljanje kakovosti, optično prepoznavanje besedila, veliki jezikovni modeli

Projekti

Financer:	EC - European Commission
Številka projekta:	101186647
Naslov:	Centre of Excellence in Artificial Intelligence for Digital Humanities
Akronim:	AI4DH

Podobna dela

Podobna dela v RUL:
Podobna dela v drugih slovenskih zbirkah:

Nazaj