Podrobno

Carniolan Provincial Assembly : corpus improvements and enhancements
ID Pretnar Žagar, Ajda (Avtor), ID Pahor de Maiti, Kristina (Avtor)

.pdfPDF - Predstavitvena datoteka, prenos (1,25 MB)
MD5: 51E40C830BDCD61D747CB03EB2DF8E00
URLURL - Izvorni URL, za dostop obiščite https://journals.uio.no/dhnbpub/article/view/13202 Povezava se odpre v novem oknu

Izvleček
Historical parliamentary corpora offer crucial evidence for studying political discourse over time, yet their usability is often limited by poor OCR quality and incomplete metadata. This paper presents the enhancement of the Kranjska 1.0 corpus, a collection of Carniolan Provincial Assembly proceedings (1861–1913) in Slovenian and German, through a two-phase process aimed at improving textual accuracy and enriching speaker metadata. First, we conducted a manual correction campaign on a representative sample of transcripts, involving trained historians proficient in Gothic script and 19th-century politics. The corrections addressed both structural and textual errors in TEI-encoded XML files, providing a gold-standard dataset for future model training. Error analysis revealed recurring OCR issues, including segmentation problems, misattributed speakers, and systematic character-level noise. Second, we harmonised and expanded speaker metadata using multiple historical sources to unify name variants, resolve ambiguities, and document parliamentary terms, factions, and attendance. The resulting metadata enhance corpus usability and interpretability. This work lays the foundation for the next project phase, which explores the automatic correction of transcripts and metadata using Multimodal Large Language Models (MLLMs). By combining historical expertise with computational methods, we contribute to more accurate processing of historical texts and promote transparency and reusability in digital humanities research.

Jezik:Angleški jezik
Ključne besede:historical parliamentary proceedings, OCR correction, error analysis, metadata enrichment
Vrsta gradiva:Drugo
Tipologija:1.08 - Objavljeni znanstveni prispevek na konferenci
Organizacija:FRI - Fakulteta za računalništvo in informatiko
FF - Filozofska fakulteta
Status publikacije:Objavljeno
Različica publikacije:Objavljena publikacija
Leto izida:2026
Št. strani:Str. 1-10
PID:20.500.12556/RUL-180816 Povezava se odpre v novem oknu
UDK:004.89:328(497.12)”1861/1913”
ISSN pri članku:2704-1441
DOI:10.5617/dhnbpub.13202 Povezava se odpre v novem oknu
COBISS.SI-ID:271706883 Povezava se odpre v novem oknu
Datum objave v RUL:17.03.2026
Število ogledov:122
Število prenosov:37
Metapodatki:XML DC-XML DC-RDF
:
Kopiraj citat
Objavi na:Bookmark and Share

Gradivo je del zbornika

Naslov:Lost in abundance
COBISS.SI-ID:271613699 Povezava se odpre v novem oknu

Gradivo je del revije

Naslov:Digital humanities in the Nordic and Baltic countries publications : DHNB
Založnik:University of Oslo library
ISSN:2704-1441
COBISS.SI-ID:228389891 Povezava se odpre v novem oknu

Licence

Licenca:CC BY 4.0, Creative Commons Priznanje avtorstva 4.0 Mednarodna
Povezava:http://creativecommons.org/licenses/by/4.0/deed.sl
Opis:To je standardna licenca Creative Commons, ki daje uporabnikom največ možnosti za nadaljnjo uporabo dela, pri čemer morajo navesti avtorja.

Sekundarni jezik

Jezik:Slovenski jezik
Ključne besede:zgodovinski parlamentarni zapisniki, popravki napak OCR, analiza napak, obogatitev metapodatkov

Projekti

Financer:ARIS - Javna agencija za znanstvenoraziskovalno in inovacijsko dejavnost Republike Slovenije
Številka projekta:P6-0436-2022
Naslov:Digitalna humanistika: viri, orodja in metode

Financer:ARIS - Javna agencija za znanstvenoraziskovalno in inovacijsko dejavnost Republike Slovenije
Številka projekta:GC-0002-2024
Naslov:Veliki jezikovni modeli za digitalno humanistiko

Financer:EC - European Commission
Program financ.:HE
Številka projekta:101186647
Naslov:Centre of Excellence in Artificial Intelligence for Digital Humanities
Akronim:AI4DH

Podobna dela

Podobna dela v RUL:
Podobna dela v drugih slovenskih zbirkah:

Nazaj