Avtomatska ekstrakcija podatkov iz računov

Ažbe, Gregor

Repository of the University of Ljubljana

Details

Avtomatska ekstrakcija podatkov iz računov
ID Ažbe, Gregor (Author), ID Šubelj, Lovro (Mentor) More about this mentor... This link opens in a new window

, ID Žitnik, Slavko (Comentor)

PDF - Presentation file, Download (5,10 MB)
MD5: CA6D07F8F4388207E5528A99AF9D1A61

Abstract

V tem magistrskem delu se osredotočamo na problem prepoznavanja podatkov z računov, ki so ključni administrativni dokumenti v poslovanju podjetij. Podjetja potrebujejo podatke računov v digitalni obliki, da jih lahko računalniško obdelujejo. Kljub naraščajoči uporabi elektronskih računov so ti večinoma v formatu PDF in ne vsebujejo strukturiranih metapodatkov, kar otežuje avtomatizirano ekstrakcijo podatkov. Ročno prepisovanje podatkov je zamudno in nagnjeno k napakam, zato je avtomatizacija tega procesa izjemnega pomena. V delu smo implementirali, opisali in primerjali uspešnost treh različnih pristopov za avtomatsko ekstrakcijo podatkov z računov. Prvi pristop temelji na klasičnih metodah strojnega učenja, kjer smo preizkusili več modelov, vključno z odločitvenimi drevesi, naključnimi gozdovi, metodami podpornih vektorjev in drugimi. Drugi pristop temelji na grafovskih nevronskih mrežah (GNN), tretji pa na pristopu s predlogami, ki ne uporablja strojnega učenja. Značilke za strojno učenje so vključevale pozicijske podatke, kot so položaj, velikost očrtanega pravokotnika in številka strani, ter besedilne značilke, kot so prisotnost določenih besed v okolici in število določenih znakov v besedi. Naš pristop s klasičnim strojnim učenjem je dosegel najboljše rezultate, saj smo z uporabo ekstremno naključnih dreves dosegli F1 = 0,89. Pristop z GNN je dosegel F_1 = 0,87, medtem ko je pristop s predlogami dosegel F1 = 0,70. Ekstremno naključna drevesa so se izkazala za najprimernejši pristop, saj je poleg najvišje uspešnosti njihova prednost tudi v nižji računski zahtevnosti in v tem, da v primerjavi z GNN za učenje potrebujejo manj učnih primerov. V primeru, da bi se pojavila potreba po dodajanju novih polj, bi morali pri pristopih s strojnim učenjem pridobiti veliko računov z novim poljem za učenje in ustrezno popraviti modele. Pri pristopu s predlogami pa bi zadoščal samo en račun z novim poljem za vsak tip računa, s katerim bi popravili ustrezno predlogo. V nadaljnjem delu bi lahko raziskali dodatne pristope, ki bi omogočali hitro učenje na podlagi le nekaj računov ali pa različne pristope z ANN, saj ti običajno zagotavljajo višjo uspešnost.

Language:	Slovenian
Keywords:	ekstrakcija podatkov, računi, strojno učenje, grafovske nevronske mreže, predloge
Work type:	Master's thesis/paper
Typology:	2.09 - Master's Thesis
Organization:	FRI - Faculty of Computer and Information Science
Year:	2024
PID:	20.500.12556/RUL-161048
COBISS.SI-ID:	210284291
Publication date in RUL:	06.09.2024
Views:	753
Downloads:	157
Metadata:
:	Copy citation
Share:

Secondary language

Abstract:
Language:	English
Title:	Automatic invoice data extraction
In this master's thesis, we focus on the problem of identifying data from invoices, which are key administrative documents in business operations. Companies need data from invoices in digital format so that they can be computer processed. Despite the growing use of electronic invoices, these are mostly in PDF format and do not contain structured metadata, which makes automated data extraction difficult. Manual data entry is time-consuming and prone to errors, making the automation of this process extremely important. In the thesis, we implemented, described, and compared the performance of three different approaches to automatic data extraction from invoices. The first approach is based on classical machine learning methods, where we tested several models, including decision trees, random forests, support vector machines, and others. The second approach is based on Graph Neural Networks (GNN), and the third is a template-based approach that does not use machine learning. The features for machine learning included positional data, such as position and size of the bounding box, and page number, as well as textual features, such as the presence of certain words in the surrounding text and the number of specific characters in the word. Our approach with classical machine learning achieved the best results, with extreme random trees achieving F1 = 0.89. The GNN approach achieved F1 = 0.87, while the template-based approach achieved F1 = 0.70. Extremely Randomized Trees proved to be the most suitable approach, as, in addition to the highest performance, their advantage lies in lower computational complexity and the fact that they require fewer training examples compared to GNNs. In cases where the need to add new fields arises, the machine learning approaches would require acquiring a large number of invoices with the new field for training and adjusting the models accordingly. In contrast, with the template-based approach, a single invoice with the new field for each invoice type would suffice to adjust the relevant template. In future work, we could explore additional approaches that would allow rapid learning based on only a few invoices or various approaches with ANNs, as these typically provide higher performance.
Keywords:	data extraction, invoices, machine learning, graph neural network, templates

Similar works from RUL:
Similar works from other Slovenian collections:

Details

Secondary language

Similar documents