In this master's thesis, we focus on the problem of identifying data from invoices, which are key administrative documents in business operations. Companies need data from invoices in digital format so that they can be computer processed. Despite the growing use of electronic invoices, these are mostly in PDF format and do not contain structured metadata, which makes automated data extraction difficult. Manual data entry is time-consuming and prone to errors, making the automation of this process extremely important.
In the thesis, we implemented, described, and compared the performance of three different approaches to automatic data extraction from invoices. The first approach is based on classical machine learning methods, where we tested several models, including decision trees, random forests, support vector machines, and others. The second approach is based on Graph Neural Networks (GNN), and the third is a template-based approach that does not use machine learning.
The features for machine learning included positional data, such as position and size of the bounding box, and page number, as well as textual features, such as the presence of certain words in the surrounding text and the number of specific characters in the word.
Our approach with classical machine learning achieved the best results, with extreme random trees achieving F1 = 0.89. The GNN approach achieved F1 = 0.87, while the template-based approach achieved F1 = 0.70.
Extremely Randomized Trees proved to be the most suitable approach, as, in addition to the highest performance, their advantage lies in lower computational complexity and the fact that they require fewer training examples compared to GNNs.
In cases where the need to add new fields arises, the machine learning approaches would require acquiring a large number of invoices with the new field for training and adjusting the models accordingly. In contrast, with the template-based approach, a single invoice with the new field for each invoice type would suffice to adjust the relevant template.
In future work, we could explore additional approaches that would allow rapid learning based on only a few invoices or various approaches with ANNs, as these typically provide higher performance.
|