The thesis focuses on improving text recognition in the yu1Parl collection, which contains transcripts of parliamentary sessions of the National Assembly of the Kingdom of Serbs, Croats and Slovenes (SHS), later Yugoslavia, from the period 1919–1939.
These documents are written in Serbo-Croatian and Slovenian, with Serbo-Croatian using two scripts, Latin and Cyrillic, which presents an additional challenge for optical character recognition.
We attempted to improve the recognition of Cyrillic characters using two vision-language models, GOT and SmolDocling.
For the purpose of fine-tuning the models, a dataset of approximately 20,000 synthetic images was created and used as a training set to enhance model performance on historical documents.
The results showed that additional training of vision–language models on synthetic data improves the performance of optical recognition of Cyrillic characters. However, on real documents from the yu1Parl corpus, even the adapted models still do not achieve a sufficient level of reliability for practical use.
|