This thesis addresses the problem of automatically generating semantic data schemas for new sources on open data portals, where data are often published in CSV format without standardized types or links to ontologies. We developed the CSVSI pipeline, which uses large language models (LLMs) to generate concise column descriptions compliant with the CSVW standard and performs ontology matching with existing ontologies. This enables the transfer of additional properties such as URIs, data types, and constraints and the creation of semantically richer schemas. We evaluate the approach on the OAEI Anatomy dataset and on datasets from Slovenia’s OPSI portal. The results show performance comparable to AML and LogMap, while exhibiting greater robustness to incomplete schemas. We conclude that automatic generation of semantic schemas using LLMs and ontology matching is an important step toward greater interoperability and reuse of open data.
|