This thesis presents a system for detecting anomalies in automatically extracted
graphs from the web, built on the Neo4j graph database and Cypherbased
rules. The system identifies structural, attribute and temporal irregularities—
such as unusual ownership structures, illogical investments and
inconsistencies in event dates—and semantically validates the results with a
Large Language Model (LLM). Evaluation on a graph with approximately
40 million nodes shows high precision for selected rules, a significant impact
of materialized relationships on query runtimes, and a reduction of manual
quality assurance (QA) workload. The approach combines the interpretability
of rules with LLM-based semantic analysis and represents a step towards
a modular, self-learning data quality assurance system.
|