The exponential growth of scientific publications complicates informed decision-making for researchers and presents a bottleneck for automating the research process, because decision support systems require structured, machine-readable data.
This work addresses the challenge by developing a comprehensive system for automated contextualization. We present a robust pipeline that includes the systematic acquisition and processing of tens of thousands of scientific articles from the Papers with Code repository. At the core of the system is a comparative analysis of advanced retrieval mechanisms based on sparse and dense embeddings. Using the most relevant retrieved documents, a large language model performs generative extraction of key information into structured quadruples of the form task, metric, value, dataset.
The evaluation demonstrates that the dense semantic embedding retrieval mechanism statistically significantly outperforms the other approaches. The system achieves a high F1 score of 0.969 on the final information extraction task. The result of this work is a functional prototype, accessible via an API, which provides structured context from a natural language query, making it directly usable in automated decision support systems.
|