The thesis was conceived with the idea of implementing a fast and systematic way of analysing media coverage, namely web articles. The main focus of the thesis addresses: the implementation of a program for obtaining web content; bridging said program with a server for data storage; tracking and analysing the obtained data.
Our starter program was implemented in Java. At the beginning, we wanted to focus on data in Slovene, but due to problems with the implementation of a working analyser, we shifted our focus to English. Using Stanford CoreNLP, we utilized various NLP (Natural Language Processing) techniques. The data was restructured to its most basic form using lemmatization and then stored in a SQL (Structured Query Language) server. What followed were experiments on said data using specific functions for specific subgroups.
The main focus of the analysis was testing the speed of querying based on different factors. The first step used just a normally written query. The second step was focused on views, and the last optimization included the use of indexing. As predicted, the runtime significantly decreased with each additional step.
|