In this work we tackle the problem of extracting lists of people from corporate websites. For this purpose we implement a web crawler to identify possible subpages with people and a data extractor, which is designed to work on any website.
We show that basic methods, such as matching names from a list, don't reach acceptable accuracy. We show that analysing the structure and transfrering the discovered knowledge of a list is crucial in reaching the required level of accuracy. Using this approach we have improved the score of our final results by 50 % in the development and by 35 % in the hidden test set.
|