Nowadays information is easily accessible and even more so valuable. With
this in mind, we set about creating a solution that will enable content extraction of articles found in Slovenian news portals. The main problem we face
with such solutions is separating the content from unnecessary information,
such as ads, comments and other layout elements of web pages. To solve this
problem, we implemented a solution based on shallow text features. On its
basis, we designed a language model, which was built with the help of Slovenian news corpus that contains 10000 articles from 5 different news portals.
The final product is an extractor that allows content extraction of Slovenian
articles and presents them in a structured form.
|