Primerjava metod za avtomatsko ekstrakcijo podatkov iz spleta

MARTIČ, GAŠPER

Repository of the University of Ljubljana

Details

Primerjava metod za avtomatsko ekstrakcijo podatkov iz spleta
ID MARTIČ, GAŠPER (Author), ID Žitnik, Slavko (Mentor) More about this mentor... This link opens in a new window

PDF - Presentation file, Download (18,31 MB)
MD5: 137DE7C4759639FDE4928BB0B8FA0BCE

Abstract

Namen diplomskega dela je pregledati in ovrednotiti obstoječe metode za avtomatsko ekstrakcijo podatkov s spletnih strani. Tovrstne metode preko analize večjega števila podobnih spletnih strani avtomatsko generirajo ovojnico, ki je sposobna s spletne strani izluščiti podatke, tudi če se struktura strani s časom rahlo spremeni. Rezultati diplomskega dela ponujajo enostaven pregled nad različnimi metodami za pridobivanje podatkov s spletnih strani. To je lahko koristno za uporabnika, ker iz spletne strani izloči moteče oglase in navigacijske menije, ki odvračajo pozornost od vsebine. Kvaliteta posamezne metode se meri v hitrosti in sposobnosti odstranjevanja nerelevantnih podatkov ter ohranjanju tistih, ki so pomembni za dojemanje vsebine. Izvajanje samih metod je avtomatizirano s pomočjo programa v jeziku Python, ki ga lahko poganjamo iz ukazne vrstice. Uporabljani sta obstoječi implementaciji metod RoadRunner in Webstemmer, prikazani pa so rezultati njunega delovanja na petih slovenskih spletnih medijih. Poleg tega je implementirana tudi polavtomatska metoda pridobivanja podatkov s pomočjo ogrodja Scrapy, da lahko vidimo rezultate in kompleksnost v primerjavi s popolnoma avtomatsko metodo.

Language:	Slovenian
Keywords:	ekstrakcija, spletni pajek, ovojnica, novice
Work type:	Bachelor thesis/paper
Typology:	2.11 - Undergraduate Thesis
Organization:	FRI - Faculty of Computer and Information Science
Year:	2023
PID:	20.500.12556/RUL-144592
COBISS.SI-ID:	144116739
Publication date in RUL:	02.03.2023
Views:	1861
Downloads:	158
Metadata:
:	Copy citation
Share:

Secondary language

Abstract:
Language:	English
Title:	Comparison of methods for automatic Web data extraction
The purpose of this thesis is to review and evaluate existing methods for automatic extraction of data from websites. Such methods analyse several similar web pages in order to generate a wrapper that is capable of extracting data from a web page even if the page layout changes slightly. The result of the thesis is a simple overview of various tools for extracting essential data from journalistic articles, which may prove useful to the reader due to the exclusion of bothersome advertisements and links on websites, which distract the reader from the content. The quality of each method is measured in its speed and ability to discard irrelevant data. The execution of the methods is automated with the help of a program in the programming language Python that we can run from the command line. RoadRunner and Webstemmer are the two implemented methods and we evaluate them based on their ability to extract data from five Slovenian media websites.
Keywords:	extraction, Web crawler, wrapper, news

Similar works from RUL:
Similar works from other Slovenian collections:

Details

Secondary language

Similar documents