Your browser does not allow JavaScript!
JavaScript is necessary for the proper functioning of this website. Please enable JavaScript or use a modern browser.
Open Science Slovenia
Open Science
DiKUL
slv
|
eng
Search
Browse
New in RUL
About RUL
In numbers
Help
Sign in
Primerjava metod za avtomatsko ekstrakcijo podatkov iz spleta
ID
MARTIČ, GAŠPER
(
Author
),
ID
Žitnik, Slavko
(
Mentor
)
More about this mentor...
PDF - Presentation file,
Download
(18,31 MB)
MD5: 137DE7C4759639FDE4928BB0B8FA0BCE
Image galllery
Abstract
Namen diplomskega dela je pregledati in ovrednotiti obstoječe metode za avtomatsko ekstrakcijo podatkov s spletnih strani. Tovrstne metode preko analize večjega števila podobnih spletnih strani avtomatsko generirajo ovojnico, ki je sposobna s spletne strani izluščiti podatke, tudi če se struktura strani s časom rahlo spremeni. Rezultati diplomskega dela ponujajo enostaven pregled nad različnimi metodami za pridobivanje podatkov s spletnih strani. To je lahko koristno za uporabnika, ker iz spletne strani izloči moteče oglase in navigacijske menije, ki odvračajo pozornost od vsebine. Kvaliteta posamezne metode se meri v hitrosti in sposobnosti odstranjevanja nerelevantnih podatkov ter ohranjanju tistih, ki so pomembni za dojemanje vsebine. Izvajanje samih metod je avtomatizirano s pomočjo programa v jeziku Python, ki ga lahko poganjamo iz ukazne vrstice. Uporabljani sta obstoječi implementaciji metod RoadRunner in Webstemmer, prikazani pa so rezultati njunega delovanja na petih slovenskih spletnih medijih. Poleg tega je implementirana tudi polavtomatska metoda pridobivanja podatkov s pomočjo ogrodja Scrapy, da lahko vidimo rezultate in kompleksnost v primerjavi s popolnoma avtomatsko metodo.
Language:
Slovenian
Keywords:
ekstrakcija
,
spletni pajek
,
ovojnica
,
novice
Work type:
Bachelor thesis/paper
Typology:
2.11 - Undergraduate Thesis
Organization:
FRI - Faculty of Computer and Information Science
Year:
2023
PID:
20.500.12556/RUL-144592
COBISS.SI-ID:
144116739
Publication date in RUL:
02.03.2023
Views:
1205
Downloads:
109
Metadata:
Cite this work
Plain text
BibTeX
EndNote XML
EndNote/Refer
RIS
ABNT
ACM Ref
AMA
APA
Chicago 17th Author-Date
Harvard
IEEE
ISO 690
MLA
Vancouver
:
Copy citation
Share:
Secondary language
Language:
English
Title:
Comparison of methods for automatic Web data extraction
Abstract:
The purpose of this thesis is to review and evaluate existing methods for automatic extraction of data from websites. Such methods analyse several similar web pages in order to generate a wrapper that is capable of extracting data from a web page even if the page layout changes slightly. The result of the thesis is a simple overview of various tools for extracting essential data from journalistic articles, which may prove useful to the reader due to the exclusion of bothersome advertisements and links on websites, which distract the reader from the content. The quality of each method is measured in its speed and ability to discard irrelevant data. The execution of the methods is automated with the help of a program in the programming language Python that we can run from the command line. RoadRunner and Webstemmer are the two implemented methods and we evaluate them based on their ability to extract data from five Slovenian media websites.
Keywords:
extraction
,
Web crawler
,
wrapper
,
news
Similar documents
Similar works from RUL:
Similar works from other Slovenian collections:
Back