Izračun cenovnih indeksov s strojnim učenjem za avtomatsko razvrščanje produktov : magistrsko delo

Pezdir, Martin

Repository of the University of Ljubljana

Details

Izračun cenovnih indeksov s strojnim učenjem za avtomatsko razvrščanje produktov : magistrsko delo
ID Pezdir, Martin (Author), ID Todorovski, Ljupčo (Mentor) More about this mentor... This link opens in a new window

PDF - Presentation file, Download (2,35 MB)
MD5: B5B92A8491695E8F9F1EADCC5969929A

Abstract

Magistrsko delo predstavlja ogrodje in modele za spletno strganje podatkov o izdelkih iz spletnih trgovin, avtomatično razvrščanje teh izdelkov v kategorije ECOICOP (ang. European Classification of Individual Consumption according to Purpose ali evropska klasifikacija individualne potrošnje po namenu) s pomočjo strojnega učenja in računanje cenovnih indeksov HICŽP (harmonizirani indeks cen življenjskih potrebščin). V delu spletnega strganja opišemo probleme in izzive, s katerimi se soočamo pri avtomatiziranem prenosu podatkov iz spleta. Dotaknemo se tudi zakonodaje na področju spletnega strganja. Implementiramo spletni strgalnik v programskem jeziku Python, ki dnevno prenaša podatke o približno 30.000 izdelkih, naprodaj v spletnih trgovinah dveh največjih slovenskih trgovcih. V drugem delu naredimo uvod v področje strojnega učenja, s poudarkom na pretvorbi tekstovnih in kategoričnih spremenljivk v numerične. Predstavimo in implementiramo dve metodi za obdelavo tekstovnih podatkov - model vreče besed in algoritem word2vec. Opišemo probleme, ki se pojavljajo zaradi specifičnosti naše podatkovne množice in predstavimo rešitve za soočanje z njimi. S strojnim učenjem zgradimo hierarhični model, ki napoveduje v kateri oddelek, skupino, razred ali podrazred spada posamezen izdelek. V zadnjem delu s pomočjo uradne metodologije izračunamo cenovne indekse na posameznih nivojih. Zaradi razpoložljivosti podatkov se osredotočimo samo na oddelek 01 - Hrana in brezalkoholne pijače. Dobimo primerljive cenovne indekse, ki pa zaradi nepoznanega uradnega vzorca podatkov v posameznem agregatu včasih odstopajo od uradnega indeksa.

Language:	Slovenian
Keywords:	spletno strganje, obdelava naravnega jezika, strojno učenje, klasifikacija, inflacija
Work type:	Master's thesis/paper
Typology:	2.09 - Master's Thesis
Organization:	FMF - Faculty of Mathematics and Physics
Year:	2020
PID:	20.500.12556/RUL-121512
UDC:	519.8
COBISS.SI-ID:	32570115
Publication date in RUL:	13.10.2020
Views:	1836
Downloads:	298
Metadata:
:	Copy citation
Share:

Secondary language

Abstract:
Language:	English
Title:	Calculation of price indices with machine learning for automatic product classification
The thesis presents a framework and models for Web scraping of data on products from online stores and automatic classification of these produtcs into ECOICOP (European Classification of Individual Consumption according to Purpose) categories using machine learning. From classified products we are able to calculate an estimate of official HICP (Harmonized Index of Consumer Prices). In the part of web scraping, we describe the problems and challenges we face when using web crawlers for automated transfer of data from the web. We touch upon the legislation in the field of Web scraping. We also implement a Web scraper in Python, which daily transfers data on approximately 30.000 products sold by the two largest Slovenian retailers. In the second part, we make basic introduction to the field of machine learning, with an emphasis on the conversion of text and categorical variables into numerical ones. We present and implement two methods for processing text data - bag of words model and the word2vec algorithm. We describe the problems that arise due to the specifics of our dataset and present solutions to deal with them. We use machine learning to build a hierarhical model that predicts categories of ECOICOP an individual product belongs to. In the last part, we use official methodology to calculate an estimate of price indices on different levels. Due to the avaliability of data, we focus only on section 01 - Food and non-alcoholic beverages. We obtain price indices comparable to the official ones, with deviations due to unknown official data sample in each group of products.
Keywords:	Web scraping, natural language processing, machine learning, classification, inflation

Similar works from RUL:
Similar works from other Slovenian collections:

Details

Secondary language

Similar documents