Vpliv manjkajočih vrednosti na modeliranje onesnaženosti zraka

ŽNIDARŠIČ, LUKA

Repository of the University of Ljubljana

Details

Vpliv manjkajočih vrednosti na modeliranje onesnaženosti zraka
ID ŽNIDARŠIČ, LUKA (Author), ID Kononenko, Igor (Mentor) More about this mentor... This link opens in a new window

, ID Faganeli Pucer, Jana (Comentor)

PDF - Presentation file, Download (8,18 MB)
MD5: 46BA7CED6D83296C5F3AC0BF7A923821

Abstract

V primeru modeliranja onesnaženosti zraka so vhodne spremenljivke pretekli nivoji različnih onesnaževal, meteorološke meritve in napovedi meteorološkega modela. Dogaja se, da nekatere meritve zaradi različnih vzrokov (npr. okvara merilnikov), niso vedno na voljo. To povzroča težave pri zagonu modelov, zato moramo manjkajoče vrednosti nadomestiti. Cilj diplomskega dela je bil proučiti različne metode za nadomeščanje manjkajočih vrednosti in določiti metodo, ki najmanj "pokvari'' napovedi PM10 in ozona dveh različnih modelov strojnega učenja (naključni gozdovi in model LASSO). Preverili smo pogosto uporabljene metode nadomeščanja podatkov v podobnih problemih. Od enostavnih metod imputacije smo izbrali metodo nadomeščanja s povprečno vrednostjo učne množice, s povprečjem vrednosti 7-dnevnega obdobja okoli manjkajočega datuma (iz učne množice) in z vrednostjo prejšnjega dne. Od naprednejših metod smo proučili uporabo linearne regresije, večkratne regresije, metodo kNN (k-najbližjih sosedov), metodo večkratne imputacije in nevronske mreže LSTM. Ideja diplomskega dela je bila tudi preveriti smotrnost izgradnje novega modela tako, da iz učne množice izključimo tisti dan manjkajoče podatke in z njimi zgradimo model, ki ga potem uporabimo za napovedovanje. Različne metode smo testirali tako, da smo iz testne množice odstranjevali različne množice podatkov (glede na njihovo izvor) in jih nadomeščali s prej navedenimi metodami. Rezultate napovedi modelov z delom nadomeščenih podatkov smo primerjali z rezultati, pridobljenimi na popolni množici podatkov (referenčni rezultati). Kot meri uspešnosti smo uporabili srednjo absolutno napako (MAE) in relativno absolutno napako (RAE). Najslabše sta se odrezali metoda kNN in metoda večkratne imputacije. Tudi enostavne metode so v veliki večini primerov zelo pokvarile napovedi modelov. Najboljše rezultate sta dali metodi nadomeščanja podatkov z nevronskimi mrežami LSTM in z "zmanjšanim'' modelom. Z uporabo teh metod smo dobili rezultate, primerljive z referenčnimi rezultati.

Language:	Slovenian
Keywords:	manjkajoče vrednosti, strojno učenje, onesnaženost zraka
Work type:	Bachelor thesis/paper
Typology:	2.11 - Undergraduate Thesis
Organization:	FRI - Faculty of Computer and Information Science
Year:	2020
PID:	20.500.12556/RUL-120074
COBISS.SI-ID:	31070979
Publication date in RUL:	15.09.2020
Views:	1147
Downloads:	169
Metadata:
:	Copy citation
Share:

Secondary language

Abstract:
Language:	English
Title:	Impact of missing values on the modeling of air pollution
For air pollutant level forecasting pollutant measurements, measurements of meteorological parameters and results of meteorological models are usually used as input parameters. Because of different technical issues all data are not always available when forecasting is performed. Missing values are an issue for machine learning models. The goal of this thesis was to investigate the effect of data imputation on the performance of random forest and LASSO models for the forecasting daily ozone and PM10 levels. We investigated the most popular methods for data imputation in air quality and meteorological studies and selected some simple data imputation methods and some machine learning methods. The simple imputation methods comprised imputation with the mean value of the imputed parameter, imputation with the mean value of a 7 day period around the missing date (from the training set) and with the persistence method (the last available value). We also tested the performance of imputation with the kNN method, linear regression, multiple regression, multiple imputation, and LTSM neural networks. We also tested the feasibility of retraining the models with a reduced training set (only data available on the day of forecasting are used) and prediction with such models. We first selected dates with all available data and tested our models, this was the baseline. Then we excluded different sets of data from the test set and imputed them with different methods. The results achieved when predicting with the random forest and LASSO models with imputed values were compared to the baseline results in terms of mean absolute error (MAE) and relative absolute error (RAE). The worst performing models were kNN and multiple imputation. All simple imputation methods did not perform well. Multiple regression showed an improved performance over simple methods. The best results were achieved when using the LSTM method and with the ``reduced'' model. When using this two methods the forecasted results were similar to baseline results.
Keywords:	missing values, machine learning, air pollution

Similar works from RUL:
Similar works from other Slovenian collections:

Details

Secondary language

Similar documents