For air pollutant level forecasting pollutant measurements, measurements of meteorological parameters and results of meteorological models are usually used as input parameters. Because of different technical issues all data are not always available when forecasting is performed. Missing values are an issue for machine learning models.
The goal of this thesis was to investigate the effect of data imputation on the performance of random forest and LASSO models for the forecasting daily ozone and PM10 levels. We investigated the most popular methods for data imputation in air quality and meteorological studies and selected some simple data imputation methods and some machine learning methods. The simple imputation methods comprised imputation with the mean value of the imputed parameter, imputation with the mean value of a 7 day period around the missing date (from the training set) and with the persistence method (the last available value). We also tested the performance of imputation with the kNN method, linear regression, multiple regression, multiple imputation, and LTSM neural networks. We also tested the feasibility of retraining the models with a reduced training set (only data available on the day of forecasting are used) and prediction with such models. We first selected dates with all available data and tested our models, this was the baseline. Then we excluded different sets of data from the test set and imputed them with different methods. The results achieved when predicting with the random forest and LASSO models with imputed values were compared to the baseline results in terms of mean absolute error (MAE) and relative absolute error (RAE).
The worst performing models were kNN and multiple imputation. All simple imputation methods did not perform well. Multiple regression showed an improved performance over simple methods. The best results were achieved when using the LSTM method and with the ``reduced'' model. When using this two methods the forecasted results were similar to baseline results.
|