Nowadays, air pollution is one of the major problems in the world. In the recent years, elevated levels of two pollutants, particulate matter $PM_{10} $ and ozone ($O_3$), have been detected in Slovenia. Prolonged exposure to these high levels can have a negative impact on our health, so it is very important that we know how to accurately forecast the levels of these two pollutants in the air. This way, preventive measures can be taken, such as limited outdoor movement at times of elevated $O_3$ levels or restrictions on urban traffic that prevent further air pollution with $PM_{10}$ particles.
In this paper, the measured values of $PM_{10}$ and $O_3$ are presented as time series. We searched for the best forecasting model for the daily values of $PM_{10}$ and $O_3$ one day ahead. We tested the extreme gradient boosting model (XGBoost) and the long short-term memory neural network (LSTM) and compared their efficiency with the autoregressive integrated moving average model (ARIMA). XGBoost is an ensemble of machine learning algorithms based on decision trees, and LSTM is a type of neural network that has the ability to learn long-term dependencies. We searched for optimal parameters and time series preprocessing techniques for each model and tested three different architectures for the LSTM model.
Firstly, we built the models as if they were forecasting values at the end of the day, at midnight. In reality, the forecasts are made in the late morning hours because by then the meteorological data for the current morning is available and the forecasted values are useful at a time when more people are starting to move outdoors. Afterwards we adopted this approach as well and have been shown to get better results.
We improved the forecasts with the features derived from meteorological forecasts for the current day, meteorological data for the previous day, and data for the current day up to the time of the forecast. It was found that meteorological forecasts, especially solar radiation, cloud cover, and precipitation, contributed most to the $O_3$ forecasts, while for the $PM_{10}$ forecasts, meteorological measurements for the current day up to the time of the forecast, especially temperature inversion and atmospheric temperature, were most important. Experiments have shown that the importance of the features changes when the time series are not preprocessed, and that the importance of the features varies from season to season.
The results of the experiments showed that we get the best results with the bidirectional LSTM neural network. The results were improved when the annual and weekly seasonal components were removed from the time series and the time series was normalized. By comparing two different approaches, we showed that we get better results when we use a longer training set for the forecasts.
|