In machine learning literature, the problem of asymmetrically distributed target variable is well researched for classification tasks. There, only a very small fraction of training examples belong to one of the rarely observed classes. We can also deal with similar, but more complex problem in regression tasks, where small amount of the training cases has outstanding, i.e., extreme values of the numerical target variable. Resampling method SMOTER is one of the few approaches addressing the problem of asymmetrically distributed target variable for regression tasks. Similar to the classification methods, SMOTER employs biased resampling of training examples, so we can get higher share of rare extreme values of target variable in the sample. The main purpose of the master thesis is to provide an overview of resampling approaches to the problem of asymmetrically distributed target variable, present SMOTER and propose two new methods, where one of them is a modification, while the other is a simplification of SMOTER. The thesis also reports upon empirical evaluation of the existing and proposed methods in combination with the learning algorithm of random forest on selected regression data sets with asymmetrically distributed target variable. The evaluation results show that all of the proposed methods have significantly better predictive performance when compared to the alternative of simply applying the learning algorithm to the original data sets. Comparing proposed methods with each other, modification and simplification of SMOTER have better predictive performance than SMOTE for regression, but not significantly.
|