In the Masters thesis I researched various approaches to perform variable selection and prediction in the case of high-dimensional and unbalanced data, that is when we have a lot of variables (dimensions) and we have different class frequencies in the data (for classification). Such a scenario arises for ex. when dealing with gene expression data, also common in such cases is a very small sample size. The goal is to identify the important variables from the data, i.e. variables which are truly associated with the outcome and to assign a class-membership for new samples for which the class is unknown.
In my work I focused on random forests, more specifically on decision trees in combination with bootstrapping or boosting. Because random forests do not perform variable selection by default I suggested a couple of approaches how we could achieve this in the context of the existing models. Additionally I verified whether rebuilding a model using the selected variables improves the predictive ability of the final model.
I also researched two main approaches of predicting new cases with unbalanced data. Most methods do not work well with such data because of the commonly used improper scoring rules and unadjusted loss functions. I presented a couple of proper scoring rules and showed their advantages. I chose one scoring rule which is used with such data and I compared the results with other methods which use improper scoring rules.
To evaluate all the approaches / models I performed a simulation where I generated high-dimensional unbalanced data in structure similar to gene expression data. Due to a very large computational burden, I chose only one combination of parameters for the simulation, which was difficult enough for the purpose of class-prediction. The data was generated with high between variable correlations in a block structure. All the models were also evaluated on real data from DNA microarrays.
The results show that my proposed methods for variable selection select far fewer uninformative variables in comparison with the other methods, they also choose far fewer variables in total. Using proper scoring rules and an adjusted loss function I managed to improve the results compared to most other methods.