Due to the variety of practical tasks, lack of experience and knowledge, we often encounter problems that we do not notice at all. Depending on the specific problem, mistakes can have a negligible effect on the validity of the results, or quite harmful. In any case, since improper handling of practical problems inevitably affects statistical conclusions, it is always necessary to pay attention to the correctness of the procedures used. For this purpose, four pitfalls in the construction and evaluation of model accuracy are presented, which are among the most common tasks where mistakes occur in practice. The presentation shows the results of three statistical methods on simulated and real data.
In the first pitfall, which relates to variable selection, we showed that models in simulated and real-world cases often showed over-optimistic performance when variable selection was performed before using cross-validation for performance assessment. Analysis of different data spaces mostly showed a negligible impact of bias on models built in low-dimensional space, and significant in high-dimensional space. The results showed that selecting variables before applying cross-validation can produce a model with ideal performance score even in the case where there is no actual difference between groups. As for the second pitfall, which is related to the optimization of model parameters, the results showed a similar impact of bias on the assessment of model performance, but to a lesser extent than in the selection of variables.
The results of the third pitfall, related to the incorrect assessment of selected models, compared to the previous two, besides the negative impact of the incorrect implementation, underscore the importance of data informativeness. As data informativeness increases, bias decreases. In the simulations, nested cross-validation is presented as a procedure for model selection and unbiased performance estimation. Nested cross-validation, despite its time-consuming execution, on average ensures correct performance estimation. Simulation results also reveal that a single run of nested cross-validation does not promise a correct estimate.
The results of the fourth pitfall, related to data balancing, similarly show the bias introduced by balancing data before applying cross-validation, as well as the effectiveness of various imbalance corrections in improving model performance. Taking into account all model measures, undersampling achieves the best performance compared to SMOTE and oversampling. In simulations and most real cases, corrections did not have a significant impact on the success of methods compared to simply ignoring the imbalance problem with a classification threshold correction.
|