We focus on the problem of predicting outcomes of sport matches using machine learning. We predict outcomes of sport matches based on data from past matches. Our task is to transform this data into quantities that describe team strengths which are then used as features in training and test data sets. Standard approaches use averages of team past performance data as features. When our data set size is small, the use of these approaches leads to overfitting and consequently poor predictions. Standard approaches do not take into account the uncertainty in sports data, which is the cause of calculated averages being unreliable. We propose a two-level approach of modeling team attributes. The first level models the connection between team past performance data and their strengths. The second level contains prediction models, which model the connection between team strengths and match outcomes. The first level allows us to train multiple second level prediction models. We obtain final predictions by averaging the predictions from all prediction models. The goal of two-level modeling is to reduce the influence of noise in sports data and to improve predictions of machine learning algorithms. As a part of our work, we offer a package in the R programming language, which contains a modular framework for two-level modeling. In the empirical evaluation of two-level modeling, we show that it clearly improves predictions compared to standard approaches and offers a promising methodology for further research.
|