In the field of machine learning, problems with high-dimensional data sets are common, and difficult to solve due to the “curse of dimensionality”. To solve these problems, we usually apply methods for dimensionality reduction. A popular method for this are autoencoders, which are usually built with neural networks. However, the downside of neural networks is high computation costs of training and their complexity which obscures the user insight into how they work. To address these issues, we aim at developing an autoencoder that is based on random forests and does not have such problems. To construct an autoencoder from a random forest, we select a set of forest leaves, which describe the data set well, and save them into an encoding vector. We use the encoding vector to encode data samples. There are two types of information we can use to decode the data: the decision tree paths leading to leaves in the encoding vector and the saved predictions form the random forest. We combine the two to get the best possible reconstruction of encoded data. We test the constructed autoencoder to tune the parameter settings and evaluate its performance in comparison to neural network autoencoders. We establish that at this point our autoencoder is significantly less accurate compared to common autoencoders and consider the possibilities for upgrading it in the future.
|