The focus of this work are anomalous datasets used for training and evaluation of neural networks or other machine learning algorithms. Data in an anomalous dataset can be categorized into normal and abnormal data. The first category represents the majority of the dataset and includes all data that is well defined, can be modeled well and is also easy to acquire compared to abnormal data. On the other hand we have limited knowledge about the data in the second category which contains anomalous data, with many types of anomalies not being known in advance. For these reasons we use generative neural networks on such tasks and train them using only normal data. Due to many difficulties in defining and acquiring anomalous data, relatively few datasets exist in the literature compared to datasets where all categories of data are well defined and represented equally.
In this thesis we created a dataset that is extremely imbalanced, containing much less abnormal data then normal data. The dataset consists of small patches of satellite images, with images of planes being labeled as anomalies. The process of labeling data was semi-supervised and we used ADS-B data to get airplane positions in the satellite images.
In the end we used the new dataset to evaluate a generative neural network GANomaly, which was presented for the purpose of anomaly detection, and examined how different ratio of normal and abnormal examples affects the performance of the network.
|