This thesis explores the use of transfer learning for the detection and classification of dice throws. The main objective was to determine how the number of training images and the variation in background affect model performance. I collected a dataset of 1,200 images, with 200 images per possible dice value, and divided them into six training subsets. The largest set contained all 1,200 images, while the others contained 600, 150, 60, and 30 images respectively. An additional set was created by augmenting the smallest set of 30 images using affine transformations, resulting in 600 training images. Each of the six YOLOv8 models was trained using transfer learning within the PyTorch framework.
For testing, I recorded 50 video sequences of dice throws - half on the same background as the training images (reference background), and half on a black background. I used these videos to test all models and recorded detection rates and classification accuracy, calculated macro precision, recall, and F1 scores, and generated confusion matrices.
I found that just 150 training images were sufficient for flawless performance on the reference background (100% detection and classification accuracy). On the black background, the detection rate of this model remained high (98%), but the classification accuracy and F1 score dropped to 78% and 77%, respectively, demonstrating the significant impact of background changes. Even the largest models (trained on 1200 and 600 images) made two identical classification errors on the black background, which indicates improved robustness with larger amounts of data.
As the number of training images decreased, the performance of the models worsened. The model trained on 60 images achieved 73% detection, 83% classification accuracy, and a 63% F1 score on the reference background; on the black background, it achieved 75% detection, 75% classification accuracy, and only a 55% F1 score. The model trained on only 30 images failed to detect a single die on either background, confirming that this amount is clearly insufficient for the task at hand.
However, I found that data augmentation can effectively compensate for the lack of training data. The model trained on the artificially expanded set of 600 images (from the original 30) achieved 100% detection on both backgrounds, 97% classification accuracy and F1 score on the reference background, and 88% accuracy and 87% F1 score on the black background.
With smaller models or on the black background, the most frequent classification error occurred with the value 2, which was often misclassified as 1. Values 3 and 5 also caused problems, particularly in the detection phase.
|