In this thesis we address the problems of image classification and image captioning with three implemented methods with neural networks (food classification, food captioning and food captioning by region-proposal). The methods were trained and tested on a 21-category food image dataset with
1470 images and a 2-category food caption dataset with 750 caption sentences. The first method—food classification method—uses the architecture of the GoogLeNet-Inception-v3 model trained on our food dataset, achieving a top-1 prediction accuracy of 82.4% and top-5 prediction accuracy of 98%. The second method—food captioning method—uses the Show and Tell architecture trained on our food caption dataset, achieving a perplexity score of 23.3. Our food visual model was used to classify the input images, but the overall results did not meet expectations, as the model does not correctly caption images containing multiple foods. The third method—food captioning with region proposal—uses our food classification method to classify images and performs better than the food-classification method alone, achieving a prediction accuracy of 86.5%. Additionally, this third method summarizes the contents of images containing different types of food with an accuracy score of 64%.
|