In recent years, the fields of computer vision and artificial intelligence have made mgreat strides in the field of image generation using deep-learning methods. Behind these results are generative deep neural network models that are capable of generating photorealistic and visually convincing images of different objects and meven complex scenes. Despite advances in image generation, the understanding of generative models and their application to image editing is still limited. Here, we use the term understanding to denote the ability of robust learning of generative models and the link between latent and target (image) probability distributions of the data.
There is not yet an automated management mechanism over general image editing that would allow editing only specific image properties. Systems that would allow image editing with generative models based on linguistic descriptions would contribute significantly to applications in various fields such as autonomous driving, robotics, manufacturing, design, entertainment, animation, and others. In such systems, the user could influence the appearance and semantic content of an image by means of a textual or speech description of the visual scene.
The main topic of the PhD thesis is building a generative neural network system in combination with linguistic description, where the goal is to extract information about desired features or changes of images from linguistic descriptions and then use this information for image editing. The starting point for our research is a generative neural network, which is built in a way that enables creating or editing a desired image given linguistic or more structured information. We present several different original contributions as part of our PhD thesis. The first original contribution is a new method for editing facial attributes called MaskFaceGAN. Given a generative image model, the presented method
allows the manipulation of different facial features (e.g. hair colour, eyebrow type, nose size). The target linguistic information required for face editing is given in the form of the selection and intensity of a particular facial feature. By designing a special generative network inverting process, the proposed solution enables high-resolution face editing, which also allows simultaneous editing of multiple features and resizing of individual facial parts. Experiments and a user study are performed on different datasets, which show the advantages of the proposed MaskFaceGAN method over competing technologies.
The next original contribution is the ChildNet method, a model that is able to predict the appearance of children given the images of their parents. ChildNet is able to synthesize an image of a child given an input image of the parents, where additional linguistic information can be added to the model in the form of additional requirements on the child’s appearance (age and gender). We also present a new high-resolution dataset that is designed to learn models for image synthesis given sibling relationships. We evaluate ChildNet against other competing technologies, where our method is shown to more accurately estimate the
appearance of the child, producing images of high quality and resolution.
The last original paper presents the FICE method, which addresses text-based fashion image editing. The linguistic information here is given in its most raw form, i.e. in the form of text. The method is capable of processing textual descriptions that can express a wide vocabulary. The concept of image editing is based on the inversion of a generative network, where the model itself is specialised for editing fashion images. To evaluate the quality of the method, we propose several different metrics focusing on image quality, person pose preservation, semantic relevance and identity preservation. We compare the methods with other textbased image editing technologies, where the FICE method is shown to outperform in all tested metrics.
In summary, all the original contributions focus on understanding and building generative models or developing systems where the target linguistic information is fed in some way into our model to generate the desired image. The results of the research demonstrate the potential of generative models for image editing and the importance of understanding the link between latent and target probability distributions. The proposed methods and systems have the potential to contribute significantly to a wide range of applications in various fields.
|