Going deeper into Generative AI models
Last updated
Last updated
You may have heard about Deep Learning models like Deep Fakes or those able to estimate how your face would look in 25 years. Indeed, deep generative models are behind these impressive use cases, but these models go far beyond purely trivial purposes and offer great possibilities for industries such as game design, cinematography, and content generation, among others.
These models are based on unsupervised learning algorithms capable of approximating complex, high-dimensional probability distributions from data and generating new samples from these underlying distributions. These algorithms may be applied to many types of data, including audio, image, and video data.
In the last five years, there has been huge progress in the field of generative models from both academia and industry. There are two specific projects worth noting: StyleGAN from NVIDIA, presenting a model capable of generating human faces, and the GPT-2 language model from OpenAI, which can generate original text based on an introductory piece of text. However, evaluating the performance of these models has been difficult given the subjective aspect of measuring the quality of the output.
It is quite common to classify generative models into two main groups:
1. Likelihood-based models
Models in this group try to explicitly learn the likelihood distribution P(X|Y) through loss functions. They are trained to infer a probabilistic distribution that is as similar as possible to the original input data distribution. We could classify VAE models into this group.
2. Implicit models
Unlike likelihood-based models, implicit models are not explicitly trained to learn the likelihood distribution, but to generate output similar to the input data. In the case of GAN, the generator learns how to generate images that could fool the discriminator by generating images that look real enough to avoid the discriminator detecting them as fake.
As we mentioned previously, evaluating generative models’ performance is challenging. In the case of likelihood-based models, we may use the likelihood values to measure how good a model is, but this does not take into account the output of the models. While outputs can be observed visually, at least for image outputs, we need an empirical metric to objectively measure the model quality and compare it against other models.
Here are some of the most commonly used metrics to evaluate generative models:
1. Kullback-Leibler Divergence (KL Divergence)
The KL Divergence measures how different a probability distribution is from another probability distribution. This is similar to standard maximum likelihood optimisation problems, but we are minimising this metric instead of maximising the likelihood.
2. Inception Score (IS)
This metric evaluates the quality of the generative model output based on the InceptionV3 Network pre-trained on ImageNet. The IS value is composed by measuring two different aspects: (1) if the InceptionV3 network is capable of clearly identifying a certain type of object; and (2) the variety of outputs: checking if the generative model can generate a large set of different objects.
3. Fréchet Inception Distance (FID)
This metric uses the InceptionV3 Network slightly differently than the IS metric. It tries to measure the differences between real and generated images by evaluating the responses of the penultimate layer of the Inception Network when using real and generated images as the input. A lower FID means that generated images are similar to the real ones, so the model is performing well.
It is important to highlight that these metrics have some limitations to keep in mind when using them to evaluate the models:
Their values can only be compared in the same context (dataset, image size, etc.).
There may be different implementations for the same metric.
While generative models have been a milestone in the Computer Vision field, these technical advances need to be adapted to real business problems. In many cases, we may want to be able to control the output in some way, whether for exploring output variations or generating content in a specific direction. This is where Conditional Image Synthesis models appear to resolve the matter.
This kind of model is given an additional input that influences the output that allows us to control the content generation. This input may be very diverse, from text to segmentation masks. A popular example of these models is DALL-E, an impressive model capable of generating realistic images from text descriptions. In this case, the way of conditioning the model output is through text.
Recent advances in the Conditional Image Synthesis field have been based on the original generative model architectures, such as GAN and VAE, and applying some modifications to allow the model to be given additional input to control the output. Nevertheless, there are new architectures, like Diffusion Models, with a different approach to generating synthetic images.
Conditional GAN
Before we start explaining the Conditional GAN model, it is worth giving a brief introduction to the GAN architecture. This model architecture was first introduced in 2014, and is composed of two sub-models: a generator and a discriminator. The former tries to generate realistic images similar to the ones present in the training data and the latter tries to discriminate between real images and the images generated by the generator. The generator learns from the output of the discriminator and is trained to create images that look ‘real’ to the discriminator.
This is great, but we cannot control the generator output, as it is randomly generated based on the ‘knowledge’ gathered from the training phase. This characteristic differentiates GAN from Conditional GAN, as the latter allows control of the generator output. The Conditional GAN architecture includes an additional control vector that feeds both the generator and discriminator, controlling the model’s behaviour in the provided direction.
This control vector can be in multiple formats such as text labels, images, and segmentation masks, among others. An example of this type of model is GauGAN, by NVIDIA, which takes a segmentation mask as a conditional input.
The GAN architecture models have a lot of potential applications, but it is important to consider which use cases they may perform well on. GAN can generate high-resolution images but is unable to catch the entire data distribution, suffering from a lack of diversity. Generally speaking, GAN models are suitable for tasks where the required images have sparse spatial details (like human faces) or good textural details are required, such as landscapes.
Variational Autoencoders (VAE)
To understand Variational Autoencoders, it helps to first understand what autoencoders are, as they are the foundation upon which VAE rests. The autoencoder architecture is composed of an encoder that compresses the input image into a numerical vector, where each dimension represents a feature using discrete values, and a decoder that takes that vector and tries to reconstruct the original image from the information encoded in that numerical vector.
The encoder is trained to gather the more representative features of every image and compress that information into a numerical vector. Meanwhile, the decoder is trained to reproduce the original image from that vector. For solving real problems, this is fairly limited, as the decoder is always going to generate the same output given an encoded vector; there is no margin for diversity.
Variational Autoencoders solve this issue in a very simple way. Instead of compressing the input image into a numerical vector with discrete values, the encoder describes the image attributes using probabilistic terms, using probability functions instead of discrete numbers. This vector is commonly named latent space, and the decoder can randomly sample the probability distributions of each attribute to infer a new image, providing more diverse outputs.
In terms of limitations, VAE models are good at estimating the latent distribution and providing more diverse outputs than GAN, but often the results are blurry compared to the high-quality images GAN may provide. So, VAE is usually recommended when image resolution requirements are not so high, and diversity is a plus.
Diffusion Models
The Palette Model is an Image-to-Image model that can perform various tasks, like Inpainting, Image Refinement, Colorization, and more. The Inpainting task is exciting as the examples in the paper show excellent results when given an image with a blanked-out area. The results are images with no blank areas, and the filled area looks natural.
The Diffusion model is inspired by nonequilibrium thermodynamics, and the process is explained in this paper that explores the area. The Model is based on a Markov Chain of data, where the chain consists of a set of images derived from an original image. Noise is added to the image, or a specific area of the image, in steps and constitutes the chain when the last image, or the selected area of the image, is just random noise. The Model then learns how to reconstruct the image by denoising it recursively. The result is an image without noise and a natural fill.
The Palette Diffusion Model does have a limitation when it comes to inpainting. We cannot condition the model to generate predefined information. If the blanked-out area on the input image were a dog, the dog would most likely not appear in the resulting image. The diffusion process creates natural-looking photos but with no control of the outcome.
Diffusion models are fascinating when it comes to potential use cases. The Palette model explored four tasks; colourization, inpainting, uncropping, and JPEG restoration. The various functions make it an exceptional media editing tool. Taking pictures, such as at tourist attractions, can be a pain when people are in the way of the beautiful photos you want to take. With the inpainting task, the model can remove the people if blanked out in the input. The information about people is removed and will likely not be generated in the output image.
Another use case is restoring the resolution of images by the restoration task. Storing or sending high-resolution media can be data-heavy. Lowering the image’s resolution could solve this problem if we could restore it to the local machine, which is possible with the diffusion model.
References: https://medium.com/datatonic/computer-vision-generative-models-and-conditional-image-synthesis-29b4a013b857