Current image generation techniques
Thousands of millions of images and their descriptions (text captions), feed the magi of current image generation AI techniques
Last updated
Thousands of millions of images and their descriptions (text captions), feed the magi of current image generation AI techniques
Last updated
The diffusion model concept
I’m going to describe something to you and I want you to imagine it in your mind as I describe it:
Imagine a very calm lake with a beautiful sunset and mountains on the horizon. And on the peaceful lake is a small paddle boat. Now imagine that sitting on the boat… is a bear.
I’m guessing that you probably have never seen a bear on a boat before, but your mind was still able to imagine it. The reason you were able to imagine it is because you know what a bear looks like, and you know what a boat and sunset look like, and you also know what those objects look when they are combined together.
You have an internal dataset of a vast amount of memories, experiences, and gained knowledge — an understanding of how objects and things relate to each other.
The AI model starts with a dataset of image/caption pairs. For example, LAION-5B is a dataset of 5.85 billion image-text pairs.
The deep learning algorithms train on datasets (training data) of image-text pairs and make associations — a multi-dimensional latent space of variables of defined object clusters. This is all mathematical — a metric system
The link between textual semantics and their visual representations is learned by a model called CLIP (Contrastive Language-Image Pre-training).
All images and their associated captions are passed through their respective encoders, mapping all objects to points in a latent space.
Now that you imagined a visual image in your mind, the next step is to create it — supposing you are an artist, you can draw it, paint it, compose a photo of it, or digitally create it using a tool like Photoshop.
The idea is to get what you imagined in your mind into some rendered art.
That’s the second part of what the AI model does, once the metrics are in place, the next step is to translate points from the mathematical latent space into pixel space, which involves a generative process called diffusion.
Diffusion models work by destroying training data by adding Gaussian noise to it, and then learning how to get the data back by reversing this process of adding noise. In other words, Diffusion models can generate coherent images from noise.
So the final process from prompt to output looks like this: The prompt goes through the encoding process, maps to a latent space, and the decoding process uses diffusion to generate the image.
And this is what the sample description of the bear on a boat looks like generated by AI. The following are 2 variations of the same prompt
References: