# Current image generation techniques

**The diffusion model concept**

### Step 1: Imagine <a href="#id-0a7b" id="id-0a7b"></a>

I’m going to describe something to you and I want you to imagine it in your mind as I describe it:

*Imagine a very calm lake with a beautiful sunset and mountains on the horizon. And on the peaceful lake is a small paddle boat.*\
\&#xNAN;*Now imagine that sitting on the boat… is a bear.*

I’m guessing that you probably have never seen a bear on a boat before, but your mind was still able to imagine it. The reason you were able to imagine it is because you know what a bear looks like, and you know what a boat and sunset look like, and you also know *what those objects look when they are combined together.*

You have an internal **dataset** of a vast amount of memories, experiences, and gained knowledge — an understanding of how objects and things relate to each other.

The AI model starts with a dataset of image/caption pairs. For example, LAION-5B is a dataset of 5.85 billion image-text pairs.

The deep learning algorithms train on datasets (training data) of image-text pairs and make associations — a ***multi-dimensional latent space*** of variables of defined object clusters. This is all mathematical — a metric system

<figure><img src="https://miro.medium.com/max/1400/1*NVzFR6MCKV9JE8waEdPwGA.png" alt=""><figcaption><p>VOX- The text-to-image revolution, explained — cap from YouTube</p></figcaption></figure>

The link between textual semantics and their visual representations is learned by a model called CLIP (Contrastive Language-Image Pre-training).

All images and their associated captions are passed through their respective encoders, mapping all objects to points in a latent space.

<figure><img src="https://miro.medium.com/max/1400/1*3MkibulrU-AxlJLVlJWFqQ.png" alt=""><figcaption><p>VOX- The text-to-image revolution, explained — cap from YouTube</p></figcaption></figure>

### Step 2: Create <a href="#id-1475" id="id-1475"></a>

Now that you imagined a visual image in your mind, the next step is to create it — supposing you are an artist, you can draw it, paint it, compose a photo of it, or digitally create it using a tool like Photoshop.

The idea is to get what you imagined in your mind into some rendered art.

That’s the second part of what the AI model does, once the metrics are in place, the next step is to translate points from the mathematical latent space into pixel space, which involves a generative process called **diffusion**.

Diffusion models work by destroying training data by adding Gaussian noise to it, and then learning how to get the data back by reversing this process of adding noise. In other words, Diffusion models can generate coherent images from noise.

<figure><img src="https://miro.medium.com/max/1400/1*iqjIGKyihuJ5NxseutUtYw.png" alt=""><figcaption><p>image from Nvidia Diffusion <a href="https://developer.nvidia.com/blog/improving-diffusion-models-as-an-alternative-to-gans-part-1/">models explained</a></p></figcaption></figure>

So the final process from prompt to output looks like this:\
The prompt goes through the encoding process, maps to a latent space, and the decoding process uses diffusion to generate the image.

<figure><img src="https://miro.medium.com/max/1400/1*hohNf7BDW0Ln5UPb_GwTzA.png" alt=""><figcaption><p>High-level overview of the DALL-E 2 image-generation process, <a href="https://www.assemblyai.com/blog/how-dall-e-2-actually-works/">modified by AssemblyAI</a></p></figcaption></figure>

And this is what the sample description of the bear on a boat looks like generated by AI.\
The following are 2 variations of the same prompt

```
A photo of a bear on a paddle boat, on a calm lake, 
beautiful sunset and mountains, realistic, 
```

<figure><img src="https://miro.medium.com/max/1000/1*gkqnc0FhjO5cYkjyRNVayg.png" alt=""><figcaption><p>Made by Author on Midjourney</p></figcaption></figure>

<figure><img src="https://miro.medium.com/max/1000/1*qLtfNbhMy1_7UGHX9czkLw.png" alt=""><figcaption><p>made by author on Midjourney</p></figcaption></figure>

References:&#x20;

* <https://medium.com/geekculture/a-simple-explanation-of-how-ai-creates-artwork-433272babcdb>
