Basic concepts on AIGC
  • About the course materials
  • General Course Format and Strategies
  • Introduction
  • Foundations for AIGC
    • Computers and content generation
    • A brief introduction to AI
      • What AI is?
      • What ML is?
      • What DL is?
      • Discriminative AI vs. Generative AI
  • Generative AI
    • Introduction to Generative AI
      • Going deeper into Generative AI models
  • Deep Neural Networks and content generation
    • Image classification
    • Autoencoders
    • GAN: Generative Adversarial networks
    • Transformers
    • Diffusion models
      • Basic foundations of SD
  • Current image generation techniques
    • GANs
  • Current text generation techniques
    • Basic concepts in NLP in Large Language Models (LLMs)
    • How chatGPT works
  • Prompt engineering
    • Prompts for LLM
    • Prompts for image generators
  • Current AI generative tools
    • Image generation tools
      • DALL-E 2
      • Midjourney
        • More experiments with Midjourney
        • Composition and previous pictures
        • Remixing
      • Stable diffusion
        • Dreambooth
        • Fine-tuning stable diffusion
      • Other solutions
      • Good prompts, img2img, inpainting, outpainting, composition
      • A complete range on new possibilities
    • Text generation tools
      • OpenAI GPT
        • GPT is something really wide
      • ChatGPT
        • Getting the most from chatGPT
      • Other transformers: HuggingFace
      • Other solutions
      • Making the most of LLM
        • Basic possibilities
        • Emergent abilities of LLM
    • Video, 3D, sound, and more
    • Current landscape of cutting-edge AI generative tools
  • Use cases
    • Generating code
    • How to create good prompts for image generation
    • How to generate text of quality
      • Summarizing, rephrasing, thesaurus, translating, correcting, learning languages, etc.
      • Creating/solving exams and tests
  • Final topics
    • AI art?
    • Is it possible to detect AI generated content?
    • Plagiarism and copyright
    • Ethics and bias
    • AI generative tools and education
    • The potential impact of AI generative tools on the job market
  • Glossary
    • Glossary of terms
  • References
    • Main references
    • Additional material
Powered by GitBook
On this page
  • 1. Introduction to Stable Diffusion
  • 1.1. Latent Diffsusion Main Compoenent
  • The Autoencoder (VAE)
  • The U-Net
  • The Text-Encoder
  • 1.2. Why is Latent Diffusion Fast & Efficient
  • 1.3. Stable Diffusion During Inference
  • What is Stable Diffusion?
  1. Deep Neural Networks and content generation
  2. Diffusion models

Basic foundations of SD

PreviousDiffusion modelsNextCurrent image generation techniques

Last updated 2 years ago

1. Introduction to Stable Diffusion

Diffusion models are machine learning models that are trained to denoise random gaussian noise step by step to get a sample of interest, such as an image.

The diffusion model has a major downside as the denoising process is both time and memory consumption are very expensive. This makes the process slow and consumes a lot of memory. The main reason for this is that they operate in pixel space which becomes unreasonably expensive, especially when generating high-resolution images.

Stable diffusion was introduced to solve this problem as it depends on Latent diffusion. Latent diffusion reduces the memory and computational cost by applying the diffusion process over a lower dimensional latent space instead of using the actual pixel space.

1.1. Latent Diffsusion Main Compoenent

There are three main components in latent diffusion:

The Autoencoder (VAE)

The autoencoder (VAE) consists of two main parts: an encoder and a decoder. The encoder will convert the images into a low-dimensional latent representation which will be the input to the next component, the U_Net. The decoder will do the opposite which will transform the latent representation back into an image.

The encoder is used to get the latent representation (latent) of the input images for the forward diffusion process during latent diffusion training. While during inference, the VAE decoder will convert back the latent into images.

The U-Net

Figure 2. The U-Net

The U-Net also consists of both encoder and decoder parts, and both are comprised of ResNet blocks. The encoder compresses an image representation into a lower-resolution image, and the decoder decodes the lower resolution back into a higher-resolution image.

To prevent the U-Net from losing important information while downsampling, short-cut connections are usually added between the downsampling ResNets of the encoder to the upsampling ResNets of the decoder.

Additionally, the stable diffusion U-Net is able to condition its output on text embeddings via cross-attention layers. The cross-attention layers are added to both the encoder and decoder part of the U-Net, usually between ResNet blocks.

The Text-Encoder

The text encoder will transform the input prompt, for example, “A Pikachu fine dining with a view to the Effiel tower,” into an embedding space that can be understood by the U-Net. This will be a simple transformer-based encoder that maps the sequence of tokens to a sequence of latent text embeddings.

It is important to use a good prompt to be able to get the expected output. That’s why now the topic of prompt engineering is being trended. Prompt engineering is the act of finding certain words that can trigger the model to produce output with certain properties.

1.2. Why is Latent Diffusion Fast & Efficient

The reason why the latent diffusion is fast and efficient is that the U-Net of the latent diffusion operates on a low dimensional space. This reduces the memory and computational complexity compared to the pixel space diffusion. For example, the autoencoder used in Stable Diffusion has a reduction factor of 8. This means that an image of a shape (3, 512, 512 ) becomes (4, 64, 64 ) in latent space, which requires 64 times less memory.

1.3. Stable Diffusion During Inference

First, the stable diffusion model takes both a latent seed and a text prompt as input. The latent seed is then used to generate random latent image representations of size 64×64, whereas the text prompt is transformed to text embeddings of size 77×768 via CLIP’s text encoder.

Next, the U-Net iteratively denoises the random latent image representations while being conditioned on the text embeddings. The output of the U-Net, being the noise residual, is used to compute a denoised latent image representation via a scheduler algorithm. Scheduler algorithms compute the predicted denoised image representation from the previous noise representation and the predicted noise residual.

Many different scheduler algorithms can be used for this computation, each having its pros and cons. For Stable Diffusion, it is recommended to use one of the following:

The denoising process is repeated around 50 times to step-by-step retrieve better latent image representations. Once complete, the latent image representation is decoded by the decoder part of the variational auto-encoder.

What is Stable Diffusion?

Figure 3. Text encoer
Figure 4. Stable diffusion model works flow during inference

(used by default)

References:

It is similar to as it is a that can be used to generate images from text prompts. As opposed to DALL-E 2 though, it is open source with a implementation and a pre-trained version on HuggingFace . It is trained using the LAION-5B dataset . Stable diffusion in composed of the following sub-models:

We have an autoencoder trained by a combination of a perceptual loss and a patch-based adversarial objective . With it, we can encode an image to a latent representation and decode it from it.

Random noise is progressively applied to the embedding. A latent representation of a text prompt is learned from a CLIP alignment into the image representation .

We then use U-Net, a convolutional network with ResNet blocks to learn to denoise the diffused embedding . The textual information is injected through cross-attention layers through the network . The resulting denoised image is then decoded by the autoencoder decoder.

PNDM scheduler
DDIM scheduler
K-LMS scheduler
https://pub.towardsai.net/getting-started-with-stable-diffusion-f343639e4931
DALL-E 2
diffusion model
PyTorch
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
The Illustrated Stable Diffusion