Basic foundations of SD
Last updated
Last updated
Diffusion models are machine learning models that are trained to denoise random gaussian noise step by step to get a sample of interest, such as an image.
The diffusion model has a major downside as the denoising process is both time and memory consumption are very expensive. This makes the process slow and consumes a lot of memory. The main reason for this is that they operate in pixel space which becomes unreasonably expensive, especially when generating high-resolution images.
Stable diffusion was introduced to solve this problem as it depends on Latent diffusion. Latent diffusion reduces the memory and computational cost by applying the diffusion process over a lower dimensional latent space instead of using the actual pixel space.
There are three main components in latent diffusion:
The autoencoder (VAE) consists of two main parts: an encoder and a decoder. The encoder will convert the images into a low-dimensional latent representation which will be the input to the next component, the U_Net. The decoder will do the opposite which will transform the latent representation back into an image.
The encoder is used to get the latent representation (latent) of the input images for the forward diffusion process during latent diffusion training. While during inference, the VAE decoder will convert back the latent into images.
The U-Net also consists of both encoder and decoder parts, and both are comprised of ResNet blocks. The encoder compresses an image representation into a lower-resolution image, and the decoder decodes the lower resolution back into a higher-resolution image.
To prevent the U-Net from losing important information while downsampling, short-cut connections are usually added between the downsampling ResNets of the encoder to the upsampling ResNets of the decoder.
Additionally, the stable diffusion U-Net is able to condition its output on text embeddings via cross-attention layers. The cross-attention layers are added to both the encoder and decoder part of the U-Net, usually between ResNet blocks.
The text encoder will transform the input prompt, for example, “A Pikachu fine dining with a view to the Effiel tower,” into an embedding space that can be understood by the U-Net. This will be a simple transformer-based encoder that maps the sequence of tokens to a sequence of latent text embeddings.
It is important to use a good prompt to be able to get the expected output. That’s why now the topic of prompt engineering is being trended. Prompt engineering is the act of finding certain words that can trigger the model to produce output with certain properties.
The reason why the latent diffusion is fast and efficient is that the U-Net of the latent diffusion operates on a low dimensional space. This reduces the memory and computational complexity compared to the pixel space diffusion. For example, the autoencoder used in Stable Diffusion has a reduction factor of 8. This means that an image of a shape (3, 512, 512 ) becomes (4, 64, 64 ) in latent space, which requires 64 times less memory.
First, the stable diffusion model takes both a latent seed and a text prompt as input. The latent seed is then used to generate random latent image representations of size 64×64, whereas the text prompt is transformed to text embeddings of size 77×768 via CLIP’s text encoder.
Next, the U-Net iteratively denoises the random latent image representations while being conditioned on the text embeddings. The output of the U-Net, being the noise residual, is used to compute a denoised latent image representation via a scheduler algorithm. Scheduler algorithms compute the predicted denoised image representation from the previous noise representation and the predicted noise residual.
Many different scheduler algorithms can be used for this computation, each having its pros and cons. For Stable Diffusion, it is recommended to use one of the following:
PNDM scheduler (used by default)
The denoising process is repeated around 50 times to step-by-step retrieve better latent image representations. Once complete, the latent image representation is decoded by the decoder part of the variational auto-encoder.
References: https://pub.towardsai.net/getting-started-with-stable-diffusion-f343639e4931
It is similar to DALL-E 2 as it is a diffusion model that can be used to generate images from text prompts. As opposed to DALL-E 2 though, it is open source with a PyTorch implementation [1] and a pre-trained version on HuggingFace [2]. It is trained using the LAION-5B dataset [3]. Stable diffusion in composed of the following sub-models: