Part A: Diffusion Playground

Here we explore the power of diffusion models mainly by implementing multiple ways of sampling, which is basically inference.
The Diffusion Model we use is DeepFloyd IF with text embeddings from TF.
Let's get started!

Part 0: Setup and initials

Steps:
- Created a Hugging Face account and logged in.
- Accepted the license for the DeepFloyd model and generated a Hugging Face Hub token.
- Downloaded precomputed text embeddings to avoid GPU memory issues.
- Instantiated the stage_1 and stage_2 objects for text-to-image generation.
- Used provided prompts and experimented with different num_inference_steps values.

Below are results of taking different num_inference_steps values for the same prompt.
num_inference_steps on stage 1 won't affect the quality, but affects where the set of generated images will be on the image manifold.
On the other hand, num_inference_steps on stage 2 DO affect the quality of the generated images.
It's also obvious that the larger the num_inference_steps on stage 2, the more the details. Compare image 1 and 2, image 3 and 4 to get a feeling for this.

Stage 1: 20, Stage 2: 20

Stage 1: 20, Stage 2: 100

Stage 1: 100, Stage 2: 20

Stage 1: 100, Stage 2: 100

Part 1: Sampling Loops for inference

The key part of this section is to generate images given a Diffusion model,
Since the model is trained to denoise following a noise shedule
It's hallucination nature would result in inperfect reconstruction of the image from pure noise, which is useful for us to generating new images that also lie on the image manifold.
The key to make the generation work is to construct good sampling procedures that guide the model to denoise on the right image at the right timestep for the desired result.

1.1: Implementing the Forward Process

Steps:
- The forward process in diffusion models: noise is added to a clean image to produce progressively noisier versions.
- The formula for adding more noise: \( q(x_t \mid x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1 - \bar{\alpha}_t) \mathbf{I}) \).
- We compute it using: \( x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon, \, \epsilon \sim \mathcal{N}(0, 1) \)
- Downloaded the Berkeley Campanile image and resized it to 64x64 pixels as the test input.
- Used the alphas_cumprod variable to compute ᾱ_t values at timesteps t = 250, 500, 750.
- Implemented the forward(im, t) function to simulate the addition of noise at the specified timesteps.

Original

Noisy Image at t=250

Noisy Image at t=500

Noisy Image at t=750

1.2: Classic Denoising

The most classic way of denoising is: Gaussian Blurring!
So that's what we did here. Just apply Gaussian with kernel_size=9.
Of course, we can expect the result to be a blurry mess!

Noisy Image at t=250

Noisy Image at t=500

Noisy Image at t=750

Gaussian Denoised Image at t=250

Gaussian Denoised Image at t=500

Gaussian Denoised Image at t=750

1.3: One-Step Denoising

Here we use a pre-trained diffusion model to predic the total amount of noise in a noisy image.
Then we use the estimated noise to denoise the image in one step.
By giving the diffusion model a default text embedding and the timestep of the noisy image, it can predict the noise in the image.
We can expect the result to be worse as the t goes larger, since it's going to be super hard to know what's the exact pattern of the noise on that t=750 image.
In fact, I think I can barelly see what's in that t=750 image, which means this is a task even humans find hard.

Noisy Image at t=250

Noisy Image at t=500

Noisy Image at t=750

One-Step Denoised Image at t=250

One-Step Denoised Image at t=500

One-Step Denoised Image at t=750

I think our model being able to recover this t=750 to this degree is already impressive!

1.4: Interative Denoising

When your LLM fails to do a good reasoning job, how would you change your prompt?
Exactly! You would add a line saying "Please do this step by step".
Here, following the intuition and observation we had in task 1.3,
We can also say that the denoiser would do a better job if the amount of noise you ask it to estimate is fewer!
So, what if we only make the noisy image a little bit less noisy at each step? Would the model do a better job?
The answer is YES! So here we are~ We will iteratively give our model the less noisy image (predicted by itself!) for the next prediction.
Essentially, we are doing an interpolation between our current noisy image to the target clean image.
So, it's also reasonable that if we allow the model to take some small, but more twisted paths, adding a flavor of randomness to it, the result would be better.
The key formula we use for this strided "interpolation" is:
\( x_{t'} = \frac{\sqrt{\bar{\alpha}_{t'} \beta_t}}{1 - \bar{\alpha}_t} x_0 + \frac{\sqrt{\alpha_t (1 - \bar{\alpha}_{t'})}}{1 - \bar{\alpha}_t} x_t + v_\sigma \)
We take a stride of 30, denoising from t=990 to t=0, to save some compute.

Noisy Image at t=690

Noisy Image at t=540

Noisy Image at t=390

Noisy Image at t=240

Noisy Image at t=90

Here is good comparison between the three approaches we've tried so far.
It's clear that the iterative denoising has more details recovered (hallcinated).

Original Image

Iteratively Denoised Image

One-Step Denoised Image

Gaussian Denoised Image

1.5: Diffusion Model Sampling

Remember I repeatitively use the word "hallucination"?
That's because whenever the pre-trained diffusion model is given a noisy image,
it's basically imagining what patterns of noise are covering that image, which made it like a noisy image at timestep t.
Imagining the pattern of noise is the essentially imagining the details back into the image, based on experience, just like you and I!
This gives our model the power of imagination, which is very useful when you force it to do a denoise task based on a purely noisy image.
That's when the magic happens and the model literally generates a complete new image from nothing!
(I mean, we humans can also do this, and we typically do this with our eyes closed, and we call this day-dreaming!)
Now, you should get a sense of how romantic this is for a diffusion to be able to do!

Interpolation from "nothingness" 1

Interpolation from "nothingness" 2

Interpolation from "nothingness" 3

Interpolation from "nothingness" 4

Interpolation from "nothingness" 5

1.6: Classifier-Free Guidance (CFG)

We all see that the result does't look good enough. They are gray-ish, dull and foggy. Not real enough. How to improve?
Hemmmm, if only there is a way that we can tell our model: "Hey! Your current generation is not real enough. Make it MORE real."
Luckily, where there is a will, there is a way.
Recall that for previous part, we always use "a high quality photo" as a "null prompt".
Well, I think "" would be even stronger "null prompt" --- it's more "null" then the highly-quality-photo one,
So, we can now use something similar to an extrapolation: \( \epsilon = \epsilon_u + \gamma (\epsilon_c - \epsilon_u) \),
to add more weight to the "highly-quality-photo" prompt,
effectively giving the model more tendency to generate images that have more "high-quality-photo-ness".
Here we choose \( \gamma = 7\).

CFG Enhanced 1

CFG Enhanced 2

CFG Enhanced 3

CFG Enhanced 4

CFG Enhanced 5

1.7: Image-to-Image Translation

Knonwing the power of injecting "realism" to models with CFG, we can now ask it "denoise + make more real" from given noisy images.
This is called Image-to-Image translation (SDEdit to be specific for our technique here). You can imagine that if the noise is too much, or close to pure noise, the generated result should be close to those in 1.6, which are random "real" images.
But for noisy images with less noise, the generated result can be pretty "real" and interesting.
Here we choose noise levels from [1, 3, 5, 7, 10, 20], which corresponds to timesteps ranging from 960 to 390. The smaller the noise_level, the more the noise. Sorry but this is for implementation's sake.
And we can expect generated results to become more reasonable and close to the given original image as noise level goes up (---the amount of noise goes down).

SDEdit with noise_level=1

SDEdit with noise_level=3

SDEdit with noise_level=5

SDEdit with noise_level=7

SDEdit with noise_level=10

SDEdit with noise_level=20

Original Image

1.7.1: SDEdit -- Editing Hand-Drawn and Web Images

Due to the "real-ness" tendency of generation image, we expect the model to do some magic on hand-drawn images and random web images

(Web) Silhouette of an elepant

SDEdit with noise_level=1

SDEdit with noise_level=3

SDEdit with noise_level=5

SDEdit with noise_level=7

SDEdit with noise_level=10

SDEdit with noise_level=20

Original Image

(Drawn) Strange Birddie

SDEdit with noise_level=1

SDEdit with noise_level=3

SDEdit with noise_level=6

SDEdit with noise_level=8

SDEdit with noise_level=9

SDEdit with noise_level=20

Original Image

(Drawn) Table with a Vase brimming with Poppies

SDEdit with noise_level=1

SDEdit with noise_level=3

SDEdit with noise_level=6

SDEdit with noise_level=9

SDEdit with noise_level=10

SDEdit with noise_level=20

Original Image

1.7.2: Inpainting

This can also easily adpated to fill in the holes by generating new stuff into a masked regions.

Inpainted Campnile

Original Image

Mask

To Replace

Inpainted Image

Inpainted YannCake

Original Image

Mask

To Replace

Inpainted Image

Inpainted Coccinellidae

Original Image

Mask

To Replace

Inpainted Image

1.7.3: Text-Conditional Image-to-Image Translation

Aside from projecting the noisy image to "real" image manifolds.
We can use other text embeddings for more interesting guiding.

Rocketizing Campnile with "a rocket ship" prompt

Text-Conditional noise_level=1

Text-Conditional noise_level=3

Text-Conditional noise_level=5

Text-Conditional noise_level=7

Text-Conditional noise_level=10

Text-Conditional noise_level=20

Original Image

Frogificating Poppy Vase Table with "a realistic photo of a frog" prompt

Text-Conditional noise_level=1

Text-Conditional noise_level=3

Text-Conditional noise_level=6

Text-Conditional noise_level=9

Text-Conditional noise_level=10

Text-Conditional noise_level=20

Original Image

Horseifying Strange Birddie with "a realistic photo of a horse" prompt

Text-Conditional noise_level=1

Text-Conditional noise_level=3

Text-Conditional noise_level=6

Text-Conditional noise_level=9

Text-Conditional noise_level=10

Text-Conditional noise_level=20

Original Image

1.8: Visual Anagrams

We are finally here and ready to make more magic happens.
How to generate up-side down visual anagrams?
Essentially, we want an image to look like A and look like B when flipped.
This demands fit the capability of a diffusion model perfectly!
For a model that only knows how to halluciately denoise an image, it can generate images that "look like" something.
The trick is so easy and elegant, that you just need to use two different text-conditioning to generate two noise estimates at each step.
And then average the noise at each step (with a CFG enhancement)! This will give you an image that looks like both texts!
Visual anagrams require a flip, which is essentially the same thing. Flipping is just for more fun! :D
Magic spell is:
\( \epsilon_1 = \text{UNet}(x_t, t, p_1) \)
\( \epsilon_2 = \text{flip}(\text{UNet}(\text{flip}(x_t), t, p_2)) \)
\( \epsilon = \frac{\epsilon_1 + \epsilon_2}{2} \)

Here are three sets of visual anagrams. These are original images. The next three rows are upsampled, which will be shown in 1.9* Bells & Whistles

"old man"

"campfire"

"bird"

"mountain"

"tower"

"giraffe"

Upsampled!

"an oil painting of an old man"

"an oil painting of people around a campfire"

"a Chinese ink painting of a bird"

"a Chinese ink painting of a mountain"

"a pencil sketch of a tower"

"a pencil sketch of a giraffe"

1.9: Hybrid Images

Similar to how we make hybrind images before, except this time, we give our model the low frequency and high frequency from two different text prompts to guide generation.

Original Scale, which are all 64x64 images. Please look from afar.

"low_freq=skull; high_freq=waterfalls"

"low_freq=frog; high_freq=village"

"low_freq=chameleon; high_freq=bridge"

Upsampled to 256x256

"low_freq=skull; high_freq=waterfalls"

"low_freq=frog; high_freq=village"

"low_freq=chameleon; high_freq=bridge"

Part B: Diffusion Forge

Part 1: Training a Single-Step Denoising UNet

1.1: Implementing the UNet

The key is to implement the architecture shown below.
The first image is the macro architecture of a UNet.
The second are the atomic operations for the ease of implementation.

Unconditional UNet Architecture

Atomic Operations

1.2: Using the UNet to Train a Denoiser

Since we want to train a UNet as a denoiser, we need to get the noisy data first.
Here we use MNIST dataset, and add noise manually.
Here we show the noise process with \( \sigma = [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]\).

Adding Noise

1.2.1: Using the UNet to Train a Denoiser

Omitting the training code, here are the parameters we choose:
Objective: train denoising with \( \sigma = 0.5\)
Loss: Mean Squared Error
Epochs: 5
Batch Size: 256
Hidden Dimensions (D): 128
Optimizer: Adam with learning rate 0.0001

Training Loss of Unconditional UNet

In-Distribution-Test: Test the model on some images with \( \sigma = 0.5\).

In-Distribution-Test of Unconditional UNet after Epoch=1

In-Distribution-Test of Unconditional UNet after Epoch=5

1.2.2: Out-of-Distribution Testing

Test the model on some images with unseen noises: \( \sigma = [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]\)
We expect the results to be poorer as the value deviates far from 0.5

Out-Distribution-Test of Unconditional UNet after Epoch=5

Part 2: Training a Diffusion Model

2.1: Time Conditional UNet (TCUNet)

Key is to implement the following architecture. Adding t as a condition injection!
We embed t into the model after normalize it with t/num_ts to get a value between [0, 1] rather than some 300+ integers.
The atomic operation we need to add here is a FCBlock that adds some non-linearity to the time conditioning t.

Time-Conditional UNet Architecture (TCUNet)

New Atomic Operation: FCBlock

2.2: Training TCUnet

Omitting the training code, here are the parameters we choose:
Objective: train TCUnet to predict noise given xt with timestep t.
Loss: Mean Squared Error on true noise and noise_pred
Epochs: 20 (since the task is way harder than denoising)
Batch Size: 128
Hidden Dimensions (D): 64
Optimizer: Adam with learning rate 0.001 with an exponential learning rate decay scheduler

TCUnet Training Algorithm

Training Loss of TCUNet

2.3: Sampling from TCUNet

The key is to implement the algorithm, which is similar to what we did in Part A.

TCUnet Samlping Algorithm

TCUnet Samlping After Epoch=1

TCUnet Samlping After Epoch=5

TCUnet Samlping After Epoch=10

TCUnet Samlping After Epoch=15

TCUnet Samlping After Epoch=20

2.4: Class Conditional UNet (CCUNet) + Training CCUNet

We see that the digits generate above looks pretty like human hand-written digits.
But there are still clearly some ghosty results that are not numbers. Also the generation sequence is random!
We can't ask the model to generate a 9 for example!
So we need to add a class condition to the model to guide the generation.