Basics of Diffusion Models

I encourage my students to be familiar with this post, so that a lot of time would be saved when reading diffusion model-related literature. “[Pre]” stands for preliminary.

Forward (Diffusion) Process

Continuous-time View

Given a (forward) Stochstic Differential Equation (SDE)

\[\mathrm{d}x = f(x,t)\mathrm{d}t + g(t)\mathrm{d}w_t ,\]

where $f$ and $g$ are called the drift term and diffusion term, respectively. And $\mathrm{d}w_t$ denotes the Brownian motion. From the view of a specific data point (think of it as a particle), this SDE characterizes how it moves from $t=0$ to $t=T$, where the drift term characterizes the deterministic part of the movement while the diffusion term characterizes the stochastic part. Thus, given $X(0)\sim p_0$, this SDE determines $p_t, t\in[0, 1]$, describing a stochastic process ${X(t)}, t\in[0,1]$.

$f$ and $g$ are design choices rather than to be learned as an encoder in Variational Auto-Encoder (VAE).
[Pre] Brownian motion.
It is conventional to use $X(0)$ for the observed data, i.e., $p_0$ is the underlying data density. In constrast, flow matching literature prefers denoting the data density by $p_1$.

Discrete-time View

From the Markov process view, the differential $\mathrm{d}x$ can be equivalently described by how to transform $X(t_{i-1})$ to $X(t_i)$, $i\in{0, 1,\ldots, T}$. For simplicity, we denote $X(t_i)$ by $X_i$ from now on, when there is no ambiuity.

A popular transition kernel, as proposed in DDPM, is

\[q(x_t \vert x_{t-1}) = \mathcal{N}(x_t ; \sqrt{1-\beta_t} x_{t-1}, \beta_t I).\]

$\beta_t \in (0, 1)$ is design choice called noise schedule.
[Pre] Why naming “kernel”: $p_{i}(x_i) = \int q(x_i \vert x_{i-1}) p_{i-1}(x_{i-1}) \mathrm{d}x_{i-1}$, used as a “weight” or “operator” inside an integral to transform one function into another.

Generally, the transition kernel as well as the adopted noise schedule need to achieve

The ultimate noise distribution $p_T$ approaches an easy-to-sample distribution such as a normal distribution.
The transition $q_{i \vert 0}$ is easy-to-sample without simulating the trajectory from $0$ to $i$.

The above transition kernel leads to

\[\begin{aligned} q(x_i \vert x_0) &= \mathcal{N}\!\left(x_i ; \sqrt{\bar{\alpha}_i}x_0, (1-\bar{\alpha}_i)I\right), \\ \text{or equivalently}\quad x_i &= \sqrt{\bar{\alpha}_i}x_0 + \sqrt{1-\bar{\alpha}_i}\epsilon_i, \quad \epsilon_i \sim \mathcal{N}(\cdot ; 0, I). \end{aligned}\]

where $\bar{\alpha}i = \prod{j=1}^{i}(1-\beta_j)$. As $\beta_j \in (0, 1)$, $\bar{\alpha}i \rightarrow 0$ as $i \rightarrow T$, and thus $p{T}=\int q_{T \vert 0}p_0$ approaches the normal distribution.

[Pre] The weighted sum of two independent Gaussian random variables is also a Gaussian distribution. Specifically, given $W=aX + bY$ with $X$ and $Y$ independent Gaussian, $W\sim\mathcal{N}(\cdot; a\mu_X + b\mu_Y, a^2\sigma_{X}^2 + b^2\sigma_{Y}^2)$.
Deriving $\bar{\alpha}$ from $\beta$ by induction.

Two Popular Instances

There are mainly two kinds of the forward processes: variance preserved (VP) and variance exploding (VE).

VP is defined as the above $q(x_i \vert x_{i-1})$. When $X_{i-1}$ and $\epsilon_i$ are independent as in practice, $\mathrm{Var}[X_i] = (1-\beta_i)\mathrm{Var}[X_{i-1}] + \beta_i \mathrm{Var}[\epsilon_i]$. Suppose the variance of data $X_0$ is one, the variance of $X_i, i\in{0,\ldots,T}$ will always be one. That’s why this transition kernel is said to be VP.

VE instead considers the transition kernel

\[q(x_i \vert x_{i-1}) = \mathcal{N}(x_i ; x_{i-1}, (\sigma_i^2 - \sigma_{i-1}^2)I) ,\]

which leads to the transition

\[\begin{aligned} q(x_i \vert x_0) &= \mathcal{N}\!\left(x_i ; x_0, \sum_{j=1}^{i}(\sigma_j^2 - \sigma_{j-1}^2)I\right) = \mathcal{N}(x_i ; x_0, \sigma_{i}^2 I), \\ \text{or equivalently}\quad x_{i} &= x_{0} + \sigma_t \epsilon_i, \quad \epsilon_i \sim \mathcal{N}(\cdot ; 0, I). \end{aligned}\]

If we let $\sigma_i$ increases along with $i$, the variance of $X_i$ explodes. That’s why this transition kernel is said to be VE.

Here $\sigma_i$ is the design choice (like the above $\beta_t$), which increases along with the increase of $i$ in practice.

It is worth noticing that people usually write

\[\begin{aligned} x_i &= \alpha_i x_{0} + \sigma_i \epsilon_i, \\ q(x_i \vert x_0) &= \mathcal{N}(x_i ; \alpha_i x_0, \sigma_{i}^2 I). \end{aligned}\]

as a general form. In summary, VP and VE instantiate this general form as the Table shows:

Case	Signal	Noise
General	$\alpha_i$	$\sigma_i$
VP	$\sqrt{\bar{\alpha}i}=\sqrt{\prod{j=1}^{i}(1-\beta_j)}$	$\sqrt{1 - \bar{\alpha}i}=\sqrt{1-\prod{j=1}^{i}(1-\beta_j)}$
VE	1	$\sigma_i$

There have been many kinds of noise schedule proposed. In this post, we only consider the linear schedule for VP:

\[\beta_i = \beta_{\text{min}} + \frac{i-1}{T-1}(\beta_{\text{max}} - \beta_{\text{min}}) ,\]

and the geometric/exponential schedule for VE:

\[\sigma_i = \sigma_{\text{min}}(\frac{\sigma_{\text{max}}}{\sigma_{\text{min}}})^{\frac{i-1}{T-1}} .\]

For the linear schedule, people often consider $\beta_{\text{min}} = 0.0001$ and $\beta_{\text{max}} = 0.02$, where $\sqrt{\bar{\alpha}t}$ ranges from $\approx 1$ to $\approx 0.0064$ along with $i$ increasing from $1$ to $1000$. For the geometric/exponential schedule, people often consider $\sigma{\text{min}}=0.01$ and $\sigma_{\text{max}}\in[50, 100]$.

[trick] Suppose the 32-by-32 image has been normalized to $[0, 1]$, $\sigma_{\text{max}}$ can be chosen to be close to $\sqrt{32\times 32\times 3\times (1-0)}=\sqrt{3072}\approx 55.4$.

Connecting Continuous and Discrete Views

The goal is to express the drift and diffusion terms based on the specified (discrete) noise schedule. Specifically, for the VP case with linear noise schedule, we treat $\beta_i$ are $T$ values of a function $\beta(t)=\beta_{\text{min}} + t(\beta_{\text{max}} - \beta_{\text{min}})$ at the points $t_i \in [0, 1)$ with $\Delta t = t_i - t_{i-1} = \frac{1}{T}$. For the VE case with geometric noise schedule, we treat $\sigma_i$ are $T$ values of a function $\sigma(t) = \sigma_{\text{min}}(\frac{\sigma_{\text{max}}}{\sigma_{\text{min}}})^{t}$ at the points $t_i \in [0, 1)$ with $\Delta t = t_i - t_{i-1} = \frac{1}{T}$. Then, we express $f(x,t)$ and $g(t)$ in terms of $\beta(t)$ and $\sigma(t)$ for VP case and VE case, respectively.

To this end, we should notice that, infinitesimally, the drift term determines the mean value of the particle’s position, and the diffusion term determines the variance of the particle’s position.

For the VP case, $\mathbb{E}[x_{t+\Delta t} - x_t \vert x_t] = (\sqrt{1-\beta(t)} - 1)x_t$ and $\mathrm{Var}[x_{t+\Delta t} - x_t \vert x_t] = \beta(t) I$. To align the variance with the corresponding SDE, we get $g(t)^2 \Delta t = \beta(t)$. Substituting into the conditional mean, we get $f(x,t)=\lim_{\Delta t\rightarrow 0}\frac{(\sqrt{1-g(t)^2 \Delta t}-1)x}{\Delta t}=-\frac{1}{2}g(t)^2 x$. Thus, the corresponding VP SDE can be expressed as follows

\[\mathrm{d}x = -\frac{1}{2}T\beta(t)x\mathrm{d}t + \sqrt{T\beta(t)}\mathrm{d}w ,\]

[Pre] The quadratic variation of Brownian motion.
To get the transition from $X_{t \vert 0}$ from the SDE like that for discrete case, we would derivate $\bar{\alpha}(t)$ by $1 - \bar{\alpha}(t) = \int_{0}^{t}T\beta(\tau)\mathrm{d}\tau$ at the first glance. However, this is wrong because although the local Brownian noise injected at time $\tau$ contributes variance $T\beta(\tau)\mathrm{d}\tau$ but after that noise is injected, it is also damped by the VP drift. Instead, $1 - \bar{\alpha}(t) = 1 - e^{-\int_{0}^{t}T\beta(\tau)\mathrm{d}\tau}$ (could be derived from SDE’s solution or approximation from a discrete view).

For the VE case, $\mathbb{E}[x_{t+\Delta t} - x_t \vert x_t] = 0$ and $\mathrm{Var}[x_{t+\Delta t} - x_t \vert x_t] = (\sigma(t+\Delta t)^2 - \sigma(t)^2) I$. Thus, $f(x,t)=0$ and $g(t)^2 \Delta t= (\sigma(t+\Delta t)^2 - \sigma(t)^2)$, which implies $g(t)^2=\lim_{\Delta t \rightarrow 0}\frac{\mathrm{d} \sigma^2(t)}{\mathrm{d} t}$. Thus, the corresponding VE SDE can be expressed as follows

\[\mathrm{d}x = \sqrt{\frac{\mathrm{d}\sigma^2(t)}{\mathrm{d}t}}\mathrm{d}w .\]

Reverse (Denoising) Process

By the diffusion time-reversal theorem, the time reversal of a diffusion process is also an SDE:

\[\mathrm{d}x = [f(x,t) - g(t)^2 \nabla_x \log{p_t (x)}]\mathrm{d}t + g(t)\mathrm{d}\bar{w}_t ,\]

where the time changes from $T$ to $0$, namely $\mathrm{d}t < 0$. Once we know the score function $\nabla_x \log{p_t (x)}$, we can go from $X(T)$ to $X(0)$ by simulating this time reversal SDE.

Share the same marginal distributions as the forward SDE yet indexed in reversed order.
The difference in drift term is the appearance of score function. This term is necessary because reversing a diffusion is not achieved by simply negating the drift. Intuitively, the forward process spreads probability mass, so the reverse process must know where the density is high.
[Pre] time-reversal Brownian motion (just the same as vanilla Brownian motion $\text{Std}[\mathrm{d}\bar{w}] = \sqrt{-\Delta t}$).

Denote the transition by $q(x_t \vert x_0) = \mathcal{N}(x_t ; \alpha_t x_0, \sigma_t^2 I)$, which is general enough to include both VP and VE. By Tweedie’s formula:

\[\nabla_{x_t}\log{p_{t}(x_t)} = \frac{\alpha_{t}\mathbb{E}[x_0 \vert x_t] - x_t}{\sigma_{t}^2} .\]

The derivation of Tweedie’s formula is mainly based on Fisher’s identity:

\[\nabla_{x}\log{p(x)} = \mathbb{E}[\nabla_{x}\log{p(x \vert z)} \vert x] = \int p(z \vert x) \nabla_{x}\log{p(x \vert z)} \mathrm{d}z ,\]

where $x$ and $z$ denote the observable and latent variables, respectively.

Intuitively, marginal score = posterior average of conditional scores.
[Pre] The proof sketch is to first differentiate the marginal density $\nabla_{x}p(x) = \nabla_x \int p(x \vert z) p(z) \mathrm{d} z$ and then divide both sides by $p(x)$.
[Pre] To use it for derivating the Tweedie’s formula, we should notice that here the conditional score $\nabla_{x_t}\log{p(x_t \vert x_0)} = -\frac{x_t - \alpha_t x_0}{\sigma_{t}^2}$.

For VP case, $\alpha_t$ equals the above $\sqrt{ \bar{ \alpha }{t} }$ and $\sigma{t}^{2}$ equals the above $1-\bar{\alpha}{t}$. For VE case, $\alpha_t$ equals $1$ and $\sigma{t}^{2}$ equals the above $\sigma_{t}^{2}$. Thus, we summarize their scores as follows:

Case	Signal	Noise	Score
General	$\alpha_i$	$\sigma_i$	$\frac{\alpha_{t}\mathbb{E}[x_0 \vert x_t] - x_t}{\sigma_{t}^2}$
VP	$\sqrt{\bar{\alpha}i}=\sqrt{\prod{j=1}^{i}(1-\beta_j)}$	$\sqrt{1 - \bar{\alpha}i}=\sqrt{1-\prod{j=1}^{i}(1-\beta_j)}$	$\frac{\sqrt{\bar{\alpha}_{t}}\mathbb{E}[x_0 \vert x_t] - x_t}{1-\bar{\alpha}_t}$
VE	1	$\sigma_i$	$\frac{\mathbb{E}[x_0 \vert x_t] - x_t}{\sigma_{t}^2}$

Training

Prediction

As the Tweedie’s formula has connected score functions to the conditional expectations $\mathbb{E}[x_0 \vert x_t]$, we can learn this instead.

Intuitively, this conditional expectation is the mean-squared approximation to $X(0)$ given $X(t)$.

There are often three kinds of parameterization: $x$-prediction, $\epsilon$-prediction, and $v$-prediction. Generally, the neural network is fed with $x_t$ and $t$ and asked to predict $x_0$, $\epsilon$, and $v$, respectively Let’s denote the estimation/prediction by $\hat{x}_0$, $\hat{\epsilon}$, or $\hat{v}$, respectively. Informally, the velocity is regarding the movement along a stochastic trajectory (caveat: SDE sample paths are not differentiable; Brownian motion has no ordinary velocity). Thus, the most general form should be $v = \dot{\alpha}_t x_0 + \dot{\sigma}_t \epsilon$. A popular case is $x_t = (1-t)x_0 + t\epsilon$, where $\alpha_t = (1-t), \dot{\alpha}_t = -1$ and $\sigma_t = t, \dot{\sigma}_t = 1$ (like the flow matching model). In this case, $v$ would be $\frac{\partial x_t}{\partial t} = -x_0 + \epsilon$.

Then, the relationship between the score function and the the prediction is as follows:

Case	Score from $\hat{x}_0$	Score from $\hat{\epsilon}$	Score from $\hat{v}$
General	$\frac{\alpha_{t}\hat{x}0 - x_t}{\sigma{t}^2}$	$\frac{-\hat{\epsilon}}{\sigma_{t}}$	$\frac{\hat{v}-\frac{\dot{\alpha}t}{\alpha_t}x_t}{\frac{\dot{\alpha}_t}{\alpha_t}\sigma{t}^2 -\sigma_t \dot{\sigma}_t}$
VP	$\frac{\sqrt{\bar{\alpha}_{t}}\hat{x}_0 - x_t}{1-\bar{\alpha}_t}$	$\frac{-\hat{\epsilon}}{\sqrt{1-\bar{\alpha}_t}}$	$-x_t - \frac{\alpha_t}{\sigma_t}\hat{v}$
VE	$\frac{\hat{x}0 - x_t}{\sigma{t}^2}$	$\frac{-\hat{\epsilon}}{\sigma_{t}}$	$\frac{-\hat{v}}{\sigma_t \dot{\sigma}_t}$

For the VP case, we should notice that $\alpha_t^2 + \sigma_t^2 = 1$ and thus can be re-parameterized as $\alpha_t = \cos{t}$ and $\sigma_t = \sin{t}$. As a result, $\dot{\alpha}_t = -\sin{t} = -\sigma_t$ and $\dot{\sigma}_t = \cos{t} = \alpha_t$. Thus, $v_t = -\sigma_t x_0 + \alpha_t \epsilon$.
An important concept is signal-noise-ratio (SNR), where $\alpha_t x_0$ and $\sigma_t \epsilon$ stand for the signal and noise components, respectively, and $\text{SNR}(t) = \frac{\text{power of signal component}}{power of noise component}$. Average power is usually defined as a signal’s mean squared amplitude, e.g., if a voltage signal $v(t)$ is applied to a resistor $R$, the instantaneous power is $p(t) = v(t)i(t) = \frac{v(t)^2}{R}$, and thus the average power is $P = \frac{1}{R}\mathbb{E}[v(t)^2] \propto \mathbb{E}[v(t)^2]$. When the signal has zero mean, its mean squared amplitude becomes its variance, that is why, in signal processing, people consider variance as the power. With the general form of diffusion process, we say $\text{SNR}(t) = \frac{\alpha_t^2}{\sigma_t^2}$.

Objective

No matter which kind of prediction is adopted, we are allowed to choose any kind of loss, including $x$-loss, $\epsilon$-loss, and $v$-loss. Obviously, the ground-truth $x$, $\epsilon$, and $v$ are available via:

(1) Uniformly sampling $x_0$ from the given dataset;
(2) Sampling $t$ from the Uniform distribution over $[0,1]$ or sampling $i$ from the Uniform distribution over ${1,\ldots,T}$;
(3) Sampling $\epsilon\sim\mathcal{N}(\cdot ; 0, I)$;
(4) Computing $x_t \sim q(x_t \vert x_0)$ with designed noise schedule based on the above outcomes.

And the prediction can be used to express other ones as follows (presented with the general form $x_t = \alpha_t x_0 + \sigma_t \epsilon$ and define $D_t = \alpha_t \dot{\sigma}_t - \dot{\alpha}_t \sigma_t $ ).

Target \ Prediction	$\hat{x}$	$\hat{\epsilon}$	$\hat{v}$
$x = x_0$	$\lVert x-\hat{x}\rVert$	$\lVert x-\frac{x_t - \sigma_t \hat{\epsilon}}{\alpha_t}\rVert$	$\lVert x - \frac{\dot{\sigma}x_t-\sigma_t\hat{v}}{D_t} \rVert$
$\epsilon$	$\lVert \epsilon - \frac{x_t - \alpha_t \hat{x}}{\sigma_t} \rVert$	$\lVert\epsilon-\hat{\epsilon}\rVert$	$\lVert \epsilon - \frac{-\dot{\alpha}_t x_t + \alpha_t\hat{v}}{ D_t } \rVert$
$v=\dot{\alpha}_t x_0 + \dot{\sigma}_t \epsilon$	$\lVert v - (\dot{\alpha}_t \hat{x} + \dot{\sigma}_t \epsilon ) \rVert$	$\lVert v - (\dot{\alpha}_t x_0 + \dot{\sigma}_t \hat{\epsilon}) \rVert$	$\lVert v-\hat{v}\rVert$

Essentially, the objective to be minimized is

\[\mathcal{L}(\theta) = \mathbb{E}[\lVert\text{target} - \text{prediction}\rVert_{2}^2].\]

In all, the generative diffusion model can be trained via regression.

Another view is the evidence lower bound (ELBO). $X(0)$ corresponds to the observable variable and $X(t), t=1,\ldots,T$ are the latent variables. As directly maximizing the log-likelihood $\max_{\theta} \log{p_{\theta}(X(0)=x_0)}=\log{{\int_{}p_{\theta}(X(0)=x_0, X(t)=x_t)\mathrm{d}x_1\cdots\mathrm{d}x_T }}$ is intractable, we need to maximize its ELBO:

\[\mathcal{L}_{\text{ELBO}}(\theta) = \mathbb{E}_{x_{1:T}\sim q(X(1:T) \vert X(0)=x_0)}\log{\frac{p_{\theta}(X(0),X(1:T))}{q(X(1:T) \vert X(0))}} ,\]

where the variational distribution $q$ has been defined as the forward (diffusion) process, namely, unlearnable. Regarding this design, diffusion models can be treated as a special kind of VAE that has a fixed hierarchical encoder to encode $X(0)$ into $X(1:T)$. As a result, there is no need to have an “E-step” like optimizing a Gaussian mixture model (GMM). Expanding the terms lead to

\[\mathcal{L}_{\text{ELBO}}(\theta) = \mathbb{E}_{x_{1:T}\sim q(X(1:T) \vert X(0)=x_0)} \{ \log{p_{\theta}(X(T))} + \log{ \frac{p_{\theta}(X(0) \vert X(1))}{q(X(1)\vert X(0))} } + \sum_{i=2}^{T}\log{\frac{p_{\theta}(X(i-1)\vert X(i))}{q(X(i) \vert X(i-1), X(0))}} \},\]

where, in the expectation, the first term could be eliminated because $p_{\theta}(X(T))$ is chosen to be the corresponding easy-to-sample noise distribution rather than a learnable one. Up to now, the objective can be simplified as follows:

\[\mathcal{L}_{\text{ELBO}}(\theta) = \mathbb{E}_{x_1 \sim q}[\log{ p_{\theta}(X(0) \vert X(1)) } - \log{q(X(1)\vert X(0))}] + \sum_{i=2}^{T}\mathbb{E}_{x_{i-1},x_i \sim q}[\log{\frac{p_{\theta}(X(i-1)\vert X(i))}{q(X(i) \vert X(i-1), X(0))}}] ,\]

where the second term needs to sample variables at adjacent timesteps, e.g., by sampling $x_{i-1}\sim q_{(i-1) \vert 0}$ and $x_i \sim q_{i \vert (i-1)}$. To improve efficiency and reduce variance, it could be substituted by:

\[\begin{aligned} &\sum_{i=2}^{T}\mathbb{E}_{x_{i-1},x_i \sim q}\!\left[\log{\frac{p_{\theta}(X(i-1)\vert X(i))}{q(X(i) \vert X(i-1), X(0))}}\right] \\ &= \sum_{i=2}^{T}\mathbb{E}_{x_{i-1},x_i \sim q}\!\left[\log{\frac{p_{\theta}(X(i-1)\vert X(i))}{\frac{q(X(i-1) \vert X(i), X(0))q(X(i) \vert X(0))}{q(X(i-1)\vert X(0))}}}\right] \\ &= \sum_{i=2}^{T}\mathbb{E}_{x_{i-1}, x_i \sim q}\!\left[\log{\frac{p_{\theta}(X(i-1)\vert X(i))}{q(X(i-1) \vert X(i), X(0))}} + \log{\frac{q(X(i-1) \vert X(0))}{q(X(i) \vert X(0))}}\right] \\ &= \sum_{i=2}^{T}\mathbb{E}_{x_i \sim q_{t \vert 0}}\!\left[\mathbb{E}_{x_{i-1}\sim q(X(i-1) \vert X(i), X(0))}\!\left[\log{\frac{p_{\theta}(X(i-1)\vert X(i))}{q(X(i-1) \vert X(i), X(0))}}\right]\right] \\ &= -\sum_{i=2}^{T}\mathbb{E}_{x_i \sim q_{t \vert 0}}\!\left[\text{KL}( q(X(i-1) \vert X(i), X(0)) \Vert p_{\theta}(X(i-1)\vert X(i)) )\right]. \end{aligned}\]

where we only need to care about the log-term that involves $\theta$.

The remaining thing is to define $p_{\theta}(X(t-1) \vert X(t))$ in a form that is tractable for the considered terms. It is by-default parameterized as $p_{\theta}(X(t-1) \vert X(t)) = q(X(t-1) \vert X(t), \hat{X}_{\theta}(0))$ since

\[q(X(t-1) \vert X(t), X(0)) = \mathcal{N}(\cdot ; \tilde{\mu}(x_t, x_0), \tilde{\beta}_t I) ,\]

where $\tilde{\mu}(x_t, x_0) = \frac{\sqrt{\bar{\alpha}{t-1}}\beta_t}{1-\bar{\alpha}_t}x_0 + \frac{\sqrt{1-\beta_t}(1-\bar{\alpha}{t-1})}{1-\bar{\alpha}t}x_t$ and $\tilde{\beta}_t = \frac{\beta_t (1-\bar{\alpha}{t-1})}{1-\bar{\alpha}_t}$. Now the KL is between two Gaussian and thus has analytic form.

[Pre] The rationale behind the above reducing variance trick is the law of total variance: $\text{Var}(Z)=\mathbb{E}[\text{Var}(Z\vert S)] + \text{Var}(\mathbb{E}[Z\vert S])$. Here $(X(i), X(0))$ serve as the condition $S$. To understand this decomposition involves concepts of conditional expectation and the law of total expectation. Besides, this decomposition is analogous to the famous Rao-Blackwell theorem.
As we define $p_{\theta}(X(t-1) \vert X(t))$ in this way, $t=1$’s case becomes a Dirac delta at predicted $\hat{X}_{\theta}(0)$ because $\tilde{\beta}_1 = 0$. Thus, it is often handled separately with a fixed minor variance.

Sampling

Stochastic Path

For the discrete view, generation means gradually sampling from $q(X(i-1) \vert X(i), \hat{X}_{\theta}(0)), i=T,\ldots, 1$ as the above ELBO part presents.

For the continuous view, once we learned the score function, sampling from $p_0$ can be implemented as sampling from $p_T$ and simulate the reverse SDE.

[Pre] Euler-Maruyama method, where the drift part is like the Euler method for ODE while the diffusion part takes $\sqrt{\Delta t}$ instead of $\Delta t$ as the stepsize.

Deterministic Path

Another important aspect for sampling is probability-flow ODE. The bridge that connects SDE to PF ODE is Fokker-Planck equation:

\[\partial_t p_t = -\nabla\cdot(fp_t) + \frac{1}{2}g(t)^2 \Delta p_t ,\]

which is correct for any given $t\in(0, T)$ and for any given $x$ of the domain such as $x\in\Re^d$. For a specific time, $f$ is a vector field and $p_t$ is a scalar field. Therefore, $(fp_t)$ denotes a vector field weighting each $f(x,t)$ by $p_t(x)$, which is again a vector field. Applying the divergence operator to this field transforms it into a scalar field, reflecting the vector field’s net outflow of probability mass at each point. Applying the Laplacian operator to the scalar field $p_t$ gives a scalar field, reflecting whether $p_t (x)$ is larger than its local average.

[Pre] For $f:\Re^d\rightarrow \Re^d$, $\nabla\cdot f = \sum_{i=1}^{d}\frac{\partial h(x)_i}{\partial x_i}$, reflecting the local net outflow
[Pre] For $f: \Re^d\rightarrow \Re$, $\Delta f = \sum_{i=1}^{d}\frac{\partial^2 f}{\partial x_i \partial x_i}$, which is also denoted as $\nabla^2$ because it is the divergence of the gradient of $f$. In single-variable calculus, the second derivative tells you the concavity (or “curvatures”) of a curve. The Laplacian serves as the multivariable equivalent. $\Delta f$ measures whether $f(x)$ is above or below the local average of nearby values. If $\Delta f(x) > 0$, then $f(x)$ is lower than its infinitesimal local average, so diffusion tends to increase it. Take $\mathcal{N}(\cdot; 0, s^2)$ as an example to see the difference between $\vert x\vert < s$ and $\vert x\vert > s$.
FP equation describes how a SDE moves probability densities, not just individual particles. Intuitively, at which $x$, $p_t$ increases? The first term at the RHS is positive when the net outflow is negative, implying the drift term $f(x,t)$ pushes particles in the neighborhood to this $x$. The second term at the RHS is positive when $p_t (x)$ is locally lower than its surroundings, in wihch case, the diffusion behavior brings more probability mass to $x$.

Substituting $\Delta p_t$ by $\nabla\cdot(\nabla p_t) = \nabla\cdot(p_t \nabla\log{p_t})$ into the Fokker Planck’s equatoin, we get

\[\partial_t p_t = -\nabla\cdot[( f(x, t) - \frac{1}{2}g_t(x)^2\nabla\log{p_t})p_t] ,\]

which happens to be the continuity equation of the ODE:

\[\frac{\mathrm{d}x}{\mathrm{d}t} = v_t(x) = f(x, t) - \frac{1}{2}g_t(x)^2\nabla\log{p_t} ,\]

where the data points (think of them as particles) moves according to this velocity field $v$.

[Pre] If particles move by $\frac{\mathrm{d}x}{\mathrm{d}t} = v_t(x)$, the density evolution is governed by the continuity equation $\partial_t p_t = -\nabla\cdot(p_t v_t)$.
Intuitively, $p_t v_t$ means the probability flux, its divergence thus means the local net outflux. If this is negative, the density $p_t$ at that place should be increased. Otherwise, $p_t$ decreases.
If we know the velocity field $v_t$ that satisfies the continuity of a given density $p_t$, we could move the particles according the ODE to achieve the given marginal densities $p_t$. That’s why this ODE is called probability flow ODE.
Why does an ODE can produce the same effect (i.e., same marginal densities $p_t$) as the forward SDE? The brown motion term makes diffusion effect, that is to say, smoothes the density, making local maximum lower, and making local maximum higher. The score function tells which direction can increases the density, thus $-\frac{1}{2}g_t(x)^2\nabla\log{p_t}$ encourages more particles to locally smaller density region. It plays the similar role as the Brown motion.

Taking the VP case as an example, the update is

\[x_{t-\Delta t} = x_{t} - (-\frac{1}{2}T\beta(t)x_{t} - \frac{1}{2}T\beta(t)\nabla_{x}\log{p_{t}(x_t)})\Delta t ,\]

which equals $x_{t} + \frac{1}{2}\beta(t)x_{t} + \frac{1}{2}\beta(t)\nabla_{x}\log{p_{t}(x_t)}$ when $\Delta t = \frac{1}{T}$, taking $T$ steps in $[0, 1]$.

Taking the VE case as an example, the update is

\[\begin{aligned} x_{t-\Delta t} &= x_{t} - \left(-\frac{1}{2}\frac{\mathrm{d}\sigma^2(t)}{\mathrm{d}t}\nabla_{x}\log{p_{t}(x_t)}\right)\Delta t \\ &= x_{t} + \frac{\mathrm{d}\sigma^2(t)}{2\mathrm{d}t}\nabla_{x}\log{p_{t}(x_t)}\Delta t \\ &\approx x_t + \frac{2\sigma(t)(\sigma(t) - \sigma(t-\Delta t))}{2\Delta t}\frac{-\hat{\epsilon}(t,x_t)}{\sigma(t)}\Delta t \\ &= x_t + (\sigma(t-\Delta t) - \sigma(t))\hat{\epsilon}(t, x_t). \end{aligned}\]

which becomes a DDIM-like sampler (keep in mind that the above equation considers time changes from $1$ to $0$).

Case	Signal	Noise	Score
General	\(\alpha_i\)	\(\sigma_i\)	\(\frac{\alpha_{t}\mathbb{E}[x_0 \vert x_t] - x_t}{\sigma_{t}^2}\)
VP	\(\sqrt{\bar{\alpha}i}=\sqrt{\prod{j=1}^{i}(1-\beta_j)}\)	\(\sqrt{1 - \bar{\alpha}i}=\sqrt{1-\prod{j=1}^{i}(1-\beta_j)}\)	\(\frac{\sqrt{\bar{\alpha}_{t}}\mathbb{E}[x_0 \vert x_t] - x_t}{1-\bar{\alpha}_t}\)
VE	1	\(\sigma_i\)	\(\frac{\mathbb{E}[x_0 \vert x_t] - x_t}{\sigma_{t}^2}\)

Case	Score from \(\hat{x}_0\)	Score from \(\hat{\epsilon}\)	Score from \(\hat{v}\)
General	\(\frac{\alpha_{t}\hat{x}0 - x_t}{\sigma{t}^2}\)	\(\frac{-\hat{\epsilon}}{\sigma_{t}}\)	\(\frac{\hat{v}-\frac{\dot{\alpha}t}{\alpha_t}x_t}{\frac{\dot{\alpha}_t}{\alpha_t}\sigma{t}^2 -\sigma_t \dot{\sigma}_t}\)
VP	\(\frac{\sqrt{\bar{\alpha}_{t}}\hat{x}_0 - x_t}{1-\bar{\alpha}_t}\)	\(\frac{-\hat{\epsilon}}{\sqrt{1-\bar{\alpha}_t}}\)	\(-x_t - \frac{\alpha_t}{\sigma_t}\hat{v}\)
VE	\(\frac{\hat{x}0 - x_t}{\sigma{t}^2}\)	\(\frac{-\hat{\epsilon}}{\sigma_{t}}\)	\(\frac{-\hat{v}}{\sigma_t \dot{\sigma}_t}\)

Target \ Prediction	\(\hat{x}\)	\(\hat{\epsilon}\)	\(\hat{v}\)
\(x = x_0\)	\(\lVert x-\hat{x}\rVert\)	\(\lVert x-\frac{x_t - \sigma_t \hat{\epsilon}}{\alpha_t}\rVert\)	\(\lVert x - \frac{\dot{\sigma}x_t-\sigma_t\hat{v}}{D_t} \rVert\)
\(\epsilon\)	\(\lVert \epsilon - \frac{x_t - \alpha_t \hat{x}}{\sigma_t} \rVert\)	\(\lVert\epsilon-\hat{\epsilon}\rVert\)	\(\lVert \epsilon - \frac{-\dot{\alpha}_t x_t + \alpha_t\hat{v}}{ D_t } \rVert\)
\(v=\dot{\alpha}_t x_0 + \dot{\sigma}_t \epsilon\)	\(\lVert v - (\dot{\alpha}_t \hat{x} + \dot{\sigma}_t \epsilon ) \rVert\)	\(\lVert v - (\dot{\alpha}_t x_0 + \dot{\sigma}_t \hat{\epsilon}) \rVert\)	\(\lVert v-\hat{v}\rVert\)