2. Why this paper is interesting?
Methodological aspects:
• Unified framework for score matching and diffusion model
• The framework suggests a way to improve both models
• New application for neural ODE/SDE
Experimental aspects:
• SOTA FID score on CIFAR-10 = 2.20 (StyleGAN-ADA = 3.26)
• Scale to 1024×1024 CelebA-HQ dataset
• Conditional generation with a post-hoc classifier (relatively small cost)
2
4. Score Matching
• Score matching matches the score 𝑠 𝑥 ≔ ∇! log 𝑝(𝑥) of data and model
• Instead of computing the score of data, we use an alternative loss
• Theorem 1. The score matching objective has an equivalent form:
1
2
𝔼"!"#"
𝑠# 𝑥 − 𝑠$%&%(𝑥) '
'
= 𝔼"!"#"
tr ∇! 𝑠#(𝑥) +
1
2
𝑠# 𝑥 '
'
+ const.
• Choice of 𝒔 𝜽 𝒙 : One can define 𝑠#(𝑥) as a gradient of unnormalized density
function (i.e., energy), or directly model it with a neural network
• Computation of the trace: One may use the Hutchinson’s estimator:
tr 𝐴 = 𝔼) 𝑣* 𝐴𝑣
4
5. Score Matching
• Proof of Theorem 1. It is sufficient to show that:
𝔼"!"#"
−𝑠$%&%(𝑥)𝑠# 𝑥 = <
+
− = 𝑝$%&% 𝑥
𝜕 log 𝑝$%&% 𝑥
𝑑𝑥+
𝑠#,+(𝑥) 𝑑𝑥
= <
+
− =
𝜕𝑝$%&% 𝑥
𝑑𝑥+
𝑠#,+(𝑥) 𝑑𝑥 = <
+
=𝑝$%&% 𝑥
𝜕𝑠#,+ 𝑥
𝑑𝑥+
𝑑𝑥 + const.
• The last equality = is from the integration by parts:
= 𝑝- 𝑥 𝑓 𝑥 𝑑𝑥 = A𝑝 𝑥 𝑓 𝑥
./
/
− = 𝑝 𝑥 𝑓- 𝑥 𝑑𝑥
• and the assumption 𝑝$%&% 𝑥 𝑠# 𝑥 → 0 at (both side of) infinity
5
6. Noise Conditional Score Networks (NCSN)
• Limitations of the score matching:
1. Scores are not well-defined in the outside of data manifold
2. Score estimation is inaccurate at low-density regions
• Idea of NCSN: “Perturb the data with the noise of various magnitudes”
• Large noise facilitates the learning of the scores at low-density regions
• At inference time, we need an annealed sampling of the noise levels
• Concretely, let 𝜎0 > ⋯ > 𝜎1 ≈ 0 and 𝑞2 𝑥 ≔ ∫ 𝑝$%&% 𝑥 𝑞2(J𝑥 ∣ 𝑥) 𝑑𝑥
• Then, NCSN model a score function of all noise levels as 𝑠#(𝑥, 𝜎)
6
7. Noise Conditional Score Networks (NCSN)
• Training. (Denoising) score matching of 𝑠#(𝑥, 𝜎) and 𝑞2(𝑥) is given by:
1
2
𝔼3$( 5!∣!)"!"#"(!) 𝑠# 𝑥, 𝜎 − ∇5! log 𝑞2(J𝑥 ∣ 𝑥) '
'
• where the score of perturbation is easily computed, e.g., for 𝑞2 ≔ 𝒩(J𝑥 ∣ 𝑥, 𝜎' 𝐼)
• Sampling. Run SGLD, start from 𝜎0, and anneal to 𝜎1 ≈ 0
• Remark that 𝑠#(𝑥, 𝜎0) is now
well-estimated, hence SGLD
gives a good initial point
7
Stochastic gradient Langevin dynamics (SGLD)
= just a gradient descent + some noise
8. Noise Conditional Score Networks (NCSN)
• Choice of the hyperparameters. The theory suggests that
1. Initial noise level. Set 𝜎0 large, yet 𝜎0 < 𝑥 + − 𝑥 8
'
2. Other noise levels. Set 𝛾 ≔ 𝜎+/𝜎+90 that satisfies SomeFormula 𝛾 ≈ 0.5
3. Noise conditioning. Parametrize the score function as 𝑠# 𝑥, 𝜎 = 𝑠#(𝑥)/𝜎
4. Selecting 𝑇 and 𝜖. Choose 𝑇 large and set SomeFormula 𝜖 ≈ 1
5. Step size. Set 𝛼+ ∝ 𝜎+
'
• See the paper for details
8
10. Denoising Diffusion Probabilistic Models (DDPM)
• Diffusion probabilistic models (DPM)
• DPM is a parametrized Markov chain such that
reverse and forward processes 𝑝# and 𝑞 are defined as:
10
Reverse process
Forward (diffusion) process
11. Denoising Diffusion Probabilistic Models (DDPM)
• Diffusion probabilistic models (DPM)
• Here, 𝛽: are usually pre-defined hyperparameters (also can be learned)
• Then, 𝑞 can be represented by closed forms:
• Training. Due to the property above, the objective ELBO
• can be easily computed (KL divergences between Gaussians)
• Sampling. Apply the reverse process 𝑝# 𝑥; ∏:<0
*
𝑝#(𝑥:.0|𝑥:)
11
𝛼% ≔ 1 − 𝛽% and &𝛼% ≔ ∏&'(
%
𝛼&
12. Denoising Diffusion Probabilistic Models (DDPM)
• Idea of DDPM: Smart parametrization of 𝜇#
• By rearranging the ELBO, 𝐿:.0 is given by
• Hence, we parametrize 𝜇# as
• where 𝜖# estimates 𝜖 (of the forward process) from 𝑥:
• Training. Use simplified objective
• Sampling. Resembles SGLD
• ⇒ DDPM resembles the denoising score matching!
12
𝑥% 𝑥), 𝜖 = &𝛼% 𝑥) + 1 − &𝛼% 𝜖 for 𝜖 ∼ 𝒩(0, 𝐈)
Simply set Σ* = 𝜎%
+ 𝐈 for a constant 𝜎%
(use 𝜎%
+ = 𝛽% or 𝜎%
+ = 5𝛽%)
13. Denoising Diffusion Probabilistic Models (DDPM)
• Experiments
• DDPM achieved the SOTA FID score (of then) on CIFAR-10
• It also provides the (upper bound of) negative log-likelihood (NLL)
13
14. Generative Modeling via SDE
• Motivation
• NCSN and DDPM are discretizations of some corresponding SDEs
• NCSN: →
• DDPM: →
• The key of success is perturbating data with multiple noise scales
• Generalize {𝜎+} to an infinite number of noise scales 𝜎(𝑡)
• We consider the general (forward) form of SDE:
• Then, the reverse is also SDE:
• ⇒ We can generate samples via reverse SDE, if the score ∇! log 𝑝:(𝑥) is given
14
𝑑𝑤 is Brownian motion, a stochastic process
generalization of Gaussian distribution
15. Generative Modeling via SDE
• Training
• Extending NCSN, the (time-dependent) score is modeled as 𝑠#(𝑥, 𝑡)
• Then, the denoising score matching loss is:
• For (cont. ver. of) NCSN and DDPM, the forward process
𝑝;: 𝑥 𝑡 𝑥 0 = 𝒩(𝑥 𝑡 ; 𝐦! ; 𝑡 , 𝚺! ; 𝑡 )
• is given by the closed form (hence, no simulation is needed)
• Sampling. Interestingly, the reverse SDE permits several sampling methods
• (1) General-purpose solver, (2) MCMC, (3) convert to deterministic ODE
15
16. Generative Modeling via SDE
1. General-purpose SDE solver (a.k.a. predictor)
• Ancestral sampling (of DDPM) is one specific discretization of reverse SDE
• Instead, one can use the same discretization with the forward process
2. Score-based MCMC (a.k.a. corrector)
• As annealed SGLD (of NCSN), directly run MCMC using the score
• Combining both, the predictor-corrector sampler get the best-of-the-both-worlds
16
It slightly improves the performance of DDPM
DDPMNCSN
17. Generative Modeling via SDE
3. Convert to deterministic ODE (a.k.a. probability flow)
• Every Itô process (SDE we consider) has a corresponding deterministic ODE
• whose trajectories induce the same evolution of densities
• Remark that neural ODE can be used for continuous normalizing flow (CNF)
• Recall. Normalizing flow is an invertible generative model
• CNF computes the trace instead of the determinant!
• ⇒ Can (1) compute exact likelihood (2) and manipulate latent via encoder!
17
18. Generative Modeling via SDE
• Conditional generation
• With a pre-trained score function, one can use it for conditional generation
using a post-hoc classifier (relatively small cost)
• Let 𝑦 be a (time-invariant) condition of data 𝑥; then, collect {𝑥 𝑡 , 𝑦}
• After training a (time-dependent) classifier 𝑝:(𝑦 ∣ 𝑥 𝑡 ), solve reverse SDE
• Applications: (1) class-conditional, (2) imputation, (3) colorization
18
19. Generative Modeling via SDE
• Experiments. The practical advantages of SDE-based generative model is:
1. High-quality image generation via predictor-corrector sampler
2. Invertible model via ODE → exact likelihood and controllable latent
19
20. Generative Modeling via SDE
• Experiments. The practical advantages of SDE-based generative model is:
1. High-quality image generation via predictor-corrector sampler
2. Invertible model via ODE → exact likelihood and controllable latent
20
Scale to 1024×1024 CelebA-HQ
21. Generative Modeling via SDE
• Experiments. The practical advantages of SDE-based generative model is:
3. Conditional generation with post-hoc classifier
• The score 𝑠#(𝑥, 𝑡) is only trained once
21
22. Future Direction
• Towards faster generation
• Score-based models show promising generation results
• However, the sampling often requires lots of (e.g., 1,000) iterations
• Denoising diffusion implicit models (DDIM) – ICLR 2021 under review
reduce the sampling to 10-20 iterations
• DDIM combines the idea of score-based models and GAN
• Combining the idea of SDE and GAN would be an interesting direction! J
22
23. References
• Hyvärinen. “Estimation of Non-Normalized Statistical Models by Score Matching”, JMLR 2005.
• Song & Ermon. “Generative Modeling by Estimating Gradients of the Data Distribution”, NeurIPS 2019.
• Song & Ermon. “Improved Techniques for Training Score-Based Generative Models”, NeurIPS 2020.
• Sohl-Dickstein et al. “Deep Unsupervised Learning using Nonequilibrium Thermodynamics”, ICML 2015.
• Ho et al. “Denoising Diffusion Probabilistic Models”, NeurIPS 2020.
• Anonymous. “Score-Based Generative Modeling through Stochastic Differential Equations”, ICLR 2021
under review.
• Chen et al. “Neural Ordinary Differential Equations”, NeurIPS 2018.
• Grathwohl et al. “FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models”,
ICLR 2019.
• Song et al. “Denoising Diffusion Implicit Models”, ICLR 2021 under review.
23