PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018

Santiago Pascual de la Puente
santi.pascual@upc.edu
PhD Candidate
Universitat Politecnica de Catalunya
Technical University of Catalonia
Deep Generative Models III:
PixelCNN/Wavenet
Normalizing Flows
#DLUPC
http://bit.ly/dlai2018

Outline
● What is in here?
● Introduction
● Taxonomy
● Variational Auto-Encoders (VAEs) (DGMs I)
● Generative Adversarial Networks (GANs) (DGMs II)
● PixelCNN/Wavenet (DGMs III)
● Normalizing Flows (and flow-based gen. models) (DGMs III)
○ Real NVP
○ GLOW
● Models comparison
● Conclusions
2

What is a generative model?
4
We have datapoints that emerge from some
generating process (landscapes in the nature,
speech from people, etc.)
X = {x1
, x2
, …, xN
}
Having our dataset X with example datapoints
(images, waveforms, written text, etc.) each point xn
lays in some M-dimensional space.
Each point xn
comes from an M-dimensional
probability distribution P(X) → Model it!

5
We can generate any type of data, like speech waveform samples or image pixels.
M samples = M dimensions
32
32
.
.
.
Scan and unroll
x3 channels
M = 32x32x3 = 3072 dimensions

Our learned model should be able to make up new samples from the
distribution, not just copy and paste existing samples!
6
Figure from NIPS 2016 Tutorial: Generative Adversarial Networks (I. Goodfellow)

7
We will see models that learn the probability density function:
● Explicitly (we impose a known loss function)
○ (1) With approximate density → Variational
Auto-Encoders (VAEs)
○ (2) With tractable density & likelihood-based:
■ PixelCNN, Wavenet.
■ Flow-based models: real-NVP, GLOW.
● Implicitly (we “do not know” the loss function)
○ (3) Generative Adversarial Networks (GANs).
Taxonomy

Variational Auto-Encoder
We finally reach the Variational AutoEncoder objective function.
log likelihood of our data
Not computable and non-negative (KL)
approximation error.
Reconstruction loss of our data
given latent space → NEURAL
DECODER reconstruction loss!
Regularization of our latent
representation → NEURAL
ENCODER projects over prior.
8Credit: Kristiadi

z
Reparameterization trick
Sample and operate with it, multiplying by and summing
9

z1
Generative behavior
Q: How can we now generate new samples once the underlying generating distribution is learned?
A: We can sample from our prior, for example, discarding the encoder path.
z2
z3
10

GAN schematic
11
First things first: what is an encoder neural network? Some deep network that extracts high level
knowledge from our (potentially raw) data, and we can use that knowledge to discriminate among
classes, to score probabilistic outcomes, to extract features, etc...
...
...
...
...
ẍz
...
...
...
...
cx
Adversarial Learning = Let’s use the encoder’s abstraction to
learn whether the generator made a good job generating
something realistic ẍ = G(z) or not.

Generator
Real world
samples
Database
Discriminator
Real
Loss
Latentrandomvariable
Sample
Sample
Fake
12
z
D is a binary classifier, whose objectives
are: database images are Real, whereas
generated ones are Fake .
Generative + Adversarial

Conditional GANs
GANs can be conditioned on other info extra to z:
text, labels, speech, etc..
z might capture random characteristics of the data
(variabilities of plausible futures), whilst c would
condition the deterministic parts !
For details on ways to condition GANs:
Ways of Conditioning Generative
Adversarial Networks (Wack et al.)

Evolving GANs
Different loss functions in the Discriminator can fit your purpose:
● Binary cross-entropy (BCE): Vanilla GAN
● Least-Squares: LSGAN
● Wasserstein GAN + Gradient Penalty: WGAN-GP
● BEGAN: Energy based (D as an autoencoder) with boundary equilibrium.
● Maximum margin: Improved binary classification

Factorizing the joint distribution
● Model explicitly the joint probability distribution of data streams x as a product of
element-wise conditional distributions for each element xi
in the stream.
○ Example: An image x of size (n, n) is decomposed scanning pixels in raster mode
(Row by row and pixel by pixel within every row)
○ Apply probability chain rule: xi
is the i-th pixel in the image.
17

● To model highly nonlinear and long-range correlations between pixels and their conditional
distributions, we will need a powerful non-linear sequential model.
○ Q: How can we deal with this (which model)?
18

■ A: Recurrent Neural Network.
19

PixelRNN
An RNN predicts the probability of each sample xi
with a categorical output distribution: Softmax
RNN/LSTM
xi-1
xi
xi-2
xi-3
xi-4
(x, y)
At every pixel, in a greyscale image:
256 classes (8 bits/pixel).
20
Recall RNN formulation:
(van den Oord et al. 2016)

■ B: Convolutional Neural Network.
21(van den Oord et al. 2016)²

■ B: Convolutional Neural Network.
A CNN?
Whut??
22

My CNN is going causal...
hi there how are you doing ?
100 dim
Let’s say we have a sequence of 100 dimensional vectors describing words.
sequence length = 8
arrangein 2D
100
dim
sequence length = 8
,
Focus in 1D to
exemplify causalitaion
23

24
We can apply a 1D convolutional activation over the 2D matrix: for an arbitrary kernel of width=3
100
dim
sequence length = 8
K1
Each 1D convolutional kernel is a
2D matrix of size (3, 100)100
dim

25
Keep in mind we are working with 100 dimensions although here we depict just one for simplicity
1 2 3 4 5 6 7 8
w1 w2 w3
w1 w2 w3
w1 w2 w3
w1 w2 w3
w1 w2 w3
1 2 3 4 5 6
w1 w2 w3
The length result of the convolution is
well known to be:
seqlength - kwidth + 1 = 8 - 3 + 1 = 6
So the output matrix will be (6, 100)
because there was no padding

26
When we add zero padding, we normally do so on both sides of the sequence (as in image padding)
1 2 3 4 5 6 7 8
w1 w2 w3
w1 w2 w3
w1 w2 w3
w1 w2 w3
w1 w2 w3
1 2 3 4 5 6 7 8
w1 w2 w3
well known to be:
because we had padding
0 0
w1 w2 w3
w1 w2 w3

27
Add the zero padding just on the left side of the sequence, not symmetrically
1 2 3 4 5 6 7 8
w1 w2 w3
w1 w2 w3
w1 w2 w3
w1 w2 w3
w1 w2 w3
1 2 3 4 5 6 7 8
w1 w2 w3
well known to be:
because we had padding
HOWEVER: now every time-step t
depends on the two previous inputs as
well as the current time-step → every
output is causal
Roughly: We make a causal convolution
by padding left the sequence with
(kwidth - 1) zeros
0
w1 w2 w3
w1 w2 w3
0
My CNN went causal

PixelCNN/Wavenet
We can simply exemplify how causal convolutions help us with the SoTA model for audio
generation Wavenet. We have a one dimensional stream of samples of length T:
28

PixelCNN/Wavenet
generation Wavenet. We have a one dimensional stream of samples of length T (e.g. audio):
receptive field hast to be T 29(van den Oord et al. 2016)³

PixelCNN/Wavenet
generation Wavenet. We have a one dimensional stream of samples of length T:
receptive field hast to be T
Dilated convolutions help
us reach a receptive field
sufficiently large to
emulate an RNN!
30
Samples available online

Conditional PixelCNN/Wavenet
● Q: Can we make the generative model learn the natural distribution of X and perform a
specific task conditioned on it?
○ A: Yes! we can condition each sample to an embedding/feature vector h, which can
represent encoded text, or speaker identity, generation speed, etc.
31

32
● PixelCNN produces very sharp and realistic samples, but...
● Q: Is this a cheap generative model computationally?

● PixelCNN produces very sharp and realistic samples, but...
● Q: Is this a cheap generative model computationally?
○ A: Nope, take the example of wavenet: Forward 16.000 times in
auto-regressive way to predict 1 second of audio at standard 16 kHz
sampling.
33

41
Normalizing Flows and the Effective Chained Composition

42
(Mohamed and Rezende 2017)

43

44

47
Planar Flow
We can apply this simple flow to an already seen model to enhance its representational
power though...
z
We still sample N(0, I) !!

48
Planar Flow
We can apply this simple flow to an already seen model to enhance its representational
power though...
z
We still sample N(0, I) !!
But if we had a bit more sophisticated flow formulation in some
form, we could directly build an end-to-end generative model!

49
Enhancing Normalizing Flows

Real-Non Volume Preserving Flows
51

52

53

54
Copy
Affine coupling

Real-NVP Permutations
55
Copy
Affine coupling
Affine coupling
Copy
Copy
Affine coupling

59
GLOW
Improvements over Real-NVP.

60
GLOW step of flow formulations

62
GLOW Multi-scale structure
This is a very powerful structure: we can infer z features from an image and we can
sample an image from z features, EITHER WAY!

63
GLOW Multi-scale structure
Additionally, during training loss is computed in the simplistic Z space!

65
GLOW Results with temperature

66
GLOW Results interpolating Z

69
WaveGLOW (Prenger et al. 2018)
Modified structure to deal
with 1D audio data, giving
SoTA results without
auto-regressive restriction.
Audio samples available online

72
Models comparison
Latent space → feature disentangle, we can
navigate and control better what we generate

73
Models comparison
No direct likelihood measurement, just an
approximation or qualitative insights.

74
Models comparison
Direct likelihood values, direct
objective “good fit” measurements.

75
Models comparison
Very deep models (many flow
steps), need much much memory
and compute power to get best
results.

76
Models comparison
Fast training.

77
Models comparison
Slow auto-regressive sampling.

78
Models comparison
One-shot predictions,
quickest to sample from.

80
Conclusions
● (Deep) Generative models are built to learn the underlying structures hidden in our (high-dimensional) data.
● There are currently four main successful deep generative models: VAE, GAN, pixelCNN and NF-based.
● PixelCNN factorizes an explicitly known discrete distribution with probability chain rule
○ They are the slowest generative models for their recursive nature.
● VAEs, GANs, and NFs are capable of factorizing our highly-complex PDFs into a simpler prior of our choice z,
which we can then explore and control to generate specific features.
● Once we trained VAEs with variational lower bound, we can generate new samples from our learned z manifold
assuming most significant data lays over the prior → hence we like powerful priors (NFs can help in this matter).
● GANs use an implicit learning method to mimic our data distribution Pdata.
● GANs and NFs are the sharpiest generators at the moment → very active research fields.
● GANs are the hardest generative models to train owing to intrinsic stabilities of the G-D synergies.
● NFs (Real-NVP based) require many steps to be effective in their task, consuming many resources at the
moment.

References
82
● Glow: generative flow with invertible 1x1 convolutions (Kingma and Dhariwal, 2018)
● Density Estimation Using Real-NVP (Dinh et al. 2016)
● Variational Inference With Normalizing Flows (Rezende and Mohamed, 2015)
● Paper Review on Glow (Pascual 2018)
● Normalizing Flows (Kosiorek 2016)
● Normalizing Flows Tutorial 1 (Jang 2016)
● Normalizing Flows Tutorial 2 (Jang 2016)

PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018

Ähnlich wie PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018 (20)

Mehr von Universitat Politècnica de Catalunya

Mehr von Universitat Politècnica de Catalunya (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018