https://telecombcn-dl.github.io/2018-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018
1. Santiago Pascual de la Puente
santi.pascual@upc.edu
PhD Candidate
Universitat Politecnica de Catalunya
Technical University of Catalonia
Deep Generative Models III:
PixelCNN/Wavenet
Normalizing Flows
#DLUPC
http://bit.ly/dlai2018
2. Outline
● What is in here?
● Introduction
● Taxonomy
● Variational Auto-Encoders (VAEs) (DGMs I)
● Generative Adversarial Networks (GANs) (DGMs II)
● PixelCNN/Wavenet (DGMs III)
● Normalizing Flows (and flow-based gen. models) (DGMs III)
○ Real NVP
○ GLOW
● Models comparison
● Conclusions
2
4. What is a generative model?
4
We have datapoints that emerge from some
generating process (landscapes in the nature,
speech from people, etc.)
X = {x1
, x2
, …, xN
}
Having our dataset X with example datapoints
(images, waveforms, written text, etc.) each point xn
lays in some M-dimensional space.
Each point xn
comes from an M-dimensional
probability distribution P(X) → Model it!
5. 5
We can generate any type of data, like speech waveform samples or image pixels.
What is a generative model?
M samples = M dimensions
32
32
.
.
.
Scan and unroll
x3 channels
M = 32x32x3 = 3072 dimensions
6. Our learned model should be able to make up new samples from the
distribution, not just copy and paste existing samples!
6
What is a generative model?
Figure from NIPS 2016 Tutorial: Generative Adversarial Networks (I. Goodfellow)
7. 7
We will see models that learn the probability density function:
● Explicitly (we impose a known loss function)
○ (1) With approximate density → Variational
Auto-Encoders (VAEs)
○ (2) With tractable density & likelihood-based:
■ PixelCNN, Wavenet.
■ Flow-based models: real-NVP, GLOW.
● Implicitly (we “do not know” the loss function)
○ (3) Generative Adversarial Networks (GANs).
Taxonomy
8. Variational Auto-Encoder
We finally reach the Variational AutoEncoder objective function.
log likelihood of our data
Not computable and non-negative (KL)
approximation error.
Reconstruction loss of our data
given latent space → NEURAL
DECODER reconstruction loss!
Regularization of our latent
representation → NEURAL
ENCODER projects over prior.
8Credit: Kristiadi
10. Variational Auto-Encoder
z1
Generative behavior
Q: How can we now generate new samples once the underlying generating distribution is learned?
A: We can sample from our prior, for example, discarding the encoder path.
z2
z3
10
11. GAN schematic
11
First things first: what is an encoder neural network? Some deep network that extracts high level
knowledge from our (potentially raw) data, and we can use that knowledge to discriminate among
classes, to score probabilistic outcomes, to extract features, etc...
...
...
...
...
ẍz
...
...
...
...
cx
Adversarial Learning = Let’s use the encoder’s abstraction to
learn whether the generator made a good job generating
something realistic ẍ = G(z) or not.
13. Conditional GANs
GANs can be conditioned on other info extra to z:
text, labels, speech, etc..
z might capture random characteristics of the data
(variabilities of plausible futures), whilst c would
condition the deterministic parts !
For details on ways to condition GANs:
Ways of Conditioning Generative
Adversarial Networks (Wack et al.)
14. Evolving GANs
Different loss functions in the Discriminator can fit your purpose:
● Binary cross-entropy (BCE): Vanilla GAN
● Least-Squares: LSGAN
● Wasserstein GAN + Gradient Penalty: WGAN-GP
● BEGAN: Energy based (D as an autoencoder) with boundary equilibrium.
● Maximum margin: Improved binary classification
17. Factorizing the joint distribution
● Model explicitly the joint probability distribution of data streams x as a product of
element-wise conditional distributions for each element xi
in the stream.
○ Example: An image x of size (n, n) is decomposed scanning pixels in raster mode
(Row by row and pixel by pixel within every row)
○ Apply probability chain rule: xi
is the i-th pixel in the image.
17
18. Factorizing the joint distribution
● To model highly nonlinear and long-range correlations between pixels and their conditional
distributions, we will need a powerful non-linear sequential model.
○ Q: How can we deal with this (which model)?
18
19. Factorizing the joint distribution
● To model highly nonlinear and long-range correlations between pixels and their conditional
distributions, we will need a powerful non-linear sequential model.
○ Q: How can we deal with this (which model)?
■ A: Recurrent Neural Network.
19
20. PixelRNN
An RNN predicts the probability of each sample xi
with a categorical output distribution: Softmax
RNN/LSTM
xi-1
xi
xi-2
xi-3
xi-4
(x, y)
At every pixel, in a greyscale image:
256 classes (8 bits/pixel).
20
Recall RNN formulation:
(van den Oord et al. 2016)
21. Factorizing the joint distribution
● To model highly nonlinear and long-range correlations between pixels and their conditional
distributions, we will need a powerful non-linear sequential model.
○ Q: How can we deal with this (which model)?
■ A: Recurrent Neural Network.
■ B: Convolutional Neural Network.
21(van den Oord et al. 2016)²
22. Factorizing the joint distribution
● To model highly nonlinear and long-range correlations between pixels and their conditional
distributions, we will need a powerful non-linear sequential model.
○ Q: How can we deal with this (which model)?
■ A: Recurrent Neural Network.
■ B: Convolutional Neural Network.
A CNN?
Whut??
22
23. My CNN is going causal...
hi there how are you doing ?
100 dim
Let’s say we have a sequence of 100 dimensional vectors describing words.
sequence length = 8
arrangein 2D
100
dim
sequence length = 8
,
Focus in 1D to
exemplify causalitaion
23
24. 24
We can apply a 1D convolutional activation over the 2D matrix: for an arbitrary kernel of width=3
100
dim
sequence length = 8
K1
Each 1D convolutional kernel is a
2D matrix of size (3, 100)100
dim
My CNN is going causal...
25. 25
Keep in mind we are working with 100 dimensions although here we depict just one for simplicity
1 2 3 4 5 6 7 8
w1 w2 w3
w1 w2 w3
w1 w2 w3
w1 w2 w3
w1 w2 w3
1 2 3 4 5 6
w1 w2 w3
The length result of the convolution is
well known to be:
seqlength - kwidth + 1 = 8 - 3 + 1 = 6
So the output matrix will be (6, 100)
because there was no padding
My CNN is going causal...
26. 26
My CNN is going causal...
When we add zero padding, we normally do so on both sides of the sequence (as in image padding)
1 2 3 4 5 6 7 8
w1 w2 w3
w1 w2 w3
w1 w2 w3
w1 w2 w3
w1 w2 w3
1 2 3 4 5 6 7 8
w1 w2 w3
The length result of the convolution is
well known to be:
seqlength - kwidth + 1 = 10 - 3 + 1 = 8
So the output matrix will be (8, 100)
because we had padding
0 0
w1 w2 w3
w1 w2 w3
27. 27
Add the zero padding just on the left side of the sequence, not symmetrically
1 2 3 4 5 6 7 8
w1 w2 w3
w1 w2 w3
w1 w2 w3
w1 w2 w3
w1 w2 w3
1 2 3 4 5 6 7 8
w1 w2 w3
The length result of the convolution is
well known to be:
seqlength - kwidth + 1 = 10 - 3 + 1 = 8
So the output matrix will be (8, 100)
because we had padding
HOWEVER: now every time-step t
depends on the two previous inputs as
well as the current time-step → every
output is causal
Roughly: We make a causal convolution
by padding left the sequence with
(kwidth - 1) zeros
0
w1 w2 w3
w1 w2 w3
0
My CNN went causal
28. PixelCNN/Wavenet
We can simply exemplify how causal convolutions help us with the SoTA model for audio
generation Wavenet. We have a one dimensional stream of samples of length T:
28
29. PixelCNN/Wavenet
We can simply exemplify how causal convolutions help us with the SoTA model for audio
generation Wavenet. We have a one dimensional stream of samples of length T (e.g. audio):
receptive field hast to be T 29(van den Oord et al. 2016)³
30. PixelCNN/Wavenet
We can simply exemplify how causal convolutions help us with the SoTA model for audio
generation Wavenet. We have a one dimensional stream of samples of length T:
receptive field hast to be T
Dilated convolutions help
us reach a receptive field
sufficiently large to
emulate an RNN!
30
Samples available online
31. Conditional PixelCNN/Wavenet
● Q: Can we make the generative model learn the natural distribution of X and perform a
specific task conditioned on it?
○ A: Yes! we can condition each sample to an embedding/feature vector h, which can
represent encoded text, or speaker identity, generation speed, etc.
31
33. Conditional PixelCNN/Wavenet
● PixelCNN produces very sharp and realistic samples, but...
● Q: Is this a cheap generative model computationally?
○ A: Nope, take the example of wavenet: Forward 16.000 times in
auto-regressive way to predict 1 second of audio at standard 16 kHz
sampling.
33
47. 47
Planar Flow
We can apply this simple flow to an already seen model to enhance its representational
power though...
z
We still sample N(0, I) !!
48. 48
Planar Flow
We can apply this simple flow to an already seen model to enhance its representational
power though...
z
We still sample N(0, I) !!
But if we had a bit more sophisticated flow formulation in some
form, we could directly build an end-to-end generative model!
62. 62
GLOW Multi-scale structure
This is a very powerful structure: we can infer z features from an image and we can
sample an image from z features, EITHER WAY!
69. 69
WaveGLOW (Prenger et al. 2018)
Modified structure to deal
with 1D audio data, giving
SoTA results without
auto-regressive restriction.
Audio samples available online
80. 80
Conclusions
● (Deep) Generative models are built to learn the underlying structures hidden in our (high-dimensional) data.
● There are currently four main successful deep generative models: VAE, GAN, pixelCNN and NF-based.
● PixelCNN factorizes an explicitly known discrete distribution with probability chain rule
○ They are the slowest generative models for their recursive nature.
● VAEs, GANs, and NFs are capable of factorizing our highly-complex PDFs into a simpler prior of our choice z,
which we can then explore and control to generate specific features.
● Once we trained VAEs with variational lower bound, we can generate new samples from our learned z manifold
assuming most significant data lays over the prior → hence we like powerful priors (NFs can help in this matter).
● GANs use an implicit learning method to mimic our data distribution Pdata.
● GANs and NFs are the sharpiest generators at the moment → very active research fields.
● GANs are the hardest generative models to train owing to intrinsic stabilities of the G-D synergies.
● NFs (Real-NVP based) require many steps to be effective in their task, consuming many resources at the
moment.