SlideShare ist ein Scribd-Unternehmen logo
1 von 222
Generative Adversarial Network
and its Applications to Signal Processing
and Natural Language Processing
Hung-yi Lee and Yu Tsao
Mihaela Rosca, Balaji Lakshminarayanan, David Warde-Farley, Shakir Mohamed, “Variational Approaches for Auto-Encoding
Generative Adversarial Networks”, arXiv, 2017
All Kinds of GAN … https://github.com/hindupuravinash/the-gan-zoo
GAN
ACGAN
BGAN
DCGAN
EBGAN
fGAN
GoGAN
CGAN
……
ICASSP Keyword search on session index page,
so session names are included.
Number of papers whose titles include the keyword
0
5
10
15
20
25
30
35
40
45
2012 2013 2014 2015 2016 2017 2018
generative adversarial reinforcement
It is a wise choice to
attend this tutorial.
Outline
Part I: General Introduction of Generative
Adversarial Network (GAN)
Part II: Applications to Signal Processing
Part III: Applications to Natural Language
Processing
Part I: General Introduction
Generative Adversarial Network
and its Applications to Signal Processing
and Natural Language Processing
Outline of Part 1
Generation by GAN
Conditional Generation
Unsupervised Conditional Generation
Relation to Reinforcement Learning
Outline of Part 1
Generation by GAN
• Image Generation as Example
• Theory behind GAN
• Issues and Possible Solutions
Conditional Generation
Unsupervised Conditional Generation
Relation to Reinforcement Learning
Anime Face Generation
Draw
Generator
Examples
Basic Idea of GAN
Generator
It is a neural network
(NN), or a function.
Generator
0.1
−3
⋮
2.4
0.9
imagevector
Generator
3
−3
⋮
2.4
0.9
Generator
0.1
2.1
⋮
5.4
0.9
Generator
0.1
−3
⋮
2.4
3.5
high
dimensional
vector
Powered by: http://mattya.github.io/chainer-DCGAN/
Each dimension of input vector
represents some characteristics.
Longer hair
blue hair Open mouth
Discri-
minator
scalar
image
Basic Idea of GAN It is a neural network
(NN), or a function.
Larger value means real,
smaller value means fake.
Discri-
minator
Discri-
minator
Discri-
minator1.0 1.0
0.1 Discri-
minator
0.1
• Initialize generator and discriminator
• In each training iteration:
DG
sample
generated
objects
G
Algorithm
D
Update
vector
vector
vector
vector
0000
1111
randomly
sampled
Database
Step 1: Fix generator G, and update discriminator D
Discriminator learns to assign high scores to real objects
and low scores to generated objects.
Fix
• Initialize generator and discriminator
• In each training iteration:
DG
Algorithm
Step 2: Fix discriminator D, and update generator G
Discri-
minator
NN
Generator
vector
0.13
hidden layer
update fix
Gradient Ascent
large network
Generator learns to “fool” the discriminator
• Initialize generator and discriminator
• In each training iteration:
DG
Learning
D
Sample some
real objects:
Generate some
fake objects:
G
Algorithm
D
Update
Learning
G
G D
image
1111
image
image
image
1
update fix
0000vector
vector
vector
vector
vector
vector
vector
vector
fix
Anime Face Generation
100 updates
Source of training data: https://zhuanlan.zhihu.com/p/24767059
Anime Face Generation
1000 updates
Anime Face Generation
2000 updates
Anime Face Generation
5000 updates
Anime Face Generation
10,000 updates
Anime Face Generation
20,000 updates
Anime Face Generation
50,000 updates
The faces
generated by
machine.
The images are generated
by Yen-Hao Chen, Po-Chun
Chien, Jun-Chen Xie, Tsung-
Han Wu.
0.0
0.0
G
0.9
0.9
G
0.1
0.1
G
0.2
0.2
G
0.3
0.3
G
0.4
0.4
G
0.5
0.5
G
0.6
0.6
G
0.7
0.7
G
0.8
0.8
G
(Variational) Auto-encoder
As close as possible
NN
Encoder
NN
Decoder
code
NN
Decoder
code
Randomly generate
a vector as code Image ?
= Generator
= Generator
Auto-encoder v.s. GAN
As close as possibleNN
Decoder
code
Genera-
tor
code
= Generator
Discri-
minator
If discriminator does not simply memorize the images,
Generator learns the patterns of faces.
Just copy an image
1
Auto-encoder
GAN
Fuzzy …
FID[Martin Heusel, et al., NIPS, 2017]: Smaller is better
[Mario Lucic, et al. arXiv, 2017]
Outline of Part 1
Generation
• Image Generation as Example
• Theory behind GAN
• Issues and Possible Solutions
Conditional Generation
Unsupervised Conditional Generation
Relation to Reinforcement Learning
Generator
• A generator G is a network. The network defines a
probability distribution 𝑃𝐺
generator
G𝑧 𝑥 = 𝐺 𝑧
Normal
Distribution
𝑃𝐺(‫)ݔ‬ 𝑃𝑑𝑎𝑡𝑎 𝑥
as close as possible
How to compute the divergence?
𝐺∗ = 𝑎𝑟𝑔 min
𝐺
𝐷𝑖𝑣 𝑃𝐺, 𝑃𝑑𝑎𝑡𝑎
Divergence between distributions 𝑃𝐺 and 𝑃𝑑𝑎𝑡𝑎
𝑥: an image (a high-
dimensional vector)
Discriminator
𝐺∗
= 𝑎𝑟𝑔 min
𝐺
𝐷𝑖𝑣 𝑃𝐺, 𝑃𝑑𝑎𝑡𝑎
Although we do not know the distributions of 𝑃𝐺 and 𝑃𝑑𝑎𝑡𝑎,
we can sample from them.
sample
G
vector
vector
vector
vector
sample from
normal
Database
Sampling from 𝑷 𝑮
Sampling from 𝑷 𝒅𝒂𝒕𝒂
Discriminator 𝐺∗
= 𝑎𝑟𝑔 min
𝐺
𝐷𝑖𝑣 𝑃𝐺, 𝑃𝑑𝑎𝑡𝑎
Discriminator
: data sampled from 𝑃𝑑𝑎𝑡𝑎
: data sampled from 𝑃𝐺
train
𝑉 𝐺, 𝐷 = 𝐸 𝑥∼𝑃 𝑑𝑎𝑡𝑎
𝑙𝑜𝑔𝐷 𝑥 + 𝐸 𝑥∼𝑃 𝐺
𝑙𝑜𝑔 1 − 𝐷 𝑥
Example Objective Function for D
(G is fixed)
𝐷∗ = 𝑎𝑟𝑔 max
𝐷
𝑉 𝐷, 𝐺Training:
Using the example objective
function is exactly the same as
training a binary classifier.
[Goodfellow, et al., NIPS, 2014]
The maximum objective value
is related to JS divergence.
Discriminator 𝐺∗
= 𝑎𝑟𝑔 min
𝐺
𝐷𝑖𝑣 𝑃𝐺, 𝑃𝑑𝑎𝑡𝑎
Discriminator
: data sampled from 𝑃𝑑𝑎𝑡𝑎
: data sampled from 𝑃𝐺
train
hard to discriminatesmall divergence
Discriminator
train
easy to discriminatelarge divergence
𝐷∗ = 𝑎𝑟𝑔 max
𝐷
𝑉 𝐷, 𝐺
Training:
(cannot make objective large)
𝐺∗
= 𝑎𝑟𝑔 min
𝐺
𝐷𝑖𝑣 𝑃𝐺, 𝑃𝑑𝑎𝑡𝑎max
𝐷
𝑉 𝐺, 𝐷
The maximum objective value
is related to JS divergence.
• Initialize generator and discriminator
• In each training iteration:
Step 1: Fix generator G, and update discriminator D
Step 2: Fix discriminator D, and update generator G
𝐷∗ = 𝑎𝑟𝑔 max
𝐷
𝑉 𝐷, 𝐺
[Goodfellow, et al., NIPS, 2014]
Using the divergence
you like 
[Sebastian Nowozin, et al., NIPS, 2016]
Can we use other divergence?
Outline of Part 1
Generation
• Image Generation as Example
• Theory behind GAN
• Issues and Possible Solutions
Conditional Generation
Unsupervised Conditional Generation
Relation to Reinforcement Learning
More tips and tricks:
https://github.com/soumith/ganhacks
JS divergence is not suitable
• In most cases, 𝑃𝐺 and 𝑃𝑑𝑎𝑡𝑎 are not overlapped.
• 1. The nature of data
• 2. Sampling
Both 𝑃𝑑𝑎𝑡𝑎 and 𝑃𝐺 are low-dim
manifold in high-dim space.
𝑃𝑑𝑎𝑡𝑎
𝑃𝐺
The overlap can be ignored.
Even though 𝑃𝑑𝑎𝑡𝑎 and 𝑃𝐺
have overlap.
If you do not have enough
sampling ……
𝑃𝑑𝑎𝑡𝑎𝑃𝐺0 𝑃𝑑𝑎𝑡𝑎𝑃𝐺1
𝐽𝑆 𝑃𝐺0
, 𝑃𝑑𝑎𝑡𝑎
= 𝑙𝑜𝑔2
𝑃𝑑𝑎𝑡𝑎𝑃𝐺100
……
𝐽𝑆 𝑃𝐺1
, 𝑃𝑑𝑎𝑡𝑎
= 𝑙𝑜𝑔2
𝐽𝑆 𝑃𝐺100
, 𝑃𝑑𝑎𝑡𝑎
= 0
What is the problem of JS divergence?
……
JS divergence is log2 if two distributions do not overlap.
Intuition: If two distributions do not overlap, binary classifier
achieves 100% accuracy
Equally bad
Same objective value is obtained. Same divergence
Wasserstein distance
• Considering one distribution P as a pile of earth,
and another distribution Q as the target
• The average distance the earth mover has to move
the earth.
𝑃 𝑄
d
𝑊 𝑃, 𝑄 = 𝑑
Wasserstein distance
Source of image: https://vincentherrmann.github.io/blog/wasserstein/
𝑃
𝑄
Using the “moving plan” with the smallest average distance to
define the Wasserstein distance.
There are many possible “moving plans”.
Smaller
distance?
Larger
distance?
𝑃𝑑𝑎𝑡𝑎𝑃𝐺0 𝑃𝑑𝑎𝑡𝑎𝑃𝐺1
𝐽𝑆 𝑃𝐺0
, 𝑃𝑑𝑎𝑡𝑎
= 𝑙𝑜𝑔2
𝑃𝑑𝑎𝑡𝑎𝑃𝐺100
……
𝐽𝑆 𝑃𝐺1
, 𝑃𝑑𝑎𝑡𝑎
= 𝑙𝑜𝑔2
𝐽𝑆 𝑃𝐺100
, 𝑃𝑑𝑎𝑡𝑎
= 0
What is the problem of JS divergence?
𝑊 𝑃𝐺0
, 𝑃𝑑𝑎𝑡𝑎
= 𝑑0
𝑊 𝑃𝐺1
, 𝑃𝑑𝑎𝑡𝑎
= 𝑑1
𝑊 𝑃𝐺100
, 𝑃𝑑𝑎𝑡𝑎
= 0
𝑑0 𝑑1
……
……
Better!
WGAN
𝑉 𝐺, 𝐷
= max
𝐷∈1−𝐿𝑖𝑝𝑠𝑐ℎ𝑖𝑡𝑧
𝐸 𝑥~𝑃 𝑑𝑎𝑡𝑎
𝐷 𝑥 − 𝐸 𝑥~𝑃 𝐺
𝐷 𝑥
Evaluate wasserstein distance between 𝑃𝑑𝑎𝑡𝑎 and 𝑃𝐺
[Martin Arjovsky, et al., arXiv, 2017]
How to fulfill this constraint?D has to be smooth enough.
real
−∞
generated
D
∞
Without the constraint, the
training of D will not converge.
Keeping the D smooth forces
D(x) become ∞ and −∞
• Original WGAN → Weight Clipping [Martin Arjovsky, et al.,
arXiv, 2017]
• Improved WGAN → Gradient Penalty [Ishaan Gulrajani,
NIPS, 2017]
• Spectral Normalization → Keep gradient norm
smaller than 1 everywhere [Miyato, et al., ICLR, 2018]
Force the parameters w between c and -c
After parameter update, if w > c, w = c; if w < -c, w = -c
Keep the gradient close to 1
𝑉 𝐺, 𝐷 = max
𝐷∈1−𝐿𝑖𝑝𝑠𝑐ℎ𝑖𝑡𝑧
𝐸 𝑥~𝑃 𝑑𝑎𝑡𝑎
𝐷 𝑥 − 𝐸 𝑥~𝑃 𝐺
𝐷 𝑥
real
samples
Keep the gradient
close to 1
[Kodali, et al., arXiv, 2017]
[Wei, et al., ICLR, 2018]
Energy-based GAN (EBGAN)
• Using an autoencoder as discriminator D
Discriminator
0 for the
best images
Generator is
the same.
-
0.1
EN DE
Autoencoder
X -1 -0.1
[Junbo Zhao, et al., arXiv, 2016]
Using the negative reconstruction error of
auto-encoder to determine the goodness
Benefit: The auto-encoder can be pre-train
by real images without generator.
Mode Collapse
: real data
: generated data
Training with too many iterations ……
Mode Dropping
BEGAN on CelebA
Generator
at iteration t
Generator
at iteration t+1
Generator
at iteration t+2
Generator switches mode during training
Ensemble
Generator
1
Generator
2
……
……
Train a set of generators: 𝐺1, 𝐺2, ⋯ , 𝐺 𝑁
Random pick a generator 𝐺𝑖
Use 𝐺𝑖 to generate the image
To generate an image
Objective Evaluation
Off-the-shelf
Image Classifier
𝑥 𝑃 𝑦|𝑥
Concentrated distribution
means higher visual quality
CNN𝑥1
𝑃 𝑦1|𝑥1
Uniform distribution
means higher variety
CNN𝑥2
𝑃 𝑦2|𝑥2
CNN𝑥3
𝑃 𝑦3|𝑥3
…
𝑃 𝑦 =
1
𝑁
𝑛
𝑃 𝑦 𝑛|𝑥 𝑛
[Tim Salimans, et al., NIPS, 2016]
𝑥: image
𝑦: class (output of CNN)
e.g. Inception net,
VGG, etc.
class 1
class 2
class 3
Objective Evaluation
=
𝑥 𝑦
𝑃 𝑦|𝑥 𝑙𝑜𝑔𝑃 𝑦|𝑥
−
𝑦
𝑃 𝑦 𝑙𝑜𝑔𝑃 𝑦
Negative entropy of P(y|x)
Entropy of P(y)
Inception Score
𝑃 𝑦 =
1
𝑁
𝑛
𝑃 𝑦 𝑛
|𝑥 𝑛
𝑃 𝑦|𝑥
class 1
class 2
class 3
Outline of Part 1
Generation
Conditional Generation
Unsupervised Conditional Generation
Relation to Reinforcement Learning
• Original Generator
• Conditional Generator
G
condition c
𝑧 𝑥 = 𝐺 𝑧
Normal
Distribution G
𝑃𝐺 𝑥 → 𝑃𝑑𝑎𝑡𝑎 𝑥
𝑧
Normal
Distribution 𝑥 = 𝐺 𝑐, 𝑧
𝑃𝐺 𝑥|𝑐 → 𝑃𝑑𝑎𝑡𝑎 𝑥|𝑐
[Mehdi Mirza, et al., arXiv, 2014]
NN
Generator
“Girl with red hair
and red eyes”
“Girl with yellow
ribbon”
e.g. Text-to-Image
Target of
NN output
Text-to-Image
• Traditional supervised approach
NN Image
Text: “train”
a dog is running
a bird is flying
A blurry image!
c1: a dog is running
as close as
possible
Conditional GAN
D
(original)
scalar𝑥
G
𝑧Normal distribution
x = G(c,z)
c: train
x is real image or not
Image
Real images:
Generated images:
1
0
Generator will learn to
generate realistic images ….
But completely ignore the
input conditions.
[Scott Reed, et al, ICML, 2016]
Conditional GAN
D
(better)
scalar
𝑐
𝑥
True text-image pairs:
G
𝑧Normal distribution
x = G(c,z)
c: train
Image
x is realistic or not +
c and x are matched or not
(train , )
(train , )(cat , )
[Scott Reed, et al, ICML, 2016]
1
00
x is realistic or not +
c and x are matched
or not
Conditional GAN - Discriminator
[Takeru Miyato, et al., ICLR, 2018]
[Han Zhang, et al., arXiv, 2017]
[Augustus Odena et al., ICML, 2017]
condition c
object x
Network
Network
Network
score
Network
Network
(almost every paper)
condition c
object x
c and x are matched
or not
x is realistic or not
Conditional GAN
paired data
blue eyes
red hair
short hair
Collecting anime faces
and the description of its
characteristics
red hair,
green eyes
blue hair,
red eyes
The images are generated
by Yen-Hao Chen, Po-Chun
Chien, Jun-Chen Xie, Tsung-
Han Wu.
Conditional GAN - Image-to-image
G
𝑧
x = G(c,z)
𝑐
[Phillip Isola, et al., CVPR, 2017
Image translation, or pix2pix
as close as
possible
Conditional GAN - Image-to-image
• Traditional supervised approach
NN Image
It is blurry.
Testing:
input L1
e.g. L1
[Phillip Isola, et al., CVPR, 2017
Conditional GAN - Image-to-image
Testing:
input L1 GAN
G
𝑧
Image D scalar
GAN + L1
L1
[Phillip Isola, et al., CVPR, 2017
Conditional GAN
- Video Generation
Generator
Discrimi
nator
Last frame is real
or generated
Discriminator thinks it is real
[Michael Mathieu, et al., arXiv, 2015]
https://github.com/dyelax/Adversarial_Video_Generation
Domain Adversarial Training
• Training and testing data are in different domains
Training
data:
Testing
data:
Generator
Generator
The same
distribution
feature
feature
Take digit
classification as example
blue points
red points
Domain Adversarial Training
feature extractor (Generator)
Discriminator
(Domain classifier)
image Which domain?
Always output
zero vectors
Domain Classifier Fails
Domain Adversarial Training
feature extractor (Generator)
Discriminator
(Domain classifier)
image
Label predictor
Which digits?
Not only cheat the domain
classifier, but satisfying label
predictor at the same time
More speech-related applications in Part II.
Successfully applied on image classification
[Ganin et al, ICML, 2015][Ajakan et al. JMLR, 2016 ]
Which domain?
Outline of Part 1
Generation
Conditional Generation
Unsupervised Conditional Generation
Relation to Reinforcement Learning
Domain X Domain Y
male female
It is good.
It’s a good day.
I love you.
It is bad.
It’s a bad day.
I don’t love you.
Unsupervised Conditional Generation
Transform an object from one domain to another
without paired data (e.g. style transfer)
Part II
Part III
GDomain X Domain Y
Unsupervised
Conditional Generation
• Approach 1: Direct Transformation
• Approach 2: Projection to Common Space
?𝐺 𝑋→𝑌
Domain X Domain Y
For texture or
color change
𝐸𝑁 𝑋 𝐷𝐸 𝑌
Encoder of
domain X
Decoder of
domain Y
Larger change, only keep the semantics
Domain YDomain X Face
Attribute
?
Direct Transformation
𝐺 𝑋→𝑌
Domain X
Domain Y
𝐷 𝑌
Domain Y
Domain X
scalar
Input image
belongs to
domain Y or not
Become similar
to domain Y
Direct Transformation
𝐺 𝑋→𝑌
Domain X
Domain Y
𝐷 𝑌
Domain Y
Domain X
scalar
Input image
belongs to
domain Y or not
Become similar
to domain Y
Not what we want!
ignore input
Direct Transformation
𝐺 𝑋→𝑌
Domain X
Domain Y
𝐷 𝑌
Domain X
scalar
Input image
belongs to
domain Y or not
Become similar
to domain Y
Not what we want!
ignore input
[Tomer Galanti, et al. ICLR, 2018]
The issue can be avoided by network design.
Simpler generator makes the input and
output more closely related.
Direct Transformation
𝐺 𝑋→𝑌
Domain X
Domain Y
𝐷 𝑌
Domain X
scalar
Input image
belongs to
domain Y or not
Become similar
to domain Y
Encoder
Network
Encoder
Network
pre-trained
as close as
possible
Baseline of DTN [Yaniv Taigman, et al., ICLR, 2017]
Direct Transformation
𝐺 𝑋→𝑌
𝐷 𝑌
Domain Y
scalar
Input image
belongs to
domain Y or not
𝐺Y→X
as close as possible
Lack of information
for reconstruction
[Jun-Yan Zhu, et al., ICCV, 2017]
Cycle consistency
Direct Transformation
𝐺 𝑋→𝑌 𝐺Y→X
as close as possible
𝐺Y→X 𝐺 𝑋→𝑌
as close as possible
𝐷 𝑌𝐷 𝑋
scalar: belongs to
domain Y or not
scalar: belongs to
domain X or not
Cycle GAN
Dual GAN
Disco GAN
[Jun-Yan Zhu, et al., ICCV, 2017]
[Zili Yi, et al., ICCV, 2017]
[Taeksoo Kim, et
al., ICML, 2017]
For multiple domains,
considering starGAN
[Yunjey Choi, arXiv, 2017]
Issue of Cycle Consistency
• CycleGAN: a Master of Steganography
[Casey Chu, et al., NIPS workshop, 2017]
𝐺Y→X𝐺 𝑋→𝑌
The information is hidden.
Unsupervised
Conditional Generation
• Approach 1: Direct Transformation
• Approach 2: Projection to Common Space
?𝐺 𝑋→𝑌
Domain X Domain Y
For texture or
color change
𝐸𝑁 𝑋 𝐷𝐸 𝑌
Encoder of
domain X
Decoder of
domain Y
Larger change, only keep the semantics
Domain YDomain X Face
Attribute
Domain X Domain Y
𝐸𝑁 𝑋
𝐸𝑁𝑌 𝐷𝐸 𝑌
𝐷𝐸 𝑋image
image
image
image
Face
Attribute
Projection to Common Space
Target
Domain X Domain Y
𝐸𝑁 𝑋
𝐸𝑁𝑌 𝐷𝐸 𝑌
𝐷𝐸 𝑋image
image
image
image
Minimizing reconstruction error
Projection to Common Space
Training
𝐸𝑁 𝑋
𝐸𝑁𝑌 𝐷𝐸 𝑌
𝐷𝐸 𝑋image
image
image
image
Minimizing reconstruction error
Because we train two auto-encoders separately …
The images with the same attribute may not project
to the same position in the latent space.
𝐷 𝑋
𝐷 𝑌
Discriminator
of X domain
Discriminator
of Y domain
Minimizing reconstruction error
Projection to Common Space
Training
Sharing the parameters of encoders and decoders
Projection to Common Space
Training
𝐸𝑁 𝑋
𝐸𝑁𝑌
𝐷𝐸 𝑋
𝐷𝐸 𝑌
Couple GAN[Ming-Yu Liu, et al., NIPS, 2016]
UNIT[Ming-Yu Liu, et al., NIPS, 2017]
𝐸𝑁 𝑋
𝐸𝑁𝑌 𝐷𝐸 𝑌
𝐷𝐸 𝑋image
image
image
image
Minimizing reconstruction error
The domain discriminator forces the output of 𝐸𝑁𝑋 and
𝐸𝑁𝑌 have the same distribution.
From 𝐸𝑁𝑋 or 𝐸𝑁𝑌
𝐷 𝑋
𝐷 𝑌
Discriminator
of X domain
Discriminator
of Y domain
Projection to Common Space
Training
Domain
Discriminator
𝐸𝑁𝑋 and 𝐸𝑁𝑌 fool the
domain discriminator
[Guillaume Lample, et al., NIPS, 2017]
𝐸𝑁 𝑋
𝐸𝑁𝑌 𝐷𝐸 𝑌
𝐷𝐸 𝑋image
image
image
image
𝐷 𝑋
𝐷 𝑌
Discriminator
of X domain
Discriminator
of Y domain
Projection to Common Space
Training
Cycle Consistency:
Used in ComboGAN [Asha Anoosheh, et al., arXiv, 017]
Minimizing reconstruction error
𝐸𝑁 𝑋
𝐸𝑁𝑌 𝐷𝐸 𝑌
𝐷𝐸 𝑋image
image
image
image
𝐷 𝑋
𝐷 𝑌
Discriminator
of X domain
Discriminator
of Y domain
Projection to Common Space
Training
Semantic Consistency:
Used in DTN [Yaniv Taigman, et al., ICLR, 2017] and
XGAN [Amélie Royer, et al., arXiv, 2017]
To the same
latent space
Outline of Part 1
Generation
Conditional Generation
Unsupervised Conditional Generation
Relation to Reinforcement Learning
Basic Components
EnvActor
Reward
Function
Video
Game
Go
Get 20 scores when
killing a monster
The rule
of GO
You cannot control
• Input of neural network: the observation of machine
represented as a vector or a matrix
• Output neural network : each action corresponds to a
neuron in output layer
…
…
NN as actor
pixels
fire
right
left
Score of an
action
0.7
0.2
0.1
Take the action
based on the
probability.
Neural network as Actor
Start with
observation 𝑠1 Observation 𝑠2 Observation 𝑠3
Example: Playing Video Game
Action 𝑎1: “right”
Obtain reward
𝑟1 = 0
Action 𝑎2 : “fire”
(kill an alien)
Obtain reward
𝑟2 = 5
Start with
observation 𝑠1 Observation 𝑠2 Observation 𝑠3
Example: Playing Video Game
After many turns
Action 𝑎 𝑇
Obtain reward 𝑟𝑇
Game Over
(spaceship destroyed)
This is an episode.
We want the total
reward be maximized.
Total reward:
𝑅 =
𝑡=1
𝑇
𝑟𝑡
Actor, Environment, Reward
𝜏 = 𝑠1, 𝑎1, 𝑠2, 𝑎2, ⋯ , 𝑠 𝑇, 𝑎 𝑇
Trajectory
Actor
𝑠1
𝑎1
Env
𝑠2
Env
𝑠1
𝑎1
Actor
𝑠2
𝑎2
Env
𝑠3
𝑎2
……
Reward Function → Discriminator
Reinforcement Learning v.s. GAN
Actor
𝑠1
𝑎1
Env
𝑠2
Env
𝑠1
𝑎1
Actor
𝑠2
𝑎2
Env
𝑠3
𝑎2
……
𝑅 𝜏 =
𝑡=1
𝑇
𝑟𝑡
Reward
𝑟1
Reward
𝑟2
“Black box”
You cannot use
backpropagation.
Actor → Generator Fixed
updatedupdated
Imitation Learning
We have demonstration of the expert.
Actor
𝑠1
𝑎1
Env
𝑠2
Env
𝑠1
𝑎1
Actor
𝑠2
𝑎2
Env
𝑠3
𝑎2
……
reward function is not available
𝜏1, 𝜏2, ⋯ , 𝜏 𝑁
Each 𝜏 is a trajectory
of the expert.
Self driving: record
human drivers
Robot: grab the
arm of robot
Inverse Reinforcement Learning
Reward
Function
Environment
Optimal
Actor
Inverse Reinforcement
Learning
Using the reward function to find the optimal actor.
Modeling reward can be easier. Simple reward
function can lead to complex policy.
Reinforcement
Learning
Expert
demonstration
of the expert
𝜏1, 𝜏2, ⋯ , 𝜏 𝑁
Framework of IRL
Expert 𝜋
Actor 𝜋
Obtain
Reward Function R
𝜏1, 𝜏2, ⋯ , 𝜏 𝑁
𝜏1, 𝜏2, ⋯ , 𝜏 𝑁
Find an actor based
on reward function R
By Reinforcement learning
𝑛=1
𝑁
𝑅 𝜏 𝑛 >
𝑛=1
𝑁
𝑅 𝜏
Reward function
→ Discriminator
Actor
→ Generator
Reward
Function R
The expert is always
the best.
𝜏1, 𝜏2, ⋯ , 𝜏 𝑁
GAN
IRL
G
D
High score for real,
low score for generated
Find a G whose output
obtains large score from D
𝜏1, 𝜏2, ⋯ , 𝜏 𝑁
Expert
Actor
Reward
Function
Larger reward for 𝜏 𝑛,
Lower reward for 𝜏
Find a Actor obtains
large reward
Concluding Remarks
Generation
Conditional Generation
Unsupervised Conditional Generation
Relation to Reinforcement Learning
Reference
• Generation
• Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-
Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio, Generative Adversarial
Networks, NIPS, 2014
• Sebastian Nowozin, Botond Cseke, Ryota Tomioka, “f-GAN: Training
Generative Neural Samplers using Variational Divergence Minimization”,
NIPS, 2016
• Martin Arjovsky, Soumith Chintala, Léon Bottou, Wasserstein GAN, arXiv,
2017
• Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, Aaron
Courville, Improved Training of Wasserstein GANs, NIPS, 2017
• Junbo Zhao, Michael Mathieu, Yann LeCun, Energy-based Generative
Adversarial Network, arXiv, 2016
• Mario Lucic, Karol Kurach, Marcin Michalski, Sylvain Gelly, Olivier Bousquet,
“Are GANs Created Equal? A Large-Scale Study”, arXiv, 2017
• Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec
Radford, Xi Chen Improved Techniques for Training GANs, NIPS, 2016
• Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler,
Sepp Hochreiter, GANs Trained by a Two Time-Scale Update Rule Converge
to a Local Nash Equilibrium, NIPS, 2017
Reference
• Generation
• Naveen Kodali, Jacob Abernethy, James Hays, Zsolt Kira, “On Convergence
and Stability of GANs”, arXiv, 2017
• Xiang Wei, Boqing Gong, Zixia Liu, Wei Lu, Liqiang Wang, Improving the
Improved Training of Wasserstein GANs: A Consistency Term and Its Dual
Effect, ICLR, 2018
• Takeru Miyato, Toshiki Kataoka, Masanori Koyama, Yuichi Yoshida, Spectral
Normalization for Generative Adversarial Networks, ICLR, 2018
Reference
• Generational Generation
• Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt
Schiele, Honglak Lee, Generative Adversarial Text to Image Synthesis, ICML,
2016
• Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, Alexei A. Efros, Image-to-Image
Translation with Conditional Adversarial Networks, CVPR, 2017
• Michael Mathieu, Camille Couprie, Yann LeCun, Deep multi-scale video
prediction beyond mean square error, arXiv, 2015
• Mehdi Mirza, Simon Osindero, Conditional Generative Adversarial Nets,
arXiv, 2014
• Takeru Miyato, Masanori Koyama, cGANs with Projection Discriminator,
ICLR, 2018
• Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei
Huang, Dimitris Metaxas, StackGAN++: Realistic Image Synthesis with
Stacked Generative Adversarial Networks, arXiv, 2017
• Augustus Odena, Christopher Olah, Jonathon Shlens, Conditional Image
Synthesis With Auxiliary Classifier GANs, ICML, 2017
Reference
• Generational Generation
• Yaroslav Ganin, Victor Lempitsky, Unsupervised Domain Adaptation by
Backpropagation, ICML, 2015
• Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario
Marchand, Domain-Adversarial Training of Neural Networks, JMLR, 2016
Reference
• Unsupervised Conditional Generation
• Jun-Yan Zhu, Taesung Park, Phillip Isola, Alexei A. Efros, Unpaired Image-to-
Image Translation using Cycle-Consistent Adversarial Networks, ICCV, 2017
• Zili Yi, Hao Zhang, Ping Tan, Minglun Gong, DualGAN: Unsupervised Dual
Learning for Image-to-Image Translation, ICCV, 2017
• Tomer Galanti, Lior Wolf, Sagie Benaim, The Role of Minimal Complexity
Functions in Unsupervised Learning of Semantic Mappings, ICLR, 2018
• Yaniv Taigman, Adam Polyak, Lior Wolf, Unsupervised Cross-Domain Image
Generation, ICLR, 2017
• Asha Anoosheh, Eirikur Agustsson, Radu Timofte, Luc Van Gool, ComboGAN:
Unrestrained Scalability for Image Domain Translation, arXiv, 2017
• Amélie Royer, Konstantinos Bousmalis, Stephan Gouws, Fred Bertsch, Inbar
Mosseri, Forrester Cole, Kevin Murphy, XGAN: Unsupervised Image-to-
Image Translation for Many-to-Many Mappings, arXiv, 2017
Reference
• Unsupervised Conditional Generation
• Guillaume Lample, Neil Zeghidour, Nicolas Usunier, Antoine Bordes, Ludovic
Denoyer, Marc'Aurelio Ranzato, Fader Networks: Manipulating Images by
Sliding Attributes, NIPS, 2017
• Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, Jiwon Kim,
Learning to Discover Cross-Domain Relations with Generative Adversarial
Networks, ICML, 2017
• Ming-Yu Liu, Oncel Tuzel, “Coupled Generative Adversarial Networks”, NIPS,
2016
• Ming-Yu Liu, Thomas Breuel, Jan Kautz, Unsupervised Image-to-Image
Translation Networks, NIPS, 2017
• Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim,
Jaegul Choo, StarGAN: Unified Generative Adversarial Networks for Multi-
Domain Image-to-Image Translation, arXiv, 2017
Generative Adversarial Network
and its Applications to Signal Processing
and Natural Language Processing
Part II: Speech Signal
Processing
Outline of Part II
Speech Signal Generation
• Speech enhancement
• Postfilter, speech synthesis, voice conversion
Speech Signal Recognition
• Speech recognition
• Speaker recognition
• Speech emotion recognition
• Lip reading
Conclusion
Speech Signal Generation (Regression Task)
G Output
Objective function
Paired
Speech, Speaker, Emotion Recognition and Lip-reading
(Classification Task)
Output
label
Clean data
E
G
𝒚
𝒙
Emb.
Noisy data
𝒙
𝒛 = 𝑔( 𝒙)
𝑔(∙)
ℎ(∙)
Accented
speech
𝒙
Channel
distortion
𝒙
Acoustic Mismatch
Outline of Part II
Speech Signal Generation
• Speech enhancement
• Postfilter, speech synthesis, voice conversion
Speech Signal Recognition
• Speech recognition
• Speaker recognition
• Speech emotion recognition
• Lip reading
Conclusion
Speech Enhancement
• Typical objective function
• Typical objective function
 Mean square error (MSE) [Xu et al., TASLP 2015], L1 [Pascual et al., Interspeech
2017], likelihood [Chai et al., MLSP 2017], STOI [Fu et al., TASLP 2018].
Enhancing
G Output
Objective function
 GAN is used as a new objective function to estimate the parameters in G.
Model structures of G: DNN [Wang et al. NIPS 2012; Xu et al., SPL 2014], DDAE
[Lu et al., Interspeech 2013], RNN (LSTM) [Chen et al., Interspeech 2015;
Weninger et al., LVA/ICA 2015], CNN [Fu et al., Interspeech 2016].
Speech Enhancement
• Speech enhancement GAN (SEGAN) [Pascual et al., Interspeech 2017]
z
Table 1: Objective evaluation results. Table 2: Subjective evaluation results.
Fig. 1: Preference test results.
Speech Enhancement (SEGAN)
SEGAN yields better speech enhancement results than Noisy and Wiener.
• Experimental results
• Pix2Pix [Michelsanti et al., Interpsech 2017]
D Scalar
Clean
Noisy
(Fake/Real)
Output
Noisy
G
Noisy Output Clean
Speech Enhancement
Fig. 2: Spectrogram comparison of Pix2Pix with baseline methods.
Speech Enhancement (Pix2Pix)
• Spectrogram analysis
Pix2Pix outperforms STAT-MMSE and is competitive to DNN SE.
NG-DNN STAT-MMSE
Noisy Clean NG-Pix2Pix
Table 3: Objective evaluation results.
Speech Enhancement (Pix2Pix)
• Objective evaluation and speaker verification test
Table 4: Speaker verification results.
1. From the objective evaluations, Pix2Pix outperforms Noisy and
MMSE and is competitive to DNN SE.
2. From the speaker verification results, Pix2Pix outperforms the
baseline models when the clean training data is used.
• Frequency-domain SEGAN (FSEGAN) [Donahue et al., ICASSP 2018]
D Scalar
Clean
Noisy
(Fake/Real)
Output
Noisy
G
Noisy Output Clean
Speech Enhancement
Fig. 2: Spectrogram comparison of FSEGAN with L1-trained method.
Speech Enhancement (FSEGAN)
• Spectrogram analysis
FSEGAN reduces both additive noise and reverberant smearing.
Table 5: WER (%) of SEGAN and FSEGAN. Table 6: WER (%) of FSEGAN with retrain.
Speech Enhancement (FSEGAN)
• ASR results
1. From Table 5, (1) FSEGAN improves recognition results for ASR-Clean.
(2) FSEGAN outperforms SEGAN as front-ends.
2. From Table 6, (1) Hybrid Retraining with FSEGAN outperforms Baseline;
(2) FSEGAN retraining slightly underperforms L1–based retraining.
= 𝐸𝒔 𝑓𝑎𝑘𝑒
log(1 − 𝑫 𝑺 𝒔 𝑓𝑎𝑘𝑒, 𝜃 )
+𝐸 𝒏 𝑓𝑎𝑘𝑒
log(1 − 𝑫 𝑵 𝒏 𝑓𝑎𝑘𝑒, 𝜃 )
• Adversarial training based mask estimation (ATME)
[Higuchi et al., ASRU 2017]
Speech Enhancement
Estimated
speech
𝑫 𝑵𝑫 𝑺
𝑮 𝑴𝒂𝒔𝒌
Clean speech
data
Noise data
Noisy data
Estimated
noise
True noiseTrue speech
Noisy speech
True or FakeTrue or Fake
𝑉 𝑀𝑎𝑠𝑘
oror
Fig. 3: Spectrogram comparison of (a) noisy; (b) MMSE with
supervision; (c) ATMB without supervision.
Speech Enhancement (ATME)
• Spectrogram analysis
The proposed adversarial training mask estimation can
capture speech/noise signals without supervised data.
Speech mask Noise mask 𝑀𝑓,𝑡
𝑛
 The estimated mask parameters are used to compute spatial covariance
matrix for MVDR beamformer.
 𝑠𝑓,𝑡= 𝐰𝑓
H
𝐲 𝑓,𝑡 , where 𝑠𝑓,𝑡 is the enhanced signal, and 𝐲 𝑓,𝑡 denotes the
observation of M microphones, 𝑓 and 𝑡 are frequency and time indices;
𝐰𝑓 denotes the beamformer coefficient.
 The MVDR solves 𝐰𝑓 by: 𝐰𝑓=
(𝑅 𝑓
𝑠+𝑛
)−1 𝐡 𝑓
𝐡 𝑓
H
(𝑅 𝑓
𝑠+𝑛
)−1 𝐡 𝑓
 To estimate 𝐡 𝑓, the spatial covariance matrix of the target signal, 𝑅𝑓
𝑠
, is
computed by : 𝑅𝑓
𝑠
= 𝑅𝑓
𝑠+𝑛
-𝑅𝑓
𝑛
, where 𝑅𝑓
𝑛
=
𝑀 𝑓,𝑡
𝑛
𝐲 𝑓,𝑡 𝐲 𝑓,𝑡
H
𝑓,𝑡 𝑀 𝑓,𝑡
𝑛 , 𝑀𝑓,𝑡
𝑛
was
computed by AT.
• Mask-based beamformer for robust ASR
Speech Enhancement (ATME)
Table 7: WERs (%) for the development and evaluation sets.
• ASR results
Speech Enhancement (ATME)
1. ATME provides significant improvements over Unprocessed.
2. Unsupervised ATME slightly underperforms supervised MMSE.
𝐺𝑆→𝑇 𝐺 𝑇→𝑆
as close as possible
𝐷 𝑇
Scalar: belongs to
domain T or not
𝐺 𝑇→𝑆 𝐺𝑆→𝑇
as close as possible
𝐷𝑆
Scalar: belongs to
domain S or not
Speech Enhancement (AFT)
• Cycle-GAN-based acoustic feature transformation (AFT)
[Mimura et al., ASRU 2017]
𝑉𝐹𝑢𝑙𝑙 = 𝑉𝐺𝐴𝑁 𝐺 𝑋→𝑌, 𝐷 𝑌 +𝑉𝐺𝐴𝑁 𝐺 𝑋→𝑌, 𝐷 𝑌
+𝜆 𝑉𝐶𝑦𝑐(𝐺 𝑋→𝑌, 𝐺 𝑌→𝑋)
Noisy Enhanced Noisy
Clean Syn. Noisy Clean
• ASR results on noise robustness and style adaptation
Table 8: Noise robust ASR. Table 9: Speaker style adaptation.
1. 𝐺 𝑇→𝑆 can transform acoustic features and effectively improve
ASR results for both noisy and accented speech.
2. 𝐺𝑆→𝑇 can be used for model adaptation and effectively improve
ASR results for noisy speech.
S: Clean; 𝑇: Noisy JNAS: Read; CSJ-SPS: Spontaneous (relax);
CSJ-APS: Spontaneous (formal);
Speech Enhancement (AFT)
Outline of Part II
Speech Signal Generation
• Speech enhancement
• Postfilter, speech synthesis, voice conversion
Speech Signal Recognition
• Speech recognition
• Speaker recognition
• Speech emotion recognition
• Lip reading
Conclusion
• Postfilter for synthesized or transformed speech
 Conventional postfilter approaches for G estimation include global variance
(GV) [Toda et al., IEICE 2007], variance scaling (VS) [Sil’en et al., Interpseech
2012], modulation spectrum (MS) [Takamichi et al., ICASSP 2014],DNN with
MSE criterion [Chen et al., Interspeech 2014; Chen et al., TASLP 2015].
 GAN is used a new objective function to estimate the parameters in G.
Postfilter
Synthesized
spectral texture
Natural
spectral texture
G Output
Objective function
Speech
synthesizer
Voice
conversion
Speech
enhancement
• GAN postfilter [Kaneko et al., ICASSP 2017]
 Traditional MMSE criterion results in statistical averaging.
 GAN is used as a new objective function to estimate the parameters in G.
 The proposed work intends to further improve the naturalness of
synthesized speech or parameters from a synthesizer.
Postfilter
Synthesized
Mel cepst. coef.
Natural
Mel cepst. coef.
D
Nature
or
Generated
Generated
Mel cepst. coef.
G
Fig. 4: Spectrograms of: (a) NAT (nature); (b) SYN (synthesized); (c) VS (variance
scaling); (d) MS (modulation spectrum); (e) MSE; (f) GAN postfilters.
Postfilter (GAN-based Postfilter)
• Spectrogram analysis
GAN postfilter reconstructs spectral texture similar to the natural one.
Fig. 5: Mel-cepstral trajectories (GANv:
GAN was applied in voiced part).
Fig. 6: Averaging difference in
modulation spectrum per Mel-
cepstral coefficient.
Postfilter (GAN-based Postfilter)
• Objective evaluations
GAN postfilter reconstructs spectral texture similar to the natural one.
Table 10: Preference score (%). Bold font indicates the numbers over 30%.
Postfilter (GAN-based Postfilter)
• Subjective evaluations
1. GAN postfilter significantly improves the synthesized speech.
2. GAN postfilter is effective particularly in voiced segments.
3. GANv outperforms GAN and is comparable to NAT.
Postfilter (GAN-postfilter-SFTF)
• GAN post-filter for STFT spectrograms [Kaneko et al.,
Interspeech 2017]
 GAN postfilter was applied on high-dimensional STFT spectrograms.
 The spectrogram was partitioned into N bands (each band overlaps its
neighboring bands).
 The GAN-based postfilter was trained for each band.
 The reconstructed spectrogram from each band was smoothly connected.
• Spectrogram analysis
Fig. 7: Spectrograms of: (1) SYN, (2) GAN, (3) Original (NAT)
Postfilter (GAN-postfilter-SFTF)
GAN postfilter reconstructs spectral texture similar to the natural one.
Speech Synthesis
• Input: linguistic features; Output: speech parameters
𝒄𝒄
𝑮 𝑺𝑺
Natural
speech
parameters
Generated
speech
parameters
Linguistic
features
sp sp
Objective function
Minimum
generation error
(MGE), MSE
• Speech synthesis with anti-spoofing verification (ASV)
[Saito et al., ICASSP 2017]
𝐿 𝐷 𝒄, 𝒄 = 𝐿 𝐷,1 𝒄 + 𝐿 𝐷,0 𝒄
𝐿 𝐷,1 𝒄 = −
1
𝑇 𝑡=1
𝑇
log( 𝐷 𝒄 𝑡 )…NAT
𝐿 𝐷,0 𝒄 = −
1
𝑇 𝑡=1
𝑇
log(1 − 𝐷 𝒄 𝑡 )…SYN
𝐿 𝒄, 𝒄 = 𝐿 𝐺 𝒄, 𝒄 + 𝜔 𝐷
𝐸 𝐿 𝐺
𝐸 𝐿 𝐷
𝐿 𝐷,1 𝒄
Minimum generation error (MGE)
with adversarial loss.
𝒄𝒄
𝑮 𝑺𝑺
Natural
speech
parameters
Generated
speech
parameters
Gen.
Nature
𝑫 𝑨𝑺𝑽𝝓(∙)
Linguistic
features
sp sp
MGE
Fig. 8: Averaged GVs of MCCs.
Speech Synthesis (ASV)
• Objective and subjective evaluations
1. The proposed algorithm generates MCCs similar to the natural ones.
Fig. 9: Scores of speech quality.
2. The proposed algorithm outperforms conventional MGE training.
• Speech synthesis with GAN (SS-GAN) [Saito et al., TASLP 2018]
𝐿 𝐷 𝒄, 𝒄 = 𝐿 𝐷,1 𝒄 + 𝐿 𝐷,0 𝒄
𝐿 𝐷,1 𝒄 = −
1
𝑇 𝑡=1
𝑇
log( 𝐷 𝒄 𝑡 )…NAT
𝐿 𝐷,0 𝒄 = −
1
𝑇 𝑡=1
𝑇
log(1 − 𝐷 𝒄 𝑡 )…SYN
𝐿 𝒄, 𝒄 = 𝐿 𝐺 𝒄, 𝒄 + 𝜔 𝐷
𝐸 𝐿 𝐺
𝐸 𝐿 𝐷
𝐿 𝐷,1 𝒄
Minimum generation error (MGE)
with adversarial loss.
𝒄𝒄
Speech Synthesis
𝑮 𝑺𝑺
Natural
speech
parameters
Generated
speech
parameters
Gen.
Nature
𝑫𝝓(∙)
Linguistic
features
𝑫 𝑨𝑺𝑽
sp, f0, duration sp, f0, duration
MGE
Fig. 11: Scores of speech quality
(sp and F0).
.
Speech Synthesis (SS-GAN)
• Subjective evaluations
Fig. 10: Scores of speech quality (sp).
The proposed algorithm works for both spectral parameters and F0.
Speech Synthesis
G
Natural
speech
parameters
Generated
speech
parameters
Gen.
Nature
𝑫
Acoustic
features
• Speech synthesis with GAN glottal waveform model
(GlottGAN) [Bollepalli et al., Interspeech 2017]
Glottal waveform Glottal waveform
• Objective evaluations
Speech Synthesis (GlottGAN)
The proposed GAN-based approach can generate glottal
waveforms similar to the natural ones.
Fig. 12: Glottal pulses generated by GANs.
G, D: DNN
G, D: conditional DNN
G, D: Deep CNN
G, D: Deep CNN + LS loss
• Speech synthesis with GAN & multi-task learning
(SS-GAN-MTL) [Yang et al., ASRU 2017]
𝒙𝒙
Speech Synthesis
𝑮 𝑺𝑺
Natural
speech
parameters
Generated
speech
parameters
Gen.
Nature
𝑫
Noise
𝒛
𝒚
𝑉𝐺𝐴𝑁 𝐺, 𝐷 = 𝐸 𝒙~𝑝 𝑑𝑎𝑡𝑎(𝑥) 𝑙𝑜𝑔𝐷 𝒙|𝒚
+ 𝐸𝒛~𝑝 𝑧, log(1 − 𝐷 𝐺 𝒛|𝒚 |𝒚
𝑉𝐿2 𝐺 = 𝐸𝒛~𝑝 𝑧,[𝐺 𝒛|𝒚 − 𝒙]2
Linguistic
features
MSE
• Speech synthesis with GAN & multi-task learning
(SS-GAN-MTL) [Yang et al., ASRU 2017]
Speech Synthesis (SS-GAN-MTL)
𝒙
𝑮 𝑺𝑺
Natural
speech
parameters
Generated
speech
parameters
Gen.
Nature
𝑫
Noise
𝒛
𝒚
Linguistic
features
MSE
𝒙
𝑫 𝑪𝑬
𝑉𝐺𝐴𝑁 𝐺, 𝐷 = 𝐸 𝒙~𝑝 𝑑𝑎𝑡𝑎(𝑥) 𝑙𝑜𝑔𝐷 𝒙|𝒚
+ 𝐸𝒛~𝑝 𝑧, log(1 − 𝐷 𝐺 𝒛|𝒚 |𝒚
𝑉𝐿2 𝐺 = 𝐸𝒛~𝑝 𝑧,[𝐺 𝒛|𝒚 − 𝒙]2
𝒙𝒙
𝑮 𝑺𝑺
Natural
speech
parameters
Generated
speech
parameters
Gen.
Nature
Noise
𝒛
𝒚
Linguistic
features
MSE
𝑫 𝑪𝑬
𝑉𝐺𝐴𝑁 𝐺, 𝐷 = 𝐸 𝒙~𝑝 𝑑𝑎𝑡𝑎(𝑥) 𝑙𝑜𝑔𝐷 𝐶𝐸 𝒙|𝒚, 𝑙𝑎𝑏𝑒𝑙
+ 𝐸𝒛~𝑝 𝑧, log(1 − 𝐷 𝐶𝐸 𝐺 𝒛|𝒚 |𝒚, 𝑙𝑎𝑏𝑒𝑙
𝑉𝐿2 𝐺 = 𝐸𝒛~𝑝 𝑧,[𝐺 𝒛|𝒚 − 𝒙]2
CE
True label
• Objective and subjective evaluations
Table 11: Objective evaluation results. Fig. 13: The preference score (%).
Speech Synthesis (SS-GAN-MTL)
1. From objective evaluations, no remarkable difference is observed.
2. From subjective evaluations, GAN outperforms BLSTM and ASV,
while GAN-PC underperforms GAN.
• Convert (transform) speech from source to target
 Conventional VC approaches include Gaussian mixture model (GMM) [Toda
et al., TASLP 2007], non-negative matrix factorization (NMF) [Wu et al., TASLP
2014; Fu et al., TBME 2017], locally linear embedding (LLE) [Wu et al.,
Interspeech 2016], restricted Boltzmann machine (RBM) [Chen et al., TASLP
2014], feed forward NN [Desai et al., TASLP 2010], recurrent NN (RNN)
[Nakashika et al., Interspeech 2014].
Voice Conversion
G Output
Objective function
Target
speaker
Source
speaker
• VAW-GAN [Hsu et al., Interspeech 2017]
Conventional MMSE approaches often encounter the “over-smoothing” issue.
 GAN is used a new objective function to estimate G.
 The goal is to increase the naturalness, clarity, similarity of converted speech.
Voice Conversion
D
Real
or
Fake
G
Target
speaker
Source
speaker
𝑉 𝐺, 𝐷 = 𝑉𝐺𝐴𝑁 𝐺, 𝐷 + 𝜆 𝑉𝑉𝐴𝐸 𝒙|𝒚
• Objective and subjective evaluations
Fig. 15: MOS on naturalness.Fig. 14: The spectral envelopes.
Voice Conversion (VAW-GAN)
VAW-GAN outperforms VAE in terms of objective and subjective
evaluations with generating more structured speech.
Voice Conversion
• Sequence-to-sequence VC with learned similarity metric
(LSM) [Kaneko et al., Interspeech 2017]
D
Real
or
Fake
𝑪
Target
speaker
Source
speaker
𝑉 𝐶, 𝐺, 𝐷 = 𝑉𝑆𝑉𝐶
𝐷 𝑙
𝐶, 𝐷 + 𝑉𝐺𝐴𝑁 (𝐶, 𝐺, 𝐷)
𝑮Noise
𝒛
𝑉𝑆𝑉𝐶
𝐷𝑙
𝐶, 𝐷 =
1
𝑀 𝑙
𝐷𝑙 𝒚 − 𝐷𝑙 𝐶(𝒙 ) 2
𝒚
Similarity metric
• Spectrogram analysis
Fig. 16: Comparison of MCCs (upper) and STFT spectrograms (lower).
Voice Conversion (LSM)
Source Target FVC MSE(S2S) LSM(S2S)
The spectral textures of LSM are more similar to the target ones.
• Subjective evaluations
Table 12: Preference scores for naturalness.
Table 12: Preference scores for clarity.
Fig. 17: Similarity of TGT and SRC with VCs.
Voice Conversion (LSM)
LSM outperforms FVC and MSE in terms of subjective evaluations.
Target
speaker
Source
speaker
• CycleGAN-VC [Kaneko et al., arXiv 2017]
• used a new objective function to estimate G
𝑉𝐹𝑢𝑙𝑙 = 𝑉𝐺𝐴𝑁 𝐺 𝑋→𝑌, 𝐷 𝑌 +𝑉𝐺𝐴𝑁 𝐺 𝑋→𝑌, 𝐷 𝑌
+𝜆 𝑉𝐶𝑦𝑐(𝐺 𝑋→𝑌, 𝐺 𝑌→𝑋)
Voice Conversion
𝑮 𝑺→𝑻 𝐺 𝑇→𝑆
as close as possible
𝑫 𝑻
Scalar: belongs to
domain T or not
Scalar: belongs to
domain S or not
𝐺 𝑇→𝑆 𝑮 𝑺→𝑻
as close as possible
𝑫 𝑺
Target Syn. Source Target
Source Syn. Target Source
• Subjective evaluations
Fig. 18: MOS for naturalness.
Fig. 19: Similarity of to source and
to target speakers. S: Source;
T:Target; P: Proposed; B:Baseline
Voice Conversion (CycleGAN-VC)
1. The proposed method uses non-parallel data.
2. For naturalness, the proposed method outperforms baseline.
3. For similarity, the proposed method is comparable to the baseline.
Target
speaker
Source
speaker
• Multi-target VC [Chou et al., arxiv 2018]
𝑒𝑛𝑐(𝒙)
𝒙
Voice Conversion
C
𝑬nc Dec
𝒚
𝒚 𝒚′····
𝑒𝑛𝑐(𝒙)
𝑬nc Dec
𝒚"
𝑮
𝒚"
D+C
Real
data
𝒙 𝑑𝑒𝑐(𝑒𝑛𝑐 𝒙 , 𝒚) 𝑑𝑒𝑐(𝑒𝑛𝑐 𝒙 , 𝒚′)
 Stage-1
 Stage-2
F/R
ID
···
• Subjective evaluations
Voice Conversion (Multi-target VC)
Fig. 20: Preference test results
1. The proposed method uses non-parallel data.
2. The multi-target VC approach outperforms one-stage only.
3. The multi-target VC approach is comparable to Cycle-GAN-VC in
terms of the naturalness and the similarity.
Outline of Part II
Speech Signal Generation
• Speech enhancement
• Postfilter, speech synthesis, voice conversion
Speech Signal Recognition
• Speech recognition
• Speaker recognition
• Speech emotion recognition
• Lip reading
Conclusion
Speech, Speaker, Emotion Recognition and Lip-reading
(Classification Task)
Output
label
Clean data
E
G
𝒚
𝒙
Emb.
Noisy data
𝒙
𝒛 = 𝑔( 𝒙)
𝑔(∙)
ℎ(∙)
Accented
speech
𝒙
Channel
distortion
𝒙
Acoustic Mismatch
Speech Recognition
• Adversarial multi-task learning (AMT)
[Shinohara Interspeech 2016]
Output 1
Senone
Input
Acoustic feature
E
G𝑉𝑦
Output 2
Domain
D 𝑉𝑧
𝒛
𝒚
𝒙
GRL
𝑉𝑦=− 𝑖 log 𝑃(𝑦𝑖|𝑥𝑖; 𝜃 𝐸, 𝜃 𝐺)
𝑉𝑧=− 𝑖 log 𝑃(𝑧𝑖|𝑥𝑖; 𝜃 𝐸, 𝜃 𝐷)
𝜃 𝐺 ← 𝜃 𝐺 − ϵ
𝜕𝑉𝑦
𝜕𝜃 𝐺
𝜃 𝐸 ← 𝜃 𝐸 − ϵ
𝜕𝑉𝑦
𝜕𝜃 𝐺
𝜃 𝐷 ← 𝜃 𝐷 − ϵ
𝜕𝑉𝑧
𝜕𝜃 𝐷
Model update
Max
classification
accuracy
Max domain
accuracy
Max classification accuracy
Objective function
+𝛼
𝜕𝑉𝑧
𝜕𝜃 𝐺
and Min domain accuracy
• ASR results in known (k) and unknown (unk)
noisy conditions
Speech Recognition (AMT)
Table 13: WER of DNNs with single-task learning (ST) and AMT.
The AMT-DNN outperforms ST-DNN with yielding lower WERs.
Speech Recognition
• Domain adversarial training for accented ASR (DAT)
[Sun et al., ICASSP2018]
Output 2
Domain
Output 1
Senone
Input
Acoustic feature
E
G D
GRL
𝑉𝑧𝑉𝑦
𝒛
𝒚
𝒙
𝑉𝑦=− 𝑖 log 𝑃(𝑦𝑖|𝑥𝑖; 𝜃 𝐸, 𝜃 𝐺)
𝑉𝑧=− 𝑖 log 𝑃(𝑧𝑖|𝑥𝑖; 𝜃 𝐸, 𝜃 𝐷)
𝜃 𝐺 ← 𝜃 𝐺 − ϵ
𝜕𝑉𝑦
𝜕𝜃 𝐺
𝜃 𝐸 ← 𝜃 𝐸 − ϵ
𝜕𝑉𝑦
𝜕𝜃 𝐺
𝜃 𝐷 ← 𝜃 𝐷 − ϵ
𝜕𝑉𝑧
𝜕𝜃 𝐷
Model update
Max
classification
accuracy
Max domain
accuracy
Max classification accuracy
Objective function
+𝛼
𝜕𝑉𝑧
𝜕𝜃 𝐺
and Min domain accuracy
• ASR results on accented speech
Speech Recognition (DAT)
1. With labeled transcriptions, ASR performance notably improves.
Table 14: WER of the baseline and adapted model.
2. DAT is effective in learning features invariant to domain differences
with and without labeled transcriptions.
STD: standard speech
Speech Recognition
• Robust ASR using GAN enhancer (GAN-Enhancer)
[Sriram et al., arXiv 2017]
𝐻 ℎ 𝒛 , 𝐲 + λ
𝒛 − 𝒛 1
𝒛 1 + 𝒛 1 + 𝜖
Cross entropy with L1 Enhancer:
Output
label
Clean data
E
GL1
𝒚
𝒙
Emb.
Noisy data
E
𝒙
Emb.
𝒛 = 𝑔(𝒙) 𝒛 = 𝑔( 𝒙)
𝑔(∙)𝑔(∙)
ℎ(∙)
Cross entropy with GAN Enhancer:
𝐻 ℎ 𝒛 , 𝐲 + λ 𝑉𝑎𝑑𝑣 (𝑔(𝒙) , 𝑔( 𝒙))
D
• ASR results on far-field speech:
Speech Recognition (GAN-Enhancer)
GAN Enhancer outperforms the Augmentation and L1-
Enhancer approaches on far-field speech.
Fig. 15: WER of GAN enhancer and the baseline methods.
Outline of Part II
Speech Signal Generation
• Speech enhancement
• Postfilter, speech synthesis, voice conversion
Speech Signal Recognition
• Speech recognition
• Speaker recognition
• Speech emotion recognition
• Lip reading
Conclusion
Speaker Recognition
• Domain adversarial neural network (DANN)
[Wang et al., ICASSP 2018]
DANN
DANN
Pre-
processing
Pre-
processing
Scoring
Enroll
i-vector
Test
i-vector
Output 2
Domain
Output 1
Speaker
ID
Input
Acoustic feature
E
G D
GRL
𝑉𝑧𝑉𝑦
𝒛
𝒚
𝒙
• Recognition results of domain mismatched conditions
Table 16: Performance of DAT and the state-of-the-art methods.
Speaker Recognition (DANN)
The DAT approach outperforms other methods with
achieving lowest EER and DCF scores.
Outline of Part II
Speech Signal Generation
• Speech enhancement
• Postfilter, speech synthesis, voice conversion
Speech Signal Recognition
• Speech recognition
• Speaker recognition
• Speech emotion recognition
• Lip reading
Conclusion
Emotion Recognition
• Adversarial AE for emotion recognition (AAE-ER)
[Sahu et al., Interspeech 2017]
AE with GAN :
𝐻 ℎ 𝒛 , 𝒙 + λ 𝑉𝐺𝐴𝑁 (𝒒, 𝑔(𝒙))
E
D
𝒙
Emb.
Syn.
𝒛 = 𝑔(𝒙)
𝑔(∙)
ℎ(∙)
𝒒
𝒙
G
The distribution of code vectors
• Recognition results of domain mismatched conditions:
Table 18: Classification results on real and synthesized features.
Emotion Recognition (AAE-ER)
Table 17: Classification results on different systems.
1. AAE alone could not yield performance improvements.
2. Using synthetic data from AAE can yield higher UAR.
Original
Training
data
Outline of Part II
Speech Signal Generation
• Speech enhancement
• Postfilter, speech synthesis, voice conversion
Speech Signal Recognition
• Speech recognition
• Speaker recognition
• Speech emotion recognition
• Lip reading
Conclusion
Lip-reading
• Domain adversarial training for lip-reading (DAT-LR)
[Wand et al., arXiv 2017]
Output 1
Words
E
G𝑉𝑦
Output 2
Speaker
D
GRL
𝑉𝑧
𝒛
𝒚
𝒙
𝑉𝑦=− 𝑖 log 𝑃(𝑦𝑖|𝑥𝑖; 𝜃 𝐸, 𝜃 𝐺)
𝑉𝑧=− 𝑖 log 𝑃(𝑧𝑖|𝑥𝑖; 𝜃 𝐸, 𝜃 𝐷)
𝜃 𝐺 ← 𝜃 𝐺 − ϵ
𝜕𝑉𝑦
𝜕𝜃 𝐺
𝜃 𝐸 ← 𝜃 𝐸 − ϵ
𝜕𝑉𝑦
𝜕𝜃 𝐺
𝜃 𝐷 ← 𝜃 𝐷 − ϵ
𝜕𝑉𝑧
𝜕𝜃 𝐷
Model update
Max
classification
accuracy
Max domain
accuracy
Max classification accuracy
Objective function
+𝛼
𝜕𝑉𝑧
𝜕𝜃 𝐺
and Min domain accuracy
~80% WAC
• Recognition results of speaker mismatched conditions
Lip-reading (DAT-LR)
Table 19: Performance of DAT and the baseline.
The DAT approach notably enhances the recognition
accuracies in different conditions.
Outline of Part II
Speech Signal Generation
• Speech enhancement
• Postfilter, speech synthesis, voice conversion
Speech Signal Recognition
• Speech recognition
• Speaker recognition
• Speech emotion recognition
• Lip reading
Conclusion
Speech Signal Generation (Regression Task)
G Output
Objective function
Paired
Speech, Speaker, Emotion Recognition and Lip-reading
(Classification Task)
Output
label
Clean data
E
G
𝒚
𝒙
Emb.
Noisy data
𝒙
𝒛 = 𝑔( 𝒙)
𝑔(∙)
ℎ(∙)
Accented
speech
𝒙
Channel
distortion
𝒙
Acoustic Mismatch
More GANs in Speech
Diagnosis of autism spectrum
Jun Deng, Nicholas Cummins, Maximilian Schmitt, Kun Qian, Fabien Ringeval, and Björn Schuller, Speech-based
Diagnosis of Autism Spectrum Condition by Generative Adversarial Network Representations, ACM DH, 2017.
Emotion recognition
Jonathan Chang, and Stefan Scherer, Learning Representations of Emotional Speech with Deep Convolutional
Generative Adversarial Networks, ICASSP, 2017.
Robust ASR
Dmitriy Serdyuk, Kartik Audhkhasi, Philémon Brakel, Bhuvana Ramabhadran, Samuel Thomas, and Yoshua Bengio,
Invariant Representations for Noisy Speech Recognition, arXiv, 2016.
Speaker verification
Hong Yu, Zheng-Hua Tan, Zhanyu Ma, and Jun Guo, Adversarial Network Bottleneck Features for Noise Robust
Speaker Verification, arXiv, 2017.
References
Speech enhancement (conventional methods)
• Yuxuan Wang and Deliang Wang, Cocktail Party Processing via Structured Prediction, NIPS 2012.
• Yong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee, An Experimental Study on Speech Enhancement Based on Deep
Neural Networks," IEEE SPL, 2014.
• Yong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee, A Regression Approach to Speech Enhancement Based on Deep
Neural Networks, IEEE/ACM TASLP, 2015.
• Xugang Lu, Yu Tsao, Shigeki Matsuda, Chiori Hori, Speech Enhancement Based on Deep Denoising Autoencoder,
Interspeech 2012.
• Zhuo Chen, Shinji Watanabe, Hakan Erdogan, John R. Hershey, Integration of Speech Enhancement and
Recognition Using Long-short term Memory Recurrent Neural Network, Interspeech 2015.
• Felix Weninger, Hakan Erdogan, Shinji Watanabe, Emmanuel Vincent, Jonathan Le Roux, John R. Hershey, and
Bjorn Schuller, Speech Enhancement with LSTM Recurrent Neural Networks and Its Application to Noise-robust
ASR, LVA/ICA, 2015.
• Szu-Wei Fu, Yu Tsao, and Xugang Lu, SNR-aware Convolutional Neural Network Modeling for Speech
Enhancement, Interspeech, 2016.
• Szu-Wei Fu, Yu Tsao, Xugang Lu, and Hisashi Kawai, End-to-end Waveform Utterance Enhancement for Direct
Evaluation Metrics Optimization by Fully Convolutional Neural Networks, arXiv, IEEE/ACM TASLP, 2018.
Speech enhancement (GAN-based methods)
• Pascual Santiago, Bonafonte Antonio, and Serra Joan, SEGAN: Speech Enhancement Generative Adversarial
Network, Interspeech, 2017.
• Michelsanti Daniel, and Zheng-Hua Tan, Conditional Generative Adversarial Networks for Speech Enhancement
and Noise-robust Speaker Verification, Interspeech, 2017.
• Donahue Chris, Li Bo, and Prabhavalkar Rohit, Exploring Speech Enhancement with Generative Adversarial
Networks for Robust Speech Recognition, ICASPP, 2018.
• Higuchi Takuya, Kinoshita Keisuke, Delcroix Marc, and Nakatani Tomohiro, Adversarial Training for Data-driven
Speech Enhancement without Parallel Corpus, ASRU, 2017.
Postfilter (conventional methods)
• Toda Tomoki, and Tokuda Keiichi, A Speech Parameter Generation Algorithm Considering Global Variance for
HMM-based Speech Synthesis, IEICE Trans. Inf. Syst., 2007.
• Sil’en Hanna, Helander Elina, Nurminen Jani, and Gabbouj Moncef, Ways to Implement Global Variance in
Statistical Speech Synthesis, Interspeech, 2012.
• Takamichi Shinnosuke, Toda Tomoki, Neubig Graham, Sakti Sakriani, and Nakamura Satoshi, A Postfilter to
Modify the Modulation Spectrum in HMM-based Speech Synthesis, ICASSP, 2014.
• Ling-Hui Chen, Tuomo Raitio, Cassia Valentini-Botinhao, Junichi Yamagishi, and Zhen-Hua Ling, DNN-based
Stochastic Postfilter for HMM-based Speech Synthesis, Interspeech, 2014.
• Ling-Hui Chen, Tuomo Raitio, Cassia Valentini-Botinhao, Zhen-Hua Ling, and Junichi Yamagishi, A Deep
Generative Architecture for Postfiltering in Statistical Parametric Speech Synthesis, IEEE/ACM TASLP, 2015.
Postfilter (GAN-based methods)
• Kaneko Takuhiro, Kameoka Hirokazu, Hojo Nobukatsu, Ijima Yusuke, Hiramatsu Kaoru, and Kashino Kunio,
Generative Adversarial Network-based Postfilter for Statistical Parametric Speech Synthesis, ICASSP, 2017.
• Kaneko Takuhiro, Takaki Shinji, Kameoka Hirokazu, and Yamagishi Junichi, Generative Adversarial Network-
Based Postfilter for STFT Spectrograms, Interspeech, 2017.
• Saito Yuki, Takamichi Shinnosuke, and Saruwatari Hiroshi, Training Algorithm to Deceive Anti-spoofing
Verification for DNN-based Speech Synthesis, ICASSP, 2017.
• Saito Yuki, Takamichi Shinnosuke, Saruwatari Hiroshi, Saito Yuki, Takamichi Shinnosuke, and Saruwatari Hiroshi,
Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks, IEEE/ACM TASLP, 2018.
• Bajibabu Bollepalli, Lauri Juvela, and Alku Paavo, Generative Adversarial Network-based Glottal Waveform
Model for Statistical Parametric Speech Synthesis, Interspeech, 2017.
• Yang Shan, Xie Lei, Chen Xiao, Lou Xiaoyan, Zhu Xuan, Huang Dongyan, and Li Haizhou, Statistical Parametric
Speech Synthesis Using Generative Adversarial Networks Under a Multi-task Learning Framework, ASRU, 2017.
References
VC (conventional methods)
• Toda Tomoki, Black Alan W, and Tokuda Keiichi, Voice Conversion Based on Maximum Likelihood Estimation of
Spectral Parameter Trajectory, IEEE/ACM TASLP, 2007.
• Ling-Hui Chen, Zhen-Hua Ling, Li-Juan Liu, and Li-Rong Dai, Voice Conversion Using Deep Neural Networks with
Layer-wise Generative Training, IEEE/ACM TASLP, 2014.
• Srinivas Desai, Alan W Black, B. Yegnanarayana, and Kishore Prahallad, Spectral mapping Using artificial Neural
Networks for Voice Conversion, IEEE/ACM TASLP, 2010.
• Nakashika Toru, Takiguchi Tetsuya, Ariki Yasuo, High-order Sequence Modeling Using Speaker-dependent
Recurrent Temporal Restricted Boltzmann Machines for Voice Conversion, Interspeech, 2014.
• Takuhiro Kaneko, Hirokazu Kameoka, Kaoru Hiramatsu, and Kunio Kashino, Sequence-to-sequence Voice
Conversion with Similarity Metric Learned Using Generative Adversarial Networks, Interspeech, 2017.
• Zhizheng Wu, Tuomas Virtanen, Eng-Siong Chng, and Haizhou Li, Exemplar-based Sparse Representation with
Residual Compensation for Voice Conversion, IEEE/ACM TASLP, 2014.
• Szu-Wei Fu, Pei-Chun Li, Ying-Hui Lai, Cheng-Chien Yang, Li-Chun Hsieh, and Yu Tsao, Joint Dictionary Learning-
based Non-negative Matrix Factorization for Voice Conversion to Improve Speech Intelligibility After Oral
Surgery, IEEE TBME, 2017.
• Yi-Chiao Wu, Hsin-Te Hwang, Chin-Cheng Hsu, Yu Tsao, and Hsin-Min Wang, Locally Linear Embedding for
Exemplar-based Spectral Conversion, Interspeech, 2016.
VC (GAN-based methods)
• Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu, Yu Tsao, and Hsin-Min Wang Voice Conversion from Unaligned
Corpora Using Variational Autoencoding Wasserstein Generative Adversarial Networks, arXiv, 2017.
• Takuhiro Kaneko, Hirokazu Kameoka, Kaoru Hiramatsu, and Kunio Kashino, Sequence-to-sequence Voice
Conversion with Similarity Metric Learned Using Generative Adversarial Networks, Interspeech, 2017.
• Takuhiro Kaneko, and Hirokazu Kameoka. Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial
Networks, arXiv, 2017.
References
ASR
• Yusuke Shinohara, Adversarial Multi-Task Learning of Deep Neural Networks for Robust Speech Recognition.
Interspeech, 2016.
• Sining Sun, Ching-Feng Yeh, Mei-Yuh Hwang, Mari Ostendorf, and Lei Xie, Domain Adversarial Training for
Accented Speech Recognition, ICASSP, 2018
• Masato Mimura, Shinsuke Sakai, and Tatsuya Kawahara, Cross-domain Speech Recognition Using Nonparallel
Corpora with Cycle-consistent Adversarial Networks, ASRU, 2017.
• Anuroop Sriram, Heewoo Jun, Yashesh Gaur, and Sanjeev Satheesh, Robust Speech Recognition Using
Generative Adversarial Networks, arXiv, 2017.
Speaker recognition
• Qing Wang, Wei Rao, Sining Sun, Lei Xie, Eng Siong Chng, and Haizhou Li, Unsupervised Domain Adaptation via
Domain Adversarial Training for Speaker Recognition, ICASSP, 2018.
Emotion recognition
• Saurabh Sahu, Rahul Gupta, Ganesh Sivaraman, Wael AbdAlmageed, and Carol Espy-Wilson, Adversarial Auto-
encoders for Speech Based Emotion Recognition. Interspeech, 2017.
Lipreading
• Michael Wand, and Jürgen Schmidhuber, Improving Speaker-Independent Lipreading with Domain-Adversarial
Training, arXiv, 2017.
References
• Saurabh Sahu, Rahul Gupta, Ganesh Sivaraman, Carol Espy-Wilson, Smoothing Model Predictions using
Adversarial Training Procedures for Speech Based Emotion Recognition
• Fuming Fang, Junichi Yamagishi, Isao Echizen, Jaime Lorenzo-Trueba, High-quality Nonparallel Voice Conversion
Based on Cycle-consistent Adversarial Network
• Lauri Juvela, Bajibabu Bollepalli, Xin Wang, Hirokazu Kameoka, Manu Airaksinen, Junichi Yamagishi, Paavo Alku,
Speech Waveform Synthesis from MFCC Sequences with Generative Adversarial Networks
• Zhong Meng, Jinyu Li, Yifan Gong, Biing-Hwang (Fred) Juang, Adversarial Teacher-Student Learning for
Unsupervised Domain Adaptation
• Zhong Meng, Jinyu Li, Zhuo Chen, Yong Zhao, Vadim Mazalov, Yifan Gong, Biing-Hwang (Fred) Juang, Speaker-
Invariant Training via Adversarial Learning
• Sen Li, Stephane Villette, Pravin Ramadas, Daniel Sinder, Speech Bandwidth Extension using Generative
Adversarial Networks
• Qing Wang, Wei Rao, Sining Sun, Lei Xie, Eng Siong Chng, Haizhou Li, Unsupervised Domain Adaptation via
Domain Adversarial Training for Speaker Recognition
• Hu Hu, Tian Tan, Yanmin Qian, Generative Adversarial Networks Based Data Augmentation for Noise Robust
Speech Recognition
• Yuki Saito, Shinnosuke Takamichi, Hiroshi Saruwatari, Text-to-speech Synthesis using STFT Spectra Based on Low-
/multi-resolution Generative Adversarial Networks
• Jiangyan Yi, Jianhua Tao, Zhengqi Wen, Ye Bai, Adversarial Multilingual Training for Low-resource Speech
Recognition
• Meet H. Soni, Neil Shah, Hemant A. Patil, Time-frequency Masking-based Speech Enhancement using Generative
Adversarial Network
• Taira Tsuchiya, Naohiro Tawara, Tetsuji Ogawa, Tetsunori Kobayashi, Speaker Invariant Feature Extraction for Zero-
resource Languages with Adversarial Learning
GANs in ICASSP 2018
• Jing Han, Zixing Zhang, Zhao Ren, Fabien Ringeval, Bjoern Schuller, Towards Conditional Adversarial Training
for Predicting Emotions from Speech
• Chenxing Li, Lei Zhu, Shuang Xu, Peng Gao, Bo Xu, CBLDNN-based Speaker-independent Speech Separation
via Generative Adversarial Training
• Anuroop Sriram, Heewoo Jun, Yashesh Gaur, Sanjeev Satheesh, Robust Speech Recognition using Generative
Adversarial Networks
• Cem Subakan, Paris Smaragdis, Generative Adversarial Source Separation,
• Ashutosh Pandey, Deliang Wang, On Adversarial Training and Loss Functions for Speech Enhancement
• Bin Liu, Shuai Nie, Yaping Zhang, Dengfeng Ke, Shan Liang, Wenju Liu, Boosting Noise Robustness of Acoustic
Model via Deep Adversarial Training
• Yang Gao, Rita Singh, Bhiksha Raj, Voice Impersonation using Generative Adversarial Networks
• Aditay Tripathi, Aanchan Mohan, Saket Anand, Maneesh Singh, Adversarial Learning of Raw Speech Features
for Domain Invariant Speech Recognition
• Zhe-Cheng Fan, Yen-Lin Lai, Jyh-Shing Jang, SVSGAN: Singing Voice Separation via Generative Adversarial
Network
• Santiago Pascual, Maruchan Park, Joan Serra, Antonio Bonafonte, Kang-Hun Ahn, Language and Noise
Transfer in Speech Enhancement Generative Adversarial Network
GANs in ICASSP 2018
A promising research direction and still has room for further
improvements in the speech signal processing domain
Thank You Very Much
Tsao, Yu Ph.D.
yu.tsao@citi.sinica.edu.tw
https://www.citi.sinica.edu.tw/pages/yu.tsao/contact_zh.html
Part III: Natural
Language Processing
Generative Adversarial Network
and its Applications to Signal Processing
and Natural Language Processing
Outline of Part III
Conditional Sequence Generation
• RL (human feedback)
• GAN (discriminator feedback)
Unsupervised Conditional Sequence Generation
• Text Style Transfer
• Unsupervised Abstractive Summarization
• Unsupervised Translation
Conditional Sequence Generation
Generator
機 器 學 習
Generator
Machine Learning
Generator
How are you?
How are you I am fine.
ASR Translation Chatbot
The generator is a typical seq2seq model.
With GAN, you can train seq2seq model in another way.
Review: Sequence-to-sequence
• Chat-bot as example
Encoder Decoder
Input sentence c
output
sentence x
Training
data:
A: How are you ?
B: I’m good.
…………
How are you ?
I’m good.
Generator
Output: Not bad I’m John.
Maximize
likelihood
Training Criterion
Human better
better
Outline of Part III
Improving Supervised Seq-to-seq Model
• RL (human feedback)
• GAN (discriminator feedback)
Unsupervised Seq-to-seq Model
• Text Style Transfer
• Unsupervised Abstractive Summarization
• Unsupervised Translation
Introduction
• Machine obtains feedback from user
• Chat-bot learns to maximize the expected reward
https://image.freepik.com/free-vector/variety-
of-human-avatars_23-2147506285.jpg
How are
you?
Bye bye 
Hello
Hi 
-10 3
http://www.freepik.com/free-vector/variety-
of-human-avatars_766615.htm
Maximizing Expected Reward
Human
Input sentence c response sentence x
Chatbot
En De
response sentence x
Input sentence c
[Li, et al., EMNLP, 2016]
reward
𝑅 𝑐, 𝑥
Learn to maximize expected reward
Policy Gradient
Policy Gradient - Implemenation
𝜃 𝑡
𝑐1, 𝑥1
𝑐2
, 𝑥2
𝑐 𝑁, 𝑥 𝑁
……
𝑅 𝑐1, 𝑥1
𝑅 𝑐2, 𝑥2
𝑅 𝑐 𝑁, 𝑥 𝑁
……
1
𝑁
𝑖=1
𝑁
𝑅 𝑐 𝑖, 𝑥 𝑖 𝛻𝑙𝑜𝑔𝑃 𝜃 𝑡 𝑥 𝑖|𝑐 𝑖
𝜃 𝑡+1 ← 𝜃 𝑡 + 𝜂𝛻 𝑅 𝜃 𝑡
𝑅 𝑐 𝑖, 𝑥 𝑖 is positive
Updating 𝜃 to increase 𝑃 𝜃 𝑥 𝑖
|𝑐 𝑖
𝑅 𝑐 𝑖
, 𝑥 𝑖
is negative
Updating 𝜃 to decrease 𝑃 𝜃 𝑥 𝑖|𝑐 𝑖
Comparison
1
𝑁
𝑖=1
𝑁
𝑅 𝑐 𝑖, 𝑥 𝑖 𝛻𝑙𝑜𝑔𝑃 𝜃 𝑥 𝑖|𝑐 𝑖
1
𝑁
𝑖=1
𝑁
𝑙𝑜𝑔𝑃 𝜃 𝑥 𝑖|𝑐 𝑖
1
𝑁
𝑖=1
𝑁
𝛻𝑙𝑜𝑔𝑃 𝜃 𝑥 𝑖|𝑐 𝑖
1
𝑁
𝑖=1
𝑁
𝑅 𝑐 𝑖, 𝑥 𝑖 𝑙𝑜𝑔𝑃 𝜃 𝑥 𝑖|𝑐 𝑖
𝑅 𝑐 𝑖
, 𝑥 𝑖
= 1 obtained from interaction
weighted by 𝑅 𝑐 𝑖
, 𝑥 𝑖
Objective
Function
Gradient
Maximum
Likelihood
Reinforcement
Learning
Training
Data
𝑐1, 𝑥1 , … , 𝑐 𝑁, 𝑥 𝑁
𝑐1, 𝑥1 , … , 𝑐 𝑁, 𝑥 𝑁
Outline of Part III
Improving Supervised Seq-to-seq Model
• RL (human feedback)
• GAN (discriminator feedback)
Unsupervised Seq-to-seq Model
• Text Style Transfer
• Unsupervised Abstractive Summarization
• Unsupervised Translation
I am busy.
Conditional GAN
Discriminator
Input sentence c response sentence x
Real or fake
http://www.nipic.com/show/3/83/3936650kd7476069.html
human
dialogues
Chatbot
En De
response sentence x
Input sentence c
[Li, et al., EMNLP, 2017]
“reward”
Algorithm
• Initialize generator G (chatbot) and discriminator D
• In each iteration:
• Sample input c and response 𝑥 from training set
• Sample input 𝑐′ from training set, and generate
response 𝑥 by G(𝑐′)
• Update D to increase 𝐷 𝑐, 𝑥 and decrease 𝐷 𝑐′, 𝑥
• Update generator G (chatbot) such that
Discrimi
nator
scalar
update
Training data:
𝑐
Chatbot
En De
Pairs of conditional input c
and response x
A A
A
B
A
B
A
A
B
B
B
<BOS>
Can we use
gradient ascent?
Due to the sampling process, “discriminator+ generator”
is not differentiable
Discriminator
Discrimi
nator
scalar
update
scalarChatbot
En De
NO!
Update Parameters
Three Categories of Solutions
Gumbel-softmax
• [Matt J. Kusner, et al, arXiv, 2016]
Continuous Input for Discriminator
• [Sai Rajeswar, et al., arXiv, 2017][Ofir Press, et al., ICML workshop, 2017][Zhen
Xu, et al., EMNLP, 2017][Alex Lamb, et al., NIPS, 2016][Yizhe Zhang, et al., ICML,
2017]
“Reinforcement Learning”
• [Yu, et al., AAAI, 2017][Li, et al., EMNLP, 2017][Tong Che, et al, arXiv,
2017][Jiaxian Guo, et al., AAAI, 2018][Kevin Lin, et al, NIPS, 2017][William
Fedus, et al., ICLR, 2018]
A A
A
B
A
B
A
A
B
B
B
<BOS>
Use the distribution
as the input of
discriminator
Avoid the sampling
process
Discriminator
Discrimi
nator
scalar
update
scalarChatbot
En De
Update Parameters
We can do
backpropagation
now.
What is the problem?
• Real sentence
• Generated
1
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
1
0.9
0.1
0
0
0
0.1
0.9
0
0
0
0.1
0.1
0.7
0.1
0
0
0
0.1
0.8
0.1
0
0
0
0.1
0.9
Can never
be 1-of-N
Discriminator can
immediately find
the difference.
WGAN is helpful
Three Categories of Solutions
Gumbel-softmax
• [Matt J. Kusner, et al, arXiv, 2016]
Continuous Input for Discriminator
• [Sai Rajeswar, et al., arXiv, 2017][Ofir Press, et al., ICML workshop, 2017][Zhen
Xu, et al., EMNLP, 2017][Alex Lamb, et al., NIPS, 2016][Yizhe Zhang, et al., ICML,
2017]
“Reinforcement Learning”
• [Yu, et al., AAAI, 2017][Li, et al., EMNLP, 2017][Tong Che, et al, arXiv,
2017][Jiaxian Guo, et al., AAAI, 2018][Kevin Lin, et al, NIPS, 2017][William
Fedus, et al., ICLR, 2018]
Reinforcement Learning?
• Consider the output of discriminator as reward
• Update generator to increase discriminator = to get
maximum reward
• Using the formulation of policy gradient, replace reward
𝑅 𝑐, 𝑥 with discriminator output D 𝑐, 𝑥
• Different from typical RL
• The discriminator would update
Discrimi
nator
scalar
update
Chatbot
En De
𝜃 𝑡
𝑐1, 𝑥1
𝑐2
, 𝑥2
𝑐 𝑁, 𝑥 𝑁
……
……
g-step
d-step
discriminator
𝑐
𝑥
fake real
𝑅 𝑐1, 𝑥1
𝑅 𝑐2, 𝑥2
𝑅 𝑐 𝑁
, 𝑥 𝑁
𝐷 𝑐1
, 𝑥1
𝐷 𝑐2, 𝑥2
𝐷 𝑐 𝑁, 𝑥 𝑁
1
𝑁
𝑖=1
𝑁
𝑅 𝑐 𝑖, 𝑥 𝑖 𝛻𝑙𝑜𝑔𝑃 𝜃 𝑡 𝑥 𝑖|𝑐 𝑖
𝜃 𝑡+1
← 𝜃 𝑡
+ 𝜂𝛻 𝑅 𝜃 𝑡
𝑅 𝑐 𝑖, 𝑥 𝑖 is positive
Updating 𝜃 to increase 𝑃 𝜃 𝑥 𝑖|𝑐 𝑖
𝑅 𝑐 𝑖, 𝑥 𝑖 is negative
Updating 𝜃 to decrease 𝑃 𝜃 𝑥 𝑖
|𝑐 𝑖
𝐷 𝑐 𝑖
, 𝑥 𝑖
𝐷 𝑐 𝑖, 𝑥 𝑖
𝐷 𝑐 𝑖, 𝑥 𝑖
D
Reward for Every Generation Step
𝛻 𝑅 𝜃 ≈
1
𝑁
𝑖=1
𝑁
𝐷 𝑐 𝑖
, 𝑥 𝑖
𝛻𝑙𝑜𝑔𝑃 𝜃 𝑥 𝑖
|𝑐 𝑖
𝑐 𝑖 = “What is your name?”
𝑥 𝑖 = “I don’t know”
𝐷 𝑐 𝑖
, 𝑥 𝑖
is negative
Update 𝜃 to decrease log𝑃 𝜃 𝑥 𝑖|𝑐 𝑖
𝑙𝑜𝑔𝑃 𝜃 𝑥 𝑖|𝑐 𝑖 = 𝑙𝑜𝑔𝑃 𝑥1
𝑖
|𝑐 𝑖 + 𝑙𝑜𝑔𝑃 𝑥2
𝑖
|𝑐 𝑖, 𝑥1
𝑖
+ 𝑙𝑜𝑔𝑃 𝑥3
𝑖
|𝑐 𝑖, 𝑥1:2
𝑖
𝑃 "𝐼"|𝑐 𝑖
𝑐 𝑖 = “What is your name?”
𝑥 𝑖 = “I am John”
𝐷 𝑐 𝑖, 𝑥 𝑖 is positive
Update 𝜃 to increase log𝑃 𝜃 𝑥 𝑖
|𝑐 𝑖
𝑙𝑜𝑔𝑃 𝜃 𝑥 𝑖
|𝑐 𝑖
= 𝑙𝑜𝑔𝑃 𝑥1
𝑖
|𝑐 𝑖
+ 𝑙𝑜𝑔𝑃 𝑥2
𝑖
|𝑐 𝑖
, 𝑥1
𝑖
+ 𝑙𝑜𝑔𝑃 𝑥3
𝑖
|𝑐 𝑖
, 𝑥1:2
𝑖
𝑃 "𝐼"|𝑐 𝑖
Reward for Every Generation Step
Method 2. Discriminator For Partially Decoded Sequences
𝑙𝑜𝑔𝑃 𝜃 𝑥 𝑖
|ℎ𝑖
= 𝑙𝑜𝑔𝑃 𝑥1
𝑖
|𝑐 𝑖
+ 𝑙𝑜𝑔𝑃 𝑥2
𝑖
|𝑐 𝑖
, 𝑥1
𝑖
+ 𝑙𝑜𝑔𝑃 𝑥3
𝑖
|𝑐 𝑖
, 𝑥1:2
𝑖
ℎ𝑖
= “What is your name?” 𝑥 𝑖 = “I don’t know”
𝑃 "𝐼"|𝑐 𝑖
𝑃 "𝑑𝑜𝑛′𝑡"|𝑐 𝑖, "𝐼" 𝑃 "𝑘𝑛𝑜𝑤"|𝑐 𝑖, "𝐼 𝑑𝑜𝑛′𝑡"
Method 1. Monte Carlo (MC) Search
𝛻 𝑅 𝜃 ≈
1
𝑁
𝑖=1
𝑁
𝑡=1
𝑇
𝑄 𝑐 𝑖
, 𝑥1:𝑡
𝑖
− 𝑏 𝛻𝑙𝑜𝑔𝑃 𝜃 𝑥𝑡
𝑖
|𝑐 𝑖
, 𝑥1:𝑡−1
𝑖
𝛻 𝑅 𝜃 ≈
1
𝑁
𝑖=1
𝑁
𝐷 𝑐 𝑖, 𝑥 𝑖 𝛻𝑙𝑜𝑔𝑃 𝜃 𝑥 𝑖|𝑐 𝑖
[Yu, et al., AAAI, 2017]
[Li, et al., EMNLP, 2017]
Find more comparison in the survey papers.
[Lu, et al., arXiv, 2018][Zhu, et al., arXiv, 2018]
Input We've got to look for another route.
MLE I'm sorry.
GAN You're not going to be here for a while.
Input You can save him by talking.
MLE I don't know.
GAN You know what's going on in there, you know what I
mean?
• MLE frequently generates “I’m sorry”, “I don’t know”, etc.
(corresponding to fuzzy images?)
• GAN generates longer and more complex responses
(however, no strong evidence shows that they are better)
Experimental Results
More Applications
• Supervised machine translation [Wu, et al., arXiv
2017][Yang, et al., arXiv 2017]
• Supervised abstractive summarization [Liu, et al., AAAI
2018]
• Image/video caption generation [Rakshith Shetty, et al., ICCV
2017][Liang, et al., arXiv 2017]
If you are using seq2seq models,
consider to improve them by GAN.
Outline of Part III
Conditional Sequence Generation
• RL (human feedback)
• GAN (discriminator feedback)
Unsupervised Conditional Sequence Generation
• Text Style Transfer
• Unsupervised Abstractive Summarization
• Unsupervised Translation
Text Style Transfer
Domain X Domain Y
male female
It is good.
It’s a good day.
I love you.
It is bad.
It’s a bad day.
I don’t love you.
positive sentences negative sentences
Direct Transformation
𝐺 𝑋→𝑌 𝐺Y→X
as close as possible
𝐺Y→X 𝐺 𝑋→𝑌
as close as possible
𝐷 𝑌𝐷 𝑋
scalar: belongs to
domain Y or not
scalar: belongs to
domain X or not
Direct Transformation
𝐺 𝑋→𝑌 𝐺Y→X
as close as possible
𝐺Y→X 𝐺 𝑋→𝑌
as close as possible
𝐷 𝑌𝐷 𝑋
negative sentence? positive sentence?
It is bad. It is good. It is bad.
I love you. I hate you. I love you.
positive
positive
positivenegative
negative negative
Direct Transformation
𝐺 𝑋→𝑌 𝐺Y→X
as close as possible
𝐺Y→X 𝐺 𝑋→𝑌
as close as possible
𝐷 𝑌𝐷 𝑋
negative sentence? positive sentence?
It is bad. It is good. It is bad.
I love you. I hate you. I love you.
positive
positive
positivenegative
negative negative
Word embedding
[Lee, et al., ICASSP, 2018]
Discrete?
• Negative sentence to positive sentence:
it's a crappy day → it's a great day
i wish you could be here → you could be here
it's not a good idea → it's good idea
i miss you → i love you
i don't love you → i love you
i can't do that → i can do that
i feel so sad → i happy
it's a bad day → it's a good day
it's a dummy day → it's a great day
sorry for doing such a horrible thing → thanks for doing a
great thing
my doggy is sick → my doggy is my doggy
my little doggy is sick → my little doggy is my little doggy
Title: SCALABLE SENTIMENT FOR SEQUENCE-TO-SEQUENCE CHATBOT RESPONSE WITH PERFORMANCE ANALYSIS
Session: Dialog Systems and Applications
Time: Wednesday, April 18, 08:30 - 10:30
Authors: Chih-Wei Lee, Yau-Shian Wang, Tsung-Yuan Hsu, Kuan-Yu Chen, Hung-Yi Lee, Lin-Shan Lee
𝐸𝑁 𝑋
𝐸𝑁𝑌 𝐷𝐸 𝑌
𝐷𝐸 𝑋 𝐷 𝑋
𝐷 𝑌
Discriminator
of X domain
Discriminator
of Y domain
Projection to Common Space
𝐸𝑁 𝑋
𝐸𝑁𝑌 𝐷𝐸 𝑌
𝐷𝐸 𝑋 𝐷 𝑋
𝐷 𝑌
Discriminator
of X domain
Discriminator
of Y domain
Projection to Common Space
Positive
Sentence
Positive
Sentence
Negative
Sentence
Negative
Sentence
Decoder hidden layer as discriminator input
[Shen, et al., NIPS, 2017]
From 𝐸𝑁𝑋 or 𝐸𝑁𝑌
Domain
Discriminator
𝐸𝑁𝑋 and 𝐸𝑁𝑌 fool the
domain discriminator
[Zhao, et al., arXiv, 2017]
[Fu, et al., AAAI, 2018]
Outline of Part III
Conditional Sequence Generation
• RL (human feedback)
• GAN (discriminator feedback)
Unsupervised Conditional Sequence Generation
• Text Style Transfer
• Unsupervised Abstractive Summarization
• Unsupervised Translation
Abstractive Summarization
• Now machine can do abstractive summary by
seq2seq (write summaries in its own words)
summary 1
summary 2
summary 3
Training Data
summary
seq2seq
(in its own words)
Supervised: We need lots of
labelled training data.
Unsupervised Abstractive
Summarization
1 2
Summary?
Seq2seq Seq2seq
long
document
long
document
short
document
The two seq2seq models are jointly learn to
minimize the reconstruction error.
Only need a lot
of documents to
train the model
Unsupervised Abstractive
Summarization
1 2
Summary?
Seq2seq Seq2seq
long
document
long
document
short
document
This is a seq2seq2seq auto-encoder.
Using a sequence of words as latent representation.
not readable …
Policy gradient is used.
Unsupervised Abstractive
Summarization
1 2
Summary?
Seq2seq Seq2seq
long
document
long
document
short
document
3
Human written summaries Real or not
Discriminator
Let Discriminator considers
my output as real
Semi-supervised Learning
(unpublished)
unsupervised
semi-supervised
Outline of Part III
Conditional Sequence Generation
• RL (human feedback)
• GAN (discriminator feedback)
Unsupervised Conditional Sequence Generation
• Text Style Transfer
• Unsupervised Abstractive Summarization
• Unsupervised Translation
Unsupervised Machine Translation
Domain X Domain Y
male female
positive sentences
[Alexis Conneau, et al., ICLR, 2018]
[Guillaume Lample, et al., ICLR, 2018]
Unsupervised learning
with 10M sentences
Supervised learning with
100K sentence pairs
=
supervised
unsupervised
Unsupervised Speech Recognition
The dog is ……
The cats are ……
The woman is ……
The man is ……
The cat is ……Cycle GAN
“The”=
Acoustic Pattern Discovery
Can we achieve
unsupervised speech
recognition?
p1
p1 p3 p2
p1 p4 p3 p5 p5
p1 p5 p4 p3
p1 p2 p3 p4
[Liu, et al., arXiv, 2018] [Chen, et al., arXiv, 2018]
Unsupervised Speech Recognition
• Phoneme recognition
Audio: TIMIT
Text: WMT
supervised
WGAN-GP
Gumbel-softmax
Concluding Remarks
Conditional Sequence Generation
• RL (human feedback)
• GAN (discriminator feedback)
Unsupervised Conditional Sequence Generation
• Text Style Transfer
• Unsupervised Abstractive Summarization
• Unsupervised Translation
To Learn More …
https://www.youtube.com/playlist?list=PLJV_el3uVTsMd2G9ZjcpJn1YfnM9wVOBf
You can learn more from the YouTube Channel
(in Mandarin)
Reference
• Conditional Sequence Generation
• Jiwei Li, Will Monroe, Alan Ritter, Michel Galley, Jianfeng Gao, Dan
Jurafsky, Deep Reinforcement Learning for Dialogue Generation, EMNLP,
2016
• Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, Alan Ritter, Dan Jurafsky,
Adversarial Learning for Neural Dialogue Generation, EMNLP, 2017
• Matt J. Kusner, José Miguel Hernández-Lobato, GANS for Sequences of
Discrete Elements with the Gumbel-softmax Distribution, arXiv 2016
• Tong Che, Yanran Li, Ruixiang Zhang, R Devon Hjelm, Wenjie Li, Yangqiu
Song, Yoshua Bengio, Maximum-Likelihood Augmented Discrete Generative
Adversarial Networks, arXiv 2017
• Lantao Yu, Weinan Zhang, Jun Wang, Yong Yu, SeqGAN: Sequence
Generative Adversarial Nets with Policy Gradient, AAAI 2017
Reference
• Conditional Sequence Generation
• Sai Rajeswar, Sandeep Subramanian, Francis Dutil, Christopher Pal, Aaron
Courville, Adversarial Generation of Natural Language, arXiv, 2017
• Ofir Press, Amir Bar, Ben Bogin, Jonathan Berant, Lior Wolf, Language
Generation with Recurrent Generative Adversarial Networks without Pre-
training, ICML workshop, 2017
• Zhen Xu, Bingquan Liu, Baoxun Wang, Chengjie Sun, Xiaolong Wang,
Zhuoran Wang, Chao Qi , Neural Response Generation via GAN with an
Approximate Embedding Layer, EMNLP, 2017
• Alex Lamb, Anirudh Goyal, Ying Zhang, Saizheng Zhang, Aaron
Courville, Yoshua Bengio, Professor Forcing: A New Algorithm for Training
Recurrent Networks, NIPS, 2016
• Yizhe Zhang, Zhe Gan, Kai Fan, Zhi Chen, Ricardo Henao, Dinghan
Shen, Lawrence Carin, Adversarial Feature Matching for Text Generation,
ICML, 2017
• Jiaxian Guo, Sidi Lu, Han Cai, Weinan Zhang, Yong Yu, Jun Wang, Long Text
Generation via Adversarial Training with Leaked Information, AAAI, 2018
• Kevin Lin, Dianqi Li, Xiaodong He, Zhengyou Zhang, Ming-Ting Sun,
Adversarial Ranking for Language Generation, NIPS, 2017
• William Fedus, Ian Goodfellow, Andrew M. Dai, MaskGAN: Better Text
Generation via Filling in the______, ICLR, 2018
Reference
• Conditional Sequence Generation
• Sidi Lu, Yaoming Zhu, Weinan Zhang, Jun Wang, Yong Yu, Neural Text
Generation: Past, Present and Beyond, arXiv, 2018
• Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun
Wang, Yong Yu, Texygen: A Benchmarking Platform for Text Generation
Models, arXiv, 2018
• Zhen Yang, Wei Chen, Feng Wang, Bo Xu, Improving Neural Machine
Translation with Conditional Sequence Generative Adversarial Nets, NAACL,
2018
• Lijun Wu, Yingce Xia, Li Zhao, Fei Tian, Tao Qin, Jianhuang Lai, Tie-Yan Liu,
Adversarial Neural Machine Translation, arXiv 2017
• Linqing Liu, Yao Lu, Min Yang, Qiang Qu, Jia Zhu, Hongyan Li, Generative
Adversarial Network for Abstractive Text Summarization, AAAI 2018
• Rakshith Shetty, Marcus Rohrbach, Lisa Anne Hendricks, Mario Fritz, Bernt
Schiele, Speaking the Same Language: Matching Machine to Human
Captions by Adversarial Training, ICCV 2017
• Xiaodan Liang, Zhiting Hu, Hao Zhang, Chuang Gan, Eric P. Xing, Recurrent
Topic-Transition GAN for Visual Paragraph Generation, arXiv 2017
Reference
• Text Style Transfer
• Zhenxin Fu, Xiaoye Tan, Nanyun Peng, Dongyan Zhao, Rui Yan, Style
Transfer in Text: Exploration and Evaluation, AAAI, 2018
• Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi Jaakkola, Style Transfer
from Non-Parallel Text by Cross-Alignment, NIPS 2017
• Chih-Wei Lee, Yau-Shian Wang, Tsung-Yuan Hsu, Kuan-Yu Chen, Hung-Yi Lee,
Lin-shan Lee, Scalable Sentiment for Sequence-to-sequence Chatbot
Response with Performance Analysis, ICASSP, 2018
• Junbo (Jake) Zhao, Yoon Kim, Kelly Zhang, Alexander M. Rush, Yann LeCun,
Adversarially Regularized Autoencoders, arxiv, 2017
Reference
• Unsupervised Machine Translation
• Alexis Conneau, Guillaume Lample, Marc'Aurelio Ranzato, Ludovic
Denoyer, Hervé Jégou, Word Translation Without Parallel Data, ICRL 2018
• Guillaume Lample, Ludovic Denoyer, Marc'Aurelio Ranzato, Unsupervised
Machine Translation Using Monolingual Corpora Only, ICRL 2018
• Unsupervised Speech Recognition
• Da-Rong Liu, Kuan-Yu Chen, Hung-Yi Lee, Lin-shan Lee, Completely
Unsupervised Phoneme Recognition by Adversarially Learning Mapping
Relationships from Audio Embeddings, arXiv, 2018
• Yi-Chen Chen, Chia-Hao Shen, Sung-Feng Huang, Hung-yi Lee, Towards
Unsupervised Automatic Speech Recognition Trained by Unaligned Speech
and Text only, arXiv, 2018

Weitere ähnliche Inhalte

Was ist angesagt?

Generative models (Geek hub 2021 lecture)
Generative models (Geek hub 2021 lecture)Generative models (Geek hub 2021 lecture)
Generative models (Geek hub 2021 lecture)Vitaly Bondar
 
Deep Learning Tutorial | Deep Learning TensorFlow | Deep Learning With Neural...
Deep Learning Tutorial | Deep Learning TensorFlow | Deep Learning With Neural...Deep Learning Tutorial | Deep Learning TensorFlow | Deep Learning With Neural...
Deep Learning Tutorial | Deep Learning TensorFlow | Deep Learning With Neural...Simplilearn
 
A brief introduction to recent segmentation methods
A brief introduction to recent segmentation methodsA brief introduction to recent segmentation methods
A brief introduction to recent segmentation methodsShunta Saito
 
Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019
Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019
Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019Universitat Politècnica de Catalunya
 
Autoencoders Tutorial | Autoencoders In Deep Learning | Tensorflow Training |...
Autoencoders Tutorial | Autoencoders In Deep Learning | Tensorflow Training |...Autoencoders Tutorial | Autoencoders In Deep Learning | Tensorflow Training |...
Autoencoders Tutorial | Autoencoders In Deep Learning | Tensorflow Training |...Edureka!
 
Variational Autoencoder
Variational AutoencoderVariational Autoencoder
Variational AutoencoderMark Chang
 
“An Introduction to Data Augmentation Techniques in ML Frameworks,” a Present...
“An Introduction to Data Augmentation Techniques in ML Frameworks,” a Present...“An Introduction to Data Augmentation Techniques in ML Frameworks,” a Present...
“An Introduction to Data Augmentation Techniques in ML Frameworks,” a Present...Edge AI and Vision Alliance
 
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...Simplilearn
 
"Getting More from Your Datasets: Data Augmentation, Annotation and Generativ...
"Getting More from Your Datasets: Data Augmentation, Annotation and Generativ..."Getting More from Your Datasets: Data Augmentation, Annotation and Generativ...
"Getting More from Your Datasets: Data Augmentation, Annotation and Generativ...Edge AI and Vision Alliance
 
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...Simplilearn
 
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Preferred Networks
 
Autoencoders in Deep Learning
Autoencoders in Deep LearningAutoencoders in Deep Learning
Autoencoders in Deep Learningmilad abbasi
 
What is Deep Learning | Deep Learning Simplified | Deep Learning Tutorial | E...
What is Deep Learning | Deep Learning Simplified | Deep Learning Tutorial | E...What is Deep Learning | Deep Learning Simplified | Deep Learning Tutorial | E...
What is Deep Learning | Deep Learning Simplified | Deep Learning Tutorial | E...Edureka!
 
Convolutional neural network
Convolutional neural networkConvolutional neural network
Convolutional neural networkFerdous ahmed
 
Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural NetworksAshray Bhandare
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Yuta Niki
 
Computer Vision image classification
Computer Vision image classificationComputer Vision image classification
Computer Vision image classificationWael Badawy
 
Random forest algorithm
Random forest algorithmRandom forest algorithm
Random forest algorithmRashid Ansari
 

Was ist angesagt? (20)

Generative models (Geek hub 2021 lecture)
Generative models (Geek hub 2021 lecture)Generative models (Geek hub 2021 lecture)
Generative models (Geek hub 2021 lecture)
 
Deep Learning Tutorial | Deep Learning TensorFlow | Deep Learning With Neural...
Deep Learning Tutorial | Deep Learning TensorFlow | Deep Learning With Neural...Deep Learning Tutorial | Deep Learning TensorFlow | Deep Learning With Neural...
Deep Learning Tutorial | Deep Learning TensorFlow | Deep Learning With Neural...
 
Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
 
A brief introduction to recent segmentation methods
A brief introduction to recent segmentation methodsA brief introduction to recent segmentation methods
A brief introduction to recent segmentation methods
 
Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019
Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019
Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019
 
Autoencoders Tutorial | Autoencoders In Deep Learning | Tensorflow Training |...
Autoencoders Tutorial | Autoencoders In Deep Learning | Tensorflow Training |...Autoencoders Tutorial | Autoencoders In Deep Learning | Tensorflow Training |...
Autoencoders Tutorial | Autoencoders In Deep Learning | Tensorflow Training |...
 
Variational Autoencoder
Variational AutoencoderVariational Autoencoder
Variational Autoencoder
 
“An Introduction to Data Augmentation Techniques in ML Frameworks,” a Present...
“An Introduction to Data Augmentation Techniques in ML Frameworks,” a Present...“An Introduction to Data Augmentation Techniques in ML Frameworks,” a Present...
“An Introduction to Data Augmentation Techniques in ML Frameworks,” a Present...
 
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
 
"Getting More from Your Datasets: Data Augmentation, Annotation and Generativ...
"Getting More from Your Datasets: Data Augmentation, Annotation and Generativ..."Getting More from Your Datasets: Data Augmentation, Annotation and Generativ...
"Getting More from Your Datasets: Data Augmentation, Annotation and Generativ...
 
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
 
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
 
Autoencoders in Deep Learning
Autoencoders in Deep LearningAutoencoders in Deep Learning
Autoencoders in Deep Learning
 
cnn ppt.pptx
cnn ppt.pptxcnn ppt.pptx
cnn ppt.pptx
 
What is Deep Learning | Deep Learning Simplified | Deep Learning Tutorial | E...
What is Deep Learning | Deep Learning Simplified | Deep Learning Tutorial | E...What is Deep Learning | Deep Learning Simplified | Deep Learning Tutorial | E...
What is Deep Learning | Deep Learning Simplified | Deep Learning Tutorial | E...
 
Convolutional neural network
Convolutional neural networkConvolutional neural network
Convolutional neural network
 
Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural Networks
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)
 
Computer Vision image classification
Computer Vision image classificationComputer Vision image classification
Computer Vision image classification
 
Random forest algorithm
Random forest algorithmRandom forest algorithm
Random forest algorithm
 

Ähnlich wie ICASSP 2018 Tutorial: Generative Adversarial Network and its Applications to Signal Processing and Natural Language Processing

Generative adversarial network and its applications to speech signal and natu...
Generative adversarial network and its applications to speech signal and natu...Generative adversarial network and its applications to speech signal and natu...
Generative adversarial network and its applications to speech signal and natu...宏毅 李
 
Generative Adversarial Network and its Applications to Speech Processing an...
Generative Adversarial Network and its Applications to Speech Processing an...Generative Adversarial Network and its Applications to Speech Processing an...
Generative Adversarial Network and its Applications to Speech Processing an...宏毅 李
 
brief Introduction to Different Kinds of GANs
brief Introduction to Different Kinds of GANsbrief Introduction to Different Kinds of GANs
brief Introduction to Different Kinds of GANsParham Zilouchian
 
[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing
[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing
[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processingNAVER Engineering
 
Reading group gan - 20170417
Reading group   gan - 20170417Reading group   gan - 20170417
Reading group gan - 20170417Shuai Zhang
 
[GAN by Hung-yi Lee]Part 1: General introduction of GAN
[GAN by Hung-yi Lee]Part 1: General introduction of GAN[GAN by Hung-yi Lee]Part 1: General introduction of GAN
[GAN by Hung-yi Lee]Part 1: General introduction of GANNAVER Engineering
 
Generative Adversarial Networks and Their Applications in Medical Imaging
Generative Adversarial Networks  and Their Applications in Medical ImagingGenerative Adversarial Networks  and Their Applications in Medical Imaging
Generative Adversarial Networks and Their Applications in Medical ImagingSanghoon Hong
 
Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it! Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it! Sudeep Das, Ph.D.
 
Project_Final_Review.pdf
Project_Final_Review.pdfProject_Final_Review.pdf
Project_Final_Review.pdfDivyaGugulothu
 
GANs and Applications
GANs and ApplicationsGANs and Applications
GANs and ApplicationsHoang Nguyen
 
Tutorial on Theory and Application of Generative Adversarial Networks
Tutorial on Theory and Application of Generative Adversarial NetworksTutorial on Theory and Application of Generative Adversarial Networks
Tutorial on Theory and Application of Generative Adversarial NetworksMLReview
 
Deep Residual Hashing Neural Network for Image Retrieval
Deep Residual Hashing Neural Network for Image RetrievalDeep Residual Hashing Neural Network for Image Retrieval
Deep Residual Hashing Neural Network for Image RetrievalEdwin Efraín Jiménez Lepe
 
Generative Adversarial Networks and Their Medical Imaging Applications
Generative Adversarial Networks and Their Medical Imaging ApplicationsGenerative Adversarial Networks and Their Medical Imaging Applications
Generative Adversarial Networks and Their Medical Imaging ApplicationsKyuhwan Jung
 
GDC2019 - SEED - Towards Deep Generative Models in Game Development
GDC2019 - SEED - Towards Deep Generative Models in Game DevelopmentGDC2019 - SEED - Towards Deep Generative Models in Game Development
GDC2019 - SEED - Towards Deep Generative Models in Game DevelopmentElectronic Arts / DICE
 
Let's paint a Picasso - A Look at Generative Adversarial Networks (GAN) and i...
Let's paint a Picasso - A Look at Generative Adversarial Networks (GAN) and i...Let's paint a Picasso - A Look at Generative Adversarial Networks (GAN) and i...
Let's paint a Picasso - A Look at Generative Adversarial Networks (GAN) and i...Catalina Arango
 
D1L5 Visualization (D1L2 Insight@DCU Machine Learning Workshop 2017)
D1L5 Visualization (D1L2 Insight@DCU Machine Learning Workshop 2017)D1L5 Visualization (D1L2 Insight@DCU Machine Learning Workshop 2017)
D1L5 Visualization (D1L2 Insight@DCU Machine Learning Workshop 2017)Universitat Politècnica de Catalunya
 
Variational Autoencoders For Image Generation
Variational Autoencoders For Image GenerationVariational Autoencoders For Image Generation
Variational Autoencoders For Image GenerationJason Anderson
 
GAN Deep Learning Approaches to Image Processing Applications (1).pptx
GAN Deep Learning Approaches to Image Processing Applications (1).pptxGAN Deep Learning Approaches to Image Processing Applications (1).pptx
GAN Deep Learning Approaches to Image Processing Applications (1).pptxRMDAcademicCoordinat
 

Ähnlich wie ICASSP 2018 Tutorial: Generative Adversarial Network and its Applications to Signal Processing and Natural Language Processing (20)

Generative adversarial network and its applications to speech signal and natu...
Generative adversarial network and its applications to speech signal and natu...Generative adversarial network and its applications to speech signal and natu...
Generative adversarial network and its applications to speech signal and natu...
 
Generative Adversarial Network and its Applications to Speech Processing an...
Generative Adversarial Network and its Applications to Speech Processing an...Generative Adversarial Network and its Applications to Speech Processing an...
Generative Adversarial Network and its Applications to Speech Processing an...
 
brief Introduction to Different Kinds of GANs
brief Introduction to Different Kinds of GANsbrief Introduction to Different Kinds of GANs
brief Introduction to Different Kinds of GANs
 
[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing
[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing
[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing
 
Reading group gan - 20170417
Reading group   gan - 20170417Reading group   gan - 20170417
Reading group gan - 20170417
 
[GAN by Hung-yi Lee]Part 1: General introduction of GAN
[GAN by Hung-yi Lee]Part 1: General introduction of GAN[GAN by Hung-yi Lee]Part 1: General introduction of GAN
[GAN by Hung-yi Lee]Part 1: General introduction of GAN
 
Deep Generative Models - Kevin McGuinness - UPC Barcelona 2018
Deep Generative Models - Kevin McGuinness - UPC Barcelona 2018Deep Generative Models - Kevin McGuinness - UPC Barcelona 2018
Deep Generative Models - Kevin McGuinness - UPC Barcelona 2018
 
Generative Adversarial Networks and Their Applications in Medical Imaging
Generative Adversarial Networks  and Their Applications in Medical ImagingGenerative Adversarial Networks  and Their Applications in Medical Imaging
Generative Adversarial Networks and Their Applications in Medical Imaging
 
Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it! Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it!
 
gan.pdf
gan.pdfgan.pdf
gan.pdf
 
Project_Final_Review.pdf
Project_Final_Review.pdfProject_Final_Review.pdf
Project_Final_Review.pdf
 
GANs and Applications
GANs and ApplicationsGANs and Applications
GANs and Applications
 
Tutorial on Theory and Application of Generative Adversarial Networks
Tutorial on Theory and Application of Generative Adversarial NetworksTutorial on Theory and Application of Generative Adversarial Networks
Tutorial on Theory and Application of Generative Adversarial Networks
 
Deep Residual Hashing Neural Network for Image Retrieval
Deep Residual Hashing Neural Network for Image RetrievalDeep Residual Hashing Neural Network for Image Retrieval
Deep Residual Hashing Neural Network for Image Retrieval
 
Generative Adversarial Networks and Their Medical Imaging Applications
Generative Adversarial Networks and Their Medical Imaging ApplicationsGenerative Adversarial Networks and Their Medical Imaging Applications
Generative Adversarial Networks and Their Medical Imaging Applications
 
GDC2019 - SEED - Towards Deep Generative Models in Game Development
GDC2019 - SEED - Towards Deep Generative Models in Game DevelopmentGDC2019 - SEED - Towards Deep Generative Models in Game Development
GDC2019 - SEED - Towards Deep Generative Models in Game Development
 
Let's paint a Picasso - A Look at Generative Adversarial Networks (GAN) and i...
Let's paint a Picasso - A Look at Generative Adversarial Networks (GAN) and i...Let's paint a Picasso - A Look at Generative Adversarial Networks (GAN) and i...
Let's paint a Picasso - A Look at Generative Adversarial Networks (GAN) and i...
 
D1L5 Visualization (D1L2 Insight@DCU Machine Learning Workshop 2017)
D1L5 Visualization (D1L2 Insight@DCU Machine Learning Workshop 2017)D1L5 Visualization (D1L2 Insight@DCU Machine Learning Workshop 2017)
D1L5 Visualization (D1L2 Insight@DCU Machine Learning Workshop 2017)
 
Variational Autoencoders For Image Generation
Variational Autoencoders For Image GenerationVariational Autoencoders For Image Generation
Variational Autoencoders For Image Generation
 
GAN Deep Learning Approaches to Image Processing Applications (1).pptx
GAN Deep Learning Approaches to Image Processing Applications (1).pptxGAN Deep Learning Approaches to Image Processing Applications (1).pptx
GAN Deep Learning Approaches to Image Processing Applications (1).pptx
 

Kürzlich hochgeladen

Steel Structures - Building technology.pptx
Steel Structures - Building technology.pptxSteel Structures - Building technology.pptx
Steel Structures - Building technology.pptxNikhil Raut
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating SystemRashmi Bhat
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxRomil Mishra
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...Chandu841456
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingBootNeck1
 
Industrial Safety Unit-I SAFETY TERMINOLOGIES
Industrial Safety Unit-I SAFETY TERMINOLOGIESIndustrial Safety Unit-I SAFETY TERMINOLOGIES
Industrial Safety Unit-I SAFETY TERMINOLOGIESNarmatha D
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
The SRE Report 2024 - Great Findings for the teams
The SRE Report 2024 - Great Findings for the teamsThe SRE Report 2024 - Great Findings for the teams
The SRE Report 2024 - Great Findings for the teamsDILIPKUMARMONDAL6
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHC Sai Kiran
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxsiddharthjain2303
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the weldingMuhammadUzairLiaqat
 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdfCaalaaAbdulkerim
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONjhunlian
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleAlluxio, Inc.
 

Kürzlich hochgeladen (20)

Steel Structures - Building technology.pptx
Steel Structures - Building technology.pptxSteel Structures - Building technology.pptx
Steel Structures - Building technology.pptx
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptx
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event Scheduling
 
Industrial Safety Unit-I SAFETY TERMINOLOGIES
Industrial Safety Unit-I SAFETY TERMINOLOGIESIndustrial Safety Unit-I SAFETY TERMINOLOGIES
Industrial Safety Unit-I SAFETY TERMINOLOGIES
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
The SRE Report 2024 - Great Findings for the teams
The SRE Report 2024 - Great Findings for the teamsThe SRE Report 2024 - Great Findings for the teams
The SRE Report 2024 - Great Findings for the teams
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECH
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptx
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the welding
 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdf
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 

ICASSP 2018 Tutorial: Generative Adversarial Network and its Applications to Signal Processing and Natural Language Processing

  • 1. Generative Adversarial Network and its Applications to Signal Processing and Natural Language Processing Hung-yi Lee and Yu Tsao
  • 2. Mihaela Rosca, Balaji Lakshminarayanan, David Warde-Farley, Shakir Mohamed, “Variational Approaches for Auto-Encoding Generative Adversarial Networks”, arXiv, 2017 All Kinds of GAN … https://github.com/hindupuravinash/the-gan-zoo GAN ACGAN BGAN DCGAN EBGAN fGAN GoGAN CGAN ……
  • 3. ICASSP Keyword search on session index page, so session names are included. Number of papers whose titles include the keyword 0 5 10 15 20 25 30 35 40 45 2012 2013 2014 2015 2016 2017 2018 generative adversarial reinforcement It is a wise choice to attend this tutorial.
  • 4. Outline Part I: General Introduction of Generative Adversarial Network (GAN) Part II: Applications to Signal Processing Part III: Applications to Natural Language Processing
  • 5. Part I: General Introduction Generative Adversarial Network and its Applications to Signal Processing and Natural Language Processing
  • 6. Outline of Part 1 Generation by GAN Conditional Generation Unsupervised Conditional Generation Relation to Reinforcement Learning
  • 7. Outline of Part 1 Generation by GAN • Image Generation as Example • Theory behind GAN • Issues and Possible Solutions Conditional Generation Unsupervised Conditional Generation Relation to Reinforcement Learning
  • 9. Basic Idea of GAN Generator It is a neural network (NN), or a function. Generator 0.1 −3 ⋮ 2.4 0.9 imagevector Generator 3 −3 ⋮ 2.4 0.9 Generator 0.1 2.1 ⋮ 5.4 0.9 Generator 0.1 −3 ⋮ 2.4 3.5 high dimensional vector Powered by: http://mattya.github.io/chainer-DCGAN/ Each dimension of input vector represents some characteristics. Longer hair blue hair Open mouth
  • 10. Discri- minator scalar image Basic Idea of GAN It is a neural network (NN), or a function. Larger value means real, smaller value means fake. Discri- minator Discri- minator Discri- minator1.0 1.0 0.1 Discri- minator 0.1
  • 11. • Initialize generator and discriminator • In each training iteration: DG sample generated objects G Algorithm D Update vector vector vector vector 0000 1111 randomly sampled Database Step 1: Fix generator G, and update discriminator D Discriminator learns to assign high scores to real objects and low scores to generated objects. Fix
  • 12. • Initialize generator and discriminator • In each training iteration: DG Algorithm Step 2: Fix discriminator D, and update generator G Discri- minator NN Generator vector 0.13 hidden layer update fix Gradient Ascent large network Generator learns to “fool” the discriminator
  • 13. • Initialize generator and discriminator • In each training iteration: DG Learning D Sample some real objects: Generate some fake objects: G Algorithm D Update Learning G G D image 1111 image image image 1 update fix 0000vector vector vector vector vector vector vector vector fix
  • 14. Anime Face Generation 100 updates Source of training data: https://zhuanlan.zhihu.com/p/24767059
  • 21. The faces generated by machine. The images are generated by Yen-Hao Chen, Po-Chun Chien, Jun-Chen Xie, Tsung- Han Wu.
  • 23. (Variational) Auto-encoder As close as possible NN Encoder NN Decoder code NN Decoder code Randomly generate a vector as code Image ? = Generator = Generator
  • 24. Auto-encoder v.s. GAN As close as possibleNN Decoder code Genera- tor code = Generator Discri- minator If discriminator does not simply memorize the images, Generator learns the patterns of faces. Just copy an image 1 Auto-encoder GAN Fuzzy …
  • 25. FID[Martin Heusel, et al., NIPS, 2017]: Smaller is better [Mario Lucic, et al. arXiv, 2017]
  • 26. Outline of Part 1 Generation • Image Generation as Example • Theory behind GAN • Issues and Possible Solutions Conditional Generation Unsupervised Conditional Generation Relation to Reinforcement Learning
  • 27. Generator • A generator G is a network. The network defines a probability distribution 𝑃𝐺 generator G𝑧 𝑥 = 𝐺 𝑧 Normal Distribution 𝑃𝐺(‫)ݔ‬ 𝑃𝑑𝑎𝑡𝑎 𝑥 as close as possible How to compute the divergence? 𝐺∗ = 𝑎𝑟𝑔 min 𝐺 𝐷𝑖𝑣 𝑃𝐺, 𝑃𝑑𝑎𝑡𝑎 Divergence between distributions 𝑃𝐺 and 𝑃𝑑𝑎𝑡𝑎 𝑥: an image (a high- dimensional vector)
  • 28. Discriminator 𝐺∗ = 𝑎𝑟𝑔 min 𝐺 𝐷𝑖𝑣 𝑃𝐺, 𝑃𝑑𝑎𝑡𝑎 Although we do not know the distributions of 𝑃𝐺 and 𝑃𝑑𝑎𝑡𝑎, we can sample from them. sample G vector vector vector vector sample from normal Database Sampling from 𝑷 𝑮 Sampling from 𝑷 𝒅𝒂𝒕𝒂
  • 29. Discriminator 𝐺∗ = 𝑎𝑟𝑔 min 𝐺 𝐷𝑖𝑣 𝑃𝐺, 𝑃𝑑𝑎𝑡𝑎 Discriminator : data sampled from 𝑃𝑑𝑎𝑡𝑎 : data sampled from 𝑃𝐺 train 𝑉 𝐺, 𝐷 = 𝐸 𝑥∼𝑃 𝑑𝑎𝑡𝑎 𝑙𝑜𝑔𝐷 𝑥 + 𝐸 𝑥∼𝑃 𝐺 𝑙𝑜𝑔 1 − 𝐷 𝑥 Example Objective Function for D (G is fixed) 𝐷∗ = 𝑎𝑟𝑔 max 𝐷 𝑉 𝐷, 𝐺Training: Using the example objective function is exactly the same as training a binary classifier. [Goodfellow, et al., NIPS, 2014] The maximum objective value is related to JS divergence.
  • 30. Discriminator 𝐺∗ = 𝑎𝑟𝑔 min 𝐺 𝐷𝑖𝑣 𝑃𝐺, 𝑃𝑑𝑎𝑡𝑎 Discriminator : data sampled from 𝑃𝑑𝑎𝑡𝑎 : data sampled from 𝑃𝐺 train hard to discriminatesmall divergence Discriminator train easy to discriminatelarge divergence 𝐷∗ = 𝑎𝑟𝑔 max 𝐷 𝑉 𝐷, 𝐺 Training: (cannot make objective large)
  • 31. 𝐺∗ = 𝑎𝑟𝑔 min 𝐺 𝐷𝑖𝑣 𝑃𝐺, 𝑃𝑑𝑎𝑡𝑎max 𝐷 𝑉 𝐺, 𝐷 The maximum objective value is related to JS divergence. • Initialize generator and discriminator • In each training iteration: Step 1: Fix generator G, and update discriminator D Step 2: Fix discriminator D, and update generator G 𝐷∗ = 𝑎𝑟𝑔 max 𝐷 𝑉 𝐷, 𝐺 [Goodfellow, et al., NIPS, 2014]
  • 32. Using the divergence you like  [Sebastian Nowozin, et al., NIPS, 2016] Can we use other divergence?
  • 33. Outline of Part 1 Generation • Image Generation as Example • Theory behind GAN • Issues and Possible Solutions Conditional Generation Unsupervised Conditional Generation Relation to Reinforcement Learning More tips and tricks: https://github.com/soumith/ganhacks
  • 34. JS divergence is not suitable • In most cases, 𝑃𝐺 and 𝑃𝑑𝑎𝑡𝑎 are not overlapped. • 1. The nature of data • 2. Sampling Both 𝑃𝑑𝑎𝑡𝑎 and 𝑃𝐺 are low-dim manifold in high-dim space. 𝑃𝑑𝑎𝑡𝑎 𝑃𝐺 The overlap can be ignored. Even though 𝑃𝑑𝑎𝑡𝑎 and 𝑃𝐺 have overlap. If you do not have enough sampling ……
  • 35. 𝑃𝑑𝑎𝑡𝑎𝑃𝐺0 𝑃𝑑𝑎𝑡𝑎𝑃𝐺1 𝐽𝑆 𝑃𝐺0 , 𝑃𝑑𝑎𝑡𝑎 = 𝑙𝑜𝑔2 𝑃𝑑𝑎𝑡𝑎𝑃𝐺100 …… 𝐽𝑆 𝑃𝐺1 , 𝑃𝑑𝑎𝑡𝑎 = 𝑙𝑜𝑔2 𝐽𝑆 𝑃𝐺100 , 𝑃𝑑𝑎𝑡𝑎 = 0 What is the problem of JS divergence? …… JS divergence is log2 if two distributions do not overlap. Intuition: If two distributions do not overlap, binary classifier achieves 100% accuracy Equally bad Same objective value is obtained. Same divergence
  • 36. Wasserstein distance • Considering one distribution P as a pile of earth, and another distribution Q as the target • The average distance the earth mover has to move the earth. 𝑃 𝑄 d 𝑊 𝑃, 𝑄 = 𝑑
  • 37. Wasserstein distance Source of image: https://vincentherrmann.github.io/blog/wasserstein/ 𝑃 𝑄 Using the “moving plan” with the smallest average distance to define the Wasserstein distance. There are many possible “moving plans”. Smaller distance? Larger distance?
  • 38. 𝑃𝑑𝑎𝑡𝑎𝑃𝐺0 𝑃𝑑𝑎𝑡𝑎𝑃𝐺1 𝐽𝑆 𝑃𝐺0 , 𝑃𝑑𝑎𝑡𝑎 = 𝑙𝑜𝑔2 𝑃𝑑𝑎𝑡𝑎𝑃𝐺100 …… 𝐽𝑆 𝑃𝐺1 , 𝑃𝑑𝑎𝑡𝑎 = 𝑙𝑜𝑔2 𝐽𝑆 𝑃𝐺100 , 𝑃𝑑𝑎𝑡𝑎 = 0 What is the problem of JS divergence? 𝑊 𝑃𝐺0 , 𝑃𝑑𝑎𝑡𝑎 = 𝑑0 𝑊 𝑃𝐺1 , 𝑃𝑑𝑎𝑡𝑎 = 𝑑1 𝑊 𝑃𝐺100 , 𝑃𝑑𝑎𝑡𝑎 = 0 𝑑0 𝑑1 …… …… Better!
  • 39. WGAN 𝑉 𝐺, 𝐷 = max 𝐷∈1−𝐿𝑖𝑝𝑠𝑐ℎ𝑖𝑡𝑧 𝐸 𝑥~𝑃 𝑑𝑎𝑡𝑎 𝐷 𝑥 − 𝐸 𝑥~𝑃 𝐺 𝐷 𝑥 Evaluate wasserstein distance between 𝑃𝑑𝑎𝑡𝑎 and 𝑃𝐺 [Martin Arjovsky, et al., arXiv, 2017] How to fulfill this constraint?D has to be smooth enough. real −∞ generated D ∞ Without the constraint, the training of D will not converge. Keeping the D smooth forces D(x) become ∞ and −∞
  • 40. • Original WGAN → Weight Clipping [Martin Arjovsky, et al., arXiv, 2017] • Improved WGAN → Gradient Penalty [Ishaan Gulrajani, NIPS, 2017] • Spectral Normalization → Keep gradient norm smaller than 1 everywhere [Miyato, et al., ICLR, 2018] Force the parameters w between c and -c After parameter update, if w > c, w = c; if w < -c, w = -c Keep the gradient close to 1 𝑉 𝐺, 𝐷 = max 𝐷∈1−𝐿𝑖𝑝𝑠𝑐ℎ𝑖𝑡𝑧 𝐸 𝑥~𝑃 𝑑𝑎𝑡𝑎 𝐷 𝑥 − 𝐸 𝑥~𝑃 𝐺 𝐷 𝑥 real samples Keep the gradient close to 1 [Kodali, et al., arXiv, 2017] [Wei, et al., ICLR, 2018]
  • 41. Energy-based GAN (EBGAN) • Using an autoencoder as discriminator D Discriminator 0 for the best images Generator is the same. - 0.1 EN DE Autoencoder X -1 -0.1 [Junbo Zhao, et al., arXiv, 2016] Using the negative reconstruction error of auto-encoder to determine the goodness Benefit: The auto-encoder can be pre-train by real images without generator.
  • 42. Mode Collapse : real data : generated data Training with too many iterations ……
  • 43. Mode Dropping BEGAN on CelebA Generator at iteration t Generator at iteration t+1 Generator at iteration t+2 Generator switches mode during training
  • 44. Ensemble Generator 1 Generator 2 …… …… Train a set of generators: 𝐺1, 𝐺2, ⋯ , 𝐺 𝑁 Random pick a generator 𝐺𝑖 Use 𝐺𝑖 to generate the image To generate an image
  • 45. Objective Evaluation Off-the-shelf Image Classifier 𝑥 𝑃 𝑦|𝑥 Concentrated distribution means higher visual quality CNN𝑥1 𝑃 𝑦1|𝑥1 Uniform distribution means higher variety CNN𝑥2 𝑃 𝑦2|𝑥2 CNN𝑥3 𝑃 𝑦3|𝑥3 … 𝑃 𝑦 = 1 𝑁 𝑛 𝑃 𝑦 𝑛|𝑥 𝑛 [Tim Salimans, et al., NIPS, 2016] 𝑥: image 𝑦: class (output of CNN) e.g. Inception net, VGG, etc. class 1 class 2 class 3
  • 46. Objective Evaluation = 𝑥 𝑦 𝑃 𝑦|𝑥 𝑙𝑜𝑔𝑃 𝑦|𝑥 − 𝑦 𝑃 𝑦 𝑙𝑜𝑔𝑃 𝑦 Negative entropy of P(y|x) Entropy of P(y) Inception Score 𝑃 𝑦 = 1 𝑁 𝑛 𝑃 𝑦 𝑛 |𝑥 𝑛 𝑃 𝑦|𝑥 class 1 class 2 class 3
  • 47. Outline of Part 1 Generation Conditional Generation Unsupervised Conditional Generation Relation to Reinforcement Learning
  • 48. • Original Generator • Conditional Generator G condition c 𝑧 𝑥 = 𝐺 𝑧 Normal Distribution G 𝑃𝐺 𝑥 → 𝑃𝑑𝑎𝑡𝑎 𝑥 𝑧 Normal Distribution 𝑥 = 𝐺 𝑐, 𝑧 𝑃𝐺 𝑥|𝑐 → 𝑃𝑑𝑎𝑡𝑎 𝑥|𝑐 [Mehdi Mirza, et al., arXiv, 2014] NN Generator “Girl with red hair and red eyes” “Girl with yellow ribbon” e.g. Text-to-Image
  • 49. Target of NN output Text-to-Image • Traditional supervised approach NN Image Text: “train” a dog is running a bird is flying A blurry image! c1: a dog is running as close as possible
  • 50. Conditional GAN D (original) scalar𝑥 G 𝑧Normal distribution x = G(c,z) c: train x is real image or not Image Real images: Generated images: 1 0 Generator will learn to generate realistic images …. But completely ignore the input conditions. [Scott Reed, et al, ICML, 2016]
  • 51. Conditional GAN D (better) scalar 𝑐 𝑥 True text-image pairs: G 𝑧Normal distribution x = G(c,z) c: train Image x is realistic or not + c and x are matched or not (train , ) (train , )(cat , ) [Scott Reed, et al, ICML, 2016] 1 00
  • 52. x is realistic or not + c and x are matched or not Conditional GAN - Discriminator [Takeru Miyato, et al., ICLR, 2018] [Han Zhang, et al., arXiv, 2017] [Augustus Odena et al., ICML, 2017] condition c object x Network Network Network score Network Network (almost every paper) condition c object x c and x are matched or not x is realistic or not
  • 53. Conditional GAN paired data blue eyes red hair short hair Collecting anime faces and the description of its characteristics red hair, green eyes blue hair, red eyes The images are generated by Yen-Hao Chen, Po-Chun Chien, Jun-Chen Xie, Tsung- Han Wu.
  • 54. Conditional GAN - Image-to-image G 𝑧 x = G(c,z) 𝑐 [Phillip Isola, et al., CVPR, 2017 Image translation, or pix2pix
  • 55. as close as possible Conditional GAN - Image-to-image • Traditional supervised approach NN Image It is blurry. Testing: input L1 e.g. L1 [Phillip Isola, et al., CVPR, 2017
  • 56. Conditional GAN - Image-to-image Testing: input L1 GAN G 𝑧 Image D scalar GAN + L1 L1 [Phillip Isola, et al., CVPR, 2017
  • 57. Conditional GAN - Video Generation Generator Discrimi nator Last frame is real or generated Discriminator thinks it is real [Michael Mathieu, et al., arXiv, 2015]
  • 59. Domain Adversarial Training • Training and testing data are in different domains Training data: Testing data: Generator Generator The same distribution feature feature Take digit classification as example
  • 60. blue points red points Domain Adversarial Training feature extractor (Generator) Discriminator (Domain classifier) image Which domain? Always output zero vectors Domain Classifier Fails
  • 61. Domain Adversarial Training feature extractor (Generator) Discriminator (Domain classifier) image Label predictor Which digits? Not only cheat the domain classifier, but satisfying label predictor at the same time More speech-related applications in Part II. Successfully applied on image classification [Ganin et al, ICML, 2015][Ajakan et al. JMLR, 2016 ] Which domain?
  • 62. Outline of Part 1 Generation Conditional Generation Unsupervised Conditional Generation Relation to Reinforcement Learning
  • 63. Domain X Domain Y male female It is good. It’s a good day. I love you. It is bad. It’s a bad day. I don’t love you. Unsupervised Conditional Generation Transform an object from one domain to another without paired data (e.g. style transfer) Part II Part III GDomain X Domain Y
  • 64. Unsupervised Conditional Generation • Approach 1: Direct Transformation • Approach 2: Projection to Common Space ?𝐺 𝑋→𝑌 Domain X Domain Y For texture or color change 𝐸𝑁 𝑋 𝐷𝐸 𝑌 Encoder of domain X Decoder of domain Y Larger change, only keep the semantics Domain YDomain X Face Attribute
  • 65. ? Direct Transformation 𝐺 𝑋→𝑌 Domain X Domain Y 𝐷 𝑌 Domain Y Domain X scalar Input image belongs to domain Y or not Become similar to domain Y
  • 66. Direct Transformation 𝐺 𝑋→𝑌 Domain X Domain Y 𝐷 𝑌 Domain Y Domain X scalar Input image belongs to domain Y or not Become similar to domain Y Not what we want! ignore input
  • 67. Direct Transformation 𝐺 𝑋→𝑌 Domain X Domain Y 𝐷 𝑌 Domain X scalar Input image belongs to domain Y or not Become similar to domain Y Not what we want! ignore input [Tomer Galanti, et al. ICLR, 2018] The issue can be avoided by network design. Simpler generator makes the input and output more closely related.
  • 68. Direct Transformation 𝐺 𝑋→𝑌 Domain X Domain Y 𝐷 𝑌 Domain X scalar Input image belongs to domain Y or not Become similar to domain Y Encoder Network Encoder Network pre-trained as close as possible Baseline of DTN [Yaniv Taigman, et al., ICLR, 2017]
  • 69. Direct Transformation 𝐺 𝑋→𝑌 𝐷 𝑌 Domain Y scalar Input image belongs to domain Y or not 𝐺Y→X as close as possible Lack of information for reconstruction [Jun-Yan Zhu, et al., ICCV, 2017] Cycle consistency
  • 70. Direct Transformation 𝐺 𝑋→𝑌 𝐺Y→X as close as possible 𝐺Y→X 𝐺 𝑋→𝑌 as close as possible 𝐷 𝑌𝐷 𝑋 scalar: belongs to domain Y or not scalar: belongs to domain X or not
  • 71. Cycle GAN Dual GAN Disco GAN [Jun-Yan Zhu, et al., ICCV, 2017] [Zili Yi, et al., ICCV, 2017] [Taeksoo Kim, et al., ICML, 2017] For multiple domains, considering starGAN [Yunjey Choi, arXiv, 2017]
  • 72. Issue of Cycle Consistency • CycleGAN: a Master of Steganography [Casey Chu, et al., NIPS workshop, 2017] 𝐺Y→X𝐺 𝑋→𝑌 The information is hidden.
  • 73. Unsupervised Conditional Generation • Approach 1: Direct Transformation • Approach 2: Projection to Common Space ?𝐺 𝑋→𝑌 Domain X Domain Y For texture or color change 𝐸𝑁 𝑋 𝐷𝐸 𝑌 Encoder of domain X Decoder of domain Y Larger change, only keep the semantics Domain YDomain X Face Attribute
  • 74. Domain X Domain Y 𝐸𝑁 𝑋 𝐸𝑁𝑌 𝐷𝐸 𝑌 𝐷𝐸 𝑋image image image image Face Attribute Projection to Common Space Target
  • 75. Domain X Domain Y 𝐸𝑁 𝑋 𝐸𝑁𝑌 𝐷𝐸 𝑌 𝐷𝐸 𝑋image image image image Minimizing reconstruction error Projection to Common Space Training
  • 76. 𝐸𝑁 𝑋 𝐸𝑁𝑌 𝐷𝐸 𝑌 𝐷𝐸 𝑋image image image image Minimizing reconstruction error Because we train two auto-encoders separately … The images with the same attribute may not project to the same position in the latent space. 𝐷 𝑋 𝐷 𝑌 Discriminator of X domain Discriminator of Y domain Minimizing reconstruction error Projection to Common Space Training
  • 77. Sharing the parameters of encoders and decoders Projection to Common Space Training 𝐸𝑁 𝑋 𝐸𝑁𝑌 𝐷𝐸 𝑋 𝐷𝐸 𝑌 Couple GAN[Ming-Yu Liu, et al., NIPS, 2016] UNIT[Ming-Yu Liu, et al., NIPS, 2017]
  • 78. 𝐸𝑁 𝑋 𝐸𝑁𝑌 𝐷𝐸 𝑌 𝐷𝐸 𝑋image image image image Minimizing reconstruction error The domain discriminator forces the output of 𝐸𝑁𝑋 and 𝐸𝑁𝑌 have the same distribution. From 𝐸𝑁𝑋 or 𝐸𝑁𝑌 𝐷 𝑋 𝐷 𝑌 Discriminator of X domain Discriminator of Y domain Projection to Common Space Training Domain Discriminator 𝐸𝑁𝑋 and 𝐸𝑁𝑌 fool the domain discriminator [Guillaume Lample, et al., NIPS, 2017]
  • 79. 𝐸𝑁 𝑋 𝐸𝑁𝑌 𝐷𝐸 𝑌 𝐷𝐸 𝑋image image image image 𝐷 𝑋 𝐷 𝑌 Discriminator of X domain Discriminator of Y domain Projection to Common Space Training Cycle Consistency: Used in ComboGAN [Asha Anoosheh, et al., arXiv, 017] Minimizing reconstruction error
  • 80. 𝐸𝑁 𝑋 𝐸𝑁𝑌 𝐷𝐸 𝑌 𝐷𝐸 𝑋image image image image 𝐷 𝑋 𝐷 𝑌 Discriminator of X domain Discriminator of Y domain Projection to Common Space Training Semantic Consistency: Used in DTN [Yaniv Taigman, et al., ICLR, 2017] and XGAN [Amélie Royer, et al., arXiv, 2017] To the same latent space
  • 81. Outline of Part 1 Generation Conditional Generation Unsupervised Conditional Generation Relation to Reinforcement Learning
  • 82. Basic Components EnvActor Reward Function Video Game Go Get 20 scores when killing a monster The rule of GO You cannot control
  • 83. • Input of neural network: the observation of machine represented as a vector or a matrix • Output neural network : each action corresponds to a neuron in output layer … … NN as actor pixels fire right left Score of an action 0.7 0.2 0.1 Take the action based on the probability. Neural network as Actor
  • 84. Start with observation 𝑠1 Observation 𝑠2 Observation 𝑠3 Example: Playing Video Game Action 𝑎1: “right” Obtain reward 𝑟1 = 0 Action 𝑎2 : “fire” (kill an alien) Obtain reward 𝑟2 = 5
  • 85. Start with observation 𝑠1 Observation 𝑠2 Observation 𝑠3 Example: Playing Video Game After many turns Action 𝑎 𝑇 Obtain reward 𝑟𝑇 Game Over (spaceship destroyed) This is an episode. We want the total reward be maximized. Total reward: 𝑅 = 𝑡=1 𝑇 𝑟𝑡
  • 86. Actor, Environment, Reward 𝜏 = 𝑠1, 𝑎1, 𝑠2, 𝑎2, ⋯ , 𝑠 𝑇, 𝑎 𝑇 Trajectory Actor 𝑠1 𝑎1 Env 𝑠2 Env 𝑠1 𝑎1 Actor 𝑠2 𝑎2 Env 𝑠3 𝑎2 ……
  • 87. Reward Function → Discriminator Reinforcement Learning v.s. GAN Actor 𝑠1 𝑎1 Env 𝑠2 Env 𝑠1 𝑎1 Actor 𝑠2 𝑎2 Env 𝑠3 𝑎2 …… 𝑅 𝜏 = 𝑡=1 𝑇 𝑟𝑡 Reward 𝑟1 Reward 𝑟2 “Black box” You cannot use backpropagation. Actor → Generator Fixed updatedupdated
  • 88. Imitation Learning We have demonstration of the expert. Actor 𝑠1 𝑎1 Env 𝑠2 Env 𝑠1 𝑎1 Actor 𝑠2 𝑎2 Env 𝑠3 𝑎2 …… reward function is not available 𝜏1, 𝜏2, ⋯ , 𝜏 𝑁 Each 𝜏 is a trajectory of the expert. Self driving: record human drivers Robot: grab the arm of robot
  • 89. Inverse Reinforcement Learning Reward Function Environment Optimal Actor Inverse Reinforcement Learning Using the reward function to find the optimal actor. Modeling reward can be easier. Simple reward function can lead to complex policy. Reinforcement Learning Expert demonstration of the expert 𝜏1, 𝜏2, ⋯ , 𝜏 𝑁
  • 90. Framework of IRL Expert 𝜋 Actor 𝜋 Obtain Reward Function R 𝜏1, 𝜏2, ⋯ , 𝜏 𝑁 𝜏1, 𝜏2, ⋯ , 𝜏 𝑁 Find an actor based on reward function R By Reinforcement learning 𝑛=1 𝑁 𝑅 𝜏 𝑛 > 𝑛=1 𝑁 𝑅 𝜏 Reward function → Discriminator Actor → Generator Reward Function R The expert is always the best.
  • 91. 𝜏1, 𝜏2, ⋯ , 𝜏 𝑁 GAN IRL G D High score for real, low score for generated Find a G whose output obtains large score from D 𝜏1, 𝜏2, ⋯ , 𝜏 𝑁 Expert Actor Reward Function Larger reward for 𝜏 𝑛, Lower reward for 𝜏 Find a Actor obtains large reward
  • 92. Concluding Remarks Generation Conditional Generation Unsupervised Conditional Generation Relation to Reinforcement Learning
  • 93. Reference • Generation • Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde- Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio, Generative Adversarial Networks, NIPS, 2014 • Sebastian Nowozin, Botond Cseke, Ryota Tomioka, “f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization”, NIPS, 2016 • Martin Arjovsky, Soumith Chintala, Léon Bottou, Wasserstein GAN, arXiv, 2017 • Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, Aaron Courville, Improved Training of Wasserstein GANs, NIPS, 2017 • Junbo Zhao, Michael Mathieu, Yann LeCun, Energy-based Generative Adversarial Network, arXiv, 2016 • Mario Lucic, Karol Kurach, Marcin Michalski, Sylvain Gelly, Olivier Bousquet, “Are GANs Created Equal? A Large-Scale Study”, arXiv, 2017 • Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, Xi Chen Improved Techniques for Training GANs, NIPS, 2016 • Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, Sepp Hochreiter, GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium, NIPS, 2017
  • 94. Reference • Generation • Naveen Kodali, Jacob Abernethy, James Hays, Zsolt Kira, “On Convergence and Stability of GANs”, arXiv, 2017 • Xiang Wei, Boqing Gong, Zixia Liu, Wei Lu, Liqiang Wang, Improving the Improved Training of Wasserstein GANs: A Consistency Term and Its Dual Effect, ICLR, 2018 • Takeru Miyato, Toshiki Kataoka, Masanori Koyama, Yuichi Yoshida, Spectral Normalization for Generative Adversarial Networks, ICLR, 2018
  • 95. Reference • Generational Generation • Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, Honglak Lee, Generative Adversarial Text to Image Synthesis, ICML, 2016 • Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, Alexei A. Efros, Image-to-Image Translation with Conditional Adversarial Networks, CVPR, 2017 • Michael Mathieu, Camille Couprie, Yann LeCun, Deep multi-scale video prediction beyond mean square error, arXiv, 2015 • Mehdi Mirza, Simon Osindero, Conditional Generative Adversarial Nets, arXiv, 2014 • Takeru Miyato, Masanori Koyama, cGANs with Projection Discriminator, ICLR, 2018 • Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, Dimitris Metaxas, StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks, arXiv, 2017 • Augustus Odena, Christopher Olah, Jonathon Shlens, Conditional Image Synthesis With Auxiliary Classifier GANs, ICML, 2017
  • 96. Reference • Generational Generation • Yaroslav Ganin, Victor Lempitsky, Unsupervised Domain Adaptation by Backpropagation, ICML, 2015 • Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, Domain-Adversarial Training of Neural Networks, JMLR, 2016
  • 97. Reference • Unsupervised Conditional Generation • Jun-Yan Zhu, Taesung Park, Phillip Isola, Alexei A. Efros, Unpaired Image-to- Image Translation using Cycle-Consistent Adversarial Networks, ICCV, 2017 • Zili Yi, Hao Zhang, Ping Tan, Minglun Gong, DualGAN: Unsupervised Dual Learning for Image-to-Image Translation, ICCV, 2017 • Tomer Galanti, Lior Wolf, Sagie Benaim, The Role of Minimal Complexity Functions in Unsupervised Learning of Semantic Mappings, ICLR, 2018 • Yaniv Taigman, Adam Polyak, Lior Wolf, Unsupervised Cross-Domain Image Generation, ICLR, 2017 • Asha Anoosheh, Eirikur Agustsson, Radu Timofte, Luc Van Gool, ComboGAN: Unrestrained Scalability for Image Domain Translation, arXiv, 2017 • Amélie Royer, Konstantinos Bousmalis, Stephan Gouws, Fred Bertsch, Inbar Mosseri, Forrester Cole, Kevin Murphy, XGAN: Unsupervised Image-to- Image Translation for Many-to-Many Mappings, arXiv, 2017
  • 98. Reference • Unsupervised Conditional Generation • Guillaume Lample, Neil Zeghidour, Nicolas Usunier, Antoine Bordes, Ludovic Denoyer, Marc'Aurelio Ranzato, Fader Networks: Manipulating Images by Sliding Attributes, NIPS, 2017 • Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, Jiwon Kim, Learning to Discover Cross-Domain Relations with Generative Adversarial Networks, ICML, 2017 • Ming-Yu Liu, Oncel Tuzel, “Coupled Generative Adversarial Networks”, NIPS, 2016 • Ming-Yu Liu, Thomas Breuel, Jan Kautz, Unsupervised Image-to-Image Translation Networks, NIPS, 2017 • Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, Jaegul Choo, StarGAN: Unified Generative Adversarial Networks for Multi- Domain Image-to-Image Translation, arXiv, 2017
  • 99. Generative Adversarial Network and its Applications to Signal Processing and Natural Language Processing Part II: Speech Signal Processing
  • 100. Outline of Part II Speech Signal Generation • Speech enhancement • Postfilter, speech synthesis, voice conversion Speech Signal Recognition • Speech recognition • Speaker recognition • Speech emotion recognition • Lip reading Conclusion
  • 101. Speech Signal Generation (Regression Task) G Output Objective function Paired
  • 102. Speech, Speaker, Emotion Recognition and Lip-reading (Classification Task) Output label Clean data E G 𝒚 𝒙 Emb. Noisy data 𝒙 𝒛 = 𝑔( 𝒙) 𝑔(∙) ℎ(∙) Accented speech 𝒙 Channel distortion 𝒙 Acoustic Mismatch
  • 103. Outline of Part II Speech Signal Generation • Speech enhancement • Postfilter, speech synthesis, voice conversion Speech Signal Recognition • Speech recognition • Speaker recognition • Speech emotion recognition • Lip reading Conclusion
  • 104. Speech Enhancement • Typical objective function • Typical objective function  Mean square error (MSE) [Xu et al., TASLP 2015], L1 [Pascual et al., Interspeech 2017], likelihood [Chai et al., MLSP 2017], STOI [Fu et al., TASLP 2018]. Enhancing G Output Objective function  GAN is used as a new objective function to estimate the parameters in G. Model structures of G: DNN [Wang et al. NIPS 2012; Xu et al., SPL 2014], DDAE [Lu et al., Interspeech 2013], RNN (LSTM) [Chen et al., Interspeech 2015; Weninger et al., LVA/ICA 2015], CNN [Fu et al., Interspeech 2016].
  • 105. Speech Enhancement • Speech enhancement GAN (SEGAN) [Pascual et al., Interspeech 2017] z
  • 106. Table 1: Objective evaluation results. Table 2: Subjective evaluation results. Fig. 1: Preference test results. Speech Enhancement (SEGAN) SEGAN yields better speech enhancement results than Noisy and Wiener. • Experimental results
  • 107. • Pix2Pix [Michelsanti et al., Interpsech 2017] D Scalar Clean Noisy (Fake/Real) Output Noisy G Noisy Output Clean Speech Enhancement
  • 108. Fig. 2: Spectrogram comparison of Pix2Pix with baseline methods. Speech Enhancement (Pix2Pix) • Spectrogram analysis Pix2Pix outperforms STAT-MMSE and is competitive to DNN SE. NG-DNN STAT-MMSE Noisy Clean NG-Pix2Pix
  • 109. Table 3: Objective evaluation results. Speech Enhancement (Pix2Pix) • Objective evaluation and speaker verification test Table 4: Speaker verification results. 1. From the objective evaluations, Pix2Pix outperforms Noisy and MMSE and is competitive to DNN SE. 2. From the speaker verification results, Pix2Pix outperforms the baseline models when the clean training data is used.
  • 110. • Frequency-domain SEGAN (FSEGAN) [Donahue et al., ICASSP 2018] D Scalar Clean Noisy (Fake/Real) Output Noisy G Noisy Output Clean Speech Enhancement
  • 111. Fig. 2: Spectrogram comparison of FSEGAN with L1-trained method. Speech Enhancement (FSEGAN) • Spectrogram analysis FSEGAN reduces both additive noise and reverberant smearing.
  • 112. Table 5: WER (%) of SEGAN and FSEGAN. Table 6: WER (%) of FSEGAN with retrain. Speech Enhancement (FSEGAN) • ASR results 1. From Table 5, (1) FSEGAN improves recognition results for ASR-Clean. (2) FSEGAN outperforms SEGAN as front-ends. 2. From Table 6, (1) Hybrid Retraining with FSEGAN outperforms Baseline; (2) FSEGAN retraining slightly underperforms L1–based retraining.
  • 113. = 𝐸𝒔 𝑓𝑎𝑘𝑒 log(1 − 𝑫 𝑺 𝒔 𝑓𝑎𝑘𝑒, 𝜃 ) +𝐸 𝒏 𝑓𝑎𝑘𝑒 log(1 − 𝑫 𝑵 𝒏 𝑓𝑎𝑘𝑒, 𝜃 ) • Adversarial training based mask estimation (ATME) [Higuchi et al., ASRU 2017] Speech Enhancement Estimated speech 𝑫 𝑵𝑫 𝑺 𝑮 𝑴𝒂𝒔𝒌 Clean speech data Noise data Noisy data Estimated noise True noiseTrue speech Noisy speech True or FakeTrue or Fake 𝑉 𝑀𝑎𝑠𝑘 oror
  • 114. Fig. 3: Spectrogram comparison of (a) noisy; (b) MMSE with supervision; (c) ATMB without supervision. Speech Enhancement (ATME) • Spectrogram analysis The proposed adversarial training mask estimation can capture speech/noise signals without supervised data. Speech mask Noise mask 𝑀𝑓,𝑡 𝑛
  • 115.  The estimated mask parameters are used to compute spatial covariance matrix for MVDR beamformer.  𝑠𝑓,𝑡= 𝐰𝑓 H 𝐲 𝑓,𝑡 , where 𝑠𝑓,𝑡 is the enhanced signal, and 𝐲 𝑓,𝑡 denotes the observation of M microphones, 𝑓 and 𝑡 are frequency and time indices; 𝐰𝑓 denotes the beamformer coefficient.  The MVDR solves 𝐰𝑓 by: 𝐰𝑓= (𝑅 𝑓 𝑠+𝑛 )−1 𝐡 𝑓 𝐡 𝑓 H (𝑅 𝑓 𝑠+𝑛 )−1 𝐡 𝑓  To estimate 𝐡 𝑓, the spatial covariance matrix of the target signal, 𝑅𝑓 𝑠 , is computed by : 𝑅𝑓 𝑠 = 𝑅𝑓 𝑠+𝑛 -𝑅𝑓 𝑛 , where 𝑅𝑓 𝑛 = 𝑀 𝑓,𝑡 𝑛 𝐲 𝑓,𝑡 𝐲 𝑓,𝑡 H 𝑓,𝑡 𝑀 𝑓,𝑡 𝑛 , 𝑀𝑓,𝑡 𝑛 was computed by AT. • Mask-based beamformer for robust ASR Speech Enhancement (ATME)
  • 116. Table 7: WERs (%) for the development and evaluation sets. • ASR results Speech Enhancement (ATME) 1. ATME provides significant improvements over Unprocessed. 2. Unsupervised ATME slightly underperforms supervised MMSE.
  • 117. 𝐺𝑆→𝑇 𝐺 𝑇→𝑆 as close as possible 𝐷 𝑇 Scalar: belongs to domain T or not 𝐺 𝑇→𝑆 𝐺𝑆→𝑇 as close as possible 𝐷𝑆 Scalar: belongs to domain S or not Speech Enhancement (AFT) • Cycle-GAN-based acoustic feature transformation (AFT) [Mimura et al., ASRU 2017] 𝑉𝐹𝑢𝑙𝑙 = 𝑉𝐺𝐴𝑁 𝐺 𝑋→𝑌, 𝐷 𝑌 +𝑉𝐺𝐴𝑁 𝐺 𝑋→𝑌, 𝐷 𝑌 +𝜆 𝑉𝐶𝑦𝑐(𝐺 𝑋→𝑌, 𝐺 𝑌→𝑋) Noisy Enhanced Noisy Clean Syn. Noisy Clean
  • 118. • ASR results on noise robustness and style adaptation Table 8: Noise robust ASR. Table 9: Speaker style adaptation. 1. 𝐺 𝑇→𝑆 can transform acoustic features and effectively improve ASR results for both noisy and accented speech. 2. 𝐺𝑆→𝑇 can be used for model adaptation and effectively improve ASR results for noisy speech. S: Clean; 𝑇: Noisy JNAS: Read; CSJ-SPS: Spontaneous (relax); CSJ-APS: Spontaneous (formal); Speech Enhancement (AFT)
  • 119. Outline of Part II Speech Signal Generation • Speech enhancement • Postfilter, speech synthesis, voice conversion Speech Signal Recognition • Speech recognition • Speaker recognition • Speech emotion recognition • Lip reading Conclusion
  • 120. • Postfilter for synthesized or transformed speech  Conventional postfilter approaches for G estimation include global variance (GV) [Toda et al., IEICE 2007], variance scaling (VS) [Sil’en et al., Interpseech 2012], modulation spectrum (MS) [Takamichi et al., ICASSP 2014],DNN with MSE criterion [Chen et al., Interspeech 2014; Chen et al., TASLP 2015].  GAN is used a new objective function to estimate the parameters in G. Postfilter Synthesized spectral texture Natural spectral texture G Output Objective function Speech synthesizer Voice conversion Speech enhancement
  • 121. • GAN postfilter [Kaneko et al., ICASSP 2017]  Traditional MMSE criterion results in statistical averaging.  GAN is used as a new objective function to estimate the parameters in G.  The proposed work intends to further improve the naturalness of synthesized speech or parameters from a synthesizer. Postfilter Synthesized Mel cepst. coef. Natural Mel cepst. coef. D Nature or Generated Generated Mel cepst. coef. G
  • 122. Fig. 4: Spectrograms of: (a) NAT (nature); (b) SYN (synthesized); (c) VS (variance scaling); (d) MS (modulation spectrum); (e) MSE; (f) GAN postfilters. Postfilter (GAN-based Postfilter) • Spectrogram analysis GAN postfilter reconstructs spectral texture similar to the natural one.
  • 123. Fig. 5: Mel-cepstral trajectories (GANv: GAN was applied in voiced part). Fig. 6: Averaging difference in modulation spectrum per Mel- cepstral coefficient. Postfilter (GAN-based Postfilter) • Objective evaluations GAN postfilter reconstructs spectral texture similar to the natural one.
  • 124. Table 10: Preference score (%). Bold font indicates the numbers over 30%. Postfilter (GAN-based Postfilter) • Subjective evaluations 1. GAN postfilter significantly improves the synthesized speech. 2. GAN postfilter is effective particularly in voiced segments. 3. GANv outperforms GAN and is comparable to NAT.
  • 125. Postfilter (GAN-postfilter-SFTF) • GAN post-filter for STFT spectrograms [Kaneko et al., Interspeech 2017]  GAN postfilter was applied on high-dimensional STFT spectrograms.  The spectrogram was partitioned into N bands (each band overlaps its neighboring bands).  The GAN-based postfilter was trained for each band.  The reconstructed spectrogram from each band was smoothly connected.
  • 126. • Spectrogram analysis Fig. 7: Spectrograms of: (1) SYN, (2) GAN, (3) Original (NAT) Postfilter (GAN-postfilter-SFTF) GAN postfilter reconstructs spectral texture similar to the natural one.
  • 127. Speech Synthesis • Input: linguistic features; Output: speech parameters 𝒄𝒄 𝑮 𝑺𝑺 Natural speech parameters Generated speech parameters Linguistic features sp sp Objective function Minimum generation error (MGE), MSE • Speech synthesis with anti-spoofing verification (ASV) [Saito et al., ICASSP 2017] 𝐿 𝐷 𝒄, 𝒄 = 𝐿 𝐷,1 𝒄 + 𝐿 𝐷,0 𝒄 𝐿 𝐷,1 𝒄 = − 1 𝑇 𝑡=1 𝑇 log( 𝐷 𝒄 𝑡 )…NAT 𝐿 𝐷,0 𝒄 = − 1 𝑇 𝑡=1 𝑇 log(1 − 𝐷 𝒄 𝑡 )…SYN 𝐿 𝒄, 𝒄 = 𝐿 𝐺 𝒄, 𝒄 + 𝜔 𝐷 𝐸 𝐿 𝐺 𝐸 𝐿 𝐷 𝐿 𝐷,1 𝒄 Minimum generation error (MGE) with adversarial loss. 𝒄𝒄 𝑮 𝑺𝑺 Natural speech parameters Generated speech parameters Gen. Nature 𝑫 𝑨𝑺𝑽𝝓(∙) Linguistic features sp sp MGE
  • 128. Fig. 8: Averaged GVs of MCCs. Speech Synthesis (ASV) • Objective and subjective evaluations 1. The proposed algorithm generates MCCs similar to the natural ones. Fig. 9: Scores of speech quality. 2. The proposed algorithm outperforms conventional MGE training.
  • 129. • Speech synthesis with GAN (SS-GAN) [Saito et al., TASLP 2018] 𝐿 𝐷 𝒄, 𝒄 = 𝐿 𝐷,1 𝒄 + 𝐿 𝐷,0 𝒄 𝐿 𝐷,1 𝒄 = − 1 𝑇 𝑡=1 𝑇 log( 𝐷 𝒄 𝑡 )…NAT 𝐿 𝐷,0 𝒄 = − 1 𝑇 𝑡=1 𝑇 log(1 − 𝐷 𝒄 𝑡 )…SYN 𝐿 𝒄, 𝒄 = 𝐿 𝐺 𝒄, 𝒄 + 𝜔 𝐷 𝐸 𝐿 𝐺 𝐸 𝐿 𝐷 𝐿 𝐷,1 𝒄 Minimum generation error (MGE) with adversarial loss. 𝒄𝒄 Speech Synthesis 𝑮 𝑺𝑺 Natural speech parameters Generated speech parameters Gen. Nature 𝑫𝝓(∙) Linguistic features 𝑫 𝑨𝑺𝑽 sp, f0, duration sp, f0, duration MGE
  • 130. Fig. 11: Scores of speech quality (sp and F0). . Speech Synthesis (SS-GAN) • Subjective evaluations Fig. 10: Scores of speech quality (sp). The proposed algorithm works for both spectral parameters and F0.
  • 131. Speech Synthesis G Natural speech parameters Generated speech parameters Gen. Nature 𝑫 Acoustic features • Speech synthesis with GAN glottal waveform model (GlottGAN) [Bollepalli et al., Interspeech 2017] Glottal waveform Glottal waveform
  • 132. • Objective evaluations Speech Synthesis (GlottGAN) The proposed GAN-based approach can generate glottal waveforms similar to the natural ones. Fig. 12: Glottal pulses generated by GANs. G, D: DNN G, D: conditional DNN G, D: Deep CNN G, D: Deep CNN + LS loss
  • 133. • Speech synthesis with GAN & multi-task learning (SS-GAN-MTL) [Yang et al., ASRU 2017] 𝒙𝒙 Speech Synthesis 𝑮 𝑺𝑺 Natural speech parameters Generated speech parameters Gen. Nature 𝑫 Noise 𝒛 𝒚 𝑉𝐺𝐴𝑁 𝐺, 𝐷 = 𝐸 𝒙~𝑝 𝑑𝑎𝑡𝑎(𝑥) 𝑙𝑜𝑔𝐷 𝒙|𝒚 + 𝐸𝒛~𝑝 𝑧, log(1 − 𝐷 𝐺 𝒛|𝒚 |𝒚 𝑉𝐿2 𝐺 = 𝐸𝒛~𝑝 𝑧,[𝐺 𝒛|𝒚 − 𝒙]2 Linguistic features MSE
  • 134. • Speech synthesis with GAN & multi-task learning (SS-GAN-MTL) [Yang et al., ASRU 2017] Speech Synthesis (SS-GAN-MTL) 𝒙 𝑮 𝑺𝑺 Natural speech parameters Generated speech parameters Gen. Nature 𝑫 Noise 𝒛 𝒚 Linguistic features MSE 𝒙 𝑫 𝑪𝑬 𝑉𝐺𝐴𝑁 𝐺, 𝐷 = 𝐸 𝒙~𝑝 𝑑𝑎𝑡𝑎(𝑥) 𝑙𝑜𝑔𝐷 𝒙|𝒚 + 𝐸𝒛~𝑝 𝑧, log(1 − 𝐷 𝐺 𝒛|𝒚 |𝒚 𝑉𝐿2 𝐺 = 𝐸𝒛~𝑝 𝑧,[𝐺 𝒛|𝒚 − 𝒙]2 𝒙𝒙 𝑮 𝑺𝑺 Natural speech parameters Generated speech parameters Gen. Nature Noise 𝒛 𝒚 Linguistic features MSE 𝑫 𝑪𝑬 𝑉𝐺𝐴𝑁 𝐺, 𝐷 = 𝐸 𝒙~𝑝 𝑑𝑎𝑡𝑎(𝑥) 𝑙𝑜𝑔𝐷 𝐶𝐸 𝒙|𝒚, 𝑙𝑎𝑏𝑒𝑙 + 𝐸𝒛~𝑝 𝑧, log(1 − 𝐷 𝐶𝐸 𝐺 𝒛|𝒚 |𝒚, 𝑙𝑎𝑏𝑒𝑙 𝑉𝐿2 𝐺 = 𝐸𝒛~𝑝 𝑧,[𝐺 𝒛|𝒚 − 𝒙]2 CE True label
  • 135. • Objective and subjective evaluations Table 11: Objective evaluation results. Fig. 13: The preference score (%). Speech Synthesis (SS-GAN-MTL) 1. From objective evaluations, no remarkable difference is observed. 2. From subjective evaluations, GAN outperforms BLSTM and ASV, while GAN-PC underperforms GAN.
  • 136. • Convert (transform) speech from source to target  Conventional VC approaches include Gaussian mixture model (GMM) [Toda et al., TASLP 2007], non-negative matrix factorization (NMF) [Wu et al., TASLP 2014; Fu et al., TBME 2017], locally linear embedding (LLE) [Wu et al., Interspeech 2016], restricted Boltzmann machine (RBM) [Chen et al., TASLP 2014], feed forward NN [Desai et al., TASLP 2010], recurrent NN (RNN) [Nakashika et al., Interspeech 2014]. Voice Conversion G Output Objective function Target speaker Source speaker
  • 137. • VAW-GAN [Hsu et al., Interspeech 2017] Conventional MMSE approaches often encounter the “over-smoothing” issue.  GAN is used a new objective function to estimate G.  The goal is to increase the naturalness, clarity, similarity of converted speech. Voice Conversion D Real or Fake G Target speaker Source speaker 𝑉 𝐺, 𝐷 = 𝑉𝐺𝐴𝑁 𝐺, 𝐷 + 𝜆 𝑉𝑉𝐴𝐸 𝒙|𝒚
  • 138. • Objective and subjective evaluations Fig. 15: MOS on naturalness.Fig. 14: The spectral envelopes. Voice Conversion (VAW-GAN) VAW-GAN outperforms VAE in terms of objective and subjective evaluations with generating more structured speech.
  • 139. Voice Conversion • Sequence-to-sequence VC with learned similarity metric (LSM) [Kaneko et al., Interspeech 2017] D Real or Fake 𝑪 Target speaker Source speaker 𝑉 𝐶, 𝐺, 𝐷 = 𝑉𝑆𝑉𝐶 𝐷 𝑙 𝐶, 𝐷 + 𝑉𝐺𝐴𝑁 (𝐶, 𝐺, 𝐷) 𝑮Noise 𝒛 𝑉𝑆𝑉𝐶 𝐷𝑙 𝐶, 𝐷 = 1 𝑀 𝑙 𝐷𝑙 𝒚 − 𝐷𝑙 𝐶(𝒙 ) 2 𝒚 Similarity metric
  • 140. • Spectrogram analysis Fig. 16: Comparison of MCCs (upper) and STFT spectrograms (lower). Voice Conversion (LSM) Source Target FVC MSE(S2S) LSM(S2S) The spectral textures of LSM are more similar to the target ones.
  • 141. • Subjective evaluations Table 12: Preference scores for naturalness. Table 12: Preference scores for clarity. Fig. 17: Similarity of TGT and SRC with VCs. Voice Conversion (LSM) LSM outperforms FVC and MSE in terms of subjective evaluations. Target speaker Source speaker
  • 142. • CycleGAN-VC [Kaneko et al., arXiv 2017] • used a new objective function to estimate G 𝑉𝐹𝑢𝑙𝑙 = 𝑉𝐺𝐴𝑁 𝐺 𝑋→𝑌, 𝐷 𝑌 +𝑉𝐺𝐴𝑁 𝐺 𝑋→𝑌, 𝐷 𝑌 +𝜆 𝑉𝐶𝑦𝑐(𝐺 𝑋→𝑌, 𝐺 𝑌→𝑋) Voice Conversion 𝑮 𝑺→𝑻 𝐺 𝑇→𝑆 as close as possible 𝑫 𝑻 Scalar: belongs to domain T or not Scalar: belongs to domain S or not 𝐺 𝑇→𝑆 𝑮 𝑺→𝑻 as close as possible 𝑫 𝑺 Target Syn. Source Target Source Syn. Target Source
  • 143. • Subjective evaluations Fig. 18: MOS for naturalness. Fig. 19: Similarity of to source and to target speakers. S: Source; T:Target; P: Proposed; B:Baseline Voice Conversion (CycleGAN-VC) 1. The proposed method uses non-parallel data. 2. For naturalness, the proposed method outperforms baseline. 3. For similarity, the proposed method is comparable to the baseline. Target speaker Source speaker
  • 144. • Multi-target VC [Chou et al., arxiv 2018] 𝑒𝑛𝑐(𝒙) 𝒙 Voice Conversion C 𝑬nc Dec 𝒚 𝒚 𝒚′···· 𝑒𝑛𝑐(𝒙) 𝑬nc Dec 𝒚" 𝑮 𝒚" D+C Real data 𝒙 𝑑𝑒𝑐(𝑒𝑛𝑐 𝒙 , 𝒚) 𝑑𝑒𝑐(𝑒𝑛𝑐 𝒙 , 𝒚′)  Stage-1  Stage-2 F/R ID ···
  • 145. • Subjective evaluations Voice Conversion (Multi-target VC) Fig. 20: Preference test results 1. The proposed method uses non-parallel data. 2. The multi-target VC approach outperforms one-stage only. 3. The multi-target VC approach is comparable to Cycle-GAN-VC in terms of the naturalness and the similarity.
  • 146. Outline of Part II Speech Signal Generation • Speech enhancement • Postfilter, speech synthesis, voice conversion Speech Signal Recognition • Speech recognition • Speaker recognition • Speech emotion recognition • Lip reading Conclusion
  • 147. Speech, Speaker, Emotion Recognition and Lip-reading (Classification Task) Output label Clean data E G 𝒚 𝒙 Emb. Noisy data 𝒙 𝒛 = 𝑔( 𝒙) 𝑔(∙) ℎ(∙) Accented speech 𝒙 Channel distortion 𝒙 Acoustic Mismatch
  • 148. Speech Recognition • Adversarial multi-task learning (AMT) [Shinohara Interspeech 2016] Output 1 Senone Input Acoustic feature E G𝑉𝑦 Output 2 Domain D 𝑉𝑧 𝒛 𝒚 𝒙 GRL 𝑉𝑦=− 𝑖 log 𝑃(𝑦𝑖|𝑥𝑖; 𝜃 𝐸, 𝜃 𝐺) 𝑉𝑧=− 𝑖 log 𝑃(𝑧𝑖|𝑥𝑖; 𝜃 𝐸, 𝜃 𝐷) 𝜃 𝐺 ← 𝜃 𝐺 − ϵ 𝜕𝑉𝑦 𝜕𝜃 𝐺 𝜃 𝐸 ← 𝜃 𝐸 − ϵ 𝜕𝑉𝑦 𝜕𝜃 𝐺 𝜃 𝐷 ← 𝜃 𝐷 − ϵ 𝜕𝑉𝑧 𝜕𝜃 𝐷 Model update Max classification accuracy Max domain accuracy Max classification accuracy Objective function +𝛼 𝜕𝑉𝑧 𝜕𝜃 𝐺 and Min domain accuracy
  • 149. • ASR results in known (k) and unknown (unk) noisy conditions Speech Recognition (AMT) Table 13: WER of DNNs with single-task learning (ST) and AMT. The AMT-DNN outperforms ST-DNN with yielding lower WERs.
  • 150. Speech Recognition • Domain adversarial training for accented ASR (DAT) [Sun et al., ICASSP2018] Output 2 Domain Output 1 Senone Input Acoustic feature E G D GRL 𝑉𝑧𝑉𝑦 𝒛 𝒚 𝒙 𝑉𝑦=− 𝑖 log 𝑃(𝑦𝑖|𝑥𝑖; 𝜃 𝐸, 𝜃 𝐺) 𝑉𝑧=− 𝑖 log 𝑃(𝑧𝑖|𝑥𝑖; 𝜃 𝐸, 𝜃 𝐷) 𝜃 𝐺 ← 𝜃 𝐺 − ϵ 𝜕𝑉𝑦 𝜕𝜃 𝐺 𝜃 𝐸 ← 𝜃 𝐸 − ϵ 𝜕𝑉𝑦 𝜕𝜃 𝐺 𝜃 𝐷 ← 𝜃 𝐷 − ϵ 𝜕𝑉𝑧 𝜕𝜃 𝐷 Model update Max classification accuracy Max domain accuracy Max classification accuracy Objective function +𝛼 𝜕𝑉𝑧 𝜕𝜃 𝐺 and Min domain accuracy
  • 151. • ASR results on accented speech Speech Recognition (DAT) 1. With labeled transcriptions, ASR performance notably improves. Table 14: WER of the baseline and adapted model. 2. DAT is effective in learning features invariant to domain differences with and without labeled transcriptions. STD: standard speech
  • 152. Speech Recognition • Robust ASR using GAN enhancer (GAN-Enhancer) [Sriram et al., arXiv 2017] 𝐻 ℎ 𝒛 , 𝐲 + λ 𝒛 − 𝒛 1 𝒛 1 + 𝒛 1 + 𝜖 Cross entropy with L1 Enhancer: Output label Clean data E GL1 𝒚 𝒙 Emb. Noisy data E 𝒙 Emb. 𝒛 = 𝑔(𝒙) 𝒛 = 𝑔( 𝒙) 𝑔(∙)𝑔(∙) ℎ(∙) Cross entropy with GAN Enhancer: 𝐻 ℎ 𝒛 , 𝐲 + λ 𝑉𝑎𝑑𝑣 (𝑔(𝒙) , 𝑔( 𝒙)) D
  • 153. • ASR results on far-field speech: Speech Recognition (GAN-Enhancer) GAN Enhancer outperforms the Augmentation and L1- Enhancer approaches on far-field speech. Fig. 15: WER of GAN enhancer and the baseline methods.
  • 154. Outline of Part II Speech Signal Generation • Speech enhancement • Postfilter, speech synthesis, voice conversion Speech Signal Recognition • Speech recognition • Speaker recognition • Speech emotion recognition • Lip reading Conclusion
  • 155. Speaker Recognition • Domain adversarial neural network (DANN) [Wang et al., ICASSP 2018] DANN DANN Pre- processing Pre- processing Scoring Enroll i-vector Test i-vector Output 2 Domain Output 1 Speaker ID Input Acoustic feature E G D GRL 𝑉𝑧𝑉𝑦 𝒛 𝒚 𝒙
  • 156. • Recognition results of domain mismatched conditions Table 16: Performance of DAT and the state-of-the-art methods. Speaker Recognition (DANN) The DAT approach outperforms other methods with achieving lowest EER and DCF scores.
  • 157. Outline of Part II Speech Signal Generation • Speech enhancement • Postfilter, speech synthesis, voice conversion Speech Signal Recognition • Speech recognition • Speaker recognition • Speech emotion recognition • Lip reading Conclusion
  • 158. Emotion Recognition • Adversarial AE for emotion recognition (AAE-ER) [Sahu et al., Interspeech 2017] AE with GAN : 𝐻 ℎ 𝒛 , 𝒙 + λ 𝑉𝐺𝐴𝑁 (𝒒, 𝑔(𝒙)) E D 𝒙 Emb. Syn. 𝒛 = 𝑔(𝒙) 𝑔(∙) ℎ(∙) 𝒒 𝒙 G The distribution of code vectors
  • 159. • Recognition results of domain mismatched conditions: Table 18: Classification results on real and synthesized features. Emotion Recognition (AAE-ER) Table 17: Classification results on different systems. 1. AAE alone could not yield performance improvements. 2. Using synthetic data from AAE can yield higher UAR. Original Training data
  • 160. Outline of Part II Speech Signal Generation • Speech enhancement • Postfilter, speech synthesis, voice conversion Speech Signal Recognition • Speech recognition • Speaker recognition • Speech emotion recognition • Lip reading Conclusion
  • 161. Lip-reading • Domain adversarial training for lip-reading (DAT-LR) [Wand et al., arXiv 2017] Output 1 Words E G𝑉𝑦 Output 2 Speaker D GRL 𝑉𝑧 𝒛 𝒚 𝒙 𝑉𝑦=− 𝑖 log 𝑃(𝑦𝑖|𝑥𝑖; 𝜃 𝐸, 𝜃 𝐺) 𝑉𝑧=− 𝑖 log 𝑃(𝑧𝑖|𝑥𝑖; 𝜃 𝐸, 𝜃 𝐷) 𝜃 𝐺 ← 𝜃 𝐺 − ϵ 𝜕𝑉𝑦 𝜕𝜃 𝐺 𝜃 𝐸 ← 𝜃 𝐸 − ϵ 𝜕𝑉𝑦 𝜕𝜃 𝐺 𝜃 𝐷 ← 𝜃 𝐷 − ϵ 𝜕𝑉𝑧 𝜕𝜃 𝐷 Model update Max classification accuracy Max domain accuracy Max classification accuracy Objective function +𝛼 𝜕𝑉𝑧 𝜕𝜃 𝐺 and Min domain accuracy ~80% WAC
  • 162. • Recognition results of speaker mismatched conditions Lip-reading (DAT-LR) Table 19: Performance of DAT and the baseline. The DAT approach notably enhances the recognition accuracies in different conditions.
  • 163. Outline of Part II Speech Signal Generation • Speech enhancement • Postfilter, speech synthesis, voice conversion Speech Signal Recognition • Speech recognition • Speaker recognition • Speech emotion recognition • Lip reading Conclusion
  • 164. Speech Signal Generation (Regression Task) G Output Objective function Paired
  • 165. Speech, Speaker, Emotion Recognition and Lip-reading (Classification Task) Output label Clean data E G 𝒚 𝒙 Emb. Noisy data 𝒙 𝒛 = 𝑔( 𝒙) 𝑔(∙) ℎ(∙) Accented speech 𝒙 Channel distortion 𝒙 Acoustic Mismatch
  • 166. More GANs in Speech Diagnosis of autism spectrum Jun Deng, Nicholas Cummins, Maximilian Schmitt, Kun Qian, Fabien Ringeval, and Björn Schuller, Speech-based Diagnosis of Autism Spectrum Condition by Generative Adversarial Network Representations, ACM DH, 2017. Emotion recognition Jonathan Chang, and Stefan Scherer, Learning Representations of Emotional Speech with Deep Convolutional Generative Adversarial Networks, ICASSP, 2017. Robust ASR Dmitriy Serdyuk, Kartik Audhkhasi, Philémon Brakel, Bhuvana Ramabhadran, Samuel Thomas, and Yoshua Bengio, Invariant Representations for Noisy Speech Recognition, arXiv, 2016. Speaker verification Hong Yu, Zheng-Hua Tan, Zhanyu Ma, and Jun Guo, Adversarial Network Bottleneck Features for Noise Robust Speaker Verification, arXiv, 2017.
  • 167. References Speech enhancement (conventional methods) • Yuxuan Wang and Deliang Wang, Cocktail Party Processing via Structured Prediction, NIPS 2012. • Yong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee, An Experimental Study on Speech Enhancement Based on Deep Neural Networks," IEEE SPL, 2014. • Yong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee, A Regression Approach to Speech Enhancement Based on Deep Neural Networks, IEEE/ACM TASLP, 2015. • Xugang Lu, Yu Tsao, Shigeki Matsuda, Chiori Hori, Speech Enhancement Based on Deep Denoising Autoencoder, Interspeech 2012. • Zhuo Chen, Shinji Watanabe, Hakan Erdogan, John R. Hershey, Integration of Speech Enhancement and Recognition Using Long-short term Memory Recurrent Neural Network, Interspeech 2015. • Felix Weninger, Hakan Erdogan, Shinji Watanabe, Emmanuel Vincent, Jonathan Le Roux, John R. Hershey, and Bjorn Schuller, Speech Enhancement with LSTM Recurrent Neural Networks and Its Application to Noise-robust ASR, LVA/ICA, 2015. • Szu-Wei Fu, Yu Tsao, and Xugang Lu, SNR-aware Convolutional Neural Network Modeling for Speech Enhancement, Interspeech, 2016. • Szu-Wei Fu, Yu Tsao, Xugang Lu, and Hisashi Kawai, End-to-end Waveform Utterance Enhancement for Direct Evaluation Metrics Optimization by Fully Convolutional Neural Networks, arXiv, IEEE/ACM TASLP, 2018. Speech enhancement (GAN-based methods) • Pascual Santiago, Bonafonte Antonio, and Serra Joan, SEGAN: Speech Enhancement Generative Adversarial Network, Interspeech, 2017. • Michelsanti Daniel, and Zheng-Hua Tan, Conditional Generative Adversarial Networks for Speech Enhancement and Noise-robust Speaker Verification, Interspeech, 2017. • Donahue Chris, Li Bo, and Prabhavalkar Rohit, Exploring Speech Enhancement with Generative Adversarial Networks for Robust Speech Recognition, ICASPP, 2018. • Higuchi Takuya, Kinoshita Keisuke, Delcroix Marc, and Nakatani Tomohiro, Adversarial Training for Data-driven Speech Enhancement without Parallel Corpus, ASRU, 2017.
  • 168. Postfilter (conventional methods) • Toda Tomoki, and Tokuda Keiichi, A Speech Parameter Generation Algorithm Considering Global Variance for HMM-based Speech Synthesis, IEICE Trans. Inf. Syst., 2007. • Sil’en Hanna, Helander Elina, Nurminen Jani, and Gabbouj Moncef, Ways to Implement Global Variance in Statistical Speech Synthesis, Interspeech, 2012. • Takamichi Shinnosuke, Toda Tomoki, Neubig Graham, Sakti Sakriani, and Nakamura Satoshi, A Postfilter to Modify the Modulation Spectrum in HMM-based Speech Synthesis, ICASSP, 2014. • Ling-Hui Chen, Tuomo Raitio, Cassia Valentini-Botinhao, Junichi Yamagishi, and Zhen-Hua Ling, DNN-based Stochastic Postfilter for HMM-based Speech Synthesis, Interspeech, 2014. • Ling-Hui Chen, Tuomo Raitio, Cassia Valentini-Botinhao, Zhen-Hua Ling, and Junichi Yamagishi, A Deep Generative Architecture for Postfiltering in Statistical Parametric Speech Synthesis, IEEE/ACM TASLP, 2015. Postfilter (GAN-based methods) • Kaneko Takuhiro, Kameoka Hirokazu, Hojo Nobukatsu, Ijima Yusuke, Hiramatsu Kaoru, and Kashino Kunio, Generative Adversarial Network-based Postfilter for Statistical Parametric Speech Synthesis, ICASSP, 2017. • Kaneko Takuhiro, Takaki Shinji, Kameoka Hirokazu, and Yamagishi Junichi, Generative Adversarial Network- Based Postfilter for STFT Spectrograms, Interspeech, 2017. • Saito Yuki, Takamichi Shinnosuke, and Saruwatari Hiroshi, Training Algorithm to Deceive Anti-spoofing Verification for DNN-based Speech Synthesis, ICASSP, 2017. • Saito Yuki, Takamichi Shinnosuke, Saruwatari Hiroshi, Saito Yuki, Takamichi Shinnosuke, and Saruwatari Hiroshi, Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks, IEEE/ACM TASLP, 2018. • Bajibabu Bollepalli, Lauri Juvela, and Alku Paavo, Generative Adversarial Network-based Glottal Waveform Model for Statistical Parametric Speech Synthesis, Interspeech, 2017. • Yang Shan, Xie Lei, Chen Xiao, Lou Xiaoyan, Zhu Xuan, Huang Dongyan, and Li Haizhou, Statistical Parametric Speech Synthesis Using Generative Adversarial Networks Under a Multi-task Learning Framework, ASRU, 2017. References
  • 169. VC (conventional methods) • Toda Tomoki, Black Alan W, and Tokuda Keiichi, Voice Conversion Based on Maximum Likelihood Estimation of Spectral Parameter Trajectory, IEEE/ACM TASLP, 2007. • Ling-Hui Chen, Zhen-Hua Ling, Li-Juan Liu, and Li-Rong Dai, Voice Conversion Using Deep Neural Networks with Layer-wise Generative Training, IEEE/ACM TASLP, 2014. • Srinivas Desai, Alan W Black, B. Yegnanarayana, and Kishore Prahallad, Spectral mapping Using artificial Neural Networks for Voice Conversion, IEEE/ACM TASLP, 2010. • Nakashika Toru, Takiguchi Tetsuya, Ariki Yasuo, High-order Sequence Modeling Using Speaker-dependent Recurrent Temporal Restricted Boltzmann Machines for Voice Conversion, Interspeech, 2014. • Takuhiro Kaneko, Hirokazu Kameoka, Kaoru Hiramatsu, and Kunio Kashino, Sequence-to-sequence Voice Conversion with Similarity Metric Learned Using Generative Adversarial Networks, Interspeech, 2017. • Zhizheng Wu, Tuomas Virtanen, Eng-Siong Chng, and Haizhou Li, Exemplar-based Sparse Representation with Residual Compensation for Voice Conversion, IEEE/ACM TASLP, 2014. • Szu-Wei Fu, Pei-Chun Li, Ying-Hui Lai, Cheng-Chien Yang, Li-Chun Hsieh, and Yu Tsao, Joint Dictionary Learning- based Non-negative Matrix Factorization for Voice Conversion to Improve Speech Intelligibility After Oral Surgery, IEEE TBME, 2017. • Yi-Chiao Wu, Hsin-Te Hwang, Chin-Cheng Hsu, Yu Tsao, and Hsin-Min Wang, Locally Linear Embedding for Exemplar-based Spectral Conversion, Interspeech, 2016. VC (GAN-based methods) • Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu, Yu Tsao, and Hsin-Min Wang Voice Conversion from Unaligned Corpora Using Variational Autoencoding Wasserstein Generative Adversarial Networks, arXiv, 2017. • Takuhiro Kaneko, Hirokazu Kameoka, Kaoru Hiramatsu, and Kunio Kashino, Sequence-to-sequence Voice Conversion with Similarity Metric Learned Using Generative Adversarial Networks, Interspeech, 2017. • Takuhiro Kaneko, and Hirokazu Kameoka. Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks, arXiv, 2017. References
  • 170. ASR • Yusuke Shinohara, Adversarial Multi-Task Learning of Deep Neural Networks for Robust Speech Recognition. Interspeech, 2016. • Sining Sun, Ching-Feng Yeh, Mei-Yuh Hwang, Mari Ostendorf, and Lei Xie, Domain Adversarial Training for Accented Speech Recognition, ICASSP, 2018 • Masato Mimura, Shinsuke Sakai, and Tatsuya Kawahara, Cross-domain Speech Recognition Using Nonparallel Corpora with Cycle-consistent Adversarial Networks, ASRU, 2017. • Anuroop Sriram, Heewoo Jun, Yashesh Gaur, and Sanjeev Satheesh, Robust Speech Recognition Using Generative Adversarial Networks, arXiv, 2017. Speaker recognition • Qing Wang, Wei Rao, Sining Sun, Lei Xie, Eng Siong Chng, and Haizhou Li, Unsupervised Domain Adaptation via Domain Adversarial Training for Speaker Recognition, ICASSP, 2018. Emotion recognition • Saurabh Sahu, Rahul Gupta, Ganesh Sivaraman, Wael AbdAlmageed, and Carol Espy-Wilson, Adversarial Auto- encoders for Speech Based Emotion Recognition. Interspeech, 2017. Lipreading • Michael Wand, and Jürgen Schmidhuber, Improving Speaker-Independent Lipreading with Domain-Adversarial Training, arXiv, 2017. References
  • 171. • Saurabh Sahu, Rahul Gupta, Ganesh Sivaraman, Carol Espy-Wilson, Smoothing Model Predictions using Adversarial Training Procedures for Speech Based Emotion Recognition • Fuming Fang, Junichi Yamagishi, Isao Echizen, Jaime Lorenzo-Trueba, High-quality Nonparallel Voice Conversion Based on Cycle-consistent Adversarial Network • Lauri Juvela, Bajibabu Bollepalli, Xin Wang, Hirokazu Kameoka, Manu Airaksinen, Junichi Yamagishi, Paavo Alku, Speech Waveform Synthesis from MFCC Sequences with Generative Adversarial Networks • Zhong Meng, Jinyu Li, Yifan Gong, Biing-Hwang (Fred) Juang, Adversarial Teacher-Student Learning for Unsupervised Domain Adaptation • Zhong Meng, Jinyu Li, Zhuo Chen, Yong Zhao, Vadim Mazalov, Yifan Gong, Biing-Hwang (Fred) Juang, Speaker- Invariant Training via Adversarial Learning • Sen Li, Stephane Villette, Pravin Ramadas, Daniel Sinder, Speech Bandwidth Extension using Generative Adversarial Networks • Qing Wang, Wei Rao, Sining Sun, Lei Xie, Eng Siong Chng, Haizhou Li, Unsupervised Domain Adaptation via Domain Adversarial Training for Speaker Recognition • Hu Hu, Tian Tan, Yanmin Qian, Generative Adversarial Networks Based Data Augmentation for Noise Robust Speech Recognition • Yuki Saito, Shinnosuke Takamichi, Hiroshi Saruwatari, Text-to-speech Synthesis using STFT Spectra Based on Low- /multi-resolution Generative Adversarial Networks • Jiangyan Yi, Jianhua Tao, Zhengqi Wen, Ye Bai, Adversarial Multilingual Training for Low-resource Speech Recognition • Meet H. Soni, Neil Shah, Hemant A. Patil, Time-frequency Masking-based Speech Enhancement using Generative Adversarial Network • Taira Tsuchiya, Naohiro Tawara, Tetsuji Ogawa, Tetsunori Kobayashi, Speaker Invariant Feature Extraction for Zero- resource Languages with Adversarial Learning GANs in ICASSP 2018
  • 172. • Jing Han, Zixing Zhang, Zhao Ren, Fabien Ringeval, Bjoern Schuller, Towards Conditional Adversarial Training for Predicting Emotions from Speech • Chenxing Li, Lei Zhu, Shuang Xu, Peng Gao, Bo Xu, CBLDNN-based Speaker-independent Speech Separation via Generative Adversarial Training • Anuroop Sriram, Heewoo Jun, Yashesh Gaur, Sanjeev Satheesh, Robust Speech Recognition using Generative Adversarial Networks • Cem Subakan, Paris Smaragdis, Generative Adversarial Source Separation, • Ashutosh Pandey, Deliang Wang, On Adversarial Training and Loss Functions for Speech Enhancement • Bin Liu, Shuai Nie, Yaping Zhang, Dengfeng Ke, Shan Liang, Wenju Liu, Boosting Noise Robustness of Acoustic Model via Deep Adversarial Training • Yang Gao, Rita Singh, Bhiksha Raj, Voice Impersonation using Generative Adversarial Networks • Aditay Tripathi, Aanchan Mohan, Saket Anand, Maneesh Singh, Adversarial Learning of Raw Speech Features for Domain Invariant Speech Recognition • Zhe-Cheng Fan, Yen-Lin Lai, Jyh-Shing Jang, SVSGAN: Singing Voice Separation via Generative Adversarial Network • Santiago Pascual, Maruchan Park, Joan Serra, Antonio Bonafonte, Kang-Hun Ahn, Language and Noise Transfer in Speech Enhancement Generative Adversarial Network GANs in ICASSP 2018
  • 173. A promising research direction and still has room for further improvements in the speech signal processing domain Thank You Very Much Tsao, Yu Ph.D. yu.tsao@citi.sinica.edu.tw https://www.citi.sinica.edu.tw/pages/yu.tsao/contact_zh.html
  • 174. Part III: Natural Language Processing Generative Adversarial Network and its Applications to Signal Processing and Natural Language Processing
  • 175. Outline of Part III Conditional Sequence Generation • RL (human feedback) • GAN (discriminator feedback) Unsupervised Conditional Sequence Generation • Text Style Transfer • Unsupervised Abstractive Summarization • Unsupervised Translation
  • 176. Conditional Sequence Generation Generator 機 器 學 習 Generator Machine Learning Generator How are you? How are you I am fine. ASR Translation Chatbot The generator is a typical seq2seq model. With GAN, you can train seq2seq model in another way.
  • 177. Review: Sequence-to-sequence • Chat-bot as example Encoder Decoder Input sentence c output sentence x Training data: A: How are you ? B: I’m good. ………… How are you ? I’m good. Generator Output: Not bad I’m John. Maximize likelihood Training Criterion Human better better
  • 178. Outline of Part III Improving Supervised Seq-to-seq Model • RL (human feedback) • GAN (discriminator feedback) Unsupervised Seq-to-seq Model • Text Style Transfer • Unsupervised Abstractive Summarization • Unsupervised Translation
  • 179. Introduction • Machine obtains feedback from user • Chat-bot learns to maximize the expected reward https://image.freepik.com/free-vector/variety- of-human-avatars_23-2147506285.jpg How are you? Bye bye  Hello Hi  -10 3 http://www.freepik.com/free-vector/variety- of-human-avatars_766615.htm
  • 180. Maximizing Expected Reward Human Input sentence c response sentence x Chatbot En De response sentence x Input sentence c [Li, et al., EMNLP, 2016] reward 𝑅 𝑐, 𝑥 Learn to maximize expected reward Policy Gradient
  • 181. Policy Gradient - Implemenation 𝜃 𝑡 𝑐1, 𝑥1 𝑐2 , 𝑥2 𝑐 𝑁, 𝑥 𝑁 …… 𝑅 𝑐1, 𝑥1 𝑅 𝑐2, 𝑥2 𝑅 𝑐 𝑁, 𝑥 𝑁 …… 1 𝑁 𝑖=1 𝑁 𝑅 𝑐 𝑖, 𝑥 𝑖 𝛻𝑙𝑜𝑔𝑃 𝜃 𝑡 𝑥 𝑖|𝑐 𝑖 𝜃 𝑡+1 ← 𝜃 𝑡 + 𝜂𝛻 𝑅 𝜃 𝑡 𝑅 𝑐 𝑖, 𝑥 𝑖 is positive Updating 𝜃 to increase 𝑃 𝜃 𝑥 𝑖 |𝑐 𝑖 𝑅 𝑐 𝑖 , 𝑥 𝑖 is negative Updating 𝜃 to decrease 𝑃 𝜃 𝑥 𝑖|𝑐 𝑖
  • 182. Comparison 1 𝑁 𝑖=1 𝑁 𝑅 𝑐 𝑖, 𝑥 𝑖 𝛻𝑙𝑜𝑔𝑃 𝜃 𝑥 𝑖|𝑐 𝑖 1 𝑁 𝑖=1 𝑁 𝑙𝑜𝑔𝑃 𝜃 𝑥 𝑖|𝑐 𝑖 1 𝑁 𝑖=1 𝑁 𝛻𝑙𝑜𝑔𝑃 𝜃 𝑥 𝑖|𝑐 𝑖 1 𝑁 𝑖=1 𝑁 𝑅 𝑐 𝑖, 𝑥 𝑖 𝑙𝑜𝑔𝑃 𝜃 𝑥 𝑖|𝑐 𝑖 𝑅 𝑐 𝑖 , 𝑥 𝑖 = 1 obtained from interaction weighted by 𝑅 𝑐 𝑖 , 𝑥 𝑖 Objective Function Gradient Maximum Likelihood Reinforcement Learning Training Data 𝑐1, 𝑥1 , … , 𝑐 𝑁, 𝑥 𝑁 𝑐1, 𝑥1 , … , 𝑐 𝑁, 𝑥 𝑁
  • 183. Outline of Part III Improving Supervised Seq-to-seq Model • RL (human feedback) • GAN (discriminator feedback) Unsupervised Seq-to-seq Model • Text Style Transfer • Unsupervised Abstractive Summarization • Unsupervised Translation I am busy.
  • 184. Conditional GAN Discriminator Input sentence c response sentence x Real or fake http://www.nipic.com/show/3/83/3936650kd7476069.html human dialogues Chatbot En De response sentence x Input sentence c [Li, et al., EMNLP, 2017] “reward”
  • 185. Algorithm • Initialize generator G (chatbot) and discriminator D • In each iteration: • Sample input c and response 𝑥 from training set • Sample input 𝑐′ from training set, and generate response 𝑥 by G(𝑐′) • Update D to increase 𝐷 𝑐, 𝑥 and decrease 𝐷 𝑐′, 𝑥 • Update generator G (chatbot) such that Discrimi nator scalar update Training data: 𝑐 Chatbot En De Pairs of conditional input c and response x
  • 186. A A A B A B A A B B B <BOS> Can we use gradient ascent? Due to the sampling process, “discriminator+ generator” is not differentiable Discriminator Discrimi nator scalar update scalarChatbot En De NO! Update Parameters
  • 187. Three Categories of Solutions Gumbel-softmax • [Matt J. Kusner, et al, arXiv, 2016] Continuous Input for Discriminator • [Sai Rajeswar, et al., arXiv, 2017][Ofir Press, et al., ICML workshop, 2017][Zhen Xu, et al., EMNLP, 2017][Alex Lamb, et al., NIPS, 2016][Yizhe Zhang, et al., ICML, 2017] “Reinforcement Learning” • [Yu, et al., AAAI, 2017][Li, et al., EMNLP, 2017][Tong Che, et al, arXiv, 2017][Jiaxian Guo, et al., AAAI, 2018][Kevin Lin, et al, NIPS, 2017][William Fedus, et al., ICLR, 2018]
  • 188. A A A B A B A A B B B <BOS> Use the distribution as the input of discriminator Avoid the sampling process Discriminator Discrimi nator scalar update scalarChatbot En De Update Parameters We can do backpropagation now.
  • 189. What is the problem? • Real sentence • Generated 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0.9 0.1 0 0 0 0.1 0.9 0 0 0 0.1 0.1 0.7 0.1 0 0 0 0.1 0.8 0.1 0 0 0 0.1 0.9 Can never be 1-of-N Discriminator can immediately find the difference. WGAN is helpful
  • 190. Three Categories of Solutions Gumbel-softmax • [Matt J. Kusner, et al, arXiv, 2016] Continuous Input for Discriminator • [Sai Rajeswar, et al., arXiv, 2017][Ofir Press, et al., ICML workshop, 2017][Zhen Xu, et al., EMNLP, 2017][Alex Lamb, et al., NIPS, 2016][Yizhe Zhang, et al., ICML, 2017] “Reinforcement Learning” • [Yu, et al., AAAI, 2017][Li, et al., EMNLP, 2017][Tong Che, et al, arXiv, 2017][Jiaxian Guo, et al., AAAI, 2018][Kevin Lin, et al, NIPS, 2017][William Fedus, et al., ICLR, 2018]
  • 191. Reinforcement Learning? • Consider the output of discriminator as reward • Update generator to increase discriminator = to get maximum reward • Using the formulation of policy gradient, replace reward 𝑅 𝑐, 𝑥 with discriminator output D 𝑐, 𝑥 • Different from typical RL • The discriminator would update Discrimi nator scalar update Chatbot En De
  • 192. 𝜃 𝑡 𝑐1, 𝑥1 𝑐2 , 𝑥2 𝑐 𝑁, 𝑥 𝑁 …… …… g-step d-step discriminator 𝑐 𝑥 fake real 𝑅 𝑐1, 𝑥1 𝑅 𝑐2, 𝑥2 𝑅 𝑐 𝑁 , 𝑥 𝑁 𝐷 𝑐1 , 𝑥1 𝐷 𝑐2, 𝑥2 𝐷 𝑐 𝑁, 𝑥 𝑁 1 𝑁 𝑖=1 𝑁 𝑅 𝑐 𝑖, 𝑥 𝑖 𝛻𝑙𝑜𝑔𝑃 𝜃 𝑡 𝑥 𝑖|𝑐 𝑖 𝜃 𝑡+1 ← 𝜃 𝑡 + 𝜂𝛻 𝑅 𝜃 𝑡 𝑅 𝑐 𝑖, 𝑥 𝑖 is positive Updating 𝜃 to increase 𝑃 𝜃 𝑥 𝑖|𝑐 𝑖 𝑅 𝑐 𝑖, 𝑥 𝑖 is negative Updating 𝜃 to decrease 𝑃 𝜃 𝑥 𝑖 |𝑐 𝑖 𝐷 𝑐 𝑖 , 𝑥 𝑖 𝐷 𝑐 𝑖, 𝑥 𝑖 𝐷 𝑐 𝑖, 𝑥 𝑖 D
  • 193. Reward for Every Generation Step 𝛻 𝑅 𝜃 ≈ 1 𝑁 𝑖=1 𝑁 𝐷 𝑐 𝑖 , 𝑥 𝑖 𝛻𝑙𝑜𝑔𝑃 𝜃 𝑥 𝑖 |𝑐 𝑖 𝑐 𝑖 = “What is your name?” 𝑥 𝑖 = “I don’t know” 𝐷 𝑐 𝑖 , 𝑥 𝑖 is negative Update 𝜃 to decrease log𝑃 𝜃 𝑥 𝑖|𝑐 𝑖 𝑙𝑜𝑔𝑃 𝜃 𝑥 𝑖|𝑐 𝑖 = 𝑙𝑜𝑔𝑃 𝑥1 𝑖 |𝑐 𝑖 + 𝑙𝑜𝑔𝑃 𝑥2 𝑖 |𝑐 𝑖, 𝑥1 𝑖 + 𝑙𝑜𝑔𝑃 𝑥3 𝑖 |𝑐 𝑖, 𝑥1:2 𝑖 𝑃 "𝐼"|𝑐 𝑖 𝑐 𝑖 = “What is your name?” 𝑥 𝑖 = “I am John” 𝐷 𝑐 𝑖, 𝑥 𝑖 is positive Update 𝜃 to increase log𝑃 𝜃 𝑥 𝑖 |𝑐 𝑖 𝑙𝑜𝑔𝑃 𝜃 𝑥 𝑖 |𝑐 𝑖 = 𝑙𝑜𝑔𝑃 𝑥1 𝑖 |𝑐 𝑖 + 𝑙𝑜𝑔𝑃 𝑥2 𝑖 |𝑐 𝑖 , 𝑥1 𝑖 + 𝑙𝑜𝑔𝑃 𝑥3 𝑖 |𝑐 𝑖 , 𝑥1:2 𝑖 𝑃 "𝐼"|𝑐 𝑖
  • 194. Reward for Every Generation Step Method 2. Discriminator For Partially Decoded Sequences 𝑙𝑜𝑔𝑃 𝜃 𝑥 𝑖 |ℎ𝑖 = 𝑙𝑜𝑔𝑃 𝑥1 𝑖 |𝑐 𝑖 + 𝑙𝑜𝑔𝑃 𝑥2 𝑖 |𝑐 𝑖 , 𝑥1 𝑖 + 𝑙𝑜𝑔𝑃 𝑥3 𝑖 |𝑐 𝑖 , 𝑥1:2 𝑖 ℎ𝑖 = “What is your name?” 𝑥 𝑖 = “I don’t know” 𝑃 "𝐼"|𝑐 𝑖 𝑃 "𝑑𝑜𝑛′𝑡"|𝑐 𝑖, "𝐼" 𝑃 "𝑘𝑛𝑜𝑤"|𝑐 𝑖, "𝐼 𝑑𝑜𝑛′𝑡" Method 1. Monte Carlo (MC) Search 𝛻 𝑅 𝜃 ≈ 1 𝑁 𝑖=1 𝑁 𝑡=1 𝑇 𝑄 𝑐 𝑖 , 𝑥1:𝑡 𝑖 − 𝑏 𝛻𝑙𝑜𝑔𝑃 𝜃 𝑥𝑡 𝑖 |𝑐 𝑖 , 𝑥1:𝑡−1 𝑖 𝛻 𝑅 𝜃 ≈ 1 𝑁 𝑖=1 𝑁 𝐷 𝑐 𝑖, 𝑥 𝑖 𝛻𝑙𝑜𝑔𝑃 𝜃 𝑥 𝑖|𝑐 𝑖 [Yu, et al., AAAI, 2017] [Li, et al., EMNLP, 2017]
  • 195. Find more comparison in the survey papers. [Lu, et al., arXiv, 2018][Zhu, et al., arXiv, 2018] Input We've got to look for another route. MLE I'm sorry. GAN You're not going to be here for a while. Input You can save him by talking. MLE I don't know. GAN You know what's going on in there, you know what I mean? • MLE frequently generates “I’m sorry”, “I don’t know”, etc. (corresponding to fuzzy images?) • GAN generates longer and more complex responses (however, no strong evidence shows that they are better) Experimental Results
  • 196. More Applications • Supervised machine translation [Wu, et al., arXiv 2017][Yang, et al., arXiv 2017] • Supervised abstractive summarization [Liu, et al., AAAI 2018] • Image/video caption generation [Rakshith Shetty, et al., ICCV 2017][Liang, et al., arXiv 2017] If you are using seq2seq models, consider to improve them by GAN.
  • 197. Outline of Part III Conditional Sequence Generation • RL (human feedback) • GAN (discriminator feedback) Unsupervised Conditional Sequence Generation • Text Style Transfer • Unsupervised Abstractive Summarization • Unsupervised Translation
  • 198. Text Style Transfer Domain X Domain Y male female It is good. It’s a good day. I love you. It is bad. It’s a bad day. I don’t love you. positive sentences negative sentences
  • 199. Direct Transformation 𝐺 𝑋→𝑌 𝐺Y→X as close as possible 𝐺Y→X 𝐺 𝑋→𝑌 as close as possible 𝐷 𝑌𝐷 𝑋 scalar: belongs to domain Y or not scalar: belongs to domain X or not
  • 200. Direct Transformation 𝐺 𝑋→𝑌 𝐺Y→X as close as possible 𝐺Y→X 𝐺 𝑋→𝑌 as close as possible 𝐷 𝑌𝐷 𝑋 negative sentence? positive sentence? It is bad. It is good. It is bad. I love you. I hate you. I love you. positive positive positivenegative negative negative
  • 201. Direct Transformation 𝐺 𝑋→𝑌 𝐺Y→X as close as possible 𝐺Y→X 𝐺 𝑋→𝑌 as close as possible 𝐷 𝑌𝐷 𝑋 negative sentence? positive sentence? It is bad. It is good. It is bad. I love you. I hate you. I love you. positive positive positivenegative negative negative Word embedding [Lee, et al., ICASSP, 2018] Discrete?
  • 202. • Negative sentence to positive sentence: it's a crappy day → it's a great day i wish you could be here → you could be here it's not a good idea → it's good idea i miss you → i love you i don't love you → i love you i can't do that → i can do that i feel so sad → i happy it's a bad day → it's a good day it's a dummy day → it's a great day sorry for doing such a horrible thing → thanks for doing a great thing my doggy is sick → my doggy is my doggy my little doggy is sick → my little doggy is my little doggy Title: SCALABLE SENTIMENT FOR SEQUENCE-TO-SEQUENCE CHATBOT RESPONSE WITH PERFORMANCE ANALYSIS Session: Dialog Systems and Applications Time: Wednesday, April 18, 08:30 - 10:30 Authors: Chih-Wei Lee, Yau-Shian Wang, Tsung-Yuan Hsu, Kuan-Yu Chen, Hung-Yi Lee, Lin-Shan Lee
  • 203. 𝐸𝑁 𝑋 𝐸𝑁𝑌 𝐷𝐸 𝑌 𝐷𝐸 𝑋 𝐷 𝑋 𝐷 𝑌 Discriminator of X domain Discriminator of Y domain Projection to Common Space
  • 204. 𝐸𝑁 𝑋 𝐸𝑁𝑌 𝐷𝐸 𝑌 𝐷𝐸 𝑋 𝐷 𝑋 𝐷 𝑌 Discriminator of X domain Discriminator of Y domain Projection to Common Space Positive Sentence Positive Sentence Negative Sentence Negative Sentence Decoder hidden layer as discriminator input [Shen, et al., NIPS, 2017] From 𝐸𝑁𝑋 or 𝐸𝑁𝑌 Domain Discriminator 𝐸𝑁𝑋 and 𝐸𝑁𝑌 fool the domain discriminator [Zhao, et al., arXiv, 2017] [Fu, et al., AAAI, 2018]
  • 205. Outline of Part III Conditional Sequence Generation • RL (human feedback) • GAN (discriminator feedback) Unsupervised Conditional Sequence Generation • Text Style Transfer • Unsupervised Abstractive Summarization • Unsupervised Translation
  • 206. Abstractive Summarization • Now machine can do abstractive summary by seq2seq (write summaries in its own words) summary 1 summary 2 summary 3 Training Data summary seq2seq (in its own words) Supervised: We need lots of labelled training data.
  • 207. Unsupervised Abstractive Summarization 1 2 Summary? Seq2seq Seq2seq long document long document short document The two seq2seq models are jointly learn to minimize the reconstruction error. Only need a lot of documents to train the model
  • 208. Unsupervised Abstractive Summarization 1 2 Summary? Seq2seq Seq2seq long document long document short document This is a seq2seq2seq auto-encoder. Using a sequence of words as latent representation. not readable … Policy gradient is used.
  • 209. Unsupervised Abstractive Summarization 1 2 Summary? Seq2seq Seq2seq long document long document short document 3 Human written summaries Real or not Discriminator Let Discriminator considers my output as real
  • 211. Outline of Part III Conditional Sequence Generation • RL (human feedback) • GAN (discriminator feedback) Unsupervised Conditional Sequence Generation • Text Style Transfer • Unsupervised Abstractive Summarization • Unsupervised Translation
  • 212. Unsupervised Machine Translation Domain X Domain Y male female positive sentences [Alexis Conneau, et al., ICLR, 2018] [Guillaume Lample, et al., ICLR, 2018]
  • 213. Unsupervised learning with 10M sentences Supervised learning with 100K sentence pairs = supervised unsupervised
  • 214. Unsupervised Speech Recognition The dog is …… The cats are …… The woman is …… The man is …… The cat is ……Cycle GAN “The”= Acoustic Pattern Discovery Can we achieve unsupervised speech recognition? p1 p1 p3 p2 p1 p4 p3 p5 p5 p1 p5 p4 p3 p1 p2 p3 p4 [Liu, et al., arXiv, 2018] [Chen, et al., arXiv, 2018]
  • 215. Unsupervised Speech Recognition • Phoneme recognition Audio: TIMIT Text: WMT supervised WGAN-GP Gumbel-softmax
  • 216. Concluding Remarks Conditional Sequence Generation • RL (human feedback) • GAN (discriminator feedback) Unsupervised Conditional Sequence Generation • Text Style Transfer • Unsupervised Abstractive Summarization • Unsupervised Translation
  • 217. To Learn More … https://www.youtube.com/playlist?list=PLJV_el3uVTsMd2G9ZjcpJn1YfnM9wVOBf You can learn more from the YouTube Channel (in Mandarin)
  • 218. Reference • Conditional Sequence Generation • Jiwei Li, Will Monroe, Alan Ritter, Michel Galley, Jianfeng Gao, Dan Jurafsky, Deep Reinforcement Learning for Dialogue Generation, EMNLP, 2016 • Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, Alan Ritter, Dan Jurafsky, Adversarial Learning for Neural Dialogue Generation, EMNLP, 2017 • Matt J. Kusner, José Miguel Hernández-Lobato, GANS for Sequences of Discrete Elements with the Gumbel-softmax Distribution, arXiv 2016 • Tong Che, Yanran Li, Ruixiang Zhang, R Devon Hjelm, Wenjie Li, Yangqiu Song, Yoshua Bengio, Maximum-Likelihood Augmented Discrete Generative Adversarial Networks, arXiv 2017 • Lantao Yu, Weinan Zhang, Jun Wang, Yong Yu, SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient, AAAI 2017
  • 219. Reference • Conditional Sequence Generation • Sai Rajeswar, Sandeep Subramanian, Francis Dutil, Christopher Pal, Aaron Courville, Adversarial Generation of Natural Language, arXiv, 2017 • Ofir Press, Amir Bar, Ben Bogin, Jonathan Berant, Lior Wolf, Language Generation with Recurrent Generative Adversarial Networks without Pre- training, ICML workshop, 2017 • Zhen Xu, Bingquan Liu, Baoxun Wang, Chengjie Sun, Xiaolong Wang, Zhuoran Wang, Chao Qi , Neural Response Generation via GAN with an Approximate Embedding Layer, EMNLP, 2017 • Alex Lamb, Anirudh Goyal, Ying Zhang, Saizheng Zhang, Aaron Courville, Yoshua Bengio, Professor Forcing: A New Algorithm for Training Recurrent Networks, NIPS, 2016 • Yizhe Zhang, Zhe Gan, Kai Fan, Zhi Chen, Ricardo Henao, Dinghan Shen, Lawrence Carin, Adversarial Feature Matching for Text Generation, ICML, 2017 • Jiaxian Guo, Sidi Lu, Han Cai, Weinan Zhang, Yong Yu, Jun Wang, Long Text Generation via Adversarial Training with Leaked Information, AAAI, 2018 • Kevin Lin, Dianqi Li, Xiaodong He, Zhengyou Zhang, Ming-Ting Sun, Adversarial Ranking for Language Generation, NIPS, 2017 • William Fedus, Ian Goodfellow, Andrew M. Dai, MaskGAN: Better Text Generation via Filling in the______, ICLR, 2018
  • 220. Reference • Conditional Sequence Generation • Sidi Lu, Yaoming Zhu, Weinan Zhang, Jun Wang, Yong Yu, Neural Text Generation: Past, Present and Beyond, arXiv, 2018 • Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, Yong Yu, Texygen: A Benchmarking Platform for Text Generation Models, arXiv, 2018 • Zhen Yang, Wei Chen, Feng Wang, Bo Xu, Improving Neural Machine Translation with Conditional Sequence Generative Adversarial Nets, NAACL, 2018 • Lijun Wu, Yingce Xia, Li Zhao, Fei Tian, Tao Qin, Jianhuang Lai, Tie-Yan Liu, Adversarial Neural Machine Translation, arXiv 2017 • Linqing Liu, Yao Lu, Min Yang, Qiang Qu, Jia Zhu, Hongyan Li, Generative Adversarial Network for Abstractive Text Summarization, AAAI 2018 • Rakshith Shetty, Marcus Rohrbach, Lisa Anne Hendricks, Mario Fritz, Bernt Schiele, Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training, ICCV 2017 • Xiaodan Liang, Zhiting Hu, Hao Zhang, Chuang Gan, Eric P. Xing, Recurrent Topic-Transition GAN for Visual Paragraph Generation, arXiv 2017
  • 221. Reference • Text Style Transfer • Zhenxin Fu, Xiaoye Tan, Nanyun Peng, Dongyan Zhao, Rui Yan, Style Transfer in Text: Exploration and Evaluation, AAAI, 2018 • Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi Jaakkola, Style Transfer from Non-Parallel Text by Cross-Alignment, NIPS 2017 • Chih-Wei Lee, Yau-Shian Wang, Tsung-Yuan Hsu, Kuan-Yu Chen, Hung-Yi Lee, Lin-shan Lee, Scalable Sentiment for Sequence-to-sequence Chatbot Response with Performance Analysis, ICASSP, 2018 • Junbo (Jake) Zhao, Yoon Kim, Kelly Zhang, Alexander M. Rush, Yann LeCun, Adversarially Regularized Autoencoders, arxiv, 2017
  • 222. Reference • Unsupervised Machine Translation • Alexis Conneau, Guillaume Lample, Marc'Aurelio Ranzato, Ludovic Denoyer, Hervé Jégou, Word Translation Without Parallel Data, ICRL 2018 • Guillaume Lample, Ludovic Denoyer, Marc'Aurelio Ranzato, Unsupervised Machine Translation Using Monolingual Corpora Only, ICRL 2018 • Unsupervised Speech Recognition • Da-Rong Liu, Kuan-Yu Chen, Hung-Yi Lee, Lin-shan Lee, Completely Unsupervised Phoneme Recognition by Adversarially Learning Mapping Relationships from Audio Embeddings, arXiv, 2018 • Yi-Chen Chen, Chia-Hao Shen, Sung-Feng Huang, Hung-yi Lee, Towards Unsupervised Automatic Speech Recognition Trained by Unaligned Speech and Text only, arXiv, 2018