InfoGAIL

Info-Wasserstein-GAIL
Yunzhu Li, Jiaming Song, Stefano Ermon, “Inferring The Latent Structure
of Human Decision-Making from Raw Visual Inputs”, ArXiv, 2017
Sungjoon Choi
(sungjoon.choi@cpslab.snu.ac.kr)

Latent Structure of Human Demos
2
Pass / code: 0 Pass / code: 1
Turn/ code: 0 Turn/ code: 1

• Introduction
• Backgrounds
• Generative Adversarial Imitation Learning (GAIL)
• Policy gradient
• InfoGAN
• Wasserstein GAN
• InfoGAIL
• Experiments
Contents
3

• Goal of imitation learning is to match expert
behavior.
• However, demonstrations often show significant
variability due to latent factors.
• This paper presents an Info-GAIL algorithm that
can infer the latent structure of human decision
making.
• This method can not only imitate, but also learn
interpretable representations.
Imitation Learning
4

• The goal of this paper is to develop an imitation
learning framework that is able to autonomously
discover and disentangle the latent factors of
variation underlying human decision making.
• Basically, this paper combines generative
adversarial imitation learning (GAIL), Info GAN,
and Wasserstein GAN with some reward
heuristics
Introduction
5

• We will NOT go into details.
GAIL
6
• But, we will see some basics of policy gradient methods.

Policy Gradient
7
Now we Get rid of expectation
over a policy function!!

Step-based PG
10
In other words, now we are considering a dynamic model!

Step-based PG
11
We do NOT have to care about
complex models in an MDP, anymore!

Step-based PG (REINFORCE)
12
Now, we have REINFORCE algorithm!
This method has been used in many deep learning methods
where the objective function is NOT differentiable.

Step-based PG (PG)
13
For all trajectories, and for all instances in a trajectory,
the PG is simply weighted MLE where the weight is defined by
the sum of future rewards, or Q value.

• Now, we know where (18) came from, right?
GAIL
14

• Interpretable Imitation Learning
• Utilized information theoretic regularization.
• Simply added InfoGAN to GAIL.
• Utilizing Raw Visual Inputs via Transfer Learning
• Used a Deep Residual Network.
Visual InfoGAIL
15

• Rather than using a single unstructured noise vector,
InfoGAN decomposes the input noise vector into two
parts: (1) z, incompressible noise and (2) c, the latent code
that targets the salient structured semantic features of the
data distribution.
• InfoGAN proposes an information-theoretic regularization:
there should be high mutual information between latent
codes c and generator distribution G(z, c). Thus I(c; G(z, c))
should be high.
InfoGAN
16

• Reward Augmentation
• A general framework to incorporate prior knowledge in imitation
learning by providing additional incentives to the agent without
interfering with the imitation learning process.
• Added a surrogate state-based reward that reflects our biases over
the desired behaviors.
• Can be seen as
• a hybrid between imitation and reinforcement learning
• side information provided to the generator
• Wasserstein GAN (WGAN)
• The discrimination network in WGAN solves a regression problem
instead of a classification problem.
• Suffers less from the vanishing gradient and mode collapse problem.
Improved Optimization
17

• Wasserstein Generative Adversarial Learning
WGAN?
18

• Variance Reduction
• Reduce variance in policy gradient method.
• Replay buffer method with prioritized replay.
• Good for the cases where the rewards are rare.
• Baseline variance reduction methods.
Improved Optimization
49

Finally, InfoGAIL
50
Sample data similar to InfoGAN
Update D similar to WGAN.
Initialize policy from behavior cloning
Update Q similar to GAN or GAIL.
Update Policy with TRPO.

Network Architectures
51
Latent codes are
added to G
Latent codes are also
added to D
Actions are added to D
The posterior network Q adopts the same
architecture as D except that the output is
a softmax over the discrete latent variables,
or factored Gaussian over continuous
latent variables.

Input Image
Action
Disc. Latent Code Cont. Latent Code
G (policy)
Input Image
Action Disc. Latent Code
D (cost)
Score
Input Image
Action Disc. Latent Code
Q (regularizer)
Disc. Latent Code Cont. Latent Code
Train policy function G with TRPO, and iterate.

Experiments
53
Pass / code: 0 Pass / code: 1
Turn/ code: 0 Turn/ code: 1

InfoGAIL

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie InfoGAIL

Ähnlich wie InfoGAIL (20)

Mehr von Sungjoon Choi

Mehr von Sungjoon Choi (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

InfoGAIL