OpenAI Retro Contest

OpenAI Retro Contest
My solutions for fast learner of Sonic
Kiyonari Harigae

Agenda
 Introduction
 Problem Overview
 Domain Adaptation
 Reinforcement learning
 Evaluation
 Results
 Discussion
 Implementation Detail
 Hyper-Parameters
 Appendix
 Reference

Introduction
• Open AI Retro Contest
This contest focuses on the transfer performance of reinforcement
learning. I aimed to make "Fast Learner" described in the details of the
contest.
My approach is Few-Shot Learning. An agent learned at one Level of the
training data set and aimed at achieving high performance at test Levels.

Problem Overview
• Problem Formulation and Assumptions
We formalize our transfer problem in a general way by considering a
source domain and a target domain, denoted DS and DT , which each
correspond to Markov decision processes (MDPs)
The state spaces (raw pixels) of the source and the target domains
completely different, however, share action spaces. Finally, state
transition and reward functions share structural similarity.
DS = (SS,AS,TS,RS) and DT = (ST ,AT ,TT ,RT )
SS ≠ ST, AS=AT, TS≈TT, RS≈RT
𝐷𝑠 ∈ M
M is a set of all natural world MDPs

Domain Adaptation
• Representation learning with Stacked AE
The behavior of RL agent is defined by a policy π : S -> A, specifying an
action to apply in each state(S). the agent will take the state space (raw
pixel) and decide what we should do. so, The Agent needs to learn
generalized Policy π of the source domain.
I thought that approach is appropriate for this problem is like DARLA[1].
It is encodes the observations that receives from environment as general
representation, and then uses these representations to learn a robust
policy that is capable of domain adaptation.
As with the original, I implemented the two steps in which the 1st step is
a DAE(Denoising Auto Encoder) and the 2nd step is VAE(Variational
AutoEncoder) [2], But difference from the original is that allowing fine
tuning in 2nd step, and β = 1(no longer Beta-VAE) [3]. More detail of
implementation are below.

Reinforcement learning
• PPO(Proximal Policy Optimization)
In PPO[4], The architecture is same as baseline, the convolution part of
the network that is shared between the policy net and value net was
replaced with the pre-trained encoder of 2nd step(VAE) of Stacked AE.
Observations
The observation resize to 78x78x3(HxWxC)
Actions
The eight button combinations:
{{LEFT}, {RIGHT}, {LEFT, DOWN},{RIGHT, DOWN}, {DOWN}, {DOWN, B},
{B}}
Rewards
The horizontal offset from the player’s initial position, But allowing
backwards(no negative rewards).

Evaluation
• Training and Evaluation procedure
Gathering images
Images(frames) were collected from the final results of the agent that
trained at each level of Training Set and used for training of Stacked VAE.
Pre-Training
Training 1st Step and 2nd Step of Stacked VAE using the image collected
above
Reinforcement Learning on the one level of training set
Training 1e6 time steps at GreenHillZone.Act1 of the level which is
simple and can learn the essence of the game. Also, allowing fine-tune
it’s convolutional part of policy that pre-trained encoder of Stacked VAE.
Evaluating transfer performance
At evaluation time, play each test level for 1e6 time steps using
the agent which trained above.
Also, For comparison of results, (Sonic) Baseline PPO[5] was added
to the evaluation.

Results
State Baseline
PPO
（Scratch)
Baseline
PPO
(Trained)
Fast Learner
PPO
(Stacked VAE)
AngelIslandZone Act2 1246.0 1377.8 1993.2
CasinoNightZone Act2 3302.9 3193.6 3684.5
FlyingBatteryZone Act2 996.0 1047.4 871.1
GreenHillZone Act2 2434.1 4846.3 4780.0
HillTopZone Act2 2227.7 2876.7 2664.4
HydrocityZone Act1 1966.7 713.8 1538.3
LavaReefZone Act1 1666.7 739.5 2891.0
MetropolisZone Act3 1395.6 1438.8 1566.2
ScrapBrainZone Act1 1080.8 1272.6 1281.3
SpringYardZone Act1 1595.7 1261.7 1272.2
StarLightZone Act3 2604.4 2627.0 2684.8
Average 1865.1 1945.0 2293.3
Baseline PPO(Scratch) is zero-shot learning, and Baseline PPO(Trained) is trained one
level of Training set(GreenHillZone.Act1) 1e6 time steps, Finally, Fast learner PPO is
evaluation target.
Baseline PPO
(Trained) seems to
be a bit Overfit,
Fast Learner PPO
(Stacked VAE) seems
to be generalized.

Results
0
500
1000
1500
2000
2500
3000
3500
score
Time steps
Learning curve
Baseline PPO （Scratch) Baseline PPO (Trained) Fast Learner PPO (Stacked VAE)

Submission Results
TASK Baseline
PPO
（Scratch)
Baseline
PPO
(Trained)
Fast Learner
PPO
(Stacked VAE)
#1 907.16 6492.72 8071.51
#2 3417.66 2477.81 2629.08
#3 2690.13 2266.38 3166.67
#4 1642.06 1915.59 2012.49
#5 1786.79 2048.23 2979.00
Average 2088.76 3040.15 3771.75
At the submission, the procedure of training is the same as at the evaluation,
but we also added the image of the collected Test Set to the training Stacked
VAE. The submitted agent trained 1e6 time steps at the same Level as at the
evaluation.
The results are as
follows, compared
with Scratch PPO,
approximately 80%
improvement.
And with Trained
PPO, approximately
24% improvement.

Discussion
At this time, by learning a good representation from training set, the agent quickly
learned the essence of the game and We confirmed the feasibility of transfer to
test levels.
However, I would like to investigate the following issues as future tasks.
・Overfitting to Source Domain
As described in DARLA's paper, By allowing fine-tune it‘s convolution layer while
learning the source policy, It can speed up learning of Source Domain, but there is
a problem of over fitting.
Consideration of validation strategy for robust policy is necessary.
・Learn good feature representation
In this time, it seems that it was difficult to collect the images and it was not able
to obtain enough representation(For example, obstacle or character motion)
It is necessary to collect more situational images and increase generalization
performance by image augmentation etc.

Implementation Detail
VAE
Encoder VAE
Decoder
DAE
Encoder
DAE
Decoder
𝑍
𝑥
𝑥
𝑥
𝑆 𝑧
𝜇, Σ
1st Step
2nd Step
𝜋 𝑆(a|𝑆 𝑍
; 𝜃)
RL
𝑥
𝐽
𝓛(𝜃,Φ;𝑥. 𝑧,𝛽) = 𝔼 𝑞Φ(𝑧|𝑥) 𝐽( 𝑥)- 𝐽(𝑥) 2
2
− 𝐷 𝐾𝐿(𝑞Φ(𝑧|𝑥) 𝑝(𝑧))

Hyper-parameters
Algorithm Value
Denoising AutoEncoder Architecture Kernel size 4, Stride 2, Encoder
layers {32r-32r-64r-64r}, latent dim
100l, Decoder layers{64r-64r-32r-
32r},
Adam optimizer
Noise factor 1.5
Learning Rate 1e-3
Variational AutoEncoder Architecture Kernel size 4, Stride 2, Encoder
layers {32r-32r-64r-64r}, latent dim
200(100 independent Gaussian
Distributions), Decoder layers{64r-
64r-32r-32r},Adam optimizer
Learning Rate 1e-4

Hyper-parameters
Algorithm Value
PPO Architecture The convolutional part of the policy
net is encoder of VAE. And then 𝑆 𝑧
input to policy layer {200l-512l-7l}
Epochs 4
Minibatch size 8
Discount(γ) 0.99
GAE parameter(λ) 0.95
Clipping param(ε) 0.2
Entropy coeff 0.001
Reward scale 0.005

Appendix 1
• Latent space linear interpolation
The representation between Levels seems to be acquired?

References
[1] I. Higgins, A. Pal, A. A. Rusu, L. Matthey, C. P. Burgess, A. Pritzel, M. Botvinick,
C.Blundell, and A. Lerchner, “DARLA: Improving zero-shot transfer in reinforcement
learning,”2017. eprint: arXiv:1707.08475.
[2] Kingma, D. P. and Welling, M. Auto-encoding variational bayes.ICLR, 2014.
[3] Higgins, Irina, Matthey, Loic, Pal, Arka, Burgess, Christopher,Glorot, Xavier,
Botvinick, Matthew, Mohamed, Shakir, andLerchner, Alexander. Beta-vae: Learning
basic visual concepts with a constrained variational framework. In ICLR, 2017.
[4] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy
optimization algorithms,” 2017. eprint: arXiv:1707.06347
[5] https://github.com/openai/retro-baselines

OpenAI Retro Contest

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie OpenAI Retro Contest

Ähnlich wie OpenAI Retro Contest (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

OpenAI Retro Contest