SlideShare ist ein Scribd-Unternehmen logo
1 von 16
OpenAI Retro Contest
My solutions for fast learner of Sonic
Kiyonari Harigae
Agenda
 Introduction
 Problem Overview
 Domain Adaptation
 Reinforcement learning
 Evaluation
 Results
 Discussion
 Implementation Detail
 Hyper-Parameters
 Appendix
 Reference
Introduction
• Open AI Retro Contest
This contest focuses on the transfer performance of reinforcement
learning. I aimed to make "Fast Learner" described in the details of the
contest.
My approach is Few-Shot Learning. An agent learned at one Level of the
training data set and aimed at achieving high performance at test Levels.
Problem Overview
• Problem Formulation and Assumptions
We formalize our transfer problem in a general way by considering a
source domain and a target domain, denoted DS and DT , which each
correspond to Markov decision processes (MDPs)
The state spaces (raw pixels) of the source and the target domains
completely different, however, share action spaces. Finally, state
transition and reward functions share structural similarity.
DS = (SS,AS,TS,RS) and DT = (ST ,AT ,TT ,RT )
SS ≠ ST, AS=AT, TS≈TT, RS≈RT
𝐷𝑠 ∈ M
M is a set of all natural world MDPs
Domain Adaptation
• Representation learning with Stacked AE
The behavior of RL agent is defined by a policy π : S -> A, specifying an
action to apply in each state(S). the agent will take the state space (raw
pixel) and decide what we should do. so, The Agent needs to learn
generalized Policy π of the source domain.
I thought that approach is appropriate for this problem is like DARLA[1].
It is encodes the observations that receives from environment as general
representation, and then uses these representations to learn a robust
policy that is capable of domain adaptation.
As with the original, I implemented the two steps in which the 1st step is
a DAE(Denoising Auto Encoder) and the 2nd step is VAE(Variational
AutoEncoder) [2], But difference from the original is that allowing fine
tuning in 2nd step, and β = 1(no longer Beta-VAE) [3]. More detail of
implementation are below.
Reinforcement learning
• PPO(Proximal Policy Optimization)
In PPO[4], The architecture is same as baseline, the convolution part of
the network that is shared between the policy net and value net was
replaced with the pre-trained encoder of 2nd step(VAE) of Stacked AE.
Observations
The observation resize to 78x78x3(HxWxC)
Actions
The eight button combinations:
{{LEFT}, {RIGHT}, {LEFT, DOWN},{RIGHT, DOWN}, {DOWN}, {DOWN, B},
{B}}
Rewards
The horizontal offset from the player’s initial position, But allowing
backwards(no negative rewards).
Evaluation
• Training and Evaluation procedure
Gathering images
Images(frames) were collected from the final results of the agent that
trained at each level of Training Set and used for training of Stacked VAE.
Pre-Training
Training 1st Step and 2nd Step of Stacked VAE using the image collected
above
Reinforcement Learning on the one level of training set
Training 1e6 time steps at GreenHillZone.Act1 of the level which is
simple and can learn the essence of the game. Also, allowing fine-tune
it’s convolutional part of policy that pre-trained encoder of Stacked VAE.
Evaluating transfer performance
At evaluation time, play each test level for 1e6 time steps using
the agent which trained above.
Also, For comparison of results, (Sonic) Baseline PPO[5] was added
to the evaluation.
Results
State Baseline
PPO
(Scratch)
Baseline
PPO
(Trained)
Fast Learner
PPO
(Stacked VAE)
AngelIslandZone Act2 1246.0 1377.8 1993.2
CasinoNightZone Act2 3302.9 3193.6 3684.5
FlyingBatteryZone Act2 996.0 1047.4 871.1
GreenHillZone Act2 2434.1 4846.3 4780.0
HillTopZone Act2 2227.7 2876.7 2664.4
HydrocityZone Act1 1966.7 713.8 1538.3
LavaReefZone Act1 1666.7 739.5 2891.0
MetropolisZone Act3 1395.6 1438.8 1566.2
ScrapBrainZone Act1 1080.8 1272.6 1281.3
SpringYardZone Act1 1595.7 1261.7 1272.2
StarLightZone Act3 2604.4 2627.0 2684.8
Average 1865.1 1945.0 2293.3
Baseline PPO(Scratch) is zero-shot learning, and Baseline PPO(Trained) is trained one
level of Training set(GreenHillZone.Act1) 1e6 time steps, Finally, Fast learner PPO is
evaluation target.
Baseline PPO
(Trained) seems to
be a bit Overfit,
Fast Learner PPO
(Stacked VAE) seems
to be generalized.
Results
0
500
1000
1500
2000
2500
3000
3500
score
Time steps
Learning curve
Baseline PPO (Scratch) Baseline PPO (Trained) Fast Learner PPO (Stacked VAE)
Submission Results
TASK Baseline
PPO
(Scratch)
Baseline
PPO
(Trained)
Fast Learner
PPO
(Stacked VAE)
#1 907.16 6492.72 8071.51
#2 3417.66 2477.81 2629.08
#3 2690.13 2266.38 3166.67
#4 1642.06 1915.59 2012.49
#5 1786.79 2048.23 2979.00
Average 2088.76 3040.15 3771.75
At the submission, the procedure of training is the same as at the evaluation,
but we also added the image of the collected Test Set to the training Stacked
VAE. The submitted agent trained 1e6 time steps at the same Level as at the
evaluation.
The results are as
follows, compared
with Scratch PPO,
approximately 80%
improvement.
And with Trained
PPO, approximately
24% improvement.
Discussion
At this time, by learning a good representation from training set, the agent quickly
learned the essence of the game and We confirmed the feasibility of transfer to
test levels.
However, I would like to investigate the following issues as future tasks.
・Overfitting to Source Domain
As described in DARLA's paper, By allowing fine-tune it‘s convolution layer while
learning the source policy, It can speed up learning of Source Domain, but there is
a problem of over fitting.
Consideration of validation strategy for robust policy is necessary.
・Learn good feature representation
In this time, it seems that it was difficult to collect the images and it was not able
to obtain enough representation(For example, obstacle or character motion)
It is necessary to collect more situational images and increase generalization
performance by image augmentation etc.
Implementation Detail
VAE
Encoder VAE
Decoder
DAE
Encoder
DAE
Decoder
𝑍
𝑥
𝑥
𝑥
𝑆 𝑧
𝜇, Σ
1st Step
2nd Step
𝜋 𝑆(a|𝑆 𝑍
; 𝜃)
RL
𝑥
𝐽
𝓛(𝜃,Φ;𝑥. 𝑧,𝛽) = 𝔼 𝑞Φ(𝑧|𝑥) 𝐽( 𝑥)- 𝐽(𝑥) 2
2
− 𝐷 𝐾𝐿(𝑞Φ(𝑧|𝑥) 𝑝(𝑧))
Hyper-parameters
Algorithm Value
Denoising AutoEncoder Architecture Kernel size 4, Stride 2, Encoder
layers {32r-32r-64r-64r}, latent dim
100l, Decoder layers{64r-64r-32r-
32r},
Adam optimizer
Noise factor 1.5
Learning Rate 1e-3
Variational AutoEncoder Architecture Kernel size 4, Stride 2, Encoder
layers {32r-32r-64r-64r}, latent dim
200(100 independent Gaussian
Distributions), Decoder layers{64r-
64r-32r-32r},Adam optimizer
Learning Rate 1e-4
Hyper-parameters
Algorithm Value
PPO Architecture The convolutional part of the policy
net is encoder of VAE. And then 𝑆 𝑧
input to policy layer {200l-512l-7l}
Epochs 4
Minibatch size 8
Discount(γ) 0.99
GAE parameter(λ) 0.95
Clipping param(ε) 0.2
Entropy coeff 0.001
Reward scale 0.005
Appendix 1
• Latent space linear interpolation
The representation between Levels seems to be acquired?
References
[1] I. Higgins, A. Pal, A. A. Rusu, L. Matthey, C. P. Burgess, A. Pritzel, M. Botvinick,
C.Blundell, and A. Lerchner, “DARLA: Improving zero-shot transfer in reinforcement
learning,”2017. eprint: arXiv:1707.08475.
[2] Kingma, D. P. and Welling, M. Auto-encoding variational bayes.ICLR, 2014.
[3] Higgins, Irina, Matthey, Loic, Pal, Arka, Burgess, Christopher,Glorot, Xavier,
Botvinick, Matthew, Mohamed, Shakir, andLerchner, Alexander. Beta-vae: Learning
basic visual concepts with a constrained variational framework. In ICLR, 2017.
[4] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy
optimization algorithms,” 2017. eprint: arXiv:1707.06347
[5] https://github.com/openai/retro-baselines

Weitere ähnliche Inhalte

Was ist angesagt?

What Makes Training Multi-modal Classification Networks Hard? ppt
What Makes Training Multi-modal Classification Networks Hard? pptWhat Makes Training Multi-modal Classification Networks Hard? ppt
What Makes Training Multi-modal Classification Networks Hard? ppt
taeseon ryu
 
Mining Regional Knowledge in Spatial Dataset
Mining Regional Knowledge in Spatial DatasetMining Regional Knowledge in Spatial Dataset
Mining Regional Knowledge in Spatial Dataset
butest
 
Dct,gibbs phen,oversampled adc,polyphase decomposition
Dct,gibbs phen,oversampled adc,polyphase decompositionDct,gibbs phen,oversampled adc,polyphase decomposition
Dct,gibbs phen,oversampled adc,polyphase decomposition
Muhammad Younas
 

Was ist angesagt? (20)

Jpeg image compression using discrete cosine transform a survey
Jpeg image compression using discrete cosine transform   a surveyJpeg image compression using discrete cosine transform   a survey
Jpeg image compression using discrete cosine transform a survey
 
Multimedia basic video compression techniques
Multimedia basic video compression techniquesMultimedia basic video compression techniques
Multimedia basic video compression techniques
 
Lect5 v2
Lect5 v2Lect5 v2
Lect5 v2
 
//STEIM Workshop: A Vernacular of File Formats
//STEIM Workshop: A Vernacular of File Formats//STEIM Workshop: A Vernacular of File Formats
//STEIM Workshop: A Vernacular of File Formats
 
Multimedia lossy compression algorithms
Multimedia lossy compression algorithmsMultimedia lossy compression algorithms
Multimedia lossy compression algorithms
 
Multimedia communication jpeg
Multimedia communication jpegMultimedia communication jpeg
Multimedia communication jpeg
 
Style gan hdw
Style gan hdwStyle gan hdw
Style gan hdw
 
Compression: Video Compression (MPEG and others)
Compression: Video Compression (MPEG and others)Compression: Video Compression (MPEG and others)
Compression: Video Compression (MPEG and others)
 
Deep learning and its application
Deep learning and its applicationDeep learning and its application
Deep learning and its application
 
What Makes Training Multi-modal Classification Networks Hard? ppt
What Makes Training Multi-modal Classification Networks Hard? pptWhat Makes Training Multi-modal Classification Networks Hard? ppt
What Makes Training Multi-modal Classification Networks Hard? ppt
 
Next generation image compression standards: JPEG XR and AIC
Next generation image compression standards: JPEG XR and AICNext generation image compression standards: JPEG XR and AIC
Next generation image compression standards: JPEG XR and AIC
 
Unit ii
Unit iiUnit ii
Unit ii
 
Survey of Super Resolution Task (SISR Only)
Survey of Super Resolution Task (SISR Only)Survey of Super Resolution Task (SISR Only)
Survey of Super Resolution Task (SISR Only)
 
[Paper] learning video representations from correspondence proposals
[Paper]  learning video representations from correspondence proposals[Paper]  learning video representations from correspondence proposals
[Paper] learning video representations from correspondence proposals
 
Oc2423022305
Oc2423022305Oc2423022305
Oc2423022305
 
JPEG
JPEGJPEG
JPEG
 
Design of Image Compression Algorithm using MATLAB
Design of Image Compression Algorithm using MATLABDesign of Image Compression Algorithm using MATLAB
Design of Image Compression Algorithm using MATLAB
 
Image transforms
Image transformsImage transforms
Image transforms
 
Mining Regional Knowledge in Spatial Dataset
Mining Regional Knowledge in Spatial DatasetMining Regional Knowledge in Spatial Dataset
Mining Regional Knowledge in Spatial Dataset
 
Dct,gibbs phen,oversampled adc,polyphase decomposition
Dct,gibbs phen,oversampled adc,polyphase decompositionDct,gibbs phen,oversampled adc,polyphase decomposition
Dct,gibbs phen,oversampled adc,polyphase decomposition
 

Ähnlich wie OpenAI Retro Contest

CyberSec_JPEGcompressionForensics.pdf
CyberSec_JPEGcompressionForensics.pdfCyberSec_JPEGcompressionForensics.pdf
CyberSec_JPEGcompressionForensics.pdf
MohammadAzreeYahaya
 
Human Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon Transform
Human Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon TransformHuman Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon Transform
Human Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon Transform
Fadwa Fouad
 

Ähnlich wie OpenAI Retro Contest (20)

Explaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiersExplaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiers
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and Tricks
 
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONMEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
 
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONMEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
 
Median based parallel steering kernel regression for image reconstruction
Median based parallel steering kernel regression for image reconstructionMedian based parallel steering kernel regression for image reconstruction
Median based parallel steering kernel regression for image reconstruction
 
Deep Learning in Computer Vision
Deep Learning in Computer VisionDeep Learning in Computer Vision
Deep Learning in Computer Vision
 
Steganography Part 2
Steganography Part 2Steganography Part 2
Steganography Part 2
 
RLTopics_2021_Lect1.pdf
RLTopics_2021_Lect1.pdfRLTopics_2021_Lect1.pdf
RLTopics_2021_Lect1.pdf
 
Lesson 39
Lesson 39Lesson 39
Lesson 39
 
AI Lesson 39
AI Lesson 39AI Lesson 39
AI Lesson 39
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep models
 
LSTM Structured Pruning
LSTM Structured PruningLSTM Structured Pruning
LSTM Structured Pruning
 
CyberSec_JPEGcompressionForensics.pdf
CyberSec_JPEGcompressionForensics.pdfCyberSec_JPEGcompressionForensics.pdf
CyberSec_JPEGcompressionForensics.pdf
 
Trackster Pruning at the CMS High-Granularity Calorimeter
Trackster Pruning at the CMS High-Granularity CalorimeterTrackster Pruning at the CMS High-Granularity Calorimeter
Trackster Pruning at the CMS High-Granularity Calorimeter
 
Human Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon Transform
Human Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon TransformHuman Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon Transform
Human Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon Transform
 
Salt Identification Challenge
Salt Identification ChallengeSalt Identification Challenge
Salt Identification Challenge
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
 
2021 05-04-u2-net
2021 05-04-u2-net2021 05-04-u2-net
2021 05-04-u2-net
 
[ppt]
[ppt][ppt]
[ppt]
 
Image Steganography Using Wavelet Transform And Genetic Algorithm
Image Steganography Using Wavelet Transform And Genetic AlgorithmImage Steganography Using Wavelet Transform And Genetic Algorithm
Image Steganography Using Wavelet Transform And Genetic Algorithm
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Kürzlich hochgeladen (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

OpenAI Retro Contest

  • 1. OpenAI Retro Contest My solutions for fast learner of Sonic Kiyonari Harigae
  • 2. Agenda  Introduction  Problem Overview  Domain Adaptation  Reinforcement learning  Evaluation  Results  Discussion  Implementation Detail  Hyper-Parameters  Appendix  Reference
  • 3. Introduction • Open AI Retro Contest This contest focuses on the transfer performance of reinforcement learning. I aimed to make "Fast Learner" described in the details of the contest. My approach is Few-Shot Learning. An agent learned at one Level of the training data set and aimed at achieving high performance at test Levels.
  • 4. Problem Overview • Problem Formulation and Assumptions We formalize our transfer problem in a general way by considering a source domain and a target domain, denoted DS and DT , which each correspond to Markov decision processes (MDPs) The state spaces (raw pixels) of the source and the target domains completely different, however, share action spaces. Finally, state transition and reward functions share structural similarity. DS = (SS,AS,TS,RS) and DT = (ST ,AT ,TT ,RT ) SS ≠ ST, AS=AT, TS≈TT, RS≈RT 𝐷𝑠 ∈ M M is a set of all natural world MDPs
  • 5. Domain Adaptation • Representation learning with Stacked AE The behavior of RL agent is defined by a policy π : S -> A, specifying an action to apply in each state(S). the agent will take the state space (raw pixel) and decide what we should do. so, The Agent needs to learn generalized Policy π of the source domain. I thought that approach is appropriate for this problem is like DARLA[1]. It is encodes the observations that receives from environment as general representation, and then uses these representations to learn a robust policy that is capable of domain adaptation. As with the original, I implemented the two steps in which the 1st step is a DAE(Denoising Auto Encoder) and the 2nd step is VAE(Variational AutoEncoder) [2], But difference from the original is that allowing fine tuning in 2nd step, and β = 1(no longer Beta-VAE) [3]. More detail of implementation are below.
  • 6. Reinforcement learning • PPO(Proximal Policy Optimization) In PPO[4], The architecture is same as baseline, the convolution part of the network that is shared between the policy net and value net was replaced with the pre-trained encoder of 2nd step(VAE) of Stacked AE. Observations The observation resize to 78x78x3(HxWxC) Actions The eight button combinations: {{LEFT}, {RIGHT}, {LEFT, DOWN},{RIGHT, DOWN}, {DOWN}, {DOWN, B}, {B}} Rewards The horizontal offset from the player’s initial position, But allowing backwards(no negative rewards).
  • 7. Evaluation • Training and Evaluation procedure Gathering images Images(frames) were collected from the final results of the agent that trained at each level of Training Set and used for training of Stacked VAE. Pre-Training Training 1st Step and 2nd Step of Stacked VAE using the image collected above Reinforcement Learning on the one level of training set Training 1e6 time steps at GreenHillZone.Act1 of the level which is simple and can learn the essence of the game. Also, allowing fine-tune it’s convolutional part of policy that pre-trained encoder of Stacked VAE. Evaluating transfer performance At evaluation time, play each test level for 1e6 time steps using the agent which trained above. Also, For comparison of results, (Sonic) Baseline PPO[5] was added to the evaluation.
  • 8. Results State Baseline PPO (Scratch) Baseline PPO (Trained) Fast Learner PPO (Stacked VAE) AngelIslandZone Act2 1246.0 1377.8 1993.2 CasinoNightZone Act2 3302.9 3193.6 3684.5 FlyingBatteryZone Act2 996.0 1047.4 871.1 GreenHillZone Act2 2434.1 4846.3 4780.0 HillTopZone Act2 2227.7 2876.7 2664.4 HydrocityZone Act1 1966.7 713.8 1538.3 LavaReefZone Act1 1666.7 739.5 2891.0 MetropolisZone Act3 1395.6 1438.8 1566.2 ScrapBrainZone Act1 1080.8 1272.6 1281.3 SpringYardZone Act1 1595.7 1261.7 1272.2 StarLightZone Act3 2604.4 2627.0 2684.8 Average 1865.1 1945.0 2293.3 Baseline PPO(Scratch) is zero-shot learning, and Baseline PPO(Trained) is trained one level of Training set(GreenHillZone.Act1) 1e6 time steps, Finally, Fast learner PPO is evaluation target. Baseline PPO (Trained) seems to be a bit Overfit, Fast Learner PPO (Stacked VAE) seems to be generalized.
  • 9. Results 0 500 1000 1500 2000 2500 3000 3500 score Time steps Learning curve Baseline PPO (Scratch) Baseline PPO (Trained) Fast Learner PPO (Stacked VAE)
  • 10. Submission Results TASK Baseline PPO (Scratch) Baseline PPO (Trained) Fast Learner PPO (Stacked VAE) #1 907.16 6492.72 8071.51 #2 3417.66 2477.81 2629.08 #3 2690.13 2266.38 3166.67 #4 1642.06 1915.59 2012.49 #5 1786.79 2048.23 2979.00 Average 2088.76 3040.15 3771.75 At the submission, the procedure of training is the same as at the evaluation, but we also added the image of the collected Test Set to the training Stacked VAE. The submitted agent trained 1e6 time steps at the same Level as at the evaluation. The results are as follows, compared with Scratch PPO, approximately 80% improvement. And with Trained PPO, approximately 24% improvement.
  • 11. Discussion At this time, by learning a good representation from training set, the agent quickly learned the essence of the game and We confirmed the feasibility of transfer to test levels. However, I would like to investigate the following issues as future tasks. ・Overfitting to Source Domain As described in DARLA's paper, By allowing fine-tune it‘s convolution layer while learning the source policy, It can speed up learning of Source Domain, but there is a problem of over fitting. Consideration of validation strategy for robust policy is necessary. ・Learn good feature representation In this time, it seems that it was difficult to collect the images and it was not able to obtain enough representation(For example, obstacle or character motion) It is necessary to collect more situational images and increase generalization performance by image augmentation etc.
  • 12. Implementation Detail VAE Encoder VAE Decoder DAE Encoder DAE Decoder 𝑍 𝑥 𝑥 𝑥 𝑆 𝑧 𝜇, Σ 1st Step 2nd Step 𝜋 𝑆(a|𝑆 𝑍 ; 𝜃) RL 𝑥 𝐽 𝓛(𝜃,Φ;𝑥. 𝑧,𝛽) = 𝔼 𝑞Φ(𝑧|𝑥) 𝐽( 𝑥)- 𝐽(𝑥) 2 2 − 𝐷 𝐾𝐿(𝑞Φ(𝑧|𝑥) 𝑝(𝑧))
  • 13. Hyper-parameters Algorithm Value Denoising AutoEncoder Architecture Kernel size 4, Stride 2, Encoder layers {32r-32r-64r-64r}, latent dim 100l, Decoder layers{64r-64r-32r- 32r}, Adam optimizer Noise factor 1.5 Learning Rate 1e-3 Variational AutoEncoder Architecture Kernel size 4, Stride 2, Encoder layers {32r-32r-64r-64r}, latent dim 200(100 independent Gaussian Distributions), Decoder layers{64r- 64r-32r-32r},Adam optimizer Learning Rate 1e-4
  • 14. Hyper-parameters Algorithm Value PPO Architecture The convolutional part of the policy net is encoder of VAE. And then 𝑆 𝑧 input to policy layer {200l-512l-7l} Epochs 4 Minibatch size 8 Discount(γ) 0.99 GAE parameter(λ) 0.95 Clipping param(ε) 0.2 Entropy coeff 0.001 Reward scale 0.005
  • 15. Appendix 1 • Latent space linear interpolation The representation between Levels seems to be acquired?
  • 16. References [1] I. Higgins, A. Pal, A. A. Rusu, L. Matthey, C. P. Burgess, A. Pritzel, M. Botvinick, C.Blundell, and A. Lerchner, “DARLA: Improving zero-shot transfer in reinforcement learning,”2017. eprint: arXiv:1707.08475. [2] Kingma, D. P. and Welling, M. Auto-encoding variational bayes.ICLR, 2014. [3] Higgins, Irina, Matthey, Loic, Pal, Arka, Burgess, Christopher,Glorot, Xavier, Botvinick, Matthew, Mohamed, Shakir, andLerchner, Alexander. Beta-vae: Learning basic visual concepts with a constrained variational framework. In ICLR, 2017. [4] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” 2017. eprint: arXiv:1707.06347 [5] https://github.com/openai/retro-baselines