SlideShare ist ein Scribd-Unternehmen logo
1 von 24
Downloaden Sie, um offline zu lesen
Model-Based Reinforcement Learning
@NIPS2017
Yasuhiro Fujita
Engineer
Preferred Networks, Inc.
Model-Based Reinforcement Learning (MBRL)
l Model = simulator = dynamics = T(s,a,sʼ)
– may or may not include the reward function
l Model-free RL uses data from the environment only
l Model-based RL uses data from a model (which is given or estimated)
u to use less data from the environment
u to look ahead and plan
u to explore
u to guarantee safety
u to generalize to different goals
Why MBRL now?
l Despite deep RLʼs recent success, itʼs still difficult to see its real-world
applications
– Requiring a huge amount of interactions (1M~1000M😨)
– No safety guarantees
– Difficult to transfer to other tasks
l MBRL can be a solution for these problems
l This talk introduces some MBRL papers from the NIPS 2017 conference
and the deep RL symposium
Imagination-Augmented Agents (I2As)
l T. Weber, S. Racanière, D. P. Reichert, L. Buesing, A. Guez, D. J. Rezende, A. P. Badia, O.
Vinyals, N. Heess, Y. Li, R. Pascanu, P. Battaglia, D. Silver, and D. Wierstra, “Imagination-
Augmented Agents for Deep Reinforcement Learning,” 2017.
l I2As utilize predictions from a model for planning
l Robust to model errors
The I2A architecture (1)
l Model-free path: feed-forward net
l Model-based path:
– Make multi-step predictions (=rollouts)
from current observation and for each
action
– Encode each rollout with LSTMs
– Aggregate each code by concatenation
The I2A architecture (2)
l The imagination core
– consists of a rollout policy and a
pretrained environment model
– predicts next observation and
reward
l The rollout policy is distilled
online from the I2A policy
Value Prediction Networks
l J. Oh, S. Singh, and H. Lee, “Value Prediction Network,” in NIPS, 2017.
l Directly predicting observations in pixels might be not a good idea
– They contain irrelevant details to the agent
– They are unnecessarily high-dimensional and difficult to predict
l VPNs learn abstract states and their model by minimizing value
prediction errors
The VPN architecture
l x:observation o:option(≒action here)
l Decompose Q(x,o) = r(sʼ) +γV(sʼ)
Planning by VPNs
l The depth and width are
fixed
l Values are averaged over
prediction steps
Training VPNs
l V(s) of each abstract state is fit to the value from planning
l Improves performance on 2D random grid worlds and some games in
Atari, combined with Async n-step Q-learning
– surpasses observation prediction on the grid worlds
QMDP-net (not RL but Imitation Learning)
l P. Karkus, D. Hsu, and W. S. Lee, “QMDP-Net: Deep Learning for Planning under Partial
Observability,” in NIPS, 2017.
l A POMDP (partially observable MDP) and its solver are modeled as a
single neural network and trained end-to-end to predict expert actions
– Value Iteration Networks (NIPS 2016) were for fully observable domains
POMDPs and the QMDP algorithm
l In a POMDP
– The agent can only observe o ~ O(s), not s
– A belief state is considered instead: b(s) = probability of being in s
l QMDP: an approximate algorithm for solving a POMDP
1. Compute Q_{MDP}(s,a) of the underlying MDP for each (s,a) pair
2. Compute the current belief b(s) = probability of the current state being s
3. Approximate Q(b,a) ≒ Σ_s b(s)Q_{MDP}(s,a)
4. Choose argmax_a Q(b,a)
– Assumes that any uncertainty in belief will be gone after the next action
The QMDP-net architecture (1)
l Consists of a Bayesian filter and QMDP planner
– Bayesian filter outputs b
– QMDP planner outputs Q(b,a)
The QMDP-net architecture (2)
l Everything is represented as a CNN
l Works on abstract observations/states/actions that can be different from
real observations/states/actions
– abstract state = position in the plane used in CNNs
Performance of the QMDP-net
l Expert actions are taken from successful trajectories of the QMDP algorithm, which solves the
ground-truth POMDP
l QMDP-net surpasses normal recurrent nets and even the QMDP algorithm (because it can fail)
MBRL with stability guarantees
l F. Berkenkamp, M. Turchetta, A. P. Schoellig, and A. Krause, “Safe Model-based
Reinforcement Learning with Stability Guarantees,” in NIPS, 2017.
l Aims to guarantee stability (= recoverability to stable states) when there
is uncertainty in model estimation in continuous control
– Achieves both safe policy update and safe exploration
l Repeat
– Estimate the region of attraction
– Safely explore to reduce uncertainty in the model
– Update the model (e.g. Gaussian process)
– Safely improve a policy to maximize some objective
How it works
l Can safely optimize a neural network policy on a simulated inverted
pendulum, without the pendulum ever falling down
RL on a learned model
l A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine, “Neural Network Dynamics for Model-
Based Deep Reinforcement Learning with Model-Free Fine-Tuning,” 2017.
l If you can optimize a policy on a learned model, you may need less data
from the environment
– And NNs are good at prediction
l One way to learn a policy on a learned model
– Model Predictive Control (MPC)
Model learning is difficult
l Even small prediction errors compound and eventually diverge
– Policies learned/computed purely from simulated experiences may fail
Fine-tuning a policy with model-free RL
l Outperforms pure model-free RL by
1. Collect data, fit a model and apply MPC
2. Train a NN policy to imitate actions of MPC
3. Fine-tune the policy with model-free RL (TRPO)
Model ensemble
l T. Kurutach and A. Tamar, “Model-Ensemble Trust-Region Policy Optimization,” in NIPS Deep
Reinforcement Learning Symposium, 2017.
l Another way to learn a policy on a learned model
– Apply model-free RL on a learned model
l Model-Ensemble Trust Region Policy Optimization (ME-TRPO)
1. Fit an ensemble of NN models to predict next states
u Why ensemble? To maintain model uncertainty
2. Optimize a policy on simulated experiences with TRPO until performance
stops to increase
3. Collect new data for model learning and go to 1
Effect on sample-complexity
l Improves sample complexity on MuJoCo-based continuous control tasks
– x-axis is time steps in a log scale
Effect of the ensemble size
l More models, better performance
Summary
l MBRL is hot
– There were more papers than I can introduce
l Popular ideas
– Incorporating a model/planning structure into a NN
– Use model-based simulations to reduce sample complexity
l (Deep) MBRL can be a solution to drawbacks of deep RL
l However, MBRL has its own challenges
– How to learn a good model
– How to make use of a possibly bad model

Weitere ähnliche Inhalte

Was ist angesagt?

Reinforcement Learning Q-Learning
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning Melaku Eneayehu
 
DQN (Deep Q-Network)
DQN (Deep Q-Network)DQN (Deep Q-Network)
DQN (Deep Q-Network)Dong Guo
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-LearningKuppusamy P
 
Marl의 개념 및 군사용 적용방안
Marl의 개념 및 군사용 적용방안Marl의 개념 및 군사용 적용방안
Marl의 개념 및 군사용 적용방안KyuYeolJung
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratchJie-Han Chen
 
Diversity is all you need(DIAYN) : Learning Skills without a Reward Function
Diversity is all you need(DIAYN) : Learning Skills without a Reward FunctionDiversity is all you need(DIAYN) : Learning Skills without a Reward Function
Diversity is all you need(DIAYN) : Learning Skills without a Reward FunctionYeChan(Paul) Kim
 
Applications of Markov Decision Processes (MDPs) in the Internet of Things (I...
Applications of Markov Decision Processes (MDPs) in the Internet of Things (I...Applications of Markov Decision Processes (MDPs) in the Internet of Things (I...
Applications of Markov Decision Processes (MDPs) in the Internet of Things (I...mabualsh
 
Intro to Reinforcement learning - part III
Intro to Reinforcement learning - part IIIIntro to Reinforcement learning - part III
Intro to Reinforcement learning - part IIIMikko Mäkipää
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningNAVER Engineering
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learningAndres Mendez-Vazquez
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningJungyeol
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkYan Xu
 
Multi-Agent Reinforcement Learning
Multi-Agent Reinforcement LearningMulti-Agent Reinforcement Learning
Multi-Agent Reinforcement LearningSeolhokim
 
Rl chapter 1 introduction
Rl chapter 1 introductionRl chapter 1 introduction
Rl chapter 1 introductionConnorShorten2
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDongHyun Kwak
 
강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introductionTaehoon Kim
 

Was ist angesagt? (20)

Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Reinforcement Learning Q-Learning
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
DQN (Deep Q-Network)
DQN (Deep Q-Network)DQN (Deep Q-Network)
DQN (Deep Q-Network)
 
ddpg seminar
ddpg seminarddpg seminar
ddpg seminar
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
 
Marl의 개념 및 군사용 적용방안
Marl의 개념 및 군사용 적용방안Marl의 개념 및 군사용 적용방안
Marl의 개념 및 군사용 적용방안
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratch
 
Diversity is all you need(DIAYN) : Learning Skills without a Reward Function
Diversity is all you need(DIAYN) : Learning Skills without a Reward FunctionDiversity is all you need(DIAYN) : Learning Skills without a Reward Function
Diversity is all you need(DIAYN) : Learning Skills without a Reward Function
 
Applications of Markov Decision Processes (MDPs) in the Internet of Things (I...
Applications of Markov Decision Processes (MDPs) in the Internet of Things (I...Applications of Markov Decision Processes (MDPs) in the Internet of Things (I...
Applications of Markov Decision Processes (MDPs) in the Internet of Things (I...
 
Intro to Reinforcement learning - part III
Intro to Reinforcement learning - part IIIIntro to Reinforcement learning - part III
Intro to Reinforcement learning - part III
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Policy gradient
Policy gradientPolicy gradient
Policy gradient
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
 
Multi-Agent Reinforcement Learning
Multi-Agent Reinforcement LearningMulti-Agent Reinforcement Learning
Multi-Agent Reinforcement Learning
 
Rl chapter 1 introduction
Rl chapter 1 introductionRl chapter 1 introduction
Rl chapter 1 introduction
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction
 

Ähnlich wie Model-Based Reinforcement Learning @NIPS2017

Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...gabrielesisinna
 
最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に - 最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に - Hiroshi Fukui
 
The Predictron: End-to-end Learning and Planning
The Predictron: End-to-end Learning and PlanningThe Predictron: End-to-end Learning and Planning
The Predictron: End-to-end Learning and PlanningYoonho Lee
 
Knn Algorithm presentation
Knn Algorithm presentationKnn Algorithm presentation
Knn Algorithm presentationRishavSharma112
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningKhaled Saleh
 
Poster_Reseau_Neurones_Journees_2013
Poster_Reseau_Neurones_Journees_2013Poster_Reseau_Neurones_Journees_2013
Poster_Reseau_Neurones_Journees_2013Pedro Lopes
 
Locate, Size and Count: Accurately Resolving People in Dense Crowds via Detec...
Locate, Size and Count: Accurately Resolving People in Dense Crowds via Detec...Locate, Size and Count: Accurately Resolving People in Dense Crowds via Detec...
Locate, Size and Count: Accurately Resolving People in Dense Crowds via Detec...IRJET Journal
 
(Research Note) Delving deeper into convolutional neural networks for camera ...
(Research Note) Delving deeper into convolutional neural networks for camera ...(Research Note) Delving deeper into convolutional neural networks for camera ...
(Research Note) Delving deeper into convolutional neural networks for camera ...Jacky Liu
 
SYNOPSIS on Parse representation and Linear SVM.
SYNOPSIS on Parse representation and Linear SVM.SYNOPSIS on Parse representation and Linear SVM.
SYNOPSIS on Parse representation and Linear SVM.bhavinecindus
 
Large Scale Kernel Learning using Block Coordinate Descent
Large Scale Kernel Learning using Block Coordinate DescentLarge Scale Kernel Learning using Block Coordinate Descent
Large Scale Kernel Learning using Block Coordinate DescentShaleen Kumar Gupta
 
Analysis of Various Single Frame Super Resolution Techniques for better PSNR
Analysis of Various Single Frame Super Resolution Techniques for better PSNRAnalysis of Various Single Frame Super Resolution Techniques for better PSNR
Analysis of Various Single Frame Super Resolution Techniques for better PSNRIRJET Journal
 
Types of Machine Learnig Algorithms(CART, ID3)
Types of Machine Learnig Algorithms(CART, ID3)Types of Machine Learnig Algorithms(CART, ID3)
Types of Machine Learnig Algorithms(CART, ID3)Fatimakhan325
 
Model Evaluation in the land of Deep Learning
Model Evaluation in the land of Deep LearningModel Evaluation in the land of Deep Learning
Model Evaluation in the land of Deep LearningPramit Choudhary
 
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHESIMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHESVikash Kumar
 
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Chris Ohk
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysYasutoTamura1
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017MLconf
 
Sparse Sampling in Digital Image Processing
Sparse Sampling in Digital Image ProcessingSparse Sampling in Digital Image Processing
Sparse Sampling in Digital Image ProcessingEswar Publications
 
DRL Medical Imaging Literature Review
DRL Medical Imaging Literature ReviewDRL Medical Imaging Literature Review
DRL Medical Imaging Literature ReviewJocelyn Baduria
 

Ähnlich wie Model-Based Reinforcement Learning @NIPS2017 (20)

Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
 
最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に - 最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に -
 
The Predictron: End-to-end Learning and Planning
The Predictron: End-to-end Learning and PlanningThe Predictron: End-to-end Learning and Planning
The Predictron: End-to-end Learning and Planning
 
Knn Algorithm presentation
Knn Algorithm presentationKnn Algorithm presentation
Knn Algorithm presentation
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
Poster_Reseau_Neurones_Journees_2013
Poster_Reseau_Neurones_Journees_2013Poster_Reseau_Neurones_Journees_2013
Poster_Reseau_Neurones_Journees_2013
 
Locate, Size and Count: Accurately Resolving People in Dense Crowds via Detec...
Locate, Size and Count: Accurately Resolving People in Dense Crowds via Detec...Locate, Size and Count: Accurately Resolving People in Dense Crowds via Detec...
Locate, Size and Count: Accurately Resolving People in Dense Crowds via Detec...
 
(Research Note) Delving deeper into convolutional neural networks for camera ...
(Research Note) Delving deeper into convolutional neural networks for camera ...(Research Note) Delving deeper into convolutional neural networks for camera ...
(Research Note) Delving deeper into convolutional neural networks for camera ...
 
SYNOPSIS on Parse representation and Linear SVM.
SYNOPSIS on Parse representation and Linear SVM.SYNOPSIS on Parse representation and Linear SVM.
SYNOPSIS on Parse representation and Linear SVM.
 
The Importance of Time in Visual Attention Models
The Importance of Time in Visual Attention ModelsThe Importance of Time in Visual Attention Models
The Importance of Time in Visual Attention Models
 
Large Scale Kernel Learning using Block Coordinate Descent
Large Scale Kernel Learning using Block Coordinate DescentLarge Scale Kernel Learning using Block Coordinate Descent
Large Scale Kernel Learning using Block Coordinate Descent
 
Analysis of Various Single Frame Super Resolution Techniques for better PSNR
Analysis of Various Single Frame Super Resolution Techniques for better PSNRAnalysis of Various Single Frame Super Resolution Techniques for better PSNR
Analysis of Various Single Frame Super Resolution Techniques for better PSNR
 
Types of Machine Learnig Algorithms(CART, ID3)
Types of Machine Learnig Algorithms(CART, ID3)Types of Machine Learnig Algorithms(CART, ID3)
Types of Machine Learnig Algorithms(CART, ID3)
 
Model Evaluation in the land of Deep Learning
Model Evaluation in the land of Deep LearningModel Evaluation in the land of Deep Learning
Model Evaluation in the land of Deep Learning
 
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHESIMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
 
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 
Sparse Sampling in Digital Image Processing
Sparse Sampling in Digital Image ProcessingSparse Sampling in Digital Image Processing
Sparse Sampling in Digital Image Processing
 
DRL Medical Imaging Literature Review
DRL Medical Imaging Literature ReviewDRL Medical Imaging Literature Review
DRL Medical Imaging Literature Review
 

Mehr von mooopan

Clipped Action Policy Gradient
Clipped Action Policy GradientClipped Action Policy Gradient
Clipped Action Policy Gradientmooopan
 
ChainerRLの紹介
ChainerRLの紹介ChainerRLの紹介
ChainerRLの紹介mooopan
 
Safe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement LearningSafe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement Learningmooopan
 
A3Cという強化学習アルゴリズムで遊んでみた話
A3Cという強化学習アルゴリズムで遊んでみた話A3Cという強化学習アルゴリズムで遊んでみた話
A3Cという強化学習アルゴリズムで遊んでみた話mooopan
 
最近のDQN
最近のDQN最近のDQN
最近のDQNmooopan
 
Learning Continuous Control Policies by Stochastic Value Gradients
Learning Continuous Control Policies by Stochastic Value GradientsLearning Continuous Control Policies by Stochastic Value Gradients
Learning Continuous Control Policies by Stochastic Value Gradientsmooopan
 
Trust Region Policy Optimization
Trust Region Policy OptimizationTrust Region Policy Optimization
Trust Region Policy Optimizationmooopan
 
Effective Modern C++ Item 24: Distinguish universal references from rvalue re...
Effective Modern C++ Item 24: Distinguish universal references from rvalue re...Effective Modern C++ Item 24: Distinguish universal references from rvalue re...
Effective Modern C++ Item 24: Distinguish universal references from rvalue re...mooopan
 
"Playing Atari with Deep Reinforcement Learning"
"Playing Atari with Deep Reinforcement Learning""Playing Atari with Deep Reinforcement Learning"
"Playing Atari with Deep Reinforcement Learning"mooopan
 

Mehr von mooopan (9)

Clipped Action Policy Gradient
Clipped Action Policy GradientClipped Action Policy Gradient
Clipped Action Policy Gradient
 
ChainerRLの紹介
ChainerRLの紹介ChainerRLの紹介
ChainerRLの紹介
 
Safe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement LearningSafe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement Learning
 
A3Cという強化学習アルゴリズムで遊んでみた話
A3Cという強化学習アルゴリズムで遊んでみた話A3Cという強化学習アルゴリズムで遊んでみた話
A3Cという強化学習アルゴリズムで遊んでみた話
 
最近のDQN
最近のDQN最近のDQN
最近のDQN
 
Learning Continuous Control Policies by Stochastic Value Gradients
Learning Continuous Control Policies by Stochastic Value GradientsLearning Continuous Control Policies by Stochastic Value Gradients
Learning Continuous Control Policies by Stochastic Value Gradients
 
Trust Region Policy Optimization
Trust Region Policy OptimizationTrust Region Policy Optimization
Trust Region Policy Optimization
 
Effective Modern C++ Item 24: Distinguish universal references from rvalue re...
Effective Modern C++ Item 24: Distinguish universal references from rvalue re...Effective Modern C++ Item 24: Distinguish universal references from rvalue re...
Effective Modern C++ Item 24: Distinguish universal references from rvalue re...
 
"Playing Atari with Deep Reinforcement Learning"
"Playing Atari with Deep Reinforcement Learning""Playing Atari with Deep Reinforcement Learning"
"Playing Atari with Deep Reinforcement Learning"
 

Kürzlich hochgeladen

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 

Kürzlich hochgeladen (20)

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 

Model-Based Reinforcement Learning @NIPS2017

  • 1. Model-Based Reinforcement Learning @NIPS2017 Yasuhiro Fujita Engineer Preferred Networks, Inc.
  • 2. Model-Based Reinforcement Learning (MBRL) l Model = simulator = dynamics = T(s,a,sʼ) – may or may not include the reward function l Model-free RL uses data from the environment only l Model-based RL uses data from a model (which is given or estimated) u to use less data from the environment u to look ahead and plan u to explore u to guarantee safety u to generalize to different goals
  • 3. Why MBRL now? l Despite deep RLʼs recent success, itʼs still difficult to see its real-world applications – Requiring a huge amount of interactions (1M~1000M😨) – No safety guarantees – Difficult to transfer to other tasks l MBRL can be a solution for these problems l This talk introduces some MBRL papers from the NIPS 2017 conference and the deep RL symposium
  • 4. Imagination-Augmented Agents (I2As) l T. Weber, S. Racanière, D. P. Reichert, L. Buesing, A. Guez, D. J. Rezende, A. P. Badia, O. Vinyals, N. Heess, Y. Li, R. Pascanu, P. Battaglia, D. Silver, and D. Wierstra, “Imagination- Augmented Agents for Deep Reinforcement Learning,” 2017. l I2As utilize predictions from a model for planning l Robust to model errors
  • 5. The I2A architecture (1) l Model-free path: feed-forward net l Model-based path: – Make multi-step predictions (=rollouts) from current observation and for each action – Encode each rollout with LSTMs – Aggregate each code by concatenation
  • 6. The I2A architecture (2) l The imagination core – consists of a rollout policy and a pretrained environment model – predicts next observation and reward l The rollout policy is distilled online from the I2A policy
  • 7. Value Prediction Networks l J. Oh, S. Singh, and H. Lee, “Value Prediction Network,” in NIPS, 2017. l Directly predicting observations in pixels might be not a good idea – They contain irrelevant details to the agent – They are unnecessarily high-dimensional and difficult to predict l VPNs learn abstract states and their model by minimizing value prediction errors
  • 8. The VPN architecture l x:observation o:option(≒action here) l Decompose Q(x,o) = r(sʼ) +γV(sʼ)
  • 9. Planning by VPNs l The depth and width are fixed l Values are averaged over prediction steps
  • 10. Training VPNs l V(s) of each abstract state is fit to the value from planning l Improves performance on 2D random grid worlds and some games in Atari, combined with Async n-step Q-learning – surpasses observation prediction on the grid worlds
  • 11. QMDP-net (not RL but Imitation Learning) l P. Karkus, D. Hsu, and W. S. Lee, “QMDP-Net: Deep Learning for Planning under Partial Observability,” in NIPS, 2017. l A POMDP (partially observable MDP) and its solver are modeled as a single neural network and trained end-to-end to predict expert actions – Value Iteration Networks (NIPS 2016) were for fully observable domains
  • 12. POMDPs and the QMDP algorithm l In a POMDP – The agent can only observe o ~ O(s), not s – A belief state is considered instead: b(s) = probability of being in s l QMDP: an approximate algorithm for solving a POMDP 1. Compute Q_{MDP}(s,a) of the underlying MDP for each (s,a) pair 2. Compute the current belief b(s) = probability of the current state being s 3. Approximate Q(b,a) ≒ Σ_s b(s)Q_{MDP}(s,a) 4. Choose argmax_a Q(b,a) – Assumes that any uncertainty in belief will be gone after the next action
  • 13. The QMDP-net architecture (1) l Consists of a Bayesian filter and QMDP planner – Bayesian filter outputs b – QMDP planner outputs Q(b,a)
  • 14. The QMDP-net architecture (2) l Everything is represented as a CNN l Works on abstract observations/states/actions that can be different from real observations/states/actions – abstract state = position in the plane used in CNNs
  • 15. Performance of the QMDP-net l Expert actions are taken from successful trajectories of the QMDP algorithm, which solves the ground-truth POMDP l QMDP-net surpasses normal recurrent nets and even the QMDP algorithm (because it can fail)
  • 16. MBRL with stability guarantees l F. Berkenkamp, M. Turchetta, A. P. Schoellig, and A. Krause, “Safe Model-based Reinforcement Learning with Stability Guarantees,” in NIPS, 2017. l Aims to guarantee stability (= recoverability to stable states) when there is uncertainty in model estimation in continuous control – Achieves both safe policy update and safe exploration l Repeat – Estimate the region of attraction – Safely explore to reduce uncertainty in the model – Update the model (e.g. Gaussian process) – Safely improve a policy to maximize some objective
  • 17. How it works l Can safely optimize a neural network policy on a simulated inverted pendulum, without the pendulum ever falling down
  • 18. RL on a learned model l A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine, “Neural Network Dynamics for Model- Based Deep Reinforcement Learning with Model-Free Fine-Tuning,” 2017. l If you can optimize a policy on a learned model, you may need less data from the environment – And NNs are good at prediction l One way to learn a policy on a learned model – Model Predictive Control (MPC)
  • 19. Model learning is difficult l Even small prediction errors compound and eventually diverge – Policies learned/computed purely from simulated experiences may fail
  • 20. Fine-tuning a policy with model-free RL l Outperforms pure model-free RL by 1. Collect data, fit a model and apply MPC 2. Train a NN policy to imitate actions of MPC 3. Fine-tune the policy with model-free RL (TRPO)
  • 21. Model ensemble l T. Kurutach and A. Tamar, “Model-Ensemble Trust-Region Policy Optimization,” in NIPS Deep Reinforcement Learning Symposium, 2017. l Another way to learn a policy on a learned model – Apply model-free RL on a learned model l Model-Ensemble Trust Region Policy Optimization (ME-TRPO) 1. Fit an ensemble of NN models to predict next states u Why ensemble? To maintain model uncertainty 2. Optimize a policy on simulated experiences with TRPO until performance stops to increase 3. Collect new data for model learning and go to 1
  • 22. Effect on sample-complexity l Improves sample complexity on MuJoCo-based continuous control tasks – x-axis is time steps in a log scale
  • 23. Effect of the ensemble size l More models, better performance
  • 24. Summary l MBRL is hot – There were more papers than I can introduce l Popular ideas – Incorporating a model/planning structure into a NN – Use model-based simulations to reduce sample complexity l (Deep) MBRL can be a solution to drawbacks of deep RL l However, MBRL has its own challenges – How to learn a good model – How to make use of a possibly bad model