SlideShare a Scribd company logo
1 of 23
Continuous Control with
Deep Reinforcement Learning
2016 ICLR
Timothy P. Lillicrap, et al. (Google DeepMind)
Presenter : Hyemin Ahn
Introduction
2016-04-17 CPSLAB (EECS) 2
 Another work of
Deep Learning
+ Reinforcement Learning
from Google DEEPMIND !
 Extended their Deep Q Network,
which is dealing with discrete action space,
to continuous action space.
Results : Preview
2016-04-17 CPSLAB (EECS) 3
Reinforcement Learning : overview
2016-04-17 CPSLAB (EECS) 4
Agent
How can we formulize our behavior?
Reinforcement Learning : overview
2016-04-17 CPSLAB (EECS) 5
At each time 𝒕,
the agent receives an observation 𝒙 𝒕
from environment 𝑬
Wow
so scare
such gun
so many bullets
nice suit btw
Reinforcement Learning : overview
2016-04-17 CPSLAB (EECS) 6
The agent takes
an action 𝒂 𝒕 ∈ 𝒜 ⊆ ℝ 𝑵,
and receives a scalar reward 𝒓 𝒕.
𝒂 𝒕
𝒙 𝒕
Reinforcement Learning : overview
2016-04-17 CPSLAB (EECS) 7
For selecting the action, there is a policy 𝛑: 𝓢 → 𝓟 𝓐 ,
which maps states to probability distribution over actions.
𝒂 𝟏
𝒂 𝟐
𝛑(𝐬𝐭)
𝒂 𝟐𝒂 𝟏
Reinforcement Learning : overview
2016-04-17 CPSLAB (EECS) 8
𝒂 𝟏
𝒔 𝟏
𝛑
𝒂 𝟐
𝒔 𝟐
𝛑
𝒂 𝟑
𝒔 𝟑
𝛑
𝒑(𝒔 𝟐|𝒔 𝟏, 𝒂 𝟏) 𝒑(𝒔 𝟑|𝒔 𝟐, 𝒂 𝟐)
𝑹 𝒕 =
𝒊=𝒕
𝑻
𝜸𝒊−𝒕
𝒓(𝒔𝒊, 𝒂𝒊)
: cumulative sum of rewards
over sequences. (𝜸 ∈ [𝟎, 𝟏]:discounting factor)
𝑸 𝝅
𝒔 𝒕, 𝒂 𝒕 = 𝔼 𝝅[𝑹 𝒕|𝒔 𝒕, 𝒂 𝒕]
: state-action value function.
Objective of RL
: find 𝛑 maximizing 𝔼 𝝅(𝑹 𝟏) !
𝒓(𝒔 𝟏, 𝒂 𝟏) 𝒓(𝒔 𝟐, 𝒂 𝟐) 𝒓(𝒔 𝟑, 𝒂 𝟑)M
D
P
Reinforcement Learning : overview
2016-04-17 CPSLAB (EECS) 9
𝑸 𝝅 𝒕𝒓𝒊𝒏𝒊𝒕𝒚 𝒔 𝒕, 𝒂 𝒕 𝑸 𝝅 𝒏𝒆𝒐 𝒔 𝒕, 𝒂 𝒕<
Reinforcement Learning : overview
2016-04-17 CPSLAB (EECS) 10
• From environment E,
xt : observation
st ∈ 𝒮 : state.
• If E is fully observed, st = xt
• at ∈ 𝒜 : agent’s action
• π: 𝒮 → 𝒫 𝒜 : a policy defining agent’s behavior
: maps states to probability distribution over the actions.
• With 𝒮, 𝒜, an initial state distribution p(s1),
transition dynamics p st+1 st, at , and reward function r st, at ,
Agent’s behavior can be modeled as a Markov Decision Process (MDP).
• Rt = i=t
T
γi−t
r(si, ai) : the sum of discounted future reward
with a discounting factor γ ∈ [0,1].
• Objective of RL : learning a policy π maximizing 𝔼π(R1).
 For this, state-action-value function Qπ
st, at = 𝔼π[Rt|st, at] is used.
Q-learning is finding , the greedy policy
Reinforcement Learning : Q-Learning
2016-04-17 CPSLAB (EECS) 11
π: 𝒮 → 𝒫 𝒜 𝜇: 𝒮 → 𝒜
The Bellman equation refers to this recursive relationship
It gets harder to compute this due to stochastic policy π: 𝒮 → 𝒫 𝒜
Let us think about the deterministic policy instead of stochastic one
Can we do this in continuous action space?
Reinforcement Learning : continuous space?
2016-04-17 CPSLAB (EECS) 12
What DQN(Deep Q Network) was learning Q-function with NN.
With function approximator parameterized by 𝜃 𝑄
,
Model a network
Finding 𝜃 𝑄
minimizing the loss function
Problem 1: How can we know this real value?
Also model a policy with a network parameter 𝜃 𝜇
And find a parameter 𝜃 𝜇
which can do
Reinforcement Learning : continuous space?
2016-04-17 CPSLAB (EECS) 13
How can we find in a continuous action space?
Anyway, if assume that we know
Silver, David, et al. "Deterministic policy gradient algorithms." ICML. 2014.
The gradient of the policy’s performance can be defined as,
Problem 2: How can we successfully explore this action space?
Objective : Learn and in a continuous action space!
Reinforcement Learning : continuous space?
2016-04-17 CPSLAB (EECS) 14
How can we successfully explore this action space?
Problem of
How can we know this real value?
Problem of
Both are neural network
Authors suggest to use additional ‘target networks’
and
DDPG(Deep DPG) Algorithm
2016-04-17 CPSLAB (EECS) 15
DDPG(Deep DPG) Algorithm
2016-04-17 CPSLAB (EECS) 16
Our objective
DDPG(Deep DPG) Algorithm
2016-04-17 CPSLAB (EECS) 17
A finite sized cache of transitions (𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1 )
Our objective
DDPG(Deep DPG) Algorithm
2016-04-17 CPSLAB (EECS) 18
A finite sized cache of transitions (𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1 )
Our objective
DDPG(Deep DPG) Algorithm
2016-04-17 CPSLAB (EECS) 19
A finite sized cache of transitions (𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1 )
Our objective
Explored reward + sum of future rewards from target policy network
DDPG(Deep DPG) Algorithm
2016-04-17 CPSLAB (EECS) 20
A finite sized cache of transitions (𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1 )
Our objective
𝜏 ≪ 1 for avoiding divergence
Explored reward + sum of future rewards from target policy network
DDPG(Deep DPG) Algorithm
2016-04-17 CPSLAB (EECS) 21
Assume these are right, real target networks
Exploration
Experiment : Results
2016-04-17 CPSLAB (EECS) 22
2016-04-17 CPSLAB (EECS) 23

More Related Content

What's hot

Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...MLconf
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIJack Clark
 
DQN (Deep Q-Network)
DQN (Deep Q-Network)DQN (Deep Q-Network)
DQN (Deep Q-Network)Dong Guo
 
Introduction of "TrailBlazer" algorithm
Introduction of "TrailBlazer" algorithmIntroduction of "TrailBlazer" algorithm
Introduction of "TrailBlazer" algorithmKatsuki Ohto
 
deep reinforcement learning with double q learning
deep reinforcement learning with double q learningdeep reinforcement learning with double q learning
deep reinforcement learning with double q learningSeungHyeok Baek
 
Exploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement LearningExploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement LearningDongmin Lee
 
Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...
Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...
Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...MLconf
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningKai-Wen Zhao
 
Higher Order Fused Regularization for Supervised Learning with Grouped Parame...
Higher Order Fused Regularization for Supervised Learning with Grouped Parame...Higher Order Fused Regularization for Supervised Learning with Grouped Parame...
Higher Order Fused Regularization for Supervised Learning with Grouped Parame...Koh Takeuchi
 
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...Dongmin Lee
 
Optimization in deep learning
Optimization in deep learningOptimization in deep learning
Optimization in deep learningJeremy Nixon
 
Bayesian Model-Agnostic Meta-Learning
Bayesian Model-Agnostic Meta-LearningBayesian Model-Agnostic Meta-Learning
Bayesian Model-Agnostic Meta-LearningSangwoo Mo
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep LearningSebastian Ruder
 
Meta-Learning with Implicit Gradients
Meta-Learning with Implicit GradientsMeta-Learning with Implicit Gradients
Meta-Learning with Implicit GradientsSangwoo Mo
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningRyo Iwaki
 
is anyone_interest_in_auto-encoding_variational-bayes
is anyone_interest_in_auto-encoding_variational-bayesis anyone_interest_in_auto-encoding_variational-bayes
is anyone_interest_in_auto-encoding_variational-bayesNAVER Engineering
 
Introduction to Diffusion Models
Introduction to Diffusion ModelsIntroduction to Diffusion Models
Introduction to Diffusion ModelsSangwoo Mo
 
Sharpness-aware minimization (SAM)
Sharpness-aware minimization (SAM)Sharpness-aware minimization (SAM)
Sharpness-aware minimization (SAM)Sangwoo Mo
 

What's hot (20)

Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
 
DQN (Deep Q-Network)
DQN (Deep Q-Network)DQN (Deep Q-Network)
DQN (Deep Q-Network)
 
Introduction of "TrailBlazer" algorithm
Introduction of "TrailBlazer" algorithmIntroduction of "TrailBlazer" algorithm
Introduction of "TrailBlazer" algorithm
 
deep reinforcement learning with double q learning
deep reinforcement learning with double q learningdeep reinforcement learning with double q learning
deep reinforcement learning with double q learning
 
Exploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement LearningExploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement Learning
 
Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...
Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...
Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-Learning
 
Lec3 dqn
Lec3 dqnLec3 dqn
Lec3 dqn
 
Higher Order Fused Regularization for Supervised Learning with Grouped Parame...
Higher Order Fused Regularization for Supervised Learning with Grouped Parame...Higher Order Fused Regularization for Supervised Learning with Grouped Parame...
Higher Order Fused Regularization for Supervised Learning with Grouped Parame...
 
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
 
Optimization in deep learning
Optimization in deep learningOptimization in deep learning
Optimization in deep learning
 
Bayesian Model-Agnostic Meta-Learning
Bayesian Model-Agnostic Meta-LearningBayesian Model-Agnostic Meta-Learning
Bayesian Model-Agnostic Meta-Learning
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep Learning
 
Meta-Learning with Implicit Gradients
Meta-Learning with Implicit GradientsMeta-Learning with Implicit Gradients
Meta-Learning with Implicit Gradients
 
Deep Learning for Computer Vision: Optimization (UPC 2016)
Deep Learning for Computer Vision: Optimization (UPC 2016)Deep Learning for Computer Vision: Optimization (UPC 2016)
Deep Learning for Computer Vision: Optimization (UPC 2016)
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learning
 
is anyone_interest_in_auto-encoding_variational-bayes
is anyone_interest_in_auto-encoding_variational-bayesis anyone_interest_in_auto-encoding_variational-bayes
is anyone_interest_in_auto-encoding_variational-bayes
 
Introduction to Diffusion Models
Introduction to Diffusion ModelsIntroduction to Diffusion Models
Introduction to Diffusion Models
 
Sharpness-aware minimization (SAM)
Sharpness-aware minimization (SAM)Sharpness-aware minimization (SAM)
Sharpness-aware minimization (SAM)
 

Viewers also liked

Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNHye-min Ahn
 
Loud and Trendy: Crowdsourcing Impressions of Social Ambiance in Popular Indo...
Loud and Trendy: Crowdsourcing Impressions of Social Ambiance in Popular Indo...Loud and Trendy: Crowdsourcing Impressions of Social Ambiance in Popular Indo...
Loud and Trendy: Crowdsourcing Impressions of Social Ambiance in Popular Indo...Darshan Santani
 
Paper Reading : Enriching word vectors with subword information(2016)
Paper Reading : Enriching word vectors with subword information(2016)Paper Reading : Enriching word vectors with subword information(2016)
Paper Reading : Enriching word vectors with subword information(2016)정훈 서
 
Explaining and harnessing adversarial examples (2015)
Explaining and harnessing adversarial examples (2015)Explaining and harnessing adversarial examples (2015)
Explaining and harnessing adversarial examples (2015)정훈 서
 
Paper Reading : Learning from simulated and unsupervised images through adver...
Paper Reading : Learning from simulated and unsupervised images through adver...Paper Reading : Learning from simulated and unsupervised images through adver...
Paper Reading : Learning from simulated and unsupervised images through adver...정훈 서
 
A pixel to-pixel segmentation method of DILD without masks using CNN and perl...
A pixel to-pixel segmentation method of DILD without masks using CNN and perl...A pixel to-pixel segmentation method of DILD without masks using CNN and perl...
A pixel to-pixel segmentation method of DILD without masks using CNN and perl...남주 김
 
강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introductionTaehoon Kim
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialOmar Enayet
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks남주 김
 
Understanding deep learning requires rethinking generalization (2017) 2 2(2)
Understanding deep learning requires rethinking generalization (2017)    2 2(2)Understanding deep learning requires rethinking generalization (2017)    2 2(2)
Understanding deep learning requires rethinking generalization (2017) 2 2(2)정훈 서
 
Understanding deep learning requires rethinking generalization (2017) 1/2
Understanding deep learning requires rethinking generalization (2017) 1/2Understanding deep learning requires rethinking generalization (2017) 1/2
Understanding deep learning requires rethinking generalization (2017) 1/2정훈 서
 

Viewers also liked (11)

Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNN
 
Loud and Trendy: Crowdsourcing Impressions of Social Ambiance in Popular Indo...
Loud and Trendy: Crowdsourcing Impressions of Social Ambiance in Popular Indo...Loud and Trendy: Crowdsourcing Impressions of Social Ambiance in Popular Indo...
Loud and Trendy: Crowdsourcing Impressions of Social Ambiance in Popular Indo...
 
Paper Reading : Enriching word vectors with subword information(2016)
Paper Reading : Enriching word vectors with subword information(2016)Paper Reading : Enriching word vectors with subword information(2016)
Paper Reading : Enriching word vectors with subword information(2016)
 
Explaining and harnessing adversarial examples (2015)
Explaining and harnessing adversarial examples (2015)Explaining and harnessing adversarial examples (2015)
Explaining and harnessing adversarial examples (2015)
 
Paper Reading : Learning from simulated and unsupervised images through adver...
Paper Reading : Learning from simulated and unsupervised images through adver...Paper Reading : Learning from simulated and unsupervised images through adver...
Paper Reading : Learning from simulated and unsupervised images through adver...
 
A pixel to-pixel segmentation method of DILD without masks using CNN and perl...
A pixel to-pixel segmentation method of DILD without masks using CNN and perl...A pixel to-pixel segmentation method of DILD without masks using CNN and perl...
A pixel to-pixel segmentation method of DILD without masks using CNN and perl...
 
강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks
 
Understanding deep learning requires rethinking generalization (2017) 2 2(2)
Understanding deep learning requires rethinking generalization (2017)    2 2(2)Understanding deep learning requires rethinking generalization (2017)    2 2(2)
Understanding deep learning requires rethinking generalization (2017) 2 2(2)
 
Understanding deep learning requires rethinking generalization (2017) 1/2
Understanding deep learning requires rethinking generalization (2017) 1/2Understanding deep learning requires rethinking generalization (2017) 1/2
Understanding deep learning requires rethinking generalization (2017) 1/2
 

Similar to 0415_seminar_DeepDPG

Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningKhaled Saleh
 
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018Universitat Politècnica de Catalunya
 
Reinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperReinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperDataScienceLab
 
increasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learningincreasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learningRyo Iwaki
 
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...Universitat Politècnica de Catalunya
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningNAVER Engineering
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDing Li
 
Parallel Guided Local Search and Some Preliminary Experimental Results for Co...
Parallel Guided Local Search and Some Preliminary Experimental Results for Co...Parallel Guided Local Search and Some Preliminary Experimental Results for Co...
Parallel Guided Local Search and Some Preliminary Experimental Results for Co...csandit
 
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...Jisu Han
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningSalem-Kabbani
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDongHyun Kwak
 
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...Universitat Politècnica de Catalunya
 
PPT - Discovering Reinforcement Learning Algorithms
PPT - Discovering Reinforcement Learning AlgorithmsPPT - Discovering Reinforcement Learning Algorithms
PPT - Discovering Reinforcement Learning AlgorithmsJisang Yoon
 
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017MLconf
 
Unit 5 Introduction to Planning and ANN.pptx
Unit 5 Introduction to Planning and ANN.pptxUnit 5 Introduction to Planning and ANN.pptx
Unit 5 Introduction to Planning and ANN.pptxDrYogeshDeshmukh1
 

Similar to 0415_seminar_DeepDPG (20)

Learning To Run
Learning To RunLearning To Run
Learning To Run
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
 
Reinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperReinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine Sweeper
 
increasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learningincreasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learning
 
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Parallel Guided Local Search and Some Preliminary Experimental Results for Co...
Parallel Guided Local Search and Some Preliminary Experimental Results for Co...Parallel Guided Local Search and Some Preliminary Experimental Results for Co...
Parallel Guided Local Search and Some Preliminary Experimental Results for Co...
 
Reinforcement Learning - DQN
Reinforcement Learning - DQNReinforcement Learning - DQN
Reinforcement Learning - DQN
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
 
PPT - Discovering Reinforcement Learning Algorithms
PPT - Discovering Reinforcement Learning AlgorithmsPPT - Discovering Reinforcement Learning Algorithms
PPT - Discovering Reinforcement Learning Algorithms
 
LFA-NPG-Paper.pdf
LFA-NPG-Paper.pdfLFA-NPG-Paper.pdf
LFA-NPG-Paper.pdf
 
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
 
Unit 5 Introduction to Planning and ANN.pptx
Unit 5 Introduction to Planning and ANN.pptxUnit 5 Introduction to Planning and ANN.pptx
Unit 5 Introduction to Planning and ANN.pptx
 

Recently uploaded

UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdfSuman Jyoti
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Christo Ananth
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringmulugeta48
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01KreezheaRecto
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile
 

Recently uploaded (20)

UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 

0415_seminar_DeepDPG

  • 1. Continuous Control with Deep Reinforcement Learning 2016 ICLR Timothy P. Lillicrap, et al. (Google DeepMind) Presenter : Hyemin Ahn
  • 2. Introduction 2016-04-17 CPSLAB (EECS) 2  Another work of Deep Learning + Reinforcement Learning from Google DEEPMIND !  Extended their Deep Q Network, which is dealing with discrete action space, to continuous action space.
  • 4. Reinforcement Learning : overview 2016-04-17 CPSLAB (EECS) 4 Agent How can we formulize our behavior?
  • 5. Reinforcement Learning : overview 2016-04-17 CPSLAB (EECS) 5 At each time 𝒕, the agent receives an observation 𝒙 𝒕 from environment 𝑬 Wow so scare such gun so many bullets nice suit btw
  • 6. Reinforcement Learning : overview 2016-04-17 CPSLAB (EECS) 6 The agent takes an action 𝒂 𝒕 ∈ 𝒜 ⊆ ℝ 𝑵, and receives a scalar reward 𝒓 𝒕. 𝒂 𝒕 𝒙 𝒕
  • 7. Reinforcement Learning : overview 2016-04-17 CPSLAB (EECS) 7 For selecting the action, there is a policy 𝛑: 𝓢 → 𝓟 𝓐 , which maps states to probability distribution over actions. 𝒂 𝟏 𝒂 𝟐 𝛑(𝐬𝐭) 𝒂 𝟐𝒂 𝟏
  • 8. Reinforcement Learning : overview 2016-04-17 CPSLAB (EECS) 8 𝒂 𝟏 𝒔 𝟏 𝛑 𝒂 𝟐 𝒔 𝟐 𝛑 𝒂 𝟑 𝒔 𝟑 𝛑 𝒑(𝒔 𝟐|𝒔 𝟏, 𝒂 𝟏) 𝒑(𝒔 𝟑|𝒔 𝟐, 𝒂 𝟐) 𝑹 𝒕 = 𝒊=𝒕 𝑻 𝜸𝒊−𝒕 𝒓(𝒔𝒊, 𝒂𝒊) : cumulative sum of rewards over sequences. (𝜸 ∈ [𝟎, 𝟏]:discounting factor) 𝑸 𝝅 𝒔 𝒕, 𝒂 𝒕 = 𝔼 𝝅[𝑹 𝒕|𝒔 𝒕, 𝒂 𝒕] : state-action value function. Objective of RL : find 𝛑 maximizing 𝔼 𝝅(𝑹 𝟏) ! 𝒓(𝒔 𝟏, 𝒂 𝟏) 𝒓(𝒔 𝟐, 𝒂 𝟐) 𝒓(𝒔 𝟑, 𝒂 𝟑)M D P
  • 9. Reinforcement Learning : overview 2016-04-17 CPSLAB (EECS) 9 𝑸 𝝅 𝒕𝒓𝒊𝒏𝒊𝒕𝒚 𝒔 𝒕, 𝒂 𝒕 𝑸 𝝅 𝒏𝒆𝒐 𝒔 𝒕, 𝒂 𝒕<
  • 10. Reinforcement Learning : overview 2016-04-17 CPSLAB (EECS) 10 • From environment E, xt : observation st ∈ 𝒮 : state. • If E is fully observed, st = xt • at ∈ 𝒜 : agent’s action • π: 𝒮 → 𝒫 𝒜 : a policy defining agent’s behavior : maps states to probability distribution over the actions. • With 𝒮, 𝒜, an initial state distribution p(s1), transition dynamics p st+1 st, at , and reward function r st, at , Agent’s behavior can be modeled as a Markov Decision Process (MDP). • Rt = i=t T γi−t r(si, ai) : the sum of discounted future reward with a discounting factor γ ∈ [0,1]. • Objective of RL : learning a policy π maximizing 𝔼π(R1).  For this, state-action-value function Qπ st, at = 𝔼π[Rt|st, at] is used.
  • 11. Q-learning is finding , the greedy policy Reinforcement Learning : Q-Learning 2016-04-17 CPSLAB (EECS) 11 π: 𝒮 → 𝒫 𝒜 𝜇: 𝒮 → 𝒜 The Bellman equation refers to this recursive relationship It gets harder to compute this due to stochastic policy π: 𝒮 → 𝒫 𝒜 Let us think about the deterministic policy instead of stochastic one Can we do this in continuous action space?
  • 12. Reinforcement Learning : continuous space? 2016-04-17 CPSLAB (EECS) 12 What DQN(Deep Q Network) was learning Q-function with NN. With function approximator parameterized by 𝜃 𝑄 , Model a network Finding 𝜃 𝑄 minimizing the loss function Problem 1: How can we know this real value?
  • 13. Also model a policy with a network parameter 𝜃 𝜇 And find a parameter 𝜃 𝜇 which can do Reinforcement Learning : continuous space? 2016-04-17 CPSLAB (EECS) 13 How can we find in a continuous action space? Anyway, if assume that we know Silver, David, et al. "Deterministic policy gradient algorithms." ICML. 2014. The gradient of the policy’s performance can be defined as, Problem 2: How can we successfully explore this action space?
  • 14. Objective : Learn and in a continuous action space! Reinforcement Learning : continuous space? 2016-04-17 CPSLAB (EECS) 14 How can we successfully explore this action space? Problem of How can we know this real value? Problem of Both are neural network Authors suggest to use additional ‘target networks’ and
  • 16. DDPG(Deep DPG) Algorithm 2016-04-17 CPSLAB (EECS) 16 Our objective
  • 17. DDPG(Deep DPG) Algorithm 2016-04-17 CPSLAB (EECS) 17 A finite sized cache of transitions (𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1 ) Our objective
  • 18. DDPG(Deep DPG) Algorithm 2016-04-17 CPSLAB (EECS) 18 A finite sized cache of transitions (𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1 ) Our objective
  • 19. DDPG(Deep DPG) Algorithm 2016-04-17 CPSLAB (EECS) 19 A finite sized cache of transitions (𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1 ) Our objective Explored reward + sum of future rewards from target policy network
  • 20. DDPG(Deep DPG) Algorithm 2016-04-17 CPSLAB (EECS) 20 A finite sized cache of transitions (𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1 ) Our objective 𝜏 ≪ 1 for avoiding divergence Explored reward + sum of future rewards from target policy network
  • 21. DDPG(Deep DPG) Algorithm 2016-04-17 CPSLAB (EECS) 21 Assume these are right, real target networks Exploration