SlideShare a Scribd company logo
1 of 23
Continuous Control with
Deep Reinforcement Learning
2016 ICLR
Timothy P. Lillicrap, et al. (Google DeepMind)
Presenter : Hyemin Ahn
Introduction
2016-04-17 CPSLAB (EECS) 2
 Another work of
Deep Learning
+ Reinforcement Learning
from Google DEEPMIND !
 Extended their Deep Q Network,
which is dealing with discrete action space,
to continuous action space.
Results : Preview
2016-04-17 CPSLAB (EECS) 3
Reinforcement Learning : overview
2016-04-17 CPSLAB (EECS) 4
Agent
How can we formulize our behavior?
Reinforcement Learning : overview
2016-04-17 CPSLAB (EECS) 5
At each time 𝒕,
the agent receives an observation 𝒙 𝒕
from environment 𝑬
Wow
so scare
such gun
so many bullets
nice suit btw
Reinforcement Learning : overview
2016-04-17 CPSLAB (EECS) 6
The agent takes
an action 𝒂 𝒕 ∈ 𝒜 ⊆ ℝ 𝑵,
and receives a scalar reward 𝒓 𝒕.
𝒂 𝒕
𝒙 𝒕
Reinforcement Learning : overview
2016-04-17 CPSLAB (EECS) 7
For selecting the action, there is a policy 𝛑: 𝓢 → 𝓟 𝓐 ,
which maps states to probability distribution over actions.
𝒂 𝟏
𝒂 𝟐
𝛑(𝐬𝐭)
𝒂 𝟐𝒂 𝟏
Reinforcement Learning : overview
2016-04-17 CPSLAB (EECS) 8
𝒂 𝟏
𝒔 𝟏
𝛑
𝒂 𝟐
𝒔 𝟐
𝛑
𝒂 𝟑
𝒔 𝟑
𝛑
𝒑(𝒔 𝟐|𝒔 𝟏, 𝒂 𝟏) 𝒑(𝒔 𝟑|𝒔 𝟐, 𝒂 𝟐)
𝑹 𝒕 =
𝒊=𝒕
𝑻
𝜸𝒊−𝒕
𝒓(𝒔𝒊, 𝒂𝒊)
: cumulative sum of rewards
over sequences. (𝜸 ∈ [𝟎, 𝟏]:discounting factor)
𝑸 𝝅
𝒔 𝒕, 𝒂 𝒕 = 𝔼 𝝅[𝑹 𝒕|𝒔 𝒕, 𝒂 𝒕]
: state-action value function.
Objective of RL
: find 𝛑 maximizing 𝔼 𝝅(𝑹 𝟏) !
𝒓(𝒔 𝟏, 𝒂 𝟏) 𝒓(𝒔 𝟐, 𝒂 𝟐) 𝒓(𝒔 𝟑, 𝒂 𝟑)M
D
P
Reinforcement Learning : overview
2016-04-17 CPSLAB (EECS) 9
𝑸 𝝅 𝒕𝒓𝒊𝒏𝒊𝒕𝒚 𝒔 𝒕, 𝒂 𝒕 𝑸 𝝅 𝒏𝒆𝒐 𝒔 𝒕, 𝒂 𝒕<
Reinforcement Learning : overview
2016-04-17 CPSLAB (EECS) 10
• From environment E,
xt : observation
st ∈ 𝒮 : state.
• If E is fully observed, st = xt
• at ∈ 𝒜 : agent’s action
• π: 𝒮 → 𝒫 𝒜 : a policy defining agent’s behavior
: maps states to probability distribution over the actions.
• With 𝒮, 𝒜, an initial state distribution p(s1),
transition dynamics p st+1 st, at , and reward function r st, at ,
Agent’s behavior can be modeled as a Markov Decision Process (MDP).
• Rt = i=t
T
γi−t
r(si, ai) : the sum of discounted future reward
with a discounting factor γ ∈ [0,1].
• Objective of RL : learning a policy π maximizing 𝔼π(R1).
 For this, state-action-value function Qπ
st, at = 𝔼π[Rt|st, at] is used.
Q-learning is finding , the greedy policy
Reinforcement Learning : Q-Learning
2016-04-17 CPSLAB (EECS) 11
π: 𝒮 → 𝒫 𝒜 𝜇: 𝒮 → 𝒜
The Bellman equation refers to this recursive relationship
It gets harder to compute this due to stochastic policy π: 𝒮 → 𝒫 𝒜
Let us think about the deterministic policy instead of stochastic one
Can we do this in continuous action space?
Reinforcement Learning : continuous space?
2016-04-17 CPSLAB (EECS) 12
What DQN(Deep Q Network) was learning Q-function with NN.
With function approximator parameterized by 𝜃 𝑄
,
Model a network
Finding 𝜃 𝑄
minimizing the loss function
Problem 1: How can we know this real value?
Also model a policy with a network parameter 𝜃 𝜇
And find a parameter 𝜃 𝜇
which can do
Reinforcement Learning : continuous space?
2016-04-17 CPSLAB (EECS) 13
How can we find in a continuous action space?
Anyway, if assume that we know
Silver, David, et al. "Deterministic policy gradient algorithms." ICML. 2014.
The gradient of the policy’s performance can be defined as,
Problem 2: How can we successfully explore this action space?
Objective : Learn and in a continuous action space!
Reinforcement Learning : continuous space?
2016-04-17 CPSLAB (EECS) 14
How can we successfully explore this action space?
Problem of
How can we know this real value?
Problem of
Both are neural network
Authors suggest to use additional ‘target networks’
and
DDPG(Deep DPG) Algorithm
2016-04-17 CPSLAB (EECS) 15
DDPG(Deep DPG) Algorithm
2016-04-17 CPSLAB (EECS) 16
Our objective
DDPG(Deep DPG) Algorithm
2016-04-17 CPSLAB (EECS) 17
A finite sized cache of transitions (𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1 )
Our objective
DDPG(Deep DPG) Algorithm
2016-04-17 CPSLAB (EECS) 18
A finite sized cache of transitions (𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1 )
Our objective
DDPG(Deep DPG) Algorithm
2016-04-17 CPSLAB (EECS) 19
A finite sized cache of transitions (𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1 )
Our objective
Explored reward + sum of future rewards from target policy network
DDPG(Deep DPG) Algorithm
2016-04-17 CPSLAB (EECS) 20
A finite sized cache of transitions (𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1 )
Our objective
𝜏 ≪ 1 for avoiding divergence
Explored reward + sum of future rewards from target policy network
DDPG(Deep DPG) Algorithm
2016-04-17 CPSLAB (EECS) 21
Assume these are right, real target networks
Exploration
Experiment : Results
2016-04-17 CPSLAB (EECS) 22
2016-04-17 CPSLAB (EECS) 23

More Related Content

What's hot

Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...MLconf
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIJack Clark
 
DQN (Deep Q-Network)
DQN (Deep Q-Network)DQN (Deep Q-Network)
DQN (Deep Q-Network)Dong Guo
 
Introduction of "TrailBlazer" algorithm
Introduction of "TrailBlazer" algorithmIntroduction of "TrailBlazer" algorithm
Introduction of "TrailBlazer" algorithmKatsuki Ohto
 
deep reinforcement learning with double q learning
deep reinforcement learning with double q learningdeep reinforcement learning with double q learning
deep reinforcement learning with double q learningSeungHyeok Baek
 
Exploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement LearningExploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement LearningDongmin Lee
 
Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...
Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...
Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...MLconf
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningKai-Wen Zhao
 
Higher Order Fused Regularization for Supervised Learning with Grouped Parame...
Higher Order Fused Regularization for Supervised Learning with Grouped Parame...Higher Order Fused Regularization for Supervised Learning with Grouped Parame...
Higher Order Fused Regularization for Supervised Learning with Grouped Parame...Koh Takeuchi
 
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...Dongmin Lee
 
Optimization in deep learning
Optimization in deep learningOptimization in deep learning
Optimization in deep learningJeremy Nixon
 
Bayesian Model-Agnostic Meta-Learning
Bayesian Model-Agnostic Meta-LearningBayesian Model-Agnostic Meta-Learning
Bayesian Model-Agnostic Meta-LearningSangwoo Mo
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep LearningSebastian Ruder
 
Meta-Learning with Implicit Gradients
Meta-Learning with Implicit GradientsMeta-Learning with Implicit Gradients
Meta-Learning with Implicit GradientsSangwoo Mo
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningRyo Iwaki
 
is anyone_interest_in_auto-encoding_variational-bayes
is anyone_interest_in_auto-encoding_variational-bayesis anyone_interest_in_auto-encoding_variational-bayes
is anyone_interest_in_auto-encoding_variational-bayesNAVER Engineering
 
Introduction to Diffusion Models
Introduction to Diffusion ModelsIntroduction to Diffusion Models
Introduction to Diffusion ModelsSangwoo Mo
 
Sharpness-aware minimization (SAM)
Sharpness-aware minimization (SAM)Sharpness-aware minimization (SAM)
Sharpness-aware minimization (SAM)Sangwoo Mo
 

What's hot (20)

Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
 
DQN (Deep Q-Network)
DQN (Deep Q-Network)DQN (Deep Q-Network)
DQN (Deep Q-Network)
 
Introduction of "TrailBlazer" algorithm
Introduction of "TrailBlazer" algorithmIntroduction of "TrailBlazer" algorithm
Introduction of "TrailBlazer" algorithm
 
deep reinforcement learning with double q learning
deep reinforcement learning with double q learningdeep reinforcement learning with double q learning
deep reinforcement learning with double q learning
 
Exploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement LearningExploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement Learning
 
Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...
Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...
Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-Learning
 
Lec3 dqn
Lec3 dqnLec3 dqn
Lec3 dqn
 
Higher Order Fused Regularization for Supervised Learning with Grouped Parame...
Higher Order Fused Regularization for Supervised Learning with Grouped Parame...Higher Order Fused Regularization for Supervised Learning with Grouped Parame...
Higher Order Fused Regularization for Supervised Learning with Grouped Parame...
 
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
 
Optimization in deep learning
Optimization in deep learningOptimization in deep learning
Optimization in deep learning
 
Bayesian Model-Agnostic Meta-Learning
Bayesian Model-Agnostic Meta-LearningBayesian Model-Agnostic Meta-Learning
Bayesian Model-Agnostic Meta-Learning
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep Learning
 
Meta-Learning with Implicit Gradients
Meta-Learning with Implicit GradientsMeta-Learning with Implicit Gradients
Meta-Learning with Implicit Gradients
 
Deep Learning for Computer Vision: Optimization (UPC 2016)
Deep Learning for Computer Vision: Optimization (UPC 2016)Deep Learning for Computer Vision: Optimization (UPC 2016)
Deep Learning for Computer Vision: Optimization (UPC 2016)
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learning
 
is anyone_interest_in_auto-encoding_variational-bayes
is anyone_interest_in_auto-encoding_variational-bayesis anyone_interest_in_auto-encoding_variational-bayes
is anyone_interest_in_auto-encoding_variational-bayes
 
Introduction to Diffusion Models
Introduction to Diffusion ModelsIntroduction to Diffusion Models
Introduction to Diffusion Models
 
Sharpness-aware minimization (SAM)
Sharpness-aware minimization (SAM)Sharpness-aware minimization (SAM)
Sharpness-aware minimization (SAM)
 

Viewers also liked

Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNHye-min Ahn
 
Loud and Trendy: Crowdsourcing Impressions of Social Ambiance in Popular Indo...
Loud and Trendy: Crowdsourcing Impressions of Social Ambiance in Popular Indo...Loud and Trendy: Crowdsourcing Impressions of Social Ambiance in Popular Indo...
Loud and Trendy: Crowdsourcing Impressions of Social Ambiance in Popular Indo...Darshan Santani
 
Paper Reading : Enriching word vectors with subword information(2016)
Paper Reading : Enriching word vectors with subword information(2016)Paper Reading : Enriching word vectors with subword information(2016)
Paper Reading : Enriching word vectors with subword information(2016)정훈 서
 
Explaining and harnessing adversarial examples (2015)
Explaining and harnessing adversarial examples (2015)Explaining and harnessing adversarial examples (2015)
Explaining and harnessing adversarial examples (2015)정훈 서
 
Paper Reading : Learning from simulated and unsupervised images through adver...
Paper Reading : Learning from simulated and unsupervised images through adver...Paper Reading : Learning from simulated and unsupervised images through adver...
Paper Reading : Learning from simulated and unsupervised images through adver...정훈 서
 
A pixel to-pixel segmentation method of DILD without masks using CNN and perl...
A pixel to-pixel segmentation method of DILD without masks using CNN and perl...A pixel to-pixel segmentation method of DILD without masks using CNN and perl...
A pixel to-pixel segmentation method of DILD without masks using CNN and perl...남주 김
 
강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introductionTaehoon Kim
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialOmar Enayet
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks남주 김
 
Understanding deep learning requires rethinking generalization (2017) 2 2(2)
Understanding deep learning requires rethinking generalization (2017)    2 2(2)Understanding deep learning requires rethinking generalization (2017)    2 2(2)
Understanding deep learning requires rethinking generalization (2017) 2 2(2)정훈 서
 
Understanding deep learning requires rethinking generalization (2017) 1/2
Understanding deep learning requires rethinking generalization (2017) 1/2Understanding deep learning requires rethinking generalization (2017) 1/2
Understanding deep learning requires rethinking generalization (2017) 1/2정훈 서
 

Viewers also liked (11)

Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNN
 
Loud and Trendy: Crowdsourcing Impressions of Social Ambiance in Popular Indo...
Loud and Trendy: Crowdsourcing Impressions of Social Ambiance in Popular Indo...Loud and Trendy: Crowdsourcing Impressions of Social Ambiance in Popular Indo...
Loud and Trendy: Crowdsourcing Impressions of Social Ambiance in Popular Indo...
 
Paper Reading : Enriching word vectors with subword information(2016)
Paper Reading : Enriching word vectors with subword information(2016)Paper Reading : Enriching word vectors with subword information(2016)
Paper Reading : Enriching word vectors with subword information(2016)
 
Explaining and harnessing adversarial examples (2015)
Explaining and harnessing adversarial examples (2015)Explaining and harnessing adversarial examples (2015)
Explaining and harnessing adversarial examples (2015)
 
Paper Reading : Learning from simulated and unsupervised images through adver...
Paper Reading : Learning from simulated and unsupervised images through adver...Paper Reading : Learning from simulated and unsupervised images through adver...
Paper Reading : Learning from simulated and unsupervised images through adver...
 
A pixel to-pixel segmentation method of DILD without masks using CNN and perl...
A pixel to-pixel segmentation method of DILD without masks using CNN and perl...A pixel to-pixel segmentation method of DILD without masks using CNN and perl...
A pixel to-pixel segmentation method of DILD without masks using CNN and perl...
 
강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks
 
Understanding deep learning requires rethinking generalization (2017) 2 2(2)
Understanding deep learning requires rethinking generalization (2017)    2 2(2)Understanding deep learning requires rethinking generalization (2017)    2 2(2)
Understanding deep learning requires rethinking generalization (2017) 2 2(2)
 
Understanding deep learning requires rethinking generalization (2017) 1/2
Understanding deep learning requires rethinking generalization (2017) 1/2Understanding deep learning requires rethinking generalization (2017) 1/2
Understanding deep learning requires rethinking generalization (2017) 1/2
 

Similar to 0415_seminar_DeepDPG

Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningKhaled Saleh
 
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018Universitat Politècnica de Catalunya
 
Reinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperReinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperDataScienceLab
 
increasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learningincreasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learningRyo Iwaki
 
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...Universitat Politècnica de Catalunya
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningNAVER Engineering
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDing Li
 
Parallel Guided Local Search and Some Preliminary Experimental Results for Co...
Parallel Guided Local Search and Some Preliminary Experimental Results for Co...Parallel Guided Local Search and Some Preliminary Experimental Results for Co...
Parallel Guided Local Search and Some Preliminary Experimental Results for Co...csandit
 
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...Jisu Han
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningSalem-Kabbani
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDongHyun Kwak
 
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...Universitat Politècnica de Catalunya
 
PPT - Discovering Reinforcement Learning Algorithms
PPT - Discovering Reinforcement Learning AlgorithmsPPT - Discovering Reinforcement Learning Algorithms
PPT - Discovering Reinforcement Learning AlgorithmsJisang Yoon
 
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017MLconf
 
Unit 5 Introduction to Planning and ANN.pptx
Unit 5 Introduction to Planning and ANN.pptxUnit 5 Introduction to Planning and ANN.pptx
Unit 5 Introduction to Planning and ANN.pptxDrYogeshDeshmukh1
 

Similar to 0415_seminar_DeepDPG (20)

Learning To Run
Learning To RunLearning To Run
Learning To Run
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
 
Reinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperReinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine Sweeper
 
increasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learningincreasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learning
 
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Parallel Guided Local Search and Some Preliminary Experimental Results for Co...
Parallel Guided Local Search and Some Preliminary Experimental Results for Co...Parallel Guided Local Search and Some Preliminary Experimental Results for Co...
Parallel Guided Local Search and Some Preliminary Experimental Results for Co...
 
Reinforcement Learning - DQN
Reinforcement Learning - DQNReinforcement Learning - DQN
Reinforcement Learning - DQN
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
 
PPT - Discovering Reinforcement Learning Algorithms
PPT - Discovering Reinforcement Learning AlgorithmsPPT - Discovering Reinforcement Learning Algorithms
PPT - Discovering Reinforcement Learning Algorithms
 
LFA-NPG-Paper.pdf
LFA-NPG-Paper.pdfLFA-NPG-Paper.pdf
LFA-NPG-Paper.pdf
 
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
 
Unit 5 Introduction to Planning and ANN.pptx
Unit 5 Introduction to Planning and ANN.pptxUnit 5 Introduction to Planning and ANN.pptx
Unit 5 Introduction to Planning and ANN.pptx
 

Recently uploaded

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesPrabhanshu Chaturvedi
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdfKamal Acharya
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college projectTonystark477637
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 

Recently uploaded (20)

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and Properties
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 

0415_seminar_DeepDPG

  • 1. Continuous Control with Deep Reinforcement Learning 2016 ICLR Timothy P. Lillicrap, et al. (Google DeepMind) Presenter : Hyemin Ahn
  • 2. Introduction 2016-04-17 CPSLAB (EECS) 2  Another work of Deep Learning + Reinforcement Learning from Google DEEPMIND !  Extended their Deep Q Network, which is dealing with discrete action space, to continuous action space.
  • 4. Reinforcement Learning : overview 2016-04-17 CPSLAB (EECS) 4 Agent How can we formulize our behavior?
  • 5. Reinforcement Learning : overview 2016-04-17 CPSLAB (EECS) 5 At each time 𝒕, the agent receives an observation 𝒙 𝒕 from environment 𝑬 Wow so scare such gun so many bullets nice suit btw
  • 6. Reinforcement Learning : overview 2016-04-17 CPSLAB (EECS) 6 The agent takes an action 𝒂 𝒕 ∈ 𝒜 ⊆ ℝ 𝑵, and receives a scalar reward 𝒓 𝒕. 𝒂 𝒕 𝒙 𝒕
  • 7. Reinforcement Learning : overview 2016-04-17 CPSLAB (EECS) 7 For selecting the action, there is a policy 𝛑: 𝓢 → 𝓟 𝓐 , which maps states to probability distribution over actions. 𝒂 𝟏 𝒂 𝟐 𝛑(𝐬𝐭) 𝒂 𝟐𝒂 𝟏
  • 8. Reinforcement Learning : overview 2016-04-17 CPSLAB (EECS) 8 𝒂 𝟏 𝒔 𝟏 𝛑 𝒂 𝟐 𝒔 𝟐 𝛑 𝒂 𝟑 𝒔 𝟑 𝛑 𝒑(𝒔 𝟐|𝒔 𝟏, 𝒂 𝟏) 𝒑(𝒔 𝟑|𝒔 𝟐, 𝒂 𝟐) 𝑹 𝒕 = 𝒊=𝒕 𝑻 𝜸𝒊−𝒕 𝒓(𝒔𝒊, 𝒂𝒊) : cumulative sum of rewards over sequences. (𝜸 ∈ [𝟎, 𝟏]:discounting factor) 𝑸 𝝅 𝒔 𝒕, 𝒂 𝒕 = 𝔼 𝝅[𝑹 𝒕|𝒔 𝒕, 𝒂 𝒕] : state-action value function. Objective of RL : find 𝛑 maximizing 𝔼 𝝅(𝑹 𝟏) ! 𝒓(𝒔 𝟏, 𝒂 𝟏) 𝒓(𝒔 𝟐, 𝒂 𝟐) 𝒓(𝒔 𝟑, 𝒂 𝟑)M D P
  • 9. Reinforcement Learning : overview 2016-04-17 CPSLAB (EECS) 9 𝑸 𝝅 𝒕𝒓𝒊𝒏𝒊𝒕𝒚 𝒔 𝒕, 𝒂 𝒕 𝑸 𝝅 𝒏𝒆𝒐 𝒔 𝒕, 𝒂 𝒕<
  • 10. Reinforcement Learning : overview 2016-04-17 CPSLAB (EECS) 10 • From environment E, xt : observation st ∈ 𝒮 : state. • If E is fully observed, st = xt • at ∈ 𝒜 : agent’s action • π: 𝒮 → 𝒫 𝒜 : a policy defining agent’s behavior : maps states to probability distribution over the actions. • With 𝒮, 𝒜, an initial state distribution p(s1), transition dynamics p st+1 st, at , and reward function r st, at , Agent’s behavior can be modeled as a Markov Decision Process (MDP). • Rt = i=t T γi−t r(si, ai) : the sum of discounted future reward with a discounting factor γ ∈ [0,1]. • Objective of RL : learning a policy π maximizing 𝔼π(R1).  For this, state-action-value function Qπ st, at = 𝔼π[Rt|st, at] is used.
  • 11. Q-learning is finding , the greedy policy Reinforcement Learning : Q-Learning 2016-04-17 CPSLAB (EECS) 11 π: 𝒮 → 𝒫 𝒜 𝜇: 𝒮 → 𝒜 The Bellman equation refers to this recursive relationship It gets harder to compute this due to stochastic policy π: 𝒮 → 𝒫 𝒜 Let us think about the deterministic policy instead of stochastic one Can we do this in continuous action space?
  • 12. Reinforcement Learning : continuous space? 2016-04-17 CPSLAB (EECS) 12 What DQN(Deep Q Network) was learning Q-function with NN. With function approximator parameterized by 𝜃 𝑄 , Model a network Finding 𝜃 𝑄 minimizing the loss function Problem 1: How can we know this real value?
  • 13. Also model a policy with a network parameter 𝜃 𝜇 And find a parameter 𝜃 𝜇 which can do Reinforcement Learning : continuous space? 2016-04-17 CPSLAB (EECS) 13 How can we find in a continuous action space? Anyway, if assume that we know Silver, David, et al. "Deterministic policy gradient algorithms." ICML. 2014. The gradient of the policy’s performance can be defined as, Problem 2: How can we successfully explore this action space?
  • 14. Objective : Learn and in a continuous action space! Reinforcement Learning : continuous space? 2016-04-17 CPSLAB (EECS) 14 How can we successfully explore this action space? Problem of How can we know this real value? Problem of Both are neural network Authors suggest to use additional ‘target networks’ and
  • 16. DDPG(Deep DPG) Algorithm 2016-04-17 CPSLAB (EECS) 16 Our objective
  • 17. DDPG(Deep DPG) Algorithm 2016-04-17 CPSLAB (EECS) 17 A finite sized cache of transitions (𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1 ) Our objective
  • 18. DDPG(Deep DPG) Algorithm 2016-04-17 CPSLAB (EECS) 18 A finite sized cache of transitions (𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1 ) Our objective
  • 19. DDPG(Deep DPG) Algorithm 2016-04-17 CPSLAB (EECS) 19 A finite sized cache of transitions (𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1 ) Our objective Explored reward + sum of future rewards from target policy network
  • 20. DDPG(Deep DPG) Algorithm 2016-04-17 CPSLAB (EECS) 20 A finite sized cache of transitions (𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1 ) Our objective 𝜏 ≪ 1 for avoiding divergence Explored reward + sum of future rewards from target policy network
  • 21. DDPG(Deep DPG) Algorithm 2016-04-17 CPSLAB (EECS) 21 Assume these are right, real target networks Exploration