SlideShare ist ein Scribd-Unternehmen logo
1 von 21
Downloaden Sie, um offline zu lesen
CONTACT
Autonomous Systems Laboratory
Mechanical Engineering
5th Engineering Building Room 810
Web. https://sites.google.com/site/aslunist/
Deep deterministic policy gradient
Minjae Jung
May. 19, 2020
Autonomous Systems Laboratory
2/21
DQN to DDPG: DQN overview
Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 529-533.
Q learning DQN
(2015)
1. replay buffer
2. neural network
3. target network
Autonomous Systems Laboratory
3/21
DQN to DDPG: DQN algorithm
Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 529-533.
• DQN is capable of human level performance on many Atari games
• Off policy training: replay buffer breaks the correlation of samples that are sampled from agent
• High dimensional observation: deep neural network can extract feature from high dimensional input
• Learning stability: target network make training process stable
Environment Q Network
Target
Q Network
DQN Loss
Replay buffer
𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑄(𝑠𝑡, 𝑎 𝑡; 𝜃)
𝑠𝑡
Update
Copy
𝑄(𝑠𝑡, 𝑎 𝑡; 𝜃)
store
(𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1)
𝑟𝑡
𝑠𝑡+1(𝑠𝑡, 𝑎 𝑡)
𝑚𝑎𝑥 𝑎 𝑄(𝑠𝑡
′
, 𝑎 𝑡; 𝜃′)
𝑄 𝑠𝑡, 𝑎 𝑡 ← 𝑄 𝑠𝑡, 𝑎 𝑡 + 𝛼[𝑟𝑡+1 + 𝛾𝑚𝑎𝑥 𝑎 𝑄 𝑠𝑡+1, 𝑎 𝑡+1 − 𝑄 𝑠𝑡, 𝑎 𝑡 ]
Q learning
𝑄 𝜋 𝜃
𝑠𝑡, 𝑎 𝑡 ← 𝑄 𝜋 𝜃
𝑠𝑡, 𝑎 𝑡 + 𝛼[𝑟𝑡+1 + 𝛾𝑚𝑎𝑥 𝑎 𝑄 𝜋 𝜃′
𝑠𝑡+1, 𝑎 𝑡+1 − 𝑄 𝜋 𝜃
𝑠𝑡, 𝑎 𝑡 ]
DQN
Policy(𝜋): 𝑎 𝑡 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑄 𝜋 𝜃
(𝑠𝑡, 𝑎 𝑡)𝑠𝑡: state
𝑎 𝑡: action
𝑟𝑡: reward
𝑄(𝑠𝑡, 𝑎 𝑡): reward to go
Autonomous Systems Laboratory
4/21
DQN to DDPG: Limitation of DQN (discrete action spaces)
• Discrete action spaces
- DQN can only handle discrete and low-dimensional action spaces
- If the dimension increases, action spaces(the number of node) increase exponentially
- i.e. 𝒌 discrete action spaces with 𝒏 dimension -> 𝒌 𝒏action spaces
• DQN cannot be straight forwardly applied to continuous domain
• Why? -> 1. Policy(𝜋): 𝑎 𝑡 = 𝒂𝒓𝒈𝒎𝒂𝒙 𝒂 𝑄 𝜋 𝜃
(𝑠𝑡, 𝑎 𝑡)
2. Update: 𝑄 𝜋 𝜃
𝑠𝑡, 𝑎 𝑡 ← 𝑄 𝜋 𝜃
𝑠𝑡, 𝑎 𝑡 + 𝛼[𝑟𝑡+1 + 𝛾𝒎𝒂𝒙 𝒂 𝑸 𝝅 𝜽′
𝒔 𝒕+𝟏, 𝒂 𝒕+𝟏 − 𝑄 𝜋 𝜃
𝑠𝑡, 𝑎 𝑡 ]
Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).
Autonomous Systems Laboratory
5/21
DDPG: DQN with Policy gradient methods
Q learning DQN
1. replay buffer
2. deep neural network
3. target network
Policy gradient
(REINFORCE)
Actor critic DPG
DDPG
Continuous
action spaces
Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).
Autonomous Systems Laboratory
6/21
Policy gradient: The goal of Reinforcement learning
𝑝 𝜃 𝑠1,𝑎1,⋯,𝑠 𝑇,𝑎 𝑇
= 𝑝(𝑠1) ෑ
𝑡=1
𝑇
𝜋 𝜃 𝑎 𝑡 𝑠𝑡 𝑝(𝑠𝑡+1|𝑠𝑡, 𝑎 𝑡)
Agent World
action
𝑃(𝑠𝑡+1|𝑠𝑡, 𝑎 𝑡)
model
reward & next state
𝑟𝑡
𝑎 𝑡
𝑠𝑡+1
state&
𝑠𝑡
policy
𝜋(𝑎 𝑡|𝑠𝑡)
𝜃∗
= 𝑎𝑟𝑔𝑚𝑎𝑥 𝜃 𝐸𝜏~𝑝 𝜃(𝜏) ෍
𝑡
𝑟 𝑠𝑡, 𝑎 𝑡
Markov decision process
𝑠1
𝑎1
𝑠2
𝑎2
𝑠3
𝑝(𝑠2|𝑠1, 𝑎1) 𝑝(𝑠3|𝑠2, 𝑎2)
𝑎3
𝑝(𝑠4|𝑠3, 𝑎3)
𝜏
objective: 𝐽(𝜃)
trajectory distribution
Goal of reinforcement learning
policy(𝜋 𝜃):
stochastic policy with weights 𝜽
Autonomous Systems Laboratory
7/21
Policy gradient: REINFORCE
• REINFORCE models the policy as a stochastic policy: 𝑎 𝑡 ~ 𝜋 𝜃(𝑎 𝑡|𝑠𝑡)
Sutton, Richard S., et al. "Policy gradient methods for reinforcement learning with function approximation." Advances in neural information processing systems. 2000.
𝑠𝑡 𝜋 𝜃(𝑎 𝑡|𝑠𝑡)
probability
0.1
0.1
0.2
0.2
0.4
Autonomous Systems Laboratory
8/21
Policy gradient: REINFORCE
𝐽 𝜃 = 𝐸𝜏~𝑝 𝜃(𝜏) ෍
𝑡
𝑟 𝑠𝑡, 𝑎 𝑡
𝛻𝐽 𝜃 =𝐸𝜏~𝑝 𝜃(𝜏) ( σ 𝑡=1
𝑇
𝛻𝜃 𝑙𝑜𝑔 𝜋 𝜃(𝑎 𝑡|𝑠𝑡) (σ 𝑡=1
𝑇
𝑟(𝑠𝑡, 𝑎 𝑡))
𝛻𝐽 𝜃 ≈
1
𝑁
෍
𝑖=1
𝑁
෍
𝑡=1
𝑇
𝛻𝜃 𝑙𝑜𝑔 𝜋 𝜃(𝑎 𝑡|𝑠𝑡) ෍
𝑡=1
𝑇
𝑟(𝑠𝑡, 𝑎 𝑡)
𝜃 ← 𝜃 + 𝛼𝛻𝐽(𝜃)
The number of episodes
problem
Must experience some episodes to update
1. Slow training process
2. High gradient variance
initial state
𝑟1
𝑟2
𝑟 𝑁
• REINFORCE models the policy as a stochastic decision: 𝑎 𝑡 ~ 𝜋 𝜃(𝑎 𝑡|𝑠𝑡)
Sutton, Richard S., et al. "Policy gradient methods for reinforcement learning with function approximation." Advances in neural information processing systems. 2000.
𝜃: weights of actor network
𝛼: learning rate
Autonomous Systems Laboratory
9/21
→ 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡)
Policy gradient: Actor critic (actor critic)
• Actor(𝜋 𝜃(𝑎 𝑡|𝑠𝑡)): output action distribution by policy network and updates in the direction suggested by critic
• Critic(𝑸 𝝓(𝒔 𝒕, 𝒂 𝒕)): evaluate actor’s action
initial state
sample data 𝒊 times
update critic & actor
sample data 𝒊 times
update critic & actor
1. Sample 𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1 from 𝜋 𝜃(𝑎 𝑡|𝑠𝑡) 𝑖 times
2. Update 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡) to sampled data
3. 𝛻𝜃 𝐽(𝜃) ≈ σ𝑖 𝛻𝑙𝑜𝑔𝜋 𝜃 𝑎 𝑡 𝑠𝑡 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡)
4. 𝜃 ← 𝜃 + 𝛼𝛻𝜃 𝐽(𝜃)
Konda, Vijay R., and John N. Tsitsiklis. "Actor-critic algorithms." Advances in neural information processing systems. 2000.
𝜙: weights of critic network
1. High gradient variance
2. Slow training policy
𝜋 𝜃(𝑎 𝑡|𝑠𝑡)
𝑄 𝜙(𝑠𝑡, 𝑎 𝑡)
Env.𝑎 𝑡
𝑠𝑡
(𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1)0~𝑖𝛻𝐽(𝜃)
actor
critic
update critic
Autonomous Systems Laboratory
10/21
Policy gradient: DPG
Silver, David, et al. "Deterministic policy gradient algorithms." 2014.
• Deterministic policy gradient (DPG) models the actor policy as a deterministic policy: 𝑎t = 𝜇 𝜃(𝑠t)
𝑎 𝑥
𝑎 𝑦
Stochastic policy 𝜋 𝜃(𝑎 𝑡|𝑠𝑡)
𝑠𝑡
• Need 10 action spaces for 5 discretized 2 dimensional actions
𝑎 𝑥
𝑎 𝑦
Deterministic policy 𝝁 𝜽(𝒔 𝒕)
𝑠𝑡
• Only 2 action spaces are needed
Autonomous Systems Laboratory
11/21
Policy gradient: DPG
Silver, David, et al. "Deterministic policy gradient algorithms." 2014.
• Deterministic policy gradient (DPG) models the actor policy as a deterministic decision: 𝑎t = 𝜇 𝜃(𝑠t)
1. Sample 𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1 from 𝜇 𝜃(𝑠) 𝑖 times
2. Update 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡) to samples
3. 𝛻𝜃 𝐽(𝜃) ≈ σ𝑖 𝛻𝜃 𝜇 𝜃 𝑠𝑡 𝛻 𝜙 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡)| 𝑎 𝑡=𝜇 𝜃(𝑠 𝑡)
4. 𝜃 ← 𝜃 + 𝛼𝛻𝜃 𝐽(𝜃)
𝑝 𝜃 𝑠1,𝑎1,⋯,𝑠 𝑇,𝑎 𝑇
= 𝑝(𝑠1) ෑ
𝑡=1
𝑇
𝜋 𝜃 𝑎 𝑡 𝑠𝑡 𝑝(𝑠𝑡+1|𝑠𝑡, 𝑎 𝑡)
trajectory distribution
𝑝 𝜃 𝑠1,𝑠2,𝑠3⋯,𝑠 𝑇
= 𝑝(𝑠1) ෑ
𝑡=1
𝑇
𝑝(𝑠𝑡+1|𝑠𝑡, 𝜇 𝜃(𝑠𝑡))
𝐽 𝜃 = 𝐸𝑠,𝑎~𝑝 𝜃(𝜏) 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡)
objective
𝜏 𝜏
𝐽 𝜃 = 𝐸𝑠~𝑝 𝜃(𝜏)[𝑄(𝑠, 𝜇 𝜃 𝑠 )]
loss: 𝐿 = 𝑟𝑡 + 𝛾𝑄 𝜙 𝑠𝑡+1, 𝜇 𝜃(𝑠𝑡+1) − 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡)
Autonomous Systems Laboratory
12/21
DDPG: DQN + DPG
Q learning DQN
Policy gradient
(REINFORCE)
Actor critic DPG
DDPG
+ continuous action spaces
- no replay buffer: sample correlation
- no target network: unstable
- high variance + lower variance
+ off policy: replay buffer
+ stable update: target network
+ high dimensional observation spaces
- discrete action spaces
- low dimensional observation spaces
Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).
Autonomous Systems Laboratory
13/21
DDPG: algorithm(1/2)
Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).
• policy
• exploration
• Add noise for exploration: white Gaussian noise
• soft target update
• Target network is constrained to change slowly
• Stabilize training process
𝑎 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑄 𝜋 𝜃
(𝑠, 𝑎) 𝒂 = 𝝁 𝜽(𝒔)
𝝁′ 𝒔 = 𝝁 𝜽 𝒔 + 𝐍
𝜽′
← 𝝉𝜽 + 𝟏 − 𝝉 𝜽′ where 𝝉 ≪ 𝟏
Autonomous Systems Laboratory
14/21
soft update 𝜃′
DDPG: algorithm(2/2)
Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).
policy 𝜇 𝜃
target policy 𝜇 𝜃′
critic 𝑄 𝜙
target critic 𝑄 𝜙′
𝑎 𝑡 = 𝜇 𝜃 𝑠𝑡 + 𝑁
Env
actorcritic
𝑠𝑡
Replay buffer
store 𝑖 data (𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1)sample 𝑖 batch (𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1)
𝜇 𝜃′(𝑠𝑡+1)
update critic
loss: 𝐿(𝜙) soft update 𝜙′
𝜇 𝜃(𝑠𝑡)
𝛻𝐽(𝜃)
select action
𝑠𝑡+1(𝑠𝑡, 𝑎 𝑡, 𝑟𝑡)
Autonomous Systems Laboratory
15/21
DDPG example: landing on a moving platform
Rodriguez-Ramos, Alejandro, et al. "A deep reinforcement learning strategy for UAV autonomous landing on a moving platform." Journal of Intelligent & Robotic
Systems 93.1-2 (2019): 351-366.
Autonomous Systems Laboratory
16/21
DDPG example: long-range robotic navigation
Faust, Aleksandra, et al. "PRM-RL: Long-range robotic navigation tasks by combining reinforcement learning and sampling-based planning." 2018 IEEE
International Conference on Robotics and Automation (ICRA). IEEE, 2018.
• DDPG used as local planner for long range navigation
Autonomous Systems Laboratory
17/21
DDPG example: multi agent DDPG (MADDPG)
Lowe, Ryan, et al. "Multi-agent actor-critic for mixed cooperative-competitive environments." Advances in neural information processing systems. 2017.
Autonomous Systems Laboratory
18/21
Conclusion & Future work
• DQN have problem to adjust continuous action space directly
• DDPG is able to consider continuous action spaces via policy gradient method and actor critic
architecture
• MADDPG for multi agent RL
• Use DDPG for continuous action space decision making problem
• ex) navigation, obstacle avoidance
Autonomous Systems Laboratory
19/21
Appendix: Objective gradient derivation
objective gradient
Autonomous Systems Laboratory
20/21
Appendix: DPG objective
Autonomous Systems Laboratory
21/21
Appendix: DDPG algorithm

Weitere ähnliche Inhalte

Was ist angesagt?

Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive EnvironmentsMulti PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive EnvironmentsJisang Yoon
 
RL_chapter1_to_chapter4
RL_chapter1_to_chapter4RL_chapter1_to_chapter4
RL_chapter1_to_chapter4hiroki yamaoka
 
Deep Multi-agent Reinforcement Learning
Deep Multi-agent Reinforcement LearningDeep Multi-agent Reinforcement Learning
Deep Multi-agent Reinforcement Learningdeawoo Kim
 
강화학습의 흐름도 Part 1
강화학습의 흐름도 Part 1강화학습의 흐름도 Part 1
강화학습의 흐름도 Part 1Dongmin Lee
 
Deep Q-learning from Demonstrations DQfD
Deep Q-learning from Demonstrations DQfDDeep Q-learning from Demonstrations DQfD
Deep Q-learning from Demonstrations DQfDAmmar Rashed
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithmJie-Han Chen
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsBill Liu
 
DQN (Deep Q-Network)
DQN (Deep Q-Network)DQN (Deep Q-Network)
DQN (Deep Q-Network)Dong Guo
 
[DL輪読会]Reinforcement Learning with Deep Energy-Based Policies
[DL輪読会]Reinforcement Learning with Deep Energy-Based Policies[DL輪読会]Reinforcement Learning with Deep Energy-Based Policies
[DL輪読会]Reinforcement Learning with Deep Energy-Based PoliciesDeep Learning JP
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement LearningUsman Qayyum
 
深層強化学習の分散化・RNN利用の動向〜R2D2の紹介をもとに〜
深層強化学習の分散化・RNN利用の動向〜R2D2の紹介をもとに〜深層強化学習の分散化・RNN利用の動向〜R2D2の紹介をもとに〜
深層強化学習の分散化・RNN利用の動向〜R2D2の紹介をもとに〜Jun Okumura
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learningBig Data Colombia
 
분산 강화학습 논문(DeepMind IMPALA) 구현
분산 강화학습 논문(DeepMind IMPALA) 구현분산 강화학습 논문(DeepMind IMPALA) 구현
분산 강화학습 논문(DeepMind IMPALA) 구현정주 김
 
Deep deterministic policy gradient
Deep deterministic policy gradientDeep deterministic policy gradient
Deep deterministic policy gradientSlobodan Blazeski
 
알아두면 쓸데있는 신기한 강화학습 NAVER 2017
알아두면 쓸데있는 신기한 강화학습 NAVER 2017알아두면 쓸데있는 신기한 강화학습 NAVER 2017
알아두면 쓸데있는 신기한 강화학습 NAVER 2017Taehoon Kim
 
강화학습 알고리즘의 흐름도 Part 2
강화학습 알고리즘의 흐름도 Part 2강화학습 알고리즘의 흐름도 Part 2
강화학습 알고리즘의 흐름도 Part 2Dongmin Lee
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learningAndres Mendez-Vazquez
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratchJie-Han Chen
 
파이썬과 케라스로 배우는 강화학습 저자특강
파이썬과 케라스로 배우는 강화학습 저자특강파이썬과 케라스로 배우는 강화학습 저자특강
파이썬과 케라스로 배우는 강화학습 저자특강Woong won Lee
 

Was ist angesagt? (20)

Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive EnvironmentsMulti PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
 
RL_chapter1_to_chapter4
RL_chapter1_to_chapter4RL_chapter1_to_chapter4
RL_chapter1_to_chapter4
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Deep Multi-agent Reinforcement Learning
Deep Multi-agent Reinforcement LearningDeep Multi-agent Reinforcement Learning
Deep Multi-agent Reinforcement Learning
 
강화학습의 흐름도 Part 1
강화학습의 흐름도 Part 1강화학습의 흐름도 Part 1
강화학습의 흐름도 Part 1
 
Deep Q-learning from Demonstrations DQfD
Deep Q-learning from Demonstrations DQfDDeep Q-learning from Demonstrations DQfD
Deep Q-learning from Demonstrations DQfD
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithm
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its Applications
 
DQN (Deep Q-Network)
DQN (Deep Q-Network)DQN (Deep Q-Network)
DQN (Deep Q-Network)
 
[DL輪読会]Reinforcement Learning with Deep Energy-Based Policies
[DL輪読会]Reinforcement Learning with Deep Energy-Based Policies[DL輪読会]Reinforcement Learning with Deep Energy-Based Policies
[DL輪読会]Reinforcement Learning with Deep Energy-Based Policies
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
深層強化学習の分散化・RNN利用の動向〜R2D2の紹介をもとに〜
深層強化学習の分散化・RNN利用の動向〜R2D2の紹介をもとに〜深層強化学習の分散化・RNN利用の動向〜R2D2の紹介をもとに〜
深層強化学習の分散化・RNN利用の動向〜R2D2の紹介をもとに〜
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learning
 
분산 강화학습 논문(DeepMind IMPALA) 구현
분산 강화학습 논문(DeepMind IMPALA) 구현분산 강화학습 논문(DeepMind IMPALA) 구현
분산 강화학습 논문(DeepMind IMPALA) 구현
 
Deep deterministic policy gradient
Deep deterministic policy gradientDeep deterministic policy gradient
Deep deterministic policy gradient
 
알아두면 쓸데있는 신기한 강화학습 NAVER 2017
알아두면 쓸데있는 신기한 강화학습 NAVER 2017알아두면 쓸데있는 신기한 강화학습 NAVER 2017
알아두면 쓸데있는 신기한 강화학습 NAVER 2017
 
강화학습 알고리즘의 흐름도 Part 2
강화학습 알고리즘의 흐름도 Part 2강화학습 알고리즘의 흐름도 Part 2
강화학습 알고리즘의 흐름도 Part 2
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learning
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratch
 
파이썬과 케라스로 배우는 강화학습 저자특강
파이썬과 케라스로 배우는 강화학습 저자특강파이썬과 케라스로 배우는 강화학습 저자특강
파이썬과 케라스로 배우는 강화학습 저자특강
 

Ähnlich wie Contact Autonomous Systems Laboratory

DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birdsWangyu Han
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIJack Clark
 
Learning a nonlinear embedding by preserving class neibourhood structure 최종
Learning a nonlinear embedding by preserving class neibourhood structure   최종Learning a nonlinear embedding by preserving class neibourhood structure   최종
Learning a nonlinear embedding by preserving class neibourhood structure 최종WooSung Choi
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017MLconf
 
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...MLconf
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningRyo Iwaki
 
An Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGIAn Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGIAnirban Santara
 
Jsai final final final
Jsai final final finalJsai final final final
Jsai final final finaldinesh malla
 
Learning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifoldLearning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifoldKai-Wen Zhao
 
increasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learningincreasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learningRyo Iwaki
 
Recommendation algorithm using reinforcement learning
Recommendation algorithm using reinforcement learningRecommendation algorithm using reinforcement learning
Recommendation algorithm using reinforcement learningArithmer Inc.
 
Lecture 5 backpropagation
Lecture 5 backpropagationLecture 5 backpropagation
Lecture 5 backpropagationParveenMalik18
 
Intrinsically Motivated Reinforcement Learning
Intrinsically Motivated Reinforcement LearningIntrinsically Motivated Reinforcement Learning
Intrinsically Motivated Reinforcement LearningKai Zhang
 
Combining genetic algoriths and constraint programming to support stress test...
Combining genetic algoriths and constraint programming to support stress test...Combining genetic algoriths and constraint programming to support stress test...
Combining genetic algoriths and constraint programming to support stress test...Lionel Briand
 
Lecture 5: Neural Networks II
Lecture 5: Neural Networks IILecture 5: Neural Networks II
Lecture 5: Neural Networks IISang Jun Lee
 
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based accelerationHye-min Ahn
 

Ähnlich wie Contact Autonomous Systems Laboratory (20)

DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birds
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
 
Learning a nonlinear embedding by preserving class neibourhood structure 최종
Learning a nonlinear embedding by preserving class neibourhood structure   최종Learning a nonlinear embedding by preserving class neibourhood structure   최종
Learning a nonlinear embedding by preserving class neibourhood structure 최종
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learning
 
An Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGIAn Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGI
 
Jsai final final final
Jsai final final finalJsai final final final
Jsai final final final
 
Continuous control
Continuous controlContinuous control
Continuous control
 
Learning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifoldLearning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifold
 
increasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learningincreasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learning
 
Recommendation algorithm using reinforcement learning
Recommendation algorithm using reinforcement learningRecommendation algorithm using reinforcement learning
Recommendation algorithm using reinforcement learning
 
Lecture 5 backpropagation
Lecture 5 backpropagationLecture 5 backpropagation
Lecture 5 backpropagation
 
Intrinsically Motivated Reinforcement Learning
Intrinsically Motivated Reinforcement LearningIntrinsically Motivated Reinforcement Learning
Intrinsically Motivated Reinforcement Learning
 
Combining genetic algoriths and constraint programming to support stress test...
Combining genetic algoriths and constraint programming to support stress test...Combining genetic algoriths and constraint programming to support stress test...
Combining genetic algoriths and constraint programming to support stress test...
 
Lecture 5: Neural Networks II
Lecture 5: Neural Networks IILecture 5: Neural Networks II
Lecture 5: Neural Networks II
 
Modifed my_poster
Modifed my_posterModifed my_poster
Modifed my_poster
 
Neural network
Neural networkNeural network
Neural network
 
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
 
2021 06-02-tabnet
2021 06-02-tabnet2021 06-02-tabnet
2021 06-02-tabnet
 

Kürzlich hochgeladen

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 

Kürzlich hochgeladen (20)

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 

Contact Autonomous Systems Laboratory

  • 1. CONTACT Autonomous Systems Laboratory Mechanical Engineering 5th Engineering Building Room 810 Web. https://sites.google.com/site/aslunist/ Deep deterministic policy gradient Minjae Jung May. 19, 2020
  • 2. Autonomous Systems Laboratory 2/21 DQN to DDPG: DQN overview Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 529-533. Q learning DQN (2015) 1. replay buffer 2. neural network 3. target network
  • 3. Autonomous Systems Laboratory 3/21 DQN to DDPG: DQN algorithm Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 529-533. • DQN is capable of human level performance on many Atari games • Off policy training: replay buffer breaks the correlation of samples that are sampled from agent • High dimensional observation: deep neural network can extract feature from high dimensional input • Learning stability: target network make training process stable Environment Q Network Target Q Network DQN Loss Replay buffer 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑄(𝑠𝑡, 𝑎 𝑡; 𝜃) 𝑠𝑡 Update Copy 𝑄(𝑠𝑡, 𝑎 𝑡; 𝜃) store (𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1) 𝑟𝑡 𝑠𝑡+1(𝑠𝑡, 𝑎 𝑡) 𝑚𝑎𝑥 𝑎 𝑄(𝑠𝑡 ′ , 𝑎 𝑡; 𝜃′) 𝑄 𝑠𝑡, 𝑎 𝑡 ← 𝑄 𝑠𝑡, 𝑎 𝑡 + 𝛼[𝑟𝑡+1 + 𝛾𝑚𝑎𝑥 𝑎 𝑄 𝑠𝑡+1, 𝑎 𝑡+1 − 𝑄 𝑠𝑡, 𝑎 𝑡 ] Q learning 𝑄 𝜋 𝜃 𝑠𝑡, 𝑎 𝑡 ← 𝑄 𝜋 𝜃 𝑠𝑡, 𝑎 𝑡 + 𝛼[𝑟𝑡+1 + 𝛾𝑚𝑎𝑥 𝑎 𝑄 𝜋 𝜃′ 𝑠𝑡+1, 𝑎 𝑡+1 − 𝑄 𝜋 𝜃 𝑠𝑡, 𝑎 𝑡 ] DQN Policy(𝜋): 𝑎 𝑡 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑄 𝜋 𝜃 (𝑠𝑡, 𝑎 𝑡)𝑠𝑡: state 𝑎 𝑡: action 𝑟𝑡: reward 𝑄(𝑠𝑡, 𝑎 𝑡): reward to go
  • 4. Autonomous Systems Laboratory 4/21 DQN to DDPG: Limitation of DQN (discrete action spaces) • Discrete action spaces - DQN can only handle discrete and low-dimensional action spaces - If the dimension increases, action spaces(the number of node) increase exponentially - i.e. 𝒌 discrete action spaces with 𝒏 dimension -> 𝒌 𝒏action spaces • DQN cannot be straight forwardly applied to continuous domain • Why? -> 1. Policy(𝜋): 𝑎 𝑡 = 𝒂𝒓𝒈𝒎𝒂𝒙 𝒂 𝑄 𝜋 𝜃 (𝑠𝑡, 𝑎 𝑡) 2. Update: 𝑄 𝜋 𝜃 𝑠𝑡, 𝑎 𝑡 ← 𝑄 𝜋 𝜃 𝑠𝑡, 𝑎 𝑡 + 𝛼[𝑟𝑡+1 + 𝛾𝒎𝒂𝒙 𝒂 𝑸 𝝅 𝜽′ 𝒔 𝒕+𝟏, 𝒂 𝒕+𝟏 − 𝑄 𝜋 𝜃 𝑠𝑡, 𝑎 𝑡 ] Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).
  • 5. Autonomous Systems Laboratory 5/21 DDPG: DQN with Policy gradient methods Q learning DQN 1. replay buffer 2. deep neural network 3. target network Policy gradient (REINFORCE) Actor critic DPG DDPG Continuous action spaces Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).
  • 6. Autonomous Systems Laboratory 6/21 Policy gradient: The goal of Reinforcement learning 𝑝 𝜃 𝑠1,𝑎1,⋯,𝑠 𝑇,𝑎 𝑇 = 𝑝(𝑠1) ෑ 𝑡=1 𝑇 𝜋 𝜃 𝑎 𝑡 𝑠𝑡 𝑝(𝑠𝑡+1|𝑠𝑡, 𝑎 𝑡) Agent World action 𝑃(𝑠𝑡+1|𝑠𝑡, 𝑎 𝑡) model reward & next state 𝑟𝑡 𝑎 𝑡 𝑠𝑡+1 state& 𝑠𝑡 policy 𝜋(𝑎 𝑡|𝑠𝑡) 𝜃∗ = 𝑎𝑟𝑔𝑚𝑎𝑥 𝜃 𝐸𝜏~𝑝 𝜃(𝜏) ෍ 𝑡 𝑟 𝑠𝑡, 𝑎 𝑡 Markov decision process 𝑠1 𝑎1 𝑠2 𝑎2 𝑠3 𝑝(𝑠2|𝑠1, 𝑎1) 𝑝(𝑠3|𝑠2, 𝑎2) 𝑎3 𝑝(𝑠4|𝑠3, 𝑎3) 𝜏 objective: 𝐽(𝜃) trajectory distribution Goal of reinforcement learning policy(𝜋 𝜃): stochastic policy with weights 𝜽
  • 7. Autonomous Systems Laboratory 7/21 Policy gradient: REINFORCE • REINFORCE models the policy as a stochastic policy: 𝑎 𝑡 ~ 𝜋 𝜃(𝑎 𝑡|𝑠𝑡) Sutton, Richard S., et al. "Policy gradient methods for reinforcement learning with function approximation." Advances in neural information processing systems. 2000. 𝑠𝑡 𝜋 𝜃(𝑎 𝑡|𝑠𝑡) probability 0.1 0.1 0.2 0.2 0.4
  • 8. Autonomous Systems Laboratory 8/21 Policy gradient: REINFORCE 𝐽 𝜃 = 𝐸𝜏~𝑝 𝜃(𝜏) ෍ 𝑡 𝑟 𝑠𝑡, 𝑎 𝑡 𝛻𝐽 𝜃 =𝐸𝜏~𝑝 𝜃(𝜏) ( σ 𝑡=1 𝑇 𝛻𝜃 𝑙𝑜𝑔 𝜋 𝜃(𝑎 𝑡|𝑠𝑡) (σ 𝑡=1 𝑇 𝑟(𝑠𝑡, 𝑎 𝑡)) 𝛻𝐽 𝜃 ≈ 1 𝑁 ෍ 𝑖=1 𝑁 ෍ 𝑡=1 𝑇 𝛻𝜃 𝑙𝑜𝑔 𝜋 𝜃(𝑎 𝑡|𝑠𝑡) ෍ 𝑡=1 𝑇 𝑟(𝑠𝑡, 𝑎 𝑡) 𝜃 ← 𝜃 + 𝛼𝛻𝐽(𝜃) The number of episodes problem Must experience some episodes to update 1. Slow training process 2. High gradient variance initial state 𝑟1 𝑟2 𝑟 𝑁 • REINFORCE models the policy as a stochastic decision: 𝑎 𝑡 ~ 𝜋 𝜃(𝑎 𝑡|𝑠𝑡) Sutton, Richard S., et al. "Policy gradient methods for reinforcement learning with function approximation." Advances in neural information processing systems. 2000. 𝜃: weights of actor network 𝛼: learning rate
  • 9. Autonomous Systems Laboratory 9/21 → 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡) Policy gradient: Actor critic (actor critic) • Actor(𝜋 𝜃(𝑎 𝑡|𝑠𝑡)): output action distribution by policy network and updates in the direction suggested by critic • Critic(𝑸 𝝓(𝒔 𝒕, 𝒂 𝒕)): evaluate actor’s action initial state sample data 𝒊 times update critic & actor sample data 𝒊 times update critic & actor 1. Sample 𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1 from 𝜋 𝜃(𝑎 𝑡|𝑠𝑡) 𝑖 times 2. Update 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡) to sampled data 3. 𝛻𝜃 𝐽(𝜃) ≈ σ𝑖 𝛻𝑙𝑜𝑔𝜋 𝜃 𝑎 𝑡 𝑠𝑡 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡) 4. 𝜃 ← 𝜃 + 𝛼𝛻𝜃 𝐽(𝜃) Konda, Vijay R., and John N. Tsitsiklis. "Actor-critic algorithms." Advances in neural information processing systems. 2000. 𝜙: weights of critic network 1. High gradient variance 2. Slow training policy 𝜋 𝜃(𝑎 𝑡|𝑠𝑡) 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡) Env.𝑎 𝑡 𝑠𝑡 (𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1)0~𝑖𝛻𝐽(𝜃) actor critic update critic
  • 10. Autonomous Systems Laboratory 10/21 Policy gradient: DPG Silver, David, et al. "Deterministic policy gradient algorithms." 2014. • Deterministic policy gradient (DPG) models the actor policy as a deterministic policy: 𝑎t = 𝜇 𝜃(𝑠t) 𝑎 𝑥 𝑎 𝑦 Stochastic policy 𝜋 𝜃(𝑎 𝑡|𝑠𝑡) 𝑠𝑡 • Need 10 action spaces for 5 discretized 2 dimensional actions 𝑎 𝑥 𝑎 𝑦 Deterministic policy 𝝁 𝜽(𝒔 𝒕) 𝑠𝑡 • Only 2 action spaces are needed
  • 11. Autonomous Systems Laboratory 11/21 Policy gradient: DPG Silver, David, et al. "Deterministic policy gradient algorithms." 2014. • Deterministic policy gradient (DPG) models the actor policy as a deterministic decision: 𝑎t = 𝜇 𝜃(𝑠t) 1. Sample 𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1 from 𝜇 𝜃(𝑠) 𝑖 times 2. Update 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡) to samples 3. 𝛻𝜃 𝐽(𝜃) ≈ σ𝑖 𝛻𝜃 𝜇 𝜃 𝑠𝑡 𝛻 𝜙 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡)| 𝑎 𝑡=𝜇 𝜃(𝑠 𝑡) 4. 𝜃 ← 𝜃 + 𝛼𝛻𝜃 𝐽(𝜃) 𝑝 𝜃 𝑠1,𝑎1,⋯,𝑠 𝑇,𝑎 𝑇 = 𝑝(𝑠1) ෑ 𝑡=1 𝑇 𝜋 𝜃 𝑎 𝑡 𝑠𝑡 𝑝(𝑠𝑡+1|𝑠𝑡, 𝑎 𝑡) trajectory distribution 𝑝 𝜃 𝑠1,𝑠2,𝑠3⋯,𝑠 𝑇 = 𝑝(𝑠1) ෑ 𝑡=1 𝑇 𝑝(𝑠𝑡+1|𝑠𝑡, 𝜇 𝜃(𝑠𝑡)) 𝐽 𝜃 = 𝐸𝑠,𝑎~𝑝 𝜃(𝜏) 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡) objective 𝜏 𝜏 𝐽 𝜃 = 𝐸𝑠~𝑝 𝜃(𝜏)[𝑄(𝑠, 𝜇 𝜃 𝑠 )] loss: 𝐿 = 𝑟𝑡 + 𝛾𝑄 𝜙 𝑠𝑡+1, 𝜇 𝜃(𝑠𝑡+1) − 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡)
  • 12. Autonomous Systems Laboratory 12/21 DDPG: DQN + DPG Q learning DQN Policy gradient (REINFORCE) Actor critic DPG DDPG + continuous action spaces - no replay buffer: sample correlation - no target network: unstable - high variance + lower variance + off policy: replay buffer + stable update: target network + high dimensional observation spaces - discrete action spaces - low dimensional observation spaces Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).
  • 13. Autonomous Systems Laboratory 13/21 DDPG: algorithm(1/2) Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015). • policy • exploration • Add noise for exploration: white Gaussian noise • soft target update • Target network is constrained to change slowly • Stabilize training process 𝑎 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑄 𝜋 𝜃 (𝑠, 𝑎) 𝒂 = 𝝁 𝜽(𝒔) 𝝁′ 𝒔 = 𝝁 𝜽 𝒔 + 𝐍 𝜽′ ← 𝝉𝜽 + 𝟏 − 𝝉 𝜽′ where 𝝉 ≪ 𝟏
  • 14. Autonomous Systems Laboratory 14/21 soft update 𝜃′ DDPG: algorithm(2/2) Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015). policy 𝜇 𝜃 target policy 𝜇 𝜃′ critic 𝑄 𝜙 target critic 𝑄 𝜙′ 𝑎 𝑡 = 𝜇 𝜃 𝑠𝑡 + 𝑁 Env actorcritic 𝑠𝑡 Replay buffer store 𝑖 data (𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1)sample 𝑖 batch (𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1) 𝜇 𝜃′(𝑠𝑡+1) update critic loss: 𝐿(𝜙) soft update 𝜙′ 𝜇 𝜃(𝑠𝑡) 𝛻𝐽(𝜃) select action 𝑠𝑡+1(𝑠𝑡, 𝑎 𝑡, 𝑟𝑡)
  • 15. Autonomous Systems Laboratory 15/21 DDPG example: landing on a moving platform Rodriguez-Ramos, Alejandro, et al. "A deep reinforcement learning strategy for UAV autonomous landing on a moving platform." Journal of Intelligent & Robotic Systems 93.1-2 (2019): 351-366.
  • 16. Autonomous Systems Laboratory 16/21 DDPG example: long-range robotic navigation Faust, Aleksandra, et al. "PRM-RL: Long-range robotic navigation tasks by combining reinforcement learning and sampling-based planning." 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018. • DDPG used as local planner for long range navigation
  • 17. Autonomous Systems Laboratory 17/21 DDPG example: multi agent DDPG (MADDPG) Lowe, Ryan, et al. "Multi-agent actor-critic for mixed cooperative-competitive environments." Advances in neural information processing systems. 2017.
  • 18. Autonomous Systems Laboratory 18/21 Conclusion & Future work • DQN have problem to adjust continuous action space directly • DDPG is able to consider continuous action spaces via policy gradient method and actor critic architecture • MADDPG for multi agent RL • Use DDPG for continuous action space decision making problem • ex) navigation, obstacle avoidance
  • 19. Autonomous Systems Laboratory 19/21 Appendix: Objective gradient derivation objective gradient