Contact Autonomous Systems Laboratory

CONTACT
Autonomous Systems Laboratory
Mechanical Engineering
5th Engineering Building Room 810
Web. https://sites.google.com/site/aslunist/
Deep deterministic policy gradient
Minjae Jung
May. 19, 2020

2/21
DQN to DDPG: DQN overview
Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 529-533.
Q learning DQN
(2015)
1. replay buffer
2. neural network
3. target network

3/21
DQN to DDPG: DQN algorithm
Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 529-533.
• DQN is capable of human level performance on many Atari games
• Off policy training: replay buffer breaks the correlation of samples that are sampled from agent
• High dimensional observation: deep neural network can extract feature from high dimensional input
• Learning stability: target network make training process stable
Environment Q Network
Target
Q Network
DQN Loss
Replay buffer
𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑄(𝑠𝑡, 𝑎 𝑡; 𝜃)
𝑠𝑡
Update
Copy
𝑄(𝑠𝑡, 𝑎 𝑡; 𝜃)
store
(𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1)
𝑟𝑡
𝑠𝑡+1(𝑠𝑡, 𝑎 𝑡)
𝑚𝑎𝑥 𝑎 𝑄(𝑠𝑡
′
, 𝑎 𝑡; 𝜃′)
𝑄 𝑠𝑡, 𝑎 𝑡 ← 𝑄 𝑠𝑡, 𝑎 𝑡 + 𝛼[𝑟𝑡+1 + 𝛾𝑚𝑎𝑥 𝑎 𝑄 𝑠𝑡+1, 𝑎 𝑡+1 − 𝑄 𝑠𝑡, 𝑎 𝑡 ]
Q learning
𝑄 𝜋 𝜃
𝑠𝑡, 𝑎 𝑡 ← 𝑄 𝜋 𝜃
𝑠𝑡, 𝑎 𝑡 + 𝛼[𝑟𝑡+1 + 𝛾𝑚𝑎𝑥 𝑎 𝑄 𝜋 𝜃′
𝑠𝑡+1, 𝑎 𝑡+1 − 𝑄 𝜋 𝜃
𝑠𝑡, 𝑎 𝑡 ]
DQN
Policy(𝜋): 𝑎 𝑡 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑄 𝜋 𝜃
(𝑠𝑡, 𝑎 𝑡)𝑠𝑡: state
𝑎 𝑡: action
𝑟𝑡: reward
𝑄(𝑠𝑡, 𝑎 𝑡): reward to go

4/21
DQN to DDPG: Limitation of DQN (discrete action spaces)
• Discrete action spaces
- DQN can only handle discrete and low-dimensional action spaces
- If the dimension increases, action spaces(the number of node) increase exponentially
- i.e. 𝒌 discrete action spaces with 𝒏 dimension -> 𝒌 𝒏action spaces
• DQN cannot be straight forwardly applied to continuous domain
• Why? -> 1. Policy(𝜋): 𝑎 𝑡 = 𝒂𝒓𝒈𝒎𝒂𝒙 𝒂 𝑄 𝜋 𝜃
(𝑠𝑡, 𝑎 𝑡)
2. Update: 𝑄 𝜋 𝜃
𝑠𝑡, 𝑎 𝑡 ← 𝑄 𝜋 𝜃
𝑠𝑡, 𝑎 𝑡 + 𝛼[𝑟𝑡+1 + 𝛾𝒎𝒂𝒙 𝒂 𝑸 𝝅 𝜽′
𝒔 𝒕+𝟏, 𝒂 𝒕+𝟏 − 𝑄 𝜋 𝜃
𝑠𝑡, 𝑎 𝑡 ]
Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).

5/21
DDPG: DQN with Policy gradient methods
Q learning DQN
1. replay buffer
2. deep neural network
3. target network
Policy gradient
(REINFORCE)
Actor critic DPG
DDPG
Continuous
action spaces

6/21
Policy gradient: The goal of Reinforcement learning
𝑝 𝜃 𝑠1,𝑎1,⋯,𝑠 𝑇,𝑎 𝑇
= 𝑝(𝑠1) ෑ
𝑡=1
𝑇
𝜋 𝜃 𝑎 𝑡 𝑠𝑡 𝑝(𝑠𝑡+1|𝑠𝑡, 𝑎 𝑡)
Agent World
action
𝑃(𝑠𝑡+1|𝑠𝑡, 𝑎 𝑡)
model
reward & next state
𝑟𝑡
𝑎 𝑡
𝑠𝑡+1
state&
𝑠𝑡
policy
𝜋(𝑎 𝑡|𝑠𝑡)
𝜃∗
= 𝑎𝑟𝑔𝑚𝑎𝑥 𝜃 𝐸𝜏~𝑝 𝜃(𝜏) ෍
𝑡
𝑟 𝑠𝑡, 𝑎 𝑡
Markov decision process
𝑠1
𝑎1
𝑠2
𝑎2
𝑠3
𝑝(𝑠2|𝑠1, 𝑎1) 𝑝(𝑠3|𝑠2, 𝑎2)
𝑎3
𝑝(𝑠4|𝑠3, 𝑎3)
𝜏
objective: 𝐽(𝜃)
trajectory distribution
Goal of reinforcement learning
policy(𝜋 𝜃):
stochastic policy with weights 𝜽

7/21
Policy gradient: REINFORCE
• REINFORCE models the policy as a stochastic policy: 𝑎 𝑡 ~ 𝜋 𝜃(𝑎 𝑡|𝑠𝑡)
Sutton, Richard S., et al. "Policy gradient methods for reinforcement learning with function approximation." Advances in neural information processing systems. 2000.
𝑠𝑡 𝜋 𝜃(𝑎 𝑡|𝑠𝑡)
probability
0.1
0.1
0.2
0.2
0.4

8/21
Policy gradient: REINFORCE
𝐽 𝜃 = 𝐸𝜏~𝑝 𝜃(𝜏) ෍
𝑡
𝑟 𝑠𝑡, 𝑎 𝑡
𝛻𝐽 𝜃 =𝐸𝜏~𝑝 𝜃(𝜏) ( σ 𝑡=1
𝑇
𝛻𝜃 𝑙𝑜𝑔 𝜋 𝜃(𝑎 𝑡|𝑠𝑡) (σ 𝑡=1
𝑇
𝑟(𝑠𝑡, 𝑎 𝑡))
𝛻𝐽 𝜃 ≈
1
𝑁
෍
𝑖=1
𝑁
෍
𝑡=1
𝑇
𝛻𝜃 𝑙𝑜𝑔 𝜋 𝜃(𝑎 𝑡|𝑠𝑡) ෍
𝑡=1
𝑇
𝑟(𝑠𝑡, 𝑎 𝑡)
𝜃 ← 𝜃 + 𝛼𝛻𝐽(𝜃)
The number of episodes
problem
Must experience some episodes to update
1. Slow training process
2. High gradient variance
initial state
𝑟1
𝑟2
𝑟 𝑁
• REINFORCE models the policy as a stochastic decision: 𝑎 𝑡 ~ 𝜋 𝜃(𝑎 𝑡|𝑠𝑡)
Sutton, Richard S., et al. "Policy gradient methods for reinforcement learning with function approximation." Advances in neural information processing systems. 2000.
𝜃: weights of actor network
𝛼: learning rate

9/21
→ 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡)
Policy gradient: Actor critic (actor critic)
• Actor(𝜋 𝜃(𝑎 𝑡|𝑠𝑡)): output action distribution by policy network and updates in the direction suggested by critic
• Critic(𝑸 𝝓(𝒔 𝒕, 𝒂 𝒕)): evaluate actor’s action
initial state
sample data 𝒊 times
update critic & actor
sample data 𝒊 times
update critic & actor
1. Sample 𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1 from 𝜋 𝜃(𝑎 𝑡|𝑠𝑡) 𝑖 times
2. Update 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡) to sampled data
3. 𝛻𝜃 𝐽(𝜃) ≈ σ𝑖 𝛻𝑙𝑜𝑔𝜋 𝜃 𝑎 𝑡 𝑠𝑡 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡)
4. 𝜃 ← 𝜃 + 𝛼𝛻𝜃 𝐽(𝜃)
Konda, Vijay R., and John N. Tsitsiklis. "Actor-critic algorithms." Advances in neural information processing systems. 2000.
𝜙: weights of critic network
1. High gradient variance
2. Slow training policy
𝜋 𝜃(𝑎 𝑡|𝑠𝑡)
𝑄 𝜙(𝑠𝑡, 𝑎 𝑡)
Env.𝑎 𝑡
𝑠𝑡
(𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1)0~𝑖𝛻𝐽(𝜃)
actor
critic
update critic

10/21
Policy gradient: DPG
Silver, David, et al. "Deterministic policy gradient algorithms." 2014.
• Deterministic policy gradient (DPG) models the actor policy as a deterministic policy: 𝑎t = 𝜇 𝜃(𝑠t)
𝑎 𝑥
𝑎 𝑦
Stochastic policy 𝜋 𝜃(𝑎 𝑡|𝑠𝑡)
𝑠𝑡
• Need 10 action spaces for 5 discretized 2 dimensional actions
𝑎 𝑥
𝑎 𝑦
Deterministic policy 𝝁 𝜽(𝒔 𝒕)
𝑠𝑡
• Only 2 action spaces are needed

11/21
Policy gradient: DPG
Silver, David, et al. "Deterministic policy gradient algorithms." 2014.
• Deterministic policy gradient (DPG) models the actor policy as a deterministic decision: 𝑎t = 𝜇 𝜃(𝑠t)
1. Sample 𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1 from 𝜇 𝜃(𝑠) 𝑖 times
2. Update 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡) to samples
3. 𝛻𝜃 𝐽(𝜃) ≈ σ𝑖 𝛻𝜃 𝜇 𝜃 𝑠𝑡 𝛻 𝜙 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡)| 𝑎 𝑡=𝜇 𝜃(𝑠 𝑡)
4. 𝜃 ← 𝜃 + 𝛼𝛻𝜃 𝐽(𝜃)
𝑝 𝜃 𝑠1,𝑎1,⋯,𝑠 𝑇,𝑎 𝑇
= 𝑝(𝑠1) ෑ
𝑡=1
𝑇
𝜋 𝜃 𝑎 𝑡 𝑠𝑡 𝑝(𝑠𝑡+1|𝑠𝑡, 𝑎 𝑡)
trajectory distribution
𝑝 𝜃 𝑠1,𝑠2,𝑠3⋯,𝑠 𝑇
= 𝑝(𝑠1) ෑ
𝑡=1
𝑇
𝑝(𝑠𝑡+1|𝑠𝑡, 𝜇 𝜃(𝑠𝑡))
𝐽 𝜃 = 𝐸𝑠,𝑎~𝑝 𝜃(𝜏) 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡)
objective
𝜏 𝜏
𝐽 𝜃 = 𝐸𝑠~𝑝 𝜃(𝜏)[𝑄(𝑠, 𝜇 𝜃 𝑠 )]
loss: 𝐿 = 𝑟𝑡 + 𝛾𝑄 𝜙 𝑠𝑡+1, 𝜇 𝜃(𝑠𝑡+1) − 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡)

12/21
DDPG: DQN + DPG
Q learning DQN
Policy gradient
(REINFORCE)
Actor critic DPG
DDPG
+ continuous action spaces
- no replay buffer: sample correlation
- no target network: unstable
- high variance + lower variance
+ off policy: replay buffer
+ stable update: target network
+ high dimensional observation spaces
- discrete action spaces
- low dimensional observation spaces

13/21
DDPG: algorithm(1/2)
• policy
• exploration
• Add noise for exploration: white Gaussian noise
• soft target update
• Target network is constrained to change slowly
• Stabilize training process
𝑎 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑄 𝜋 𝜃
(𝑠, 𝑎) 𝒂 = 𝝁 𝜽(𝒔)
𝝁′ 𝒔 = 𝝁 𝜽 𝒔 + 𝐍
𝜽′
← 𝝉𝜽 + 𝟏 − 𝝉 𝜽′ where 𝝉 ≪ 𝟏

14/21
soft update 𝜃′
DDPG: algorithm(2/2)
policy 𝜇 𝜃
target policy 𝜇 𝜃′
critic 𝑄 𝜙
target critic 𝑄 𝜙′
𝑎 𝑡 = 𝜇 𝜃 𝑠𝑡 + 𝑁
Env
actorcritic
𝑠𝑡
Replay buffer
store 𝑖 data (𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1)sample 𝑖 batch (𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1)
𝜇 𝜃′(𝑠𝑡+1)
update critic
loss: 𝐿(𝜙) soft update 𝜙′
𝜇 𝜃(𝑠𝑡)
𝛻𝐽(𝜃)
select action
𝑠𝑡+1(𝑠𝑡, 𝑎 𝑡, 𝑟𝑡)

15/21
DDPG example: landing on a moving platform
Rodriguez-Ramos, Alejandro, et al. "A deep reinforcement learning strategy for UAV autonomous landing on a moving platform." Journal of Intelligent & Robotic
Systems 93.1-2 (2019): 351-366.

16/21
DDPG example: long-range robotic navigation
Faust, Aleksandra, et al. "PRM-RL: Long-range robotic navigation tasks by combining reinforcement learning and sampling-based planning." 2018 IEEE
International Conference on Robotics and Automation (ICRA). IEEE, 2018.
• DDPG used as local planner for long range navigation

17/21
DDPG example: multi agent DDPG (MADDPG)
Lowe, Ryan, et al. "Multi-agent actor-critic for mixed cooperative-competitive environments." Advances in neural information processing systems. 2017.

18/21
Conclusion & Future work
• DQN have problem to adjust continuous action space directly
• DDPG is able to consider continuous action spaces via policy gradient method and actor critic
architecture
• MADDPG for multi agent RL
• Use DDPG for continuous action space decision making problem
• ex) navigation, obstacle avoidance

19/21
Appendix: Objective gradient derivation
objective gradient

20/21
Appendix: DPG objective

21/21
Appendix: DDPG algorithm

Contact Autonomous Systems Laboratory

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Contact Autonomous Systems Laboratory

Ähnlich wie Contact Autonomous Systems Laboratory (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Contact Autonomous Systems Laboratory