This document provides an overview of deep deterministic policy gradient (DDPG), which combines aspects of DQN and policy gradient methods to enable deep reinforcement learning with continuous action spaces. It summarizes DQN and its limitations for continuous domains. It then explains policy gradient methods like REINFORCE, actor-critic, and deterministic policy gradient (DPG) that can handle continuous action spaces. DDPG adopts key elements of DQN like experience replay and target networks, and models the policy as a deterministic function like DPG, to apply deep reinforcement learning to complex continuous control tasks.
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
Contact Autonomous Systems Laboratory
1. CONTACT
Autonomous Systems Laboratory
Mechanical Engineering
5th Engineering Building Room 810
Web. https://sites.google.com/site/aslunist/
Deep deterministic policy gradient
Minjae Jung
May. 19, 2020
2. Autonomous Systems Laboratory
2/21
DQN to DDPG: DQN overview
Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 529-533.
Q learning DQN
(2015)
1. replay buffer
2. neural network
3. target network
3. Autonomous Systems Laboratory
3/21
DQN to DDPG: DQN algorithm
Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 529-533.
• DQN is capable of human level performance on many Atari games
• Off policy training: replay buffer breaks the correlation of samples that are sampled from agent
• High dimensional observation: deep neural network can extract feature from high dimensional input
• Learning stability: target network make training process stable
Environment Q Network
Target
Q Network
DQN Loss
Replay buffer
𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑄(𝑠𝑡, 𝑎 𝑡; 𝜃)
𝑠𝑡
Update
Copy
𝑄(𝑠𝑡, 𝑎 𝑡; 𝜃)
store
(𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1)
𝑟𝑡
𝑠𝑡+1(𝑠𝑡, 𝑎 𝑡)
𝑚𝑎𝑥 𝑎 𝑄(𝑠𝑡
′
, 𝑎 𝑡; 𝜃′)
𝑄 𝑠𝑡, 𝑎 𝑡 ← 𝑄 𝑠𝑡, 𝑎 𝑡 + 𝛼[𝑟𝑡+1 + 𝛾𝑚𝑎𝑥 𝑎 𝑄 𝑠𝑡+1, 𝑎 𝑡+1 − 𝑄 𝑠𝑡, 𝑎 𝑡 ]
Q learning
𝑄 𝜋 𝜃
𝑠𝑡, 𝑎 𝑡 ← 𝑄 𝜋 𝜃
𝑠𝑡, 𝑎 𝑡 + 𝛼[𝑟𝑡+1 + 𝛾𝑚𝑎𝑥 𝑎 𝑄 𝜋 𝜃′
𝑠𝑡+1, 𝑎 𝑡+1 − 𝑄 𝜋 𝜃
𝑠𝑡, 𝑎 𝑡 ]
DQN
Policy(𝜋): 𝑎 𝑡 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑄 𝜋 𝜃
(𝑠𝑡, 𝑎 𝑡)𝑠𝑡: state
𝑎 𝑡: action
𝑟𝑡: reward
𝑄(𝑠𝑡, 𝑎 𝑡): reward to go
4. Autonomous Systems Laboratory
4/21
DQN to DDPG: Limitation of DQN (discrete action spaces)
• Discrete action spaces
- DQN can only handle discrete and low-dimensional action spaces
- If the dimension increases, action spaces(the number of node) increase exponentially
- i.e. 𝒌 discrete action spaces with 𝒏 dimension -> 𝒌 𝒏action spaces
• DQN cannot be straight forwardly applied to continuous domain
• Why? -> 1. Policy(𝜋): 𝑎 𝑡 = 𝒂𝒓𝒈𝒎𝒂𝒙 𝒂 𝑄 𝜋 𝜃
(𝑠𝑡, 𝑎 𝑡)
2. Update: 𝑄 𝜋 𝜃
𝑠𝑡, 𝑎 𝑡 ← 𝑄 𝜋 𝜃
𝑠𝑡, 𝑎 𝑡 + 𝛼[𝑟𝑡+1 + 𝛾𝒎𝒂𝒙 𝒂 𝑸 𝝅 𝜽′
𝒔 𝒕+𝟏, 𝒂 𝒕+𝟏 − 𝑄 𝜋 𝜃
𝑠𝑡, 𝑎 𝑡 ]
Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).
5. Autonomous Systems Laboratory
5/21
DDPG: DQN with Policy gradient methods
Q learning DQN
1. replay buffer
2. deep neural network
3. target network
Policy gradient
(REINFORCE)
Actor critic DPG
DDPG
Continuous
action spaces
Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).
6. Autonomous Systems Laboratory
6/21
Policy gradient: The goal of Reinforcement learning
𝑝 𝜃 𝑠1,𝑎1,⋯,𝑠 𝑇,𝑎 𝑇
= 𝑝(𝑠1) ෑ
𝑡=1
𝑇
𝜋 𝜃 𝑎 𝑡 𝑠𝑡 𝑝(𝑠𝑡+1|𝑠𝑡, 𝑎 𝑡)
Agent World
action
𝑃(𝑠𝑡+1|𝑠𝑡, 𝑎 𝑡)
model
reward & next state
𝑟𝑡
𝑎 𝑡
𝑠𝑡+1
state&
𝑠𝑡
policy
𝜋(𝑎 𝑡|𝑠𝑡)
𝜃∗
= 𝑎𝑟𝑔𝑚𝑎𝑥 𝜃 𝐸𝜏~𝑝 𝜃(𝜏)
𝑡
𝑟 𝑠𝑡, 𝑎 𝑡
Markov decision process
𝑠1
𝑎1
𝑠2
𝑎2
𝑠3
𝑝(𝑠2|𝑠1, 𝑎1) 𝑝(𝑠3|𝑠2, 𝑎2)
𝑎3
𝑝(𝑠4|𝑠3, 𝑎3)
𝜏
objective: 𝐽(𝜃)
trajectory distribution
Goal of reinforcement learning
policy(𝜋 𝜃):
stochastic policy with weights 𝜽
7. Autonomous Systems Laboratory
7/21
Policy gradient: REINFORCE
• REINFORCE models the policy as a stochastic policy: 𝑎 𝑡 ~ 𝜋 𝜃(𝑎 𝑡|𝑠𝑡)
Sutton, Richard S., et al. "Policy gradient methods for reinforcement learning with function approximation." Advances in neural information processing systems. 2000.
𝑠𝑡 𝜋 𝜃(𝑎 𝑡|𝑠𝑡)
probability
0.1
0.1
0.2
0.2
0.4
8. Autonomous Systems Laboratory
8/21
Policy gradient: REINFORCE
𝐽 𝜃 = 𝐸𝜏~𝑝 𝜃(𝜏)
𝑡
𝑟 𝑠𝑡, 𝑎 𝑡
𝛻𝐽 𝜃 =𝐸𝜏~𝑝 𝜃(𝜏) ( σ 𝑡=1
𝑇
𝛻𝜃 𝑙𝑜𝑔 𝜋 𝜃(𝑎 𝑡|𝑠𝑡) (σ 𝑡=1
𝑇
𝑟(𝑠𝑡, 𝑎 𝑡))
𝛻𝐽 𝜃 ≈
1
𝑁
𝑖=1
𝑁
𝑡=1
𝑇
𝛻𝜃 𝑙𝑜𝑔 𝜋 𝜃(𝑎 𝑡|𝑠𝑡)
𝑡=1
𝑇
𝑟(𝑠𝑡, 𝑎 𝑡)
𝜃 ← 𝜃 + 𝛼𝛻𝐽(𝜃)
The number of episodes
problem
Must experience some episodes to update
1. Slow training process
2. High gradient variance
initial state
𝑟1
𝑟2
𝑟 𝑁
• REINFORCE models the policy as a stochastic decision: 𝑎 𝑡 ~ 𝜋 𝜃(𝑎 𝑡|𝑠𝑡)
Sutton, Richard S., et al. "Policy gradient methods for reinforcement learning with function approximation." Advances in neural information processing systems. 2000.
𝜃: weights of actor network
𝛼: learning rate
9. Autonomous Systems Laboratory
9/21
→ 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡)
Policy gradient: Actor critic (actor critic)
• Actor(𝜋 𝜃(𝑎 𝑡|𝑠𝑡)): output action distribution by policy network and updates in the direction suggested by critic
• Critic(𝑸 𝝓(𝒔 𝒕, 𝒂 𝒕)): evaluate actor’s action
initial state
sample data 𝒊 times
update critic & actor
sample data 𝒊 times
update critic & actor
1. Sample 𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1 from 𝜋 𝜃(𝑎 𝑡|𝑠𝑡) 𝑖 times
2. Update 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡) to sampled data
3. 𝛻𝜃 𝐽(𝜃) ≈ σ𝑖 𝛻𝑙𝑜𝑔𝜋 𝜃 𝑎 𝑡 𝑠𝑡 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡)
4. 𝜃 ← 𝜃 + 𝛼𝛻𝜃 𝐽(𝜃)
Konda, Vijay R., and John N. Tsitsiklis. "Actor-critic algorithms." Advances in neural information processing systems. 2000.
𝜙: weights of critic network
1. High gradient variance
2. Slow training policy
𝜋 𝜃(𝑎 𝑡|𝑠𝑡)
𝑄 𝜙(𝑠𝑡, 𝑎 𝑡)
Env.𝑎 𝑡
𝑠𝑡
(𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1)0~𝑖𝛻𝐽(𝜃)
actor
critic
update critic
10. Autonomous Systems Laboratory
10/21
Policy gradient: DPG
Silver, David, et al. "Deterministic policy gradient algorithms." 2014.
• Deterministic policy gradient (DPG) models the actor policy as a deterministic policy: 𝑎t = 𝜇 𝜃(𝑠t)
𝑎 𝑥
𝑎 𝑦
Stochastic policy 𝜋 𝜃(𝑎 𝑡|𝑠𝑡)
𝑠𝑡
• Need 10 action spaces for 5 discretized 2 dimensional actions
𝑎 𝑥
𝑎 𝑦
Deterministic policy 𝝁 𝜽(𝒔 𝒕)
𝑠𝑡
• Only 2 action spaces are needed
15. Autonomous Systems Laboratory
15/21
DDPG example: landing on a moving platform
Rodriguez-Ramos, Alejandro, et al. "A deep reinforcement learning strategy for UAV autonomous landing on a moving platform." Journal of Intelligent & Robotic
Systems 93.1-2 (2019): 351-366.
16. Autonomous Systems Laboratory
16/21
DDPG example: long-range robotic navigation
Faust, Aleksandra, et al. "PRM-RL: Long-range robotic navigation tasks by combining reinforcement learning and sampling-based planning." 2018 IEEE
International Conference on Robotics and Automation (ICRA). IEEE, 2018.
• DDPG used as local planner for long range navigation
17. Autonomous Systems Laboratory
17/21
DDPG example: multi agent DDPG (MADDPG)
Lowe, Ryan, et al. "Multi-agent actor-critic for mixed cooperative-competitive environments." Advances in neural information processing systems. 2017.
18. Autonomous Systems Laboratory
18/21
Conclusion & Future work
• DQN have problem to adjust continuous action space directly
• DDPG is able to consider continuous action spaces via policy gradient method and actor critic
architecture
• MADDPG for multi agent RL
• Use DDPG for continuous action space decision making problem
• ex) navigation, obstacle avoidance