This slide reviews deep reinforcement learning, specially Q-Learning and its variants. We introduce Bellman operator and approximate it with deep neural network. Last but not least, we review the classical paper: DeepMind Atari Game beats human performance. Also, some tips of stabilizing DQN are included.
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Deep Reinforcement Learning: Q-Learning
1. DQN algorithm
kv
Physics Department, National Taiwan University
kelispinor@gmail.com
The silide is largely credicted from David Silver’s slide and CS294
July 16, 2018
kv (NTU-PHYS) RLMC July 16, 2018 1 / 27
3. Introdution What is Reinforcement Learning
What is Reinforcement Learning?
RL is a general framework for AI.
RL is for agent with ability to interact
Each action influences agent’s future states
Success is measured by a scalar reward signal
RL in a nutshell: Select actions to maximize future reward.
kv (NTU-PHYS) RLMC July 16, 2018 3 / 27
4. Introdution What is Reinforcement Learning
Reinforcement Learning Framework
In Reinforcement Learning, the agent observes current state St, receives
reward Rt, then interacts with the environment with action At under
policy.
Agent
Environment
Action atNew state st+1 Reward rt+1
kv (NTU-PHYS) RLMC July 16, 2018 4 / 27
5. Introdution Markov Decision Process
Markov Decision Process
Markov Property
The future is independent of the past given the present.
P(St+1|St) = P(St+1|St, St−1, ..., S2, S1)
MDP is a tuple < S, A, P, R, γ >, defined by follwing components
S: state space
A: action space
P(r, s |s, a): transition probability. trainsition s, a → r, s
kv (NTU-PHYS) RLMC July 16, 2018 5 / 27
6. Introdution Dynamic Programming
Policy
Policy: Is any function mapping from the states to actions π : S → A
Deterministic policy a = π(s)
Stochastic policy a ∼ π(a|s)
kv (NTU-PHYS) RLMC July 16, 2018 6 / 27
7. Introdution Dynamic Programming
Policy Evaluation and Value Functions
Policy optimization: maximize expected reward wrt policy π
maximize E
t
rt
Policy evaluation: compute the expected return for given π
State value function: V π
(s) = E
∞
t γt
rt|St = s
State-action value function: Qπ
(s, a) = E
∞
t γt
rt|St = s, At = a
kv (NTU-PHYS) RLMC July 16, 2018 7 / 27
8. Introdution Dynamic Programming
Value Functions
Q-function or state-action value function: expected total reward from
state s and action a under a policy π
Qπ
(s, a) = E
π
[r0 + γr1 + γ2
r2 + ...|s0 = s, a0 = a] (1)
State value function: expected (long-term )retrun starting from s
V π
(s) = E
π
[r0 + γr1 + γ2
r2 + ...|St = s] (2)
= E
a∼π
[Qπ
(s, a)|St = s] (3)
Advantage function
Aπ
(s, a) = Qπ
(s, a) − V π
(s) (4)
kv (NTU-PHYS) RLMC July 16, 2018 8 / 27
9. Introdution Dynamic Programming
Bellman Equation
State action value function can be unrolled recursively
Qπ
(s, a) = E[r0 + γr1 + γ2
r2 + ...|s, a] (5)
= E
s
[r + γQπ
(s , a )|s, a] (6)
Optimal Q function Q∗(s, a) can be unrolled recursively
Q∗
(s, a) = E
s
[r + max
a
Q∗
(s , a )|s, a] (7)
Value iteration algorithm solves the Bellman equation
Qi+1(s) = E
s
[r + max
a
Qi (s , a )|s, a] (8)
kv (NTU-PHYS) RLMC July 16, 2018 9 / 27
10. Introdution Dynamic Programming
Bellman Backups Operator
Q-function with clear time index
Qπ
(s0, a0) = E
s1∼P(s1|s0,a0)
r0 + γ E
a1∼π
Qπ
(s1, a1) (9)
Define Bellman backup operator, operating on Q-function
[T π
Q](s0, a0) = E
s1∼P(s1|s0,a0)
r0 + γ E
a1∼π
Qπ
(s1, a1) (10)
Qπ is a fixed point function
T π
Qπ
= Qπ
(11)
If we apply T π repeatedly to Q, the series will converge to Qπ
Q, T π
Q, (T π
)2
Q, ... → Qπ
(12)
kv (NTU-PHYS) RLMC July 16, 2018 10 / 27
11. Introdution Dynamic Programming
Introducing Q∗
Denote π∗ an optimal policy.
Q∗(s, a) = Qπ∗
(s, a) = maxπ Qπ(s, a)
Satisfy π∗(s) = argmaxa Q∗(s, a)
Then, Bellman equation
Qπ
(s0, a0) = E
s1∼P(s1|s0,a0)
r0 + γ E
a1∼π
Qπ
(s1, a1) (13)
becomes
Q∗
(s0, a0) = E
s1∼P(s1|s0,a0)
r0 + γ max
a1
Q∗
(s1, a1) (14)
We can also define corresponding Bellman backup operator
kv (NTU-PHYS) RLMC July 16, 2018 11 / 27
12. Introdution Dynamic Programming
Bellman Backups Operator on Q∗
Bellman backup operator, operating on Q-function
[T Q](s0, a0) = E
s1∼P(s1|s0,a0)
r0 + γ max
a1
Q(s1, a1) (15)
Qπ is a fixed point function
T Q∗
= Q∗
(16)
If we apply T repeatedly to Q, the series will converge to Q∗
Q, T Q, (T )2
Q, ... → Q∗
(17)
kv (NTU-PHYS) RLMC July 16, 2018 12 / 27
13. Introdution Dynamic Programming
Deep Q-Learning
Repersent value function by deep Q-Network with weights w
Q(s, a; w) ≈ Qπ
(s, a)
Objective function of Q-values is defined in mean-squared error
L(w) = E (r + γ max
a
Q(s , a ; w)
TD Target
−Q(s, a; w))2
Q-learning gradient
∂L(w)
∂w
= E (r + γ max
a
Q(s , a ; w)
TD Target
−Q(s, a; w))
∂Q(s, a; w)
∂w
kv (NTU-PHYS) RLMC July 16, 2018 13 / 27
14. Introdution Dynamic Programming
Deep Q-Learning
Backup estimation T Qt = rt + maxat+1 γQ(st+1, at+1)
To approximate Q ← T Qt, solve T Qt − Q(st, at)
2
T is contraction under . ∞ not . 2
kv (NTU-PHYS) RLMC July 16, 2018 14 / 27
15. Introdution Dynamic Programming
Stability Issues
1 Data is sequential
Successive non-iid data are highly correlated
2 Policy changes raplidly with slightly change of Q values
π may oscillates
Distribution of data may swing
3 Scale of rewards and Q value is unknown
Large gradients can cause unstable backpropagation
kv (NTU-PHYS) RLMC July 16, 2018 15 / 27
16. Introdution Dynamic Programming
Deep Q Network
Proposed solutions
1 Use experience replay
Break correlations in data, recover to iid setting
2 Fix target network
Old Q-function is freezed over long timesteps before update
Break correlations in Q-function and target
3 Clip rewards and normalize adaptively to sensible range
Robust gradients
kv (NTU-PHYS) RLMC July 16, 2018 16 / 27
17. Introdution Dynamic Programming
Stablize DQN: Experience Replay
Goal: Remove correlations. Build agent’s data-set
at is sampled from -greedy policy
Store transition (st, at, rt+1, st+1) in replay memory D
Sample randomly in mini-batch of transition (s, a, r, s ) from D
Optimize MSE between Q-network and Q-Learning target
L(w) = E
a,s,r,s ∼D
(r + γ max
a
Q(s , a ; w) − Q(s, a; w))2
kv (NTU-PHYS) RLMC July 16, 2018 17 / 27
18. Introdution Dynamic Programming
Stablize DQN: Fixed Target
Goal: Avoid oscillations, fix parameters used in target
Compute Q-learning target wrt old, fixed parameters w−
r + γ max
a
Q(s , a ; w−
)
Optimize MSE between Q-network and Q-learning target
L(w) = E
s,a,r,s ∼D
(r + γ max
a
Q(s , a ; w−
)
Fixed Target
−Q(s, a; w))2
Periodically update fixed parameters w− ← w
kv (NTU-PHYS) RLMC July 16, 2018 18 / 27
19. Introdution Dynamic Programming
Stablize DQN: Rewards/ Values Range
Clips rewards to [-1, 1]
Ensure gradients are well-conditioned
kv (NTU-PHYS) RLMC July 16, 2018 19 / 27
21. Introdution Dynamic Programming
DQN in Atari
End-to-end learning of Q from pixels s
Input s is stacked last 4 frames
Output Q(s, a) for 18 actions
Reward is change in score for that step
Figure: Q-Network Architecture
kv (NTU-PHYS) RLMC July 16, 2018 21 / 27
26. Introdution Dynamic Programming
Is Q-value has meaning?
But Q-values are usually overestimated.
kv (NTU-PHYS) RLMC July 16, 2018 26 / 27
27. Introdution Dynamic Programming
Double Q Learning
EX1,X2 [max(X1, X2)] ≥ max(EX1,X2 [X1], EX1,X2 [X2])
Q-values are noisy and overesitmated
Solution: use two networks and compute max with the other networ
QA(s, a) ← r + γQ(s , argmax
a
QB(s , a ))
QB(s, a) ← r + γQ(s , argmax
a
QA(s , a ))
Original DQN
Q(s, a) ← r + γQtarget
(s , a ) = r + γQtarget
(s , argmax
a
Qtarget
)
Double DQN
Q(s, a) ← r + γQtarget
(s , argmax
a
Q(s , a )) (18)
kv (NTU-PHYS) RLMC July 16, 2018 27 / 27