Deep Reinforcement Learning: Q-Learning

DQN algorithm
kv
Physics Department, National Taiwan University
kelispinor@gmail.com
The silide is largely credicted from David Silver’s slide and CS294
July 16, 2018
kv (NTU-PHYS) RLMC July 16, 2018 1 / 27

Overview
Overview
1 Overview
2 Introdution
What is Reinforcement Learning
Markov Decision Process
Dynamic Programming

Introdution What is Reinforcement Learning
What is Reinforcement Learning?
RL is a general framework for AI.
RL is for agent with ability to interact
Each action inﬂuences agent’s future states
Success is measured by a scalar reward signal
RL in a nutshell: Select actions to maximize future reward.

Introdution What is Reinforcement Learning
Reinforcement Learning Framework
In Reinforcement Learning, the agent observes current state St, receives
reward Rt, then interacts with the environment with action At under
policy.
Agent
Environment
Action atNew state st+1 Reward rt+1

Introdution Markov Decision Process
Markov Decision Process
Markov Property
The future is independent of the past given the present.
P(St+1|St) = P(St+1|St, St−1, ..., S2, S1)
MDP is a tuple < S, A, P, R, γ >, deﬁned by follwing components
S: state space
A: action space
P(r, s |s, a): transition probability. trainsition s, a → r, s

Introdution Dynamic Programming
Policy
Policy: Is any function mapping from the states to actions π : S → A
Deterministic policy a = π(s)
Stochastic policy a ∼ π(a|s)

Policy Evaluation and Value Functions
Policy optimization: maximize expected reward wrt policy π
maximize E
t
rt
Policy evaluation: compute the expected return for given π
State value function: V π
(s) = E
∞
t γt
rt|St = s
State-action value function: Qπ
(s, a) = E
∞
t γt
rt|St = s, At = a

Value Functions
Q-function or state-action value function: expected total reward from
state s and action a under a policy π
Qπ
(s, a) = E
π
[r0 + γr1 + γ2
r2 + ...|s0 = s, a0 = a] (1)
State value function: expected (long-term )retrun starting from s
V π
(s) = E
π
[r0 + γr1 + γ2
r2 + ...|St = s] (2)
= E
a∼π
[Qπ
(s, a)|St = s] (3)
Advantage function
Aπ
(s, a) = Qπ
(s, a) − V π
(s) (4)

Bellman Equation
State action value function can be unrolled recursively
Qπ
(s, a) = E[r0 + γr1 + γ2
r2 + ...|s, a] (5)
= E
s
[r + γQπ
(s , a )|s, a] (6)
Optimal Q function Q∗(s, a) can be unrolled recursively
Q∗
(s, a) = E
s
[r + max
a
Q∗
(s , a )|s, a] (7)
Value iteration algorithm solves the Bellman equation
Qi+1(s) = E
s
[r + max
a
Qi (s , a )|s, a] (8)

Bellman Backups Operator
Q-function with clear time index
Qπ
(s0, a0) = E
s1∼P(s1|s0,a0)
r0 + γ E
a1∼π
Qπ
(s1, a1) (9)
Deﬁne Bellman backup operator, operating on Q-function
[T π
Q](s0, a0) = E
s1∼P(s1|s0,a0)
r0 + γ E
a1∼π
Qπ
(s1, a1) (10)
Qπ is a ﬁxed point function
T π
Qπ
= Qπ
(11)
If we apply T π repeatedly to Q, the series will converge to Qπ
Q, T π
Q, (T π
)2
Q, ... → Qπ
(12)

Introducing Q∗
Denote π∗ an optimal policy.
Q∗(s, a) = Qπ∗
(s, a) = maxπ Qπ(s, a)
Satisfy π∗(s) = argmaxa Q∗(s, a)
Then, Bellman equation
Qπ
(s0, a0) = E
s1∼P(s1|s0,a0)
r0 + γ E
a1∼π
Qπ
(s1, a1) (13)
becomes
Q∗
(s0, a0) = E
s1∼P(s1|s0,a0)
r0 + γ max
a1
Q∗
(s1, a1) (14)
We can also deﬁne corresponding Bellman backup operator

Bellman Backups Operator on Q∗
Bellman backup operator, operating on Q-function
[T Q](s0, a0) = E
s1∼P(s1|s0,a0)
r0 + γ max
a1
Q(s1, a1) (15)
Qπ is a ﬁxed point function
T Q∗
= Q∗
(16)
If we apply T repeatedly to Q, the series will converge to Q∗
Q, T Q, (T )2
Q, ... → Q∗
(17)

Deep Q-Learning
Repersent value function by deep Q-Network with weights w
Q(s, a; w) ≈ Qπ
(s, a)
Objective function of Q-values is deﬁned in mean-squared error
L(w) = E (r + γ max
a
Q(s , a ; w)
TD Target
−Q(s, a; w))2
Q-learning gradient
∂L(w)
∂w
= E (r + γ max
a
Q(s , a ; w)
TD Target
−Q(s, a; w))
∂Q(s, a; w)
∂w

Deep Q-Learning
Backup estimation T Qt = rt + maxat+1 γQ(st+1, at+1)
To approximate Q ← T Qt, solve T Qt − Q(st, at)
2
T is contraction under . ∞ not . 2

Stability Issues
1 Data is sequential
Successive non-iid data are highly correlated
2 Policy changes raplidly with slightly change of Q values
π may oscillates
Distribution of data may swing
3 Scale of rewards and Q value is unknown
Large gradients can cause unstable backpropagation

Deep Q Network
Proposed solutions
1 Use experience replay
Break correlations in data, recover to iid setting
2 Fix target network
Old Q-function is freezed over long timesteps before update
Break correlations in Q-function and target
3 Clip rewards and normalize adaptively to sensible range
Robust gradients

Stablize DQN: Experience Replay
Goal: Remove correlations. Build agent’s data-set
at is sampled from -greedy policy
Store transition (st, at, rt+1, st+1) in replay memory D
Sample randomly in mini-batch of transition (s, a, r, s ) from D
Optimize MSE between Q-network and Q-Learning target
L(w) = E
a,s,r,s ∼D
(r + γ max
a
Q(s , a ; w) − Q(s, a; w))2

Stablize DQN: Fixed Target
Goal: Avoid oscillations, fix parameters used in target
Compute Q-learning target wrt old, fixed parameters w−
r + γ max
a
Q(s , a ; w−
)
Optimize MSE between Q-network and Q-learning target
L(w) = E
s,a,r,s ∼D
(r + γ max
a
Q(s , a ; w−
)
Fixed Target
−Q(s, a; w))2
Periodically update fixed parameters w− ← w

Stablize DQN: Rewards/ Values Range
Clips rewards to [-1, 1]
Ensure gradients are well-conditioned

DQN in Atari
Figure: Deep Q Learning

DQN in Atari
End-to-end learning of Q from pixels s
Input s is stacked last 4 frames
Output Q(s, a) for 18 actions
Reward is change in score for that step
Figure: Q-Network Architecture

DQN Results

Is Q-value has meaning?

Is Q-value has meaning?
But Q-values are usually overestimated.

Double Q Learning
EX1,X2 [max(X1, X2)] ≥ max(EX1,X2 [X1], EX1,X2 [X2])
Q-values are noisy and overesitmated
Solution: use two networks and compute max with the other networ
QA(s, a) ← r + γQ(s , argmax
a
QB(s , a ))
QB(s, a) ← r + γQ(s , argmax
a
QA(s , a ))
Original DQN
Q(s, a) ← r + γQtarget
(s , a ) = r + γQtarget
(s , argmax
a
Qtarget
)
Double DQN
Q(s, a) ← r + γQtarget
(s , argmax
a
Q(s , a )) (18)

Deep Reinforcement Learning: Q-Learning

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Deep Reinforcement Learning: Q-Learning

Ähnlich wie Deep Reinforcement Learning: Q-Learning (20)

Mehr von Kai-Wen Zhao

Mehr von Kai-Wen Zhao (7)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Deep Reinforcement Learning: Q-Learning