SlideShare ist ein Scribd-Unternehmen logo
1 von 27
Downloaden Sie, um offline zu lesen
DQN algorithm
kv
Physics Department, National Taiwan University
kelispinor@gmail.com
The silide is largely credicted from David Silver’s slide and CS294
July 16, 2018
kv (NTU-PHYS) RLMC July 16, 2018 1 / 27
Overview
Overview
1 Overview
2 Introdution
What is Reinforcement Learning
Markov Decision Process
Dynamic Programming
kv (NTU-PHYS) RLMC July 16, 2018 2 / 27
Introdution What is Reinforcement Learning
What is Reinforcement Learning?
RL is a general framework for AI.
RL is for agent with ability to interact
Each action influences agent’s future states
Success is measured by a scalar reward signal
RL in a nutshell: Select actions to maximize future reward.
kv (NTU-PHYS) RLMC July 16, 2018 3 / 27
Introdution What is Reinforcement Learning
Reinforcement Learning Framework
In Reinforcement Learning, the agent observes current state St, receives
reward Rt, then interacts with the environment with action At under
policy.
Agent
Environment
Action atNew state st+1 Reward rt+1
kv (NTU-PHYS) RLMC July 16, 2018 4 / 27
Introdution Markov Decision Process
Markov Decision Process
Markov Property
The future is independent of the past given the present.
P(St+1|St) = P(St+1|St, St−1, ..., S2, S1)
MDP is a tuple < S, A, P, R, γ >, defined by follwing components
S: state space
A: action space
P(r, s |s, a): transition probability. trainsition s, a → r, s
kv (NTU-PHYS) RLMC July 16, 2018 5 / 27
Introdution Dynamic Programming
Policy
Policy: Is any function mapping from the states to actions π : S → A
Deterministic policy a = π(s)
Stochastic policy a ∼ π(a|s)
kv (NTU-PHYS) RLMC July 16, 2018 6 / 27
Introdution Dynamic Programming
Policy Evaluation and Value Functions
Policy optimization: maximize expected reward wrt policy π
maximize E
t
rt
Policy evaluation: compute the expected return for given π
State value function: V π
(s) = E
∞
t γt
rt|St = s
State-action value function: Qπ
(s, a) = E
∞
t γt
rt|St = s, At = a
kv (NTU-PHYS) RLMC July 16, 2018 7 / 27
Introdution Dynamic Programming
Value Functions
Q-function or state-action value function: expected total reward from
state s and action a under a policy π
Qπ
(s, a) = E
π
[r0 + γr1 + γ2
r2 + ...|s0 = s, a0 = a] (1)
State value function: expected (long-term )retrun starting from s
V π
(s) = E
π
[r0 + γr1 + γ2
r2 + ...|St = s] (2)
= E
a∼π
[Qπ
(s, a)|St = s] (3)
Advantage function
Aπ
(s, a) = Qπ
(s, a) − V π
(s) (4)
kv (NTU-PHYS) RLMC July 16, 2018 8 / 27
Introdution Dynamic Programming
Bellman Equation
State action value function can be unrolled recursively
Qπ
(s, a) = E[r0 + γr1 + γ2
r2 + ...|s, a] (5)
= E
s
[r + γQπ
(s , a )|s, a] (6)
Optimal Q function Q∗(s, a) can be unrolled recursively
Q∗
(s, a) = E
s
[r + max
a
Q∗
(s , a )|s, a] (7)
Value iteration algorithm solves the Bellman equation
Qi+1(s) = E
s
[r + max
a
Qi (s , a )|s, a] (8)
kv (NTU-PHYS) RLMC July 16, 2018 9 / 27
Introdution Dynamic Programming
Bellman Backups Operator
Q-function with clear time index
Qπ
(s0, a0) = E
s1∼P(s1|s0,a0)
r0 + γ E
a1∼π
Qπ
(s1, a1) (9)
Define Bellman backup operator, operating on Q-function
[T π
Q](s0, a0) = E
s1∼P(s1|s0,a0)
r0 + γ E
a1∼π
Qπ
(s1, a1) (10)
Qπ is a fixed point function
T π
Qπ
= Qπ
(11)
If we apply T π repeatedly to Q, the series will converge to Qπ
Q, T π
Q, (T π
)2
Q, ... → Qπ
(12)
kv (NTU-PHYS) RLMC July 16, 2018 10 / 27
Introdution Dynamic Programming
Introducing Q∗
Denote π∗ an optimal policy.
Q∗(s, a) = Qπ∗
(s, a) = maxπ Qπ(s, a)
Satisfy π∗(s) = argmaxa Q∗(s, a)
Then, Bellman equation
Qπ
(s0, a0) = E
s1∼P(s1|s0,a0)
r0 + γ E
a1∼π
Qπ
(s1, a1) (13)
becomes
Q∗
(s0, a0) = E
s1∼P(s1|s0,a0)
r0 + γ max
a1
Q∗
(s1, a1) (14)
We can also define corresponding Bellman backup operator
kv (NTU-PHYS) RLMC July 16, 2018 11 / 27
Introdution Dynamic Programming
Bellman Backups Operator on Q∗
Bellman backup operator, operating on Q-function
[T Q](s0, a0) = E
s1∼P(s1|s0,a0)
r0 + γ max
a1
Q(s1, a1) (15)
Qπ is a fixed point function
T Q∗
= Q∗
(16)
If we apply T repeatedly to Q, the series will converge to Q∗
Q, T Q, (T )2
Q, ... → Q∗
(17)
kv (NTU-PHYS) RLMC July 16, 2018 12 / 27
Introdution Dynamic Programming
Deep Q-Learning
Repersent value function by deep Q-Network with weights w
Q(s, a; w) ≈ Qπ
(s, a)
Objective function of Q-values is defined in mean-squared error
L(w) = E (r + γ max
a
Q(s , a ; w)
TD Target
−Q(s, a; w))2
Q-learning gradient
∂L(w)
∂w
= E (r + γ max
a
Q(s , a ; w)
TD Target
−Q(s, a; w))
∂Q(s, a; w)
∂w
kv (NTU-PHYS) RLMC July 16, 2018 13 / 27
Introdution Dynamic Programming
Deep Q-Learning
Backup estimation T Qt = rt + maxat+1 γQ(st+1, at+1)
To approximate Q ← T Qt, solve T Qt − Q(st, at)
2
T is contraction under . ∞ not . 2
kv (NTU-PHYS) RLMC July 16, 2018 14 / 27
Introdution Dynamic Programming
Stability Issues
1 Data is sequential
Successive non-iid data are highly correlated
2 Policy changes raplidly with slightly change of Q values
π may oscillates
Distribution of data may swing
3 Scale of rewards and Q value is unknown
Large gradients can cause unstable backpropagation
kv (NTU-PHYS) RLMC July 16, 2018 15 / 27
Introdution Dynamic Programming
Deep Q Network
Proposed solutions
1 Use experience replay
Break correlations in data, recover to iid setting
2 Fix target network
Old Q-function is freezed over long timesteps before update
Break correlations in Q-function and target
3 Clip rewards and normalize adaptively to sensible range
Robust gradients
kv (NTU-PHYS) RLMC July 16, 2018 16 / 27
Introdution Dynamic Programming
Stablize DQN: Experience Replay
Goal: Remove correlations. Build agent’s data-set
at is sampled from -greedy policy
Store transition (st, at, rt+1, st+1) in replay memory D
Sample randomly in mini-batch of transition (s, a, r, s ) from D
Optimize MSE between Q-network and Q-Learning target
L(w) = E
a,s,r,s ∼D
(r + γ max
a
Q(s , a ; w) − Q(s, a; w))2
kv (NTU-PHYS) RLMC July 16, 2018 17 / 27
Introdution Dynamic Programming
Stablize DQN: Fixed Target
Goal: Avoid oscillations, fix parameters used in target
Compute Q-learning target wrt old, fixed parameters w−
r + γ max
a
Q(s , a ; w−
)
Optimize MSE between Q-network and Q-learning target
L(w) = E
s,a,r,s ∼D
(r + γ max
a
Q(s , a ; w−
)
Fixed Target
−Q(s, a; w))2
Periodically update fixed parameters w− ← w
kv (NTU-PHYS) RLMC July 16, 2018 18 / 27
Introdution Dynamic Programming
Stablize DQN: Rewards/ Values Range
Clips rewards to [-1, 1]
Ensure gradients are well-conditioned
kv (NTU-PHYS) RLMC July 16, 2018 19 / 27
Introdution Dynamic Programming
DQN in Atari
Figure: Deep Q Learning
kv (NTU-PHYS) RLMC July 16, 2018 20 / 27
Introdution Dynamic Programming
DQN in Atari
End-to-end learning of Q from pixels s
Input s is stacked last 4 frames
Output Q(s, a) for 18 actions
Reward is change in score for that step
Figure: Q-Network Architecture
kv (NTU-PHYS) RLMC July 16, 2018 21 / 27
Introdution Dynamic Programming
DQN Results
kv (NTU-PHYS) RLMC July 16, 2018 22 / 27
Introdution Dynamic Programming
DQN Results
kv (NTU-PHYS) RLMC July 16, 2018 23 / 27
Introdution Dynamic Programming
DQN Results
kv (NTU-PHYS) RLMC July 16, 2018 24 / 27
Introdution Dynamic Programming
Is Q-value has meaning?
kv (NTU-PHYS) RLMC July 16, 2018 25 / 27
Introdution Dynamic Programming
Is Q-value has meaning?
But Q-values are usually overestimated.
kv (NTU-PHYS) RLMC July 16, 2018 26 / 27
Introdution Dynamic Programming
Double Q Learning
EX1,X2 [max(X1, X2)] ≥ max(EX1,X2 [X1], EX1,X2 [X2])
Q-values are noisy and overesitmated
Solution: use two networks and compute max with the other networ
QA(s, a) ← r + γQ(s , argmax
a
QB(s , a ))
QB(s, a) ← r + γQ(s , argmax
a
QA(s , a ))
Original DQN
Q(s, a) ← r + γQtarget
(s , a ) = r + γQtarget
(s , argmax
a
Qtarget
)
Double DQN
Q(s, a) ← r + γQtarget
(s , argmax
a
Q(s , a )) (18)
kv (NTU-PHYS) RLMC July 16, 2018 27 / 27

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningNAVER Engineering
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithmJie-Han Chen
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 
RLCode와 A3C 쉽고 깊게 이해하기
RLCode와 A3C 쉽고 깊게 이해하기RLCode와 A3C 쉽고 깊게 이해하기
RLCode와 A3C 쉽고 깊게 이해하기Woong won Lee
 
Reinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic ProgrammingReinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic ProgrammingSeung Jae Lee
 
A brief overview of Reinforcement Learning applied to games
A brief overview of Reinforcement Learning applied to gamesA brief overview of Reinforcement Learning applied to games
A brief overview of Reinforcement Learning applied to gamesThomas da Silva Paula
 
Double Q-learning Paper Reading
Double Q-learning Paper ReadingDouble Q-learning Paper Reading
Double Q-learning Paper ReadingTakato Yamazaki
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDing Li
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-LearningKuppusamy P
 
An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)pauldix
 
Q-learning and Deep Q Network (Reinforcement Learning)
Q-learning and Deep Q Network (Reinforcement Learning)Q-learning and Deep Q Network (Reinforcement Learning)
Q-learning and Deep Q Network (Reinforcement Learning)Thom Lane
 
Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Taehoon Kim
 
Reinforcement Learning 3. Finite Markov Decision Processes
Reinforcement Learning 3. Finite Markov Decision ProcessesReinforcement Learning 3. Finite Markov Decision Processes
Reinforcement Learning 3. Finite Markov Decision ProcessesSeung Jae Lee
 
Dueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learningDueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learningTaehoon Kim
 
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"YeChan(Paul) Kim
 
[1312.5602] Playing Atari with Deep Reinforcement Learning
[1312.5602] Playing Atari with Deep Reinforcement Learning[1312.5602] Playing Atari with Deep Reinforcement Learning
[1312.5602] Playing Atari with Deep Reinforcement LearningSeung Jae Lee
 

Was ist angesagt? (20)

Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithm
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
RLCode와 A3C 쉽고 깊게 이해하기
RLCode와 A3C 쉽고 깊게 이해하기RLCode와 A3C 쉽고 깊게 이해하기
RLCode와 A3C 쉽고 깊게 이해하기
 
Reinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic ProgrammingReinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic Programming
 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
 
A brief overview of Reinforcement Learning applied to games
A brief overview of Reinforcement Learning applied to gamesA brief overview of Reinforcement Learning applied to games
A brief overview of Reinforcement Learning applied to games
 
Double Q-learning Paper Reading
Double Q-learning Paper ReadingDouble Q-learning Paper Reading
Double Q-learning Paper Reading
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
 
An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)
 
Q-learning and Deep Q Network (Reinforcement Learning)
Q-learning and Deep Q Network (Reinforcement Learning)Q-learning and Deep Q Network (Reinforcement Learning)
Q-learning and Deep Q Network (Reinforcement Learning)
 
Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)
 
Reinforcement Learning 3. Finite Markov Decision Processes
Reinforcement Learning 3. Finite Markov Decision ProcessesReinforcement Learning 3. Finite Markov Decision Processes
Reinforcement Learning 3. Finite Markov Decision Processes
 
Dueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learningDueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learning
 
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
 
[1312.5602] Playing Atari with Deep Reinforcement Learning
[1312.5602] Playing Atari with Deep Reinforcement Learning[1312.5602] Playing Atari with Deep Reinforcement Learning
[1312.5602] Playing Atari with Deep Reinforcement Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 

Ähnlich wie Deep Reinforcement Learning: Q-Learning

Learning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifoldLearning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifoldKai-Wen Zhao
 
DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birdsWangyu Han
 
Dueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement LearningDueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement LearningYoonho Lee
 
Financial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning ApproachFinancial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning Approach謙益 黃
 
Scalable trust-region method for deep reinforcement learning using Kronecker-...
Scalable trust-region method for deep reinforcement learning using Kronecker-...Scalable trust-region method for deep reinforcement learning using Kronecker-...
Scalable trust-region method for deep reinforcement learning using Kronecker-...Willy Marroquin (WillyDevNET)
 
increasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learningincreasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learningRyo Iwaki
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksBen Ball
 
Automated Security Response through Online Learning with Adaptive Con jectures
Automated Security Response through Online Learning with Adaptive Con jecturesAutomated Security Response through Online Learning with Adaptive Con jectures
Automated Security Response through Online Learning with Adaptive Con jecturesKim Hammar
 
Value Function Approximation via Low-Rank Models
Value Function Approximation via Low-Rank ModelsValue Function Approximation via Low-Rank Models
Value Function Approximation via Low-Rank ModelsLyft
 
SAT based planning for multiagent systems
SAT based planning for multiagent systemsSAT based planning for multiagent systems
SAT based planning for multiagent systemsRavi Kuril
 
Reinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del PraReinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del PraData Science Milan
 
Dr. Pablo Diaz Benito (University of the Witwatersrand) TITLE: "Novel Charges...
Dr. Pablo Diaz Benito (University of the Witwatersrand) TITLE: "Novel Charges...Dr. Pablo Diaz Benito (University of the Witwatersrand) TITLE: "Novel Charges...
Dr. Pablo Diaz Benito (University of the Witwatersrand) TITLE: "Novel Charges...Rene Kotze
 
A Multicore Parallelization of Continuous Skyline Queries on Data Streams
A Multicore Parallelization of Continuous Skyline Queries on Data StreamsA Multicore Parallelization of Continuous Skyline Queries on Data Streams
A Multicore Parallelization of Continuous Skyline Queries on Data StreamsTiziano De Matteis
 
4.3 real time game physics
4.3 real time game physics4.3 real time game physics
4.3 real time game physicsSayed Ahmed
 
Lec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scgLec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scgRonald Teo
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningRyo Iwaki
 
Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...
Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...
Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...Andrea Tassi
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017MLconf
 

Ähnlich wie Deep Reinforcement Learning: Q-Learning (20)

Learning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifoldLearning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifold
 
DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birds
 
Dueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement LearningDueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement Learning
 
Financial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning ApproachFinancial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning Approach
 
Scalable trust-region method for deep reinforcement learning using Kronecker-...
Scalable trust-region method for deep reinforcement learning using Kronecker-...Scalable trust-region method for deep reinforcement learning using Kronecker-...
Scalable trust-region method for deep reinforcement learning using Kronecker-...
 
increasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learningincreasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learning
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and Tricks
 
Deep Q-learning explained
Deep Q-learning explainedDeep Q-learning explained
Deep Q-learning explained
 
Automated Security Response through Online Learning with Adaptive Con jectures
Automated Security Response through Online Learning with Adaptive Con jecturesAutomated Security Response through Online Learning with Adaptive Con jectures
Automated Security Response through Online Learning with Adaptive Con jectures
 
Value Function Approximation via Low-Rank Models
Value Function Approximation via Low-Rank ModelsValue Function Approximation via Low-Rank Models
Value Function Approximation via Low-Rank Models
 
SAT based planning for multiagent systems
SAT based planning for multiagent systemsSAT based planning for multiagent systems
SAT based planning for multiagent systems
 
Reinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del PraReinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del Pra
 
Dr. Pablo Diaz Benito (University of the Witwatersrand) TITLE: "Novel Charges...
Dr. Pablo Diaz Benito (University of the Witwatersrand) TITLE: "Novel Charges...Dr. Pablo Diaz Benito (University of the Witwatersrand) TITLE: "Novel Charges...
Dr. Pablo Diaz Benito (University of the Witwatersrand) TITLE: "Novel Charges...
 
A Multicore Parallelization of Continuous Skyline Queries on Data Streams
A Multicore Parallelization of Continuous Skyline Queries on Data StreamsA Multicore Parallelization of Continuous Skyline Queries on Data Streams
A Multicore Parallelization of Continuous Skyline Queries on Data Streams
 
4.3 real time game physics
4.3 real time game physics4.3 real time game physics
4.3 real time game physics
 
Lec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scgLec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scg
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learning
 
Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...
Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...
Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...
 
Dp
DpDp
Dp
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 

Mehr von Kai-Wen Zhao

Learning visual representation without human label
Learning visual representation without human labelLearning visual representation without human label
Learning visual representation without human labelKai-Wen Zhao
 
Deep Double Descent
Deep Double DescentDeep Double Descent
Deep Double DescentKai-Wen Zhao
 
Recent Object Detection Research & Person Detection
Recent Object Detection Research & Person DetectionRecent Object Detection Research & Person Detection
Recent Object Detection Research & Person DetectionKai-Wen Zhao
 
Toward Disentanglement through Understand ELBO
Toward Disentanglement through Understand ELBOToward Disentanglement through Understand ELBO
Toward Disentanglement through Understand ELBOKai-Wen Zhao
 
Paper Review: An exact mapping between the Variational Renormalization Group ...
Paper Review: An exact mapping between the Variational Renormalization Group ...Paper Review: An exact mapping between the Variational Renormalization Group ...
Paper Review: An exact mapping between the Variational Renormalization Group ...Kai-Wen Zhao
 
NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accel...
NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accel...NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accel...
NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accel...Kai-Wen Zhao
 
High Dimensional Data Visualization using t-SNE
High Dimensional Data Visualization using t-SNEHigh Dimensional Data Visualization using t-SNE
High Dimensional Data Visualization using t-SNEKai-Wen Zhao
 

Mehr von Kai-Wen Zhao (7)

Learning visual representation without human label
Learning visual representation without human labelLearning visual representation without human label
Learning visual representation without human label
 
Deep Double Descent
Deep Double DescentDeep Double Descent
Deep Double Descent
 
Recent Object Detection Research & Person Detection
Recent Object Detection Research & Person DetectionRecent Object Detection Research & Person Detection
Recent Object Detection Research & Person Detection
 
Toward Disentanglement through Understand ELBO
Toward Disentanglement through Understand ELBOToward Disentanglement through Understand ELBO
Toward Disentanglement through Understand ELBO
 
Paper Review: An exact mapping between the Variational Renormalization Group ...
Paper Review: An exact mapping between the Variational Renormalization Group ...Paper Review: An exact mapping between the Variational Renormalization Group ...
Paper Review: An exact mapping between the Variational Renormalization Group ...
 
NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accel...
NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accel...NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accel...
NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accel...
 
High Dimensional Data Visualization using t-SNE
High Dimensional Data Visualization using t-SNEHigh Dimensional Data Visualization using t-SNE
High Dimensional Data Visualization using t-SNE
 

Kürzlich hochgeladen

Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 

Kürzlich hochgeladen (20)

Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 

Deep Reinforcement Learning: Q-Learning

  • 1. DQN algorithm kv Physics Department, National Taiwan University kelispinor@gmail.com The silide is largely credicted from David Silver’s slide and CS294 July 16, 2018 kv (NTU-PHYS) RLMC July 16, 2018 1 / 27
  • 2. Overview Overview 1 Overview 2 Introdution What is Reinforcement Learning Markov Decision Process Dynamic Programming kv (NTU-PHYS) RLMC July 16, 2018 2 / 27
  • 3. Introdution What is Reinforcement Learning What is Reinforcement Learning? RL is a general framework for AI. RL is for agent with ability to interact Each action influences agent’s future states Success is measured by a scalar reward signal RL in a nutshell: Select actions to maximize future reward. kv (NTU-PHYS) RLMC July 16, 2018 3 / 27
  • 4. Introdution What is Reinforcement Learning Reinforcement Learning Framework In Reinforcement Learning, the agent observes current state St, receives reward Rt, then interacts with the environment with action At under policy. Agent Environment Action atNew state st+1 Reward rt+1 kv (NTU-PHYS) RLMC July 16, 2018 4 / 27
  • 5. Introdution Markov Decision Process Markov Decision Process Markov Property The future is independent of the past given the present. P(St+1|St) = P(St+1|St, St−1, ..., S2, S1) MDP is a tuple < S, A, P, R, γ >, defined by follwing components S: state space A: action space P(r, s |s, a): transition probability. trainsition s, a → r, s kv (NTU-PHYS) RLMC July 16, 2018 5 / 27
  • 6. Introdution Dynamic Programming Policy Policy: Is any function mapping from the states to actions π : S → A Deterministic policy a = π(s) Stochastic policy a ∼ π(a|s) kv (NTU-PHYS) RLMC July 16, 2018 6 / 27
  • 7. Introdution Dynamic Programming Policy Evaluation and Value Functions Policy optimization: maximize expected reward wrt policy π maximize E t rt Policy evaluation: compute the expected return for given π State value function: V π (s) = E ∞ t γt rt|St = s State-action value function: Qπ (s, a) = E ∞ t γt rt|St = s, At = a kv (NTU-PHYS) RLMC July 16, 2018 7 / 27
  • 8. Introdution Dynamic Programming Value Functions Q-function or state-action value function: expected total reward from state s and action a under a policy π Qπ (s, a) = E π [r0 + γr1 + γ2 r2 + ...|s0 = s, a0 = a] (1) State value function: expected (long-term )retrun starting from s V π (s) = E π [r0 + γr1 + γ2 r2 + ...|St = s] (2) = E a∼π [Qπ (s, a)|St = s] (3) Advantage function Aπ (s, a) = Qπ (s, a) − V π (s) (4) kv (NTU-PHYS) RLMC July 16, 2018 8 / 27
  • 9. Introdution Dynamic Programming Bellman Equation State action value function can be unrolled recursively Qπ (s, a) = E[r0 + γr1 + γ2 r2 + ...|s, a] (5) = E s [r + γQπ (s , a )|s, a] (6) Optimal Q function Q∗(s, a) can be unrolled recursively Q∗ (s, a) = E s [r + max a Q∗ (s , a )|s, a] (7) Value iteration algorithm solves the Bellman equation Qi+1(s) = E s [r + max a Qi (s , a )|s, a] (8) kv (NTU-PHYS) RLMC July 16, 2018 9 / 27
  • 10. Introdution Dynamic Programming Bellman Backups Operator Q-function with clear time index Qπ (s0, a0) = E s1∼P(s1|s0,a0) r0 + γ E a1∼π Qπ (s1, a1) (9) Define Bellman backup operator, operating on Q-function [T π Q](s0, a0) = E s1∼P(s1|s0,a0) r0 + γ E a1∼π Qπ (s1, a1) (10) Qπ is a fixed point function T π Qπ = Qπ (11) If we apply T π repeatedly to Q, the series will converge to Qπ Q, T π Q, (T π )2 Q, ... → Qπ (12) kv (NTU-PHYS) RLMC July 16, 2018 10 / 27
  • 11. Introdution Dynamic Programming Introducing Q∗ Denote π∗ an optimal policy. Q∗(s, a) = Qπ∗ (s, a) = maxπ Qπ(s, a) Satisfy π∗(s) = argmaxa Q∗(s, a) Then, Bellman equation Qπ (s0, a0) = E s1∼P(s1|s0,a0) r0 + γ E a1∼π Qπ (s1, a1) (13) becomes Q∗ (s0, a0) = E s1∼P(s1|s0,a0) r0 + γ max a1 Q∗ (s1, a1) (14) We can also define corresponding Bellman backup operator kv (NTU-PHYS) RLMC July 16, 2018 11 / 27
  • 12. Introdution Dynamic Programming Bellman Backups Operator on Q∗ Bellman backup operator, operating on Q-function [T Q](s0, a0) = E s1∼P(s1|s0,a0) r0 + γ max a1 Q(s1, a1) (15) Qπ is a fixed point function T Q∗ = Q∗ (16) If we apply T repeatedly to Q, the series will converge to Q∗ Q, T Q, (T )2 Q, ... → Q∗ (17) kv (NTU-PHYS) RLMC July 16, 2018 12 / 27
  • 13. Introdution Dynamic Programming Deep Q-Learning Repersent value function by deep Q-Network with weights w Q(s, a; w) ≈ Qπ (s, a) Objective function of Q-values is defined in mean-squared error L(w) = E (r + γ max a Q(s , a ; w) TD Target −Q(s, a; w))2 Q-learning gradient ∂L(w) ∂w = E (r + γ max a Q(s , a ; w) TD Target −Q(s, a; w)) ∂Q(s, a; w) ∂w kv (NTU-PHYS) RLMC July 16, 2018 13 / 27
  • 14. Introdution Dynamic Programming Deep Q-Learning Backup estimation T Qt = rt + maxat+1 γQ(st+1, at+1) To approximate Q ← T Qt, solve T Qt − Q(st, at) 2 T is contraction under . ∞ not . 2 kv (NTU-PHYS) RLMC July 16, 2018 14 / 27
  • 15. Introdution Dynamic Programming Stability Issues 1 Data is sequential Successive non-iid data are highly correlated 2 Policy changes raplidly with slightly change of Q values π may oscillates Distribution of data may swing 3 Scale of rewards and Q value is unknown Large gradients can cause unstable backpropagation kv (NTU-PHYS) RLMC July 16, 2018 15 / 27
  • 16. Introdution Dynamic Programming Deep Q Network Proposed solutions 1 Use experience replay Break correlations in data, recover to iid setting 2 Fix target network Old Q-function is freezed over long timesteps before update Break correlations in Q-function and target 3 Clip rewards and normalize adaptively to sensible range Robust gradients kv (NTU-PHYS) RLMC July 16, 2018 16 / 27
  • 17. Introdution Dynamic Programming Stablize DQN: Experience Replay Goal: Remove correlations. Build agent’s data-set at is sampled from -greedy policy Store transition (st, at, rt+1, st+1) in replay memory D Sample randomly in mini-batch of transition (s, a, r, s ) from D Optimize MSE between Q-network and Q-Learning target L(w) = E a,s,r,s ∼D (r + γ max a Q(s , a ; w) − Q(s, a; w))2 kv (NTU-PHYS) RLMC July 16, 2018 17 / 27
  • 18. Introdution Dynamic Programming Stablize DQN: Fixed Target Goal: Avoid oscillations, fix parameters used in target Compute Q-learning target wrt old, fixed parameters w− r + γ max a Q(s , a ; w− ) Optimize MSE between Q-network and Q-learning target L(w) = E s,a,r,s ∼D (r + γ max a Q(s , a ; w− ) Fixed Target −Q(s, a; w))2 Periodically update fixed parameters w− ← w kv (NTU-PHYS) RLMC July 16, 2018 18 / 27
  • 19. Introdution Dynamic Programming Stablize DQN: Rewards/ Values Range Clips rewards to [-1, 1] Ensure gradients are well-conditioned kv (NTU-PHYS) RLMC July 16, 2018 19 / 27
  • 20. Introdution Dynamic Programming DQN in Atari Figure: Deep Q Learning kv (NTU-PHYS) RLMC July 16, 2018 20 / 27
  • 21. Introdution Dynamic Programming DQN in Atari End-to-end learning of Q from pixels s Input s is stacked last 4 frames Output Q(s, a) for 18 actions Reward is change in score for that step Figure: Q-Network Architecture kv (NTU-PHYS) RLMC July 16, 2018 21 / 27
  • 22. Introdution Dynamic Programming DQN Results kv (NTU-PHYS) RLMC July 16, 2018 22 / 27
  • 23. Introdution Dynamic Programming DQN Results kv (NTU-PHYS) RLMC July 16, 2018 23 / 27
  • 24. Introdution Dynamic Programming DQN Results kv (NTU-PHYS) RLMC July 16, 2018 24 / 27
  • 25. Introdution Dynamic Programming Is Q-value has meaning? kv (NTU-PHYS) RLMC July 16, 2018 25 / 27
  • 26. Introdution Dynamic Programming Is Q-value has meaning? But Q-values are usually overestimated. kv (NTU-PHYS) RLMC July 16, 2018 26 / 27
  • 27. Introdution Dynamic Programming Double Q Learning EX1,X2 [max(X1, X2)] ≥ max(EX1,X2 [X1], EX1,X2 [X2]) Q-values are noisy and overesitmated Solution: use two networks and compute max with the other networ QA(s, a) ← r + γQ(s , argmax a QB(s , a )) QB(s, a) ← r + γQ(s , argmax a QA(s , a )) Original DQN Q(s, a) ← r + γQtarget (s , a ) = r + γQtarget (s , argmax a Qtarget ) Double DQN Q(s, a) ← r + γQtarget (s , argmax a Q(s , a )) (18) kv (NTU-PHYS) RLMC July 16, 2018 27 / 27