SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Downloaden Sie, um offline zu lesen
Dueling Network Architectures for
Deep Reinforcement Learning
2016-06-28
Taehoon Kim
Motivation
• Recent advances
• Design improved control and RL algorithms
• Incorporate existing NN into RL methods
• We,
• focus on innovating a NN that is better suited for model-free RL
• Separate
• the representation of state value
• (state-dependent) action advantages
2
Overview
3
state	value	function
advantage	function
sharing convolutional	feature	learning	module
aggregating	layer
state-action	value	function
Dueling network
• Single Q network with two streams
• Produce separate estimations of state value func and advantage func
• without any extra supervision
• which states are valuable?
• without having to learn the effect of each action for each state
4
Saliency map on the Atari game Enduro
5
1.	Focus	on	horizon,
where	new	cars	appear
2.	Focus	on	the	score
Not	pay	much	attention
when	there	are	no	cars	in	front
Attention	on	car	immediately	in	front
making	its	choice	of	action	very	relevant
Definitions
• Value 𝑉(𝑠), how good it is to be in particular state 𝑠
• Advantage 𝐴(𝑠, 𝑎)
• Policy 𝜋
• Return 𝑅* = ∑ 𝛾./*
𝑟.
∞
.1* , where 𝛾 ∈ [0,1]
• Q function 𝑄8
𝑠, 𝑎 = 𝔼 𝑅* 𝑠* = 𝑠, 𝑎* = 𝑎, 𝜋
• State-value function 𝑉8 𝑠 = 𝔼:~8(<)[𝑄8 𝑠, 𝑎 ]
6
Bellman equation
• Recursively with dynamic programming
• 𝑄8
𝑠, 𝑎 = 𝔼<` 𝑟 + 𝛾𝔼:`~8(<`) 𝑄8
𝑠`, 𝑎` 	|𝑠, 𝑎, 𝜋
• Optimal Q∗
𝑠, 𝑎 = max
8
𝑄8
𝑠, 𝑎
• Deterministic policy a = arg max
:`∈𝒜
Q∗
𝑠, 𝑎`
• Optimal V∗
𝑠	 = max
:
𝑄∗
𝑠, 𝑎
• Bellman equation Q∗
𝑠, 𝑎 = 𝔼<` 𝑟 + 𝛾 max
:`
𝑄∗
𝑠`, 𝑎` |𝑠, 𝑎
7
Advantage function
• Bellman equation Q∗
𝑠, 𝑎 = 𝔼<` 𝑟 + 𝛾 max
:`
𝑄∗
𝑠`, 𝑎` |𝑠, 𝑎
• Advantage function A8
𝑠, 𝑎 = 𝑄8
𝑠, 𝑎 − 𝑉8
(𝑠)
• 𝔼:~8(<`) 𝐴8
𝑠, 𝑎 = 0
8
Advantage function
• Value 𝑉(𝑠), how good it is to be in particular state 𝑠
• Q(𝑠, 𝑎), the value of choosing a particular action 𝑎 when in state 𝑠
• A = 𝑉 − 𝑄 to obtain a relative measure of importance of each action
9
Deep Q-network (DQN)
• Model Free
• states and rewards are produced by the environment
• Off policy
• states and rewards are obtained with a behavior policy (epsilon greedy)
• different from the online policy that is being learned
10
Deep Q-network: 1) Target network
• Deep Q-network 𝑄 𝑠, 𝑎; 𝜽
• Target network 𝑄 𝑠, 𝑎; 𝜽/
• 𝐿O 𝜃O = 𝔼<,:,Q,<` 𝑦O
STU
− 𝑄 𝑠, 𝑎; 𝜽𝒊
W
• 𝑦O
STU
= 𝑟 + 𝛾 max
:`
𝑄 𝑠`, 𝑎`; 𝜽/
• Freeze parameters for a fixed number of iterations
• 𝛻YZ
𝐿O 𝜃O = 𝔼<,:,Q,<` 𝑦O
STU
− 𝑄 𝑠, 𝑎; 𝜽𝒊 𝛻YZ
𝑄 𝑠, 𝑎; 𝜽𝒊
11
𝑠`
𝑠
Deep Q-network: 2) Experience memory
• Experience 𝑒* = (𝑠*, 𝑎*, 𝑟*, 𝑠*])
• Accumulates a dataset 𝒟* = 𝑒], 𝑒W, … , 𝑒*
• 𝐿O 𝜃O = 𝔼 <,:,Q,<` ~𝒰(𝒟) 𝑦O
STU
− 𝑄 𝑠, 𝑎; 𝜽𝒊
W
12
Double Deep Q-network (DDQN)
• In DQN
• the max operator uses the same values to both select and evaluate an action
• lead to overoptimistic value estimates
• 𝑦O
STU
= 𝑟 + 𝛾 max
:`
𝑄 𝑠`, 𝑎`; 𝜽/
• To migrate this problem, in DDQN
• 𝑦O
SSTU
= 𝑟 + 𝛾𝑄 𝑠`,arg max
:`
𝑄(𝑠`, 𝑎`; 𝜃O); 𝜽/
13
Prioritized Replay (Schaul et al., 2016)
• To increase the replay probability of experience tuples
• that have a high expected learning progress
• use importance sampling weight measured via the proxy of absolute TD-error
• sampling transitions with high absolute TD-errors
• Led to faster learning and to better final policy quality
14
Dueling Network Architecture : Key insight
• For many states
• unnecessary to estimate the value of each action choice
• For example, move left or right only matters when a collision is eminent
• In most of states, the choice of action has no affect on what happens
• For bootstrapping based algorithm
• the estimation of state value is of great importance for every state
• bootstrapping	: update estimates on the basis of other estimates.
15
Formulation
• A8
𝑠, 𝑎 = 𝑄8
𝑠, 𝑎 − 𝑉8
(𝑠)
• 𝑉8
𝑠 = 𝔼:~8(<) 𝑄8
𝑠, 𝑎
• A8
𝑠, 𝑎 = 𝑄8
𝑠, 𝑎 − 𝔼:~8(<) 𝑄8
𝑠, 𝑎
• 𝔼:~8(<) 𝐴8
𝑠, 𝑎 = 0
• For a deterministic policy, 𝑎∗
= argmax
a`∈𝒜
𝑄 𝑠, 𝑎`
• 𝑄 𝑠, 𝑎∗ = 𝑉(𝑠) and 𝐴 𝑠, 𝑎∗ = 0
16
Formulation
• Dueling network = CNN + fully-connected layers that output
• a scalar 𝑉 𝑠; 𝜃, 𝛽
• an 𝒜 -dimensional vector 𝐴(𝑠, 𝑎; 𝜃, 𝛼)
• Tempt to construct the aggregating module
• 𝑄 𝑠, 𝑎; 𝜃, 𝛼, 𝛽 = 𝑉 𝑠; 𝜃, 𝛽 + 𝐴(𝑠, 𝑎; 𝜃, 𝛼)
17
Aggregation module 1: simple add
• But 𝑄 𝑠, 𝑎; 𝜃, 𝛼, 𝛽 is only a parameterized estimate of the true Q-
function
• Unidentifiable
• Given 𝑄, 𝑉 and 𝐴 can’t uniquely be recovered
• We force the 𝐴 to have zero at the chosen action
• 𝑄 𝑠, 𝑎; 𝜃, 𝛼, 𝛽 = 𝑉 𝑠; 𝜃, 𝛽 + 𝐴 𝑠, 𝑎; 𝜃, 𝛼 − max
:`∈𝒜
𝐴(𝑠, 𝑎`; 𝜃, 𝛼)
18
Aggregation module 2: subtract max
• For 𝑎∗
= argmax
a`∈𝒜
𝑄(𝑠, 𝑎`; 𝜃, 𝛼, 𝛽) = argmax
a`∈𝒜
𝐴 𝑠, 𝑎; 𝜃, 𝛼
• obtain 𝑄 𝑠, 𝑎∗
; 𝜃, 𝛼, 𝛽 = 𝑉 𝑠; 𝜃, 𝛽
• 𝑄 𝑠, 𝑎∗
= 𝑉(𝑠)
• An alternative module replace max operator with an average
• 𝑄 𝑠, 𝑎; 𝜃, 𝛼, 𝛽 = 𝑉 𝑠; 𝜃, 𝛽 + 𝐴 𝑠, 𝑎; 𝜃, 𝛼 −
]
𝒜
∑ 𝐴(𝑠, 𝑎`; 𝜃, 𝛼):`
19
Aggregation module 3: subtract average
• An alternative module replace max operator with an average
• 𝑄 𝑠, 𝑎; 𝜃, 𝛼, 𝛽 = 𝑉 𝑠; 𝜃, 𝛽 + 𝐴 𝑠, 𝑎; 𝜃, 𝛼 −
]
𝒜
∑ 𝐴(𝑠, 𝑎`; 𝜃, 𝛼):`
• Now loses the original semantics of 𝑉 and 𝐴
• because now off-target by a constant,
]
𝒜
∑ 𝐴(𝑠, 𝑎`; 𝜃, 𝛼):`
• But increases the stability of the optimization
• 𝐴 only need to change as fast as the mean
• Instead of having to compensate any change to the optimal action’s advantage
20
max
:`∈𝒜
𝐴(𝑠, 𝑎`; 𝜃, 𝛼)
Aggregation module 3: subtract average
• Subtracting mean is the best
• helps identifiability
• not change the relative rank of 𝐴 (and hence Q)
• Aggregation module is a part of the network not a algorithmic step
• training of dueling network requires only back-propagation
21
Compatibility
• Because the output of dueling network is Q function
• DQN
• DDQN
• SARSA
• On-policy, off-policy, whatever
22
Definition: Generalized policy iteration
23
Experiments: Policy evaluation
• Useful for evaluating network architecture
• devoid of confounding factors such as choice of exploration strategy, and
interaction between policy improvement and policy evaluation
• In experiment, employ temporal difference learning
• optimizing 𝑦O = 𝑟 + 𝛾𝔼:`~8(<`) 𝑄 𝑠`, 𝑎`, 𝜃O
• Corridor environment
• exact 𝑄8
(𝑠, 𝑎) can be computed separately for all 𝑠, 𝑎 ∈ 𝒮×𝒜
24
Experiments: Policy evaluation
• Test for 5, 10, and 20 actions (first tackled by DDQN)
• The stream 𝑽 𝒔; 𝜽, 𝜷 learn a general value shared across many
similar actions at 𝑠
• Hence leading to faster convergence
25
Performance	gap	increasing	with	the	number	of	actions
Experiments: General Atari Game-Playing
• Similar to DQN (Mnih et al., 2015) and add fully-connected layers
• Rescale the combined gradient entering the last convolutional layer
by 1/ 2, which mildly increases stability
• Clipped gradients to their norm less than or equal to 10
• clipping is not a standard practice in RL
26
Performance: Up-to 30 no-op random start
• Duel Clip > Single Clip > Single
• Good job Dueling network
27
Performance: Human start
28
• Not necessarily have to generalize well to play the Atari games
• Can achieve good performance by simply remembering sequences of
actions
• To obtain a more robust measure that use 100 starting points
sampled from a human expert’s trajectory
• from each starting points, evaluate up to 108,000 frames
• again, good job Dueling network
Combining with Prioritized Experience Replay
• Prioritization and the dueling architecture address very different
aspects of the learning process
• Although orthogonal in their objectives, these extensions
(prioritization, dueling and gradient clipping) interact in subtle ways
• Prioritization interacts with gradient clipping
• Sampling transitions with high absolute TD-errors more often leads to gradients with higher
norms so re-tuned the hyper parameters
29
References
1. [Wang, 2015] Wang, Z., de Freitas, N., & Lanctot, M. (2015). Dueling network architectures for
deep reinforcement learning. arXiv preprint arXiv:1511.06581.
2. [Van, 2015] Van Hasselt, H., Guez, A., & Silver, D. (2015). Deep reinforcement learning with
double Q-learning. CoRR, abs/1509.06461.
3. [Schaul, 2015] Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2015). Prioritized experience
replay. arXiv preprint arXiv:1511.05952.
4. [Sutton, 1998] Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction(Vol.
1, No. 1). Cambridge: MIT press.
30

Weitere ähnliche Inhalte

Was ist angesagt?

Intro to Reinforcement learning - part III
Intro to Reinforcement learning - part IIIIntro to Reinforcement learning - part III
Intro to Reinforcement learning - part IIIMikko Mäkipää
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 
Q-learning and Deep Q Network (Reinforcement Learning)
Q-learning and Deep Q Network (Reinforcement Learning)Q-learning and Deep Q Network (Reinforcement Learning)
Q-learning and Deep Q Network (Reinforcement Learning)Thom Lane
 
Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Taehoon Kim
 
Markov decision process
Markov decision processMarkov decision process
Markov decision processHamed Abdi
 
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기NAVER Engineering
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree LearningMilind Gokhale
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter TuningJon Lederman
 
강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introductionTaehoon Kim
 
AlphaGo 알고리즘 요약
AlphaGo 알고리즘 요약AlphaGo 알고리즘 요약
AlphaGo 알고리즘 요약Jooyoul Lee
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningKai-Wen Zhao
 
Deep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQNDeep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQNEuijin Jeong
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313Slideshare
 
Reinforcement Learning 3. Finite Markov Decision Processes
Reinforcement Learning 3. Finite Markov Decision ProcessesReinforcement Learning 3. Finite Markov Decision Processes
Reinforcement Learning 3. Finite Markov Decision ProcessesSeung Jae Lee
 

Was ist angesagt? (20)

Intro to Reinforcement learning - part III
Intro to Reinforcement learning - part IIIIntro to Reinforcement learning - part III
Intro to Reinforcement learning - part III
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Q-learning and Deep Q Network (Reinforcement Learning)
Q-learning and Deep Q Network (Reinforcement Learning)Q-learning and Deep Q Network (Reinforcement Learning)
Q-learning and Deep Q Network (Reinforcement Learning)
 
Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)
 
Markov decision process
Markov decision processMarkov decision process
Markov decision process
 
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree Learning
 
Policy gradient
Policy gradientPolicy gradient
Policy gradient
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
 
강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction
 
AlphaGo and AlphaGo Zero
AlphaGo and AlphaGo ZeroAlphaGo and AlphaGo Zero
AlphaGo and AlphaGo Zero
 
AlphaGo 알고리즘 요약
AlphaGo 알고리즘 요약AlphaGo 알고리즘 요약
AlphaGo 알고리즘 요약
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-Learning
 
Lec3 dqn
Lec3 dqnLec3 dqn
Lec3 dqn
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Deep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQNDeep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQN
 
Gradient Boosting
Gradient BoostingGradient Boosting
Gradient Boosting
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
 
Hill climbing algorithm
Hill climbing algorithmHill climbing algorithm
Hill climbing algorithm
 
Reinforcement Learning 3. Finite Markov Decision Processes
Reinforcement Learning 3. Finite Markov Decision ProcessesReinforcement Learning 3. Finite Markov Decision Processes
Reinforcement Learning 3. Finite Markov Decision Processes
 

Ähnlich wie Dueling network architectures for deep reinforcement learning

Paper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satPaper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satChenYiHuang5
 
Parallel Machine Learning- DSGD and SystemML
Parallel Machine Learning- DSGD and SystemMLParallel Machine Learning- DSGD and SystemML
Parallel Machine Learning- DSGD and SystemMLJanani C
 
Deep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationDeep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationYan Xu
 
NS-CUK Seminar: H.E.Lee, Review on "Gated Graph Sequence Neural Networks", I...
NS-CUK Seminar: H.E.Lee,  Review on "Gated Graph Sequence Neural Networks", I...NS-CUK Seminar: H.E.Lee,  Review on "Gated Graph Sequence Neural Networks", I...
NS-CUK Seminar: H.E.Lee, Review on "Gated Graph Sequence Neural Networks", I...ssuser4b1f48
 
PR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation LearningPR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation LearningSungchul Kim
 
A framework for nonlinear model predictive control
A framework for nonlinear model predictive controlA framework for nonlinear model predictive control
A framework for nonlinear model predictive controlModelon
 
DQN Variants: A quick glance
DQN Variants: A quick glanceDQN Variants: A quick glance
DQN Variants: A quick glanceTejas Kotha
 
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...Ryo Takahashi
 
Paper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelinePaper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelineChenYiHuang5
 
08 neural networks
08 neural networks08 neural networks
08 neural networksankit_ppt
 
Paper study: Attention, learn to solve routing problems!
Paper study: Attention, learn to solve routing problems!Paper study: Attention, learn to solve routing problems!
Paper study: Attention, learn to solve routing problems!ChenYiHuang5
 
An Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGIAn Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGIAnirban Santara
 
SPICE-MATEX @ DAC15
SPICE-MATEX @ DAC15SPICE-MATEX @ DAC15
SPICE-MATEX @ DAC15Hao Zhuang
 
Final Presentation - Edan&Itzik
Final Presentation - Edan&ItzikFinal Presentation - Edan&Itzik
Final Presentation - Edan&Itzikitzik cohen
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsPierre de Lacaze
 

Ähnlich wie Dueling network architectures for deep reinforcement learning (20)

Paper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satPaper study: Learning to solve circuit sat
Paper study: Learning to solve circuit sat
 
Parallel Machine Learning- DSGD and SystemML
Parallel Machine Learning- DSGD and SystemMLParallel Machine Learning- DSGD and SystemML
Parallel Machine Learning- DSGD and SystemML
 
Deep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationDeep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and Regularization
 
NS-CUK Seminar: H.E.Lee, Review on "Gated Graph Sequence Neural Networks", I...
NS-CUK Seminar: H.E.Lee,  Review on "Gated Graph Sequence Neural Networks", I...NS-CUK Seminar: H.E.Lee,  Review on "Gated Graph Sequence Neural Networks", I...
NS-CUK Seminar: H.E.Lee, Review on "Gated Graph Sequence Neural Networks", I...
 
PR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation LearningPR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation Learning
 
A framework for nonlinear model predictive control
A framework for nonlinear model predictive controlA framework for nonlinear model predictive control
A framework for nonlinear model predictive control
 
DQN Variants: A quick glance
DQN Variants: A quick glanceDQN Variants: A quick glance
DQN Variants: A quick glance
 
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
 
04 Multi-layer Feedforward Networks
04 Multi-layer Feedforward Networks04 Multi-layer Feedforward Networks
04 Multi-layer Feedforward Networks
 
Paper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelinePaper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipeline
 
08 neural networks
08 neural networks08 neural networks
08 neural networks
 
Paper study: Attention, learn to solve routing problems!
Paper study: Attention, learn to solve routing problems!Paper study: Attention, learn to solve routing problems!
Paper study: Attention, learn to solve routing problems!
 
Iiwas19 yamazaki slide
Iiwas19 yamazaki slideIiwas19 yamazaki slide
Iiwas19 yamazaki slide
 
An Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGIAn Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGI
 
SPICE-MATEX @ DAC15
SPICE-MATEX @ DAC15SPICE-MATEX @ DAC15
SPICE-MATEX @ DAC15
 
K-means and GMM
K-means and GMMK-means and GMM
K-means and GMM
 
Introduction to Genetic algorithm and its significance in VLSI design and aut...
Introduction to Genetic algorithm and its significance in VLSI design and aut...Introduction to Genetic algorithm and its significance in VLSI design and aut...
Introduction to Genetic algorithm and its significance in VLSI design and aut...
 
230727_HB_JointJournalClub.pptx
230727_HB_JointJournalClub.pptx230727_HB_JointJournalClub.pptx
230727_HB_JointJournalClub.pptx
 
Final Presentation - Edan&Itzik
Final Presentation - Edan&ItzikFinal Presentation - Edan&Itzik
Final Presentation - Edan&Itzik
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural Nets
 

Mehr von Taehoon Kim

LLM에서 배우는 이미지 생성 모델 ZERO부터 학습하기 Training Large-Scale Diffusion Model from Scr...
LLM에서 배우는 이미지 생성 모델 ZERO부터 학습하기 Training Large-Scale Diffusion Model from Scr...LLM에서 배우는 이미지 생성 모델 ZERO부터 학습하기 Training Large-Scale Diffusion Model from Scr...
LLM에서 배우는 이미지 생성 모델 ZERO부터 학습하기 Training Large-Scale Diffusion Model from Scr...Taehoon Kim
 
상상을 현실로 만드는, 이미지 생성 모델을 위한 엔지니어링
상상을 현실로 만드는, 이미지 생성 모델을 위한 엔지니어링상상을 현실로 만드는, 이미지 생성 모델을 위한 엔지니어링
상상을 현실로 만드는, 이미지 생성 모델을 위한 엔지니어링Taehoon Kim
 
머신러닝 해외 취업 준비: 닳고 닳은 이력서와 고통스러웠던 면접을 돌아보며 SNU 2018
머신러닝 해외 취업 준비: 닳고 닳은 이력서와 고통스러웠던 면접을 돌아보며 SNU 2018머신러닝 해외 취업 준비: 닳고 닳은 이력서와 고통스러웠던 면접을 돌아보며 SNU 2018
머신러닝 해외 취업 준비: 닳고 닳은 이력서와 고통스러웠던 면접을 돌아보며 SNU 2018Taehoon Kim
 
Random Thoughts on Paper Implementations [KAIST 2018]
Random Thoughts on Paper Implementations [KAIST 2018]Random Thoughts on Paper Implementations [KAIST 2018]
Random Thoughts on Paper Implementations [KAIST 2018]Taehoon Kim
 
책 읽어주는 딥러닝: 배우 유인나가 해리포터를 읽어준다면 DEVIEW 2017
책 읽어주는 딥러닝: 배우 유인나가 해리포터를 읽어준다면 DEVIEW 2017책 읽어주는 딥러닝: 배우 유인나가 해리포터를 읽어준다면 DEVIEW 2017
책 읽어주는 딥러닝: 배우 유인나가 해리포터를 읽어준다면 DEVIEW 2017Taehoon Kim
 
알아두면 쓸데있는 신기한 강화학습 NAVER 2017
알아두면 쓸데있는 신기한 강화학습 NAVER 2017알아두면 쓸데있는 신기한 강화학습 NAVER 2017
알아두면 쓸데있는 신기한 강화학습 NAVER 2017Taehoon Kim
 
카카오톡으로 여친 만들기 2013.06.29
카카오톡으로 여친 만들기 2013.06.29카카오톡으로 여친 만들기 2013.06.29
카카오톡으로 여친 만들기 2013.06.29Taehoon Kim
 
Differentiable Neural Computer
Differentiable Neural ComputerDifferentiable Neural Computer
Differentiable Neural ComputerTaehoon Kim
 
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016Taehoon Kim
 
지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016
지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016
지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016Taehoon Kim
 
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016Taehoon Kim
 
쉽게 쓰여진 Django
쉽게 쓰여진 Django쉽게 쓰여진 Django
쉽게 쓰여진 DjangoTaehoon Kim
 
영화 서비스에 대한 생각
영화 서비스에 대한 생각영화 서비스에 대한 생각
영화 서비스에 대한 생각Taehoon Kim
 

Mehr von Taehoon Kim (14)

LLM에서 배우는 이미지 생성 모델 ZERO부터 학습하기 Training Large-Scale Diffusion Model from Scr...
LLM에서 배우는 이미지 생성 모델 ZERO부터 학습하기 Training Large-Scale Diffusion Model from Scr...LLM에서 배우는 이미지 생성 모델 ZERO부터 학습하기 Training Large-Scale Diffusion Model from Scr...
LLM에서 배우는 이미지 생성 모델 ZERO부터 학습하기 Training Large-Scale Diffusion Model from Scr...
 
상상을 현실로 만드는, 이미지 생성 모델을 위한 엔지니어링
상상을 현실로 만드는, 이미지 생성 모델을 위한 엔지니어링상상을 현실로 만드는, 이미지 생성 모델을 위한 엔지니어링
상상을 현실로 만드는, 이미지 생성 모델을 위한 엔지니어링
 
머신러닝 해외 취업 준비: 닳고 닳은 이력서와 고통스러웠던 면접을 돌아보며 SNU 2018
머신러닝 해외 취업 준비: 닳고 닳은 이력서와 고통스러웠던 면접을 돌아보며 SNU 2018머신러닝 해외 취업 준비: 닳고 닳은 이력서와 고통스러웠던 면접을 돌아보며 SNU 2018
머신러닝 해외 취업 준비: 닳고 닳은 이력서와 고통스러웠던 면접을 돌아보며 SNU 2018
 
Random Thoughts on Paper Implementations [KAIST 2018]
Random Thoughts on Paper Implementations [KAIST 2018]Random Thoughts on Paper Implementations [KAIST 2018]
Random Thoughts on Paper Implementations [KAIST 2018]
 
책 읽어주는 딥러닝: 배우 유인나가 해리포터를 읽어준다면 DEVIEW 2017
책 읽어주는 딥러닝: 배우 유인나가 해리포터를 읽어준다면 DEVIEW 2017책 읽어주는 딥러닝: 배우 유인나가 해리포터를 읽어준다면 DEVIEW 2017
책 읽어주는 딥러닝: 배우 유인나가 해리포터를 읽어준다면 DEVIEW 2017
 
알아두면 쓸데있는 신기한 강화학습 NAVER 2017
알아두면 쓸데있는 신기한 강화학습 NAVER 2017알아두면 쓸데있는 신기한 강화학습 NAVER 2017
알아두면 쓸데있는 신기한 강화학습 NAVER 2017
 
카카오톡으로 여친 만들기 2013.06.29
카카오톡으로 여친 만들기 2013.06.29카카오톡으로 여친 만들기 2013.06.29
카카오톡으로 여친 만들기 2013.06.29
 
Differentiable Neural Computer
Differentiable Neural ComputerDifferentiable Neural Computer
Differentiable Neural Computer
 
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016
 
지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016
지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016
지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016
 
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016
 
Deep Reasoning
Deep ReasoningDeep Reasoning
Deep Reasoning
 
쉽게 쓰여진 Django
쉽게 쓰여진 Django쉽게 쓰여진 Django
쉽게 쓰여진 Django
 
영화 서비스에 대한 생각
영화 서비스에 대한 생각영화 서비스에 대한 생각
영화 서비스에 대한 생각
 

Kürzlich hochgeladen

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 

Kürzlich hochgeladen (20)

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

Dueling network architectures for deep reinforcement learning

  • 1. Dueling Network Architectures for Deep Reinforcement Learning 2016-06-28 Taehoon Kim
  • 2. Motivation • Recent advances • Design improved control and RL algorithms • Incorporate existing NN into RL methods • We, • focus on innovating a NN that is better suited for model-free RL • Separate • the representation of state value • (state-dependent) action advantages 2
  • 4. Dueling network • Single Q network with two streams • Produce separate estimations of state value func and advantage func • without any extra supervision • which states are valuable? • without having to learn the effect of each action for each state 4
  • 5. Saliency map on the Atari game Enduro 5 1. Focus on horizon, where new cars appear 2. Focus on the score Not pay much attention when there are no cars in front Attention on car immediately in front making its choice of action very relevant
  • 6. Definitions • Value 𝑉(𝑠), how good it is to be in particular state 𝑠 • Advantage 𝐴(𝑠, 𝑎) • Policy 𝜋 • Return 𝑅* = ∑ 𝛾./* 𝑟. ∞ .1* , where 𝛾 ∈ [0,1] • Q function 𝑄8 𝑠, 𝑎 = 𝔼 𝑅* 𝑠* = 𝑠, 𝑎* = 𝑎, 𝜋 • State-value function 𝑉8 𝑠 = 𝔼:~8(<)[𝑄8 𝑠, 𝑎 ] 6
  • 7. Bellman equation • Recursively with dynamic programming • 𝑄8 𝑠, 𝑎 = 𝔼<` 𝑟 + 𝛾𝔼:`~8(<`) 𝑄8 𝑠`, 𝑎` |𝑠, 𝑎, 𝜋 • Optimal Q∗ 𝑠, 𝑎 = max 8 𝑄8 𝑠, 𝑎 • Deterministic policy a = arg max :`∈𝒜 Q∗ 𝑠, 𝑎` • Optimal V∗ 𝑠 = max : 𝑄∗ 𝑠, 𝑎 • Bellman equation Q∗ 𝑠, 𝑎 = 𝔼<` 𝑟 + 𝛾 max :` 𝑄∗ 𝑠`, 𝑎` |𝑠, 𝑎 7
  • 8. Advantage function • Bellman equation Q∗ 𝑠, 𝑎 = 𝔼<` 𝑟 + 𝛾 max :` 𝑄∗ 𝑠`, 𝑎` |𝑠, 𝑎 • Advantage function A8 𝑠, 𝑎 = 𝑄8 𝑠, 𝑎 − 𝑉8 (𝑠) • 𝔼:~8(<`) 𝐴8 𝑠, 𝑎 = 0 8
  • 9. Advantage function • Value 𝑉(𝑠), how good it is to be in particular state 𝑠 • Q(𝑠, 𝑎), the value of choosing a particular action 𝑎 when in state 𝑠 • A = 𝑉 − 𝑄 to obtain a relative measure of importance of each action 9
  • 10. Deep Q-network (DQN) • Model Free • states and rewards are produced by the environment • Off policy • states and rewards are obtained with a behavior policy (epsilon greedy) • different from the online policy that is being learned 10
  • 11. Deep Q-network: 1) Target network • Deep Q-network 𝑄 𝑠, 𝑎; 𝜽 • Target network 𝑄 𝑠, 𝑎; 𝜽/ • 𝐿O 𝜃O = 𝔼<,:,Q,<` 𝑦O STU − 𝑄 𝑠, 𝑎; 𝜽𝒊 W • 𝑦O STU = 𝑟 + 𝛾 max :` 𝑄 𝑠`, 𝑎`; 𝜽/ • Freeze parameters for a fixed number of iterations • 𝛻YZ 𝐿O 𝜃O = 𝔼<,:,Q,<` 𝑦O STU − 𝑄 𝑠, 𝑎; 𝜽𝒊 𝛻YZ 𝑄 𝑠, 𝑎; 𝜽𝒊 11 𝑠` 𝑠
  • 12. Deep Q-network: 2) Experience memory • Experience 𝑒* = (𝑠*, 𝑎*, 𝑟*, 𝑠*]) • Accumulates a dataset 𝒟* = 𝑒], 𝑒W, … , 𝑒* • 𝐿O 𝜃O = 𝔼 <,:,Q,<` ~𝒰(𝒟) 𝑦O STU − 𝑄 𝑠, 𝑎; 𝜽𝒊 W 12
  • 13. Double Deep Q-network (DDQN) • In DQN • the max operator uses the same values to both select and evaluate an action • lead to overoptimistic value estimates • 𝑦O STU = 𝑟 + 𝛾 max :` 𝑄 𝑠`, 𝑎`; 𝜽/ • To migrate this problem, in DDQN • 𝑦O SSTU = 𝑟 + 𝛾𝑄 𝑠`,arg max :` 𝑄(𝑠`, 𝑎`; 𝜃O); 𝜽/ 13
  • 14. Prioritized Replay (Schaul et al., 2016) • To increase the replay probability of experience tuples • that have a high expected learning progress • use importance sampling weight measured via the proxy of absolute TD-error • sampling transitions with high absolute TD-errors • Led to faster learning and to better final policy quality 14
  • 15. Dueling Network Architecture : Key insight • For many states • unnecessary to estimate the value of each action choice • For example, move left or right only matters when a collision is eminent • In most of states, the choice of action has no affect on what happens • For bootstrapping based algorithm • the estimation of state value is of great importance for every state • bootstrapping : update estimates on the basis of other estimates. 15
  • 16. Formulation • A8 𝑠, 𝑎 = 𝑄8 𝑠, 𝑎 − 𝑉8 (𝑠) • 𝑉8 𝑠 = 𝔼:~8(<) 𝑄8 𝑠, 𝑎 • A8 𝑠, 𝑎 = 𝑄8 𝑠, 𝑎 − 𝔼:~8(<) 𝑄8 𝑠, 𝑎 • 𝔼:~8(<) 𝐴8 𝑠, 𝑎 = 0 • For a deterministic policy, 𝑎∗ = argmax a`∈𝒜 𝑄 𝑠, 𝑎` • 𝑄 𝑠, 𝑎∗ = 𝑉(𝑠) and 𝐴 𝑠, 𝑎∗ = 0 16
  • 17. Formulation • Dueling network = CNN + fully-connected layers that output • a scalar 𝑉 𝑠; 𝜃, 𝛽 • an 𝒜 -dimensional vector 𝐴(𝑠, 𝑎; 𝜃, 𝛼) • Tempt to construct the aggregating module • 𝑄 𝑠, 𝑎; 𝜃, 𝛼, 𝛽 = 𝑉 𝑠; 𝜃, 𝛽 + 𝐴(𝑠, 𝑎; 𝜃, 𝛼) 17
  • 18. Aggregation module 1: simple add • But 𝑄 𝑠, 𝑎; 𝜃, 𝛼, 𝛽 is only a parameterized estimate of the true Q- function • Unidentifiable • Given 𝑄, 𝑉 and 𝐴 can’t uniquely be recovered • We force the 𝐴 to have zero at the chosen action • 𝑄 𝑠, 𝑎; 𝜃, 𝛼, 𝛽 = 𝑉 𝑠; 𝜃, 𝛽 + 𝐴 𝑠, 𝑎; 𝜃, 𝛼 − max :`∈𝒜 𝐴(𝑠, 𝑎`; 𝜃, 𝛼) 18
  • 19. Aggregation module 2: subtract max • For 𝑎∗ = argmax a`∈𝒜 𝑄(𝑠, 𝑎`; 𝜃, 𝛼, 𝛽) = argmax a`∈𝒜 𝐴 𝑠, 𝑎; 𝜃, 𝛼 • obtain 𝑄 𝑠, 𝑎∗ ; 𝜃, 𝛼, 𝛽 = 𝑉 𝑠; 𝜃, 𝛽 • 𝑄 𝑠, 𝑎∗ = 𝑉(𝑠) • An alternative module replace max operator with an average • 𝑄 𝑠, 𝑎; 𝜃, 𝛼, 𝛽 = 𝑉 𝑠; 𝜃, 𝛽 + 𝐴 𝑠, 𝑎; 𝜃, 𝛼 − ] 𝒜 ∑ 𝐴(𝑠, 𝑎`; 𝜃, 𝛼):` 19
  • 20. Aggregation module 3: subtract average • An alternative module replace max operator with an average • 𝑄 𝑠, 𝑎; 𝜃, 𝛼, 𝛽 = 𝑉 𝑠; 𝜃, 𝛽 + 𝐴 𝑠, 𝑎; 𝜃, 𝛼 − ] 𝒜 ∑ 𝐴(𝑠, 𝑎`; 𝜃, 𝛼):` • Now loses the original semantics of 𝑉 and 𝐴 • because now off-target by a constant, ] 𝒜 ∑ 𝐴(𝑠, 𝑎`; 𝜃, 𝛼):` • But increases the stability of the optimization • 𝐴 only need to change as fast as the mean • Instead of having to compensate any change to the optimal action’s advantage 20 max :`∈𝒜 𝐴(𝑠, 𝑎`; 𝜃, 𝛼)
  • 21. Aggregation module 3: subtract average • Subtracting mean is the best • helps identifiability • not change the relative rank of 𝐴 (and hence Q) • Aggregation module is a part of the network not a algorithmic step • training of dueling network requires only back-propagation 21
  • 22. Compatibility • Because the output of dueling network is Q function • DQN • DDQN • SARSA • On-policy, off-policy, whatever 22
  • 24. Experiments: Policy evaluation • Useful for evaluating network architecture • devoid of confounding factors such as choice of exploration strategy, and interaction between policy improvement and policy evaluation • In experiment, employ temporal difference learning • optimizing 𝑦O = 𝑟 + 𝛾𝔼:`~8(<`) 𝑄 𝑠`, 𝑎`, 𝜃O • Corridor environment • exact 𝑄8 (𝑠, 𝑎) can be computed separately for all 𝑠, 𝑎 ∈ 𝒮×𝒜 24
  • 25. Experiments: Policy evaluation • Test for 5, 10, and 20 actions (first tackled by DDQN) • The stream 𝑽 𝒔; 𝜽, 𝜷 learn a general value shared across many similar actions at 𝑠 • Hence leading to faster convergence 25 Performance gap increasing with the number of actions
  • 26. Experiments: General Atari Game-Playing • Similar to DQN (Mnih et al., 2015) and add fully-connected layers • Rescale the combined gradient entering the last convolutional layer by 1/ 2, which mildly increases stability • Clipped gradients to their norm less than or equal to 10 • clipping is not a standard practice in RL 26
  • 27. Performance: Up-to 30 no-op random start • Duel Clip > Single Clip > Single • Good job Dueling network 27
  • 28. Performance: Human start 28 • Not necessarily have to generalize well to play the Atari games • Can achieve good performance by simply remembering sequences of actions • To obtain a more robust measure that use 100 starting points sampled from a human expert’s trajectory • from each starting points, evaluate up to 108,000 frames • again, good job Dueling network
  • 29. Combining with Prioritized Experience Replay • Prioritization and the dueling architecture address very different aspects of the learning process • Although orthogonal in their objectives, these extensions (prioritization, dueling and gradient clipping) interact in subtle ways • Prioritization interacts with gradient clipping • Sampling transitions with high absolute TD-errors more often leads to gradients with higher norms so re-tuned the hyper parameters 29
  • 30. References 1. [Wang, 2015] Wang, Z., de Freitas, N., & Lanctot, M. (2015). Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581. 2. [Van, 2015] Van Hasselt, H., Guez, A., & Silver, D. (2015). Deep reinforcement learning with double Q-learning. CoRR, abs/1509.06461. 3. [Schaul, 2015] Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2015). Prioritized experience replay. arXiv preprint arXiv:1511.05952. 4. [Sutton, 1998] Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction(Vol. 1, No. 1). Cambridge: MIT press. 30