SlideShare ist ein Scribd-Unternehmen logo
1 von 24
Downloaden Sie, um offline zu lesen
Deep Q-Networks
Volodymyr Mnih
Deep RL Bootcamp, Berkeley
■ Learning a parametric Q function:
■ Remember:
■ Update:
■ For tabular function, , we recover the familiar update:
■ Converges to optimal values (*)
■ Does it work with a neural network Q functions?
■ Yes but with some care
Recap: Q-Learning
Recap: (Tabular) Q-Learning
Recap: Approximate Q-Learning
● High-level idea - make Q-learning look like supervised learning.
● Two main ideas for stabilizing Q-learning.
● Apply Q-updates on batches of past experience instead of online:
○ Experience replay (Lin, 1993).
○ Previously used for better data efficiency.
○ Makes the data distribution more stationary.
● Use an older set of weights to compute the targets (target network):
○ Keeps the target function from changing too quickly.
DQN
“Human-Level Control Through Deep Reinforcement Learning”, Mnih, Kavukcuoglu, Silver et al. (2015)
Target Network Intuition
s s’
● Changing the value of one action will
change the value of other actions
and similar states.
● The network can end up chasing its
own tail because of bootstrapping.
● Somewhat surprising fact - bigger
networks are less prone to this
because they alias less.
DQN Training Algorithm
DQN Details
● Uses Huber loss instead of squared loss on Bellman error:
● Uses RMSProp instead of vanilla SGD.
○ Optimization in RL really matters.
● It helps to anneal the exploration rate.
○ Start ε at 1 and anneal it to 0.1 or 0.05 over the first million frames.
DQN on ATARI
Pong Enduro Beamrider Q*bert
• 49 ATARI 2600 games.
• From pixels to actions.
• The change in score is the reward.
• Same algorithm.
• Same function approximator, w/ 3M free parameters.
• Same hyperparameters.
• Roughly human-level performance on 29 out of 49 games.
ATARI Network Architecture
Stack of 4 previous
frames
Convolutional layer
of rectified linear units
Convolutional layer
of rectified linear units
Fully-connected layer
of rectified linear units
Fully-connected linear
output layer
16 8x8 filters
32 4x4 filters
256 hidden units
4x84x84
● Convolutional neural network architecture:
○ History of frames as input.
○ One output per action - expected reward for that action Q(s, a).
○ Final results used a slightly bigger network (3 convolutional + 1 fully-connected hidden layers).
Stability Techniques
Atari Results
“Human-Level Control Through Deep Reinforcement Learning”, Mnih, Kavukcuoglu, Silver et al. (2015)
DQN Playing ATARI
Action Values on Pong
Learned Value Functions
Sacrificing Immediate Rewards
DQN Source Code
● The DQN source code (in Lua+Torch) is available:
https://sites.google.com/a/deepmind.com/dqn/
Neural Fitted Q Iteration
● NFQ (Riedmiller, 2005) trains neural networks with Q-learning.
● Alternates between collecting new data and fitting a new Q-function to all previous
experience with batch gradient descent.
● DQN can be seen as an online variant of NFQ.
Lin’s Networks
● Long-Ji Lin’s thesis “Reinforcement Learning for Robots using Neural
Networks” (1993) also trained neural nets with Q-learning.
● Introduced experience replay among other things.
● Lin’s networks did not share parameters among actions.
Q(a1
,s) Q(a2
,s) Q(a3
,s) Q(a3
,s)Q(a1
,s) Q(a2
,s)
Lin’s architecture DQN
Double DQN
● There is an upward bias in maxa
Q(s, a; θ).
● DQN maintains two sets of weight θ and θ-
, so reduce bias by using:
○ θ for selecting the best action.
○ θ-
for evaluating the best action.
● Double DQN loss:
“Double Reinforcement Learning with Double Q-Learning”, van Hasselt et al. (2016)
Prioritized Experience Replay
● Replaying all transitions with equal probability is highly suboptimal.
● Replay transitions in proportion to absolute Bellman error:
● Leads to much faster learning.
“Prioritized Experience Replay”, Schaul et al. (2016)
● Value-Advantage decomposition of Q:
● Dueling DQN (Wang et al., 2015):
Dueling DQN
Q(s,a)
Q(s,a)
A(s,a)
V(s)
DQN
Dueling
DQN
Atari Results
“Dueling Network Architectures for Deep Reinforcement Learning”, Wang et al. (2016)
Noisy Nets for Exploration
● Add noise to network parameters for better exploration [Fortunato, Azar, Piot et al. (2017)].
● Standard linear layer:
● Noisy linear layer:
● εw
and εb
contain noise.
● σw
and σb
are learned parameters that determine the amount of noise.
“Noisy Nets for Exploration”, Fortunato, Azar, Piot et al. (2017)
Also see “Parameter Space Noise for Exploration”, Plappert et al. (2017)
Questions?

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
04 Multi-layer Feedforward Networks
04 Multi-layer Feedforward Networks04 Multi-layer Feedforward Networks
04 Multi-layer Feedforward Networks
 
Deep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQNDeep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQN
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
 
Dueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learningDueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learning
 
Intro to Reinforcement learning - part III
Intro to Reinforcement learning - part IIIIntro to Reinforcement learning - part III
Intro to Reinforcement learning - part III
 
Deep Learning - CNN and RNN
Deep Learning - CNN and RNNDeep Learning - CNN and RNN
Deep Learning - CNN and RNN
 
RNN & LSTM: Neural Network for Sequential Data
RNN & LSTM: Neural Network for Sequential DataRNN & LSTM: Neural Network for Sequential Data
RNN & LSTM: Neural Network for Sequential Data
 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
 
Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithm
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-Learning
 
Multi armed bandit
Multi armed banditMulti armed bandit
Multi armed bandit
 
A brief overview of Reinforcement Learning applied to games
A brief overview of Reinforcement Learning applied to gamesA brief overview of Reinforcement Learning applied to games
A brief overview of Reinforcement Learning applied to games
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratch
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
LSTM Tutorial
LSTM TutorialLSTM Tutorial
LSTM Tutorial
 
Human-level Control Through Deep Reinforcement Learning (Presentation)
Human-level Control Through Deep Reinforcement Learning (Presentation)Human-level Control Through Deep Reinforcement Learning (Presentation)
Human-level Control Through Deep Reinforcement Learning (Presentation)
 
Perceptron & Neural Networks
Perceptron & Neural NetworksPerceptron & Neural Networks
Perceptron & Neural Networks
 
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
 

Ähnlich wie Lec3 dqn

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Ryo Takahashi
 

Ähnlich wie Lec3 dqn (20)

DQN Variants: A quick glance
DQN Variants: A quick glanceDQN Variants: A quick glance
DQN Variants: A quick glance
 
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousryHands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
 
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
 
Practical Block-wise Neural Network Architecture Generation
Practical Block-wise Neural Network Architecture GenerationPractical Block-wise Neural Network Architecture Generation
Practical Block-wise Neural Network Architecture Generation
 
Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)
 
GNR638_Course Project for spring semester
GNR638_Course Project for spring semesterGNR638_Course Project for spring semester
GNR638_Course Project for spring semester
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural Nets
 
Introduction to Chainer
Introduction to ChainerIntroduction to Chainer
Introduction to Chainer
 
Introduction to Chainer
Introduction to ChainerIntroduction to Chainer
Introduction to Chainer
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
 
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
 
GNR638_project ppt.pdf
GNR638_project ppt.pdfGNR638_project ppt.pdf
GNR638_project ppt.pdf
 
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
 
Neural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learningNeural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learning
 
Batch normalization presentation
Batch normalization presentationBatch normalization presentation
Batch normalization presentation
 
Machine learning project
Machine learning projectMachine learning project
Machine learning project
 
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
 
08 neural networks
08 neural networks08 neural networks
08 neural networks
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 
Machine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis IntroductionMachine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis Introduction
 

Mehr von Ronald Teo

Lec2 sampling-based-approximations-and-function-fitting
Lec2 sampling-based-approximations-and-function-fittingLec2 sampling-based-approximations-and-function-fitting
Lec2 sampling-based-approximations-and-function-fitting
Ronald Teo
 

Mehr von Ronald Teo (16)

Mc td
Mc tdMc td
Mc td
 
07 regularization
07 regularization07 regularization
07 regularization
 
Dp
DpDp
Dp
 
06 mlp
06 mlp06 mlp
06 mlp
 
Mdp
MdpMdp
Mdp
 
04 numerical
04 numerical04 numerical
04 numerical
 
Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
 Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
 
Intro rl
Intro rlIntro rl
Intro rl
 
Lec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scgLec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scg
 
Lec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methodsLec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methods
 
Lec6 nuts-and-bolts-deep-rl-research
Lec6 nuts-and-bolts-deep-rl-researchLec6 nuts-and-bolts-deep-rl-research
Lec6 nuts-and-bolts-deep-rl-research
 
Lec4b pong from_pixels
Lec4b pong from_pixelsLec4b pong from_pixels
Lec4b pong from_pixels
 
Lec4a policy-gradients-actor-critic
Lec4a policy-gradients-actor-criticLec4a policy-gradients-actor-critic
Lec4a policy-gradients-actor-critic
 
Lec2 sampling-based-approximations-and-function-fitting
Lec2 sampling-based-approximations-and-function-fittingLec2 sampling-based-approximations-and-function-fitting
Lec2 sampling-based-approximations-and-function-fitting
 
Lec1 intro-mdps-exact-methods
Lec1 intro-mdps-exact-methodsLec1 intro-mdps-exact-methods
Lec1 intro-mdps-exact-methods
 
02 linear algebra
02 linear algebra02 linear algebra
02 linear algebra
 

Kürzlich hochgeladen

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Lec3 dqn

  • 1. Deep Q-Networks Volodymyr Mnih Deep RL Bootcamp, Berkeley
  • 2. ■ Learning a parametric Q function: ■ Remember: ■ Update: ■ For tabular function, , we recover the familiar update: ■ Converges to optimal values (*) ■ Does it work with a neural network Q functions? ■ Yes but with some care Recap: Q-Learning
  • 5. ● High-level idea - make Q-learning look like supervised learning. ● Two main ideas for stabilizing Q-learning. ● Apply Q-updates on batches of past experience instead of online: ○ Experience replay (Lin, 1993). ○ Previously used for better data efficiency. ○ Makes the data distribution more stationary. ● Use an older set of weights to compute the targets (target network): ○ Keeps the target function from changing too quickly. DQN “Human-Level Control Through Deep Reinforcement Learning”, Mnih, Kavukcuoglu, Silver et al. (2015)
  • 6. Target Network Intuition s s’ ● Changing the value of one action will change the value of other actions and similar states. ● The network can end up chasing its own tail because of bootstrapping. ● Somewhat surprising fact - bigger networks are less prone to this because they alias less.
  • 8. DQN Details ● Uses Huber loss instead of squared loss on Bellman error: ● Uses RMSProp instead of vanilla SGD. ○ Optimization in RL really matters. ● It helps to anneal the exploration rate. ○ Start ε at 1 and anneal it to 0.1 or 0.05 over the first million frames.
  • 9. DQN on ATARI Pong Enduro Beamrider Q*bert • 49 ATARI 2600 games. • From pixels to actions. • The change in score is the reward. • Same algorithm. • Same function approximator, w/ 3M free parameters. • Same hyperparameters. • Roughly human-level performance on 29 out of 49 games.
  • 10. ATARI Network Architecture Stack of 4 previous frames Convolutional layer of rectified linear units Convolutional layer of rectified linear units Fully-connected layer of rectified linear units Fully-connected linear output layer 16 8x8 filters 32 4x4 filters 256 hidden units 4x84x84 ● Convolutional neural network architecture: ○ History of frames as input. ○ One output per action - expected reward for that action Q(s, a). ○ Final results used a slightly bigger network (3 convolutional + 1 fully-connected hidden layers).
  • 12. Atari Results “Human-Level Control Through Deep Reinforcement Learning”, Mnih, Kavukcuoglu, Silver et al. (2015)
  • 17. DQN Source Code ● The DQN source code (in Lua+Torch) is available: https://sites.google.com/a/deepmind.com/dqn/
  • 18. Neural Fitted Q Iteration ● NFQ (Riedmiller, 2005) trains neural networks with Q-learning. ● Alternates between collecting new data and fitting a new Q-function to all previous experience with batch gradient descent. ● DQN can be seen as an online variant of NFQ.
  • 19. Lin’s Networks ● Long-Ji Lin’s thesis “Reinforcement Learning for Robots using Neural Networks” (1993) also trained neural nets with Q-learning. ● Introduced experience replay among other things. ● Lin’s networks did not share parameters among actions. Q(a1 ,s) Q(a2 ,s) Q(a3 ,s) Q(a3 ,s)Q(a1 ,s) Q(a2 ,s) Lin’s architecture DQN
  • 20. Double DQN ● There is an upward bias in maxa Q(s, a; θ). ● DQN maintains two sets of weight θ and θ- , so reduce bias by using: ○ θ for selecting the best action. ○ θ- for evaluating the best action. ● Double DQN loss: “Double Reinforcement Learning with Double Q-Learning”, van Hasselt et al. (2016)
  • 21. Prioritized Experience Replay ● Replaying all transitions with equal probability is highly suboptimal. ● Replay transitions in proportion to absolute Bellman error: ● Leads to much faster learning. “Prioritized Experience Replay”, Schaul et al. (2016)
  • 22. ● Value-Advantage decomposition of Q: ● Dueling DQN (Wang et al., 2015): Dueling DQN Q(s,a) Q(s,a) A(s,a) V(s) DQN Dueling DQN Atari Results “Dueling Network Architectures for Deep Reinforcement Learning”, Wang et al. (2016)
  • 23. Noisy Nets for Exploration ● Add noise to network parameters for better exploration [Fortunato, Azar, Piot et al. (2017)]. ● Standard linear layer: ● Noisy linear layer: ● εw and εb contain noise. ● σw and σb are learned parameters that determine the amount of noise. “Noisy Nets for Exploration”, Fortunato, Azar, Piot et al. (2017) Also see “Parameter Space Noise for Exploration”, Plappert et al. (2017)