Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020

Xavier Giro-i-Nieto
Associate Professor
Universitat Politecnica de Catalunya
@DocXavi
xavier.giro@upc.edu
Q-learning with a Neural Network
Deep Reinforcement Learning 1
[course site]

2
Acknowledgments
Víctor Campos
victor.campos@bsc.es
PhD Candidate
Barcelona Supercomputing Center
Universitat Politècnica de Catalunya

Outline
1. Motivation
2. Q-Learning with Neural Networks
3. Deep Q-Networks (DQN)

4
Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller.
"Playing atari with deep reinforcement learning." arXiv preprint arXiv:1312.5602 (2013).

Flavours of (model-free) RL Policy-based / value-based
(e.g. DQN)
(e.g. A3C)
(e.g. REINFORCE)

Outline
1. Motivation
2. Function Approximation

Value
function
Policy ㄫ Model
(of the Environment)
Goals of Reinforcement Learning
Reinforcement Learning with Neural Networks (NN)

9
Solving the Optimal Policy
Optimal
policy ㄫ*
The optimal policy is that one capable of achieving the optimal value functions
V*
(s) and Q*
(s,a)
Optimal
Q-value
functions

10
Tabular Q-Learning is feasible for small state-action spaces:
actions
Q(s,a)
Solving the Optimal Policy: Q-Learning
S F F F
F H F H
F F F F
H F F G

11
How would you compute the amount of rows in a tabular Q-learning solution of
Space invaders ?
Q-Function estimation with a table
actions
Q(s,a)

12
Exploring all positive states would require:
● generating all possible (valid) pixel combinations
● run all possible actions
● several times to allow estimation.
... is not scalable
Q-Function estimation with a table
actions
Q(s,a)

13
Deep neural network are powerful function approximators, also of
Q*(s,a).
Q-Function approximation with NN
Q(s,a,w) ≈ Q*(s,a)
Neural Network parameters

14
Network
State Action
Q
Can you think of a more eﬃcient way
of estimating Q values?
Hint: how many passes through the
network do we need for sampling an
action when using this architecture?

15
For discrete actions, practical implementations actually feed the state into the
NN and estimate a Q-value for each action.

16
Q-Learning with Neural Networks
#TD-Gammon Tesauro, G. (1994). TD-Gammon, a self-teaching backgammon program, achieves
master-level play. Neural computation, 6(2), 215-219.

18
Deep Q-Network (DQN) was the ﬁrst method to combine value-based RL (in
particular, Q-learning) with deep neural networks.
Deep Q-learning (DQN)
Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves et al.
"Human-level control through deep reinforcement learning." Nature 518, no. 7540 (2015): 529-533.

19
Number of
actions
between 4-18,
depending on
the Atari game

20
Some challenges arise when combining Q-learning with neural networks trained
with gradient descent:
1. When updating Q(s,a) for a given (s,a) tuple, we will be changing the Q value
estimate for all (s,a). This did not happen in the tabular version of Q-learning.
2. Gradient-based techniques assume that the training samples are independent
and identically distributed (i.i.d.). This assumption is broken when the training
data is generated by an agent interacting with the environment.

Outline
1. Motivation
2. Function approximation
○ Online & Target Networks
○ Replay Memory
4. Improvements to DQN

22
DQN: Online & Target Networks
Source: Arthur Juliani, “Simple Reinforcement Learning with Tensorflow Part 4: Deep Q-Networks and Beyond” (2016)
● Q-network parameters in the online policy
network determine the next training samples
➡ can lead to bad feedback loops .
● A diﬀerent and more stable target network is
used to estimate TD targets.
TD target

23
DQN: Online & Target Networks
Source: Arthur Juliani, “Simple Reinforcement Learning with Tensorflow Part 4: Deep Q-Networks and Beyond” (2016)
The target network (wi
-
) is updated by copying the parameters of the policy
network (wi
) periodically.
Online (Policy) Network (wi
)Target Network (wi
-
)
Copy
TD target

25
DQN: Replay Memory
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
● Learning from batches of consecutive
samples is problematic because samples are
too correlated ➡ ineﬃcient learning
● Continually update a replay memory table
of interactions (s, a, r, s’) as episodes are
collected.
● Train a the online network (wi
) with random
minibatches of transitions from the replay
memory, instead of consecutive samples.
Memory
s a r s’
s a r s’
s a r s’
s a r s’
... ... ... ...
s a r s’

26
Algorithm:
1. Collect transitions (s, a, r, s’) and store them in a replay memory D
2. Sample random mini-batch of transitions (s, a, r, s’) from replay memory D
3. Compute TD-learning targets with respect to (wrt.) old parameters wー
4. Optimise with MSE loss using gradient descent:
David Silver, UCL course on RL (Lecture 6)
TD target

27
Online demo
Andrej Karpathy, “ConvNetJS Deep Q Learning Demo”

Reminder: taxonomy of DRL methods
Figure: OpenAI Spinning Up

Learn more
Volodymyr Minih, UCL 2018

31
Beyond DQN by Víctor Campos (2020)
Multiple improvements have been proposed since the original DQN was
published:
● Double DQN
● Prioritized Experience Replay
● Dueling Networks
● Distributional DQN
● Noisy Networks

32
van Hasselt et al., Deep Reinforcement Learning with Double Q-learning, AAAI 2016
Double DQN
DQN suﬀers from overestimation bias. It can be reduced by using the online
network to sample the action whose Q(s,a) will be used for bootstrapping. The
target, Y, still uses a target network:
online network target network
target network

33
Schaul et al., Prioritized Experience Replay, ICLR 2016
Prioritized Experience Replay (PER)
In the original DQN, all samples have the same probability of being sampled for
replay (i.e. for learning).
One might want to replay interesting transitions more often. PER assigns a
diﬀerent replay priority to each transition. For instance, based on TD error
(transitions with a larger TD error are sampled more often).

34
Wang et al., Dueling Network Architectures for Deep Reinforcement Learning, ICML 2016
Dueling Networks
Instead of estimating Q(s,a) directly, dueling networks estimate V(s) and A(s,a)
and combine them to obtain an estimate of Q(s,a). This architecture helps with
generalization across states.
Q(s,a)
Q(s,a)
V(s)
A(s,a)
Standard DQN
Dueling DQN

35
Bellemare et al., A Distributional Perspective on Reinforcement Learning, ICML 2017
Distributional DQN
Standard Q-learning estimates the expected return for each state and action.
Distributional Q-learning tries to model the full distribution instead of its
expectation only.

36
Fortunato et al., Noisy Networks for Exploration, ICLR 2018
Noisy Networks
DQN explores with ε-greedy. Noisy DQN instead adds (learnable) noise to the
Q-network parameters when sampling actions.

37
Hessel et al., Rainbow: Combining Improvements in Deep Reinforcement Learning, AAAI 2018
Rainbow: combining all improvements

38
What about continuous actions?
DQN derives a policy from a Q value estimate greedily:
π(s) = argmaxa∈A
Q(s,a)
For discrete actions, all Q(s,a) for a given state are computed and we take the
action with the largest value.
How can we compute this argmaxa∈A
operation when dealing with continuous
(i.e. inﬁnite) actions?

39
What about continuous actions?
DQN derives a policy from a Q value estimate greedily:
π(s) = argmaxa∈A
Q(s,a)
For discrete actions, all Q(s,a) for a given state are computed and we take the
action with the largest value.
We can train another neural network that will predict the action that
maximizes Q(s,a).

40
Deep Deterministic Policy Gradient (DDPG)
Network
(Critic)
State Action
Q
Lillicrap et al., Continuous control with deep reinforcement learning, ICLR 2016

41
Deep Deterministic Policy Gradient (DDPG)
Network
(Critic)
State
Q
Network
(Actor)
State
Action
Lillicrap et al., Continuous control with deep reinforcement learning, ICLR 2016

Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020

Similar to Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020 (20)

More from Universitat Politècnica de Catalunya

More from Universitat Politècnica de Catalunya (19)

Recently uploaded

Recently uploaded (20)

Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020