https://telecombcn-dl.github.io/drl-2020/
This course presents the principles of reinforcement learning as an artificial intelligence tool based on the interaction of the machine with its environment, with applications to control tasks (eg. robotics, autonomous driving) o decision making (eg. resource optimization in wireless communication networks). It also advances in the development of deep neural networks trained with little or no supervision, both for discriminative and generative tasks, with special attention on multimedia applications (vision, language and speech).
4. 4
Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller.
"Playing atari with deep reinforcement learning." arXiv preprint arXiv:1312.5602 (2013).
9. 9
Solving the Optimal Policy
Optimal
policy ㄫ*
The optimal policy is that one capable of achieving the optimal value functions
V*
(s) and Q*
(s,a)
Optimal
Q-value
functions
10. 10
Tabular Q-Learning is feasible for small state-action spaces:
actions
Q(s,a)
Solving the Optimal Policy: Q-Learning
S F F F
F H F H
F F F F
H F F G
11. 11
How would you compute the amount of rows in a tabular Q-learning solution of
Space invaders ?
Q-Function estimation with a table
actions
Q(s,a)
12. 12
Exploring all positive states would require:
● generating all possible (valid) pixel combinations
● run all possible actions
● several times to allow estimation.
... is not scalable
Q-Function estimation with a table
actions
Q(s,a)
13. 13
Deep neural network are powerful function approximators, also of
Q*(s,a).
Q-Function approximation with NN
Q(s,a,w) ≈ Q*(s,a)
Neural Network parameters
14. 14
Q-Function approximation with NN
Network
State Action
Q
Can you think of a more efficient way
of estimating Q values?
Hint: how many passes through the
network do we need for sampling an
action when using this architecture?
15. 15
For discrete actions, practical implementations actually feed the state into the
NN and estimate a Q-value for each action.
Q-Function approximation with NN
16. 16
Q-Learning with Neural Networks
#TD-Gammon Tesauro, G. (1994). TD-Gammon, a self-teaching backgammon program, achieves
master-level play. Neural computation, 6(2), 215-219.
18. 18
Deep Q-Network (DQN) was the first method to combine value-based RL (in
particular, Q-learning) with deep neural networks.
Deep Q-learning (DQN)
Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves et al.
"Human-level control through deep reinforcement learning." Nature 518, no. 7540 (2015): 529-533.
19. 19
Deep Q-learning (DQN)
Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves et al.
"Human-level control through deep reinforcement learning." Nature 518, no. 7540 (2015): 529-533.
Number of
actions
between 4-18,
depending on
the Atari game
20. 20
Deep Q-learning (DQN)
Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves et al.
"Human-level control through deep reinforcement learning." Nature 518, no. 7540 (2015): 529-533.
Some challenges arise when combining Q-learning with neural networks trained
with gradient descent:
1. When updating Q(s,a) for a given (s,a) tuple, we will be changing the Q value
estimate for all (s,a). This did not happen in the tabular version of Q-learning.
2. Gradient-based techniques assume that the training samples are independent
and identically distributed (i.i.d.). This assumption is broken when the training
data is generated by an agent interacting with the environment.
21. Outline
1. Motivation
2. Function approximation
3. Deep Q-Networks (DQN)
○ Online & Target Networks
○ Replay Memory
4. Improvements to DQN
22. 22
DQN: Online & Target Networks
Source: Arthur Juliani, “Simple Reinforcement Learning with Tensorflow Part 4: Deep Q-Networks and Beyond” (2016)
● Q-network parameters in the online policy
network determine the next training samples
➡ can lead to bad feedback loops .
● A different and more stable target network is
used to estimate TD targets.
TD target
23. 23
DQN: Online & Target Networks
Source: Arthur Juliani, “Simple Reinforcement Learning with Tensorflow Part 4: Deep Q-Networks and Beyond” (2016)
The target network (wi
-
) is updated by copying the parameters of the policy
network (wi
) periodically.
Online (Policy) Network (wi
)Target Network (wi
-
)
Copy
TD target
24. Outline
1. Motivation
2. Function approximation
3. Deep Q-Networks (DQN)
○ Online & Target Networks
○ Replay Memory
4. Improvements to DQN
25. 25
DQN: Replay Memory
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
● Learning from batches of consecutive
samples is problematic because samples are
too correlated ➡ inefficient learning
● Continually update a replay memory table
of interactions (s, a, r, s’) as episodes are
collected.
● Train a the online network (wi
) with random
minibatches of transitions from the replay
memory, instead of consecutive samples.
Memory
s a r s’
s a r s’
s a r s’
s a r s’
... ... ... ...
s a r s’
26. 26
Algorithm:
1. Collect transitions (s, a, r, s’) and store them in a replay memory D
2. Sample random mini-batch of transitions (s, a, r, s’) from replay memory D
3. Compute TD-learning targets with respect to (wrt.) old parameters wー
4. Optimise with MSE loss using gradient descent:
David Silver, UCL course on RL (Lecture 6)
Deep Q-learning (DQN)
TD target
31. 31
Beyond DQN by Víctor Campos (2020)
Multiple improvements have been proposed since the original DQN was
published:
● Double DQN
● Prioritized Experience Replay
● Dueling Networks
● Distributional DQN
● Noisy Networks
32. 32
van Hasselt et al., Deep Reinforcement Learning with Double Q-learning, AAAI 2016
Double DQN
DQN suffers from overestimation bias. It can be reduced by using the online
network to sample the action whose Q(s,a) will be used for bootstrapping. The
target, Y, still uses a target network:
online network target network
target network
33. 33
Schaul et al., Prioritized Experience Replay, ICLR 2016
Prioritized Experience Replay (PER)
In the original DQN, all samples have the same probability of being sampled for
replay (i.e. for learning).
One might want to replay interesting transitions more often. PER assigns a
different replay priority to each transition. For instance, based on TD error
(transitions with a larger TD error are sampled more often).
34. 34
Wang et al., Dueling Network Architectures for Deep Reinforcement Learning, ICML 2016
Dueling Networks
Instead of estimating Q(s,a) directly, dueling networks estimate V(s) and A(s,a)
and combine them to obtain an estimate of Q(s,a). This architecture helps with
generalization across states.
Q(s,a)
Q(s,a)
V(s)
A(s,a)
Standard DQN
Dueling DQN
35. 35
Bellemare et al., A Distributional Perspective on Reinforcement Learning, ICML 2017
Distributional DQN
Standard Q-learning estimates the expected return for each state and action.
Distributional Q-learning tries to model the full distribution instead of its
expectation only.
36. 36
Fortunato et al., Noisy Networks for Exploration, ICLR 2018
Noisy Networks
DQN explores with ε-greedy. Noisy DQN instead adds (learnable) noise to the
Q-network parameters when sampling actions.
37. 37
Hessel et al., Rainbow: Combining Improvements in Deep Reinforcement Learning, AAAI 2018
Rainbow: combining all improvements
38. 38
What about continuous actions?
DQN derives a policy from a Q value estimate greedily:
π(s) = argmaxa∈A
Q(s,a)
For discrete actions, all Q(s,a) for a given state are computed and we take the
action with the largest value.
How can we compute this argmaxa∈A
operation when dealing with continuous
(i.e. infinite) actions?
39. 39
What about continuous actions?
DQN derives a policy from a Q value estimate greedily:
π(s) = argmaxa∈A
Q(s,a)
For discrete actions, all Q(s,a) for a given state are computed and we take the
action with the largest value.
We can train another neural network that will predict the action that
maximizes Q(s,a).
40. 40
Deep Deterministic Policy Gradient (DDPG)
Network
(Critic)
State Action
Q
Lillicrap et al., Continuous control with deep reinforcement learning, ICLR 2016
41. 41
Deep Deterministic Policy Gradient (DDPG)
Network
(Critic)
State
Q
Network
(Actor)
State
Action
Lillicrap et al., Continuous control with deep reinforcement learning, ICLR 2016