Deep Q-learning explained

Deep Q-Network
Deep Reinforcement Learning
Azzeddine CHENINE
@azzeddineCH_
2

Quick Reminder
Learning by experience
5

Quick Reminder
Episodic tasks and continuous task
6

Quick Reminder
Episodic tasks and continuous task
An action has a reward and results a new state
7

Quick Reminder
The value based methods try to reach an optimal policy by using a value function
8

Quick Reminder
The Q function of an optima policy maximise the expected return G
9

Quick Reminder
The Q function of an optima policy maximise the expected return G
Epsilon greedy policies solve the exploration and exploitation
problem
10

§
Limitation
Non practical with environments with large or in
fi
nite state or
action space
Games
Stock market
Robots
12

Modern problems

require

Modern solutions
13

Deep Q-Networks
Deepminds
- 2015 -
14

No more given de
fi
nitions ….
15

No more given de
fi
nitions ….
fi
guring it out by
ourselves
16

1 - How to replace the
Q-table ?
Neural networks can be trained to
fi
t on a linear or a non
linear function.
17

Q-table ?
fi
linear function.
Replace the Q-value function with a deep neural network
18

Q-table ?
fi
linear function.
The outputs layer hold the Q-value of each possible
action
⟹ make the actions space discrete
19

Q-table ?
fi
linear function.
The outputs layer hold the Q-value of each possible
action
The input layer takes the desired representation of the
state
⟹ make the actions space discrete
⟹ a pipeline of convolutions
⟹ a Dense layer
⟹ a sequence of LSTM/RNN cells
20

Q-table ✅
21

Q-table ✅
2 - How to optimize the
model
How did we used update the Q-table
The agent took an action (a) in a state (S)
22

Q-table ✅
model
The agent moved to the next state (S’) and
received a reward (r)
23

Q-table ✅
model
the discount rate
𝞭
re
fl
ects how much we care
about the futur reward
The agent moved to the next state (S’) and
received a reward (r)
𝞭
≈ 1⟹ we care too much about the future reward
𝞭
≈ 0⟹ we care too much about the immediate reward
24

Q-table ✅
model
Q' ( S , a ) ⟸ Q ( S , a ) +
𝛼
[ r ( S , a , S’ ) +
𝞭
max Q(S’, a') - Q ( S, a ) ]
A good Q value for an action (a) given a state (S) is a
good approximation to the expected discounted
cumulative reward.
⟹ Q ( S, a ) ≈ r ( S , a , S’ ) +
𝞭
max Q(S’, a')
25

Q-table ✅
model
How are we going to update the Q-
networks weights
Q-values
State
Q-networks
⟹ Q ( S, a ) ≈ r ( S , a , S’ ) +
𝞭
max Q(S’, a')
26

Q-table ✅
model
networks weights
Q-values Action
Action selection
State
Q-networks Environments
New State S’
Reward R
⟹ Q ( S, a ) ≈ r ( S , a , S’ ) +
𝞭
max Q(S’, a')
27

Q-table ✅
model
networks weights
Q-values Action
Action selection
⨁
r ( S , a , S’ ) +
𝞭
max Q(S’, a')
State
New State S’
Reward R
⟹ Q ( S, a ) ≈ r ( S , a , S’ ) +
𝞭
max Q(S’, a')
28

Q-table ✅
model
networks weights
Q-values Action
Action selection
⨁
r ( S , a , S’ ) +
𝞭
max Q(S’, a')
Q(S, a')
State
New State S’
Reward R
⟹ Q ( S, a ) ≈ r ( S , a , S’ ) +
𝞭
max Q(S’, a')
29

Q-table ✅
model
networks weights
Q-values Action
Action selection
⨁
r ( S , a , S’ ) +
𝞭
max Q(S’, a')
Q(S, a')
State
New State S’
Reward R
Loss Optimizers
⟹ Q ( S, a ) ≈ r ( S , a , S’ ) +
𝞭
max Q(S’, a')
30

Q-table ✅
model
networks weights
Q-values Action
Action selection
⨁
r ( S , a , S’ ) +
𝞭
max Q(S’, a')
Q(S, a')
New network weights
State
New State S’
Reward R
Loss Optimizers
⟹ Q ( S, a ) ≈ r ( S , a , S’ ) +
𝞭
max Q(S’, a')
31

Q-table ✅
model
networks weights
⟹ Q ( S, a ) ≈ r ( S , a , S’ ) +
𝞭
max Q(S’, a')
Update weights
State
New state
For n episode
32

We are chasing a moving target
33

r ( S , a , S’ ) +
𝞭
max Q(S’, a')
Q ( S, a )
34

Q ( S, a )
r ( S , a , S’ ) +
𝞭
max Q(S’, a')
35

Q ( S, a )
r ( S , a , S’ ) +
𝞭
max Q(S’, a')
36

r ( S , a , S’ ) +
𝞭
max Q(S’, a')
Q ( S, a )
37

Q-table ✅
model ✅
networks weights
3 - T-target
We use a local network and a target network
38

Q-table ✅
model ✅
networks weights
3 - T-target
The target neural network get slightly updated based
on the local networks weights
39

Q-table ✅
model ✅
networks weights
3 - T-target
The local network predict the Q-values for the current
state
40

Q-table ✅
model ✅
networks weights
3 - T-target
The local network predict the Q-values for the current
state
The target network predict the Q-values for the next
state
41

Q-table ✅
model ✅
networks weights
3 - T-target
Q-values Action
Action selection
State
Local network Environments
New State S’
Reward R
42

Q-table ✅
model ✅
networks weights
3 - T-target
Q-values Action
Action selection
⨁
r ( S , a , S’ ) +
𝞭
max Q(S’, a')
Q(S, a')
New network weights
State
Local network Environments
New State S’
Reward R
Loss Optimizers for
the local network
Target network
43

?
Someone to answer the question
please
In supervised learning, why we shu
ffl
ing
the dataset gives better accuracy?
44

please
ffl
ing
If not shu
ffl
ing data, the data can be sorted or
similar data points will lie next to each other,
which leads to slow convergence
?
45

please
ffl
ing
If not shu
ffl
ing data, the data can be sorted or
similar data points will lie next to each other,
which leads to slow convergence
In our case we are having a big
correlation between consecutive states
which may lead to ine
ffi
cient learning .
?
46

Q-table ✅
model ✅
How to break the coloration
between states
3 - T-target ✅
4 - Experience
replay
Stack the experiences in a buffer
After the speci
fi
ed number of steps pick an N
random experiences
Proceed with optimizing the loss with the
learning batch
47

Q-table ✅
model ✅
How to break the coloration
between states
3 - T-target ✅
4 - Experience
replay ✅
Stack the experiences in a buffer
After the speci
fi
ed number of steps pick an N
random experiences
Proceed with optimizing the loss with the
learning batch
48

Build a DQL agent
Putting it all together
49

Deep Q-learning explained

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Deep Q-learning explained

Ähnlich wie Deep Q-learning explained (20)

Mehr von azzeddine chenine

Mehr von azzeddine chenine (7)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Deep Q-learning explained