2. Task: Learn how to behave successfully to achieve a
goal while interacting with an external environment
Learn via experiences!
Examples
• Game playing: player knows whether it win or lose, but
not know how to move at each step
• Control: a traffic system can measure the delay of cars,
but not know how to decrease it.
Reinforcement Learning
2
3. RL is learning from interaction
3
Agent acts on its environment, it receives some evaluation of
its action (reinforcement)
The goal of the agent is to learn a policy that maximize its total
(future) reward
St →At →Rt →St+1 →At+1 →Rt+1 →St+2…
4. At each State S, choose the a action a which
maximizes the function Q(S, a)
Q is the estimated utility function – it tells us how
good an action is given a certain state
Q-Learning Basics
4
所有決策都依據Q-table (best policy),但Q table 要從何得來?
5. 5
If this number > epsilon, then we will do “exploitation” (this means we
use what we already know to select the best action at each step).
6. Bellman equation (Q-table Update Rule)
6
s0 s2
s1
s3
a
c
b
a
c
d
a
c
f
a
b
d
Q(S0,b) is max
Get maximum Q value for this next
state based on all possible actions.
0 <= 𝛾 <1Q(S,a)=R(S,a)+ 𝛾 max(Q(S’,a’))
a’immediate reward future reward
This is a recursive definition
Discount rate
8. 8
Initially we explore the environment and update the Q-Table.
When the Q-Table is ready, the agent will start to exploit the
environment and start taking better actions.
9. This Q-table becomes a reference table for our agent to
select the best action
1. Set current state = initial state.
2. From current state, find the action with the highest Q value.
3. Set current state = next state.
4. Repeat Steps 2 and 3 until current state = goal state.
Algorithm to utilize the Q matrix:
9
14. Example : Q-Learning By Hand
14http://mnemstudio.org/path-finding-q-learning-tutorial.htm
The outside of the building can be thought of as one big
room (5). Notice that doors 1 and 4 lead into the building
from room 5 (outside).
15. 15
The -1's in the table represent null
values (i.e.; where there isn't a
link between nodes). For example,
State 0 cannot go to State 1.
17. 17
If our agent learns more through further
episodes, it will finally reach convergence
values in matrix Q like:
This matrix Q, can then be
normalized (i.e.; converted to
percentage) by dividing all non-zero
entries by the highest number (500 in
this case):