1. Introduction to Machine
Learning
Lecture 22
Reinforcement Learning
Albert Orriols i Puig
http://www.albertorriols.net
htt // lb t i l t
aorriols@salle.url.edu
Artificial Intelligence – Machine Learning
g g
Enginyeria i Arquitectura La Salle
Universitat Ramon Llull
2. Recap of Lecture 21
Value functions
Vπ(s): Long-term reward estimation
from s a e s following po cy π
o state o o g policy
Qπ(s,a): Long-term reward estimation
from s a e s e ecu g ac o a
o state executing action
and then following policy π
The long term reward is a recency weighted average of
recency-weighted
the received rewards
…r …
at rt+1 at+1 rt+2 at+2 rt+3 at+3
t
st st+1 st+2 st+3
Slide 2
Artificial Intelligence Machine Learning
3. Recap of Lecture 21
Policy
A policy, π, is a mapping from states, s∈S, and actions,
a∈A(s), to the probability π(s, a) of taking action a when in
state s.
Slide 3
Artificial Intelligence Machine Learning
4. Today’s Agenda
Bellman equations for value functions
Optimal policy
Learning the optimal policy
Q-learning
Slide 4
Artificial Intelligence Machine Learning
5. Let’s Estimate the Future Reward
I want to estimate which will be my reward g
y given a
certain state and a policy π
For the state value function Vπ(s)
state-value
For the action-value function Qπ(s,a)
Slide 5
Artificial Intelligence Machine Learning
6. Bellman Equation for a Policy π
Playing a little with the equations
yg q
Therefore
Finally
Slide 6
Artificial Intelligence Machine Learning
7. Q-value Bellman Equation
If we estimate the q-value
q
Slide 7
Artificial Intelligence Machine Learning
8. Calculation of Value Functions
How to calculate the value functions for a given policy
g p y
Solve a set of linear equations
1.
Bellman equation for Vπ
This is a system of |S| linear equations
Iterative method (convergence proved)
2.
Calculate the value by sweeping through the states
Greedy methods
3.
Slide 8
Artificial Intelligence Machine Learning
9. Example: The Gridworld
Rewards
-1 if the agent goes out of the grid
0 for all the other states except from state A and B
From A, all four actions yield a reward of 10 and take the agent to A’
From B, all four actions yield a reward of 5 and take the agent to B’
(b) obtained by solving
Policy = equal probability for each movement
γ=0.9
Slide 9
Artificial Intelligence Machine Learning
10. Looking for the Optimal Policy
Slide 10
Artificial Intelligence Machine Learning
11. Optimal Policy
We search for a policy that achieves a lot of reward over
p y
the long run
Value functions enable us to define a partial order over
policies
A policy π is better than or equal to π’ if its expected return is
π
greater than or equal to that of π’ for all states
Optimal policies π* share the optimal state value function V*
π state-value V
Which can be written as
Slide 11
Artificial Intelligence Machine Learning
13. Focusing on the Objective
We want to find the optimal policy
p p y
There are many methods for this purpose
Dynamic programming
D i i
Policy iteration
Value iteration
[Asynchronous versions]
RL algorithms
Q-learning
Sarsa
TD-learning
We are going to see Q-learning
Slide 13
Artificial Intelligence Machine Learning
14. Q-learning
RL algorithms
g
Learning by doing
Temporal difference method
Learn directly from raw experience without a model of the
environment’s dynamics
Advantages
No model of the world needed
Good policies before learning the optimal policy
Reacts to changes in the environment
g
Slide 14
Artificial Intelligence Machine Learning
15. Dynamic Programming in Brief
Needs a model of the environment to compute true expected values
A very informative backup
Slide 15
Artificial Intelligence Machine Learning
16. Temporal Difference Leraning
No model of the world needed
Most incremental
Slide 16
Artificial Intelligence Machine Learning
17. Q-learning
Based on Q-backups
Q p
The learned action-value function Q directly approximates Q*,
independent of the policy being followed
Slide 17
Artificial Intelligence Machine Learning
18. Q-learning: Pseudo code
Pseudo code for Q-learning
Q g
Slide 18
Artificial Intelligence Machine Learning
30. Some Last Remarks
Exploration regime
p g
Explore vs. exploit
ε-greedy
ε greedy action selection
Soft-max action selection
Initialization f Q-values: b optimistic
I iti li ti of Q l be ti i ti
Learning rate α
In stationary environments
α(s) = 1 / (number of visits to state s)
In non-stationary environments
α takes a constant value
The higher the value the higher the influence of recent
value,
experiences
Slide 30
Artificial Intelligence Machine Learning
31. Next Class
Reinforcement l
Rif t learning with LCSs
i ith LCS
Slide 31
Artificial Intelligence Machine Learning
32. Introduction to Machine
Learning
Lecture 22
Reinforcement Learning
Albert Orriols i Puig
http://www.albertorriols.net
htt // lb t i l t
aorriols@salle.url.edu
Artificial Intelligence – Machine Learning
g g
Enginyeria i Arquitectura La Salle
Universitat Ramon Llull