Lecture22

Introduction to Machine
Learning
Lecture 22
Reinforcement Learning

Albert Orriols i Puig
http://www.albertorriols.net
htt // lb t i l t
aorriols@salle.url.edu

Artificial Intelligence – Machine Learning
g g
Enginyeria i Arquitectura La Salle
Universitat Ramon Llull

Recap of Lecture 21
Value functions
Vπ(s): Long-term reward estimation
from s a e s following po cy π
o state o o g policy
Qπ(s,a): Long-term reward estimation
from s a e s e ecu g ac o a
o state executing action
and then following policy π
The long term reward is a recency weighted average of
recency-weighted
the received rewards

…r …
at rt+1 at+1 rt+2 at+2 rt+3 at+3
t
st st+1 st+2 st+3

Slide 2
Artificial Intelligence Machine Learning

Recap of Lecture 21

Policy
A policy, π, is a mapping from states, s∈S, and actions,
a∈A(s), to the probability π(s, a) of taking action a when in
state s.

Slide 3

Today’s Agenda

Bellman equations for value functions
Optimal policy
Learning the optimal policy
Q-learning

Slide 4

Let’s Estimate the Future Reward
I want to estimate which will be my reward g
y given a
certain state and a policy π
For the state value function Vπ(s)
state-value

For the action-value function Qπ(s,a)

Slide 5

Bellman Equation for a Policy π
Playing a little with the equations
yg q

Therefore

Finally

Slide 6

Q-value Bellman Equation
If we estimate the q-value
q

Slide 7

Calculation of Value Functions
How to calculate the value functions for a given policy
g p y
Solve a set of linear equations
1.

Bellman equation for Vπ

This is a system of |S| linear equations

Iterative method (convergence proved)
2.

Calculate the value by sweeping through the states

Greedy methods
3.

Slide 8

Example: The Gridworld
Rewards
-1 if the agent goes out of the grid
0 for all the other states except from state A and B
From A, all four actions yield a reward of 10 and take the agent to A’
From B, all four actions yield a reward of 5 and take the agent to B’

(b) obtained by solving
Policy = equal probability for each movement
γ=0.9
Slide 9

Looking for the Optimal Policy

Slide 10

Optimal Policy
We search for a policy that achieves a lot of reward over
p y
the long run
Value functions enable us to define a partial order over
policies
A policy π is better than or equal to π’ if its expected return is
π
greater than or equal to that of π’ for all states
Optimal policies π* share the optimal state value function V*
π state-value V

Which can be written as

Slide 11

Learning Optimal Policies

Slide 12

Focusing on the Objective
We want to find the optimal policy
p p y
There are many methods for this purpose
Dynamic programming
D i i
Policy iteration
Value iteration
[Asynchronous versions]
RL algorithms
Q-learning
Sarsa
TD-learning

We are going to see Q-learning

Slide 13

Q-learning
RL algorithms
g
Learning by doing

Temporal difference method
Learn directly from raw experience without a model of the
environment’s dynamics

Advantages
No model of the world needed
Good policies before learning the optimal policy
Reacts to changes in the environment
g

Slide 14

Dynamic Programming in Brief

Needs a model of the environment to compute true expected values
A very informative backup
Slide 15

Temporal Difference Leraning

No model of the world needed
Most incremental
Slide 16

Q-learning
Based on Q-backups
Q p

The learned action-value function Q directly approximates Q*,
independent of the policy being followed

Slide 17

Q-learning: Pseudo code
Pseudo code for Q-learning
Q g

Slide 18

Q-learning in Action
15x15 maze world; R(goal)=1; R(other)=0

γ=0.9
α=0.65

Slide 19

Initial policy

Slide 20

After 20 episodes

Slide 21

After 30 episodes

Slide 22

After 100 episodes

Slide 23

After 150 episodes

Slide 24

After 200 episodes

Slide 25

After 250 episodes

Slide 26

After 300 episodes

Slide 27

After 350 episodes

Slide 28

After 400 episodes

Slide 29

Some Last Remarks
Exploration regime
p g
Explore vs. exploit
ε-greedy
ε greedy action selection
Soft-max action selection
Initialization f Q-values: b optimistic
I iti li ti of Q l be ti i ti
Learning rate α
In stationary environments
α(s) = 1 / (number of visits to state s)
In non-stationary environments
α takes a constant value
The higher the value the higher the influence of recent
value,
experiences

Slide 30

Next Class

Reinforcement l
Rif t learning with LCSs
i ith LCS

Slide 31

Lecture22

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (19)

Ähnlich wie Lecture22

Ähnlich wie Lecture22 (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Lecture22