This document provides an introduction to reinforcement learning. It discusses how reinforcement learning aims to learn behaviors through trial-and-error interaction with an environment to maximize rewards. The document outlines the basic components of a reinforcement learning problem including states, actions, rewards, and policies. It provides examples of reinforcement learning problems like pole balancing and the mountain car problem to illustrate these concepts. The next class will cover how to learn policies to solve reinforcement learning problems.
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
Introduction to Reinforcement Learning
1. Introduction to Machine
Learning
Lecture 21
Reinforcement Learning
Albert Orriols i Puig
http://www.albertorriols.net
htt // lb t i l t
aorriols@salle.url.edu
Artificial Intelligence – Machine Learning
g g
Enginyeria i Arquitectura La Salle
Universitat Ramon Llull
2. Recap of Lectures 5-18
Supervised learning
p g
Data classification
Labeled data
Build a model that
covers all the space
Unsupervised learning
Clustering
Unlabeled data
Group similar objects
G i il bj t
Association rule analysis
Unlabeled data
Get the most frequent/important associations
Genetic Fuzzy Systems
Slide 2
Artificial Intelligence Machine Learning
3. Today’s Agenda
Introduction
Reinforcement Learning
Some examples before going farther
Slide 3
Artificial Intelligence Machine Learning
4. Introduction
What does reinforcement learning aim at?
g
Learning from interaction (with environment)
Goal-directed learning
GOAL
State
Environment
Environment
Action
Agent
agent
Learning what to do and its effect
Trial-and-error search and delayed reward
Slide 4
Artificial Intelligence Machine Learning
5. Introduction
Learn a reactive behaviors
Behaviors as a mapping between perceptions and actions
The
Th agent has to exploit what it already knows in order to
th t l it h t l dk i dt
obtain reward, but it also has to explore in order to make
better action selections in the future.
Dilemma − neither exploitation nor exploration can be
e a e t e e p o tat o o e p o at o ca
pursued exclusively without failing at the task.
Slide 5
Artificial Intelligence Machine Learning
6. How Can We Learn It?
Look-up tables
p Rules
1. 3.
Perception Action
State 1 Action 1
State 2 Action 2
State 3 Action 3
… …
Neural Net orks
Ne ral Networks Finite t
Fi it automata
t
2. 4.
Slide 6
Artificial Intelligence Machine Learning
8. Reinforcement Learning
Reward function
Agent
r:S → R
State Action
or
Reward
st at
r:S×A→ R
rt
Environment
Agent and environment interact at discrete time steps t=0,1,2, …
The agent
g
observes state at step t: st ε S
produces action at at step t: at ε A(st)
gets resulting reward: rt+1 ε R
goes to the next step st+1
Slide 8
Artificial Intelligence Machine Learning
9. Reinforcement Learning
Agent
State Action
Reward
st at
rt
Environment
Trace of a trial
…r …
at rt+1 at+1 rt+2 at+2 rt+3 at+3
t
st st+1 st+2 st+3
Agent goal:
Maximize the total amount of reward t receives
Therefore, that means maximizing not only the immediate reward,
but cumulative reward in the long run
Slide 9
Artificial Intelligence Machine Learning
10. Example of RL
Example: Recycling robot
State
charge level of battery
Actions
look for cans, wait for can, go recharge
Reward
R d
positive for finding cans, negative for running out of battery
Slide 10
Artificial Intelligence Machine Learning
11. More precisely…
Restricting to Markovian Decision Process (MDP)
g ( )
Finite set of situations
Finite t f ti
Fi it set of actions
Transition probabilities
Reward probabilities
This means that
The agent needs to have complete information of the world
State st+1 only depends on state st and action at
Slide 11
Artificial Intelligence Machine Learning
12. Recycling Robot Example
1 − β , −3 β , R search
wait
1, R
wait search
recharge
1, 0
High
g Low
search wait
α ,R 1 − α ,R
search wait
search
1R
1,
Slide 12
Artificial Intelligence Machine Learning
13. Recycling Robot Example
S = {high, low}
g
A (high) = {wait, search}
A (low ) = {wait, search, recharge}
R search : expected # cans while searching
R wait : expected # cans while waiting
R search > R wait
Slide 13
Artificial Intelligence Machine Learning
14. Breaking the Markovian Property
Possible problems that do not satisfy MDP
p y
When action and states are not finite
Solution: Discretize the set of actions and states
When transition probabilities do not depend only on the current
state
Possible solution: represent states as structures build up
over time from sequences of sensations
q
This is POMDP Partial observable MDP
Use POMDP algorithms to solve these problems
g
Slide 14
Artificial Intelligence Machine Learning
16. Elements of RL
Policy: what to do
Reward: what’s good
Value: What’s good because it p ed cts reward
a ue at s t predicts e a d
Model: What follows what
Slide 16
Artificial Intelligence Machine Learning
17. Components of an RL Agent
Policy (behavior)
Mapping from states to actions
π*: S A
Reward
Local reward in state t:
rt
Model
Probability of transition from state s to s’ by executing action a
s
T(s,a,s’)
And
The transitions probabilities depend only on these parameters
This is not known by the agent
Slide 17
Artificial Intelligence Machine Learning
18. Components of an RL Agent
Value functions
Vπ(s): Long-term reward estimation from state s following policy
π
Qπ(s,a): Long-term reward estimation from state s executing
ac o
action a and then following po cy π
ad e oo g policy
A simple example
A maze
Note t at t e age t does not know its o
ote that the agent ot o ts own pos t o It ca o y
position. t can only
perceive what it has in the surrounding states
Slide 18
Artificial Intelligence Machine Learning
19. Components of an RL Agent
Value functions
Vπ(s): Long-term reward estimation from state s following policy
π
Qπ(s,a): Long-term reward estimation from state s executing
ac o
action a and then following po cy π
ad e oo g policy
A simple example
A maze
Note t at t e age t does not know its o
ote that the agent ot o ts own pos t o It ca o y
position. t can only
perceive what it has in the surrounding states
Slide 19
Artificial Intelligence Machine Learning
20. Pursuing the goal: Maximize long term reward
Slide 20
Artificial Intelligence Machine Learning
21. Goals and Rewards
Ok, but I need to maximize my long term reward. How I
, y g
get the long term reward?
Long term reward defined in terms of the goal of the agent
The agent receives the local reward at each time step
How?
Intuitive idea: Sum all the rewards obtained so far
Problem: It can increase heavily in non-ending tasks
Slide 21
Artificial Intelligence Machine Learning
22. Goals and Rewards
How can we deal with non-ending tasks?
g
Weighted addition of local rewards
The γ parameter (0 < γ < 1) is the discounting factor
e pa a ete ) s t e d scou t g acto
…r …
at rt+1 at+1 rt+2 at+2 rt+3 at+3
t
st st+1 st+2 st+3
Note t e b as for immediate rewards
ote the bias o ed ate e a ds
If you want to avoid it, set γ close to 1
Slide 22
Artificial Intelligence Machine Learning
23. Some examples
Slide 23
Artificial Intelligence Machine Learning
24. Pole balancing
Balance the pole
p
The car can move forward
a d backward
and bac a d
Avoid failure:
the pole falling beyond
a certain critical angle
the car hitting the end of the track
g
Reward
-1 upon failure
-ak, for k steps before failure
a
Slide 24
Artificial Intelligence Machine Learning
25. Mountain Car Problem
Objective
j
Get to the top of the hill as
qu c y
quickly as poss b e
possible
State d fi iti
St t definition:
Car position and speed
Actions
Forward, reverse, none
Reward
-1 for each step that are not the on the top of the hill
-number of steps before reaching the top of the hill
Slide 25
Artificial Intelligence Machine Learning
26. Next Class
How t l
H to learn th policies
the li i
Slide 26
Artificial Intelligence Machine Learning
27. Introduction to Machine
Learning
Lecture 21
Reinforcement Learning
Albert Orriols i Puig
http://www.albertorriols.net
htt // lb t i l t
aorriols@salle.url.edu
Artificial Intelligence – Machine Learning
g g
Enginyeria i Arquitectura La Salle
Universitat Ramon Llull