SlideShare ist ein Scribd-Unternehmen logo
1 von 43
Downloaden Sie, um offline zu lesen
Introduction to Reinforcement Learning
Part II: Basic tabular methods in RL
Mikko Mäkipää 31.1.2022
Agenda
• Last time: Part I
• Intro: Reinforcement learning as a ML approach
• Basic building blocks: agent and environment, MDP, policies, value functions, Bellman
equations, optimal policies and value functions
• Basic dynamic programming algorithms illustrated on a simple maze: Value iteration, Policy
iteration
Agenda
• This time: Part II
• Some more building blocks: GPI, bandits, exploration, TD updates,…
• Basic model-free methods using tabular value representation
• …illustrated on Blackjack: Monte Carlo on- vs off-policy; Sarsa, Expected Sarsa, Q-learning
• Next time: Part III
• Value function approximation-based methods
• Semi-gradient descent with Sarsa and different linear representations; polynomial, tile
coding, Fourier cosine basis
• Batch updates LSPI-LSTDQ
Recap – we’ll be briefly revisiting the
following concepts
• RL problem setting; Agent and environment
• Markov Decision Process, MDP
• Policy
• Policy Iteration
• Discounted return, Utility
• Value function, state-value function, action value function
• Bellman equations, update rules (backups)
RL problem setting: Agent and environment
Agent performs action
Agent observes
environment state
and reward
Environment
Agent
*) This would be a fully observable environment
RL problem setting: Agent and environment
Agent performs action
Agent observes
environment state
and reward
Environment
Agent
Agent models the environment as
a Markov Decision Process
Agent maintains a policy
that defines what action to
take when in a state
Agent approximates the value function
of each state and action
Agent creates an internal
representation
of state
Markov Decision Process (MDP)
Markov Decision Process is a tuple , with
• States
• Actions
• Transition probabilities
• Rewards
Reward function
• Discount factor ,
• When we know all of these, we have a fully defined MPD
From fully defined MDPs to model-free
methods
Methods for fully defined MDPs Model-based methods Model-free methods
We know all of states, actions, transition
probabilities, rewards and discount factor
We know states, actions, and discount factor
(from problem definition)
We know states, actions, and discount factor
(from problem definition)
Model gives prediction of next state and
reward when taking an action in a state
So, we use a model to estimate transition
probabilities and rewards
We don’t know transition probabilities or
rewards
We don’t have a model, and don’t need one
We can use a Dynamic Programming
algorithms, such as policy iteration, to find
optimal value function and policy
We can use the model-augmented MDP with
DP methods
Or learn our model from experience as in
model free methods
Or create simulated experience based on our
model for model-free
We employ an agent to explore the
environment, and we use agent’s direct
experience to update our estimates of value
function and policy
Algorithms typically use full sweeps of the
state-space to update values
We can use episodes or individual action-
steps to update values
We need to have some approach to selecting
which states and actions the agent explores
We discussed Policy iteration
– a DP algorithm for fully-defined MDPs
• Perform policy evaluation - evaluate
state values under current policy*
• Improve the policy by determining the
best action in each state using the
state-value function determined
during policy evaluation step
• Stop when policy no longer changes
• Each improvement step improves the
value function and when improvement
stops, the optimal policy and value
function have been found
*) using either iterative policy evaluation or solving the linear system
Towards Generalized Policy Iteration
source: David Silver: UCL Course on RL
The process of making a new policy that improves on
an original policy, by making it greedy with respect to
the value function of the original policy, is called
policy improvement
Policy improvement theorem:
Blackjack as an MDP
• States: current set of cards for dealer and
player
• Actions: HIT – more cards, STAND – no
more cards
• Transition probabilities: A stochastic
environment due to randomly drawing
cards – exact probabilities difficult to
determine, though
• Rewards: At the end of episode: +1 for
winning, -1 for losing, 0 for a draw; and 0
otherwise
• Discounting: not used,
Blackjack as an MDP - example
dealer: 5
player: 13
no ace
HIT
5,13,no ace
HIT
Dealer:
Player:
Blackjack as an MDP
dealer: 5
player: 13
no ace
5,13,no ace
HIT
Dealer:
Player:
Blackjack as an MDP - example
dealer: 5
player: 13
no ace
dealer: 5
player: 20
no ace
dealer: bust
player: 20
no ace
dealer bust
HIT STAND
reward: 0
reward: +1
5,20,no ace
5,13,no ace
Dealer:
Player:
If in previous state
player is 13 then
here we have 8
states where
player’s sum is
14,15,…,21
If in previous state
player has 20 then
here only state is
21, all other cards
lead to lose
Policy
• Policy defines how the agent behaves
in an MDP environment
• Policy is a mapping from each state
to an action
• A deterministic policy always
returns the same action for a state
• A stochastic policy gives a
probability for an action in a state
One possible deterministic policy for the maze
Flipsism (Höpsismi) as a policy
• A random policy
• For MDPs with two actions in each
state
• Equal probability for choosing either
action = 0,5
Multi-armed bandits
– a slightly more formal approach to stochastic policies
• We can choose from four actions;
a, b, c or d
• Whenever we choose an action,
we receive a reward with an
unknown probability distribution
• We have now chosen an action
six times, a and b twice, c and d
once
• We have received the rewards
shown
• We want to maximize the reward
we receive over time
• What action would you select
next, why?
This would be a 4-armed bandit
Multi-armed bandits
• Now we have selected each
action six times and the
reward situation is as
shown
• How would you continue
from here? Why?
Exploration vs exploitation
• Exploitation: we exploit the
information we already have to
maximize reward
• Maintain estimates of the
values of actions
• Select the action whose
estimated value is greatest
• This is called the greedy
action
• Exploration: we choose some
other action than the greedy
one to gain information and to
improve our estimates We have now chosen an action 40 000 times: 10 000 times each a, b, c and d
We can estimate that we have lost about 85 000 in value compared to the optimal
strategy of choosing b every time
Epsilon-greedy policy
• -greedy policy is a strategy to balance exploration and exploitation
• We choose the greedy action with probability , and a random action with
probability
• For this to work in theory, all states are to be visited infinitely often and needs
to decrease towards zero, so that the policy converges to greedy policy*
• In practice, it might be enough to decrease epsilon towards the greedy policy
• A simple, often proposed strategy is to decrease epsilon as , but this might
be a bit fast in practice
*) GLIE: Greedy in the Limit with Infinite Exploration
-greedy method for bandits
• It was said that ”we maintain estimates
of the values of actions”
• For this, we use incrementally
computed sample averages:
• And use -greedy policy for selecting an
action
Source: Sutton-Barto 2nd ed
For calculating incremental mean, we maintain two parameters:
N, the current visit count for each action (selecting a bandit) and
Q, the current estimated value for the action
General update rule for RL
• Note the format of the update rule in the method on the previous slide
• We can consider the form
as a general update rule, where represents our current target value,
is the error of our current estimate and is a decreasing step-size or
learning-rate parameter
• Expect to see more of these soon…
Discounted return, utility
• An agent exploring the MDP environment would observe a sequence
• Discounted return, or utility, from time step 𝑡 onwards is the sum of discounted
rewards received:
The state-value function
• If the agent was following a policy, then in each state , the agent would select
the action defined by that policy
• The state-value function of a state under policy , denoted , is the expected
discounted return when following the policy from state onwards:
• The recursive relationship between the value of a state and its successor states is
called the Bellman expectation equation for state-value function
The action-value function
• Action-value function for policy defines the expected utility when
starting in state , performing action and following the policy thereafter
5,20,no ace
5,13,no ace
State-action value
when
state is (5,13,no ace)
and action is HIT
Q(S,A) ~ - 0,255
State-action value
when
state is (5,20,no ace)
and action is STAND
Q(S,A) ~ 0,669
State-action value function
Greedy policy from action-value function
• To derive the policy from state-value
function , we need to know the
transition probabilities and rewards:
• But we can extract the policy directly
from action-value function
• So, working with enables us to be
model-free
First RL algorithm: Monte Carlo
• Sample a full episode from MDP using a -greedy policy
• For each state-action pair estimate value using average sample returns
• Maintain visit-counts to each state action pair
• Update value estimates based on incremental average of observed return
One more concept: on-policy vs off-policy
• On-policy learning: apply a policy to choose actions and learn the value-function
for that policy
• Monte Carlo algorithm presented in the previous slide is an on-policy method
• In practice, we start with a stochastic policy to sample all possible state-action
pairs
• and gradually adjust the policy towards a deterministic optimal policy (GLIE?)
• Off-policy learning: apply a policy, but learn for some other policy
• Typically in off-policy learning, we apply a behavior policy that allows for
exploration and learns about an optimal target policy
Towards Off-policy Monte Carlo
• To use returns generated by behavior policy to evaluate target policy , we
apply importance sampling, a technique to estimate expected values for one
distribution using samples from another
• The probability of observing a sequence of states and actions under policy is
• We form importance sampling ratios, the ratio of probabilities of the sequences
under target and behavior policies
• And apply those to weight our observed returns
Off-policy Monte Carlo
generate episode
iterate backwards
accumulate discounted returns
MC update, now with importance sampling
policy improvement, greedy wrt value func
incremental weight update
Source: Sutton-Barto 2nd ed
Temporal-difference methods
• Recall our general update rule from a couple of slides back
• Monte Carlo methods use the returns from a full episode as a learning target
• In Temporal-difference methods, we use a sample return instead
• We can apply temporal-difference methods with incomplete sequences, or when
we don’t have terminating episodes
If one had to identify one idea as central and novel to reinforcement learning, it would
undoubtedly be temporal-difference (TD) learning
- Sutton and Barto
Recap: Bellman eqs
First TD algorithm: Sarsa
• Generate samples from MDP using a -greedy policy
• For each sample, update state-action value using discounted sample return
TD-target
TD-error
learning-rate parameter
Three TD algorithms in just one slide
• Sarsa: Samples
• Q-learning: Samples
• Expected Sarsa: Samples
Q-learning again
• Considered as “one of the early
breakthroughs in RL”
• published by Watkins in 1989
• It is an off-policy algorithm that directly
approximates the optimal action-value
function
• State-action pairs are selected for
evaluation by e-greedy behavior policy
• But next state action, and thus, next
state-action value in the update, is
replaced by the greedy action for that
state
Source: Sutton-Barto 2nd ed
Simulation experiments: Reference result
Greedy policy Action-value function Difference in value between actions
Monte Carlo Off-policy; 100 000 000 episodes; random behavior policy, ; no discounting
Monte Carlo
On-policy
• 100 000 learning episodes
• Decreasing epsilon
according to state-action
visit count:
• Initial epsilon
•
Learning results: Action value function
So, this illustrates Monte Carlo on-policy
after 100 000 learning episodes
Battle of TD-agents
• Participating agents:
• Monte Carlo on-policy as episodic reference, on-policy, decreasing epsilon
• Sarsa, on-policy, decreasing epsilon
• Expected Sarsa, as on-policy, decreasing epsilon
• Expected Sarsa, as off-policy, random behavior policy,
• Q-learning, random behavior policy,
• 100 000 learning episodes for each
• Schedule for alpha: Exponential target at
• Target rounds 90 000, initial 0,2 –> target 0,01
• Schedule for epsilon: State-action visit count –scaled,
MSE and wrong action calls*
*) When compared to reference case
Q-learning
• 100 000 learning episodes
• Constant epsilon:
So…
• We have covered basic model-free RL
algorithms
• Algorithms that learn from episodes or
from TD-updates
• That apply GPI; they work with value, in
particular state-action value function, and
derive the corresponding policy from that
• That store the values of state-actions, i.e.
use tabular value representation

Weitere ähnliche Inhalte

Ähnlich wie Intro to Reinforcement learning - part II

Making smart decisions in real-time with Reinforcement Learning
Making smart decisions in real-time with Reinforcement LearningMaking smart decisions in real-time with Reinforcement Learning
Making smart decisions in real-time with Reinforcement LearningRuth Yakubu
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement LearningNatan Katz
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysYasutoTamura1
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningNAVER Engineering
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 
reinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptxreinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptxMohibKhan79
 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfVaishnavGhadge1
 
Reinforcemnet Leaning in ML and DL.pptx
Reinforcemnet Leaning in ML and  DL.pptxReinforcemnet Leaning in ML and  DL.pptx
Reinforcemnet Leaning in ML and DL.pptxManiMaran230751
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginnersgokulprasath06
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDing Li
 
Trust Region Policy Optimization, Schulman et al, 2015
Trust Region Policy Optimization, Schulman et al, 2015Trust Region Policy Optimization, Schulman et al, 2015
Trust Region Policy Optimization, Schulman et al, 2015Chris Ohk
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learningazzeddine chenine
 
Reinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperReinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperDataScienceLab
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningPrabhu Kumar
 
poster_final_v7
poster_final_v7poster_final_v7
poster_final_v7Tie Zheng
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDongHyun Kwak
 

Ähnlich wie Intro to Reinforcement learning - part II (20)

Making smart decisions in real-time with Reinforcement Learning
Making smart decisions in real-time with Reinforcement LearningMaking smart decisions in real-time with Reinforcement Learning
Making smart decisions in real-time with Reinforcement Learning
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
 
Making Complex Decisions(Artificial Intelligence)
Making Complex Decisions(Artificial Intelligence)Making Complex Decisions(Artificial Intelligence)
Making Complex Decisions(Artificial Intelligence)
 
RL_in_10_min.pptx
RL_in_10_min.pptxRL_in_10_min.pptx
RL_in_10_min.pptx
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
reinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptxreinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptx
 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdf
 
Reinforcemnet Leaning in ML and DL.pptx
Reinforcemnet Leaning in ML and  DL.pptxReinforcemnet Leaning in ML and  DL.pptx
Reinforcemnet Leaning in ML and DL.pptx
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginners
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Trust Region Policy Optimization, Schulman et al, 2015
Trust Region Policy Optimization, Schulman et al, 2015Trust Region Policy Optimization, Schulman et al, 2015
Trust Region Policy Optimization, Schulman et al, 2015
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learning
 
Reinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperReinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine Sweeper
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
 
poster_final_v7
poster_final_v7poster_final_v7
poster_final_v7
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 

Kürzlich hochgeladen

DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...HyderabadDolls
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxchadhar227
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1ranjankumarbehera14
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...kumargunjan9515
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.pptibrahimabdi22
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...SOFTTECHHUB
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...gajnagarg
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxronsairoathenadugay
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...kumargunjan9515
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...gajnagarg
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numberssuginr1
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...gajnagarg
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareGraham Ware
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...gragchanchal546
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowgargpaaro
 

Kürzlich hochgeladen (20)

DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 

Intro to Reinforcement learning - part II

  • 1. Introduction to Reinforcement Learning Part II: Basic tabular methods in RL Mikko Mäkipää 31.1.2022
  • 2. Agenda • Last time: Part I • Intro: Reinforcement learning as a ML approach • Basic building blocks: agent and environment, MDP, policies, value functions, Bellman equations, optimal policies and value functions • Basic dynamic programming algorithms illustrated on a simple maze: Value iteration, Policy iteration
  • 3. Agenda • This time: Part II • Some more building blocks: GPI, bandits, exploration, TD updates,… • Basic model-free methods using tabular value representation • …illustrated on Blackjack: Monte Carlo on- vs off-policy; Sarsa, Expected Sarsa, Q-learning • Next time: Part III • Value function approximation-based methods • Semi-gradient descent with Sarsa and different linear representations; polynomial, tile coding, Fourier cosine basis • Batch updates LSPI-LSTDQ
  • 4. Recap – we’ll be briefly revisiting the following concepts • RL problem setting; Agent and environment • Markov Decision Process, MDP • Policy • Policy Iteration • Discounted return, Utility • Value function, state-value function, action value function • Bellman equations, update rules (backups)
  • 5. RL problem setting: Agent and environment Agent performs action Agent observes environment state and reward Environment Agent *) This would be a fully observable environment
  • 6. RL problem setting: Agent and environment Agent performs action Agent observes environment state and reward Environment Agent Agent models the environment as a Markov Decision Process Agent maintains a policy that defines what action to take when in a state Agent approximates the value function of each state and action Agent creates an internal representation of state
  • 7. Markov Decision Process (MDP) Markov Decision Process is a tuple , with • States • Actions • Transition probabilities • Rewards Reward function • Discount factor , • When we know all of these, we have a fully defined MPD
  • 8. From fully defined MDPs to model-free methods Methods for fully defined MDPs Model-based methods Model-free methods We know all of states, actions, transition probabilities, rewards and discount factor We know states, actions, and discount factor (from problem definition) We know states, actions, and discount factor (from problem definition) Model gives prediction of next state and reward when taking an action in a state So, we use a model to estimate transition probabilities and rewards We don’t know transition probabilities or rewards We don’t have a model, and don’t need one We can use a Dynamic Programming algorithms, such as policy iteration, to find optimal value function and policy We can use the model-augmented MDP with DP methods Or learn our model from experience as in model free methods Or create simulated experience based on our model for model-free We employ an agent to explore the environment, and we use agent’s direct experience to update our estimates of value function and policy Algorithms typically use full sweeps of the state-space to update values We can use episodes or individual action- steps to update values We need to have some approach to selecting which states and actions the agent explores
  • 9. We discussed Policy iteration – a DP algorithm for fully-defined MDPs • Perform policy evaluation - evaluate state values under current policy* • Improve the policy by determining the best action in each state using the state-value function determined during policy evaluation step • Stop when policy no longer changes • Each improvement step improves the value function and when improvement stops, the optimal policy and value function have been found *) using either iterative policy evaluation or solving the linear system
  • 10. Towards Generalized Policy Iteration source: David Silver: UCL Course on RL The process of making a new policy that improves on an original policy, by making it greedy with respect to the value function of the original policy, is called policy improvement Policy improvement theorem:
  • 11. Blackjack as an MDP • States: current set of cards for dealer and player • Actions: HIT – more cards, STAND – no more cards • Transition probabilities: A stochastic environment due to randomly drawing cards – exact probabilities difficult to determine, though • Rewards: At the end of episode: +1 for winning, -1 for losing, 0 for a draw; and 0 otherwise • Discounting: not used,
  • 12. Blackjack as an MDP - example dealer: 5 player: 13 no ace HIT 5,13,no ace HIT Dealer: Player:
  • 13. Blackjack as an MDP dealer: 5 player: 13 no ace 5,13,no ace HIT Dealer: Player:
  • 14. Blackjack as an MDP - example dealer: 5 player: 13 no ace dealer: 5 player: 20 no ace dealer: bust player: 20 no ace dealer bust HIT STAND reward: 0 reward: +1 5,20,no ace 5,13,no ace Dealer: Player: If in previous state player is 13 then here we have 8 states where player’s sum is 14,15,…,21 If in previous state player has 20 then here only state is 21, all other cards lead to lose
  • 15. Policy • Policy defines how the agent behaves in an MDP environment • Policy is a mapping from each state to an action • A deterministic policy always returns the same action for a state • A stochastic policy gives a probability for an action in a state One possible deterministic policy for the maze
  • 16. Flipsism (Höpsismi) as a policy • A random policy • For MDPs with two actions in each state • Equal probability for choosing either action = 0,5
  • 17. Multi-armed bandits – a slightly more formal approach to stochastic policies • We can choose from four actions; a, b, c or d • Whenever we choose an action, we receive a reward with an unknown probability distribution • We have now chosen an action six times, a and b twice, c and d once • We have received the rewards shown • We want to maximize the reward we receive over time • What action would you select next, why? This would be a 4-armed bandit
  • 18. Multi-armed bandits • Now we have selected each action six times and the reward situation is as shown • How would you continue from here? Why?
  • 19. Exploration vs exploitation • Exploitation: we exploit the information we already have to maximize reward • Maintain estimates of the values of actions • Select the action whose estimated value is greatest • This is called the greedy action • Exploration: we choose some other action than the greedy one to gain information and to improve our estimates We have now chosen an action 40 000 times: 10 000 times each a, b, c and d We can estimate that we have lost about 85 000 in value compared to the optimal strategy of choosing b every time
  • 20. Epsilon-greedy policy • -greedy policy is a strategy to balance exploration and exploitation • We choose the greedy action with probability , and a random action with probability • For this to work in theory, all states are to be visited infinitely often and needs to decrease towards zero, so that the policy converges to greedy policy* • In practice, it might be enough to decrease epsilon towards the greedy policy • A simple, often proposed strategy is to decrease epsilon as , but this might be a bit fast in practice *) GLIE: Greedy in the Limit with Infinite Exploration
  • 21. -greedy method for bandits • It was said that ”we maintain estimates of the values of actions” • For this, we use incrementally computed sample averages: • And use -greedy policy for selecting an action Source: Sutton-Barto 2nd ed For calculating incremental mean, we maintain two parameters: N, the current visit count for each action (selecting a bandit) and Q, the current estimated value for the action
  • 22. General update rule for RL • Note the format of the update rule in the method on the previous slide • We can consider the form as a general update rule, where represents our current target value, is the error of our current estimate and is a decreasing step-size or learning-rate parameter • Expect to see more of these soon…
  • 23. Discounted return, utility • An agent exploring the MDP environment would observe a sequence • Discounted return, or utility, from time step 𝑡 onwards is the sum of discounted rewards received:
  • 24. The state-value function • If the agent was following a policy, then in each state , the agent would select the action defined by that policy • The state-value function of a state under policy , denoted , is the expected discounted return when following the policy from state onwards: • The recursive relationship between the value of a state and its successor states is called the Bellman expectation equation for state-value function
  • 25. The action-value function • Action-value function for policy defines the expected utility when starting in state , performing action and following the policy thereafter
  • 26. 5,20,no ace 5,13,no ace State-action value when state is (5,13,no ace) and action is HIT Q(S,A) ~ - 0,255 State-action value when state is (5,20,no ace) and action is STAND Q(S,A) ~ 0,669 State-action value function
  • 27. Greedy policy from action-value function • To derive the policy from state-value function , we need to know the transition probabilities and rewards: • But we can extract the policy directly from action-value function • So, working with enables us to be model-free
  • 28. First RL algorithm: Monte Carlo • Sample a full episode from MDP using a -greedy policy • For each state-action pair estimate value using average sample returns • Maintain visit-counts to each state action pair • Update value estimates based on incremental average of observed return
  • 29. One more concept: on-policy vs off-policy • On-policy learning: apply a policy to choose actions and learn the value-function for that policy • Monte Carlo algorithm presented in the previous slide is an on-policy method • In practice, we start with a stochastic policy to sample all possible state-action pairs • and gradually adjust the policy towards a deterministic optimal policy (GLIE?) • Off-policy learning: apply a policy, but learn for some other policy • Typically in off-policy learning, we apply a behavior policy that allows for exploration and learns about an optimal target policy
  • 30. Towards Off-policy Monte Carlo • To use returns generated by behavior policy to evaluate target policy , we apply importance sampling, a technique to estimate expected values for one distribution using samples from another • The probability of observing a sequence of states and actions under policy is • We form importance sampling ratios, the ratio of probabilities of the sequences under target and behavior policies • And apply those to weight our observed returns
  • 31. Off-policy Monte Carlo generate episode iterate backwards accumulate discounted returns MC update, now with importance sampling policy improvement, greedy wrt value func incremental weight update Source: Sutton-Barto 2nd ed
  • 32. Temporal-difference methods • Recall our general update rule from a couple of slides back • Monte Carlo methods use the returns from a full episode as a learning target • In Temporal-difference methods, we use a sample return instead • We can apply temporal-difference methods with incomplete sequences, or when we don’t have terminating episodes If one had to identify one idea as central and novel to reinforcement learning, it would undoubtedly be temporal-difference (TD) learning - Sutton and Barto
  • 34. First TD algorithm: Sarsa • Generate samples from MDP using a -greedy policy • For each sample, update state-action value using discounted sample return TD-target TD-error learning-rate parameter
  • 35. Three TD algorithms in just one slide • Sarsa: Samples • Q-learning: Samples • Expected Sarsa: Samples
  • 36. Q-learning again • Considered as “one of the early breakthroughs in RL” • published by Watkins in 1989 • It is an off-policy algorithm that directly approximates the optimal action-value function • State-action pairs are selected for evaluation by e-greedy behavior policy • But next state action, and thus, next state-action value in the update, is replaced by the greedy action for that state Source: Sutton-Barto 2nd ed
  • 37. Simulation experiments: Reference result Greedy policy Action-value function Difference in value between actions Monte Carlo Off-policy; 100 000 000 episodes; random behavior policy, ; no discounting
  • 38. Monte Carlo On-policy • 100 000 learning episodes • Decreasing epsilon according to state-action visit count: • Initial epsilon •
  • 39. Learning results: Action value function So, this illustrates Monte Carlo on-policy after 100 000 learning episodes
  • 40. Battle of TD-agents • Participating agents: • Monte Carlo on-policy as episodic reference, on-policy, decreasing epsilon • Sarsa, on-policy, decreasing epsilon • Expected Sarsa, as on-policy, decreasing epsilon • Expected Sarsa, as off-policy, random behavior policy, • Q-learning, random behavior policy, • 100 000 learning episodes for each • Schedule for alpha: Exponential target at • Target rounds 90 000, initial 0,2 –> target 0,01 • Schedule for epsilon: State-action visit count –scaled,
  • 41. MSE and wrong action calls* *) When compared to reference case
  • 42. Q-learning • 100 000 learning episodes • Constant epsilon:
  • 43. So… • We have covered basic model-free RL algorithms • Algorithms that learn from episodes or from TD-updates • That apply GPI; they work with value, in particular state-action value function, and derive the corresponding policy from that • That store the values of state-actions, i.e. use tabular value representation