SlideShare ist ein Scribd-Unternehmen logo
1 von 42
Downloaden Sie, um offline zu lesen
Chapter 5: Monte Carlo Methods
Seungjae Ryan Lee
New method: Monte Carlo method
● Do not assume complete knowledge of environment
○ Only need experience
○ Can use simulated experience
● Average sample returns
● Use General Policy Iteration (GPI)
○ Prediction: compute value functions
○ Policy Improvement: improve policy from value functions
○ Control: discover optimal policy
Monte Carlo Prediction:
● Estimate from sample return
● Converges as more returns are observed
Return 1 observed 10 times
Return 0 observed 2 times
Return -1 observed 0 times
First-visit MC vs. Every-visit MC
● First-visit
○ Average of returns following first visits to states
○ Studied widely
○ Primary focus for this chapter
● Every-visit
○ Average returns following all visits to states
○ Extended naturally to function approximation (Ch. 9) and eligibility traces (Ch. 12)
s1 s2 s3 s2 s3Sample Trajectory:
s1
s2 s3
First-visit MC prediction in Practice:
Blackjack Example
● States: (Sum of cards, Has usable ace, Dealer’s card)
● Action: Hit (request card), Stick (stop)
● Reward: +1, 0, -1 for win, draw, loss
● Policy: request cards if and only if sum < 20
● Difficult to use DP although environment dynamics is known
Blackjack Example Results
● Less common experience have uncertain estimates
○ ex) States with usable ace
MC vs. DP
● No bootstrapping
● Estimates for each state are independent
● Can estimate the value of a subset of all states
Monte Carlo Dynamic Programming
Soap Bubble Example
● Compute shape of soap surface for a closed wire frame
● Height of surface is average of heights at neighboring points
● Surface must meet boundaries with the wire frame
Soap Bubble Example: DP vs. MC
http://www-anw.cs.umass.edu/~barto/courses/cs687/Chapter%205.pdf
DP
● Update heights by its neighboring heights
● Iteratively sweep the grid
MC
● Take random walk until boundary is reached
● Average sampled boundary height
Monte Carlo Prediction:
● More useful if model is not available
○ Can determine policy without model
● Converges quadratically to when infinite samples
● Need exploration: all state-action pairs need to be visited infinitely
https://www.youtube.com/watch?v=qaMdN6LS9rA
Exploring Starts (ES)
● Specify state-action pair to start episode on
● Cannot be used when learning from actual interactions
Monte Carlo ES
● Control: approximate optimal policies
● Use Generalized Policy Iteration (GPI)
○ Maintain approximate policy and approximate value function
○ Policy evaluation: Monte Carlo Prediction for one episode with start chosen by ES
○ Policy Improvement: Greedy selection
● No proof of convergence
Monte Carlo ES Pseudocode
Blackjack Example Revisited
● Prediction → Control
ε-soft Policy
● Avoid exploring starts → Add exploration to policy
● Soft policy: every action has nonzero probability of being selected
● ε-soft policy: every action has at least probability of being selected
● ex) ε-greedy policy
○ Select greedily for probability
○ Select randomly for probability (including greedy)
ε-soft vs ε-greedy
Softer
Random
ε = 1
Greedy
ε = 0
ε-greedy
ε = 0.1
ε-soft
ε = 0.1
On-policy ε-soft MC control Pseudocode
● On-policy: Evaluate / improve policy that is used to make decisions
On-policy vs. Off-policy
● On-policy: Evaluate / improve policy that is used to make decisions
○ Requires ε-soft policy: near optimal but never optimal
○ Simple, low variance
● Off-policy: Evaluate / improve policy different from that used to generate data
○ Target policy : policy to evaluate
○ Behavior policy : policy for taking actions
○ More powerful and general
○ High variance, slower convergence
○ Can learn from non-learning controller or human expert
Coverage assumption for off-policy learning
● To estimate values under , all possible actions of must be taken by
● must be stochastic in states where
Importance Sampling
● Trajectories have different probabilities under different policies
● Estimate expected value from one distribution given samples from another
● Weight returns by importance sampling ratio
○ Relative probability of trajectory occurring under the target and behavior policies
Ordinary Importance Sampling
● Zero bias but unbounded variance
● With single return:
Ordinary Importance Sampling: Zero Bias
Ordinary Importance Sampling: Unbounded
Variance● 1-state, 2-action undiscounted MDP
● Off-policy first-visit MC
● Variance of an estimator:
Ordinary Importance Sampling: Unbounded
Variance● Just consider all-left episodes with different lengths
○ Any trajectory with right has importance sampling ratio of 0
○ All-left trajectory have importance sampling ratio of
Ordinary Importance Sampling: Unbounded
Variance
Weighted Importance Sampling
● Has bias that converges asymptotically to zero
● Strongly preferred due to lower variance
● With single return:
Blackjack example for Importance Sampling
● Evaluated for a single state
○ player’s sum = 13, has usable ace, dealer’s card = 2
○ Behavior policy: uniform random policy
○ Target policy: stick iff player’s sum >= 20
Incremental Monte Carlo
● Update value without tracking all returns
● Ordinary importance sampling:
● Weighted importance sampling:
Incremental Monte Carlo Pseudocode
Off-policy Monte Carlo Control
● Off-policy: target policy and behavior policy
● Monte Carlo: Learn from samples without bootstrapping
● Control: Find optimal policy through GPI
Off-policy Monte Carlo Control Pseudocode
Discounting-aware Importance Sampling: Intuition*
● Exploit return’s internal structure to reduce variance
○ Return = Discounted sum of rewards
● Consider myopic discount
Irrelevant to return: adds variance
Discounting as Partial Termination*
● Consider discount as degree of partial termination
○ If , all episodes terminate after receiving first reward
○ If , episode could terminate after n steps with probability
○ Premature termination results in partial returns
● Full Return as flat (undiscounted) partial return
Discounting-aware Ordinary Importance Sampling*
● Scale flat partial returns by a truncated importance sampling ratio
● Estimator for Ordinary importance sampling:
● Estimator for Discounting-aware ordinary importance sampling
Discounting-aware Weighted Importance Sampling*
● Scale flat partial returns by a truncated importance sampling ratio
● Estimator for Weighted importance sampling
● Estimator for Discounting-aware weighted importance sampling
Per-decision Importance Sampling: Intuition*
● Unroll returns as sum of rewards
● Can ignore trajectory after the reward since they are uncorrelated
Per-decision Importance Sampling: Process*
● Simplify expectation
● Equivalent expectation for return
Per-decision Ordinary Importance Sampling*
● Estimator for Ordinary Importance Sampling:
● Estimator for Per-reward Ordinary Importance Sampling:
Per-decision Weighted Importance Sampling?*
● Unclear if per-reward weighted importance sampling is possible
● All proposed estimators are inconsistent
○ Do not converge asymptotically
Summary
● Learn from experience (sample episodes)
○ Learn directly from interaction without model
○ Can learn with simulation
○ Can focus to subset of states
○ No bootstrapping → less harmed by violation of Markov property
● Need to maintain exploration for Control
○ Exploring starts: unlikely in learning from real experience
○ On-policy: maintain exploration in policy
○ Off-policy: separate behavior and target policies
■ Importance Sampling
● Ordinary importance sampling
● Weighted importance sampling
Thank you!
Original content from
● Reinforcement Learning: An Introduction by Sutton and Barto
You can find more content in
● github.com/seungjaeryanlee
● www.endtoend.ai

Weitere ähnliche Inhalte

Was ist angesagt?

Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
Slideshare
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
butest
 

Was ist angesagt? (20)

Reinforcement Learning 7. n-step Bootstrapping
Reinforcement Learning 7. n-step BootstrappingReinforcement Learning 7. n-step Bootstrapping
Reinforcement Learning 7. n-step Bootstrapping
 
Reinforcement Learning 2. Multi-armed Bandits
Reinforcement Learning 2. Multi-armed BanditsReinforcement Learning 2. Multi-armed Bandits
Reinforcement Learning 2. Multi-armed Bandits
 
Lecture 9 Markov decision process
Lecture 9 Markov decision processLecture 9 Markov decision process
Lecture 9 Markov decision process
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learning
 
Multi-Armed Bandit and Applications
Multi-Armed Bandit and ApplicationsMulti-Armed Bandit and Applications
Multi-Armed Bandit and Applications
 
Policy gradient
Policy gradientPolicy gradient
Policy gradient
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learning
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
 
Multi armed bandit
Multi armed banditMulti armed bandit
Multi armed bandit
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithm
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Reinforcement Learning Q-Learning
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning
 
Intro to Reinforcement learning - part III
Intro to Reinforcement learning - part IIIIntro to Reinforcement learning - part III
Intro to Reinforcement learning - part III
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
 
Decision tree and random forest
Decision tree and random forestDecision tree and random forest
Decision tree and random forest
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 

Ähnlich wie Reinforcement Learning 5. Monte Carlo Methods

Making Complex Decisions(Artificial Intelligence)
Making Complex Decisions(Artificial Intelligence)Making Complex Decisions(Artificial Intelligence)
Making Complex Decisions(Artificial Intelligence)
United International University
 
GTC 2021: Counterfactual Learning to Rank in E-commerce
GTC 2021: Counterfactual Learning to Rank in E-commerceGTC 2021: Counterfactual Learning to Rank in E-commerce
GTC 2021: Counterfactual Learning to Rank in E-commerce
GrubhubTech
 

Ähnlich wie Reinforcement Learning 5. Monte Carlo Methods (20)

Structured prediction with reinforcement learning
Structured prediction with reinforcement learningStructured prediction with reinforcement learning
Structured prediction with reinforcement learning
 
Intro to Reinforcement learning - part II
Intro to Reinforcement learning - part IIIntro to Reinforcement learning - part II
Intro to Reinforcement learning - part II
 
Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)
 
Rl chapter 1 introduction
Rl chapter 1 introductionRl chapter 1 introduction
Rl chapter 1 introduction
 
reinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptxreinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptx
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Paper Reading: Smooth Scan
Paper Reading: Smooth ScanPaper Reading: Smooth Scan
Paper Reading: Smooth Scan
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
Aaa ped-24- Reinforcement Learning
Aaa ped-24- Reinforcement LearningAaa ped-24- Reinforcement Learning
Aaa ped-24- Reinforcement Learning
 
Reinforcement Learning 10. On-policy Control with Approximation
Reinforcement Learning 10. On-policy Control with ApproximationReinforcement Learning 10. On-policy Control with Approximation
Reinforcement Learning 10. On-policy Control with Approximation
 
A brief introduction to Searn Algorithm
A brief introduction to Searn AlgorithmA brief introduction to Searn Algorithm
A brief introduction to Searn Algorithm
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Proximal Policy Optimization
Proximal Policy OptimizationProximal Policy Optimization
Proximal Policy Optimization
 
Making Complex Decisions(Artificial Intelligence)
Making Complex Decisions(Artificial Intelligence)Making Complex Decisions(Artificial Intelligence)
Making Complex Decisions(Artificial Intelligence)
 
Taxi surge pricing
Taxi surge pricingTaxi surge pricing
Taxi surge pricing
 
Content-Based approaches for Cold-Start Job Recommendations
Content-Based approaches for Cold-Start Job RecommendationsContent-Based approaches for Cold-Start Job Recommendations
Content-Based approaches for Cold-Start Job Recommendations
 
GTC 2021: Counterfactual Learning to Rank in E-commerce
GTC 2021: Counterfactual Learning to Rank in E-commerceGTC 2021: Counterfactual Learning to Rank in E-commerce
GTC 2021: Counterfactual Learning to Rank in E-commerce
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Introduction to reinforcement learning
Introduction to reinforcement learningIntroduction to reinforcement learning
Introduction to reinforcement learning
 
Predicting Winning Price in Real Time Bidding with Censored Data
Predicting Winning Price in Real Time Bidding with Censored DataPredicting Winning Price in Real Time Bidding with Censored Data
Predicting Winning Price in Real Time Bidding with Censored Data
 

Kürzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Reinforcement Learning 5. Monte Carlo Methods

  • 1. Chapter 5: Monte Carlo Methods Seungjae Ryan Lee
  • 2. New method: Monte Carlo method ● Do not assume complete knowledge of environment ○ Only need experience ○ Can use simulated experience ● Average sample returns ● Use General Policy Iteration (GPI) ○ Prediction: compute value functions ○ Policy Improvement: improve policy from value functions ○ Control: discover optimal policy
  • 3. Monte Carlo Prediction: ● Estimate from sample return ● Converges as more returns are observed Return 1 observed 10 times Return 0 observed 2 times Return -1 observed 0 times
  • 4. First-visit MC vs. Every-visit MC ● First-visit ○ Average of returns following first visits to states ○ Studied widely ○ Primary focus for this chapter ● Every-visit ○ Average returns following all visits to states ○ Extended naturally to function approximation (Ch. 9) and eligibility traces (Ch. 12) s1 s2 s3 s2 s3Sample Trajectory: s1 s2 s3
  • 6. Blackjack Example ● States: (Sum of cards, Has usable ace, Dealer’s card) ● Action: Hit (request card), Stick (stop) ● Reward: +1, 0, -1 for win, draw, loss ● Policy: request cards if and only if sum < 20 ● Difficult to use DP although environment dynamics is known
  • 7. Blackjack Example Results ● Less common experience have uncertain estimates ○ ex) States with usable ace
  • 8. MC vs. DP ● No bootstrapping ● Estimates for each state are independent ● Can estimate the value of a subset of all states Monte Carlo Dynamic Programming
  • 9. Soap Bubble Example ● Compute shape of soap surface for a closed wire frame ● Height of surface is average of heights at neighboring points ● Surface must meet boundaries with the wire frame
  • 10. Soap Bubble Example: DP vs. MC http://www-anw.cs.umass.edu/~barto/courses/cs687/Chapter%205.pdf DP ● Update heights by its neighboring heights ● Iteratively sweep the grid MC ● Take random walk until boundary is reached ● Average sampled boundary height
  • 11. Monte Carlo Prediction: ● More useful if model is not available ○ Can determine policy without model ● Converges quadratically to when infinite samples ● Need exploration: all state-action pairs need to be visited infinitely https://www.youtube.com/watch?v=qaMdN6LS9rA
  • 12. Exploring Starts (ES) ● Specify state-action pair to start episode on ● Cannot be used when learning from actual interactions
  • 13. Monte Carlo ES ● Control: approximate optimal policies ● Use Generalized Policy Iteration (GPI) ○ Maintain approximate policy and approximate value function ○ Policy evaluation: Monte Carlo Prediction for one episode with start chosen by ES ○ Policy Improvement: Greedy selection ● No proof of convergence
  • 14. Monte Carlo ES Pseudocode
  • 15. Blackjack Example Revisited ● Prediction → Control
  • 16. ε-soft Policy ● Avoid exploring starts → Add exploration to policy ● Soft policy: every action has nonzero probability of being selected ● ε-soft policy: every action has at least probability of being selected ● ex) ε-greedy policy ○ Select greedily for probability ○ Select randomly for probability (including greedy)
  • 17. ε-soft vs ε-greedy Softer Random ε = 1 Greedy ε = 0 ε-greedy ε = 0.1 ε-soft ε = 0.1
  • 18. On-policy ε-soft MC control Pseudocode ● On-policy: Evaluate / improve policy that is used to make decisions
  • 19. On-policy vs. Off-policy ● On-policy: Evaluate / improve policy that is used to make decisions ○ Requires ε-soft policy: near optimal but never optimal ○ Simple, low variance ● Off-policy: Evaluate / improve policy different from that used to generate data ○ Target policy : policy to evaluate ○ Behavior policy : policy for taking actions ○ More powerful and general ○ High variance, slower convergence ○ Can learn from non-learning controller or human expert
  • 20. Coverage assumption for off-policy learning ● To estimate values under , all possible actions of must be taken by ● must be stochastic in states where
  • 21. Importance Sampling ● Trajectories have different probabilities under different policies ● Estimate expected value from one distribution given samples from another ● Weight returns by importance sampling ratio ○ Relative probability of trajectory occurring under the target and behavior policies
  • 22. Ordinary Importance Sampling ● Zero bias but unbounded variance ● With single return:
  • 24. Ordinary Importance Sampling: Unbounded Variance● 1-state, 2-action undiscounted MDP ● Off-policy first-visit MC ● Variance of an estimator:
  • 25. Ordinary Importance Sampling: Unbounded Variance● Just consider all-left episodes with different lengths ○ Any trajectory with right has importance sampling ratio of 0 ○ All-left trajectory have importance sampling ratio of
  • 26. Ordinary Importance Sampling: Unbounded Variance
  • 27. Weighted Importance Sampling ● Has bias that converges asymptotically to zero ● Strongly preferred due to lower variance ● With single return:
  • 28. Blackjack example for Importance Sampling ● Evaluated for a single state ○ player’s sum = 13, has usable ace, dealer’s card = 2 ○ Behavior policy: uniform random policy ○ Target policy: stick iff player’s sum >= 20
  • 29. Incremental Monte Carlo ● Update value without tracking all returns ● Ordinary importance sampling: ● Weighted importance sampling:
  • 31. Off-policy Monte Carlo Control ● Off-policy: target policy and behavior policy ● Monte Carlo: Learn from samples without bootstrapping ● Control: Find optimal policy through GPI
  • 32. Off-policy Monte Carlo Control Pseudocode
  • 33. Discounting-aware Importance Sampling: Intuition* ● Exploit return’s internal structure to reduce variance ○ Return = Discounted sum of rewards ● Consider myopic discount Irrelevant to return: adds variance
  • 34. Discounting as Partial Termination* ● Consider discount as degree of partial termination ○ If , all episodes terminate after receiving first reward ○ If , episode could terminate after n steps with probability ○ Premature termination results in partial returns ● Full Return as flat (undiscounted) partial return
  • 35. Discounting-aware Ordinary Importance Sampling* ● Scale flat partial returns by a truncated importance sampling ratio ● Estimator for Ordinary importance sampling: ● Estimator for Discounting-aware ordinary importance sampling
  • 36. Discounting-aware Weighted Importance Sampling* ● Scale flat partial returns by a truncated importance sampling ratio ● Estimator for Weighted importance sampling ● Estimator for Discounting-aware weighted importance sampling
  • 37. Per-decision Importance Sampling: Intuition* ● Unroll returns as sum of rewards ● Can ignore trajectory after the reward since they are uncorrelated
  • 38. Per-decision Importance Sampling: Process* ● Simplify expectation ● Equivalent expectation for return
  • 39. Per-decision Ordinary Importance Sampling* ● Estimator for Ordinary Importance Sampling: ● Estimator for Per-reward Ordinary Importance Sampling:
  • 40. Per-decision Weighted Importance Sampling?* ● Unclear if per-reward weighted importance sampling is possible ● All proposed estimators are inconsistent ○ Do not converge asymptotically
  • 41. Summary ● Learn from experience (sample episodes) ○ Learn directly from interaction without model ○ Can learn with simulation ○ Can focus to subset of states ○ No bootstrapping → less harmed by violation of Markov property ● Need to maintain exploration for Control ○ Exploring starts: unlikely in learning from real experience ○ On-policy: maintain exploration in policy ○ Off-policy: separate behavior and target policies ■ Importance Sampling ● Ordinary importance sampling ● Weighted importance sampling
  • 42. Thank you! Original content from ● Reinforcement Learning: An Introduction by Sutton and Barto You can find more content in ● github.com/seungjaeryanlee ● www.endtoend.ai