SlideShare ist ein Scribd-Unternehmen logo
1 von 35
Downloaden Sie, um offline zu lesen
Chapter 1: Introduction
Richard S. Sutton and Andrew G. Barto
Reinforcement Learning:
A Computational Approach to
Learning from Interaction with an
Environment
Policies
Value Functions
Rewards
Models
Motivating Example: Cartpole Balancing
Mapping State to Actions to
Maximize a Reward Signal
Key Challenges to RL:
1. Search: Exploration-Exploitation
2. Delayed Reward
- Agents must consider more than the immediate
reward because acting greedily like this may
result in less future reward
Exploration-Exploitation
● To obtain a lot of reward, a reinforcement learning agent must
prefer actions that it has tried in the past and found to be
effective in producing reward
● But to discover such actions, it has to try actions that it has not
selected before
Exploration-Exploitation
● Exploit (act greedily) w.r.t it has already experienced to maximize reward
● Explore (act non-greedily) take actions which don’t have the maximum expected
reward in order to learn more about them and make better selections in the
future
● Stochastic Tasks, each action must be tried many times to gain a reliable
estimate of its expected reward
4 Key Elements of Reinforcement Learning
● Policy
● Reward
● Value Function
● Model (Optional)
Policy
● This is the mapping from states to actions
● Defining the agent’s behavior
● Policies are usually stochastic, meaning that we sample an action from a
probability distribution compared to something like supervised learning
where we would take the argmax of the distribution
Reward
● Goal of the RL agent
● The environment sends a reward at each time step (usually 0)
● Agent is trying to maximize reward
● Primary basis for altering the policy
○ (Also Novelty Search / Intrinsic Motivation)
● Reward signals may be stochastic functions of state and actions
Value Function
● Assigning values to states
● Specifies what is good in the long run vs. reward which is an immediate signal
● The value of a state is the reward the agent can expect starting from that state
● Values correspond to a more refined and farsighted judgment of how pleased
or displeased we are that our environment is in a particular state
Model (Optional → Model-Based vs. Model-Free)
● Mimics the behavior of the environment
● Allows inference about how the environment might behave
● Given a state and action, the model might predict the resultant
next state and next reward
● Models are used for planning, considering future situations
before experiencing them
● Model-Based (Models and Planning)
● Model-Free (Explicitly Trial-and-Error Learners)
Reinforcement vs. Supervised Learning
● Supervised Learning tells the agent the exact correct situation for every state
for the purpose of generalizing to states not seen in the training set
● Reinforcement Learning generally has a much sparser reward signal, do not
know what the correct action for every state is, but receive rewards based on a
series of states and actions
Examples of
Reinforcement Learning
Chess
● A Move is informed by planning (anticipating possible responses and
counter-responses) and judgments of particular positions and moves
Petroleum Refinery
● An adaptive controller adjusts parameters of a petroleum
refinery’s operation in real time
● Optimizes a reward function of yield/cost/quality without sticking
strictly to the set points originally suggested by engineers
Really good example of this is DeepMind / Google Data Center Cooling Bill reduction by 40% (Link in Description)
Gazelle Calf
● Struggles to its feet minutes after being born
● Half an hour later, it is running at 20 miles per hour
Cleaning Robot
● A mobile robot decides → explore new room to find more
trash or recharge battery
● Makes decision based on state input of the charge level of its
battery and its sense of how quickly it can get to the recharger
Phil Making Breakfast
● Closely examined, contains a complex web of behavior and
interlocking goal-subgoal relationships
● Walk to cupboard, open it, select a cereal box, reach for it, grasp it,
retrieve the box
● Each step is guided by goals and is in service of other goals
“grasping a spoon”
The Agent seeks to achieve a goal
despite uncertainty about its
environment
Actions change future states
● Chess moves
● Levels of reservoirs of the refinery
● Robot’s next location and charge level of its battery
→ Impacting actions available to the agent in the future
Goals explicit in the sense that the agent can
judge progress toward it goal based on what it
can sense directly
● Chess player knows whether or not he wins
● The refinery controller knows how much petroleum is being
produced
● The gazelle calf knows when it falls
● The mobile robot knows when its batteries run down
● Phil knows whether or not he is enjoying his breakfast
Rewards are given directly by the environment,
but values must be estimated and re-estimated
from the sequences of observations an agent
makes over its entire lifetime
The most important component of all RL
algorithms is method for efficiently estimating
values
The central role of value estimation is arguably
the most important breakthrough in RL over the
last 6 decades
Evolutionary Methods and RL
● Apply multiple static policies with separate instances of the
environment
● Policies obtaining the most reward carried over to the next
generation of policies
● Skips estimating value functions in the process
Evolutionary Methods ignore crucial information
● The frequency of wins gives an estimate of the probability of winning with that policy,
used to direct the next policy selection
● What happens during the game is ignored
→ If the player wins, all of its behavior in the game is given credit
● Value function methods allow individual states to be evaluated
● Learning a value function takes advantage of information available during the
course of play
Tic-Tac-Toe against an imperfect player
● The policy describes the move to make given the state of the board
● Value Function → An estimate of winning probability for each state could be obtained by
playing the game many times
● State A has higher value than state B if the current winning estimate is higher from A than B
Tic-Tac-Toe
● Most of the time we move greedily, selecting the action that leads
to the state with the greatest value
● Exploratory moves → Select randomly despite what the value
function would prefer
● Update values of states throughout experience
Updating Value Functions (Temporal Difference Learning)
Lessons learned from Tic-Tac-Toe
● Tic-tac-toe has a relatively small, finite state set
● Compared with backgammon ~1020
states
● This many states makes it impossible to experience more than a small fraction
of them
● The artificial neural network provides the program with the ability to generalize
from its experience so that in new states it selects moves based on information
saved from similar states faced in the past
Self-Play
● What if the agent played against itself with both sides learning?
● Would it learn a different policy for selecting moves?
Symmetries
● Many tic-tac-toe positions appear different but are really the same because of
symmetries. How might we amend the learning process described above to take
advantage of this?
● In what ways would this change improve the learning process?
● Suppose the opponent did not take advantage of symmetries.
● Is it true then, that symmetrically equivalent positions should have the same value?
Greedy Play
● Suppose the RL player was greedy, it always played the move that
brought it to the position that it rated the best.
● Might it learn to play better, or worse, than a non-greedy player?
What problems might occur?
Learning from Exploration
● Suppose learning updates occurred after all moves, including exploratory moves.
● If the step-size parameter is appropriately reduced over time (but not the tendency
to explore), then the state values would converge to a different set of probabilities.
● What are the two sets of probabilities computed when we do and when we do not
learn from exploratory moves?
● Assuming that we do continue to make exploratory moves, which set of probabilities
might be better to learn? Which would result in more wins?
Reinforcement Learning
Chapter 1
Policies
Value Functions
Rewards
Models

Weitere ähnliche Inhalte

Was ist angesagt?

An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learningBig Data Colombia
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learningSubrat Panda, PhD
 
Reinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic ProgrammingReinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic ProgrammingSeung Jae Lee
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningKhaled Saleh
 
Reinforcement Learning 3. Finite Markov Decision Processes
Reinforcement Learning 3. Finite Markov Decision ProcessesReinforcement Learning 3. Finite Markov Decision Processes
Reinforcement Learning 3. Finite Markov Decision ProcessesSeung Jae Lee
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithmJie-Han Chen
 
Reinforcement Learning 8: Planning and Learning with Tabular Methods
Reinforcement Learning 8: Planning and Learning with Tabular MethodsReinforcement Learning 8: Planning and Learning with Tabular Methods
Reinforcement Learning 8: Planning and Learning with Tabular MethodsSeung Jae Lee
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDing Li
 
RLCode와 A3C 쉽고 깊게 이해하기
RLCode와 A3C 쉽고 깊게 이해하기RLCode와 A3C 쉽고 깊게 이해하기
RLCode와 A3C 쉽고 깊게 이해하기Woong won Lee
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratchJie-Han Chen
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learningJie-Han Chen
 
Reinforcement Learning 7. n-step Bootstrapping
Reinforcement Learning 7. n-step BootstrappingReinforcement Learning 7. n-step Bootstrapping
Reinforcement Learning 7. n-step BootstrappingSeung Jae Lee
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 
Planning and Learning with Tabular Methods
Planning and Learning with Tabular MethodsPlanning and Learning with Tabular Methods
Planning and Learning with Tabular MethodsDongmin Lee
 
Reinforcement learning
Reinforcement  learningReinforcement  learning
Reinforcement learningSKS
 
Imitation learning tutorial
Imitation learning tutorialImitation learning tutorial
Imitation learning tutorialYisong Yue
 
Reinforcement Learning In AI Powerpoint Presentation Slide Templates Complete...
Reinforcement Learning In AI Powerpoint Presentation Slide Templates Complete...Reinforcement Learning In AI Powerpoint Presentation Slide Templates Complete...
Reinforcement Learning In AI Powerpoint Presentation Slide Templates Complete...SlideTeam
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningKai-Wen Zhao
 
Introduction to SAC(Soft Actor-Critic)
Introduction to SAC(Soft Actor-Critic)Introduction to SAC(Soft Actor-Critic)
Introduction to SAC(Soft Actor-Critic)Suhyun Cho
 

Was ist angesagt? (20)

An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learning
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learning
 
Reinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic ProgrammingReinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic Programming
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
Reinforcement Learning 3. Finite Markov Decision Processes
Reinforcement Learning 3. Finite Markov Decision ProcessesReinforcement Learning 3. Finite Markov Decision Processes
Reinforcement Learning 3. Finite Markov Decision Processes
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithm
 
Reinforcement Learning 8: Planning and Learning with Tabular Methods
Reinforcement Learning 8: Planning and Learning with Tabular MethodsReinforcement Learning 8: Planning and Learning with Tabular Methods
Reinforcement Learning 8: Planning and Learning with Tabular Methods
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Reinforcement Learning - DQN
Reinforcement Learning - DQNReinforcement Learning - DQN
Reinforcement Learning - DQN
 
RLCode와 A3C 쉽고 깊게 이해하기
RLCode와 A3C 쉽고 깊게 이해하기RLCode와 A3C 쉽고 깊게 이해하기
RLCode와 A3C 쉽고 깊게 이해하기
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratch
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learning
 
Reinforcement Learning 7. n-step Bootstrapping
Reinforcement Learning 7. n-step BootstrappingReinforcement Learning 7. n-step Bootstrapping
Reinforcement Learning 7. n-step Bootstrapping
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Planning and Learning with Tabular Methods
Planning and Learning with Tabular MethodsPlanning and Learning with Tabular Methods
Planning and Learning with Tabular Methods
 
Reinforcement learning
Reinforcement  learningReinforcement  learning
Reinforcement learning
 
Imitation learning tutorial
Imitation learning tutorialImitation learning tutorial
Imitation learning tutorial
 
Reinforcement Learning In AI Powerpoint Presentation Slide Templates Complete...
Reinforcement Learning In AI Powerpoint Presentation Slide Templates Complete...Reinforcement Learning In AI Powerpoint Presentation Slide Templates Complete...
Reinforcement Learning In AI Powerpoint Presentation Slide Templates Complete...
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-Learning
 
Introduction to SAC(Soft Actor-Critic)
Introduction to SAC(Soft Actor-Critic)Introduction to SAC(Soft Actor-Critic)
Introduction to SAC(Soft Actor-Critic)
 

Ähnlich wie Rl chapter 1 introduction

24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptxManiMaran230751
 
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]Shakeeb Ahmad Mohammad Mukhtar
 
Intro to Reinforcement Learning
Intro to Reinforcement LearningIntro to Reinforcement Learning
Intro to Reinforcement LearningUtkarsh Garg
 
Simulation To Reality: Reinforcement Learning For Autonomous Driving
Simulation To Reality: Reinforcement Learning For Autonomous DrivingSimulation To Reality: Reinforcement Learning For Autonomous Driving
Simulation To Reality: Reinforcement Learning For Autonomous DrivingDonal Byrne
 
Artificial Intelligence and Machine Learning.pptx
Artificial Intelligence and Machine Learning.pptxArtificial Intelligence and Machine Learning.pptx
Artificial Intelligence and Machine Learning.pptxMANIPRADEEPS1
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningPrabhu Kumar
 
reinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptxreinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptxMohibKhan79
 
Building a deep learning ai.pptx
Building a deep learning ai.pptxBuilding a deep learning ai.pptx
Building a deep learning ai.pptxDaniel Slater
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningCloudxLab
 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfVaishnavGhadge1
 
Intelligent Agents
Intelligent AgentsIntelligent Agents
Intelligent Agentsmarada0033
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginnersgokulprasath06
 
Reinforcement learning in Machine learning
 Reinforcement learning in Machine learning Reinforcement learning in Machine learning
Reinforcement learning in Machine learningMegha Sharma
 
Introduction to reinforcement learning
Introduction to reinforcement learningIntroduction to reinforcement learning
Introduction to reinforcement learningPramod Ramachandra
 
Online learning & adaptive game playing
Online learning & adaptive game playingOnline learning & adaptive game playing
Online learning & adaptive game playingSaeid Ghafouri
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-LearningKuppusamy P
 
W2_Lec03_Lec04_Agents.pptx
W2_Lec03_Lec04_Agents.pptxW2_Lec03_Lec04_Agents.pptx
W2_Lec03_Lec04_Agents.pptxJavaid Iqbal
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningSVijaylakshmi
 

Ähnlich wie Rl chapter 1 introduction (20)

24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx
 
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]
 
Intro to Reinforcement Learning
Intro to Reinforcement LearningIntro to Reinforcement Learning
Intro to Reinforcement Learning
 
Simulation To Reality: Reinforcement Learning For Autonomous Driving
Simulation To Reality: Reinforcement Learning For Autonomous DrivingSimulation To Reality: Reinforcement Learning For Autonomous Driving
Simulation To Reality: Reinforcement Learning For Autonomous Driving
 
RL.ppt
RL.pptRL.ppt
RL.ppt
 
Artificial Intelligence and Machine Learning.pptx
Artificial Intelligence and Machine Learning.pptxArtificial Intelligence and Machine Learning.pptx
Artificial Intelligence and Machine Learning.pptx
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
 
reinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptxreinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptx
 
Building a deep learning ai.pptx
Building a deep learning ai.pptxBuilding a deep learning ai.pptx
Building a deep learning ai.pptx
 
CS3013 -MACHINE LEARNING.pptx
CS3013 -MACHINE LEARNING.pptxCS3013 -MACHINE LEARNING.pptx
CS3013 -MACHINE LEARNING.pptx
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdf
 
Intelligent Agents
Intelligent AgentsIntelligent Agents
Intelligent Agents
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginners
 
Reinforcement learning in Machine learning
 Reinforcement learning in Machine learning Reinforcement learning in Machine learning
Reinforcement learning in Machine learning
 
Introduction to reinforcement learning
Introduction to reinforcement learningIntroduction to reinforcement learning
Introduction to reinforcement learning
 
Online learning & adaptive game playing
Online learning & adaptive game playingOnline learning & adaptive game playing
Online learning & adaptive game playing
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
 
W2_Lec03_Lec04_Agents.pptx
W2_Lec03_Lec04_Agents.pptxW2_Lec03_Lec04_Agents.pptx
W2_Lec03_Lec04_Agents.pptx
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 

Kürzlich hochgeladen

detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxAleenaJamil4
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxellehsormae
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 

Kürzlich hochgeladen (20)

detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptx
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptx
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 

Rl chapter 1 introduction

  • 1. Chapter 1: Introduction Richard S. Sutton and Andrew G. Barto
  • 2. Reinforcement Learning: A Computational Approach to Learning from Interaction with an Environment Policies Value Functions Rewards Models
  • 4. Mapping State to Actions to Maximize a Reward Signal
  • 5. Key Challenges to RL: 1. Search: Exploration-Exploitation 2. Delayed Reward - Agents must consider more than the immediate reward because acting greedily like this may result in less future reward
  • 6. Exploration-Exploitation ● To obtain a lot of reward, a reinforcement learning agent must prefer actions that it has tried in the past and found to be effective in producing reward ● But to discover such actions, it has to try actions that it has not selected before
  • 7. Exploration-Exploitation ● Exploit (act greedily) w.r.t it has already experienced to maximize reward ● Explore (act non-greedily) take actions which don’t have the maximum expected reward in order to learn more about them and make better selections in the future ● Stochastic Tasks, each action must be tried many times to gain a reliable estimate of its expected reward
  • 8. 4 Key Elements of Reinforcement Learning ● Policy ● Reward ● Value Function ● Model (Optional)
  • 9. Policy ● This is the mapping from states to actions ● Defining the agent’s behavior ● Policies are usually stochastic, meaning that we sample an action from a probability distribution compared to something like supervised learning where we would take the argmax of the distribution
  • 10. Reward ● Goal of the RL agent ● The environment sends a reward at each time step (usually 0) ● Agent is trying to maximize reward ● Primary basis for altering the policy ○ (Also Novelty Search / Intrinsic Motivation) ● Reward signals may be stochastic functions of state and actions
  • 11. Value Function ● Assigning values to states ● Specifies what is good in the long run vs. reward which is an immediate signal ● The value of a state is the reward the agent can expect starting from that state ● Values correspond to a more refined and farsighted judgment of how pleased or displeased we are that our environment is in a particular state
  • 12. Model (Optional → Model-Based vs. Model-Free) ● Mimics the behavior of the environment ● Allows inference about how the environment might behave ● Given a state and action, the model might predict the resultant next state and next reward ● Models are used for planning, considering future situations before experiencing them ● Model-Based (Models and Planning) ● Model-Free (Explicitly Trial-and-Error Learners)
  • 13. Reinforcement vs. Supervised Learning ● Supervised Learning tells the agent the exact correct situation for every state for the purpose of generalizing to states not seen in the training set ● Reinforcement Learning generally has a much sparser reward signal, do not know what the correct action for every state is, but receive rewards based on a series of states and actions
  • 15. Chess ● A Move is informed by planning (anticipating possible responses and counter-responses) and judgments of particular positions and moves
  • 16. Petroleum Refinery ● An adaptive controller adjusts parameters of a petroleum refinery’s operation in real time ● Optimizes a reward function of yield/cost/quality without sticking strictly to the set points originally suggested by engineers Really good example of this is DeepMind / Google Data Center Cooling Bill reduction by 40% (Link in Description)
  • 17. Gazelle Calf ● Struggles to its feet minutes after being born ● Half an hour later, it is running at 20 miles per hour
  • 18. Cleaning Robot ● A mobile robot decides → explore new room to find more trash or recharge battery ● Makes decision based on state input of the charge level of its battery and its sense of how quickly it can get to the recharger
  • 19. Phil Making Breakfast ● Closely examined, contains a complex web of behavior and interlocking goal-subgoal relationships ● Walk to cupboard, open it, select a cereal box, reach for it, grasp it, retrieve the box ● Each step is guided by goals and is in service of other goals “grasping a spoon”
  • 20. The Agent seeks to achieve a goal despite uncertainty about its environment
  • 21. Actions change future states ● Chess moves ● Levels of reservoirs of the refinery ● Robot’s next location and charge level of its battery → Impacting actions available to the agent in the future
  • 22. Goals explicit in the sense that the agent can judge progress toward it goal based on what it can sense directly ● Chess player knows whether or not he wins ● The refinery controller knows how much petroleum is being produced ● The gazelle calf knows when it falls ● The mobile robot knows when its batteries run down ● Phil knows whether or not he is enjoying his breakfast
  • 23. Rewards are given directly by the environment, but values must be estimated and re-estimated from the sequences of observations an agent makes over its entire lifetime
  • 24. The most important component of all RL algorithms is method for efficiently estimating values The central role of value estimation is arguably the most important breakthrough in RL over the last 6 decades
  • 25. Evolutionary Methods and RL ● Apply multiple static policies with separate instances of the environment ● Policies obtaining the most reward carried over to the next generation of policies ● Skips estimating value functions in the process
  • 26. Evolutionary Methods ignore crucial information ● The frequency of wins gives an estimate of the probability of winning with that policy, used to direct the next policy selection ● What happens during the game is ignored → If the player wins, all of its behavior in the game is given credit ● Value function methods allow individual states to be evaluated ● Learning a value function takes advantage of information available during the course of play
  • 27. Tic-Tac-Toe against an imperfect player ● The policy describes the move to make given the state of the board ● Value Function → An estimate of winning probability for each state could be obtained by playing the game many times ● State A has higher value than state B if the current winning estimate is higher from A than B
  • 28. Tic-Tac-Toe ● Most of the time we move greedily, selecting the action that leads to the state with the greatest value ● Exploratory moves → Select randomly despite what the value function would prefer ● Update values of states throughout experience
  • 29. Updating Value Functions (Temporal Difference Learning)
  • 30. Lessons learned from Tic-Tac-Toe ● Tic-tac-toe has a relatively small, finite state set ● Compared with backgammon ~1020 states ● This many states makes it impossible to experience more than a small fraction of them ● The artificial neural network provides the program with the ability to generalize from its experience so that in new states it selects moves based on information saved from similar states faced in the past
  • 31. Self-Play ● What if the agent played against itself with both sides learning? ● Would it learn a different policy for selecting moves?
  • 32. Symmetries ● Many tic-tac-toe positions appear different but are really the same because of symmetries. How might we amend the learning process described above to take advantage of this? ● In what ways would this change improve the learning process? ● Suppose the opponent did not take advantage of symmetries. ● Is it true then, that symmetrically equivalent positions should have the same value?
  • 33. Greedy Play ● Suppose the RL player was greedy, it always played the move that brought it to the position that it rated the best. ● Might it learn to play better, or worse, than a non-greedy player? What problems might occur?
  • 34. Learning from Exploration ● Suppose learning updates occurred after all moves, including exploratory moves. ● If the step-size parameter is appropriately reduced over time (but not the tendency to explore), then the state values would converge to a different set of probabilities. ● What are the two sets of probabilities computed when we do and when we do not learn from exploratory moves? ● Assuming that we do continue to make exploratory moves, which set of probabilities might be better to learn? Which would result in more wins?