SlideShare ist ein Scribd-Unternehmen logo
1 von 33
*Copyright Notice:
Most materials from this presentation are
taken from the book “Reinforcement Learning:
An Introduction, 2nd edition” authored by
Richard S. Sutton and Andrew G. Barto. The
other quoted sources are mentioned in the
respective slides. This presentation itself
adopts Creative Commons license.
Generalization
To generalize beyond rewards to permit predictions
about arbitrary signals
Cumulant signal Ct
General Value Function, GVF, or Forecast
We seek to approximate with a parameterized form,
ˆv (s, w ), although of course there would have to be
a different w for each prediction, that is, for each
choice of π , γ , and C.
Along with the learned predictions, we might also
learn policies to maximize the predictions in the
usual ways by Generalized Policy Iteration (GPI)
or by actor–critic methods.
Auxiliary tasks
Auxiliary tasks are extra, in-addition-to, the main
task of maximizing reward.
The ability to predict and control a diverse multitude
of signals can constitute a powerful kind of
environmental model.
If good features can be found early on easy auxiliary
tasks, then those features may significantly speed
learning on the main task.
*Taken from https://bit.ly/2g9Yv2A
*Taken from https://bit.ly/2uA6tHl
*Taken from https://bit.ly/2yLYlaK
Temporal abstraction
Semi-Markov Decision Process (SMDP): The
amount of time between one decision and the next is
a random variable, either real- or integer-valued.
Option : A pair of a policy, π, and a state-dependent
termination function, γ. We define a pair of these as
a generalized notion of action termed an option.
Option
To execute an option ω=(πω , γω) at time t is to obtain
the action to take, At, from πω(.| St) , then terminate
at time t + 1 with probability γω (St+1). If the option
does not terminate at t + 1, then At+1 is selected from
πω(.| St+1) and the option terminates at t+2 with
probability γω (St+2), and so on until eventual
termination.
Partial observability
In parametric function approximation we retained the
assumption that the learned value functions (and
policies) are functions of the environment’s state, but
allowed these functions to be arbitrarily restricted by
the parameterization.
The case of parameterized function approximation
includes the case of partial observability.
The environment would emit not its states, but only
observations—signals that depend on its state but
provide only partial information about it.
The environmental interaction would simply be an
alternating sequence of actions At and observations
Ot:
A0, O1, A1, O2, A2, O3, A3, O4, . . . ,
Observations
Use the word history, and the notation Ht, for an initial portion
of the trajectory up to an observation: Ht = A0, O1, . . . , At-1, Ot
A summary of the history, the state, called Markov state, must
be a function of history, St = f(Ht), and have the Markov
property.
A function f has the Markov property if and only if any two
histories h and h’ that are mapped by f to the same state
(f(h)=f(h’)) also have the same probabilities for their next
observation,
f(h) = f(h’) => Pr{Ot+1=o | Ht=h, At=a} = Pr{Ot+1=o | Ht=h’, At=a}
Markov property
For computational reasons we prefer to obtain the
same effect as f with an incremental, recursive
update that computes St+1 from St, incorporating the
next increment of data, At and Ot+1:
St+1=u(St, At, Ot+1), for all t ≧ 0,
with the first state S0 given. The function u is called
the state-update function.
State-update function
POMDP
Partially Observable MDP (POMDP): The
environment is assumed to have a well defined latent
state Xt that underlies and produces the
environment’s observations, but is never available to
the agent. The Markov state, St, for a POMDP is the
distribution over the latent states given the history,
called the belief state.
*Taken from https://bit.ly/2NtAiWb
*Taken from https://bit.ly/2NtAiWb
PSR
Predictive State Representation (PSR): The semantics
of the agent state is grounded in predictions about future
observations and actions, which are readily observable.
In PSRs, a Markov state is defined as a d-vector of the
probabilities of d specially chosen “core” tests. The vector
is then updated by a state-update function u that is
analogous to Bayes rule, but with a semantics grounded
in observable data, which arguably makes it easier to
learn.
*Taken from https://bit.ly/2xCkZzU
Approximate state
kth-order history : The last k observations and actions,
St=Ot, At−1, Ot−1, . . . , At−k, for some k ≧ 1, which can be
achieved by a state-update function that just shifts the
new data in and the oldest data out.
The general idea is that a state that is good for some
predictions is also good for others.
Both POMDP and PSR approaches can be applied with
approximate states.
Designing reward
Designing a reward signal is a critical part of any
application of reinforcement learning!
One challenge is to design a reward signal so that as an
agent learns, its behavior approaches, and ideally
eventually achieves, what the application’s designer
actually desires.
Reinforcement learning agents can discover unexpected
ways to make their environments deliver reward, some of
which might be undesirable, or even dangerous.
Sparse reward
Delivering non-zero reward frequently enough to
allow the agent to achieve the goal once, let alone to
learn to achieve it efficiently from multiple initial
conditions, can be a daunting challenge.
State–action pairs that clearly deserve to trigger
reward may be few and far between, and rewards
that mark progress toward a goal can be infrequent
because progress is difficult or even impossible to
detect.
Shaping
When it is difficult for an agent to receive any non-
zero reward signal at all, either due to sparseness of
rewarding situations or their inaccessibility given
initial behavior, starting with an easier problem that
its reward signal is not sparse and incrementally
increasing its difficulty as the agent learns can be an
effective, and sometimes indispensable, strategy,
and gradually modifying it toward the reward signal
suited to problem of original interest.
Learning from experts
Learning by demonstration
Imitation learning
Apprenticeship learning
Learning from an expert’s behavior can be done either by
learning directly by supervised learning or by extracting a
reward signal using what is known as “inverse
reinforcement learning” and then using a reinforcement
learning algorithm with that reward signal to learn a policy.
Evolution
The search for a good reward signal can be automated by
defining a space of feasible candidates and applying an
optimization algorithm.
The optimization algorithm, analogous to evolution,
evaluates each candidate reward signal by running the
reinforcement learning system with that signal for some
number of steps, and then scoring the overall result by a
“high-level” objective function intended to encode the
designer’s true goal. Evolution provides reward signals
that are sensitive to observable predictors of evolutionary
fitness.
Remaining issues
We still need powerful parametric function
approximation methods that work well in fully
incremental and online settings.
Catastrophic forgetting
Replay buffers
We still need methods for learning features such that
subsequent learning generalizes well.
Representation learning
Meta learning
We still need scalable methods for planning with learned
environment models.
Automating the choice of tasks on which an agent works
and uses to structure its developing competence.
The interaction between behavior and learning via some
computational analog of curiosity.
Developing methods to make it acceptably safe to embed
reinforcement learning agents into physical environments.
Moving forward…
Potential contributions
With its focus on learning by interacting with dynamic
environments, reinforcement learning, as it develops
over the future, will be a critical component of agents
with abilities close to us.
Shedding light on fundamental questions about the
mind and how it emerges from the brain.
As an aid to human decision making.
Safety concern
The safety of artificial intelligence applications involving
reinforcement learning is a topic that deserves careful
attention.
Simulators provide safe environments
Learning from simulated experience can make virtually
unlimited data available for learning and because simulations
typically run much faster than real time, learning can often
occur more quickly than if it relied on real experience.
The full potential of reinforcement learning requires
reinforcement learning agents to be embedded into the
flow of real-world experience.
It is often difficult, sometimes impossible, to simulate real-
world experience with enough fidelity to make the
resulting policies, whether derived by reinforcement
learning or by other methods, work well—and safely—
when directing real actions.
Another challenge if reinforcement learning agents are to
act and learn in the real world is not just about what they
might learn eventually, but about how they will behave
while they are learning.

Weitere ähnliche Inhalte

Mehr von Jason Tsai

Introduction to Spiking Neural Networks: From a Computational Neuroscience pe...
Introduction to Spiking Neural Networks: From a Computational Neuroscience pe...Introduction to Spiking Neural Networks: From a Computational Neuroscience pe...
Introduction to Spiking Neural Networks: From a Computational Neuroscience pe...Jason Tsai
 
Reinforcement Learning: Chapter 15 Neuroscience
Reinforcement Learning: Chapter 15 NeuroscienceReinforcement Learning: Chapter 15 Neuroscience
Reinforcement Learning: Chapter 15 NeuroscienceJason Tsai
 
Deep Learning: Chapter 11 Practical Methodology
Deep Learning: Chapter 11 Practical MethodologyDeep Learning: Chapter 11 Practical Methodology
Deep Learning: Chapter 11 Practical MethodologyJason Tsai
 
Deep Learning: Introduction & Chapter 5 Machine Learning Basics
Deep Learning: Introduction & Chapter 5 Machine Learning BasicsDeep Learning: Introduction & Chapter 5 Machine Learning Basics
Deep Learning: Introduction & Chapter 5 Machine Learning BasicsJason Tsai
 
漫談人工智慧:啟發自大腦科學的深度學習網路
漫談人工智慧:啟發自大腦科學的深度學習網路漫談人工智慧:啟發自大腦科學的深度學習網路
漫談人工智慧:啟發自大腦科學的深度學習網路Jason Tsai
 
漫談人工智慧:啟發自大腦科學的深度學習網路
漫談人工智慧:啟發自大腦科學的深度學習網路漫談人工智慧:啟發自大腦科學的深度學習網路
漫談人工智慧:啟發自大腦科學的深度學習網路Jason Tsai
 

Mehr von Jason Tsai (6)

Introduction to Spiking Neural Networks: From a Computational Neuroscience pe...
Introduction to Spiking Neural Networks: From a Computational Neuroscience pe...Introduction to Spiking Neural Networks: From a Computational Neuroscience pe...
Introduction to Spiking Neural Networks: From a Computational Neuroscience pe...
 
Reinforcement Learning: Chapter 15 Neuroscience
Reinforcement Learning: Chapter 15 NeuroscienceReinforcement Learning: Chapter 15 Neuroscience
Reinforcement Learning: Chapter 15 Neuroscience
 
Deep Learning: Chapter 11 Practical Methodology
Deep Learning: Chapter 11 Practical MethodologyDeep Learning: Chapter 11 Practical Methodology
Deep Learning: Chapter 11 Practical Methodology
 
Deep Learning: Introduction & Chapter 5 Machine Learning Basics
Deep Learning: Introduction & Chapter 5 Machine Learning BasicsDeep Learning: Introduction & Chapter 5 Machine Learning Basics
Deep Learning: Introduction & Chapter 5 Machine Learning Basics
 
漫談人工智慧:啟發自大腦科學的深度學習網路
漫談人工智慧:啟發自大腦科學的深度學習網路漫談人工智慧:啟發自大腦科學的深度學習網路
漫談人工智慧:啟發自大腦科學的深度學習網路
 
漫談人工智慧:啟發自大腦科學的深度學習網路
漫談人工智慧:啟發自大腦科學的深度學習網路漫談人工智慧:啟發自大腦科學的深度學習網路
漫談人工智慧:啟發自大腦科學的深度學習網路
 

Kürzlich hochgeladen

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 

Kürzlich hochgeladen (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 

Reinforcement Learning: Chapter 17 Frontiers

  • 1.
  • 2. *Copyright Notice: Most materials from this presentation are taken from the book “Reinforcement Learning: An Introduction, 2nd edition” authored by Richard S. Sutton and Andrew G. Barto. The other quoted sources are mentioned in the respective slides. This presentation itself adopts Creative Commons license.
  • 3. Generalization To generalize beyond rewards to permit predictions about arbitrary signals Cumulant signal Ct General Value Function, GVF, or Forecast
  • 4. We seek to approximate with a parameterized form, ˆv (s, w ), although of course there would have to be a different w for each prediction, that is, for each choice of π , γ , and C. Along with the learned predictions, we might also learn policies to maximize the predictions in the usual ways by Generalized Policy Iteration (GPI) or by actor–critic methods.
  • 5. Auxiliary tasks Auxiliary tasks are extra, in-addition-to, the main task of maximizing reward. The ability to predict and control a diverse multitude of signals can constitute a powerful kind of environmental model. If good features can be found early on easy auxiliary tasks, then those features may significantly speed learning on the main task.
  • 9. Temporal abstraction Semi-Markov Decision Process (SMDP): The amount of time between one decision and the next is a random variable, either real- or integer-valued. Option : A pair of a policy, π, and a state-dependent termination function, γ. We define a pair of these as a generalized notion of action termed an option.
  • 10. Option To execute an option ω=(πω , γω) at time t is to obtain the action to take, At, from πω(.| St) , then terminate at time t + 1 with probability γω (St+1). If the option does not terminate at t + 1, then At+1 is selected from πω(.| St+1) and the option terminates at t+2 with probability γω (St+2), and so on until eventual termination.
  • 11. Partial observability In parametric function approximation we retained the assumption that the learned value functions (and policies) are functions of the environment’s state, but allowed these functions to be arbitrarily restricted by the parameterization. The case of parameterized function approximation includes the case of partial observability.
  • 12. The environment would emit not its states, but only observations—signals that depend on its state but provide only partial information about it. The environmental interaction would simply be an alternating sequence of actions At and observations Ot: A0, O1, A1, O2, A2, O3, A3, O4, . . . , Observations
  • 13. Use the word history, and the notation Ht, for an initial portion of the trajectory up to an observation: Ht = A0, O1, . . . , At-1, Ot A summary of the history, the state, called Markov state, must be a function of history, St = f(Ht), and have the Markov property. A function f has the Markov property if and only if any two histories h and h’ that are mapped by f to the same state (f(h)=f(h’)) also have the same probabilities for their next observation, f(h) = f(h’) => Pr{Ot+1=o | Ht=h, At=a} = Pr{Ot+1=o | Ht=h’, At=a} Markov property
  • 14. For computational reasons we prefer to obtain the same effect as f with an incremental, recursive update that computes St+1 from St, incorporating the next increment of data, At and Ot+1: St+1=u(St, At, Ot+1), for all t ≧ 0, with the first state S0 given. The function u is called the state-update function. State-update function
  • 15.
  • 16. POMDP Partially Observable MDP (POMDP): The environment is assumed to have a well defined latent state Xt that underlies and produces the environment’s observations, but is never available to the agent. The Markov state, St, for a POMDP is the distribution over the latent states given the history, called the belief state.
  • 19. PSR Predictive State Representation (PSR): The semantics of the agent state is grounded in predictions about future observations and actions, which are readily observable. In PSRs, a Markov state is defined as a d-vector of the probabilities of d specially chosen “core” tests. The vector is then updated by a state-update function u that is analogous to Bayes rule, but with a semantics grounded in observable data, which arguably makes it easier to learn.
  • 21. Approximate state kth-order history : The last k observations and actions, St=Ot, At−1, Ot−1, . . . , At−k, for some k ≧ 1, which can be achieved by a state-update function that just shifts the new data in and the oldest data out. The general idea is that a state that is good for some predictions is also good for others. Both POMDP and PSR approaches can be applied with approximate states.
  • 22. Designing reward Designing a reward signal is a critical part of any application of reinforcement learning! One challenge is to design a reward signal so that as an agent learns, its behavior approaches, and ideally eventually achieves, what the application’s designer actually desires. Reinforcement learning agents can discover unexpected ways to make their environments deliver reward, some of which might be undesirable, or even dangerous.
  • 23. Sparse reward Delivering non-zero reward frequently enough to allow the agent to achieve the goal once, let alone to learn to achieve it efficiently from multiple initial conditions, can be a daunting challenge. State–action pairs that clearly deserve to trigger reward may be few and far between, and rewards that mark progress toward a goal can be infrequent because progress is difficult or even impossible to detect.
  • 24. Shaping When it is difficult for an agent to receive any non- zero reward signal at all, either due to sparseness of rewarding situations or their inaccessibility given initial behavior, starting with an easier problem that its reward signal is not sparse and incrementally increasing its difficulty as the agent learns can be an effective, and sometimes indispensable, strategy, and gradually modifying it toward the reward signal suited to problem of original interest.
  • 25. Learning from experts Learning by demonstration Imitation learning Apprenticeship learning Learning from an expert’s behavior can be done either by learning directly by supervised learning or by extracting a reward signal using what is known as “inverse reinforcement learning” and then using a reinforcement learning algorithm with that reward signal to learn a policy.
  • 26. Evolution The search for a good reward signal can be automated by defining a space of feasible candidates and applying an optimization algorithm. The optimization algorithm, analogous to evolution, evaluates each candidate reward signal by running the reinforcement learning system with that signal for some number of steps, and then scoring the overall result by a “high-level” objective function intended to encode the designer’s true goal. Evolution provides reward signals that are sensitive to observable predictors of evolutionary fitness.
  • 28. We still need powerful parametric function approximation methods that work well in fully incremental and online settings. Catastrophic forgetting Replay buffers We still need methods for learning features such that subsequent learning generalizes well. Representation learning Meta learning
  • 29. We still need scalable methods for planning with learned environment models. Automating the choice of tasks on which an agent works and uses to structure its developing competence. The interaction between behavior and learning via some computational analog of curiosity. Developing methods to make it acceptably safe to embed reinforcement learning agents into physical environments.
  • 31. Potential contributions With its focus on learning by interacting with dynamic environments, reinforcement learning, as it develops over the future, will be a critical component of agents with abilities close to us. Shedding light on fundamental questions about the mind and how it emerges from the brain. As an aid to human decision making.
  • 32. Safety concern The safety of artificial intelligence applications involving reinforcement learning is a topic that deserves careful attention. Simulators provide safe environments Learning from simulated experience can make virtually unlimited data available for learning and because simulations typically run much faster than real time, learning can often occur more quickly than if it relied on real experience.
  • 33. The full potential of reinforcement learning requires reinforcement learning agents to be embedded into the flow of real-world experience. It is often difficult, sometimes impossible, to simulate real- world experience with enough fidelity to make the resulting policies, whether derived by reinforcement learning or by other methods, work well—and safely— when directing real actions. Another challenge if reinforcement learning agents are to act and learn in the real world is not just about what they might learn eventually, but about how they will behave while they are learning.