SlideShare a Scribd company logo
1 of 39
Download to read offline
Reinforcement Learning
Chapter 21
TB Artificial Intelligence
1 / 38
Outline
Agents and Machine Learning (chap. 2)
Markov Decision Problems (chap. 18)
Passive Reinforcement Learning
Active Reinforcement Learning
2 / 38
Topic
Agents and Machine Learning (chap. 2)
Markov Decision Problems (chap. 18)
Passive Reinforcement Learning
Active Reinforcement Learning
3 / 38
Simple Reflex Agent
Agent
Environment
Sensors
What the world
is like now
What action I
should do now
Condition−action rules
Actuators
4 / 38
Model-based Reflex Agent
Agent
Environment
Sensors
What action I
should do now
State
How the world evolves
What my actions do
Condition−action rules
Actuators
What the world
is like now
5 / 38
Goal-based Agent
Agent
Environment
Sensors
What it will be like
if I do action A
What action I
should do now
State
How the world evolves
What my actions do
Goals
Actuators
What the world
is like now
6 / 38
Utility-based Agent
Agent
Environment
Sensors
What it will be like
if I do action A
How happy I will be
in such a state
What action I
should do now
State
How the world evolves
What my actions do
Utility
Actuators
What the world
is like now
7 / 38
Learning Agent
4 basic kinds of intelligent agents (increasing order of complexity):
1. Simple Reflex agents
2. Model-based Reflex Agents
3. Goal-based Reflex Agents
4. Utility-based Reflex Agents
But it can be fastidious to program such agents. . .
8 / 38
Learning Agent (cont.)
Performance standard
Agent
Environment
Sensors
Performance
element
changes
knowledge
learning
goals
Problem
generator
feedback
Learning
element
Critic
Actuators
9 / 38
Four Main Aspects
1. Which feedback from environnement is available?
(supervised learning, unsupervised, reinforcement)
2. How knowledge is modeled?
(algebraic expressions, production rules, graph, networks, sequences, . . . )
Is there prior knowledge?
3. Which agent components must learn?
I State → Action?
I Environnement?
I How the world evolves?
I Predictable results of actions?
I Desirability of actions ?
I States which maximise utility ?
4. On-line or batch ?
10 / 38
Machine Learning
I Supervised Learning
I Learning with labelled instances
I Ex. : Decision Trees, Neural Networks, SVMS, . . .
I Unsupervised Learning
I Learning without labels
I Ex. : K-means, clustering, . . .
I Reinforcement Learning
I Learning with rewards
I App. : robots, autonomous vehicles, . . .
11 / 38
Reinforcement Learning
I Supervised learning is simplest and best-studied type of learning
I Another type of learning tasks is learning behaviors when we don’t have a teacher to
tell us how
I The agent has a task to perform; it takes some actions in the world; at some later
point gets feedback telling it how well it did on performing task
I The agent performs the same task over and over again
I The agent gets carrots for good behavior and sticks for bad behavior
I It’s called reinforcement learning because the agent gets positive reinforcement for
tasks done well and negative reinforcement for tasks done poorly
12 / 38
Reinforcement Learning (cont.)
I The problem of getting an agent to act in the world so as to maximize its rewards
I Consider teaching a dog a new trick: you cannot tell it what to do, but you can
reward/punish it if it does the right/wrong thing. It has to figure out what it did
that made it get the reward/punishment, which is known as the credit assignment
problem
I We can use a similar method to train computers to do many tasks, such as playing
backgammon or chess, scheduling jobs, and controlling robot limbs
Example : SDyna for video games
13 / 38
Reinforcement Learning : Basic Model
[Kaelbling et al., 1996]
I i : input (some indications about s)
I s : state of the environment
I a : action
I r : reinforcement signal
I B : agent’s behavior
I T : transition function
I I : input function (what is seen about the env.)
a
T
s
i
r
B
I
R
14 / 38
Topic
Agents and Machine Learning (chap. 2)
Markov Decision Problems (chap. 18)
Passive Reinforcement Learning
Active Reinforcement Learning
15 / 38
Markov Decision Process
Definition (Markov Decision Process (MDP))
A sequential decision problem for a fully observable, stochastic environment with a
Markovian transition function and additive reward is called a Markov Decision Problem,
and defined by a tuple hS, A, T, Ri:
I A set of states s ∈ S
I A set of actions a ∈ A
I A stochastic transition function T(s, a, s0) = P(s0|s, a)
I A reward function R(s)
16 / 38
Markov Decision Process (cont.)
START
+1
-1
-0.04
-0.04
-0.04
-0.04
-0.04
-0.04
-0.04
-0.04
with probability to go straight = 0.8, left = 0.1 and right = 0.1
I In a deterministic environment, a solution would be [↑,↑,→→→]
I In our stochastic environment, the probability of reaching goal state +1 given this
sequence of actions is only 0.33
17 / 38
Markov Decision Process: Policy
I A policy is a function π : S → A that specifies what action the agent should take in
any given state
I Executing a policy can give rise to many action sequences!
I How can we determine the quality of a policy?
↑
+1
-1
↑
→
←
→
←
↑
→
←
18 / 38
Markov Decision Process: Utility
I Utility is an internal measure of an agent’s success
I Agent’s own internal performance measure
I Surrogate for success and happiness
I The utility is a function of the rewards
U([s0, s1, s2, . . .]) = γ0
R(s0) + γ1
R(s1) + γ2
R(s2) + . . .
=
∞
X
t=0
γt
R(st)
with γ a discount factor
19 / 38
Markov Decision Process: Utility (cont.)
↑
+1
-1
↑
→
←
→
←
↑
→
←
Optimal Policy
0.705
+1
-1
0.762
0.812
0.655
0.868
0.611
0.660
0.918
0.388
Utilities
20 / 38
Value Iteration Algorithm
I Given an MDP, recursively formulate the utility of starting at state s
U(s) = R(s) + γ max
a∈A(s)
X
s0
P(s0
|s, a)U(s0
) (Bellman Equation)
I Suggest an iterative algorithm:
Ui+1(s) ← R(s) + γ max
a∈A(s)
X
s0
P(s0
|s, a)Ui(s0
)
I Once we have U(s) for all states s, we can construct the optimal policy:
π∗
(s) = argmax
a∈A(s)
X
s0
P(s0
|s, a)U(s0
)
21 / 38
Policy Iteration Algorithm
I Policy evaluation : given πi compute Ui
Ui(s) = Uπi
(s) = E
" ∞
X
t=0
γt
R(St)
#
I Policy improvement = given Ui compute πi+1
πi+1(s) = argmax
a∈A(s)
X
s0
P(s0
|s, a)Ui(s0
)
I Repeat until stability
22 / 38
Why Reinforcement Learning?
I We could find an optimal policy for an MDP if we know the transition model
P(s0|s, a)
I But, an agent in an unknown environment does not know the transition model nor
in advance what rewards it will get in new states
I We want the agent to learn to behave rationally in an unsupervised process
The purpose of RL is to learn the optimal policy based only on received rewards
23 / 38
Topic
Agents and Machine Learning (chap. 2)
Markov Decision Problems (chap. 18)
Passive Reinforcement Learning
Active Reinforcement Learning
24 / 38
Passive RL
I In passive RL, the agents’ policy π is fixed, it only needs to know how good it is
I Agent runs a number of trials, starting in (1,1) and continuing until it reaches a
terminal state
I The utility of a state is the expected total remaining reward (reward-to-go)
I Each trial provides a sample of the reward-to-go for each visited state
I The agent keeps a running average for each state, which will converge to the true
value
I This is a direct utility estimation method
25 / 38
Direct Utility Estimation
Trials
(1,1) ↑ −0.04 (1,1) -0.04
(1,2) ↑ −0.04 (1,2) ↑ −0.04
(1,3) → −0.04 (1,3) → −0.04
(2,3) → −0.04 (2,3) → −0.04
(3,3) → −0.04 (3,3) → −0.04
(3,2) ↑ −0.04 (3,2) ↑ −0.04
(3,3) → −0.04 (4,2) exit -1
(4,3) exit +1 (done)
(done)
↑
+1
-1
↑
→
←
→
←
↑
→
←
γ = 1, R = −0.04
V (2, 3) ≈ (0.840 − 1.12)/2 = −0.14
V (3, 3) ≈ (0.96 + 0.88 − 1.08)/3 = 0.86
26 / 38
Problems with Direct Utility Estimation
I Direct utility fails to exploit the fact that states are dependent as shown in Bellman
equations
U(s) = R(s) + γ max
a∈A(s)
X
s0
P(s0
|s, a)U(s0
)
I Learning can be speeded up by using these dependencies
I Direct utility can be seen as inductive learning search in a too large hypothesis space
that contains many hypothesis violating Bellman equations
27 / 38
Adaptive Dynamic Programming
I An ADP agent uses dependencies between states to speed up value estimation
I It follows a policy π and can use observed transitions to incrementally built the
transition model P(s0|s, π(s))
I It can then plug the learned transition model and observed rewards R(s) into the
Bellman equations to U(s)
I The equations are linear because there is no max operator → easier to solve
I The result is U(s) for the given policy π
28 / 38
Temporal-difference Learning
I TD is another passive utility value learning algorithm using Bellman equations
I Instead of solving the equations, TD uses the observed transitions to adjust the
utilities of the observed states to agree with Bellman equations
I TD uses a learning rate parameter α to select the rate of change of utility
adjustment
I TD does not need a trnasition model to perform its updates, only the observed
transitions
29 / 38
Temporal-difference Learning (cont.)
I TD update rule for transition from s to s0:
V π
(s) ← V π
(s) + α(R(s) + γV π
(s0
) − V π
(s))
(noisy) sample of value at s based on next state s0
I So the updates is maintaining a « mean » of the (noisy) value sample
I If the learning rate decreases appropriately with the number of samples (e.g. 1/n)
then the value estimates will converge to true values!
V π
(s) = R(s) + γ
X
s0
T(s, a, s0
)V π
(s0
)
(Dan Klein, UC Berkeley)
30 / 38
Topic
Agents and Machine Learning (chap. 2)
Markov Decision Problems (chap. 18)
Passive Reinforcement Learning
Active Reinforcement Learning
31 / 38
Active Reinforcement Learning
I While a passive RL agent executes a fixed policy π, an active RL agent has to
decide which actions to take
I An active RL agent is an extension of a passive one, e.g. the passive ADP agent,
and adds
I Needs to learn a complete transition model for all actions (not just π), using passive
ADP learning
I Utilities need to reflect the optimal policy π∗
, as expressed by the Bellman equations
I Equations can be solved by VI or PI methods described before
I Action to be selected as the optimal/maximizing one
(Roar Fjellheim, University of Oslo)
32 / 38
Exploitation vs. Exploration
I The active RL agent may select maximizing actions based on a faulty learned model,
and fail to incorporate observations that might lead to a more correct model
I To avoid this, the agent design could include selecting actions that lead to more
correct models at the cost of reduced immediate rewards
I This called exploitation vs. exploration tradeoff
I The issue of optimal exploration policy is studied in a subfield of statistical decision
theory dealing with so-called bandit problems
(Roar Fjellheim, University of Oslo)
33 / 38
Q-Learning
I An action-utility function Q assigns an expected utility to taking a given action in a
given state: Q(a, s) is the value of doing action a in state s
I Q-values are related to utility values:
U(s) = max
a
Q(a, s)
I Q-values are sufficient for decision making without needing a transition model
P(s0|s, a)
I Can be learned directly from rewards using a TD-method based on an update
equation:
Q(s, a) ← Q(s, a) + α(R(s) + max
a0
Q(s0
, a0
) − Q(s, a))
(Roar Fjellheim, University of Oslo)
34 / 38
Q-Learning
I Q-Learning: samplebased Q-value iteration
I Learn Q∗(s, a) values
I Receive a sample (s, a, s0
, r)
I Consider your old estimate: Q(s, a)
I Consider your new sample estimate:
Q∗
(s, a) =
X
s0
T(s, a, s0
)
h
R(s, a, s0
) + γ max
a0
Q∗
(s0
, a,0
)
i
sample = R(s, a, s0
) + γ max
a0
Q∗
(s0
, a,0
)
I Incorporate the new estimate into a running average:
Q(s, a) ← (1 − α)Q(s, a) + α[sample]
(Dan Klein, UC Berkeley)
35 / 38
SARSA
I Updating the Q-value depends on the current state of the agent s1, the action the
agent chooses a1, the reward r the agent gets for choosing this action, the state s2
that the agent will now be in after taking that action, and finally the next action a2
the agent will choose in its new state
Q(s, a) ← Q(s, a) + α(R(s) + γQ(s0
, a0
) − Q(s, a))
I SARSA learns the Q values associated with taking the policy it follows itself, while
Q-learning learns the Q values associated with taking the exploitation policy while
following an exploration/exploitation policy
Q(s, a) ← Q(s, a) + α(R(s) + γ max
a0
Q(s0
, a0
) − Q(s, a))
36 / 38
SARSA
I Updating the Q-value depends on the current state of the agent s1, the action the
agent chooses a1, the reward r the agent gets for choosing this action, the state s2
that the agent will now be in after taking that action, and finally the next action a2
the agent will choose in its new state
Q(s, a) ← Q(s, a) + α(R(s) + γQ(s0
, a0
) − Q(s, a))
I SARSA learns the Q values associated with taking the policy it follows itself, while
Q-learning learns the Q values associated with taking the exploitation policy while
following an exploration/exploitation policy
Q(s, a) ← Q(s, a) + α(R(s) + γ max
a0
Q(s0
, a0
) − Q(s, a))
36 / 38
Generalization in RL
I In simple domains, U and Q can be represented by tables, indexed by state s
I However, for large state spaces the tables will be too large to be feasible, e.g. chess
1040 states
I Instead functional approximation can sometimes be used, e.g.
e
U(s) =
P
parameteri × featurei(s)
I Instead of e.g. 1040 table entries, U can be estimated by e.g. 20 parameterized
features
I Parameters can be found by supervised learning
I Problem: Such a function may not exist, and learning process may therefore fail to
converge
(Roar Fjellheim, University of Oslo)
37 / 38
Summary
I Reinforcement learning (RL) examines how the agent can learn to act in an
unknown environment just based on percepts and rewards
I Three RL designs are model-based, using a model P and utility function U,
model-free, using action-utility function Q, and reflex, using a policy
I The utility of a state is the expected sum of rewards received up to the terminal
state. Three methods are direct estimation, Adaptive dynamic programming (ADP),
and Temporal-Difference (TD)
I Action-value function (Q-functions) can be learned by ADP or TD approaches
I In passive learning the agent just observes the environment, while an active learner
must select actions to trade off immediate reward vs. exploration for improved
model precision
I In domains with very large state spaces, utility tables U are replaced by approximate
functions
I Policy search work directly on a representation of the policy, improving it in an
iterative cycle
I Reinforcement learning is a very active research area, especially in robotics
38 / 38

More Related Content

Similar to Reinforcement learning 1.pdf

Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313Slideshare
 
NUS-ISS Learning Day 2019-Introduction to reinforcement learning
NUS-ISS Learning Day 2019-Introduction to reinforcement learningNUS-ISS Learning Day 2019-Introduction to reinforcement learning
NUS-ISS Learning Day 2019-Introduction to reinforcement learningNUS-ISS
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningPrabhu Kumar
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning Chandra Meena
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017MLconf
 
Markov chain and Monte Carlo
Markov chain and Monte CarloMarkov chain and Monte Carlo
Markov chain and Monte CarloAjay Chaurasiya
 
Lecture notes
Lecture notesLecture notes
Lecture notesbutest
 
14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptxRithikRaj25
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learningazzeddine chenine
 
lecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentationlecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentationbutest
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksBen Ball
 
Cs221 lecture8-fall11
Cs221 lecture8-fall11Cs221 lecture8-fall11
Cs221 lecture8-fall11darwinrlo
 
Modelling using differnt metods in matlab2 (2) (2) (2) (4) (1) (1).pptx
Modelling using differnt metods in matlab2 (2) (2) (2) (4) (1) (1).pptxModelling using differnt metods in matlab2 (2) (2) (2) (4) (1) (1).pptx
Modelling using differnt metods in matlab2 (2) (2) (2) (4) (1) (1).pptxKadiriIbrahim2
 
Designing an AI that gains experience for absolute beginners
Designing an AI that gains experience for absolute beginnersDesigning an AI that gains experience for absolute beginners
Designing an AI that gains experience for absolute beginnersTanzim Saqib
 

Similar to Reinforcement learning 1.pdf (20)

Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
 
NUS-ISS Learning Day 2019-Introduction to reinforcement learning
NUS-ISS Learning Day 2019-Introduction to reinforcement learningNUS-ISS Learning Day 2019-Introduction to reinforcement learning
NUS-ISS Learning Day 2019-Introduction to reinforcement learning
 
RL intro
RL introRL intro
RL intro
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
 
RL.ppt
RL.pptRL.ppt
RL.ppt
 
Deep RL.pdf
Deep RL.pdfDeep RL.pdf
Deep RL.pdf
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
 
Markov chain and Monte Carlo
Markov chain and Monte CarloMarkov chain and Monte Carlo
Markov chain and Monte Carlo
 
Lecture notes
Lecture notesLecture notes
Lecture notes
 
14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learning
 
lecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentationlecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentation
 
Intro rl
Intro rlIntro rl
Intro rl
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and Tricks
 
Cs221 lecture8-fall11
Cs221 lecture8-fall11Cs221 lecture8-fall11
Cs221 lecture8-fall11
 
Modelling using differnt metods in matlab2 (2) (2) (2) (4) (1) (1).pptx
Modelling using differnt metods in matlab2 (2) (2) (2) (4) (1) (1).pptxModelling using differnt metods in matlab2 (2) (2) (2) (4) (1) (1).pptx
Modelling using differnt metods in matlab2 (2) (2) (2) (4) (1) (1).pptx
 
Designing an AI that gains experience for absolute beginners
Designing an AI that gains experience for absolute beginnersDesigning an AI that gains experience for absolute beginners
Designing an AI that gains experience for absolute beginners
 
PMED Undergraduate Workshop - Introduction to Reinforcement Learning - Lili W...
PMED Undergraduate Workshop - Introduction to Reinforcement Learning - Lili W...PMED Undergraduate Workshop - Introduction to Reinforcement Learning - Lili W...
PMED Undergraduate Workshop - Introduction to Reinforcement Learning - Lili W...
 

Recently uploaded

Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitterShivangiSharma879191
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774
 
Comparative Analysis of Text Summarization Techniques
Comparative Analysis of Text Summarization TechniquesComparative Analysis of Text Summarization Techniques
Comparative Analysis of Text Summarization Techniquesugginaramesh
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfROCENODodongVILLACER
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHC Sai Kiran
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani
 
computer application and construction management
computer application and construction managementcomputer application and construction management
computer application and construction managementMariconPadriquez1
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 

Recently uploaded (20)

POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.ppt
 
Comparative Analysis of Text Summarization Techniques
Comparative Analysis of Text Summarization TechniquesComparative Analysis of Text Summarization Techniques
Comparative Analysis of Text Summarization Techniques
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdf
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECH
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
 
computer application and construction management
computer application and construction managementcomputer application and construction management
computer application and construction management
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 

Reinforcement learning 1.pdf

  • 1. Reinforcement Learning Chapter 21 TB Artificial Intelligence 1 / 38
  • 2. Outline Agents and Machine Learning (chap. 2) Markov Decision Problems (chap. 18) Passive Reinforcement Learning Active Reinforcement Learning 2 / 38
  • 3. Topic Agents and Machine Learning (chap. 2) Markov Decision Problems (chap. 18) Passive Reinforcement Learning Active Reinforcement Learning 3 / 38
  • 4. Simple Reflex Agent Agent Environment Sensors What the world is like now What action I should do now Condition−action rules Actuators 4 / 38
  • 5. Model-based Reflex Agent Agent Environment Sensors What action I should do now State How the world evolves What my actions do Condition−action rules Actuators What the world is like now 5 / 38
  • 6. Goal-based Agent Agent Environment Sensors What it will be like if I do action A What action I should do now State How the world evolves What my actions do Goals Actuators What the world is like now 6 / 38
  • 7. Utility-based Agent Agent Environment Sensors What it will be like if I do action A How happy I will be in such a state What action I should do now State How the world evolves What my actions do Utility Actuators What the world is like now 7 / 38
  • 8. Learning Agent 4 basic kinds of intelligent agents (increasing order of complexity): 1. Simple Reflex agents 2. Model-based Reflex Agents 3. Goal-based Reflex Agents 4. Utility-based Reflex Agents But it can be fastidious to program such agents. . . 8 / 38
  • 9. Learning Agent (cont.) Performance standard Agent Environment Sensors Performance element changes knowledge learning goals Problem generator feedback Learning element Critic Actuators 9 / 38
  • 10. Four Main Aspects 1. Which feedback from environnement is available? (supervised learning, unsupervised, reinforcement) 2. How knowledge is modeled? (algebraic expressions, production rules, graph, networks, sequences, . . . ) Is there prior knowledge? 3. Which agent components must learn? I State → Action? I Environnement? I How the world evolves? I Predictable results of actions? I Desirability of actions ? I States which maximise utility ? 4. On-line or batch ? 10 / 38
  • 11. Machine Learning I Supervised Learning I Learning with labelled instances I Ex. : Decision Trees, Neural Networks, SVMS, . . . I Unsupervised Learning I Learning without labels I Ex. : K-means, clustering, . . . I Reinforcement Learning I Learning with rewards I App. : robots, autonomous vehicles, . . . 11 / 38
  • 12. Reinforcement Learning I Supervised learning is simplest and best-studied type of learning I Another type of learning tasks is learning behaviors when we don’t have a teacher to tell us how I The agent has a task to perform; it takes some actions in the world; at some later point gets feedback telling it how well it did on performing task I The agent performs the same task over and over again I The agent gets carrots for good behavior and sticks for bad behavior I It’s called reinforcement learning because the agent gets positive reinforcement for tasks done well and negative reinforcement for tasks done poorly 12 / 38
  • 13. Reinforcement Learning (cont.) I The problem of getting an agent to act in the world so as to maximize its rewards I Consider teaching a dog a new trick: you cannot tell it what to do, but you can reward/punish it if it does the right/wrong thing. It has to figure out what it did that made it get the reward/punishment, which is known as the credit assignment problem I We can use a similar method to train computers to do many tasks, such as playing backgammon or chess, scheduling jobs, and controlling robot limbs Example : SDyna for video games 13 / 38
  • 14. Reinforcement Learning : Basic Model [Kaelbling et al., 1996] I i : input (some indications about s) I s : state of the environment I a : action I r : reinforcement signal I B : agent’s behavior I T : transition function I I : input function (what is seen about the env.) a T s i r B I R 14 / 38
  • 15. Topic Agents and Machine Learning (chap. 2) Markov Decision Problems (chap. 18) Passive Reinforcement Learning Active Reinforcement Learning 15 / 38
  • 16. Markov Decision Process Definition (Markov Decision Process (MDP)) A sequential decision problem for a fully observable, stochastic environment with a Markovian transition function and additive reward is called a Markov Decision Problem, and defined by a tuple hS, A, T, Ri: I A set of states s ∈ S I A set of actions a ∈ A I A stochastic transition function T(s, a, s0) = P(s0|s, a) I A reward function R(s) 16 / 38
  • 17. Markov Decision Process (cont.) START +1 -1 -0.04 -0.04 -0.04 -0.04 -0.04 -0.04 -0.04 -0.04 with probability to go straight = 0.8, left = 0.1 and right = 0.1 I In a deterministic environment, a solution would be [↑,↑,→→→] I In our stochastic environment, the probability of reaching goal state +1 given this sequence of actions is only 0.33 17 / 38
  • 18. Markov Decision Process: Policy I A policy is a function π : S → A that specifies what action the agent should take in any given state I Executing a policy can give rise to many action sequences! I How can we determine the quality of a policy? ↑ +1 -1 ↑ → ← → ← ↑ → ← 18 / 38
  • 19. Markov Decision Process: Utility I Utility is an internal measure of an agent’s success I Agent’s own internal performance measure I Surrogate for success and happiness I The utility is a function of the rewards U([s0, s1, s2, . . .]) = γ0 R(s0) + γ1 R(s1) + γ2 R(s2) + . . . = ∞ X t=0 γt R(st) with γ a discount factor 19 / 38
  • 20. Markov Decision Process: Utility (cont.) ↑ +1 -1 ↑ → ← → ← ↑ → ← Optimal Policy 0.705 +1 -1 0.762 0.812 0.655 0.868 0.611 0.660 0.918 0.388 Utilities 20 / 38
  • 21. Value Iteration Algorithm I Given an MDP, recursively formulate the utility of starting at state s U(s) = R(s) + γ max a∈A(s) X s0 P(s0 |s, a)U(s0 ) (Bellman Equation) I Suggest an iterative algorithm: Ui+1(s) ← R(s) + γ max a∈A(s) X s0 P(s0 |s, a)Ui(s0 ) I Once we have U(s) for all states s, we can construct the optimal policy: π∗ (s) = argmax a∈A(s) X s0 P(s0 |s, a)U(s0 ) 21 / 38
  • 22. Policy Iteration Algorithm I Policy evaluation : given πi compute Ui Ui(s) = Uπi (s) = E " ∞ X t=0 γt R(St) # I Policy improvement = given Ui compute πi+1 πi+1(s) = argmax a∈A(s) X s0 P(s0 |s, a)Ui(s0 ) I Repeat until stability 22 / 38
  • 23. Why Reinforcement Learning? I We could find an optimal policy for an MDP if we know the transition model P(s0|s, a) I But, an agent in an unknown environment does not know the transition model nor in advance what rewards it will get in new states I We want the agent to learn to behave rationally in an unsupervised process The purpose of RL is to learn the optimal policy based only on received rewards 23 / 38
  • 24. Topic Agents and Machine Learning (chap. 2) Markov Decision Problems (chap. 18) Passive Reinforcement Learning Active Reinforcement Learning 24 / 38
  • 25. Passive RL I In passive RL, the agents’ policy π is fixed, it only needs to know how good it is I Agent runs a number of trials, starting in (1,1) and continuing until it reaches a terminal state I The utility of a state is the expected total remaining reward (reward-to-go) I Each trial provides a sample of the reward-to-go for each visited state I The agent keeps a running average for each state, which will converge to the true value I This is a direct utility estimation method 25 / 38
  • 26. Direct Utility Estimation Trials (1,1) ↑ −0.04 (1,1) -0.04 (1,2) ↑ −0.04 (1,2) ↑ −0.04 (1,3) → −0.04 (1,3) → −0.04 (2,3) → −0.04 (2,3) → −0.04 (3,3) → −0.04 (3,3) → −0.04 (3,2) ↑ −0.04 (3,2) ↑ −0.04 (3,3) → −0.04 (4,2) exit -1 (4,3) exit +1 (done) (done) ↑ +1 -1 ↑ → ← → ← ↑ → ← γ = 1, R = −0.04 V (2, 3) ≈ (0.840 − 1.12)/2 = −0.14 V (3, 3) ≈ (0.96 + 0.88 − 1.08)/3 = 0.86 26 / 38
  • 27. Problems with Direct Utility Estimation I Direct utility fails to exploit the fact that states are dependent as shown in Bellman equations U(s) = R(s) + γ max a∈A(s) X s0 P(s0 |s, a)U(s0 ) I Learning can be speeded up by using these dependencies I Direct utility can be seen as inductive learning search in a too large hypothesis space that contains many hypothesis violating Bellman equations 27 / 38
  • 28. Adaptive Dynamic Programming I An ADP agent uses dependencies between states to speed up value estimation I It follows a policy π and can use observed transitions to incrementally built the transition model P(s0|s, π(s)) I It can then plug the learned transition model and observed rewards R(s) into the Bellman equations to U(s) I The equations are linear because there is no max operator → easier to solve I The result is U(s) for the given policy π 28 / 38
  • 29. Temporal-difference Learning I TD is another passive utility value learning algorithm using Bellman equations I Instead of solving the equations, TD uses the observed transitions to adjust the utilities of the observed states to agree with Bellman equations I TD uses a learning rate parameter α to select the rate of change of utility adjustment I TD does not need a trnasition model to perform its updates, only the observed transitions 29 / 38
  • 30. Temporal-difference Learning (cont.) I TD update rule for transition from s to s0: V π (s) ← V π (s) + α(R(s) + γV π (s0 ) − V π (s)) (noisy) sample of value at s based on next state s0 I So the updates is maintaining a « mean » of the (noisy) value sample I If the learning rate decreases appropriately with the number of samples (e.g. 1/n) then the value estimates will converge to true values! V π (s) = R(s) + γ X s0 T(s, a, s0 )V π (s0 ) (Dan Klein, UC Berkeley) 30 / 38
  • 31. Topic Agents and Machine Learning (chap. 2) Markov Decision Problems (chap. 18) Passive Reinforcement Learning Active Reinforcement Learning 31 / 38
  • 32. Active Reinforcement Learning I While a passive RL agent executes a fixed policy π, an active RL agent has to decide which actions to take I An active RL agent is an extension of a passive one, e.g. the passive ADP agent, and adds I Needs to learn a complete transition model for all actions (not just π), using passive ADP learning I Utilities need to reflect the optimal policy π∗ , as expressed by the Bellman equations I Equations can be solved by VI or PI methods described before I Action to be selected as the optimal/maximizing one (Roar Fjellheim, University of Oslo) 32 / 38
  • 33. Exploitation vs. Exploration I The active RL agent may select maximizing actions based on a faulty learned model, and fail to incorporate observations that might lead to a more correct model I To avoid this, the agent design could include selecting actions that lead to more correct models at the cost of reduced immediate rewards I This called exploitation vs. exploration tradeoff I The issue of optimal exploration policy is studied in a subfield of statistical decision theory dealing with so-called bandit problems (Roar Fjellheim, University of Oslo) 33 / 38
  • 34. Q-Learning I An action-utility function Q assigns an expected utility to taking a given action in a given state: Q(a, s) is the value of doing action a in state s I Q-values are related to utility values: U(s) = max a Q(a, s) I Q-values are sufficient for decision making without needing a transition model P(s0|s, a) I Can be learned directly from rewards using a TD-method based on an update equation: Q(s, a) ← Q(s, a) + α(R(s) + max a0 Q(s0 , a0 ) − Q(s, a)) (Roar Fjellheim, University of Oslo) 34 / 38
  • 35. Q-Learning I Q-Learning: samplebased Q-value iteration I Learn Q∗(s, a) values I Receive a sample (s, a, s0 , r) I Consider your old estimate: Q(s, a) I Consider your new sample estimate: Q∗ (s, a) = X s0 T(s, a, s0 ) h R(s, a, s0 ) + γ max a0 Q∗ (s0 , a,0 ) i sample = R(s, a, s0 ) + γ max a0 Q∗ (s0 , a,0 ) I Incorporate the new estimate into a running average: Q(s, a) ← (1 − α)Q(s, a) + α[sample] (Dan Klein, UC Berkeley) 35 / 38
  • 36. SARSA I Updating the Q-value depends on the current state of the agent s1, the action the agent chooses a1, the reward r the agent gets for choosing this action, the state s2 that the agent will now be in after taking that action, and finally the next action a2 the agent will choose in its new state Q(s, a) ← Q(s, a) + α(R(s) + γQ(s0 , a0 ) − Q(s, a)) I SARSA learns the Q values associated with taking the policy it follows itself, while Q-learning learns the Q values associated with taking the exploitation policy while following an exploration/exploitation policy Q(s, a) ← Q(s, a) + α(R(s) + γ max a0 Q(s0 , a0 ) − Q(s, a)) 36 / 38
  • 37. SARSA I Updating the Q-value depends on the current state of the agent s1, the action the agent chooses a1, the reward r the agent gets for choosing this action, the state s2 that the agent will now be in after taking that action, and finally the next action a2 the agent will choose in its new state Q(s, a) ← Q(s, a) + α(R(s) + γQ(s0 , a0 ) − Q(s, a)) I SARSA learns the Q values associated with taking the policy it follows itself, while Q-learning learns the Q values associated with taking the exploitation policy while following an exploration/exploitation policy Q(s, a) ← Q(s, a) + α(R(s) + γ max a0 Q(s0 , a0 ) − Q(s, a)) 36 / 38
  • 38. Generalization in RL I In simple domains, U and Q can be represented by tables, indexed by state s I However, for large state spaces the tables will be too large to be feasible, e.g. chess 1040 states I Instead functional approximation can sometimes be used, e.g. e U(s) = P parameteri × featurei(s) I Instead of e.g. 1040 table entries, U can be estimated by e.g. 20 parameterized features I Parameters can be found by supervised learning I Problem: Such a function may not exist, and learning process may therefore fail to converge (Roar Fjellheim, University of Oslo) 37 / 38
  • 39. Summary I Reinforcement learning (RL) examines how the agent can learn to act in an unknown environment just based on percepts and rewards I Three RL designs are model-based, using a model P and utility function U, model-free, using action-utility function Q, and reflex, using a policy I The utility of a state is the expected sum of rewards received up to the terminal state. Three methods are direct estimation, Adaptive dynamic programming (ADP), and Temporal-Difference (TD) I Action-value function (Q-functions) can be learned by ADP or TD approaches I In passive learning the agent just observes the environment, while an active learner must select actions to trade off immediate reward vs. exploration for improved model precision I In domains with very large state spaces, utility tables U are replaced by approximate functions I Policy search work directly on a representation of the policy, improving it in an iterative cycle I Reinforcement learning is a very active research area, especially in robotics 38 / 38