SlideShare ist ein Scribd-Unternehmen logo
1 von 51
Downloaden Sie, um offline zu lesen
Lecture 4: Model-Free Prediction
Lecture 4: Model-Free Prediction
David Silver
Lecture 4: Model-Free Prediction
Outline
1 Introduction
2 Monte-Carlo Learning
3 Temporal-Difference Learning
4 TD(λ)
Lecture 4: Model-Free Prediction
Introduction
Model-Free Reinforcement Learning
Last lecture:
Planning by dynamic programming
Solve a known MDP
This lecture:
Model-free prediction
Estimate the value function of an unknown MDP
Next lecture:
Model-free control
Optimise the value function of an unknown MDP
Lecture 4: Model-Free Prediction
Monte-Carlo Learning
Monte-Carlo Reinforcement Learning
MC methods learn directly from episodes of experience
MC is model-free: no knowledge of MDP transitions / rewards
MC learns from complete episodes: no bootstrapping
MC uses the simplest possible idea: value = mean return
Caveat: can only apply MC to episodic MDPs
All episodes must terminate
Lecture 4: Model-Free Prediction
Monte-Carlo Learning
Monte-Carlo Policy Evaluation
Goal: learn vπ from episodes of experience under policy π
S1, A1, R2, ..., Sk ∼ π
Recall that the return is the total discounted reward:
Gt = Rt+1 + γRt+2 + ... + γT−1
RT
Recall that the value function is the expected return:
vπ(s) = Eπ [Gt | St = s]
Monte-Carlo policy evaluation uses empirical mean return
instead of expected return
Lecture 4: Model-Free Prediction
Monte-Carlo Learning
First-Visit Monte-Carlo Policy Evaluation
To evaluate state s
The first time-step t that state s is visited in an episode,
Increment counter N(s) ← N(s) + 1
Increment total return S(s) ← S(s) + Gt
Value is estimated by mean return V (s) = S(s)/N(s)
By law of large numbers, V (s) → vπ(s) as N(s) → ∞
Lecture 4: Model-Free Prediction
Monte-Carlo Learning
Every-Visit Monte-Carlo Policy Evaluation
To evaluate state s
Every time-step t that state s is visited in an episode,
Increment counter N(s) ← N(s) + 1
Increment total return S(s) ← S(s) + Gt
Value is estimated by mean return V (s) = S(s)/N(s)
Again, V (s) → vπ(s) as N(s) → ∞
Lecture 4: Model-Free Prediction
Monte-Carlo Learning
Blackjack Example
Blackjack Example
States (200 of them):
Current sum (12-21)
Dealer’s showing card (ace-10)
Do I have a “useable” ace? (yes-no)
Action stick: Stop receiving cards (and terminate)
Action twist: Take another card (no replacement)
Reward for stick:
+1 if sum of cards > sum of dealer cards
0 if sum of cards = sum of dealer cards
-1 if sum of cards < sum of dealer cards
Reward for twist:
-1 if sum of cards > 21 (and terminate)
0 otherwise
Transitions: automatically twist if sum of cards < 12
Lecture 4: Model-Free Prediction
Monte-Carlo Learning
Blackjack Example
Blackjack Value Function after Monte-Carlo Learning
Policy: stick if sum of cards ≥ 20, otherwise twist
Lecture 4: Model-Free Prediction
Monte-Carlo Learning
Incremental Monte-Carlo
Incremental Mean
The mean µ1, µ2, ... of a sequence x1, x2, ... can be computed
incrementally,
µk =
1
k
k
j=1
xj
=
1
k

xk +
k−1
j=1
xj


=
1
k
(xk + (k − 1)µk−1)
= µk−1 +
1
k
(xk − µk−1)
Lecture 4: Model-Free Prediction
Monte-Carlo Learning
Incremental Monte-Carlo
Incremental Monte-Carlo Updates
Update V (s) incrementally after episode S1, A1, R2, ..., ST
For each state St with return Gt
N(St) ← N(St) + 1
V (St) ← V (St) +
1
N(St)
(Gt − V (St))
In non-stationary problems, it can be useful to track a running
mean, i.e. forget old episodes.
V (St) ← V (St) + α (Gt − V (St))
Lecture 4: Model-Free Prediction
Temporal-Difference Learning
Temporal-Difference Learning
TD methods learn directly from episodes of experience
TD is model-free: no knowledge of MDP transitions / rewards
TD learns from incomplete episodes, by bootstrapping
TD updates a guess towards a guess
Lecture 4: Model-Free Prediction
Temporal-Difference Learning
MC and TD
Goal: learn vπ online from experience under policy π
Incremental every-visit Monte-Carlo
Update value V (St) toward actual return Gt
V (St) ← V (St) + α (Gt − V (St))
Simplest temporal-difference learning algorithm: TD(0)
Update value V (St) toward estimated return Rt+1 + γV (St+1)
V (St) ← V (St) + α (Rt+1 + γV (St+1) − V (St))
Rt+1 + γV (St+1) is called the TD target
δt = Rt+1 + γV (St+1) − V (St) is called the TD error
Lecture 4: Model-Free Prediction
Temporal-Difference Learning
Driving Home Example
Driving Home Example
State Elapsed Time
(minutes)
Predicted
Time to Go
Predicted
Total Time
leaving office 0 30 30
reach car, raining 5 35 40
exit highway 20 15 35
behind truck 30 10 40
home street 40 3 43
arrive home 43 0 43
Lecture 4: Model-Free Prediction
Temporal-Difference Learning
Driving Home Example
Driving Home Example: MC vs. TD
Changes recommended by
Monte Carlo methods (!=1)!
Changes recommended!
by TD methods (!=1)!
Lecture 4: Model-Free Prediction
Temporal-Difference Learning
Driving Home Example
Advantages and Disadvantages of MC vs. TD
TD can learn before knowing the final outcome
TD can learn online after every step
MC must wait until end of episode before return is known
TD can learn without the final outcome
TD can learn from incomplete sequences
MC can only learn from complete sequences
TD works in continuing (non-terminating) environments
MC only works for episodic (terminating) environments
Lecture 4: Model-Free Prediction
Temporal-Difference Learning
Driving Home Example
Bias/Variance Trade-Off
Return Gt = Rt+1 + γRt+2 + ... + γT−1RT is unbiased
estimate of vπ(St)
True TD target Rt+1 + γvπ(St+1) is unbiased estimate of
vπ(St)
TD target Rt+1 + γV (St+1) is biased estimate of vπ(St)
TD target is much lower variance than the return:
Return depends on many random actions, transitions, rewards
TD target depends on one random action, transition, reward
Lecture 4: Model-Free Prediction
Temporal-Difference Learning
Driving Home Example
Advantages and Disadvantages of MC vs. TD (2)
MC has high variance, zero bias
Good convergence properties
(even with function approximation)
Not very sensitive to initial value
Very simple to understand and use
TD has low variance, some bias
Usually more efficient than MC
TD(0) converges to vπ(s)
(but not always with function approximation)
More sensitive to initial value
Lecture 4: Model-Free Prediction
Temporal-Difference Learning
Random Walk Example
Random Walk Example
Lecture 4: Model-Free Prediction
Temporal-Difference Learning
Random Walk Example
Random Walk: MC vs. TD
Lecture 4: Model-Free Prediction
Temporal-Difference Learning
Batch MC and TD
Batch MC and TD
MC and TD converge: V (s) → vπ(s) as experience → ∞
But what about batch solution for finite experience?
s1
1 , a1
1, r1
2 , ..., s1
T1
...
sK
1 , aK
1 , rK
2 , ..., sK
TK
e.g. Repeatedly sample episode k ∈ [1, K]
Apply MC or TD(0) to episode k
Lecture 4: Model-Free Prediction
Temporal-Difference Learning
Batch MC and TD
AB Example
Two states A, B; no discounting; 8 episodes of experience
A, 0, B, 0!
B, 1!
B, 1!
B, 1!
B, 1!
B, 1!
B, 1!
B, 0!
What is V (A), V (B)?
Lecture 4: Model-Free Prediction
Temporal-Difference Learning
Batch MC and TD
AB Example
Two states A, B; no discounting; 8 episodes of experience
A, 0, B, 0!
B, 1!
B, 1!
B, 1!
B, 1!
B, 1!
B, 1!
B, 0!
What is V (A), V (B)?
Lecture 4: Model-Free Prediction
Temporal-Difference Learning
Batch MC and TD
Certainty Equivalence
MC converges to solution with minimum mean-squared error
Best fit to the observed returns
K
k=1
Tk
t=1
Gk
t − V (sk
t )
2
In the AB example, V (A) = 0
TD(0) converges to solution of max likelihood Markov model
Solution to the MDP S, A, ˆP, ˆR, γ that best fits the data
ˆPa
s,s =
1
N(s, a)
K
k=1
Tk
t=1
1(sk
t , ak
t , sk
t+1 = s, a, s )
ˆRa
s =
1
N(s, a)
K
k=1
Tk
t=1
1(sk
t , ak
t = s, a)rk
t
In the AB example, V (A) = 0.75
Lecture 4: Model-Free Prediction
Temporal-Difference Learning
Batch MC and TD
Advantages and Disadvantages of MC vs. TD (3)
TD exploits Markov property
Usually more efficient in Markov environments
MC does not exploit Markov property
Usually more effective in non-Markov environments
Lecture 4: Model-Free Prediction
Temporal-Difference Learning
Unified View
Monte-Carlo Backup
V (St) ← V (St) + α (Gt − V (St))
T
! T
! T
! T
!T
!
T
! T
! T
! T
! T
!
st
T! T!
T! T!
T!T!
T!
T! T!T!
Lecture 4: Model-Free Prediction
Temporal-Difference Learning
Unified View
Temporal-Difference Backup
V (St) ← V (St) + α (Rt+1 + γV (St+1) − V (St))
T
! T
! T
! T
!T
!
T
! T
! T
! T
! T
!
st+1
rt+1
st
T!T!T!T!T!
T! T! T! T! T!
Lecture 4: Model-Free Prediction
Temporal-Difference Learning
Unified View
Dynamic Programming Backup
V (St) ← Eπ [Rt+1 + γV (St+1)]
T!
T! T! T!
st
rt+1
st+1
T!
T!T!
T!
T!T!
T!
T!
T!
Lecture 4: Model-Free Prediction
Temporal-Difference Learning
Unified View
Bootstrapping and Sampling
Bootstrapping: update involves an estimate
MC does not bootstrap
DP bootstraps
TD bootstraps
Sampling: update samples an expectation
MC samples
DP does not sample
TD samples
Lecture 4: Model-Free Prediction
Temporal-Difference Learning
Unified View
Unified View of Reinforcement Learning
Lecture 4: Model-Free Prediction
TD(λ)
n-Step TD
n-Step Prediction
Let TD target look n steps into the future
Lecture 4: Model-Free Prediction
TD(λ)
n-Step TD
n-Step Return
Consider the following n-step returns for n = 1, 2, ∞:
n = 1 (TD) G
(1)
t = Rt+1 + γV (St+1)
n = 2 G
(2)
t = Rt+1 + γRt+2 + γ2V (St+2)
...
...
n = ∞ (MC) G
(∞)
t = Rt+1 + γRt+2 + ... + γT−1RT
Define the n-step return
G
(n)
t = Rt+1 + γRt+2 + ... + γn−1
Rt+n + γn
V (St+n)
n-step temporal-difference learning
V (St) ← V (St) + α G
(n)
t − V (St)
Lecture 4: Model-Free Prediction
TD(λ)
n-Step TD
Large Random Walk Example
Lecture 4: Model-Free Prediction
TD(λ)
n-Step TD
Averaging n-Step Returns
We can average n-step returns over different n
e.g. average the 2-step and 4-step returns
1
2
G(2)
+
1
2
G(4)
Combines information from two different
time-steps
Can we efficiently combine information from all
time-steps?
One backup
Lecture 4: Model-Free Prediction
TD(λ)
Forward View of TD(λ)
λ-return
The λ-return Gλ
t combines
all n-step returns G
(n)
t
Using weight (1 − λ)λn−1
Gλ
t = (1 − λ)
∞
n=1
λn−1
G
(n)
t
Forward-view TD(λ)
V (St) ← V (St) + α Gλ
t − V (St)
Lecture 4: Model-Free Prediction
TD(λ)
Forward View of TD(λ)
TD(λ) Weighting Function
Gλ
t = (1 − λ)
∞
n=1
λn−1
G
(n)
t
Lecture 4: Model-Free Prediction
TD(λ)
Forward View of TD(λ)
Forward-view TD(λ)
Update value function towards the λ-return
Forward-view looks into the future to compute Gλ
t
Like MC, can only be computed from complete episodes
Lecture 4: Model-Free Prediction
TD(λ)
Forward View of TD(λ)
Forward-View TD(λ) on Large Random Walk
Lecture 4: Model-Free Prediction
TD(λ)
Backward View of TD(λ)
Backward View TD(λ)
Forward view provides theory
Backward view provides mechanism
Update online, every step, from incomplete sequences
Lecture 4: Model-Free Prediction
TD(λ)
Backward View of TD(λ)
Eligibility Traces
Credit assignment problem: did bell or light cause shock?
Frequency heuristic: assign credit to most frequent states
Recency heuristic: assign credit to most recent states
Eligibility traces combine both heuristics
E0(s) = 0
Et(s) = γλEt−1(s) + 1(St = s)
Lecture 4: Model-Free Prediction
TD(λ)
Backward View of TD(λ)
Backward View TD(λ)
Keep an eligibility trace for every state s
Update value V (s) for every state s
In proportion to TD-error δt and eligibility trace Et(s)
δt = Rt+1 + γV (St+1) − V (St)
V (s) ← V (s) + αδtEt(s)
Lecture 4: Model-Free Prediction
TD(λ)
Relationship Between Forward and Backward TD
TD(λ) and TD(0)
When λ = 0, only current state is updated
Et(s) = 1(St = s)
V (s) ← V (s) + αδtEt(s)
This is exactly equivalent to TD(0) update
V (St) ← V (St) + αδt
Lecture 4: Model-Free Prediction
TD(λ)
Relationship Between Forward and Backward TD
TD(λ) and MC
When λ = 1, credit is deferred until end of episode
Consider episodic environments with offline updates
Over the course of an episode, total update for TD(1) is the
same as total update for MC
Theorem
The sum of offline updates is identical for forward-view and
backward-view TD(λ)
T
t=1
αδtEt(s) =
T
t=1
α Gλ
t − V (St) 1(St = s)
Lecture 4: Model-Free Prediction
TD(λ)
Forward and Backward Equivalence
MC and TD(1)
Consider an episode where s is visited once at time-step k,
TD(1) eligibility trace discounts time since visit,
Et(s) = γEt−1(s) + 1(St = s)
=
0 if t < k
γt−k if t ≥ k
TD(1) updates accumulate error online
T−1
t=1
αδtEt(s) = α
T−1
t=k
γt−k
δt = α (Gk − V (Sk))
By end of episode it accumulates total error
δk + γδk+1 + γ2
δk+2 + ... + γT−1−k
δT−1
Lecture 4: Model-Free Prediction
TD(λ)
Forward and Backward Equivalence
Telescoping in TD(1)
When λ = 1, sum of TD errors telescopes into MC error,
δt + γδt+1 + γ2
δt+2 + ... + γT−1−t
δT−1
= Rt+1 + γV (St+1) − V (St)
+ γRt+2 + γ2
V (St+2) − γV (St+1)
+ γ2
Rt+3 + γ3
V (St+3) − γ2
V (St+2)
...
+ γT−1−t
RT + γT−t
V (ST ) − γT−1−t
V (ST−1)
= Rt+1 + γRt+2 + γ2
Rt+3... + γT−1−t
RT − V (St)
= Gt − V (St)
Lecture 4: Model-Free Prediction
TD(λ)
Forward and Backward Equivalence
TD(λ) and TD(1)
TD(1) is roughly equivalent to every-visit Monte-Carlo
Error is accumulated online, step-by-step
If value function is only updated offline at end of episode
Then total update is exactly the same as MC
Lecture 4: Model-Free Prediction
TD(λ)
Forward and Backward Equivalence
Telescoping in TD(λ)
For general λ, TD errors also telescope to λ-error, Gλ
t − V (St)
Gλ
t − V (St) = −V (St) + (1 − λ)λ0
(Rt+1 + γV (St+1))
+ (1 − λ)λ1
Rt+1 + γRt+2 + γ2
V (St+2)
+ (1 − λ)λ2
Rt+1 + γRt+2 + γ2
Rt+3 + γ3
V (St+3)
+ ...
= −V (St) + (γλ)0
(Rt+1 + γV (St+1) − γλV (St+1))
+ (γλ)1
(Rt+2 + γV (St+2) − γλV (St+2))
+ (γλ)2
(Rt+3 + γV (St+3) − γλV (St+3))
+ ...
= (γλ)0
(Rt+1 + γV (St+1) − V (St))
+ (γλ)1
(Rt+2 + γV (St+2) − V (St+1))
+ (γλ)2
(Rt+3 + γV (St+3) − V (St+2))
+ ...
= δt + γλδt+1 + (γλ)2
δt+2 + ...
Lecture 4: Model-Free Prediction
TD(λ)
Forward and Backward Equivalence
Forwards and Backwards TD(λ)
Consider an episode where s is visited once at time-step k,
TD(λ) eligibility trace discounts time since visit,
Et(s) = γλEt−1(s) + 1(St = s)
=
0 if t < k
(γλ)t−k if t ≥ k
Backward TD(λ) updates accumulate error online
T
t=1
αδtEt(s) = α
T
t=k
(γλ)t−k
δt = α Gλ
k − V (Sk)
By end of episode it accumulates total error for λ-return
For multiple visits to s, Et(s) accumulates many errors
Lecture 4: Model-Free Prediction
TD(λ)
Forward and Backward Equivalence
Offline Equivalence of Forward and Backward TD
Offline updates
Updates are accumulated within episode
but applied in batch at the end of episode
Lecture 4: Model-Free Prediction
TD(λ)
Forward and Backward Equivalence
Onine Equivalence of Forward and Backward TD
Online updates
TD(λ) updates are applied online at each step within episode
Forward and backward-view TD(λ) are slightly different
NEW: Exact online TD(λ) achieves perfect equivalence
By using a slightly different form of eligibility trace
Sutton and von Seijen, ICML 2014
Lecture 4: Model-Free Prediction
TD(λ)
Forward and Backward Equivalence
Summary of Forward and Backward TD(λ)
Offline updates λ = 0 λ ∈ (0, 1) λ = 1
Backward view TD(0) TD(λ) TD(1)
=
=
=
Forward view TD(0) Forward TD(λ) MC
Online updates λ = 0 λ ∈ (0, 1) λ = 1
Backward view TD(0) TD(λ) TD(1)
=
=
=
Forward view TD(0) Forward TD(λ) MC
=
=
=
Exact Online TD(0) Exact Online TD(λ) Exact Online TD(1)
= here indicates equivalence in total update at end of episode.

Weitere ähnliche Inhalte

Was ist angesagt?

SURF 2012 Final Report(1)
SURF 2012 Final Report(1)SURF 2012 Final Report(1)
SURF 2012 Final Report(1)Eric Zhang
 
Applied Machine Learning For Search Engine Relevance
Applied Machine Learning For Search Engine Relevance Applied Machine Learning For Search Engine Relevance
Applied Machine Learning For Search Engine Relevance charlesmartin14
 
lecture 30
lecture 30lecture 30
lecture 30sajinsc
 
Algorithm_NP-Completeness Proof
Algorithm_NP-Completeness ProofAlgorithm_NP-Completeness Proof
Algorithm_NP-Completeness ProofIm Rafid
 
Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for ClassificationPrakash Pimpale
 
Divide and Conquer - Part 1
Divide and Conquer - Part 1Divide and Conquer - Part 1
Divide and Conquer - Part 1Amrinder Arora
 
(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi DivergenceMasahiro Suzuki
 
Asymptotic Notations
Asymptotic NotationsAsymptotic Notations
Asymptotic NotationsNagendraK18
 
Algorithm chapter 1
Algorithm chapter 1Algorithm chapter 1
Algorithm chapter 1chidabdu
 
A Simple Review on SVM
A Simple Review on SVMA Simple Review on SVM
A Simple Review on SVMHonglin Yu
 

Was ist angesagt? (20)

SURF 2012 Final Report(1)
SURF 2012 Final Report(1)SURF 2012 Final Report(1)
SURF 2012 Final Report(1)
 
Applied Machine Learning For Search Engine Relevance
Applied Machine Learning For Search Engine Relevance Applied Machine Learning For Search Engine Relevance
Applied Machine Learning For Search Engine Relevance
 
NP completeness
NP completenessNP completeness
NP completeness
 
lecture 30
lecture 30lecture 30
lecture 30
 
Algorithm_NP-Completeness Proof
Algorithm_NP-Completeness ProofAlgorithm_NP-Completeness Proof
Algorithm_NP-Completeness Proof
 
Unit 2
Unit 2Unit 2
Unit 2
 
Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for Classification
 
AA ppt9107
AA ppt9107AA ppt9107
AA ppt9107
 
Unit 5
Unit 5Unit 5
Unit 5
 
Tsp is NP-Complete
Tsp is NP-CompleteTsp is NP-Complete
Tsp is NP-Complete
 
Divide and Conquer - Part 1
Divide and Conquer - Part 1Divide and Conquer - Part 1
Divide and Conquer - Part 1
 
Ma2520962099
Ma2520962099Ma2520962099
Ma2520962099
 
Unit 3
Unit 3Unit 3
Unit 3
 
Unit 4
Unit 4Unit 4
Unit 4
 
(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence
 
Asymptotic Notations
Asymptotic NotationsAsymptotic Notations
Asymptotic Notations
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Algorithm chapter 1
Algorithm chapter 1Algorithm chapter 1
Algorithm chapter 1
 
A Simple Review on SVM
A Simple Review on SVMA Simple Review on SVM
A Simple Review on SVM
 
Unit i
Unit iUnit i
Unit i
 

Ähnlich wie Mc td

Number theory
Number theory Number theory
Number theory tes31
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement LearningNatan Katz
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksBen Ball
 
Day 2 review with sat
Day 2 review with satDay 2 review with sat
Day 2 review with satjbianco9910
 
DeepLearn2022 1. Goals & AlgorithmDesign.pdf
DeepLearn2022 1. Goals & AlgorithmDesign.pdfDeepLearn2022 1. Goals & AlgorithmDesign.pdf
DeepLearn2022 1. Goals & AlgorithmDesign.pdfSean Meyn
 
Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017Andres Hernandez
 
This Week in Machine Learning and AI Feb 2019
This Week in Machine Learning and AI Feb 2019This Week in Machine Learning and AI Feb 2019
This Week in Machine Learning and AI Feb 2019Charles Martin
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learningJie-Han Chen
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learningJie-Han Chen
 
dynamic programming complete by Mumtaz Ali (03154103173)
dynamic programming complete by Mumtaz Ali (03154103173)dynamic programming complete by Mumtaz Ali (03154103173)
dynamic programming complete by Mumtaz Ali (03154103173)Mumtaz Ali
 
Introduction to Optimization revised.ppt
Introduction to Optimization revised.pptIntroduction to Optimization revised.ppt
Introduction to Optimization revised.pptJahnaviGautam
 
Bayesian Experimental Design for Stochastic Kinetic Models
Bayesian Experimental Design for Stochastic Kinetic ModelsBayesian Experimental Design for Stochastic Kinetic Models
Bayesian Experimental Design for Stochastic Kinetic ModelsColin Gillespie
 
Firefly exact MCMC for Big Data
Firefly exact MCMC for Big DataFirefly exact MCMC for Big Data
Firefly exact MCMC for Big DataGianvito Siciliano
 
The TclQuadcode Compiler
The TclQuadcode CompilerThe TclQuadcode Compiler
The TclQuadcode CompilerDonal Fellows
 
Reinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference LearningReinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference LearningSeung Jae Lee
 
Why Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural NetworksWhy Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural NetworksCharles Martin
 
Incremental and Multi-feature Tensor Subspace Learning applied for Background...
Incremental and Multi-feature Tensor Subspace Learning applied for Background...Incremental and Multi-feature Tensor Subspace Learning applied for Background...
Incremental and Multi-feature Tensor Subspace Learning applied for Background...ActiveEon
 
Lecture notes
Lecture notesLecture notes
Lecture notesbutest
 

Ähnlich wie Mc td (20)

Number theory
Number theory Number theory
Number theory
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and Tricks
 
Day 2 review with sat
Day 2 review with satDay 2 review with sat
Day 2 review with sat
 
DeepLearn2022 1. Goals & AlgorithmDesign.pdf
DeepLearn2022 1. Goals & AlgorithmDesign.pdfDeepLearn2022 1. Goals & AlgorithmDesign.pdf
DeepLearn2022 1. Goals & AlgorithmDesign.pdf
 
Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017
 
This Week in Machine Learning and AI Feb 2019
This Week in Machine Learning and AI Feb 2019This Week in Machine Learning and AI Feb 2019
This Week in Machine Learning and AI Feb 2019
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
 
dynamic programming complete by Mumtaz Ali (03154103173)
dynamic programming complete by Mumtaz Ali (03154103173)dynamic programming complete by Mumtaz Ali (03154103173)
dynamic programming complete by Mumtaz Ali (03154103173)
 
Introduction to Optimization revised.ppt
Introduction to Optimization revised.pptIntroduction to Optimization revised.ppt
Introduction to Optimization revised.ppt
 
Lesson 7: The Derivative
Lesson 7: The DerivativeLesson 7: The Derivative
Lesson 7: The Derivative
 
Bayesian Experimental Design for Stochastic Kinetic Models
Bayesian Experimental Design for Stochastic Kinetic ModelsBayesian Experimental Design for Stochastic Kinetic Models
Bayesian Experimental Design for Stochastic Kinetic Models
 
Firefly exact MCMC for Big Data
Firefly exact MCMC for Big DataFirefly exact MCMC for Big Data
Firefly exact MCMC for Big Data
 
The TclQuadcode Compiler
The TclQuadcode CompilerThe TclQuadcode Compiler
The TclQuadcode Compiler
 
Reinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference LearningReinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference Learning
 
Why Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural NetworksWhy Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural Networks
 
Daa chapter 3
Daa chapter 3Daa chapter 3
Daa chapter 3
 
Incremental and Multi-feature Tensor Subspace Learning applied for Background...
Incremental and Multi-feature Tensor Subspace Learning applied for Background...Incremental and Multi-feature Tensor Subspace Learning applied for Background...
Incremental and Multi-feature Tensor Subspace Learning applied for Background...
 
Lecture notes
Lecture notesLecture notes
Lecture notes
 

Mehr von Ronald Teo

07 regularization
07 regularization07 regularization
07 regularizationRonald Teo
 
Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
 Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodelRonald Teo
 
Lec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scgLec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scgRonald Teo
 
Lec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methodsLec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methodsRonald Teo
 
Lec6 nuts-and-bolts-deep-rl-research
Lec6 nuts-and-bolts-deep-rl-researchLec6 nuts-and-bolts-deep-rl-research
Lec6 nuts-and-bolts-deep-rl-researchRonald Teo
 
Lec4b pong from_pixels
Lec4b pong from_pixelsLec4b pong from_pixels
Lec4b pong from_pixelsRonald Teo
 
Lec4a policy-gradients-actor-critic
Lec4a policy-gradients-actor-criticLec4a policy-gradients-actor-critic
Lec4a policy-gradients-actor-criticRonald Teo
 
Lec2 sampling-based-approximations-and-function-fitting
Lec2 sampling-based-approximations-and-function-fittingLec2 sampling-based-approximations-and-function-fitting
Lec2 sampling-based-approximations-and-function-fittingRonald Teo
 
Lec1 intro-mdps-exact-methods
Lec1 intro-mdps-exact-methodsLec1 intro-mdps-exact-methods
Lec1 intro-mdps-exact-methodsRonald Teo
 
02 linear algebra
02 linear algebra02 linear algebra
02 linear algebraRonald Teo
 

Mehr von Ronald Teo (14)

07 regularization
07 regularization07 regularization
07 regularization
 
06 mlp
06 mlp06 mlp
06 mlp
 
04 numerical
04 numerical04 numerical
04 numerical
 
Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
 Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
 
Intro rl
Intro rlIntro rl
Intro rl
 
Lec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scgLec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scg
 
Lec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methodsLec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methods
 
Lec6 nuts-and-bolts-deep-rl-research
Lec6 nuts-and-bolts-deep-rl-researchLec6 nuts-and-bolts-deep-rl-research
Lec6 nuts-and-bolts-deep-rl-research
 
Lec4b pong from_pixels
Lec4b pong from_pixelsLec4b pong from_pixels
Lec4b pong from_pixels
 
Lec4a policy-gradients-actor-critic
Lec4a policy-gradients-actor-criticLec4a policy-gradients-actor-critic
Lec4a policy-gradients-actor-critic
 
Lec3 dqn
Lec3 dqnLec3 dqn
Lec3 dqn
 
Lec2 sampling-based-approximations-and-function-fitting
Lec2 sampling-based-approximations-and-function-fittingLec2 sampling-based-approximations-and-function-fitting
Lec2 sampling-based-approximations-and-function-fitting
 
Lec1 intro-mdps-exact-methods
Lec1 intro-mdps-exact-methodsLec1 intro-mdps-exact-methods
Lec1 intro-mdps-exact-methods
 
02 linear algebra
02 linear algebra02 linear algebra
02 linear algebra
 

Kürzlich hochgeladen

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 

Kürzlich hochgeladen (20)

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 

Mc td

  • 1. Lecture 4: Model-Free Prediction Lecture 4: Model-Free Prediction David Silver
  • 2. Lecture 4: Model-Free Prediction Outline 1 Introduction 2 Monte-Carlo Learning 3 Temporal-Difference Learning 4 TD(λ)
  • 3. Lecture 4: Model-Free Prediction Introduction Model-Free Reinforcement Learning Last lecture: Planning by dynamic programming Solve a known MDP This lecture: Model-free prediction Estimate the value function of an unknown MDP Next lecture: Model-free control Optimise the value function of an unknown MDP
  • 4. Lecture 4: Model-Free Prediction Monte-Carlo Learning Monte-Carlo Reinforcement Learning MC methods learn directly from episodes of experience MC is model-free: no knowledge of MDP transitions / rewards MC learns from complete episodes: no bootstrapping MC uses the simplest possible idea: value = mean return Caveat: can only apply MC to episodic MDPs All episodes must terminate
  • 5. Lecture 4: Model-Free Prediction Monte-Carlo Learning Monte-Carlo Policy Evaluation Goal: learn vπ from episodes of experience under policy π S1, A1, R2, ..., Sk ∼ π Recall that the return is the total discounted reward: Gt = Rt+1 + γRt+2 + ... + γT−1 RT Recall that the value function is the expected return: vπ(s) = Eπ [Gt | St = s] Monte-Carlo policy evaluation uses empirical mean return instead of expected return
  • 6. Lecture 4: Model-Free Prediction Monte-Carlo Learning First-Visit Monte-Carlo Policy Evaluation To evaluate state s The first time-step t that state s is visited in an episode, Increment counter N(s) ← N(s) + 1 Increment total return S(s) ← S(s) + Gt Value is estimated by mean return V (s) = S(s)/N(s) By law of large numbers, V (s) → vπ(s) as N(s) → ∞
  • 7. Lecture 4: Model-Free Prediction Monte-Carlo Learning Every-Visit Monte-Carlo Policy Evaluation To evaluate state s Every time-step t that state s is visited in an episode, Increment counter N(s) ← N(s) + 1 Increment total return S(s) ← S(s) + Gt Value is estimated by mean return V (s) = S(s)/N(s) Again, V (s) → vπ(s) as N(s) → ∞
  • 8. Lecture 4: Model-Free Prediction Monte-Carlo Learning Blackjack Example Blackjack Example States (200 of them): Current sum (12-21) Dealer’s showing card (ace-10) Do I have a “useable” ace? (yes-no) Action stick: Stop receiving cards (and terminate) Action twist: Take another card (no replacement) Reward for stick: +1 if sum of cards > sum of dealer cards 0 if sum of cards = sum of dealer cards -1 if sum of cards < sum of dealer cards Reward for twist: -1 if sum of cards > 21 (and terminate) 0 otherwise Transitions: automatically twist if sum of cards < 12
  • 9. Lecture 4: Model-Free Prediction Monte-Carlo Learning Blackjack Example Blackjack Value Function after Monte-Carlo Learning Policy: stick if sum of cards ≥ 20, otherwise twist
  • 10. Lecture 4: Model-Free Prediction Monte-Carlo Learning Incremental Monte-Carlo Incremental Mean The mean µ1, µ2, ... of a sequence x1, x2, ... can be computed incrementally, µk = 1 k k j=1 xj = 1 k  xk + k−1 j=1 xj   = 1 k (xk + (k − 1)µk−1) = µk−1 + 1 k (xk − µk−1)
  • 11. Lecture 4: Model-Free Prediction Monte-Carlo Learning Incremental Monte-Carlo Incremental Monte-Carlo Updates Update V (s) incrementally after episode S1, A1, R2, ..., ST For each state St with return Gt N(St) ← N(St) + 1 V (St) ← V (St) + 1 N(St) (Gt − V (St)) In non-stationary problems, it can be useful to track a running mean, i.e. forget old episodes. V (St) ← V (St) + α (Gt − V (St))
  • 12. Lecture 4: Model-Free Prediction Temporal-Difference Learning Temporal-Difference Learning TD methods learn directly from episodes of experience TD is model-free: no knowledge of MDP transitions / rewards TD learns from incomplete episodes, by bootstrapping TD updates a guess towards a guess
  • 13. Lecture 4: Model-Free Prediction Temporal-Difference Learning MC and TD Goal: learn vπ online from experience under policy π Incremental every-visit Monte-Carlo Update value V (St) toward actual return Gt V (St) ← V (St) + α (Gt − V (St)) Simplest temporal-difference learning algorithm: TD(0) Update value V (St) toward estimated return Rt+1 + γV (St+1) V (St) ← V (St) + α (Rt+1 + γV (St+1) − V (St)) Rt+1 + γV (St+1) is called the TD target δt = Rt+1 + γV (St+1) − V (St) is called the TD error
  • 14. Lecture 4: Model-Free Prediction Temporal-Difference Learning Driving Home Example Driving Home Example State Elapsed Time (minutes) Predicted Time to Go Predicted Total Time leaving office 0 30 30 reach car, raining 5 35 40 exit highway 20 15 35 behind truck 30 10 40 home street 40 3 43 arrive home 43 0 43
  • 15. Lecture 4: Model-Free Prediction Temporal-Difference Learning Driving Home Example Driving Home Example: MC vs. TD Changes recommended by Monte Carlo methods (!=1)! Changes recommended! by TD methods (!=1)!
  • 16. Lecture 4: Model-Free Prediction Temporal-Difference Learning Driving Home Example Advantages and Disadvantages of MC vs. TD TD can learn before knowing the final outcome TD can learn online after every step MC must wait until end of episode before return is known TD can learn without the final outcome TD can learn from incomplete sequences MC can only learn from complete sequences TD works in continuing (non-terminating) environments MC only works for episodic (terminating) environments
  • 17. Lecture 4: Model-Free Prediction Temporal-Difference Learning Driving Home Example Bias/Variance Trade-Off Return Gt = Rt+1 + γRt+2 + ... + γT−1RT is unbiased estimate of vπ(St) True TD target Rt+1 + γvπ(St+1) is unbiased estimate of vπ(St) TD target Rt+1 + γV (St+1) is biased estimate of vπ(St) TD target is much lower variance than the return: Return depends on many random actions, transitions, rewards TD target depends on one random action, transition, reward
  • 18. Lecture 4: Model-Free Prediction Temporal-Difference Learning Driving Home Example Advantages and Disadvantages of MC vs. TD (2) MC has high variance, zero bias Good convergence properties (even with function approximation) Not very sensitive to initial value Very simple to understand and use TD has low variance, some bias Usually more efficient than MC TD(0) converges to vπ(s) (but not always with function approximation) More sensitive to initial value
  • 19. Lecture 4: Model-Free Prediction Temporal-Difference Learning Random Walk Example Random Walk Example
  • 20. Lecture 4: Model-Free Prediction Temporal-Difference Learning Random Walk Example Random Walk: MC vs. TD
  • 21. Lecture 4: Model-Free Prediction Temporal-Difference Learning Batch MC and TD Batch MC and TD MC and TD converge: V (s) → vπ(s) as experience → ∞ But what about batch solution for finite experience? s1 1 , a1 1, r1 2 , ..., s1 T1 ... sK 1 , aK 1 , rK 2 , ..., sK TK e.g. Repeatedly sample episode k ∈ [1, K] Apply MC or TD(0) to episode k
  • 22. Lecture 4: Model-Free Prediction Temporal-Difference Learning Batch MC and TD AB Example Two states A, B; no discounting; 8 episodes of experience A, 0, B, 0! B, 1! B, 1! B, 1! B, 1! B, 1! B, 1! B, 0! What is V (A), V (B)?
  • 23. Lecture 4: Model-Free Prediction Temporal-Difference Learning Batch MC and TD AB Example Two states A, B; no discounting; 8 episodes of experience A, 0, B, 0! B, 1! B, 1! B, 1! B, 1! B, 1! B, 1! B, 0! What is V (A), V (B)?
  • 24. Lecture 4: Model-Free Prediction Temporal-Difference Learning Batch MC and TD Certainty Equivalence MC converges to solution with minimum mean-squared error Best fit to the observed returns K k=1 Tk t=1 Gk t − V (sk t ) 2 In the AB example, V (A) = 0 TD(0) converges to solution of max likelihood Markov model Solution to the MDP S, A, ˆP, ˆR, γ that best fits the data ˆPa s,s = 1 N(s, a) K k=1 Tk t=1 1(sk t , ak t , sk t+1 = s, a, s ) ˆRa s = 1 N(s, a) K k=1 Tk t=1 1(sk t , ak t = s, a)rk t In the AB example, V (A) = 0.75
  • 25. Lecture 4: Model-Free Prediction Temporal-Difference Learning Batch MC and TD Advantages and Disadvantages of MC vs. TD (3) TD exploits Markov property Usually more efficient in Markov environments MC does not exploit Markov property Usually more effective in non-Markov environments
  • 26. Lecture 4: Model-Free Prediction Temporal-Difference Learning Unified View Monte-Carlo Backup V (St) ← V (St) + α (Gt − V (St)) T ! T ! T ! T !T ! T ! T ! T ! T ! T ! st T! T! T! T! T!T! T! T! T!T!
  • 27. Lecture 4: Model-Free Prediction Temporal-Difference Learning Unified View Temporal-Difference Backup V (St) ← V (St) + α (Rt+1 + γV (St+1) − V (St)) T ! T ! T ! T !T ! T ! T ! T ! T ! T ! st+1 rt+1 st T!T!T!T!T! T! T! T! T! T!
  • 28. Lecture 4: Model-Free Prediction Temporal-Difference Learning Unified View Dynamic Programming Backup V (St) ← Eπ [Rt+1 + γV (St+1)] T! T! T! T! st rt+1 st+1 T! T!T! T! T!T! T! T! T!
  • 29. Lecture 4: Model-Free Prediction Temporal-Difference Learning Unified View Bootstrapping and Sampling Bootstrapping: update involves an estimate MC does not bootstrap DP bootstraps TD bootstraps Sampling: update samples an expectation MC samples DP does not sample TD samples
  • 30. Lecture 4: Model-Free Prediction Temporal-Difference Learning Unified View Unified View of Reinforcement Learning
  • 31. Lecture 4: Model-Free Prediction TD(λ) n-Step TD n-Step Prediction Let TD target look n steps into the future
  • 32. Lecture 4: Model-Free Prediction TD(λ) n-Step TD n-Step Return Consider the following n-step returns for n = 1, 2, ∞: n = 1 (TD) G (1) t = Rt+1 + γV (St+1) n = 2 G (2) t = Rt+1 + γRt+2 + γ2V (St+2) ... ... n = ∞ (MC) G (∞) t = Rt+1 + γRt+2 + ... + γT−1RT Define the n-step return G (n) t = Rt+1 + γRt+2 + ... + γn−1 Rt+n + γn V (St+n) n-step temporal-difference learning V (St) ← V (St) + α G (n) t − V (St)
  • 33. Lecture 4: Model-Free Prediction TD(λ) n-Step TD Large Random Walk Example
  • 34. Lecture 4: Model-Free Prediction TD(λ) n-Step TD Averaging n-Step Returns We can average n-step returns over different n e.g. average the 2-step and 4-step returns 1 2 G(2) + 1 2 G(4) Combines information from two different time-steps Can we efficiently combine information from all time-steps? One backup
  • 35. Lecture 4: Model-Free Prediction TD(λ) Forward View of TD(λ) λ-return The λ-return Gλ t combines all n-step returns G (n) t Using weight (1 − λ)λn−1 Gλ t = (1 − λ) ∞ n=1 λn−1 G (n) t Forward-view TD(λ) V (St) ← V (St) + α Gλ t − V (St)
  • 36. Lecture 4: Model-Free Prediction TD(λ) Forward View of TD(λ) TD(λ) Weighting Function Gλ t = (1 − λ) ∞ n=1 λn−1 G (n) t
  • 37. Lecture 4: Model-Free Prediction TD(λ) Forward View of TD(λ) Forward-view TD(λ) Update value function towards the λ-return Forward-view looks into the future to compute Gλ t Like MC, can only be computed from complete episodes
  • 38. Lecture 4: Model-Free Prediction TD(λ) Forward View of TD(λ) Forward-View TD(λ) on Large Random Walk
  • 39. Lecture 4: Model-Free Prediction TD(λ) Backward View of TD(λ) Backward View TD(λ) Forward view provides theory Backward view provides mechanism Update online, every step, from incomplete sequences
  • 40. Lecture 4: Model-Free Prediction TD(λ) Backward View of TD(λ) Eligibility Traces Credit assignment problem: did bell or light cause shock? Frequency heuristic: assign credit to most frequent states Recency heuristic: assign credit to most recent states Eligibility traces combine both heuristics E0(s) = 0 Et(s) = γλEt−1(s) + 1(St = s)
  • 41. Lecture 4: Model-Free Prediction TD(λ) Backward View of TD(λ) Backward View TD(λ) Keep an eligibility trace for every state s Update value V (s) for every state s In proportion to TD-error δt and eligibility trace Et(s) δt = Rt+1 + γV (St+1) − V (St) V (s) ← V (s) + αδtEt(s)
  • 42. Lecture 4: Model-Free Prediction TD(λ) Relationship Between Forward and Backward TD TD(λ) and TD(0) When λ = 0, only current state is updated Et(s) = 1(St = s) V (s) ← V (s) + αδtEt(s) This is exactly equivalent to TD(0) update V (St) ← V (St) + αδt
  • 43. Lecture 4: Model-Free Prediction TD(λ) Relationship Between Forward and Backward TD TD(λ) and MC When λ = 1, credit is deferred until end of episode Consider episodic environments with offline updates Over the course of an episode, total update for TD(1) is the same as total update for MC Theorem The sum of offline updates is identical for forward-view and backward-view TD(λ) T t=1 αδtEt(s) = T t=1 α Gλ t − V (St) 1(St = s)
  • 44. Lecture 4: Model-Free Prediction TD(λ) Forward and Backward Equivalence MC and TD(1) Consider an episode where s is visited once at time-step k, TD(1) eligibility trace discounts time since visit, Et(s) = γEt−1(s) + 1(St = s) = 0 if t < k γt−k if t ≥ k TD(1) updates accumulate error online T−1 t=1 αδtEt(s) = α T−1 t=k γt−k δt = α (Gk − V (Sk)) By end of episode it accumulates total error δk + γδk+1 + γ2 δk+2 + ... + γT−1−k δT−1
  • 45. Lecture 4: Model-Free Prediction TD(λ) Forward and Backward Equivalence Telescoping in TD(1) When λ = 1, sum of TD errors telescopes into MC error, δt + γδt+1 + γ2 δt+2 + ... + γT−1−t δT−1 = Rt+1 + γV (St+1) − V (St) + γRt+2 + γ2 V (St+2) − γV (St+1) + γ2 Rt+3 + γ3 V (St+3) − γ2 V (St+2) ... + γT−1−t RT + γT−t V (ST ) − γT−1−t V (ST−1) = Rt+1 + γRt+2 + γ2 Rt+3... + γT−1−t RT − V (St) = Gt − V (St)
  • 46. Lecture 4: Model-Free Prediction TD(λ) Forward and Backward Equivalence TD(λ) and TD(1) TD(1) is roughly equivalent to every-visit Monte-Carlo Error is accumulated online, step-by-step If value function is only updated offline at end of episode Then total update is exactly the same as MC
  • 47. Lecture 4: Model-Free Prediction TD(λ) Forward and Backward Equivalence Telescoping in TD(λ) For general λ, TD errors also telescope to λ-error, Gλ t − V (St) Gλ t − V (St) = −V (St) + (1 − λ)λ0 (Rt+1 + γV (St+1)) + (1 − λ)λ1 Rt+1 + γRt+2 + γ2 V (St+2) + (1 − λ)λ2 Rt+1 + γRt+2 + γ2 Rt+3 + γ3 V (St+3) + ... = −V (St) + (γλ)0 (Rt+1 + γV (St+1) − γλV (St+1)) + (γλ)1 (Rt+2 + γV (St+2) − γλV (St+2)) + (γλ)2 (Rt+3 + γV (St+3) − γλV (St+3)) + ... = (γλ)0 (Rt+1 + γV (St+1) − V (St)) + (γλ)1 (Rt+2 + γV (St+2) − V (St+1)) + (γλ)2 (Rt+3 + γV (St+3) − V (St+2)) + ... = δt + γλδt+1 + (γλ)2 δt+2 + ...
  • 48. Lecture 4: Model-Free Prediction TD(λ) Forward and Backward Equivalence Forwards and Backwards TD(λ) Consider an episode where s is visited once at time-step k, TD(λ) eligibility trace discounts time since visit, Et(s) = γλEt−1(s) + 1(St = s) = 0 if t < k (γλ)t−k if t ≥ k Backward TD(λ) updates accumulate error online T t=1 αδtEt(s) = α T t=k (γλ)t−k δt = α Gλ k − V (Sk) By end of episode it accumulates total error for λ-return For multiple visits to s, Et(s) accumulates many errors
  • 49. Lecture 4: Model-Free Prediction TD(λ) Forward and Backward Equivalence Offline Equivalence of Forward and Backward TD Offline updates Updates are accumulated within episode but applied in batch at the end of episode
  • 50. Lecture 4: Model-Free Prediction TD(λ) Forward and Backward Equivalence Onine Equivalence of Forward and Backward TD Online updates TD(λ) updates are applied online at each step within episode Forward and backward-view TD(λ) are slightly different NEW: Exact online TD(λ) achieves perfect equivalence By using a slightly different form of eligibility trace Sutton and von Seijen, ICML 2014
  • 51. Lecture 4: Model-Free Prediction TD(λ) Forward and Backward Equivalence Summary of Forward and Backward TD(λ) Offline updates λ = 0 λ ∈ (0, 1) λ = 1 Backward view TD(0) TD(λ) TD(1) = = = Forward view TD(0) Forward TD(λ) MC Online updates λ = 0 λ ∈ (0, 1) λ = 1 Backward view TD(0) TD(λ) TD(1) = = = Forward view TD(0) Forward TD(λ) MC = = = Exact Online TD(0) Exact Online TD(λ) Exact Online TD(1) = here indicates equivalence in total update at end of episode.