SlideShare a Scribd company logo
1 of 19
Download to read offline
Value Function Geometry and Gradient TD
Ashwin Rao
Stanford University, ICME
October 27, 2018
Ashwin Rao (Stanford) Value Function Geometry October 27, 2018 1 / 19
Overview
1 Motivation and Notation
2 Geometric/Vector Space Representation of Value Functions
3 Solutions with Linear System Formulations
4 Residual Gradient TD
5 Gradient TD
Ashwin Rao (Stanford) Value Function Geometry October 27, 2018 2 / 19
Motivation for understanding Value Function Geometry
Helps us better understand transformations of Value Functions (VFs)
Across the various DP and RL algorithms
Particularly helps when VFs are approximated, esp. with linear approx
Provides insights into stability and convergence
Particularly when dealing with the “Deadly Triad”
Deadly Triad := [Bootstrapping, Func Approx, Off-Policy]
Leads us to Gradient TD
Ashwin Rao (Stanford) Value Function Geometry October 27, 2018 3 / 19
Notation
Assume state space S consists of n states: {s1, s2, . . . , sn}
Action space A consisting of finite number of actions
This exposition extends easily to continuous state/action spaces too
This exposition is for a fixed (often stochastic) policy denoted π(a|s)
VF for a policy π is denoted as vπ : S → R
m feature functions φ1, φ2, . . . , φm : S → R
Feature vector for a state s ∈ S denoted as φ(s) ∈ Rm
For linear function approximation of VF with weights
w = (w1, w2, . . . , wm), VF vw : S → R is defined as:
vw(s) = wT
· φ(s) =
m
j=1
wj · φj (s) for any s ∈ S
µπ : S → [0, 1] denotes the states’ probability distribution under π
Ashwin Rao (Stanford) Value Function Geometry October 27, 2018 4 / 19
VF Vector Space and VF Linear Approximations
n-dimensional space, with each dim corresponding to a state in S
A vector in this space is a specific VF (typically denoted v): S → R
Wach dimension’s coordinate is the VF for that dimension’s state
Coordinates of vector vπ for policy π are: [vπ(s1), vπ(s2), . . . , vπ(sn)]
Consider m vectors where jth vector is: [φj (s1), φj (s2), . . . , φj (sn)]
These m vectors are the m columns of n × m matrix Φ = [φj (si )]
Their span represents an m-dim subspace within this n-dim space
Spanned by the set of all w = [w1, w2, . . . , wm] ∈ Rm
Vector vw = Φ · w in this subspace has coordinates
[vw(s1), vw(s2), . . . , vw(sn)]
Vector vw is fully specified by w (so we often say w to mean vw)
Ashwin Rao (Stanford) Value Function Geometry October 27, 2018 5 / 19
Some more notation
Denote r(s, a) as the Expected Reward upon action a in state s
Denote p(s, s , a) as the probability of transition s → s upon action a
Define
R(s) =
a∈A
π(a|s) · r(s, a)
P(s, s ) =
a∈A
π(a|s) · p(s, s , a)
Denote R as the vector [R(s1), R(s2), . . . , R(sn)]
Denote P as the matrix [P(si , si )], 1 ≤ i, i ≤ n
Denote γ as the MDP discount factor
Ashwin Rao (Stanford) Value Function Geometry October 27, 2018 6 / 19
Bellman operator Bπ
Bellman operator Bπ for policy π operating on VF vector v defined as:
Bπv = R + γP · v
Note that vπ is the fixed point of operator Bπ (meaning Bπvπ = vπ)
If we start with an arbitrary VF vector v and repeatedly apply Bπ, by
Contraction Mapping Theorem, we will reach the fixed point vπ
This is the Dynamic Programming Policy Evaluation algorithm
Monte Carlo without func approx also converges to vπ (albeit slowly)
Ashwin Rao (Stanford) Value Function Geometry October 27, 2018 7 / 19
Projection operator ΠΦ
Firs we define “distance” d(v1, v2) between VF vectors v1, v2
Weighted by µπ across the n dimensions of v1, v2
d(v1, v2) =
n
i=1
µπ(si ) · (v1(si ) − v2(si ))2
= (v1 − v2)T
· D · (v1 − v2)
where D is the square diagonal matrix consisting of µπ(si ), 1 ≤ i ≤ n
Projection operator for subspace spanned by Φ is denoted as ΠΦ
ΠΦ performs an orthogonal projection of VF vector v on subspace Φ
So we need to minimize d(v, ΠΦv)
This is a weighted least squares regression with solution:
w = (ΦT
· D · Φ)−1
· ΦT
· D · v
So, the Projection operator ΠΦ can be written as:
ΠΦ = Φ · (ΦT
· D · Φ)−1
· ΦT
· D
Ashwin Rao (Stanford) Value Function Geometry October 27, 2018 8 / 19
4 VF vectors of interest in the Φ subspace
Note: We will refer to the Φ-subspace VF vectors by their weights w
1 Projection of vπ: wπ = arg minw d(vπ, vw)
This is the VF we seek when doing linear function approximation
Because it is the VF vector “closest” to vπ in the Φ subspace
Monte-Carlo with linear func approx will (slowly) converge to wπ
2 Bellman Error (BE)-minimizing: wBE = arg minw d(Bπvw, vw)
This is the solution to a linear system (covered later)
In model-free setting, Residual Gradient TD Algorithm (covered later)
Cannot learn if we can only access features, and not underlying states
Ashwin Rao (Stanford) Value Function Geometry October 27, 2018 9 / 19
4 VF vectors of interest in the Φ subspace (continued)
3 Projected Bellman Error (PBE)-minimizing:
wPBE = arg minw d((ΠΦ · Bπ)vw, vw)
The minimum is 0, i.e., Φ · wPBE is the fixed point of operator ΠΦ · Bπ
This fixed point is the solution to a linear system (covered later)
Alternatively, if we start with an arbitrary vw and repeatedly apply
ΠΦ · Bπ, we will converge to Φ · wPBE
This is a DP-like process with approximation - repeatedly thrown out of
the Φ subspace (applying Bellman operator Bπ), followed by landing
back in the Φ subspace (applying Projection operator ΠΦ)
In model-free setting, Gradient TD Algorithms (covered later)
4 Temporal Difference Error (TDE)-minimizing:
wTDE = arg minw Eπ[δ2]
δ is the TD error
Minimizes the expected square of the TD error when following policy π
Naive Residual Gradient TD Algorithm (covered later)
Ashwin Rao (Stanford) Value Function Geometry October 27, 2018 10 / 19
Ashwin Rao (Stanford) Value Function Geometry October 27, 2018 11 / 19
Solution of wBE with a Linear System Formulation
wBE = arg min
w
d(vw, R + γP · vw)
= arg min
w
d(Φ · w, R + γP · Φ · w)
= arg min
w
d(Φ · w − γP · Φ · w, R)
= arg min
w
d((Φ − γP · Φ) · w, R)
This is a weighted least-squares linear regression of R versus Φ − γP · Φ
with weights µπ, whose solution is:
wBE = ((Φ − γP · Φ)T
· D · (Φ − γP · Φ))−1
· (Φ − γP · Φ)T
· D · R
Ashwin Rao (Stanford) Value Function Geometry October 27, 2018 12 / 19
Solution of wPBE with a Linear System Formulation
Φ · wPBE is the fixed point of operator ΠΦ · Bπ. We know:
ΠΦ = Φ · (ΦT
· D · Φ)−1
· ΦT
· D
Bπv = R + γP · v
Therefore,
Φ · (ΦT
· D · Φ)−1
· ΦT
· D · (R + γP · Φ · wPBE ) = Φ · wPBE
Since columns of Φ are assumed to be independent (full rank),
(ΦT
· D · Φ)−1
· ΦT
· D · (R + γP · Φ · wPBE ) = wPBE
ΦT
· D · (R + γP · Φ · wPBE ) = ΦT
· D · Φ · wPBE
ΦT
· D · (Φ − γP · Φ) · wPBE = ΦT
· D · R
This is a square linear system of the form A · wPBE = b whose solution is:
wPBE = A−1
· b = (ΦT
· D · (Φ − γP · Φ))−1
· ΦT
· D · R
Ashwin Rao (Stanford) Value Function Geometry October 27, 2018 13 / 19
Model-Free Learning of wPBE
How do we construct matrix A = ΦT · D · (Φ − γP · Φ) and vector
b = ΦT · D · R without a model?
Following policy π, each time we perform a model-free transition from
s to s getting reward r, we get a sample estimate of A and b
Estimate of A is the outer-product of vectors φ(s) and φ(s) − γ · φ(s )
Estimate of b is scalar r times vector φ(s)
Average these estimates across many such model-free transitions
This algorithm is called Least Squares Temporal Difference (LSTD)
Alternative: Semi-Gradient Temporal Difference (TD) descent
This semi-gradient descent converges to wPBE with weight updates:
∆w = α · (r + γ · wT
· φ(s ) − wT
· φ(s)) · φ(s)
r + γ · wT · φ(s ) − wT · φ(s) is the TD error, denoted δ
Ashwin Rao (Stanford) Value Function Geometry October 27, 2018 14 / 19
Naive Residual Gradient Algorithm to solve for wTDE
We defined wTDE as the vector in the Φ subspace that minimizes the
expected square of the TD error δ when following policy π.
wTDE = arg min
w
s∈S
µπ(s)
r,s
probπ(r, s |s)·(r+γ·wT
·φ(s )−wT
·φ(s))2
To perform SGD, we have to estimate the gradient of the expected
square of TD error by sampling
The weight update for each sample in the SGD will be:
∆w = −
1
2
α · w (r + γ · wT
· φ(s ) − wT
· φ(s))2
= α · (r + γ · wT
· φ(s ) − wT
· φ(s)) · (φ(s) − γ · φ(s ))
This algorithm (named Naive Residual Gradient) converges robustly,
but not to a desirable place
Ashwin Rao (Stanford) Value Function Geometry October 27, 2018 15 / 19
Residual Gradient Algorithm to solve for wBE
We defined wBE as the vector in the Φ subspace that minimizes BE
But BE for a state is the expected TD error in that state
So we want to do SGD with gradient of square of expected TD error
∆w = −
1
2
α · w (Eπ[δ])2
= −α · Eπ[r + γ · wT
· φ(s ) − wT
· φ(s)] · w Eπ[δ]
= α · (Eπ[r + γ · wT
· φ(s )] − wT
· φ(s)) · (φ(s) − γ · Eπ[φ(s )])
This is called the Residual Gradient algorithm
Requires two independent samples of s transitioning from s
In that case, converges to wBE robustly (even for non-linear approx)
But it is slow, and doesn’t converge to a desirable place
Cannot learn if we can only access features, and not underlying states
Ashwin Rao (Stanford) Value Function Geometry October 27, 2018 16 / 19
Gradient TD Algorithms to solve for wPBE
For on-policy linear func approx, semi-gradient TD works
For non-linear func approx or off-policy, we need Gradient TD
GTD: The original Gradient TD algorithm
GTD-2: Second-generation GTD
TDC: TD with Gradient correction
We need to set up the loss function whose gradient will drive SGD
wPBE = arg minw d(ΠΦBπvw, vw) = arg minw d(ΠΦBπvw, ΠΦvw)
So we define the loss function (denoting Bπvw − vw as δw) as:
L(w) = (ΠΦδw)T
· D · (ΠΦδw) = δw
T
· ΠT
Φ · D · ΠΦ · δw
= δw
T
·(Φ·(ΦT
·D·Φ)−1
·ΦT
·D)T
·D·(Φ·(ΦT
·D·Φ)−1
·ΦT
·D)·δw
= δw
T
·(D·Φ·(ΦT
·D·Φ)−1
·ΦT
)·D·(Φ·(ΦT
·D·Φ)−1
·ΦT
·D)·δw
= (δw
T
·D·Φ)·(ΦT
·D·Φ)−1
·(ΦT
·D·Φ)·(ΦT
·D·Φ)−1
·(ΦT
·D·δw)
= (ΦT
· D · δw)T
· (ΦT
· D · Φ)−1
· (ΦT
· D · δw)
Ashwin Rao (Stanford) Value Function Geometry October 27, 2018 17 / 19
TDC Algorithm to solve for wPBE
We derive the TDC Algorithm based on wL(w)
wL(w) = 2 · ( w(ΦT
· D · δw)T
) · (ΦT
· D · Φ)−1
· (ΦT
· D · δw)
Now we express each of these 3 terms as expectations of model-free
transitions s
µ
−→ (r, s ), denoting r + γ · wT · φ(s ) − wT · φ(s) as δ
ΦT · D · δw = E[δ · φ(s)]
w(ΦT · D · δw)T = w(E[δ · φ(s)])T = E[( wδ) · φ(s)T ] =
E[(γ · φ(s ) − φ(s)) · φ(s)T ]
ΦT · D · Φ = E[φ(s) · φ(s)T ]
Substituting, we get:
wL(w) = 2 · E[(γ · φ(s ) − φ(s)) · φ(s)T
] · E[φ(s) · φ(s)T
]−1
· E[δ · φ(s)]
Ashwin Rao (Stanford) Value Function Geometry October 27, 2018 18 / 19
Weight Updates of TDC Algorithm
∆w = −
1
2
α · wL(w)
= α · E[(φ(s) − γ · φ(s )) · φ(s)T
] · E[φ(s) · φ(s)T
]−1
· E[δ · φ(s)]
= α · (E[φ(s) · φ(s)T
] − γ · E[φ(s ) · φ(s)T
]) · E[φ(s) · φ(s)T
]−1
· E[δ · φ(s)]
= α · (E[δ · φ(s)] − γ · E[φ(s ) · φ(s)T
] · E[φ(s) · φ(s)T
]−1
· E[δ · φ(s)])
= α · (E[δ · φ(s)] − γ · E[φ(s ) · φ(s)T
] · θ)
where θ = E[φ(s) · φ(s)T ]−1 · E[δ · φ(s)] is the solution to a weighted
least-squares linear regression of Bπv − v against Φ, with weights as µπ.
Cascade Learning: Update both w and θ (θ converging faster)
∆w = α · δ · φ(s) − α · γ · φ(s ) · (θT · φ(s))
∆θ = β · (δ − θT · φ(s)) · φ(s)
Note: θT · φ(s) operates as estimate of TD error δ for current state s
Ashwin Rao (Stanford) Value Function Geometry October 27, 2018 19 / 19

More Related Content

What's hot

Jyokyo-kai-20120605
Jyokyo-kai-20120605Jyokyo-kai-20120605
Jyokyo-kai-20120605ketanaka
 
Consistency proof of a feasible arithmetic inside a bounded arithmetic
Consistency proof of a feasible arithmetic inside a bounded arithmeticConsistency proof of a feasible arithmetic inside a bounded arithmetic
Consistency proof of a feasible arithmetic inside a bounded arithmeticYamagata Yoriyuki
 
Semi-automatic ABC: a discussion
Semi-automatic ABC: a discussionSemi-automatic ABC: a discussion
Semi-automatic ABC: a discussionChristian Robert
 
Discussion of Fearnhead and Prangle, RSS< Dec. 14, 2011
Discussion of Fearnhead and Prangle, RSS< Dec. 14, 2011Discussion of Fearnhead and Prangle, RSS< Dec. 14, 2011
Discussion of Fearnhead and Prangle, RSS< Dec. 14, 2011Christian Robert
 
Consistency proof of a feasible arithmetic inside a bounded arithmetic
Consistency proof of a feasible arithmetic inside a bounded arithmeticConsistency proof of a feasible arithmetic inside a bounded arithmetic
Consistency proof of a feasible arithmetic inside a bounded arithmeticYamagata Yoriyuki
 
Planning Under Uncertainty With Markov Decision Processes
Planning Under Uncertainty With Markov Decision ProcessesPlanning Under Uncertainty With Markov Decision Processes
Planning Under Uncertainty With Markov Decision Processesahmad bassiouny
 
A Generalization of QN-Maps
A Generalization of QN-MapsA Generalization of QN-Maps
A Generalization of QN-MapsIOSR Journals
 
Consistency proof of a feasible arithmetic inside a bounded arithmetic
Consistency proof of a feasible arithmetic inside a bounded arithmeticConsistency proof of a feasible arithmetic inside a bounded arithmetic
Consistency proof of a feasible arithmetic inside a bounded arithmeticYamagata Yoriyuki
 
Fundamental Theorems of Asset Pricing
Fundamental Theorems of Asset PricingFundamental Theorems of Asset Pricing
Fundamental Theorems of Asset PricingAshwin Rao
 
Stochastic Control/Reinforcement Learning for Optimal Market Making
Stochastic Control/Reinforcement Learning for Optimal Market MakingStochastic Control/Reinforcement Learning for Optimal Market Making
Stochastic Control/Reinforcement Learning for Optimal Market MakingAshwin Rao
 
HJB Equation and Merton's Portfolio Problem
HJB Equation and Merton's Portfolio ProblemHJB Equation and Merton's Portfolio Problem
HJB Equation and Merton's Portfolio ProblemAshwin Rao
 
no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm
no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithmno U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm
no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithmChristian Robert
 
Asymptotic notations
Asymptotic notationsAsymptotic notations
Asymptotic notationsEhtisham Ali
 
Lec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scgLec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scgRonald Teo
 
Fine Grained Complexity of Rainbow Coloring and its Variants
Fine Grained Complexity of Rainbow Coloring and its VariantsFine Grained Complexity of Rainbow Coloring and its Variants
Fine Grained Complexity of Rainbow Coloring and its VariantsAkankshaAgrawal55
 
M1 unit viii-jntuworld
M1 unit viii-jntuworldM1 unit viii-jntuworld
M1 unit viii-jntuworldmrecedu
 

What's hot (20)

Jyokyo-kai-20120605
Jyokyo-kai-20120605Jyokyo-kai-20120605
Jyokyo-kai-20120605
 
Consistency proof of a feasible arithmetic inside a bounded arithmetic
Consistency proof of a feasible arithmetic inside a bounded arithmeticConsistency proof of a feasible arithmetic inside a bounded arithmetic
Consistency proof of a feasible arithmetic inside a bounded arithmetic
 
Semi-automatic ABC: a discussion
Semi-automatic ABC: a discussionSemi-automatic ABC: a discussion
Semi-automatic ABC: a discussion
 
Discussion of Fearnhead and Prangle, RSS< Dec. 14, 2011
Discussion of Fearnhead and Prangle, RSS< Dec. 14, 2011Discussion of Fearnhead and Prangle, RSS< Dec. 14, 2011
Discussion of Fearnhead and Prangle, RSS< Dec. 14, 2011
 
Consistency proof of a feasible arithmetic inside a bounded arithmetic
Consistency proof of a feasible arithmetic inside a bounded arithmeticConsistency proof of a feasible arithmetic inside a bounded arithmetic
Consistency proof of a feasible arithmetic inside a bounded arithmetic
 
Planning Under Uncertainty With Markov Decision Processes
Planning Under Uncertainty With Markov Decision ProcessesPlanning Under Uncertainty With Markov Decision Processes
Planning Under Uncertainty With Markov Decision Processes
 
Lecture26
Lecture26Lecture26
Lecture26
 
A Generalization of QN-Maps
A Generalization of QN-MapsA Generalization of QN-Maps
A Generalization of QN-Maps
 
Consistency proof of a feasible arithmetic inside a bounded arithmetic
Consistency proof of a feasible arithmetic inside a bounded arithmeticConsistency proof of a feasible arithmetic inside a bounded arithmetic
Consistency proof of a feasible arithmetic inside a bounded arithmetic
 
Fundamental Theorems of Asset Pricing
Fundamental Theorems of Asset PricingFundamental Theorems of Asset Pricing
Fundamental Theorems of Asset Pricing
 
Stochastic Control/Reinforcement Learning for Optimal Market Making
Stochastic Control/Reinforcement Learning for Optimal Market MakingStochastic Control/Reinforcement Learning for Optimal Market Making
Stochastic Control/Reinforcement Learning for Optimal Market Making
 
HJB Equation and Merton's Portfolio Problem
HJB Equation and Merton's Portfolio ProblemHJB Equation and Merton's Portfolio Problem
HJB Equation and Merton's Portfolio Problem
 
no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm
no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithmno U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm
no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm
 
Asymptotic notations
Asymptotic notationsAsymptotic notations
Asymptotic notations
 
Lec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scgLec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scg
 
smtlecture.3
smtlecture.3smtlecture.3
smtlecture.3
 
Fine Grained Complexity of Rainbow Coloring and its Variants
Fine Grained Complexity of Rainbow Coloring and its VariantsFine Grained Complexity of Rainbow Coloring and its Variants
Fine Grained Complexity of Rainbow Coloring and its Variants
 
Asymptotic notation
Asymptotic notationAsymptotic notation
Asymptotic notation
 
Reductions
ReductionsReductions
Reductions
 
M1 unit viii-jntuworld
M1 unit viii-jntuworldM1 unit viii-jntuworld
M1 unit viii-jntuworld
 

Similar to Value Function Geometry and Gradient TD

Differential kinematics robotic
Differential kinematics  roboticDifferential kinematics  robotic
Differential kinematics roboticdahmane sid ahmed
 
Lecture_10_Parallel_Algorithms_Part_II.ppt
Lecture_10_Parallel_Algorithms_Part_II.pptLecture_10_Parallel_Algorithms_Part_II.ppt
Lecture_10_Parallel_Algorithms_Part_II.pptWahyuAde4
 
Secure Domination in graphs
Secure Domination in graphsSecure Domination in graphs
Secure Domination in graphsMahesh Gadhwal
 
Functional analysis in mechanics 2e
Functional analysis in mechanics  2eFunctional analysis in mechanics  2e
Functional analysis in mechanics 2eSpringer
 
Functional analysis in mechanics
Functional analysis in mechanicsFunctional analysis in mechanics
Functional analysis in mechanicsSpringer
 
Problem Solving by Computer Finite Element Method
Problem Solving by Computer Finite Element MethodProblem Solving by Computer Finite Element Method
Problem Solving by Computer Finite Element MethodPeter Herbert
 
study Streaming Multigrid For Gradient Domain Operations On Large Images
study Streaming Multigrid For Gradient Domain Operations On Large Imagesstudy Streaming Multigrid For Gradient Domain Operations On Large Images
study Streaming Multigrid For Gradient Domain Operations On Large ImagesChiamin Hsu
 
Coordinate sampler: A non-reversible Gibbs-like sampler
Coordinate sampler: A non-reversible Gibbs-like samplerCoordinate sampler: A non-reversible Gibbs-like sampler
Coordinate sampler: A non-reversible Gibbs-like samplerChristian Robert
 
Convergence of ABC methods
Convergence of ABC methodsConvergence of ABC methods
Convergence of ABC methodsChristian Robert
 
Activity 1 (Directional Derivative and Gradient with minimum 3 applications)....
Activity 1 (Directional Derivative and Gradient with minimum 3 applications)....Activity 1 (Directional Derivative and Gradient with minimum 3 applications)....
Activity 1 (Directional Derivative and Gradient with minimum 3 applications)....loniyakrishn
 
Differential Equations Assignment Help
Differential Equations Assignment HelpDifferential Equations Assignment Help
Differential Equations Assignment HelpMaths Assignment Help
 
Linear Regression
Linear Regression Linear Regression
Linear Regression Rupak Roy
 
Dericavion e integracion de funciones
Dericavion e integracion de funcionesDericavion e integracion de funciones
Dericavion e integracion de funcionesdiegoalejandroalgara
 
First Order Active RC Sections
First Order Active RC SectionsFirst Order Active RC Sections
First Order Active RC SectionsHoopeer Hoopeer
 
Paper Introduction: Combinatorial Model and Bounds for Target Set Selection
Paper Introduction: Combinatorial Model and Bounds for Target Set SelectionPaper Introduction: Combinatorial Model and Bounds for Target Set Selection
Paper Introduction: Combinatorial Model and Bounds for Target Set SelectionYu Liu
 
Application Of vector Integration and all
Application Of vector Integration and allApplication Of vector Integration and all
Application Of vector Integration and allMalikUmarKhakh
 
random forests for ABC model choice and parameter estimation
random forests for ABC model choice and parameter estimationrandom forests for ABC model choice and parameter estimation
random forests for ABC model choice and parameter estimationChristian Robert
 

Similar to Value Function Geometry and Gradient TD (20)

Differential kinematics robotic
Differential kinematics  roboticDifferential kinematics  robotic
Differential kinematics robotic
 
Lecture_10_Parallel_Algorithms_Part_II.ppt
Lecture_10_Parallel_Algorithms_Part_II.pptLecture_10_Parallel_Algorithms_Part_II.ppt
Lecture_10_Parallel_Algorithms_Part_II.ppt
 
Secure Domination in graphs
Secure Domination in graphsSecure Domination in graphs
Secure Domination in graphs
 
Functional analysis in mechanics 2e
Functional analysis in mechanics  2eFunctional analysis in mechanics  2e
Functional analysis in mechanics 2e
 
Functional analysis in mechanics
Functional analysis in mechanicsFunctional analysis in mechanics
Functional analysis in mechanics
 
05 linear transformations
05 linear transformations05 linear transformations
05 linear transformations
 
Problem Solving by Computer Finite Element Method
Problem Solving by Computer Finite Element MethodProblem Solving by Computer Finite Element Method
Problem Solving by Computer Finite Element Method
 
study Streaming Multigrid For Gradient Domain Operations On Large Images
study Streaming Multigrid For Gradient Domain Operations On Large Imagesstudy Streaming Multigrid For Gradient Domain Operations On Large Images
study Streaming Multigrid For Gradient Domain Operations On Large Images
 
Coordinate sampler: A non-reversible Gibbs-like sampler
Coordinate sampler: A non-reversible Gibbs-like samplerCoordinate sampler: A non-reversible Gibbs-like sampler
Coordinate sampler: A non-reversible Gibbs-like sampler
 
Iclr2016 vaeまとめ
Iclr2016 vaeまとめIclr2016 vaeまとめ
Iclr2016 vaeまとめ
 
Convergence of ABC methods
Convergence of ABC methodsConvergence of ABC methods
Convergence of ABC methods
 
Activity 1 (Directional Derivative and Gradient with minimum 3 applications)....
Activity 1 (Directional Derivative and Gradient with minimum 3 applications)....Activity 1 (Directional Derivative and Gradient with minimum 3 applications)....
Activity 1 (Directional Derivative and Gradient with minimum 3 applications)....
 
Differential Equations Assignment Help
Differential Equations Assignment HelpDifferential Equations Assignment Help
Differential Equations Assignment Help
 
Linear Regression
Linear Regression Linear Regression
Linear Regression
 
Dericavion e integracion de funciones
Dericavion e integracion de funcionesDericavion e integracion de funciones
Dericavion e integracion de funciones
 
First Order Active RC Sections
First Order Active RC SectionsFirst Order Active RC Sections
First Order Active RC Sections
 
Paper Introduction: Combinatorial Model and Bounds for Target Set Selection
Paper Introduction: Combinatorial Model and Bounds for Target Set SelectionPaper Introduction: Combinatorial Model and Bounds for Target Set Selection
Paper Introduction: Combinatorial Model and Bounds for Target Set Selection
 
Application Of vector Integration and all
Application Of vector Integration and allApplication Of vector Integration and all
Application Of vector Integration and all
 
Analytic function
Analytic functionAnalytic function
Analytic function
 
random forests for ABC model choice and parameter estimation
random forests for ABC model choice and parameter estimationrandom forests for ABC model choice and parameter estimation
random forests for ABC model choice and parameter estimation
 

More from Ashwin Rao

Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...
Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...
Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...Ashwin Rao
 
A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...
A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...
A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...Ashwin Rao
 
Risk-Aversion, Risk-Premium and Utility Theory
Risk-Aversion, Risk-Premium and Utility TheoryRisk-Aversion, Risk-Premium and Utility Theory
Risk-Aversion, Risk-Premium and Utility TheoryAshwin Rao
 
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...Ashwin Rao
 
Towards Improved Pricing and Hedging of Agency Mortgage-backed Securities
Towards Improved Pricing and Hedging of Agency Mortgage-backed SecuritiesTowards Improved Pricing and Hedging of Agency Mortgage-backed Securities
Towards Improved Pricing and Hedging of Agency Mortgage-backed SecuritiesAshwin Rao
 
Recursive Formulation of Gradient in a Dense Feed-Forward Deep Neural Network
Recursive Formulation of Gradient in a Dense Feed-Forward Deep Neural NetworkRecursive Formulation of Gradient in a Dense Feed-Forward Deep Neural Network
Recursive Formulation of Gradient in a Dense Feed-Forward Deep Neural NetworkAshwin Rao
 
Demystifying the Bias-Variance Tradeoff
Demystifying the Bias-Variance TradeoffDemystifying the Bias-Variance Tradeoff
Demystifying the Bias-Variance TradeoffAshwin Rao
 
Category Theory made easy with (ugly) pictures
Category Theory made easy with (ugly) picturesCategory Theory made easy with (ugly) pictures
Category Theory made easy with (ugly) picturesAshwin Rao
 
Risk Pooling sensitivity to Correlation
Risk Pooling sensitivity to CorrelationRisk Pooling sensitivity to Correlation
Risk Pooling sensitivity to CorrelationAshwin Rao
 
Abstract Algebra in 3 Hours
Abstract Algebra in 3 HoursAbstract Algebra in 3 Hours
Abstract Algebra in 3 HoursAshwin Rao
 
OmniChannelNewsvendor
OmniChannelNewsvendorOmniChannelNewsvendor
OmniChannelNewsvendorAshwin Rao
 
The Newsvendor meets the Options Trader
The Newsvendor meets the Options TraderThe Newsvendor meets the Options Trader
The Newsvendor meets the Options TraderAshwin Rao
 
The Fuss about || Haskell | Scala | F# ||
The Fuss about || Haskell | Scala | F# ||The Fuss about || Haskell | Scala | F# ||
The Fuss about || Haskell | Scala | F# ||Ashwin Rao
 
Career Advice at University College of London, Mathematics Department.
Career Advice at University College of London, Mathematics Department.Career Advice at University College of London, Mathematics Department.
Career Advice at University College of London, Mathematics Department.Ashwin Rao
 
Careers outside Academia - USC Computer Science Masters and Ph.D. Students
Careers outside Academia - USC Computer Science Masters and Ph.D. StudentsCareers outside Academia - USC Computer Science Masters and Ph.D. Students
Careers outside Academia - USC Computer Science Masters and Ph.D. StudentsAshwin Rao
 
IIT Bombay - First Steps to a Successful Career
IIT Bombay - First Steps to a Successful CareerIIT Bombay - First Steps to a Successful Career
IIT Bombay - First Steps to a Successful CareerAshwin Rao
 
Stanford FinMath - Careers in Quant Finance
Stanford FinMath - Careers in Quant FinanceStanford FinMath - Careers in Quant Finance
Stanford FinMath - Careers in Quant FinanceAshwin Rao
 
Berkeley Financial Engineering - Guidance on Careers in Quantitative Finance
Berkeley Financial Engineering - Guidance on Careers in Quantitative FinanceBerkeley Financial Engineering - Guidance on Careers in Quantitative Finance
Berkeley Financial Engineering - Guidance on Careers in Quantitative FinanceAshwin Rao
 
Careers in Quant Finance talk at UCLA Financial Engineering
Careers in Quant Finance talk at UCLA Financial EngineeringCareers in Quant Finance talk at UCLA Financial Engineering
Careers in Quant Finance talk at UCLA Financial EngineeringAshwin Rao
 
Columbia CS - Roles in Quant Finance
Columbia CS - Roles in Quant Finance Columbia CS - Roles in Quant Finance
Columbia CS - Roles in Quant Finance Ashwin Rao
 

More from Ashwin Rao (20)

Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...
Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...
Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...
 
A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...
A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...
A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...
 
Risk-Aversion, Risk-Premium and Utility Theory
Risk-Aversion, Risk-Premium and Utility TheoryRisk-Aversion, Risk-Premium and Utility Theory
Risk-Aversion, Risk-Premium and Utility Theory
 
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...
 
Towards Improved Pricing and Hedging of Agency Mortgage-backed Securities
Towards Improved Pricing and Hedging of Agency Mortgage-backed SecuritiesTowards Improved Pricing and Hedging of Agency Mortgage-backed Securities
Towards Improved Pricing and Hedging of Agency Mortgage-backed Securities
 
Recursive Formulation of Gradient in a Dense Feed-Forward Deep Neural Network
Recursive Formulation of Gradient in a Dense Feed-Forward Deep Neural NetworkRecursive Formulation of Gradient in a Dense Feed-Forward Deep Neural Network
Recursive Formulation of Gradient in a Dense Feed-Forward Deep Neural Network
 
Demystifying the Bias-Variance Tradeoff
Demystifying the Bias-Variance TradeoffDemystifying the Bias-Variance Tradeoff
Demystifying the Bias-Variance Tradeoff
 
Category Theory made easy with (ugly) pictures
Category Theory made easy with (ugly) picturesCategory Theory made easy with (ugly) pictures
Category Theory made easy with (ugly) pictures
 
Risk Pooling sensitivity to Correlation
Risk Pooling sensitivity to CorrelationRisk Pooling sensitivity to Correlation
Risk Pooling sensitivity to Correlation
 
Abstract Algebra in 3 Hours
Abstract Algebra in 3 HoursAbstract Algebra in 3 Hours
Abstract Algebra in 3 Hours
 
OmniChannelNewsvendor
OmniChannelNewsvendorOmniChannelNewsvendor
OmniChannelNewsvendor
 
The Newsvendor meets the Options Trader
The Newsvendor meets the Options TraderThe Newsvendor meets the Options Trader
The Newsvendor meets the Options Trader
 
The Fuss about || Haskell | Scala | F# ||
The Fuss about || Haskell | Scala | F# ||The Fuss about || Haskell | Scala | F# ||
The Fuss about || Haskell | Scala | F# ||
 
Career Advice at University College of London, Mathematics Department.
Career Advice at University College of London, Mathematics Department.Career Advice at University College of London, Mathematics Department.
Career Advice at University College of London, Mathematics Department.
 
Careers outside Academia - USC Computer Science Masters and Ph.D. Students
Careers outside Academia - USC Computer Science Masters and Ph.D. StudentsCareers outside Academia - USC Computer Science Masters and Ph.D. Students
Careers outside Academia - USC Computer Science Masters and Ph.D. Students
 
IIT Bombay - First Steps to a Successful Career
IIT Bombay - First Steps to a Successful CareerIIT Bombay - First Steps to a Successful Career
IIT Bombay - First Steps to a Successful Career
 
Stanford FinMath - Careers in Quant Finance
Stanford FinMath - Careers in Quant FinanceStanford FinMath - Careers in Quant Finance
Stanford FinMath - Careers in Quant Finance
 
Berkeley Financial Engineering - Guidance on Careers in Quantitative Finance
Berkeley Financial Engineering - Guidance on Careers in Quantitative FinanceBerkeley Financial Engineering - Guidance on Careers in Quantitative Finance
Berkeley Financial Engineering - Guidance on Careers in Quantitative Finance
 
Careers in Quant Finance talk at UCLA Financial Engineering
Careers in Quant Finance talk at UCLA Financial EngineeringCareers in Quant Finance talk at UCLA Financial Engineering
Careers in Quant Finance talk at UCLA Financial Engineering
 
Columbia CS - Roles in Quant Finance
Columbia CS - Roles in Quant Finance Columbia CS - Roles in Quant Finance
Columbia CS - Roles in Quant Finance
 

Recently uploaded

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 

Recently uploaded (20)

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 

Value Function Geometry and Gradient TD

  • 1. Value Function Geometry and Gradient TD Ashwin Rao Stanford University, ICME October 27, 2018 Ashwin Rao (Stanford) Value Function Geometry October 27, 2018 1 / 19
  • 2. Overview 1 Motivation and Notation 2 Geometric/Vector Space Representation of Value Functions 3 Solutions with Linear System Formulations 4 Residual Gradient TD 5 Gradient TD Ashwin Rao (Stanford) Value Function Geometry October 27, 2018 2 / 19
  • 3. Motivation for understanding Value Function Geometry Helps us better understand transformations of Value Functions (VFs) Across the various DP and RL algorithms Particularly helps when VFs are approximated, esp. with linear approx Provides insights into stability and convergence Particularly when dealing with the “Deadly Triad” Deadly Triad := [Bootstrapping, Func Approx, Off-Policy] Leads us to Gradient TD Ashwin Rao (Stanford) Value Function Geometry October 27, 2018 3 / 19
  • 4. Notation Assume state space S consists of n states: {s1, s2, . . . , sn} Action space A consisting of finite number of actions This exposition extends easily to continuous state/action spaces too This exposition is for a fixed (often stochastic) policy denoted π(a|s) VF for a policy π is denoted as vπ : S → R m feature functions φ1, φ2, . . . , φm : S → R Feature vector for a state s ∈ S denoted as φ(s) ∈ Rm For linear function approximation of VF with weights w = (w1, w2, . . . , wm), VF vw : S → R is defined as: vw(s) = wT · φ(s) = m j=1 wj · φj (s) for any s ∈ S µπ : S → [0, 1] denotes the states’ probability distribution under π Ashwin Rao (Stanford) Value Function Geometry October 27, 2018 4 / 19
  • 5. VF Vector Space and VF Linear Approximations n-dimensional space, with each dim corresponding to a state in S A vector in this space is a specific VF (typically denoted v): S → R Wach dimension’s coordinate is the VF for that dimension’s state Coordinates of vector vπ for policy π are: [vπ(s1), vπ(s2), . . . , vπ(sn)] Consider m vectors where jth vector is: [φj (s1), φj (s2), . . . , φj (sn)] These m vectors are the m columns of n × m matrix Φ = [φj (si )] Their span represents an m-dim subspace within this n-dim space Spanned by the set of all w = [w1, w2, . . . , wm] ∈ Rm Vector vw = Φ · w in this subspace has coordinates [vw(s1), vw(s2), . . . , vw(sn)] Vector vw is fully specified by w (so we often say w to mean vw) Ashwin Rao (Stanford) Value Function Geometry October 27, 2018 5 / 19
  • 6. Some more notation Denote r(s, a) as the Expected Reward upon action a in state s Denote p(s, s , a) as the probability of transition s → s upon action a Define R(s) = a∈A π(a|s) · r(s, a) P(s, s ) = a∈A π(a|s) · p(s, s , a) Denote R as the vector [R(s1), R(s2), . . . , R(sn)] Denote P as the matrix [P(si , si )], 1 ≤ i, i ≤ n Denote γ as the MDP discount factor Ashwin Rao (Stanford) Value Function Geometry October 27, 2018 6 / 19
  • 7. Bellman operator Bπ Bellman operator Bπ for policy π operating on VF vector v defined as: Bπv = R + γP · v Note that vπ is the fixed point of operator Bπ (meaning Bπvπ = vπ) If we start with an arbitrary VF vector v and repeatedly apply Bπ, by Contraction Mapping Theorem, we will reach the fixed point vπ This is the Dynamic Programming Policy Evaluation algorithm Monte Carlo without func approx also converges to vπ (albeit slowly) Ashwin Rao (Stanford) Value Function Geometry October 27, 2018 7 / 19
  • 8. Projection operator ΠΦ Firs we define “distance” d(v1, v2) between VF vectors v1, v2 Weighted by µπ across the n dimensions of v1, v2 d(v1, v2) = n i=1 µπ(si ) · (v1(si ) − v2(si ))2 = (v1 − v2)T · D · (v1 − v2) where D is the square diagonal matrix consisting of µπ(si ), 1 ≤ i ≤ n Projection operator for subspace spanned by Φ is denoted as ΠΦ ΠΦ performs an orthogonal projection of VF vector v on subspace Φ So we need to minimize d(v, ΠΦv) This is a weighted least squares regression with solution: w = (ΦT · D · Φ)−1 · ΦT · D · v So, the Projection operator ΠΦ can be written as: ΠΦ = Φ · (ΦT · D · Φ)−1 · ΦT · D Ashwin Rao (Stanford) Value Function Geometry October 27, 2018 8 / 19
  • 9. 4 VF vectors of interest in the Φ subspace Note: We will refer to the Φ-subspace VF vectors by their weights w 1 Projection of vπ: wπ = arg minw d(vπ, vw) This is the VF we seek when doing linear function approximation Because it is the VF vector “closest” to vπ in the Φ subspace Monte-Carlo with linear func approx will (slowly) converge to wπ 2 Bellman Error (BE)-minimizing: wBE = arg minw d(Bπvw, vw) This is the solution to a linear system (covered later) In model-free setting, Residual Gradient TD Algorithm (covered later) Cannot learn if we can only access features, and not underlying states Ashwin Rao (Stanford) Value Function Geometry October 27, 2018 9 / 19
  • 10. 4 VF vectors of interest in the Φ subspace (continued) 3 Projected Bellman Error (PBE)-minimizing: wPBE = arg minw d((ΠΦ · Bπ)vw, vw) The minimum is 0, i.e., Φ · wPBE is the fixed point of operator ΠΦ · Bπ This fixed point is the solution to a linear system (covered later) Alternatively, if we start with an arbitrary vw and repeatedly apply ΠΦ · Bπ, we will converge to Φ · wPBE This is a DP-like process with approximation - repeatedly thrown out of the Φ subspace (applying Bellman operator Bπ), followed by landing back in the Φ subspace (applying Projection operator ΠΦ) In model-free setting, Gradient TD Algorithms (covered later) 4 Temporal Difference Error (TDE)-minimizing: wTDE = arg minw Eπ[δ2] δ is the TD error Minimizes the expected square of the TD error when following policy π Naive Residual Gradient TD Algorithm (covered later) Ashwin Rao (Stanford) Value Function Geometry October 27, 2018 10 / 19
  • 11. Ashwin Rao (Stanford) Value Function Geometry October 27, 2018 11 / 19
  • 12. Solution of wBE with a Linear System Formulation wBE = arg min w d(vw, R + γP · vw) = arg min w d(Φ · w, R + γP · Φ · w) = arg min w d(Φ · w − γP · Φ · w, R) = arg min w d((Φ − γP · Φ) · w, R) This is a weighted least-squares linear regression of R versus Φ − γP · Φ with weights µπ, whose solution is: wBE = ((Φ − γP · Φ)T · D · (Φ − γP · Φ))−1 · (Φ − γP · Φ)T · D · R Ashwin Rao (Stanford) Value Function Geometry October 27, 2018 12 / 19
  • 13. Solution of wPBE with a Linear System Formulation Φ · wPBE is the fixed point of operator ΠΦ · Bπ. We know: ΠΦ = Φ · (ΦT · D · Φ)−1 · ΦT · D Bπv = R + γP · v Therefore, Φ · (ΦT · D · Φ)−1 · ΦT · D · (R + γP · Φ · wPBE ) = Φ · wPBE Since columns of Φ are assumed to be independent (full rank), (ΦT · D · Φ)−1 · ΦT · D · (R + γP · Φ · wPBE ) = wPBE ΦT · D · (R + γP · Φ · wPBE ) = ΦT · D · Φ · wPBE ΦT · D · (Φ − γP · Φ) · wPBE = ΦT · D · R This is a square linear system of the form A · wPBE = b whose solution is: wPBE = A−1 · b = (ΦT · D · (Φ − γP · Φ))−1 · ΦT · D · R Ashwin Rao (Stanford) Value Function Geometry October 27, 2018 13 / 19
  • 14. Model-Free Learning of wPBE How do we construct matrix A = ΦT · D · (Φ − γP · Φ) and vector b = ΦT · D · R without a model? Following policy π, each time we perform a model-free transition from s to s getting reward r, we get a sample estimate of A and b Estimate of A is the outer-product of vectors φ(s) and φ(s) − γ · φ(s ) Estimate of b is scalar r times vector φ(s) Average these estimates across many such model-free transitions This algorithm is called Least Squares Temporal Difference (LSTD) Alternative: Semi-Gradient Temporal Difference (TD) descent This semi-gradient descent converges to wPBE with weight updates: ∆w = α · (r + γ · wT · φ(s ) − wT · φ(s)) · φ(s) r + γ · wT · φ(s ) − wT · φ(s) is the TD error, denoted δ Ashwin Rao (Stanford) Value Function Geometry October 27, 2018 14 / 19
  • 15. Naive Residual Gradient Algorithm to solve for wTDE We defined wTDE as the vector in the Φ subspace that minimizes the expected square of the TD error δ when following policy π. wTDE = arg min w s∈S µπ(s) r,s probπ(r, s |s)·(r+γ·wT ·φ(s )−wT ·φ(s))2 To perform SGD, we have to estimate the gradient of the expected square of TD error by sampling The weight update for each sample in the SGD will be: ∆w = − 1 2 α · w (r + γ · wT · φ(s ) − wT · φ(s))2 = α · (r + γ · wT · φ(s ) − wT · φ(s)) · (φ(s) − γ · φ(s )) This algorithm (named Naive Residual Gradient) converges robustly, but not to a desirable place Ashwin Rao (Stanford) Value Function Geometry October 27, 2018 15 / 19
  • 16. Residual Gradient Algorithm to solve for wBE We defined wBE as the vector in the Φ subspace that minimizes BE But BE for a state is the expected TD error in that state So we want to do SGD with gradient of square of expected TD error ∆w = − 1 2 α · w (Eπ[δ])2 = −α · Eπ[r + γ · wT · φ(s ) − wT · φ(s)] · w Eπ[δ] = α · (Eπ[r + γ · wT · φ(s )] − wT · φ(s)) · (φ(s) − γ · Eπ[φ(s )]) This is called the Residual Gradient algorithm Requires two independent samples of s transitioning from s In that case, converges to wBE robustly (even for non-linear approx) But it is slow, and doesn’t converge to a desirable place Cannot learn if we can only access features, and not underlying states Ashwin Rao (Stanford) Value Function Geometry October 27, 2018 16 / 19
  • 17. Gradient TD Algorithms to solve for wPBE For on-policy linear func approx, semi-gradient TD works For non-linear func approx or off-policy, we need Gradient TD GTD: The original Gradient TD algorithm GTD-2: Second-generation GTD TDC: TD with Gradient correction We need to set up the loss function whose gradient will drive SGD wPBE = arg minw d(ΠΦBπvw, vw) = arg minw d(ΠΦBπvw, ΠΦvw) So we define the loss function (denoting Bπvw − vw as δw) as: L(w) = (ΠΦδw)T · D · (ΠΦδw) = δw T · ΠT Φ · D · ΠΦ · δw = δw T ·(Φ·(ΦT ·D·Φ)−1 ·ΦT ·D)T ·D·(Φ·(ΦT ·D·Φ)−1 ·ΦT ·D)·δw = δw T ·(D·Φ·(ΦT ·D·Φ)−1 ·ΦT )·D·(Φ·(ΦT ·D·Φ)−1 ·ΦT ·D)·δw = (δw T ·D·Φ)·(ΦT ·D·Φ)−1 ·(ΦT ·D·Φ)·(ΦT ·D·Φ)−1 ·(ΦT ·D·δw) = (ΦT · D · δw)T · (ΦT · D · Φ)−1 · (ΦT · D · δw) Ashwin Rao (Stanford) Value Function Geometry October 27, 2018 17 / 19
  • 18. TDC Algorithm to solve for wPBE We derive the TDC Algorithm based on wL(w) wL(w) = 2 · ( w(ΦT · D · δw)T ) · (ΦT · D · Φ)−1 · (ΦT · D · δw) Now we express each of these 3 terms as expectations of model-free transitions s µ −→ (r, s ), denoting r + γ · wT · φ(s ) − wT · φ(s) as δ ΦT · D · δw = E[δ · φ(s)] w(ΦT · D · δw)T = w(E[δ · φ(s)])T = E[( wδ) · φ(s)T ] = E[(γ · φ(s ) − φ(s)) · φ(s)T ] ΦT · D · Φ = E[φ(s) · φ(s)T ] Substituting, we get: wL(w) = 2 · E[(γ · φ(s ) − φ(s)) · φ(s)T ] · E[φ(s) · φ(s)T ]−1 · E[δ · φ(s)] Ashwin Rao (Stanford) Value Function Geometry October 27, 2018 18 / 19
  • 19. Weight Updates of TDC Algorithm ∆w = − 1 2 α · wL(w) = α · E[(φ(s) − γ · φ(s )) · φ(s)T ] · E[φ(s) · φ(s)T ]−1 · E[δ · φ(s)] = α · (E[φ(s) · φ(s)T ] − γ · E[φ(s ) · φ(s)T ]) · E[φ(s) · φ(s)T ]−1 · E[δ · φ(s)] = α · (E[δ · φ(s)] − γ · E[φ(s ) · φ(s)T ] · E[φ(s) · φ(s)T ]−1 · E[δ · φ(s)]) = α · (E[δ · φ(s)] − γ · E[φ(s ) · φ(s)T ] · θ) where θ = E[φ(s) · φ(s)T ]−1 · E[δ · φ(s)] is the solution to a weighted least-squares linear regression of Bπv − v against Φ, with weights as µπ. Cascade Learning: Update both w and θ (θ converging faster) ∆w = α · δ · φ(s) − α · γ · φ(s ) · (θT · φ(s)) ∆θ = β · (δ − θT · φ(s)) · φ(s) Note: θT · φ(s) operates as estimate of TD error δ for current state s Ashwin Rao (Stanford) Value Function Geometry October 27, 2018 19 / 19