SlideShare ist ein Scribd-Unternehmen logo
1 von 26
Downloaden Sie, um offline zu lesen
A MM framework for RL
A MM framework for RL
SungYub Kim
Management Science/Optimization Lab
Department of Industrial Engineering
Seoul National University
January 6, 2018
SungYub Kim January 6, 2018 1 / 26
A MM framework for RL
Limitations of policy gradients
MM framework?
MM framework for RL
Natural policy gradients
MM algorithms for RL
1. Trust region policy optimization (TRPO)
2. Actor-critic using Kronecker-Factored Trust Region (ACKTR)
3. Proximal policy optimization (PPO)
SungYub Kim January 6, 2018 2 / 26
A MM framework for RL
1. Kakade, S., & Langford, J. (2002, July). Approximately optimal approximate
reinforcement learning. In ICML (Vol. 2, pp. 267-274).
2. Kakade, S. M. (2002). A natural policy gradient. In Advances in neural information
processing systems (pp. 1531-1538).
3. Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015). Trust region
policy optimization. In Proceedings of the 32nd International Conference on Machine
Learning (ICML-15) (pp. 1889-1897).
4. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal
policy optimization algorithms. arXiv preprint arXiv:1707.06347.
5. Wu, Y., Mansimov, E., Liao, S., Grosse, R., & Ba, J. (2017). Scalable trust-region
method for deep reinforcement learning using Kronecker-factored approximation. arXiv
preprint arXiv:1708.05144.
6. Martens, J., & Grosse, R. (2015, June). Optimizing neural networks with
Kronecker-factored approximate curvature. In International Conference on Machine
Learning (pp. 2408-2417).
7. Achiam, J., Held, D., Tamar, A., & Abbeel, P. (2017). Constrained Policy
Optimization. arXiv preprint arXiv:1705.10528.
8. UC Berkeley CS294 2017 Fall Lecture note Oct 11
SungYub Kim January 6, 2018 3 / 26
Limitations of policy gradients A MM framework for RL
Policy Gradients Review
Definition 1.1
Markov Decision Process
1 S is a set of states,
2 A is a set of actions,
3 Ta
s,s = P[St+1 = s |St = s, At = a] is the transition probability that action a in
state s at time t will lead to state s at time t + 1,
4 Ra
s = E[Rt|St = s, At = a] is the expectation reward received when state s
agent chooses action a,
5 γ ∈ [0, 1] is the discount factor, which represents the difference in importance
between future rewards and present rewards.
Definition 1.2
A policy π is a distribution over actions given states,
π(s, a) = P[At = a|St = s].
SungYub Kim January 6, 2018 4 / 26
Limitations of policy gradients A MM framework for RL
Policy Gradients Review
Definition 1.3
A (discounted) return is the sum of the discounted rewards.
Gt = Rt + γRt+1 + γ2
Rt+2 + · · · =
We want to maximize the return. Therefore, the optimization problem is
= Eτ∼πθ G0 .
For first-order optimization, we need the gradient of the objective function
θJ(πθ) = Eτ∼πθ G0
θ log(πθ(st, at))
= Eτ∼πθ
θ log(πθ(st, at))Aπθ
(st, at) .
SungYub Kim January 6, 2018 5 / 26
Limitations of policy gradients A MM framework for RL
Limitations of policy gradients
Sample efficiency
Recall policy gradient is θEτ∼πθ G0 . To find the estimator of gradient, we need
to calculate on-policy expectation. (No experience replay!)
Importance sampling can be used for off-policy gradient. But stability of this
method is bad.
Relation between parameter space and performance measure?
SungYub Kim January 6, 2018 6 / 26
MM framework? A MM framework for RL
MM framework?
MM(Majorization-minimization/Minorization-maximization) algorithms consists of
Majorization/Minoirzation step : Step for making surrogate objective function satisfying
f(θ) ≥ ˆfθ(m) (θ) for all θ.
) = ˆfθ(m) (θ(m)
in consideration of local information of original objective function.
(evaluation/gradient/hessian etc.)
Minimization/Maximization step : Step for optimizing surrogate objective function.
Does this algorithm work?
f(θ(m)) = f(θ(m)) − ˆfθ(m) (θ(m)) + ˆfθ(m) (θ(m))
≤ f(θ(m+1)) − ˆfθ(m) (θ(m+1)) + ˆfθ(m) (θ(m))
≤ f(θ(m+1)) − ˆfθ(m) (θ(m+1)) + ˆfθ(m) (θ(m+1))
= f(θ(m+1))
SungYub Kim January 6, 2018 7 / 26
MM framework for RL A MM framework for RL
Relative policy performance identity
In supervised learning framework, we usually consider parameter space and
distribution space. But in RL framework, we also need to consider performance
Relative policy performace identiy[1] gives us relation between distribution space
and performance space.
Lemma 3.1
Relative policy performace identiy(RPPI)
J(π ) − J(π) = Eτ∼π
(st, at)
Good news : Relation between policy space and performance space.
Bad news : We cannot calculate expectation on π .
SungYub Kim January 6, 2018 8 / 26
MM framework for RL A MM framework for RL
Proof of relative policy performance identity
(st, at) = Eτ∼π
(Rt + γV π
(St+1) − V π
= J(π ) + Eτ∼π
V π
(St+1) −
V π
= J(π ) + Eτ∼π
V π
(St) −
V π
= J(π ) − Eτ∼π V π
= J(π ) − J(π)
SungYub Kim January 6, 2018 9 / 26
MM framework for RL A MM framework for RL
MM framework for RL
Definition 3.2
Discounted future state distribution(Stationary distribution)
(s) = (1 − γ)
P St = s|π
τ = [H, T, H, T, . . . ]
s = H/T
J(π ) − J(π) = Eτ∼π
(st, at)
= 1
Es∼dπ ,a∼π
Aπ(s, a)
= 1
Es∼dπ ,a∼π
π (s,a)
Aπ(s, a)
If dπ ≈ dπ , then
J(π ) − J(π) ≈
1 − γ
π (s, a)
π(s, a)
(s, a)
= Lπ(π )
SungYub Kim January 6, 2018 10 / 26
MM framework for RL A MM framework for RL
MM framework for RL
B ∈ {α2
[1], Dmax
TV (π π ), Dmax
KL (π π )[3], Es∼dπ DKL(π π )(s) [7]}
(π π )
= maxs∈S DTV(π(s, ·) π (s, ·))
(π π )
= maxs∈S DKL(π(s, ·) π (s, ·))
If we define M(π )
= J(π) + Lπ(π ) − CB, then
M(π) = J(π)
M(π ) ≤ J(π )
Therfore we defined MM algorithms for Reinforcement Learning. We call B relative policy
performance bounds.
SungYub Kim January 6, 2018 11 / 26
MM framework for RL A MM framework for RL
MM framework for RL
Let Mi(πθ)
= J(πi) + Lπi (πθ) − CB and we would like to solve
instead of
It can be seen as exploiting the information of current policy πi.
⇒ Can control the change of model. (By KL-term)
Both Lπi (πθ) and Es∼dπi DKL(πi πθ)(s) can be estimated using only current policy.
SungYub Kim January 6, 2018 12 / 26
Natural policy gradient A MM framework for RL
Natural policy gradient
Specifically, let Mi(πθ)
= J(πi) + Lπi (πθ) + C δ − Es∼dπi DKL(πi πθ)(s) then
can be seen as Lagrangian relaxation of
maxθ Lπi (πθ) Es∼dπi DKL(πi πθ)(s) ≤ δ
and this is equivalent to
maxθ Lπi (πθ) Es∼dπi DKL(πi πθ)(s) ≤ δ2 = δ
Evaluation(zero-order) of this optmization problem is Easy.
But primal algorithm of this problem doesn’t exist. (Nonlinearity of KL-term)
⇒ Taylor expansion!!
SungYub Kim January 6, 2018 13 / 26
Natural policy gradient A MM framework for RL
Natural policy gradient
Note that
Lπi (πθ) ≈ Lπi (πθ) + θL(πi)(θ − θi)
DKL(πi πθ) ≈ 1
(θ − θi)T Fi(θ − θi)
where F is the fisher information matrix
Fi = Es∼dπi ,a∼πi
θ log πθ(s, a)
Therefore, we get
maxθ θi
L(πi)(θ − θi) 1
(θ − θi)T Fi(θ − θi) ≤ δ
And solution to this optimization problem is
θi+1 = θi +
L(πi)T F−1
i θi
i θi
and we call ˆg
= F−1
i θi
L(πi) Natural policy gradient.
SungYub Kim January 6, 2018 14 / 26
Natural policy gradient A MM framework for RL
Natural policy gradient
Note that the original SGD means we would like to improve the performance by small movement
in parameter space.
max∆θ θi
L(πi)∆θ ∆θ 2 ≤ δ
max∆θ θi
L(πi)∆θ ∆θ 2
F −1
≤ δ
Similarly, natural policy gradient means we would like to improve the performance by small
movement in policy space.
SungYub Kim January 6, 2018 15 / 26
Natural policy gradient A MM framework for RL
Properties of Natural policy gradient
Now we would like to asnswer the question
Relation between parameter space and performance measure?
The answer is in [2].
Theorem 4.1
Greedy update in exp family
For πθ(s, a) ∝ exp(θT φsa), assume that ˆg is non-zero. Let
π∞(s, a) = limα→∞ πθ+αˆg(s,a)(s, a). Then π∞(s, a) = 0 iff a ∈ argmaxa Aπθ (s, a ).
Theorem 4.2
Greedy update in general parametric policy
Let the update to the parameter is θ = θ + αˆg. Then
πθ (s, a) = πθ(s, a)(1 + Aπθ (s, a)) + o(α2
Therefore, natural policy gradient is a natural scheme of Policy iteration methods in RL.
SungYub Kim January 6, 2018 16 / 26
MM algorithms for RL A MM framework for RL
How to find ˆg
Now our task is transformed to find
= F−1
i θi
It is not easy for neural networks numerically. The solutions are
Conjugate Gradient (CG) method
Kronecker-Factored Approximate Curvature
SungYub Kim January 6, 2018 17 / 26
MM algorithms for RL A MM framework for RL
Conjugate Gradient method(CG)
Conjugate gradient algorithm solves linear equation Ax = b by finding projection on to Krylov
subspaces, span{Ab, A2b, . . . , An−1b}.
Algorithm 5.1
Conjugate gradient algorithm
Let x0
= 0 and g0
= Ax0 − b and d0
= −g0.
For k = 0, . . . , n − 1:
If gk = 0, then STOP return xk.
= xk + λkdk, where λk
k dk
= Axk − b
= −gk+1 +
Although time complexity of CG is O(n3) (eqaul to Gauss-Jordan elimination), but convergence
rate of CG is quadratic. Therefore we can stop earlier than n.
⇒ Truncated Natural Policy Gradient.
SungYub Kim January 6, 2018 18 / 26
MM algorithms for RL A MM framework for RL
Trust region policy optimization(TRPO)
Algorithm 5.2
Trust region policy optimization [3]
Given policy parameters θ0,0.
For n = 0, 1, 2, . . .
Collect sample trajectory set Dn following policy π(θn,0).
(Estimate advantages
t )
Estimate θn,0
L(θn,0) and F−1
n,0 with Dn.
Approximate ˆg with fixed iteration CG.
For minibatch k = 0, 1, 2, . . . , T
Perform backtracking line search with exponetial decay for step size αj to obtain
θn,k+1 = θn,k + αj
such that
(θn,k+1) ≥ 0 and ˆDKL(θn,k θn,k+1) ≤ δ
If k is T, then θn+1,0 = θn,T .
SungYub Kim January 6, 2018 19 / 26
MM algorithms for RL A MM framework for RL
Kronecker-Factored Approximate Curvature(K-FAC)
(In this presentation, we only consider block-diagonal approximation ver of K-FAC.)
Note that Fi = E DθiDθT
i , where
Dθi = vec(DW1)T
· · · vec(DWl)T
Fi =
E[vec(DW1)vec(DW1)T ] E[vec(DW1)vec(DW2)T ] · · · E[vec(DW1)vec(DWl)T ]
E[vec(DW2)vec(DW1)T ] E[vec(DW2)vec(DW2)T ] · · · E[vec(DW2)vec(DWl)T ]
E[vec(DWl)vec(DW1)T ] E[vec(DWl)vec(DW2)T ] · · · E[vec(DWl)vec(DWl)T ]
We approximate this by
ˆFi =
E[vec(DW1)vec(DW1)T ] 0 · · · 0
0 E[vec(DW2)vec(DW2)T ] · · · 0
0 0 · · · E[vec(DWl)vec(DWl)T ]
SungYub Kim January 6, 2018 20 / 26
MM algorithms for RL A MM framework for RL
Kronecker-Factored Approximate Curvature(K-FAC)
Since v ⊗ u = vec(uvT ), (A ⊗ B)T = AT ⊗ BT and (A ⊗ B)(C ⊗ D) = AC ⊗ BD, each
block-diagonal submatrix can be approximated by
E[vec(DWi)vec(DWi)T ] = E[vec( si LaT
i−1)vec( si LaT
i−1)T ]
= E[(ai−1 ⊗ si L)(ai−1 ⊗ si L)T ]
= E[(ai−1 ⊗ si L)(aT
i−1 ⊗ si LT )]
= E[ai−1aT
i−1 ⊗ si L si LT ]
≈ E[ai−1aT
i−1] ⊗ E[ si L si LT ]
= A ⊗ S
= ¯Fi.
Since (P ⊗ Q)−1 = P−1 ⊗ Q−1 and (P ⊗ Q)vec(T) = QTPT
∆Wi = ¯F−1
i θi
L(θi) = A−1
⊗ S−1
L(θi) = S−1
SungYub Kim January 6, 2018 21 / 26
MM algorithms for RL A MM framework for RL
Actor-Critic using Kronecker-Factored Trust Region
Algorithm 5.3
Given policy parameters θ0,0.
For n = 0, 1, 2, . . .
Collect sample trajectory set Dn following policy π(θn,0).
(Estimate advantages
t )
Estimate θn,0
L(θn,0) and F−1
n,0 with Dn.
Approximate ˆg by K-FAC.
For minibatch k = 0, 1, 2, . . . , T
Perform backtracking line search with exponetial decay for step size αj to obtain
θn,k+1 = θn,k + αj
such that
(θn,k+1) ≥ 0 and ˆDKL(θn,k θn,k+1) ≤ δ
If k is T, then θn+1,0 = θn,T .
SungYub Kim January 6, 2018 22 / 26
MM algorithms for RL A MM framework for RL
Clipped surrogate objective
In PPO, we would like to detour the hardness of calculating ¯F−1
i and cost of line
searching by clipped surrogate objective.
(θ) = E min(rk(θ) ˆA
t , clip(rk(θ), 1 − , 1 + ) ˆA
t )
rk(θ) =
πθ(s, a)
πθk (s, a)
SungYub Kim January 6, 2018 23 / 26
MM algorithms for RL A MM framework for RL
PPO with Clipped surrogate objective
Algorithm 5.4
PPO with Clipped surrogate objective [4]
Given policy parameters θ0,0.
For n = 0, 1, 2, . . .
Collect sample trajectory set Dn following policy π(θn, 0).
(Estimate advantages
t )
For minibatch k = 0, 1, 2, . . . , T
Compute policy update
θn,k+1 = θn,k + α θ
SungYub Kim January 6, 2018 24 / 26
Benchmarks A MM framework for RL
SungYub Kim January 6, 2018 25 / 26
A MM framework for RL
Questions & Answers
SungYub Kim January 6, 2018 26 / 26

Weitere ähnliche Inhalte

Was ist angesagt?

Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree Search
Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree SearchAdaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree Search
Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree SearchAshwin Rao
Improved Trainings of Wasserstein GANs (WGAN-GP)
Improved Trainings of Wasserstein GANs (WGAN-GP)Improved Trainings of Wasserstein GANs (WGAN-GP)
Improved Trainings of Wasserstein GANs (WGAN-GP)Sangwoo Mo
A kernel-free particle method: Smile Problem Resolved
A kernel-free particle method: Smile Problem ResolvedA kernel-free particle method: Smile Problem Resolved
A kernel-free particle method: Smile Problem ResolvedKaiju Capital Management
Nodal Domain Theorem for the p-Laplacian on Graphs and the Related Multiway C...
Nodal Domain Theorem for the p-Laplacian on Graphs and the Related Multiway C...Nodal Domain Theorem for the p-Laplacian on Graphs and the Related Multiway C...
Nodal Domain Theorem for the p-Laplacian on Graphs and the Related Multiway C...Francesco Tudisco
Learning to Reconstruct at Stanford
Learning to Reconstruct at StanfordLearning to Reconstruct at Stanford
Learning to Reconstruct at StanfordJonas Adler
SIAM SEAS Talk Slides
SIAM SEAS Talk SlidesSIAM SEAS Talk Slides
SIAM SEAS Talk SlidesRyan White
Estimates for a class of non-standard bilinear multipliers
Estimates for a class of non-standard bilinear multipliersEstimates for a class of non-standard bilinear multipliers
Estimates for a class of non-standard bilinear multipliersVjekoslavKovac1
On Convolution of Graph Signals and Deep Learning on Graph Domains
On Convolution of Graph Signals and Deep Learning on Graph DomainsOn Convolution of Graph Signals and Deep Learning on Graph Domains
On Convolution of Graph Signals and Deep Learning on Graph DomainsJean-Charles Vialatte
Lossy Kernelization
Lossy KernelizationLossy Kernelization
Lossy Kernelizationmsramanujan
Quantitative norm convergence of some ergodic averages
Quantitative norm convergence of some ergodic averagesQuantitative norm convergence of some ergodic averages
Quantitative norm convergence of some ergodic averagesVjekoslavKovac1
Reinforcement Learning in Configurable Environments
Reinforcement Learning in Configurable EnvironmentsReinforcement Learning in Configurable Environments
Reinforcement Learning in Configurable EnvironmentsEmanuele Ghelfi
Sparse Kernel Learning for Image Annotation
Sparse Kernel Learning for Image AnnotationSparse Kernel Learning for Image Annotation
Sparse Kernel Learning for Image AnnotationSean Moran
REBAR: Low-variance, unbiased gradient estimates for discrete latent variable...
REBAR: Low-variance, unbiased gradient estimates for discrete latent variable...REBAR: Low-variance, unbiased gradient estimates for discrete latent variable...
REBAR: Low-variance, unbiased gradient estimates for discrete latent variable...Sangwoo Mo
Trilinear embedding for divergence-form operators
Trilinear embedding for divergence-form operatorsTrilinear embedding for divergence-form operators
Trilinear embedding for divergence-form operatorsVjekoslavKovac1
Estimating Future Initial Margin with Machine Learning
Estimating Future Initial Margin with Machine LearningEstimating Future Initial Margin with Machine Learning
Estimating Future Initial Margin with Machine LearningAndres Hernandez
Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017Andres Hernandez
From L to N: Nonlinear Predictors in Generalized Models
From L to N: Nonlinear Predictors in Generalized ModelsFrom L to N: Nonlinear Predictors in Generalized Models
From L to N: Nonlinear Predictors in Generalized Modelshtstatistics

Was ist angesagt? (20)

Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree Search
Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree SearchAdaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree Search
Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree Search
Improved Trainings of Wasserstein GANs (WGAN-GP)
Improved Trainings of Wasserstein GANs (WGAN-GP)Improved Trainings of Wasserstein GANs (WGAN-GP)
Improved Trainings of Wasserstein GANs (WGAN-GP)
A kernel-free particle method: Smile Problem Resolved
A kernel-free particle method: Smile Problem ResolvedA kernel-free particle method: Smile Problem Resolved
A kernel-free particle method: Smile Problem Resolved
Nodal Domain Theorem for the p-Laplacian on Graphs and the Related Multiway C...
Nodal Domain Theorem for the p-Laplacian on Graphs and the Related Multiway C...Nodal Domain Theorem for the p-Laplacian on Graphs and the Related Multiway C...
Nodal Domain Theorem for the p-Laplacian on Graphs and the Related Multiway C...
Learning to Reconstruct at Stanford
Learning to Reconstruct at StanfordLearning to Reconstruct at Stanford
Learning to Reconstruct at Stanford
SIAM SEAS Talk Slides
SIAM SEAS Talk SlidesSIAM SEAS Talk Slides
SIAM SEAS Talk Slides
Estimates for a class of non-standard bilinear multipliers
Estimates for a class of non-standard bilinear multipliersEstimates for a class of non-standard bilinear multipliers
Estimates for a class of non-standard bilinear multipliers
On Convolution of Graph Signals and Deep Learning on Graph Domains
On Convolution of Graph Signals and Deep Learning on Graph DomainsOn Convolution of Graph Signals and Deep Learning on Graph Domains
On Convolution of Graph Signals and Deep Learning on Graph Domains
Lossy Kernelization
Lossy KernelizationLossy Kernelization
Lossy Kernelization
Quantitative norm convergence of some ergodic averages
Quantitative norm convergence of some ergodic averagesQuantitative norm convergence of some ergodic averages
Quantitative norm convergence of some ergodic averages
Reinforcement Learning in Configurable Environments
Reinforcement Learning in Configurable EnvironmentsReinforcement Learning in Configurable Environments
Reinforcement Learning in Configurable Environments
Sparse Kernel Learning for Image Annotation
Sparse Kernel Learning for Image AnnotationSparse Kernel Learning for Image Annotation
Sparse Kernel Learning for Image Annotation
REBAR: Low-variance, unbiased gradient estimates for discrete latent variable...
REBAR: Low-variance, unbiased gradient estimates for discrete latent variable...REBAR: Low-variance, unbiased gradient estimates for discrete latent variable...
REBAR: Low-variance, unbiased gradient estimates for discrete latent variable...
Prim algorithm
Prim algorithmPrim algorithm
Prim algorithm
GDRR Opening Workshop - Modeling Approaches for High-Frequency Financial Time...
GDRR Opening Workshop - Modeling Approaches for High-Frequency Financial Time...GDRR Opening Workshop - Modeling Approaches for High-Frequency Financial Time...
GDRR Opening Workshop - Modeling Approaches for High-Frequency Financial Time...
Trilinear embedding for divergence-form operators
Trilinear embedding for divergence-form operatorsTrilinear embedding for divergence-form operators
Trilinear embedding for divergence-form operators
Estimating Future Initial Margin with Machine Learning
Estimating Future Initial Margin with Machine LearningEstimating Future Initial Margin with Machine Learning
Estimating Future Initial Margin with Machine Learning
Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017
From L to N: Nonlinear Predictors in Generalized Models
From L to N: Nonlinear Predictors in Generalized ModelsFrom L to N: Nonlinear Predictors in Generalized Models
From L to N: Nonlinear Predictors in Generalized Models

Ähnlich wie MM framework for RL

Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design Dan Elton
Introduction to reinforcement learning - Phu Nguyen
Introduction to reinforcement learning - Phu NguyenIntroduction to reinforcement learning - Phu Nguyen
Introduction to reinforcement learning - Phu NguyenTu Le Dinh
Learning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifoldLearning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifoldKai-Wen Zhao
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIJack Clark
MAPE regression, seminar @ QUT (Brisbane)
MAPE regression, seminar @ QUT (Brisbane)MAPE regression, seminar @ QUT (Brisbane)
MAPE regression, seminar @ QUT (Brisbane)Arnaud de Myttenaere
Financial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning ApproachFinancial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning Approach謙益 黃
Tensor train to solve stochastic PDEs
Tensor train to solve stochastic PDEsTensor train to solve stochastic PDEs
Tensor train to solve stochastic PDEsAlexander Litvinenko
learned optimizer.pptx
learned optimizer.pptxlearned optimizer.pptx
learned optimizer.pptxQingsong Guo
Simplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution AlgorithmsSimplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution AlgorithmsPK Lehre
Simplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution AlgorithmsSimplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution AlgorithmsPer Kristian Lehre
CS294-112 Lec 05
CS294-112 Lec 05CS294-112 Lec 05
CS294-112 Lec 05Gyubin Son
Scalable trust-region method for deep reinforcement learning using Kronecker-...
Scalable trust-region method for deep reinforcement learning using Kronecker-...Scalable trust-region method for deep reinforcement learning using Kronecker-...
Scalable trust-region method for deep reinforcement learning using Kronecker-...Willy Marroquin (WillyDevNET)
Introduce to Reinforcement Learning
Introduce to Reinforcement LearningIntroduce to Reinforcement Learning
Introduce to Reinforcement LearningNguyen Luong An Phu
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...Yuko Kuroki (黒木祐子)
Murphy: Machine learning A probabilistic perspective: Ch.9
Murphy: Machine learning A probabilistic perspective: Ch.9Murphy: Machine learning A probabilistic perspective: Ch.9
Murphy: Machine learning A probabilistic perspective: Ch.9Daisuke Yoneoka
Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...
Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...
Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...Andrea Tassi
Time-Series Analysis on Multiperiodic Conditional Correlation by Sparse Covar...
Time-Series Analysis on Multiperiodic Conditional Correlation by Sparse Covar...Time-Series Analysis on Multiperiodic Conditional Correlation by Sparse Covar...
Time-Series Analysis on Multiperiodic Conditional Correlation by Sparse Covar...Michael Lie

Ähnlich wie MM framework for RL (20)

Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design
Introduction to reinforcement learning - Phu Nguyen
Introduction to reinforcement learning - Phu NguyenIntroduction to reinforcement learning - Phu Nguyen
Introduction to reinforcement learning - Phu Nguyen
MUMS: Transition & SPUQ Workshop - Practical Bayesian Optimization for Urban ...
MUMS: Transition & SPUQ Workshop - Practical Bayesian Optimization for Urban ...MUMS: Transition & SPUQ Workshop - Practical Bayesian Optimization for Urban ...
MUMS: Transition & SPUQ Workshop - Practical Bayesian Optimization for Urban ...
Learning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifoldLearning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifold
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
MAPE regression, seminar @ QUT (Brisbane)
MAPE regression, seminar @ QUT (Brisbane)MAPE regression, seminar @ QUT (Brisbane)
MAPE regression, seminar @ QUT (Brisbane)
Financial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning ApproachFinancial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning Approach
Tensor train to solve stochastic PDEs
Tensor train to solve stochastic PDEsTensor train to solve stochastic PDEs
Tensor train to solve stochastic PDEs
learned optimizer.pptx
learned optimizer.pptxlearned optimizer.pptx
learned optimizer.pptx
RL unit 5 part 1.pdf
RL unit 5 part 1.pdfRL unit 5 part 1.pdf
RL unit 5 part 1.pdf
Simplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution AlgorithmsSimplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution AlgorithmsSimplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution Algorithms
CS294-112 Lec 05
CS294-112 Lec 05CS294-112 Lec 05
CS294-112 Lec 05
Scalable trust-region method for deep reinforcement learning using Kronecker-...
Scalable trust-region method for deep reinforcement learning using Kronecker-...Scalable trust-region method for deep reinforcement learning using Kronecker-...
Scalable trust-region method for deep reinforcement learning using Kronecker-...
Introduce to Reinforcement Learning
Introduce to Reinforcement LearningIntroduce to Reinforcement Learning
Introduce to Reinforcement Learning
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...
Murphy: Machine learning A probabilistic perspective: Ch.9
Murphy: Machine learning A probabilistic perspective: Ch.9Murphy: Machine learning A probabilistic perspective: Ch.9
Murphy: Machine learning A probabilistic perspective: Ch.9
Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...
Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...
Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...
Time-Series Analysis on Multiperiodic Conditional Correlation by Sparse Covar...
Time-Series Analysis on Multiperiodic Conditional Correlation by Sparse Covar...Time-Series Analysis on Multiperiodic Conditional Correlation by Sparse Covar...
Time-Series Analysis on Multiperiodic Conditional Correlation by Sparse Covar...

Kürzlich hochgeladen

MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...ttt fff
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7

Kürzlich hochgeladen (20)

MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...

MM framework for RL

  • 1. A MM framework for RL A MM framework for RL SungYub Kim Management Science/Optimization Lab Department of Industrial Engineering Seoul National University January 6, 2018 SungYub Kim January 6, 2018 1 / 26
  • 2. A MM framework for RL Contents Limitations of policy gradients MM framework? MM framework for RL Natural policy gradients MM algorithms for RL 1. Trust region policy optimization (TRPO) 2. Actor-critic using Kronecker-Factored Trust Region (ACKTR) 3. Proximal policy optimization (PPO) Benchmarks SungYub Kim January 6, 2018 2 / 26
  • 3. A MM framework for RL References 1. Kakade, S., & Langford, J. (2002, July). Approximately optimal approximate reinforcement learning. In ICML (Vol. 2, pp. 267-274). 2. Kakade, S. M. (2002). A natural policy gradient. In Advances in neural information processing systems (pp. 1531-1538). 3. Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015). Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15) (pp. 1889-1897). 4. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. 5. Wu, Y., Mansimov, E., Liao, S., Grosse, R., & Ba, J. (2017). Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation. arXiv preprint arXiv:1708.05144. 6. Martens, J., & Grosse, R. (2015, June). Optimizing neural networks with Kronecker-factored approximate curvature. In International Conference on Machine Learning (pp. 2408-2417). 7. Achiam, J., Held, D., Tamar, A., & Abbeel, P. (2017). Constrained Policy Optimization. arXiv preprint arXiv:1705.10528. 8. UC Berkeley CS294 2017 Fall Lecture note Oct 11 SungYub Kim January 6, 2018 3 / 26
  • 4. Limitations of policy gradients A MM framework for RL Policy Gradients Review Definition 1.1 Markov Decision Process 1 S is a set of states, 2 A is a set of actions, 3 Ta s,s = P[St+1 = s |St = s, At = a] is the transition probability that action a in state s at time t will lead to state s at time t + 1, 4 Ra s = E[Rt|St = s, At = a] is the expectation reward received when state s agent chooses action a, 5 γ ∈ [0, 1] is the discount factor, which represents the difference in importance between future rewards and present rewards. Definition 1.2 A policy π is a distribution over actions given states, π(s, a) = P[At = a|St = s]. SungYub Kim January 6, 2018 4 / 26
  • 5. Limitations of policy gradients A MM framework for RL Policy Gradients Review Definition 1.3 A (discounted) return is the sum of the discounted rewards. Gt = Rt + γRt+1 + γ2 Rt+2 + · · · = ∞ k=0 γk Rt+k. We want to maximize the return. Therefore, the optimization problem is max θ J(πθ) . = Eτ∼πθ G0 . For first-order optimization, we need the gradient of the objective function θJ(πθ) = Eτ∼πθ G0 ∞ t=0 θ log(πθ(st, at)) = Eτ∼πθ ∞ t=0 γt θ log(πθ(st, at))Aπθ (st, at) . SungYub Kim January 6, 2018 5 / 26
  • 6. Limitations of policy gradients A MM framework for RL Limitations of policy gradients Sample efficiency Recall policy gradient is θEτ∼πθ G0 . To find the estimator of gradient, we need to calculate on-policy expectation. (No experience replay!) Importance sampling can be used for off-policy gradient. But stability of this method is bad. Relation between parameter space and performance measure? SungYub Kim January 6, 2018 6 / 26
  • 7. MM framework? A MM framework for RL MM framework? MM(Majorization-minimization/Minorization-maximization) algorithms consists of Majorization/Minoirzation step : Step for making surrogate objective function satisfying f(θ) ≥ ˆfθ(m) (θ) for all θ. f(θ(m) ) = ˆfθ(m) (θ(m) ) in consideration of local information of original objective function. (evaluation/gradient/hessian etc.) Minimization/Maximization step : Step for optimizing surrogate objective function. Does this algorithm work? f(θ(m)) = f(θ(m)) − ˆfθ(m) (θ(m)) + ˆfθ(m) (θ(m)) ≤ f(θ(m+1)) − ˆfθ(m) (θ(m+1)) + ˆfθ(m) (θ(m)) ≤ f(θ(m+1)) − ˆfθ(m) (θ(m+1)) + ˆfθ(m) (θ(m+1)) = f(θ(m+1)) SungYub Kim January 6, 2018 7 / 26
  • 8. MM framework for RL A MM framework for RL Relative policy performance identity In supervised learning framework, we usually consider parameter space and distribution space. But in RL framework, we also need to consider performance space. Relative policy performace identiy[1] gives us relation between distribution space and performance space. Lemma 3.1 Relative policy performace identiy(RPPI) J(π ) − J(π) = Eτ∼π ∞ t=0 γt Aπ (st, at) Good news : Relation between policy space and performance space. Bad news : We cannot calculate expectation on π . SungYub Kim January 6, 2018 8 / 26
  • 9. MM framework for RL A MM framework for RL Proof of relative policy performance identity Eτ∼π ∞ t=0 γt Aπ (st, at) = Eτ∼π ∞ t=0 γt (Rt + γV π (St+1) − V π (St)) = J(π ) + Eτ∼π ∞ t=0 γt+1 V π (St+1) − ∞ t=0 γt V π (St) = J(π ) + Eτ∼π ∞ t=1 γt V π (St) − ∞ t=0 γt V π (St) = J(π ) − Eτ∼π V π (S0) = J(π ) − J(π) SungYub Kim January 6, 2018 9 / 26
  • 10. MM framework for RL A MM framework for RL MM framework for RL Definition 3.2 Discounted future state distribution(Stationary distribution) dπ (s) = (1 − γ) ∞ t=0 γt P St = s|π τ = [H, T, H, T, . . . ] s = H/T Then J(π ) − J(π) = Eτ∼π ∞ t=0 γt Aπ (st, at) = 1 1−γ Es∼dπ ,a∼π Aπ(s, a) = 1 1−γ Es∼dπ ,a∼π π (s,a) π(s,a) Aπ(s, a) If dπ ≈ dπ , then J(π ) − J(π) ≈ 1 1 − γ Es∼dπ,a∼π π (s, a) π(s, a) Aπ (s, a) . = Lπ(π ) SungYub Kim January 6, 2018 10 / 26
  • 11. MM framework for RL A MM framework for RL MM framework for RL Let B ∈ {α2 [1], Dmax TV (π π ), Dmax KL (π π )[3], Es∼dπ DKL(π π )(s) [7]} where Dmax TV (π π ) . = maxs∈S DTV(π(s, ·) π (s, ·)) Dmax KL (π π ) . = maxs∈S DKL(π(s, ·) π (s, ·)) If we define M(π ) . = J(π) + Lπ(π ) − CB, then M(π) = J(π) M(π ) ≤ J(π ) Therfore we defined MM algorithms for Reinforcement Learning. We call B relative policy performance bounds. SungYub Kim January 6, 2018 11 / 26
  • 12. MM framework for RL A MM framework for RL MM framework for RL Let Mi(πθ) . = J(πi) + Lπi (πθ) − CB and we would like to solve max θ Mi(πθ) instead of max θ J(πθ). It can be seen as exploiting the information of current policy πi. ⇒ Can control the change of model. (By KL-term) Both Lπi (πθ) and Es∼dπi DKL(πi πθ)(s) can be estimated using only current policy. SungYub Kim January 6, 2018 12 / 26
  • 13. Natural policy gradient A MM framework for RL Natural policy gradient Specifically, let Mi(πθ) . = J(πi) + Lπi (πθ) + C δ − Es∼dπi DKL(πi πθ)(s) then max θ Mi(πθ) can be seen as Lagrangian relaxation of maxθ Lπi (πθ) Es∼dπi DKL(πi πθ)(s) ≤ δ and this is equivalent to maxθ Lπi (πθ) Es∼dπi DKL(πi πθ)(s) ≤ δ2 = δ Evaluation(zero-order) of this optmization problem is Easy. But primal algorithm of this problem doesn’t exist. (Nonlinearity of KL-term) ⇒ Taylor expansion!! SungYub Kim January 6, 2018 13 / 26
  • 14. Natural policy gradient A MM framework for RL Natural policy gradient Note that Lπi (πθ) ≈ Lπi (πθ) + θL(πi)(θ − θi) DKL(πi πθ) ≈ 1 2 (θ − θi)T Fi(θ − θi) where F is the fisher information matrix Fi = Es∼dπi ,a∼πi 2 θ log πθ(s, a) θ=θi Therefore, we get maxθ θi L(πi)(θ − θi) 1 2 (θ − θi)T Fi(θ − θi) ≤ δ And solution to this optimization problem is θi+1 = θi + 2δ θi L(πi)T F−1 i θi L(πi) F−1 i θi L(πi) and we call ˆg . = F−1 i θi L(πi) Natural policy gradient. SungYub Kim January 6, 2018 14 / 26
  • 15. Natural policy gradient A MM framework for RL Natural policy gradient Note that the original SGD means we would like to improve the performance by small movement in parameter space. max∆θ θi L(πi)∆θ ∆θ 2 ≤ δ max∆θ θi L(πi)∆θ ∆θ 2 F −1 i ≤ δ Similarly, natural policy gradient means we would like to improve the performance by small movement in policy space. SungYub Kim January 6, 2018 15 / 26
  • 16. Natural policy gradient A MM framework for RL Properties of Natural policy gradient Now we would like to asnswer the question Relation between parameter space and performance measure? The answer is in [2]. Theorem 4.1 Greedy update in exp family For πθ(s, a) ∝ exp(θT φsa), assume that ˆg is non-zero. Let π∞(s, a) = limα→∞ πθ+αˆg(s,a)(s, a). Then π∞(s, a) = 0 iff a ∈ argmaxa Aπθ (s, a ). Theorem 4.2 Greedy update in general parametric policy Let the update to the parameter is θ = θ + αˆg. Then πθ (s, a) = πθ(s, a)(1 + Aπθ (s, a)) + o(α2 ) Therefore, natural policy gradient is a natural scheme of Policy iteration methods in RL. SungYub Kim January 6, 2018 16 / 26
  • 17. MM algorithms for RL A MM framework for RL How to find ˆg Now our task is transformed to find ˆg . = F−1 i θi L(πi) It is not easy for neural networks numerically. The solutions are Conjugate Gradient (CG) method Kronecker-Factored Approximate Curvature SungYub Kim January 6, 2018 17 / 26
  • 18. MM algorithms for RL A MM framework for RL Conjugate Gradient method(CG) Conjugate gradient algorithm solves linear equation Ax = b by finding projection on to Krylov subspaces, span{Ab, A2b, . . . , An−1b}. Algorithm 5.1 Conjugate gradient algorithm Let x0 . = 0 and g0 . = Ax0 − b and d0 . = −g0. For k = 0, . . . , n − 1: If gk = 0, then STOP return xk. Else: xk+1 . = xk + λkdk, where λk . = gT k dk dT k Adk gk+1 . = Axk − b dk+1 . = −gk+1 + gT k+1gk+1 gT k gk dk. Although time complexity of CG is O(n3) (eqaul to Gauss-Jordan elimination), but convergence rate of CG is quadratic. Therefore we can stop earlier than n. ⇒ Truncated Natural Policy Gradient. SungYub Kim January 6, 2018 18 / 26
  • 19. MM algorithms for RL A MM framework for RL Trust region policy optimization(TRPO) Algorithm 5.2 Trust region policy optimization [3] Given policy parameters θ0,0. For n = 0, 1, 2, . . . Collect sample trajectory set Dn following policy π(θn,0). (Estimate advantages ˆ A πθn,0 t ) Estimate θn,0 L(θn,0) and F−1 n,0 with Dn. Approximate ˆg with fixed iteration CG. For minibatch k = 0, 1, 2, . . . , T Perform backtracking line search with exponetial decay for step size αj to obtain θn,k+1 = θn,k + αj ˆg. such that ˆLθn,k (θn,k+1) ≥ 0 and ˆDKL(θn,k θn,k+1) ≤ δ If k is T, then θn+1,0 = θn,T . SungYub Kim January 6, 2018 19 / 26
  • 20. MM algorithms for RL A MM framework for RL Kronecker-Factored Approximate Curvature(K-FAC) (In this presentation, we only consider block-diagonal approximation ver of K-FAC.) Note that Fi = E DθiDθT i , where Dθi = vec(DW1)T vec(DW2)T · · · vec(DWl)T T Therefore Fi =      E[vec(DW1)vec(DW1)T ] E[vec(DW1)vec(DW2)T ] · · · E[vec(DW1)vec(DWl)T ] E[vec(DW2)vec(DW1)T ] E[vec(DW2)vec(DW2)T ] · · · E[vec(DW2)vec(DWl)T ] . .. . .. ... ... E[vec(DWl)vec(DW1)T ] E[vec(DWl)vec(DW2)T ] · · · E[vec(DWl)vec(DWl)T ]      We approximate this by ˆFi =      E[vec(DW1)vec(DW1)T ] 0 · · · 0 0 E[vec(DW2)vec(DW2)T ] · · · 0 . .. . .. ... . .. 0 0 · · · E[vec(DWl)vec(DWl)T ]      SungYub Kim January 6, 2018 20 / 26
  • 21. MM algorithms for RL A MM framework for RL Kronecker-Factored Approximate Curvature(K-FAC) Since v ⊗ u = vec(uvT ), (A ⊗ B)T = AT ⊗ BT and (A ⊗ B)(C ⊗ D) = AC ⊗ BD, each block-diagonal submatrix can be approximated by E[vec(DWi)vec(DWi)T ] = E[vec( si LaT i−1)vec( si LaT i−1)T ] = E[(ai−1 ⊗ si L)(ai−1 ⊗ si L)T ] = E[(ai−1 ⊗ si L)(aT i−1 ⊗ si LT )] = E[ai−1aT i−1 ⊗ si L si LT ] ≈ E[ai−1aT i−1] ⊗ E[ si L si LT ] . = A ⊗ S . = ¯Fi. Since (P ⊗ Q)−1 = P−1 ⊗ Q−1 and (P ⊗ Q)vec(T) = QTPT ∆Wi = ¯F−1 i θi L(θi) = A−1 ⊗ S−1 θi L(θi) = S−1 θi L(θi)A−1 SungYub Kim January 6, 2018 21 / 26
  • 22. MM algorithms for RL A MM framework for RL Actor-Critic using Kronecker-Factored Trust Region Algorithm 5.3 Given policy parameters θ0,0. For n = 0, 1, 2, . . . Collect sample trajectory set Dn following policy π(θn,0). (Estimate advantages ˆ A πθn,0 t ) Estimate θn,0 L(θn,0) and F−1 n,0 with Dn. Approximate ˆg by K-FAC. For minibatch k = 0, 1, 2, . . . , T Perform backtracking line search with exponetial decay for step size αj to obtain θn,k+1 = θn,k + αj ˆg. such that ˆLθn,k (θn,k+1) ≥ 0 and ˆDKL(θn,k θn,k+1) ≤ δ If k is T, then θn+1,0 = θn,T . SungYub Kim January 6, 2018 22 / 26
  • 23. MM algorithms for RL A MM framework for RL Clipped surrogate objective In PPO, we would like to detour the hardness of calculating ¯F−1 i and cost of line searching by clipped surrogate objective. LCLIP θk (θ) = E min(rk(θ) ˆA πθk t , clip(rk(θ), 1 − , 1 + ) ˆA πθk t ) where rk(θ) = πθ(s, a) πθk (s, a) SungYub Kim January 6, 2018 23 / 26
  • 24. MM algorithms for RL A MM framework for RL PPO with Clipped surrogate objective Algorithm 5.4 PPO with Clipped surrogate objective [4] Given policy parameters θ0,0. For n = 0, 1, 2, . . . Collect sample trajectory set Dn following policy π(θn, 0). (Estimate advantages ˆ A πθn,k t ) For minibatch k = 0, 1, 2, . . . , T Compute policy update θn,k+1 = θn,k + α θ ˆLCLIP θn,0 (θ) θ=θn,k SungYub Kim January 6, 2018 24 / 26
  • 25. Benchmarks A MM framework for RL Benchmarks SungYub Kim January 6, 2018 25 / 26
  • 26. A MM framework for RL Questions & Answers SungYub Kim January 6, 2018 26 / 26