Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
MM framework for RL
1. A MM framework for RL
A MM framework for RL
SungYub Kim
Management Science/Optimization Lab
Department of Industrial Engineering
Seoul National University
January 6, 2018
SungYub Kim January 6, 2018 1 / 26
2. A MM framework for RL
Contents
Limitations of policy gradients
MM framework?
MM framework for RL
Natural policy gradients
MM algorithms for RL
1. Trust region policy optimization (TRPO)
2. Actor-critic using Kronecker-Factored Trust Region (ACKTR)
3. Proximal policy optimization (PPO)
Benchmarks
SungYub Kim January 6, 2018 2 / 26
3. A MM framework for RL
References
1. Kakade, S., & Langford, J. (2002, July). Approximately optimal approximate
reinforcement learning. In ICML (Vol. 2, pp. 267-274).
2. Kakade, S. M. (2002). A natural policy gradient. In Advances in neural information
processing systems (pp. 1531-1538).
3. Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015). Trust region
policy optimization. In Proceedings of the 32nd International Conference on Machine
Learning (ICML-15) (pp. 1889-1897).
4. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal
policy optimization algorithms. arXiv preprint arXiv:1707.06347.
5. Wu, Y., Mansimov, E., Liao, S., Grosse, R., & Ba, J. (2017). Scalable trust-region
method for deep reinforcement learning using Kronecker-factored approximation. arXiv
preprint arXiv:1708.05144.
6. Martens, J., & Grosse, R. (2015, June). Optimizing neural networks with
Kronecker-factored approximate curvature. In International Conference on Machine
Learning (pp. 2408-2417).
7. Achiam, J., Held, D., Tamar, A., & Abbeel, P. (2017). Constrained Policy
Optimization. arXiv preprint arXiv:1705.10528.
8. UC Berkeley CS294 2017 Fall Lecture note Oct 11
SungYub Kim January 6, 2018 3 / 26
4. Limitations of policy gradients A MM framework for RL
Policy Gradients Review
Definition 1.1
Markov Decision Process
1 S is a set of states,
2 A is a set of actions,
3 Ta
s,s = P[St+1 = s |St = s, At = a] is the transition probability that action a in
state s at time t will lead to state s at time t + 1,
4 Ra
s = E[Rt|St = s, At = a] is the expectation reward received when state s
agent chooses action a,
5 γ ∈ [0, 1] is the discount factor, which represents the difference in importance
between future rewards and present rewards.
Definition 1.2
A policy π is a distribution over actions given states,
π(s, a) = P[At = a|St = s].
SungYub Kim January 6, 2018 4 / 26
5. Limitations of policy gradients A MM framework for RL
Policy Gradients Review
Definition 1.3
A (discounted) return is the sum of the discounted rewards.
Gt = Rt + γRt+1 + γ2
Rt+2 + · · · =
∞
k=0
γk
Rt+k.
We want to maximize the return. Therefore, the optimization problem is
max
θ
J(πθ)
.
= Eτ∼πθ G0 .
For first-order optimization, we need the gradient of the objective function
θJ(πθ) = Eτ∼πθ G0
∞
t=0
θ log(πθ(st, at))
= Eτ∼πθ
∞
t=0
γt
θ log(πθ(st, at))Aπθ
(st, at) .
SungYub Kim January 6, 2018 5 / 26
6. Limitations of policy gradients A MM framework for RL
Limitations of policy gradients
Sample efficiency
Recall policy gradient is θEτ∼πθ G0 . To find the estimator of gradient, we need
to calculate on-policy expectation. (No experience replay!)
Importance sampling can be used for off-policy gradient. But stability of this
method is bad.
Relation between parameter space and performance measure?
SungYub Kim January 6, 2018 6 / 26
7. MM framework? A MM framework for RL
MM framework?
MM(Majorization-minimization/Minorization-maximization) algorithms consists of
Majorization/Minoirzation step : Step for making surrogate objective function satisfying
f(θ) ≥ ˆfθ(m) (θ) for all θ.
f(θ(m)
) = ˆfθ(m) (θ(m)
)
in consideration of local information of original objective function.
(evaluation/gradient/hessian etc.)
Minimization/Maximization step : Step for optimizing surrogate objective function.
Does this algorithm work?
f(θ(m)) = f(θ(m)) − ˆfθ(m) (θ(m)) + ˆfθ(m) (θ(m))
≤ f(θ(m+1)) − ˆfθ(m) (θ(m+1)) + ˆfθ(m) (θ(m))
≤ f(θ(m+1)) − ˆfθ(m) (θ(m+1)) + ˆfθ(m) (θ(m+1))
= f(θ(m+1))
SungYub Kim January 6, 2018 7 / 26
8. MM framework for RL A MM framework for RL
Relative policy performance identity
In supervised learning framework, we usually consider parameter space and
distribution space. But in RL framework, we also need to consider performance
space.
Relative policy performace identiy[1] gives us relation between distribution space
and performance space.
Lemma 3.1
Relative policy performace identiy(RPPI)
J(π ) − J(π) = Eτ∼π
∞
t=0
γt
Aπ
(st, at)
Good news : Relation between policy space and performance space.
Bad news : We cannot calculate expectation on π .
SungYub Kim January 6, 2018 8 / 26
9. MM framework for RL A MM framework for RL
Proof of relative policy performance identity
Eτ∼π
∞
t=0
γt
Aπ
(st, at) = Eτ∼π
∞
t=0
γt
(Rt + γV π
(St+1) − V π
(St))
= J(π ) + Eτ∼π
∞
t=0
γt+1
V π
(St+1) −
∞
t=0
γt
V π
(St)
= J(π ) + Eτ∼π
∞
t=1
γt
V π
(St) −
∞
t=0
γt
V π
(St)
= J(π ) − Eτ∼π V π
(S0)
= J(π ) − J(π)
SungYub Kim January 6, 2018 9 / 26
10. MM framework for RL A MM framework for RL
MM framework for RL
Definition 3.2
Discounted future state distribution(Stationary distribution)
dπ
(s) = (1 − γ)
∞
t=0
γt
P St = s|π
τ = [H, T, H, T, . . . ]
s = H/T
Then
J(π ) − J(π) = Eτ∼π
∞
t=0
γt
Aπ
(st, at)
= 1
1−γ
Es∼dπ ,a∼π
Aπ(s, a)
= 1
1−γ
Es∼dπ ,a∼π
π (s,a)
π(s,a)
Aπ(s, a)
If dπ ≈ dπ , then
J(π ) − J(π) ≈
1
1 − γ
Es∼dπ,a∼π
π (s, a)
π(s, a)
Aπ
(s, a)
.
= Lπ(π )
SungYub Kim January 6, 2018 10 / 26
11. MM framework for RL A MM framework for RL
MM framework for RL
Let
B ∈ {α2
[1], Dmax
TV (π π ), Dmax
KL (π π )[3], Es∼dπ DKL(π π )(s) [7]}
where
Dmax
TV
(π π )
.
= maxs∈S DTV(π(s, ·) π (s, ·))
Dmax
KL
(π π )
.
= maxs∈S DKL(π(s, ·) π (s, ·))
If we define M(π )
.
= J(π) + Lπ(π ) − CB, then
M(π) = J(π)
M(π ) ≤ J(π )
Therfore we defined MM algorithms for Reinforcement Learning. We call B relative policy
performance bounds.
SungYub Kim January 6, 2018 11 / 26
12. MM framework for RL A MM framework for RL
MM framework for RL
Let Mi(πθ)
.
= J(πi) + Lπi (πθ) − CB and we would like to solve
max
θ
Mi(πθ)
instead of
max
θ
J(πθ).
It can be seen as exploiting the information of current policy πi.
⇒ Can control the change of model. (By KL-term)
Both Lπi (πθ) and Es∼dπi DKL(πi πθ)(s) can be estimated using only current policy.
SungYub Kim January 6, 2018 12 / 26
13. Natural policy gradient A MM framework for RL
Natural policy gradient
Specifically, let Mi(πθ)
.
= J(πi) + Lπi (πθ) + C δ − Es∼dπi DKL(πi πθ)(s) then
max
θ
Mi(πθ)
can be seen as Lagrangian relaxation of
maxθ Lπi (πθ)
sub.to Es∼dπi DKL(πi πθ)(s) ≤ δ
and this is equivalent to
maxθ Lπi (πθ)
sub.to Es∼dπi DKL(πi πθ)(s) ≤ δ2 = δ
Evaluation(zero-order) of this optmization problem is Easy.
But primal algorithm of this problem doesn’t exist. (Nonlinearity of KL-term)
⇒ Taylor expansion!!
SungYub Kim January 6, 2018 13 / 26
14. Natural policy gradient A MM framework for RL
Natural policy gradient
Note that
Lπi (πθ) ≈ Lπi (πθ) + θL(πi)(θ − θi)
DKL(πi πθ) ≈ 1
2
(θ − θi)T Fi(θ − θi)
where F is the fisher information matrix
Fi = Es∼dπi ,a∼πi
2
θ log πθ(s, a)
θ=θi
Therefore, we get
maxθ θi
L(πi)(θ − θi)
sub.to 1
2
(θ − θi)T Fi(θ − θi) ≤ δ
And solution to this optimization problem is
θi+1 = θi +
2δ
θi
L(πi)T F−1
i θi
L(πi)
F−1
i θi
L(πi)
and we call ˆg
.
= F−1
i θi
L(πi) Natural policy gradient.
SungYub Kim January 6, 2018 14 / 26
15. Natural policy gradient A MM framework for RL
Natural policy gradient
Note that the original SGD means we would like to improve the performance by small movement
in parameter space.
max∆θ θi
L(πi)∆θ
sub.to ∆θ 2 ≤ δ
max∆θ θi
L(πi)∆θ
sub.to ∆θ 2
F −1
i
≤ δ
Similarly, natural policy gradient means we would like to improve the performance by small
movement in policy space.
SungYub Kim January 6, 2018 15 / 26
16. Natural policy gradient A MM framework for RL
Properties of Natural policy gradient
Now we would like to asnswer the question
Relation between parameter space and performance measure?
The answer is in [2].
Theorem 4.1
Greedy update in exp family
For πθ(s, a) ∝ exp(θT φsa), assume that ˆg is non-zero. Let
π∞(s, a) = limα→∞ πθ+αˆg(s,a)(s, a). Then π∞(s, a) = 0 iff a ∈ argmaxa Aπθ (s, a ).
Theorem 4.2
Greedy update in general parametric policy
Let the update to the parameter is θ = θ + αˆg. Then
πθ (s, a) = πθ(s, a)(1 + Aπθ (s, a)) + o(α2
)
Therefore, natural policy gradient is a natural scheme of Policy iteration methods in RL.
SungYub Kim January 6, 2018 16 / 26
17. MM algorithms for RL A MM framework for RL
How to find ˆg
Now our task is transformed to find
ˆg
.
= F−1
i θi
L(πi)
It is not easy for neural networks numerically. The solutions are
Conjugate Gradient (CG) method
Kronecker-Factored Approximate Curvature
SungYub Kim January 6, 2018 17 / 26
18. MM algorithms for RL A MM framework for RL
Conjugate Gradient method(CG)
Conjugate gradient algorithm solves linear equation Ax = b by finding projection on to Krylov
subspaces, span{Ab, A2b, . . . , An−1b}.
Algorithm 5.1
Conjugate gradient algorithm
Let x0
.
= 0 and g0
.
= Ax0 − b and d0
.
= −g0.
For k = 0, . . . , n − 1:
If gk = 0, then STOP return xk.
Else:
xk+1
.
= xk + λkdk, where λk
.
=
gT
k dk
dT
k
Adk
gk+1
.
= Axk − b
dk+1
.
= −gk+1 +
gT
k+1gk+1
gT
k
gk
dk.
Although time complexity of CG is O(n3) (eqaul to Gauss-Jordan elimination), but convergence
rate of CG is quadratic. Therefore we can stop earlier than n.
⇒ Truncated Natural Policy Gradient.
SungYub Kim January 6, 2018 18 / 26
19. MM algorithms for RL A MM framework for RL
Trust region policy optimization(TRPO)
Algorithm 5.2
Trust region policy optimization [3]
Given policy parameters θ0,0.
For n = 0, 1, 2, . . .
Collect sample trajectory set Dn following policy π(θn,0).
(Estimate advantages
ˆ
A
πθn,0
t )
Estimate θn,0
L(θn,0) and F−1
n,0 with Dn.
Approximate ˆg with fixed iteration CG.
For minibatch k = 0, 1, 2, . . . , T
Perform backtracking line search with exponetial decay for step size αj to obtain
θn,k+1 = θn,k + αj
ˆg.
such that
ˆLθn,k
(θn,k+1) ≥ 0 and ˆDKL(θn,k θn,k+1) ≤ δ
If k is T, then θn+1,0 = θn,T .
SungYub Kim January 6, 2018 19 / 26
20. MM algorithms for RL A MM framework for RL
Kronecker-Factored Approximate Curvature(K-FAC)
(In this presentation, we only consider block-diagonal approximation ver of K-FAC.)
Note that Fi = E DθiDθT
i , where
Dθi = vec(DW1)T
vec(DW2)T
· · · vec(DWl)T
T
Therefore
Fi =
E[vec(DW1)vec(DW1)T ] E[vec(DW1)vec(DW2)T ] · · · E[vec(DW1)vec(DWl)T ]
E[vec(DW2)vec(DW1)T ] E[vec(DW2)vec(DW2)T ] · · · E[vec(DW2)vec(DWl)T ]
.
..
.
..
...
...
E[vec(DWl)vec(DW1)T ] E[vec(DWl)vec(DW2)T ] · · · E[vec(DWl)vec(DWl)T ]
We approximate this by
ˆFi =
E[vec(DW1)vec(DW1)T ] 0 · · · 0
0 E[vec(DW2)vec(DW2)T ] · · · 0
.
..
.
..
...
.
..
0 0 · · · E[vec(DWl)vec(DWl)T ]
SungYub Kim January 6, 2018 20 / 26
21. MM algorithms for RL A MM framework for RL
Kronecker-Factored Approximate Curvature(K-FAC)
Since v ⊗ u = vec(uvT ), (A ⊗ B)T = AT ⊗ BT and (A ⊗ B)(C ⊗ D) = AC ⊗ BD, each
block-diagonal submatrix can be approximated by
E[vec(DWi)vec(DWi)T ] = E[vec( si LaT
i−1)vec( si LaT
i−1)T ]
= E[(ai−1 ⊗ si L)(ai−1 ⊗ si L)T ]
= E[(ai−1 ⊗ si L)(aT
i−1 ⊗ si LT )]
= E[ai−1aT
i−1 ⊗ si L si LT ]
≈ E[ai−1aT
i−1] ⊗ E[ si L si LT ]
.
= A ⊗ S
.
= ¯Fi.
Since (P ⊗ Q)−1 = P−1 ⊗ Q−1 and (P ⊗ Q)vec(T) = QTPT
∆Wi = ¯F−1
i θi
L(θi) = A−1
⊗ S−1
θi
L(θi) = S−1
θi
L(θi)A−1
SungYub Kim January 6, 2018 21 / 26
22. MM algorithms for RL A MM framework for RL
Actor-Critic using Kronecker-Factored Trust Region
Algorithm 5.3
Given policy parameters θ0,0.
For n = 0, 1, 2, . . .
Collect sample trajectory set Dn following policy π(θn,0).
(Estimate advantages
ˆ
A
πθn,0
t )
Estimate θn,0
L(θn,0) and F−1
n,0 with Dn.
Approximate ˆg by K-FAC.
For minibatch k = 0, 1, 2, . . . , T
Perform backtracking line search with exponetial decay for step size αj to obtain
θn,k+1 = θn,k + αj
ˆg.
such that
ˆLθn,k
(θn,k+1) ≥ 0 and ˆDKL(θn,k θn,k+1) ≤ δ
If k is T, then θn+1,0 = θn,T .
SungYub Kim January 6, 2018 22 / 26
23. MM algorithms for RL A MM framework for RL
Clipped surrogate objective
In PPO, we would like to detour the hardness of calculating ¯F−1
i and cost of line
searching by clipped surrogate objective.
LCLIP
θk
(θ) = E min(rk(θ) ˆA
πθk
t , clip(rk(θ), 1 − , 1 + ) ˆA
πθk
t )
where
rk(θ) =
πθ(s, a)
πθk (s, a)
SungYub Kim January 6, 2018 23 / 26
24. MM algorithms for RL A MM framework for RL
PPO with Clipped surrogate objective
Algorithm 5.4
PPO with Clipped surrogate objective [4]
Given policy parameters θ0,0.
For n = 0, 1, 2, . . .
Collect sample trajectory set Dn following policy π(θn, 0).
(Estimate advantages
ˆ
A
πθn,k
t )
For minibatch k = 0, 1, 2, . . . , T
Compute policy update
θn,k+1 = θn,k + α θ
ˆLCLIP
θn,0
(θ)
θ=θn,k
SungYub Kim January 6, 2018 24 / 26
25. Benchmarks A MM framework for RL
Benchmarks
SungYub Kim January 6, 2018 25 / 26
26. A MM framework for RL
Questions & Answers
SungYub Kim January 6, 2018 26 / 26