MM framework for RL

A MM framework for RL
SungYub Kim
Management Science/Optimization Lab
Department of Industrial Engineering
Seoul National University
January 6, 2018
SungYub Kim January 6, 2018 1 / 26

Contents
Limitations of policy gradients
MM framework?
MM framework for RL
Natural policy gradients
MM algorithms for RL
1. Trust region policy optimization (TRPO)
2. Actor-critic using Kronecker-Factored Trust Region (ACKTR)
3. Proximal policy optimization (PPO)
Benchmarks

References
1. Kakade, S., & Langford, J. (2002, July). Approximately optimal approximate
reinforcement learning. In ICML (Vol. 2, pp. 267-274).
2. Kakade, S. M. (2002). A natural policy gradient. In Advances in neural information
processing systems (pp. 1531-1538).
3. Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015). Trust region
policy optimization. In Proceedings of the 32nd International Conference on Machine
Learning (ICML-15) (pp. 1889-1897).
4. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal
policy optimization algorithms. arXiv preprint arXiv:1707.06347.
5. Wu, Y., Mansimov, E., Liao, S., Grosse, R., & Ba, J. (2017). Scalable trust-region
method for deep reinforcement learning using Kronecker-factored approximation. arXiv
preprint arXiv:1708.05144.
6. Martens, J., & Grosse, R. (2015, June). Optimizing neural networks with
Kronecker-factored approximate curvature. In International Conference on Machine
Learning (pp. 2408-2417).
7. Achiam, J., Held, D., Tamar, A., & Abbeel, P. (2017). Constrained Policy
Optimization. arXiv preprint arXiv:1705.10528.
8. UC Berkeley CS294 2017 Fall Lecture note Oct 11

Limitations of policy gradients A MM framework for RL
Policy Gradients Review
Definition 1.1
Markov Decision Process
1 S is a set of states,
2 A is a set of actions,
3 Ta
s,s = P[St+1 = s |St = s, At = a] is the transition probability that action a in
state s at time t will lead to state s at time t + 1,
4 Ra
s = E[Rt|St = s, At = a] is the expectation reward received when state s
agent chooses action a,
5 γ ∈ [0, 1] is the discount factor, which represents the difference in importance
between future rewards and present rewards.
Definition 1.2
A policy π is a distribution over actions given states,
π(s, a) = P[At = a|St = s].

Policy Gradients Review
Deﬁnition 1.3
A (discounted) return is the sum of the discounted rewards.
Gt = Rt + γRt+1 + γ2
Rt+2 + · · · =
∞
k=0
γk
Rt+k.
We want to maximize the return. Therefore, the optimization problem is
max
θ
J(πθ)
.
= Eτ∼πθ G0 .
For ﬁrst-order optimization, we need the gradient of the objective function
θJ(πθ) = Eτ∼πθ G0
∞
t=0
θ log(πθ(st, at))
= Eτ∼πθ
∞
t=0
γt
θ log(πθ(st, at))Aπθ
(st, at) .

Limitations of policy gradients
Sample efficiency
Recall policy gradient is θEτ∼πθ G0 . To find the estimator of gradient, we need
to calculate on-policy expectation. (No experience replay!)
Importance sampling can be used for off-policy gradient. But stability of this
method is bad.
Relation between parameter space and performance measure?

MM framework? A MM framework for RL
MM framework?
MM(Majorization-minimization/Minorization-maximization) algorithms consists of
Majorization/Minoirzation step : Step for making surrogate objective function satisfying
f(θ) ≥ ˆfθ(m) (θ) for all θ.
f(θ(m)
) = ˆfθ(m) (θ(m)
)
in consideration of local information of original objective function.
(evaluation/gradient/hessian etc.)
Minimization/Maximization step : Step for optimizing surrogate objective function.
Does this algorithm work?
f(θ(m)) = f(θ(m)) − ˆfθ(m) (θ(m)) + ˆfθ(m) (θ(m))
≤ f(θ(m+1)) − ˆfθ(m) (θ(m+1)) + ˆfθ(m) (θ(m))
≤ f(θ(m+1)) − ˆfθ(m) (θ(m+1)) + ˆfθ(m) (θ(m+1))
= f(θ(m+1))

MM framework for RL A MM framework for RL
Relative policy performance identity
In supervised learning framework, we usually consider parameter space and
distribution space. But in RL framework, we also need to consider performance
space.
Relative policy performace identiy[1] gives us relation between distribution space
and performance space.
Lemma 3.1
Relative policy performace identiy(RPPI)
J(π ) − J(π) = Eτ∼π
∞
t=0
γt
Aπ
(st, at)
Good news : Relation between policy space and performance space.
Bad news : We cannot calculate expectation on π .

Proof of relative policy performance identity
Eτ∼π
∞
t=0
γt
Aπ
(st, at) = Eτ∼π
∞
t=0
γt
(Rt + γV π
(St+1) − V π
(St))
= J(π ) + Eτ∼π
∞
t=0
γt+1
V π
(St+1) −
∞
t=0
γt
V π
(St)
= J(π ) + Eτ∼π
∞
t=1
γt
V π
(St) −
∞
t=0
γt
V π
(St)
= J(π ) − Eτ∼π V π
(S0)
= J(π ) − J(π)

MM framework for RL
Deﬁnition 3.2
Discounted future state distribution(Stationary distribution)
dπ
(s) = (1 − γ)
∞
t=0
γt
P St = s|π
τ = [H, T, H, T, . . . ]
s = H/T
Then
J(π ) − J(π) = Eτ∼π
∞
t=0
γt
Aπ
(st, at)
= 1
1−γ
Es∼dπ ,a∼π
Aπ(s, a)
= 1
1−γ
Es∼dπ ,a∼π
π (s,a)
π(s,a)
Aπ(s, a)
If dπ ≈ dπ , then
J(π ) − J(π) ≈
1
1 − γ
Es∼dπ,a∼π
π (s, a)
π(s, a)
Aπ
(s, a)
.
= Lπ(π )

MM framework for RL
Let
B ∈ {α2
[1], Dmax
TV (π π ), Dmax
KL (π π )[3], Es∼dπ DKL(π π )(s) [7]}
where
Dmax
TV
(π π )
.
= maxs∈S DTV(π(s, ·) π (s, ·))
Dmax
KL
(π π )
.
= maxs∈S DKL(π(s, ·) π (s, ·))
If we deﬁne M(π )
.
= J(π) + Lπ(π ) − CB, then
M(π) = J(π)
M(π ) ≤ J(π )
Therfore we deﬁned MM algorithms for Reinforcement Learning. We call B relative policy
performance bounds.

MM framework for RL
Let Mi(πθ)
.
= J(πi) + Lπi (πθ) − CB and we would like to solve
max
θ
Mi(πθ)
instead of
max
θ
J(πθ).
It can be seen as exploiting the information of current policy πi.
⇒ Can control the change of model. (By KL-term)
Both Lπi (πθ) and Es∼dπi DKL(πi πθ)(s) can be estimated using only current policy.

Natural policy gradient A MM framework for RL
Natural policy gradient
Speciﬁcally, let Mi(πθ)
.
= J(πi) + Lπi (πθ) + C δ − Es∼dπi DKL(πi πθ)(s) then
max
θ
Mi(πθ)
can be seen as Lagrangian relaxation of
maxθ Lπi (πθ)
sub.to Es∼dπi DKL(πi πθ)(s) ≤ δ
and this is equivalent to
maxθ Lπi (πθ)
sub.to Es∼dπi DKL(πi πθ)(s) ≤ δ2 = δ
Evaluation(zero-order) of this optmization problem is Easy.
But primal algorithm of this problem doesn’t exist. (Nonlinearity of KL-term)
⇒ Taylor expansion!!

Note that
Lπi (πθ) ≈ Lπi (πθ) + θL(πi)(θ − θi)
DKL(πi πθ) ≈ 1
2
(θ − θi)T Fi(θ − θi)
where F is the ﬁsher information matrix
Fi = Es∼dπi ,a∼πi
2
θ log πθ(s, a)
θ=θi
Therefore, we get
maxθ θi
L(πi)(θ − θi)
sub.to 1
2
(θ − θi)T Fi(θ − θi) ≤ δ
And solution to this optimization problem is
θi+1 = θi +
2δ
θi
L(πi)T F−1
i θi
L(πi)
F−1
i θi
L(πi)
and we call ˆg
.
= F−1
i θi
L(πi) Natural policy gradient.

Note that the original SGD means we would like to improve the performance by small movement
in parameter space.
max∆θ θi
L(πi)∆θ
sub.to ∆θ 2 ≤ δ
max∆θ θi
L(πi)∆θ
sub.to ∆θ 2
F −1
i
≤ δ
Similarly, natural policy gradient means we would like to improve the performance by small
movement in policy space.

Properties of Natural policy gradient
Now we would like to asnswer the question
Relation between parameter space and performance measure?
The answer is in [2].
Theorem 4.1
Greedy update in exp family
For πθ(s, a) ∝ exp(θT φsa), assume that ˆg is non-zero. Let
π∞(s, a) = limα→∞ πθ+αˆg(s,a)(s, a). Then π∞(s, a) = 0 iﬀ a ∈ argmaxa Aπθ (s, a ).
Theorem 4.2
Greedy update in general parametric policy
Let the update to the parameter is θ = θ + αˆg. Then
πθ (s, a) = πθ(s, a)(1 + Aπθ (s, a)) + o(α2
)
Therefore, natural policy gradient is a natural scheme of Policy iteration methods in RL.

MM algorithms for RL A MM framework for RL
How to ﬁnd ˆg
Now our task is transformed to ﬁnd
ˆg
.
= F−1
i θi
L(πi)
It is not easy for neural networks numerically. The solutions are
Conjugate Gradient (CG) method
Kronecker-Factored Approximate Curvature

Conjugate Gradient method(CG)
Conjugate gradient algorithm solves linear equation Ax = b by ﬁnding projection on to Krylov
subspaces, span{Ab, A2b, . . . , An−1b}.
Algorithm 5.1
Conjugate gradient algorithm
Let x0
.
= 0 and g0
.
= Ax0 − b and d0
.
= −g0.
For k = 0, . . . , n − 1:
If gk = 0, then STOP return xk.
Else:
xk+1
.
= xk + λkdk, where λk
.
=
gT
k dk
dT
k
Adk
gk+1
.
= Axk − b
dk+1
.
= −gk+1 +
gT
k+1gk+1
gT
k
gk
dk.
Although time complexity of CG is O(n3) (eqaul to Gauss-Jordan elimination), but convergence
rate of CG is quadratic. Therefore we can stop earlier than n.
⇒ Truncated Natural Policy Gradient.

Trust region policy optimization(TRPO)
Algorithm 5.2
Trust region policy optimization [3]
Given policy parameters θ0,0.
For n = 0, 1, 2, . . .
Collect sample trajectory set Dn following policy π(θn,0).
(Estimate advantages
ˆ
A
πθn,0
t )
Estimate θn,0
L(θn,0) and F−1
n,0 with Dn.
Approximate ˆg with ﬁxed iteration CG.
For minibatch k = 0, 1, 2, . . . , T
Perform backtracking line search with exponetial decay for step size αj to obtain
θn,k+1 = θn,k + αj
ˆg.
such that
ˆLθn,k
(θn,k+1) ≥ 0 and ˆDKL(θn,k θn,k+1) ≤ δ
If k is T, then θn+1,0 = θn,T .

Kronecker-Factored Approximate Curvature(K-FAC)
(In this presentation, we only consider block-diagonal approximation ver of K-FAC.)
Note that Fi = E DθiDθT
i , where
Dθi = vec(DW1)T
vec(DW2)T
· · · vec(DWl)T
T
Therefore
Fi =





E[vec(DW1)vec(DW1)T ] E[vec(DW1)vec(DW2)T ] · · · E[vec(DW1)vec(DWl)T ]
E[vec(DW2)vec(DW1)T ] E[vec(DW2)vec(DW2)T ] · · · E[vec(DW2)vec(DWl)T ]
.
..
.
..
...
...
E[vec(DWl)vec(DW1)T ] E[vec(DWl)vec(DW2)T ] · · · E[vec(DWl)vec(DWl)T ]





We approximate this by
ˆFi =





E[vec(DW1)vec(DW1)T ] 0 · · · 0
0 E[vec(DW2)vec(DW2)T ] · · · 0
.
..
.
..
...
.
..
0 0 · · · E[vec(DWl)vec(DWl)T ]






Kronecker-Factored Approximate Curvature(K-FAC)
Since v ⊗ u = vec(uvT ), (A ⊗ B)T = AT ⊗ BT and (A ⊗ B)(C ⊗ D) = AC ⊗ BD, each
block-diagonal submatrix can be approximated by
E[vec(DWi)vec(DWi)T ] = E[vec( si LaT
i−1)vec( si LaT
i−1)T ]
= E[(ai−1 ⊗ si L)(ai−1 ⊗ si L)T ]
= E[(ai−1 ⊗ si L)(aT
i−1 ⊗ si LT )]
= E[ai−1aT
i−1 ⊗ si L si LT ]
≈ E[ai−1aT
i−1] ⊗ E[ si L si LT ]
.
= A ⊗ S
.
= ¯Fi.
Since (P ⊗ Q)−1 = P−1 ⊗ Q−1 and (P ⊗ Q)vec(T) = QTPT
∆Wi = ¯F−1
i θi
L(θi) = A−1
⊗ S−1
θi
L(θi) = S−1
θi
L(θi)A−1

Actor-Critic using Kronecker-Factored Trust Region
Algorithm 5.3
For n = 0, 1, 2, . . .
Collect sample trajectory set Dn following policy π(θn,0).
ˆ
A
πθn,0
t )
Estimate θn,0
L(θn,0) and F−1
n,0 with Dn.
Approximate ˆg by K-FAC.
For minibatch k = 0, 1, 2, . . . , T
Perform backtracking line search with exponetial decay for step size αj to obtain
θn,k+1 = θn,k + αj
ˆg.
such that
ˆLθn,k
(θn,k+1) ≥ 0 and ˆDKL(θn,k θn,k+1) ≤ δ
If k is T, then θn+1,0 = θn,T .

Clipped surrogate objective
In PPO, we would like to detour the hardness of calculating ¯F−1
i and cost of line
searching by clipped surrogate objective.
LCLIP
θk
(θ) = E min(rk(θ) ˆA
πθk
t , clip(rk(θ), 1 − , 1 + ) ˆA
πθk
t )
where
rk(θ) =
πθ(s, a)
πθk (s, a)

PPO with Clipped surrogate objective
Algorithm 5.4
PPO with Clipped surrogate objective [4]
For n = 0, 1, 2, . . .
Collect sample trajectory set Dn following policy π(θn, 0).
ˆ
A
πθn,k
t )
For minibatch k = 0, 1, 2, . . . , T
Compute policy update
θn,k+1 = θn,k + α θ
ˆLCLIP
θn,0
(θ)
θ=θn,k

Benchmarks A MM framework for RL
Benchmarks

Questions & Answers

MM framework for RL

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie MM framework for RL

Ähnlich wie MM framework for RL (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

MM framework for RL