How To Troubleshoot Collaboration Apps for the Modern Connected Worker
Safe and Efficient Off-Policy Reinforcement Learning
1. Safe and Efficient Off-Policy
Reinforcement Learning
NIPS 2016
Yasuhiro Fujita
Preferred Networks Inc.
January 19, 2017
2. Safe and Efficient Off-Policy Reinforcement Learning
by Remi Munos, Thomas Stepleton, Anna Harutyunyan and Marc G. Bellemare
▶ Off-policy RL: learning the value function for one policy π
Qπ
(x, a) = Eπ[r1 + γr2 + γ2
r3 + · · · |x0 = x, a0 = a]
from data collected by another policy µ ̸= π
▶ Retrace(λ): a new off-policy multi-step RL algorithm
▶ Theoretical advantages
+ It converges for any π, µ (safe)
+ It makes the best use of samples if π and µ are close to
each other (efficient)
+ Its variance is lower than importance sampling
▶ Empirical evaluation
▶ On Atari 2600 it beats one-step Q-learning (DQN) and
the existing multi-step methods (Q∗(λ), Tree-Backup)
3. Notation and definitions
▶ state x ∈ X
▶ action a ∈ A
▶ discount factor γ ∈ [0, 1]
▶ immediate reward r ∈ R
▶ policies π, µ : X × A → [0, 1]
▶ value function
Qπ
(x, a) := Eπ[r1 + γr2 + γ2
r3 + · · · |x0 = x, a0 = a]
▶ optimal value function Q∗
:= maxπ Qπ
▶ EπQ(x, ·) :=
∑
a π(a|x)Q(x, a)
4. Policy evaluation
▶ Learning the value function for a policy π:
Qπ
(x, a) = Eπ[r1 + γr2 + γ2
r3 + · · · |x0 = x, a0 = a]
▶ You can learn optimal control if π is a greedy policy to
the current estimate Q(x, a) e.g. Q-learning
▶ On-policy: learning from data collected by π
▶ Off-policy: learning from data collected by µ ̸= π
▶ Off-policy methods have advantages:
+ Sample-efficient (e.g. experience replay)
+ Exploration by µ
5. On-policy multi-step methods
From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf
▶ Temporal difference (or “surprise”) at t:
δt = rt + γQ(xt+1, at+1) − Q(xt, at)
▶ You can use δt to estimate Qπ
(xt, at) (one-step)
▶ Can you use δt to estimate Qπ
(xs, as) for all s ≤ t?
(multi-step)
6. TD(λ)
From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf
▶ A popular multi-step algorithm for on-policy policy
evaluation
▶ ∆tQ(x, a) = (γλ)t
δt, where λ ∈ [0, 1] is chosen to
balance bias and variance
▶ Multi-step methods have advantages:
+ Rewards are propagated rapidly
+ Bias introduced by bootstrapping is reduced
7. Off-policy multi-step algorithm
From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf
▶ δt = rt + γEπQ(xt+1, ·) − Q(xt, at)
▶ You can use δt to estimate Qπ
(xt, at) e.g. Q-learning
▶ Can you use δt to estimate Qπ
(xs, as) for all s ≤ t?
▶ δt might be less relevant to Qπ(xs, as) compared to the
on-policy case
8. Importance Sampling (IS) [Precup et al. 2000]
From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf
▶ ∆tQ(x, a) = γt
(
∏
1≤s≤t
π(as |xs )
µ(as |xs )
)δt
+ Unbiased estimate of Qπ
− Large (possibly infinite) variance since π(as |xs )
µ(as |xs )
is not
bounded
9. Qπ
(λ) [Harutyunyan et al. 2016]
From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf
▶ ∆tQ(x, a) = (γλ)t
δt
+ Convergent if µ and π are sufficiently close to each other
or λ is sufficiently small:
λ < 1−γ
γϵ
, where ϵ := maxx ∥π(·|x) − µ(·|x)∥1
− Not convergent otherwise
10. Tree-Backup (TB) [Precup et al. 2000]
From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf
▶ ∆tQ(x, a) = (γλ)t
(
∏
1≤s≤t π(as|xs))δt
+ Convergent for any π and µ
+ Works even if µ is unknown and/or non-Markov
−
∏
1≤s≤t π(as|xs) decays rapidly when near on-policy
11. A unified view
▶ General algorithm: ∆Q(x, a) =
∑
t≥0 γt
(
∏
1≤s≤t cs)δt
▶ None of the existing methods is perfect
▶ Low variance (↔ IS)
▶ “Safe” i.e. convergent for any π and µ (↔ Qπ(λ))
▶ “Efficient” i.e. using full returns when on-policy (↔
Tree-Backup)
12. Choice of the coefficients cs
▶ Contraction speed
▶ Consider a general operator R:
RQ(x, a) = Q(x, a) + Eµ[
∑
t≥0
γt
(
∏
1≤s≤t
cs)δt]
▶ If 0 ≤ cs ≤ π(as |xs )
µ(as |xs ) , R is a contraction and Qπ is its
fixed point (thus the algorithm is “safe”)
|RQ(x, a) − Qπ
(x, a)| ≤ η(x, a)∥Q − Qπ
∥
η(x, a) := 1 − (1 − γ)Eµ[
∑
t≥0
γt
(
t∏
s=1
cs)]
▶ η = 0 for cs = 1 (“efficient”)
▶ Variance
▶ cs ≤ 1 result in low variance since
∏
1≤s≤t cs ≤ 1
13. Retrace(λ)
From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf
▶ ∆Q(x, a) = γt
(
∏
1≤s≤t λ min(1, π(as |xs )
µ(as |xs )
))δt
+ Variance is bounded
+ Convergent for any π and µ
+ Uses full returns when on-policy
− Doesn’t work if µ is unknown or non-Markov (↔
Tree-Backup)
14. Evaluation on Atari 2600
▶ Trained asynchrounously with 16 CPU threads [Mnih
et al. 2016]
▶ Each thread has private replay memory holding 62,500
transitions
▶ Q-learning uses a minibatch of 64 transitions
▶ Retrace, TB and Q∗
(λ) (a control version of Qπ
(λ)) use
four 16-step sub-sequences
15. Performance comparison
▶ Inter-algorithm scores are normalized so that 0 and 1
respectively correspond to the worst and best scores for a
particular game
▶ λ = 1 performs best except Q∗
▶ Retrace(λ) performs best on 30 out of 60 games
16. Sensitivity to the value of λ
▶ Retrace(λ) is robust and consistently outperforms Tree-Backup
▶ Q* performs best for small values of λ
▶ Note that the Q-learning scores are fixed across different λ
17. Conclusions
▶ Retrace(λ)
▶ is an off-policy multi-step value-based RL algorithm
▶ is low-variance, safe and efficient
▶ outperforms one-step Q-learning and existing multi-step
variants on Atari 2600
▶ (is already applied to A3C in another paper [Wang et al.
2016])
18. References I
[1] Anna Harutyunyan et al. “Q(λ) with Off-Policy Corrections”. In: Proceedings
of Algorithmic Learning Theory (ALT). 2016. arXiv: 1602.04951.
[2] Volodymyr Mnih et al. Asynchronous Methods for Deep Reinforcement
Learning (old). 2016. arXiv: 1602.01783.
[3] Doina Precup, Richard S Sutton, and Satinder P Singh. “Eligibility Traces for
Off-Policy Policy Evaluation”. In: ICML ’00: Proceedings of the Seventeenth
International Conference on Machine Learning (2000), pp. 759–766.
[4] Ziyu Wang et al. “Sample Efficient Actor-Critic with Experience Replay”. In:
arXiv (2016), pp. 1–20. arXiv: 1611.01224.