SlideShare ist ein Scribd-Unternehmen logo
1 von 18
Downloaden Sie, um offline zu lesen
Safe and Efficient Off-Policy
Reinforcement Learning
NIPS 2016
Yasuhiro Fujita
Preferred Networks Inc.
January 19, 2017
Safe and Efficient Off-Policy Reinforcement Learning
by Remi Munos, Thomas Stepleton, Anna Harutyunyan and Marc G. Bellemare
▶ Off-policy RL: learning the value function for one policy π
Qπ
(x, a) = Eπ[r1 + γr2 + γ2
r3 + · · · |x0 = x, a0 = a]
from data collected by another policy µ ̸= π
▶ Retrace(λ): a new off-policy multi-step RL algorithm
▶ Theoretical advantages
+ It converges for any π, µ (safe)
+ It makes the best use of samples if π and µ are close to
each other (efficient)
+ Its variance is lower than importance sampling
▶ Empirical evaluation
▶ On Atari 2600 it beats one-step Q-learning (DQN) and
the existing multi-step methods (Q∗(λ), Tree-Backup)
Notation and definitions
▶ state x ∈ X
▶ action a ∈ A
▶ discount factor γ ∈ [0, 1]
▶ immediate reward r ∈ R
▶ policies π, µ : X × A → [0, 1]
▶ value function
Qπ
(x, a) := Eπ[r1 + γr2 + γ2
r3 + · · · |x0 = x, a0 = a]
▶ optimal value function Q∗
:= maxπ Qπ
▶ EπQ(x, ·) :=
∑
a π(a|x)Q(x, a)
Policy evaluation
▶ Learning the value function for a policy π:
Qπ
(x, a) = Eπ[r1 + γr2 + γ2
r3 + · · · |x0 = x, a0 = a]
▶ You can learn optimal control if π is a greedy policy to
the current estimate Q(x, a) e.g. Q-learning
▶ On-policy: learning from data collected by π
▶ Off-policy: learning from data collected by µ ̸= π
▶ Off-policy methods have advantages:
+ Sample-efficient (e.g. experience replay)
+ Exploration by µ
On-policy multi-step methods
From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf
▶ Temporal difference (or “surprise”) at t:
δt = rt + γQ(xt+1, at+1) − Q(xt, at)
▶ You can use δt to estimate Qπ
(xt, at) (one-step)
▶ Can you use δt to estimate Qπ
(xs, as) for all s ≤ t?
(multi-step)
TD(λ)
From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf
▶ A popular multi-step algorithm for on-policy policy
evaluation
▶ ∆tQ(x, a) = (γλ)t
δt, where λ ∈ [0, 1] is chosen to
balance bias and variance
▶ Multi-step methods have advantages:
+ Rewards are propagated rapidly
+ Bias introduced by bootstrapping is reduced
Off-policy multi-step algorithm
From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf
▶ δt = rt + γEπQ(xt+1, ·) − Q(xt, at)
▶ You can use δt to estimate Qπ
(xt, at) e.g. Q-learning
▶ Can you use δt to estimate Qπ
(xs, as) for all s ≤ t?
▶ δt might be less relevant to Qπ(xs, as) compared to the
on-policy case
Importance Sampling (IS) [Precup et al. 2000]
From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf
▶ ∆tQ(x, a) = γt
(
∏
1≤s≤t
π(as |xs )
µ(as |xs )
)δt
+ Unbiased estimate of Qπ
− Large (possibly infinite) variance since π(as |xs )
µ(as |xs )
is not
bounded
Qπ
(λ) [Harutyunyan et al. 2016]
From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf
▶ ∆tQ(x, a) = (γλ)t
δt
+ Convergent if µ and π are sufficiently close to each other
or λ is sufficiently small:
λ < 1−γ
γϵ
, where ϵ := maxx ∥π(·|x) − µ(·|x)∥1
− Not convergent otherwise
Tree-Backup (TB) [Precup et al. 2000]
From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf
▶ ∆tQ(x, a) = (γλ)t
(
∏
1≤s≤t π(as|xs))δt
+ Convergent for any π and µ
+ Works even if µ is unknown and/or non-Markov
−
∏
1≤s≤t π(as|xs) decays rapidly when near on-policy
A unified view
▶ General algorithm: ∆Q(x, a) =
∑
t≥0 γt
(
∏
1≤s≤t cs)δt
▶ None of the existing methods is perfect
▶ Low variance (↔ IS)
▶ “Safe” i.e. convergent for any π and µ (↔ Qπ(λ))
▶ “Efficient” i.e. using full returns when on-policy (↔
Tree-Backup)
Choice of the coefficients cs
▶ Contraction speed
▶ Consider a general operator R:
RQ(x, a) = Q(x, a) + Eµ[
∑
t≥0
γt
(
∏
1≤s≤t
cs)δt]
▶ If 0 ≤ cs ≤ π(as |xs )
µ(as |xs ) , R is a contraction and Qπ is its
fixed point (thus the algorithm is “safe”)
|RQ(x, a) − Qπ
(x, a)| ≤ η(x, a)∥Q − Qπ
∥
η(x, a) := 1 − (1 − γ)Eµ[
∑
t≥0
γt
(
t∏
s=1
cs)]
▶ η = 0 for cs = 1 (“efficient”)
▶ Variance
▶ cs ≤ 1 result in low variance since
∏
1≤s≤t cs ≤ 1
Retrace(λ)
From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf
▶ ∆Q(x, a) = γt
(
∏
1≤s≤t λ min(1, π(as |xs )
µ(as |xs )
))δt
+ Variance is bounded
+ Convergent for any π and µ
+ Uses full returns when on-policy
− Doesn’t work if µ is unknown or non-Markov (↔
Tree-Backup)
Evaluation on Atari 2600
▶ Trained asynchrounously with 16 CPU threads [Mnih
et al. 2016]
▶ Each thread has private replay memory holding 62,500
transitions
▶ Q-learning uses a minibatch of 64 transitions
▶ Retrace, TB and Q∗
(λ) (a control version of Qπ
(λ)) use
four 16-step sub-sequences
Performance comparison
▶ Inter-algorithm scores are normalized so that 0 and 1
respectively correspond to the worst and best scores for a
particular game
▶ λ = 1 performs best except Q∗
▶ Retrace(λ) performs best on 30 out of 60 games
Sensitivity to the value of λ
▶ Retrace(λ) is robust and consistently outperforms Tree-Backup
▶ Q* performs best for small values of λ
▶ Note that the Q-learning scores are fixed across different λ
Conclusions
▶ Retrace(λ)
▶ is an off-policy multi-step value-based RL algorithm
▶ is low-variance, safe and efficient
▶ outperforms one-step Q-learning and existing multi-step
variants on Atari 2600
▶ (is already applied to A3C in another paper [Wang et al.
2016])
References I
[1] Anna Harutyunyan et al. “Q(λ) with Off-Policy Corrections”. In: Proceedings
of Algorithmic Learning Theory (ALT). 2016. arXiv: 1602.04951.
[2] Volodymyr Mnih et al. Asynchronous Methods for Deep Reinforcement
Learning (old). 2016. arXiv: 1602.01783.
[3] Doina Precup, Richard S Sutton, and Satinder P Singh. “Eligibility Traces for
Off-Policy Policy Evaluation”. In: ICML ’00: Proceedings of the Seventeenth
International Conference on Machine Learning (2000), pp. 759–766.
[4] Ziyu Wang et al. “Sample Efficient Actor-Critic with Experience Replay”. In:
arXiv (2016), pp. 1–20. arXiv: 1611.01224.

Weitere ähnliche Inhalte

Was ist angesagt?

Gradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation GraphsGradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation GraphsYoonho Lee
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningRyo Iwaki
 
Variational Autoencoder
Variational AutoencoderVariational Autoencoder
Variational AutoencoderMark Chang
 
Matching networks for one shot learning
Matching networks for one shot learningMatching networks for one shot learning
Matching networks for one shot learningKazuki Fujikawa
 
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16MLconf
 
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)Universitat Politècnica de Catalunya
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnDataRobot
 
VAE-type Deep Generative Models
VAE-type Deep Generative ModelsVAE-type Deep Generative Models
VAE-type Deep Generative ModelsKenta Oono
 
Speaker Diarization
Speaker DiarizationSpeaker Diarization
Speaker DiarizationHONGJOO LEE
 
InfoGAN and Generative Adversarial Networks
InfoGAN and Generative Adversarial NetworksInfoGAN and Generative Adversarial Networks
InfoGAN and Generative Adversarial NetworksZak Jost
 
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15MLconf
 
Lecture 5: Neural Networks II
Lecture 5: Neural Networks IILecture 5: Neural Networks II
Lecture 5: Neural Networks IISang Jun Lee
 
Dueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learningDueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learningTaehoon Kim
 
Interaction Networks for Learning about Objects, Relations and Physics
Interaction Networks for Learning about Objects, Relations and PhysicsInteraction Networks for Learning about Objects, Relations and Physics
Interaction Networks for Learning about Objects, Relations and PhysicsKen Kuroki
 
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...Fabian Pedregosa
 
Dictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix FactorizationDictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix Factorizationrecsysfr
 
Data-Driven Recommender Systems
Data-Driven Recommender SystemsData-Driven Recommender Systems
Data-Driven Recommender Systemsrecsysfr
 
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...Universitat Politècnica de Catalunya
 
[Paper reading] L-SHAPLEY AND C-SHAPLEY: EFFICIENT MODEL INTERPRETATION FOR S...
[Paper reading] L-SHAPLEY AND C-SHAPLEY: EFFICIENT MODEL INTERPRETATION FOR S...[Paper reading] L-SHAPLEY AND C-SHAPLEY: EFFICIENT MODEL INTERPRETATION FOR S...
[Paper reading] L-SHAPLEY AND C-SHAPLEY: EFFICIENT MODEL INTERPRETATION FOR S...Daiki Tanaka
 

Was ist angesagt? (20)

Gradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation GraphsGradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation Graphs
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learning
 
Variational Autoencoder
Variational AutoencoderVariational Autoencoder
Variational Autoencoder
 
Matching networks for one shot learning
Matching networks for one shot learningMatching networks for one shot learning
Matching networks for one shot learning
 
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
 
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
 
VAE-type Deep Generative Models
VAE-type Deep Generative ModelsVAE-type Deep Generative Models
VAE-type Deep Generative Models
 
Speaker Diarization
Speaker DiarizationSpeaker Diarization
Speaker Diarization
 
InfoGAN and Generative Adversarial Networks
InfoGAN and Generative Adversarial NetworksInfoGAN and Generative Adversarial Networks
InfoGAN and Generative Adversarial Networks
 
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
 
Lecture 5: Neural Networks II
Lecture 5: Neural Networks IILecture 5: Neural Networks II
Lecture 5: Neural Networks II
 
Dueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learningDueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learning
 
Interaction Networks for Learning about Objects, Relations and Physics
Interaction Networks for Learning about Objects, Relations and PhysicsInteraction Networks for Learning about Objects, Relations and Physics
Interaction Networks for Learning about Objects, Relations and Physics
 
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...
 
Dictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix FactorizationDictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix Factorization
 
Data-Driven Recommender Systems
Data-Driven Recommender SystemsData-Driven Recommender Systems
Data-Driven Recommender Systems
 
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
 
ddpg seminar
ddpg seminarddpg seminar
ddpg seminar
 
[Paper reading] L-SHAPLEY AND C-SHAPLEY: EFFICIENT MODEL INTERPRETATION FOR S...
[Paper reading] L-SHAPLEY AND C-SHAPLEY: EFFICIENT MODEL INTERPRETATION FOR S...[Paper reading] L-SHAPLEY AND C-SHAPLEY: EFFICIENT MODEL INTERPRETATION FOR S...
[Paper reading] L-SHAPLEY AND C-SHAPLEY: EFFICIENT MODEL INTERPRETATION FOR S...
 

Andere mochten auch

Learning to learn by gradient descent by gradient descent
Learning to learn by gradient descent by gradient descentLearning to learn by gradient descent by gradient descent
Learning to learn by gradient descent by gradient descentHiroyuki Fukuda
 
Conditional Image Generation with PixelCNN Decoders
Conditional Image Generation with PixelCNN DecodersConditional Image Generation with PixelCNN Decoders
Conditional Image Generation with PixelCNN Decoderssuga93
 
時系列データ3
時系列データ3時系列データ3
時系列データ3graySpace999
 
Fast and Probvably Seedings for k-Means
Fast and Probvably Seedings for k-MeansFast and Probvably Seedings for k-Means
Fast and Probvably Seedings for k-MeansKimikazu Kato
 
Introduction of “Fairness in Learning: Classic and Contextual Bandits”
Introduction of “Fairness in Learning: Classic and Contextual Bandits”Introduction of “Fairness in Learning: Classic and Contextual Bandits”
Introduction of “Fairness in Learning: Classic and Contextual Bandits”Kazuto Fukuchi
 
[DL輪読会]Convolutional Sequence to Sequence Learning
[DL輪読会]Convolutional Sequence to Sequence Learning[DL輪読会]Convolutional Sequence to Sequence Learning
[DL輪読会]Convolutional Sequence to Sequence LearningDeep Learning JP
 
論文紹介 Combining Model-Based and Model-Free Updates for Trajectory-Centric Rein...
論文紹介 Combining Model-Based and Model-Free Updates for Trajectory-Centric Rein...論文紹介 Combining Model-Based and Model-Free Updates for Trajectory-Centric Rein...
論文紹介 Combining Model-Based and Model-Free Updates for Trajectory-Centric Rein...Kusano Hitoshi
 
NIPS 2016 Overview and Deep Learning Topics
NIPS 2016 Overview and Deep Learning Topics  NIPS 2016 Overview and Deep Learning Topics
NIPS 2016 Overview and Deep Learning Topics Koichi Hamada
 
ICML2016読み会 概要紹介
ICML2016読み会 概要紹介ICML2016読み会 概要紹介
ICML2016読み会 概要紹介Kohei Hayashi
 
論文紹介 Pixel Recurrent Neural Networks
論文紹介 Pixel Recurrent Neural Networks論文紹介 Pixel Recurrent Neural Networks
論文紹介 Pixel Recurrent Neural NetworksSeiya Tokui
 

Andere mochten auch (11)

Learning to learn by gradient descent by gradient descent
Learning to learn by gradient descent by gradient descentLearning to learn by gradient descent by gradient descent
Learning to learn by gradient descent by gradient descent
 
Conditional Image Generation with PixelCNN Decoders
Conditional Image Generation with PixelCNN DecodersConditional Image Generation with PixelCNN Decoders
Conditional Image Generation with PixelCNN Decoders
 
時系列データ3
時系列データ3時系列データ3
時系列データ3
 
Fast and Probvably Seedings for k-Means
Fast and Probvably Seedings for k-MeansFast and Probvably Seedings for k-Means
Fast and Probvably Seedings for k-Means
 
Value iteration networks
Value iteration networksValue iteration networks
Value iteration networks
 
Introduction of “Fairness in Learning: Classic and Contextual Bandits”
Introduction of “Fairness in Learning: Classic and Contextual Bandits”Introduction of “Fairness in Learning: Classic and Contextual Bandits”
Introduction of “Fairness in Learning: Classic and Contextual Bandits”
 
[DL輪読会]Convolutional Sequence to Sequence Learning
[DL輪読会]Convolutional Sequence to Sequence Learning[DL輪読会]Convolutional Sequence to Sequence Learning
[DL輪読会]Convolutional Sequence to Sequence Learning
 
論文紹介 Combining Model-Based and Model-Free Updates for Trajectory-Centric Rein...
論文紹介 Combining Model-Based and Model-Free Updates for Trajectory-Centric Rein...論文紹介 Combining Model-Based and Model-Free Updates for Trajectory-Centric Rein...
論文紹介 Combining Model-Based and Model-Free Updates for Trajectory-Centric Rein...
 
NIPS 2016 Overview and Deep Learning Topics
NIPS 2016 Overview and Deep Learning Topics  NIPS 2016 Overview and Deep Learning Topics
NIPS 2016 Overview and Deep Learning Topics
 
ICML2016読み会 概要紹介
ICML2016読み会 概要紹介ICML2016読み会 概要紹介
ICML2016読み会 概要紹介
 
論文紹介 Pixel Recurrent Neural Networks
論文紹介 Pixel Recurrent Neural Networks論文紹介 Pixel Recurrent Neural Networks
論文紹介 Pixel Recurrent Neural Networks
 

Ähnlich wie Safe and Efficient Off-Policy Reinforcement Learning

Learning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifoldLearning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifoldKai-Wen Zhao
 
week10_Reinforce.pdf
week10_Reinforce.pdfweek10_Reinforce.pdf
week10_Reinforce.pdfYuChianWu
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIJack Clark
 
Double Q-learning Paper Reading
Double Q-learning Paper ReadingDouble Q-learning Paper Reading
Double Q-learning Paper ReadingTakato Yamazaki
 
block-mdp-masters-defense.pdf
block-mdp-masters-defense.pdfblock-mdp-masters-defense.pdf
block-mdp-masters-defense.pdfJunghyun Lee
 
Bayesian Deep Learning
Bayesian Deep LearningBayesian Deep Learning
Bayesian Deep LearningRayKim51
 
Hierarchical Reinforcement Learning with Option-Critic Architecture
Hierarchical Reinforcement Learning with Option-Critic ArchitectureHierarchical Reinforcement Learning with Option-Critic Architecture
Hierarchical Reinforcement Learning with Option-Critic ArchitectureNecip Oguz Serbetci
 
DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birdsWangyu Han
 
Massive Matrix Factorization : Applications to collaborative filtering
Massive Matrix Factorization : Applications to collaborative filteringMassive Matrix Factorization : Applications to collaborative filtering
Massive Matrix Factorization : Applications to collaborative filteringArthur Mensch
 
Machine learning in science and industry — day 2
Machine learning in science and industry — day 2Machine learning in science and industry — day 2
Machine learning in science and industry — day 2arogozhnikov
 
Matrix Factorizations for Recommender Systems
Matrix Factorizations for Recommender SystemsMatrix Factorizations for Recommender Systems
Matrix Factorizations for Recommender SystemsDmitriy Selivanov
 
MLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic trackMLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic trackarogozhnikov
 
MLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic trackMLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic trackarogozhnikov
 
Distributed Group Analytical Hierarchical Process by Consensus
 Distributed Group Analytical Hierarchical Process by Consensus Distributed Group Analytical Hierarchical Process by Consensus
Distributed Group Analytical Hierarchical Process by ConsensusMiguel Rebollo
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data ScienceAlbert Bifet
 

Ähnlich wie Safe and Efficient Off-Policy Reinforcement Learning (20)

Continuous control
Continuous controlContinuous control
Continuous control
 
Learning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifoldLearning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifold
 
week10_Reinforce.pdf
week10_Reinforce.pdfweek10_Reinforce.pdf
week10_Reinforce.pdf
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
 
SASA 2016
SASA 2016SASA 2016
SASA 2016
 
Double Q-learning Paper Reading
Double Q-learning Paper ReadingDouble Q-learning Paper Reading
Double Q-learning Paper Reading
 
block-mdp-masters-defense.pdf
block-mdp-masters-defense.pdfblock-mdp-masters-defense.pdf
block-mdp-masters-defense.pdf
 
Bayesian Deep Learning
Bayesian Deep LearningBayesian Deep Learning
Bayesian Deep Learning
 
Hierarchical Reinforcement Learning with Option-Critic Architecture
Hierarchical Reinforcement Learning with Option-Critic ArchitectureHierarchical Reinforcement Learning with Option-Critic Architecture
Hierarchical Reinforcement Learning with Option-Critic Architecture
 
DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birds
 
ML unit-1.pptx
ML unit-1.pptxML unit-1.pptx
ML unit-1.pptx
 
Massive Matrix Factorization : Applications to collaborative filtering
Massive Matrix Factorization : Applications to collaborative filteringMassive Matrix Factorization : Applications to collaborative filtering
Massive Matrix Factorization : Applications to collaborative filtering
 
Machine learning in science and industry — day 2
Machine learning in science and industry — day 2Machine learning in science and industry — day 2
Machine learning in science and industry — day 2
 
Matrix Factorizations for Recommender Systems
Matrix Factorizations for Recommender SystemsMatrix Factorizations for Recommender Systems
Matrix Factorizations for Recommender Systems
 
MLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic trackMLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic track
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
 
2021 06-02-tabnet
2021 06-02-tabnet2021 06-02-tabnet
2021 06-02-tabnet
 
MLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic trackMLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic track
 
Distributed Group Analytical Hierarchical Process by Consensus
 Distributed Group Analytical Hierarchical Process by Consensus Distributed Group Analytical Hierarchical Process by Consensus
Distributed Group Analytical Hierarchical Process by Consensus
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data Science
 

Mehr von mooopan

Clipped Action Policy Gradient
Clipped Action Policy GradientClipped Action Policy Gradient
Clipped Action Policy Gradientmooopan
 
Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017mooopan
 
ChainerRLの紹介
ChainerRLの紹介ChainerRLの紹介
ChainerRLの紹介mooopan
 
A3Cという強化学習アルゴリズムで遊んでみた話
A3Cという強化学習アルゴリズムで遊んでみた話A3Cという強化学習アルゴリズムで遊んでみた話
A3Cという強化学習アルゴリズムで遊んでみた話mooopan
 
最近のDQN
最近のDQN最近のDQN
最近のDQNmooopan
 
Learning Continuous Control Policies by Stochastic Value Gradients
Learning Continuous Control Policies by Stochastic Value GradientsLearning Continuous Control Policies by Stochastic Value Gradients
Learning Continuous Control Policies by Stochastic Value Gradientsmooopan
 
Trust Region Policy Optimization
Trust Region Policy OptimizationTrust Region Policy Optimization
Trust Region Policy Optimizationmooopan
 
Effective Modern C++ Item 24: Distinguish universal references from rvalue re...
Effective Modern C++ Item 24: Distinguish universal references from rvalue re...Effective Modern C++ Item 24: Distinguish universal references from rvalue re...
Effective Modern C++ Item 24: Distinguish universal references from rvalue re...mooopan
 
"Playing Atari with Deep Reinforcement Learning"
"Playing Atari with Deep Reinforcement Learning""Playing Atari with Deep Reinforcement Learning"
"Playing Atari with Deep Reinforcement Learning"mooopan
 

Mehr von mooopan (9)

Clipped Action Policy Gradient
Clipped Action Policy GradientClipped Action Policy Gradient
Clipped Action Policy Gradient
 
Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017
 
ChainerRLの紹介
ChainerRLの紹介ChainerRLの紹介
ChainerRLの紹介
 
A3Cという強化学習アルゴリズムで遊んでみた話
A3Cという強化学習アルゴリズムで遊んでみた話A3Cという強化学習アルゴリズムで遊んでみた話
A3Cという強化学習アルゴリズムで遊んでみた話
 
最近のDQN
最近のDQN最近のDQN
最近のDQN
 
Learning Continuous Control Policies by Stochastic Value Gradients
Learning Continuous Control Policies by Stochastic Value GradientsLearning Continuous Control Policies by Stochastic Value Gradients
Learning Continuous Control Policies by Stochastic Value Gradients
 
Trust Region Policy Optimization
Trust Region Policy OptimizationTrust Region Policy Optimization
Trust Region Policy Optimization
 
Effective Modern C++ Item 24: Distinguish universal references from rvalue re...
Effective Modern C++ Item 24: Distinguish universal references from rvalue re...Effective Modern C++ Item 24: Distinguish universal references from rvalue re...
Effective Modern C++ Item 24: Distinguish universal references from rvalue re...
 
"Playing Atari with Deep Reinforcement Learning"
"Playing Atari with Deep Reinforcement Learning""Playing Atari with Deep Reinforcement Learning"
"Playing Atari with Deep Reinforcement Learning"
 

Kürzlich hochgeladen

How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 

Kürzlich hochgeladen (20)

How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 

Safe and Efficient Off-Policy Reinforcement Learning

  • 1. Safe and Efficient Off-Policy Reinforcement Learning NIPS 2016 Yasuhiro Fujita Preferred Networks Inc. January 19, 2017
  • 2. Safe and Efficient Off-Policy Reinforcement Learning by Remi Munos, Thomas Stepleton, Anna Harutyunyan and Marc G. Bellemare ▶ Off-policy RL: learning the value function for one policy π Qπ (x, a) = Eπ[r1 + γr2 + γ2 r3 + · · · |x0 = x, a0 = a] from data collected by another policy µ ̸= π ▶ Retrace(λ): a new off-policy multi-step RL algorithm ▶ Theoretical advantages + It converges for any π, µ (safe) + It makes the best use of samples if π and µ are close to each other (efficient) + Its variance is lower than importance sampling ▶ Empirical evaluation ▶ On Atari 2600 it beats one-step Q-learning (DQN) and the existing multi-step methods (Q∗(λ), Tree-Backup)
  • 3. Notation and definitions ▶ state x ∈ X ▶ action a ∈ A ▶ discount factor γ ∈ [0, 1] ▶ immediate reward r ∈ R ▶ policies π, µ : X × A → [0, 1] ▶ value function Qπ (x, a) := Eπ[r1 + γr2 + γ2 r3 + · · · |x0 = x, a0 = a] ▶ optimal value function Q∗ := maxπ Qπ ▶ EπQ(x, ·) := ∑ a π(a|x)Q(x, a)
  • 4. Policy evaluation ▶ Learning the value function for a policy π: Qπ (x, a) = Eπ[r1 + γr2 + γ2 r3 + · · · |x0 = x, a0 = a] ▶ You can learn optimal control if π is a greedy policy to the current estimate Q(x, a) e.g. Q-learning ▶ On-policy: learning from data collected by π ▶ Off-policy: learning from data collected by µ ̸= π ▶ Off-policy methods have advantages: + Sample-efficient (e.g. experience replay) + Exploration by µ
  • 5. On-policy multi-step methods From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf ▶ Temporal difference (or “surprise”) at t: δt = rt + γQ(xt+1, at+1) − Q(xt, at) ▶ You can use δt to estimate Qπ (xt, at) (one-step) ▶ Can you use δt to estimate Qπ (xs, as) for all s ≤ t? (multi-step)
  • 6. TD(λ) From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf ▶ A popular multi-step algorithm for on-policy policy evaluation ▶ ∆tQ(x, a) = (γλ)t δt, where λ ∈ [0, 1] is chosen to balance bias and variance ▶ Multi-step methods have advantages: + Rewards are propagated rapidly + Bias introduced by bootstrapping is reduced
  • 7. Off-policy multi-step algorithm From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf ▶ δt = rt + γEπQ(xt+1, ·) − Q(xt, at) ▶ You can use δt to estimate Qπ (xt, at) e.g. Q-learning ▶ Can you use δt to estimate Qπ (xs, as) for all s ≤ t? ▶ δt might be less relevant to Qπ(xs, as) compared to the on-policy case
  • 8. Importance Sampling (IS) [Precup et al. 2000] From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf ▶ ∆tQ(x, a) = γt ( ∏ 1≤s≤t π(as |xs ) µ(as |xs ) )δt + Unbiased estimate of Qπ − Large (possibly infinite) variance since π(as |xs ) µ(as |xs ) is not bounded
  • 9. Qπ (λ) [Harutyunyan et al. 2016] From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf ▶ ∆tQ(x, a) = (γλ)t δt + Convergent if µ and π are sufficiently close to each other or λ is sufficiently small: λ < 1−γ γϵ , where ϵ := maxx ∥π(·|x) − µ(·|x)∥1 − Not convergent otherwise
  • 10. Tree-Backup (TB) [Precup et al. 2000] From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf ▶ ∆tQ(x, a) = (γλ)t ( ∏ 1≤s≤t π(as|xs))δt + Convergent for any π and µ + Works even if µ is unknown and/or non-Markov − ∏ 1≤s≤t π(as|xs) decays rapidly when near on-policy
  • 11. A unified view ▶ General algorithm: ∆Q(x, a) = ∑ t≥0 γt ( ∏ 1≤s≤t cs)δt ▶ None of the existing methods is perfect ▶ Low variance (↔ IS) ▶ “Safe” i.e. convergent for any π and µ (↔ Qπ(λ)) ▶ “Efficient” i.e. using full returns when on-policy (↔ Tree-Backup)
  • 12. Choice of the coefficients cs ▶ Contraction speed ▶ Consider a general operator R: RQ(x, a) = Q(x, a) + Eµ[ ∑ t≥0 γt ( ∏ 1≤s≤t cs)δt] ▶ If 0 ≤ cs ≤ π(as |xs ) µ(as |xs ) , R is a contraction and Qπ is its fixed point (thus the algorithm is “safe”) |RQ(x, a) − Qπ (x, a)| ≤ η(x, a)∥Q − Qπ ∥ η(x, a) := 1 − (1 − γ)Eµ[ ∑ t≥0 γt ( t∏ s=1 cs)] ▶ η = 0 for cs = 1 (“efficient”) ▶ Variance ▶ cs ≤ 1 result in low variance since ∏ 1≤s≤t cs ≤ 1
  • 13. Retrace(λ) From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf ▶ ∆Q(x, a) = γt ( ∏ 1≤s≤t λ min(1, π(as |xs ) µ(as |xs ) ))δt + Variance is bounded + Convergent for any π and µ + Uses full returns when on-policy − Doesn’t work if µ is unknown or non-Markov (↔ Tree-Backup)
  • 14. Evaluation on Atari 2600 ▶ Trained asynchrounously with 16 CPU threads [Mnih et al. 2016] ▶ Each thread has private replay memory holding 62,500 transitions ▶ Q-learning uses a minibatch of 64 transitions ▶ Retrace, TB and Q∗ (λ) (a control version of Qπ (λ)) use four 16-step sub-sequences
  • 15. Performance comparison ▶ Inter-algorithm scores are normalized so that 0 and 1 respectively correspond to the worst and best scores for a particular game ▶ λ = 1 performs best except Q∗ ▶ Retrace(λ) performs best on 30 out of 60 games
  • 16. Sensitivity to the value of λ ▶ Retrace(λ) is robust and consistently outperforms Tree-Backup ▶ Q* performs best for small values of λ ▶ Note that the Q-learning scores are fixed across different λ
  • 17. Conclusions ▶ Retrace(λ) ▶ is an off-policy multi-step value-based RL algorithm ▶ is low-variance, safe and efficient ▶ outperforms one-step Q-learning and existing multi-step variants on Atari 2600 ▶ (is already applied to A3C in another paper [Wang et al. 2016])
  • 18. References I [1] Anna Harutyunyan et al. “Q(λ) with Off-Policy Corrections”. In: Proceedings of Algorithmic Learning Theory (ALT). 2016. arXiv: 1602.04951. [2] Volodymyr Mnih et al. Asynchronous Methods for Deep Reinforcement Learning (old). 2016. arXiv: 1602.01783. [3] Doina Precup, Richard S Sutton, and Satinder P Singh. “Eligibility Traces for Off-Policy Policy Evaluation”. In: ICML ’00: Proceedings of the Seventeenth International Conference on Machine Learning (2000), pp. 759–766. [4] Ziyu Wang et al. “Sample Efficient Actor-Critic with Experience Replay”. In: arXiv (2016), pp. 1–20. arXiv: 1611.01224.