SlideShare ist ein Scribd-Unternehmen logo
1 von 72
Downloaden Sie, um offline zu lesen
Hierarchical Reinforcement Learning with
Option-Critic Architecture
Oğuz Şerbetci
April 4, 2018
Modelling of Cognitive Processes
TU Berlin
Reinforcement Learning
Hierarchical Reinforcement Learning
Demonstration
Resources
Appendix
1
Reinforcement Learning
Reinforcement Learning
Agent
Environment
2
Reinforcement Learning
Agent
Environment
Action at
2
Reinforcement Learning
Agent
Environment
Action atState st
Reward rt
2
Reinforcement Learning
MDP: S, A, p(s |s, a), r(s, a)
3
Reinforcement Learning
MDP: S, A, p(s |s, a), r(s, a)
3
Reinforcement Learning
MDP: S, A, p(s |s, a), r(s, a)
Policy π(a|s) : S × A → [0, 1]
3
Reinforcement Learning
MDP: S, A, p(s |s, a), r(s, a)
Policy π(a|s) : S × A → [0, 1]
3
Reinforcement Learning
MDP: S, A, p(s |s, a), r(s, a)
Policy π(a|s) : S × A → [0, 1]
Goal: an optimal policy π∗ that maximizes E
∞
t=0
γt
rt|s0 = s
3
Problems
4
Problems
• lack of planning and commitment
4
Problems
• lack of planning and commitment
• inefficient exploration
4
Problems
• lack of planning and commitment
• inefficient exploration
• temporal credit assignment problem
4
Problems
• lack of planning and commitment
• inefficient exploration
• temporal credit assignment problem
• inability to divide-and-conquer
4
5
Hierarchical Reinforcement Learning
Temporal Abstractions
Icons made by Smashicons and Freepick from Freepick
6
Temporal Abstractions
Icons made by Smashicons and Freepick from Freepick
6
Temporal Abstractions
Icons made by Smashicons and Freepick from Freepick
6
Temporal Abstractions
Icons made by Smashicons and Freepick from Freepick
6
7
Options Framework (Sutton, Precup, et al. 1999)
SMDP:
S, A, p(s , k |s, a), r(s, a)
(Sutton, Precup, et al. 1999)
8
Options Framework (Sutton, Precup, et al. 1999)
SMDP:
S, A, p(s , k |s, a), r(s, a)
Option:
Iω: initiation set
πω: intra-option-policy
βω: termination-policy
(Sutton, Precup, et al. 1999)
8
Options Framework (Sutton, Precup, et al. 1999)
SMDP:
S, A, p(s , k |s, a), r(s, a)
Option:
Iω: initiation set
πω: intra-option-policy
βω: termination-policy
(Sutton, Precup, et al. 1999)
8
Options Framework (Sutton, Precup, et al. 1999)
SMDP:
S, A, p(s , k |s, a), r(s, a)
Option:
Iω: initiation set
πω: intra-option-policy
βω: termination-policy
πΩ: option-policy (Sutton, Precup, et al. 1999)
8
9
Option-Critic (Bacon et al. 2017)
Given the number of options Option-Critic learns βω, πω, πΩ in
an end-to-end & online fashion.
Allows non-linear function approximators, enabling
continuous state and action spaces.
• online, end-to-end learning of options in continuous
state/action space
• allows using non-linear function approximators (deep RL)
10
Value Functions
The state value function:
Vπ(s) = E
∞
t=0
γt
rt|s0 = s
11
Value Functions
The state value function:
Vπ(s) = E
∞
t=0
γt
rt|s0 = s
=
a∈As
π(s, a) r(s, a) + γ
s
p(s |s, a)V(s )
Bellman Equations (Bellman 1952)
11
Value Functions
The state value function:
Vπ(s) = E
∞
t=0
γt
rt|s0 = s
=
a∈As
π(s, a) r(s, a) + γ
s
p(s |s, a)V(s )
The action value function:
Qπ(s, a) = E
∞
t=0
γt
rt|s0 = s, a0 = a
Bellman Equations (Bellman 1952)
11
Value Functions
The state value function:
Vπ(s) = E
∞
t=0
γt
rt|s0 = s
=
a∈As
π(s, a) r(s, a) + γ
s
p(s |s, a)V(s )
The action value function:
Qπ(s, a) = E
∞
t=0
γt
rt|s0 = s, a0 = a
= r(s, a) + γ
s
p(s |s, a)
a ∈A
π(s , a )Q(s , a )
Bellman Equations (Bellman 1952)
11
Value methods
TD Learning:
Qπ
(s, a) ←Qπ
(s, a) + α
TD target
r + γVπ
(s ) −Qπ
(s, a)
TD error
12
Value methods
TD Learning:
Qπ
(s, a) ←Qπ
(s, a) + α
TD target
r + γVπ
(s ) −Qπ
(s, a)
TD error
Vπ
(s ) = max
a
Q(s , a) Q-Learning
12
Value methods
TD Learning:
Qπ
(s, a) ←Qπ
(s, a) + α
TD target
r + γVπ
(s ) −Qπ
(s, a)
TD error
Vπ
(s ) = max
a
Q(s , a) Q-Learning
π(a|s) = argmax
a
Q(s, a) greedy policy
12
Value methods
TD Learning:
Qπ
(s, a) ←Qπ
(s, a) + α
TD target
r + γVπ
(s ) −Qπ
(s, a)
TD error
Vπ
(s ) = max
a
Q(s , a) Q-Learning
π(a|s) = argmax
a
Q(s, a) greedy policy
π(a|s) =
random with probability
argmaxa Q(s, a) with probability 1 −
- greedy policy
12
Option Value Functions
The option value function:
QΩ(s, ω) =
a
πω(a|s)QU (s, w, a)
13
Option Value Functions
The option value function:
QΩ(s, ω) =
a
πω(a|s)QU (s, w, a)
The action value function:
QU (s, ω, a) = r(s, a) + γ
s
p(s |s, a)U(ω, s )
13
Option Value Functions
The option value function:
QΩ(s, ω) =
a
πω(a|s)QU (s, w, a)
The action value function:
QU (s, ω, a) = r(s, a) + γ
s
p(s |s, a)U(ω, s )
The state value function upon arrival:
U(ω, s ) = (1 − βω(s ))QΩ(s , ω) + βω(s )VΩ(s )
13
Policy Gradient Methods
π(a|s) = argmax
a
Q(s, a)
14
Policy Gradient Methods
π(a|s) = argmax
a
Q(s, a) vs. π(a|s, θ) = softmax
a
h(s, a, θ)
14
Policy Gradient Methods
π(a|s) = argmax
a
Q(s, a) vs. π(a|s, θ) = softmax
a
h(s, a, θ)
J(θ)
14
Policy Gradient Methods
π(a|s) = argmax
a
Q(s, a) vs. π(a|s, θ) = softmax
a
h(s, a, θ)
J(θ) = V∗
(s)
14
Policy Gradient Methods
π(a|s) = argmax
a
Q(s, a) vs. π(a|s, θ) = softmax
a
h(s, a, θ)
J(θ) = V∗
(s)
θJ(θ)
14
Policy Gradient Methods
π(a|s) = argmax
a
Q(s, a) vs. π(a|s, θ) = softmax
a
h(s, a, θ)
J(θ) = V∗
(s)
θJ(θ)
Policy Gradient Theorem (Sutton, McAllester, et al. 2000)
14
Policy Gradient Methods
π(a|s) = argmax
a
Q(s, a) vs. π(a|s, θ) = softmax
a
h(s, a, θ)
J(θ) = V∗
(s)
θJ(θ)
Policy Gradient Theorem (Sutton, McAllester, et al. 2000)
θJ(θ) =
s
µπ(s)
a
Qπ(s, a) θπ(a|s, θ)
14
Policy Gradient Methods
π(a|s) = argmax
a
Q(s, a) vs. π(a|s, θ) = softmax
a
h(s, a, θ)
J(θ) = V∗
(s)
θJ(θ)
Policy Gradient Theorem (Sutton, McAllester, et al. 2000)
θJ(θ) =
s
µπ(s)
a
Qπ(s, a) θπ(a|s, θ)
θJ(θ) = E γt
a
Qπ(s, a) θπ(a|s, θ)
14
Actor-Critic (Sutton 1984)
θ ← θ + αγt
δ
TD-Error
θ log π(a|s, θ)
Taken from Pierre-Luc Bacon 15
Option-Critic (Bacon et al. 2017)
Taken from (Bacon et al. 2017)
16
Option-Critic (Bacon et al. 2017)
The gradient wrt. intra-option-policy parameters θ:
θQΩ(s, ω)
take better primitives inside options.
17
Option-Critic (Bacon et al. 2017)
The gradient wrt. intra-option-policy parameters θ:
θQΩ(s, ω) = E
∂ log πω,θ(a|s)
∂θ
QU (s, ω, a)
take better primitives inside options.
17
Option-Critic (Bacon et al. 2017)
The gradient wrt. intra-option-policy parameters θ:
θQΩ(s, ω) = E
∂ log πω,θ(a|s)
∂θ
QU (s, ω, a)
take better primitives inside options.
The gradient wrt. termination-policy parameters ϑ:
ϑU(ω, s )
shorten options with bad advantage.
17
Option-Critic (Bacon et al. 2017)
The gradient wrt. intra-option-policy parameters θ:
θQΩ(s, ω) = E
∂ log πω,θ(a|s)
∂θ
QU (s, ω, a)
take better primitives inside options.
The gradient wrt. termination-policy parameters ϑ:
ϑU(ω, s ) = E −
∂βω,ϑ(s )
∂ϑ
QΩ(s , ω) − VΩ(s , ω)
e.g. maxω Q(s,w)
shorten options with bad advantage.
17
Demonstration
18
19
20
20
Complex Environment i
(Bacon et al. 2017)
21
Complex Environment ii
(Harb et al. 2017)
22
But... i
23
But... ii
(Dilokthanakul et al. 2017)
24
Resources
• Sutton & Barto, Reinforcement Learning: An Introduction,
Second Edition Draft
• David Silver’s Reinforcement Learning Course
25
References i
Bacon, Pierre-Luc, Jean Harb, and Doina Precup (2017). “The
Option-Critic architecture”. In: Proceedings of the Thirty-First
AAAI Conference on Artificial Intelligence, February 4-9, 2017,
San Francisco, California, USA. Pp. 1726–1734.
Bellman, Richard (1952). “On the theory of dynamic
programming”. In: Proceedings of the National Academy of
Sciences 38.8, pp. 716–719. doi: 10.1073/pnas.38.8.716.
Dilokthanakul, N., C. Kaplanis, N. Pawlowski, and M. Shanahan
(2017). “Feature Control as Intrinsic Motivation for Hierarchical
Reinforcement Learning”. In: ArXiv e-prints. arXiv:
1705.06769 [cs.LG].
26
References ii
Harb, Jean, Pierre-Luc Bacon, Martin Klissarov, and Doina Precup
(2017). “When waiting is not an option: Learning options with a
deliberation cost”. In: arXiv: 1709.04571.
Sutton, Richard S (1984). “Temporal credit assignment in
reinforcement learning”. AAI8410337. PhD thesis.
Sutton, Richard S, David A McAllester, Satinder P Singh, and
Yishay Mansour (2000). “Policy gradient methods for
reinforcement learning with function approximation”. In:
Advances in Neural Information Processing Systems,
pp. 1057–1063.
27
References iii
Sutton, Richard S, Doina Precup, and Satinder Singh (1999).
“Between MDPs and Semi-MDPs: A framework for temporal
abstraction in reinforcement learning”. In: Artificial
Intelligence 112.1-2, pp. 181–211. doi:
10.1016/S0004-3702(99)00052-1.
28
Appendix
Option-Critic (Bacon et al. 2017) i
procedure train(α, NΩ)
s ← s0
choose ω ∼ πΩ(ω|s) Option-policy
repeat
choose a ∼ πω,θ(a|s) Intra-option-policy
take the action a in s, observe s and r
1. Options evaluation
g ← r TD-Target
if s is not terminal then
g ← g + γ(1 − βω,ϑ(s ))QΩ(s , ω)
+ γ βω,ϑ(s ) max
ω
QΩ(s , ω)
29
Option-Critic (Bacon et al. 2017) ii
2. Critic improvement
δU ← g − QU (s, ω, a)
QU (s, ω, a) ← QU (s, ω, a) + αU δU
3. Intra-option Q-learning
δΩ ← g − QΩ(s, ω)
QΩ(s, ω) ← QΩ(s, ω) + αΩδΩ
4. Options improvement
θ ← θ + αθ
∂ log πω,θ(a|s)
∂θ QU (s, ω, a)
ϑ ← ϑ + αϑ
∂βω,ϑ(s )
∂ϑ (QΩ(s , ω) − maxω QΩ(s , ω) + ξ)
30
Option-Critic (Bacon et al. 2017) iii
if terminate ∼ βω,ϑ(s ) then Termination-policy
choose ω ∼ πΩ(ω|s )
s ← s .
until s is terminal
31

Weitere ähnliche Inhalte

Was ist angesagt?

Deep Learning Tutorial
Deep Learning TutorialDeep Learning Tutorial
Deep Learning TutorialAmr Rashed
 
Adversarial Self-Supervised Contrastive Learning
Adversarial Self-Supervised Contrastive LearningAdversarial Self-Supervised Contrastive Learning
Adversarial Self-Supervised Contrastive LearningMLAI2
 
Deep Learning Hardware: Past, Present, & Future
Deep Learning Hardware: Past, Present, & FutureDeep Learning Hardware: Past, Present, & Future
Deep Learning Hardware: Past, Present, & FutureRouyun Pan
 
Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10) Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10) Larry Guo
 
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex FridmanMIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex FridmanPeerasak C.
 
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...Edureka!
 
Evolutionary Algorithms and their Applications in Civil Engineering - 1
Evolutionary Algorithms and their Applications in Civil Engineering - 1Evolutionary Algorithms and their Applications in Civil Engineering - 1
Evolutionary Algorithms and their Applications in Civil Engineering - 1shreymodi
 
Deep Implicit Layers: Learning Structured Problems with Neural Networks
Deep Implicit Layers: Learning Structured Problems with Neural NetworksDeep Implicit Layers: Learning Structured Problems with Neural Networks
Deep Implicit Layers: Learning Structured Problems with Neural NetworksSangwoo Mo
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313Slideshare
 
The Forward-Forward Algorithm
The Forward-Forward AlgorithmThe Forward-Forward Algorithm
The Forward-Forward Algorithmtaeseon ryu
 
Lecture 26 local beam search
Lecture 26 local beam searchLecture 26 local beam search
Lecture 26 local beam searchHema Kashyap
 
Super resolution in deep learning era - Jaejun Yoo
Super resolution in deep learning era - Jaejun YooSuper resolution in deep learning era - Jaejun Yoo
Super resolution in deep learning era - Jaejun YooJaeJun Yoo
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Gaurav Mittal
 
Transfer Learning and Domain Adaptation (D2L3 2017 UPC Deep Learning for Comp...
Transfer Learning and Domain Adaptation (D2L3 2017 UPC Deep Learning for Comp...Transfer Learning and Domain Adaptation (D2L3 2017 UPC Deep Learning for Comp...
Transfer Learning and Domain Adaptation (D2L3 2017 UPC Deep Learning for Comp...Universitat Politècnica de Catalunya
 
Deep Belief Networks
Deep Belief NetworksDeep Belief Networks
Deep Belief NetworksHasan H Topcu
 
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기NAVER Engineering
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsYoung-Geun Choi
 

Was ist angesagt? (20)

Deep Learning Tutorial
Deep Learning TutorialDeep Learning Tutorial
Deep Learning Tutorial
 
Adversarial Self-Supervised Contrastive Learning
Adversarial Self-Supervised Contrastive LearningAdversarial Self-Supervised Contrastive Learning
Adversarial Self-Supervised Contrastive Learning
 
Recurrent Neural Network
Recurrent Neural NetworkRecurrent Neural Network
Recurrent Neural Network
 
Policy gradient
Policy gradientPolicy gradient
Policy gradient
 
Deep Learning Hardware: Past, Present, & Future
Deep Learning Hardware: Past, Present, & FutureDeep Learning Hardware: Past, Present, & Future
Deep Learning Hardware: Past, Present, & Future
 
Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10) Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10)
 
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex FridmanMIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
 
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
 
Evolutionary Algorithms and their Applications in Civil Engineering - 1
Evolutionary Algorithms and their Applications in Civil Engineering - 1Evolutionary Algorithms and their Applications in Civil Engineering - 1
Evolutionary Algorithms and their Applications in Civil Engineering - 1
 
rtnetlink
rtnetlinkrtnetlink
rtnetlink
 
Deep Implicit Layers: Learning Structured Problems with Neural Networks
Deep Implicit Layers: Learning Structured Problems with Neural NetworksDeep Implicit Layers: Learning Structured Problems with Neural Networks
Deep Implicit Layers: Learning Structured Problems with Neural Networks
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
 
The Forward-Forward Algorithm
The Forward-Forward AlgorithmThe Forward-Forward Algorithm
The Forward-Forward Algorithm
 
Lecture 26 local beam search
Lecture 26 local beam searchLecture 26 local beam search
Lecture 26 local beam search
 
Super resolution in deep learning era - Jaejun Yoo
Super resolution in deep learning era - Jaejun YooSuper resolution in deep learning era - Jaejun Yoo
Super resolution in deep learning era - Jaejun Yoo
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)
 
Transfer Learning and Domain Adaptation (D2L3 2017 UPC Deep Learning for Comp...
Transfer Learning and Domain Adaptation (D2L3 2017 UPC Deep Learning for Comp...Transfer Learning and Domain Adaptation (D2L3 2017 UPC Deep Learning for Comp...
Transfer Learning and Domain Adaptation (D2L3 2017 UPC Deep Learning for Comp...
 
Deep Belief Networks
Deep Belief NetworksDeep Belief Networks
Deep Belief Networks
 
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep models
 

Ähnlich wie Hierarchical Reinforcement Learning with Option-Critic Architecture

Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIJack Clark
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningRyo Iwaki
 
Optimization of probabilistic argumentation with Markov processes
Optimization of probabilistic argumentation with Markov processesOptimization of probabilistic argumentation with Markov processes
Optimization of probabilistic argumentation with Markov processesEmmanuel Hadoux
 
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithm
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithmOptimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithm
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithminfopapers
 
Learning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifoldLearning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifoldKai-Wen Zhao
 
Safe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement LearningSafe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement Learningmooopan
 
block-mdp-masters-defense.pdf
block-mdp-masters-defense.pdfblock-mdp-masters-defense.pdf
block-mdp-masters-defense.pdfJunghyun Lee
 
Max Entropy
Max EntropyMax Entropy
Max Entropyjianingy
 
Counterfactual Learning for Recommendation
Counterfactual Learning for RecommendationCounterfactual Learning for Recommendation
Counterfactual Learning for RecommendationOlivier Jeunen
 
A model-based relevance estimation approach for feature selection in microarr...
A model-based relevance estimation approach for feature selection in microarr...A model-based relevance estimation approach for feature selection in microarr...
A model-based relevance estimation approach for feature selection in microarr...Gianluca Bontempi
 
Efficient Identification of Improving Moves in a Ball for Pseudo-Boolean Prob...
Efficient Identification of Improving Moves in a Ball for Pseudo-Boolean Prob...Efficient Identification of Improving Moves in a Ball for Pseudo-Boolean Prob...
Efficient Identification of Improving Moves in a Ball for Pseudo-Boolean Prob...jfrchicanog
 
Actors for Behavioural Simulation
Actors for Behavioural SimulationActors for Behavioural Simulation
Actors for Behavioural SimulationClarkTony
 
Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design Dan Elton
 
Generic Reinforcement Schemes and Their Optimization
Generic Reinforcement Schemes and Their OptimizationGeneric Reinforcement Schemes and Their Optimization
Generic Reinforcement Schemes and Their Optimizationinfopapers
 
Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Charles Martin
 
week10_Reinforce.pdf
week10_Reinforce.pdfweek10_Reinforce.pdf
week10_Reinforce.pdfYuChianWu
 

Ähnlich wie Hierarchical Reinforcement Learning with Option-Critic Architecture (20)

Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learning
 
Classification
ClassificationClassification
Classification
 
Optimization of probabilistic argumentation with Markov processes
Optimization of probabilistic argumentation with Markov processesOptimization of probabilistic argumentation with Markov processes
Optimization of probabilistic argumentation with Markov processes
 
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithm
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithmOptimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithm
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithm
 
Continuous control
Continuous controlContinuous control
Continuous control
 
Learning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifoldLearning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifold
 
Safe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement LearningSafe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement Learning
 
block-mdp-masters-defense.pdf
block-mdp-masters-defense.pdfblock-mdp-masters-defense.pdf
block-mdp-masters-defense.pdf
 
Max Entropy
Max EntropyMax Entropy
Max Entropy
 
Counterfactual Learning for Recommendation
Counterfactual Learning for RecommendationCounterfactual Learning for Recommendation
Counterfactual Learning for Recommendation
 
ddpg seminar
ddpg seminarddpg seminar
ddpg seminar
 
RL unit 5 part 1.pdf
RL unit 5 part 1.pdfRL unit 5 part 1.pdf
RL unit 5 part 1.pdf
 
A model-based relevance estimation approach for feature selection in microarr...
A model-based relevance estimation approach for feature selection in microarr...A model-based relevance estimation approach for feature selection in microarr...
A model-based relevance estimation approach for feature selection in microarr...
 
Efficient Identification of Improving Moves in a Ball for Pseudo-Boolean Prob...
Efficient Identification of Improving Moves in a Ball for Pseudo-Boolean Prob...Efficient Identification of Improving Moves in a Ball for Pseudo-Boolean Prob...
Efficient Identification of Improving Moves in a Ball for Pseudo-Boolean Prob...
 
Actors for Behavioural Simulation
Actors for Behavioural SimulationActors for Behavioural Simulation
Actors for Behavioural Simulation
 
Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design
 
Generic Reinforcement Schemes and Their Optimization
Generic Reinforcement Schemes and Their OptimizationGeneric Reinforcement Schemes and Their Optimization
Generic Reinforcement Schemes and Their Optimization
 
Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3
 
week10_Reinforce.pdf
week10_Reinforce.pdfweek10_Reinforce.pdf
week10_Reinforce.pdf
 

Kürzlich hochgeladen

Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startQuintin Balsdon
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756dollysharma2066
 
Unit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfUnit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfRagavanV2
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxJuliansyahHarahap1
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptMsecMca
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptNANDHAKUMARA10
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...tanu pandey
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Arindam Chakraborty, Ph.D., P.E. (CA, TX)
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapRishantSharmaFr
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringmulugeta48
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 

Kürzlich hochgeladen (20)

Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
Unit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfUnit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdf
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 

Hierarchical Reinforcement Learning with Option-Critic Architecture

  • 1. Hierarchical Reinforcement Learning with Option-Critic Architecture Oğuz Şerbetci April 4, 2018 Modelling of Cognitive Processes TU Berlin
  • 2. Reinforcement Learning Hierarchical Reinforcement Learning Demonstration Resources Appendix 1
  • 7. Reinforcement Learning MDP: S, A, p(s |s, a), r(s, a) 3
  • 8. Reinforcement Learning MDP: S, A, p(s |s, a), r(s, a) 3
  • 9. Reinforcement Learning MDP: S, A, p(s |s, a), r(s, a) Policy π(a|s) : S × A → [0, 1] 3
  • 10. Reinforcement Learning MDP: S, A, p(s |s, a), r(s, a) Policy π(a|s) : S × A → [0, 1] 3
  • 11. Reinforcement Learning MDP: S, A, p(s |s, a), r(s, a) Policy π(a|s) : S × A → [0, 1] Goal: an optimal policy π∗ that maximizes E ∞ t=0 γt rt|s0 = s 3
  • 13. Problems • lack of planning and commitment 4
  • 14. Problems • lack of planning and commitment • inefficient exploration 4
  • 15. Problems • lack of planning and commitment • inefficient exploration • temporal credit assignment problem 4
  • 16. Problems • lack of planning and commitment • inefficient exploration • temporal credit assignment problem • inability to divide-and-conquer 4
  • 17. 5
  • 19. Temporal Abstractions Icons made by Smashicons and Freepick from Freepick 6
  • 20. Temporal Abstractions Icons made by Smashicons and Freepick from Freepick 6
  • 21. Temporal Abstractions Icons made by Smashicons and Freepick from Freepick 6
  • 22. Temporal Abstractions Icons made by Smashicons and Freepick from Freepick 6
  • 23. 7
  • 24. Options Framework (Sutton, Precup, et al. 1999) SMDP: S, A, p(s , k |s, a), r(s, a) (Sutton, Precup, et al. 1999) 8
  • 25. Options Framework (Sutton, Precup, et al. 1999) SMDP: S, A, p(s , k |s, a), r(s, a) Option: Iω: initiation set πω: intra-option-policy βω: termination-policy (Sutton, Precup, et al. 1999) 8
  • 26. Options Framework (Sutton, Precup, et al. 1999) SMDP: S, A, p(s , k |s, a), r(s, a) Option: Iω: initiation set πω: intra-option-policy βω: termination-policy (Sutton, Precup, et al. 1999) 8
  • 27. Options Framework (Sutton, Precup, et al. 1999) SMDP: S, A, p(s , k |s, a), r(s, a) Option: Iω: initiation set πω: intra-option-policy βω: termination-policy πΩ: option-policy (Sutton, Precup, et al. 1999) 8
  • 28. 9
  • 29. Option-Critic (Bacon et al. 2017) Given the number of options Option-Critic learns βω, πω, πΩ in an end-to-end & online fashion. Allows non-linear function approximators, enabling continuous state and action spaces. • online, end-to-end learning of options in continuous state/action space • allows using non-linear function approximators (deep RL) 10
  • 30. Value Functions The state value function: Vπ(s) = E ∞ t=0 γt rt|s0 = s 11
  • 31. Value Functions The state value function: Vπ(s) = E ∞ t=0 γt rt|s0 = s = a∈As π(s, a) r(s, a) + γ s p(s |s, a)V(s ) Bellman Equations (Bellman 1952) 11
  • 32. Value Functions The state value function: Vπ(s) = E ∞ t=0 γt rt|s0 = s = a∈As π(s, a) r(s, a) + γ s p(s |s, a)V(s ) The action value function: Qπ(s, a) = E ∞ t=0 γt rt|s0 = s, a0 = a Bellman Equations (Bellman 1952) 11
  • 33. Value Functions The state value function: Vπ(s) = E ∞ t=0 γt rt|s0 = s = a∈As π(s, a) r(s, a) + γ s p(s |s, a)V(s ) The action value function: Qπ(s, a) = E ∞ t=0 γt rt|s0 = s, a0 = a = r(s, a) + γ s p(s |s, a) a ∈A π(s , a )Q(s , a ) Bellman Equations (Bellman 1952) 11
  • 34. Value methods TD Learning: Qπ (s, a) ←Qπ (s, a) + α TD target r + γVπ (s ) −Qπ (s, a) TD error 12
  • 35. Value methods TD Learning: Qπ (s, a) ←Qπ (s, a) + α TD target r + γVπ (s ) −Qπ (s, a) TD error Vπ (s ) = max a Q(s , a) Q-Learning 12
  • 36. Value methods TD Learning: Qπ (s, a) ←Qπ (s, a) + α TD target r + γVπ (s ) −Qπ (s, a) TD error Vπ (s ) = max a Q(s , a) Q-Learning π(a|s) = argmax a Q(s, a) greedy policy 12
  • 37. Value methods TD Learning: Qπ (s, a) ←Qπ (s, a) + α TD target r + γVπ (s ) −Qπ (s, a) TD error Vπ (s ) = max a Q(s , a) Q-Learning π(a|s) = argmax a Q(s, a) greedy policy π(a|s) = random with probability argmaxa Q(s, a) with probability 1 − - greedy policy 12
  • 38. Option Value Functions The option value function: QΩ(s, ω) = a πω(a|s)QU (s, w, a) 13
  • 39. Option Value Functions The option value function: QΩ(s, ω) = a πω(a|s)QU (s, w, a) The action value function: QU (s, ω, a) = r(s, a) + γ s p(s |s, a)U(ω, s ) 13
  • 40. Option Value Functions The option value function: QΩ(s, ω) = a πω(a|s)QU (s, w, a) The action value function: QU (s, ω, a) = r(s, a) + γ s p(s |s, a)U(ω, s ) The state value function upon arrival: U(ω, s ) = (1 − βω(s ))QΩ(s , ω) + βω(s )VΩ(s ) 13
  • 41. Policy Gradient Methods π(a|s) = argmax a Q(s, a) 14
  • 42. Policy Gradient Methods π(a|s) = argmax a Q(s, a) vs. π(a|s, θ) = softmax a h(s, a, θ) 14
  • 43. Policy Gradient Methods π(a|s) = argmax a Q(s, a) vs. π(a|s, θ) = softmax a h(s, a, θ) J(θ) 14
  • 44. Policy Gradient Methods π(a|s) = argmax a Q(s, a) vs. π(a|s, θ) = softmax a h(s, a, θ) J(θ) = V∗ (s) 14
  • 45. Policy Gradient Methods π(a|s) = argmax a Q(s, a) vs. π(a|s, θ) = softmax a h(s, a, θ) J(θ) = V∗ (s) θJ(θ) 14
  • 46. Policy Gradient Methods π(a|s) = argmax a Q(s, a) vs. π(a|s, θ) = softmax a h(s, a, θ) J(θ) = V∗ (s) θJ(θ) Policy Gradient Theorem (Sutton, McAllester, et al. 2000) 14
  • 47. Policy Gradient Methods π(a|s) = argmax a Q(s, a) vs. π(a|s, θ) = softmax a h(s, a, θ) J(θ) = V∗ (s) θJ(θ) Policy Gradient Theorem (Sutton, McAllester, et al. 2000) θJ(θ) = s µπ(s) a Qπ(s, a) θπ(a|s, θ) 14
  • 48. Policy Gradient Methods π(a|s) = argmax a Q(s, a) vs. π(a|s, θ) = softmax a h(s, a, θ) J(θ) = V∗ (s) θJ(θ) Policy Gradient Theorem (Sutton, McAllester, et al. 2000) θJ(θ) = s µπ(s) a Qπ(s, a) θπ(a|s, θ) θJ(θ) = E γt a Qπ(s, a) θπ(a|s, θ) 14
  • 49. Actor-Critic (Sutton 1984) θ ← θ + αγt δ TD-Error θ log π(a|s, θ) Taken from Pierre-Luc Bacon 15
  • 50. Option-Critic (Bacon et al. 2017) Taken from (Bacon et al. 2017) 16
  • 51. Option-Critic (Bacon et al. 2017) The gradient wrt. intra-option-policy parameters θ: θQΩ(s, ω) take better primitives inside options. 17
  • 52. Option-Critic (Bacon et al. 2017) The gradient wrt. intra-option-policy parameters θ: θQΩ(s, ω) = E ∂ log πω,θ(a|s) ∂θ QU (s, ω, a) take better primitives inside options. 17
  • 53. Option-Critic (Bacon et al. 2017) The gradient wrt. intra-option-policy parameters θ: θQΩ(s, ω) = E ∂ log πω,θ(a|s) ∂θ QU (s, ω, a) take better primitives inside options. The gradient wrt. termination-policy parameters ϑ: ϑU(ω, s ) shorten options with bad advantage. 17
  • 54. Option-Critic (Bacon et al. 2017) The gradient wrt. intra-option-policy parameters θ: θQΩ(s, ω) = E ∂ log πω,θ(a|s) ∂θ QU (s, ω, a) take better primitives inside options. The gradient wrt. termination-policy parameters ϑ: ϑU(ω, s ) = E − ∂βω,ϑ(s ) ∂ϑ QΩ(s , ω) − VΩ(s , ω) e.g. maxω Q(s,w) shorten options with bad advantage. 17
  • 56. 18
  • 57. 19
  • 58. 20
  • 59. 20
  • 60. Complex Environment i (Bacon et al. 2017) 21
  • 61. Complex Environment ii (Harb et al. 2017) 22
  • 65. • Sutton & Barto, Reinforcement Learning: An Introduction, Second Edition Draft • David Silver’s Reinforcement Learning Course 25
  • 66. References i Bacon, Pierre-Luc, Jean Harb, and Doina Precup (2017). “The Option-Critic architecture”. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA. Pp. 1726–1734. Bellman, Richard (1952). “On the theory of dynamic programming”. In: Proceedings of the National Academy of Sciences 38.8, pp. 716–719. doi: 10.1073/pnas.38.8.716. Dilokthanakul, N., C. Kaplanis, N. Pawlowski, and M. Shanahan (2017). “Feature Control as Intrinsic Motivation for Hierarchical Reinforcement Learning”. In: ArXiv e-prints. arXiv: 1705.06769 [cs.LG]. 26
  • 67. References ii Harb, Jean, Pierre-Luc Bacon, Martin Klissarov, and Doina Precup (2017). “When waiting is not an option: Learning options with a deliberation cost”. In: arXiv: 1709.04571. Sutton, Richard S (1984). “Temporal credit assignment in reinforcement learning”. AAI8410337. PhD thesis. Sutton, Richard S, David A McAllester, Satinder P Singh, and Yishay Mansour (2000). “Policy gradient methods for reinforcement learning with function approximation”. In: Advances in Neural Information Processing Systems, pp. 1057–1063. 27
  • 68. References iii Sutton, Richard S, Doina Precup, and Satinder Singh (1999). “Between MDPs and Semi-MDPs: A framework for temporal abstraction in reinforcement learning”. In: Artificial Intelligence 112.1-2, pp. 181–211. doi: 10.1016/S0004-3702(99)00052-1. 28
  • 70. Option-Critic (Bacon et al. 2017) i procedure train(α, NΩ) s ← s0 choose ω ∼ πΩ(ω|s) Option-policy repeat choose a ∼ πω,θ(a|s) Intra-option-policy take the action a in s, observe s and r 1. Options evaluation g ← r TD-Target if s is not terminal then g ← g + γ(1 − βω,ϑ(s ))QΩ(s , ω) + γ βω,ϑ(s ) max ω QΩ(s , ω) 29
  • 71. Option-Critic (Bacon et al. 2017) ii 2. Critic improvement δU ← g − QU (s, ω, a) QU (s, ω, a) ← QU (s, ω, a) + αU δU 3. Intra-option Q-learning δΩ ← g − QΩ(s, ω) QΩ(s, ω) ← QΩ(s, ω) + αΩδΩ 4. Options improvement θ ← θ + αθ ∂ log πω,θ(a|s) ∂θ QU (s, ω, a) ϑ ← ϑ + αϑ ∂βω,ϑ(s ) ∂ϑ (QΩ(s , ω) − maxω QΩ(s , ω) + ξ) 30
  • 72. Option-Critic (Bacon et al. 2017) iii if terminate ∼ βω,ϑ(s ) then Termination-policy choose ω ∼ πΩ(ω|s ) s ← s . until s is terminal 31