SlideShare ist ein Scribd-Unternehmen logo
1 von 28
Downloaden Sie, um offline zu lesen
Exploration Strategies in
Reinforcement Learning
Dongmin Lee
SNU Robot Learning Laboratory
AI Robotics KR
October 3, 2019
Outline
• Introduction
• Reinforcement Learning
• Exploration Strategies in RL
• Entropy Regularization in RL
1
Introduction
2
How do we build intelligent systems / machines?
• Data à Information
• Information à Decision
intelligence
intelligence
Introduction
3
Intelligent systems must be able to adapt to uncertainty
• Uncertainty:
• Other cars
• Weather
• Road construction
• Passenger
Introduction
4
RL provides a formalism for behaviors
• Problem of a goal-directed agent interacting with an uncertain environment
• Interaction à adaptation
feedback & decision
Reinforcement Learning
5
• State of the environment / system : 𝑠" ∈ 𝑆
• Action determined by the agent : 𝑎" ∈ 𝐴
• Reward: evaluates the benefit of state and action
• Policy (mapping from state to action): 𝜋
• Deterministic
𝜋 𝑠" = 𝑎"
• Stochastic (randomized)
𝜋 𝑎 𝑠 = Prob(𝑎" = 𝑎|𝑠" = 𝑠)
à Policy is what we want to optimize!
6
A Markov decision process (MDP) is a tuple < 𝑆, 𝐴, 𝑝, 𝑟, 𝛾 >, consisting of
• 𝑆: set of states (state space)
e.g., 𝑆 = 1, … , 𝑛 (discrete), 𝑆 = ℝ: (continuous)
• 𝐴: set of actions (action space)
e.g., 𝐴 = 1, … , 𝑚 (discrete), 𝐴 = ℝ< (continuous)
• 𝑝: state transition probability
𝑝 𝑠= 𝑠, 𝑎 ≜ Prob 𝑠"?@ = 𝑠= 𝑠" = 𝑠, 𝑎" = 𝑎 d
• 𝑟: reward function
𝑟 𝑠", 𝑎" = 𝑟"d
• 𝛾 ∈ (0,1]: discount factor
Reinforcement Learning
7
The MDP problem
• To find an optimal policy that maximizes the expected cumulative reward:
max
F∈∏
𝔼F I
"JK
L
𝛾" 𝑟 𝑠", 𝑎"
Reinforcement Learning
8
What are the challenges of RL?
• Huge # of samples: millions
• Fast, stable learning
• Hyperparameter tuning
• Exploration
• Sparse reward signals
• Safety / reliability
• Simulator
Reinforcement Learning
9
What are the challenges of RL?
• Huge # of samples: millions
• Fast, stable learning
• Hyperparameter tuning
• Exploration
• Sparse reward signals
• Safety / reliability
• Simulator
Reinforcement Learning
Exploration Strategies in RL
10
Common issue: exploration
• Is there a value of exploring unknown regions of the environment?
• Which action should we try to explore unknown regions of the environment?
Exploration Strategies in RL
11
How can we algorithmically solve this issue exploration vs exploitation?
1. 𝜖-Greedy
2. Optimism in the face of uncertainty
3. Thompson (posterior) sampling
4. Information theoretic exploration
5. Noisy networks, Curiosity-driven, etc.
Exploration Strategies in RL
12
How can we algorithmically solve this issue exploration vs exploitation?
1. 𝜖-Greedy
2. Optimism in the face of uncertainty
3. Thompson (posterior) sampling
4. Information theoretic exploration
5. Noisy networks, Curiosity-driven, etc.
Exploration Strategies in RL
13
How can we algorithmically solve this issue exploration vs exploitation?
1. 𝜖-Greedy
2. Optimism in the face of uncertainty
3. Thompson (posterior) sampling
4. Information theoretic exploration
5. Noisy networks, Curiosity-driven, etc.
• Q) What is a means of assessing these strategies?
Exploration Strategies in RL
14
What is regret?
• An additional means of assessing the algorithms’ performance
• A loss that we incur due to time spent during the learning
• The expected cumulative regret:
ℛO ≜ 𝔼F max
P∈Q
I
"JK
O
𝑟" 𝑎 − 𝑟"(𝑎")
Regret problem
• To find a strategy that minimizes the expected cumulative regret ℛO
• Maximize expected cumulative reward ≡ minimize expected cumulative regret!
Exploration Strategies in RL
15
Method 1: 𝜖-Greedy
Idea:
• Choose a greedy action (arg max 𝑄F
(𝑠, 𝑎)) with probability (1 − 𝜖)
• Choose a random action with probability 𝜖
Advantages:
1. Very simple
2. Light computation
Disadvantages:
1. Fine tuning 𝜖
2. Not quite systematic: No focus on unexplored regions
3. Inefficient
Exploration Strategies in RL
16
A comparison of greedy, 𝜖-greedy, decaying 𝜖-greedy
http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/XX.pdf
Exploration Strategies in RL
17
Method 2: Optimism in the face of uncertainty (OFU)
Idea:
1. Construct a confidence set for MDP parameters using collected samples
2. Choose MDP parameters that give the highest rewards
3. Construct an optimal policy of the optimistically chosen MDP
Advantages:
1. Regret optimal
2. Systematic: More focus on unexplored regions
Disadvantages:
1. Complicated
2. Computation intensive
Exploration Strategies in RL
18
Method 3: Thompson (posterior) sampling
Idea:
1. Sample MDP parameters from posterior distribution
2. Construct an optimal policy of the sampled MDP parameters
3. Update the posterior distribution
Advantages:
1. Regret optimal
2. Systematic: More focus on unexplored regions
Disadvantages:
1. Somewhat complicated
Exploration Strategies in RL
19
Method 4: Information theoretic exploration
Idea:
• Use high entropy policy to explore more
Advantages:
1. Simple
2. Good empirical performance
Disadvantages:
1. More theoretic analyses needed
Entropy Regularization in RL
20
What is entropy?
• Average rate at which information is produced by a stochastic source of data
• The well-known standard Shannon-Gibbs entropy:
𝐻 𝑃 ≜ 𝔼
X~Z
− log 𝑃 𝑋 = − I
]∈X
𝑃(𝑥) log 𝑃(𝑥)
• More generally, entropy refers to disorder
• Standard MDP problem:
max
F
𝔼F I
"JK
L
𝛾" 𝑟(𝑠", 𝑎")
• Maximum entropy MDP problem:
max
F
𝔼F I
"JK
L
𝛾" 𝑟 𝑠", 𝑎" + 𝛼𝐻 𝜋 ⋅ 𝑠"
where 𝐻 𝜋 ⋅ 𝑠" = 𝔼P~F − log 𝜋 𝑎 𝑠 = − ∑P 𝜋 𝑎 𝑠 log 𝜋 𝑎 𝑠 .
T. Haarnoja, et al., “Reinforcement Learning with Deep Energy-Based Policies”, ICML 2017 21
Entropy Regularization in RL
What’s the benefit of maximum entropy regularization?
• Computation:
No maximization involved
• Exploration:
Potentially explore unexplored regions using high entropy policy
• Structural similarity:
Can combine it with many RL methods for standard MDP
22
Entropy Regularization in RL
https://sites.google.com/view/sac-and-applications 23
Entropy Regularization in RL
• Soft Actor-Critic on Minitaur ­ Training
https://sites.google.com/view/sac-and-applications 24
Entropy Regularization in RL
• Soft Actor-Critic on Minitaur ­ Testing
25
Entropy Regularization in RL
Tsallis entropy: Generalization of the Shannon-Gibbs entropy
K. Lee, et al., “Tsallis Reinforcement Learning: A Unified Framework for Maximum Entropy Reinforcement Learning”, arXiv preprint 2019
• Maximum entropy MDP problem:
max
F
𝔼F I
"JK
L
𝛾" 𝑟 𝑠", 𝑎" + 𝛼𝐻 𝜋 ⋅ 𝑠"
• Maximum Tsallis entropy MDP problem:
max
F
𝔼F I
"JK
L
𝛾" 𝑟 𝑠", 𝑎" + 𝛼𝑆c 𝜋 ⋅ 𝑠"
• Tsallis entropy
𝑆c 𝜋 ⋅ 𝑠" = 𝔼P~F − lnc(𝜋 𝑎 𝑠 )
where, lnc(𝑥) = e
log(𝑥) , 𝑖𝑓 𝑞 = 1 𝑎𝑛𝑑 𝑥 > 0
]jklm@
cm@
, 𝑖𝑓 𝑞 ≠ 1 𝑎𝑛𝑑 𝑥 > 0
, 𝑞 is called an entropic index
K. Lee, et al., “Tsallis Reinforcement Learning: A Unified Framework for Maximum Entropy Reinforcement Learning”, arXiv preprint 2019 26
Entropy Regularization in RL
Thank You!
Any Questions?

Weitere ähnliche Inhalte

Was ist angesagt?

Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 
Machine learning life cycle
Machine learning life cycleMachine learning life cycle
Machine learning life cycleRamjee Ganti
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning Chandra Meena
 
An Introduction to Soft Computing
An Introduction to Soft ComputingAn Introduction to Soft Computing
An Introduction to Soft ComputingTameem Ahmad
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learningbutest
 
Genetic Algorithms - Artificial Intelligence
Genetic Algorithms - Artificial IntelligenceGenetic Algorithms - Artificial Intelligence
Genetic Algorithms - Artificial IntelligenceSahil Kumar
 
Flowchart of GA
Flowchart of GAFlowchart of GA
Flowchart of GAIshucs
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep LearningOswald Campesato
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithmJie-Han Chen
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Gaurav Mittal
 
Intelligent Agent PPT ON SLIDESHARE IN ARTIFICIAL INTELLIGENCE
Intelligent Agent PPT ON SLIDESHARE IN ARTIFICIAL INTELLIGENCEIntelligent Agent PPT ON SLIDESHARE IN ARTIFICIAL INTELLIGENCE
Intelligent Agent PPT ON SLIDESHARE IN ARTIFICIAL INTELLIGENCEKhushboo Pal
 
Introduction to Neural Networks
Introduction to Neural NetworksIntroduction to Neural Networks
Introduction to Neural NetworksDatabricks
 
Planning in AI(Partial order planning)
Planning in AI(Partial order planning)Planning in AI(Partial order planning)
Planning in AI(Partial order planning)Vicky Tyagi
 
Artificial Intelligence Notes Unit 1
Artificial Intelligence Notes Unit 1 Artificial Intelligence Notes Unit 1
Artificial Intelligence Notes Unit 1 DigiGurukul
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialOmar Enayet
 
Introduction to agents and multi-agent systems
Introduction to agents and multi-agent systemsIntroduction to agents and multi-agent systems
Introduction to agents and multi-agent systemsAntonio Moreno
 

Was ist angesagt? (20)

Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Machine learning life cycle
Machine learning life cycleMachine learning life cycle
Machine learning life cycle
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
 
Fuzzy logic ppt
Fuzzy logic pptFuzzy logic ppt
Fuzzy logic ppt
 
An Introduction to Soft Computing
An Introduction to Soft ComputingAn Introduction to Soft Computing
An Introduction to Soft Computing
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Genetic Algorithms - Artificial Intelligence
Genetic Algorithms - Artificial IntelligenceGenetic Algorithms - Artificial Intelligence
Genetic Algorithms - Artificial Intelligence
 
Agent architectures
Agent architecturesAgent architectures
Agent architectures
 
Flowchart of GA
Flowchart of GAFlowchart of GA
Flowchart of GA
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithm
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)
 
Intelligent Agent PPT ON SLIDESHARE IN ARTIFICIAL INTELLIGENCE
Intelligent Agent PPT ON SLIDESHARE IN ARTIFICIAL INTELLIGENCEIntelligent Agent PPT ON SLIDESHARE IN ARTIFICIAL INTELLIGENCE
Intelligent Agent PPT ON SLIDESHARE IN ARTIFICIAL INTELLIGENCE
 
Soft computing
Soft computingSoft computing
Soft computing
 
Introduction to Neural Networks
Introduction to Neural NetworksIntroduction to Neural Networks
Introduction to Neural Networks
 
Planning in AI(Partial order planning)
Planning in AI(Partial order planning)Planning in AI(Partial order planning)
Planning in AI(Partial order planning)
 
Human Action Recognition
Human Action RecognitionHuman Action Recognition
Human Action Recognition
 
Artificial Intelligence Notes Unit 1
Artificial Intelligence Notes Unit 1 Artificial Intelligence Notes Unit 1
Artificial Intelligence Notes Unit 1
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
 
Introduction to agents and multi-agent systems
Introduction to agents and multi-agent systemsIntroduction to agents and multi-agent systems
Introduction to agents and multi-agent systems
 

Ähnlich wie Exploration Strategies in Reinforcement Learning

Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Dongmin Lee
 
An Updated Survey on Niching Methods and Their Applications
An Updated Survey on Niching Methods and Their ApplicationsAn Updated Survey on Niching Methods and Their Applications
An Updated Survey on Niching Methods and Their ApplicationsSajib Sen
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learningAkshay Kanchan
 
DutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in MLDutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in MLBigML, Inc
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningNAVER Engineering
 
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...Jisu Han
 
TensorFlow London 11: Pierre Harvey Richemond 'Trends and Developments in Rei...
TensorFlow London 11: Pierre Harvey Richemond 'Trends and Developments in Rei...TensorFlow London 11: Pierre Harvey Richemond 'Trends and Developments in Rei...
TensorFlow London 11: Pierre Harvey Richemond 'Trends and Developments in Rei...Seldon
 
Hyperoptimized Machine Learning and Deep Learning Methods For Geospatial and ...
Hyperoptimized Machine Learning and Deep Learning Methods For Geospatial and ...Hyperoptimized Machine Learning and Deep Learning Methods For Geospatial and ...
Hyperoptimized Machine Learning and Deep Learning Methods For Geospatial and ...Neelabha Pant
 
Machine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data DemystifiedMachine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data DemystifiedOmid Vahdaty
 
Reinforcement course material samples: lecture 1
Reinforcement course material samples: lecture 1Reinforcement course material samples: lecture 1
Reinforcement course material samples: lecture 1YasutoTamura1
 
Machine learning - session 3
Machine learning - session 3Machine learning - session 3
Machine learning - session 3Luis Borbon
 
Multi-Domain Diversity Preservation to Mitigate Particle Stagnation and Enab...
Multi-Domain Diversity Preservation to Mitigate Particle Stagnation and Enab...Multi-Domain Diversity Preservation to Mitigate Particle Stagnation and Enab...
Multi-Domain Diversity Preservation to Mitigate Particle Stagnation and Enab...Weiyang Tong
 
Shanghai deep learning meetup 4
Shanghai deep learning meetup 4Shanghai deep learning meetup 4
Shanghai deep learning meetup 4Xiaohu ZHU
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017MLconf
 

Ähnlich wie Exploration Strategies in Reinforcement Learning (20)

Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)
 
An Updated Survey on Niching Methods and Their Applications
An Updated Survey on Niching Methods and Their ApplicationsAn Updated Survey on Niching Methods and Their Applications
An Updated Survey on Niching Methods and Their Applications
 
Bottle sum
Bottle sumBottle sum
Bottle sum
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learning
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
DutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in MLDutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in ML
 
Ml ppt at
Ml ppt atMl ppt at
Ml ppt at
 
Machine Learning Tools and Particle Swarm Optimization for Content-Based Sear...
Machine Learning Tools and Particle Swarm Optimization for Content-Based Sear...Machine Learning Tools and Particle Swarm Optimization for Content-Based Sear...
Machine Learning Tools and Particle Swarm Optimization for Content-Based Sear...
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Metaheuristics
MetaheuristicsMetaheuristics
Metaheuristics
 
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
 
From data lakes to actionable data (adventures in data curation)
From data lakes to actionable data (adventures in data curation)From data lakes to actionable data (adventures in data curation)
From data lakes to actionable data (adventures in data curation)
 
TensorFlow London 11: Pierre Harvey Richemond 'Trends and Developments in Rei...
TensorFlow London 11: Pierre Harvey Richemond 'Trends and Developments in Rei...TensorFlow London 11: Pierre Harvey Richemond 'Trends and Developments in Rei...
TensorFlow London 11: Pierre Harvey Richemond 'Trends and Developments in Rei...
 
Hyperoptimized Machine Learning and Deep Learning Methods For Geospatial and ...
Hyperoptimized Machine Learning and Deep Learning Methods For Geospatial and ...Hyperoptimized Machine Learning and Deep Learning Methods For Geospatial and ...
Hyperoptimized Machine Learning and Deep Learning Methods For Geospatial and ...
 
Machine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data DemystifiedMachine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data Demystified
 
Reinforcement course material samples: lecture 1
Reinforcement course material samples: lecture 1Reinforcement course material samples: lecture 1
Reinforcement course material samples: lecture 1
 
Machine learning - session 3
Machine learning - session 3Machine learning - session 3
Machine learning - session 3
 
Multi-Domain Diversity Preservation to Mitigate Particle Stagnation and Enab...
Multi-Domain Diversity Preservation to Mitigate Particle Stagnation and Enab...Multi-Domain Diversity Preservation to Mitigate Particle Stagnation and Enab...
Multi-Domain Diversity Preservation to Mitigate Particle Stagnation and Enab...
 
Shanghai deep learning meetup 4
Shanghai deep learning meetup 4Shanghai deep learning meetup 4
Shanghai deep learning meetup 4
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 

Mehr von Dongmin Lee

Causal Confusion in Imitation Learning
Causal Confusion in Imitation LearningCausal Confusion in Imitation Learning
Causal Confusion in Imitation LearningDongmin Lee
 
Character Controllers using Motion VAEs
Character Controllers using Motion VAEsCharacter Controllers using Motion VAEs
Character Controllers using Motion VAEsDongmin Lee
 
Causal Confusion in Imitation Learning
Causal Confusion in Imitation LearningCausal Confusion in Imitation Learning
Causal Confusion in Imitation LearningDongmin Lee
 
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...Dongmin Lee
 
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...Dongmin Lee
 
Let's do Inverse RL
Let's do Inverse RLLet's do Inverse RL
Let's do Inverse RLDongmin Lee
 
모두를 위한 PG여행 가이드
모두를 위한 PG여행 가이드모두를 위한 PG여행 가이드
모두를 위한 PG여행 가이드Dongmin Lee
 
Safe Reinforcement Learning
Safe Reinforcement LearningSafe Reinforcement Learning
Safe Reinforcement LearningDongmin Lee
 
안.전.제.일. 강화학습!
안.전.제.일. 강화학습!안.전.제.일. 강화학습!
안.전.제.일. 강화학습!Dongmin Lee
 
Planning and Learning with Tabular Methods
Planning and Learning with Tabular MethodsPlanning and Learning with Tabular Methods
Planning and Learning with Tabular MethodsDongmin Lee
 
Multi-armed Bandits
Multi-armed BanditsMulti-armed Bandits
Multi-armed BanditsDongmin Lee
 
강화학습 알고리즘의 흐름도 Part 2
강화학습 알고리즘의 흐름도 Part 2강화학습 알고리즘의 흐름도 Part 2
강화학습 알고리즘의 흐름도 Part 2Dongmin Lee
 
강화학습의 흐름도 Part 1
강화학습의 흐름도 Part 1강화학습의 흐름도 Part 1
강화학습의 흐름도 Part 1Dongmin Lee
 
강화학습의 개요
강화학습의 개요강화학습의 개요
강화학습의 개요Dongmin Lee
 

Mehr von Dongmin Lee (14)

Causal Confusion in Imitation Learning
Causal Confusion in Imitation LearningCausal Confusion in Imitation Learning
Causal Confusion in Imitation Learning
 
Character Controllers using Motion VAEs
Character Controllers using Motion VAEsCharacter Controllers using Motion VAEs
Character Controllers using Motion VAEs
 
Causal Confusion in Imitation Learning
Causal Confusion in Imitation LearningCausal Confusion in Imitation Learning
Causal Confusion in Imitation Learning
 
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
 
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
 
Let's do Inverse RL
Let's do Inverse RLLet's do Inverse RL
Let's do Inverse RL
 
모두를 위한 PG여행 가이드
모두를 위한 PG여행 가이드모두를 위한 PG여행 가이드
모두를 위한 PG여행 가이드
 
Safe Reinforcement Learning
Safe Reinforcement LearningSafe Reinforcement Learning
Safe Reinforcement Learning
 
안.전.제.일. 강화학습!
안.전.제.일. 강화학습!안.전.제.일. 강화학습!
안.전.제.일. 강화학습!
 
Planning and Learning with Tabular Methods
Planning and Learning with Tabular MethodsPlanning and Learning with Tabular Methods
Planning and Learning with Tabular Methods
 
Multi-armed Bandits
Multi-armed BanditsMulti-armed Bandits
Multi-armed Bandits
 
강화학습 알고리즘의 흐름도 Part 2
강화학습 알고리즘의 흐름도 Part 2강화학습 알고리즘의 흐름도 Part 2
강화학습 알고리즘의 흐름도 Part 2
 
강화학습의 흐름도 Part 1
강화학습의 흐름도 Part 1강화학습의 흐름도 Part 1
강화학습의 흐름도 Part 1
 
강화학습의 개요
강화학습의 개요강화학습의 개요
강화학습의 개요
 

Kürzlich hochgeladen

kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadhamedmustafa094
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityMorshed Ahmed Rahath
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptNANDHAKUMARA10
 
Air Compressor reciprocating single stage
Air Compressor reciprocating single stageAir Compressor reciprocating single stage
Air Compressor reciprocating single stageAbc194748
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxJuliansyahHarahap1
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"mphochane1998
 
Learn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic MarksLearn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic MarksMagic Marks
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdfKamal Acharya
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptDineshKumar4165
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdfKamal Acharya
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwaitjaanualu31
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
 
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...Health
 
Rums floating Omkareshwar FSPV IM_16112021.pdf
Rums floating Omkareshwar FSPV IM_16112021.pdfRums floating Omkareshwar FSPV IM_16112021.pdf
Rums floating Omkareshwar FSPV IM_16112021.pdfsmsksolar
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesMayuraD1
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapRishantSharmaFr
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdfKamal Acharya
 
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...HenryBriggs2
 

Kürzlich hochgeladen (20)

kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal load
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
Air Compressor reciprocating single stage
Air Compressor reciprocating single stageAir Compressor reciprocating single stage
Air Compressor reciprocating single stage
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
 
Learn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic MarksLearn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic Marks
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdf
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdf
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
 
Rums floating Omkareshwar FSPV IM_16112021.pdf
Rums floating Omkareshwar FSPV IM_16112021.pdfRums floating Omkareshwar FSPV IM_16112021.pdf
Rums floating Omkareshwar FSPV IM_16112021.pdf
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
 

Exploration Strategies in Reinforcement Learning

  • 1. Exploration Strategies in Reinforcement Learning Dongmin Lee SNU Robot Learning Laboratory AI Robotics KR October 3, 2019
  • 2. Outline • Introduction • Reinforcement Learning • Exploration Strategies in RL • Entropy Regularization in RL 1
  • 3. Introduction 2 How do we build intelligent systems / machines? • Data à Information • Information à Decision intelligence intelligence
  • 4. Introduction 3 Intelligent systems must be able to adapt to uncertainty • Uncertainty: • Other cars • Weather • Road construction • Passenger
  • 5. Introduction 4 RL provides a formalism for behaviors • Problem of a goal-directed agent interacting with an uncertain environment • Interaction à adaptation feedback & decision
  • 6. Reinforcement Learning 5 • State of the environment / system : 𝑠" ∈ 𝑆 • Action determined by the agent : 𝑎" ∈ 𝐴 • Reward: evaluates the benefit of state and action • Policy (mapping from state to action): 𝜋 • Deterministic 𝜋 𝑠" = 𝑎" • Stochastic (randomized) 𝜋 𝑎 𝑠 = Prob(𝑎" = 𝑎|𝑠" = 𝑠) à Policy is what we want to optimize!
  • 7. 6 A Markov decision process (MDP) is a tuple < 𝑆, 𝐴, 𝑝, 𝑟, 𝛾 >, consisting of • 𝑆: set of states (state space) e.g., 𝑆 = 1, … , 𝑛 (discrete), 𝑆 = ℝ: (continuous) • 𝐴: set of actions (action space) e.g., 𝐴 = 1, … , 𝑚 (discrete), 𝐴 = ℝ< (continuous) • 𝑝: state transition probability 𝑝 𝑠= 𝑠, 𝑎 ≜ Prob 𝑠"?@ = 𝑠= 𝑠" = 𝑠, 𝑎" = 𝑎 d • 𝑟: reward function 𝑟 𝑠", 𝑎" = 𝑟"d • 𝛾 ∈ (0,1]: discount factor Reinforcement Learning
  • 8. 7 The MDP problem • To find an optimal policy that maximizes the expected cumulative reward: max F∈∏ 𝔼F I "JK L 𝛾" 𝑟 𝑠", 𝑎" Reinforcement Learning
  • 9. 8 What are the challenges of RL? • Huge # of samples: millions • Fast, stable learning • Hyperparameter tuning • Exploration • Sparse reward signals • Safety / reliability • Simulator Reinforcement Learning
  • 10. 9 What are the challenges of RL? • Huge # of samples: millions • Fast, stable learning • Hyperparameter tuning • Exploration • Sparse reward signals • Safety / reliability • Simulator Reinforcement Learning
  • 11. Exploration Strategies in RL 10 Common issue: exploration • Is there a value of exploring unknown regions of the environment? • Which action should we try to explore unknown regions of the environment?
  • 12. Exploration Strategies in RL 11 How can we algorithmically solve this issue exploration vs exploitation? 1. 𝜖-Greedy 2. Optimism in the face of uncertainty 3. Thompson (posterior) sampling 4. Information theoretic exploration 5. Noisy networks, Curiosity-driven, etc.
  • 13. Exploration Strategies in RL 12 How can we algorithmically solve this issue exploration vs exploitation? 1. 𝜖-Greedy 2. Optimism in the face of uncertainty 3. Thompson (posterior) sampling 4. Information theoretic exploration 5. Noisy networks, Curiosity-driven, etc.
  • 14. Exploration Strategies in RL 13 How can we algorithmically solve this issue exploration vs exploitation? 1. 𝜖-Greedy 2. Optimism in the face of uncertainty 3. Thompson (posterior) sampling 4. Information theoretic exploration 5. Noisy networks, Curiosity-driven, etc. • Q) What is a means of assessing these strategies?
  • 15. Exploration Strategies in RL 14 What is regret? • An additional means of assessing the algorithms’ performance • A loss that we incur due to time spent during the learning • The expected cumulative regret: ℛO ≜ 𝔼F max P∈Q I "JK O 𝑟" 𝑎 − 𝑟"(𝑎") Regret problem • To find a strategy that minimizes the expected cumulative regret ℛO • Maximize expected cumulative reward ≡ minimize expected cumulative regret!
  • 16. Exploration Strategies in RL 15 Method 1: 𝜖-Greedy Idea: • Choose a greedy action (arg max 𝑄F (𝑠, 𝑎)) with probability (1 − 𝜖) • Choose a random action with probability 𝜖 Advantages: 1. Very simple 2. Light computation Disadvantages: 1. Fine tuning 𝜖 2. Not quite systematic: No focus on unexplored regions 3. Inefficient
  • 17. Exploration Strategies in RL 16 A comparison of greedy, 𝜖-greedy, decaying 𝜖-greedy http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/XX.pdf
  • 18. Exploration Strategies in RL 17 Method 2: Optimism in the face of uncertainty (OFU) Idea: 1. Construct a confidence set for MDP parameters using collected samples 2. Choose MDP parameters that give the highest rewards 3. Construct an optimal policy of the optimistically chosen MDP Advantages: 1. Regret optimal 2. Systematic: More focus on unexplored regions Disadvantages: 1. Complicated 2. Computation intensive
  • 19. Exploration Strategies in RL 18 Method 3: Thompson (posterior) sampling Idea: 1. Sample MDP parameters from posterior distribution 2. Construct an optimal policy of the sampled MDP parameters 3. Update the posterior distribution Advantages: 1. Regret optimal 2. Systematic: More focus on unexplored regions Disadvantages: 1. Somewhat complicated
  • 20. Exploration Strategies in RL 19 Method 4: Information theoretic exploration Idea: • Use high entropy policy to explore more Advantages: 1. Simple 2. Good empirical performance Disadvantages: 1. More theoretic analyses needed
  • 21. Entropy Regularization in RL 20 What is entropy? • Average rate at which information is produced by a stochastic source of data • The well-known standard Shannon-Gibbs entropy: 𝐻 𝑃 ≜ 𝔼 X~Z − log 𝑃 𝑋 = − I ]∈X 𝑃(𝑥) log 𝑃(𝑥) • More generally, entropy refers to disorder
  • 22. • Standard MDP problem: max F 𝔼F I "JK L 𝛾" 𝑟(𝑠", 𝑎") • Maximum entropy MDP problem: max F 𝔼F I "JK L 𝛾" 𝑟 𝑠", 𝑎" + 𝛼𝐻 𝜋 ⋅ 𝑠" where 𝐻 𝜋 ⋅ 𝑠" = 𝔼P~F − log 𝜋 𝑎 𝑠 = − ∑P 𝜋 𝑎 𝑠 log 𝜋 𝑎 𝑠 . T. Haarnoja, et al., “Reinforcement Learning with Deep Energy-Based Policies”, ICML 2017 21 Entropy Regularization in RL
  • 23. What’s the benefit of maximum entropy regularization? • Computation: No maximization involved • Exploration: Potentially explore unexplored regions using high entropy policy • Structural similarity: Can combine it with many RL methods for standard MDP 22 Entropy Regularization in RL
  • 24. https://sites.google.com/view/sac-and-applications 23 Entropy Regularization in RL • Soft Actor-Critic on Minitaur ­ Training
  • 25. https://sites.google.com/view/sac-and-applications 24 Entropy Regularization in RL • Soft Actor-Critic on Minitaur ­ Testing
  • 26. 25 Entropy Regularization in RL Tsallis entropy: Generalization of the Shannon-Gibbs entropy K. Lee, et al., “Tsallis Reinforcement Learning: A Unified Framework for Maximum Entropy Reinforcement Learning”, arXiv preprint 2019
  • 27. • Maximum entropy MDP problem: max F 𝔼F I "JK L 𝛾" 𝑟 𝑠", 𝑎" + 𝛼𝐻 𝜋 ⋅ 𝑠" • Maximum Tsallis entropy MDP problem: max F 𝔼F I "JK L 𝛾" 𝑟 𝑠", 𝑎" + 𝛼𝑆c 𝜋 ⋅ 𝑠" • Tsallis entropy 𝑆c 𝜋 ⋅ 𝑠" = 𝔼P~F − lnc(𝜋 𝑎 𝑠 ) where, lnc(𝑥) = e log(𝑥) , 𝑖𝑓 𝑞 = 1 𝑎𝑛𝑑 𝑥 > 0 ]jklm@ cm@ , 𝑖𝑓 𝑞 ≠ 1 𝑎𝑛𝑑 𝑥 > 0 , 𝑞 is called an entropic index K. Lee, et al., “Tsallis Reinforcement Learning: A Unified Framework for Maximum Entropy Reinforcement Learning”, arXiv preprint 2019 26 Entropy Regularization in RL