Exploration Strategies in Reinforcement Learning

Exploration Strategies in
Reinforcement Learning
Dongmin Lee
SNU Robot Learning Laboratory
AI Robotics KR
October 3, 2019

Outline
• Introduction
• Reinforcement Learning
• Exploration Strategies in RL
• Entropy Regularization in RL
1

Introduction
2
How do we build intelligent systems / machines?
• Data à Information
• Information à Decision
intelligence
intelligence

Introduction
3
Intelligent systems must be able to adapt to uncertainty
• Uncertainty:
• Other cars
• Weather
• Road construction
• Passenger

Introduction
4
RL provides a formalism for behaviors
• Problem of a goal-directed agent interacting with an uncertain environment
• Interaction à adaptation
feedback & decision

5
• State of the environment / system : 𝑠" ∈ 𝑆
• Action determined by the agent : 𝑎" ∈ 𝐴
• Reward: evaluates the benefit of state and action
• Policy (mapping from state to action): 𝜋
• Deterministic
𝜋 𝑠" = 𝑎"
• Stochastic (randomized)
𝜋 𝑎 𝑠 = Prob(𝑎" = 𝑎|𝑠" = 𝑠)
à Policy is what we want to optimize!

6
A Markov decision process (MDP) is a tuple < 𝑆, 𝐴, 𝑝, 𝑟, 𝛾 >, consisting of
• 𝑆: set of states (state space)
e.g., 𝑆 = 1, … , 𝑛 (discrete), 𝑆 = ℝ: (continuous)
• 𝐴: set of actions (action space)
e.g., 𝐴 = 1, … , 𝑚 (discrete), 𝐴 = ℝ< (continuous)
• 𝑝: state transition probability
𝑝 𝑠= 𝑠, 𝑎 ≜ Prob 𝑠"?@ = 𝑠= 𝑠" = 𝑠, 𝑎" = 𝑎 d
• 𝑟: reward function
𝑟 𝑠", 𝑎" = 𝑟"d
• 𝛾 ∈ (0,1]: discount factor

7
The MDP problem
• To find an optimal policy that maximizes the expected cumulative reward:
max
F∈∏
𝔼F I
"JK
L
𝛾" 𝑟 𝑠", 𝑎"

8
What are the challenges of RL?
• Huge # of samples: millions
• Fast, stable learning
• Hyperparameter tuning
• Exploration
• Sparse reward signals
• Safety / reliability
• Simulator

9
What are the challenges of RL?
• Huge # of samples: millions
• Fast, stable learning
• Hyperparameter tuning
• Exploration
• Sparse reward signals
• Safety / reliability
• Simulator

Exploration Strategies in RL
10
Common issue: exploration
• Is there a value of exploring unknown regions of the environment?
• Which action should we try to explore unknown regions of the environment?

11
How can we algorithmically solve this issue exploration vs exploitation?
1. 𝜖-Greedy
2. Optimism in the face of uncertainty
3. Thompson (posterior) sampling
4. Information theoretic exploration
5. Noisy networks, Curiosity-driven, etc.

12
1. 𝜖-Greedy

13
1. 𝜖-Greedy
• Q) What is a means of assessing these strategies?

14
What is regret?
• An additional means of assessing the algorithms’ performance
• A loss that we incur due to time spent during the learning
• The expected cumulative regret:
ℛO ≜ 𝔼F max
P∈Q
I
"JK
O
𝑟" 𝑎 − 𝑟"(𝑎")
Regret problem
• To find a strategy that minimizes the expected cumulative regret ℛO
• Maximize expected cumulative reward ≡ minimize expected cumulative regret!

15
Method 1: 𝜖-Greedy
Idea:
• Choose a greedy action (arg max 𝑄F
(𝑠, 𝑎)) with probability (1 − 𝜖)
• Choose a random action with probability 𝜖
Advantages:
1. Very simple
2. Light computation
Disadvantages:
1. Fine tuning 𝜖
2. Not quite systematic: No focus on unexplored regions
3. Inefficient

16
A comparison of greedy, 𝜖-greedy, decaying 𝜖-greedy
http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/XX.pdf

17
Method 2: Optimism in the face of uncertainty (OFU)
Idea:
1. Construct a confidence set for MDP parameters using collected samples
2. Choose MDP parameters that give the highest rewards
3. Construct an optimal policy of the optimistically chosen MDP
Advantages:
1. Regret optimal
2. Systematic: More focus on unexplored regions
Disadvantages:
1. Complicated
2. Computation intensive

18
Method 3: Thompson (posterior) sampling
Idea:
1. Sample MDP parameters from posterior distribution
2. Construct an optimal policy of the sampled MDP parameters
3. Update the posterior distribution
Advantages:
1. Regret optimal
2. Systematic: More focus on unexplored regions
Disadvantages:
1. Somewhat complicated

19
Method 4: Information theoretic exploration
Idea:
• Use high entropy policy to explore more
Advantages:
1. Simple
2. Good empirical performance
Disadvantages:
1. More theoretic analyses needed

Entropy Regularization in RL
20
What is entropy?
• Average rate at which information is produced by a stochastic source of data
• The well-known standard Shannon-Gibbs entropy:
𝐻 𝑃 ≜ 𝔼
X~Z
− log 𝑃 𝑋 = − I
]∈X
𝑃(𝑥) log 𝑃(𝑥)
• More generally, entropy refers to disorder

• Standard MDP problem:
max
F
𝔼F I
"JK
L
𝛾" 𝑟(𝑠", 𝑎")
• Maximum entropy MDP problem:
max
F
𝔼F I
"JK
L
𝛾" 𝑟 𝑠", 𝑎" + 𝛼𝐻 𝜋 ⋅ 𝑠"
where 𝐻 𝜋 ⋅ 𝑠" = 𝔼P~F − log 𝜋 𝑎 𝑠 = − ∑P 𝜋 𝑎 𝑠 log 𝜋 𝑎 𝑠 .
T. Haarnoja, et al., “Reinforcement Learning with Deep Energy-Based Policies”, ICML 2017 21

What’s the benefit of maximum entropy regularization?
• Computation:
No maximization involved
• Exploration:
Potentially explore unexplored regions using high entropy policy
• Structural similarity:
Can combine it with many RL methods for standard MDP
22

https://sites.google.com/view/sac-and-applications 23
• Soft Actor-Critic on Minitaur Training

https://sites.google.com/view/sac-and-applications 24
• Soft Actor-Critic on Minitaur Testing

25
Tsallis entropy: Generalization of the Shannon-Gibbs entropy
K. Lee, et al., “Tsallis Reinforcement Learning: A Unified Framework for Maximum Entropy Reinforcement Learning”, arXiv preprint 2019

• Maximum entropy MDP problem:
max
F
𝔼F I
"JK
L
𝛾" 𝑟 𝑠", 𝑎" + 𝛼𝐻 𝜋 ⋅ 𝑠"
• Maximum Tsallis entropy MDP problem:
max
F
𝔼F I
"JK
L
𝛾" 𝑟 𝑠", 𝑎" + 𝛼𝑆c 𝜋 ⋅ 𝑠"
• Tsallis entropy
𝑆c 𝜋 ⋅ 𝑠" = 𝔼P~F − lnc(𝜋 𝑎 𝑠 )
where, lnc(𝑥) = e
log(𝑥) , 𝑖𝑓 𝑞 = 1 𝑎𝑛𝑑 𝑥 > 0
]jklm@
cm@
, 𝑖𝑓 𝑞 ≠ 1 𝑎𝑛𝑑 𝑥 > 0
, 𝑞 is called an entropic index
K. Lee, et al., “Tsallis Reinforcement Learning: A Unified Framework for Maximum Entropy Reinforcement Learning”, arXiv preprint 2019 26

Exploration Strategies in Reinforcement Learning

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Exploration Strategies in Reinforcement Learning

Ähnlich wie Exploration Strategies in Reinforcement Learning (20)

Mehr von Dongmin Lee

Mehr von Dongmin Lee (14)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Exploration Strategies in Reinforcement Learning