I presented about "Exploration Strategies in Reinforcement Learning" at AI Robotics KR.
- Exploration strategies in RL
1. Epsilon-greedy
2. Optimism in the face of uncertainty
3. Thompson (posterior) sampling
4. Information theoretic exploration (e.g., Entropy Regularization in RL)
Thank you.
5. Introduction
4
RL provides a formalism for behaviors
• Problem of a goal-directed agent interacting with an uncertain environment
• Interaction à adaptation
feedback & decision
6. Reinforcement Learning
5
• State of the environment / system : 𝑠" ∈ 𝑆
• Action determined by the agent : 𝑎" ∈ 𝐴
• Reward: evaluates the benefit of state and action
• Policy (mapping from state to action): 𝜋
• Deterministic
𝜋 𝑠" = 𝑎"
• Stochastic (randomized)
𝜋 𝑎 𝑠 = Prob(𝑎" = 𝑎|𝑠" = 𝑠)
à Policy is what we want to optimize!
7. 6
A Markov decision process (MDP) is a tuple < 𝑆, 𝐴, 𝑝, 𝑟, 𝛾 >, consisting of
• 𝑆: set of states (state space)
e.g., 𝑆 = 1, … , 𝑛 (discrete), 𝑆 = ℝ: (continuous)
• 𝐴: set of actions (action space)
e.g., 𝐴 = 1, … , 𝑚 (discrete), 𝐴 = ℝ< (continuous)
• 𝑝: state transition probability
𝑝 𝑠= 𝑠, 𝑎 ≜ Prob 𝑠"?@ = 𝑠= 𝑠" = 𝑠, 𝑎" = 𝑎 d
• 𝑟: reward function
𝑟 𝑠", 𝑎" = 𝑟"d
• 𝛾 ∈ (0,1]: discount factor
Reinforcement Learning
8. 7
The MDP problem
• To find an optimal policy that maximizes the expected cumulative reward:
max
F∈∏
𝔼F I
"JK
L
𝛾" 𝑟 𝑠", 𝑎"
Reinforcement Learning
9. 8
What are the challenges of RL?
• Huge # of samples: millions
• Fast, stable learning
• Hyperparameter tuning
• Exploration
• Sparse reward signals
• Safety / reliability
• Simulator
Reinforcement Learning
10. 9
What are the challenges of RL?
• Huge # of samples: millions
• Fast, stable learning
• Hyperparameter tuning
• Exploration
• Sparse reward signals
• Safety / reliability
• Simulator
Reinforcement Learning
11. Exploration Strategies in RL
10
Common issue: exploration
• Is there a value of exploring unknown regions of the environment?
• Which action should we try to explore unknown regions of the environment?
12. Exploration Strategies in RL
11
How can we algorithmically solve this issue exploration vs exploitation?
1. 𝜖-Greedy
2. Optimism in the face of uncertainty
3. Thompson (posterior) sampling
4. Information theoretic exploration
5. Noisy networks, Curiosity-driven, etc.
13. Exploration Strategies in RL
12
How can we algorithmically solve this issue exploration vs exploitation?
1. 𝜖-Greedy
2. Optimism in the face of uncertainty
3. Thompson (posterior) sampling
4. Information theoretic exploration
5. Noisy networks, Curiosity-driven, etc.
14. Exploration Strategies in RL
13
How can we algorithmically solve this issue exploration vs exploitation?
1. 𝜖-Greedy
2. Optimism in the face of uncertainty
3. Thompson (posterior) sampling
4. Information theoretic exploration
5. Noisy networks, Curiosity-driven, etc.
• Q) What is a means of assessing these strategies?
15. Exploration Strategies in RL
14
What is regret?
• An additional means of assessing the algorithms’ performance
• A loss that we incur due to time spent during the learning
• The expected cumulative regret:
ℛO ≜ 𝔼F max
P∈Q
I
"JK
O
𝑟" 𝑎 − 𝑟"(𝑎")
Regret problem
• To find a strategy that minimizes the expected cumulative regret ℛO
• Maximize expected cumulative reward ≡ minimize expected cumulative regret!
16. Exploration Strategies in RL
15
Method 1: 𝜖-Greedy
Idea:
• Choose a greedy action (arg max 𝑄F
(𝑠, 𝑎)) with probability (1 − 𝜖)
• Choose a random action with probability 𝜖
Advantages:
1. Very simple
2. Light computation
Disadvantages:
1. Fine tuning 𝜖
2. Not quite systematic: No focus on unexplored regions
3. Inefficient
17. Exploration Strategies in RL
16
A comparison of greedy, 𝜖-greedy, decaying 𝜖-greedy
http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/XX.pdf
18. Exploration Strategies in RL
17
Method 2: Optimism in the face of uncertainty (OFU)
Idea:
1. Construct a confidence set for MDP parameters using collected samples
2. Choose MDP parameters that give the highest rewards
3. Construct an optimal policy of the optimistically chosen MDP
Advantages:
1. Regret optimal
2. Systematic: More focus on unexplored regions
Disadvantages:
1. Complicated
2. Computation intensive
19. Exploration Strategies in RL
18
Method 3: Thompson (posterior) sampling
Idea:
1. Sample MDP parameters from posterior distribution
2. Construct an optimal policy of the sampled MDP parameters
3. Update the posterior distribution
Advantages:
1. Regret optimal
2. Systematic: More focus on unexplored regions
Disadvantages:
1. Somewhat complicated
20. Exploration Strategies in RL
19
Method 4: Information theoretic exploration
Idea:
• Use high entropy policy to explore more
Advantages:
1. Simple
2. Good empirical performance
Disadvantages:
1. More theoretic analyses needed
21. Entropy Regularization in RL
20
What is entropy?
• Average rate at which information is produced by a stochastic source of data
• The well-known standard Shannon-Gibbs entropy:
𝐻 𝑃 ≜ 𝔼
X~Z
− log 𝑃 𝑋 = − I
]∈X
𝑃(𝑥) log 𝑃(𝑥)
• More generally, entropy refers to disorder
22. • Standard MDP problem:
max
F
𝔼F I
"JK
L
𝛾" 𝑟(𝑠", 𝑎")
• Maximum entropy MDP problem:
max
F
𝔼F I
"JK
L
𝛾" 𝑟 𝑠", 𝑎" + 𝛼𝐻 𝜋 ⋅ 𝑠"
where 𝐻 𝜋 ⋅ 𝑠" = 𝔼P~F − log 𝜋 𝑎 𝑠 = − ∑P 𝜋 𝑎 𝑠 log 𝜋 𝑎 𝑠 .
T. Haarnoja, et al., “Reinforcement Learning with Deep Energy-Based Policies”, ICML 2017 21
Entropy Regularization in RL
23. What’s the benefit of maximum entropy regularization?
• Computation:
No maximization involved
• Exploration:
Potentially explore unexplored regions using high entropy policy
• Structural similarity:
Can combine it with many RL methods for standard MDP
22
Entropy Regularization in RL
26. 25
Entropy Regularization in RL
Tsallis entropy: Generalization of the Shannon-Gibbs entropy
K. Lee, et al., “Tsallis Reinforcement Learning: A Unified Framework for Maximum Entropy Reinforcement Learning”, arXiv preprint 2019
27. • Maximum entropy MDP problem:
max
F
𝔼F I
"JK
L
𝛾" 𝑟 𝑠", 𝑎" + 𝛼𝐻 𝜋 ⋅ 𝑠"
• Maximum Tsallis entropy MDP problem:
max
F
𝔼F I
"JK
L
𝛾" 𝑟 𝑠", 𝑎" + 𝛼𝑆c 𝜋 ⋅ 𝑠"
• Tsallis entropy
𝑆c 𝜋 ⋅ 𝑠" = 𝔼P~F − lnc(𝜋 𝑎 𝑠 )
where, lnc(𝑥) = e
log(𝑥) , 𝑖𝑓 𝑞 = 1 𝑎𝑛𝑑 𝑥 > 0
]jklm@
cm@
, 𝑖𝑓 𝑞 ≠ 1 𝑎𝑛𝑑 𝑥 > 0
, 𝑞 is called an entropic index
K. Lee, et al., “Tsallis Reinforcement Learning: A Unified Framework for Maximum Entropy Reinforcement Learning”, arXiv preprint 2019 26
Entropy Regularization in RL