Curiosity-Bottleneck: Exploration by Distilling Task-Specific Novelty
1. Code available at: http://vision.snu.ac.kr/projects/cb
Curiosity-Bottleneck:
Exploration by
DistillingTask-Specific Novelty
ICML 2019
Youngjin Kim
Hyunwoo Kim*
Wontae Nam*
Jihoon Kim
Gunhee Kim
(*equal contribution)
3. Extrinsic Reward vs. Intrinsic Reward
+500 SCORE for getting an item !
-150 SCORE for stepping a bomb : ( +200 MOTIVATION SCORE
as I’ve never been to this place !
-150 MOTIVATION SCORE
I’ve been here too many times
7. 1. Distractive environments are widespread
§ Real-world observations contain novel but task-irrelevant information.
Problematic situation:
Exploration under Distraction
(a) Known Place
(b) Known Place
with Strangers
Navigating robot
8. 2. Degeneration of prior novelty-based exploration strategies
§ Due to task-agnostic intrinsic reward
§ Need mechanisms to prioritize task-relevant novelty
Not Novel Novel
Problematic situation:
Exploration under Distraction
(a) Known Place
(b) Known Place
with Strangers
Navigating robot
9. Quantify the ‘Degree of Compression’ using
a compressive value network
𝑥" 𝜋E
Compressor
𝑟"
%
E 𝑟"
&
𝑎"
Value Predictor
Intrinsic Reward
External Reward
Environment Policy Environment
Our approach: Curiosity-Bottleneck
(𝑦"
10. § Encode rare 𝑥 to a lengthy code and common 𝑥 to a shorter code
§ Discard information about 𝑥 during compression
Our approach: Curiosity-Bottleneck
𝑥" 𝜋E
Compressor
𝑟"
%
E 𝑟"
&
𝑎"
Value Predictor
Intrinsic Reward
External Reward
Environment Policy Environment
Compressor
(𝑦"
11. § Prevent the Compressor from discarding task-related information
𝑥" 𝜋E
Compressor
𝑟"
%
E 𝑟"
&
𝑎"
Value Predictor
Intrinsic Reward
External Reward
Environment Policy Environment
Our approach: Curiosity-Bottleneck
Value Predictor
(𝑦"
12. 1. Objective Function
§ Minimize average code-length of representation 𝑍
§ Discard information about observation 𝑋
𝑚𝑎𝑥 𝐼(𝑍; 𝑌)
𝑚𝑖𝑛 𝐻(𝑍) − 𝐻 𝑍 𝑋 = 𝑚𝑖𝑛 𝐼(𝑋; 𝑍)
§ Preserve information related to value estimate 𝑌
𝐿 = −𝐼 𝑍; 𝑌 + 𝛽𝐼 𝑋; 𝑍
𝑟%
(𝑥) = :
;
𝑝 𝑧 𝑥 log
𝑝 𝑥, 𝑧
𝑝 𝑥 𝑝(𝑧)
𝑑𝑧
2. Intrinsic Reward: Per-instance Mutual Information
Our approach: Curiosity-Bottleneck
16. Experiment:Treasure Hunt
§ Agent is depicted as a circle
§ Item(triangle) with reward is hidden somewhere
§ The item appears only when the agent is nearby
§ Once the agent obtains an item, the next item
will be spawned in another area (also hidden)
§ The traces(pentagon) of eaten items will remain
§ Get the maximum score!
Example of the game play
Outline of the game
17. Experiment:Treasure Hunt
Movement condition
2 types of onset conditions for distraction
Location condition
When the agent stays
in the same location
When the agent stays
in the corners of the map
18. Consistently outperform baselines on different distraction settings
MeanEpisodicReward
(a) Movement Condition
CB CB-noKL RND Dynamics SimHash
(b) Location Condition
1e6 1e6
Experiment:Treasure Hunt
19. Experiment:Treasure Hunt
𝑥
𝑧
𝑞(𝑍)𝑝C(𝑍|𝑥") 𝑝C(𝑍|𝑥P)
𝑥" 𝑥P
Range of Experiences
𝑞(𝑍)𝑝C(𝑍|𝑥") 𝑝C(𝑍|𝑥P)
𝑥P𝑥"
Range of Experiences
𝛻KL− 𝛻log 𝑞D
− 𝛻log 𝑞D𝛻KL
𝑦 Target Value ( ) and Prediction ( )
(a) Early Training Steps (b) After Collecting Rewards
𝛻KL
− 𝛻log 𝑞D
𝛻KL − 𝛻log 𝑞D
18.2 8.018.1 4.6
𝑧
.….
illustration of adaptive exploration strategy
20. (a) Input (b) CB-Early (d) CB-noKL (f) Dynamics(e) RND(c) CB (g) SimHash
Compression loss term induces task-agnostic exploration in early stages
𝑲𝑳[𝒑 𝜽 𝒁 𝒙 ||𝒒 𝒁 ]
Grad-CamVisualization
The adaptive exploration strategy
Experiment:Treasure Hunt
21. Value prediction loss term induces task-specific exploration
after collecting external rewards
− 𝒍𝒐𝒈 𝒒 𝝓 𝒚 𝒛
(a) Input (b) CB-Early (d) CB-noKL (f) Dynamics(e) RND(c) CB (g) SimHash
Grad-CamVisualization
The adaptive exploration strategy
Experiment:Treasure Hunt
23. Contributions
• First work to discriminate information by task-relevancy
→ Focus on task-relevant novelty and filter out distractive information
• Utilize information bottleneck as a novelty measure
→ the KL-divergence term as a degree of compression
• Extensive experiments
→ Experimented on a custom grid-world environment
to show situations where previous methods suffer.
Experimented on Atari environment for generality.
• Psychologically plausible