Curiosity-Bottleneck: Exploration by Distilling Task-Specific Novelty

Code available at: http://vision.snu.ac.kr/projects/cb
Curiosity-Bottleneck:
Exploration by
DistillingTask-Specific Novelty
ICML 2019
Youngjin Kim
Hyunwoo Kim*
Wontae Nam*
Jihoon Kim
Gunhee Kim
(*equal contribution)

Exploitation vs. Exploration
Image source: UC Berkeley AI course slide, lecture 11
NEW
!

Extrinsic Reward vs. Intrinsic Reward
+500 SCORE for getting an item !
-150 SCORE for stepping a bomb : ( +200 MOTIVATION SCORE
as I’ve never been to this place !
-150 MOTIVATION SCORE
I’ve been here too many times

Previous Research on Exploration
Anything Novel

Source for Novelty
Task-irrelevant
Novelty
Task-relevant
Novelty

Our Research
Task-irrelevant
Novelty
Task-relevant
Novelty

1. Distractive environments are widespread
§ Real-world observations contain novel but task-irrelevant information.
Problematic situation:
Exploration under Distraction
(a) Known Place
(b) Known Place
with Strangers
Navigating robot

2. Degeneration of prior novelty-based exploration strategies
§ Due to task-agnostic intrinsic reward
§ Need mechanisms to prioritize task-relevant novelty
Not Novel Novel
Problematic situation:
Exploration under Distraction
(a) Known Place
(b) Known Place
with Strangers
Navigating robot

Quantify the ‘Degree of Compression’ using
a compressive value network
𝑥" 𝜋E
Compressor
𝑟"
%
E 𝑟"
&
𝑎"
Value Predictor
Intrinsic Reward
External Reward
Environment Policy Environment
Our approach: Curiosity-Bottleneck
(𝑦"

§ Encode rare 𝑥 to a lengthy code and common 𝑥 to a shorter code
§ Discard information about 𝑥 during compression
𝑥" 𝜋E
Compressor
𝑟"
%
E 𝑟"
&
𝑎"
Value Predictor
Intrinsic Reward
External Reward
Compressor
(𝑦"

§ Prevent the Compressor from discarding task-related information
𝑥" 𝜋E
Compressor
𝑟"
%
E 𝑟"
&
𝑎"
Value Predictor
Intrinsic Reward
External Reward
Value Predictor
(𝑦"

1. Objective Function
§ Minimize average code-length of representation 𝑍
§ Discard information about observation 𝑋
𝑚𝑎𝑥 𝐼(𝑍; 𝑌)
𝑚𝑖𝑛 𝐻(𝑍) − 𝐻 𝑍 𝑋 = 𝑚𝑖𝑛 𝐼(𝑋; 𝑍)
§ Preserve information related to value estimate 𝑌
𝐿 = −𝐼 𝑍; 𝑌 + 𝛽𝐼 𝑋; 𝑍
𝑟%
(𝑥) = :
;
𝑝 𝑧 𝑥 log
𝑝 𝑥, 𝑧
𝑝 𝑥 𝑝(𝑧)
𝑑𝑧
2. Intrinsic Reward: Per-instance Mutual Information

3. Approximation
Variational Information Bottleneck with Gaussian assumptions
𝐿C,D = 𝐸F,G[− log 𝑞D 𝑦 𝑧 + 𝛽𝐾𝐿[𝑝C 𝑍 𝑥 | 𝑞 𝑍 ]
𝑟%
(𝑥) = 𝐾𝐿[𝑝C 𝑍 𝑥 ||𝑞 𝑍 ]
𝑧" ∼ 𝑝C(𝑍|𝑥")𝑥"
Compressor
𝜇C, 𝜎C
𝐾𝐿[𝑝C(𝑍|𝑥")||𝑞(𝑍)]
Value Predictor
𝜇D, 𝜎D
𝑟"
%
−log𝑞D(𝑦"|𝑧")
𝐿C,D
+

Proof of concept: static images
Random
Box
Object
Pixel
Noise

Detects novelty 𝑝"( ) while being robust to distraction 𝑝P( )
(b) Ideal (c) CB (d) CB-noKL (e) RND (f) SimHash
0.1 0.9
0.1
0.9
𝑝"
𝑝P
Random
Box
Object
Pixel
Noise
(a) Input
0.1
0.9
𝑝"
0.1
0.9
𝑝"
0.1 0.9𝑝P 0.1 0.9𝑝P 0.1 0.9𝑝P 0.1 0.9𝑝P
Proof of concept: static images

Experiment:Treasure Hunt
§ Agent is depicted as a circle
§ Item(triangle) with reward is hidden somewhere
§ The item appears only when the agent is nearby
§ Once the agent obtains an item, the next item
will be spawned in another area (also hidden)
§ The traces(pentagon) of eaten items will remain
§ Get the maximum score!
Example of the game play
Outline of the game

Movement condition
2 types of onset conditions for distraction
Location condition
When the agent stays
in the same location
When the agent stays
in the corners of the map

Consistently outperform baselines on different distraction settings
MeanEpisodicReward
(a) Movement Condition
CB CB-noKL RND Dynamics SimHash
(b) Location Condition
1e6 1e6

𝑥
𝑧
𝑞(𝑍)𝑝C(𝑍|𝑥") 𝑝C(𝑍|𝑥P)
𝑥" 𝑥P
Range of Experiences
𝑞(𝑍)𝑝C(𝑍|𝑥") 𝑝C(𝑍|𝑥P)
𝑥P𝑥"
Range of Experiences
𝛻KL− 𝛻log 𝑞D
− 𝛻log 𝑞D𝛻KL
𝑦 Target Value ( ) and Prediction ( )
(a) Early Training Steps (b) After Collecting Rewards
𝛻KL
− 𝛻log 𝑞D
𝛻KL − 𝛻log 𝑞D
18.2 8.018.1 4.6
𝑧
.….
illustration of adaptive exploration strategy

(a) Input (b) CB-Early (d) CB-noKL (f) Dynamics(e) RND(c) CB (g) SimHash
Compression loss term induces task-agnostic exploration in early stages
𝑲𝑳[𝒑 𝜽 𝒁 𝒙 ||𝒒 𝒁 ]
Grad-CamVisualization
The adaptive exploration strategy

Value prediction loss term induces task-specific exploration
after collecting external rewards
− 𝒍𝒐𝒈 𝒒 𝝓 𝒚 𝒛
(a) Input (b) CB-Early (d) CB-noKL (f) Dynamics(e) RND(c) CB (g) SimHash
Grad-CamVisualization
The adaptive exploration strategy

Gravitar Solaris
WithDistractionW.o.Distraction
Montezuma
CB CB-noKL RND Dynamics SimHash
Experiment: Atari Hard-exploration Games

Contributions
• First work to discriminate information by task-relevancy
→ Focus on task-relevant novelty and filter out distractive information
• Utilize information bottleneck as a novelty measure
→ the KL-divergence term as a degree of compression
• Extensive experiments
→ Experimented on a custom grid-world environment
to show situations where previous methods suffer.
Experimented on Atari environment for generality.
• Psychologically plausible

Curiosity-Bottleneck: Exploration by Distilling Task-Specific Novelty

Recommended

Recommended

More Related Content

Similar to Curiosity-Bottleneck: Exploration by Distilling Task-Specific Novelty

Similar to Curiosity-Bottleneck: Exploration by Distilling Task-Specific Novelty (20)

More from Hyunwoo Kim

More from Hyunwoo Kim (16)

Recently uploaded

Recently uploaded (20)

Curiosity-Bottleneck: Exploration by Distilling Task-Specific Novelty