SlideShare a Scribd company logo
1 of 13
Imagination-Augmented Agents
for Deep Reinforcement Learning
Theophane Weber, Sebastien Racaniere, David P. Reichert, Lars Buesing et al.
DeepMind
Presented by Choi Seong Jae
Introduction
โ€ข Reinforcement Learning์€ Markov Decision Process(MDP) Problem์„
ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•
โ€ข ๐‘†: a set of states
โ€ข ๐ด: a set of actions
โ€ข ๐‘‡(๐‘ โ€™|๐‘ , ๐‘Ž): the transition function maps
โ€ข ๐‘…(๐‘ , ๐‘Ž, ๐‘ โ€™) -> r: the reinforcement function mapping state-action-successor state
triples to a scalar return
๐œ‹โˆ—(๐‘ ) โ† ๐‘Ž๐‘Ÿ๐‘”๐‘š๐‘Ž๐‘ฅ
๐‘Ž โˆˆ ๐ด
๐‘„(๐‘ , ๐‘Ž)
๐‘„(๐‘ , ๐‘Ž)= ๐”ผ ๐‘ก=0
โˆž
๐‘Ÿ๐‘ก|๐‘ , ๐‘Ž
Introduction
โ€ข Model-Free RL
โ€ข Raw observation์„ ์ง์ ‘์ ์œผ๋กœ values ํ˜น์€ actions์— mapping ํ•˜๋Š” ํ•จ์ˆ˜๋ฅผ ์ฐพ์Œ
(Neural Network ์ด์šฉ)
โ€ข ๋Œ€๋Ÿ‰์˜ training data ํ•„์š”, ๊ฐ™์€ ํ™˜๊ฒฝ์—์„œ novel task๋กœ ์ผ๋ฐ˜ํ™” ํ•  ์ˆ˜ ์—†์Œ
โ€ข DQN, A3C, Policy Gradient ETC.
โ€ข Model-Based RL
โ€ข Transition matrix ๐‘‡, reward function ๐‘…, state-action space์ธ ๐‘†, ๐ด๋ฅผ ์•Œ๊ณ  ์žˆ๋‹ค๊ณ  ๊ฐ€์ •
โ€ข ๐‘†, ๐ด๊ฐ€ ์ปค์ง€๋ฉด ๊ณ„์‚ฐ ๋ถˆ๊ฐ€
โ€ข Model-Free RL์˜ ๋‹จ์ ์„ ํ•ด๊ฒฐ ๊ฐ€๋Šฅ
Overview: I2A
โ€ข Model-Free Model์— Model-Based์˜ ๊ฐœ๋…์„ ์ถ”๊ฐ€
โ€ข ๋ฏธ๋ž˜์˜ ์ƒํ™ฉ์„ ๋ฏธ๋ฆฌ simulation ํ•ด ๋ณด๊ณ , simulation์—์„œ ๋‚˜์˜จ
์ •๋ณด๋ฅผ ํ˜„์žฌ์— ์ ์šฉํ•ด ์ ์ ˆํ•œ action์„ ์„ ํƒ
โ€ข https://www.youtube.com/watch?v=iUowxivGfv0
Overview: I2A Architecture
I2A: Environment Model
โ€ข ResNet์œผ๋กœ ๋งŒ๋“ค์–ด์กŒ๊ณ , Recurrentํ•œ ๊ตฌ์„ฑ
โ€ข ๐‘‚๐‘ก or ๐‘‚๐‘ก์™€ action์„ Input์œผ๋กœ ๋ฐ›๊ณ , Trajectory ๐‘‡
๋ฅผ Output์œผ๋กœ ํ•จ
โ€ข ๐‘‡๋Š” next observation ๐‘‚๐‘ก+๐‘–๊ณผ next reward ๐‘Ÿ๐‘ก+๐‘–๋ฅผ ํฌํ•จ
โ€ข Environment Model์˜ ๊ฒฝ์šฐ Standard Model-Free
Agent์—์„œ ์ƒ์„ฑ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•ด Pre-train ๋œ
Model
โ€ข Predicted ๐‘‚๐‘ก+๐‘–์™€ ๐‘Ÿ๐‘ก+๐‘–์€ ์™„๋ฒฝํ•˜์ง€ ์•Š์ง€๋งŒ, ๐‘‚๐‘ก+๐‘–์™€
๐‘Ÿ๐‘ก+๐‘– ์„ ๋„˜์–ด์„œ๋Š” ์ •๋ณด๋ฅผ ๊ฐ–๊ณ  ์žˆ๋‹ค๊ณ  ๊ฐ€์ •
I2A: Rollout Encoder
โ€ข ๊ฐ action ๋ณ„๋กœ Rollout Encoder๊ฐ€ ์กด์žฌ
โ€ข Rollout Encoder์˜ ๊ฐ Encoder๋Š” LSTM cell
๋กœ ๊ตฌ์„ฑ
โ€ข Predicted ๐‘‚๐‘ก+๐‘–์™€ ๐‘Ÿ๐‘ก+๐‘–์ด ์™„๋ฒฝํ•˜์ง€ ์•Š์œผ๋ฏ€๋กœ
Encoder๋ฅผ ํ†ตํ•ด ์ถ”๊ฐ€์ ์ธ ์ •๋ณด๋ฅผ ์ถ”์ถœ
โ€ข Aggregator์—์„œ ๊ฐ Rollout Encoder์—์„œ ๋‚˜
์˜จ Encoded values๋ฅผ ๋‹จ์ˆœ concatenate
Experiments
โ€ข I2A๊ฐ€ ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•๋“ค ๋ณด๋‹ค ์›”๋“ฑํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ž„
โ€ข Copy-Model IAA์˜ ๊ฒฝ์šฐ standard์™€ ์œ ์‚ฌํ•œ ์„ฑ
๋Šฅ์„ ๋ณด์ž„
โ€ข No reward IAA์˜ ๊ฒฝ์šฐ 3e9 steps๋ฅผ ํ•™์Šตํ•  ๊ฒฝ
์šฐ I2A ๋งŒํผ์˜ ์„ฑ๋Šฅ์ด ๋‚˜์˜ด
โ€ข Predicted rewards๊ฐ€ ๋„์›€์€ ๋  ์ˆ˜ ์žˆ์ง€๋งŒ, Predicted
Observations๋งŒ์œผ๋กœ ์ถฉ๋ถ„ํžˆ Informative ํ•จ
โ€ข 5 ์ด์ƒ์˜ rolling step์—์„œ๋Š” ๋” ์ด์ƒ์˜ ์„ฑ๋Šฅ ํ–ฅ
์ƒ์€ ์—†์Œ
Experiments
โ€ข Noisy Environment Model์—์„œ I2A์˜ ์„ฑ๋Šฅ
์ƒ ์ฐจ์ด๋Š” ์—†์Œ
โ€ข ๊ทธ๋Ÿฌ๋‚˜ poor model์˜ ๊ฒฝ์šฐ 3 rolling step๊ณผ 5
rolling step์—์„œ ์„ฑ๋Šฅ ์ฐจ์ด๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์—†์Œ
โ€ข Rollout Encoder๊ฐ€ Env. Model์—์„œ ๋ถˆํ•„์š”ํ•œ ์ •๋ณด
๋ฅผ Ignoring ํ•˜๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Œ
โ€ข Rollout Encoder-free agent์˜ ๊ฒฝ์šฐ Env.
Model์˜ ์ •ํ™•๋„๊ฐ€ ์„ฑ๋Šฅ ์ƒ ํฐ ์˜ํ–ฅ์„ ๋ฏธ์นจ
โ€ข Accurate Env. Model์—์„œ๋Š” standard์™€ ์œ ์‚ฌํ•œ ์„ฑ
๋Šฅ์„ ๋ณด์ž„
Experiments
โ€ข AlphaGo์— ์“ฐ์ธ MCTS์˜ ๊ฒฝ์šฐ ์ตœ๋Œ€ 95%์˜
์„ฑ๋Šฅ์„ ๋ณด์ž„(Perfect Model)
โ€ข ํ•˜์ง€๋งŒ ์œ ์‚ฌํ•œ ์„ฑ๋Šฅ์„ ๋‚ผ ๋•Œ, computation
cost๊ฐ€ 18๋ฐฐ ์ •๋„ ์ฐจ์ด ๋‚จ
โ€ข 4 boxes ์ƒํƒœ์—์„œ ํ•™์Šต์„ ํ•˜๊ณ , box ๊ฐœ์ˆ˜๋ฅผ
๋Š˜๋ ค๋ณด์•˜์œผ๋‚˜, ์—ฌ์ „ํžˆ 4 boxes standard์™€
์œ ์‚ฌํ•œ ์„ฑ๋Šฅ์„ ๋ƒ„
Conclusion
โ€ข ๋‹จ์ˆœ ๊ณผ๊ฑฐ์˜ ๋ฐ์ดํ„ฐ๋งŒ ์ด์šฉํ•ด ํ˜„์žฌ ์œ ์šฉํ•œ action์„ ์ทจํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹Œ, ๋ฏธ๋ž˜์˜ ์ƒํ™ฉ์„ ์ถ”๋ก ํ•˜์—ฌ ์ •๋ณด๋กœ ๋ฐ›์•„๋“ค
์ด๊ณ  ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ด
โ€ข Model-Free RL์— Model-Based RL ๊ฐœ๋…์„ ์ถ”๊ฐ€
โ€ข ๊ธฐ์กด์˜ Model-Based Planning Method์ธ Monte Carlo Tree Search(MCTS)์™€ ๋ณ‘ํ•ฉํ•  ๊ฒฝ์šฐ ๋‚ฎ์€ computation
cost๋กœ ๋™์ผํ•œ ์„ฑ๋Šฅ์„ ํ™•๋ณด
โ€ข ์˜๋ฌธ
โ€ข ์‹คํ—˜์—์„œ ๋‚˜์˜จ Sokoban ํ™˜๊ฒฝ์˜ ๊ฒฝ์šฐ, ํ•˜๋‚˜์˜ action์„ ์ทจํ–ˆ์„ ๋•Œ ๋‹ค์Œ ์ƒํ™ฉ์€ ํ•ญ์ƒ ์ค‘์š”ํ•œ ์ •๋ณด๋ฅผ ์ง€๋‹ˆ๋Š”๋ฐ, ํ”„๋ ˆ์ž„ ๋ณ„๋กœ
์ง„ํ–‰๋˜๋Š” ์‹ค์‹œ๊ฐ„ ํ™˜๊ฒฝ์˜ ๊ฒฝ์šฐ ๋‹ค์Œ ์ƒํ™ฉ์ด ํฌ๊ฒŒ ์˜๋ฏธ ์—†๋Š” ์ƒํ™ฉ์ผ ์ˆ˜ ์žˆ์Œ. ์ด๋Ÿฌํ•œ ํ™˜๊ฒฝ์—์„œ๋Š” ์–ด๋– ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š”์ง€?
โ€ข ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” Pre-trained Environment Model์„ ์ด์šฉํ•ด, predicted observation๊ณผ reward๋ฅผ ์–ป์–ด ์—ฌ๊ธฐ์„œ ์ •๋ณด๋ฅผ ์ถ”์ถœํ–ˆ
์ง€๋งŒ, predicted๊ฐ€ ์•„๋‹Œ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์„ ํ•  ์ˆ˜ ์žˆ๋Š” ํ™˜๊ฒฝ์ด๋ผ๋ฉด ์ข€ ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚ด์ง€๋Š” ์•Š์„์ง€?
Appendix
โ€ข Standard model-free baseline agent
โ€ข For Sokoban: 3 layers CNN, kernel sizes 8x8, 4x4, 3x3, strides of 4, 2, 1 and
number of output channels 32, 64, 64; following FC has 512 units
โ€ข Rollout Encoder LSTM has 512(for Sokoban) hidden units.
And all rollouts are concatenated into a single vector ๐‘๐‘–๐‘Ž of
length 2560(a rollout encoder per action).
Appendix
โ€ข Sokoban environment
โ€ข Every time step, a penalty of -0.1 is applied to the agent
โ€ข Whenever the agent pushes a box on target, it receives a reward of +1
โ€ข Whenever the agent pushes a box off target, it receives a penalty of -1
โ€ข Finishing the level gives the agent a reward of +10 and the level terminates.

More Related Content

What's hot

DL from scratch(1~3)
DL from scratch(1~3)DL from scratch(1~3)
DL from scratch(1~3)Park Seong Hyeon
ย 
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...Sunghoon Joo
ย 
"simple does it weakly supervised instance and semantic segmentation" Paper r...
"simple does it weakly supervised instance and semantic segmentation" Paper r..."simple does it weakly supervised instance and semantic segmentation" Paper r...
"simple does it weakly supervised instance and semantic segmentation" Paper r...LEE HOSEONG
ย 
Lecture 4: Neural Networks I
Lecture 4: Neural Networks ILecture 4: Neural Networks I
Lecture 4: Neural Networks ISang Jun Lee
ย 
๋„ค์ด๋ฒ„ NLP Challenge ํ›„๊ธฐ
๋„ค์ด๋ฒ„ NLP Challenge ํ›„๊ธฐ๋„ค์ด๋ฒ„ NLP Challenge ํ›„๊ธฐ
๋„ค์ด๋ฒ„ NLP Challenge ํ›„๊ธฐJangwon Park
ย 
neural network ๊ธฐ์ดˆ
neural network ๊ธฐ์ดˆneural network ๊ธฐ์ดˆ
neural network ๊ธฐ์ดˆDea-hwan Ki
ย 
AlexNet, VGG, GoogleNet, Resnet
AlexNet, VGG, GoogleNet, ResnetAlexNet, VGG, GoogleNet, Resnet
AlexNet, VGG, GoogleNet, ResnetJungwon Kim
ย 
[ํ•œ๊ตญ์–ด] Neural Architecture Search with Reinforcement Learning
[ํ•œ๊ตญ์–ด] Neural Architecture Search with Reinforcement Learning[ํ•œ๊ตญ์–ด] Neural Architecture Search with Reinforcement Learning
[ํ•œ๊ตญ์–ด] Neural Architecture Search with Reinforcement LearningKiho Suh
ย 
"How does batch normalization help optimization" Paper Review
"How does batch normalization help optimization" Paper Review"How does batch normalization help optimization" Paper Review
"How does batch normalization help optimization" Paper ReviewLEE HOSEONG
ย 
"Learning From Noisy Large-Scale Datasets With Minimal Supervision" Paper Review
"Learning From Noisy Large-Scale Datasets With Minimal Supervision" Paper Review"Learning From Noisy Large-Scale Datasets With Minimal Supervision" Paper Review
"Learning From Noisy Large-Scale Datasets With Minimal Supervision" Paper ReviewLEE HOSEONG
ย 
PR-203: Class-Balanced Loss Based on Effective Number of Samples
PR-203: Class-Balanced Loss Based on Effective Number of SamplesPR-203: Class-Balanced Loss Based on Effective Number of Samples
PR-203: Class-Balanced Loss Based on Effective Number of SamplesSunghoon Joo
ย 

What's hot (11)

DL from scratch(1~3)
DL from scratch(1~3)DL from scratch(1~3)
DL from scratch(1~3)
ย 
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...
ย 
"simple does it weakly supervised instance and semantic segmentation" Paper r...
"simple does it weakly supervised instance and semantic segmentation" Paper r..."simple does it weakly supervised instance and semantic segmentation" Paper r...
"simple does it weakly supervised instance and semantic segmentation" Paper r...
ย 
Lecture 4: Neural Networks I
Lecture 4: Neural Networks ILecture 4: Neural Networks I
Lecture 4: Neural Networks I
ย 
๋„ค์ด๋ฒ„ NLP Challenge ํ›„๊ธฐ
๋„ค์ด๋ฒ„ NLP Challenge ํ›„๊ธฐ๋„ค์ด๋ฒ„ NLP Challenge ํ›„๊ธฐ
๋„ค์ด๋ฒ„ NLP Challenge ํ›„๊ธฐ
ย 
neural network ๊ธฐ์ดˆ
neural network ๊ธฐ์ดˆneural network ๊ธฐ์ดˆ
neural network ๊ธฐ์ดˆ
ย 
AlexNet, VGG, GoogleNet, Resnet
AlexNet, VGG, GoogleNet, ResnetAlexNet, VGG, GoogleNet, Resnet
AlexNet, VGG, GoogleNet, Resnet
ย 
[ํ•œ๊ตญ์–ด] Neural Architecture Search with Reinforcement Learning
[ํ•œ๊ตญ์–ด] Neural Architecture Search with Reinforcement Learning[ํ•œ๊ตญ์–ด] Neural Architecture Search with Reinforcement Learning
[ํ•œ๊ตญ์–ด] Neural Architecture Search with Reinforcement Learning
ย 
"How does batch normalization help optimization" Paper Review
"How does batch normalization help optimization" Paper Review"How does batch normalization help optimization" Paper Review
"How does batch normalization help optimization" Paper Review
ย 
"Learning From Noisy Large-Scale Datasets With Minimal Supervision" Paper Review
"Learning From Noisy Large-Scale Datasets With Minimal Supervision" Paper Review"Learning From Noisy Large-Scale Datasets With Minimal Supervision" Paper Review
"Learning From Noisy Large-Scale Datasets With Minimal Supervision" Paper Review
ย 
PR-203: Class-Balanced Loss Based on Effective Number of Samples
PR-203: Class-Balanced Loss Based on Effective Number of SamplesPR-203: Class-Balanced Loss Based on Effective Number of Samples
PR-203: Class-Balanced Loss Based on Effective Number of Samples
ย 

Similar to Imagination-Augmented Agents for Deep Reinforcement Learning

Progressive Growing of GANs for Improved Quality, Stability, and Variation Re...
Progressive Growing of GANs for Improved Quality, Stability, and Variation Re...Progressive Growing of GANs for Improved Quality, Stability, and Variation Re...
Progressive Growing of GANs for Improved Quality, Stability, and Variation Re...ํƒœ์—ฝ ๊น€
ย 
Coursera Machine Learning (by Andrew Ng)_๊ฐ•์˜์ •๋ฆฌ
Coursera Machine Learning (by Andrew Ng)_๊ฐ•์˜์ •๋ฆฌCoursera Machine Learning (by Andrew Ng)_๊ฐ•์˜์ •๋ฆฌ
Coursera Machine Learning (by Andrew Ng)_๊ฐ•์˜์ •๋ฆฌSANG WON PARK
ย 
SQL performance and UDF
SQL performance and UDFSQL performance and UDF
SQL performance and UDFJAEGEUN YU
ย 
์ธ๊ณต์ง€๋Šฅ, ๊ธฐ๊ณ„ํ•™์Šต ๊ทธ๋ฆฌ๊ณ  ๋”ฅ๋Ÿฌ๋‹
์ธ๊ณต์ง€๋Šฅ, ๊ธฐ๊ณ„ํ•™์Šต ๊ทธ๋ฆฌ๊ณ  ๋”ฅ๋Ÿฌ๋‹์ธ๊ณต์ง€๋Šฅ, ๊ธฐ๊ณ„ํ•™์Šต ๊ทธ๋ฆฌ๊ณ  ๋”ฅ๋Ÿฌ๋‹
์ธ๊ณต์ง€๋Šฅ, ๊ธฐ๊ณ„ํ•™์Šต ๊ทธ๋ฆฌ๊ณ  ๋”ฅ๋Ÿฌ๋‹Jinwon Lee
ย 
Review MLP Mixer
Review MLP MixerReview MLP Mixer
Review MLP MixerWoojin Jeong
ย 
์œ„์„ฑ๊ด€์ธก ๋ฐ์ดํ„ฐ ํ™œ์šฉ ๊ฐ•์ˆ˜๋Ÿ‰ ์‚ฐ์ถœ AI ๊ฒฝ์ง„๋Œ€ํšŒ 1์œ„ ์ˆ˜์ƒ์ž‘
์œ„์„ฑ๊ด€์ธก ๋ฐ์ดํ„ฐ ํ™œ์šฉ ๊ฐ•์ˆ˜๋Ÿ‰ ์‚ฐ์ถœ AI ๊ฒฝ์ง„๋Œ€ํšŒ 1์œ„ ์ˆ˜์ƒ์ž‘์œ„์„ฑ๊ด€์ธก ๋ฐ์ดํ„ฐ ํ™œ์šฉ ๊ฐ•์ˆ˜๋Ÿ‰ ์‚ฐ์ถœ AI ๊ฒฝ์ง„๋Œ€ํšŒ 1์œ„ ์ˆ˜์ƒ์ž‘
์œ„์„ฑ๊ด€์ธก ๋ฐ์ดํ„ฐ ํ™œ์šฉ ๊ฐ•์ˆ˜๋Ÿ‰ ์‚ฐ์ถœ AI ๊ฒฝ์ง„๋Œ€ํšŒ 1์œ„ ์ˆ˜์ƒ์ž‘DACON AI ๋ฐ์ด์ฝ˜
ย 
ํŒŒ์ด์ฌ๊ณผ ์ผ€๋ผ์Šค๋กœ ๋ฐฐ์šฐ๋Š” ๊ฐ•ํ™”ํ•™์Šต ์ €์žํŠน๊ฐ•
ํŒŒ์ด์ฌ๊ณผ ์ผ€๋ผ์Šค๋กœ ๋ฐฐ์šฐ๋Š” ๊ฐ•ํ™”ํ•™์Šต ์ €์žํŠน๊ฐ•ํŒŒ์ด์ฌ๊ณผ ์ผ€๋ผ์Šค๋กœ ๋ฐฐ์šฐ๋Š” ๊ฐ•ํ™”ํ•™์Šต ์ €์žํŠน๊ฐ•
ํŒŒ์ด์ฌ๊ณผ ์ผ€๋ผ์Šค๋กœ ๋ฐฐ์šฐ๋Š” ๊ฐ•ํ™”ํ•™์Šต ์ €์žํŠน๊ฐ•Woong won Lee
ย 
[Paper Review] Visualizing and understanding convolutional networks
[Paper Review] Visualizing and understanding convolutional networks[Paper Review] Visualizing and understanding convolutional networks
[Paper Review] Visualizing and understanding convolutional networksKorea, Sejong University.
ย 
[paper review] ์†๊ทœ๋นˆ - Eye in the sky & 3D human pose estimation in video with ...
[paper review] ์†๊ทœ๋นˆ - Eye in the sky & 3D human pose estimation in video with ...[paper review] ์†๊ทœ๋นˆ - Eye in the sky & 3D human pose estimation in video with ...
[paper review] ์†๊ทœ๋นˆ - Eye in the sky & 3D human pose estimation in video with ...Gyubin Son
ย 
Introduction toDQN
Introduction toDQNIntroduction toDQN
Introduction toDQNCurt Park
ย 
๋„คํŠธ์›Œํฌ ๊ฒฝ๋Ÿ‰ํ™” ์ด๋ชจ์ €๋ชจ @ 2020 DLD
๋„คํŠธ์›Œํฌ ๊ฒฝ๋Ÿ‰ํ™” ์ด๋ชจ์ €๋ชจ @ 2020 DLD๋„คํŠธ์›Œํฌ ๊ฒฝ๋Ÿ‰ํ™” ์ด๋ชจ์ €๋ชจ @ 2020 DLD
๋„คํŠธ์›Œํฌ ๊ฒฝ๋Ÿ‰ํ™” ์ด๋ชจ์ €๋ชจ @ 2020 DLDKim Junghoon
ย 
๊ฐ•ํ™”ํ•™์Šต ๊ธฐ์ดˆ๋ถ€ํ„ฐ DQN๊นŒ์ง€ (Reinforcement Learning from Basics to DQN)
๊ฐ•ํ™”ํ•™์Šต ๊ธฐ์ดˆ๋ถ€ํ„ฐ DQN๊นŒ์ง€ (Reinforcement Learning from Basics to DQN)๊ฐ•ํ™”ํ•™์Šต ๊ธฐ์ดˆ๋ถ€ํ„ฐ DQN๊นŒ์ง€ (Reinforcement Learning from Basics to DQN)
๊ฐ•ํ™”ํ•™์Šต ๊ธฐ์ดˆ๋ถ€ํ„ฐ DQN๊นŒ์ง€ (Reinforcement Learning from Basics to DQN)Curt Park
ย 
Image data augmentatiion
Image data augmentatiionImage data augmentatiion
Image data augmentatiionSubin An
ย 
03.12 cnn backpropagation
03.12 cnn backpropagation03.12 cnn backpropagation
03.12 cnn backpropagationDea-hwan Ki
ย 
ํŒŒ์ด์ฝ˜ ํ•œ๊ตญ 2019 ํŠœํ† ๋ฆฌ์–ผ - LRP (Part 2)
ํŒŒ์ด์ฝ˜ ํ•œ๊ตญ 2019 ํŠœํ† ๋ฆฌ์–ผ - LRP (Part 2)ํŒŒ์ด์ฝ˜ ํ•œ๊ตญ 2019 ํŠœํ† ๋ฆฌ์–ผ - LRP (Part 2)
ํŒŒ์ด์ฝ˜ ํ•œ๊ตญ 2019 ํŠœํ† ๋ฆฌ์–ผ - LRP (Part 2)XAIC
ย 
Image Deep Learning ์‹ค๋ฌด์ ์šฉ
Image Deep Learning ์‹ค๋ฌด์ ์šฉImage Deep Learning ์‹ค๋ฌด์ ์šฉ
Image Deep Learning ์‹ค๋ฌด์ ์šฉYoungjae Kim
ย 
ํ•œ๊ตญ์ธ๊ณต์ง€๋Šฅ์—ฐ๊ตฌ์†Œ ๊ฐ•ํ™”ํ•™์Šต๋žฉ ๊ฒฐ๊ณผ๋ณด๊ณ ์„œ
ํ•œ๊ตญ์ธ๊ณต์ง€๋Šฅ์—ฐ๊ตฌ์†Œ ๊ฐ•ํ™”ํ•™์Šต๋žฉ ๊ฒฐ๊ณผ๋ณด๊ณ ์„œํ•œ๊ตญ์ธ๊ณต์ง€๋Šฅ์—ฐ๊ตฌ์†Œ ๊ฐ•ํ™”ํ•™์Šต๋žฉ ๊ฒฐ๊ณผ๋ณด๊ณ ์„œ
ํ•œ๊ตญ์ธ๊ณต์ง€๋Šฅ์—ฐ๊ตฌ์†Œ ๊ฐ•ํ™”ํ•™์Šต๋žฉ ๊ฒฐ๊ณผ๋ณด๊ณ ์„œEuijin Jeong
ย 
๋”ฅ๋‰ด๋Ÿด๋„ท ํด๋Ÿฌ์Šคํ„ฐ๋ง ์‹คํŒจ๊ธฐ
๋”ฅ๋‰ด๋Ÿด๋„ท ํด๋Ÿฌ์Šคํ„ฐ๋ง ์‹คํŒจ๊ธฐ๋”ฅ๋‰ด๋Ÿด๋„ท ํด๋Ÿฌ์Šคํ„ฐ๋ง ์‹คํŒจ๊ธฐ
๋”ฅ๋‰ด๋Ÿด๋„ท ํด๋Ÿฌ์Šคํ„ฐ๋ง ์‹คํŒจ๊ธฐMyeongju Kim
ย 
์ž„ํƒœํ˜„, MMO ์„œ๋ฒ„ ๊ฐœ๋ฐœ ํฌ์ŠคํŠธ ๋ชจํ…œ, NDC2012
์ž„ํƒœํ˜„, MMO ์„œ๋ฒ„ ๊ฐœ๋ฐœ ํฌ์ŠคํŠธ ๋ชจํ…œ, NDC2012์ž„ํƒœํ˜„, MMO ์„œ๋ฒ„ ๊ฐœ๋ฐœ ํฌ์ŠคํŠธ ๋ชจํ…œ, NDC2012
์ž„ํƒœํ˜„, MMO ์„œ๋ฒ„ ๊ฐœ๋ฐœ ํฌ์ŠคํŠธ ๋ชจํ…œ, NDC2012devCAT Studio, NEXON
ย 

Similar to Imagination-Augmented Agents for Deep Reinforcement Learning (20)

Progressive Growing of GANs for Improved Quality, Stability, and Variation Re...
Progressive Growing of GANs for Improved Quality, Stability, and Variation Re...Progressive Growing of GANs for Improved Quality, Stability, and Variation Re...
Progressive Growing of GANs for Improved Quality, Stability, and Variation Re...
ย 
Nationality recognition
Nationality recognitionNationality recognition
Nationality recognition
ย 
Coursera Machine Learning (by Andrew Ng)_๊ฐ•์˜์ •๋ฆฌ
Coursera Machine Learning (by Andrew Ng)_๊ฐ•์˜์ •๋ฆฌCoursera Machine Learning (by Andrew Ng)_๊ฐ•์˜์ •๋ฆฌ
Coursera Machine Learning (by Andrew Ng)_๊ฐ•์˜์ •๋ฆฌ
ย 
SQL performance and UDF
SQL performance and UDFSQL performance and UDF
SQL performance and UDF
ย 
์ธ๊ณต์ง€๋Šฅ, ๊ธฐ๊ณ„ํ•™์Šต ๊ทธ๋ฆฌ๊ณ  ๋”ฅ๋Ÿฌ๋‹
์ธ๊ณต์ง€๋Šฅ, ๊ธฐ๊ณ„ํ•™์Šต ๊ทธ๋ฆฌ๊ณ  ๋”ฅ๋Ÿฌ๋‹์ธ๊ณต์ง€๋Šฅ, ๊ธฐ๊ณ„ํ•™์Šต ๊ทธ๋ฆฌ๊ณ  ๋”ฅ๋Ÿฌ๋‹
์ธ๊ณต์ง€๋Šฅ, ๊ธฐ๊ณ„ํ•™์Šต ๊ทธ๋ฆฌ๊ณ  ๋”ฅ๋Ÿฌ๋‹
ย 
Review MLP Mixer
Review MLP MixerReview MLP Mixer
Review MLP Mixer
ย 
์œ„์„ฑ๊ด€์ธก ๋ฐ์ดํ„ฐ ํ™œ์šฉ ๊ฐ•์ˆ˜๋Ÿ‰ ์‚ฐ์ถœ AI ๊ฒฝ์ง„๋Œ€ํšŒ 1์œ„ ์ˆ˜์ƒ์ž‘
์œ„์„ฑ๊ด€์ธก ๋ฐ์ดํ„ฐ ํ™œ์šฉ ๊ฐ•์ˆ˜๋Ÿ‰ ์‚ฐ์ถœ AI ๊ฒฝ์ง„๋Œ€ํšŒ 1์œ„ ์ˆ˜์ƒ์ž‘์œ„์„ฑ๊ด€์ธก ๋ฐ์ดํ„ฐ ํ™œ์šฉ ๊ฐ•์ˆ˜๋Ÿ‰ ์‚ฐ์ถœ AI ๊ฒฝ์ง„๋Œ€ํšŒ 1์œ„ ์ˆ˜์ƒ์ž‘
์œ„์„ฑ๊ด€์ธก ๋ฐ์ดํ„ฐ ํ™œ์šฉ ๊ฐ•์ˆ˜๋Ÿ‰ ์‚ฐ์ถœ AI ๊ฒฝ์ง„๋Œ€ํšŒ 1์œ„ ์ˆ˜์ƒ์ž‘
ย 
ํŒŒ์ด์ฌ๊ณผ ์ผ€๋ผ์Šค๋กœ ๋ฐฐ์šฐ๋Š” ๊ฐ•ํ™”ํ•™์Šต ์ €์žํŠน๊ฐ•
ํŒŒ์ด์ฌ๊ณผ ์ผ€๋ผ์Šค๋กœ ๋ฐฐ์šฐ๋Š” ๊ฐ•ํ™”ํ•™์Šต ์ €์žํŠน๊ฐ•ํŒŒ์ด์ฌ๊ณผ ์ผ€๋ผ์Šค๋กœ ๋ฐฐ์šฐ๋Š” ๊ฐ•ํ™”ํ•™์Šต ์ €์žํŠน๊ฐ•
ํŒŒ์ด์ฌ๊ณผ ์ผ€๋ผ์Šค๋กœ ๋ฐฐ์šฐ๋Š” ๊ฐ•ํ™”ํ•™์Šต ์ €์žํŠน๊ฐ•
ย 
[Paper Review] Visualizing and understanding convolutional networks
[Paper Review] Visualizing and understanding convolutional networks[Paper Review] Visualizing and understanding convolutional networks
[Paper Review] Visualizing and understanding convolutional networks
ย 
[paper review] ์†๊ทœ๋นˆ - Eye in the sky & 3D human pose estimation in video with ...
[paper review] ์†๊ทœ๋นˆ - Eye in the sky & 3D human pose estimation in video with ...[paper review] ์†๊ทœ๋นˆ - Eye in the sky & 3D human pose estimation in video with ...
[paper review] ์†๊ทœ๋นˆ - Eye in the sky & 3D human pose estimation in video with ...
ย 
Introduction toDQN
Introduction toDQNIntroduction toDQN
Introduction toDQN
ย 
๋„คํŠธ์›Œํฌ ๊ฒฝ๋Ÿ‰ํ™” ์ด๋ชจ์ €๋ชจ @ 2020 DLD
๋„คํŠธ์›Œํฌ ๊ฒฝ๋Ÿ‰ํ™” ์ด๋ชจ์ €๋ชจ @ 2020 DLD๋„คํŠธ์›Œํฌ ๊ฒฝ๋Ÿ‰ํ™” ์ด๋ชจ์ €๋ชจ @ 2020 DLD
๋„คํŠธ์›Œํฌ ๊ฒฝ๋Ÿ‰ํ™” ์ด๋ชจ์ €๋ชจ @ 2020 DLD
ย 
๊ฐ•ํ™”ํ•™์Šต ๊ธฐ์ดˆ๋ถ€ํ„ฐ DQN๊นŒ์ง€ (Reinforcement Learning from Basics to DQN)
๊ฐ•ํ™”ํ•™์Šต ๊ธฐ์ดˆ๋ถ€ํ„ฐ DQN๊นŒ์ง€ (Reinforcement Learning from Basics to DQN)๊ฐ•ํ™”ํ•™์Šต ๊ธฐ์ดˆ๋ถ€ํ„ฐ DQN๊นŒ์ง€ (Reinforcement Learning from Basics to DQN)
๊ฐ•ํ™”ํ•™์Šต ๊ธฐ์ดˆ๋ถ€ํ„ฐ DQN๊นŒ์ง€ (Reinforcement Learning from Basics to DQN)
ย 
Image data augmentatiion
Image data augmentatiionImage data augmentatiion
Image data augmentatiion
ย 
03.12 cnn backpropagation
03.12 cnn backpropagation03.12 cnn backpropagation
03.12 cnn backpropagation
ย 
ํŒŒ์ด์ฝ˜ ํ•œ๊ตญ 2019 ํŠœํ† ๋ฆฌ์–ผ - LRP (Part 2)
ํŒŒ์ด์ฝ˜ ํ•œ๊ตญ 2019 ํŠœํ† ๋ฆฌ์–ผ - LRP (Part 2)ํŒŒ์ด์ฝ˜ ํ•œ๊ตญ 2019 ํŠœํ† ๋ฆฌ์–ผ - LRP (Part 2)
ํŒŒ์ด์ฝ˜ ํ•œ๊ตญ 2019 ํŠœํ† ๋ฆฌ์–ผ - LRP (Part 2)
ย 
Image Deep Learning ์‹ค๋ฌด์ ์šฉ
Image Deep Learning ์‹ค๋ฌด์ ์šฉImage Deep Learning ์‹ค๋ฌด์ ์šฉ
Image Deep Learning ์‹ค๋ฌด์ ์šฉ
ย 
ํ•œ๊ตญ์ธ๊ณต์ง€๋Šฅ์—ฐ๊ตฌ์†Œ ๊ฐ•ํ™”ํ•™์Šต๋žฉ ๊ฒฐ๊ณผ๋ณด๊ณ ์„œ
ํ•œ๊ตญ์ธ๊ณต์ง€๋Šฅ์—ฐ๊ตฌ์†Œ ๊ฐ•ํ™”ํ•™์Šต๋žฉ ๊ฒฐ๊ณผ๋ณด๊ณ ์„œํ•œ๊ตญ์ธ๊ณต์ง€๋Šฅ์—ฐ๊ตฌ์†Œ ๊ฐ•ํ™”ํ•™์Šต๋žฉ ๊ฒฐ๊ณผ๋ณด๊ณ ์„œ
ํ•œ๊ตญ์ธ๊ณต์ง€๋Šฅ์—ฐ๊ตฌ์†Œ ๊ฐ•ํ™”ํ•™์Šต๋žฉ ๊ฒฐ๊ณผ๋ณด๊ณ ์„œ
ย 
๋”ฅ๋‰ด๋Ÿด๋„ท ํด๋Ÿฌ์Šคํ„ฐ๋ง ์‹คํŒจ๊ธฐ
๋”ฅ๋‰ด๋Ÿด๋„ท ํด๋Ÿฌ์Šคํ„ฐ๋ง ์‹คํŒจ๊ธฐ๋”ฅ๋‰ด๋Ÿด๋„ท ํด๋Ÿฌ์Šคํ„ฐ๋ง ์‹คํŒจ๊ธฐ
๋”ฅ๋‰ด๋Ÿด๋„ท ํด๋Ÿฌ์Šคํ„ฐ๋ง ์‹คํŒจ๊ธฐ
ย 
์ž„ํƒœํ˜„, MMO ์„œ๋ฒ„ ๊ฐœ๋ฐœ ํฌ์ŠคํŠธ ๋ชจํ…œ, NDC2012
์ž„ํƒœํ˜„, MMO ์„œ๋ฒ„ ๊ฐœ๋ฐœ ํฌ์ŠคํŠธ ๋ชจํ…œ, NDC2012์ž„ํƒœํ˜„, MMO ์„œ๋ฒ„ ๊ฐœ๋ฐœ ํฌ์ŠคํŠธ ๋ชจํ…œ, NDC2012
์ž„ํƒœํ˜„, MMO ์„œ๋ฒ„ ๊ฐœ๋ฐœ ํฌ์ŠคํŠธ ๋ชจํ…œ, NDC2012
ย 

Imagination-Augmented Agents for Deep Reinforcement Learning

  • 1. Imagination-Augmented Agents for Deep Reinforcement Learning Theophane Weber, Sebastien Racaniere, David P. Reichert, Lars Buesing et al. DeepMind Presented by Choi Seong Jae
  • 2. Introduction โ€ข Reinforcement Learning์€ Markov Decision Process(MDP) Problem์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ• โ€ข ๐‘†: a set of states โ€ข ๐ด: a set of actions โ€ข ๐‘‡(๐‘ โ€™|๐‘ , ๐‘Ž): the transition function maps โ€ข ๐‘…(๐‘ , ๐‘Ž, ๐‘ โ€™) -> r: the reinforcement function mapping state-action-successor state triples to a scalar return ๐œ‹โˆ—(๐‘ ) โ† ๐‘Ž๐‘Ÿ๐‘”๐‘š๐‘Ž๐‘ฅ ๐‘Ž โˆˆ ๐ด ๐‘„(๐‘ , ๐‘Ž) ๐‘„(๐‘ , ๐‘Ž)= ๐”ผ ๐‘ก=0 โˆž ๐‘Ÿ๐‘ก|๐‘ , ๐‘Ž
  • 3. Introduction โ€ข Model-Free RL โ€ข Raw observation์„ ์ง์ ‘์ ์œผ๋กœ values ํ˜น์€ actions์— mapping ํ•˜๋Š” ํ•จ์ˆ˜๋ฅผ ์ฐพ์Œ (Neural Network ์ด์šฉ) โ€ข ๋Œ€๋Ÿ‰์˜ training data ํ•„์š”, ๊ฐ™์€ ํ™˜๊ฒฝ์—์„œ novel task๋กœ ์ผ๋ฐ˜ํ™” ํ•  ์ˆ˜ ์—†์Œ โ€ข DQN, A3C, Policy Gradient ETC. โ€ข Model-Based RL โ€ข Transition matrix ๐‘‡, reward function ๐‘…, state-action space์ธ ๐‘†, ๐ด๋ฅผ ์•Œ๊ณ  ์žˆ๋‹ค๊ณ  ๊ฐ€์ • โ€ข ๐‘†, ๐ด๊ฐ€ ์ปค์ง€๋ฉด ๊ณ„์‚ฐ ๋ถˆ๊ฐ€ โ€ข Model-Free RL์˜ ๋‹จ์ ์„ ํ•ด๊ฒฐ ๊ฐ€๋Šฅ
  • 4. Overview: I2A โ€ข Model-Free Model์— Model-Based์˜ ๊ฐœ๋…์„ ์ถ”๊ฐ€ โ€ข ๋ฏธ๋ž˜์˜ ์ƒํ™ฉ์„ ๋ฏธ๋ฆฌ simulation ํ•ด ๋ณด๊ณ , simulation์—์„œ ๋‚˜์˜จ ์ •๋ณด๋ฅผ ํ˜„์žฌ์— ์ ์šฉํ•ด ์ ์ ˆํ•œ action์„ ์„ ํƒ โ€ข https://www.youtube.com/watch?v=iUowxivGfv0
  • 6. I2A: Environment Model โ€ข ResNet์œผ๋กœ ๋งŒ๋“ค์–ด์กŒ๊ณ , Recurrentํ•œ ๊ตฌ์„ฑ โ€ข ๐‘‚๐‘ก or ๐‘‚๐‘ก์™€ action์„ Input์œผ๋กœ ๋ฐ›๊ณ , Trajectory ๐‘‡ ๋ฅผ Output์œผ๋กœ ํ•จ โ€ข ๐‘‡๋Š” next observation ๐‘‚๐‘ก+๐‘–๊ณผ next reward ๐‘Ÿ๐‘ก+๐‘–๋ฅผ ํฌํ•จ โ€ข Environment Model์˜ ๊ฒฝ์šฐ Standard Model-Free Agent์—์„œ ์ƒ์„ฑ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•ด Pre-train ๋œ Model โ€ข Predicted ๐‘‚๐‘ก+๐‘–์™€ ๐‘Ÿ๐‘ก+๐‘–์€ ์™„๋ฒฝํ•˜์ง€ ์•Š์ง€๋งŒ, ๐‘‚๐‘ก+๐‘–์™€ ๐‘Ÿ๐‘ก+๐‘– ์„ ๋„˜์–ด์„œ๋Š” ์ •๋ณด๋ฅผ ๊ฐ–๊ณ  ์žˆ๋‹ค๊ณ  ๊ฐ€์ •
  • 7. I2A: Rollout Encoder โ€ข ๊ฐ action ๋ณ„๋กœ Rollout Encoder๊ฐ€ ์กด์žฌ โ€ข Rollout Encoder์˜ ๊ฐ Encoder๋Š” LSTM cell ๋กœ ๊ตฌ์„ฑ โ€ข Predicted ๐‘‚๐‘ก+๐‘–์™€ ๐‘Ÿ๐‘ก+๐‘–์ด ์™„๋ฒฝํ•˜์ง€ ์•Š์œผ๋ฏ€๋กœ Encoder๋ฅผ ํ†ตํ•ด ์ถ”๊ฐ€์ ์ธ ์ •๋ณด๋ฅผ ์ถ”์ถœ โ€ข Aggregator์—์„œ ๊ฐ Rollout Encoder์—์„œ ๋‚˜ ์˜จ Encoded values๋ฅผ ๋‹จ์ˆœ concatenate
  • 8. Experiments โ€ข I2A๊ฐ€ ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•๋“ค ๋ณด๋‹ค ์›”๋“ฑํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ž„ โ€ข Copy-Model IAA์˜ ๊ฒฝ์šฐ standard์™€ ์œ ์‚ฌํ•œ ์„ฑ ๋Šฅ์„ ๋ณด์ž„ โ€ข No reward IAA์˜ ๊ฒฝ์šฐ 3e9 steps๋ฅผ ํ•™์Šตํ•  ๊ฒฝ ์šฐ I2A ๋งŒํผ์˜ ์„ฑ๋Šฅ์ด ๋‚˜์˜ด โ€ข Predicted rewards๊ฐ€ ๋„์›€์€ ๋  ์ˆ˜ ์žˆ์ง€๋งŒ, Predicted Observations๋งŒ์œผ๋กœ ์ถฉ๋ถ„ํžˆ Informative ํ•จ โ€ข 5 ์ด์ƒ์˜ rolling step์—์„œ๋Š” ๋” ์ด์ƒ์˜ ์„ฑ๋Šฅ ํ–ฅ ์ƒ์€ ์—†์Œ
  • 9. Experiments โ€ข Noisy Environment Model์—์„œ I2A์˜ ์„ฑ๋Šฅ ์ƒ ์ฐจ์ด๋Š” ์—†์Œ โ€ข ๊ทธ๋Ÿฌ๋‚˜ poor model์˜ ๊ฒฝ์šฐ 3 rolling step๊ณผ 5 rolling step์—์„œ ์„ฑ๋Šฅ ์ฐจ์ด๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์—†์Œ โ€ข Rollout Encoder๊ฐ€ Env. Model์—์„œ ๋ถˆํ•„์š”ํ•œ ์ •๋ณด ๋ฅผ Ignoring ํ•˜๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Œ โ€ข Rollout Encoder-free agent์˜ ๊ฒฝ์šฐ Env. Model์˜ ์ •ํ™•๋„๊ฐ€ ์„ฑ๋Šฅ ์ƒ ํฐ ์˜ํ–ฅ์„ ๋ฏธ์นจ โ€ข Accurate Env. Model์—์„œ๋Š” standard์™€ ์œ ์‚ฌํ•œ ์„ฑ ๋Šฅ์„ ๋ณด์ž„
  • 10. Experiments โ€ข AlphaGo์— ์“ฐ์ธ MCTS์˜ ๊ฒฝ์šฐ ์ตœ๋Œ€ 95%์˜ ์„ฑ๋Šฅ์„ ๋ณด์ž„(Perfect Model) โ€ข ํ•˜์ง€๋งŒ ์œ ์‚ฌํ•œ ์„ฑ๋Šฅ์„ ๋‚ผ ๋•Œ, computation cost๊ฐ€ 18๋ฐฐ ์ •๋„ ์ฐจ์ด ๋‚จ โ€ข 4 boxes ์ƒํƒœ์—์„œ ํ•™์Šต์„ ํ•˜๊ณ , box ๊ฐœ์ˆ˜๋ฅผ ๋Š˜๋ ค๋ณด์•˜์œผ๋‚˜, ์—ฌ์ „ํžˆ 4 boxes standard์™€ ์œ ์‚ฌํ•œ ์„ฑ๋Šฅ์„ ๋ƒ„
  • 11. Conclusion โ€ข ๋‹จ์ˆœ ๊ณผ๊ฑฐ์˜ ๋ฐ์ดํ„ฐ๋งŒ ์ด์šฉํ•ด ํ˜„์žฌ ์œ ์šฉํ•œ action์„ ์ทจํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹Œ, ๋ฏธ๋ž˜์˜ ์ƒํ™ฉ์„ ์ถ”๋ก ํ•˜์—ฌ ์ •๋ณด๋กœ ๋ฐ›์•„๋“ค ์ด๊ณ  ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ด โ€ข Model-Free RL์— Model-Based RL ๊ฐœ๋…์„ ์ถ”๊ฐ€ โ€ข ๊ธฐ์กด์˜ Model-Based Planning Method์ธ Monte Carlo Tree Search(MCTS)์™€ ๋ณ‘ํ•ฉํ•  ๊ฒฝ์šฐ ๋‚ฎ์€ computation cost๋กœ ๋™์ผํ•œ ์„ฑ๋Šฅ์„ ํ™•๋ณด โ€ข ์˜๋ฌธ โ€ข ์‹คํ—˜์—์„œ ๋‚˜์˜จ Sokoban ํ™˜๊ฒฝ์˜ ๊ฒฝ์šฐ, ํ•˜๋‚˜์˜ action์„ ์ทจํ–ˆ์„ ๋•Œ ๋‹ค์Œ ์ƒํ™ฉ์€ ํ•ญ์ƒ ์ค‘์š”ํ•œ ์ •๋ณด๋ฅผ ์ง€๋‹ˆ๋Š”๋ฐ, ํ”„๋ ˆ์ž„ ๋ณ„๋กœ ์ง„ํ–‰๋˜๋Š” ์‹ค์‹œ๊ฐ„ ํ™˜๊ฒฝ์˜ ๊ฒฝ์šฐ ๋‹ค์Œ ์ƒํ™ฉ์ด ํฌ๊ฒŒ ์˜๋ฏธ ์—†๋Š” ์ƒํ™ฉ์ผ ์ˆ˜ ์žˆ์Œ. ์ด๋Ÿฌํ•œ ํ™˜๊ฒฝ์—์„œ๋Š” ์–ด๋– ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š”์ง€? โ€ข ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” Pre-trained Environment Model์„ ์ด์šฉํ•ด, predicted observation๊ณผ reward๋ฅผ ์–ป์–ด ์—ฌ๊ธฐ์„œ ์ •๋ณด๋ฅผ ์ถ”์ถœํ–ˆ ์ง€๋งŒ, predicted๊ฐ€ ์•„๋‹Œ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์„ ํ•  ์ˆ˜ ์žˆ๋Š” ํ™˜๊ฒฝ์ด๋ผ๋ฉด ์ข€ ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚ด์ง€๋Š” ์•Š์„์ง€?
  • 12. Appendix โ€ข Standard model-free baseline agent โ€ข For Sokoban: 3 layers CNN, kernel sizes 8x8, 4x4, 3x3, strides of 4, 2, 1 and number of output channels 32, 64, 64; following FC has 512 units โ€ข Rollout Encoder LSTM has 512(for Sokoban) hidden units. And all rollouts are concatenated into a single vector ๐‘๐‘–๐‘Ž of length 2560(a rollout encoder per action).
  • 13. Appendix โ€ข Sokoban environment โ€ข Every time step, a penalty of -0.1 is applied to the agent โ€ข Whenever the agent pushes a box on target, it receives a reward of +1 โ€ข Whenever the agent pushes a box off target, it receives a penalty of -1 โ€ข Finishing the level gives the agent a reward of +10 and the level terminates.