1. DQN with Differentiable Memory Architectures
Okada Shintarou in SP team
(Mentor: Fujita, Kusumoto)
2. What I Did in This Internship
- Implement the Chainer version DRQN, MQN, RMQN, FRMQN. (The original is
implemented with Torch)
- DRQN has different mechanism from DQN to train
- MQN, RMQN, FRMQN have a Key-Value store memory
- Implement RogueGym
- 3D FPS based RL platform
- Scene images are available without OpenGL
- OpenAI Gym like Interface
- High customizability
3. Background and Problem
- DQN has been shown to successfully learn to play many Atari 2600 games (e.g.
Pong).
- However DQN is not good at some games where
- agents can not observe whole state of environment and
- have to keep some memories to clear missions (e.g. I-Maze where agents
have to look an indicator tile and go remote correct goal tile.)
Whole of states are observable Partially observable
4. DeepQNetworks = Q-Learning + DNNs
- Q(s,a) is Quality function:
- Generally, Q(s,a) is approximated by a function because of combinatorial
explosion of s and a.
- In DQN, Q(s,a) is approximated by DNNs.
DNNs
5. Feedback Recurrent Memory Q-Network (FRMQN) [1]
How to convey past informations?
(a) DQN
- Feed forward only
- Past M frames as input
(b) DRQN
- LSTM
(c) MQN
- No LSTM
- Key-Value store memory
(d) RMQN
- LSTM
- Key-Value store memory
(e) FRMQN
- LSTM
- Key-Value store memory
- Feedback from past memory
output
[1] Oh, Junhyuk, et al. "Control of Memory, Active Perception, and Action in Minecraft." ICML (2016).
7. Project Malmo[2]
- The original paper employed a Minecraft-based environment
- So first, we tried using “Project Malmo” (a Minecraft-based RL platform
developped by Microsoft)
- But, Malmo
- Lacks of stability (Machine Learning turns into surprisingly difficult)
- Uses OpenGL (Please tell me how to play Minecraft on Ubuntu16.04 servers
with TitanX without any displays over SSH)
- Is slow (It takes 4sec overhead per one episode of 30000 episodes)
[2] Johnson M., Hofmann K., Hutton T.,
Bignell D. (2016) The Malmo Platform for
Artificial Intelligence Experimentation.
Proc. 25th International Joint Conference
on Artificial Intelligence, Ed.
Kambhampati S., p. 4246. AAAI Press,
Palo Alto, California USA.
https://github.com/Microsoft/malmo
8. So I Developped RogueGym
- RogueGym is a rogue-like environment for reinforcement learning inspired by
Project Malmo
- 3D scenes and types of surrounding blocks are available as agents'
observations.
Agent's observation
World's state
(top view)
9. I-Maze Environment
Environment:
- One long corridor, two goals and one indicator tile
- Agents need to reach correct goal indicated by green
or yellow tile
- Agents spawn randomly directed
- 50 steps limit
Actions: move forward / move backward / turn left / turn right
Rewards:
- Every step: -0.04
- Reaching Blue tile when Indicator tile is Green: 1, Yellow: -1
- Reaching Red tile when Indicator tile is Green: -1, Yellow: 1
4 actions
10. Experiment 1: Block Input
Comparison of (DQN), DRQN, MQN, RMQN, FRMQN
Environment:
- I-Maze (vertical corrider length 5)
Observations: Types of blocks in front of agents (expressed one-hot-vectors).
corridor = 5
observation range
Raw observations
{stone, air, stone,
stone, air, stone,
air, agent, air}
One-hot vector
air → 100000
stone → 010000
green_tile → 001000
yellow_tile → 000100
red_tile → 000010
blue_tile → 000001
Input
{010000,
100000,
010000,
010000,
100000,
010000,
100000,
/*agent,*/
100000}
11. - Calculate loss with
- Full episode
- Randomly extracted some successive frames
- When calculating loss, calculate Q value with
- Each frames
- Only last frame
We choiced "Full episode" and "Each frames"
(But the original implementation published at 9/21 seems "Randomly" and "Only
last frame"......This may be a different point from the original implementation)
an episode
There are Some Choices to Train Reccurent Models
an episode
ignore!
- How to select training
batches
- randomly extracted
frames from whole of
episodes
- randomly extracted
frames from randomly
extracted episodes (we
choiced)
12. Experiment1 Result
episode
totalreward
- DQN are trained with randomly extracted batches of 12 frames.
- The other models are trained with randomly extracted full episodes.
episode
totalreward
13. Generalization Performance
vertical corridor length
Totalreward(averageof100run)
- The memory limit are
changed from 11 to 49
- FRMQN does not lose
performance on long vertical
corridors.
- DRQN fights well.
14. Conclusion
- FRMQN has high generalization performance
- Introduced differentiable Key-Value store memory module can be changed the
size after trained
- DRQN is not so bad
- Recurrent Networks are useful for partially observable environment
- It is important that how to run iterations quickly
15. WIP: Experiment 2: Scene Image Input
Comparison of DQN, DRQN, MQN, RMQN, FRMQN
Environment: I-Maze changing vertical corrider length 5, 7, 9 for every episode
randomly.
Observations: Scene images that agents see
Training detail: Random frames, only last Q value
corridor = 5 corridor = 7 corridor = 9
an agents' observation
example