Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Deep parking
1. Deep parking:
an implementation of automatic parking
with deep reinforcement learning
Shintaro Shiba, Feb.2016-Dec.2016
Engineer Internship at Preferred Networks
Mentor: Abe-san, Fujita-san
1
2. About me
Shintaro Shiba
• Graduate student at the University of
Tokyo
– Major in neuroscience and animal behavior
• Part-time engineer (internship) at
Preferred Networks, Inc.
– Blog post URL:
https://research.preferred.jp/2017/03/deep-
parking/
2
3. Contents
• Original Idea
• Background: DQN and Double-DQN
• Task definition
– Environment: car simulator
– Agents
1. Coordinate
2. Bird‘s-eye view
3. Subjective view
• Discussion
• Summary
3
5. Original Idea: DQN for parking
https://research.preferred.jp/2016/01/ces2016/
https://research.preferred.jp/2015/06/distributed-deep-reinforcement-learning/
Succeeded in driving smoothly with DQN
Input: 32 virtual sensors, 3 previous actions + Current speed and steering
Output: 9 actions
Is it possible to learn for car agent to park itself,
with inputs of images from camera?
5
9. Reinforcement learning in this project
Environment
Car simulator
Agent
Different sensor +
different neural network
action
state = sensor input
reward
9
10. Environment:
Car simulator
Forces of …
• Traction
• Air resistance
• Rolling resistance
• Centrifugal force
• Brake
• Cornering force
F = Ftraction + Faero + Frr + Fc + Fbrake + Fcf
10
11. Common specifications:
state, action, reward
Input (States)
– Features specific to each agent + car speed, car steering
Output (Actions)
– 9: accelerate, decelerate, steer right, steer left, throw (do
nothing), accelerate + steer right, accelerate + steer left,
decelerate + steer right, decelerate + steer left
Reward
– +1 when the car is in the goal
– -1 when the car is out of the field
– 0.01 - 0.01 * distance_to_goal otherwise (changed afterward)
Goal
– Car inside the goal region, no other conditions like car direction
Terminate
– Time up: 500 times of actions (changed to 450 afterward)
– Field out: Out of the field
11
12. Common specifications:
hyperparameters
Maximum episode: 50,000
Gamma: 0.97
Optimizer: RMSpropGraves
– lr=0.00015, alpha=0.95, momentum=0.95,
eps=0.01
– changed afterward: lr=0.00015, alpha=0.95,
momentum=0, eps=0.01
Batchsize: 50 or 64
Epsilon: 0.1 at last
– linearly decreased from 1.0 at first
12
21. Bird’s-eye view agent
Result: after 18k episodes ?
But we had already spent about 6 month for this agent so moved to the next…21
22. Subjective view agent
Input features
– N_of_camera images of subjective view from
the car
– Number of cameras…Three or Four
– FoV = 120 deg
camera
ex. Input images for four camera agent
front
+0
back
+180
right
+90
left
+270
22
25. Subjective view agent
Problem
– Calculation time (GeForce GTX TITAN X)
• At first… 3 [min/ep] x 50k [ep] = 100 days
• Reviewed by Abe-san… 1.6 [min/ep] x 50k [ep] = 55
days
– Because of copy and synchronization between GPU and
CPU
– Learning interrupted as soon as divergence of DNN output
– (Fortunately) agent “learned” goal by ~10k episodes in
some trials
– Memory usage
• In DQN, we need to store 1M previous input data
– 1M x (80 x 80 x 3 ch x 4 cameras)
• Save images to disk and access every time
25
26. Subjective view agent
Result: three cameras, 6k episodes
0 deg
-120 deg
+120 deg
Trajectory of the car agent Subjective view (Input for DQN)
26
27. Subjective view agent
Result: three cameras, 50k episodes
The policy “move anyways” ?
>> Reward setting
Seems not able to goal every time
Only “easy” goal to achieve
>> Variable task difficulty (curriculum
Frequent goals here
27
29. Modify reward
Previous
– +1 when the car is in the goal
– -1 when the car is out of the field
– 0.01 - 0.01 * distance_to_goal otherwise
New
– +1 - speed when the car is in the goal
• in order to stop the car
– -1 when the car is out of the field
– -0.005
29
30. Modify difficulty
Difficulty: Initial car direction & position
– Constraint
• Car always starts near the middle of the field
• Car always starts with face toward center:
– Curriculum
• Car direction:
– where n = currriculum
• Criteria:
– 0.6 of mean reward over 100 episodes
±
p
12
n
±
p
4
Goal
n = 1
n = 2
30
31. Subjective view agent:
modifications
N cameras Reward Difficulty Learning result
3 Default Default about 6k: o
50k: x
3 modified Default about 16k: o
3 modified Constraint ? (still learning)
3 modified Curriculum o
(though curriculum 1
yet)
4 Default Default x
4 modified Curriculum △ (not bad, but not
successful yet at 6k)
31
32. Subjective view agent:
modifications
Curriculum + Three cameras
@curriculum 1. Criteria needs to be modified
reward mean reward sum
1.0
0.0
500
0
n episode
0 10k 20k
n episode
0 10k 20k
32
33. Discussion
1. Initial settings included the situation
where car cannot reach the goal
– e.g. Start towards the edge of the field
– This made learning unstable
2. Why successful for coordinate agent?
– In spite there could be such situations?
33
34. Discussion
3. Comparison with three and four cameras
– Considering success rate and execution time,
three camera is better
– Why not successful in four cameras?
– Need several trials?
4. DQN often diverged
– every three times in personal feeling
• four cameras is slightly more oftern
– Importance of dataset for learning
• memory size, batch size
34
35. Discussion
5. Curriculum
– Ideally better to quantify “difficulty of the task”
• In this case, maybe it is roughly represented as
“bias of distribution” of the selected actions?
accelerate
decelerate
throw (do nothing)
steer right
steer left
accelerate + steer right
accelerate + steer left
decelerate + steer right
decelerate + steer left
same times for each actions >> go straight
biased distribution of selected actions >> go right/lef
35
36. Summary
• Car agent can park itself with subjective view
of cameras, though not always stable
learning
• Trade-off between reward design and
learning difficulty
– Simple reward: difficult to learn
• Try other algorithms like A3C
– Complex reward: difficult to set
• Other setting for distance_to_goal
36