Deep parking

Deep parking:
an implementation of automatic parking
with deep reinforcement learning
Shintaro Shiba, Feb.2016-Dec.2016
Engineer Internship at Preferred Networks
Mentor: Abe-san, Fujita-san
1

About me
Shintaro Shiba
• Graduate student at the University of
Tokyo
– Major in neuroscience and animal behavior
• Part-time engineer (internship) at
Preferred Networks, Inc.
– Blog post URL:
https://research.preferred.jp/2017/03/deep-
parking/
2

Contents
• Original Idea
• Background: DQN and Double-DQN
• Task definition
– Environment: car simulator
– Agents
1. Coordinate
2. Bird‘s-eye view
3. Subjective view
• Discussion
• Summary
3

Achievement
Trajectory of the car agent Subjective view (Input for DQN)
0 deg
-120 deg
+120 deg
4

Original Idea: DQN for parking
https://research.preferred.jp/2016/01/ces2016/
https://research.preferred.jp/2015/06/distributed-deep-reinforcement-learning/
Succeeded in driving smoothly with DQN
Input: 32 virtual sensors, 3 previous actions + Current speed and steering
Output: 9 actions
Is it possible to learn for car agent to park itself,
with inputs of images from camera?
5

Reinforcement learning
Environment
Agent
action
state
reward
Learning algorithm
6

DQN: Deep-Q Network
Volodymyr Mnih et al. 2015
each episode >>
each action >>
update Q function >>
7

Double DQN
Preventing overestimation of Q values
Hado van Hasselt et al. 2015
8

Reinforcement learning in this project
Environment
Car simulator
Agent
Different sensor +
different neural network
action
state = sensor input
reward
9

Environment:
Car simulator
Forces of …
• Traction
• Air resistance
• Rolling resistance
• Centrifugal force
• Brake
• Cornering force
F = Ftraction + Faero + Frr + Fc + Fbrake + Fcf
10

Common specifications:
state, action, reward
Input (States)
– Features specific to each agent + car speed, car steering
Output (Actions)
– 9: accelerate, decelerate, steer right, steer left, throw (do
nothing), accelerate + steer right, accelerate + steer left,
decelerate + steer right, decelerate + steer left
Reward
– +1 when the car is in the goal
– -1 when the car is out of the field
– 0.01 - 0.01 * distance_to_goal otherwise (changed afterward)
Goal
– Car inside the goal region, no other conditions like car direction
Terminate
– Time up: 500 times of actions (changed to 450 afterward)
– Field out: Out of the field
11

Common specifications:
hyperparameters
Maximum episode: 50,000
Gamma: 0.97
Optimizer: RMSpropGraves
– lr=0.00015, alpha=0.95, momentum=0.95,
eps=0.01
– changed afterward: lr=0.00015, alpha=0.95,
momentum=0, eps=0.01
Batchsize: 50 or 64
Epsilon: 0.1 at last
– linearly decreased from 1.0 at first
12

Agents
1. Coordinate
2. Bird’s-eye view
3. Subjective view
– Three cameras
– Four cameras
13

Coordinate agent
Input features
– Relative coordinate value from the car to the
goal
(80, 300)
goal
car
14
input shape: (2, )
normalized

Coordinate agent
Neural Network
– only full-connected layers (3)
n of actions (9)
n of car
parameters (2)
coordinates (2)
64 64
15

Bird’s-eye view agent
Input features
– Bird’s-eye image of the whole field
input size: 80 x 80
normalized
17

Neural Network
80
80
128
192
n of actions
n of car
parameters (2)
64
400
18
Conv

Neural Network
80
80
128
192
n of actions
n of car
parameters (2)
64
400
19
Conv

Result: 18k episodes
20

Result: after 18k episodes ?
But we had already spent about 6 month for this agent so moved to the next…21

Subjective view agent
Input features
– N_of_camera images of subjective view from
the car
– Number of cameras…Three or Four
– FoV = 120 deg
camera
ex. Input images for four camera agent
front
+0
back
+180
right
+90
left
+270
22

Neural Network
Conv
80
80
200 x 3
400
256
n of actions
n of car
parameters (2)
64 23

Neural Network
Conv
80
80
200 x 3
400
256
n of actions
n of car
parameters (2)
64 24

Problem
– Calculation time (GeForce GTX TITAN X)
• At first… 3 [min/ep] x 50k [ep] = 100 days
• Reviewed by Abe-san… 1.6 [min/ep] x 50k [ep] = 55
days
– Because of copy and synchronization between GPU and
CPU
– Learning interrupted as soon as divergence of DNN output
– (Fortunately) agent “learned” goal by ~10k episodes in
some trials
– Memory usage
• In DQN, we need to store 1M previous input data
– 1M x (80 x 80 x 3 ch x 4 cameras)
• Save images to disk and access every time
25

Result: three cameras, 6k episodes
0 deg
-120 deg
+120 deg
Trajectory of the car agent Subjective view (Input for DQN)
26

Result: three cameras, 50k episodes
The policy “move anyways” ?
>> Reward setting
Seems not able to goal every time
Only “easy” goal to achieve
>> Variable task difficulty (curriculum
Frequent goals here
27

Four camera at 30k ep.
28

Modify reward
Previous
– +1 when the car is in the goal
– 0.01 - 0.01 * distance_to_goal otherwise
New
– +1 - speed when the car is in the goal
• in order to stop the car
– -0.005
29

Modify difficulty
Difficulty: Initial car direction & position
– Constraint
• Car always starts near the middle of the field
• Car always starts with face toward center:
– Curriculum
• Car direction:
– where n = currriculum
• Criteria:
– 0.6 of mean reward over 100 episodes
±
p
12
n
±
p
4
Goal
n = 1
n = 2
30

Subjective view agent:
modifications
N cameras Reward Difficulty Learning result
3 Default Default about 6k: o
50k: x
3 modified Default about 16k: o
3 modified Constraint ? (still learning)
3 modified Curriculum o
(though curriculum 1
yet)
4 Default Default x
4 modified Curriculum △ (not bad, but not
successful yet at 6k)
31

Subjective view agent:
modifications
Curriculum + Three cameras
@curriculum 1. Criteria needs to be modified
reward mean reward sum
1.0
0.0
500
0
n episode
0 10k 20k
n episode
0 10k 20k
32

Discussion
1. Initial settings included the situation
where car cannot reach the goal
– e.g. Start towards the edge of the field
– This made learning unstable
2. Why successful for coordinate agent?
– In spite there could be such situations?
33

Discussion
3. Comparison with three and four cameras
– Considering success rate and execution time,
three camera is better
– Why not successful in four cameras?
– Need several trials?
4. DQN often diverged
– every three times in personal feeling
• four cameras is slightly more oftern
– Importance of dataset for learning
• memory size, batch size
34

Discussion
5. Curriculum
– Ideally better to quantify “difficulty of the task”
• In this case, maybe it is roughly represented as
“bias of distribution” of the selected actions?
accelerate
decelerate
throw (do nothing)
steer right
steer left
accelerate + steer right
accelerate + steer left
decelerate + steer right
decelerate + steer left
same times for each actions >> go straight
biased distribution of selected actions >> go right/lef
35

Summary
• Car agent can park itself with subjective view
of cameras, though not always stable
learning
• Trade-off between reward design and
learning difficulty
– Simple reward: difficult to learn
• Try other algorithms like A3C
– Complex reward: difficult to set
• Other setting for distance_to_goal
36

Deep parking

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Deep parking

Similar to Deep parking (20)

Recently uploaded

Recently uploaded (20)

Deep parking

Editor's Notes