Value iteration networks

Value Iteration Networks
A. Tamar, Y. Wu, G. Thomas, S. Levine, and P. Abbeel
Dept. of Electrical Engineering and Computer Sciences, UC Berkeley
Presenter: Keisuke Fujimoto
(Twitter @peisuke)

Purpose: Machine learning based robot path planning. This planner is available in
new environment not included in train data set.
Strategy: Prediction of optimal action. The method can learn rewards of each place
and action to get good rewards.
Result: Planning in 28 x 28 grid map, Applicable to continuous control robot
Map
Pose
Velocity
Goal
Action
A. Tamar, Y. Wu, G. Thomas, S. Levine, and P. Abbeel
Dept. of Electrical Engineering and Computer Sciences, UC Berkeley
Presenter:
Keisuke Fujimoto
(ABEJA)

Background
Target : Autonomous Robot
• Manipulation robot, Navigation robot, Transfer robot
Problem :
• Reinforcement learning can not work outside of training
environments.
Goal
Target object
Manipulation robot Navigation robot

Contribution
• Value Iteration Networks (VIN)
• Model free training
• It does not require robot dynamics models.
• Generalized action prediction in new environments
• It can not work outside of training environments.
• Key approach
• Represents value-iteration planning by CNN
• Prediction of reward map and computation of sum
of future rewards.

Overview of VIN
Input : State of the robot (pose, velocity), goal, map (left fig.)
Output : Action (direction, mortar's torque)
Strategy : Determination of optimal action using predicted
rewards (right fig.).
State Rewards

Reward propagation
• Action can be determined by sum
of future reward generated using
reward propagation
-10 -10 -10
-10 -10 1
-10 -10
Map Reward from map
Left move action
-10 -10 -9 -10
-10 -10 -9 1 0.9
-10 -10 -9
-10 -10 -10
-10 -10 1 -9
-9 -10 -10 0.9
-9 -9
Up move from map
One-step propagation example:

Determination of action
• Optimal action at reward propagated
place is max reward action (middle fig.)
• Determination of optimal action using
propagated reward (right fig.)
Left move action
-10 -10 -9 -10
-10 -10 -9 1 0.9
-10 -10 -9
-10 -10 -10
-10 -10 1 -9
-9 -10 -10 0.9
-9 -9
Up move from map -10 -10 -9 -10
-10 -10 -9 1 0.9
-9 -10 -10 0.9
-9 -9
Max
After Reward propagation
-10 -10 -9 -8 -10
-10 -10 -9 1 0.9
-9 -10 -10 0.9 0.8
-8 -9 -9 0.8 0.7
-7 -8 -8 0.7 0.6
Current robot pose

Value Iteration Module
• Reward propagation with Convolutional Neural Network
• Input is reward map and output is sum of feature reward map
• Q is hidden reward map, V is sum of feature reward map
Output
Convolution
Max

• Deep Architecture of Value Iteration Networks
• Input is map and state, fR predicts reward map
• Attention modules crops the value map around robot position
• 𝜓 outputs optimal action

Attention function
• Attention module crops a subset of the values around
current robot pose.
• Optimal pose have relative to only current robot pose.
• Due to this attention module, prediction of optimal
action becomes easy.
-10 -10 -9 -8 -10
-10 -10 -9 1 0.9
-9 -10 -10 0.9 0.8
-8 -9 -9 0.8 0.7
-7 -8 -8 0.7 0.6
If robot is here.
-10 0.9 0.8
-9 0.8 0.7
-8 0.7 0.6
Selected area

Grid-World Domain
Environment :
Occupancy grid map, test size is 8x8 to 28x28
The number of recurrence is 20 for the 28x28 maps
Training dataset is 5000 maps, 7 trajectories.
Networks Arch. :
Competitive method :
CNN based Deep Q-Network, Direct action prediction using FCN
Map, Goal
CNN Reward map VI module Attention FC layer
Action
Current Position
3 layer net
150 hidden node 10 channels in Q-layer 80 parameters

Results of Grid-World Domain
Predicted path Reward Sum of feature reward

Mars Rover Navigation
Environment :
• Navigating the surface of Mars by a rover.
• It predicts path from only surface image without obstacle
information.
• Success rate is 90.3%.
Red point shows elevation sharper, in prediction time, vin
does not uses the elevation shape information

Continuous Control
Environment :
• Apply to continuous control space.
• Grid size is 28x28
• input is position and velocity
which is float data.
• Output is 2d continuous control
parameters.
Comparison about final distance to the goal
This result is from author's presentation

WebNav Challenge
Environment :
• Navigate website links to find a query
• Features: average word embeddings
• Using an approximate graph for planning
Evaluation:
• Success rate of within top-4 predictions
• Test set 1: start from index page
• Test set 2: start from random page
Result:

Conclusion
Purpose :
• Machine learning based robot path planning.
Method :
• Learning rewards of each place and predict action
using propagated reward.
Result :
• VIN policies learn an approximate planning
computation relevant for solving the task.
• Grid-worlds, to continuous control, and even to
navigation of Wikipedia links.

Code:
https://github.com/peisuke/vin
This code is implemented in chainer!
Twitter:
@peisuke
We are hiring !!
https://www.wantedly.com/companies/abeja

Value iteration networks

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (18)

Ähnlich wie Value iteration networks

Ähnlich wie Value iteration networks (20)

Mehr von Fujimoto Keisuke

Mehr von Fujimoto Keisuke (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Value iteration networks