2. References
■ Fast and Furious: Real Time End-to-End 3D
Detection, Tracking and Motion Forecasting
with a Single Convolutional Net
■ IntentNet: Learning to Predict Intention from
Raw Sensor Data
■ Jointly Learnable Behavior and Trajectory
Planning for Self-Driving Vehicles
■ LiRaNet: E2E Trajectory Prediction using
Spatio-Temporal Radar Fusion
■ Learning Lane Graph Representations for
Motion Forecasting
■ Implicit Latent Variable Model for Scene-
Consistent Motion Forecasting
■ PnPNet: End-to-End Perception and
Prediction with Tracking in the Loop
■ Perceive, Predict, and Plan: Safe Motion
Planning Through Interpretable Semantic
Representations
■ End-to-end Interpretable Neural Motion
Planner
■ DSDNet: Deep Structured Self-Driving
Network
■ MP3: A Unified Model to Map, Perceive,
Predict and Plan
■ LookOut: Diverse Multi-Future Prediction and
Planning for Self-Driving
■ Deep Multi-Task Learning for Joint
Localization, Perception, and Prediction
3. Fast and Furious: Real Time End-to-End 3D Detection, Tracking
and Motion Forecasting with a Single Convolutional Net
■ deep neural network that is able to jointly reason about 3D detection, tracking
and motion forecasting given data captured by a 3D sensor.
■ By jointly reasoning about these tasks, this holistic approach is more robust to
occlusion as well as sparse data at range.
■ This approach performs 3D convolutions across space and time over a bird’s
eye view representation of the 3D world, which is very efficient in terms of both
memory and computation.
■ By sharing computation, perform all tasks in as little as 30 ms.
4. Overview: FaF network takes multiple frames as input and performs detection,
tracking and motion forecasting.
Fast and Furious: Real Time End-to-End 3D Detection, Tracking
and Motion Forecasting with a Single Convolutional Net
5. Fast and Furious: Real Time End-to-End 3D Detection, Tracking
and Motion Forecasting with a Single Convolutional Net
6. Motion forecasting
Fast and Furious: Real Time End-to-End 3D Detection, Tracking
and Motion Forecasting with a Single Convolutional Net
7. IntentNet: Learning to Predict Intention
from Raw Sensor Data
■ In order to plan a safe maneuver, self-driving vehicles need to understand the
intent of other traffic participants.
■ intent as a combination of discrete high level behaviors as well as continuous
trajectories describing future motion.
■ A one-stage detector and forecaster that exploits both 3D point clouds
produced by a LiDAR sensor as well as dynamic maps of the environment.
■ This multi-task model reduce reaction time in self-driving applications.
11. Jointly Learnable Behavior and Trajectory
Planning for Self-Driving Vehicles
■ The motion planners used in self-driving vehicles need to generate trajectories that are safe,
comfortable, and obey the traffic rules.
■ This is usually achieved by two modules: behavior planner, which handles high-level decisions and
produces a coarse trajectory, and trajectory planner that generates a smooth, feasible trajectory
for the duration of the planning horizon.
■ These planners, however, are typically developed separately, and changes in the behavior planner
might affect the trajectory planner in unexpected ways.
■ Furthermore, the final trajectory outputted by the trajectory planner might differ significantly from
the one generated by the behavior planner, as they do not share the same objective.
■ This paper proposes a jointly learnable behavior and trajectory planner, while unlike most existing
learnable motion planners that address either only behavior planning, or use an uninterpretable
neural network to represent the entire logic from sensors to driving commands.
■ This approach features an interpretable cost function on top of perception, prediction and vehicle
dynamics, and a joint learning algorithm that learns a shared cost function employed by the
behavior and trajectory components.
12. Jointly Learnable Behavior and Trajectory
Planning for Self-Driving Vehicles
The learnable motion planner has discrete and continuous components,
minimizing the same cost function with a same set of learned cost weights.
13. Jointly Learnable Behavior and Trajectory
Planning for Self-Driving Vehicles
A: Given a scenario, we generate a set of possible SDV (self-driving vehicles) behaviors. B:
Left and right lane boundaries and the driving path that are relevant to the intended
behavior are considered in the cost function. C: SDV geometry for spatiotemporal
overlapping cost are approximated using circles. D: The SDV yields to pedestrians through
stop lines on the driving paths.
14. Jointly Learnable Behavior and Trajectory
Planning for Self-Driving Vehicles
■ The objective of the planner is then to find a behavior and a trajectory that is safe, comfortable,
and progressing along the route.
■ Given sets of candidate behaviors and trajectories, the cost function is to choose the best (b, τ ).
■ The cost function consists of sub-costs c that focus on different aspects of the trajectories such
as safety, comfort, feasibility, mission completion, and traffic rules.
■ Obstacles: A safe trajectory for the SDV should not only be collision free, but also satisfy a safety-
distance to the surrounding obstacles, including both the static and dynamic objects such as
vehicles, pedestrians, cyclists, unknown objects, etc. The overlap cost coverlap is then 1 if a
trajectory violates the spatial occupancy of any obstacle in a given predicted trajectory, and is
averaged across all possible predicted trajectories weighted by their probabilities. The obstacle
cost cobstacle penalizes the squared distance of the violation of the safety-distance dsafe. This cost
is scaled by the speed of the SDV, making the distance violation more costly at higher speeds.
15. Jointly Learnable Behavior and Trajectory
Planning for Self-Driving Vehicles
■ Driving-path and lane boundary: The SDV is expected to adhere to the structure of the road,
i.e. , it should not go out of the lane boundary and should stay close to the center of the lane.
The driving-path cost cpath is the squared distance towards the driving path. The lane
boundary cost clane is the squared violation distance of a safety threshold.
■ Headway: SDV should keep a safe longitudinal distance that depends on the speed of the
SDV and the leading vehicle for lane following or lane change behavior. The headway cost is
computed as the violation of the safety distance after applying a comfortable constant
deceleration, assuming that the leading vehicle applies a hard brake.
■ Yield: When a pedestrian is predicted reaching the boundary of the SDV lane or crossing it,
they impose a stopping point at a safe longitudinal distance and penalize any trajectory that
violates it. The yield cost cyield penalizes the squared longitudinal violation distance weighted
by the pedestrian prediction probability.
16. Jointly Learnable Behavior and Trajectory
Planning for Self-Driving Vehicles
Left: Headway cost penalizes unsafe distance to leading vehicles.
Right: for each sampled trajectory, a weight function determines how
relevant an obstacle is to the SDV in terms of its lateral offset.
17. Jointly Learnable Behavior and Trajectory
Planning for Self-Driving Vehicles
■ Route: A behavior is desirable if the goal lane is closer to the route than the current lane. So
they penalize the number of lane-changes that is required to converge to the route.
■ Cost-to-go: they compute the deceleration needed for slowing-down to possible upcoming
speed-limits and use the square of the violation of the comfortable deceleration as cost-to-go.
■ Speed limit, travel distance and dynamics: Using the speed-limit of a lane, which is available in
the map data, they introduce a cost that penalizes a trajectory if it goes above the eligible
speed. The speed limit cost cspeed is the squared violation in speed. In order to favor trajectories
that advance in the route, they use the travelled longitudinal distance as a reward. Since the
SDV is physically limited to certain ranges of acceleration, curvature, etc, they prune trajectories
that violate such constraints. Additionally, they introduce costs that penalize aggressive motions
to promote comfortable driving. Specifically, the dynamics cost cdyn consists of the squared
values of jerk and violation thereof, acceleration and violation thereof, lateral acceleration and
violation thereof, lateral jerk and violation thereof, curvature, twist, and wrench.
18. Jointly Learnable Behavior and Trajectory
Planning for Self-Driving Vehicles
■ The inference process contains two stages of
optimization.
– In the behavioral planning stage, it adopts a
coarse-level parameterization for trajectory
generation. The resulting trajectory is found
by selecting the one with the lowest cost.
– In the trajectory planning stage, it uses a
fine-level parameterization where they
model the trajectory as a function of vehicle
control variables. The trajectory is initialized
with the output of the behavior planner, and
optimized through a continuous optimization
solver.
19. Jointly Learnable Behavior and Trajectory
Planning for Self-Driving Vehicles
Example trajectories in a nudging scenario Behavioral decisions include obstacle side
assignment and lane information, which are
sent through the behavioral trajectory
interface.
20. Jointly Learnable Behavior and Trajectory
Planning for Self-Driving Vehicles
■ It uses a combination of max-margin objective and
imitation learning as the loss function
max-margin loss
imitation learning loss
gradient of the learning objective
21. Jointly Learnable Behavior and Trajectory
Planning for Self-Driving Vehicles
■ Two real-world driving datasets: ManualDrive is a set of human driving recordings where the
drivers are instructed to drive smoothly and carefully respecting all traffic rules; TOR-4D is
composed of very challenging scenarios.
22. LiRaNet: E2E Trajectory Prediction using
Spatio-Temporal Radar Fusion
■ LiRaNet, a end-to-end trajectory prediction method which utilizes radar sensor
information along with widely used lidar and high definition (HD) maps.
■ Automotive radar provides rich, complementary information, allowing for longer
range vehicle detection as well as instantaneous radial velocity measurements.
■ However, there are factors that make the fusion of lidar and radar information
challenging, such as the relatively low angular resolution of radar measurements,
their sparsity and the lack of exact time synchronization with lidar.
■ To overcome these challenges, propose an efficient spatio-temporal radar feature
extraction scheme which achieves state-of-the-art performance on multiple large-
scale datasets.
23. LiRaNet: E2E Trajectory Prediction using
Spatio-Temporal Radar Fusion
An example scene from X17k in bird’s eye view where lidar points (light blue) and radar point velocities (orange) are
visualized with labels (white) for current, past and future frames. Vehicle A is a turning bus that has multiple radar points
across frames. By effectively combining them over space and time a full 2D velocity and turning rate can be recovered.
Vehicle B shows the high positional noise that inherently comes with radar. Vehicle C shows a case with sparse lidar
points where implicitly associating them across time can be challenging. However, radar points present around C can
add context for the model to detect and predict the trajectory.
24. LiRaNet: E2E Trajectory Prediction using
Spatio-Temporal Radar Fusion
LiRaNet overview: The radar feature extraction network (A) extracts spatio-temporal features from raw radar points in
2 steps: (1) for each frame we create a graph between the BEV grid cells and radar points to learn spatial features of
each cell using a non-rigid convolution, (2) these spatial features are further fused temporally by stacking across
channel dimension and using an MLP to get a radar feature volume. This feature volume is then fused with feature
volumes from other sensors and fed to a joint perception-prediction network (B) which produces detections and their
future trajectories. An example prediction for a scene from X17k can be seen in (C).
25. LiRaNet: E2E Trajectory Prediction using
Spatio-Temporal Radar Fusion
■ Input domain consists of radar points and output domain consists of BEV cells.
■ For each cell j, calculate the features hm
j for the sweep m as
■ where Am
j is the set of associated radar points, xm
i is the 2D coordinates of the associated
radar point, xm
j is the 2D coordinate of the BEV cell’s center, ⊕ denotes the concatenation
operation, fm
i is the feature vector for the radar point and gm() is an multi-layer perceptron
(MLP) with learnable weights shared across all the cells.
■ Calculate Am
j, using nearest neighbor algorithm with a distance threshold.
■ By using a threshold larger than the size of a cell, this method compensates for positional
errors in radar.
■ For each cell j, calculate the final spatio-temporal feature vector (hj ) by concatenating the
per sweep features hj,m and using an MLP to combine them.
26. LiRaNet: E2E Trajectory Prediction using
Spatio-Temporal Radar Fusion
Radar
is
fused
with
LiDAR
and
Map
to
generate
multi-sensor
feature
volume
in
BEV
which
is
used
for
detection
and
trajectory
prediction.
Each
learnable
operator
is
denoted
as
(op,
kernel,
stride,
number
of
features).
Conv
denotes
convolution
and
Res
denotes
a
residual
block.
28. Learning Lane Graph Representations for
Motion Forecasting
■ A motion forecasting model that exploits a novel structured map representation as well as actor-
map interactions.
■ Instead of encoding vectorized maps as raster images, construct a lane graph from raw map data
to explicitly preserve the map structure.
■ To capture the complex topology and long range dependencies of the lane graph, propose
LaneGCN which extends graph convolutions with multiple adjacency matrices and along-lane
dilation.
■ To capture the complex interactions between actors and maps, exploit a fusion network consisting
of four types of interactions, actor-to-lane, lane-to-lane, lane-to-actor and actor-to-actor.
■ Powered by LaneGCN and actor-map interactions, the model is able to predict accurate and
realistic multi-modal trajectories.
■ This approach significantly outperforms the state-of-the-art on the large scale Argoverse motion
forecasting benchmark.
■ https://github.com/uber-research/LaneGCN
29. Learning Lane Graph Representations for
Motion Forecasting
It constructs a lane graph from raw map data and use LaneGCN to extract map features. In parallel,
ActorNet extracts actor features from observed past trajectories. Then it uses FusionNet to model the
Interactions between actors themselves and the map, and predict the future trajectories.
30. Learning Lane Graph Representations for
Motion Forecasting
The model is composed of four modules. (1) ActorNet receives the past actor trajectories as input, and uses 1D
convolution to extract actor node features. (2) MapNet constructs a lane graph from HD maps, and uses a LaneGCN to
exact lane node features. (3) FusionNet is a stack of 4 interaction blocks. The actor to lane block fuses real-time traffic
information from actor nodes to lane nodes. The lane to lane block propagates information over the lane graph and
updates lane features. The lane to actor block fuses updated map information from lane nodes to actor nodes. The actor to
actor block performs interactions among actors. It uses another LaneGCN for the lane to lane block, and spatial attention
layers for the other blocks. (4) The prediction header uses after-fusion actor features to produce multi-modal trajectories.
31. Learning Lane Graph Representations for
Motion Forecasting
LaneGCN is a stack of 4 multi-scale LaneConv
residual blocks, each of which consists of a LaneConv
(1, 2, 4, 8, 16, 32) and a linear layer with a residual
connection. All layers have 128 feature channels.
Left: The lane centerline of interest, its
predecessor, successor, left and right neighbor
are denoted with red, orange, blue, purple, and
green lines, respectively. Each centerline is given
as a sequence of BEV points (hollow circles).
Right: Derived lane graph with an example lane
node. The lane node of interest, its predecessor,
successor, left and right neighbor are denoted
with red, orange, blue, purple and green circles
respectively.
32. Learning Lane Graph Representations for
Motion Forecasting
LaneConv Operator:
parameterize the node feature as follows,
LaneConv operator as
Ai and Wi are the adjacency and the weight matrices
corresponding to the i-th connection type respectively.
MLP indicates a multi-layer perceptron and the
two subscripts refer to shape and location,
respectively.
k-dilation LaneConv operator as
Ak
pre is the k-th matrix power of Apre.
In regular grid graphs, a dilated convolution operator can effectively
capture the long range dependency by enlarging the receptive field.
33. Learning Lane Graph Representations for
Motion Forecasting
LaneGCN: LaneConv(k1, .., kC) this multi- scale layer.
• In the model, use spatial attention and LaneGCN to capture a complete set of actor-map interactions.
FusionNet:
• Build a stack of four fusion modules to capture all information between actors and lane nodes, i.e.,
actors to lanes (A2L), lanes to lanes (L2L), lanes to actors (L2A) and actors to actors (A2A).
• L2L module by LaneGCN, but other three modules by:
Prediction Header: The header has two branches, a regression branch to predict the trajectory of each
mode and a classification branch to predict the confidence score of each mode.
• regression branch
• classification branch
35. Implicit Latent Variable Model for Scene-
Consistent Motion Forecasting
■ In this paper, aim to learn scene-consistent motion forecasts of complex urban trac directly
from sensor data.
■ In particular, propose to characterize the joint distribution over future trajectories via an
implicit latent variable model.
■ It models the scene as an interaction graph and employ powerful graph neural networks to
learn a distributed latent representation of the scene.
■ Coupled with a deterministic decoder, obtain trajectory samples that are consistent across
trac participants, achieving state-of-the-art results in motion forecasting and interaction
understanding.
■ This motion forecasts result in safer and more comfortable motion planning.
36. Implicit Latent Variable Model for Scene-
Consistent Motion Forecasting
Graphical models of trajectory distribution. Dashed arrows/circles denote that only some approaches within
the group use those components. Double circle in (c) denotes that it is a deterministic mapping of its inputs.
Actor Feature Extraction. Given LiDAR and maps, the backbone CNN detects the actors in the scene, and
individual feature vectors per actor are extracted via RRoI Align, followed by a CNN with spatial pooling.
37. Implicit Latent Variable Model for Scene-
Consistent Motion Forecasting
Implicit Latent Variable Model encodes the scene into a latent space, from which it can efficiently sample
multiple future realizations in parallel, each with socially consistent trajectories.
40. PnPNet: End-to-End Perception and
Prediction with Tracking in the Loop
■ The problem of joint perception and motion forecasting in the context of self-driving
vehicles.
■ Towards this goal, PnPNet, an end-to-end model that takes as input sequential sensor
data, and outputs at each time step object tracks and their future trajectories.
■ The key component is a tracking module that generates object tracks online from
detections and exploits trajectory level features for motion forecasting.
■ Specifically, the object tracks get updated at each time step by solving both the data
association problem and the trajectory estimation problem.
■ Importantly, the whole model is end-to-end trainable and benefits from joint optimization
of all tasks.
■ Validate PnPNet on two large-scale driving datasets, improvements over the state-of-the-
art with better occlusion recovery and more accurate future prediction.
41. PnPNet: End-to-End Perception and
Prediction with Tracking in the Loop
Three paradigms for perception and prediction. Traditional approach (a) adopts the modular design that
decomposes the stack into subtasks and solves them with individual models. End-to-end method (b) uses a
joint model to solve detection and prediction simultaneously, but performs tracking as post-processing. As
a result, the full temporal history contained in tracks is not used by detection and prediction. This
approach (c) brings tracking into the loop so that all tasks benefit from rich temporal context.
42. PnPNet: End-to-End Perception and
Prediction with Tracking in the Loop
PnPNet for end-to-end perception and prediction. The model consists of three modules that perform 3D object
detection, discrete-continuous tracking, and motion forecasting sequentially. To extract trajectory level actor
representations used for tracking and prediction, also equip the model with two explicit memories: one for
global sensor feature maps, and one for past object trajectories. Both memories get updated at each time step
with up-to-date sensor features and tracking results.
43. PnPNet: End-to-End Perception and
Prediction with Tracking in the Loop
The trajectory level object representation. Given an object trajectory, we first extract its sensor observation
and motion features at each time step, and then apply an LSTM network to model the temporal dynamics
45. Perceive, Predict, and Plan: Safe Motion
Planning Through Interpretable Semantic
Representations
■ In this paper propose an end-to-end learnable network that performs joint perception,
prediction and motion planning for self-driving vehicles and produces interpretable
intermediate representations.
■ Unlike existing neural motion planners, this motion planning costs are consistent with
perception and prediction estimates.
■ This is achieved by a novel differentiable semantic occupancy representation that is explicitly
used as cost by the motion planning process.
■ This network is learned end-to-end from human demonstrations.
■ The experiments in a large-scale manual-driving dataset and closed-loop simulation show that
the proposed model significantly outperforms state-of-the-art planners in imitating the human
behaviors while producing much safer trajectories.
46. Perceive, Predict, and Plan: Safe Motion
Planning Through Interpretable Semantic
Representations
The overview of end-to-end learnable autonomy system that takes raw sensor data, an
HD map and a high level route as input and produces safe maneuvers for the self-
driving vehicle via our novel semantic interpretable intermediate representations.
47. Perceive, Predict, and Plan: Safe Motion
Planning Through Interpretable Semantic
Representations
Semantic classes in occupancy forecasting. Colors match between drawing and hierarchy. Shadowed area
corresponds to SDV route. Black vehicle, pedestrian and bike icons represent the agents' true current location.
48. Perceive, Predict, and Plan: Safe Motion
Planning Through Interpretable Semantic
Representations
Inference diagram of the perception and recurrent occupancy forecasting
model. || symbolizes concatenation along the feature dimension, ⊕ element-
wise sum and ∆ bilinear interpolation used to downscale the occupancy.
49. Perceive, Predict, and Plan: Safe Motion
Planning Through Interpretable Semantic
Representations
Examples of the motion planner cost functions: (a) collision, (b) driving-path,
(c) lane boundary, (d) traffic light, (e) comfort, (f) route, (g) progress.
Cost related comfort, traffic rules and progress in the route:
Safe cost:
50. Perceive, Predict, and Plan: Safe Motion
Planning Through Interpretable Semantic
Representations
Learn the model parameters by exploiting these two loss functions:
Semantic Occupancy Loss:
Planning Loss:
51. Perceive, Predict, and Plan: Safe Motion
Planning Through Interpretable Semantic
Representations
52. End-to-end Interpretable Neural Motion
Planner
■ A neural motion planner (NMP) for learning to drive autonomously in complex urban
scenarios that include traffic-light handling, yielding, and interactions with multiple
road-users.
■ Towards this goal, design a holistic model that takes as input raw LIDAR data and a
HD map and produces interpretable intermediate representations in the form of 3D
detections and their future trajectories, as well as a cost volume defining the
goodness of each position that the self-driving car can take within the planning
horizon.
■ Then sample a set of diverse physically possible trajectories and choose the one
with the minimum learned cost.
■ Importantly, the cost volume is able to naturally capture multi-modality.
53. End-to-end interpretable neural motion planner (NMP). Backbone network takes LiDAR data and maps
as inputs, and outputs bounding boxes of other actors for future timesteps (perception), as well as a cost
volume for planning with T filters. Next, for each trajectory proposal from the sampler, its cost is indexed
from different filters of the cost volume and summed together. The trajectory with the minimal cost will be
our final planning.
54. NMP
■ An end-to-end learnable motion planner that generates accurate space-time trajectories over a
planning horizon of a few seconds.
■ Importantly, the model takes as input LiDAR point clouds and a HD map and produces
interpretable intermediate representations in the form of 3D detections and their future motion
forecasted over the planning horizon.
■ The final output representation is a space-time cost volume that represents the “goodness” of
each possible location that the SDV can take within the planning horizon.
■ The planner then scores a series of trajectory proposals using the learned cost volume and
chooses the one with the minimum cost.
■ Train the model end-to-end with a multi-task objective. The planning loss encourages the
minimum cost plan to be similar to the trajectory performed by human demonstrators.
■ Note: this loss is sparse as a GT trajectory only occupies small portion of the space.
■ Introduce another perception loss that encourages the intermediate representations to produce
accurate 3D detections and motion forecasting.
■ This ensures the interpretability of the intermediate representations and enables much faster
learning.
55. NMP
■ Given the input LiDAR sweeps and the HD map, compute the corresponding cost
volume c by feedforward convolutional operations.
■ The final trajectory can then be computed by minimizing.
■ Note, however, that this optimization is NP hard.
■ Thus rely on sampling to obtain a low cost trajectory. Towards this goal, sample a
wide variety of trajectories that can be executed by the SDV and produce as final
output the one with minimal cost according to learned cost volume.
■ To efficiently sample physically possible trajectories during inference.
■ Since the cost of a trajectory is computed by indexing from the cost volume, the
planner is fast enough for real-time inference.
56. Trajectory Representation. First sample a set of parameters of a Clothoid to determine the
shape of a trajectory. Then sample a velocity profile to determine how fast the Self-driving
vehicle (SDV) go along this trajectory. Combining these two, get a space time trajectory.
a Clothoid curve, also known as Euler spiral or
Cornu spiral, to represent the 2D path of the SDV
57. End-to-End Learning
The ultimate goal is to plan a safe trajectory while following the rules of traffic.
Let the model to understand where obstacles are and where they will be in the future in order
to avoid collisions.
Therefore, use a multi-task training with supervision from detection, motion forecasting as
well as human driven trajectories for the ego-car.
Note that no supervision for cost volume.
Thus adopt max-margin loss to push the network to learn to discriminate between good and
bad trajectories.
The overall loss function is then:
58. Cost Volume across Time. The planned trajectory in red and ground-truth in blue. Overlay lower cost
region for different timesteps in the same figure, using different colors (indicated by legend). Detection
and corresponding prediction results are in cyan.
59. DSDNet: Deep Structured Self-Driving
Network
■ Deep Structured self-Driving Network (DSDNet), which performs object detection,
motion prediction, and motion planning with a single neural network.
■ Towards this goal, develop a deep structured energy based model which considers
the interactions between actors and produces socially consistent multimodal future
predictions.
■ Furthermore, DSDNet explicitly exploits the predicted future distributions of actors to
plan a safe maneuver by using a structured planning cost.
■ The sample-based formulation allows to overcome the difficulty in probabilistic
inference of continuous random variables
60. DSDNet overview: The model takes LiDAR and map as inputs, processes them with a CNN backbone,
and jointly performs object detection, multi-modal socially-consistent prediction, and safe planning
under uncertainty. Rainbow patterns mean highly likely actors' future positions predicted by model.
In order to plan a safe maneuver, a self-driving vehicle has to first understand its surroundings as well
as predict how the future might evolve. It should then plan its motion by considering all possibilities of
the future weighting them properly. This is not trivial as the future is very multi-modal and actors
interact with each other. Moreover, the inference procedure needs to be performed in a fraction of
a second in order to have practical value.
61. Backbone Feature Network and Object Detection
■ Let X be the LiDAR point clouds and the HD map given as input to system. Since LiDAR point
clouds can be very sparse and the actors‘ motion is an important cue for detection and
prediction, use the past 10 LiDAR sweeps (e.g., 1s of measurements) and voxelize them into a
3D tensor. utilize HD maps as they provide a strong prior about the scene.
■ then process this 3D tensor with a deep convolutional network backbone and compute a
backbone feature map. employ a single-shot detection header on this feature map to output
detection bounding boxes for the actors in the scene.
■ Apply two Conv2D layers separately on feature map, one for classifying if a location is
occupied by an actor, the other for regressing the position offset, size, orientation and speed
of each actor.
■ prediction and planning modules will then take these detections and the feature map as input
to produce both a distribution over the actors‘ behaviors and a safe planning maneuver.
62. Probabilistic Multimodal Social Prediction
■ As actors move on the ground, represent their possible future behavior using
a trajectory defined as a sequence of 2D waypoints on birds eye view (BEV)
sampled at T discrete timestamps.
■ Note that T is the same duration as our planning horizon, and compute the
motion prediction distribution and a motion plan each time a new sensor
measurement arises (i.e., every 100ms).
■ Modeling this joint distribution and performing efficient inference is
challenging, as each actor has a high-dimensional continuous action space.
■ propose to approximate this high-dimensional continuous space with a finite
number of samples, and construct a non-parametric distribution over the
sampled space.
■ Also use a combination of straight, circle, and clothoid curves.
63. Probabilistic Multimodal Social Prediction
joint distribution of all actors' future behaviors
decompose the energy E into two terms. The first term encodes the goodness of a future
trajectory (independent of other actors) while the second term explicitly encodes the fact that
pairs of actors should not collide in the future.
64. Details of the multimodal social prediction
module: For each actor, sample a set of
physically valid trajectories, and use a neural
network Etraj to assign energies (probabilities) to
them. To make different actors' behaviors
socially consistent, employ message passing
steps which explicitly model interactions and can
encode human prior knowledge (collision
avoidance). The final predicted socially-
consistent distribution is shown on top right.
Neural header
for evaluating
Etraj and Ctraj.
65. Probabilistic Multimodal Social Prediction
■ For safety, our motion planner needs to take all possible actor‘s future into
consideration. Therefore, motion forecasting needs to infer the probability of each
actor taking a particular future trajectory.
■ utilize sum-product message passing to estimate the marginal distribution per actor,
taking into account the effects of all other actors by marginalization.
■ The marginal reflects the uncertainty and multimodality in an actor's future behavior
and will be leveraged by planner.
■ Through this message passing procedure, actors communicate with each others their
future intentions and how probable those intentions are. The collision energy helps to
coordinate intentions from different actors such that the behaviors are compliant and
do not result in collision.
66. Safe Motion Planning under Uncertain Future
■ The motion planning module fulfills final goal, that is, navigating towards a
destination while avoiding collision and obeying traffic rules.
■ Towards this goal, build a cost function C, which assigns lower cost values to “good“
trajectory proposals and higher values to “bad” ones.
■ Planning is performed by finding the optimal trajectory with the minimum cost.
■ Planning Cost:
■ conduct exact minimization over P. Ctraj is a net based cost and evaluate all K
possible trajectories with a single batch forward pass. Ccoll is computed with a GPU
based collision checker. Efficiently evaluate for all K samples and select the
trajectory with minimum-cost as final planning result.
67. Learning
train the full model (backbone, detection, prediction and planning) jointly with a multi-class loss
planning loss: utilize a max-margin loss, using expert behavior as positive examples and
randomly sampled trajectories as negative ones.
detection loss: Ldetection is a sum of classification and regression loss. Use a cross-entropy
classification loss and assign an anchor‘s label based on its IoU with any actor. The
regression loss is a smooth L1 between model regression outputs and the ground-truth
targets. Those targets include position, size, orientation and velocity.
employ cross-entropy between discrete distribution and the true target as prediction loss. Once
sampled K trajectories per actor, this loss can be regarded as a standard classification loss
over K classes (one for each trajectory sample). The target class is set to be the closest
trajectory sample to the ground-truth future trajectory (in L2 distance).
68. Qualitative results on ATG4D. The prediction module can capture multi-modalities when vehicles
approach intersections, while being certain and accurate when vehicles drive along a single lane (top).
The model can produce smooth trajectories which follow the lane and are compliant with other actors
(bottom). Cyan boxes: detection. Cyan trajectory: prediction. Red box: ego-car. Red trajectory:
planning. Overlay the predicted marginal distribution for different timestamps with different colors and
only show high-probability regions.
69. MP3: A Unified Model to Map, Perceive,
Predict and Plan
■ Unfortunately, building HD maps has proven hard to scale due to their cost
as well as the requirements they impose in the localization system that has
to work everywhere with centimeter-level accuracy.
■ Being able to drive without an HD map would be very beneficial to scale self-
driving solutions as well as to increase the failure tolerance of existing ones
(e.g., if localization fails or the map is not up-to-date).
■ MP3, an end-to-end approach to mapless driving where the input is raw
sensor data and a high-level command (e.g., turn left at the intersection).
■ MP3 predicts intermediate representations in the form of an online map and
the current and future state of dynamic agents, and exploits them in a novel
neural motion planner to make interpretable decisions taking into account
uncertainty.
70. Left: a localization error makes the SDV follow a wrong route when using an HD map,
driving into traffic. Right: mapless driving can interpret the scene from sensors and
achieve a safe plan that follows a high-level command.
71. MP3 predicts probabilistic scene representations that are leveraged
in motion planning as interpretable cost functions.
73. The motion field warps the occupancy over time. Transparency denotes
probability. Color differences the predicted layers by the network and the
future occupancy. Particular case of unimodal motion (K = 1).
74. Backbone Network. The output features within one CNN block are fixed. All kernel strides and dilation are 1.
Mapping Network. The output features within one CNN block are fixed. All kernel strides are 1.
75. Perception and Prediction Network. It outputs the initial occupancy and current and future motion. The
output features within one CNN block are fixed. All kernel strides are 1.
Routing Network. The output features within one CNN block are fixed. All kernel strides are 1. The
architecture is the same for the 3 branches, but with different learnable parameters.
76. Qualitative results. The predicted scene representations and motion plan for different high-level actions.
77. LookOut: Diverse Multi-Future
Prediction and Planning for Self-Driving
■ LOOKOUT, autonomy system that perceives the environment, predicts a diverse set
of futures of how the scene might unroll and estimates the trajectory of the SDV by
optimizing a set of contingency plans over these future realizations.
■ In particular, learn a diverse joint distribution over multi-agent future trajectories in a
traffic scene that covers a wide range of future modes with high sample efficiency
while leveraging the expressive power of generative models.
■ Unlike previous work in diverse motion forecasting, the diversity objective explicitly
rewards sampling future scenarios that require distinct reactions from the self-
driving vehicle for improved safety.
■ This contingency planner then finds comfortable and non-conservative trajectories
that ensure safe reactions to a wide range of future scenarios.
78. The future is highly uncertain and multi-modal by showing 2 distinct futures at the scene-level. In such
scenario, LOOKOUT plans a short-term executable action that leads to 2 different contingent plans to stay
safe in both cases.
LOOKOUT inference. For the learnable components, the colors denote different training stages. The
backbone CNN, actor CNN and prediction decoder are first trained, the diverse sampler next, and lastly the
scenario scorer.
79. Joint Perception and Motion Forecasting
■ In order to extract features useful for both detection and motion forecasting,
employ a convolutional backbone network, which takes as input a history of
voxelized LiDAR sweeps and a raster HD map, both in bird’s eye view (BEV)
centered around the SDV.
■ Then perform multi-class object detection with a shallow convolutional
header to recognize the presence, BEV pose and dimensions of vehicles,
pedestrians and bicyclists, and apply rotated RoI align to extract small feature
crops from the scene context around each actor’s location.
■ An actor CNN with max-pooling reduces the feature map of each actor into a
feature vector.
■ It parameterizes the trajectory of each actor with a temporal series of the
actor centroid in 2-dimensional Euclidean space, where each trajectory is
predicted in the actor’s relative coordinate frame in Bird’s Eye View (BEV)
defined by its centroid and heading.
■ The latent variable model then characterizes the joint distribution over actors’
trajectories as follows:
80. Joint Perception and Motion Forecasting
■ adopt an implicit decoder, parameterized by a spatially-aware Graph Neural Network
(GNN). Since from observational data only obtain (X, Y) pairs, a posterior or encoder
function is introduced to approximate the true posterior distribution during training,
also parameterized by a GNN.
■ This encoder function helps this model learn a powerful decoder, since given only X
there could be many feasible Y due to the inherent multi-modality and uncertainty of
the future.
■ use binary cross-entropy with hard negative mining per class for the presence of an
actor, and Huber loss for the regression targets (i.e., pose and dimension).
■ use the CVAE framework for the latent variable model, which optimizes the evidence
lower bound (ELBO) of the log-likelihood.
81. Planning Centric Diverse Sampler
■ learn a diverse sampling function that maps the actor contexts coming from
sensor data around each actor into a compact set of scene latent samples
whose decoded trajectories achieve good coverage.
■ This sampler will then replace the Monte Carlo sampling from the prior
distribution during inference.
■ To sample a set of latents that are distinct enough from each other such that
they will be decoded into a set of diverse futures, use the reparameterization
trick to map a shared noise across K latent mappings
■ To handle the fact that the input can vary in size, parameterize M with a pair
of GNNs: one to generate the means and another to generate the
covariances.
■ Both GNNs assume a fully connected graph where each node is anchored to
an actor, and initialize the node states.
82. Planning Centric Diverse Sampler
■ perform message passing to aggregate info over the whole scene at each
node.
■ Finally, each node in the first GNN predicts via an MLP. Similarly, each node
in the second GNN predicts via another MLP.
■ Through sampling and decoding, obtain a set of K future trajectory
realizations of all actors in the scene Y.
■ The objective of this diverse sampler is to be able to generate a set of futures
Y that are diverse while recovering well the ground-truth observations.
■ energy function is composed of a few terms that promote the diversity while
preserving data reconstruction:
83. Scenario Probability Estimation
■ The diverse set of K future realizations Y provides the coverage needed for
safe motion planning. However, for accurate risk assessment need to
estimate the probability distribution over each future realization in the set.
■ augment the model to also output a score for all future realizations, via a
GNN that takes as input the actor features and all K sample future scenarios.
■ then easily recover a distribution over such scores by renormalization.
■ Thus, the probability of each sample is
■ Train the scoring function s to match the approximate categorical distribution
over future scenarios under the KL divergence.
■ define this approximate distribution as follows
84. Contingency Planner
■ The goal of the motion planning module is to generate safe, comfortable and
not overly conservative trajectories for the SDV to execute.
■ achieve this through Model Predictive Control, where a trajectory is planned
considering a finite horizon, and is executed until a new trajectory is
replanned upon availability of a new LiDAR sweep.
■ However, the above formulation does not exploit the fact that only one of the
predicted scenarios will happen in the future and is conversely optimizing for
a single trajectory that is ”good” in expectation.
■ Instead of finding a single motion plan for multiple futures, generate a single
common immediate action, followed by a set of future trajectories, one for
each future realization of the scene.
85. Contingency Planner
■ This contingency planning finds an immediate action that is safe with respect to all the
possible realizations in Y and comfortably bridges into a set of contingent trajectories, where
each is specifically planned for a single future realization.
■ Such decision-postponing avoids over-conservative behaviors while staying safe until more
information is obtained.
■ The safe motion planning is only possible if the set of predicted future scenarios is diverse,
and covers possible realizations, including low likelihood events.
■ The planner cost function is a linear combination of various carefully crafted sub-costs that
encode different aspects of driving including safety, comfort, traffic-rules and the route.
sampling strategy: (i) first a set of (spatial) paths are
generated, (ii) for each path, a set of initial velocity
profiles are sampled, creating the set of short-term
trajectories, (iii) conditioned on the end state of
these initial trajectories, another set of velocity
profiles are sampled for the rest of the planning
horizon assuming the SDV follows the same path.
86. Obtaining K latent samples from an implicit latent variable model (ILVM) implies
samplingK times independently from the prior. In contrast, our diverse sampler
exploits a GNN mapping to predict K latent samples from a single noise (in parallel).
Contingency planning paradigm. The cost-to-go of a short term ego-action is captured
by the ability to react to the K diverse predicted futures with the K-most suitable ego-
trajectories.
87. Diverse multi-future predictions and plans in closed-loop, zoomed in. Object detections and motion
forecasts are blue for vehicles and pink for pedestrians. The green bounding box is the SDV, its
immediate action (1s) is shown in black (starting from its rear axle), and its contingent trajectories
planned for each possible future scenario are shown in distinct colors. LiDAR points are not visualized.
88. Scenario 1 Scenario 2 Scenario 1 Scenario 2
Motion forecasting visualizations. blend 15 future scenarios with transparency (assume equal probability for
visualization purposes). Time is encoded in the rainbow color map ranging from red (0s) to pink (5s). This can
be seen as a sample-based characterization of the per-actor marginal distributions.
89. Deep Multi-Task Learning for Joint
Localization, Perception, and Prediction
■ Tremendous progress on many subtasks of autonomous driving including
perception, motion forecasting, and motion planning.
■ However, these systems often assume that the car is accurately localized against a
high-definition map.
■ In state-of-the-art autonomy stacks, localization error exists.
■ Based on observations, design a system LP2 that jointly performs perception,
prediction, and localization.
■ The architecture is able to reuse computation between the three tasks, and is thus
able to correct localization errors efficiently.
90. P2: Perception-Prediction
■ P2 models are tasked with perceiving actors and predicting their future trajectories to ensure that
motion planning has access to safety-critical information about the scene for the entire duration
of the planning horizon.
■ Implicit Latent Variable Model (ILVM) encodes the whole scene in a latent random variable and
uses a deterministic decoder to efficiently sample multiple scene-consistent trajectories for all the
actors in the scene.
■ Besides LiDAR, the ILVM backbone takes as input a multichannel image with semantic aspects of
the rasterized map, which the model is expected to use to improve detection and forecasting.
■ While the LiDAR scans are always processed in the vehicle frame, the scans and the map are
aligned using the pose of the car.
■ Thus, localization error results in a misalignment between the semantic map and the LiDAR scan.
91. Motion Planning (MP)
■ Motion Planning (MP): Given a map and a set of dynamic agents and their future
behaviours, the task of the motion planner is to provide a route that is safe,
comfortable, and physically realizable to the control module.
■ Path Lateral Time (PLT) motion planner is a method that samples physically
realizable trajectories, evaluates them, and selects the one with the minimal cost.
■ The PLT planner receives S =50 Monte Carlo samples from the joint distribution over
the trajectories of all actors from the P2 module.
■ It then samples a small set of trajectories given the map, high level routing and the
current state of the SDV.
■ The planned trajectory is then computed to be the one with the minimum expected
cost over the predicted futures, as defined by a cost function c that takes into
account safety and comfort.
■ In this case, bad localization gives the planner a wrong idea about the layout of the
static parts of the scene.
92. The effects of localization error on perception-prediction and motion planning. (Top) The effects of
perturbing the ego-pose on P2. SFDE is the mean displacement error across all samples at the 5s
mark, and mAP@0.7 is the mean average precision evaluated at an IOU of 0.7. (Bottom) The effects
of perturbing the ego-pose on planning. Collision rate is the percentage of examples for which the
planned path collides with another vehicle or pedestrian within the 5s simulation, and L2 human is
the distance between the planned path and the ground truth human-driven path at the 5s mark.
93. The architecture of the combined localization and perception-prediction (LP2) model.
94. LP2
■ A system receives LiDAR as input, which is then converted to a BEV voxelization with
the channels of the 2D input corresponding to the height dimension.
■ P2 contains four main submodules. (i) A lightweight network processes a rasterized
semantic map centred at the current vehicle pose. (ii) Another neural backbone
extracts features from a coarsely voxelized LiDAR sweep. These two features maps
are concatenatedand passed to (iii) a detector-predictor that encodes the scene into
a latent variable Z, and (iv) a gnn where each node represents a detected actor, and
which deterministically decodes samples into samples of the joint posterior
distribution over all actor trajectories.
■ Ground intensity localization with deep LiDAR embeddings: This localizer learns deep
functions that produce spatial embeddings of both the map and LiDAR sweep before
alignment.
■ Multi-Resolution Feature Sharing: upsampling a crop of the LiDAR feature map to
match the feature resolution in the localization, then add the feature maps together
using a weighted sum to produce the final localization embedding.