SlideShare ist ein Scribd-Unternehmen logo
1 von 96
Downloaden Sie, um offline zu lesen
Jointly Mapping, Localization,
Perception Prediction and Planning
(from Raquel Urtasum’s papers)
Yu Huang
Sunnyvale, California
Yu.huang07@gmail.com
References
■ Fast and Furious: Real Time End-to-End 3D
Detection, Tracking and Motion Forecasting
with a Single Convolutional Net
■ IntentNet: Learning to Predict Intention from
Raw Sensor Data
■ Jointly Learnable Behavior and Trajectory
Planning for Self-Driving Vehicles
■ LiRaNet: E2E Trajectory Prediction using
Spatio-Temporal Radar Fusion
■ Learning Lane Graph Representations for
Motion Forecasting
■ Implicit Latent Variable Model for Scene-
Consistent Motion Forecasting
■ PnPNet: End-to-End Perception and
Prediction with Tracking in the Loop
■ Perceive, Predict, and Plan: Safe Motion
Planning Through Interpretable Semantic
Representations
■ End-to-end Interpretable Neural Motion
Planner
■ DSDNet: Deep Structured Self-Driving
Network
■ MP3: A Unified Model to Map, Perceive,
Predict and Plan
■ LookOut: Diverse Multi-Future Prediction and
Planning for Self-Driving
■ Deep Multi-Task Learning for Joint
Localization, Perception, and Prediction
Fast and Furious: Real Time End-to-End 3D Detection, Tracking
and Motion Forecasting with a Single Convolutional Net
■ deep neural network that is able to jointly reason about 3D detection, tracking
and motion forecasting given data captured by a 3D sensor.
■ By jointly reasoning about these tasks, this holistic approach is more robust to
occlusion as well as sparse data at range.
■ This approach performs 3D convolutions across space and time over a bird’s
eye view representation of the 3D world, which is very efficient in terms of both
memory and computation.
■ By sharing computation, perform all tasks in as little as 30 ms.
Overview: FaF network takes multiple frames as input and performs detection,
tracking and motion forecasting.
Fast and Furious: Real Time End-to-End 3D Detection, Tracking
and Motion Forecasting with a Single Convolutional Net
Fast and Furious: Real Time End-to-End 3D Detection, Tracking
and Motion Forecasting with a Single Convolutional Net
Motion forecasting
Fast and Furious: Real Time End-to-End 3D Detection, Tracking
and Motion Forecasting with a Single Convolutional Net
IntentNet: Learning to Predict Intention
from Raw Sensor Data
■ In order to plan a safe maneuver, self-driving vehicles need to understand the
intent of other traffic participants.
■ intent as a combination of discrete high level behaviors as well as continuous
trajectories describing future motion.
■ A one-stage detector and forecaster that exploits both 3D point clouds
produced by a LiDAR sensor as well as dynamic maps of the environment.
■ This multi-task model reduce reaction time in self-driving applications.
IntentNet: Learning to Predict Intention
from Raw Sensor Data
IntentNet: Learning to Predict Intention
from Raw Sensor Data
IntentNet: Learning to Predict Intention
from Raw Sensor Data
Jointly Learnable Behavior and Trajectory
Planning for Self-Driving Vehicles
■ The motion planners used in self-driving vehicles need to generate trajectories that are safe,
comfortable, and obey the traffic rules.
■ This is usually achieved by two modules: behavior planner, which handles high-level decisions and
produces a coarse trajectory, and trajectory planner that generates a smooth, feasible trajectory
for the duration of the planning horizon.
■ These planners, however, are typically developed separately, and changes in the behavior planner
might affect the trajectory planner in unexpected ways.
■ Furthermore, the final trajectory outputted by the trajectory planner might differ significantly from
the one generated by the behavior planner, as they do not share the same objective.
■ This paper proposes a jointly learnable behavior and trajectory planner, while unlike most existing
learnable motion planners that address either only behavior planning, or use an uninterpretable
neural network to represent the entire logic from sensors to driving commands.
■ This approach features an interpretable cost function on top of perception, prediction and vehicle
dynamics, and a joint learning algorithm that learns a shared cost function employed by the
behavior and trajectory components.
Jointly Learnable Behavior and Trajectory
Planning for Self-Driving Vehicles
The learnable motion planner has discrete and continuous components,
minimizing the same cost function with a same set of learned cost weights.
Jointly Learnable Behavior and Trajectory
Planning for Self-Driving Vehicles
A: Given a scenario, we generate a set of possible SDV (self-driving vehicles) behaviors. B:
Left and right lane boundaries and the driving path that are relevant to the intended
behavior are considered in the cost function. C: SDV geometry for spatiotemporal
overlapping cost are approximated using circles. D: The SDV yields to pedestrians through
stop lines on the driving paths.
Jointly Learnable Behavior and Trajectory
Planning for Self-Driving Vehicles
■ The objective of the planner is then to find a behavior and a trajectory that is safe, comfortable,
and progressing along the route.
■ Given sets of candidate behaviors and trajectories, the cost function is to choose the best (b, τ ).
■ The cost function consists of sub-costs c that focus on different aspects of the trajectories such
as safety, comfort, feasibility, mission completion, and traffic rules.
■ Obstacles: A safe trajectory for the SDV should not only be collision free, but also satisfy a safety-
distance to the surrounding obstacles, including both the static and dynamic objects such as
vehicles, pedestrians, cyclists, unknown objects, etc. The overlap cost coverlap is then 1 if a
trajectory violates the spatial occupancy of any obstacle in a given predicted trajectory, and is
averaged across all possible predicted trajectories weighted by their probabilities. The obstacle
cost cobstacle penalizes the squared distance of the violation of the safety-distance dsafe. This cost
is scaled by the speed of the SDV, making the distance violation more costly at higher speeds.
Jointly Learnable Behavior and Trajectory
Planning for Self-Driving Vehicles
■ Driving-path and lane boundary: The SDV is expected to adhere to the structure of the road,
i.e. , it should not go out of the lane boundary and should stay close to the center of the lane.
The driving-path cost cpath is the squared distance towards the driving path. The lane
boundary cost clane is the squared violation distance of a safety threshold.
■ Headway: SDV should keep a safe longitudinal distance that depends on the speed of the
SDV and the leading vehicle for lane following or lane change behavior. The headway cost is
computed as the violation of the safety distance after applying a comfortable constant
deceleration, assuming that the leading vehicle applies a hard brake.
■ Yield: When a pedestrian is predicted reaching the boundary of the SDV lane or crossing it,
they impose a stopping point at a safe longitudinal distance and penalize any trajectory that
violates it. The yield cost cyield penalizes the squared longitudinal violation distance weighted
by the pedestrian prediction probability.
Jointly Learnable Behavior and Trajectory
Planning for Self-Driving Vehicles
Left: Headway cost penalizes unsafe distance to leading vehicles.
Right: for each sampled trajectory, a weight function determines how
relevant an obstacle is to the SDV in terms of its lateral offset.
Jointly Learnable Behavior and Trajectory
Planning for Self-Driving Vehicles
■ Route: A behavior is desirable if the goal lane is closer to the route than the current lane. So
they penalize the number of lane-changes that is required to converge to the route.
■ Cost-to-go: they compute the deceleration needed for slowing-down to possible upcoming
speed-limits and use the square of the violation of the comfortable deceleration as cost-to-go.
■ Speed limit, travel distance and dynamics: Using the speed-limit of a lane, which is available in
the map data, they introduce a cost that penalizes a trajectory if it goes above the eligible
speed. The speed limit cost cspeed is the squared violation in speed. In order to favor trajectories
that advance in the route, they use the travelled longitudinal distance as a reward. Since the
SDV is physically limited to certain ranges of acceleration, curvature, etc, they prune trajectories
that violate such constraints. Additionally, they introduce costs that penalize aggressive motions
to promote comfortable driving. Specifically, the dynamics cost cdyn consists of the squared
values of jerk and violation thereof, acceleration and violation thereof, lateral acceleration and
violation thereof, lateral jerk and violation thereof, curvature, twist, and wrench.
Jointly Learnable Behavior and Trajectory
Planning for Self-Driving Vehicles
■ The inference process contains two stages of
optimization.
– In the behavioral planning stage, it adopts a
coarse-level parameterization for trajectory
generation. The resulting trajectory is found
by selecting the one with the lowest cost.
– In the trajectory planning stage, it uses a
fine-level parameterization where they
model the trajectory as a function of vehicle
control variables. The trajectory is initialized
with the output of the behavior planner, and
optimized through a continuous optimization
solver.
Jointly Learnable Behavior and Trajectory
Planning for Self-Driving Vehicles
Example trajectories in a nudging scenario Behavioral decisions include obstacle side
assignment and lane information, which are
sent through the behavioral trajectory
interface.
Jointly Learnable Behavior and Trajectory
Planning for Self-Driving Vehicles
■ It uses a combination of max-margin objective and
imitation learning as the loss function
max-margin loss
imitation learning loss
gradient of the learning objective
Jointly Learnable Behavior and Trajectory
Planning for Self-Driving Vehicles
■ Two real-world driving datasets: ManualDrive is a set of human driving recordings where the
drivers are instructed to drive smoothly and carefully respecting all traffic rules; TOR-4D is
composed of very challenging scenarios.
LiRaNet: E2E Trajectory Prediction using
Spatio-Temporal Radar Fusion
■ LiRaNet, a end-to-end trajectory prediction method which utilizes radar sensor
information along with widely used lidar and high definition (HD) maps.
■ Automotive radar provides rich, complementary information, allowing for longer
range vehicle detection as well as instantaneous radial velocity measurements.
■ However, there are factors that make the fusion of lidar and radar information
challenging, such as the relatively low angular resolution of radar measurements,
their sparsity and the lack of exact time synchronization with lidar.
■ To overcome these challenges, propose an efficient spatio-temporal radar feature
extraction scheme which achieves state-of-the-art performance on multiple large-
scale datasets.
LiRaNet: E2E Trajectory Prediction using
Spatio-Temporal Radar Fusion
An example scene from X17k in bird’s eye view where lidar points (light blue) and radar point velocities (orange) are
visualized with labels (white) for current, past and future frames. Vehicle A is a turning bus that has multiple radar points
across frames. By effectively combining them over space and time a full 2D velocity and turning rate can be recovered.
Vehicle B shows the high positional noise that inherently comes with radar. Vehicle C shows a case with sparse lidar
points where implicitly associating them across time can be challenging. However, radar points present around C can
add context for the model to detect and predict the trajectory.
LiRaNet: E2E Trajectory Prediction using
Spatio-Temporal Radar Fusion
LiRaNet overview: The radar feature extraction network (A) extracts spatio-temporal features from raw radar points in
2 steps: (1) for each frame we create a graph between the BEV grid cells and radar points to learn spatial features of
each cell using a non-rigid convolution, (2) these spatial features are further fused temporally by stacking across
channel dimension and using an MLP to get a radar feature volume. This feature volume is then fused with feature
volumes from other sensors and fed to a joint perception-prediction network (B) which produces detections and their
future trajectories. An example prediction for a scene from X17k can be seen in (C).
LiRaNet: E2E Trajectory Prediction using
Spatio-Temporal Radar Fusion
■ Input domain consists of radar points and output domain consists of BEV cells.
■ For each cell j, calculate the features hm
j for the sweep m as
■ where Am
j is the set of associated radar points, xm
i is the 2D coordinates of the associated
radar point, xm
j is the 2D coordinate of the BEV cell’s center, ⊕ denotes the concatenation
operation, fm
i is the feature vector for the radar point and gm() is an multi-layer perceptron
(MLP) with learnable weights shared across all the cells.
■ Calculate Am
j, using nearest neighbor algorithm with a distance threshold.
■ By using a threshold larger than the size of a cell, this method compensates for positional
errors in radar.
■ For each cell j, calculate the final spatio-temporal feature vector (hj ) by concatenating the
per sweep features hj,m and using an MLP to combine them.
LiRaNet: E2E Trajectory Prediction using
Spatio-Temporal Radar Fusion
Radar
is
fused
with
LiDAR
and
Map
to
generate
multi-sensor
feature
volume
in
BEV
which
is
used
for
detection
and
trajectory
prediction.
Each
learnable
operator
is
denoted
as
(op,
kernel,
stride,
number
of
features).
Conv
denotes
convolution
and
Res
denotes
a
residual
block.
LiRaNet: E2E Trajectory Prediction using
Spatio-Temporal Radar Fusion
Learning Lane Graph Representations for
Motion Forecasting
■ A motion forecasting model that exploits a novel structured map representation as well as actor-
map interactions.
■ Instead of encoding vectorized maps as raster images, construct a lane graph from raw map data
to explicitly preserve the map structure.
■ To capture the complex topology and long range dependencies of the lane graph, propose
LaneGCN which extends graph convolutions with multiple adjacency matrices and along-lane
dilation.
■ To capture the complex interactions between actors and maps, exploit a fusion network consisting
of four types of interactions, actor-to-lane, lane-to-lane, lane-to-actor and actor-to-actor.
■ Powered by LaneGCN and actor-map interactions, the model is able to predict accurate and
realistic multi-modal trajectories.
■ This approach significantly outperforms the state-of-the-art on the large scale Argoverse motion
forecasting benchmark.
■ https://github.com/uber-research/LaneGCN
Learning Lane Graph Representations for
Motion Forecasting
It constructs a lane graph from raw map data and use LaneGCN to extract map features. In parallel,
ActorNet extracts actor features from observed past trajectories. Then it uses FusionNet to model the
Interactions between actors themselves and the map, and predict the future trajectories.
Learning Lane Graph Representations for
Motion Forecasting
The model is composed of four modules. (1) ActorNet receives the past actor trajectories as input, and uses 1D
convolution to extract actor node features. (2) MapNet constructs a lane graph from HD maps, and uses a LaneGCN to
exact lane node features. (3) FusionNet is a stack of 4 interaction blocks. The actor to lane block fuses real-time traffic
information from actor nodes to lane nodes. The lane to lane block propagates information over the lane graph and
updates lane features. The lane to actor block fuses updated map information from lane nodes to actor nodes. The actor to
actor block performs interactions among actors. It uses another LaneGCN for the lane to lane block, and spatial attention
layers for the other blocks. (4) The prediction header uses after-fusion actor features to produce multi-modal trajectories.
Learning Lane Graph Representations for
Motion Forecasting
LaneGCN is a stack of 4 multi-scale LaneConv
residual blocks, each of which consists of a LaneConv
(1, 2, 4, 8, 16, 32) and a linear layer with a residual
connection. All layers have 128 feature channels.
Left: The lane centerline of interest, its
predecessor, successor, left and right neighbor
are denoted with red, orange, blue, purple, and
green lines, respectively. Each centerline is given
as a sequence of BEV points (hollow circles).
Right: Derived lane graph with an example lane
node. The lane node of interest, its predecessor,
successor, left and right neighbor are denoted
with red, orange, blue, purple and green circles
respectively.
Learning Lane Graph Representations for
Motion Forecasting
LaneConv Operator:
parameterize the node feature as follows,
LaneConv operator as
Ai and Wi are the adjacency and the weight matrices
corresponding to the i-th connection type respectively.
MLP indicates a multi-layer perceptron and the
two subscripts refer to shape and location,
respectively.
k-dilation LaneConv operator as
Ak
pre is the k-th matrix power of Apre.
In regular grid graphs, a dilated convolution operator can effectively
capture the long range dependency by enlarging the receptive field.
Learning Lane Graph Representations for
Motion Forecasting
LaneGCN: LaneConv(k1, .., kC) this multi- scale layer.
• In the model, use spatial attention and LaneGCN to capture a complete set of actor-map interactions.
FusionNet:
• Build a stack of four fusion modules to capture all information between actors and lane nodes, i.e.,
actors to lanes (A2L), lanes to lanes (L2L), lanes to actors (L2A) and actors to actors (A2A).
• L2L module by LaneGCN, but other three modules by:
Prediction Header: The header has two branches, a regression branch to predict the trajectory of each
mode and a classification branch to predict the confidence score of each mode.
• regression branch
• classification branch
Learning Lane Graph Representations for
Motion Forecasting
Implicit Latent Variable Model for Scene-
Consistent Motion Forecasting
■ In this paper, aim to learn scene-consistent motion forecasts of complex urban trac directly
from sensor data.
■ In particular, propose to characterize the joint distribution over future trajectories via an
implicit latent variable model.
■ It models the scene as an interaction graph and employ powerful graph neural networks to
learn a distributed latent representation of the scene.
■ Coupled with a deterministic decoder, obtain trajectory samples that are consistent across
trac participants, achieving state-of-the-art results in motion forecasting and interaction
understanding.
■ This motion forecasts result in safer and more comfortable motion planning.
Implicit Latent Variable Model for Scene-
Consistent Motion Forecasting
Graphical models of trajectory distribution. Dashed arrows/circles denote that only some approaches within
the group use those components. Double circle in (c) denotes that it is a deterministic mapping of its inputs.
Actor Feature Extraction. Given LiDAR and maps, the backbone CNN detects the actors in the scene, and
individual feature vectors per actor are extracted via RRoI Align, followed by a CNN with spatial pooling.
Implicit Latent Variable Model for Scene-
Consistent Motion Forecasting
Implicit Latent Variable Model encodes the scene into a latent space, from which it can efficiently sample
multiple future realizations in parallel, each with socially consistent trajectories.
Implicit Latent Variable Model for Scene-
Consistent Motion Forecasting
Implicit Latent Variable Model for Scene-
Consistent Motion Forecasting
PnPNet: End-to-End Perception and
Prediction with Tracking in the Loop
■ The problem of joint perception and motion forecasting in the context of self-driving
vehicles.
■ Towards this goal, PnPNet, an end-to-end model that takes as input sequential sensor
data, and outputs at each time step object tracks and their future trajectories.
■ The key component is a tracking module that generates object tracks online from
detections and exploits trajectory level features for motion forecasting.
■ Specifically, the object tracks get updated at each time step by solving both the data
association problem and the trajectory estimation problem.
■ Importantly, the whole model is end-to-end trainable and benefits from joint optimization
of all tasks.
■ Validate PnPNet on two large-scale driving datasets, improvements over the state-of-the-
art with better occlusion recovery and more accurate future prediction.
PnPNet: End-to-End Perception and
Prediction with Tracking in the Loop
Three paradigms for perception and prediction. Traditional approach (a) adopts the modular design that
decomposes the stack into subtasks and solves them with individual models. End-to-end method (b) uses a
joint model to solve detection and prediction simultaneously, but performs tracking as post-processing. As
a result, the full temporal history contained in tracks is not used by detection and prediction. This
approach (c) brings tracking into the loop so that all tasks benefit from rich temporal context.
PnPNet: End-to-End Perception and
Prediction with Tracking in the Loop
PnPNet for end-to-end perception and prediction. The model consists of three modules that perform 3D object
detection, discrete-continuous tracking, and motion forecasting sequentially. To extract trajectory level actor
representations used for tracking and prediction, also equip the model with two explicit memories: one for
global sensor feature maps, and one for past object trajectories. Both memories get updated at each time step
with up-to-date sensor features and tracking results.
PnPNet: End-to-End Perception and
Prediction with Tracking in the Loop
The trajectory level object representation. Given an object trajectory, we first extract its sensor observation
and motion features at each time step, and then apply an LSTM network to model the temporal dynamics
PnPNet: End-to-End Perception and
Prediction with Tracking in the Loop
Perceive, Predict, and Plan: Safe Motion
Planning Through Interpretable Semantic
Representations
■ In this paper propose an end-to-end learnable network that performs joint perception,
prediction and motion planning for self-driving vehicles and produces interpretable
intermediate representations.
■ Unlike existing neural motion planners, this motion planning costs are consistent with
perception and prediction estimates.
■ This is achieved by a novel differentiable semantic occupancy representation that is explicitly
used as cost by the motion planning process.
■ This network is learned end-to-end from human demonstrations.
■ The experiments in a large-scale manual-driving dataset and closed-loop simulation show that
the proposed model significantly outperforms state-of-the-art planners in imitating the human
behaviors while producing much safer trajectories.
Perceive, Predict, and Plan: Safe Motion
Planning Through Interpretable Semantic
Representations
The overview of end-to-end learnable autonomy system that takes raw sensor data, an
HD map and a high level route as input and produces safe maneuvers for the self-
driving vehicle via our novel semantic interpretable intermediate representations.
Perceive, Predict, and Plan: Safe Motion
Planning Through Interpretable Semantic
Representations
Semantic classes in occupancy forecasting. Colors match between drawing and hierarchy. Shadowed area
corresponds to SDV route. Black vehicle, pedestrian and bike icons represent the agents' true current location.
Perceive, Predict, and Plan: Safe Motion
Planning Through Interpretable Semantic
Representations
Inference diagram of the perception and recurrent occupancy forecasting
model. || symbolizes concatenation along the feature dimension, ⊕ element-
wise sum and ∆ bilinear interpolation used to downscale the occupancy.
Perceive, Predict, and Plan: Safe Motion
Planning Through Interpretable Semantic
Representations
Examples of the motion planner cost functions: (a) collision, (b) driving-path,
(c) lane boundary, (d) traffic light, (e) comfort, (f) route, (g) progress.
Cost related comfort, traffic rules and progress in the route:
Safe cost:
Perceive, Predict, and Plan: Safe Motion
Planning Through Interpretable Semantic
Representations
Learn the model parameters by exploiting these two loss functions:
Semantic Occupancy Loss:
Planning Loss:
Perceive, Predict, and Plan: Safe Motion
Planning Through Interpretable Semantic
Representations
End-to-end Interpretable Neural Motion
Planner
■ A neural motion planner (NMP) for learning to drive autonomously in complex urban
scenarios that include traffic-light handling, yielding, and interactions with multiple
road-users.
■ Towards this goal, design a holistic model that takes as input raw LIDAR data and a
HD map and produces interpretable intermediate representations in the form of 3D
detections and their future trajectories, as well as a cost volume defining the
goodness of each position that the self-driving car can take within the planning
horizon.
■ Then sample a set of diverse physically possible trajectories and choose the one
with the minimum learned cost.
■ Importantly, the cost volume is able to naturally capture multi-modality.
End-to-end interpretable neural motion planner (NMP). Backbone network takes LiDAR data and maps
as inputs, and outputs bounding boxes of other actors for future timesteps (perception), as well as a cost
volume for planning with T filters. Next, for each trajectory proposal from the sampler, its cost is indexed
from different filters of the cost volume and summed together. The trajectory with the minimal cost will be
our final planning.
NMP
■ An end-to-end learnable motion planner that generates accurate space-time trajectories over a
planning horizon of a few seconds.
■ Importantly, the model takes as input LiDAR point clouds and a HD map and produces
interpretable intermediate representations in the form of 3D detections and their future motion
forecasted over the planning horizon.
■ The final output representation is a space-time cost volume that represents the “goodness” of
each possible location that the SDV can take within the planning horizon.
■ The planner then scores a series of trajectory proposals using the learned cost volume and
chooses the one with the minimum cost.
■ Train the model end-to-end with a multi-task objective. The planning loss encourages the
minimum cost plan to be similar to the trajectory performed by human demonstrators.
■ Note: this loss is sparse as a GT trajectory only occupies small portion of the space.
■ Introduce another perception loss that encourages the intermediate representations to produce
accurate 3D detections and motion forecasting.
■ This ensures the interpretability of the intermediate representations and enables much faster
learning.
NMP
■ Given the input LiDAR sweeps and the HD map, compute the corresponding cost
volume c by feedforward convolutional operations.
■ The final trajectory can then be computed by minimizing.
■ Note, however, that this optimization is NP hard.
■ Thus rely on sampling to obtain a low cost trajectory. Towards this goal, sample a
wide variety of trajectories that can be executed by the SDV and produce as final
output the one with minimal cost according to learned cost volume.
■ To efficiently sample physically possible trajectories during inference.
■ Since the cost of a trajectory is computed by indexing from the cost volume, the
planner is fast enough for real-time inference.
Trajectory Representation. First sample a set of parameters of a Clothoid to determine the
shape of a trajectory. Then sample a velocity profile to determine how fast the Self-driving
vehicle (SDV) go along this trajectory. Combining these two, get a space time trajectory.
a Clothoid curve, also known as Euler spiral or
Cornu spiral, to represent the 2D path of the SDV
End-to-End Learning
The ultimate goal is to plan a safe trajectory while following the rules of traffic.
Let the model to understand where obstacles are and where they will be in the future in order
to avoid collisions.
Therefore, use a multi-task training with supervision from detection, motion forecasting as
well as human driven trajectories for the ego-car.
Note that no supervision for cost volume.
Thus adopt max-margin loss to push the network to learn to discriminate between good and
bad trajectories.
The overall loss function is then:
Cost Volume across Time. The planned trajectory in red and ground-truth in blue. Overlay lower cost
region for different timesteps in the same figure, using different colors (indicated by legend). Detection
and corresponding prediction results are in cyan.
DSDNet: Deep Structured Self-Driving
Network
■ Deep Structured self-Driving Network (DSDNet), which performs object detection,
motion prediction, and motion planning with a single neural network.
■ Towards this goal, develop a deep structured energy based model which considers
the interactions between actors and produces socially consistent multimodal future
predictions.
■ Furthermore, DSDNet explicitly exploits the predicted future distributions of actors to
plan a safe maneuver by using a structured planning cost.
■ The sample-based formulation allows to overcome the difficulty in probabilistic
inference of continuous random variables
DSDNet overview: The model takes LiDAR and map as inputs, processes them with a CNN backbone,
and jointly performs object detection, multi-modal socially-consistent prediction, and safe planning
under uncertainty. Rainbow patterns mean highly likely actors' future positions predicted by model.
In order to plan a safe maneuver, a self-driving vehicle has to first understand its surroundings as well
as predict how the future might evolve. It should then plan its motion by considering all possibilities of
the future weighting them properly. This is not trivial as the future is very multi-modal and actors
interact with each other. Moreover, the inference procedure needs to be performed in a fraction of
a second in order to have practical value.
Backbone Feature Network and Object Detection
■ Let X be the LiDAR point clouds and the HD map given as input to system. Since LiDAR point
clouds can be very sparse and the actors‘ motion is an important cue for detection and
prediction, use the past 10 LiDAR sweeps (e.g., 1s of measurements) and voxelize them into a
3D tensor. utilize HD maps as they provide a strong prior about the scene.
■ then process this 3D tensor with a deep convolutional network backbone and compute a
backbone feature map. employ a single-shot detection header on this feature map to output
detection bounding boxes for the actors in the scene.
■ Apply two Conv2D layers separately on feature map, one for classifying if a location is
occupied by an actor, the other for regressing the position offset, size, orientation and speed
of each actor.
■ prediction and planning modules will then take these detections and the feature map as input
to produce both a distribution over the actors‘ behaviors and a safe planning maneuver.
Probabilistic Multimodal Social Prediction
■ As actors move on the ground, represent their possible future behavior using
a trajectory defined as a sequence of 2D waypoints on birds eye view (BEV)
sampled at T discrete timestamps.
■ Note that T is the same duration as our planning horizon, and compute the
motion prediction distribution and a motion plan each time a new sensor
measurement arises (i.e., every 100ms).
■ Modeling this joint distribution and performing efficient inference is
challenging, as each actor has a high-dimensional continuous action space.
■ propose to approximate this high-dimensional continuous space with a finite
number of samples, and construct a non-parametric distribution over the
sampled space.
■ Also use a combination of straight, circle, and clothoid curves.
Probabilistic Multimodal Social Prediction
joint distribution of all actors' future behaviors
decompose the energy E into two terms. The  first term encodes the goodness of a future
trajectory (independent of other actors) while the second term explicitly encodes the fact that
pairs of actors should not collide in the future.
Details of the multimodal social prediction
module: For each actor, sample a set of
physically valid trajectories, and use a neural
network Etraj to assign energies (probabilities) to
them. To make different actors' behaviors
socially consistent, employ message passing
steps which explicitly model interactions and can
encode human prior knowledge (collision
avoidance). The final predicted socially-
consistent distribution is shown on top right.
Neural header
for evaluating
Etraj and Ctraj.
Probabilistic Multimodal Social Prediction
■ For safety, our motion planner needs to take all possible actor‘s future into
consideration. Therefore, motion forecasting needs to infer the probability of each
actor taking a particular future trajectory.
■ utilize sum-product message passing to estimate the marginal distribution per actor,
taking into account the effects of all other actors by marginalization.
■ The marginal reflects the uncertainty and multimodality in an actor's future behavior
and will be leveraged by planner.
■ Through this message passing procedure, actors communicate with each others their
future intentions and how probable those intentions are. The collision energy helps to
coordinate intentions from different actors such that the behaviors are compliant and
do not result in collision.
Safe Motion Planning under Uncertain Future
■ The motion planning module fulfills final goal, that is, navigating towards a
destination while avoiding collision and obeying traffic rules.
■ Towards this goal, build a cost function C, which assigns lower cost values to “good“
trajectory proposals and higher values to “bad” ones.
■ Planning is performed by finding the optimal trajectory with the minimum cost.
■ Planning Cost:
■ conduct exact minimization over P. Ctraj is a net based cost and evaluate all K
possible trajectories with a single batch forward pass. Ccoll is computed with a GPU
based collision checker. Efficiently evaluate for all K samples and select the
trajectory with minimum-cost as final planning result.
Learning
train the full model (backbone, detection, prediction and planning) jointly with a multi-class loss
planning loss: utilize a max-margin loss, using expert behavior as positive examples and
randomly sampled trajectories as negative ones.
detection loss: Ldetection is a sum of classification and regression loss. Use a cross-entropy
classification loss and assign an anchor‘s label based on its IoU with any actor. The
regression loss is a smooth L1 between model regression outputs and the ground-truth
targets. Those targets include position, size, orientation and velocity.
employ cross-entropy between discrete distribution and the true target as prediction loss. Once
sampled K trajectories per actor, this loss can be regarded as a standard classification loss
over K classes (one for each trajectory sample). The target class is set to be the closest
trajectory sample to the ground-truth future trajectory (in L2 distance).
Qualitative results on ATG4D. The prediction module can capture multi-modalities when vehicles
approach intersections, while being certain and accurate when vehicles drive along a single lane (top).
The model can produce smooth trajectories which follow the lane and are compliant with other actors
(bottom). Cyan boxes: detection. Cyan trajectory: prediction. Red box: ego-car. Red trajectory:
planning. Overlay the predicted marginal distribution for different timestamps with different colors and
only show high-probability regions.
MP3: A Unified Model to Map, Perceive,
Predict and Plan
■ Unfortunately, building HD maps has proven hard to scale due to their cost
as well as the requirements they impose in the localization system that has
to work everywhere with centimeter-level accuracy.
■ Being able to drive without an HD map would be very beneficial to scale self-
driving solutions as well as to increase the failure tolerance of existing ones
(e.g., if localization fails or the map is not up-to-date).
■ MP3, an end-to-end approach to mapless driving where the input is raw
sensor data and a high-level command (e.g., turn left at the intersection).
■ MP3 predicts intermediate representations in the form of an online map and
the current and future state of dynamic agents, and exploits them in a novel
neural motion planner to make interpretable decisions taking into account
uncertainty.
Left: a localization error makes the SDV follow a wrong route when using an HD map,
driving into traffic. Right: mapless driving can interpret the scene from sensors and
achieve a safe plan that follows a high-level command.
MP3 predicts probabilistic scene representations that are leveraged
in motion planning as interpretable cost functions.
Interpretable Scene representations. For occupancy and motion,
visualize all time steps and classes in the same image to save space.
The motion field warps the occupancy over time. Transparency denotes
probability. Color differences the predicted layers by the network and the
future occupancy. Particular case of unimodal motion (K = 1).
Backbone Network. The output features within one CNN block are fixed. All kernel strides and dilation are 1.
Mapping Network. The output features within one CNN block are fixed. All kernel strides are 1.
Perception and Prediction Network. It outputs the initial occupancy and current and future motion. The
output features within one CNN block are fixed. All kernel strides are 1.
Routing Network. The output features within one CNN block are fixed. All kernel strides are 1. The
architecture is the same for the 3 branches, but with different learnable parameters.
Qualitative results. The predicted scene representations and motion plan for different high-level actions.
LookOut: Diverse Multi-Future
Prediction and Planning for Self-Driving
■ LOOKOUT, autonomy system that perceives the environment, predicts a diverse set
of futures of how the scene might unroll and estimates the trajectory of the SDV by
optimizing a set of contingency plans over these future realizations.
■ In particular, learn a diverse joint distribution over multi-agent future trajectories in a
traffic scene that covers a wide range of future modes with high sample efficiency
while leveraging the expressive power of generative models.
■ Unlike previous work in diverse motion forecasting, the diversity objective explicitly
rewards sampling future scenarios that require distinct reactions from the self-
driving vehicle for improved safety.
■ This contingency planner then finds comfortable and non-conservative trajectories
that ensure safe reactions to a wide range of future scenarios.
The future is highly uncertain and multi-modal by showing 2 distinct futures at the scene-level. In such
scenario, LOOKOUT plans a short-term executable action that leads to 2 different contingent plans to stay
safe in both cases.
LOOKOUT inference. For the learnable components, the colors denote different training stages. The
backbone CNN, actor CNN and prediction decoder are first trained, the diverse sampler next, and lastly the
scenario scorer.
Joint Perception and Motion Forecasting
■ In order to extract features useful for both detection and motion forecasting,
employ a convolutional backbone network, which takes as input a history of
voxelized LiDAR sweeps and a raster HD map, both in bird’s eye view (BEV)
centered around the SDV.
■ Then perform multi-class object detection with a shallow convolutional
header to recognize the presence, BEV pose and dimensions of vehicles,
pedestrians and bicyclists, and apply rotated RoI align to extract small feature
crops from the scene context around each actor’s location.
■ An actor CNN with max-pooling reduces the feature map of each actor into a
feature vector.
■ It parameterizes the trajectory of each actor with a temporal series of the
actor centroid in 2-dimensional Euclidean space, where each trajectory is
predicted in the actor’s relative coordinate frame in Bird’s Eye View (BEV)
defined by its centroid and heading.
■ The latent variable model then characterizes the joint distribution over actors’
trajectories as follows:
Joint Perception and Motion Forecasting
■ adopt an implicit decoder, parameterized by a spatially-aware Graph Neural Network
(GNN). Since from observational data only obtain (X, Y) pairs, a posterior or encoder
function is introduced to approximate the true posterior distribution during training,
also parameterized by a GNN.
■ This encoder function helps this model learn a powerful decoder, since given only X
there could be many feasible Y due to the inherent multi-modality and uncertainty of
the future.
■ use binary cross-entropy with hard negative mining per class for the presence of an
actor, and Huber loss for the regression targets (i.e., pose and dimension).
■ use the CVAE framework for the latent variable model, which optimizes the evidence
lower bound (ELBO) of the log-likelihood.
Planning Centric Diverse Sampler
■ learn a diverse sampling function that maps the actor contexts coming from
sensor data around each actor into a compact set of scene latent samples
whose decoded trajectories achieve good coverage.
■ This sampler will then replace the Monte Carlo sampling from the prior
distribution during inference.
■ To sample a set of latents that are distinct enough from each other such that
they will be decoded into a set of diverse futures, use the reparameterization
trick to map a shared noise across K latent mappings
■ To handle the fact that the input can vary in size, parameterize M with a pair
of GNNs: one to generate the means and another to generate the
covariances.
■ Both GNNs assume a fully connected graph where each node is anchored to
an actor, and initialize the node states.
Planning Centric Diverse Sampler
■ perform message passing to aggregate info over the whole scene at each
node.
■ Finally, each node in the first GNN predicts via an MLP. Similarly, each node
in the second GNN predicts via another MLP.
■ Through sampling and decoding, obtain a set of K future trajectory
realizations of all actors in the scene Y.
■ The objective of this diverse sampler is to be able to generate a set of futures
Y that are diverse while recovering well the ground-truth observations.
■ energy function is composed of a few terms that promote the diversity while
preserving data reconstruction:
Scenario Probability Estimation
■ The diverse set of K future realizations Y provides the coverage needed for
safe motion planning. However, for accurate risk assessment need to
estimate the probability distribution over each future realization in the set.
■ augment the model to also output a score for all future realizations, via a
GNN that takes as input the actor features and all K sample future scenarios.
■ then easily recover a distribution over such scores by renormalization.
■ Thus, the probability of each sample is
■ Train the scoring function s to match the approximate categorical distribution
over future scenarios under the KL divergence.
■ define this approximate distribution as follows
Contingency Planner
■ The goal of the motion planning module is to generate safe, comfortable and
not overly conservative trajectories for the SDV to execute.
■ achieve this through Model Predictive Control, where a trajectory is planned
considering a finite horizon, and is executed until a new trajectory is
replanned upon availability of a new LiDAR sweep.
■ However, the above formulation does not exploit the fact that only one of the
predicted scenarios will happen in the future and is conversely optimizing for
a single trajectory that is ”good” in expectation.
■ Instead of finding a single motion plan for multiple futures, generate a single
common immediate action, followed by a set of future trajectories, one for
each future realization of the scene.
Contingency Planner
■ This contingency planning finds an immediate action that is safe with respect to all the
possible realizations in Y and comfortably bridges into a set of contingent trajectories, where
each is specifically planned for a single future realization.
■ Such decision-postponing avoids over-conservative behaviors while staying safe until more
information is obtained.
■ The safe motion planning is only possible if the set of predicted future scenarios is diverse,
and covers possible realizations, including low likelihood events.
■ The planner cost function is a linear combination of various carefully crafted sub-costs that
encode different aspects of driving including safety, comfort, traffic-rules and the route.
sampling strategy: (i) first a set of (spatial) paths are
generated, (ii) for each path, a set of initial velocity
profiles are sampled, creating the set of short-term
trajectories, (iii) conditioned on the end state of
these initial trajectories, another set of velocity
profiles are sampled for the rest of the planning
horizon assuming the SDV follows the same path.
Obtaining K latent samples from an implicit latent variable model (ILVM) implies
samplingK times independently from the prior. In contrast, our diverse sampler
exploits a GNN mapping to predict K latent samples from a single noise (in parallel).
Contingency planning paradigm. The cost-to-go of a short term ego-action is captured
by the ability to react to the K diverse predicted futures with the K-most suitable ego-
trajectories.
Diverse multi-future predictions and plans in closed-loop, zoomed in. Object detections and motion
forecasts are blue for vehicles and pink for pedestrians. The green bounding box is the SDV, its
immediate action (1s) is shown in black (starting from its rear axle), and its contingent trajectories
planned for each possible future scenario are shown in distinct colors. LiDAR points are not visualized.
Scenario 1 Scenario 2 Scenario 1 Scenario 2
Motion forecasting visualizations. blend 15 future scenarios with transparency (assume equal probability for
visualization purposes). Time is encoded in the rainbow color map ranging from red (0s) to pink (5s). This can
be seen as a sample-based characterization of the per-actor marginal distributions.
Deep Multi-Task Learning for Joint
Localization, Perception, and Prediction
■ Tremendous progress on many subtasks of autonomous driving including
perception, motion forecasting, and motion planning.
■ However, these systems often assume that the car is accurately localized against a
high-definition map.
■ In state-of-the-art autonomy stacks, localization error exists.
■ Based on observations, design a system LP2 that jointly performs perception,
prediction, and localization.
■ The architecture is able to reuse computation between the three tasks, and is thus
able to correct localization errors efficiently.
P2: Perception-Prediction
■ P2 models are tasked with perceiving actors and predicting their future trajectories to ensure that
motion planning has access to safety-critical information about the scene for the entire duration
of the planning horizon.
■ Implicit Latent Variable Model (ILVM) encodes the whole scene in a latent random variable and
uses a deterministic decoder to efficiently sample multiple scene-consistent trajectories for all the
actors in the scene.
■ Besides LiDAR, the ILVM backbone takes as input a multichannel image with semantic aspects of
the rasterized map, which the model is expected to use to improve detection and forecasting.
■ While the LiDAR scans are always processed in the vehicle frame, the scans and the map are
aligned using the pose of the car.
■ Thus, localization error results in a misalignment between the semantic map and the LiDAR scan.
Motion Planning (MP)
■ Motion Planning (MP): Given a map and a set of dynamic agents and their future
behaviours, the task of the motion planner is to provide a route that is safe,
comfortable, and physically realizable to the control module.
■ Path Lateral Time (PLT) motion planner is a method that samples physically
realizable trajectories, evaluates them, and selects the one with the minimal cost.
■ The PLT planner receives S =50 Monte Carlo samples from the joint distribution over
the trajectories of all actors from the P2 module.
■ It then samples a small set of trajectories given the map, high level routing and the
current state of the SDV.
■ The planned trajectory is then computed to be the one with the minimum expected
cost over the predicted futures, as defined by a cost function c that takes into
account safety and comfort.
■ In this case, bad localization gives the planner a wrong idea about the layout of the
static parts of the scene.
The effects of localization error on perception-prediction and motion planning. (Top) The effects of
perturbing the ego-pose on P2. SFDE is the mean displacement error across all samples at the 5s
mark, and mAP@0.7 is the mean average precision evaluated at an IOU of 0.7. (Bottom) The effects
of perturbing the ego-pose on planning. Collision rate is the percentage of examples for which the
planned path collides with another vehicle or pedestrian within the 5s simulation, and L2 human is
the distance between the planned path and the ground truth human-driven path at the 5s mark.
The architecture of the combined localization and perception-prediction (LP2) model.
LP2
■ A system receives LiDAR as input, which is then converted to a BEV voxelization with
the channels of the 2D input corresponding to the height dimension.
■ P2 contains four main submodules. (i) A lightweight network processes a rasterized
semantic map centred at the current vehicle pose. (ii) Another neural backbone
extracts features from a coarsely voxelized LiDAR sweep. These two features maps
are concatenatedand passed to (iii) a detector-predictor that encodes the scene into
a latent variable Z, and (iv) a gnn where each node represents a detected actor, and
which deterministically decodes samples into samples of the joint posterior
distribution over all actor trajectories.
■ Ground intensity localization with deep LiDAR embeddings: This localizer learns deep
functions that produce spatial embeddings of both the map and LiDAR sweep before
alignment.
■ Multi-Resolution Feature Sharing: upsampling a crop of the LiDAR feature map to
match the feature resolution in the localization, then add the feature maps together
using a weighted sum to produce the final localization embedding.
Thanks

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Driving Behavior for ADAS and Autonomous Driving VI
Driving Behavior for ADAS and Autonomous Driving VIDriving Behavior for ADAS and Autonomous Driving VI
Driving Behavior for ADAS and Autonomous Driving VI
 
Fisheye-Omnidirectional View in Autonomous Driving III
Fisheye-Omnidirectional View in Autonomous Driving IIIFisheye-Omnidirectional View in Autonomous Driving III
Fisheye-Omnidirectional View in Autonomous Driving III
 
Pedestrian behavior/intention modeling for autonomous driving III
Pedestrian behavior/intention modeling for autonomous driving IIIPedestrian behavior/intention modeling for autonomous driving III
Pedestrian behavior/intention modeling for autonomous driving III
 
Driving Behavior for ADAS and Autonomous Driving III
Driving Behavior for ADAS and Autonomous Driving IIIDriving Behavior for ADAS and Autonomous Driving III
Driving Behavior for ADAS and Autonomous Driving III
 
Pedestrian behavior/intention modeling for autonomous driving IV
Pedestrian behavior/intention modeling for autonomous driving IVPedestrian behavior/intention modeling for autonomous driving IV
Pedestrian behavior/intention modeling for autonomous driving IV
 
Driving Behavior for ADAS and Autonomous Driving II
Driving Behavior for ADAS and Autonomous Driving IIDriving Behavior for ADAS and Autonomous Driving II
Driving Behavior for ADAS and Autonomous Driving II
 
Pedestrian Behavior/Intention Modeling for Autonomous Driving VI
Pedestrian Behavior/Intention Modeling for Autonomous Driving VIPedestrian Behavior/Intention Modeling for Autonomous Driving VI
Pedestrian Behavior/Intention Modeling for Autonomous Driving VI
 
Driving behaviors for adas and autonomous driving XI
Driving behaviors for adas and autonomous driving XIDriving behaviors for adas and autonomous driving XI
Driving behaviors for adas and autonomous driving XI
 
Driving behaviors for adas and autonomous driving xiv
Driving behaviors for adas and autonomous driving xivDriving behaviors for adas and autonomous driving xiv
Driving behaviors for adas and autonomous driving xiv
 
Deep VO and SLAM
Deep VO and SLAMDeep VO and SLAM
Deep VO and SLAM
 
Driving behaviors for adas and autonomous driving XII
Driving behaviors for adas and autonomous driving XIIDriving behaviors for adas and autonomous driving XII
Driving behaviors for adas and autonomous driving XII
 
Camera-Based Road Lane Detection by Deep Learning II
Camera-Based Road Lane Detection by Deep Learning IICamera-Based Road Lane Detection by Deep Learning II
Camera-Based Road Lane Detection by Deep Learning II
 
Scenario-Based Development & Testing for Autonomous Driving
Scenario-Based Development & Testing for Autonomous DrivingScenario-Based Development & Testing for Autonomous Driving
Scenario-Based Development & Testing for Autonomous Driving
 
Driving Behavior for ADAS and Autonomous Driving IV
Driving Behavior for ADAS and Autonomous Driving IVDriving Behavior for ADAS and Autonomous Driving IV
Driving Behavior for ADAS and Autonomous Driving IV
 
BEV Semantic Segmentation
BEV Semantic SegmentationBEV Semantic Segmentation
BEV Semantic Segmentation
 
Simulation for autonomous driving at uber atg
Simulation for autonomous driving at uber atgSimulation for autonomous driving at uber atg
Simulation for autonomous driving at uber atg
 
3-d interpretation from stereo images for autonomous driving
3-d interpretation from stereo images for autonomous driving3-d interpretation from stereo images for autonomous driving
3-d interpretation from stereo images for autonomous driving
 
How to Build a Data Closed-loop Platform for Autonomous Driving?
How to Build a Data Closed-loop Platform for Autonomous Driving?How to Build a Data Closed-loop Platform for Autonomous Driving?
How to Build a Data Closed-loop Platform for Autonomous Driving?
 
camera-based Lane detection by deep learning
camera-based Lane detection by deep learningcamera-based Lane detection by deep learning
camera-based Lane detection by deep learning
 
Driving Behavior for ADAS and Autonomous Driving VII
Driving Behavior for ADAS and Autonomous Driving VIIDriving Behavior for ADAS and Autonomous Driving VII
Driving Behavior for ADAS and Autonomous Driving VII
 

Ähnlich wie Jointly mapping, localization, perception, prediction and planning

Self-Driving Cars With Convolutional Neural Networks (CNN.pptx
Self-Driving Cars With Convolutional Neural Networks (CNN.pptxSelf-Driving Cars With Convolutional Neural Networks (CNN.pptx
Self-Driving Cars With Convolutional Neural Networks (CNN.pptx
ssuserf79e761
 
Lane Detection and Obstacle Aviodance
Lane Detection and Obstacle AviodanceLane Detection and Obstacle Aviodance
Lane Detection and Obstacle Aviodance
Nishanth Sriramoju
 
Cloudsim t-drive enhancing driving directions with taxi drivers’ intelligence
Cloudsim  t-drive enhancing driving directions with taxi drivers’ intelligenceCloudsim  t-drive enhancing driving directions with taxi drivers’ intelligence
Cloudsim t-drive enhancing driving directions with taxi drivers’ intelligence
ecway
 
Java t-drive enhancing driving directions with taxi drivers’ intelligence
Java  t-drive enhancing driving directions with taxi drivers’ intelligenceJava  t-drive enhancing driving directions with taxi drivers’ intelligence
Java t-drive enhancing driving directions with taxi drivers’ intelligence
ecwayerode
 
T drive enhancing driving directions with taxi drivers’ intelligence
T drive enhancing driving directions with taxi drivers’ intelligenceT drive enhancing driving directions with taxi drivers’ intelligence
T drive enhancing driving directions with taxi drivers’ intelligence
Ecway Technologies
 
Java t-drive enhancing driving directions with taxi drivers’ intelligence
Java  t-drive enhancing driving directions with taxi drivers’ intelligenceJava  t-drive enhancing driving directions with taxi drivers’ intelligence
Java t-drive enhancing driving directions with taxi drivers’ intelligence
Ecway Technologies
 
Android t-drive enhancing driving directions with taxi drivers’ intelligence
Android  t-drive enhancing driving directions with taxi drivers’ intelligenceAndroid  t-drive enhancing driving directions with taxi drivers’ intelligence
Android t-drive enhancing driving directions with taxi drivers’ intelligence
ecway
 
T drive enhancing driving directions with taxi drivers’ intelligence
T drive enhancing driving directions with taxi drivers’ intelligenceT drive enhancing driving directions with taxi drivers’ intelligence
T drive enhancing driving directions with taxi drivers’ intelligence
Ecway Technologies
 
Dotnet t-drive enhancing driving directions with taxi drivers’ intelligence
Dotnet  t-drive enhancing driving directions with taxi drivers’ intelligenceDotnet  t-drive enhancing driving directions with taxi drivers’ intelligence
Dotnet t-drive enhancing driving directions with taxi drivers’ intelligence
Ecway Technologies
 
Week 8 Expository Essay 2 Revision 65 pointsContent and De.docx
Week 8 Expository Essay 2 Revision 65 pointsContent and De.docxWeek 8 Expository Essay 2 Revision 65 pointsContent and De.docx
Week 8 Expository Essay 2 Revision 65 pointsContent and De.docx
melbruce90096
 
Lane Detection and Obstacle Aviodance Revised
Lane Detection and Obstacle Aviodance RevisedLane Detection and Obstacle Aviodance Revised
Lane Detection and Obstacle Aviodance Revised
Phanindra Amaradhi
 

Ähnlich wie Jointly mapping, localization, perception, prediction and planning (20)

Hcs
HcsHcs
Hcs
 
Self-Driving Cars With Convolutional Neural Networks (CNN.pptx
Self-Driving Cars With Convolutional Neural Networks (CNN.pptxSelf-Driving Cars With Convolutional Neural Networks (CNN.pptx
Self-Driving Cars With Convolutional Neural Networks (CNN.pptx
 
Lane Detection and Obstacle Aviodance
Lane Detection and Obstacle AviodanceLane Detection and Obstacle Aviodance
Lane Detection and Obstacle Aviodance
 
Cloudsim t-drive enhancing driving directions with taxi drivers’ intelligence
Cloudsim  t-drive enhancing driving directions with taxi drivers’ intelligenceCloudsim  t-drive enhancing driving directions with taxi drivers’ intelligence
Cloudsim t-drive enhancing driving directions with taxi drivers’ intelligence
 
Java t-drive enhancing driving directions with taxi drivers’ intelligence
Java  t-drive enhancing driving directions with taxi drivers’ intelligenceJava  t-drive enhancing driving directions with taxi drivers’ intelligence
Java t-drive enhancing driving directions with taxi drivers’ intelligence
 
T drive enhancing driving directions with taxi drivers’ intelligence
T drive enhancing driving directions with taxi drivers’ intelligenceT drive enhancing driving directions with taxi drivers’ intelligence
T drive enhancing driving directions with taxi drivers’ intelligence
 
Java t-drive enhancing driving directions with taxi drivers’ intelligence
Java  t-drive enhancing driving directions with taxi drivers’ intelligenceJava  t-drive enhancing driving directions with taxi drivers’ intelligence
Java t-drive enhancing driving directions with taxi drivers’ intelligence
 
Android t-drive enhancing driving directions with taxi drivers’ intelligence
Android  t-drive enhancing driving directions with taxi drivers’ intelligenceAndroid  t-drive enhancing driving directions with taxi drivers’ intelligence
Android t-drive enhancing driving directions with taxi drivers’ intelligence
 
T drive enhancing driving directions with taxi drivers’ intelligence
T drive enhancing driving directions with taxi drivers’ intelligenceT drive enhancing driving directions with taxi drivers’ intelligence
T drive enhancing driving directions with taxi drivers’ intelligence
 
Dotnet t-drive enhancing driving directions with taxi drivers’ intelligence
Dotnet  t-drive enhancing driving directions with taxi drivers’ intelligenceDotnet  t-drive enhancing driving directions with taxi drivers’ intelligence
Dotnet t-drive enhancing driving directions with taxi drivers’ intelligence
 
Double Deep Q-Learning and Faster R-CNN-Based Autonomous Vehicle Navigation a...
Double Deep Q-Learning and Faster R-CNN-Based Autonomous Vehicle Navigation a...Double Deep Q-Learning and Faster R-CNN-Based Autonomous Vehicle Navigation a...
Double Deep Q-Learning and Faster R-CNN-Based Autonomous Vehicle Navigation a...
 
(Paper) A Method for Pedestrian Position Estimation using Inter-Vehicle Comm...
 (Paper) A Method for Pedestrian Position Estimation using Inter-Vehicle Comm... (Paper) A Method for Pedestrian Position Estimation using Inter-Vehicle Comm...
(Paper) A Method for Pedestrian Position Estimation using Inter-Vehicle Comm...
 
Week 8 Expository Essay 2 Revision 65 pointsContent and De.docx
Week 8 Expository Essay 2 Revision 65 pointsContent and De.docxWeek 8 Expository Essay 2 Revision 65 pointsContent and De.docx
Week 8 Expository Essay 2 Revision 65 pointsContent and De.docx
 
FinalReport
FinalReportFinalReport
FinalReport
 
Lane Detection and Obstacle Aviodance Revised
Lane Detection and Obstacle Aviodance RevisedLane Detection and Obstacle Aviodance Revised
Lane Detection and Obstacle Aviodance Revised
 
Review of Environment Perception for Intelligent Vehicles
Review of Environment Perception for Intelligent VehiclesReview of Environment Perception for Intelligent Vehicles
Review of Environment Perception for Intelligent Vehicles
 
IRJET-To Analyze Calibration of Car-Following Behavior of Vehicles
IRJET-To Analyze Calibration of Car-Following Behavior of VehiclesIRJET-To Analyze Calibration of Car-Following Behavior of Vehicles
IRJET-To Analyze Calibration of Car-Following Behavior of Vehicles
 
T drive enhancing driving directions with taxi drivers’ intelligence
T drive enhancing driving directions with taxi drivers’ intelligenceT drive enhancing driving directions with taxi drivers’ intelligence
T drive enhancing driving directions with taxi drivers’ intelligence
 
JAVA 2013 IEEE DATAMINING PROJECT T drive enhancing driving directions with t...
JAVA 2013 IEEE DATAMINING PROJECT T drive enhancing driving directions with t...JAVA 2013 IEEE DATAMINING PROJECT T drive enhancing driving directions with t...
JAVA 2013 IEEE DATAMINING PROJECT T drive enhancing driving directions with t...
 
Zeleke_Poster14
Zeleke_Poster14Zeleke_Poster14
Zeleke_Poster14
 

Mehr von Yu Huang

Mehr von Yu Huang (20)

Application of Foundation Model for Autonomous Driving
Application of Foundation Model for Autonomous DrivingApplication of Foundation Model for Autonomous Driving
Application of Foundation Model for Autonomous Driving
 
The New Perception Framework in Autonomous Driving: An Introduction of BEV N...
The New Perception Framework  in Autonomous Driving: An Introduction of BEV N...The New Perception Framework  in Autonomous Driving: An Introduction of BEV N...
The New Perception Framework in Autonomous Driving: An Introduction of BEV N...
 
Data Closed Loop in Simulation Test of Autonomous Driving
Data Closed Loop in Simulation Test of Autonomous DrivingData Closed Loop in Simulation Test of Autonomous Driving
Data Closed Loop in Simulation Test of Autonomous Driving
 
Techniques and Challenges in Autonomous Driving
Techniques and Challenges in Autonomous DrivingTechniques and Challenges in Autonomous Driving
Techniques and Challenges in Autonomous Driving
 
BEV Joint Detection and Segmentation
BEV Joint Detection and SegmentationBEV Joint Detection and Segmentation
BEV Joint Detection and Segmentation
 
BEV Object Detection and Prediction
BEV Object Detection and PredictionBEV Object Detection and Prediction
BEV Object Detection and Prediction
 
Fisheye based Perception for Autonomous Driving VI
Fisheye based Perception for Autonomous Driving VIFisheye based Perception for Autonomous Driving VI
Fisheye based Perception for Autonomous Driving VI
 
Fisheye/Omnidirectional View in Autonomous Driving V
Fisheye/Omnidirectional View in Autonomous Driving VFisheye/Omnidirectional View in Autonomous Driving V
Fisheye/Omnidirectional View in Autonomous Driving V
 
Fisheye/Omnidirectional View in Autonomous Driving IV
Fisheye/Omnidirectional View in Autonomous Driving IVFisheye/Omnidirectional View in Autonomous Driving IV
Fisheye/Omnidirectional View in Autonomous Driving IV
 
Prediction,Planninng & Control at Baidu
Prediction,Planninng & Control at BaiduPrediction,Planninng & Control at Baidu
Prediction,Planninng & Control at Baidu
 
Cruise AI under the Hood
Cruise AI under the HoodCruise AI under the Hood
Cruise AI under the Hood
 
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
 
Annotation tools for ADAS & Autonomous Driving
Annotation tools for ADAS & Autonomous DrivingAnnotation tools for ADAS & Autonomous Driving
Annotation tools for ADAS & Autonomous Driving
 
Multi sensor calibration by deep learning
Multi sensor calibration by deep learningMulti sensor calibration by deep learning
Multi sensor calibration by deep learning
 
Data pipeline and data lake for autonomous driving
Data pipeline and data lake for autonomous drivingData pipeline and data lake for autonomous driving
Data pipeline and data lake for autonomous driving
 
Open Source codes of trajectory prediction & behavior planning
Open Source codes of trajectory prediction & behavior planningOpen Source codes of trajectory prediction & behavior planning
Open Source codes of trajectory prediction & behavior planning
 
Lidar in the adverse weather: dust, fog, snow and rain
Lidar in the adverse weather: dust, fog, snow and rainLidar in the adverse weather: dust, fog, snow and rain
Lidar in the adverse weather: dust, fog, snow and rain
 
Autonomous Driving of L3/L4 Commercial trucks
Autonomous Driving of L3/L4 Commercial trucksAutonomous Driving of L3/L4 Commercial trucks
Autonomous Driving of L3/L4 Commercial trucks
 
3-d interpretation from single 2-d image V
3-d interpretation from single 2-d image V3-d interpretation from single 2-d image V
3-d interpretation from single 2-d image V
 
3-d interpretation from single 2-d image IV
3-d interpretation from single 2-d image IV3-d interpretation from single 2-d image IV
3-d interpretation from single 2-d image IV
 

Kürzlich hochgeladen

Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Christo Ananth
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
dollysharma2066
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
ankushspencer015
 

Kürzlich hochgeladen (20)

Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICSUNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
Vivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design SpainVivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design Spain
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 

Jointly mapping, localization, perception, prediction and planning

  • 1. Jointly Mapping, Localization, Perception Prediction and Planning (from Raquel Urtasum’s papers) Yu Huang Sunnyvale, California Yu.huang07@gmail.com
  • 2. References ■ Fast and Furious: Real Time End-to-End 3D Detection, Tracking and Motion Forecasting with a Single Convolutional Net ■ IntentNet: Learning to Predict Intention from Raw Sensor Data ■ Jointly Learnable Behavior and Trajectory Planning for Self-Driving Vehicles ■ LiRaNet: E2E Trajectory Prediction using Spatio-Temporal Radar Fusion ■ Learning Lane Graph Representations for Motion Forecasting ■ Implicit Latent Variable Model for Scene- Consistent Motion Forecasting ■ PnPNet: End-to-End Perception and Prediction with Tracking in the Loop ■ Perceive, Predict, and Plan: Safe Motion Planning Through Interpretable Semantic Representations ■ End-to-end Interpretable Neural Motion Planner ■ DSDNet: Deep Structured Self-Driving Network ■ MP3: A Unified Model to Map, Perceive, Predict and Plan ■ LookOut: Diverse Multi-Future Prediction and Planning for Self-Driving ■ Deep Multi-Task Learning for Joint Localization, Perception, and Prediction
  • 3. Fast and Furious: Real Time End-to-End 3D Detection, Tracking and Motion Forecasting with a Single Convolutional Net ■ deep neural network that is able to jointly reason about 3D detection, tracking and motion forecasting given data captured by a 3D sensor. ■ By jointly reasoning about these tasks, this holistic approach is more robust to occlusion as well as sparse data at range. ■ This approach performs 3D convolutions across space and time over a bird’s eye view representation of the 3D world, which is very efficient in terms of both memory and computation. ■ By sharing computation, perform all tasks in as little as 30 ms.
  • 4. Overview: FaF network takes multiple frames as input and performs detection, tracking and motion forecasting. Fast and Furious: Real Time End-to-End 3D Detection, Tracking and Motion Forecasting with a Single Convolutional Net
  • 5. Fast and Furious: Real Time End-to-End 3D Detection, Tracking and Motion Forecasting with a Single Convolutional Net
  • 6. Motion forecasting Fast and Furious: Real Time End-to-End 3D Detection, Tracking and Motion Forecasting with a Single Convolutional Net
  • 7. IntentNet: Learning to Predict Intention from Raw Sensor Data ■ In order to plan a safe maneuver, self-driving vehicles need to understand the intent of other traffic participants. ■ intent as a combination of discrete high level behaviors as well as continuous trajectories describing future motion. ■ A one-stage detector and forecaster that exploits both 3D point clouds produced by a LiDAR sensor as well as dynamic maps of the environment. ■ This multi-task model reduce reaction time in self-driving applications.
  • 8. IntentNet: Learning to Predict Intention from Raw Sensor Data
  • 9. IntentNet: Learning to Predict Intention from Raw Sensor Data
  • 10. IntentNet: Learning to Predict Intention from Raw Sensor Data
  • 11. Jointly Learnable Behavior and Trajectory Planning for Self-Driving Vehicles ■ The motion planners used in self-driving vehicles need to generate trajectories that are safe, comfortable, and obey the traffic rules. ■ This is usually achieved by two modules: behavior planner, which handles high-level decisions and produces a coarse trajectory, and trajectory planner that generates a smooth, feasible trajectory for the duration of the planning horizon. ■ These planners, however, are typically developed separately, and changes in the behavior planner might affect the trajectory planner in unexpected ways. ■ Furthermore, the final trajectory outputted by the trajectory planner might differ significantly from the one generated by the behavior planner, as they do not share the same objective. ■ This paper proposes a jointly learnable behavior and trajectory planner, while unlike most existing learnable motion planners that address either only behavior planning, or use an uninterpretable neural network to represent the entire logic from sensors to driving commands. ■ This approach features an interpretable cost function on top of perception, prediction and vehicle dynamics, and a joint learning algorithm that learns a shared cost function employed by the behavior and trajectory components.
  • 12. Jointly Learnable Behavior and Trajectory Planning for Self-Driving Vehicles The learnable motion planner has discrete and continuous components, minimizing the same cost function with a same set of learned cost weights.
  • 13. Jointly Learnable Behavior and Trajectory Planning for Self-Driving Vehicles A: Given a scenario, we generate a set of possible SDV (self-driving vehicles) behaviors. B: Left and right lane boundaries and the driving path that are relevant to the intended behavior are considered in the cost function. C: SDV geometry for spatiotemporal overlapping cost are approximated using circles. D: The SDV yields to pedestrians through stop lines on the driving paths.
  • 14. Jointly Learnable Behavior and Trajectory Planning for Self-Driving Vehicles ■ The objective of the planner is then to find a behavior and a trajectory that is safe, comfortable, and progressing along the route. ■ Given sets of candidate behaviors and trajectories, the cost function is to choose the best (b, τ ). ■ The cost function consists of sub-costs c that focus on different aspects of the trajectories such as safety, comfort, feasibility, mission completion, and traffic rules. ■ Obstacles: A safe trajectory for the SDV should not only be collision free, but also satisfy a safety- distance to the surrounding obstacles, including both the static and dynamic objects such as vehicles, pedestrians, cyclists, unknown objects, etc. The overlap cost coverlap is then 1 if a trajectory violates the spatial occupancy of any obstacle in a given predicted trajectory, and is averaged across all possible predicted trajectories weighted by their probabilities. The obstacle cost cobstacle penalizes the squared distance of the violation of the safety-distance dsafe. This cost is scaled by the speed of the SDV, making the distance violation more costly at higher speeds.
  • 15. Jointly Learnable Behavior and Trajectory Planning for Self-Driving Vehicles ■ Driving-path and lane boundary: The SDV is expected to adhere to the structure of the road, i.e. , it should not go out of the lane boundary and should stay close to the center of the lane. The driving-path cost cpath is the squared distance towards the driving path. The lane boundary cost clane is the squared violation distance of a safety threshold. ■ Headway: SDV should keep a safe longitudinal distance that depends on the speed of the SDV and the leading vehicle for lane following or lane change behavior. The headway cost is computed as the violation of the safety distance after applying a comfortable constant deceleration, assuming that the leading vehicle applies a hard brake. ■ Yield: When a pedestrian is predicted reaching the boundary of the SDV lane or crossing it, they impose a stopping point at a safe longitudinal distance and penalize any trajectory that violates it. The yield cost cyield penalizes the squared longitudinal violation distance weighted by the pedestrian prediction probability.
  • 16. Jointly Learnable Behavior and Trajectory Planning for Self-Driving Vehicles Left: Headway cost penalizes unsafe distance to leading vehicles. Right: for each sampled trajectory, a weight function determines how relevant an obstacle is to the SDV in terms of its lateral offset.
  • 17. Jointly Learnable Behavior and Trajectory Planning for Self-Driving Vehicles ■ Route: A behavior is desirable if the goal lane is closer to the route than the current lane. So they penalize the number of lane-changes that is required to converge to the route. ■ Cost-to-go: they compute the deceleration needed for slowing-down to possible upcoming speed-limits and use the square of the violation of the comfortable deceleration as cost-to-go. ■ Speed limit, travel distance and dynamics: Using the speed-limit of a lane, which is available in the map data, they introduce a cost that penalizes a trajectory if it goes above the eligible speed. The speed limit cost cspeed is the squared violation in speed. In order to favor trajectories that advance in the route, they use the travelled longitudinal distance as a reward. Since the SDV is physically limited to certain ranges of acceleration, curvature, etc, they prune trajectories that violate such constraints. Additionally, they introduce costs that penalize aggressive motions to promote comfortable driving. Specifically, the dynamics cost cdyn consists of the squared values of jerk and violation thereof, acceleration and violation thereof, lateral acceleration and violation thereof, lateral jerk and violation thereof, curvature, twist, and wrench.
  • 18. Jointly Learnable Behavior and Trajectory Planning for Self-Driving Vehicles ■ The inference process contains two stages of optimization. – In the behavioral planning stage, it adopts a coarse-level parameterization for trajectory generation. The resulting trajectory is found by selecting the one with the lowest cost. – In the trajectory planning stage, it uses a fine-level parameterization where they model the trajectory as a function of vehicle control variables. The trajectory is initialized with the output of the behavior planner, and optimized through a continuous optimization solver.
  • 19. Jointly Learnable Behavior and Trajectory Planning for Self-Driving Vehicles Example trajectories in a nudging scenario Behavioral decisions include obstacle side assignment and lane information, which are sent through the behavioral trajectory interface.
  • 20. Jointly Learnable Behavior and Trajectory Planning for Self-Driving Vehicles ■ It uses a combination of max-margin objective and imitation learning as the loss function max-margin loss imitation learning loss gradient of the learning objective
  • 21. Jointly Learnable Behavior and Trajectory Planning for Self-Driving Vehicles ■ Two real-world driving datasets: ManualDrive is a set of human driving recordings where the drivers are instructed to drive smoothly and carefully respecting all traffic rules; TOR-4D is composed of very challenging scenarios.
  • 22. LiRaNet: E2E Trajectory Prediction using Spatio-Temporal Radar Fusion ■ LiRaNet, a end-to-end trajectory prediction method which utilizes radar sensor information along with widely used lidar and high definition (HD) maps. ■ Automotive radar provides rich, complementary information, allowing for longer range vehicle detection as well as instantaneous radial velocity measurements. ■ However, there are factors that make the fusion of lidar and radar information challenging, such as the relatively low angular resolution of radar measurements, their sparsity and the lack of exact time synchronization with lidar. ■ To overcome these challenges, propose an efficient spatio-temporal radar feature extraction scheme which achieves state-of-the-art performance on multiple large- scale datasets.
  • 23. LiRaNet: E2E Trajectory Prediction using Spatio-Temporal Radar Fusion An example scene from X17k in bird’s eye view where lidar points (light blue) and radar point velocities (orange) are visualized with labels (white) for current, past and future frames. Vehicle A is a turning bus that has multiple radar points across frames. By effectively combining them over space and time a full 2D velocity and turning rate can be recovered. Vehicle B shows the high positional noise that inherently comes with radar. Vehicle C shows a case with sparse lidar points where implicitly associating them across time can be challenging. However, radar points present around C can add context for the model to detect and predict the trajectory.
  • 24. LiRaNet: E2E Trajectory Prediction using Spatio-Temporal Radar Fusion LiRaNet overview: The radar feature extraction network (A) extracts spatio-temporal features from raw radar points in 2 steps: (1) for each frame we create a graph between the BEV grid cells and radar points to learn spatial features of each cell using a non-rigid convolution, (2) these spatial features are further fused temporally by stacking across channel dimension and using an MLP to get a radar feature volume. This feature volume is then fused with feature volumes from other sensors and fed to a joint perception-prediction network (B) which produces detections and their future trajectories. An example prediction for a scene from X17k can be seen in (C).
  • 25. LiRaNet: E2E Trajectory Prediction using Spatio-Temporal Radar Fusion ■ Input domain consists of radar points and output domain consists of BEV cells. ■ For each cell j, calculate the features hm j for the sweep m as ■ where Am j is the set of associated radar points, xm i is the 2D coordinates of the associated radar point, xm j is the 2D coordinate of the BEV cell’s center, ⊕ denotes the concatenation operation, fm i is the feature vector for the radar point and gm() is an multi-layer perceptron (MLP) with learnable weights shared across all the cells. ■ Calculate Am j, using nearest neighbor algorithm with a distance threshold. ■ By using a threshold larger than the size of a cell, this method compensates for positional errors in radar. ■ For each cell j, calculate the final spatio-temporal feature vector (hj ) by concatenating the per sweep features hj,m and using an MLP to combine them.
  • 26. LiRaNet: E2E Trajectory Prediction using Spatio-Temporal Radar Fusion Radar is fused with LiDAR and Map to generate multi-sensor feature volume in BEV which is used for detection and trajectory prediction. Each learnable operator is denoted as (op, kernel, stride, number of features). Conv denotes convolution and Res denotes a residual block.
  • 27. LiRaNet: E2E Trajectory Prediction using Spatio-Temporal Radar Fusion
  • 28. Learning Lane Graph Representations for Motion Forecasting ■ A motion forecasting model that exploits a novel structured map representation as well as actor- map interactions. ■ Instead of encoding vectorized maps as raster images, construct a lane graph from raw map data to explicitly preserve the map structure. ■ To capture the complex topology and long range dependencies of the lane graph, propose LaneGCN which extends graph convolutions with multiple adjacency matrices and along-lane dilation. ■ To capture the complex interactions between actors and maps, exploit a fusion network consisting of four types of interactions, actor-to-lane, lane-to-lane, lane-to-actor and actor-to-actor. ■ Powered by LaneGCN and actor-map interactions, the model is able to predict accurate and realistic multi-modal trajectories. ■ This approach significantly outperforms the state-of-the-art on the large scale Argoverse motion forecasting benchmark. ■ https://github.com/uber-research/LaneGCN
  • 29. Learning Lane Graph Representations for Motion Forecasting It constructs a lane graph from raw map data and use LaneGCN to extract map features. In parallel, ActorNet extracts actor features from observed past trajectories. Then it uses FusionNet to model the Interactions between actors themselves and the map, and predict the future trajectories.
  • 30. Learning Lane Graph Representations for Motion Forecasting The model is composed of four modules. (1) ActorNet receives the past actor trajectories as input, and uses 1D convolution to extract actor node features. (2) MapNet constructs a lane graph from HD maps, and uses a LaneGCN to exact lane node features. (3) FusionNet is a stack of 4 interaction blocks. The actor to lane block fuses real-time traffic information from actor nodes to lane nodes. The lane to lane block propagates information over the lane graph and updates lane features. The lane to actor block fuses updated map information from lane nodes to actor nodes. The actor to actor block performs interactions among actors. It uses another LaneGCN for the lane to lane block, and spatial attention layers for the other blocks. (4) The prediction header uses after-fusion actor features to produce multi-modal trajectories.
  • 31. Learning Lane Graph Representations for Motion Forecasting LaneGCN is a stack of 4 multi-scale LaneConv residual blocks, each of which consists of a LaneConv (1, 2, 4, 8, 16, 32) and a linear layer with a residual connection. All layers have 128 feature channels. Left: The lane centerline of interest, its predecessor, successor, left and right neighbor are denoted with red, orange, blue, purple, and green lines, respectively. Each centerline is given as a sequence of BEV points (hollow circles). Right: Derived lane graph with an example lane node. The lane node of interest, its predecessor, successor, left and right neighbor are denoted with red, orange, blue, purple and green circles respectively.
  • 32. Learning Lane Graph Representations for Motion Forecasting LaneConv Operator: parameterize the node feature as follows, LaneConv operator as Ai and Wi are the adjacency and the weight matrices corresponding to the i-th connection type respectively. MLP indicates a multi-layer perceptron and the two subscripts refer to shape and location, respectively. k-dilation LaneConv operator as Ak pre is the k-th matrix power of Apre. In regular grid graphs, a dilated convolution operator can effectively capture the long range dependency by enlarging the receptive field.
  • 33. Learning Lane Graph Representations for Motion Forecasting LaneGCN: LaneConv(k1, .., kC) this multi- scale layer. • In the model, use spatial attention and LaneGCN to capture a complete set of actor-map interactions. FusionNet: • Build a stack of four fusion modules to capture all information between actors and lane nodes, i.e., actors to lanes (A2L), lanes to lanes (L2L), lanes to actors (L2A) and actors to actors (A2A). • L2L module by LaneGCN, but other three modules by: Prediction Header: The header has two branches, a regression branch to predict the trajectory of each mode and a classification branch to predict the confidence score of each mode. • regression branch • classification branch
  • 34. Learning Lane Graph Representations for Motion Forecasting
  • 35. Implicit Latent Variable Model for Scene- Consistent Motion Forecasting ■ In this paper, aim to learn scene-consistent motion forecasts of complex urban trac directly from sensor data. ■ In particular, propose to characterize the joint distribution over future trajectories via an implicit latent variable model. ■ It models the scene as an interaction graph and employ powerful graph neural networks to learn a distributed latent representation of the scene. ■ Coupled with a deterministic decoder, obtain trajectory samples that are consistent across trac participants, achieving state-of-the-art results in motion forecasting and interaction understanding. ■ This motion forecasts result in safer and more comfortable motion planning.
  • 36. Implicit Latent Variable Model for Scene- Consistent Motion Forecasting Graphical models of trajectory distribution. Dashed arrows/circles denote that only some approaches within the group use those components. Double circle in (c) denotes that it is a deterministic mapping of its inputs. Actor Feature Extraction. Given LiDAR and maps, the backbone CNN detects the actors in the scene, and individual feature vectors per actor are extracted via RRoI Align, followed by a CNN with spatial pooling.
  • 37. Implicit Latent Variable Model for Scene- Consistent Motion Forecasting Implicit Latent Variable Model encodes the scene into a latent space, from which it can efficiently sample multiple future realizations in parallel, each with socially consistent trajectories.
  • 38. Implicit Latent Variable Model for Scene- Consistent Motion Forecasting
  • 39. Implicit Latent Variable Model for Scene- Consistent Motion Forecasting
  • 40. PnPNet: End-to-End Perception and Prediction with Tracking in the Loop ■ The problem of joint perception and motion forecasting in the context of self-driving vehicles. ■ Towards this goal, PnPNet, an end-to-end model that takes as input sequential sensor data, and outputs at each time step object tracks and their future trajectories. ■ The key component is a tracking module that generates object tracks online from detections and exploits trajectory level features for motion forecasting. ■ Specifically, the object tracks get updated at each time step by solving both the data association problem and the trajectory estimation problem. ■ Importantly, the whole model is end-to-end trainable and benefits from joint optimization of all tasks. ■ Validate PnPNet on two large-scale driving datasets, improvements over the state-of-the- art with better occlusion recovery and more accurate future prediction.
  • 41. PnPNet: End-to-End Perception and Prediction with Tracking in the Loop Three paradigms for perception and prediction. Traditional approach (a) adopts the modular design that decomposes the stack into subtasks and solves them with individual models. End-to-end method (b) uses a joint model to solve detection and prediction simultaneously, but performs tracking as post-processing. As a result, the full temporal history contained in tracks is not used by detection and prediction. This approach (c) brings tracking into the loop so that all tasks benefit from rich temporal context.
  • 42. PnPNet: End-to-End Perception and Prediction with Tracking in the Loop PnPNet for end-to-end perception and prediction. The model consists of three modules that perform 3D object detection, discrete-continuous tracking, and motion forecasting sequentially. To extract trajectory level actor representations used for tracking and prediction, also equip the model with two explicit memories: one for global sensor feature maps, and one for past object trajectories. Both memories get updated at each time step with up-to-date sensor features and tracking results.
  • 43. PnPNet: End-to-End Perception and Prediction with Tracking in the Loop The trajectory level object representation. Given an object trajectory, we first extract its sensor observation and motion features at each time step, and then apply an LSTM network to model the temporal dynamics
  • 44. PnPNet: End-to-End Perception and Prediction with Tracking in the Loop
  • 45. Perceive, Predict, and Plan: Safe Motion Planning Through Interpretable Semantic Representations ■ In this paper propose an end-to-end learnable network that performs joint perception, prediction and motion planning for self-driving vehicles and produces interpretable intermediate representations. ■ Unlike existing neural motion planners, this motion planning costs are consistent with perception and prediction estimates. ■ This is achieved by a novel differentiable semantic occupancy representation that is explicitly used as cost by the motion planning process. ■ This network is learned end-to-end from human demonstrations. ■ The experiments in a large-scale manual-driving dataset and closed-loop simulation show that the proposed model significantly outperforms state-of-the-art planners in imitating the human behaviors while producing much safer trajectories.
  • 46. Perceive, Predict, and Plan: Safe Motion Planning Through Interpretable Semantic Representations The overview of end-to-end learnable autonomy system that takes raw sensor data, an HD map and a high level route as input and produces safe maneuvers for the self- driving vehicle via our novel semantic interpretable intermediate representations.
  • 47. Perceive, Predict, and Plan: Safe Motion Planning Through Interpretable Semantic Representations Semantic classes in occupancy forecasting. Colors match between drawing and hierarchy. Shadowed area corresponds to SDV route. Black vehicle, pedestrian and bike icons represent the agents' true current location.
  • 48. Perceive, Predict, and Plan: Safe Motion Planning Through Interpretable Semantic Representations Inference diagram of the perception and recurrent occupancy forecasting model. || symbolizes concatenation along the feature dimension, ⊕ element- wise sum and ∆ bilinear interpolation used to downscale the occupancy.
  • 49. Perceive, Predict, and Plan: Safe Motion Planning Through Interpretable Semantic Representations Examples of the motion planner cost functions: (a) collision, (b) driving-path, (c) lane boundary, (d) traffic light, (e) comfort, (f) route, (g) progress. Cost related comfort, traffic rules and progress in the route: Safe cost:
  • 50. Perceive, Predict, and Plan: Safe Motion Planning Through Interpretable Semantic Representations Learn the model parameters by exploiting these two loss functions: Semantic Occupancy Loss: Planning Loss:
  • 51. Perceive, Predict, and Plan: Safe Motion Planning Through Interpretable Semantic Representations
  • 52. End-to-end Interpretable Neural Motion Planner ■ A neural motion planner (NMP) for learning to drive autonomously in complex urban scenarios that include traffic-light handling, yielding, and interactions with multiple road-users. ■ Towards this goal, design a holistic model that takes as input raw LIDAR data and a HD map and produces interpretable intermediate representations in the form of 3D detections and their future trajectories, as well as a cost volume defining the goodness of each position that the self-driving car can take within the planning horizon. ■ Then sample a set of diverse physically possible trajectories and choose the one with the minimum learned cost. ■ Importantly, the cost volume is able to naturally capture multi-modality.
  • 53. End-to-end interpretable neural motion planner (NMP). Backbone network takes LiDAR data and maps as inputs, and outputs bounding boxes of other actors for future timesteps (perception), as well as a cost volume for planning with T filters. Next, for each trajectory proposal from the sampler, its cost is indexed from different filters of the cost volume and summed together. The trajectory with the minimal cost will be our final planning.
  • 54. NMP ■ An end-to-end learnable motion planner that generates accurate space-time trajectories over a planning horizon of a few seconds. ■ Importantly, the model takes as input LiDAR point clouds and a HD map and produces interpretable intermediate representations in the form of 3D detections and their future motion forecasted over the planning horizon. ■ The final output representation is a space-time cost volume that represents the “goodness” of each possible location that the SDV can take within the planning horizon. ■ The planner then scores a series of trajectory proposals using the learned cost volume and chooses the one with the minimum cost. ■ Train the model end-to-end with a multi-task objective. The planning loss encourages the minimum cost plan to be similar to the trajectory performed by human demonstrators. ■ Note: this loss is sparse as a GT trajectory only occupies small portion of the space. ■ Introduce another perception loss that encourages the intermediate representations to produce accurate 3D detections and motion forecasting. ■ This ensures the interpretability of the intermediate representations and enables much faster learning.
  • 55. NMP ■ Given the input LiDAR sweeps and the HD map, compute the corresponding cost volume c by feedforward convolutional operations. ■ The final trajectory can then be computed by minimizing. ■ Note, however, that this optimization is NP hard. ■ Thus rely on sampling to obtain a low cost trajectory. Towards this goal, sample a wide variety of trajectories that can be executed by the SDV and produce as final output the one with minimal cost according to learned cost volume. ■ To efficiently sample physically possible trajectories during inference. ■ Since the cost of a trajectory is computed by indexing from the cost volume, the planner is fast enough for real-time inference.
  • 56. Trajectory Representation. First sample a set of parameters of a Clothoid to determine the shape of a trajectory. Then sample a velocity profile to determine how fast the Self-driving vehicle (SDV) go along this trajectory. Combining these two, get a space time trajectory. a Clothoid curve, also known as Euler spiral or Cornu spiral, to represent the 2D path of the SDV
  • 57. End-to-End Learning The ultimate goal is to plan a safe trajectory while following the rules of traffic. Let the model to understand where obstacles are and where they will be in the future in order to avoid collisions. Therefore, use a multi-task training with supervision from detection, motion forecasting as well as human driven trajectories for the ego-car. Note that no supervision for cost volume. Thus adopt max-margin loss to push the network to learn to discriminate between good and bad trajectories. The overall loss function is then:
  • 58. Cost Volume across Time. The planned trajectory in red and ground-truth in blue. Overlay lower cost region for different timesteps in the same figure, using different colors (indicated by legend). Detection and corresponding prediction results are in cyan.
  • 59. DSDNet: Deep Structured Self-Driving Network ■ Deep Structured self-Driving Network (DSDNet), which performs object detection, motion prediction, and motion planning with a single neural network. ■ Towards this goal, develop a deep structured energy based model which considers the interactions between actors and produces socially consistent multimodal future predictions. ■ Furthermore, DSDNet explicitly exploits the predicted future distributions of actors to plan a safe maneuver by using a structured planning cost. ■ The sample-based formulation allows to overcome the difficulty in probabilistic inference of continuous random variables
  • 60. DSDNet overview: The model takes LiDAR and map as inputs, processes them with a CNN backbone, and jointly performs object detection, multi-modal socially-consistent prediction, and safe planning under uncertainty. Rainbow patterns mean highly likely actors' future positions predicted by model. In order to plan a safe maneuver, a self-driving vehicle has to first understand its surroundings as well as predict how the future might evolve. It should then plan its motion by considering all possibilities of the future weighting them properly. This is not trivial as the future is very multi-modal and actors interact with each other. Moreover, the inference procedure needs to be performed in a fraction of a second in order to have practical value.
  • 61. Backbone Feature Network and Object Detection ■ Let X be the LiDAR point clouds and the HD map given as input to system. Since LiDAR point clouds can be very sparse and the actors‘ motion is an important cue for detection and prediction, use the past 10 LiDAR sweeps (e.g., 1s of measurements) and voxelize them into a 3D tensor. utilize HD maps as they provide a strong prior about the scene. ■ then process this 3D tensor with a deep convolutional network backbone and compute a backbone feature map. employ a single-shot detection header on this feature map to output detection bounding boxes for the actors in the scene. ■ Apply two Conv2D layers separately on feature map, one for classifying if a location is occupied by an actor, the other for regressing the position offset, size, orientation and speed of each actor. ■ prediction and planning modules will then take these detections and the feature map as input to produce both a distribution over the actors‘ behaviors and a safe planning maneuver.
  • 62. Probabilistic Multimodal Social Prediction ■ As actors move on the ground, represent their possible future behavior using a trajectory defined as a sequence of 2D waypoints on birds eye view (BEV) sampled at T discrete timestamps. ■ Note that T is the same duration as our planning horizon, and compute the motion prediction distribution and a motion plan each time a new sensor measurement arises (i.e., every 100ms). ■ Modeling this joint distribution and performing efficient inference is challenging, as each actor has a high-dimensional continuous action space. ■ propose to approximate this high-dimensional continuous space with a finite number of samples, and construct a non-parametric distribution over the sampled space. ■ Also use a combination of straight, circle, and clothoid curves.
  • 63. Probabilistic Multimodal Social Prediction joint distribution of all actors' future behaviors decompose the energy E into two terms. The  first term encodes the goodness of a future trajectory (independent of other actors) while the second term explicitly encodes the fact that pairs of actors should not collide in the future.
  • 64. Details of the multimodal social prediction module: For each actor, sample a set of physically valid trajectories, and use a neural network Etraj to assign energies (probabilities) to them. To make different actors' behaviors socially consistent, employ message passing steps which explicitly model interactions and can encode human prior knowledge (collision avoidance). The final predicted socially- consistent distribution is shown on top right. Neural header for evaluating Etraj and Ctraj.
  • 65. Probabilistic Multimodal Social Prediction ■ For safety, our motion planner needs to take all possible actor‘s future into consideration. Therefore, motion forecasting needs to infer the probability of each actor taking a particular future trajectory. ■ utilize sum-product message passing to estimate the marginal distribution per actor, taking into account the effects of all other actors by marginalization. ■ The marginal reflects the uncertainty and multimodality in an actor's future behavior and will be leveraged by planner. ■ Through this message passing procedure, actors communicate with each others their future intentions and how probable those intentions are. The collision energy helps to coordinate intentions from different actors such that the behaviors are compliant and do not result in collision.
  • 66. Safe Motion Planning under Uncertain Future ■ The motion planning module fulfills final goal, that is, navigating towards a destination while avoiding collision and obeying traffic rules. ■ Towards this goal, build a cost function C, which assigns lower cost values to “good“ trajectory proposals and higher values to “bad” ones. ■ Planning is performed by finding the optimal trajectory with the minimum cost. ■ Planning Cost: ■ conduct exact minimization over P. Ctraj is a net based cost and evaluate all K possible trajectories with a single batch forward pass. Ccoll is computed with a GPU based collision checker. Efficiently evaluate for all K samples and select the trajectory with minimum-cost as final planning result.
  • 67. Learning train the full model (backbone, detection, prediction and planning) jointly with a multi-class loss planning loss: utilize a max-margin loss, using expert behavior as positive examples and randomly sampled trajectories as negative ones. detection loss: Ldetection is a sum of classification and regression loss. Use a cross-entropy classification loss and assign an anchor‘s label based on its IoU with any actor. The regression loss is a smooth L1 between model regression outputs and the ground-truth targets. Those targets include position, size, orientation and velocity. employ cross-entropy between discrete distribution and the true target as prediction loss. Once sampled K trajectories per actor, this loss can be regarded as a standard classification loss over K classes (one for each trajectory sample). The target class is set to be the closest trajectory sample to the ground-truth future trajectory (in L2 distance).
  • 68. Qualitative results on ATG4D. The prediction module can capture multi-modalities when vehicles approach intersections, while being certain and accurate when vehicles drive along a single lane (top). The model can produce smooth trajectories which follow the lane and are compliant with other actors (bottom). Cyan boxes: detection. Cyan trajectory: prediction. Red box: ego-car. Red trajectory: planning. Overlay the predicted marginal distribution for different timestamps with different colors and only show high-probability regions.
  • 69. MP3: A Unified Model to Map, Perceive, Predict and Plan ■ Unfortunately, building HD maps has proven hard to scale due to their cost as well as the requirements they impose in the localization system that has to work everywhere with centimeter-level accuracy. ■ Being able to drive without an HD map would be very beneficial to scale self- driving solutions as well as to increase the failure tolerance of existing ones (e.g., if localization fails or the map is not up-to-date). ■ MP3, an end-to-end approach to mapless driving where the input is raw sensor data and a high-level command (e.g., turn left at the intersection). ■ MP3 predicts intermediate representations in the form of an online map and the current and future state of dynamic agents, and exploits them in a novel neural motion planner to make interpretable decisions taking into account uncertainty.
  • 70. Left: a localization error makes the SDV follow a wrong route when using an HD map, driving into traffic. Right: mapless driving can interpret the scene from sensors and achieve a safe plan that follows a high-level command.
  • 71. MP3 predicts probabilistic scene representations that are leveraged in motion planning as interpretable cost functions.
  • 72. Interpretable Scene representations. For occupancy and motion, visualize all time steps and classes in the same image to save space.
  • 73. The motion field warps the occupancy over time. Transparency denotes probability. Color differences the predicted layers by the network and the future occupancy. Particular case of unimodal motion (K = 1).
  • 74. Backbone Network. The output features within one CNN block are fixed. All kernel strides and dilation are 1. Mapping Network. The output features within one CNN block are fixed. All kernel strides are 1.
  • 75. Perception and Prediction Network. It outputs the initial occupancy and current and future motion. The output features within one CNN block are fixed. All kernel strides are 1. Routing Network. The output features within one CNN block are fixed. All kernel strides are 1. The architecture is the same for the 3 branches, but with different learnable parameters.
  • 76. Qualitative results. The predicted scene representations and motion plan for different high-level actions.
  • 77. LookOut: Diverse Multi-Future Prediction and Planning for Self-Driving ■ LOOKOUT, autonomy system that perceives the environment, predicts a diverse set of futures of how the scene might unroll and estimates the trajectory of the SDV by optimizing a set of contingency plans over these future realizations. ■ In particular, learn a diverse joint distribution over multi-agent future trajectories in a traffic scene that covers a wide range of future modes with high sample efficiency while leveraging the expressive power of generative models. ■ Unlike previous work in diverse motion forecasting, the diversity objective explicitly rewards sampling future scenarios that require distinct reactions from the self- driving vehicle for improved safety. ■ This contingency planner then finds comfortable and non-conservative trajectories that ensure safe reactions to a wide range of future scenarios.
  • 78. The future is highly uncertain and multi-modal by showing 2 distinct futures at the scene-level. In such scenario, LOOKOUT plans a short-term executable action that leads to 2 different contingent plans to stay safe in both cases. LOOKOUT inference. For the learnable components, the colors denote different training stages. The backbone CNN, actor CNN and prediction decoder are first trained, the diverse sampler next, and lastly the scenario scorer.
  • 79. Joint Perception and Motion Forecasting ■ In order to extract features useful for both detection and motion forecasting, employ a convolutional backbone network, which takes as input a history of voxelized LiDAR sweeps and a raster HD map, both in bird’s eye view (BEV) centered around the SDV. ■ Then perform multi-class object detection with a shallow convolutional header to recognize the presence, BEV pose and dimensions of vehicles, pedestrians and bicyclists, and apply rotated RoI align to extract small feature crops from the scene context around each actor’s location. ■ An actor CNN with max-pooling reduces the feature map of each actor into a feature vector. ■ It parameterizes the trajectory of each actor with a temporal series of the actor centroid in 2-dimensional Euclidean space, where each trajectory is predicted in the actor’s relative coordinate frame in Bird’s Eye View (BEV) defined by its centroid and heading. ■ The latent variable model then characterizes the joint distribution over actors’ trajectories as follows:
  • 80. Joint Perception and Motion Forecasting ■ adopt an implicit decoder, parameterized by a spatially-aware Graph Neural Network (GNN). Since from observational data only obtain (X, Y) pairs, a posterior or encoder function is introduced to approximate the true posterior distribution during training, also parameterized by a GNN. ■ This encoder function helps this model learn a powerful decoder, since given only X there could be many feasible Y due to the inherent multi-modality and uncertainty of the future. ■ use binary cross-entropy with hard negative mining per class for the presence of an actor, and Huber loss for the regression targets (i.e., pose and dimension). ■ use the CVAE framework for the latent variable model, which optimizes the evidence lower bound (ELBO) of the log-likelihood.
  • 81. Planning Centric Diverse Sampler ■ learn a diverse sampling function that maps the actor contexts coming from sensor data around each actor into a compact set of scene latent samples whose decoded trajectories achieve good coverage. ■ This sampler will then replace the Monte Carlo sampling from the prior distribution during inference. ■ To sample a set of latents that are distinct enough from each other such that they will be decoded into a set of diverse futures, use the reparameterization trick to map a shared noise across K latent mappings ■ To handle the fact that the input can vary in size, parameterize M with a pair of GNNs: one to generate the means and another to generate the covariances. ■ Both GNNs assume a fully connected graph where each node is anchored to an actor, and initialize the node states.
  • 82. Planning Centric Diverse Sampler ■ perform message passing to aggregate info over the whole scene at each node. ■ Finally, each node in the first GNN predicts via an MLP. Similarly, each node in the second GNN predicts via another MLP. ■ Through sampling and decoding, obtain a set of K future trajectory realizations of all actors in the scene Y. ■ The objective of this diverse sampler is to be able to generate a set of futures Y that are diverse while recovering well the ground-truth observations. ■ energy function is composed of a few terms that promote the diversity while preserving data reconstruction:
  • 83. Scenario Probability Estimation ■ The diverse set of K future realizations Y provides the coverage needed for safe motion planning. However, for accurate risk assessment need to estimate the probability distribution over each future realization in the set. ■ augment the model to also output a score for all future realizations, via a GNN that takes as input the actor features and all K sample future scenarios. ■ then easily recover a distribution over such scores by renormalization. ■ Thus, the probability of each sample is ■ Train the scoring function s to match the approximate categorical distribution over future scenarios under the KL divergence. ■ define this approximate distribution as follows
  • 84. Contingency Planner ■ The goal of the motion planning module is to generate safe, comfortable and not overly conservative trajectories for the SDV to execute. ■ achieve this through Model Predictive Control, where a trajectory is planned considering a finite horizon, and is executed until a new trajectory is replanned upon availability of a new LiDAR sweep. ■ However, the above formulation does not exploit the fact that only one of the predicted scenarios will happen in the future and is conversely optimizing for a single trajectory that is ”good” in expectation. ■ Instead of finding a single motion plan for multiple futures, generate a single common immediate action, followed by a set of future trajectories, one for each future realization of the scene.
  • 85. Contingency Planner ■ This contingency planning finds an immediate action that is safe with respect to all the possible realizations in Y and comfortably bridges into a set of contingent trajectories, where each is specifically planned for a single future realization. ■ Such decision-postponing avoids over-conservative behaviors while staying safe until more information is obtained. ■ The safe motion planning is only possible if the set of predicted future scenarios is diverse, and covers possible realizations, including low likelihood events. ■ The planner cost function is a linear combination of various carefully crafted sub-costs that encode different aspects of driving including safety, comfort, traffic-rules and the route. sampling strategy: (i) first a set of (spatial) paths are generated, (ii) for each path, a set of initial velocity profiles are sampled, creating the set of short-term trajectories, (iii) conditioned on the end state of these initial trajectories, another set of velocity profiles are sampled for the rest of the planning horizon assuming the SDV follows the same path.
  • 86. Obtaining K latent samples from an implicit latent variable model (ILVM) implies samplingK times independently from the prior. In contrast, our diverse sampler exploits a GNN mapping to predict K latent samples from a single noise (in parallel). Contingency planning paradigm. The cost-to-go of a short term ego-action is captured by the ability to react to the K diverse predicted futures with the K-most suitable ego- trajectories.
  • 87. Diverse multi-future predictions and plans in closed-loop, zoomed in. Object detections and motion forecasts are blue for vehicles and pink for pedestrians. The green bounding box is the SDV, its immediate action (1s) is shown in black (starting from its rear axle), and its contingent trajectories planned for each possible future scenario are shown in distinct colors. LiDAR points are not visualized.
  • 88. Scenario 1 Scenario 2 Scenario 1 Scenario 2 Motion forecasting visualizations. blend 15 future scenarios with transparency (assume equal probability for visualization purposes). Time is encoded in the rainbow color map ranging from red (0s) to pink (5s). This can be seen as a sample-based characterization of the per-actor marginal distributions.
  • 89. Deep Multi-Task Learning for Joint Localization, Perception, and Prediction ■ Tremendous progress on many subtasks of autonomous driving including perception, motion forecasting, and motion planning. ■ However, these systems often assume that the car is accurately localized against a high-definition map. ■ In state-of-the-art autonomy stacks, localization error exists. ■ Based on observations, design a system LP2 that jointly performs perception, prediction, and localization. ■ The architecture is able to reuse computation between the three tasks, and is thus able to correct localization errors efficiently.
  • 90. P2: Perception-Prediction ■ P2 models are tasked with perceiving actors and predicting their future trajectories to ensure that motion planning has access to safety-critical information about the scene for the entire duration of the planning horizon. ■ Implicit Latent Variable Model (ILVM) encodes the whole scene in a latent random variable and uses a deterministic decoder to efficiently sample multiple scene-consistent trajectories for all the actors in the scene. ■ Besides LiDAR, the ILVM backbone takes as input a multichannel image with semantic aspects of the rasterized map, which the model is expected to use to improve detection and forecasting. ■ While the LiDAR scans are always processed in the vehicle frame, the scans and the map are aligned using the pose of the car. ■ Thus, localization error results in a misalignment between the semantic map and the LiDAR scan.
  • 91. Motion Planning (MP) ■ Motion Planning (MP): Given a map and a set of dynamic agents and their future behaviours, the task of the motion planner is to provide a route that is safe, comfortable, and physically realizable to the control module. ■ Path Lateral Time (PLT) motion planner is a method that samples physically realizable trajectories, evaluates them, and selects the one with the minimal cost. ■ The PLT planner receives S =50 Monte Carlo samples from the joint distribution over the trajectories of all actors from the P2 module. ■ It then samples a small set of trajectories given the map, high level routing and the current state of the SDV. ■ The planned trajectory is then computed to be the one with the minimum expected cost over the predicted futures, as defined by a cost function c that takes into account safety and comfort. ■ In this case, bad localization gives the planner a wrong idea about the layout of the static parts of the scene.
  • 92. The effects of localization error on perception-prediction and motion planning. (Top) The effects of perturbing the ego-pose on P2. SFDE is the mean displacement error across all samples at the 5s mark, and mAP@0.7 is the mean average precision evaluated at an IOU of 0.7. (Bottom) The effects of perturbing the ego-pose on planning. Collision rate is the percentage of examples for which the planned path collides with another vehicle or pedestrian within the 5s simulation, and L2 human is the distance between the planned path and the ground truth human-driven path at the 5s mark.
  • 93. The architecture of the combined localization and perception-prediction (LP2) model.
  • 94. LP2 ■ A system receives LiDAR as input, which is then converted to a BEV voxelization with the channels of the 2D input corresponding to the height dimension. ■ P2 contains four main submodules. (i) A lightweight network processes a rasterized semantic map centred at the current vehicle pose. (ii) Another neural backbone extracts features from a coarsely voxelized LiDAR sweep. These two features maps are concatenatedand passed to (iii) a detector-predictor that encodes the scene into a latent variable Z, and (iv) a gnn where each node represents a detected actor, and which deterministically decodes samples into samples of the joint posterior distribution over all actor trajectories. ■ Ground intensity localization with deep LiDAR embeddings: This localizer learns deep functions that produce spatial embeddings of both the map and LiDAR sweep before alignment. ■ Multi-Resolution Feature Sharing: upsampling a crop of the LiDAR feature map to match the feature resolution in the localization, then add the feature maps together using a weighted sum to produce the final localization embedding.
  • 95.