Driving behaviors for adas and autonomous driving XII

Driving Behaviors for ADAS
and Autonomous Driving XII
Yu Huang
Yu.huang07@gmail.com
Sunnyvale, California

Outline
• SCALE-Net: Scalable Vehicle Trajectory Prediction Network under Random Number of Interacting Vehicles via
Edge-enhanced Graph CNN (2)
• MotionNet: Joint Perception and Motion Prediction for Autonomous Driving Based on Bird’s Eye View Maps
(3.15)
• PiP: Planning-informed Trajectory Prediction for Autonomous Drivin (3.25)
• Shared Cross-Modal Trajectory Prediction for Autonomous Driving (4.1)
• TPNet: Trajectory Proposal Network for Motion Prediction (4.26)
• VTGNet: A Vision-based Trajectory Generation Network for Autonomous Vehicles in Urban Environments
(4.27)
• UST: Unifying Spatio-Temporal Context for Trajectory Prediction in Autonomous Driving (5.6)
• Robust Trajectory Forecasting for Multiple Intelligent Agents in Dynamic Scene (5.27)
• PnPNet: End-to-End Perception and Prediction with Tracking in the Loop (5.29)
• The Importance of Prior Knowledge in Precise Multimodal Prediction (6.4)

SCALE-Net: Scalable Vehicle Trajectory Prediction Network
under Random Number of Interacting Vehicles via Edge-
enhanced GCNN
• Predicting the future trajectory of surrounding vehicles in a randomly varying traffic level is one of
the most challenging problems in developing an autonomous vehicle.
• Since there is no pre-defined number of interacting vehicles participate in, the prediction network
has to be scalable with respect to the vehicle number in order to guarantee the consistency in
terms of both accuracy and computational load.
• The fully scalable trajectory prediction network, SCALE-Net, can ensure both higher prediction
performance and consistent computational load regardless of the number of surrounding vehicles.
• The SCALE- Net employs the Edge-enhance Graph Convolutional Neural Network (EGCN) for the
inter-vehicular interaction embedding network.
• Since the EGCN is inherently scalable with respect to the graph node (an agent in this study), the
model can be operated independently from the total number of vehicles considered.
• The experimental test shows that both computation time and prediction performance of the
SCALE-Net consistently outperform those of previous models regardless of the level of traffic
complexities

enhanced GCNN
Comparison between state input based- and scene input based- prediction
model on variation of computation time and accuracy per a single driving
scene with respect to the number of surrounding vehicles.

enhanced GCNN
Overall architecture of the SCALE-Net for interactive scalable trajectory prediction algorithm. Historical states of
the ego and surrounding vehicles, which is illustrated with dotted red line, is used as input parameter of the proposed
architecture. After passing through the EGCN based scene embedding layer and LSTM based trajectory predictor,
future trajectory of the surrounding vehicles are generated as shown in right-most figure with blue dotted line.

enhanced GCNN
Overall flow diagram of the EGCN layer for interaction embedding. Node number 4, indexed with 5, is updated
using blue-colored elements. Left: in edge-enhanced attention process, weight of the vehicles around vehicle number
4 is calculated using the relative states of the entire vehicles in order to generated weighted adjacency matrix, 𝑨𝒂𝒅𝒋.
Right: using 𝑨𝒂𝒅𝒋, node information of the vehicle 4 is updated by weight of GCN, 𝑾 𝒈𝒄𝒏.

enhanced GCNN

enhanced GCNN
Examples of trajectory prediction result in various
traffic level where green and transparent blue line is
predicted by SCALE-Net and V- LSTM, respectively
Critically interacting scene where the maneuver of the vehicles
highly depends on interaction effect from adjacent vehicles.

MotionNet: Joint Perception and Motion Prediction for
Autonomous Driving Based on Bird’s Eye View Maps
• The ability to reliably perceive the environmental states, particularly the existence of objects and their
motion behavior, is crucial for autonomous driving.
• an efficient deep model, called MotionNet, jointly perform perception and motion prediction from 3D
point clouds.
• MotionNet takes a sequence of LiDAR sweeps as input and outputs a bird’s eye view (BEV) map, which
encodes the object category and motion information in each grid cell.
• The backbone of MotionNet is a novel spatio- temporal pyramid network, which extracts deep spatial
and temporal features in a hierarchical fashion.
• To enforce the smoothness of predictions over both space and time, the training of MotionNet is
further regularized with novel spatial and temporal consistency losses.
• Extensive experiments show that the method overall outperforms the state-of-the-arts, including the
latest scene-flow- and 3D-object-detection-based methods.
• This indicates the potential value of the proposed method serving as a backup to the bounding-box-
based system, and providing complementary information to the motion planner in autonomous driving.
• Code is available at https://github.com/pxiangwu/MotionNet.

Top: MotionNet is a system based on bird’s eye view (BEV) map,
and performs perception and motion prediction jointly without
using bounding boxes.
It can potentially serve as a backup to the standard bounding-
box-based-system and provide complementary information for
motion planning.
Bottom: During testing, with (a) LiDAR data (BEV), given an
object (e.g., disabled person on a wheelchair, as illustrated in
(d)) that never appears in the training data, 3D object detection
tends to fail; see plots (b) and (c).
In contrast, MotionNet is still able to perceive the object and
forecast its motion; see plots (e) and (f), where the color
represents the category and the arrow denotes the future
displacement.

Overview of MotionNet. Given a sequence of LiDAR sweeps, first represent the raw point clouds into BEV maps,
which are essentially 2D images with multiple channels. Each pixel (cell) in a BEV map is associated with a
feature vector along the height dimension. then feed the BEV maps into the spatio-temporal pyramid network
(STPN) for feature extraction. The output of STPN is finally delivered to three heads: (1) cell classification,
which perceives the category of each cell, such as vehicle, pedestrian or background; (2) motion prediction,
which predicts the future trajectory of each cell; (3) state estimation, which estimates the current motion
status of each cell, such as static or moving. The final output is a BEV map, which includes both perception and
motion prediction information.

Spatio-temporal pyramid network. Each STC block
consists of two consecutive 2D convolutions
followed by one pseudo- 1D convolution. The
temporal pooling is applied to the temporal
dimension and squeezes it to length 1.

PiP: Planning-informed Trajectory Prediction for
Autonomous Driving
• It is critical to predict the motion of surrounding vehicles for self-driving planning,
especially in a socially compliant and flexible way.
• However, future prediction is challenging due to the interaction and uncertainty in
driving behaviors.
• planning-informed trajectory prediction (PiP) to tackle the prediction problem in the
multi-agent setting.
• differentiated from the traditional manner of prediction, which is only based on historical
information and decoupled with planning.
• By informing the prediction process with the planning of ego vehicle, it achieves the
state-of-the-art performance of multi- agent forecasting on highway datasets.
• Moreover, it enables a novel pipeline which couples the prediction and planning, by
conditioning PiP on multiple candidate trajectories of the ego vehicle, which is highly
beneficial for autonomous driving in interactive scenarios.

Autonomous Driving
Comparison between the traditional prediction approach (left) and PiP (right) under a lane merging scenario. Assume the
ego vehicle (red) intends to merge to the left lane. It is required to predict the trajectories of surrounding vehicles (blue). To
alleviate the uncertainty led by future interaction, PiP incorporates the future plans (dotted red curve) of ego vehicle in
addition to the history tracks (grey curve). While the traditional prediction result is produced independently with the ego’s
future, PiP produces predictions one-to-one corresponding to the candidate future trajectories by enabling the novel
planning-prediction-coupled pipeline. Therefore, PiP evaluates the planning safety more precisely and achieves more
flexible driving behavior (solid red curve) compared with the traditional pipeline.

Autonomous Driving
The overview of PiP architecture: PiP consists of 3 key modules, planning coupled, target fusion, and maneuver-based
decoding module. Each predicted target is firstly encoded in the planning coupled module by aggregating all information
within the target-centric area (blue square). A target tensor is then set up within the ego-vehicle-centric area (red square) by
placing the target encodings into the spatial gird based on their locations. Afterward, the target tensors are passed through
the target fusion module to learn the interdependency between tar- gets, and eventually, a fused target tensor is generated.
Finally, prediction of each target is decoded from corresponding fused target encoding in the maneuver-based decoding
module. The target vehicle marked is exemplified for planning coupled encoding and multi-modal trajectories decoding.

Autonomous Driving

Shared Cross-Modal Trajectory Prediction for
Autonomous Driving
• A framework for predicting future trajectories of traffic agents in highly interactive
environments.
• On the basis of the fact that autonomous driving vehicles are equipped with various
types of sensors (e.g., LiDAR scanner, RGB camera, etc.), this work aims to get benefit
from the use of multiple input modalities that are complementary to each other.
• The proposed approach is composed of two stages. (i) feature encoding where to
discover motion behavior of the target agent wrt other directly and indirectly observable
influences. Extract such behaviors from multiple perspectives such as in top-down and
frontal view. (ii) cross-modal embedding where we embed a set of learned behavior
representations into a single cross-modal latent space.
• Construct a generative model and formulate the objective functions with an additional
regularizer specifically designed for future prediction.
• An extensive evaluation is conducted to show the efficacy of the proposed framework
using two benchmark driving datasets.

Autonomous Driving
Given a sequence of images and past positions, the feature encoder analyzes internal, external, and social stimuli of
agents. The features generated from multiple sensory data (e.g., top-down view LiDAR and frontal view RGB) are used
to condition the generative model that aims to embed different input modalities into a single cross-modal latent
space. The following decoder predicts future trajectory in top-down or frontal view using the latent variable sampled
from the learned embedding space. Note that the dotted shapes and arrows are only visible at training time.

Autonomous Driving
The detailed illustration of the feature encoder. Using the past image sequence, model spatio-temporal factors
given by external environments. The internal factors of the target agent is encoded from its past motion as well
as surrounding local perceptual context. In addition, consider the relative motion between the target and every
other interactive agents to construct the social interactions.

Autonomous Driving

TPNet: Trajectory Proposal Network for
Motion Prediction
• Making accurate motion prediction of the surrounding traffic agents such as pedestrians, vehicles,
and cyclists is crucial for autonomous driving.
• Recent data-driven motion prediction methods have attempted to learn to directly regress the
exact future position or its distribution from massive amount of trajectory data.
• However, it remains difficult for these methods to provide multimodal predictions as well as
integrate physical constraints such as traffic rules and movable areas.
• This work is a two-stage motion prediction framework, Trajectory Proposal Network (TPNet).
• TPNet first generates a candidate set of future trajectories as hypothesis proposals, then makes
the final predictions by classifying and refining the proposals which meets the physical constraints.
• By steering the proposal generation process, safe and multimodal predictions are realized.
• Thus this framework effectively mitigates the complexity of motion prediction problem while
ensuring the multimodal output.
• Experiments on four large-scale trajectory prediction datasets, i.e. the ETH, UCY, Apollo and
Argoverse datasets, show that TPNet achieves the state-of-results.

Motion Prediction
The movement of traffic agents are often regularized by the
movable areas (white areas for vehicles and gray areas for
pedestrians), while there might be multiple plausible future
paths for the agents. Thus it requires the motion prediction
systems to be able to incorporate the traffic constraints and
output multimodal predictions. This framework generates
the predictions with different intentions under physical
constraints for both vehicles and pedestrians.

Motion Prediction
Framework of the Trajectory Proposal Network (TPNet). In the first stage, a rough end point is
regressed to reduce the searching space and then proposals are generated. In the second
stage, proposals are classified and refined to generate final predictions. The dotted proposals
are the proposals that lie outside of the movable area, which will be further punished.

Motion Prediction
Illustration of proposal generation. Proposals
are generated around the end point predicted
in the first stage. γ is used to control the shape
of the proposal.
Illustration of multimodal proposal generation using
road information. The reference lines indicate the
possible center lane lines that the vehicle could dive in.

Motion Prediction

VTGNet: A Vision-based Trajectory Generation Network
for Autonomous Vehicles in Urban Environments
• Reliable navigation like expert human drivers in urban environments is a critical capability for
autonomous vehicles.
• Traditional methods for autonomous driving are implemented with many building blocks from
perception, planning and control, making them difficult to generalize to varied scenarios due to
complex assumptions and interdependencies.
• An end-to-end trajectory generation method based on imitation learning.
• It can extract spatiotemporal features from the front-view camera images for scene
understanding, then generate collision-free trajectories several seconds into the future.
• The network consists of three sub-networks, which are selectively activated for three common
driving tasks: keep straight, turn left and turn right.
• The experimental results suggest that under various weather and lighting conditions, the network
can reliably generate trajectories in different urban environments, such as turning at intersections
and slowing down for collision avoidance.
• Furthermore, by integrating the network into a navigation system, good generalization
performance is presented in an unseen simulated world for autonomous driving on different
types of vehicles, such as cars and trucks.

Different approaches for trajectory planning and decision-making for autonomous vehicles.

The architecture of VTGNet, which consists of a feature extractor and a trajectory generator. MobileNet V2 is used
as the feature extractor with 17 bottleneck convolutional layers. And the long short-term memory (LSTM) is used in
the decoder to process the spatiotemporal information. The output of the VTGNet is a vector of size 22×3 indicating
the trajectory in the future 22 frames (velocity and x,y positions in the body frame). Note that the width of the
network layers indicates the number of output channels.

Different baselines in this work. The feature extractor for these networks is the same as the one in
the proposed VTGNet. The size of the output features is shown above the layers.

UST: Unifying Spatio-Temporal Context for
Trajectory Prediction in Autonomous Driving
• Trajectory prediction has always been a challenging problem for autonomous driving, since it
needs to infer the latent intention from the behaviors and interactions from traffic participants.
• This problem is intrinsically hard, because each participant may behave differently under different
environments and interactions.
• This key is to effectively model interlaced influence from both spatial and temporal context.
• Existing work usually encodes these two types of context separately, which would lead to inferior
modeling of the scenarios.
• A unified approach to treat time-space dimensions equally for modeling spatio-temporal context.
• The module is simple and easy to implement within several lines of codes.
• In contrast to existing methods which heavily rely on RNN for temporal context and hand-crafted
structure for spatial context, it could auto-partition the spatio-temporal space to adapt the data.
• Test on two recently proposed trajectory prediction dataset ApolloScape and Argoverse.
• These encouraging results further validate the superiority of our approach.

Illustration and representations of the trajectory prediction task. Blue, green, red colors show
trajectories for vehicles, bicycles, pedestrians respectively. (b) shows the common representation,
which represents the surrounding agents as sequences of positions in 2D spatial space. (c) shows
our proposed trajectory representation in a unified spatio-temporal space.

• Design spatio-temporal point sets to represent the raw input;
• Treat the snapshot of status of agent n at time step t as a single point with metadata in a 3D space
spanned by 2D location and time;
• By this uniform representation, unify space and time into one representation, which eases the
subsequent context modeling task.
• To deal with such unordered and variable length data, the structure and operations of this feature
extractor should be deliberately designed to fit the nature of the data.
• Inspired by PointNet, take two key components for context extraction.
• Embedding: to map S-T points into a hidden representation, in which the spatial context and
temporal context are unified;
• Permutation Invariant Aggregator: form the global context feature, by default, we use max
pooling as the aggregator.
• Recursive Refinement: concatenate the global context feature to every individual feature, and
recursively apply the aforementioned steps. In the second step, the embedding is aware of
the status of individual agent and all the global context, thus could capture the interactions.
• Finally, feed the encoded spatio-temporal feature into a standard LSTM.

(a) and (f) show activation patterns of two typical neurons in the pooled spatio-temporal features.
The number in other subfigures indicates the value of activation of this neuron of the case.

Robust Trajectory Forecasting for Multiple
Intelligent Agents in Dynamic Scene
• Trajectory forecasting, or trajectory prediction, of multiple interacting agents in dynamic scenes,
is an important problem for many applications, such as robotic systems and autonomous driving.
• The problem is a great challenge because of the complex interactions among the agents and their
interactions with the surrounding scenes.
• A method for the robust trajectory forecasting of multiple intelligent agents in dynamic scenes.
• The method consists of three major interrelated components: an interaction net for global
spatiotemporal interactive feature extraction, an environment net for decoding dynamic scenes
(i.e., the surrounding road topology of an agent), and a prediction net that combines the spa-
tiotemporal feature, the scene feature, the past trajectories of agents and some random noise for
the robust trajectory prediction of agents.
• Experiments on pedestrian-walking and vehicle-pedestrian heterogeneous datasets demonstrate
that the method outperforms the SOA prediction methods in terms of prediction accuracy.

The method contains three components, a spatio-temporal interaction network, an
environment feature extraction network, and a trajectory prediction network.

PnPNet: End-to-End Perception and Prediction
with Tracking in the Loop
• The problem of joint perception and motion forecasting in the context of self-driving
vehicles.
• Towards this goal, PnPNet, an end-to-end model that takes as input sequential sensor
data, and outputs at each time step object tracks and their future trajectories.
• The key component is a tracking module that generates object tracks online from
detections and exploits trajectory level features for motion forecasting.
• Specifically, the object tracks get updated at each time step by solving both the data
association problem and the trajectory estimation problem.
• Importantly, the whole model is end-to-end trainable and benefits from joint
optimization of all tasks.
• Validate PnPNet on two large-scale driving datasets, improvements over the state-of-the-
art with better occlusion recovery and more accurate future prediction.

Three paradigms for perception and prediction. Traditional approach (a) adopts the modular design that
decomposes the stack into subtasks and solves them with individual models. End-to-end method (b) uses a
joint model to solve detection and prediction simultaneously, but performs tracking as post-processing. As
a result, the full temporal history contained in tracks is not used by detection and prediction. This
approach (c) brings tracking into the loop so that all tasks benefit from rich temporal context.

PnPNet for end-to-end perception and prediction. The model consists of three modules that perform 3D object
detection, discrete-continuous tracking, and motion forecasting sequentially. To extract trajectory level actor
representations used for tracking and prediction, also equip the model with two explicit memories: one for
global sensor feature maps, and one for past object trajectories. Both memories get updated at each time step
with up-to-date sensor features and tracking results.

The trajectory level object representation. Given an object trajectory, we first extract its sensor observation
and motion features at each time step, and then apply an LSTM network to model the temporal dynamics

The Importance of Prior Knowledge in Precise
Multimodal Prediction
• Roads have well defined geometries, topologies, and traffic rules.
• While this has been widely exploited in motion planning methods to produce maneuvers that obey the
law, little work has been devoted to utilize these priors in perception and motion forecasting methods.
• This is a method to incorporate these structured priors as a loss function.
• In contrast to imposing hard constraints, this approach allows the model to handle non-compliant
maneuvers when those happen in the real world.
• Safe motion planning is the end goal, and thus a probabilistic characterization of the possible future
developments of the scene is key to choose the plan with the lowest expected cost.
• Towards this goal, design a framework that leverages REINFORCE to incorporate non- differentiable
priors over sample trajectories from a probabilistic model, thus optimizing the whole distribution.
• On real-world self-driving datasets containing complex road topologies and multi-agent interactions.
• Despite the importance of this evaluation, it has been often overlooked by previous perception and
motion forecasting works.

• Human driving behavior is highly structured: in the majority of scenarios, drivers will follow the
road topology and traffic rules.
• To leverage this informative prior, but not overly penalize non-compliant behavior, define a
flexible traffic-rule informed loss that is conditioned on ground-truth behavior.
• To this end, leverage a lane-graph representation where the nodes encode lane segments and the
edges represent relationships between lane segments such as adjacency, predecessor, and
successor (taking into account direction of traffic flow).

• It is more important to precisely characterize motion of vehicles that might interact with the SDV
(self-driving vehicle), rather than other traffic participants that do not influence the SDV behavior.
• Approximate area of interest with the SDV’s route (i.e. high-level command), which is defined as
union of all lane segments that the SDV can travel on to reach a preset goal, given the lane-graph.
• The horizon is set to be equal to the prediction horizon (5s), the target lane by a route planner.
• This gives a safe approximation over its future possible locations.
• Define pos. traj.s as those with > one waypoint falling within the SDV route, and neg. otherwise.
• Achieve high precision and high recall under this definition, taking into account if the ground-
truth trajectory intersects the route (positive) or not (negative).

Driving behaviors for adas and autonomous driving XII

Driving behaviors for adas and autonomous driving XII

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Driving behaviors for adas and autonomous driving XII

Ähnlich wie Driving behaviors for adas and autonomous driving XII (20)

Mehr von Yu Huang

Mehr von Yu Huang (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Driving behaviors for adas and autonomous driving XII