Pedestrian Behavior/Intention Modeling for Autonomous Driving VI

Pedestrian Behavior/Intention
Modeling for Autonomous Driving VI
Yu Huang
Yu.huang07@gmail.com
Sunnyvale, California

Outline
• CoMoGCN: Coherent Motion Aware Trajectory Prediction with Graph Representation (5.5)
• STINet: Spatio-Temporal-Interactive Network for Pedestrian Detect. and Trajectory Pred.
• AC-VRNN: Attentive Conditional-VRNN for Multi-Future Trajectory Prediction (5.17)
• Spatio-Temporal Graph Transformer Networks for Pedestrian Trajectory Prediction (5.18)
• Intention-aware Residual Bidirectional LSTM for Long-term Pedestrian Trajectory
Prediction (6.30)
• It Is Not the Journey but the Destination: Endpoint Conditioned Trajectory Prediction (7.6)
• Graph2Kernel Grid-LSTM: A Multi-Cued Model for Pedestrian Trajectory Prediction by
Learning Adaptive Neighborhoods (7.8)
• Probabilistic Crowd GAN: Multimodal Pedestrian Trajectory Prediction using a Graph
Vehicle-Pedestrian Attention Network (7.12)

CoMoGCN: Coherent Motion Aware Trajectory
Prediction with Graph Representation
• Forecasting human trajectories is critical for tasks such as robot crowd navigation and
autonomous driving.
• Modeling social interactions is of great importance for accurate group-wise motion
prediction.
• However, most existing methods do not consider information about coherence within
the crowd, but rather only pairwise interactions.
• A framework, coherent motion aware graph convolutional network (CoMoGCN), for
trajectory prediction in crowded scenes with group constraints.
• First, cluster pedestrian trajectories into groups according to motion coherence.
• Then, use graph convolutional networks to aggregate crowd information efficiently.
• The CoMoGCN also takes advantage of variational autoencoders to capture the
multimodal nature of the human trajectories by modeling the distribution.

System overview. procedures: 1. obtain coherent motion labels for each human in an offline data pre-processing
procedure. 2. Based on the coherent motion labels for each human, establish graphs capturing intergroup and
intragroup relationships. The encoder LSTM takes past trajectories as input and feeds the encoded features into two
GCNs. 3. The embeddings from the two GCNs are concatenated and forwarded to an MLP to create a distribution.
Then, features are sampled from the distribution and fed into a decoder LSTM for trajectory prediction.

STINet: Spatio-Temporal-Interactive Network for
Pedestrian Detection and Trajectory Prediction
• Detecting pedestrians and predicting future trajectories for them are critical tasks for
numerous applications, such as autonomous driving.
• Previous methods either treat the detection and prediction as separate tasks or simply
add a trajectory regression head on top of a detector.
• An end-to-end two-stage network: Spatio-Temporal-Interactive Network (STINet).
• In addition to 3D geometry modeling of pedestrians, model the temporal information for
each of the pedestrians.
• It predicts both current and past locations in the first stage, so that each pedestrian can
be linked across frames and the comprehensive spatio-temporal information can be
captured in the second stage.
• Also, model the interaction among objects with an interaction graph, to gather the
information among the neighboring objects.
• Comprehensive experiments on the Lyft Dataset and the recently released large-scale
Waymo Open Dataset for both object detection and future trajectory prediction.

The overview. It takes a sequence of point clouds as input, detects pedestrians and predicts their future
trajectories simultaneously. The point clouds are processed by Pillar Feature Encoding to generate Pillar
Features. Then each Pillar Feature is fed into a backbone ResUNet to get backbone features. A Temporal
Region Proposal Network (T-RPN) takes backbone features and generated temporal proposal with past
and current boxes for each object. Spatio-Temporal-Interactive (STI) Feature Extractor learns features
for each temporal proposal which are used for final detection and trajectory prediction.

Backbone. Upper: overview of the backbone. The
input point cloud sequence is fed to Voxelization and
Point net to generate pseudo images, which are then
processed by ResNet U-Net to generate final
backbone feature sequence. Lower: detailed design
of ResNet U-Net.

Spatial-Temporal-Interactive Feature Extractor
(STI- FE): Local geometry, local dynamic and
history path features are extracted given a
temporal proposal. For local geometry and
local dynamics features, the yellow areas are
used for feature extraction. Relational
reasoning is performed across proposals’ local
features to generate interactive features.

AC-VRNN: Attentive Conditional-VRNN for
Multi-Future Trajectory Prediction
• Anticipating human motion in crowded scenarios is essential for developing intelligent
transportation systems, social-aware robots and advanced video-surveillance
applications.
• An important aspect of such task is represented by the inherently multi-modal nature of
human paths which makes socially-acceptable multiple futures when human interactions
are involved.
• A generative model for multi-future trajectory prediction based on Conditional
Variational Recurrent Neural Networks (C-VRNNs).
• Conditioning relies on prior belief maps, representing most likely moving directions and
forcing the model to consider the collective agents’ motion.
• Human interactions are modeled in a structured way with a graph attention mechanism,
providing an online attentive hidden state refinement of the recurrent estimation.
• Compared to sequence-to-sequence methods, this model operates step- by-step,
generating more refined and accurate predictions.

trajectory prediction framework for a single time-step. The overall model is composed of a training module (left)
and an inference module (right). The former is composed of a recurrent variational autoencoder conditioned on
prior belief maps. The hidden state of the RNN is refined with an attentive module for the next step of
recurrence. The latter performs the displacements generation through the prior network on hidden states and
makes an online computation of the adjacency matrix which defines connections between pairs of nodes.

Scheme of the attentive hidden state refinement process. The adjacency matrix is an irregular block matrix where each
block size is defined by the number of pedestrians in the current scene (a). Belief map during training for one sample
using heat similarity-based strategy. The map is centred at t − 1 to display the sampled displacements distribution at t (b).

Spatio-Temporal Graph Transformer Networks for
Pedestrian Trajectory Prediction
• Understanding crowd motion dynamics is critical to real- world applications, e.g.,
surveillance systems and autonomous driving.
• This is challenging because it requires effectively modeling the socially aware crowd
spatial interaction and complex temporal dependencies.
• attention is the most important factor for trajectory prediction.
• STAR, a Spatio-Temporal grAph tRans- former framework, tackles trajectory prediction
by only attention mechanisms. STAR models intra-graph crowd interaction by TGConv, a
Transformer-based graph convolution mechanism.
• The inter-graph temporal dependencies modeled by separate temporal Transformers.
• STAR captures complex spatio-temporal interactions by interleaving between spatial and
temporal Transformers.
• To calibrate the temporal prediction for the long-lasting effect of disappeared
pedestrians, apply a read-writable external memory module, consistently being updated
by the temporal Transformer.

STAR models the crowd as a graph and learns spatio-temporal interaction of the crowd motion
by interleaving between a graph-based spatial Transformer and a temporal Transformer

Temporal Transformer and Spatial Transformer. (a) Temporal Transformer treats each
pedestrians independently and extracts the temporal dependencies by Transformer model (h
is the embedding of pedestrian positions, Q, K and V are the query, key, value matrix in
Transformers). (b) Spatial Transformer models the crowd as a graph, and applies TGConv, a
Transformer-based message passing graph convolution, to model the social interactions (mi→j
is the message from node i to j represented by Transformer attention)

Network structure of STAR with application to trajectory prediction. In STAR, trajectory prediction is
achieved completely by attention mechanisms. STAR inter- leaves spatial Transformer and temporal
Transformer in two encoder blocks to extract spatio-temporal pedestrian dependencies. An external
read-writable graph memory module helps to smooth the graph embeddings and improve the
consistency of temporal predictions. The prediction at Tobs + 1 is added back to history to predict the
pedestrian poses at Tobs + 2.

Intention-aware Residual Bidirectional LSTM for
Long-term Pedestrian Trajectory Prediction
• Trajectory prediction is one of the key capabilities for robots to safely navigate and interact with
pedestrians.
• Critical insights from human intention and behavioral patterns need to be effectively integrated
into long-term pedestrian behavior forecasting.
• An intention-aware motion prediction framework consists of a Residual Bidirectional LSTM (ReBiL)
and a mutable intention filter.
• Instead of learning step-wise displacement, learning offset to warp a nominal intention-aware
linear prediction, giving residual learning a physical intuition.
• The intention filter is inspired by genetic algorithms and particle filtering, where particles mutate
intention hypotheses throughout the pedestrian’s motion with ReBiL as the motion model.
• Experiments on a publicly available dataset under abnormal intention-changing scenarios.

Overview of our motion prediction framework. ReBiL (dashed-line arrow) performs both truncated
prediction for particle weight update and long-term prediction at t after mutation. Mutable intention filter
takes truncated prediction results to update particle weights, and it implements Sequential Importance
Resampling (SIR) and mutation mechanism.

It Is Not the Journey but the Destination: Endpoint
Conditioned Trajectory Prediction
• Human trajectory forecasting with multiple socially interacting agents is of
critical importance for autonomous navigation in human environments,
e.g., for self-driving cars and social robots.
• Predicted Endpoint Conditioned Network (PECNet) for flexible human
trajectory prediction.
• PECNet infers distant trajectory endpoints to assist in long-range multi-
modal trajectory prediction.
• A non- local social pooling layer enables PECNet to infer diverse yet socially
compliant trajectories.
• Additionally, a simple “truncation- trick” for improving few-shot multi-
modal trajectory prediction performance.
• Code https://karttikeya.github.io/publication/htf/

Architecture of PECNet: PECNet uses past
history, along with ground truth endpoint
to train a VAE for multi-modal endpoint
inference. Ground-truth endpoints are
denoted by ⋆ whereas x denote the
sampled endpoints Gc. The sampled
endpoints condition the social-pooling &
predictor networks for multi-agent multi-
modal trajectory forecasting. Red
connections denote the parts utilized only
during training. Shades of the same color
denote spatio-temporal neighbours
encoded with the block diagonal social
mask in social pooling module

Graph2Kernel Grid-LSTM: A Multi-Cued Model for Pedestrian
Trajectory Prediction by Learning Adaptive Neighborhoods
• Pedestrian trajectory prediction is a prominent research track that has advanced towards
modelling of crowd social and contextual interactions, with extensive usage of Long
Short-Term Memory (LSTM) for temporal representation of walking trajectories.
• Existing approaches use virtual neighborhoods as a fixed grid for pooling social states of
pedestrians with tuning process that controls how social interactions are being captured.
• This entails performance customization to specific scenes but lowers the generalization
capability of the approaches.
• Grid-LSTM, a recent extension of LSTM, which operates over multidimensional feature
inputs.
• A perspective to interaction modeling by proposing that pedestrian neighborhoods can
become adaptive in design.
• Grid-LSTM as an encoder to learn about potential future neighborhoods and their
influence on pedestrian motion given the visual and the spatial boundaries.
• The experiment results clearly illustrate the generalization of our approach across
datasets.

The static neighborhood grid fO segments the
scene image into several local regions. The
dynamic grid fS takes pedestrians trajectories x1,x2
along with their looking angle to stem their
social interactions. The output static grid has
few highlighted areas, which indicates future
neighborhoods where pedestrians would walk.

Full pipeline of G2K kernel. The SRI
network encodes Vislets and positional
trajectories for each pedestrian
trajectory. Then maps them into social
grid mask using NLSTMv. The GNN
network discretize static context using
NLSTMo into ’Visuospatial’
neighborhoods and stores pedestrian
contextual awareness in fO. At the
consequent step, SRI takes fO and fS,
and maps them into the weighted
adjacency matrix. This will generate
the edge set ν as means of completing
graph at time-step t.

Gated Neighborhood Network pipeline. At the beginning, 2DCONV encodes a static image of the
scene and forward the features into NLSTM cell which discretizes the environment into a virtual grid.

Probabilistic Crowd GAN: Multimodal Pedestrian Trajectory
Prediction using a Graph Vehicle-Pedestrian Attention Network
• Understanding and predicting the intention of pedestrians is essential to enable
autonomous vehicles and mobile robots to navigate crowds.
• This problem becomes increasingly complex when we consider the uncertainty and
multimodality of pedestrian motion, as well as the implicit interactions between
members of a crowd, including any response to a vehicle.
• Probabilistic Crowd GAN, extends recent work in trajectory prediction, combining
Recurrent Neural Networks (RNNs) with Mixture Density Networks (MDNs) to output
probabilistic multimodal predictions, from which likely modal paths are found and used
for adversarial training.
• use of Graph Vehicle-Pedestrian Attention Network (GVAT), which models social
interactions and allows input of a shared vehicle feature, showing that inclusion of this
module leads to improved trajectory prediction both with and without the presence of a
vehicle.
• Through evaluation on various datasets, illustrates how the true multimodal and
uncertain nature of crowd interactions can be directly modelled.

Observed pedestrian trajectories are passed to the Generator’s encoder LSTM, whilst the relative
position of all agents, including any vehicle, are passed to the GVAT Pooling module. The
Generator outputs a GMM for each agent, from which the MultiPAC module finds the likely
modal paths, which are compared to ground truth paths by the Discriminator.

Node features of agent i (red) in GVAT. The
distance from i to the vehicle is appended to
each other ped-ped distance input before
encoding to account for the impact of the
vehicle on i’s relationships within the graph

Pedestrian Behavior/Intention Modeling for Autonomous Driving VI

Pedestrian Behavior/Intention Modeling for Autonomous Driving VI

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Pedestrian Behavior/Intention Modeling for Autonomous Driving VI

Ähnlich wie Pedestrian Behavior/Intention Modeling for Autonomous Driving VI (20)

Mehr von Yu Huang

Mehr von Yu Huang (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Pedestrian Behavior/Intention Modeling for Autonomous Driving VI