Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Pedestrian behavior/intention modeling for autonomous driving V

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Wird geladen in …3
×

Hier ansehen

1 von 40 Anzeige

Pedestrian behavior/intention modeling for autonomous driving V

Herunterladen, um offline zu lesen

pedestrian, behavior, intention, preference, detection, tracking, segmentation, inter-city, zebra crossing, deep learning, prediction, pose estimation, crowd, reinforcement learning, imitation learning, GAN, social, obstacle.

pedestrian, behavior, intention, preference, detection, tracking, segmentation, inter-city, zebra crossing, deep learning, prediction, pose estimation, crowd, reinforcement learning, imitation learning, GAN, social, obstacle.

Anzeige
Anzeige

Weitere Verwandte Inhalte

Ähnlich wie Pedestrian behavior/intention modeling for autonomous driving V (20)

Anzeige

Weitere von Yu Huang (20)

Aktuellste (20)

Anzeige

Pedestrian behavior/intention modeling for autonomous driving V

  1. 1. Pedestrian Behavior/Intention Modeling for Autonomous Driving V Yu Huang Yu.huang07@gmail.com Sunnyvale, California
  2. 2. Outline • Soft + Hardwired Attention: An LSTM Framework for Human Trajectory Prediction and Abnormal Event Detection (17.2.18) • Group LSTM: Group Trajectory Prediction in Crowded Scenarios (ECCV2018 workshop) • Social-BiGAT: Multimodal Trajectory Forecasting using Bicycle-GAN and Graph Attention Networks (7.17) • The Trajectron: Probabilistic Multi-Agent Trajectory Modeling With Dynamic Spatiotemporal Graphs (8.23) • Trajectory Prediction by Coupling Scene-LSTM with Human Movement LSTM (8.23) • STGAT: Modeling Spatial-Temporal Interactions for Human Trajectory Prediction (ICCV19) • Neighbourhood Context Embeddings in Deep Inverse Reinforcement Learning for Predicting Pedestrian Motion Over Long Time Horizons (ICCV19) • GraphTCN: Spatio-Temporal Interaction Modeling for Human Trajectory Prediction (3.26) • Recursive Social Behavior Graph for Trajectory Prediction (4.22)
  3. 3. Soft + Hardwired Attention: An LSTM Framework for Human Trajectory Prediction and Abnormal Event Detection • As humans we possess an intuitive ability for navigation which we master through years of practice; however existing approaches to model this trait for diverse tasks including monitoring pedestrian flow and detecting abnormal events have been limited by using a variety of hand-crafted features. • Recent research in the area of deep- learning has demonstrated the power of learning features directly from the data; and related research in recurrent neural networks has shown exemplary results in sequence- to-sequence problems such as neural machine translation and neural image caption generation. • Motivated by these approaches, a method to predict the future motion of a pedestrian given a short history of their, and their neighbours, past behaviour. • The novelty of the method is the combined attention model which utilises both “soft attention” as well as “hard-wired” attention in order to map the trajectory information from the local neighbourhood to the future positions of the pedestrian of interest. • How a simple approximation of attention weights (i.e. hard-wired) can be merged together with soft attention weights in order to make our model applicable for challenging real world scenarios with hundreds of neighbours.
  4. 4. Soft + Hardwired Attention: An LSTM Framework for Human Trajectory Prediction and Abnormal Event Detection A scene (on the left): The trajectory of the pedestrian of interest is shown in green, and has two neighbours (shown in purple) to the left, one in front and none on right. Neighbourhood encoding scheme (on the right): Trajectory information is encoded with LSTM encoders. A soft attention context vector is used to embed the trajectory information from the pedestrian of interest, and a hardwired attention context vector is used for neighbouring trajectories. In order to generate soft attention vector, use a soft attention function. The merged context vector is then used to predict the future trajectory for the pedestrian of interest (shown in red).
  5. 5. Soft + Hardwired Attention: An LSTM Framework for Human Trajectory Prediction and Abnormal Event Detection The Soft + Hardwired Attention model. utilise the trajectory information from both the pedestrian of interest and the neighbouring trajectories. embed the trajectory information from the pedestrian of interest with the soft attention context vector, while neighbouring trajectories are embedded with the aid of a hardwired attention context vector. In order to generate soft attention context vector, use a soft attention function. Then the merged context vector, is used to predict the future state
  6. 6. Soft + Hardwired Attention: An LSTM Framework for Human Trajectory Prediction and Abnormal Event Detection
  7. 7. Group LSTM: Group Trajectory Prediction in Crowded Scenarios • The analysis of crowded scenes is one of the most challenging scenarios in visual surveillance, and a variety of factors need to be taken into account, such as the structure of the environments, and the presence of mutual occlusions and obstacles. • Traditional prediction methods (such as RNN, LSTM, VAE, etc.) focus on anticipating individual’s future path based on the precise motion history of a pedestrian. • However, since tracking algorithms are generally not reliable in highly dense scenes, these methods are not easily applicable in real environments. • Nevertheless, it is very common that people (friends, couples, family members, etc.) tend to exhibit coherent motion patterns. • Motivated by this phenomenon, an approach to predict future trajectories in crowded scenes, at the group level. • First, by exploiting the motion coherency, cluster trajectories that have similar motion trends. • In this way, pedestrians within the same group can be well segmented. • Then, an improved social-LSTM is adopted for future path prediction.
  8. 8. Group LSTM: Group Trajectory Prediction in Crowded Scenarios i Representation of the Social hidden-state tensor. The black dot represents the pedestrian of interest. Other pedestrians are shown in different color codes, namely green for pedestrians belonging to the same set, and red for pedestrians belonging to a different set. The neighborhood of pedestrian of interest is described by N0 × N0 cells, which preserves the spatial information by pooling spatially adjacent neighbors. Pedestrians belonging to the same set are not used for the final computation of the pooling layer.
  9. 9. Group LSTM: Group Trajectory Prediction in Crowded Scenarios The figure represents the chain structure of the LSTM network between two consecutive time steps. At each time step, the inputs of the LSTM cell are the previous position and the Social pooling tensor Ht. The output of the LSTM cell is the current position.
  10. 10. Group LSTM: Group Trajectory Prediction in Crowded Scenarios
  11. 11. Social-BiGAT: Multimodal Trajectory Forecasting using Bicycle-GAN and Graph Attention Networks • Predicting the future trajectories of multiple interacting agents in a scene has become an increasingly important problem for many different applications ranging from control of autonomous vehicles and social robots to security and surveillance. • This problem is compounded by the presence of social interactions between humans and their physical interactions with the scene. • While the existing literature has explored some of these cues, they mainly ignored the multimodal nature of each human’s future trajectory. • Social-BiGAT, a graph-based generative adversarial network that generates realistic, multimodal trajectory predictions by better modelling the social interactions of pedestrians in a scene. • Based on a graph attention network (GAT) that learns reliable feature representations that encode the social interactions between humans in the scene, and a recurrent encoder-decoder architecture that is trained adversarially to predict, based on the features, the humans’ paths. • The multimodal nature of the prediction by forming a reversible transformation between each scene and its latent noise vector, as in Bicycle-GAN.
  12. 12. Social-BiGAT: Multimodal Trajectory Forecasting using Bicycle-GAN and Graph Attention Networks Architecture for the Social-BiGAT model. The model consists of a single generator, two discriminators (one at local pedestrian scale, and one at global scene scale), and a latent encoder that learns noise from scenes. The model makes use of a graph attention network (GAT) and self-attention on an image to consider the social and physical features of a scene.
  13. 13. Social-BiGAT: Multimodal Trajectory Forecasting using Bicycle-GAN and Graph Attention Networks Training process for the Social-BiGAT model. Teach the generator and discriminators using traditional adversarial learning techniques, with an additional L2 loss on generated samples to encourage consistency. Further train the latent encoder by ensuring it can recreate noise passed into the generator, and by making sure it mirrors a normal distribution.
  14. 14. Social-BiGAT: Multimodal Trajectory Forecasting using Bicycle-GAN and Graph Attention Networks
  15. 15. The Trajectron: Probabilistic Multi-Agent Trajectory Modeling With Dynamic Spatiotemporal Graphs • Developing safe human-robot interaction systems is a necessary step towards the widespread integration of autonomous agents in society. • A key component of such systems is the ability to reason about the many potential futures (e.g. trajectories) of other agents in the scene. • Trajectron, a graph-structured model that predicts many potential future trajectories of multiple agents simultaneously in both highly dynamic and multi- modal scenarios (i.e. where the number of agents in the scene is time-varying and there are many possible highly- distinct futures for each agent). • It combines tools from recurrent sequence modeling and variational deep generative modeling to produce a distribution of future trajectories for each agent in a scene. • Test the performance of the model on several datasets, obtaining state-of-the-art results on standard trajectory prediction metrics as well as introducing a new metric for comparing models that output distributions.
  16. 16. The Trajectron: Probabilistic Multi-Agent Trajectory Modeling With Dynamic Spatiotemporal Graphs Top: An example graph with four nodes. a is the modeled node and is of type T3. It has three neighbors: b of type T1, c of type T2, and d of type T1. Here, c is about to connect with a. Bottom: The corresponding architecture for node a. Overall, the Trajectron employs a hybrid edge combination scheme combining aspects of Social Attention and the Structural-RNN.
  17. 17. The Trajectron: Probabilistic Multi-Agent Trajectory Modeling With Dynamic Spatiotemporal Graphs • Trajectron combines elements of variational deep generative models (in particular, CVAEs), recurrent sequence models (LSTMs), and dynamic spatiotemporal graphical structures to produce high-quality multimodal trajectories that models/predicts future behaviors of multiple humans. • Trajectron actually models a human’s velocity, which is then numerically integrated to produce spatial trajectories. • Build a graph G = (V , E ) representing the scene with nodes representing agents and edges based on agents’ spatial proximity. • Node History Encoder (NHE) to encode a node’s state history; • Edge Encoders (EEs) to incorporate influence from neighboring nodes. • With the previous outputs in hand, form a concatenated representation which then parameterizes the recognition and prior distributions in the CVAE framework.
  18. 18. The Trajectron: Probabilistic Multi-Agent Trajectory Modeling With Dynamic Spatiotemporal Graphs
  19. 19. Trajectory Prediction by Coupling Scene-LSTM with Human Movement LSTM • A trajectory prediction system that incorporates the scene information (Scene- LSTM) as well as individual pedestrian movement (Pedestrian-LSTM) trained simultaneously within static crowded scenes. • Superimpose a two-level grid structure (grid cells and subgrids) on the scene to encode spatial granularity plus common human movements. • The Scene-LSTM captures the commonly traveled paths that can be used to significantly influence the accuracy of human trajectory prediction in local areas (i.e. grid cells). • Further design scene data filters, consisting of a hard filter and a soft filter, to select the relevant scene information in a local region when necessary and combine it with Pedestrian-LSTM for forecasting a pedestrian’s future locations. • The experimental results on several publicly available datasets demonstrate that it produces more accurate predicted trajectories in different scene contexts.
  20. 20. Trajectory Prediction by Coupling Scene-LSTM with Human Movement LSTM Scene-LSTM learns common human movements on a two-level grid structure. The common human movement is filtered and used in combination with individual movement (Pedestrian- LSTM) to predict a pedestrian’s future locations.
  21. 21. Trajectory Prediction by Coupling Scene-LSTM with Human Movement LSTM The system consists of three main modules: Pedestrian Movement (PM), Scene Data (SD) and Scene Data Filter (SDF). PM models the individual movement of pedestrians. SD encodes common human movements in each grid cell. SDF selects relevant scene data to update the Pedestrian-LSTM, which is used to predict the future locations. ⊗ denotes elementwise multiplication. ⊕ denotes vector addition. hi and hsare the hidden states of Pedestrian-LSTM and Scene-LSTM, respectively.
  22. 22. Trajectory Prediction by Coupling Scene-LSTM with Human Movement LSTM Illustrations of the hard filter, which determines whether the scene data should be applied in predicting the future locations of a pedestrian. (a) the frame image is first divided into n × n grid cells (n = 4 in this example) to capture all human movements in each grid cell; (b) & (c) only non-linear grid cells are selected for further processing at the subgrid level; the scene data is not applied for pedestrians in the linear grid cell; (d) a non- linear grid cell is further divided into m × m subgrids (m = 4) and each trajectory is parsed into subgrid paths; (e) the common subgrids, occupied by common subgrid paths; (f) at prediction time, the decision of use/not use scene data depends on the current location of each pedestrian. If the pedestrian’s current location is in the common subgrids, the scene data is used (red pedestrian); otherwise, it is not used (green pedestrian).
  23. 23. Trajectory Prediction by Coupling Scene-LSTM with Human Movement LSTM Illustrations of the soft filter. The relevant information of scene data (i.e. Scene-LSTM) is selected using each pedestrians walking behavior. The filtered grid-cell memory of each pedestrian is then used in combination with pedestrian movements (Pedestrian-LSTM) to predict the future trajectories.
  24. 24. Trajectory Prediction by Coupling Scene-LSTM with Human Movement LSTM
  25. 25. STGAT: Modeling Spatial-Temporal Interactions for Human Trajectory Prediction • Human trajectory prediction is challenging and critical in various applications (e.g., autonomous vehicles and social robots). • Because of the continuity and foresight of the pedestrian movements, the moving pedestrians in crowded spaces will consider both spatial and temporal interactions to avoid future collisions. • However, most of the existing methods ignore the temporal correlations of interactions with other pedestrians involved in a scene. • Spatial-Temporal Graph Attention network (STGAT), based on a sequence-to-sequence architecture to predict future trajectories of pedestrians. • Besides the spatial interactions captured by the graph attention mechanism at each time-step, adopt an extra LSTM to encode the temporal correlations of interactions. • Test on two publicly available crowd datasets (ETH and UCY) and produces more “socially” plausible trajectories for pedestrians.
  26. 26. STGAT: Modeling Spatial-Temporal Interactions for Human Trajectory Prediction The architecture of the STGAT model. The framework is based on seq2seq model and consists of 3 parts: Encoder, Intermediate State and Decoder. The Encoder module includes three components: 2 types of LSTMs and Graph Attention Network (GAT) . The Intermediate State encapsulates the spatial and temporal information of all observed trajectories. The Decoder module generates the future trajectories based on Intermediate State.
  27. 27. STGAT: Modeling Spatial-Temporal Interactions for Human Trajectory Prediction
  28. 28. STGAT: Modeling Spatial-Temporal Interactions for Human Trajectory Prediction
  29. 29. Neighbourhood Context Embeddings in Deep Inverse Reinforcement Learning for Predicting Pedestrian Motion Over Long Time Horizons • Despite the fact that Deep Inverse Reinforcement Learning (D-IRL) based modelling paradigms offer flexibility and robustness when anticipating human behaviour across long time horizons, compared to their supervised learning counterparts, no existing state-of-the-art D-IRL methods consider path planning in situations where there are multiple moving pedestrians in the environment. • To address this, a recurrent neural network based method for embedding pedestrian dynamics in a D-IRL setting, where there are multiple moving agents. • Capture the motion of the pedestrian of interest as well as the motion of other pedestrians in the neighbourhood through Long-Short-Term Memory networks. • The neighbourhood dynamics are encoded into a feature map, preserving the spatial integrity of the observed trajectories. • Utilising the maximum-entropy based non-linear inverse reinforcement learning framework, map these features to a reward map. • The importance of capturing the dynamic evolution of the environment using the embedding scheme.
  30. 30. Neighbourhood Context Embeddings in Deep Inverse Reinforcement Learning for Predicting Pedestrian Motion Over Long Time Horizons The architecture used to embed the neighbourhood context: The trajectory of the pedestrian of interest is shown in blue, with three neighbours shown in green. Heading directions are indicated with circles. encode the trajectories using LSTMs where soft attention is utilised to embed the information from the pedestrian of interest and the neighbours use hard-wired attention. Next a feature map is generated to embed this information spatially, based on the cartesian points of each trajectory.
  31. 31. Neighbourhood Context Embeddings in Deep Inverse Reinforcement Learning for Predicting Pedestrian Motion Over Long Time Horizons The architecture of the four layer fully convolution network used to map the feature map G to the reward map R. The first three layers contain 32, 1 × 1 convolution kernels with a ReLU activation, and the final layer contains 1, 1 × 1 convolution kernel. The learned reward map covers all the areas of the environment, encapsulating structural factors such as buildings and pathways that influence pedestrian behaviour.
  32. 32. Neighbourhood Context Embeddings in Deep Inverse Reinforcement Learning for Predicting Pedestrian Motion Over Long Time Horizons
  33. 33. GraphTCN: Spatio-Temporal Interaction Modeling for Human Trajectory Prediction • Trajectory prediction is a fundamental and challenging task to forecast the future path of the agents in autonomous applications with multi-agent interaction, where the agents need to predict the future movements of their neighbors to avoid collisions. • To respond timely and precisely to the environment, high efficiency and accuracy are required in the prediction. • Conventional approaches, e.g., LSTM-based models, take considerable computation costs in the prediction, especially for the long sequence prediction. • To support more efficient and accurate trajectory predictions, a CNN-based spatial-temporal graph framework GraphTCN, which captures the spatial and temporal interactions in an input- aware manner. • The spatial interaction between agents at each time step is captured with an edge graph attention network (EGAT), and the temporal interaction across time step is modeled with a modified gated convolutional network. • In contrast to conventional models, both the spatial and temporal modeling in GraphTCN are computed within each local time window. • Therefore, GraphTCN can be executed in parallel for much higher efficiency, and meanwhile with accuracy comparable to best-performing approaches.
  34. 34. GraphTCN: Spatio-Temporal Interaction Modeling for Human Trajectory Prediction The overview of GraphTCN, where EGAT captures the spatial interaction between agents for each time step and based on the spatial and historical trajectory embedding, TCN further captures the temporal interaction across time steps. The decoder module then produces multiple socially acceptable trajectories for all the agents simultaneously.
  35. 35. GraphTCN: Spatio-Temporal Interaction Modeling for Human Trajectory Prediction TCN with a stack of 3 causal convolution layers of kernel size 3. In each layer, the left padding is adopted based on the kernel size. The input contains the spatial information captured by preceding modules. The output of TCN is collected by concatenating all the outputs across time.
  36. 36. GraphTCN: Spatio-Temporal Interaction Modeling for Human Trajectory Prediction
  37. 37. Recursive Social Behavior Graph for Trajectory Prediction • Social interaction is an important topic in trajectory prediction to generate plausible paths. • Force based models utilize the distance to compute force, and they will fail when the interaction is complicated. • for pooling methods, the distance between two person at a single timestep is used as a criterion to calculate the strength of the relationship. • Attention methods also meet the same problem that Euclidean distance are used in their method to guide the attention mechanism. • An insight of group-based social interaction model to explore relationships among pedestrians. • recursively extract social representations supervised by group-based annotations and formulate them into a social behavior graph, called Recursive Social Behavior Graph. • recursive mechanism explores the representation power largely. • Graph CNN is used to propagate social interaction information in such a graph.
  38. 38. Recursive Social Behavior Graph for Trajectory Prediction Overview. For individual representation, BiLSTMs are used to encode historical trajectory feature, and CNNs are used to encode human context feature. For relational social representation, first generate RSBG recursively and then use GCN to propagate social features. At the decoding stage, social features are concatenated with individual features which finally decoded by an LSTM based decoder.
  39. 39. Recursive Social Behavior Graph for Trajectory Prediction

×