Simulation for autonomous driving at uber atg

Simulation for Autonomous
Driving @Uber ATG
Yu Huang
Sunnyvale，California
Yu.huang07@gmail.com

Reference
• Testing Safety of SDVs by Simulating Perception and Prediction
• LiDARsim: Realistic LiDAR Simulation by Leveraging the Real World
• Recovering and Simulating Pedestrians in the Wild
• S3: Neural Shape, Skeleton, and Skinning Fields for 3D Human Modeling
• SceneGen: Learning to Generate Realistic Traffic Scenes
• TrafficSim: Learning to Simulate Realistic Multi-Agent Behaviors
• GeoSim: Realistic Video Simulation via Geometry-Aware Composition for
Self-Driving
• AdvSim: Generating Safety-Critical Scenarios for Self-Driving Vehicles
• Appendix: (Waymo)
• SurfelGAN: Synthesizing Realistic Sensor Data for Autonomous Driving

Testing the Safety of Self-driving Vehicles by
Simulating Perception and Prediction
• Testing the safety of self driving vehicles in simulation, with an
alternative to sensor simulation.
• Directly simulate the outputs of the self-driving vehicle’s perception
and prediction system, enabling realistic motion planning testing.
• Specifically, use paired data in the form of ground truth labels and
real perception and prediction outputs to train a model that predicts
what the online system will produce.
• Importantly, the inputs to system consists of high definition maps,
bounding boxes, and trajectories, easily sketched in a matter of
minutes, which makes it a much more scalable solution.

The goal is to simulate the outputs of the SDV’s perception and prediction system to realistically test
its motion planner. For each timestep, system ingests an HD map and a set of actors (bounding
boxes and trajectories) and produces noisy outputs similar to those from the real system. To test the
motion planner, mock real outputs with simulated ones.

Perturbation models for perception and prediction simulation. NoNoise assumes perfect perception and
prediction. GaussianNoise and Multimodal Noise use marginal noise distributions to perturb each actor’s
shape, position, and whether it is misdetected. ActorNoise accounts for inter-actor variability by predicting
perturbations conditioned on each actor’s bounding box and positions over time.

ContextNoise for perception and prediction simulation. Given BEV rasterized images of the scene (drawn
from bounding boxes and HD maps), model simulates outputs similar to those from the real perception
and prediction system. It consists of: (i) a shared backbone feature extractor; (ii) a perception head for
simulating bounding box outputs; and (iii) a prediction head for simulating future states outputs.

Simulation results on ATG4D. It
visualizes PLT motion planning
results when given real perception
and prediction (top) versus
simulations from NoNoise (middle)
and ContextNoise (bottom).
ContextNoise faithfully simulates a
misprediction due to multi-modality
and induces a lane-change behavior
from the motion planner.

LiDARsim: Realistic LiDAR Simulation by Leveraging the
RealWorld
• The problem of producing realistic simulations of LiDAR point clouds, the sensor
of preference for most self-driving vehicles.
• By leveraging real data, simulate the complex world more realistically compared
to employing virtual worlds built from CAD/procedural models.
• Towards this goal, first build a large catalog of 3D static maps and 3D dynamic
objects by driving around several cities with our self-driving fleet.
• Then generate scenarios by selecting a scene from catalog and ”virtually” placing
the self-driving vehicle (SDV) and a set of dynamic objects from the catalog in
plausible locations in the scene.
• To produce realistic simulations, develop a simulator that captures both the
power of physics-based and learning-based simulation.
• First utilize ray casting over the 3D scene and then use a deep neural network to
produce deviations from the physics-based simulation, producing realistic LiDAR
point clouds.

RealWorld
First create the assets from real data, and then compose them into a
scene and simulate the sensor with physics and machine learning.

RealWorld
Collect real data from multiple trajectories in the same area, remove moving objects, aggregate
and align the data, and create a mesh surfel representation of the background
From left to right: Individual sweep, Accumulated cloud,
Symmetry completion, outlier removal and surfel meshing

RealWorld
Raydrop physics: Multiple real-world
factors and sensor biases determine if the
signal is detected by LiDAR receiver
Raydrop network: Using ML and real data
to approximate the raydropping process

RealWorld
Top: Scale of vehicle bank
(displaying several hundred
vehicles out of 25000), Down:
Diversity of vehicle bank colored
by intensity, overlaid on vehicle
dimension scatter plot; Examples
(left to right): opened hood, bikes
on top of vehicle, opened trunk,
pickup with bucket, intensity
shows text, traffic cones on truck,
van with trailer, tractor on truck

RealWorld

Recovering and Simulating Pedestrians in the Wild
• Sensor simulation is a key component for testing performance of SDVs and
for data augmentation to better train perception systems.
• Recover the shape and motion of pedestrians from sensor readings
captured in the wild by a self-driving car driving around.
• Towards this goal, formulate the problem as energy minimization in a deep
structured model that exploits human shape priors, reprojection
consistency with 2D poses extracted from images, and a ray-caster that
encourages the reconstructed mesh to agree with the LiDAR readings.
• Not require any ground-truth 3D scans or 3D pose annotations.
• Then incorporate the reconstructed pedestrian assets bank in a realistic
LiDAR simulation system by performing motion retargeting;
• The simulated LiDAR data can be used to significantly reduce the amount of
annotated real-world data required for visual perception tasks.

Recover realistic 3D human meshes and poses from sequences of LiDAR and camera readings,
which can then be used in sensor simulation for perception algorithm training and testing.
LiDAR for human Mesh Estimation, (LiME): Given sensory observations, a sensor fusion regression
network predicts the human parameters which minimize the objective function, then perform energy
minimization over the sequence to obtain an optimized shape and 3D pose.

Result on real world data using LiME: (1)
Camera image. (2) Reconstructed mesh. (3)
Ray-casted points on recovered mesh,
overlapped with GT LiDAR points. (4) Side view.
Interpolating a mesh for a query trajectory from
recovered asset mesh sequences.

Quantitative results of method on 3DPW dataset. The sensory input consists of camera
image and the synthetic LiDAR points. Show method using both SMPL model and human
model, and compare with SPIN.

S3: Neural Shape, Skeleton, and Skinning
Fields for 3D Human Modeling
• Constructing and animating humans is an important component for building
virtual worlds in a wide variety of applications such as virtual reality or robotics
testing in simulation.
• As there are exponentially many variations of humans with different shape, pose
and clothing, it is critical to develop methods that can automatically reconstruct
and animate humans at scale from real world data.
• Towards this goal, represent the pedestrian’s shape, pose and skinning weights as
neural implicit functions that are directly learned from data.
• This representation enables us to handle a wide variety of different pedestrian
shapes and poses without explicitly fitting a human parametric body model,
allowing to handle a wider range of human geometries and topologies.
• Generate 3D human animations at scale from a single RGB image (and/or an
optional LiDAR sweep) as input.

Given a single image and/or a single LiDAR sweep as input, model infers shape, skeleton and
skinning jointly, which can then be used to generate animated 3D characters in novel poses.

From left to right: Process input sensor data into spatial feature representations.
Query points adaptively from 3D space and extract their point encoding, used to
query neural implicit representations of shape, pose, and skinning. Apply post-
processing to construct the final explicit representation of an animatable person

SceneGen: Learning to Generate Realistic Traffic Scenes
• For the problem of generating realistic traffic scenes automatically existing
methods typically insert actors into the scene according to a set of hand-crafted
heuristics and are limited in their ability to model the true complexity and
diversity of real traffic scenes, thus inducing a content gap between synthesized
traffic scenes versus real ones.
• As a result, existing simulators (SUMO, CORSIM, VISSIM, and MITSIM) lack the
fidelity necessary to train and test self-driving vehicles.
• To address this limitation, SceneGen—a neural autoregressive model of traffic
scenes that eschews the need for rules and heuristics.
• In particular, given the ego-vehicle state and a high definition map of surrounding
area, SceneGen inserts actors of various classes into the scene and synthesizes
their sizes, orientations, and velocities scenes.
• SceneGen coupled with sensor simulation can be used to train perception models
that generalize to the real world.

Autoregressive Traffic Scene Generation Probabilistic Actor Model
Given the ego SDV’s state and an HD map of the surrounding area, SceneGen generates
a traffic scene by inserting actors one at a time. Model each actor probabilistically, as a
product over distributions of its class, position, bounding box, and velocity.

The input multi-channel image to SceneGen for ATG4D

Traffic scenes generated by SceneGen conditioned on HD maps from ATG4D (top) and Argoverse (bottom).

Qualitative comparison of traffic scenes generated by SceneGen and various baselines

TrafficSim: Learning to Simulate Realistic Multi-
Agent Behaviors
• To close gap simulation-real world, simulate realistic multi-agent behaviors.
• Heuristic based models that directly encode traffic rules, which cannot capture
irregular maneuvers (nudging, U-turns) and complex interactions (yielding, merging).
• In contrast, leverage real-world data to learn directly from human demonstration
and thus capture a more diverse set of actor behaviors.
• TRAFFICSIM, a multi-agent behavior model for realistic traffic simulation.
• Leverage an implicit latent variable model to parameterize a joint actor policy that
generates socially consistent plans for all actors in the scene jointly.
• To learn a robust policy for long horizon simulation, unroll the policy in training and
optimize through the fully differentiable simulation across time.
• The learning objective incorporates both demonstrations and common sense.
• Exploit trajectories generated by TRAFFICSIM as effective data augmentation for
training better motion planner.

Agent Behaviors
Generating realistic multi-agent behaviors
is a key component for simulation
Complex human driving behavior observed in the real
world: red is actor of interest, green are interacting actors

Agent Behaviors
TRAFFICSIM architecture: global map module (a) is run once per map for repeated simulation
runs. At each timestep, local observation module (b) extracts motion and map features, then
joint behavior module (c) produces a multi-agent plan.

Agent Behaviors
TRAFFICSIM models all actors jointly to simulate realistic traffic scenarios
through time. Sample at each timestep to obtain parallel simulations

Agent Behaviors
Optimize policy with back-propagation through the differentiable simulation (left),
and apply imitation and common sense loss at each simulated state (right).

Agent Behaviors

Agent Behaviors
adaptive weight is a decreasing
function of simulation timestep.
Differentiable relaxation of collision loss
approximates each vehicle as 5 circles and
considers distance between closest centroids.
Time-Adaptive Multi-Task Loss
collision loss

GeoSim: Realistic Video Simulation via Geometry-
Aware Composition for Self-Driving
• GeoSim, a geometry-aware image composition process which synthesizes novel
urban driving scenarios by augmenting existing images with dynamic objects
extracted from other scenes and rendered at novel poses.
• Towards this goal, first build a diverse bank of 3D objects with both realistic
geometry and appearance from sensor data.
• During simulation, perform a novel geometry-aware simulation-by-composition
procedure:
• 1) proposes plausible and realistic object placements into a given scene,
• 2) renders novel views of dynamic objects from the asset bank,
• 3) composes and blends the rendered image segments.
• The resulting synthetic images are realistic, traffic-aware, and geometrically
consistent, allowing it to scale to complex use cases.
• Long-range realistic video simulation across multiple camera sensors, and synthetic
data generation for data augmentation on downstream segmentation tasks.

Realistic video simulation via geometry-aware composition for self-driving. A data-driven image manipulation
approach that inserts dynamic objects into existing videos. The resulting synthetic video footages are highly
realistic, layout-aware, and geometrically consistent, allowing image simulation to scale to complex use cases.

Realistic 3D assets creation. Left: multi-view multi-sensor reconstruction network; Right: 3D
asset samples. For each sample we show one of the source images and the 3D mesh.

3D reconstruction network architecture. Left: Image feature extraction
backbone; Right: Multi-view image fusion block.

3D-aware object placement, segment retrieval, and temporal simulation.

Geometry-aware composition with occlusion reasoning followed by an image synthesis module.

Schematics of shadow generation. (left to right): result without shadow, schematics of
virtual scene, shadow weight (ratio of intensity between rendered image with inserted
object and without inserted object), result with shadow

AdvSim: Generating Safety-Critical Scenarios
for Self-Driving Vehicles
• As self-driving systems become better, simulating scenarios where the autonomy
stack may fail becomes more important.
• Those scenarios are generated for a few scenes w.r.t to the planning module that
takes ground-truth actor states as input, which does not scale and cannot identify
all possible autonomy failures, such as perception failures due to occlusion.
• AdvSim, an adversarial framework, generates safety critical scenarios for any LiDAR-
based autonomy system.
• Given an initial traffic scenario, AdvSim modifies the actors’ trajectories in a
physically plausible manner and updates the LiDAR sensor data to match the
perturbed world.
• Importantly, by simulating directly from sensor data, obtain adversarial scenarios
that are safety-critical for the full autonomy stack.
• The robustness and safety of these systems can be further improved by training
them with scenarios generated by AdvSim.

The goal is to perturb the maneuvers of interactive actors in an existing scenario with adversarial behaviors that
cause realistic autonomy system failures. Given an existing scenario and its original sensor data, perturb the
scenario and update accordingly how the SDV would observe the LiDAR sensor data based on the new scene
configuration. Then evaluate the autonomy system on the modified scenario, compute an adversarial objective,
and update the proposed perturbation using a search algorithm.

Realistic LiDAR simulation for scenario perturbations. Given a scenario perturbation on the actors’ motions, the
previously recorded LiDAR data is modified to accurately reflect the updated scene configuration. Remove the
original actor LiDAR observations and replace with simulated actor LiDAR observations at the perturbed
locations, while ensuring sensor realism. The above example perturbs all actors left by 5 meters.

Visualization of autonomy system’s output plan on original and corresponding adversarial
scenes. A: IL avoids the high-speed lane-changing vehicle behind but collides with the front
one. B: NMP collides with the merging vehicle. C: PLT collides with one vehicle and two
occluded pedestrians at crossroads. D: P3 collides with the crossing pedestrian.

SurfelGAN: Synthesizing Realistic Sensor Data for
Autonomous Driving
• Autonomous driving system development is critically dependent on
the ability to replay complex and diverse traffic scenarios in
simulation.
• In such scenarios, the ability to accurately simulate the vehicle
sensors such as cameras, lidar or radar is essential.
• However, current sensor simulators leverage gaming engines such as
Unreal or Unity, requiring manual creation of environments, objects
and material properties.
• They have limited scalability and fail to produce realistic
approximations of camera, lidar, and radar data without significant
additional work.

Autonomous Driving
• SurfelGAN generates realistic scenario sensor data, based only on a
limited amount of lidar and camera data collected by AV.
• It uses texture-mapped surfels to reconstruct the scene from an initial
vehicle pass or set of passes, preserving rich information about object
3D geometry and appearance, as well as the scene conditions.
• Then leverage a SurfelGAN network to reconstruct realistic images for
positions/orientations of the SDV and moving objects in the scene.
• It creates a dataset where two self-driving vehicles observe the same
scene at the same time, which is used to provide additional
evaluation and demonstrate the usefulness of the SurfelGAN model.

Autonomous Driving
a) The goal is the generation of camera images for autonomous driving simulation. When
provided with a novel trajectory of the self-driving vehicle in simulation, the system generates
realistic visual sensor data that is useful for downstream modules such as an object detector, a
behavior predictor, or a motion planner. At a high level, the method consists of two steps: b) First,
scan the target environment and reconstruct a scene consisting of rich textured surfels. c) Surfels
are rendered at the camera pose of the novel trajectory, alongside semantic and instance
segmentation masks. Through a GAN, generate realistically looking camera images.

Autonomous Driving
SurfelGAN training paradigm. The training setup has two symmetric encoder-decoder generators mapping from surfel
renderings to real images and vice versa. Additionally, there are two discriminators, which specialize in the surfel and the real
domain. The losses are shown as colored arrows. Green: supervised reconstruction loss. Red: adversarial loss. Blue/Yellow:
cycle-consistency losses. When training with paired data, e.g. WOD-TRAIN, the surfel renderings translate to real images,
and apply a one-directional supervised reconstruction loss (SurfelGAN-S) only or add an additional adversarial loss
(SurfelGAN-SA). When training with unpaired data, go either from the surfel renderings (e.g. WOD-TRAIN-NV) or the real
images (e.g. Internal Camera Dataset), use one of the encoder-decoder networks to get to the other domain and back. Then
apply a cycle consistency loss. (SurfelGAN-SAC). The encoder-decoder networks consist of 8 convolutional and 8
deconvolutional layers. Discriminators consist of 5 convolutional layers. All network operate on 256x256 sized input.

Autonomous Driving
Qualitative comparison
bewteen different
SurfelGAN variants and
the baseline on WOD-
EVAL under different
weather conditions.

Autonomous Driving

Simulation for autonomous driving at uber atg

Simulation for autonomous driving at uber atg

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Simulation for autonomous driving at uber atg

Ähnlich wie Simulation for autonomous driving at uber atg (20)

Mehr von Yu Huang

Mehr von Yu Huang (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Simulation for autonomous driving at uber atg