SlideShare ist ein Scribd-Unternehmen logo
1 von 75
Downloaden Sie, um offline zu lesen
3D Interpretation from Single 2D Image
for Autonomous Driving IV
Yu Huang
Yu.huang07@gmail.com
Sunnyvale, California
Outline
• Demystifying Pseudo-LiDAR for Monocular 3D Object Detection
• CubifAE-3D: Monocular Camera Space Cubification for Auto-Encoder based 3D
Object Detection
• Ground-aware Monocular 3D Object Detection for Autonomous Driving
• Categorical Depth Distribution Network for Monocular 3D Object Detection
• Depth-conditioned Dynamic Message Propagation for Monocular 3D Object
Detection
• Geometry-based Distance Decomposition for Monocular 3D Object Detection
• Geometry-aware data augmentation for monocular 3D object detection
• Lidar Point Cloud Guided Monocular 3D Object Detection
Demystifying Pseudo-LiDAR for Monocular
3D Object Detection
• Pseudo-LiDAR-based methods for monocular 3D object detection have generated large
attention in the community.
• This generated a distorted impression about the superiority of Pseudo-LiDAR approaches
against methods working with RGB-images only.
• The 1st contribution is analysing and showing experimentally that the validation results
published by Pseudo-LiDAR-based methods are substantially biased.
• The source of the bias resides in an overlap between the KITTI3D object detection validation
set and the training/validation sets used to train depth predictors feeding Pseudo-LiDAR-
based methods.
• Surprisingly, the bias remains also after geographically removing the overlap, revealing the
presence of a more structured contamination.
• This leaves the test set as the only reliable mean of comparison, where published Pseudo-
LiDAR-based methods do not excel.
• The second contribution brings Pseudo-LiDAR based methods back up in the ranking with
the introduction of a 3D confidence prediction module.
Demystifying Pseudo-LiDAR for Monocular
3D Object Detection
It analyze the cause of the performance bias
of monocular Pseudo-LiDAR-based (PL)
methods, which consists in a substantial
drop between the results on the KITTI3D
validation and test set. It show that this bias
is due to the fact that the depth estimators
on which PL methods heavily rely have been
trained on a depth training set (black lines)
which includes 30% of the detection
validation set data (red lines). It propose to
solve this bias by creating an alternative
unbiased depth training set (green lines)
which eliminates the overlap as well as
introduces a geographical distance w.r.t.
detection validation data.
Demystifying Pseudo-LiDAR for Monocular
3D Object Detection
The table shows that certain sub-tasks like
rotation (R) and shape (W;H;L) prediction,
despite the substitution with ground-truth
values, do not significantly improve
performance. In contrast, substituting the
predicted depth estimation (Z) with ground
truth improves substantially, meaning that
depth is by-far the most crucial component
for 3D object detection.
Demystifying Pseudo-LiDAR for Monocular
3D Object Detection
BTS (“From big to small: Multi-scale local planar guidance for monocular depth estimation”)
Demystifying Pseudo-LiDAR for Monocular
3D Object Detection
• The 3D object detection task, requires to associate each object with a 3D bounding box and
a corresponding confidence value.
• This confidence should generally reflect the quality of the 3D bounding box and can be
thought as a measure of how much the particular estimate is reliable.
• The existing Pseudo-LiDAR methods do not perform the 3D confidence estimation in any
way and rely on the class probability coming along with the 2D detections.
• By doing so, the confidence adopted by current PL-based methods is actually agnostic to
the quality of the 3D predictions and therefore not effective for the role it should take.
• 2D detectors are often too confident and the need for a 3D confidence seems essential.
• Propose to do endow PL-based methods with the ability of estimating the 3D confidence of
their predictions.
• This architecture can be divided into three main branches namely 2D Detection, Pseudo-
LiDAR and 3D Detection.
Demystifying Pseudo-LiDAR for Monocular
3D Object Detection
Architecture of a Pseudo-LiDAR-based method integrating the 3D confidence component.
1."Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving." CVPR 2019.
2. PatchNet: "Rethinking pseudo-lidar representation" ECCV'2020
Demystifying Pseudo-LiDAR for Monocular
3D Object Detection
Example of final part of the architecture, where adding the Confidence Head to the PatchNet architecture.
The Confidence Branch requires minimal modifications to the original architecture, adds negligible
computational complexity and inference time and is compatible with most Pseudo-LiDAR approaches.
Demystifying Pseudo-LiDAR for Monocular
3D Object Detection
Demystifying Pseudo-LiDAR for Monocular
3D Object Detection
CubifAE-3D: Monocular Camera Space Cubification
for Auto-Encoder based 3D Object Detection
• Starting from a synthetic dataset, pre-train an RGB-to-Depth Auto-Encoder (AE).
• The embedding learnt from this AE is then used to train a 3D Object Detector
(3DOD) CNN which is used to regress the parameters of 3D object poses after the
encoder from the AE generates a latent embedding from the RGB image.
• It pre-train the AE using paired RGB and depth images from simulation data once
and subsequently only train the 3DOD network using real data, comprising of RGB
images and 3D object pose labels (without the requirement of dense depth).
• The 3DOD network utilizes a particular‘cubification’ of 3D space around the
camera, where each cuboid is tasked with predicting N object poses, along with
their class and confidence values.
• A method for 3D object detection using a single monocular image, CubifAE-3D,
including AE pre-training+ dividing 3D space around the camera into cuboids.
• The first part refers to cubification/voxellization of mono camera space as a pre-
processing step, and AE refers to the Auto- Encoding of RGB-to-depth space.
CubifAE-3D: Monocular Camera Space Cubification
for Auto-Encoder based 3D Object Detection
CubifAE-3D high-level architecture
CubifAE-3D: Monocular Camera Space Cubification
for Auto-Encoder based 3D Object Detection
• Figure (the next page): CubifAE-3D architecture.
• The RGB-to-depth auto-encoder is first trained in a supervised way with a
combination of MSE and Edge-Aware Smoothing Loss.
• Once trained, the decoder is detached, encoder weights are frozen, and the
encoder output is fed to the 3DOD model, which is trained with a
combination of xyzloss, whlloss, orientationloss, iouloss, and confloss.
• A 2D bounding-box is obtained for each object by projecting its detected 3D
bounding-box onto the camera image plane, cropped, and resized to 64x64
and fed to the classifier model (bottom branch) along with the normalized
whl vector for class prediction.
• The dimensions indicated correspond to the output tensor for each block.
• Also replace the encoder head by a pretrained backbone network (VGG-16)
and observe an improved performance.
CubifAE-3D: Monocular Camera Space Cubification
for Auto-Encoder based 3D Object Detection
Detailed model architecture of CubifAE-3D
CubifAE-3D: Monocular Camera Space Cubification
for Auto-Encoder based 3D Object Detection
The RGB-to-depth auto-encoder
The total loss function for this model
CubifAE-3D: Monocular Camera Space Cubification
for Auto-Encoder based 3D Object Detection
Monocular RGB to Depth Map prediction
CubifAE-3D: Monocular Camera Space Cubification
for Auto-Encoder based 3D Object Detection
Cubification of the camera space: The perception RoI is divided into a 4x4xM grid (x and y directions
aligned with image plane, where each grid has stacked on it, M cuboids in the z direction). Each cuboid
is responsible for predicting up to N object poses. The object coordinates and dimensions are then
normalized 0-1 in accordance with a prior that is computed from data statistics.
CubifAE-3D: Monocular Camera Space Cubification
for Auto-Encoder based 3D Object Detection
Samples of qualitative results on the KITTI dataset. The top part of each image shows
a bounding box obtained as a 2D projection of their 3D poses. The bottom part shows
a birds-eye view of the object poses with the ego-vehicle positioned at the center of red
circle drawn on the left; pointing towards the right of the image.
CubifAE-3D: Monocular Camera Space Cubification
for Auto-Encoder based 3D Object Detection
Qualitative results on the KITTI dataset
Ground-aware Monocular 3D Object
Detection for Autonomous Driving
• Most of existing algorithms are based on the geometric constraints in
2D-3D correspondence, which stems from generic 6D object pose
estimation.
• First identify how the ground plane provides additional clues in depth
reasoning in 3D detection in driving scenes.
• Based on this observation, then improve the processing of 3D anchors
and introduce a neural network module to fully utilize such
application-specific priors in the framework of deep learning.
• Introduce a neural network embedded with the proposed module for
3D object detection.
• Further verify the power of the proposed module with a neural network
designed for monocular depth prediction.
• https://www.github.com/Owen-Liuyuxuan/visualDet3D
Ground-aware Monocular 3D Object
Detection for Autonomous Driving
Perspective geometry for the GAC module. When calculating the
vertical offsets, assume pixels are foreground object centers. When
computing the depth priors z, assume pixels are on the ground because
they are features to be queried.
Ground-aware Monocular 3D Object
Detection for Autonomous Driving
offset
inverse depth
depth
Relation of depth and height
Ground-aware Monocular 3D Object
Detection for Autonomous Driving
Ground-Aware Convolution (GAC)Module
Ground-aware Monocular 3D Object
Detection for Autonomous Driving
Object detection
Ground-aware Monocular 3D Object
Detection for Autonomous Driving
Categorical Depth Distribution Network for
Monocular 3D Object Detection
• Mono 3D object detection is a key problem for autonomous vehicles, as it provides a
solution with simple configuration compared to typical multi-sensor systems.
• The main challenge in mono 3D detection lies in accurately predicting object depth, inferred
from object and scene cues due to the lack of direct range measurement.
• Many methods attempt to directly estimate depth to assist in 3D detection, but show limited
performance as a result of depth inaccuracy.
• This solution, Categorical Depth Distribution Network (CaDDN), uses a predicted
categorical depth distribution for each pixel to project rich contextual feature information to
the appropriate depth interval in 3D space.
• Then use the computationally efficient bird’s-eye-view (BEV) projection and single-stage
detector to produce the final output detections.
• CaDDN as a fully differentiable E2E joint depth estimation and object detection.
• https://github.com/TRAILab/CaDDN
Categorical Depth Distribution Network for
Monocular 3D Object Detection
(a) Input image. (b) Without depth distribution supervision, BEV features from CaDDN suffer
from smearing effects. (c) Depth distribution supervision encourages BEV features from
CaDDN to encode meaningful depth confidence, in which objects can be accurately detected.
Categorical Depth Distribution Network for
Monocular 3D Object Detection
• Direct methods estimate 3D detections directly from images without
predicting an intermediate 3D scene representation; they can incorporate the
geometric relationship between the 2D image plane and 3D space to assist
with detections.
• Depth-based methods perform the 3D detection task using pixel-wise depth
maps as an additional input, where the depth maps are precomputed using
monocular depth estimation architectures; Estimated depth maps can be
used in combination with images to perform the 3D detection task.
• Grid-based methods avoid estimating raw depth values by predicting a BEV
grid representation, to be used as input for 3D detection architectures;
Multiple voxels can be projected to the same image feature, leading to
repeated features along the projection ray and reduced detection accuracy.
Categorical Depth Distribution Network for
Monocular 3D Object Detection
CaDDN Architecture. The network is composed of 3 modules to generate 3D feature representations and
one to perform 3D detection. Frustum features G are generated using depth distributions D, transformed
into voxel features V. The voxel features are collapsed to BEV features B for 3D object detection.
Categorical Depth Distribution Network for
Monocular 3D Object Detection
• The purpose of the frustum feature network is to project image information
into 3D space, by associating image features to estimated depths;
• It follow the design of the semantic segmentation network DeepLabV3 to
estimate the categorical depth distributions from image features (Depth
Distribution Network), where modifying the network to produce pixel-wise
probability scores of belonging to depth bins rather than semantic classes
with a downsample-upsample architecture;
• In parallel to estimating depth distributions, perform channel reduction
(Image Channel Reduce) on image features to generate the final image
features, using a 1x1 convolution + BatchNorm + ReLU layer.
• Channel reduction is required to reduce the high memory footprint of
ResNet-101 features that will be populated in the 3D frustum grid.
Categorical Depth Distribution Network for
Monocular 3D Object Detection
Each feature pixel F(u; v) is weighted by its depth
distribution probabilities D(u; v) of belonging to D
discrete depth bins to generate frustum features G(u; v).
Sampling points in each voxel are projected into the
frustum grid. Frustum features are sampled using
trilinear interpolation to populate voxels in V.
Categorical Depth Distribution Network for
Monocular 3D Object Detection
The continuous depth space is
discretized in order to define the
set of D bins used in the depth
distributions D. Depth
discretization can be performed
with uniform discretization (UD)
with a fixed bin size, spacing-
increasing discretization (SID)
with increasing bin sizes in log
space, or linear-increasing
discretization (LID) with linearly
increasing bin sizes.
Categorical Depth Distribution Network for
Monocular 3D Object Detection
• Apply depth distribution labels to supervise predicted depth distributions.
• Depth distribution labels are generated by projecting LiDAR point clouds into
the image frame to create sparse dense maps.
• Depth completion performed to generate depth values at each pixel in image.
• It require depth information at each image feature pixel, so downsample the
depth maps of size WI x HI to the image feature size WF x HF .
• The depth maps are converted to bin indices using the LID discretization
method, followed by a conversion into a one-hot encoding to generate the
depth distribution labels.
• A one-hot encoding ensures the depth distribution labels are sharp, essential
to encourage sharpness in depth distribution predictions via supervision.
Categorical Depth Distribution Network for
Monocular 3D Object Detection
Categorical Depth Distribution Network for
Monocular 3D Object Detection
Categorical Depth Distribution Network for
Monocular 3D Object Detection
Depth-conditioned Dynamic Message
Propagation for Monocular 3D Object Detection
• The objective is to learn context- and depth-aware feature representation to
solve the problem of monocular 3D object detection.
• (i) propose a depth conditioned dynamic message propagation (DDMP)
network to effectively integrate the multi-scale depth information with the
image context;
• (ii) this is achieved by first adaptively sampling context-aware nodes in the
image context and then dynamically predicting hybrid depth-dependent
filter weights and affinity matrices for propagating information;
• (iii) by augmenting a center-aware depth encoding (CDE) task, alleviates the
inaccurate depth prior;
• (iv) thoroughly demonstrate the effectiveness and show SoA results among
the monocular-based approaches on the KITTI benchmark dataset.
• Code and models are released at https://github.com/fudan-zvg/DDMP
Depth-conditioned Dynamic Message
Propagation for Monocular 3D Object Detection
Left: DDMP adaptively
samples context-aware
nodes (top) in the image
context and dynamically
predicting hybrid depth-
dependent filter weights and
affinity matrices (bottom) for
propagating information.
Right: the improvement of
DDMP-3D (red) over the
baseline (yellow) via center-
aware depth encoding.
Depth-conditioned Dynamic Message
Propagation for Monocular 3D Object Detection
• Two branches are involved including 3D detection branch (blue) and depth feature
extraction branch (green).
• The RGB images are initially fed into the upper branch for feature extraction while
corresponding depth maps estimated via off-the-shelf depth estimator are sent into the
depth branch for extracting depth-aware features.
• The DDMP (dynamic message propagation) modules in yellow reveal the depth-conditioned
dynamic message propagation; It dynamically samples context-aware nodes in the upper
image branch and predicts the hybrid filter weights and affinities based on multi-scale depth
features from the bottom branch for message propagation.
• Common 3D heads for 3D center, dimension, orientation regression are followed to achieve
final 3D object boxes.
• CDE (center-aware depth feature encoding) is the auxiliary task for joint-optimization
training to implicitly guide the depth sub-network to learn center aware depth features for
better object localization, which is discarded during inference.
Depth-conditioned Dynamic Message
Propagation for Monocular 3D Object Detection
Schematic illustration of DDMP-3D
Depth-conditioned Dynamic Message
Propagation for Monocular 3D Object Detection
Illustration of DDMP module in a single scale pattern. Dynamic nodes are first sampled from
the image and depth feature graph, for these sampled nodes, the filter weights and affinity
matrices are learned from depth features to propagate the depth conditioned message.
Depth-conditioned Dynamic Message
Propagation for Monocular 3D Object Detection
• Generally, the depth map lose appearance details or fail to discriminate between foreground
instance and backgrounds, unreliable depth prior for depth-assisted 3D object detection.
• It is already proved that multi-task strategy can boost each single task to some degree,
benefiting from the multi-fold regularization effect in the joint-optimization.
• It augment an auxiliary task to jointly optimize with the main 3D detection task.
• The augmented task with xyz supervision in 3D space uniquely determines a point in 2D
image plane, which imposes spatial constraints to gain a 3D instance-level understanding.
• With the better instance-awareness brought by CDE, the model is able to alleviate the
inaccurate depth prior in situation like occlusion and distant objects.
• It adopt a similar network architecture for depth branch with the head only predicting 3D
centers without predefined anchors.
Depth-conditioned Dynamic Message
Propagation for Monocular 3D Object Detection
Depth-conditioned Dynamic Message
Propagation for Monocular 3D Object Detection
Depth-conditioned Dynamic Message
Propagation for Monocular 3D Object Detection
Qualitative comparison of ground truth (green), the baseline (yellow), and our method
(red) on KITTI val set. For better visualization, the first and second columns show RGB
and BEV images of point clouds converted from pre-estimated depth, respectively.
Geometry-based Distance Decomposition
for Monocular 3D Object Detection
• Monocular 3D object detection‘s core challenge is to predict the distance of objects in the
absence of explicit depth information.
• Unlike regressing the distance as a single variable in most existing methods, MonoRCNN, its
geometry-based distance decomposition is to recover the distance by its factors.
• The decomposition factors the distance of objects into the most representative and stable
variables, i.e. the physical height and the projected visual height in the image plane.
• In MonoRCNN, decomposition maintains the self-consistency between the two heights,
leading to the robust distance prediction when both predicted heights are inaccurate.
• The decomposition enables to trace cause of distance uncertainty for different scenarios.
• Such decomposition makes the distance prediction interpretable, accurate, and robust.
• It directly predicts 3D bounding boxes from RGB images with a compact architecture,
making the training and inference simple and efficient.
Geometry-based Distance Decomposition
for Monocular 3D Object Detection
The distance decomposition is based on the imaging geometry of a pinhole camera. The
distance from the center of an object to the camera, denoted as Z, can be calculated by Z =
fH/h , where f denotes the focal length of the camera, H denotes the physical height of the
object, and h denotes the length of the projected central line (PCL). The PCL represents the
projection of the vertical line at the center of the 3D bounding box. This equation shows the
distance of objects is determined by the physical height and projected visual height.
Note: It abstracts objects as the vertical lines at the center of 3D bounding boxes and their
visual projection as the projection of these vertical lines, then recover the distance by them
based on the imaging geometry.
Geometry-based Distance Decomposition
for Monocular 3D Object Detection
The architecture of MonoRCNN. It is built upon Faster R-CNN and adds the carefully designed 3D distance
head. The 3D distance head is based on geometry-based distance decomposition. Specifically, the method
regresses physical height H, reciprocal of the projected visual height hrec = 1/h , and their uncertainties, then
recovers the distance by Z = fHhrec. Blue arrows represent operations in the network during training and
inference, and Orange arrows represent operations to recover 3D bounding boxes during inference.
Geometry-based Distance Decomposition
for Monocular 3D Object Detection
Comparison between the predicted eight projected corners (red boxes) and predicted visual
height (blue lines). Predicting the eight projected corners fails under challenging cases, such as
occlusion, truncation, and extreme lighting conditions, while predicting the visual height is more
simple and robust. The images are from the KITTI validation split.
Geometry-based Distance Decomposition
for Monocular 3D Object Detection
Uncertainty-aware Regression 3D Attribute Head
keypoint loss function
physical size and yaw angle
The loss functions for H and hrec
Overall Loss
Geometry-based Distance Decomposition
for Monocular 3D Object Detection
Geometry-based Distance Decomposition
for Monocular 3D Object Detection
KITTI Examples
nuScenes Cross-
Test Examples
Geometry-aware data augmentation for
monocular 3D object detection
• This work, first conduct a thorough analysis to reveal how existing methods fail to robustly
estimate depth when different geometry shifts occur.
• Through image-based/instance-based manipulations, illustrate being vulnerable in capturing
consistent relationships between depth and both object apparent sizes and positions.
• Convert those manipulations into 4 corresponding 3D-aware data augmentation techniques.
• At the image-level, randomly manipulate the camera system, including its focal length,
receptive field and location, to generate new training images with geometric shifts.
• At the instance level, crop the foreground objects and randomly paste them to other scenes
to generate new training instances.
• All the proposed augmentation techniques share the virtue that geometry relationships in
objects are preserved while their geometry is manipulated.
• Not only the instability of depth recovery is effectively alleviated, but also the final 3D
detection performance is significantly improved.
Geometry-aware data augmentation for
monocular 3D object detection
Geometric manipulations
Geometry-aware data augmentation for
monocular 3D object detection
3D point to image coordinate
Use focal length to infer depth
Use vertical position to infer depth
Shifting the camera focal length
Manipulate the camera receptive field
Moving the camera
Geometry-aware data augmentation for
monocular 3D object detection
Visualization of the geometric relationships between
depth and both object apparent sizes and positions.
Geometry-aware data augmentation for
monocular 3D object detection
Rotation matrix R from egocentric orientation angle
8 corner points in the object coordinate
coordinate of point
Geometry-aware data augmentation for
monocular 3D object detection
Empirical analysis of monocular detector under geometric manipulations
Try both anchor-free (e.g. Center-Net) and anchor-based (e.g., M3D-RPN) models.
Geometry-aware data augmentation for
monocular 3D object detection
Visualization of Copy-paste data augmentation with and without geometry-aware
Geometry-aware data augmentation for
monocular 3D object detection
Geometry-aware data augmentation for
monocular 3D object detection
Geometry-aware data augmentation for
monocular 3D object detection
Lidar Point Cloud Guided Monocular 3D
Object Detection
• LiDAR point clouds, which provide accurate depth measurement, can offer
beneficial information for the training of monocular methods.
• Prior works only use LiDAR point clouds to train a depth estimator, which
implicit way does not fully utilize LiDAR point clouds, consequently leading
to suboptimal performances.
• To effectively take advantage of LiDAR point clouds, it propose a general,
simple yet effective framework for monocular methods.
• Specifically, use LiDAR point clouds to directly guide the training of
monocular 3D detectors, allowing them to learn desired objectives
meanwhile eliminating the extra annotation cost.
• Thanks to the general design, this method can be plugged into any
monocular 3D detection method, significantly boosting the performance.
Lidar Point Cloud Guided Monocular 3D
Object Detection
LiDAR guided monocular 3D object
detection. It directly use LiDAR point
clouds to guide the training of
monocular 3D detector.
Lidar Point Cloud Guided Monocular 3D
Object Detection
Qualitative examples of pseudo-
LiDAR based monocular 3D
detection. From top to bottom:
the RGB image, 3D predictions
on the bird’s eye view (BEV)
map. The estimated 3D box
center typically locates around
the converted point cloud,
meaning that it works well if
provided depths are accurate
while has difficulty revising the
poor predicted object depth.
Lidar Point Cloud Guided Monocular 3D
Object Detection
• Specifically, training a depth-map-based method can roughly comprise of
two stages: (1) training a dense depth estimation network; (2) training a
monocular 3D detector.
• With a common practice, current mono depth-map-based methods all
utilize projected LiDAR point clouds as ground truths to train
depth estimator.
• The number of LiDAR point clouds used for training affects the final 3D
detection accuracy heavily.
• It directly utilize LiDAR point clouds to generate massive pseudo 3D box
labels for monocular methods.
• This simple yet effective way allows monocular 3D detectors to learn desired
objectives meanwhile eliminating the extra annotation cost.
• It is able to work in either supervised or unsupervised mode according to the
reliance on manual 3D box annotations.
Lidar Point Cloud Guided Monocular 3D
Object Detection
It generate 3D boxes from
LiDAR point clouds, aiming to
train the monocular 3D detector.
Such 3D boxes are predicted via
the pre-trained LiDAR 3D
detector (supervised mode) or
obtained directly from the point
cloud without training
(unsupervised mode).
Lidar Point Cloud Guided Monocular 3D
Object Detection
• To take advantage of available 3D box annotations, first train a LiDAR-based 3D
detector from scratch with LiDAR point clouds and associated 3D box annotations.
• The pre-trained LiDAR-based 3D detector is then utilized to infer 3D boxes on new
LiDAR point clouds.
• Such results are treated as pseudo labels to train monocular 3D detectors.
• Due to the precise depth measurement, pseudo labels predicted from the LiDAR-
based 3D detector are considerably accurate and qualified to be used directly in the
training of monocular 3D detectors.
• Interestingly, with different training settings for the LiDAR-based 3D detector,
monocular 3D detectors guided by them show close performances.
• It indicates that monocular methods can indeed be beneficial from the guidance of
the LiDAR point clouds, and that only a small number of 3D box annotations are
sufficient to push the monocular method to achieve high performances.
• Thus the manual annotation cost also can be greatly reduced.
Lidar Point Cloud Guided Monocular 3D
Object Detection
• The process of generating pseudo labels on the LiDAR point cloud can be roughly divided
into three steps: 2D boxes and masks prediction, RoI points selecting and clustering, and 3D
boxes estimation.
• In the beginning, an off-the-shelf 2D instance segmentation model is adopted to perform
segmentation on the RGB image, obtaining 2D box and mask estimates.
• These estimates are used for building camera frustums in order to select associated LiDAR RoI
points for every object, where those boxes without any LiDAR point inside are ignored.
• To eliminate irrelevant points, take advantage of the unsupervised clustering approach, i.e.,
DBSCAN, to divide the RoI point cloud into different groups according to the density.
• Points that are close in 3D spatial space will be aggregated into a cluster.
• Then regard the cluster containing most points as target corresponding to the object.
• Finally, seek the minimum 3D bounding box that covers all target points.
Lidar Point Cloud Guided Monocular 3D
Object Detection
• To simplify the problem of solving the 3D bounding box, project points onto the bird’s-eye-
view map, reducing parameters since the height and y of the object can be easily obtained.
• Solve the convex hull of object points followed by obtaining the box by using rotating calipers.
• Specifically, edges of the convex hull are enumerated to produce enclosing rectangles, in which
the rectangle with the smallest area is chosen as the resulted BEV box (parameterized by the box
center (x; z), the dimension (w; l), and the orientation Ry).
• Other parameters of the 3D box can be calculated from statistics on remaining points.
• The height h can be represented by the max spatial offset along the y-axis of points, and the
center coordinate y is calculated by averaging y coordinates of points.
• Consequently, the complete 3D box is generated.
• Pseudo labels can fail when LiDAR points are not enough to describe outline of objects.
• Restricting object dimensions to eliminate those that are likely to be outliers as most valid
objects’ dimensions are close.
Lidar Point Cloud Guided Monocular 3D
Object Detection
Lidar Point Cloud Guided Monocular 3D
Object Detection
Lidar Point Cloud Guided Monocular 3D
Object Detection
3-d interpretation from single 2-d image IV

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Depth Fusion from RGB and Depth Sensors by Deep Learning
Depth Fusion from RGB and Depth Sensors by Deep LearningDepth Fusion from RGB and Depth Sensors by Deep Learning
Depth Fusion from RGB and Depth Sensors by Deep Learning
 
Pose estimation from RGB images by deep learning
Pose estimation from RGB images by deep learningPose estimation from RGB images by deep learning
Pose estimation from RGB images by deep learning
 
Depth Fusion from RGB and Depth Sensors III
Depth Fusion from RGB and Depth Sensors  IIIDepth Fusion from RGB and Depth Sensors  III
Depth Fusion from RGB and Depth Sensors III
 
LiDAR-based Autonomous Driving III (by Deep Learning)
LiDAR-based Autonomous Driving III (by Deep Learning)LiDAR-based Autonomous Driving III (by Deep Learning)
LiDAR-based Autonomous Driving III (by Deep Learning)
 
Deep vo and slam iii
Deep vo and slam iiiDeep vo and slam iii
Deep vo and slam iii
 
Lidar for Autonomous Driving II (via Deep Learning)
Lidar for Autonomous Driving II (via Deep Learning)Lidar for Autonomous Driving II (via Deep Learning)
Lidar for Autonomous Driving II (via Deep Learning)
 
Multi sensor calibration by deep learning
Multi sensor calibration by deep learningMulti sensor calibration by deep learning
Multi sensor calibration by deep learning
 
3-d interpretation from single 2-d image for autonomous driving
3-d interpretation from single 2-d image for autonomous driving3-d interpretation from single 2-d image for autonomous driving
3-d interpretation from single 2-d image for autonomous driving
 
Deep vo and slam ii
Deep vo and slam iiDeep vo and slam ii
Deep vo and slam ii
 
camera-based Lane detection by deep learning
camera-based Lane detection by deep learningcamera-based Lane detection by deep learning
camera-based Lane detection by deep learning
 
Deep learning for 3-D Scene Reconstruction and Modeling
Deep learning for 3-D Scene Reconstruction and Modeling Deep learning for 3-D Scene Reconstruction and Modeling
Deep learning for 3-D Scene Reconstruction and Modeling
 
Deep Learning’s Application in Radar Signal Data
Deep Learning’s Application in Radar Signal DataDeep Learning’s Application in Radar Signal Data
Deep Learning’s Application in Radar Signal Data
 
Deep Learning’s Application in Radar Signal Data II
Deep Learning’s Application in Radar Signal Data IIDeep Learning’s Application in Radar Signal Data II
Deep Learning’s Application in Radar Signal Data II
 
BEV Semantic Segmentation
BEV Semantic SegmentationBEV Semantic Segmentation
BEV Semantic Segmentation
 
Driving behaviors for adas and autonomous driving XII
Driving behaviors for adas and autonomous driving XIIDriving behaviors for adas and autonomous driving XII
Driving behaviors for adas and autonomous driving XII
 
Deep VO and SLAM
Deep VO and SLAMDeep VO and SLAM
Deep VO and SLAM
 
Depth Fusion from RGB and Depth Sensors II
Depth Fusion from RGB and Depth Sensors IIDepth Fusion from RGB and Depth Sensors II
Depth Fusion from RGB and Depth Sensors II
 
fusion of Camera and lidar for autonomous driving I
fusion of Camera and lidar for autonomous driving Ifusion of Camera and lidar for autonomous driving I
fusion of Camera and lidar for autonomous driving I
 
Survey 1 (project overview)
Survey 1 (project overview)Survey 1 (project overview)
Survey 1 (project overview)
 
Driving behaviors for adas and autonomous driving xiv
Driving behaviors for adas and autonomous driving xivDriving behaviors for adas and autonomous driving xiv
Driving behaviors for adas and autonomous driving xiv
 

Ähnlich wie 3-d interpretation from single 2-d image IV

10.1109@ICCMC48092.2020.ICCMC-000167.pdf
10.1109@ICCMC48092.2020.ICCMC-000167.pdf10.1109@ICCMC48092.2020.ICCMC-000167.pdf
10.1109@ICCMC48092.2020.ICCMC-000167.pdf
mokamojah
 
Presentation Object Recognition And Tracking Project
Presentation Object Recognition And Tracking ProjectPresentation Object Recognition And Tracking Project
Presentation Object Recognition And Tracking Project
Prathamesh Joshi
 

Ähnlich wie 3-d interpretation from single 2-d image IV (20)

3-d interpretation from stereo images for autonomous driving
3-d interpretation from stereo images for autonomous driving3-d interpretation from stereo images for autonomous driving
3-d interpretation from stereo images for autonomous driving
 
3-d interpretation from single 2-d image V
3-d interpretation from single 2-d image V3-d interpretation from single 2-d image V
3-d interpretation from single 2-d image V
 
10.1109@ICCMC48092.2020.ICCMC-000167.pdf
10.1109@ICCMC48092.2020.ICCMC-000167.pdf10.1109@ICCMC48092.2020.ICCMC-000167.pdf
10.1109@ICCMC48092.2020.ICCMC-000167.pdf
 
Mmpaper draft10
Mmpaper draft10Mmpaper draft10
Mmpaper draft10
 
Mmpaper draft10
Mmpaper draft10Mmpaper draft10
Mmpaper draft10
 
Presentation Object Recognition And Tracking Project
Presentation Object Recognition And Tracking ProjectPresentation Object Recognition And Tracking Project
Presentation Object Recognition And Tracking Project
 
Real-time 3D Object Detection on LIDAR Point Cloud using Complex- YOLO V4
Real-time 3D Object Detection on LIDAR Point Cloud using Complex- YOLO V4Real-time 3D Object Detection on LIDAR Point Cloud using Complex- YOLO V4
Real-time 3D Object Detection on LIDAR Point Cloud using Complex- YOLO V4
 
[DL輪読会]ClearGrasp
[DL輪読会]ClearGrasp[DL輪読会]ClearGrasp
[DL輪読会]ClearGrasp
 
Understanding the world in 3D with AI.pdf
Understanding the world in 3D with AI.pdfUnderstanding the world in 3D with AI.pdf
Understanding the world in 3D with AI.pdf
 
BEV Object Detection and Prediction
BEV Object Detection and PredictionBEV Object Detection and Prediction
BEV Object Detection and Prediction
 
3d object detection and recognition : a review
3d object detection and recognition : a review3d object detection and recognition : a review
3d object detection and recognition : a review
 
Goal location prediction based on deep learning using RGB-D camera
Goal location prediction based on deep learning using RGB-D cameraGoal location prediction based on deep learning using RGB-D camera
Goal location prediction based on deep learning using RGB-D camera
 
Simulation of collision avoidance by navigation
Simulation of collision avoidance by navigationSimulation of collision avoidance by navigation
Simulation of collision avoidance by navigation
 
pydataPointCloud.pptx
pydataPointCloud.pptxpydataPointCloud.pptx
pydataPointCloud.pptx
 
On constructing z dimensional Image By DIBR Synthesized Images
On constructing z dimensional Image By DIBR Synthesized ImagesOn constructing z dimensional Image By DIBR Synthesized Images
On constructing z dimensional Image By DIBR Synthesized Images
 
"High-resolution 3D Reconstruction on a Mobile Processor," a Presentation fro...
"High-resolution 3D Reconstruction on a Mobile Processor," a Presentation fro..."High-resolution 3D Reconstruction on a Mobile Processor," a Presentation fro...
"High-resolution 3D Reconstruction on a Mobile Processor," a Presentation fro...
 
Simulation for autonomous driving at uber atg
Simulation for autonomous driving at uber atgSimulation for autonomous driving at uber atg
Simulation for autonomous driving at uber atg
 
Nadia2013 research
Nadia2013 researchNadia2013 research
Nadia2013 research
 
Detection of a user-defined object in an image using feature extraction- Trai...
Detection of a user-defined object in an image using feature extraction- Trai...Detection of a user-defined object in an image using feature extraction- Trai...
Detection of a user-defined object in an image using feature extraction- Trai...
 
Dataset creation for Deep Learning-based Geometric Computer Vision problems
Dataset creation for Deep Learning-based Geometric Computer Vision problemsDataset creation for Deep Learning-based Geometric Computer Vision problems
Dataset creation for Deep Learning-based Geometric Computer Vision problems
 

Mehr von Yu Huang

Mehr von Yu Huang (20)

Application of Foundation Model for Autonomous Driving
Application of Foundation Model for Autonomous DrivingApplication of Foundation Model for Autonomous Driving
Application of Foundation Model for Autonomous Driving
 
The New Perception Framework in Autonomous Driving: An Introduction of BEV N...
The New Perception Framework  in Autonomous Driving: An Introduction of BEV N...The New Perception Framework  in Autonomous Driving: An Introduction of BEV N...
The New Perception Framework in Autonomous Driving: An Introduction of BEV N...
 
Data Closed Loop in Simulation Test of Autonomous Driving
Data Closed Loop in Simulation Test of Autonomous DrivingData Closed Loop in Simulation Test of Autonomous Driving
Data Closed Loop in Simulation Test of Autonomous Driving
 
Techniques and Challenges in Autonomous Driving
Techniques and Challenges in Autonomous DrivingTechniques and Challenges in Autonomous Driving
Techniques and Challenges in Autonomous Driving
 
BEV Joint Detection and Segmentation
BEV Joint Detection and SegmentationBEV Joint Detection and Segmentation
BEV Joint Detection and Segmentation
 
Fisheye based Perception for Autonomous Driving VI
Fisheye based Perception for Autonomous Driving VIFisheye based Perception for Autonomous Driving VI
Fisheye based Perception for Autonomous Driving VI
 
Fisheye/Omnidirectional View in Autonomous Driving V
Fisheye/Omnidirectional View in Autonomous Driving VFisheye/Omnidirectional View in Autonomous Driving V
Fisheye/Omnidirectional View in Autonomous Driving V
 
Fisheye/Omnidirectional View in Autonomous Driving IV
Fisheye/Omnidirectional View in Autonomous Driving IVFisheye/Omnidirectional View in Autonomous Driving IV
Fisheye/Omnidirectional View in Autonomous Driving IV
 
Prediction,Planninng & Control at Baidu
Prediction,Planninng & Control at BaiduPrediction,Planninng & Control at Baidu
Prediction,Planninng & Control at Baidu
 
Cruise AI under the Hood
Cruise AI under the HoodCruise AI under the Hood
Cruise AI under the Hood
 
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
 
Scenario-Based Development & Testing for Autonomous Driving
Scenario-Based Development & Testing for Autonomous DrivingScenario-Based Development & Testing for Autonomous Driving
Scenario-Based Development & Testing for Autonomous Driving
 
How to Build a Data Closed-loop Platform for Autonomous Driving?
How to Build a Data Closed-loop Platform for Autonomous Driving?How to Build a Data Closed-loop Platform for Autonomous Driving?
How to Build a Data Closed-loop Platform for Autonomous Driving?
 
Annotation tools for ADAS & Autonomous Driving
Annotation tools for ADAS & Autonomous DrivingAnnotation tools for ADAS & Autonomous Driving
Annotation tools for ADAS & Autonomous Driving
 
Prediction and planning for self driving at waymo
Prediction and planning for self driving at waymoPrediction and planning for self driving at waymo
Prediction and planning for self driving at waymo
 
Jointly mapping, localization, perception, prediction and planning
Jointly mapping, localization, perception, prediction and planningJointly mapping, localization, perception, prediction and planning
Jointly mapping, localization, perception, prediction and planning
 
Data pipeline and data lake for autonomous driving
Data pipeline and data lake for autonomous drivingData pipeline and data lake for autonomous driving
Data pipeline and data lake for autonomous driving
 
Open Source codes of trajectory prediction & behavior planning
Open Source codes of trajectory prediction & behavior planningOpen Source codes of trajectory prediction & behavior planning
Open Source codes of trajectory prediction & behavior planning
 
Lidar in the adverse weather: dust, fog, snow and rain
Lidar in the adverse weather: dust, fog, snow and rainLidar in the adverse weather: dust, fog, snow and rain
Lidar in the adverse weather: dust, fog, snow and rain
 
Autonomous Driving of L3/L4 Commercial trucks
Autonomous Driving of L3/L4 Commercial trucksAutonomous Driving of L3/L4 Commercial trucks
Autonomous Driving of L3/L4 Commercial trucks
 

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsRussian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 

3-d interpretation from single 2-d image IV

  • 1. 3D Interpretation from Single 2D Image for Autonomous Driving IV Yu Huang Yu.huang07@gmail.com Sunnyvale, California
  • 2. Outline • Demystifying Pseudo-LiDAR for Monocular 3D Object Detection • CubifAE-3D: Monocular Camera Space Cubification for Auto-Encoder based 3D Object Detection • Ground-aware Monocular 3D Object Detection for Autonomous Driving • Categorical Depth Distribution Network for Monocular 3D Object Detection • Depth-conditioned Dynamic Message Propagation for Monocular 3D Object Detection • Geometry-based Distance Decomposition for Monocular 3D Object Detection • Geometry-aware data augmentation for monocular 3D object detection • Lidar Point Cloud Guided Monocular 3D Object Detection
  • 3. Demystifying Pseudo-LiDAR for Monocular 3D Object Detection • Pseudo-LiDAR-based methods for monocular 3D object detection have generated large attention in the community. • This generated a distorted impression about the superiority of Pseudo-LiDAR approaches against methods working with RGB-images only. • The 1st contribution is analysing and showing experimentally that the validation results published by Pseudo-LiDAR-based methods are substantially biased. • The source of the bias resides in an overlap between the KITTI3D object detection validation set and the training/validation sets used to train depth predictors feeding Pseudo-LiDAR- based methods. • Surprisingly, the bias remains also after geographically removing the overlap, revealing the presence of a more structured contamination. • This leaves the test set as the only reliable mean of comparison, where published Pseudo- LiDAR-based methods do not excel. • The second contribution brings Pseudo-LiDAR based methods back up in the ranking with the introduction of a 3D confidence prediction module.
  • 4. Demystifying Pseudo-LiDAR for Monocular 3D Object Detection It analyze the cause of the performance bias of monocular Pseudo-LiDAR-based (PL) methods, which consists in a substantial drop between the results on the KITTI3D validation and test set. It show that this bias is due to the fact that the depth estimators on which PL methods heavily rely have been trained on a depth training set (black lines) which includes 30% of the detection validation set data (red lines). It propose to solve this bias by creating an alternative unbiased depth training set (green lines) which eliminates the overlap as well as introduces a geographical distance w.r.t. detection validation data.
  • 5. Demystifying Pseudo-LiDAR for Monocular 3D Object Detection The table shows that certain sub-tasks like rotation (R) and shape (W;H;L) prediction, despite the substitution with ground-truth values, do not significantly improve performance. In contrast, substituting the predicted depth estimation (Z) with ground truth improves substantially, meaning that depth is by-far the most crucial component for 3D object detection.
  • 6. Demystifying Pseudo-LiDAR for Monocular 3D Object Detection BTS (“From big to small: Multi-scale local planar guidance for monocular depth estimation”)
  • 7. Demystifying Pseudo-LiDAR for Monocular 3D Object Detection • The 3D object detection task, requires to associate each object with a 3D bounding box and a corresponding confidence value. • This confidence should generally reflect the quality of the 3D bounding box and can be thought as a measure of how much the particular estimate is reliable. • The existing Pseudo-LiDAR methods do not perform the 3D confidence estimation in any way and rely on the class probability coming along with the 2D detections. • By doing so, the confidence adopted by current PL-based methods is actually agnostic to the quality of the 3D predictions and therefore not effective for the role it should take. • 2D detectors are often too confident and the need for a 3D confidence seems essential. • Propose to do endow PL-based methods with the ability of estimating the 3D confidence of their predictions. • This architecture can be divided into three main branches namely 2D Detection, Pseudo- LiDAR and 3D Detection.
  • 8. Demystifying Pseudo-LiDAR for Monocular 3D Object Detection Architecture of a Pseudo-LiDAR-based method integrating the 3D confidence component. 1."Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving." CVPR 2019. 2. PatchNet: "Rethinking pseudo-lidar representation" ECCV'2020
  • 9. Demystifying Pseudo-LiDAR for Monocular 3D Object Detection Example of final part of the architecture, where adding the Confidence Head to the PatchNet architecture. The Confidence Branch requires minimal modifications to the original architecture, adds negligible computational complexity and inference time and is compatible with most Pseudo-LiDAR approaches.
  • 10. Demystifying Pseudo-LiDAR for Monocular 3D Object Detection
  • 11. Demystifying Pseudo-LiDAR for Monocular 3D Object Detection
  • 12. CubifAE-3D: Monocular Camera Space Cubification for Auto-Encoder based 3D Object Detection • Starting from a synthetic dataset, pre-train an RGB-to-Depth Auto-Encoder (AE). • The embedding learnt from this AE is then used to train a 3D Object Detector (3DOD) CNN which is used to regress the parameters of 3D object poses after the encoder from the AE generates a latent embedding from the RGB image. • It pre-train the AE using paired RGB and depth images from simulation data once and subsequently only train the 3DOD network using real data, comprising of RGB images and 3D object pose labels (without the requirement of dense depth). • The 3DOD network utilizes a particular‘cubification’ of 3D space around the camera, where each cuboid is tasked with predicting N object poses, along with their class and confidence values. • A method for 3D object detection using a single monocular image, CubifAE-3D, including AE pre-training+ dividing 3D space around the camera into cuboids. • The first part refers to cubification/voxellization of mono camera space as a pre- processing step, and AE refers to the Auto- Encoding of RGB-to-depth space.
  • 13. CubifAE-3D: Monocular Camera Space Cubification for Auto-Encoder based 3D Object Detection CubifAE-3D high-level architecture
  • 14. CubifAE-3D: Monocular Camera Space Cubification for Auto-Encoder based 3D Object Detection • Figure (the next page): CubifAE-3D architecture. • The RGB-to-depth auto-encoder is first trained in a supervised way with a combination of MSE and Edge-Aware Smoothing Loss. • Once trained, the decoder is detached, encoder weights are frozen, and the encoder output is fed to the 3DOD model, which is trained with a combination of xyzloss, whlloss, orientationloss, iouloss, and confloss. • A 2D bounding-box is obtained for each object by projecting its detected 3D bounding-box onto the camera image plane, cropped, and resized to 64x64 and fed to the classifier model (bottom branch) along with the normalized whl vector for class prediction. • The dimensions indicated correspond to the output tensor for each block. • Also replace the encoder head by a pretrained backbone network (VGG-16) and observe an improved performance.
  • 15. CubifAE-3D: Monocular Camera Space Cubification for Auto-Encoder based 3D Object Detection Detailed model architecture of CubifAE-3D
  • 16. CubifAE-3D: Monocular Camera Space Cubification for Auto-Encoder based 3D Object Detection The RGB-to-depth auto-encoder The total loss function for this model
  • 17. CubifAE-3D: Monocular Camera Space Cubification for Auto-Encoder based 3D Object Detection Monocular RGB to Depth Map prediction
  • 18. CubifAE-3D: Monocular Camera Space Cubification for Auto-Encoder based 3D Object Detection Cubification of the camera space: The perception RoI is divided into a 4x4xM grid (x and y directions aligned with image plane, where each grid has stacked on it, M cuboids in the z direction). Each cuboid is responsible for predicting up to N object poses. The object coordinates and dimensions are then normalized 0-1 in accordance with a prior that is computed from data statistics.
  • 19. CubifAE-3D: Monocular Camera Space Cubification for Auto-Encoder based 3D Object Detection Samples of qualitative results on the KITTI dataset. The top part of each image shows a bounding box obtained as a 2D projection of their 3D poses. The bottom part shows a birds-eye view of the object poses with the ego-vehicle positioned at the center of red circle drawn on the left; pointing towards the right of the image.
  • 20. CubifAE-3D: Monocular Camera Space Cubification for Auto-Encoder based 3D Object Detection Qualitative results on the KITTI dataset
  • 21. Ground-aware Monocular 3D Object Detection for Autonomous Driving • Most of existing algorithms are based on the geometric constraints in 2D-3D correspondence, which stems from generic 6D object pose estimation. • First identify how the ground plane provides additional clues in depth reasoning in 3D detection in driving scenes. • Based on this observation, then improve the processing of 3D anchors and introduce a neural network module to fully utilize such application-specific priors in the framework of deep learning. • Introduce a neural network embedded with the proposed module for 3D object detection. • Further verify the power of the proposed module with a neural network designed for monocular depth prediction. • https://www.github.com/Owen-Liuyuxuan/visualDet3D
  • 22. Ground-aware Monocular 3D Object Detection for Autonomous Driving Perspective geometry for the GAC module. When calculating the vertical offsets, assume pixels are foreground object centers. When computing the depth priors z, assume pixels are on the ground because they are features to be queried.
  • 23. Ground-aware Monocular 3D Object Detection for Autonomous Driving offset inverse depth depth Relation of depth and height
  • 24. Ground-aware Monocular 3D Object Detection for Autonomous Driving Ground-Aware Convolution (GAC)Module
  • 25. Ground-aware Monocular 3D Object Detection for Autonomous Driving Object detection
  • 26. Ground-aware Monocular 3D Object Detection for Autonomous Driving
  • 27. Categorical Depth Distribution Network for Monocular 3D Object Detection • Mono 3D object detection is a key problem for autonomous vehicles, as it provides a solution with simple configuration compared to typical multi-sensor systems. • The main challenge in mono 3D detection lies in accurately predicting object depth, inferred from object and scene cues due to the lack of direct range measurement. • Many methods attempt to directly estimate depth to assist in 3D detection, but show limited performance as a result of depth inaccuracy. • This solution, Categorical Depth Distribution Network (CaDDN), uses a predicted categorical depth distribution for each pixel to project rich contextual feature information to the appropriate depth interval in 3D space. • Then use the computationally efficient bird’s-eye-view (BEV) projection and single-stage detector to produce the final output detections. • CaDDN as a fully differentiable E2E joint depth estimation and object detection. • https://github.com/TRAILab/CaDDN
  • 28. Categorical Depth Distribution Network for Monocular 3D Object Detection (a) Input image. (b) Without depth distribution supervision, BEV features from CaDDN suffer from smearing effects. (c) Depth distribution supervision encourages BEV features from CaDDN to encode meaningful depth confidence, in which objects can be accurately detected.
  • 29. Categorical Depth Distribution Network for Monocular 3D Object Detection • Direct methods estimate 3D detections directly from images without predicting an intermediate 3D scene representation; they can incorporate the geometric relationship between the 2D image plane and 3D space to assist with detections. • Depth-based methods perform the 3D detection task using pixel-wise depth maps as an additional input, where the depth maps are precomputed using monocular depth estimation architectures; Estimated depth maps can be used in combination with images to perform the 3D detection task. • Grid-based methods avoid estimating raw depth values by predicting a BEV grid representation, to be used as input for 3D detection architectures; Multiple voxels can be projected to the same image feature, leading to repeated features along the projection ray and reduced detection accuracy.
  • 30. Categorical Depth Distribution Network for Monocular 3D Object Detection CaDDN Architecture. The network is composed of 3 modules to generate 3D feature representations and one to perform 3D detection. Frustum features G are generated using depth distributions D, transformed into voxel features V. The voxel features are collapsed to BEV features B for 3D object detection.
  • 31. Categorical Depth Distribution Network for Monocular 3D Object Detection • The purpose of the frustum feature network is to project image information into 3D space, by associating image features to estimated depths; • It follow the design of the semantic segmentation network DeepLabV3 to estimate the categorical depth distributions from image features (Depth Distribution Network), where modifying the network to produce pixel-wise probability scores of belonging to depth bins rather than semantic classes with a downsample-upsample architecture; • In parallel to estimating depth distributions, perform channel reduction (Image Channel Reduce) on image features to generate the final image features, using a 1x1 convolution + BatchNorm + ReLU layer. • Channel reduction is required to reduce the high memory footprint of ResNet-101 features that will be populated in the 3D frustum grid.
  • 32. Categorical Depth Distribution Network for Monocular 3D Object Detection Each feature pixel F(u; v) is weighted by its depth distribution probabilities D(u; v) of belonging to D discrete depth bins to generate frustum features G(u; v). Sampling points in each voxel are projected into the frustum grid. Frustum features are sampled using trilinear interpolation to populate voxels in V.
  • 33. Categorical Depth Distribution Network for Monocular 3D Object Detection The continuous depth space is discretized in order to define the set of D bins used in the depth distributions D. Depth discretization can be performed with uniform discretization (UD) with a fixed bin size, spacing- increasing discretization (SID) with increasing bin sizes in log space, or linear-increasing discretization (LID) with linearly increasing bin sizes.
  • 34. Categorical Depth Distribution Network for Monocular 3D Object Detection • Apply depth distribution labels to supervise predicted depth distributions. • Depth distribution labels are generated by projecting LiDAR point clouds into the image frame to create sparse dense maps. • Depth completion performed to generate depth values at each pixel in image. • It require depth information at each image feature pixel, so downsample the depth maps of size WI x HI to the image feature size WF x HF . • The depth maps are converted to bin indices using the LID discretization method, followed by a conversion into a one-hot encoding to generate the depth distribution labels. • A one-hot encoding ensures the depth distribution labels are sharp, essential to encourage sharpness in depth distribution predictions via supervision.
  • 35. Categorical Depth Distribution Network for Monocular 3D Object Detection
  • 36. Categorical Depth Distribution Network for Monocular 3D Object Detection
  • 37. Categorical Depth Distribution Network for Monocular 3D Object Detection
  • 38. Depth-conditioned Dynamic Message Propagation for Monocular 3D Object Detection • The objective is to learn context- and depth-aware feature representation to solve the problem of monocular 3D object detection. • (i) propose a depth conditioned dynamic message propagation (DDMP) network to effectively integrate the multi-scale depth information with the image context; • (ii) this is achieved by first adaptively sampling context-aware nodes in the image context and then dynamically predicting hybrid depth-dependent filter weights and affinity matrices for propagating information; • (iii) by augmenting a center-aware depth encoding (CDE) task, alleviates the inaccurate depth prior; • (iv) thoroughly demonstrate the effectiveness and show SoA results among the monocular-based approaches on the KITTI benchmark dataset. • Code and models are released at https://github.com/fudan-zvg/DDMP
  • 39. Depth-conditioned Dynamic Message Propagation for Monocular 3D Object Detection Left: DDMP adaptively samples context-aware nodes (top) in the image context and dynamically predicting hybrid depth- dependent filter weights and affinity matrices (bottom) for propagating information. Right: the improvement of DDMP-3D (red) over the baseline (yellow) via center- aware depth encoding.
  • 40. Depth-conditioned Dynamic Message Propagation for Monocular 3D Object Detection • Two branches are involved including 3D detection branch (blue) and depth feature extraction branch (green). • The RGB images are initially fed into the upper branch for feature extraction while corresponding depth maps estimated via off-the-shelf depth estimator are sent into the depth branch for extracting depth-aware features. • The DDMP (dynamic message propagation) modules in yellow reveal the depth-conditioned dynamic message propagation; It dynamically samples context-aware nodes in the upper image branch and predicts the hybrid filter weights and affinities based on multi-scale depth features from the bottom branch for message propagation. • Common 3D heads for 3D center, dimension, orientation regression are followed to achieve final 3D object boxes. • CDE (center-aware depth feature encoding) is the auxiliary task for joint-optimization training to implicitly guide the depth sub-network to learn center aware depth features for better object localization, which is discarded during inference.
  • 41. Depth-conditioned Dynamic Message Propagation for Monocular 3D Object Detection Schematic illustration of DDMP-3D
  • 42. Depth-conditioned Dynamic Message Propagation for Monocular 3D Object Detection Illustration of DDMP module in a single scale pattern. Dynamic nodes are first sampled from the image and depth feature graph, for these sampled nodes, the filter weights and affinity matrices are learned from depth features to propagate the depth conditioned message.
  • 43. Depth-conditioned Dynamic Message Propagation for Monocular 3D Object Detection • Generally, the depth map lose appearance details or fail to discriminate between foreground instance and backgrounds, unreliable depth prior for depth-assisted 3D object detection. • It is already proved that multi-task strategy can boost each single task to some degree, benefiting from the multi-fold regularization effect in the joint-optimization. • It augment an auxiliary task to jointly optimize with the main 3D detection task. • The augmented task with xyz supervision in 3D space uniquely determines a point in 2D image plane, which imposes spatial constraints to gain a 3D instance-level understanding. • With the better instance-awareness brought by CDE, the model is able to alleviate the inaccurate depth prior in situation like occlusion and distant objects. • It adopt a similar network architecture for depth branch with the head only predicting 3D centers without predefined anchors.
  • 44. Depth-conditioned Dynamic Message Propagation for Monocular 3D Object Detection
  • 45. Depth-conditioned Dynamic Message Propagation for Monocular 3D Object Detection
  • 46. Depth-conditioned Dynamic Message Propagation for Monocular 3D Object Detection Qualitative comparison of ground truth (green), the baseline (yellow), and our method (red) on KITTI val set. For better visualization, the first and second columns show RGB and BEV images of point clouds converted from pre-estimated depth, respectively.
  • 47. Geometry-based Distance Decomposition for Monocular 3D Object Detection • Monocular 3D object detection‘s core challenge is to predict the distance of objects in the absence of explicit depth information. • Unlike regressing the distance as a single variable in most existing methods, MonoRCNN, its geometry-based distance decomposition is to recover the distance by its factors. • The decomposition factors the distance of objects into the most representative and stable variables, i.e. the physical height and the projected visual height in the image plane. • In MonoRCNN, decomposition maintains the self-consistency between the two heights, leading to the robust distance prediction when both predicted heights are inaccurate. • The decomposition enables to trace cause of distance uncertainty for different scenarios. • Such decomposition makes the distance prediction interpretable, accurate, and robust. • It directly predicts 3D bounding boxes from RGB images with a compact architecture, making the training and inference simple and efficient.
  • 48. Geometry-based Distance Decomposition for Monocular 3D Object Detection The distance decomposition is based on the imaging geometry of a pinhole camera. The distance from the center of an object to the camera, denoted as Z, can be calculated by Z = fH/h , where f denotes the focal length of the camera, H denotes the physical height of the object, and h denotes the length of the projected central line (PCL). The PCL represents the projection of the vertical line at the center of the 3D bounding box. This equation shows the distance of objects is determined by the physical height and projected visual height. Note: It abstracts objects as the vertical lines at the center of 3D bounding boxes and their visual projection as the projection of these vertical lines, then recover the distance by them based on the imaging geometry.
  • 49. Geometry-based Distance Decomposition for Monocular 3D Object Detection The architecture of MonoRCNN. It is built upon Faster R-CNN and adds the carefully designed 3D distance head. The 3D distance head is based on geometry-based distance decomposition. Specifically, the method regresses physical height H, reciprocal of the projected visual height hrec = 1/h , and their uncertainties, then recovers the distance by Z = fHhrec. Blue arrows represent operations in the network during training and inference, and Orange arrows represent operations to recover 3D bounding boxes during inference.
  • 50. Geometry-based Distance Decomposition for Monocular 3D Object Detection Comparison between the predicted eight projected corners (red boxes) and predicted visual height (blue lines). Predicting the eight projected corners fails under challenging cases, such as occlusion, truncation, and extreme lighting conditions, while predicting the visual height is more simple and robust. The images are from the KITTI validation split.
  • 51. Geometry-based Distance Decomposition for Monocular 3D Object Detection Uncertainty-aware Regression 3D Attribute Head keypoint loss function physical size and yaw angle The loss functions for H and hrec Overall Loss
  • 52. Geometry-based Distance Decomposition for Monocular 3D Object Detection
  • 53. Geometry-based Distance Decomposition for Monocular 3D Object Detection KITTI Examples nuScenes Cross- Test Examples
  • 54. Geometry-aware data augmentation for monocular 3D object detection • This work, first conduct a thorough analysis to reveal how existing methods fail to robustly estimate depth when different geometry shifts occur. • Through image-based/instance-based manipulations, illustrate being vulnerable in capturing consistent relationships between depth and both object apparent sizes and positions. • Convert those manipulations into 4 corresponding 3D-aware data augmentation techniques. • At the image-level, randomly manipulate the camera system, including its focal length, receptive field and location, to generate new training images with geometric shifts. • At the instance level, crop the foreground objects and randomly paste them to other scenes to generate new training instances. • All the proposed augmentation techniques share the virtue that geometry relationships in objects are preserved while their geometry is manipulated. • Not only the instability of depth recovery is effectively alleviated, but also the final 3D detection performance is significantly improved.
  • 55. Geometry-aware data augmentation for monocular 3D object detection Geometric manipulations
  • 56. Geometry-aware data augmentation for monocular 3D object detection 3D point to image coordinate Use focal length to infer depth Use vertical position to infer depth Shifting the camera focal length Manipulate the camera receptive field Moving the camera
  • 57. Geometry-aware data augmentation for monocular 3D object detection Visualization of the geometric relationships between depth and both object apparent sizes and positions.
  • 58. Geometry-aware data augmentation for monocular 3D object detection Rotation matrix R from egocentric orientation angle 8 corner points in the object coordinate coordinate of point
  • 59. Geometry-aware data augmentation for monocular 3D object detection Empirical analysis of monocular detector under geometric manipulations Try both anchor-free (e.g. Center-Net) and anchor-based (e.g., M3D-RPN) models.
  • 60. Geometry-aware data augmentation for monocular 3D object detection Visualization of Copy-paste data augmentation with and without geometry-aware
  • 61. Geometry-aware data augmentation for monocular 3D object detection
  • 62. Geometry-aware data augmentation for monocular 3D object detection
  • 63. Geometry-aware data augmentation for monocular 3D object detection
  • 64. Lidar Point Cloud Guided Monocular 3D Object Detection • LiDAR point clouds, which provide accurate depth measurement, can offer beneficial information for the training of monocular methods. • Prior works only use LiDAR point clouds to train a depth estimator, which implicit way does not fully utilize LiDAR point clouds, consequently leading to suboptimal performances. • To effectively take advantage of LiDAR point clouds, it propose a general, simple yet effective framework for monocular methods. • Specifically, use LiDAR point clouds to directly guide the training of monocular 3D detectors, allowing them to learn desired objectives meanwhile eliminating the extra annotation cost. • Thanks to the general design, this method can be plugged into any monocular 3D detection method, significantly boosting the performance.
  • 65. Lidar Point Cloud Guided Monocular 3D Object Detection LiDAR guided monocular 3D object detection. It directly use LiDAR point clouds to guide the training of monocular 3D detector.
  • 66. Lidar Point Cloud Guided Monocular 3D Object Detection Qualitative examples of pseudo- LiDAR based monocular 3D detection. From top to bottom: the RGB image, 3D predictions on the bird’s eye view (BEV) map. The estimated 3D box center typically locates around the converted point cloud, meaning that it works well if provided depths are accurate while has difficulty revising the poor predicted object depth.
  • 67. Lidar Point Cloud Guided Monocular 3D Object Detection • Specifically, training a depth-map-based method can roughly comprise of two stages: (1) training a dense depth estimation network; (2) training a monocular 3D detector. • With a common practice, current mono depth-map-based methods all utilize projected LiDAR point clouds as ground truths to train depth estimator. • The number of LiDAR point clouds used for training affects the final 3D detection accuracy heavily. • It directly utilize LiDAR point clouds to generate massive pseudo 3D box labels for monocular methods. • This simple yet effective way allows monocular 3D detectors to learn desired objectives meanwhile eliminating the extra annotation cost. • It is able to work in either supervised or unsupervised mode according to the reliance on manual 3D box annotations.
  • 68. Lidar Point Cloud Guided Monocular 3D Object Detection It generate 3D boxes from LiDAR point clouds, aiming to train the monocular 3D detector. Such 3D boxes are predicted via the pre-trained LiDAR 3D detector (supervised mode) or obtained directly from the point cloud without training (unsupervised mode).
  • 69. Lidar Point Cloud Guided Monocular 3D Object Detection • To take advantage of available 3D box annotations, first train a LiDAR-based 3D detector from scratch with LiDAR point clouds and associated 3D box annotations. • The pre-trained LiDAR-based 3D detector is then utilized to infer 3D boxes on new LiDAR point clouds. • Such results are treated as pseudo labels to train monocular 3D detectors. • Due to the precise depth measurement, pseudo labels predicted from the LiDAR- based 3D detector are considerably accurate and qualified to be used directly in the training of monocular 3D detectors. • Interestingly, with different training settings for the LiDAR-based 3D detector, monocular 3D detectors guided by them show close performances. • It indicates that monocular methods can indeed be beneficial from the guidance of the LiDAR point clouds, and that only a small number of 3D box annotations are sufficient to push the monocular method to achieve high performances. • Thus the manual annotation cost also can be greatly reduced.
  • 70. Lidar Point Cloud Guided Monocular 3D Object Detection • The process of generating pseudo labels on the LiDAR point cloud can be roughly divided into three steps: 2D boxes and masks prediction, RoI points selecting and clustering, and 3D boxes estimation. • In the beginning, an off-the-shelf 2D instance segmentation model is adopted to perform segmentation on the RGB image, obtaining 2D box and mask estimates. • These estimates are used for building camera frustums in order to select associated LiDAR RoI points for every object, where those boxes without any LiDAR point inside are ignored. • To eliminate irrelevant points, take advantage of the unsupervised clustering approach, i.e., DBSCAN, to divide the RoI point cloud into different groups according to the density. • Points that are close in 3D spatial space will be aggregated into a cluster. • Then regard the cluster containing most points as target corresponding to the object. • Finally, seek the minimum 3D bounding box that covers all target points.
  • 71. Lidar Point Cloud Guided Monocular 3D Object Detection • To simplify the problem of solving the 3D bounding box, project points onto the bird’s-eye- view map, reducing parameters since the height and y of the object can be easily obtained. • Solve the convex hull of object points followed by obtaining the box by using rotating calipers. • Specifically, edges of the convex hull are enumerated to produce enclosing rectangles, in which the rectangle with the smallest area is chosen as the resulted BEV box (parameterized by the box center (x; z), the dimension (w; l), and the orientation Ry). • Other parameters of the 3D box can be calculated from statistics on remaining points. • The height h can be represented by the max spatial offset along the y-axis of points, and the center coordinate y is calculated by averaging y coordinates of points. • Consequently, the complete 3D box is generated. • Pseudo labels can fail when LiDAR points are not enough to describe outline of objects. • Restricting object dimensions to eliminate those that are likely to be outliers as most valid objects’ dimensions are close.
  • 72. Lidar Point Cloud Guided Monocular 3D Object Detection
  • 73. Lidar Point Cloud Guided Monocular 3D Object Detection
  • 74. Lidar Point Cloud Guided Monocular 3D Object Detection