3-d interpretation from single 2-d image IV

3D Interpretation from Single 2D Image
for Autonomous Driving IV
Yu Huang
Yu.huang07@gmail.com
Sunnyvale, California

Outline
• Demystifying Pseudo-LiDAR for Monocular 3D Object Detection
• CubifAE-3D: Monocular Camera Space Cubification for Auto-Encoder based 3D
Object Detection
• Ground-aware Monocular 3D Object Detection for Autonomous Driving
• Categorical Depth Distribution Network for Monocular 3D Object Detection
• Depth-conditioned Dynamic Message Propagation for Monocular 3D Object
Detection
• Geometry-based Distance Decomposition for Monocular 3D Object Detection
• Geometry-aware data augmentation for monocular 3D object detection
• Lidar Point Cloud Guided Monocular 3D Object Detection

Demystifying Pseudo-LiDAR for Monocular
3D Object Detection
• Pseudo-LiDAR-based methods for monocular 3D object detection have generated large
attention in the community.
• This generated a distorted impression about the superiority of Pseudo-LiDAR approaches
against methods working with RGB-images only.
• The 1st contribution is analysing and showing experimentally that the validation results
published by Pseudo-LiDAR-based methods are substantially biased.
• The source of the bias resides in an overlap between the KITTI3D object detection validation
set and the training/validation sets used to train depth predictors feeding Pseudo-LiDAR-
based methods.
• Surprisingly, the bias remains also after geographically removing the overlap, revealing the
presence of a more structured contamination.
• This leaves the test set as the only reliable mean of comparison, where published Pseudo-
LiDAR-based methods do not excel.
• The second contribution brings Pseudo-LiDAR based methods back up in the ranking with
the introduction of a 3D confidence prediction module.

3D Object Detection
It analyze the cause of the performance bias
of monocular Pseudo-LiDAR-based (PL)
methods, which consists in a substantial
drop between the results on the KITTI3D
validation and test set. It show that this bias
is due to the fact that the depth estimators
on which PL methods heavily rely have been
trained on a depth training set (black lines)
which includes 30% of the detection
validation set data (red lines). It propose to
solve this bias by creating an alternative
unbiased depth training set (green lines)
which eliminates the overlap as well as
introduces a geographical distance w.r.t.
detection validation data.

3D Object Detection
The table shows that certain sub-tasks like
rotation (R) and shape (W;H;L) prediction,
despite the substitution with ground-truth
values, do not significantly improve
performance. In contrast, substituting the
predicted depth estimation (Z) with ground
truth improves substantially, meaning that
depth is by-far the most crucial component
for 3D object detection.

3D Object Detection
BTS (“From big to small: Multi-scale local planar guidance for monocular depth estimation”)

3D Object Detection
• The 3D object detection task, requires to associate each object with a 3D bounding box and
a corresponding confidence value.
• This confidence should generally reflect the quality of the 3D bounding box and can be
thought as a measure of how much the particular estimate is reliable.
• The existing Pseudo-LiDAR methods do not perform the 3D confidence estimation in any
way and rely on the class probability coming along with the 2D detections.
• By doing so, the confidence adopted by current PL-based methods is actually agnostic to
the quality of the 3D predictions and therefore not effective for the role it should take.
• 2D detectors are often too confident and the need for a 3D confidence seems essential.
• Propose to do endow PL-based methods with the ability of estimating the 3D confidence of
their predictions.
• This architecture can be divided into three main branches namely 2D Detection, Pseudo-
LiDAR and 3D Detection.

3D Object Detection
Architecture of a Pseudo-LiDAR-based method integrating the 3D confidence component.
1."Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving." CVPR 2019.
2. PatchNet: "Rethinking pseudo-lidar representation" ECCV'2020

3D Object Detection
Example of final part of the architecture, where adding the Confidence Head to the PatchNet architecture.
The Confidence Branch requires minimal modifications to the original architecture, adds negligible
computational complexity and inference time and is compatible with most Pseudo-LiDAR approaches.

3D Object Detection

CubifAE-3D: Monocular Camera Space Cubification
for Auto-Encoder based 3D Object Detection
• Starting from a synthetic dataset, pre-train an RGB-to-Depth Auto-Encoder (AE).
• The embedding learnt from this AE is then used to train a 3D Object Detector
(3DOD) CNN which is used to regress the parameters of 3D object poses after the
encoder from the AE generates a latent embedding from the RGB image.
• It pre-train the AE using paired RGB and depth images from simulation data once
and subsequently only train the 3DOD network using real data, comprising of RGB
images and 3D object pose labels (without the requirement of dense depth).
• The 3DOD network utilizes a particular‘cubification’ of 3D space around the
camera, where each cuboid is tasked with predicting N object poses, along with
their class and confidence values.
• A method for 3D object detection using a single monocular image, CubifAE-3D,
including AE pre-training+ dividing 3D space around the camera into cuboids.
• The first part refers to cubification/voxellization of mono camera space as a pre-
processing step, and AE refers to the Auto- Encoding of RGB-to-depth space.

CubifAE-3D high-level architecture

• Figure (the next page): CubifAE-3D architecture.
• The RGB-to-depth auto-encoder is first trained in a supervised way with a
combination of MSE and Edge-Aware Smoothing Loss.
• Once trained, the decoder is detached, encoder weights are frozen, and the
encoder output is fed to the 3DOD model, which is trained with a
combination of xyzloss, whlloss, orientationloss, iouloss, and confloss.
• A 2D bounding-box is obtained for each object by projecting its detected 3D
bounding-box onto the camera image plane, cropped, and resized to 64x64
and fed to the classifier model (bottom branch) along with the normalized
whl vector for class prediction.
• The dimensions indicated correspond to the output tensor for each block.
• Also replace the encoder head by a pretrained backbone network (VGG-16)
and observe an improved performance.

Detailed model architecture of CubifAE-3D

The RGB-to-depth auto-encoder
The total loss function for this model

Monocular RGB to Depth Map prediction

Cubification of the camera space: The perception RoI is divided into a 4x4xM grid (x and y directions
aligned with image plane, where each grid has stacked on it, M cuboids in the z direction). Each cuboid
is responsible for predicting up to N object poses. The object coordinates and dimensions are then
normalized 0-1 in accordance with a prior that is computed from data statistics.

Samples of qualitative results on the KITTI dataset. The top part of each image shows
a bounding box obtained as a 2D projection of their 3D poses. The bottom part shows
a birds-eye view of the object poses with the ego-vehicle positioned at the center of red
circle drawn on the left; pointing towards the right of the image.

Qualitative results on the KITTI dataset

Ground-aware Monocular 3D Object
Detection for Autonomous Driving
• Most of existing algorithms are based on the geometric constraints in
2D-3D correspondence, which stems from generic 6D object pose
estimation.
• First identify how the ground plane provides additional clues in depth
reasoning in 3D detection in driving scenes.
• Based on this observation, then improve the processing of 3D anchors
and introduce a neural network module to fully utilize such
application-specific priors in the framework of deep learning.
• Introduce a neural network embedded with the proposed module for
3D object detection.
• Further verify the power of the proposed module with a neural network
designed for monocular depth prediction.
• https://www.github.com/Owen-Liuyuxuan/visualDet3D

Perspective geometry for the GAC module. When calculating the
vertical offsets, assume pixels are foreground object centers. When
computing the depth priors z, assume pixels are on the ground because
they are features to be queried.

offset
inverse depth
depth
Relation of depth and height

Ground-Aware Convolution （GAC）Module

Object detection

Categorical Depth Distribution Network for
Monocular 3D Object Detection
• Mono 3D object detection is a key problem for autonomous vehicles, as it provides a
solution with simple configuration compared to typical multi-sensor systems.
• The main challenge in mono 3D detection lies in accurately predicting object depth, inferred
from object and scene cues due to the lack of direct range measurement.
• Many methods attempt to directly estimate depth to assist in 3D detection, but show limited
performance as a result of depth inaccuracy.
• This solution, Categorical Depth Distribution Network (CaDDN), uses a predicted
categorical depth distribution for each pixel to project rich contextual feature information to
the appropriate depth interval in 3D space.
• Then use the computationally efficient bird’s-eye-view (BEV) projection and single-stage
detector to produce the final output detections.
• CaDDN as a fully differentiable E2E joint depth estimation and object detection.
• https://github.com/TRAILab/CaDDN

(a) Input image. (b) Without depth distribution supervision, BEV features from CaDDN suffer
from smearing effects. (c) Depth distribution supervision encourages BEV features from
CaDDN to encode meaningful depth confidence, in which objects can be accurately detected.

• Direct methods estimate 3D detections directly from images without
predicting an intermediate 3D scene representation; they can incorporate the
geometric relationship between the 2D image plane and 3D space to assist
with detections.
• Depth-based methods perform the 3D detection task using pixel-wise depth
maps as an additional input, where the depth maps are precomputed using
monocular depth estimation architectures; Estimated depth maps can be
used in combination with images to perform the 3D detection task.
• Grid-based methods avoid estimating raw depth values by predicting a BEV
grid representation, to be used as input for 3D detection architectures;
Multiple voxels can be projected to the same image feature, leading to
repeated features along the projection ray and reduced detection accuracy.

CaDDN Architecture. The network is composed of 3 modules to generate 3D feature representations and
one to perform 3D detection. Frustum features G are generated using depth distributions D, transformed
into voxel features V. The voxel features are collapsed to BEV features B for 3D object detection.

• The purpose of the frustum feature network is to project image information
into 3D space, by associating image features to estimated depths;
• It follow the design of the semantic segmentation network DeepLabV3 to
estimate the categorical depth distributions from image features (Depth
Distribution Network), where modifying the network to produce pixel-wise
probability scores of belonging to depth bins rather than semantic classes
with a downsample-upsample architecture;
• In parallel to estimating depth distributions, perform channel reduction
(Image Channel Reduce) on image features to generate the final image
features, using a 1x1 convolution + BatchNorm + ReLU layer.
• Channel reduction is required to reduce the high memory footprint of
ResNet-101 features that will be populated in the 3D frustum grid.

Each feature pixel F(u; v) is weighted by its depth
distribution probabilities D(u; v) of belonging to D
discrete depth bins to generate frustum features G(u; v).
Sampling points in each voxel are projected into the
frustum grid. Frustum features are sampled using
trilinear interpolation to populate voxels in V.

The continuous depth space is
discretized in order to define the
set of D bins used in the depth
distributions D. Depth
discretization can be performed
with uniform discretization (UD)
with a fixed bin size, spacing-
increasing discretization (SID)
with increasing bin sizes in log
space, or linear-increasing
discretization (LID) with linearly
increasing bin sizes.

• Apply depth distribution labels to supervise predicted depth distributions.
• Depth distribution labels are generated by projecting LiDAR point clouds into
the image frame to create sparse dense maps.
• Depth completion performed to generate depth values at each pixel in image.
• It require depth information at each image feature pixel, so downsample the
depth maps of size WI x HI to the image feature size WF x HF .
• The depth maps are converted to bin indices using the LID discretization
method, followed by a conversion into a one-hot encoding to generate the
depth distribution labels.
• A one-hot encoding ensures the depth distribution labels are sharp, essential
to encourage sharpness in depth distribution predictions via supervision.

Depth-conditioned Dynamic Message
Propagation for Monocular 3D Object Detection
• The objective is to learn context- and depth-aware feature representation to
solve the problem of monocular 3D object detection.
• (i) propose a depth conditioned dynamic message propagation (DDMP)
network to effectively integrate the multi-scale depth information with the
image context;
• (ii) this is achieved by first adaptively sampling context-aware nodes in the
image context and then dynamically predicting hybrid depth-dependent
filter weights and affinity matrices for propagating information;
• (iii) by augmenting a center-aware depth encoding (CDE) task, alleviates the
inaccurate depth prior;
• (iv) thoroughly demonstrate the effectiveness and show SoA results among
the monocular-based approaches on the KITTI benchmark dataset.
• Code and models are released at https://github.com/fudan-zvg/DDMP

Left: DDMP adaptively
samples context-aware
nodes (top) in the image
context and dynamically
predicting hybrid depth-
dependent filter weights and
affinity matrices (bottom) for
propagating information.
Right: the improvement of
DDMP-3D (red) over the
baseline (yellow) via center-
aware depth encoding.

• Two branches are involved including 3D detection branch (blue) and depth feature
extraction branch (green).
• The RGB images are initially fed into the upper branch for feature extraction while
corresponding depth maps estimated via off-the-shelf depth estimator are sent into the
depth branch for extracting depth-aware features.
• The DDMP (dynamic message propagation) modules in yellow reveal the depth-conditioned
dynamic message propagation; It dynamically samples context-aware nodes in the upper
image branch and predicts the hybrid filter weights and affinities based on multi-scale depth
features from the bottom branch for message propagation.
• Common 3D heads for 3D center, dimension, orientation regression are followed to achieve
final 3D object boxes.
• CDE (center-aware depth feature encoding) is the auxiliary task for joint-optimization
training to implicitly guide the depth sub-network to learn center aware depth features for
better object localization, which is discarded during inference.

Schematic illustration of DDMP-3D

Illustration of DDMP module in a single scale pattern. Dynamic nodes are first sampled from
the image and depth feature graph, for these sampled nodes, the filter weights and affinity
matrices are learned from depth features to propagate the depth conditioned message.

• Generally, the depth map lose appearance details or fail to discriminate between foreground
instance and backgrounds, unreliable depth prior for depth-assisted 3D object detection.
• It is already proved that multi-task strategy can boost each single task to some degree,
benefiting from the multi-fold regularization effect in the joint-optimization.
• It augment an auxiliary task to jointly optimize with the main 3D detection task.
• The augmented task with xyz supervision in 3D space uniquely determines a point in 2D
image plane, which imposes spatial constraints to gain a 3D instance-level understanding.
• With the better instance-awareness brought by CDE, the model is able to alleviate the
inaccurate depth prior in situation like occlusion and distant objects.
• It adopt a similar network architecture for depth branch with the head only predicting 3D
centers without predefined anchors.

Qualitative comparison of ground truth (green), the baseline (yellow), and our method
(red) on KITTI val set. For better visualization, the first and second columns show RGB
and BEV images of point clouds converted from pre-estimated depth, respectively.

Geometry-based Distance Decomposition
for Monocular 3D Object Detection
• Monocular 3D object detection‘s core challenge is to predict the distance of objects in the
absence of explicit depth information.
• Unlike regressing the distance as a single variable in most existing methods, MonoRCNN, its
geometry-based distance decomposition is to recover the distance by its factors.
• The decomposition factors the distance of objects into the most representative and stable
variables, i.e. the physical height and the projected visual height in the image plane.
• In MonoRCNN, decomposition maintains the self-consistency between the two heights,
leading to the robust distance prediction when both predicted heights are inaccurate.
• The decomposition enables to trace cause of distance uncertainty for different scenarios.
• Such decomposition makes the distance prediction interpretable, accurate, and robust.
• It directly predicts 3D bounding boxes from RGB images with a compact architecture,
making the training and inference simple and efficient.

The distance decomposition is based on the imaging geometry of a pinhole camera. The
distance from the center of an object to the camera, denoted as Z, can be calculated by Z =
fH/h , where f denotes the focal length of the camera, H denotes the physical height of the
object, and h denotes the length of the projected central line (PCL). The PCL represents the
projection of the vertical line at the center of the 3D bounding box. This equation shows the
distance of objects is determined by the physical height and projected visual height.
Note: It abstracts objects as the vertical lines at the center of 3D bounding boxes and their
visual projection as the projection of these vertical lines, then recover the distance by them
based on the imaging geometry.

The architecture of MonoRCNN. It is built upon Faster R-CNN and adds the carefully designed 3D distance
head. The 3D distance head is based on geometry-based distance decomposition. Specifically, the method
regresses physical height H, reciprocal of the projected visual height hrec = 1/h , and their uncertainties, then
recovers the distance by Z = fHhrec. Blue arrows represent operations in the network during training and
inference, and Orange arrows represent operations to recover 3D bounding boxes during inference.

Comparison between the predicted eight projected corners (red boxes) and predicted visual
height (blue lines). Predicting the eight projected corners fails under challenging cases, such as
occlusion, truncation, and extreme lighting conditions, while predicting the visual height is more
simple and robust. The images are from the KITTI validation split.

Uncertainty-aware Regression 3D Attribute Head
keypoint loss function
physical size and yaw angle
The loss functions for H and hrec
Overall Loss

KITTI Examples
nuScenes Cross-
Test Examples

Geometry-aware data augmentation for
monocular 3D object detection
• This work, first conduct a thorough analysis to reveal how existing methods fail to robustly
estimate depth when different geometry shifts occur.
• Through image-based/instance-based manipulations, illustrate being vulnerable in capturing
consistent relationships between depth and both object apparent sizes and positions.
• Convert those manipulations into 4 corresponding 3D-aware data augmentation techniques.
• At the image-level, randomly manipulate the camera system, including its focal length,
receptive field and location, to generate new training images with geometric shifts.
• At the instance level, crop the foreground objects and randomly paste them to other scenes
to generate new training instances.
• All the proposed augmentation techniques share the virtue that geometry relationships in
objects are preserved while their geometry is manipulated.
• Not only the instability of depth recovery is effectively alleviated, but also the final 3D
detection performance is significantly improved.

Geometric manipulations

3D point to image coordinate
Use focal length to infer depth
Use vertical position to infer depth
Shifting the camera focal length
Manipulate the camera receptive field
Moving the camera

Visualization of the geometric relationships between
depth and both object apparent sizes and positions.

Rotation matrix R from egocentric orientation angle
8 corner points in the object coordinate
coordinate of point

Empirical analysis of monocular detector under geometric manipulations
Try both anchor-free (e.g. Center-Net) and anchor-based (e.g., M3D-RPN) models.

Visualization of Copy-paste data augmentation with and without geometry-aware

Lidar Point Cloud Guided Monocular 3D
Object Detection
• LiDAR point clouds, which provide accurate depth measurement, can offer
beneficial information for the training of monocular methods.
• Prior works only use LiDAR point clouds to train a depth estimator, which
implicit way does not fully utilize LiDAR point clouds, consequently leading
to suboptimal performances.
• To effectively take advantage of LiDAR point clouds, it propose a general,
simple yet effective framework for monocular methods.
• Specifically, use LiDAR point clouds to directly guide the training of
monocular 3D detectors, allowing them to learn desired objectives
meanwhile eliminating the extra annotation cost.
• Thanks to the general design, this method can be plugged into any
monocular 3D detection method, significantly boosting the performance.

Object Detection
LiDAR guided monocular 3D object
detection. It directly use LiDAR point
clouds to guide the training of
monocular 3D detector.

Object Detection
Qualitative examples of pseudo-
LiDAR based monocular 3D
detection. From top to bottom:
the RGB image, 3D predictions
on the bird’s eye view (BEV)
map. The estimated 3D box
center typically locates around
the converted point cloud,
meaning that it works well if
provided depths are accurate
while has difficulty revising the
poor predicted object depth.

Object Detection
• Specifically, training a depth-map-based method can roughly comprise of
two stages: (1) training a dense depth estimation network; (2) training a
monocular 3D detector.
• With a common practice, current mono depth-map-based methods all
utilize projected LiDAR point clouds as ground truths to train
depth estimator.
• The number of LiDAR point clouds used for training affects the final 3D
detection accuracy heavily.
• It directly utilize LiDAR point clouds to generate massive pseudo 3D box
labels for monocular methods.
• This simple yet effective way allows monocular 3D detectors to learn desired
objectives meanwhile eliminating the extra annotation cost.
• It is able to work in either supervised or unsupervised mode according to the
reliance on manual 3D box annotations.

Object Detection
It generate 3D boxes from
LiDAR point clouds, aiming to
train the monocular 3D detector.
Such 3D boxes are predicted via
the pre-trained LiDAR 3D
detector (supervised mode) or
obtained directly from the point
cloud without training
(unsupervised mode).

Object Detection
• To take advantage of available 3D box annotations, first train a LiDAR-based 3D
detector from scratch with LiDAR point clouds and associated 3D box annotations.
• The pre-trained LiDAR-based 3D detector is then utilized to infer 3D boxes on new
LiDAR point clouds.
• Such results are treated as pseudo labels to train monocular 3D detectors.
• Due to the precise depth measurement, pseudo labels predicted from the LiDAR-
based 3D detector are considerably accurate and qualified to be used directly in the
training of monocular 3D detectors.
• Interestingly, with different training settings for the LiDAR-based 3D detector,
monocular 3D detectors guided by them show close performances.
• It indicates that monocular methods can indeed be beneficial from the guidance of
the LiDAR point clouds, and that only a small number of 3D box annotations are
sufficient to push the monocular method to achieve high performances.
• Thus the manual annotation cost also can be greatly reduced.

Object Detection
• The process of generating pseudo labels on the LiDAR point cloud can be roughly divided
into three steps: 2D boxes and masks prediction, RoI points selecting and clustering, and 3D
boxes estimation.
• In the beginning, an off-the-shelf 2D instance segmentation model is adopted to perform
segmentation on the RGB image, obtaining 2D box and mask estimates.
• These estimates are used for building camera frustums in order to select associated LiDAR RoI
points for every object, where those boxes without any LiDAR point inside are ignored.
• To eliminate irrelevant points, take advantage of the unsupervised clustering approach, i.e.,
DBSCAN, to divide the RoI point cloud into different groups according to the density.
• Points that are close in 3D spatial space will be aggregated into a cluster.
• Then regard the cluster containing most points as target corresponding to the object.
• Finally, seek the minimum 3D bounding box that covers all target points.

Object Detection
• To simplify the problem of solving the 3D bounding box, project points onto the bird’s-eye-
view map, reducing parameters since the height and y of the object can be easily obtained.
• Solve the convex hull of object points followed by obtaining the box by using rotating calipers.
• Specifically, edges of the convex hull are enumerated to produce enclosing rectangles, in which
the rectangle with the smallest area is chosen as the resulted BEV box (parameterized by the box
center (x; z), the dimension (w; l), and the orientation Ry).
• Other parameters of the 3D box can be calculated from statistics on remaining points.
• The height h can be represented by the max spatial offset along the y-axis of points, and the
center coordinate y is calculated by averaging y coordinates of points.
• Consequently, the complete 3D box is generated.
• Pseudo labels can fail when LiDAR points are not enough to describe outline of objects.
• Restricting object dimensions to eliminate those that are likely to be outliers as most valid
objects’ dimensions are close.

Object Detection

3-d interpretation from single 2-d image IV

3-d interpretation from single 2-d image IV

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie 3-d interpretation from single 2-d image IV

Ähnlich wie 3-d interpretation from single 2-d image IV (20)

Mehr von Yu Huang

Mehr von Yu Huang (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

3-d interpretation from single 2-d image IV