2. Outline
• CalibNet
• PointPillars
• Complex-YOLO
• Robust Deep Multi-modal Learning Based on GIF Network
• LATTE: Accelerate Lidar Point Cloud Annotation
• FVNet: 3D Front-View Proposal Generation for Object Detection from Point Cloud
• RGB and LiDAR fusion based 3D Semantic Segmentation
• Voxel-FPN: multi-scale voxel feature aggregation in 3D object detection from point clouds
• STD: Sparse-to-Dense 3D Object Detector for Point Cloud
• End-to-end sensor modeling for LiDAR Point Cloud
• Part-A2 Net
• StarNet: Targeted Computation for Object Detection in Point Clouds
• Class-balanced Grouping and Sampling for Point Cloud 3D Object Detection
• Deep Hough Voting for 3D Object Detection in Point Clouds
• MLOD: A multi-view 3D object detection based on robust feature fusion method
3. CalibNet: Self-Supervised Extrinsic Calibration
using 3D Spatial Transformer Networks
• CalibNet: a self-supervised deep network capable of automatically estimating the 6-DoF
rigid body transformation between a 3D LiDAR and a 2D camera in real-time.
• CalibNet alleviates the need for calibration targets, thereby resulting in significant savings in
calibration efforts.
• During training, the network only takes as input a LiDAR point cloud, the corresponding
monocular image, and the camera calibration matrix K.
• At train time, no impose direct supervision (i.e., no directly regress to the calibration parameters,
for example).
• Instead, train the network to predict calibration parameters that maximize the geometric and
photometric consistency of the input images and point clouds.
• CalibNet learns to iteratively solve the underlying geometric problem and accurately predicts
extrinsic calibration parameters for a wide range of mis-calibrations, without requiring
retraining or domain adaptation.
• Code: https://github.com/epiception/CalibNet.
4. CalibNet: Self-Supervised Extrinsic Calibration
using 3D Spatial Transformer Networks
Input RGB image (a), a raw LiDAR point cloud (b), and outputs a transformation T that best aligns the two
inputs. (c) the colorized point cloud output for a mis-calibrated setup, and (d) the output after calibration
7. PointPillars: Fast Encoders for Object
Detection from Point Clouds
• It addresses encoding a point cloud into a format appropriate for a detection pipeline.
• Two types of encoders: fixed encoders tend to be fast but sacrifice accuracy, while
encoders that are learned from data are more accurate, but slower.
• PointPillars is an encoder which utilizes PointNets to learn a representation of point
clouds organized in vertical columns (pillars).
• While the encoded features can be used with any standard 2D convolutional detection
architecture, run a lean downstream network.
• Despite only using lidar, a full detection pipeline significantly outperforms the SoA, even
among fusion methods, w.r.t. both the 3D and bird’s eye view KITTI benchmarks.
• This detection performance is achieved while running at 62 Hz.
• A faster version matches the state of the art at 105 Hz.
8. PointPillars: Fast Encoders for Object
Detection from Point Clouds
Network overview. The components of the network are a Pillar Feature Network, Backbone(2D CNN),
and SSD Detection Head. The raw point cloud is converted to a stacked pillar tensor and pillar index tensor.
The encoder uses the stacked pillars to learn a set of features that can be scattered back to a 2D pseudo-
image for a CNN. The features from the backbone are used by the detection head to predict 3D bounding
boxes for objects.
9. PointPillars: Fast Encoders for Object
Detection from Point Clouds
Qualitative analysis of KITTI results
Failure cases on KITTI
10. Complex-YOLO: An Euler-Region-Proposal for
Real-time 3D Object Detection on Point Clouds
• Complex-YOLO, a real-time 3D object detection network on point clouds only.
• A network that expands YOLOv2, a fast 2D standard object detector for RGB images, by a
specific complex regression strategy to estimate multi-class 3D boxes in Cartesian space.
• A specific Euler-Region- Proposal Network (E-RPN) to estimate the pose of the object by
adding an imaginary and a real fraction to the regression network.
• This network ends up in a closed complex space and avoids singularities, which occur by
single angle estimations. The E-RPN supports to generalize well during training.
11. Complex-YOLO: An Euler-Region-Proposal for
Real-time 3D Object Detection on Point Clouds
Complex-YOLO is a very efficient model that directly operates on Lidar only
based birds-eye-view RGB-maps to estimate and localize accurate 3D
multiclass bounding boxes. The figure shows a bird view based on a
Velodyne HDL64 point cloud such as the predicted objects.
12. Complex-YOLO: An Euler-Region-Proposal for
Real-time 3D Object Detection on Point Clouds
Complex-YOLO Pipeline. A pipeline for fast and accurate 3D box estimations on
point clouds. The RGB-map is fed into the CNN. The E-RPN grid runs simultaneously
on the last feature map and predicts five boxes per grid cell. Each box prediction is
composed by the regression parameters t and object scores p with a general
probability p0 and n class scores p1...pn.
15. Robust Deep Multi-modal Learning Based
on GIF Network
• Designing robust deep multimodal learning architecture in the presence of the modalities
degraded in quality.
• Deep fusion architecture for object detection which processes each modality using the
separate convolutional neural network (CNN) and constructs the joint feature maps by
combining the intermediate features obtained by the CNNs.
• To facilitate the robustness to the degraded modalities, the gated information fusion (GIF)
network which weights the contribution from each modality according to the input feature
maps to be fused.
• The combining weights are determined by applying the convolutional layers followed by the
sigmoid function to the concatenated intermediate feature maps.
• The network including the CNN backbone and GIF is trained in an end-to-end fashion.
17. LATTE: Accelerating LiDAR Point Cloud Annotation via
Sensor Fusion, One-Click Annotation, and Tracking
• Annotating LiDAR point cloud data is challenging due to the following issues: 1) A LiDAR point cloud is
usually sparse and has low resolution, making it difficult for human annotators to recognize objects. 2)
Compared to annotation on 2D images, the operation of drawing 3D bounding boxes or even point- wise
labels on LiDAR point clouds is more complex and time- consuming. 3) LiDAR data are usually collected in
sequences, so consecutive frames are highly correlated, leading to repeated annotations.
• To tackle these challenges, LATTE, an open-sourced annotation tool for LiDAR point clouds.
• LATTE features the following innovations: 1) Sensor fusion: utilize image-based detection algorithms to
automatically pre-label a calibrated image, and transfer the labels to the point cloud. 2) One-click
annotation: Instead of drawing 3D bounding boxes or point-wise labels, simplify the annotation to just one
click on the target object, and automatically generate the bounding box for the target. 3) Tracking: integrate
tracking into sequence annotation such that transfer labels from one frame to subsequent ones and
therefore significantly reduce repeated labeling.
• Experiments show the features accelerate the annotation speed by 6.2x and significantly improve label
quality with 23.6% and 2.2% higher instance-level precision and recall, and 2.0% higher bounding box IoU.
• LATTE is open-sourced at https://github.com/bernwang/latte.
18. LATTE: Accelerating LiDAR Point Cloud Annotation via
Sensor Fusion, One-Click Annotation, and Tracking
A screenshot of LATTE
19. LATTE: Accelerating LiDAR Point Cloud Annotation via
Sensor Fusion, One-Click Annotation, and Tracking
Challenges of annotating LiDAR point clouds. (a) LiDAR point clouds have low resolution and therefore objects
are difficult for humans to recognize. The upper two figures are point clouds of a traffic pole and a cyclist, but
both are difficult to recognize. The lower two are the corresponding images. (b) Annotating 2D bounding boxes
on an image vs. 3D bounding boxes on a point cloud. Annotating 3D bounding boxes is more complicated due
to more degrees of freedom of 3D scaling and rotation. (c) Point clouds of two consecutive frames are shown
here. Even though the two frames are highly similar, target objects are moving and have different speeds. As a
20. LATTE: Accelerating LiDAR Point Cloud Annotation via
Sensor Fusion, One-Click Annotation, and Tracking
The sensor-fusion pipeline of LATTE. A Lidar point cloud is projected onto
its corresponding image. Next, use Mask-RCNN to predict semantic
labels on the image. The labels are then transferred back to the LiDAR
point cloud.
21. LATTE: Accelerating LiDAR Point Cloud Annotation via
Sensor Fusion, One-Click Annotation, and Tracking
To use sensor fusion to help annotators confirm the category of a selected object. Once a 3D
bounding box is chosen, project all the points within the bounding box to the image and show the
corresponding crop of the image to human annotators for visual confirmation.
22. LATTE: Accelerating LiDAR Point Cloud Annotation via
Sensor Fusion, One-Click Annotation, and Tracking
The one click annotation pipeline of LATTE. For a given Lidar point cloud, first remove the ground.
After an annotator clicks on one point on a target object, use clustering algorithms to expand from
the clicked point to the entire object. Finally, estimate a top-view 2D bounding box for the object.
23. LATTE: Accelerating LiDAR Point Cloud Annotation via
Sensor Fusion, One-Click Annotation, and Tracking
To model the ground as a segment of planes
After find the cluster, use a search-based
rectangle fitting to estimate bounding boxes.
Other methods, such as PCA based ones,
can also be plugged into LATTE. To have
the optimal rectangle fitting for a cluster,
need to know the appropriate heading of
the rectangle.
24. LATTE: Accelerating LiDAR Point Cloud Annotation via
Sensor Fusion, One-Click Annotation, and Tracking
Tracking pipeline of LATTE. Annotators label a
bounding box in the initial frame. Next, use
Kalman filtering to predict the center position of
the bounding box at the next frame. Human
annotators then adjust the bounding box, and
use the new center position as a new
observation to update the Kalman filter.
25. FVNet: 3D Front-View Proposal Generation for
Real-Time Object Detection from Point Cloud
• A framework called FVNet for 3D front-view proposal generation and object
detection from point clouds.
• It consists of two stages: generation of front-view proposals and estimation of 3D
bounding box parameters.
• Instead of generating proposals from camera images or bird’s-eye-view maps, first
project point clouds onto a cylindrical surface to generate front-view feature maps
which retains rich information.
• Then introduce a proposal generation network to predict 3D region proposals from
the generated maps and further extrude objects of interest from the whole point
cloud.
• Another network to extract the point-wise features from the extruded object points
and regress the final 3D bounding box parameters in the canonical coordinates.
• The framework achieves real-time performance with 12ms per point cloud sample.
26. FVNet: 3D Front-View Proposal Generation for
Real-Time Object Detection from Point Cloud
The overview of (a) FVNet. It consists of two sub-networks: (b) Proposal Generation Network (PG-Net) for
generation of 3D region proposals and (c) Parameter Estimation Network (PE-Net) for estimation of 3D
bounding box parameters.
27. FVNet: 3D Front-View Proposal Generation for
Real-Time Object Detection from Point Cloud
The architecture of PG-Net. The bottom shows
the details of the residual block, the
convolutional block and the up- sampling block,
respectively.
28. FVNet: 3D Front-View Proposal Generation for
Real-Time Object Detection from Point Cloud
A 3D bounding box and its
corresponding cylinder
fragment. Left: the 3D bounding
box with dimension prior (Pw,
Ph), location prediction (bx, by)
and truncated distances
prediction (r1, r2). Right: the
corresponding cylinder
fragment in 3D space, which is
generated by truncating the
frustum with two radial
distances r1 and r2.
The projection functions
29. RGB and LiDAR fusion based 3D Semantic Segmentation for
Autonomous Driving Fast Point RCNN
• LiDAR perception is gradually becoming mature for algorithms including object
detection and SLAM.
• However, semantic segmentation algorithm remains to be relatively less explored.
• Motivated by the fact that semantic segmentation is a mature algorithm on image
data, explore sensor fusion based 3D segmentation.
• To convert the RGB image to a polar-grid mapping representation used for LiDAR
and design early and mid-level fusion architectures.
• Additionally, design a hybrid fusion architecture that combines both fusion
algorithms.
• To evaluate the algorithm on KITTI dataset which provides segmentation annotation
for cars, pedestrians and cyclists.
• Have evaluated two state-of-the-art architectures namely SqueezeSeg and
PointSeg and improve the mIoU score by 10% in both cases relative to the LiDAR
only baseline.
30. RGB and LiDAR fusion based 3D Semantic Segmentation for
Autonomous Driving Fast Point RCNN
Illustration of LiDAR Polar Grid Map representation.
31. RGB and LiDAR fusion based 3D Semantic Segmentation for
Autonomous Driving Fast Point RCNN
Input frame and ground-truth
tensor. Top to bottom: X, Y, Z, D, I,
RGB and Ground Truth.
32. RGB and LiDAR fusion based 3D Semantic Segmentation for
Autonomous Driving Fast Point RCNN
(a) LiDAR baseline architecture based on SqueezeSeg
33. RGB and LiDAR fusion based 3D Semantic Segmentation for
Autonomous Driving Fast Point RCNN
(b) Proposed RGB+LiDAR mid-fusion architecture
Semantic Segmentation network architectures. (a) shows the baseline SqueezeSeg based unimodal
baseline architecture. The architecture remains the same for early fusion except for the change in
number of input planes. (b) shows the proposed mid-fusion architecture.
34. Part-A2 Net: 3D Part-Aware and Aggregation Neural
Network for Object Detection from Point Cloud
• The part-aware and aggregation neural network (Part-A2 Net) for 3D object detection from
point cloud.
• The whole framework consists of the part- aware stage and the part-aggregation stage.
• Firstly, the part-aware stage learns to simultaneously predict coarse 3D proposals and
accurate intra-object part locations with the free-of-charge supervisions derived from 3D
ground- truth boxes.
• The predicted intra-object part locations within the same proposals are grouped by the
new-designed RoI- aware point cloud pooling module, which results in an effective
representation to encode the features of 3D proposals.
• Then the part-aggregation stage learns to re-score the box and refine the box location
based on the pooled part locations.
• Extensive experiments on the KITTI 3D object detection dataset, which demonstrate that
both the predicted intra-object part locations and the proposed RoI-aware point cloud
pooling scheme benefit 3D object detection and Part-A2 net outperforms state-of-the-art
methods by utilizing only point cloud data.
35. Part-A2 Net: 3D Part-Aware and Aggregation Neural
Network for Object Detection from Point Cloud
Intra-object part locations and
segmentation masks can be robustly
predicted by the proposed part-aware
and aggregation network even when
objects are partially occluded. Such part
locations can assist accurate 3D object
detection.
36. Part-A2 Net: 3D Part-Aware and Aggregation Neural
Network for Object Detection from Point Cloud
The overall framework of part-aware and aggregation NN for 3D object detection. It consists of two stages: (a) The
first part-aware stage estimates intra-object part locations accurately and generates 3D proposals by feeding the
raw point cloud to newly designed backbone network. (b) The second part-aggregation stage conducts the
proposed RoI-aware point cloud pooling operation to group the part information from each 3D proposal, then the
part-aggregation network is utilized to score boxes and refine locations based on the part features and information.
37. Part-A2 Net: 3D Part-Aware and Aggregation Neural
Network for Object Detection from Point Cloud
Sparse up-sampling and feature refinement
block. This module is adopted in the decoder of
sparse convolution based UNet backbone. The
lateral features and bottom features are first
fused and transformed by sparse convolution.
The fused feature is then up-sampled by the
sparse inverse convolution.
Illustration of RoI-aware point cloud feature pooling. Due to the
ambiguity showed in the above BEV figure, not recover the
original box shape by using previous point cloud pooling method.
The RoI-aware point cloud pooling method could encode the box
shape by keeping the empty voxels, which could be efficiently
processed by following sparse convolution.
38. Part-A2 Net: 3D Part-Aware and Aggregation Neural
Network for Object Detection from Point Cloud
Qualitative results of Part-A2 Net on the KITTI test split. The predicted 3D boxes are drawn with green
3D bounding boxes, and the estimated intra-object part locations are visualized with different colors.
39. Voxel-FPN: multi-scale voxel feature aggregation
in 3D object detection from point clouds
• Object detection in point cloud data is one of the key components
in computer vision systems, especially for autonomous driving
applications.
• To present Voxel-FPN, a novel one-stage 3D object detector that
utilizes raw data from LIDAR sensors only.
• The core framework consists of an encoder network and a
corresponding decoder followed by a region proposal network.
• Encoder extracts multi-scale voxel information in a bottom-up
manner while decoder fuses multiple feature maps from various
scales in a top-down way.
41. Voxel-FPN: multi-scale voxel feature aggregation
in 3D object detection from point clouds
Structure of voxel feature extraction network
42. Voxel-FPN: multi-scale voxel feature aggregation
in 3D object detection from point clouds
The detailed structure for RPN-FPN
43. Voxel-FPN: multi-scale voxel feature aggregation
in 3D object detection from point clouds
Visualized car detection results from the method: cubes in green color
denote ground truth 3D boxes and those in red indicate detection results.
44. STD: Sparse-to-Dense 3D Object Detector
for Point Cloud
• A two-stage 3D object detection frame- work, named sparse-to-dense 3D Object
Detector (STD).
• The first stage is a bottom-up proposal generation network that uses raw point
cloud as input to generate accurate proposals by seeding each point with a new
spherical anchor.
• It achieves a high recall with less computation compared with prior works.
• Then, PointsPool is applied for generating proposal features by transforming their
interior point features from sparse expression to compact representation, which
saves even more computation time.
• In box prediction, which is the second stage, implement a parallel intersection-over-
union (IoU) branch to increase awareness of localization accuracy, resulting in
further improved performance.
• Experiments on KITTI dataset, and evaluate in terms of 3D object and Bird’s Eye
View (BEV) detection.
• It outperforms other state- of-the-arts by a large margin, especially on the hard set,
with inference speed more than 10 FPS.
45. STD: Sparse-to-Dense 3D Object Detector
for Point Cloud
Illustration of framework consisting of three different parts. The first is a proposal generation module (PGM) to
generate accurate proposals from man-made point-based spherical anchors. The second part is a PointsPool
layer to convert proposal features from sparse expression to compact representation. The final one is a box
prediction network. It classifies and regresses proposals, and picks high-quality predictions.
46. STD: Sparse-to-Dense 3D Object Detector
for Point Cloud
Illustration of networks in the proposal generation module. (a) 3D segmentation network (PointNet++). It takes a raw
point cloud (x, y, z, r) as input, and generates semantic segmentation scores as well as global context features for
each point by stacking SA layers and FP modules. (b) Proposal generation Network (PointNet). It treats normalized
coordinates and semantic features of points within anchors as input, and produces classification and regression
predictions.
47. STD: Sparse-to-Dense 3D Object Detector
for Point Cloud
Visualization of results on KITTI test set. Cars, pedestrians and cyclists are
highlighted in yellow, red and green respectively. The upper row in each image is
the 3D object detection result projected onto the RGB image. The other is the result
in the LiDAR phase.
48. End-to-end sensor modeling for LiDAR
Point Cloud
• Laser scanner sensors (LiDAR, Light Detection And Ranging) became a fundamental choice
due to its long- range and robustness to low light driving conditions.
• The problem of designing a control software for self-driving cars is a complex task to
explicitly formulate in rule-based systems, thus recent approaches rely on machine learning
that can learn those rules from data.
• The major problem with such approaches is that the amount of training data required for
generalizing a machine learning model is big, and on the other hand LiDAR data annotation
is very costly compared to other car sensors.
• An accurate LiDAR sensor model can cope with such problem.
• Moreover, its value goes beyond this because existing LiDAR development, validation, and
evaluation platforms and processes are very costly, and virtual testing and development
environments are still immature in terms of physical properties representation.
• This is a Deep Learning-based LiDAR sensor model.
• It models the sensor echos, using a Deep Neural Network to model echo pulse widths
learned from real data using Polar Grid Maps (PGM).
• To benchmark performance against comprehensive real sensor data.
51. End-to-end sensor modeling for LiDAR
Point Cloud
A comparison between real LiDAR data and
data from synthetic data generated from the
sensor model to left and right respectively.
Each scan point color represent its Echo Pulse
Width (EPW) value. It is obvious that both
examples 1- the approach has clearly
mimicked EPW values from real data. 2- the
approach could mimic noise model in syntactic
generated data in the far perception. 3- the
model could learn how to represent lanes as
learned from real traces.
52. End-to-end sensor modeling for LiDAR
Point Cloud
Multidimensional Lockup
table that the DNN need to
learn.
53. End-to-end sensor modeling for LiDAR
Point Cloud
DNN Pipeline that Encapsulate Sensor model N- dimensional
Lockup table.
54. End-to-end sensor modeling for LiDAR
Point Cloud
Annotated Polar Grid Map point cloud, Upper PGM is
depth representation, lower PGM is point level annotation.
The Polar Grid Map (PGM) is a representation for a LiDAR full scan in a 3D
tensor.
56. End-to-end sensor modeling for LiDAR
Point Cloud
Unet architecture. Each white box
corresponds to a multi-channel feature map.
The number of channels is denoted on top of
the box. The x-y-size is provided at the
middle of the box.
58. End-to-end sensor modeling for LiDAR
Point Cloud
Histogram Bayes Classifier output, one out of many selection
block.
59. End-to-end sensor modeling for LiDAR
Point Cloud
Summary, learn from real traces(left image), to transfer syntactic
data(middle image) to be more realistic(right image).
60. Fast Point RCNN
• A unified, efficient and effective framework for point-cloud based 3D
object detection.
• The two-stage approach utilizes both voxel representation and raw
point cloud data to exploit respective advantages.
• The first stage network, with voxel representation as input, only
consists of light convolutional operations, producing a small number of
high-quality initial predictions.
• Coordinate and indexed convolutional feature of each point in initial
prediction are effectively fused with the attention mechanism,
preserving both accurate localization and context information.
• The second stage works on interior points with their fused feature for
further refining the prediction.
61. Fast Point RCNN
Overview of the two-stage framework. In the first stage, voxelize point cloud and feed them to VoxelRPN to
produce a small number of initial predictions. Then generate the box feature for each prediction by fusing
interior points’ coordinates and context feature from VoxelRPN. Box features are fed to RefinerNet for further
refinement.
62. Fast Point RCNN
Network structure of VoxelRPN. The format of
layers used in the figure follows (kernel
size)(channels)/(stride), i.e. (kx, ky, kz)(chn)/(sx, sy,
sz). The default stride is 1 unless otherwise
specified.
Suppose the region of interest for the point
cloud is a cuboid of size (L,W,H) and each
voxel is of size (vl,vw,vh), the 3D space can be
divided into 3D voxel grid of size (L/vl, W/vw,
V/vh).
63. Fast Point RCNN
Network Structure of RefinerNet.
Canonization of a box. The number denotes
the order of corner prediction in RefinerNet.
65. StarNet: Targeted Computation for Object
Detection in Point Clouds
• Previous work on object detection from LiDAR has emphasized re-purposing
convolutional approaches from traditional camera imagery.
• An object detection system designed specifically for point cloud data blending
aspects of one-stage and two-stage systems.
• Objects in point clouds are quite distinct from traditional camera images: objects are
sparse and vary widely in location, but do not exhibit scale distortions observed in
single camera perspective.
• It suggests that simple and cheap data-driven object proposals to maximize spatial
coverage or match the observed densities of point cloud data may suffice.
• This recognition paired with a local, non-convolutional, point-based network
permits building an object detector for point clouds that may be trained only once,
but adapted to different computational settings – targeted to different predictive
priorities or spatial regions.
• It is demonstrated this flexibility and the targeted detection strategies on both the
KITTI detection dataset as well as on the large-scale Waymo Open Dataset.
67. StarNet: Targeted Computation for Object
Detection in Point Clouds
StarNet point featurizer. (a)
StarNet Blocks take as input a set
of points, where each point has an
associated feature vector. Each
block first computes aggregate
statistics (max) across the point
cloud. Next, the global statistics
are concatenated back to each
point’s feature. Finally, two fully-
connected layers are applied, each
composed of BN, linear projection,
and ReLU activation. (b) The
StarNet point featurizer stacks
multiple StarNet Blocks and
performs a readout of each block’s
output using mean aggregation.
The readouts are concatenated
together to form the featurization
69. Class-balanced Grouping and Sampling for
Point Cloud 3D Object Detection
• This report presents a method which wins the nuScenes 3D Detection
Challenge held in Workshop on Autonomous Driving(WAD, CVPR 2019).
• Generally, utilize sparse 3D convolution to extract rich semantic features,
which are then fed into a class-balanced multi-head network to perform 3D
object detection.
• To handle the severe class imbalance problem inherent in the autonomous
driving scenarios, design a class-balanced sampling and augmentation
strategy to generate a more balanced data distribution.
• A balanced grouping head to boost the performance for the categories with
similar shapes.
• Based on the Challenge results, it outperforms the PointPillars baseline by a
large margin across all metrics, achieving state-of-the-art (SOTA) detection
performance on the nuScenes dataset.
70. Class-balanced Grouping and Sampling for
Point Cloud 3D Object Detection
Network Architecture. 3D Feature Extractor is composed of submanifold and regular 3D sparse
convolutions. Outputs of 3D Feature Extractor are of 16× downscale ratio, which are flatten along
output axis and fed into following Region Proposal Network to generate 8× feature maps, followed by
the multi-group head network to generate final predictions. Number of groups in head is set according
to grouping specification.
71. Class-balanced Grouping and Sampling for
Point Cloud 3D Object Detection
Examples of detection results in validation split. Ground truth annotations are in green and detection results
are in blue. The token on top of each point cloud bird view image is its corresponding sample data token.
72. Deep Hough Voting for 3D Object
Detection in Point Clouds
• Code is open sourced at https://github.com/ facebookresearch/votenet
• Current 3D object detection methods are heavily influenced by 2D detectors.
• In order to leverage architectures in 2D detectors, they often convert 3D point clouds to regular grids
(i.e., to voxel grids or to bird’s eye view images), or rely on detection in 2D images to propose 3D
boxes.
• Few works have attempted to directly detect objects in point clouds.
• The first principle is to construct a 3D detection pipeline for point cloud data and as generic as
possible.
• However, due to the sparse nature of the data – samples from 2D manifolds in 3D space – a major
challenge when directly predicting bounding box parameters from scene points: a 3D object centroid
can be far from any surface point thus hard to regress accurately in one step.
• To address the challenge, VoteNet, an end-to-end 3D object detection network based on a synergy of
deep point set networks and Hough voting.
• This model achieves state-of-the-art 3D detection on two large datasets of real 3D scans, ScanNet and
SUN RGB-D with a simple design, compact model size and high efficiency.
• Remarkably, VoteNet outperforms previous methods by using purely geometric information without
relying on color images.
73. Deep Hough Voting for 3D Object
Detection in Point Clouds
3D object detection in point clouds with a deep Hough voting model. Given a
point cloud of a 3D scene, VoteNet votes to object centers and then groups
and aggregates the votes to predict 3D bounding boxes and semantic classes
of objects.
74. Deep Hough Voting for 3D Object
Detection in Point Clouds
Illustration of the VoteNet architecture for 3D object detection in point clouds. Given an input point cloud of N points with XYZ
coordinates, a backbone network (PointNet++ layers) subsamples and learns deep features on the points and outputs a subset
of M points but extended by C-dim features. This subset of points are considered as seed points. Each seed independently
generates a vote through a voting module. Then the votes are grouped into clusters and processed by the proposal module to
generate the final proposals. The classified and NMS proposals become the final 3D Bboxes output.
75. Deep Hough Voting for 3D Object
Detection in Point Clouds
Voting helps increase detection contexts. Seed
points that generate good boxes (BoxNet), or good
votes (VoteNet) which in turn generate good boxes,
are overlaid (in blue) on top of a representative
ScanNet scene. As the voting step effectively
increases context, VoteNet demonstrates a much
denser cover of the scene, therefore increasing the
likelihood of accurate detection.
76. MLOD: A multi-view 3D object detection
based on robust feature fusion method
• Multi-view Labelling Object Detector (MLOD).
• The detector takes an RGB image and a LIDAR point cloud as input and follows the two-stage object
detection framework.
• A Region Proposal Network (RPN) generates 3D proposals in a Bird’s Eye View (BEV) projection of the
point cloud.
• The second stage projects the 3D proposal bounding boxes to the image and BEV feature maps and
sends the corresponding map crops to a detection header for classification and bounding-box
regression.
• Unlike other multi-view based methods, the cropped image features are not directly fed to the
detection header, but masked by the depth information to filter out parts outside 3D bounding boxes.
• The fusion of image and BEV features is challenging, as they are derived from different perspectives.
• A detection header, which provides detection results not just from fusion layer, but also from each
sensor channel. Hence the object detector can be trained on data labelled in different views to avoid
the degeneration of feature extractors.
• MLOD achieves state-of-the-art performance on the KITTI 3D object detection benchmark.
• Most importantly, the evaluation shows that the header architecture is effective in preventing image
feature extractor degeneration.
77. MLOD: A multi-view 3D object detection
based on robust feature fusion method
The multi-view header architecture diagram
Architectural diagram of the proposed method
78. MLOD: A multi-view 3D object detection
based on robust feature fusion method
The procedure of the foreground masking layer.
(a) Illustration of foreground masking layer procedure:
Step 1: calculating the median of nonzero values in
each grid; Step 2: obtaining a mask by Equation 1
(dmin = 6.8, dmax = 9.7 in this example); Step 3: applying
the mask to the feature maps. (b) A qualitative
example of a foreground mask and its application to
the original image. The bottom left background and
the top left and right background are masked.
(a)
(b)
79. MLOD: A multi-view 3D object detection
based on robust feature fusion method
Qualitative results of MLOD. In each image, detected cars are in green, pedestrians are in blue, and cyclists are in yellow.