The document outlines several methods for fusing RGB and depth sensor data using convolutional neural networks. Key methods discussed include:
- Propagating confidence maps through CNNs to produce dense depth completions from sparse LiDAR data with uncertainty estimates.
- Using CNNs to handle both sparse depth data and dense RGB data for tasks like depth completion and semantic segmentation, by changing only the last layer of the network.
- Fusing sparse 3D LiDAR and dense stereo depth with a CNN to produce high-precision depth estimations, encoding the complementary characteristics of each sensor type.
- Training a morphological neural network using a large RGB-D dataset to learn optimal filter shapes for depth completion from sparse inputs
1. Depth Fusion from RGB and
Depth Sensors III
Yu Huang
Yu.huang07@gmail.com
Sunnyvale, California
2. Outline
• Propagating Confidences through CNNs for Sparse Data Regression
• Sparse and Dense Data with CNNs: Depth Completion and Semantic Segmentation
• High-precision Depth Estimation with the 3D LiDAR and Stereo Fusion
• Learn Morph. Operators for Depth Completion
• DeepLiDAR: Deep Surface Normal Guided Depth Prediction from LiDAR and Color Image
• Dense Depth Posterior (DDP) from Single Image and Sparse Range
• DFuseNet: Fusion of RGB and Sparse Depth for Image Guided Dense Depth Completion
• 3D LiDAR and Stereo Fusion using Stereo Matching Network with Conditional Cost Volume
Normalization
• Sparse and noisy LiDAR completion with RGB guidance and uncertainty
3. Propagating Confidences through CNNs for
Sparse Data Regression
• An algebraically-constrained convolution layer for CNNs with sparse input;
• Strategies for determining the confidence from the convolution operation and propagating
it to consecutive layers.
• An objective function that simultaneously minimizes the data error while maximizing the
output confidence.
• This approach produces a continuous pixel-wise confidence map enabling information
fusion, state inference, and decision support.
4. Propagating Confidences through CNNs for
Sparse Data Regression
The multi-scale architecture for the task of scene depth completion which utilizes normalized convolution layers
5. Propagating Confidences through CNNs for
Sparse Data Regression
Top-left : input RGB image, top-right : projected LiDAR point cloud
bottom- left : output from our method, bottom-right : error map in logarithmic scale.
6. Sparse and Dense Data with CNNs: Depth
Completion and Semantic Segmentation
• CNNs are designed for dense data, but vision data is often sparse (stereo depth, point
clouds, pen stroke, etc.).
• A method to handle sparse depth data with optional dense RGB, and accomplish depth
completion and semantic segmentation changing only the last layer.
• It is a sparse training strategy and a late fusion scheme for dense RGB + sparse depth.
• Following study of sparse data and validity mask, decide not to use any additional mask
proving that the network learns sparsity invariant features by itself.
• Ensure network robustness to varying input sparsities.
• It even works with densities as low as 0.8% (8 layer lidar), and outperforms all published SoA
on the Kitti depth completion benchmark.
• Changing only the last layer, we also performed semantic segmentation on synthetic and
real datasets.
7. Sparse and Dense Data with CNNs: Depth
Completion and Semantic Segmentation
A network architecture adapted from NASNetNot use a validity mask
8. Sparse and Dense Data with CNNs: Depth
Completion and Semantic Segmentation
A validity mask is a binary matrix of same size as the input data, with ones indicating available input
data and zeros elsewhere.
However, the validity information is quickly lost in the later layers.
This is a consequence of the normalization phase on the number of valid pixels, which processes a
mask with only one valid pixel in the same way as a fully valid mask.
Another consequence is that the network tends to produce blurry outputs.
9. Sparse and Dense Data with CNNs: Depth
Completion and Semantic Segmentation
A naive strategy consists of averaging separate predictions from each modality. An alternative is to
apply an early fusion modalities are simply concatenated channel-wise and fed to the network.
It appears preferable to transform different representations (RGB intensities, distance values) to a
similar feature space before fusing them (known as late fusion).
10. Sparse and Dense Data with CNNs: Depth
Completion and Semantic Segmentation
sD = sparse depth
11. Sparse and Dense Data with CNNs: Depth
Completion and Semantic Segmentation
12. High-precision Depth Estimation with the
3D LiDAR and Stereo Fusion
• A deep CNN architecture for high-precision depth estimation by jointly utilizing sparse 3D
LiDAR and dense stereo depth information.
• In this network, the complementary characteristics of sparse 3D LiDAR and dense stereo
depth are simultaneously encoded in a boosting manner.
• Tailored to the LiDAR and stereo fusion problem, this network differs from previous CNNs in
the incorporation of a compact convolution module, which can be deployed with the
constraints of mobile devices.
• As training data for the LiDAR and stereo fusion is rather limited, a simple yet effective
approach for reproducing the raw KITTI dataset is used.
• The raw LIDAR scans are augmented by adapting an off-the-shelf stereo algorithm and a
confidence measure.
14. High-precision Depth Estimation with the
3D LiDAR and Stereo Fusion
LiDAR and stereo fusion: (from top to down) Input color image, LiDAR disparity, the result of SGM and fusion.
17. Learning Morphological Operators for
Depth Completion
• A method for completing sparse depth images in a semantically accurate manner by training
a novel morphological NN.
• It approximates morphological operations by Contra-harmonic Mean Filter layers which are
trained in a contemporary NN framework.
• An early fusion U-Net architecture then combines dilated depth channels and RGB.
• Using a large scale RGB-D dataset to learn the optimal morphological and convolutional
filter shapes that produce a fully sampled depth image at the output.
• The resulting depth images is used to augment intelligent vehicles perception systems.
19. Learning Morphological Operators for
Depth Completion
• Morphological operators are the foundation of many image segmentation algorithms.
• Using so called “structuring elements” they represent non-linear operations which
compute the minimum, maximum or the combination of both within the element.
• In the context of depth completion, it is of interest to learn the shape and the operation type
that fits best the data.
• The approximation of morphological operators by the contra-harmonic mean (CHM) filter is
the best founded technique which can easily be integrated in a deep learning framework.
The contra-harmonic mean filter function ψk (x) is modeled as the power-weighted 2D
convolution of the image f (x) and a filter w representing the structuring element
21. DeepLiDAR: Deep Surface Normal Guided Depth Prediction for
Outdoor Scene from Sparse LiDAR Data and Single Color Image
• A deep learning architecture that produces accurate dense depth for the
outdoor scene from a single color image and a sparse depth.
• This network estimates surface normals as the intermediate
representation to produce dense depth, and can be trained end-to-end.
• With a modified encoder-decoder structure, this network effectively fuses
the dense color image and the sparse LiDAR depth.
• To address outdoor specific challenges, it predicts a confidence mask to
handle mixed LiDAR signals near FG boundaries due to occlusion, and
combines estimates from the color image and surface normals with
learned attention maps to improve the depth accuracy especially for
distant areas.
• Comprehensive analysis shows that this model generalizes well to the
input with higher sparsity or from indoor scenes.
22. DeepLiDAR: Deep Surface Normal Guided Depth Prediction for
Outdoor Scene from Sparse LiDAR Data and Single Color Image
It takes as input a color image
and a sparse depth image
from the LiDAR (Row 1), and
output a dense depth map
(Row 2). It estimates surface
normals (Row 3) as the
intermediate representation.
23. DeepLiDAR: Deep Surface Normal Guided Depth Prediction for
Outdoor Scene from Sparse LiDAR Data and Single Color Image
The pipeline. It consists of two pathways. Both starting from a RGB image, a sparse depth, and a binary mask as the
inputs, the surface normal pathway produces a pixel-wise surface normal, further combined with the sparse depth and
a confidence mask from the color pathway to produce a dense depth. The color pathway produces a dense depth too.
The final dense depth output is the weighted sum of the depths from two pathways using the estimated attention map.
24. DeepLiDAR: Deep Surface Normal Guided Depth Prediction for
Outdoor Scene from Sparse LiDAR Data and Single Color Image
Detailed architecture of deep completion unit. Occlusion and learned confidence.
25. DeepLiDAR: Deep Surface Normal Guided Depth Prediction for
Outdoor Scene from Sparse LiDAR Data and Single Color Image
26. DeepLiDAR: Deep Surface Normal Guided Depth Prediction for
Outdoor Scene from Sparse LiDAR Data and Single Color Image
27. Dense Depth Posterior (DDP) from Single
Image and Sparse Range
• A deep learning system to infer the posterior distribution of a dense depth
map associated with an image, by exploiting sparse range measurements, for
instance from a lidar.
• While the lidar may provide a depth value for a small percentage of the
pixels, exploit regularities reflected in the training set to complete the map so
as to have a probability over depth for each pixel in the image.
• To exploit a Conditional Prior Network, that allows associating a probability
to each depth value given an image, and combine it with a likelihood term
that uses the sparse measurements.
• Optionally exploit the availability of stereo during training, but in any case
only require a single image and a sparse point cloud at run-time.
28. Dense Depth Posterior (DDP) from Single
Image and Sparse Range
(A): Architecture of a Conditional Prior Network (CPN) to learn the conditional of the
dense depth given a single image. (B): Depth Completion Network (DCN) for learning
the mapping from sparse depth map and image to dense depth map.
29. Dense Depth Posterior (DDP) from Single
Image and Sparse Range
An image (top) is insufficient to determine the
geometry of the scene
A point cloud alone (middle) is similarly ambiguous.
Combining a single image, the lidar point cloud, and
previously seen scenes allows inferring a dense
depth map (bottom) with high confidence
Color bar from left to right: zero to infinity.
30. DFuseNet: Deep Fusion of RGB and Sparse Depth
Information for Image Guided Dense Depth Completion
• A CNN that is designed to upsample a series of sparse range measurements based on the
contextual cues gleaned from a HR intensity image.
• It draws inspiration from related work on SR and inpainting.
• An architecture that seeks to pull contextual cues separately from the intensity image and the
depth features and then fuse them later in the network.
• It effectively exploits the relationship between the two modalities and produces accurate
results while respecting salient image structures.
Input color image
LiDAR scan mask
DFuseNet
31. DFuseNet: Deep Fusion of RGB and Sparse Depth
Information for Image Guided Dense Depth Completion
The network architecture uses two input branches for RGB depth input respectively. Spatial
Pyramid Pooling (SPP) blocks are used in the encoder and a hierarchical representation of
decoder features are used to predict dense depth images.
32. DFuseNet: Deep Fusion of RGB and Sparse Depth
Information for Image Guided Dense Depth Completion
Spatial Pyramid Pooling blocks
used in the encoder architecture
Input image
predicted depth
without stereo term
prediction with
stereo term
Learning to extrapolate better using available info.:
By adding a stereo depth based loss term, able to
make better extrapolations in regions where no
ground truth or LiDAR exists.
33. 3D LiDAR and Stereo Fusion using Stereo Matching
Network with Conditional Cost Volume Normalization
• The complementary characteristics of active and passive depth sensing techniques motivate
the fusion of the LiDAR sensor and stereo camera for improved depth perception.
• Recent SoA on deep models of stereo matching are composed of two main components:
matching cost computation and cost volume regularization.
• Instead of directly fusing estimated depths across LiDAR and stereo modalities, improve
stereo matching network with two enhanced techniques: Input Fusion to incorporate the
geometric info from sparse LiDAR depth with the RGB images for learning joint feature
representations and Conditional Cost Volume Normalization (CCVNorm) to adaptively
regularize cost volume optimization in dependence on LiDAR measurements.
• The framework is generic and closely integrated with the cost volume component that is
commonly utilized in stereo matching neural networks.
• With a hierarchical extension of CCVNorm, the method brings only slight overhead to the
stereo matching network in terms of computation time and model size.
34. 3D LiDAR and Stereo Fusion using Stereo Matching
Network with Conditional Cost Volume Normalization
Overview of 3D LiDAR and stereo fusion framework: (1) Input Fusion that incorporates the geometric information
from sparse LiDAR depth with the RGB images as the input for the Cost Computation phase to learn joint feature
representations, and (2) CCVNorm that replaces batch normalization (BN) layer and modulates the cost volume
features F with being conditioned on LiDAR data, in the Cost Regularization phase of stereo matching network.
35. 3D LiDAR and Stereo Fusion using Stereo Matching
Network with Conditional Cost Volume Normalization
Conditional Cost Volume Normalization.
At each pixel (red dashed bounding box),
based on the discretized disparity value of
corresponding LiDAR data, categorical
CCVNorm selects the modulation
parameters γ from a Dˆ-entry lookup table,
while the LiDAR points with invalid value
are separately handled with an additional
set of parameters (in gray color). On the
other hand, HierCCVNorm produces γ by a
hierarchical modulation of 2 steps.
36. 3D LiDAR and Stereo Fusion using Stereo Matching
Network with Conditional Cost Volume Normalization
Comparing to other baselines and variants, this method captures details in complex structure area (the white
dashed bounding box) by leveraging complementary characteristics of LiDAR and stereo modalities.
37. Sparse and noisy LiDAR completion with
RGB guidance and uncertainty
• It proposes a method to accurately complete sparse LiDAR maps guided by RGB images.
• Mono depth prediction methods fail to generate absolute and precise depth maps.
• Stereoscopic approaches are still significantly outperformed by LiDAR based approaches.
• The goal of the depth completion task is to generate dense depth predictions from sparse
and irregular point clouds which are mapped to a 2D plane.
• A framework extracts both global and local information in order to produce proper depth
maps.
• Simple depth completion does not require a deep network; additionally a fusion method
with RGB guidance from a monocular camera in order to leverage object information and to
correct mistakes in the sparse input.
• Confidence masks are exploited in order to take into account the uncertainty in the depth
predictions from each modality.
• Code with visualizations is: https://github.com/wvangansbeke/Sparse-Depth-Completion.
38. Sparse and noisy LiDAR completion with
RGB guidance and uncertainty
The framework consists of two
parts: the global branch on top
and the local branch below. The
global path outputs three maps:
a guidance map, global depth
map and a confidence map. The
local map predicts a confidence
map and a local map by also
taking into account the guidance
map of the global network. The
framework fuses global and
local information based on the
confidence maps in a late fusion
approach.
39. Sparse and noisy LiDAR completion with
RGB guidance and uncertainty
The green box shows that the framework successfully
corrects the mistakes in the sparse LiDAR input frame.