This document presents MegaDepth, a new dataset for single-view depth prediction generated from internet photos using structure-from-motion and multi-view stereo techniques. It introduces challenges in existing depth datasets and contributions of MegaDepth, which includes over 130,000 images from landmarks after filtering. An hourglass convolutional neural network is trained on MegaDepth using novel scale-invariant and ordinal depth losses to predict depth from single views with high accuracy and generalizability. Evaluation shows the model generalizes well to other datasets while identifying limitations for complex surfaces, thin objects, and difficult materials.
2. 2
Limitations in the available training data
• NYU: Indoor-only images
• Make3D: Small numbers of training examples
• KITTI: Sparse sampling
• Difficult to collect (RGB image, depth map)
• RGB-D sensors (Kinect): limited to indoor use
• LIDAR: sparse depth maps
Contributions : MegaDepth
• Multi-view internet photo collections (a virtually unlimited data source)
• Generate training data via modern structure-from-motion (SfM) and multi-view stereo (MVS)
• Challenges: noise and unreconstructable objects
• New data cleaning methods & automatically augmenting data with ordinal depth relations generated using
semantic segmentation.
• High accuracy & Generalizability
Motivation
3. 3
Download Internet photos from Flickr (Landmarks10K dataset)
COLMAP : SOTA SfM system and MVS system
Camera poses, sparse point clouds, dense depth maps
The MegaDepth Dataset
4. 4
Raw depth maps from COLMAP
• Transient objects(people, cars, etc.)
• Noisy depth discontinuities
• Bleeding of background depths into foreground objects
Modified COLMAP
• At each iteration, keep smaller(closer) of the two at each pixel.
• Apply a median filter to remove unstable depth values
Depth map refinement
5. 5
Depth enhancement via semantic segmentation
• Semantic filtering (transient objects & difficult-to-reconstruct objects)
- PSPNet (150 semantic categories)
- divide the pixels into three subsets (Foreground/Background/Sky)
- If<50% of pixels in C(of F) have a reconstructed depth, discard all depths from C.
• Euclidean vs. ordinal depth (filtering training data)
- if >30% of an image I consists of valid depth values, then keep that image as training data
for learning Euclidean depth
• Automatic ordinal depth labeling
- Ford: if the area of C(of F) is larger than 5% of the image
- Bord : if p’s C(of B) is larger than 5% of the image
& p has a valid depth value that lies in the last quartile
of the full range of depths for I
Depth map refinement
F: statues, fountains, people, cars
B: building, towers, mountains, etc.
6. 6
200 3D models from landmarks around the world
150K reconstructed images
After filtering: 130K valid images
Euclidean depth data: 100K images
Ordinal depth data: 30K images
Additional dataset: images from [18]
Creating a dataset
MegaDepth(MD)
[18] Tanks and Temples: Benchmarking Large-Scale Scene Reconstruction (siggraph2017)
8. 8
Unknown scale factor: cannot compare predicted and ground truth depths
directly
The ratio of pairs of depths are preserved under scaling.
In the log-depth domain, the difference between pairs of log-depth
Scale-invariant Loss function
Scale-invariant data term
Muti-scale scale-invariant
gradient matching term
Robust ordinal depth loss
12. 12
Generalization
• to new Internet photos from never-before-seen locations
• to other types of images from other dataset
The effect of terms in our loss function
Experimental Setup
• Test set: 46/200 (reconstructed models)
• Training set / validation set: Randomly split 96%:4% (for 154 models)
Evaluation
13. 13
Error metircs
• Si-RMSE:
• SfM Disagreement Rate(SDR):
Evaluation and ablation study on MD test set
14. 14
Effect of loss network and loss variants
Evaluation and ablation study on MD test set
20. 20
Present a new use for Internet-derived SfM+MVS data
Generating large amounts of training data for single view depth prediction
generalizes very well to other datasets.
Limitations:
• oblique surfaces (e.g., ground), thin or complex objects (e.g., lampposts), and difficult materials (e.g.,
shiny glass)
• not predict metric depth
Conclusion