[212]big models without big data using domain specific deep networks in data-scarce settings
1. Big models without big data:
Using deep networks for computer vision in
data-scarce settings
Jon Almazan, Cesar de Souza, Yohann Cabon,
Diane Larlus, Naila Murray, Jerome Revaud
3. Deep learning for computer vision:
The data-scarcity challenge
Supervised deep learning :
J State-of-the-art for many CV tasks
L Requires lots of annotated data
Visual data is cheap and plentiful
Annotated data may be:
⢠Expensive
⢠Proprietary
⢠Non-feasible
How to use deep learning in data-scarce settings?
3
24 hrs of Photographyby Erik Kessels
7. Ground truth Prediction by PDP
Context: Attention prediction
7
Task: predict topographical attention map
Existing approaches: model it as a classification or regression task
Our approach: model attention as a stochastic process, using
probability distribution prediction (PDP)
Jetley, Murray, Vig. End-to-End Saliency Mapping via Probability Distribution Prediction. CVPR 2016.
8. Approach
Model attention map as a generalized Bernoulli distribution
Apply novel loss functions that penalize distance btw. predicted(p) and target(t) distributions
Use fully-convolutional architecture for probability distribution prediction
8
Jetley, Murray, Vig. End-to-End Saliency Mapping via Probability Distribution Prediction. CVPR 2016.
9. Data
Ground-truth attention data:
⢠Normally collected with eye-trackers
⢠Very expensive to collect
Jiang et al.*:
⢠introduce SALICON dataset
⢠use mouse-tracking as proxy:
We train our models with SALICON and fine-tune/test on
eye-tracking data
9
*Jiang et al. SALICON: Saliency in Context. CVPR 2015.
University of Kent
10. Results
10
Convergence of AUC using different loss functions Performance on SALICON test set
Results in source domain: mouse-tracking prediction
Jetley, Murray, Vig. End-to-End Saliency Mapping via Probability Distribution Prediction. CVPR 2016.
11. Results
11
OSIE dataset
VOCA 2012 dataset
Results in target domain:
task-free eye-tracking prediction
Results in target domain:
task-dependent eye-tracking prediction
Jetley, Murray, Vig. End-to-End Saliency Mapping via Probability Distribution Prediction. CVPR 2016.
12. Conclusion
12
Problem: attention map prediction
using limited target data
Solution: training with appropriate loss
functions, and pre-training with proxy
data
Jetley, Murray, Vig. End-to-End Saliency Mapping via Probability Distribution Prediction. CVPR 2016.
15. Recent approaches
Recent methods leverage deep learning:
J Representations are compact and fast at test time!
Use standard networks designed for image classification:
L Not designed for retrieval
L Results significantly below the state-of-the-art
15
16. Can we learn to represent images for
retrieval?
Yes, if:
1. Training data is available
2. The network architecture can capture fine details
3. Training focuses on retrieval
16Gordo, Almazan, Revaud, Larlus. Deep Image Retrieval: Learning global representations for image search. ECCV 2016.
Gordo, Almazan, Revaud, Larlus. End-to-End Learning of Deep Visual Representations for Image Retrieval. IJCV 2017.
17. Obtaining Training Data
Public dataset of landmark images
⢠~200K images
⢠600 different landmarks (Eiffel tower, Rome colosseum, Big BenâŚ)
⢠Extremely noisy. Learning fails without clean data.
17
[Babenko et al, Neural codes @ ECCV14]
Prototypical view
Non-prototypical view
Wrong category
18. Obtaining Training Data
We proposed an automatic cleaning technique:
⢠Create graph per class using image matching
⢠Prune edges corresponding to low matching scores
⢠Use verified keypoint matches to mine bounding boxes
18
Public dataset of landmark images
⢠~200K images
⢠600 different landmarks (Eiffel tower, Rome colosseum, Big BenâŚ)
⢠Extremely noisy. Learning fails without clean data.
Gordo, Almazan, Revaud, Larlus. Deep Image Retrieval: Learning global representations for image search. ECCV 2016.
Gordo, Almazan, Revaud, Larlus. End-to-End Learning of Deep Visual Representations for Image Retrieval. IJCV 2017.
19. Obtaining Training Data
We proposed an automatic cleaning technique, resulting in:
⢠40K spatially verified images
⢠Approximate bounding box annotations
⢠A new cleaned dataset, now publicly available
19
Public dataset of landmark images
⢠~200K images
⢠600 different landmarks (Eiffel tower, Rome colosseum, Big BenâŚ)
⢠Extremely noisy. Learning fails without clean data.
Gordo, Almazan, Revaud, Larlus. Deep Image Retrieval: Learning global representations for image search. ECCV 2016.
Gordo, Almazan, Revaud, Larlus. End-to-End Learning of Deep Visual Representations for Image Retrieval. IJCV 2017.
20. Proposed approach
Learning to rank images:
We propose a new three-stream Siamese Network: a network designed for
retrieval
20Gordo, Almazan, Revaud, Larlus. Deep Image Retrieval: Learning global representations for image search. ECCV 2016.
Gordo, Almazan, Revaud, Larlus. End-to-End Learning of Deep Visual Representations for Image Retrieval. IJCV 2017.
21. Experimental evaluation on standard
benchmarks
Oxford dataset
⢠5k images
⢠5k images + 100k distractor images
Paris dataset
⢠6k images
INRIA Holidays dataset
⢠1491 images
21
23. Experiments: Paris 6k and INRIA Holidays
Xerox Confidential 23
Deep Traditional Ours Deep Traditional Ours
79.7
85.5
86.5 86.5
80.5
83.4
82.4
85.1
82.8
96.7
60
65
70
75
80
85
90
95
100
MeanAveragePrecision
Paris 6K
78.9
82
87.5
84.9
82.5
84.7
75.8
81.3
94.8
70
75
80
85
90
95
100
MeanAveragePrecision
INRIA Holidays
24. Qualitative results
24Gordo, Almazan, Revaud, Larlus. Deep Image Retrieval: Learning global representations for image search. ECCV 2016.
Gordo, Almazan, Revaud, Larlus. End-to-End Learning of Deep Visual Representations for Image Retrieval. IJCV 2017.
25. Conclusion
25
Problem: efficient instance-level image retrieval using deep networks
Solution: training with reliable annotations and an appropriate model architecture
Query
Gordo, Almazan, Revaud, Larlus. Deep Image Retrieval: Learning global representations for image search. ECCV 2016.
Gordo, Almazan, Revaud, Larlus. End-to-End Learning of Deep Visual Representations for Image Retrieval. IJCV 2017.
27. Synthetic Data for Computer Vision
Benefits
⢠Complete control
⢠Automatic annotations
⢠Quantity & variability
Challenges
⢠Chicken & egg problem?
⢠Technically feasible and cost-effective?
Our solution
⢠Off-the-shelf game engine (Unity)
⢠Seeding virtual worlds with limited real-world sensor data
⢠Automatic generation of all labels via shader programming
27
28. 28
Gaidon et al. Virtual Worlds as Proxy
for Multi-Object Tracking Analysis.
CVPR 2016
Ros et al. The synthia dataset: A large collection of synthetic images
for semantic segmentation of urban scenes. CVPR 2016
Richter et al. Playing for Data: Ground Truth from
Computer Games. ECCV 2016
Synthetic Data for Computer Vision
29. Virtual worlds for action classification
From modelling vehicles to modelling human actions:
Orders of magnitude increase in complexity:
⢠non-rigid motion
⢠complex interactions with objects and people
⢠large diversity in viewpoints and appearance
How to create diverse, realistic, and physically-plausible
training videos?
Our solution: Procedural Human Action Videos (PHAV):
⢠generative model of human action videos
29
de Souza, Cabon, Gaidon, Lopez. Procedural Generation of Videos to Train Deep Action Recognition Networks. CVPR 2017.
31. Procedural Human Action Videos
PHAV Data modalities:
⢠RGB
⢠Depth
⢠Semantic Segmentation
⢠Instance Segmentation
⢠Horizontal Flow
⢠Vertical Flow
Extracted using Multiple Render Targets
31
32. 32
Virtual worlds for action classification
de Souza, Cabon, Gaidon, Lopez. Procedural Generation of Videos to Train Deep Action Recognition Networks. CVPR 2017.
33. 33
Adding PHAV helps training, particularly when real-world data is limited:
Naver Labs
Virtual worlds for action classification
de Souza, Cabon, Gaidon, Lopez. Procedural Generation of Videos to Train Deep Action Recognition Networks. CVPR 2017.
34. Conclusion
34
Problem: generate large-scale annotated synthetic videos useful for CV
Solution: modern game engine, real to virtual cloning, shaders
de Souza, Cabon, Gaidon, Lopez. Procedural Generation of Videos to Train Deep Action Recognition Networks. CVPR 2017.
38. Some numbers
Time to train the network: ~1 week on a single M40 GPU
Time to encode images: ~10 images per second on an M40 GPU
Total size per encoded image: 8Kb (128 images per Mb; dim=2048)
Time to compare images: millions of comparisons per second
⢠After PQ compression: 256 bytes/image with minor decrease in accuracy
Training memory requirements: ~3 x 7Gb
⢠3-stream residual networks do not naively fit in memory!
⢠Each stream is processed sequentially: only one stream active at a time
38