The document discusses Abhishek Sharma's PhD defense talk on learning from multiple views of data. It presents an overview of his work on semantic segmentation to extract visual features from images and using a recursive context propagation network to incorporate contextual information. It also covers his research on constructing a common representation space to match content across different modalities like images and text.
1. Learning From Multiple Views of
Data
PhD Defense talk of
Abhishek Sharma
Collaborators
David W. Jacobs, Larry S. Davis, Hal Daume III, Oncel Tuzel, Ming Yu-Liu,
Abhishek Kumar, Jonghyun Choi, Murad Al Haj, Sanja Fidler and Angjoo
Kanazawa
2. Overview
1. Introduction
PART - I
1. Content Extraction
1. Semantic Segmentation as visual feature
2. Contextual information
3. Neural Network model
PART - II
1. Cross-modal content matching
1. Challenges
2. PLS based common representation
3. Generalized Multi-view Analysis
2. Future Directions
3. Match image and sentence
Image courtesy – UIUC sentence-Image dataset: http://vision.cs.uiuc.edu/pascal-sentences/
Text
viewTwo parked jet airplanes facing opposite directions
Image
view
Canonical/
Common
view
4. Find the image based on a sentence
Two parked jet airplanes facing opposite directions
5. Find the image based on a sentence
Two parked jet airplanes facing opposite directions
6. Find the image based on a sentence
Two parked jet airplanes facing opposite directions
7. Find the image based on a sentence
Two parked jet airplanes facing opposite directions
8. A simple computer-based matching of sentence and image
1. Task understanding
2. Content from text and image
1. jet airplanes
2. Two
3. Parked
4. facing opposite direction
3. Content Matching
9. Cross-view content matching challenges
Text – “Two parked jet airplanes facing opposite directions on a grassy land”
Bag-of-Word
SIFT BoW
1
jetdirection facing
111 …Index 2 3 4 10000
Dimension
Mismatch
Semantic
Mismatch
Insufficient
Content
Deep ?
10. Cross-view content matching challenge
Lack of correspondence
Same Region
Missing Region
=
Column-wise Vectorization
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
Deep ?
11. Other useful problems
Task – Face recognition
… Face DB
Content Extraction
Pixel, Attribute, SIFT, LBP,
HOG, Gabor
Content Matching
CCA, PLS, Metric Learning,
SVMs
12. Other useful problems
Task – Forensic sketch photo matching
Suspect
Image
Database
Forensic
Sketch Query
Image courtesy – Lios Gibson, “Forensic Art Essentials: A Manual for Law Enforcement Artists”
Content Extraction
SIFT, HOG, Gabor
Content Matching
Local LDA, PLS, CCA
13. This Dissertation
We are interested in extracting and matching task-dependent content
across multiple modalities
Task
Content
Matching
Content
Extraction
Pose-invariant face recognition
Pose-lighting invariant face recognition
Text-image matching
Forensic Image-photo matching
Semantic Segmentation
Partial Least Square
Pose-error robust matching
Generalized Multi-view Analysis
16. Semantic Segmentation: Overview
1. Scene understanding, robotics, medical image analysis etc.
2. Related work
3. Problem formulation
4. Role of context
5. Intuitive picture
6. Mathematical picture
7. Complete Pipeline
8. Back-propagation and issues
9. Pure-node RCPN
10. Experiments
17. Related Work
1. Multi-scale CNN (Farabet, Pineheiro)
2. Deep CNN (DeepSeg)
3. Non-parametric template matching (Tighe_1, Tighe_2, Eigen, Yang)
4. CRF models (Gould, Munoz, Lempitzky, Kumar, Mottaghi, Yuille)
18. Semantic Segmentation: Problem formulation
Label each super-pixel
Super-
segmentation
Road
Car
Ground
Image courtesy – http://www.cs.unc.edu/~jtighe/Papers/ECCV10/siftflow/baseFinal.html
Input image Super-segment overlaid image
19. Semantic Segmentation: Context
• Labeling super-pixel in isolation is difficult
• Without context machines outperform humans: 77.4% vs 72.2%
(Mottaghi et al.)
Building
Train
Aeroplane
Image courtesy – Roozbeh Mottaghi, Sanja Fidler, Jian Yao, Raquel Urtasun and Devi Parikh, “Analyzing Semantic Segmentation Using Hybrid Human-Machine CRFs”, IEEE CVPR 2013
21. Semantic Segmentation: Context
• Labeling super-pixel in isolation is difficult
• Without context machines outperform humans: 77.4% vs 72.2%
(Mottaghi et al.)
• Use context
• MRFs and CRFs
• Typically MRFs and CRFs use human designed potential functions and features
• Complex human visual system – LEARN IT FROM DATA
Roozbeh Mottaghi, Sanja Fidler, Jian Yao, Raquel Urtasun and Devi Parikh, “Analyzing Semantic Segmentation Using Hybrid Human-Machine CRFs”, IEEE CVPR 2013
22. Recursive Context Propagation Network or RCPN
1. Label each super-pixel using entire image
2. Fast feed-forward computations for real-time labeling
3. End-to-end learning
4. Modular to the segmentation pipeline
24. Semantic Segmentation - Pipeline
• 𝐹𝐶𝑁𝑁 = Multi-scale CNN at scales – 1, 2 and 4
• 8×8×16 → 2×2 maxpool → 7×7×64 → 2×2
maxpool → 7×7×256
• 256×3 = 768 dimensional pixel feature
• Field of View (FOV) for every pixel = 47×47,
94×94 and 188×188 at different scales
• Super-pixels by LiuSeg
• ~ 100 super-pixels per image
• 𝑣𝑖 = average pixel features in each super-pixel
• Data augmentation by 5 random average sets
1. Super-pixel feature
31. RCPN Back-propagation and Bypass Error
1v
2v
1x
2x
6x 6
~x
1y1
~x
9x
cat
e1
dec
e6
com
e9
Sub-tree
com
e6
sem
e1
sem
e2
1l
32. RCPN Back-propagation and Bypass Error
1v 1x 1y1
~x
cat
e1
2v 2x
6x 6
~x
9x
dec
e6
com
e9
Sub-tree
com
e6
sem
e1
sem
e2
1l
Combiner is
bypassed
Context
Lost
Poor Local
Minimum
Sem Grad
Com Grad
Dec Grad
Lab Grad
Empirical
𝒈 𝑐𝑜𝑚 ≪ 𝒈 𝑠𝑒𝑚 ≈ 𝒈 𝑑𝑒𝑐 ≪ 𝒈𝑙𝑎𝑏
Ideal
𝒈 𝑠𝑒𝑚 < 𝒈 𝑐𝑜𝑚 < 𝒈 𝑑𝑒𝑐 < 𝒈𝑙𝑎𝑏
33. Pure-node RCPN or PN-RCPN
•RCPN + pure-nodes classification loss
•Benefits
•Roughly 65% more training data
•Meaningful combination by combiner
•Deeper and stronger gradients
35. Grad Strength: RCPN vs. PN-RCPN
Sem Grad
Com Grad
Dec Grad
Lab Grad
Sem Grad
Com Grad
Dec Grad
Lab Grad
𝒈 𝑐𝑜𝑚 ≪ 𝒈 𝑠𝑒𝑚 ≈ 𝒈 𝑑𝑒𝑐 ≪ 𝒈𝑙𝑎𝑏 𝒈 𝑠𝑒𝑚 < 𝒈 𝑐𝑜𝑚 ≈ 𝒈 𝑑𝑒𝑐 < 𝒈𝑙𝑎𝑏
36. Experiments: Datasets
We conduct semantic segmentation experiments on three datasets
Stanford Background
Color images with 8 semantic classes
Train/Test – 572/143 images
SIFT Flow
Color images with 33 semantic classes
Train/Test – 2488/200
Daimler Urban Dataset
Gray-scale images with 6 semantic classes
Train/Test – 500/200
37. Experiments: Details
• Per pixel 0.5 subtraction
• 100 Super-pixels/image for Stanford and SIFT Flow
• 800 for Daimler due to large size
• 10 random parse trees with 5 random feature set for training to avoid
over-fitting
• 20 random parse trees with max-voting for testing
38. Experiments: Performance metric
1. Per-pixel accuracy (PPA)
2. Mean-class accuracy (MCA)
3. Intersection over Union (IoU) – Penalize under- & over-segmentation
4. Dynamic IoU (Dyn IoU) – IoU for dynamic objects
5. Time Per Image (TPI) – Both CPU and GPU
39. Stanford Results
Method PPA MCA IoU TPI (CPU/GPU)
Gould 76.4 NA NA 30 – 600 / NA
Munoz 76.9 NA NA 12 / NA
Tighe_1 77.5 NA NA 4 / NA
Kumar 79.4 NA NA < 600 / NA
Socher 78.1 NA NA NA / NA
Lempitzky 81.9 72.4 NA > 60 /NA
Singh 74.1 62.2 NA 20 / NA
Farabet 81.4 76.0 NA 60.5 / NA
Eigen 75.3 66.5 NA 16.6 / NA
Pinheiro 80.2 69.6 NA 10 / NA
Plain-NN 80.1 69.7 56.4 1.1 / 0.4
RCPN 81.8 73.9 61.3 1.1 / 0.4
PN-RCPN 82.1 79.0 64.0 1.1 / 0.4
TM-RCPN 82.3 79.1 64.5 1.6-6.1 / 0.9-5.9
40. SIFT Flow results
Method PPA MCA IoU TPI (CPU/GPU)
Tighe 77.0 30.1 NA 8.4 / NA
Liu 76.7 NA NA 31 / NA
Siingh 79.2 33.8 NA 20 / NA
Eigen 77.1 32.5 NA 16.6 / NA
Farabet 78.5 29.6 NA NA / NA
Bal. Farabet 72.3 50.8 NA NA / NA
Tighe, 24 78.6 39.2 NA 8.4 / NA
Pinheiro 77.7 29.8 NA NA / NA
Yang 79.8 48.7 NA < 12 / NA
Plain-NN 76.3 32.1 24.7 1.1 / 0.36
RCPN 79.6 33.6 26.9 1.1 / 0.4
Bal. RCPN 75.5 48.0 28.6 1.1 / 0.4
PN-RCPN 80.9 39.1 30.8 1.1 / 0.4
Bal. PN-RCPN 75.5 52.8 30.2 1.1 / 0.4
TM-RCPN 80.8 38.4 30.7 1.6-6.1 / 0.9-5.4
Bal. TM-RCPN 76.4 52.6 31.4 1.6-6.1 / 0.9-5.4
DeepSeg 85.2 51.7 39.1 NA / 0.2
46. PLS based multi-modal face recognition
PLS Bridge
Common
Subspace
Pose
Resolution
Sketch
WX WY
Shape = Identity
X Y
47. PLS based pose-invariant face recognition
0.75
0.8
0.85
0.9
0.95
1
1.05
PGFR TFA LLR ELF
Partial Comparison –Differenttesting
scenario
Others Proposed
• CMU PIE face date set for experiments.
• 34 training and 34 testing, intensity features
54. GMA cont..
• Multi-view extension of any generalized eigen-value
feature extraction
• GMA + LDA = GMLDA
D = Between class scatter matrix; S = Between class scatter
matrix
• GMA + MFA = GMMFA
D = Penalty Graph; S = Intrinsic Graph
• GMA + LPP = GMLPP
D = Identity; S = Graph Laplacian of Similarity matrix
55. Pros and Cons
Cross-view classification and retrieval
Kernelizable
Closed form optimal solution
Supervised
Generalize to unseen classes
Domain agnostic
56. Pros and Cons
Still not ideal
Non-probabilistic
Shallow
Similar views across test and train
57. VIEW 2
GMAVIEW 1 CCA/PLS/BLM
SVM-2K/HMFDA IDEAL
DIFFERENT LATENT SPACESORIGINAL SPACE
PAIRED DATA
Final Picture
58. Experiments
Pose and Lighting Invariant face
recognition
• 129 train subjects in 5 illums
• 129 test subjects (same identity
diff session) in 18 illums
• 120 subjects in 5 illum
• 129 test subjects (diff identity
diff session) in 18 illum
59. Text-Image Retrieval
• Wiki pages (2173 + 693)
• 10 Different classes
• Latent Dirichlet Allocation Model based text features
• SIFT histogram based image features
• Precision-Recall based Mean Average Precision score
• SM – Sematic matching (domain dependent approach)
• SCM – Semantic matching in CCA latent space (two stage
domain dependent approach)
60. Future Directions
• Deep learning based feature extraction
• Large-scale Data collection
• Deep Multi-view algorithms Vs. Common Deep Network
• Unsupervised training
62. Reference
Tighe_1: J. Tighe and S. Lazebnik. Superparsing. Int. J. Comput. Vision, 101(2):329–349, 2013
Tighe_2: J. Tighe and S. Lazebnik. Finding things: Image parsing with regions and per-exemplar detectors. IEEE CVPR, 2013
Gould: S. Gould, R. Fulton, and D. Koller. Decomposing a scene into geometric and semantically consistent regions. IEEE ICCV, 2009
Munoz: D. Munoz, J. A. Bagnell, and M. Hebert. Stacked hierarchical labeling. ECCV, 2010
Kumar: M. P. Kumar and D. Koller. Efficiently selecting regions for scene understanding. IEEE CVPR, 2010
Lempitsky: V. Lempitsky, A. Vedaldi, and A. Zisserman. A pylon model for semantic segmentation. NIPS, 2011
Farabet: C. Farabet, C. Couprie, L. Najman, and Y. LeCun. Learning hierarchical features for scene labeling. IEEE TPAMI, August 2013
Eigen: R. Fergus and D. Eigen. Nonparametric image parsing using adaptive neighbor sets. IEEE CVPR, 2012
Joint: L. Ladick, P. Sturgess, C. Russell, S. Sengupta, Y. Bastanlar, W. Clocksin, and P. Torr. Joint optimization for object class
segmentation and dense stereo reconstruction. International Journal of Computer Vision, 100(2):122–133, 2012
Liu: C. Liu, J. Yuen, and A. Torralba. Nonparametric scene parsing via label transfer. IEEE TPAMI, 33(12), Dec 2011
LiuSeg: M.-Y. Liu, O. Tuzel, S. Ramalingam, and R. Chellappa. Entropy rate superpixel segmentation. IEEE CVPR, 2011
Pinheiro: P. H. O. Pinheiro and R. Collobert. Recurrent convolutional neural networks for scene parsing. ICML, 2014
Stixmantics: T. Scharwachter, M. Enzweiler, U. Franke, and S. Roth. Stix- ¨ mantics: A medium-level model for real-time semantic scene
understanding. ECCV, 2014
Yang: J. Yang, B. Price, S. Cohen, and M.-H. Yang. Context driven scene parsing with attention to rare classes. CVPR, pages 3294–3301,
2014
Hinweis der Redaktion
First show a single mode matching and then discuss cross-modal with face perhaps easier or text-image.
Assigning a class to each pixel
Prior work on semantic segmentation at least one slide
Say from where you got it all of them
Cite Racquel’s paper
Racquel cite and give a little more time to audience
Try to remove as much as possible
This is what I did, dnt throw it away like this without emphasizing
Make correspondence between the part being discussed and the text
Make correspondence between the part being discussed and the text
Too much parse tree text
Put a picture of pure node with animation say something that it is learnable and end-to-end
Variable width of the gradient because it is vanishing
Variable width of the gradient because it is vanishing
Variable width of the gradient because it is vanishing
Variable width of the gradient because it is vanishing
Pictures from dataset
Segmentation results
Make palletes of all colors for view-specific content