39. Neural Network Approaches for
Visual Recognition
L. Fei-Fei
Computer Science Dept.
Stanford University
40. Machine learning in computer vision
• Aug 12, Lecture 3: Neural Network
– Convolutional Nets for object recognition
– Unsupervised feature learning via Deep Belief Net
7 August 2010 40
L. Fei-Fei, Dragon Star 2010,
Stanford
41. Machine learning in computer vision
• Aug 12, Lecture 3: Neural Network
– Convolutional Nets for object recognition
(slides courtesy to Yann LeCun (NYU))
– Unsupervised feature learning via Deep Belief Net
7 August 2010 41
L. Fei-Fei, Dragon Star 2010,
Stanford
[[Reference paper: Yann LeCun et al. Proc. IEEE, 1998]]
42. Mammalian visual pathway is hierarchical
• The ventral (recognition) pathway in the visual cortex
– Retina → LGN → V1 → V2 → V4 → PIT → AIT (80-100ms)
[picturefromSimonThorpe]
7 August 2010 42
L. Fei-Fei, Dragon Star 2010,
Stanford
43. • Efficient Learning Algorithms for multi-stage “Deep Architectures”
– Multi-stage learning is difficult
• Learning good internal representations of the world (features)
– Hierarchy of features that capture the relevant information and
eliminates irrelevant variabilities
• Learning with unlabeled samples (unsupervised learning)
– Animals can learn vision by just looking at the world
– No supervision is required
• Deep learning algorithms that can be applied to other modalities
– If we can learn feature hierarchies for vision, we can learn feature
hierarchies for audition, natural language, ....
Some challenges for machine learning in vision
44. • The raw input is pre-processed through a hand-crafted feature
extractor
• The features are not learned
• The trainable classifier is often generic (task independent), and
“simple” (linear classifier, kernel machine, nearest neighbor,.....)
• The most common Machine Learning architecture: the Kernel
Machine
“Simple” Trainable
Classifier
Pre-processing /
Feature Extraction
this part is mostly hand-crafted
Internal Representation
7 August 2010 44
L. Fei-Fei, Dragon Star 2010,
Stanford
The traditional “shallow” architecture
for recognition
45. • In Language: hierarchy in syntax and semantics
–Words->Parts of Speech->Sentences->Text
–Objects,Actions,Attributes...-> Phrases -> Statements ->
Stories
• In Vision: part-whole hierarchy
–Pixels->Edges->Textons->Parts->Objects->Scenes
Trainable
Feature
Extractor
Trainable
Feature
Extractor
Trainable
Classifier
Good representations are hierarchical
7 August 2010 45
L. Fei-Fei, Dragon Star 2010,
Stanford
46. • Deep Learning: learning a hierarchy of internal representations
• From low-level features to mid-level invariant representations, to
object identities
• Representations are increasingly invariant as we go up the layers
• using multiple stages gets around the specificity/invariance
dilemma
Trainable
Feature
Extractor
Trainable
Feature
Extractor
Trainable
Classifier
Learned Internal Representation
“Deep” learning: learning hierarchical
representations
7 August 2010 46
L. Fei-Fei, Dragon Star 2010,
Stanford
47. • We can approximate any function as close as we
want with shallow architecture. Why would we
need deep ones?
–kernel machines and 2-layer neural net are “universal”.
• Deep learning machines
• Deep machines are more efficient for representing
certain classes of functions, particularly those
involved in visual recognition
–they can represent more complex functions with less
“hardware”
• We need an efficient parameterization of the class
of functions that are useful for “AI” tasks.
7 August 2010 47
L. Fei-Fei, Dragon Star 2010,
Stanford
Do we really need deep architecture?
48. Filter
Bank
Spatial
Pooling
Non-
Linearity
• Multiple Stages
• Each Stage is composed of
– A bank of local filters (convolutions)
– A non-linear layer (may include harsh non-linearities, such as
rectification, contrast normalization, etc...).
– A feature pooling layer
• Multiple stages can be stacked to produce high-level representations
– Each stage makes the representation more global, and more
invariant
• The systems can be trained with a combination of unsupervised and
supervised methods
ClassifierFilter
Bank
Spatial
Pooling
Non-
Linearity
Hierarchical/deep architectures for vision
7 August 2010 48
L. Fei-Fei, Dragon Star 2010,
Stanford
49. • [Hubel & Wiesel 1962]:
– simple cells detect local features
– complex cells “pool” the outputs of simple cells within a retinotopic
neighborhood.
pooling subsampling
“Simple cells”
“Complex cells”
Multiple
convolutions
Retinotopic Feature Maps
Filtering+NonLinearity+Pooling = 1 stage of a Convolutional Net
7 August 2010 49
L. Fei-Fei, Dragon Star 2010,
Stanford
50. • Biologically-inspired models of low-level feature extraction
– Inspired by [Hubel and Wiesel 1962]
• Only 3 types of operations needed for traditional ConvNets:
– Convolution/Filtering
– Pointwise Non-Linearity
– Pooling/Subsampling
Filter
Bank
Spatial
Pooling
Non-
Linearity
Feature extraction by filtering and pooling
7 August 2010 50
L. Fei-Fei, Dragon Star 2010,
Stanford
51. Convolutional Network: Multi-Stage Trainable Architecture
Hierarchical Architecture
Representations are more global, more invariant, and more
abstract as we go up the layers
Alternated Layers of Filtering and Spatial Pooling
Filtering detects conjunctions of features
Pooling computes local disjunctions of features
Fully Trainable
All the layers are trainable
Convolutions,
Filtering
Pooling
Subsampling
Convolutions,
Filtering Pooling
Subsampling
Convolutions,
Filtering
Convolutions,
Classification
7 August 2010 51
L. Fei-Fei, Dragon Star 2010,
Stanford
52. 7 August 2010 52
L. Fei-Fei, Dragon Star 2010,
Stanford
Convolutional network architecture
53. • Building a complete artificial vision system:
– Stack multiple stages of simple cells / complex cells layers
– Higher stages compute more global, more invariant features
– Stick a classification layer on top
– [Fukushima 1971-1982]
• neocognitron
– [LeCun 1988-2007]
• convolutional net
– [Poggio 2002-2006]
• HMAX
– [Ullman 2002-2006]
• fragment hierarchy
– [Lowe 2006]
• HMAX
QUESTION: How do we find
(or learn) the filters?
Convolutional Nets and other Multistage Hubel-Wiesel Architectures
7 August 2010 53
L. Fei-Fei, Dragon Star 2010,
Stanford
54. Convolutional Net Training: Gradient Back-Propagation
7 August 2010 54
L. Fei-Fei, Dragon Star 2010,
Stanford
55. 7 August 2010 55
L. Fei-Fei, Dragon Star 2010,
Stanford
56. 7 August 2010 56
L. Fei-Fei, Dragon Star 2010,
Stanford
57. 7 August 2010 57
L. Fei-Fei, Dragon Star 2010,
Stanford
58. 7 August 2010 59
L. Fei-Fei, Dragon Star 2010,
Stanford
59. Convolutional Net Architecture for Hand-writing recognition
input
1@32x32
Layer 1
6@28x28
Layer 2
6@14x14
Layer 3
12@10x10
Layer 4
12@5x5
Layer 5
100@1x1
10
5x5
convolution
5x5
convolution
5x5
convolution2x2
pooling/
subsampling
2x2
pooling/
subsampling
Layer 6: 10
Convolutional net for handwriting recognition (400,000 synapses)
Convolutional layers (simple cells): all units in a feature plane share the same weights
Pooling/subsampling layers (complex cells): for invariance to small distortions.
Supervised gradient-descent learning using back-propagation
The entire network is trained end-to-end. All the layers are trained simultaneously.
[LeCun et al. Proc IEEE, 1998]
7 August 2010 60
L. Fei-Fei, Dragon Star 2010,
Stanford
60. MNIST Handwritten Digit Dataset
Handwritten Digit Dataset MNIST: 60,000 training samples, 10,000 test samples
7 August 2010 61
L. Fei-Fei, Dragon Star 2010,
Stanford
61. CLASSIFIER DEFORMATION PREPROCESSING ERROR (%) Reference
linear classifier (1-layer NN) none 12.00 LeCun et al. 1998
linear classifier (1-layer NN) deskewing 8.40 LeCun et al. 1998
pairwise linear classifier deskewing 7.60 LeCun et al. 1998
K-nearest-neighbors, (L2) none 3.09 Kenneth Wilder, U. Chicago
K-nearest-neighbors, (L2) deskewing 2.40 LeCun et al. 1998
K-nearest-neighbors, (L2) deskew, clean, blur 1.80 Kenneth Wilder, U. Chicago
K-NN L3, 2 pixel jitter deskew, clean, blur 1.22 Kenneth Wilder, U. Chicago
K-NN, shape context m atching shape context feature 0.63 Belongie et al. IEEE PAMI 2002
40 PCA + quadratic classifier none 3.30 LeCun et al. 1998
1000 RBF + linear classifier none 3.60 LeCun et al. 1998
K-NN, Tangent Distance subsam p 16x16 pixels 1.10 LeCun et al. 1998
SVM, Gaussian Kernel none 1.40
SVM deg 4 polynom ial deskewing 1.10 LeCun et al. 1998
Reduced Set SVM deg 5 poly deskewing 1.00 LeCun et al. 1998
Virtual SVM deg-9 poly Affine none 0.80 LeCun et al. 1998
V-SVM, 2-pixel jittered none 0.68 DeCoste and Scholkopf, MLJ2002
V-SVM, 2-pixel jittered deskewing 0.56 DeCoste and Scholkopf, MLJ2002
2-layer NN, 300 HU, MSE none 4.70 LeCun et al. 1998
2-layer NN, 300 HU, MSE, Affine none 3.60 LeCun et al. 1998
2-layer NN, 300 HU deskewing 1.60 LeCun et al. 1998
3-layer NN, 500+ 150 HU none 2.95 LeCun et al. 1998
3-layer NN, 500+ 150 HU Affine none 2.45 LeCun et al. 1998
3-layer NN, 500+ 300 HU, CE, reg none 1.53 Hinton, unpublished, 2005
2-layer NN, 800 HU, CE none 1.60 Sim ard et al., ICDAR 2003
2-layer NN, 800 HU, CE Affine none 1.10 Sim ard et al., ICDAR 2003
2-layer NN, 800 HU, MSE Elastic none 0.90 Sim ard et al., ICDAR 2003
2-layer NN, 800 HU, CE Elastic none 0.70 Sim ard et al., ICDAR 2003
Convolutional net LeNet-1 subsam p 16x16 pixels 1.70 LeCun et al. 1998
Convolutional net LeNet-4 none 1.10 LeCun et al. 1998
Convolutional net LeNet-5, none 0.95 LeCun et al. 1998
Conv. net LeNet-5, Affine none 0.80 LeCun et al. 1998
Boosted LeNet-4 Affine none 0.70 LeCun et al. 1998
Conv. net, CE Affine none 0.60 Sim ard et al., ICDAR 2003
Com v net, CE Elastic none 0.40 Sim ard et al., ICDAR 2003
7 August 2010 62
L. Fei-Fei, Dragon Star 2010,
Stanford
Results on MNIST handwritten Digits
63. Face Detection and Pose Estimation with Convolutional Nets
• Training: 52,850, 32x32 grey-level images of faces, 52,850 non-faces.
• Each sample: used 5 times with random variation in scale, in-plane rotation, brightness and
contrast.
• 2nd phase: half of the initial negative set was replaced by false positives of the initial version
of the detector .
7 August 2010 64
L. Fei-Fei, Dragon Star 2010,
Stanford
64. Face Detection: Results
x93%86%Schneiderman & Kanade
x96%89%Rowley et al
x83%70%xJones & Viola (profile)
xx95%90%Jones & Viola (tilted)
88%83%83%67%97%90%Our Detector
1.280.53.360.4726.94.42
MIT+CMUPROFILETILTEDData Set->
False positives per image->
7 August 2010 65
L. Fei-Fei, Dragon Star 2010,
Stanford
65. Face Detection and Pose Estimation: Results
7 August 2010 66
L. Fei-Fei, Dragon Star 2010,
Stanford
66. Face Detection with a Convolutional Net
7 August 2010 67
L. Fei-Fei, Dragon Star 2010,
Stanford
67. Yann LeCun
Industrial Applications of ConvNets
• AT&T/Lucent/NCR
– Check reading, OCR, handwriting recognition (deployed 1996)
• Vidient Inc
– Vidient Inc's “SmartCatch” system deployed in several airports
and facilities around the US for detecting intrusions, tailgating,
and abandoned objects (Vidient is a spin-off of NEC)
• NEC Labs
– Cancer cell detection, automotive applications, kiosks
• Google
– OCR, face and license plate removal from StreetView
• Microsoft
– OCR, handwriting recognition, speech detection
• France Telecom
– Face detection, HCI, cell phone-based applications
• Other projects: HRL (3D vision)....
68. Machine learning in computer vision
• Aug 12, Lecture 3: Neural Network
– Convolutional Nets for object recognition
– Unsupervised feature learning via Deep Belief Net
(slides courtesy to Honglak Lee (Stanford))
7 August 2010 69
L. Fei-Fei, Dragon Star 2010,
Stanford
69. Machine Learning’s Success
• Data mining
– Web data mining
– Biomedical data mining
– Time series data mining
• Artificial Intelligence
– Computer vision
– Speech recognition
– Autonomous car driving
However, machine learning’s success has relied on having
a good feature representation of the data.
How can we develop good representations automatically?7 August 2010 70
L. Fei-Fei, Dragon Star 2010,
Stanford
70. The Learning Pipeline
Input
Input space
Motorbikes
“Non”-Motorbikes
Learning
Algorithm
pixel 1
pixel2
pixel 1
pixel 2
7 August 2010 71
L. Fei-Fei, Dragon Star 2010,
Stanford
71. The Learning Pipeline
Input
Input space Feature space
Motorbikes
“Non”-Motorbikes
Low-level
features
Learning
Algorithm
pixel 1
pixel2
“wheel”
“handle”
“feature engineering”
handle
wheel
7 August 2010 72
L. Fei-Fei, Dragon Star 2010,
Stanford
72. Computer vision features
SIFT Spin image
HoG RIFT
Textons GLOH
Drawbacks of feature engineering
1. Needs expert knowledge
2. Time consuming hand-tuning
7 August 2010 73
L. Fei-Fei, Dragon Star 2010,
Stanford
73. Feature learning from unlabeled data
• Main idea
– Finding underlying structure (cause) or statistical
correlation from the input data.
• Sparse coding [Olshausen and Field, 1997]
– Objective: Given input data {x}, search for a set of
bases {bj} such that
where aj are mostly zeros.
∑=
j
jjbax
7 August 2010 74
L. Fei-Fei, Dragon Star 2010,
Stanford
74. Sparse coding on images
Natural Images Learned bases: “Edges”
= 0.8 * + 0.3 * + 0.5 *
x = 0.8 * b36
+ 0.3 * b42 + 0.5 * b65
[0, 0, … 0.8, …, 0.3, …, 0.5, …] = coefficients (feature representation)
New example
Lee, Ng, et al. 20077 August 2010 75
L. Fei-Fei, Dragon Star 2010,
Stanford
75. Optimization
Given input data {x(1), …, x(m)}, we want to find good
bases {b1, …, bn}:
∑∑ ∑ +−
i
i
i j
j
i
j
i
ab abax 1
)(2
2
)()(
, ||||||||min β
Reconstruction error Sparsity penalty
1||||: ≤∀ jbj Normalization
constraint
Solve by alternating minimization:
-- Keep b fixed, find optimal a.
-- Keep a fixed, find optimal b. Lee, Ng, 20067 August 2010 76
L. Fei-Fei, Dragon Star 2010,
Stanford
76. Evaluated on Caltech101 object category dataset.
Image classification
Previous reported results:
Fei-Fei et al, 2004: 16%
Berg et al., 2005: 17%
Holub et al., 2005: 40%
Serre et al., 2005: 35%
Berg et al, 2005: 48%
Lazebnik et al, 2006: 64%
Varma et al. 2008: 78%
Classification
Algorithm
(SVM)
Algorithm Accuracy
Baseline (Fei-Fei et al., 2004) 16%
PCA 37%
Our method 47%
36% error reduction
Input Image Features (coefficients)
Learned
bases
7 August 2010 77
L. Fei-Fei, Dragon Star 2010,
Stanford
77. Learning Feature Hierarchy
Input image (pixels)
“Sparse coding”
(edges)
[Related work: Hinton, Bengio, LeCun, and others.]
DBN (Hinton et al., 2006) with additional sparseness constraint.
Higher layer
(Combinations
of edges)
7 August 2010 78
L. Fei-Fei, Dragon Star 2010,
Stanford
78. Restricted Boltzmann Machine (RBM)
– Undirected, bipartite graphical model
– Inference is easy
– Training by approximating maximum likelihood
(Contrastive Divergence [Hinton, 2002])
visible nodes (data)
hidden nodes
Weights W (bases) encodes statistical
relationship between h and v.
7 August 2010 79
L. Fei-Fei, Dragon Star 2010,
Stanford
79. Deep Belief Network
• Deep Belief Network (DBN) [Hinton et al., 2006]
– Generative model with multiple hidden layers
– Successful applications
• Recognizing handwritten digits
• Learning motion capture data
• Collaborative filtering
– Bottom-up, layer-wise training
using Restricted Boltzmann machines
2W
3W
1W
visible nodes (data)
V
H1
H2
H3
7 August 2010 80
L. Fei-Fei, Dragon Star 2010,
Stanford
80. Sparse Restricted Boltzmann Machines [NIPS-2008]
• Main idea
– Constrain the hidden layer nodes to have “sparse”
average activation (similar to sparse coding)
– Regularize with a sparsity penalty
Log-likelihood Sparsity penalty
Average activation Target sparsity7 August 2010 81
L. Fei-Fei, Dragon Star 2010,
Stanford
81. Sparse representation from digits [NIPS 2008]
Sparse RBM bases
“pen-strokes”
Sparse RBMs often give readily interpretable features
and good discriminative power.
Training examples
7 August 2010 82
L. Fei-Fei, Dragon Star 2010,
Stanford
82. Learning object representations
• Learning objects and parts in images
• Large image patches contain interesting higher-
level structures.
– E.g., object parts and full objects
7 August 2010 83
L. Fei-Fei, Dragon Star 2010,
Stanford
83. Applying DBN to large image
2W
3W
1W
Input image
Faces
Problem:
- Typically, input dimension ~ 1,000 (30x30 pixels)
- Computationally intractable to learn from realistic
image sizes (e.g. 200x200 pixels)
??
7 August 2010 84
L. Fei-Fei, Dragon Star 2010,
Stanford
84. Convolutional architectures
Weight sharing by convolution (e.g.,
[Lecun et al., 1989])
“Max-pooling”
Invariance
Computational efficiency
Deterministic and feed-forward
We develop convolutional Restricted
Boltzmann machine (CRBM).
We define probabilistic max-pooling
that combine bottom-up and top-down
information.
convolution filter
Detection layer
maximum 2x2 grid
Max-pooling layer
Detection layer
Max-pooling layer
Input
convolution
convolution
maximum 2x2 grid
max
conv
conv
max
7 August 2010 85
L. Fei-Fei, Dragon Star 2010,
Stanford
85. Wk
V (visible layer)
Detection layer H
Max-pooling layer P
Convolutional RBM (CRBM) [ICML 2009]
Hidden nodes (binary)
“Filter“ weights (shared)
For “filter” k,
At most one
hidden nodes
are active.
‘’max-pooling’’ node (binary)
Input data V
7 August 2010 86
L. Fei-Fei, Dragon Star 2010,
Stanford
86. Probabilistic max pooling
Xj are stochastic binary
and mutually exclusive.
X3X1 X2 X4
Collapse 2n configurations into n+1
configurations. Permits bottom up
and top down inference.
Y
Pooling node
Detection nodes
1
1 0 0 0
1
0 1 0 0
1
0 0 1 0
1
0 0 0 1
0
0 0 0 0
7 August 2010 87
L. Fei-Fei, Dragon Star 2010,
Stanford
87. Probabilistic max pooling
X3X1 X2 X4
Y
Bottom-up inference
I1 I2 I3 I4
Pooling node
Detection nodes
Probability can be written as a
softmax function. Sample from
multinomial distribution.
Output of convolution
W*V from below7 August 2010 88
L. Fei-Fei, Dragon Star 2010,
Stanford
88. Convolutional Deep Belief Networks
• Bottom-up (greedy), layer-wise training
– Train one layer (convolutional RBM) at a time.
• Inference (approximate)
– Undirected connections for all layers (Markov net)
[Related work: Salakhutdinov and Hinton, 2009]
– Block Gibbs sampling or mean-field
– Hierarchical probabilistic inference
7 August 2010 89
L. Fei-Fei, Dragon Star 2010,
Stanford
89. Unsupervised learning of object-parts
Faces Cars Elephants Chairs
7 August 2010 90
L. Fei-Fei, Dragon Star 2010,
Stanford
90. Object category classification (Caltech 101)
• Our model is comparable to the results using
state-of-the-art features (e.g., SIFT).
40
45
50
55
60
65
70
75
CDBN (first layer)
CDBN (first+second layer)
Ranzato et al. (2007)
Mutch and Lowe (2006)
Lazebnik et al. (2006)
Zhang et al. (2006)
Our method
TestAccuracy
7 August 2010 91
L. Fei-Fei, Dragon Star 2010,
Stanford
91. Unsupervised learning of object-parts
Trained from multiple classes
(cars, faces, motorbikes, airplanes)
Object-specific features
& shared features
“Grouping” the object parts
(highly specific)
7 August 2010 92
L. Fei-Fei, Dragon Star 2010,
Stanford
92. Review of main contributions
Developed efficient algorithms for unsupervised feature learning.
Showed that unsupervised feature learning is useful for many
machine learning tasks.
Object recognition, image segmentation, audio classification, text
classification, robotic perception, and others.
Object parts “Filling in”
7 August 2010 93
L. Fei-Fei, Dragon Star 2010,
Stanford
93. Weaknesses & Criticisms
• Learning everything. Better to encode prior
knowledge about structure of images.
A: Compare with machine learning vs. linguists
debate in NLP.
• Results not yet competitive with best
engineered systems.
A: Agreed. True for some domains.
7 August 2010 94
L. Fei-Fei, Dragon Star 2010,
Stanford