2. What is (computer) vision?
• When we “see” something, what does it
involve?
• Take a picture with a camera, it is just a
bunch of colored dots (pixels)
• Want to make computers understand
images
• Looks easy, but not really…
Image (or video) Sensing device Interpreting device Interpretations
Corn/mature corn
jn a cornfield/
plant/blue sky in
the background
Etc.
3. What is it related to?
Biology
Psychology
Neuroscience
Information Cognitive
sciences
Engineering Computer
Robotics Science
Computer Vision
Speech Information retrieval
Machine learning
Physics Maths
6. A picture is worth a thousand words.
--- Confucius
or Printers’ Ink Ad (1921)
7. A picture is worth a thousand words.
--- Confucius
or Printers’ Ink Ad (1921)
horizontal lines textured
blue on the top
vertical
white
shadow to the left porous
oblique large green patches
8. A picture is worth a thousand words.
--- Confucius
or Printers’ Ink Ad (1921)
clear sky autumn leaves
building court yard
bicycles
trees campus
day time
talking outdoor
people
19. History: single object recognition
• Lowe, et al. 1999, 2003
• Mahamud and Herbert, 2000
• Ferrari, Tuytelaars, and Van Gool, 2004
• Rothganger, Lazebnik, and Ponce, 2004
• Moreels and Perona, 2005
•…
21. Object categorization:
the statistical viewpoint
p ( zebra | image)
vs.
p (no zebra|image)
• Bayes rule:
p ( zebra | image) p (image | zebra ) p ( zebra)
= ⋅
p (no zebra | image) p (image | no zebra) p (no zebra )
posterior ratio likelihood ratio prior ratio
22. Object categorization:
the statistical viewpoint
p ( zebra | image) p (image | zebra ) p ( zebra)
= ⋅
p (no zebra | image) p (image | no zebra) p (no zebra )
posterior ratio likelihood ratio prior ratio
• Discriminative methods model posterior
• Generative methods model likelihood and
prior
23. Discriminative
p( zebra | image)
• Direct modeling of
p (no zebra | image)
Decision Zebra
boundary
Non-zebra
24. Generative
• Model p(image | zebra) and p (image | no zebra)
p(image | zebra) p (image | no zebra)
Low Middle
High Middle Low
25. Three main issues
• Representation
– How to represent an object category
• Learning
– How to form the classifier, given training data
• Recognition
– How the classifier is to be used on novel data
28. Representation
– Generative /
discriminative / hybrid
– Appearance only or
location and
appearance
– Invariances
• View point
• Illumination
• Occlusion
• Scale
• Deformation
• Clutter
• etc.
29. Representation
– Generative /
discriminative / hybrid
– Appearance only or
location and
appearance
– invariances
– Part-based or global
w/sub-window
30. Representation
– Generative /
discriminative / hybrid
– Appearance only or
location and
appearance
– invariances
– Parts or global w/sub-
window
– Use set of features or
each pixel in image
31. Learning
– Unclear how to model categories, so we
learn what distinguishes them rather than
manually specify the difference -- hence
current interest in machine learning
32. Learning
– Unclear how to model categories, so we
learn what distinguishes them rather than
manually specify the difference -- hence
current interest in machine learning)
– Methods of training: generative vs.
discriminative
5
p(C1|x) p(C2|x)
p(x|C )
2 1
4
posterior probabilities
class densities
0.8
3
0.6
2
p(x|C1) 0.4
1
0.2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
33. Learning
– Unclear how to model categories, so we
learn what distinguishes them rather than
manually specify the difference -- hence
current interest in machine learning)
– What are you maximizing? Likelihood
(Gen.) or performances on train/validation
set (Disc.)
– Level of supervision
• Manual segmentation; bounding box; image
labels; noisy labels
Contains a motorbike
34. Learning
– Unclear how to model categories, so we
learn what distinguishes them rather than
manually specify the difference -- hence
current interest in machine learning)
– What are you maximizing? Likelihood
(Gen.) or performances on train/validation
set (Disc.)
– Level of supervision
• Manual segmentation; bounding box; image
labels; noisy labels
– Batch/incremental (on category and image
level; user-feedback )
35. Learning
– Unclear how to model categories, so we
learn what distinguishes them rather than
manually specify the difference -- hence
current interest in machine learning)
– What are you maximizing? Likelihood
(Gen.) or performances on train/validation
set (Disc.)
– Level of supervision
• Manual segmentation; bounding box; image
labels; noisy labels
– Batch/incremental (on category and image
level; user-feedback )
– Training images:
• Issue of overfitting
• Negative images for discriminative methods
36. Learning
– Unclear how to model categories, so we
learn what distinguishes them rather than
manually specify the difference -- hence
current interest in machine learning)
– What are you maximizing? Likelihood
(Gen.) or performances on train/validation
set (Disc.)
– Level of supervision
• Manual segmentation; bounding box; image
labels; noisy labels
– Batch/incremental (on category and image
level; user-feedback )
– Training images:
• Issue of overfitting
• Negative images for discriminative methods
– Priors
40. Analogy to documents
Of all the sensory impressions proceeding to China is forecasting a trade surplus of $90bn
the brain, the visual experiences are the (£51bn) to $100bn this year, a threefold
dominant ones. Our perception of the world increase on 2004's $32bn. The Commerce
around us is based essentially on the Ministry said the surplus would be created by
messages that reach the brain from our eyes. a predicted 30% jump in exports to $750bn,
For a long time it was thought that the retinal compared with a 18% rise in imports to
image was transmitted pointbrain, to visual
sensory, by point China, trade,
$660bn. The figures are likely to further
visual, perception,
centers in the brain; the cerebral cortex was a
movie screen, so to speak, upon which the
surplus, commerce,
annoy the US, which has long argued that
China's exports are unfairly helped by a
retinal, cerebral cortex,
image in the eye was projected. Through the exports, imports, US,
deliberately undervalued yuan. Beijing
discoveries of Hubelcell, optical
eye, and Wiesel we now yuan, bank, domestic,
agrees the surplus is too high, but says the
know that behind the origin of the visual yuan is only one factor. Bank of China
nerve, image
perception in the brain there is a considerably foreign, increase,
governor Zhou Xiaochuan said the country
Hubel, Wiesel
more complicated course of events. By also needed to do more tovalue
trade, boost domestic
following the visual impulses along their path demand so more goods stayed within the
to the various cell layers of the optical cortex, country. China increased the value of the
Hubel and Wiesel have been able to yuan against the dollar by 2.1% in July and
demonstrate that the message about the permitted it to trade within a narrow band, but
image falling on the retina undergoes a step- the US wants the yuan to be allowed to trade
wise analysis in a system of nerve cells freely. However, Beijing has made it clear that
stored in columns. In this system each cell it will take its time and tread carefully before
has its specific function and is responsible for allowing the yuan to rise further in value.
a specific detail in the pattern of the retinal
image.
53. 2 case studies
1. Naïve Bayes classifier
– Csurka et al. 2004
2. Hierarchical Bayesian text models
(pLSA and LDA)
– Background: Hoffman 2001, Blei et al. 2004
– Object categorization: Sivic et al. 2005, Sudderth et
al. 2005
– Natural scene categorization: Fei-Fei et al. 2005
54. First, some notations
• wn: each patch in an image
– wn = [0,0,…1,…,0,0]T
• w: a collection of all N patches in an image
– w = [w1,w2,…,wN]
• dj: the jth image in an image collection
• c: category of the image
• z: theme or topic of the patch
55. Case #1: the Naïve Bayes model
c w
N
N
c ∗ = arg max p (c | w) ∝ p (c) p ( w | c) = p(c)∏ p( wn | c)
c n =1
Object class Prior prob. of Image likelihood
decision the object classes given the class
Csurka et al. 2004
56. Case #2: Hierarchical Bayesian
text models
Probabilistic Latent Semantic Analysis (pLSA)
d z w
N
D Hoffman, 2001
Latent Dirichlet Allocation (LDA)
c π z w
N
D Blei et al., 2001
57. Case #2: Hierarchical Bayesian
text models
Probabilistic Latent Semantic Analysis (pLSA)
d z w
N
D
“face”
Sivic et al. ICCV 2005
58. Case #2: Hierarchical Bayesian
text models
“beach”
Latent Dirichlet Allocation (LDA)
c π z w
N
D
Fei-Fei et al. ICCV 2005
61. Invariance issues
• Scale and rotation
• Occlusion
– Implicit in the models
– Codeword distribution: small variations
– (In theory) Theme (z) distribution: different
occlusion patterns
62. Invariance issues
• Scale and rotation
• Occlusion
• Translation
– Encode (relative) location information
Sudderth et al. 2005
63. Invariance issues
• Scale and rotation
• Occlusion
• Translation
• View point (in theory)
– Codewords: detector
and descriptor
– Theme distributions:
different view points
Fergus et al. 2005
64. Model properties
Of all the sensory impressions proceeding to
the brain, the visual experiences are the
dominant ones. Our perception of the world
around us is based essentially on the
messages that reach the brain from our eyes.
For a long time it was thought that the retinal
image was transmitted pointbrain, to visual
sensory, by point
• Intuitive visual, perception,
centers in the brain; the cerebral cortex was a
movie screen, so to speak, upon which the
retinal, cerebral cortex,
image in the eye was projected. Through the
– Analogy to documents discoveries of Hubelcell, optical
eye, and Wiesel we now
know that behind the origin of the visual
nerve, image
perception in the brain there is a considerably
Hubel, Wiesel
more complicated course of events. By
following the visual impulses along their path
to the various cell layers of the optical cortex,
Hubel and Wiesel have been able to
demonstrate that the message about the
image falling on the retina undergoes a step-
wise analysis in a system of nerve cells
stored in columns. In this system each cell
has its specific function and is responsible for
a specific detail in the pattern of the retinal
image.
65. Model properties
• Intuitive
• (Could use)
generative models
– Convenient for weakly-
or un-supervised
training
– Prior information
– Hierarchical Bayesian
framework Sivic et al., 2005, Sudderth et al., 2005
66. Model properties
• Intuitive
• (Could use)
generative models
• Learning and
recognition relatively
fast
– Compare to other
methods
67. Weakness of the model
• No rigorous geometric information
of the object components
• It’s intuitive to most of us that
objects are made of parts – no
such information
• Not extensively tested yet for
– View point invariance
– Scale invariance
• Segmentation and localization
unclear
68. part-based models
Slides courtesy to Rob Fergus for “part-based models”
70. P. Bruegel, 1562
One-shot learning
Fei-Fei et al. ‘03, ‘04, ‘06
of object categories
71. model representation
One-shot learning
Fei-Fei et al. ‘03, ‘04, ‘06
of object categories
72. X (location)
(x,y) coords. of region center
A (appearance)
normalize
ch
1 pat
1
11x c1
c2
…..
Projection onto
PCA basis c10
73. The Generative Model
X (location)
(x,y) coords. of region center
μX ΓX μA ΓA
A (appearance)
h normalize
X A ch
pat
I 11x
1 1 c1
c2
…..
Projection onto
PCA basis c10
74. The Generative Model
μX ΓX μA ΓA parameters
hidden variable
h
X A observed variables
I
75. The Generative Model
ML/MAP
μX ΓX μA ΓA
h θ1
X A
I θ2
θn
where θ = {µX, ΓX, µA, ΓA}
Weber et al. ’98 ’00, Fergus et al. ’03
76. The Generative Model
μX ΓX ML/MAP
shape model
θ1
μA ΓA
appearance model θ2
θn
where θ = {µX, ΓX, µA, ΓA}
78. The Generative Model
Bayesian
m0X a0X m0A a0A
β0X B0X β0A B0A
μX ΓX μA ΓA
P
θ1
h
θ2
I X A
θn
Parameters to estimate: {mX, βX, aX, BX, mA, βA, aA, BA}
Fei-Fei et al. ‘03, ‘04, ‘06 i.e. parameters of Normal-Wishart distribution
79. The Generative Model
m0X a0X m0A a0A priors
β0X B0X β0A B0A
μX ΓX μA ΓA parameters
P
h
I X A
Fei-Fei et al. ‘03, ‘04, ‘06
80. The Generative Model
m0X a0X m0A a0A priors
β0X B0X β0A B0A
μX ΓX μA ΓA
P
h
I X A Prior distribution
Fei-Fei et al. ‘03, ‘04, ‘06
81. 1. human vision
3. learning
& inferences
One-shot learning
2. model of object categories
representation
4. evaluation
& dataset
& application
82. learning & inferences
No labeling No segmentation No alignment
One-shot learning
Fei-Fei et al. 2003, 2004, 2006
of object categories
83. learning & inferences
Bayesian
θ1
θ2
θn
One-shot learning
Fei-Fei et al. 2003, 2004, 2006
of object categories
84. Random
Variational EM initialization
M-Step E-Step
new estimate
of p(θ|train)
prior knowledge of p(θ)
Attias, Jordan, Hinton etc.
85. evaluation & dataset
One-shot learning
Fei-Fei et al. 2004, 2006a, 2006b
of object categories
86. evaluation & dataset -- Caltech 101 Dataset
One-shot learning
Fei-Fei et al. 2004, 2006a, 2006b
of object categories
87. evaluation & dataset -- Caltech 101 Dataset
One-shot learning
Fei-Fei et al. 2004, 2006a, 2006b
of object categories
89. Discriminative methods
Object detection and recognition is formulated as a classification problem.
The image is partitioned into a set of overlapping windows
… and a decision is taken at each window about if it contains a target object or not.
Decision
Background boundary
Where are the screens?
Computer screen
Bag of image patches
In some feature space
90. Discriminative vs. generative
• Generative model
0.1
(The artist) 0.05
0
0 10 20 30 40 50 60 70
x = data
• Discriminative model
1
(The lousy painter)
0.5
0
0 10 20 30 40 50 60 70
x = data
• Classification function
1
-1
0 10 20 30 40 50 60 70 80
x = data
92. Formulation
• Formulation: binary classification
…
Features x = x1 x2 x3 … xN xN+1 xN+2 … xN+M
Labels y= -1 +1 -1 -1 ? ? ?
Training data: each image patch is labeled Test data
as containing the object or background
• Classification function
Where belongs to some family of functions
• Minimize misclassification error
(Not that simple: we need some guarantees that there will be generalization)
93. Overview of section
• Object detection with classifiers
• Boosting
– Gentle boosting
– Weak detectors
– Object model
– Object detection
• Multiclass object detection
94. Why boosting?
• A simple algorithm for learning robust classifiers
– Freund & Shapire, 1995
– Friedman, Hastie, Tibshhirani, 1998
• Provides efficient algorithm for sparse visual
feature selection
– Tieu & Viola, 2000
– Viola & Jones, 2003
• Easy to implement, not requires external
optimization tools.
95. Boosting
• Defines a classifier using an additive model:
Strong Weak classifier
classifier
Weight
Features
vector
96. Boosting
• Defines a classifier using an additive model:
Strong Weak classifier
classifier
Weight
Features
vector
• We need to define a family of weak classifiers
from a family of weak classifiers
97. Boosting
• It is a sequential procedure:
xt=1 Each data point has
xt
a class label:
xt=2
+1 ( )
yt =
-1 ( )
and a weight:
wt =1
98. Toy example
Weak learners from the family of lines
Each data point has
a class label:
+1 ( )
yt =
-1 ( )
and a weight:
wt =1
h => p(error) = 0.5 it is at chance
99. Toy example
Each data point has
a class label:
+1 ( )
yt =
-1 ( )
and a weight:
wt =1
This one seems to be the best
This is a ‘weak classifier’: It performs slightly better than chance.
100. Toy example
Each data point has
a class label:
+1 ( )
yt =
-1 ( )
We update the weights:
wt wt exp{-yt Ht}
We set a new problem for which the previous weak classifier performs at chance again
101. Toy example
Each data point has
a class label:
+1 ( )
yt =
-1 ( )
We update the weights:
wt wt exp{-yt Ht}
We set a new problem for which the previous weak classifier performs at chance again
102. Toy example
Each data point has
a class label:
+1 ( )
yt =
-1 ( )
We update the weights:
wt wt exp{-yt Ht}
We set a new problem for which the previous weak classifier performs at chance again
103. Toy example
Each data point has
a class label:
+1 ( )
yt =
-1 ( )
We update the weights:
wt wt exp{-yt Ht}
We set a new problem for which the previous weak classifier performs at chance again
104. Toy example
f1 f2
f4
f3
The strong (non- linear) classifier is built as the combination of
all the weak (linear) classifiers.
105. From images to features:
Weak detectors
We will now define a family of visual
features that can be used as weak
classifiers (“weak detectors”)
Takes image as input and the output is binary response.
The output is a weak detector.
106. Weak detectors
Textures of textures
Tieu and Viola, CVPR 2000
Every combination of three filters
generates a different feature
This gives thousands of features. Boosting selects a sparse subset, so computations
on test time are very efficient. Boosting also avoids overfitting to some extend.
107. Weak detectors
Haar filters and integral image
Viola and Jones, ICCV 2001
The average intensity in the
block is computed with four
sums independently of the
block size.
109. Weak detectors
Part based: similar to part-based generative
models. We create weak detectors by
using parts and voting for the object center
location
Car model Screen model
These features are used for the detector on the course web site.
110. Weak detectors
First we collect a set of part templates from a set of training
objects.
Vidal-Naquet, Ullman (2003)
…
111. Weak detectors
We now define a family of “weak detectors” as:
= * =
Better than chance
112. Weak detectors
We can do a better job using filtered images
= = * =
*
Still a weak detector
but better than before
113. Training
First we evaluate all the N features on all the training images.
Then, we sample the feature outputs on the object center and at random
locations in the background:
114. Representation and object model
Selected features for the screen detector
… …
1 2 3 4 10 100
Lousy painter