1. Novel Approaches to Natural Scene Categorization
Amit Prabhudesai
Roll No. 04307002
amitp@ee.iitb.ac.in
M.Tech Thesis Defence
Under the guidance of
Prof. Subhasis Chaudhuri
Indian Institute of Technology, Bombay
Natural Scene Categorization – p.1/32
2. Overview of topics to be covered
• Natural Scene Categorization: Challenges
• Our contribution
◦ Qualitative visual environment description
• Portable, real-time system to aid the visually impaired
• System has peripheral vision!
◦ Model-based approaches
• Use of stochastic models to capture semantics
• pLSA and maximum entropy models
• Conclusions and Future Work
Natural Scene Categorization – p.2/32
3. Natural Scene Categorization
• Interesting application of a CBIR system
• Images from a broad image domain: diverse and often
ambiguous
• Bridging the semantic gap
• Grouping scenes into semantically meaningful categories
could aid further retrieval
• Efficient schemes for grouping images into semantic
categories
Natural Scene Categorization – p.3/32
4. Qualitative Visual Environment Retrieval
BUILDING
SKY
LAWN
FR
LT RT
WOODS
LB RB P3
P2
P1
WATER BODY
• Use of omnidirectional images
• Challenges
◦ Unstructured environment
◦ No prior learning (unlike navigation/localization)
• Target application and objective
◦ Wearable computing community, emphasis on visually
challenged people
◦ Real-time operation
Natural Scene Categorization – p.4/32
6. System Overview (contd.)
• Environment representation
◦ Image database containing images belonging to 6
classes: Lawns(L), Woods(W), Buildings(B),
Waterbodies(H), Roads(R) and Traffic(T)
◦ Moderately large intra-class variance (in the feature
space) in images of each category
◦ Description relative to the person using the system: e.g.,
‘to left of’, ‘in the front’, etc.
◦ Topological relationships indicated by a graph
◦ Each node annotated by an identifier associated with a
class
Natural Scene Categorization – p.6/32
9. System Overview (contd.)
• Environment Retrieval
◦ Node annotation
• Objective: Robust retrieval against illumination
changes and intra-class variations
• Solution: Annotation decided by a simple voting
scheme
Natural Scene Categorization – p.8/32
10. System Overview (contd.)
• Environment Retrieval
◦ Node annotation
• Objective: Robust retrieval against illumination
changes and intra-class variations
• Solution: Annotation decided by a simple voting
scheme
◦ Dynamic node annotation
• Temporal evolution of graph Gn with time tn
• Complete temporal evolution of the graph given by G,
obtained by concatenating the subgraphs Gn ,
i.e.,G = {G1 , G2 , . . . , Gk , . . .}
Natural Scene Categorization – p.8/32
11. System Overview (contd.)
• Environment Retrieval
◦ Real-time operation
• Colour histogram: compact feature vector
• Pre-computed histograms of all the database images
• Linear time complexity (O(N)): on P-IV 2.0 GHz, ∼
100 ms for single omnicam image
Natural Scene Categorization – p.9/32
12. System Overview (contd.)
• Environment Retrieval
◦ Real-time operation
• Colour histogram: compact feature vector
• Pre-computed histograms of all the database images
• Linear time complexity (O(N)): on P-IV 2.0 GHz, ∼
100 ms for single omnicam image
◦ Portable, low-cost system for visually impaired
• Modest hardware and software requirements
• Easily put together using off-the-shelf components
Natural Scene Categorization – p.9/32
13. System Overview (contd.)
• Results
◦ Cylindrical concentric mosaics
Natural Scene Categorization – p.10/32
14. System Overview (contd.)
• Results
◦ Cylindrical concentric mosaics
Natural Scene Categorization – p.10/32
17. System Overview (contd.)
• Results
◦ Omnivideo sequence
FORWARD DIRECTION
n
W
W W
B W
W
W W 20
W W
W W 15
W W
W
W W X W 10
n
W B
X 5
W
W
X
W1
W
X BACKWARD DIRECTION
B
X
Natural Scene Categorization – p.12/32
18. System Overview (contd.)
• Results
◦ Omnivideo sequence
FORWARD DIRECTION
n
W
W W
B W
W
W W 20
W W
W W 15
W W
W
W W X W 10
n
W B
X 5
W
W
X
W1
W
X BACKWARD DIRECTION
B
X
R R R L L R
10 20 25 n
1 5 15
Natural Scene Categorization – p.12/32
19. Analyzing our results
• System accuracy: close to 70%– This is not enough!
• Some scenes are inherently ambiguous!
• Often the second best class is the correct class
Natural Scene Categorization – p.13/32
20. Analyzing our results
• System accuracy: close to 70%– This is not enough!
• Some scenes are inherently ambiguous!
• Often the second best class is the correct class
• Limitations
1. Limited discriminating power of global colour histogram
(GCH)
2. Local colour histogram (LCH) based on tiling cannot be
used
3. Each frame analyzed independently
Natural Scene Categorization – p.13/32
21. Analyzing our results
• System accuracy: close to 70%– This is not enough!
• Some scenes are inherently ambiguous!
• Often the second best class is the correct class
• Limitations
1. Limited discriminating power of global colour histogram
(GCH)
2. Local colour histogram (LCH) based on tiling cannot be
used
3. Each frame analyzed independently
• Possible solutions
1. Adding memory to the system
2. Clustering scheme before computing similarity measure
Natural Scene Categorization – p.13/32
22. Method I. Adding memory to the system
• System uses only the current observation in labeling
• Good idea to use all observations upto the current one
• Desired: A recursive implementation to calculate the
posterior (should be able to do it in real-time!)
• Hidden Markov Model: Parameter estimation using Kevin
Murphy’s HMM toolkit
Natural Scene Categorization – p.14/32
23. Method I. Adding memory to the system
• System uses only the current observation in labeling
• Good idea to use all observations upto the current one
• Desired: A recursive implementation to calculate the
posterior (should be able to do it in real-time!)
• Hidden Markov Model: Parameter estimation using Kevin
Murphy’s HMM toolkit
• Challenges
1. Estimation of the transition matrix- possible solution is to
use limited classes
2. Enormous training data required
Natural Scene Categorization – p.14/32
24. Adding memory. . . (Results)
• Improved confidence in the results. However, negligible
improvement in the accuracy
• Reasons for poor performance
◦ Limited number of transitions in categories (as opposed
to locations
◦ Typical training data for HMMs is thousands of labels:
difficult to collect such vast data
• Limitation: Makes the system dependent on the system
dependent on the training sequence
Natural Scene Categorization – p.15/32
25. Method II. Preclustering the image
• Presence of clutter, images from a broad domain
• Premise: The part of the image indicative of the semantic
category forms a distinct part in the feature space
Some test images belonging to the ‘Water-bodies’ category
• Possible solution: segment out the clutter in the scene
Natural Scene Categorization – p.16/32
26. Preclustering the image. . .
• K means clustering of the image
• Use only pixels from the largest cluster to compute the
colour histogram
Natural Scene Categorization – p.17/32
27. Preclustering the image. . .
• K means clustering of the image
• Use only pixels from the largest cluster to compute the
colour histogram
Results of K means clustering on the test images
Natural Scene Categorization – p.17/32
28. Preclustering the image. . .
• K means clustering of the image
• Use only pixels from the largest cluster to compute the
colour histogram
Results of K means clustering on the test images
• Results
◦ Accuracy improves significantly– for ‘water-bodies’ class
improvement from 25% to about 72%
• Limitations: What about, say, a traffic scene?!
Natural Scene Categorization – p.17/32
29. Model-based approaches
• Stochastic models used to learn semantic concepts from
training images
• Use of normal perspective images
• Use of local image features
• Two models examined
1. probabilistic Latent Semantic Analysis (pLSA)
2. Maximum entropy models
• Use of the ‘bag of words’ approach
Natural Scene Categorization – p.18/32
30. Bag of words approach
• Local features more robust to occlusions and spatial
variations
• Image represented as a collection of local patches
• Image patches are members of a learned (visual)
vocabulary
• Positional relationships not considered!
• Data representation by a co-occurrence matrix
Natural Scene Categorization – p.19/32
31. Bag of words approach
• Local features more robust to occlusions and spatial
variations
• Image represented as a collection of local patches
• Image patches are members of a learned (visual)
vocabulary
• Positional relationships not considered!
• Data representation by a co-occurrence matrix
• Notation
◦ D = {d1 , . . . , dN } : corpus of documents
◦ W = {w1 , . . . , wM } : dictionary of words
◦ Z = {z1 , . . . , zK } : (latent) topic variables
◦ N = {n(w, d)}: co-occurrence table
Natural Scene Categorization – p.19/32
32. pLSA model . . .
• Generative model
◦ select a document d with probability P (d)
◦ select a latent class z with probability P (z|d)
◦ select a word w with probability P (w|z)
Natural Scene Categorization – p.20/32
33. pLSA model . . .
• Generative model
◦ select a document d with probability P (d)
◦ select a latent class z with probability P (z|d)
◦ select a word w with probability P (w|z)
• Joint observation probability
P (d, w) = P (d)P (w|d), where
P (w|d) = z∈Z P (w|z)P (z|d)
Natural Scene Categorization – p.20/32
34. pLSA model . . .
• Generative model
◦ select a document d with probability P (d)
◦ select a latent class z with probability P (z|d)
◦ select a word w with probability P (w|z)
• Joint observation probability
P (d, w) = P (d)P (w|d), where
P (w|d) = z∈Z P (w|z)P (z|d)
• Modeling assumptions
1. Observation pairs (d, w) generated independently
2. Conditional independence assumption
P (w, d|z) = P (w|z)P (d|z)
Natural Scene Categorization – p.20/32
35. pLSA model . . .
• Model fitting
◦ Maximize the log-likelihood function
L = d∈D w∈W n(d, w)logP (d, w)
◦ Minimizing the KL divergence between the empirical
distribution and the model
◦ EM algorithm to learn model parameters
Natural Scene Categorization – p.21/32
36. pLSA model . . .
• Model fitting
◦ Maximize the log-likelihood function
L = d∈D w∈W n(d, w)logP (d, w)
◦ Minimizing the KL divergence between the empirical
distribution and the model
◦ EM algorithm to learn model parameters
• Evaluating model on unseen test images
◦ P (w|z) and P (z|d) learned from the training dataset
◦ ‘Fold-in’ heuristic for categorization: learned factors
P (w|z) are kept fixed, mixing coefficients P (z|dtest ) are
estimated using the EM iterations
Natural Scene Categorization – p.21/32
37. pLSA model . . .
• Details of experiment to evaluate model
◦ 5 categories: houses, forests, mountains, streets and
beaches
◦ Image dataset: COREL photo CDs, images from internet
search engines, and personal image collections
◦ 100 images of each category
◦ Modifications in Rob Fergus’s code for the experiments
◦ 128-dim SIFT feature used to represent a patch
◦ Visual codebook with 125 entries
• Image annotation
z = arg maxi P (zi |dtest )
ˆ
Natural Scene Categorization – p.22/32
38. pLSA model. . . Results
• 50 runs of the experiment: with random partitioning on each
run
• Vastly different accuracy on different runs: best case ∼ 46%,
and worst case 5%
Natural Scene Categorization – p.23/32
39. pLSA model. . . Results
• 50 runs of the experiment: with random partitioning on each
run
• Vastly different accuracy on different runs: best case ∼ 46%,
and worst case 5%
• Analysis of the results
◦ Confusion matrix gives us further insights
◦ Most of the labeling errors occur between houses and
streets
◦ Ambiguity between mountains and forests
Natural Scene Categorization – p.23/32
40. Results using the pLSA model
Figure 0: Some images that were wrongly anno-
tated by our system
Natural Scene Categorization – p.24/32
41. Results of the pLSA model . . .
• Comparison with the naive Bayes’ classifier
Figure 0: Confusion matrices for the pLSA and
naive Bayes models
• 10-fold cross validation test on the same dataset: mean
accuracy: ∼ 66%
Natural Scene Categorization – p.25/32
42. Analysis of our results
• Reasons for poor performance
◦ Model convergence!
◦ Local optima problem in the EM algorithm
◦ Optimum value of the objective function depends on the
initialized values
◦ We initialize the algorithm randomly at each run!
Natural Scene Categorization – p.26/32
43. Analysis of our results
• Reasons for poor performance
◦ Model convergence!
◦ Local optima problem in the EM algorithm
◦ Optimum value of the objective function depends on the
initialized values
◦ We initialize the algorithm randomly at each run!
• Possible solution: Deterministic annealing EM (DAEM)
algorithm
• Even with DAEM no guarantee of converging to the global
optimal solution
Natural Scene Categorization – p.26/32
44. Maximum entropy models
• Maximum entropy prefers a uniform distribution when no
data are available
• Best model is the one that is:
1. Consistent with the constraints imposed by training data
2. Makes as few assumptions as possible
• Training dataset: {(x1 , y1 ), (x2 , y2 ), . . . , (xN , yN )}, where xi
represents an image and yi represents a label
• Predicate functions
◦ Unigram predicate: co-occurrence statistics of a word
and a label
1 if y =LABEL and v1 ∈ x
fv1 ,LABEL (x, y) =
0 otherwise
Natural Scene Categorization – p.27/32
45. Maximum entropy models . . .
• Notation
◦ f : predicate function
◦ p(x, y): empirical distribution of the observed pairs
˜
◦ p(y|x): stochastic model to be learnt
Natural Scene Categorization – p.28/32
46. Maximum entropy models . . .
• Notation
◦ f : predicate function
◦ p(x, y): empirical distribution of the observed pairs
˜
◦ p(y|x): stochastic model to be learnt
• Model fitting: expected value of the predicate function w.r.t.
to the stochastic model should equal the expected value of
the predicate measured from the training data
• Constrained optimization problem
Maximize H(p) = − x,y p(x)p(y|x)logp(y|x)
˜
s.t. x,y p(x, y)f (x, y) =
˜ x,y p(x)p(y|x)f (x, y)
˜
Natural Scene Categorization – p.28/32
47. Maximum entropy models . . .
• Notation
◦ f : predicate function
◦ p(x, y): empirical distribution of the observed pairs
˜
◦ p(y|x): stochastic model to be learnt
• Model fitting: expected value of the predicate function w.r.t.
to the stochastic model should equal the expected value of
the predicate measured from the training data
• Constrained optimization problem
Maximize H(p) = − x,y p(x)p(y|x)logp(y|x)
˜
s.t. x,y p(x, y)f (x, y) =
˜ x,y p(x)p(y|x)f (x, y)
˜
• p(y|x) = 1 k
Z(x) exp i=1 λi fi (x, y)
Natural Scene Categorization – p.28/32
48. Results for the maximum entropy model
• Same dataset, feature and codebook as used for the pLSA
experiment
• Evaluation using Zhang Le’s maximum entropy toolkit
Natural Scene Categorization – p.29/32
49. Results for the maximum entropy model
• Same dataset, feature and codebook as used for the pLSA
experiment
• Evaluation using Zhang Le’s maximum entropy toolkit
• 25-fold cross-validation accuracy: ∼ 70%
• The second best label is often the correct label: accuracy
improves to 85%
Natural Scene Categorization – p.29/32
50. Results for the maximum entropy model
• Same dataset, feature and codebook as used for the pLSA
experiment
• Evaluation using Zhang Le’s maximum entropy toolkit
• 25-fold cross-validation accuracy: ∼ 70%
• The second best label is often the correct label: accuracy
improves to 85%
Figure 1: Confusion matrices for the maximum
entropy and naive Bayes models
Natural Scene Categorization – p.29/32
51. A comparative study
Method # of catg. training # per catg. perf(%)
Maximum entropy 5 50 70
pLSA 5 50 46
Naive Bayes’ classifier 5 50 66
Fei-Fei 13 100 64
Vogel 6 ∼100 89.3
Vogel 6 ∼100 67.2
Oliva 8 250 ∼ 300 89
Table 0: A performance comparison with other
studies reported in literature.
Natural Scene Categorization – p.30/32
52. Future Work
• Further investigations into the pLSA model
• Issue of model convergence
• DAEM algorithm is not the ideal solution
• Using a richer feature set, e.g., bank of Gabor filters
• For maximum entropy models, ways to define predicates
that will capture semantic information better
Natural Scene Categorization – p.31/32
53. THANK YOU
Natural Scene Categorization – p.32/32