Categorization of natural images

Novel Approaches to Natural Scene Categorization
Amit Prabhudesai
Roll No. 04307002
amitp@ee.iitb.ac.in

M.Tech Thesis Defence
Under the guidance of
Prof. Subhasis Chaudhuri
Indian Institute of Technology, Bombay

Natural Scene Categorization – p.1/32

Overview of topics to be covered

• Natural Scene Categorization: Challenges
• Our contribution
◦ Qualitative visual environment description
• Portable, real-time system to aid the visually impaired
• System has peripheral vision!
◦ Model-based approaches
• Use of stochastic models to capture semantics
• pLSA and maximum entropy models
• Conclusions and Future Work


Natural Scene Categorization

• Interesting application of a CBIR system
• Images from a broad image domain: diverse and often
ambiguous
• Bridging the semantic gap
• Grouping scenes into semantically meaningful categories
could aid further retrieval
• Efﬁcient schemes for grouping images into semantic
categories


Qualitative Visual Environment Retrieval

BUILDING
SKY

LAWN
FR
LT RT

WOODS
LB RB P3
P2

P1
WATER BODY

• Use of omnidirectional images
• Challenges
◦ Unstructured environment
◦ No prior learning (unlike navigation/localization)
• Target application and objective
◦ Wearable computing community, emphasis on visually
challenged people
◦ Real-time operation


Qualitative Visual Environment System: Overview

• Environment representation
• Environment retrieval
◦ View partitioning
◦ Feature extraction
◦ Node annotation
◦ Dynamic node annotation
• Results


System Overview (contd.)

• Environment representation
◦ Image database containing images belonging to 6
classes: Lawns(L), Woods(W), Buildings(B),
Waterbodies(H), Roads(R) and Trafﬁc(T)
◦ Moderately large intra-class variance (in the feature
space) in images of each category
◦ Description relative to the person using the system: e.g.,
‘to left of’, ‘in the front’, etc.
◦ Topological relationships indicated by a graph
◦ Each node annotated by an identiﬁer associated with a
class



• Environment Retrieval
◦ View Partitioning
FORWARD DIRECTION
FR

FR LT RT

LT RT

¡

¡

¡

¡

¡

¡

¡

¡

¡

¡

¡

¡

¡

¡

¡

¡

¡

¡

¡

¡

¡

¡

¡

¡

¡

LB RB

£

¢

£

¢

£

¢

£

¢

£

¢

£

¢
£

¢

£

¢

£

¢

£

¢

£

¢

£

¢
XX LB RB

£

¢

£

¢

£

¢

£

¢

£

¢

£

¢
£

¢

£

¢

£

¢

£

¢

£

¢

£

¢
£

¢

£

¢

£

¢

£

¢

£

¢

£

¢
£

¢

£

¢

£

¢

£

¢

£

¢

£

¢
BS
BS XX
BACKWARD DIRECTION

View Partitioning Graphical representation



◦ View Partitioning
FORWARD DIRECTION
FR

FR LT RT

LT RT

¡

¡

¡

¡

¡

¡

¡

¡

¡

¡

¡

¡

¡

¡

¡

¡

¡

¡

¡

¡

¡

¡

¡

¡

¡

LB RB

£

¢

£

¢

£

¢

£

¢

£

¢

£

¢
£

¢

£

¢

£

¢

£

¢

£

¢

£

¢
XX LB RB

£

¢

£

¢

£

¢

£

¢

£

¢

£

¢
£

¢

£

¢

£

¢

£

¢

£

¢

£

¢
£

¢

£

¢

£

¢

£

¢

£

¢

£

¢
£

¢

£

¢

£

¢

£

¢

£

¢

£

¢
BS
BS XX
BACKWARD DIRECTION

View Partitioning Graphical representation

◦ Feature Extraction
• Feature invariant to scaling, viewpoint, illumination
changes, and geometric warping introduced by
omnicam images
• Colour histogram selected as the feature for
performing CBIR



◦ Node annotation
• Objective: Robust retrieval against illumination
changes and intra-class variations
• Solution: Annotation decided by a simple voting
scheme



◦ Node annotation
• Objective: Robust retrieval against illumination
changes and intra-class variations
• Solution: Annotation decided by a simple voting
scheme
◦ Dynamic node annotation
• Temporal evolution of graph Gn with time tn
• Complete temporal evolution of the graph given by G,
obtained by concatenating the subgraphs Gn ,
i.e.,G = {G1 , G2 , . . . , Gk , . . .}



• Colour histogram: compact feature vector
• Pre-computed histograms of all the database images
• Linear time complexity (O(N)): on P-IV 2.0 GHz, ∼
100 ms for single omnicam image



• Colour histogram: compact feature vector
• Pre-computed histograms of all the database images
• Linear time complexity (O(N)): on P-IV 2.0 GHz, ∼
100 ms for single omnicam image
◦ Portable, low-cost system for visually impaired
• Modest hardware and software requirements
• Easily put together using off-the-shelf components



• Results
◦ Cylindrical concentric mosaics



• Results
◦ Still omnicam image



• Results
◦ Omnivideo sequence

FORWARD DIRECTION
n

W

W W

B W
W
W W 20
W W

W W 15
W W
W
W W X W 10
n
W B
X 5
W
W
X
W1
W
X BACKWARD DIRECTION

B
X



• Results
◦ Omnivideo sequence

FORWARD DIRECTION
n

W

W W

B W
W
W W 20
W W

W W 15
W W
W
W W X W 10
n
W B
X 5
W
W
X
W1
W
X BACKWARD DIRECTION

B
X

R R R L L R

10 20 25 n
1 5 15


Analyzing our results

• System accuracy: close to 70%– This is not enough!
• Some scenes are inherently ambiguous!
• Often the second best class is the correct class




• Limitations
1. Limited discriminating power of global colour histogram
(GCH)
2. Local colour histogram (LCH) based on tiling cannot be
used
3. Each frame analyzed independently




• Limitations
1. Limited discriminating power of global colour histogram
(GCH)
2. Local colour histogram (LCH) based on tiling cannot be
used
3. Each frame analyzed independently
• Possible solutions
1. Adding memory to the system
2. Clustering scheme before computing similarity measure


Method I. Adding memory to the system

• System uses only the current observation in labeling
• Good idea to use all observations upto the current one
• Desired: A recursive implementation to calculate the
posterior (should be able to do it in real-time!)
• Hidden Markov Model: Parameter estimation using Kevin
Murphy’s HMM toolkit


Method I. Adding memory to the system

• System uses only the current observation in labeling
• Good idea to use all observations upto the current one
• Desired: A recursive implementation to calculate the
posterior (should be able to do it in real-time!)
• Hidden Markov Model: Parameter estimation using Kevin
Murphy’s HMM toolkit
• Challenges
1. Estimation of the transition matrix- possible solution is to
use limited classes
2. Enormous training data required


Adding memory. . . (Results)

• Improved conﬁdence in the results. However, negligible
improvement in the accuracy
• Reasons for poor performance
◦ Limited number of transitions in categories (as opposed
to locations
◦ Typical training data for HMMs is thousands of labels:
difﬁcult to collect such vast data
• Limitation: Makes the system dependent on the system
dependent on the training sequence


Method II. Preclustering the image

• Presence of clutter, images from a broad domain
• Premise: The part of the image indicative of the semantic
category forms a distinct part in the feature space

Some test images belonging to the ‘Water-bodies’ category

• Possible solution: segment out the clutter in the scene


Preclustering the image. . .

• K means clustering of the image
• Use only pixels from the largest cluster to compute the
colour histogram



colour histogram

Results of K means clustering on the test images



colour histogram

Results of K means clustering on the test images

• Results
◦ Accuracy improves signiﬁcantly– for ‘water-bodies’ class
improvement from 25% to about 72%
• Limitations: What about, say, a trafﬁc scene?!


Model-based approaches

• Stochastic models used to learn semantic concepts from
training images
• Use of normal perspective images
• Use of local image features
• Two models examined
1. probabilistic Latent Semantic Analysis (pLSA)
2. Maximum entropy models
• Use of the ‘bag of words’ approach


Bag of words approach

• Local features more robust to occlusions and spatial
variations
• Image represented as a collection of local patches
• Image patches are members of a learned (visual)
vocabulary
• Positional relationships not considered!
• Data representation by a co-occurrence matrix


Bag of words approach

• Local features more robust to occlusions and spatial
variations
• Image represented as a collection of local patches
• Image patches are members of a learned (visual)
vocabulary
• Positional relationships not considered!
• Data representation by a co-occurrence matrix

• Notation
◦ D = {d1 , . . . , dN } : corpus of documents
◦ W = {w1 , . . . , wM } : dictionary of words
◦ Z = {z1 , . . . , zK } : (latent) topic variables
◦ N = {n(w, d)}: co-occurrence table


pLSA model . . .

• Generative model
◦ select a document d with probability P (d)
◦ select a latent class z with probability P (z|d)
◦ select a word w with probability P (w|z)


pLSA model . . .


• Joint observation probability
P (d, w) = P (d)P (w|d), where
P (w|d) = z∈Z P (w|z)P (z|d)


pLSA model . . .


• Joint observation probability
P (d, w) = P (d)P (w|d), where
P (w|d) = z∈Z P (w|z)P (z|d)

• Modeling assumptions
1. Observation pairs (d, w) generated independently
2. Conditional independence assumption
P (w, d|z) = P (w|z)P (d|z)


pLSA model . . .

• Model ﬁtting
◦ Maximize the log-likelihood function
L = d∈D w∈W n(d, w)logP (d, w)
◦ Minimizing the KL divergence between the empirical
distribution and the model
◦ EM algorithm to learn model parameters


pLSA model . . .

• Model fitting
◦ Maximize the log-likelihood function
L = d∈D w∈W n(d, w)logP (d, w)
◦ Minimizing the KL divergence between the empirical
distribution and the model
◦ EM algorithm to learn model parameters

• Evaluating model on unseen test images
◦ P (w|z) and P (z|d) learned from the training dataset
◦ ‘Fold-in’ heuristic for categorization: learned factors
P (w|z) are kept fixed, mixing coefficients P (z|dtest ) are
estimated using the EM iterations


pLSA model . . .

• Details of experiment to evaluate model
◦ 5 categories: houses, forests, mountains, streets and
beaches
◦ Image dataset: COREL photo CDs, images from internet
search engines, and personal image collections
◦ 100 images of each category
◦ Modiﬁcations in Rob Fergus’s code for the experiments
◦ 128-dim SIFT feature used to represent a patch
◦ Visual codebook with 125 entries
• Image annotation
z = arg maxi P (zi |dtest )
ˆ


pLSA model. . . Results

• 50 runs of the experiment: with random partitioning on each
run
• Vastly different accuracy on different runs: best case ∼ 46%,
and worst case 5%


pLSA model. . . Results

• 50 runs of the experiment: with random partitioning on each
run
• Vastly different accuracy on different runs: best case ∼ 46%,
and worst case 5%
• Analysis of the results
◦ Confusion matrix gives us further insights
◦ Most of the labeling errors occur between houses and
streets
◦ Ambiguity between mountains and forests


Results using the pLSA model

Figure 0: Some images that were wrongly anno-
tated by our system


Results of the pLSA model . . .

• Comparison with the naive Bayes’ classiﬁer

Figure 0: Confusion matrices for the pLSA and
naive Bayes models
• 10-fold cross validation test on the same dataset: mean
accuracy: ∼ 66%


Analysis of our results

◦ Model convergence!
◦ Local optima problem in the EM algorithm
◦ Optimum value of the objective function depends on the
initialized values
◦ We initialize the algorithm randomly at each run!


Analysis of our results

◦ Model convergence!
◦ Local optima problem in the EM algorithm
◦ Optimum value of the objective function depends on the
initialized values
◦ We initialize the algorithm randomly at each run!

• Possible solution: Deterministic annealing EM (DAEM)
algorithm
• Even with DAEM no guarantee of converging to the global
optimal solution


Maximum entropy models

• Maximum entropy prefers a uniform distribution when no
data are available
• Best model is the one that is:
1. Consistent with the constraints imposed by training data
2. Makes as few assumptions as possible
• Training dataset: {(x1 , y1 ), (x2 , y2 ), . . . , (xN , yN )}, where xi
represents an image and yi represents a label
• Predicate functions
◦ Unigram predicate: co-occurrence statistics of a word
and a label
1 if y =LABEL and v1 ∈ x
fv1 ,LABEL (x, y) =
0 otherwise


Maximum entropy models . . .

• Notation
◦ f : predicate function
◦ p(x, y): empirical distribution of the observed pairs
˜
◦ p(y|x): stochastic model to be learnt



• Notation
˜

• Model ﬁtting: expected value of the predicate function w.r.t.
to the stochastic model should equal the expected value of
the predicate measured from the training data
• Constrained optimization problem
Maximize H(p) = − x,y p(x)p(y|x)logp(y|x)
˜
s.t. x,y p(x, y)f (x, y) =
˜ x,y p(x)p(y|x)f (x, y)
˜



• Notation
˜

• Model ﬁtting: expected value of the predicate function w.r.t.
to the stochastic model should equal the expected value of
the predicate measured from the training data
• Constrained optimization problem
Maximize H(p) = − x,y p(x)p(y|x)logp(y|x)
˜
s.t. x,y p(x, y)f (x, y) =
˜ x,y p(x)p(y|x)f (x, y)
˜

• p(y|x) = 1 k
Z(x) exp i=1 λi fi (x, y)


Results for the maximum entropy model

• Same dataset, feature and codebook as used for the pLSA
experiment
• Evaluation using Zhang Le’s maximum entropy toolkit



experiment

• 25-fold cross-validation accuracy: ∼ 70%
• The second best label is often the correct label: accuracy
improves to 85%



experiment

• 25-fold cross-validation accuracy: ∼ 70%
• The second best label is often the correct label: accuracy
improves to 85%

Figure 1: Confusion matrices for the maximum
entropy and naive Bayes models

A comparative study

Method # of catg. training # per catg. perf(%)
Maximum entropy 5 50 70
pLSA 5 50 46
Naive Bayes’ classiﬁer 5 50 66
Fei-Fei 13 100 64
Vogel 6 ∼100 89.3
Vogel 6 ∼100 67.2
Oliva 8 250 ∼ 300 89

Table 0: A performance comparison with other
studies reported in literature.


Future Work

• Further investigations into the pLSA model
• Issue of model convergence
• DAEM algorithm is not the ideal solution
• Using a richer feature set, e.g., bank of Gabor ﬁlters
• For maximum entropy models, ways to deﬁne predicates
that will capture semantic information better


THANK YOU


Categorization of natural images

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Empfohlen

Empfohlen (20)

Categorization of natural images