Attentional Object Detection - introductory slides.

Attentional Object
Detection
Why look for everything everywhere?

Sergey Karayev
for UC Berkeley Computer Vision Retreat 2011

Problem:
Recognition and localization of objects
of multiple classes
in cluttered scenes.

Proposals

Detectors Object Detection

Post-process

etc.

Sliding window Proposals

...with priors/
Voting pruning

Efficient
search

etc.

Sliding window Proposals

•Too slow: quadratic in number of search
dimensions (x,y,scale,class).
•Speed-ups:
•Parallelization.
★Priors/Pruning with non-detector
features.
★Algorithmic efficiency.

Proposals

Priors/pruning

•Usesnon-detector features (location,
geometry, context, depth, “objectness”)
•Often done in post-processing.

Proposals

Currently only works for local features.

Voting

Efficient
subwindow
search

• Priority ordered? How?
Proposals • Pruned / Exhaustive?

• Class-speciﬁc?

Detectors

Post-process

Detector
Template/Parts

single feature [2]. As a result each stage of the boosting

Local features process, which selects a new weak classifier, can be viewed
as a feature selection process. AdaBoost provides an effec-
tive learning algorithm and strong bounds on generalization
A
performance [13, 9, 10].
The third major contribution of this paper is a method
for combining successively more complex classifiers in a
single feature [2]. As a result eachstructure which dramatically increases the speed of
cascade stage of the boosting
single feature [2]. Asa a resultdetector by focusingboosting on promising regions of
the each stage of be attention
process, which selects new weak classifier, can theviewed
process, which selects a new weak classifier, can behind focus of attention approaches
the image. The notion be viewed
as a feature selection process. AdaBoost provides an effec- C
as a feature selection process.thatbounds on generalization
tive learning algorithm and strong it is often possible to rapidly determine where in an
is AdaBoost provides an effec-
A B
tive learning algorithm and image an object might occur [17, 8, 1]. More complex pro-
performance [13, 9, 10]. strong bounds on generalization

ct
B
performance [13, 9, contribution of this paper isonly for these promising regions. The
The third major 10]. cessing is reserved a method A
Figure 1: Example rectangle features shown r
for combining successivelykey measure of such is a method is the “false negative” rate
more complex classifiers in a
The third major contribution of this paper an approach enclosing detection window. The sum of the
cascade structure which dramatically increases process. of must be the case that all, or
the complex the speed in
ofmoreattentional classifiersIt a

je
for combining successively

Decision stumps
lie within the white rectangles are subtracted f
single feature [2]. As a result the detector by focusing dramaticallypromising regions of of
each stage of the boosting
cascade structure which
attention on object instances are selected by the attentional
almost all,increases the speed of pixels in the grey rectangles. Two-rectangle
process, which selects a new weak image. Thecan be viewedfocus of attention approaches of
the classifier, notion behind filter. on promising regions
the that it is often possible attention determine where in an
detector by focusing to rapidly
C D
shown in (A) and (B). Figure (C) shows a th
is

re
as a feature selection process. the image. The notion behind focus ofdescribe a approaches training an extremely sim-
AdaBoost provides an effec- We will attention process for C feature, and (D) a four-rectangle feature.
D
image an object might occur [17, 8, 1]. More complex pro-
tive learning algorithm and strong bounds reserved only forple and efficient regions. The an can be used as a “super-
on generalizationrapidly determine wherewhich
iscessing is often possible tothese promising
that it is classifier in
performance [13, 9, 10]. vised” focus of attention operator.Figure 1: Example rectangle features shown relative to the
A The term supervised B
image an objectsuch an occur [17, 8, 1]. More complex pro- enclosing detection window. The sum of the pixels which
key measure of might approach is the “false negative” rate
The third major contribution of this paper isprocess. refers tobe the case regions. or
cessing attentional a methodthese promising that all, The lieoperator 1: white rectangles are subtracted shown relative to the
of the is reserved only for must the fact that the attentional within the trained to rectangleusing features rather than the pixels direct
It is for
Figure Example features from the sum
for combining successively more measureobject instancesdetect is the “false negative” rate of In the domain of face
key complexof such an approach examples the a particular class. pixels in thedetection window. The reason is that are which act to en
almost all, classifiers in are selected by of attentional
a common features can
grey rectangles. Two-rectangleof the pixels
enclosing false neg- sum features is difficult to learn u
cascade structure which dramatically attentionalthe speed of must be is possiblethatachieve fewer than (A) and (B). Figure (C) shows a three-rectangle
of the increases process. detection it the case to all, or shown in 1% the white rectangles are subtractedthat the sum
filter. It domain knowledge
lie within from
the detector by focusing attention We all, object instances are training anby thepositives usingfeature, and (D) a four-rectangle feature. of training data. For this system th
almost will describe a process for selected false attentional a classifier constructed rectangles. Two-rectangle features are
on promising regions of atives and 40% extremely sim- quantity
of pixels in the grey
the image. The notion behind focusandattention classifier which two be used asfeatures. The effect of this filter isD to
filter. of efficient approaches
ple from can Harr-like a “super-
C
second critical motivation for features: the f
shown in where the
(A) and (B). Figure (C) showsmuch faster than a pixel-based
system operates a three-rectangle
is that it is often possible to rapidly determine attention operator.byThe term supervised
vised” focus of where in an reduce over one half the number of locations
We will describe a process for training an extremely sim- feature, and (D) a four-rectangle feature.



• Local or global feature?
• Shared parts across classes?
Detectors • Cascaded?

• Conﬁdence ≈ likelihood?

Post-process



• Local or global feature?
• Shared parts across classes?
Detectors • Cascaded?

• Conﬁdence ≈ likelihood?

• NMS/Meanshift?
Post-process • Context? (Inter-object?)

Where we are

Cascaded Deformable Part Models.
Per class, ~1 sec / medium-sized image.

Where we are

• PASCAL: ~5K test images, 20 classes. 28
hours to process.
• ImageNet ’11: ~450K test images, 3000
classes. 375,000 hours to process.

Where we are

• Standard movie: ~130K frames. 36 hours
per object class.

So what can we do?
Not look for everything
everywhere!

New Performance
Evaluation
• Goal: Be able to stop detection and have the
most correct detections and the fewest
incorrect detections at any time.

AP AP

vs.
time time

Attention
• Natural bottleneck in animal vision.
• Two kinds:
• Bottom-up: rapid, driven by
featurization.
• Top-down: secondary, driven by task.
• Eye ﬁxations are a good proxy for implicit
attention. Necessary because of the fovea.

Tilke Judd
tjudd@mit.edu Krista Ehinger
kehinger@mit.edu Fr´ do Durand
e
fredo@csail.mit.edu Anton
torralba
tjudd@mit.edu kehinger@mit.edu fredo@csail.mit.edu torralba

Basic ideas
MIT Computer Science Artificial Intelligence Laboratory and MIT Brain and Co
MIT Computer Science Artificial Intelligence Laboratory and MIT Brain and Co

Abstract
Abstract
• Single saliency map from
or many applications in graphics, design, and human
or many applicationsis essential todesign, and human
puter interaction, it in graphics, understand where
ans look in a scene. isfoci eye toattention
which Where oftracking devices are
puter interaction, it essential understand where
are selected.
a viable option, models of saliency can be used to pre-
ans look in a scene. Where eye tracking devices are
fixation option, models of saliency can be used to pre-
viable locations. Most saliency approaches are based
•
fixation locations. Most saliency not consider are based
ottom-up computation that does approaches top-down
Sequential selection due
ge semantics and often that doesmatch actual eye move-
ottom-up computation does not not consider top-down
s. To address “inhibition of return,”
to this problem, we collected eye tracking
e semantics and often does not match actual eye move-
of To viewers on 1003 images and use thiseye tracking
15 address this problem, we collected database as
ing and testinginformationmodeldatabase as
or onexamples to learn use this of saliency
s.
of 15 viewers 1003 images and a
d on low, maximization.model of saliency
middle and high-level image features. This
ing and testing examples to learn a
e databasemiddle and high-level image features. This
d on low, of eye tracking data is publicly available
•
this paper.
Influenced from the top.
database of eye tracking data is publicly available
this paper.
Figure 1. Eye tracking data. We co
ntroduction on 1003 images from 15 viewers to us
Figure 1. Eye tracking data. We co

model. On average, images contained 4.6 cars and 2.1 pedestrians. targets (cars or pedestrians) and press a key to indicate co

x

d
given in Eqs. (1)–(5) induced by the three main assumptions.

rmined by the scene description S (e.g., vectorial
perties such as global illumination, scene iden-
resent). The product of the likelihood P(IjS) and

Attentional Object Detector

Assume we have a powerful but expensive
per-class classiﬁer.
• How should we pick locations to consider?
• What should we look for at a location?

Attentional Object Detector

Proposals

Detector

Vogel and Freitas. Target-directed attention:
Sequential decision-making for gaze planning. ICRA
2008.

• GIST and a simple
regressor to compute
likelihood map.
• Reinforcement learning
to ﬁnd best gaze
sequence.
• “Heavier” feature and
regressor to evaluate
the ﬁxation locations.

Vogel and Freitas. Target-directed attention:
Sequential decision-making for gaze planning. ICRA
2008.

• Evaluated only on Caltech Office scenes.
• Gaze planning improves over just using
bottom-up saliency while being only slightly
slower.
• Detection rate is lower than full image, but
maximum precision is higher.

Gualdi et al. Multi-stage Sampling with Boosting
200 CascadesPrati, and R. Cucchiara
G. Gualdi, A. for Pedestrian Detection in Images and
Videos. ECCV 2010.

• LogitBoost classifier
with covariance
descriptors.
• Score falls off over
some region of
Multi-stage Sampling with Boosting Cascades for Pedestrian Detection 203

Fig. 1. Region of support for the cascade of LogitBoost classifiers trained on INRIA
support. to 48x144),
pedestrian dataset, averaged over a total 62 pedestrian patches; (a) a positive patch
(pedestrian is 48x144); (b-d) response of the classifier: (b) fixed w (equal
s
sliding wx , wy ; (c) fixed wx (equal to x of patch center), sliding ws , wy ; (d) fixed wy
• Sample points in image
(equal to y of patch center), sliding wx , ws ; (e) 3D plot of the response in (b).

to estimate P(O|I).
scale variations, i.e. the response of the classifier in the close neighborhood (both
Resample close to
in position and scale) of the window encompassing a pedestrian, remains positive
(“region of support ” of a positive detection). Having a sufficiently wide region of
promising points.
support allows to uniformly prune the SW S, up to the point of having at least
one window targeting the region of support of each pedestrian in the frame. V ice
versa, a too wide region of support could generate de-localized detections [4].
Distribution of samples important advantage of =
O n this regard, an across the stages: m the 5 and
covariance descriptors is its

Gualdi et al. Multi-stage Sampling with Boosting
Cascades for Pedestrian Detection in Images and
Videos. ECCV 2010.

• Evaluated on INRIA Pedestrians, Graz02, and
some videos.
• Always reduces miss rate over sliding
window, while being 2-6x faster.

fewer than 25 successive fixations, this foveated approach
provide a useful way to improve the search efficiency of
will be faster than exhaustively applying object detection to
specific object detectors, i.e., most regions without objects
Butko and Movellan. Optimal Scanning for Faster
a high resolution image.
Two particular challenges are: (1) sequentially picking
tend to have low visual saliency [5]. Unfortunately visual
saliency filters are computationally expensive [17] and need
Object Detection. CVPR 2009.
the fixation locations; (2) integrating the information ac- to be applied to entire images, making them less attractive
for scanning very high resolution images.
Our work also relates to recent work on optimal image
search, like the Efficient Subwindow Search [10]. Our ap-
proach is data driven and detector independent, where the
ESS approach is more analytic. Our approach requires a
dataset of labeled images to build a statistical model of
the performance of a given object detector. The ESS ap-
ˆ
proach requires a function f that must be constructed ana-

• Digital fovea placed
lytically for each specific object detector for the guarantees
of the algorithm to hold, but only some object detectors are
amenable to such a construction. The efficiency of the al-
sequentially to maximize
gorithm depends on the tightness of the upper bound that f
computes and the computational overhead of evaulating f . ˆ
ˆ

expected of Eye-Movement
2. I-POMDP: A Model
information gain.
• Liken it to stochastic
Najemnik & Geisler developed an information maxi-
mization (Infomax) model of eye-movements and applied
it to explain visual search of simple objects in pink noise
optimal control, and use a
image backgrounds [12]. The model uses a greedy search
approach: saccades are planned one at a time with the next
“multinomial infomax
saccade made to the location in the image plane that is ex-
pected to yield the highest chance of correctly guessing the
POMDP” to pick the
target location. The Najemnik & Geisler model success-
fully captured some aspects of human saccades but it has

sequence.
two important limitations: (1) Its fixation policy is greedy,
i.e., it maximizes the instantaneous information gain rather
than the long term gathering of information. (2) It is appli-
cable only to artificially constructed images.
Butko & Movellan [4] proposed the I-POMDP frame-
work for modeling visual search. The framework ex-
Figure 1. A digital fovea: Several concentric Image Patches (IPs) tends the Najemnik & Geisler model by applying long-term
(Top) are arranged around a point of fixation. The image por- POMPDP planning methods. They showed that long-term
tions contained within each rectangle are reduced to a common information maximization reduces search time. Moreover

Butko and Movellan. Optimal Scanning for Faster
Object Detection. CVPR 2009.

Fixation 1 Fixation 2 Fixation 3 4

3.5

• Evaluate on own faces
3
I!POMDP
Viola Jones

Error (grid cells)
dataset against V-J. 2x
2.5

2
Fixation 4 Fixation 5 Fixation 6
1.5 speedup, but small
1 decrease in accuracy.
0.5

0
0 0.02 0.04 0.06 0.08 0.1
Runtime (seconds)
Figure 6. Successive fixation choices by the MI-POMDP policy.
The face is found in six fixations. The final estimation of the face Figure 8. By changing the Viola Jones scaling factor, both Viola
location is one grid-cell diagonal from the labeled location, giving Jones and I-POMDP become faster and less accurate. MI-POMDP
a euclidean distance error of 1.4 grid-cells. is usually closer to the origin on a time-error curve, showing that
it gives a better speed-accuracy tradeoff than just applying Viola
Jones.
crease in accuracy, as shown in the Table below. Both meth-
ods on average placed the face between one and two grid-
cells off the true face location. 4.2. Speed-Accuracy Tradeoff

Vijayanarasimhan and Kapoor. Visual Recognition and
Detection Under Bounded Computational Resources.
CVPR 2010.
Computation
Feature Channel Dim
time (ms)
SIFT R, G, B, Gray 128 0.21
P64 Figure 3. 17 grid weights learnt for each category in the ETHZ
T1a S2 The Gray 68 1.2
shape dataset.
P18 T2 S2 9 Gray 36 0.09
Table 2. Attributes of theresources. used in the experiments.
of computational features
• Hough voting with multiplethe INRIA
Datasets: We use two challenging object detection
datasets namely, the ETHZ shape dataset and
(five in our feature types.to compare against several state-of-
experiments) order generate an initial set of
horses dataset in and
potheses. Then, we run each selection strategy iterativ
the-art hough based detection approaches [21, 24, 11, 10].
Figure 2. A summary of our algorithm.
updating• Uses Value ofisInformationweighted a fix
hypotheses as dataset contains 255 the to for five
the The ETHZ probability then modeled asgiraffes, mugssum
shape images
features get added until and
shape-based classes (applelogos, bottles,
conditional
|f ). This term depends on the feature f which is timethe probabilities(1 its lookneighbors: the
amount of pick region of horsesin our case). 170 images
swans). lapsed to nearest atcontains
of has The INRIA sec dataset and
(O,x)
i
to be extracted.
However, since we are only trying to determine the
In type best feature to qualitative 170 imagescomp
5 we or more some extract. results with-
Figure withthe category.side-views of horses and objects occur in
out
one
show In both the datasets,
ing the first highly cluttered natural scenes with large variations in both
eature to extract, we instead estimate the expected value 1000 points selected by our p(h|f ) select
p(gi
(O,x)
|f, l) = qi active
h
(2)
he term p(gi
(O,x)
• scale passive selection h∈N (f
approach Active approach extracts less ob-
|f ) for every feature type t. We do this and theand appearance, and sometimes) contain The first r
baseline. multiple
contains example imagesfeature inlessfair comparisons. qih ET
features, takeseverydatabase FOand = time, and
considering all the features in the training database that jects per image. We use the same training and testing setup
of type t and obtain the average value of the term. The used by h[10] on both datasets for category in the
where is a from the
shape, the has higher accuracy on ETHZpoints
ure type with the largest value can be interpreted as the second row refers third conditional probability for part
(O,x)
|h, l) and to the rows show the
Implementation Details: Parameter learning of the
p(gi
that is expected to provide the best evidence for object the grid model is performed by first scaling strategies, truth
lected by and Horses. fixed selectionisallmodelour exper- a the ground resp
active and random height term pixels in parameter
presence given the features. This
gi . For example, for the “body” of a giraffe, texture- bounding boxes to a (100
that needs to be estimated from the training data for every
tively. Brightiments) denote selectedgaspect ratio. Then the points
ed features could provide the best evidence and there- feature h and every grid part feature points.
dots while preserving the i . And,
(O,x) are uniformly sampled along the edges (using a Canny edge

Image Attributions
• Girschick et al. - Cascaded deformable
part models.
• Viola & Jones - Rapid object detection.
• Judd et al. - Learning to predict where
humans looks.
• Chikkerur et al. - What and where? A
Bayesian theory of attention.
• ...and the papers reviewed.

Attentional Object Detection - introductory slides.

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (18)

Similar to Attentional Object Detection - introductory slides.

Similar to Attentional Object Detection - introductory slides. (20)

More from Sergey Karayev

More from Sergey Karayev (14)

Recently uploaded

Recently uploaded (20)

Attentional Object Detection - introductory slides.

Editor's Notes