Maximizing Impact_ Nonprofit Website Planning, Budgeting, and Design.pdf
NIPS2009: Understand Visual Scenes - Part 1
1. Understanding Visual Scenes
Antonio Torralba
Computer Science and Artificial Intelligence Laboratory (CSAIL)
Department of Electrical Engineering and Computer Science
4. SKY
TREE
PERSON BENCH
PERSON
PATH
LAKE PERSON
DUCK
PERSON
DUCK
SIGN DUCK
GRASS
8. Why do we care about recognition?
Perception of function: We can perceive the 3D shape, texture,
material properties, without knowing about objects.
But, the concept of category encapsulates also information
about what can we do with those objects.
“We therefore include the perception of function as a proper –indeed, crucial- subject
for vision science”, from Vision Science, chapter 9, Palmer.
9. The perception of function
• Direct perception (affordances): Gibson
Flat surface
Horizontal Sittable
Knee-high upon
…
• Mediated perception (Categorization)
Flat surface
Horizontal Chair Sittable
Knee-high upon
… Chair Chair
Chair?
10. Direct perception
Some aspects of an object function can be
perceived directly
• Functional form: Some forms clearly
indicate to a function (“sittable-upon”,
container, cutting device, …)
Sittable-upon Sittable-upon It does not seem easy
to sit-upon this…
Sittable-upon
14. Direct perception
Some aspects of an object function can be
perceived directly
• Observer relativity: Function is observer
dependent
From http://lastchancerescueflint.org
15. Limitations of Direct Perception
Objects of similar structure might have very different functions
Not all functions seem to be available from direct visual information only.
18. Object recognition
Is it really so hard?
Find the chair in this image Output of normalized correlation
This is a chair
19. Object recognition
Is it really so hard?
Find the chair in this image
Pretty much garbage
Simple template matching is not going to make it
20. So, let’s make the problem simpler:
Block world
Object Recognition in the Geometric Era:
a Retrospective. Joseph L. Mundy. 2006
21. Binford and generalized Recognition by
cylinders components
Irving Biederman
Recognition-by-Components: A Theory of Human Image
Understanding.
Psychological Review, 1987.
Object Recognition in the Geometric Era:
Introduced in computer vision by A. Pentland, 1986. a Retrospective. Joseph L. Mundy. 2006
22. Families of recognition algorithms
Voting models Shape matching
Bag of words models
Deformable models
Viola and Jones, ICCV 2001 Berg, Berg, Malik, 2005
Csurka, Dance, Fan, Willamowski, and Heisele, Poggio, et. al., NIPS 01
Cootes, Edwards, Taylor, 2001
Bray 2004 Schneiderman, Kanade 2004
Sivic, Russell, Freeman, Zisserman, Vidal-Naquet, Ullman 2003
ICCV 2005
Rigid template models
Constellation models
Fischler and Elschlager, 1973 Sirovich and Kirby 1987
Turk, Pentland, 1991
Burl, Leung, and Perona, 1995
Weber, Welling, and Perona, 2000 Dalal & Triggs, 2006
Fergus, Perona, & Zisserman, CVPR 2003
23. The face age
Feret dataset, 1996 DARPA
• The representation and matching of pictorial structures Fischler,
Elschlager (1973).
• Face recognition using eigenfaces M. Turk and A. Pentland (1991).
• Human Face Detection in Visual Scenes - Rowley, Baluja, Kanade (1995)
• Graded Learning for Object Detection - Fleuret, Geman (1999)
• Robust Real-time Object Detection - Viola, Jones (2001)
• Feature Reduction and Hierarchy of Classifiers for Fast Object Detection
in Video Images - Heisele, Serre, Mukherjee, Poggio (2001)
• ….
25. Haar-like filters and cascades
Viola and Jones, ICCV 2001
The average intensity in the
block is computed with four
sums independently of the
block size.
Also Fleuret and Geman, 2001
26. Generic objects:
Edge based descriptors
Gavrila, Philomin, ICCV 1999
Papageorgiou & Poggio (2000)
J. Shotton, A. Blake, R. Cipolla. PAMI 2008.
Opelt, Pinz, Zisserman, ECCV 2006
27. Histograms of oriented gradients
Shape context
SIFT, D. Lowe, ICCV 1999 Belongie, Malik, Puzicha, NIPS 2000
32. Evaluation of performance
Before plotting and ROC or precision-recall curves…
The detector challenge: by looking at the output of a detector on a random set
of images, can you guess which object is it trying to detect?
33. What object is detector trying to detect?
The detector challenge: by looking at the output of a detector on a random set
of images, can you guess which object is it trying to detect?
34. What object is detector trying to detect?
Table detector
The detector challenge: by looking at the output of a detector on a random set
of images, can you guess which object is it trying to detect?
43. Mary Potter (1976)
Mary Potter (1975, 1976) demonstrated that during a rapid sequential
visual presentation (100 msec per image), a novel picture is instantly
understood and observers seem to comprehend a lot of visual
information
44. Demo : Rapid image understanding
By Aude Oliva
Instructions: 9 photographs will be shown for
half a second each. Your task is to memorize
these pictures
55. Memory Test
Which of the following pictures have you seen ?
If you have seen the image
clap your hands once
If you have not seen the image
do nothing
68. You have seen these pictures
You were tested with these pictures
69. The gist of the scene
In a glance, we remember the meaning of an
image and its global layout but some
objects and details are forgotten
70. From objects to scenes
SceneType 2 {street, office, …} S
Object localization O1 O1 O1 O1 O2 O2 O2 O2
Local features L L L L Riesenhuber & Poggio (99); Vidal
-Naquet & Ullman (03); Serre &
Poggio, (05); Agarwal & Roth, (02),
Moghaddam, Pentland (97), Turk,
Pentland (91),Vidal-Naquet, Ullman,
I (03) Heisele, et al, (01), Agarwal &
Image Roth, (02), Kremp, Geman, Amit (02),
Dorko, Schmid, (03) Fergus, Perona,
Zisserman (03), Fei Fei, Fergus,
Perona, (03), Schneiderman, Kanade
(00), Lowe (99)
71. What makes scenes different?
Ceiling
Ceiling Lamp wall
Light painting
Door Door Painting mirror
Door mirror
Wall Door Wall Door wall wall Lamp
phone
Fireplace Bed alarm
armchair armchair
Floor Side-table
Coffee table carpet
Different objects, different spatial layout
72. What makes scenes different?
ceiling
lamp lamp
cabinet
glasses mirror
Window mirror mirror
shelves wall
beer
dishes faucet faucet towel counter
sink counter sink soap sink
counter
cabinet
cabinet cabinet
cabinet stool
Different objects, similar spatial layout
77. Scene emergent features
“Recognition via features that are not those of individual objects but “emerge” as
objects are brought into relation to each other to form a scene.” – Biederman 81
From “on the semantics of a glance at a scene”, Biederman, 1981
78. Examples of scene emergent features
Suggestive edges and junctions Simple geometric forms
Blobs Textures
79. Ensemble statistics
Ariely, 2001, Seeing sets: Representation by statistical properties
Chong, Treisman, 2003, Representation of statistical properties
Alvarez, Oliva, 2008, 2009, Spatial ensemble statistics
Conclusion: observers had
more accurate representation
of the mean than of the
individual members of the set.
80. From scenes to objects
SceneType 2 {street, office, …} S
Object localization O1 O1 O1 O1 O2 O2 O2 O2
G
• Scene emergent
Local features L L L L • Ensemble statistics
• Global features
I
Image
81. How far can we go without objects?
SceneType 2 {street, office, …} S
G
• Scene emergent
Local features L L L L • Ensemble statistics
• Global features
I
Image
83. Global image descriptors
Bag of words Spatially organized textures
M. Gorkani, R. Picard, ICPR 1994
A. Oliva, A. Torralba, IJCV 2001
Sivic et. al., ICCV 2005
Fei-Fei and Perona, CVPR 2005
Non localized textons
Walker, Malik. Vision Research 2004
… S. Lazebnik, et al, CVPR 2006 …
R. Datta, D. Joshi, J. Li, and J. Z. Wang, Image Retrieval: Ideas, Influences, and Trends of the New Age,
ACM Computing Surveys, vol. 40, no. 2, pp. 5:1-60, 2008.
84. Gist descriptor
Oliva and Torralba, 2001
• Apply oriented Gabor filters
over different scales
• Average filter energy
in each bin
8 orientations
4 scales
x 16 bins
512 dimensions
Similar to SIFT (Lowe 1999) applied to the entire image
M. Gorkani, R. Picard, ICPR 1994; Walker, Malik. Vision Research 2004; Vogel et al. 2004;
Fei-Fei and Perona, CVPR 2005; S. Lazebnik, et al, CVPR 2006; …
86. Bag of words &
spatial pyramid matching
Sivic, Zisserman, 2003. Visual words = Kmeans of SIFT descriptors
S. Lazebnik, et al, CVPR 2006
87. The 15-scenes benchmark
Oliva & Torralba, 2001
Fei Fei & Perona, 2005
Lazebnik, et al 2006
Office
Skyscrapers Suburb Building facade Coast Forest Bedroom Living room
Industrial Street Highway Mountain Open country Kitchen Store
94. Training images Correct classifications Miss-classifications
Monastery Cathedral Castle
Abbey
Toy shop Van Discotheque
Airplane cabin
Subway Stage Restaurant
Airport terminal
Restaurant Courtyard Canal
patio
Alley
Harbor Coast Athletic
field
Amphitheater
Xiao, Hays, Ehinger, Oliva, Torralba; maybe 2010
95. Example of three different scenes
RIVER
BEACH
VILLAGE
Xiao, Hays, Ehinger, Oliva, Torralba; maybe 2010
96. But they are all part of the same picture
Xiao, Hays, Ehinger, Oliva, Torralba; maybe 2010
97. But they are all part of the same picture
Xiao, Hays, Ehinger, Oliva, Torralba; maybe 2010
98. Scene detection
Xiao, Hays, Ehinger, Oliva, Torralba; maybe 2010
99. Categories or a continuous space?
Check poster by Malisiewicz, Efros
100. Categories or a continuous space?
From the city to the mountains in 10 steps
101. Spatial envelope:
a continuous space of scenes
Highway
Street
City centre
Tall Building
Degree of Expansion
Degree of Openness Oliva & Torralba, 2001
102. Spatial envelope:
a continuous space of scenes
Coast
Countryside
Forest
Mountain
Degree of Ruggedness
Degree of Openness
Oliva & Torralba, 2001
105. Look‐Alikes by Joan Steiner
Even in high resolution, we can not shut down contextual processing and it is hard
to recognize the true identities of the elements that compose this scene.
109. Scene recognition without
object recognition
Scene
S
g
Scene
features
Murphy, Torralba, Freeman; NIPS 2003. Torralba, Murphy, Freeman, CACM 2010.
110. An integrated model of Scenes,
Objects, and Parts
Scene
S
Ncar
P(Ncar | S = street)
01 5 N g
P(Ncar | S = park)
Scene
gist
features
01 5 N
Murphy, Torralba, Freeman; NIPS 2003. Torralba, Murphy, Freeman, CACM 2010.
112. The layered structure of scenes
Assuming a human observer standing on the ground
p(x) p(x2|x1)
In a display with multiple targets present, the location of one target constraints the ‘y’
coordinate of the remaining targets, but not the ‘x’ coordinate.
Torralba, Oliva, Castelhano, Henderson. 2006
113. Context driven object detection
Scene
S
Ncar Zcar
P(Ncar | S = street)
01 5 N g
Scene
gist
features
Murphy, Torralba, Freeman; NIPS 2003. Torralba, Murphy, Freeman, CACM 2010.
114. An integrated model of Scenes,
Objects, and Parts
We train a multiview car detector.
car
Fi p(d | F=1) = N(d | µ1, σ1)
p(d | F=0) = N(d | µ0, σ0)
dcari xcari
N=4
Murphy, Torralba, Freeman; NIPS 2003. Torralba, Murphy, Freeman, CACM 2010.
115. An integrated model of Scenes,
Objects, and Parts
Scene
S
Ncar Zcar
car
Fi
g
Scene
dcari xcari gist
features
M=4
Murphy, Torralba, Freeman; NIPS 2003. Torralba, Murphy, Freeman, CACM 2010.