Promising avenues for interdisciplinary research in vision

Dr. Oge Marques
Associate Professor
Computer Science and Engineering
Florida Atlantic University
Boca Raton, FL (USA)

June 2009

Take-home message

We postulate that many
challenging problems in
human and computer vision
research can be approached
in a truly interdisciplinary
way and show examples of
recent work on the topic of
“objects in context” that
support our claim.!

Outline
•  Background and motivation
•  Visual perception
•  Object detection and recognition
•  Scene recognition and analysis
•  The role of context
•  Representative work
•  Concluding remarks

Background and motivation
Computer vision is not as easy as it seemed 40+ years ago

Background and motivation
•  Computer vision has many open research
questions
–  Object detection, recognition, and categorization
–  Scene analysis, recognition, and understanding
–  Objects in context

•  Research in human vision has grown
tremendously
–  Computational models of selected visual processes
have emerged

•  A truly interdisciplinary effort can help bring the
best of human vision research into selected
problems in computer vision.

The fundamental question of
vision

How are we able so quickly and
effortlessly to perceive meaningful,
coherent, 3D scenes from
incomplete, 2D patterns of light
that enter our eyes?

A selected related question

  How are we able to perceive, 
detect, categorize and recognize 
objects and scenes?

Vision Science
Interdisciplinary study of many areas of visual 
processing and function 

Areas of Research  Disciplines 
• Detection  • Psychology 
• Attention  • Neuroscience 
• Memory  • Biology 
• Recognition  • Computer Science 
• Motion perception  • Engineering 
• etc.  • etc.

Reverse engineering the
perceptual system

•  We know the
visual system
works
•  But how?

We don’t ‘see’ with our eyes

We see with our brains!

The hierarchical nature of the
scientific knowledge of the visual
system

The deeper in the system you go, the less we know…

What do we know about visual
perception?
Not much compared to what we don’t know

Ignorance

Knowledge

The perceptual process

Source: E.B. Goldstein, “Sensation and Perception”

Four Stages of Visual Perception
Inspired by work by David Marr
(1945-1980)

•  One of the most influential neuroscientists of vision.
•  Thought of vision as an information-processing
task.
•  In his book Vision (1982), he distinguished three
different levels of description involved in
understanding complex information processing
systems:
–  Computational level
–  Algorithmic level
–  Implementation level
•  An important point is that the levels can be
considered independently.


“cup”

The challenge of object
recognition
•  Why is it so difficult for computers to carry
out object recognition tasks that humans
can perform easily?

•  Although most human visual perception
appears to be almost effortless, it involves
complex “behind the scenes” processes.

recognition
Human vision scientist: “Let’s
look at selected behavioral
and neural processes that
make it possible for people to
perceive (i.e., detect and
recognize) objects.”

Computer vision scientist:
“Let’s model what is known –
and reasonable – and try it
out on standard databases
containing real-world images.”

perception
•  The stimulus on the receptors is ambiguous

perception

The inverse projection problem

perception

http://users.skynet.be/J.Beever/pave.htm

perception
•  Objects can be hidden or blurred

Can you find…
- the pencil?
- the glasses?

perception
•  Objects can be hidden or blurred

Who are these people?

perception
Objects look different from different viewpoints

The ability of humans to recognize an object seen from different
viewpoints is called viewpoint invariance.

perception
Objects look different from different viewpoints

Q: Which two faces correspond to the same person?

A1 (human): (a) and (c)
A2 (computer): (a) and (b)

Research question
How do we recognize objects from different
viewpoints?
Structural-Description Models Image-Description Models

Propose that our ability to recognize Propose that our ability to recognize
3D objects is based on 3D volumes objects from different viewpoints is
(called volumetric features) that can based on stored 2D views of the
be combined to create the overall object as it would appear from different
shape of an object. viewpoints.

Which Model Is Correct? The actual mechanism for object recognition probably
involves elements of both the structural-description and image-description models
(Palmeri & Gauthier, 2004)

Why do we care about object
recognition?

Because object recognition leads
to
perception of function.

So, what do we use direct or
indirect?
“It seems exceedingly unlikely (though
logically possible) that we categorize
everything in our visual fields”, Palmer.

Hypothesis: we categorize the objects
that are relevant for a specific task that we
have at hand, but we only extract
affordances from the others.

Object detection and the
“Head in the coffee beans
problem”

“Head in the coffee beans
problem”
Can you find the head in this image?

So what does object recognition involve?

Slide by Fei-Fei, Fergus, Torralba

Verification: is that a lamp?


Detection: are there people?


Identification: is that Potala Palace?


Object categorization

mountain

tree
building
banner

street lamp

vendor
people

Scene and context categorization
•  outdoor
•  city
•  …


Is this space large or small?
How far are the buildings in the back?


Activity

What is this person doing?
What are these two doing??


What is a scene?
•  A scene is a view of a real-world
environment that contains multiples
surfaces and objects, organized in a
meaningful way.

–  A tour of scene understanding literature:
http://cvcl.mit.edu/SUNSarticles.htm

The “gist” of a scene
•  Mary Potter (1975, 1976) demonstrated that
during a rapid sequential visual presentation (100
msec per image), a novel scene picture is indeed
instantly understood and observers seem to
comprehend a lot of visual information, but a
delay of a few hundreds msec (~ 300 msec) is
required for the picture to be consolidated in
memory.
•  The “gist” (a summary) refers to the visual
information perceived after/during a glance at an
image.
•  To simplify, the gist is often synonymous with the
basic level category of the scene or event (e.g.
wedding, bathroom, beach, forest, street)

What we (don’t) know about scene
analysis, recognition, and
classification
•  Humans are very good at recognizing and
classifying scenes
•  We are also very fast (100 ms or less)
•  We often sacrifice accuracy in the name of
speed (we capture the gist but miss many
details)

•  How exactly do we do it?

What is the basis for scene
identification?

•  Different schools of thought:
– Scene-centered
– Part-based (i.e., object-centered)
– Holistic

Objects in context
•  Objects do not exist isolated from a
context

•  Torralba’s challenge: “How far can you
go without using an object detector?”

The multiple personalities of a blob

Biederman 1982

•  Pictures shown for 150
ms.
•  Objects in appropriate
context were detected
more accurately than
objects in an
inappropriate context.
•  Scene consistency
affects object detection.

Objects and Scenes
Biederman’s violations (1981):

Biederman’s classes in Computer Vision
Galleguillos & Belongie, Tech Report (2008)

•  Interposition and support can be coded by
reference to physical space.
•  Probability, position and size are defined as
semantic relations because they require access
to the referential meaning of the object.
•  Semantic relations include information about
detailed interactions among objects in the scene
and they are often used as contextual features.

Dreaming of an ideal computer vision solution…

Types of context

•  Contextual features can be grouped into 3
categories:
–  semantic context (probability)
–  spatial context (position)
–  scale context (size).

•  Contextual knowledge can be any information that is
not directly produced by the appearance of an object.

•  It can be obtained from:
–  the nearby image data;
–  image tags or annotations;
–  the presence and location of other objects.

Acquiring and modeling context

•  Which of the three should one use?
–  Spatial and scale context are the most
exploited types of context by recognition
frameworks.
–  Generally, semantic context is implicitly
present in spatial context, as information of
object co-occurrences come from identifying
objects for the spatial relations in the scene.
–  The same happens to scale context, as scale
is measured with respect to others objects.
–  Therefore, using spatial and scale context
involve using all forms of contextual
information in the scene.

Representative work
•  There are many research groups working
on the intersection of human and
computer vision in numerous topics,
including “objects in context”.
•  Most expressive example: work by
Aude Oliva and Antonio Torralba (and
collaborators) at MIT.

Representative work
•  A case study:

–  L.W. Renninger and J. Malik (2004). When is
scene recognition just texture
recognition? Vision Research, 44,
2301-2311.

Renninger and Malik

•  Basic idea
–  Consider texture as an
early cue for scene
perception.
•  It’s simple
•  It’s fast (pre-attentive)
(Julesz, 1981)

Renninger and Malik
•  Approach
How well do
humans Build a texture-based
discriminate scenes model for scene
with very limited discrimination.
exposure?

Compare
performance!

Renninger and Malik
•  Task
–  2AFC
–  Subjects are shown an image
•  Image exposure time: 37, 50 and 69ms
–  Image followed by a jumbled scene mask
–  The task is to select one of two word choices
that best describes the image
–  Subject performance: 77%, 82% and 92%
correct
•  Get ready…

Texture Discrimination Model
–  Cluster response distributions from V1-like
filters to get prototypical responses (textons)
–  Remember what types of textons occur in
particular scenes (build histogram)
–  Label new image using a nearest neighbor
classifier
•  Compare texton histogram for new image to stored
representations (χ2 distance)
(Malik and Perona, 1990)
(Malik, et. al., 1999)

•  V1-like filters

•  Textons

Texture
Discrimination
Model

Confusion matrix

Outdoor/
Natural Indoor
MM

Natural 50.56 33.26 16.19

Outdoor / MM 23.14 46.54 30.33

Indoor 8.12 18.69 73.18

Discrimination of
Superordinate Categories

Renninger and Malik
•  Conclusion
–  Early scene identification can be mostly
explained by a simple texture model

Our experience
•  Working with Dept of Psychology @ FAU
–  Two joint graduate-level courses
–  Joint student supervision
–  Joint grant proposals
–  Joint papers (in preparation)
–  Constant discussions
–  Promising days ahead…
•  Imaging Science & Technology Center
•  Multidisciplinary Vision Program

Our focus
•  To establish quantitative measures
of the importance of context
– Method: present subjects with degraded
(blocky, blurry, etc.) objects against a
context and ask them to recognize the
objet as it becomes progressively more
visible.
– Human vision: behavioral experiments
– Computer vision: stimuli creation

Concluding remarks
•  Great potential
•  Cultural barriers
•  Open problems and challenges on both
sides
•  The time is ripe for interdisciplinary
research on vision, particularly “objects in
context”

Acknowledgments
•  Thanks to Prof. Elan Barenholtz (Dept of
Psychology, FAU) for allowing me to use some
of his slides and for the many interesting
discussions on the topics presented in this talk.

•  Many slides for this talk contain material made
publicly available on the Web by Antonio
Torralba and Aude Oliva (MIT) and Fei-Fei Li
(UIUC).

Thank you for attending my talk!

Questions?

Email: omarques@fau.edu

Promising avenues for interdisciplinary research in vision

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (9)

Ähnlich wie Promising avenues for interdisciplinary research in vision

Ähnlich wie Promising avenues for interdisciplinary research in vision (20)

Mehr von Förderverein Technische Fakultät

Mehr von Förderverein Technische Fakultät (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Promising avenues for interdisciplinary research in vision