CVPR2010: grouplet: a structured image representation for recognizing human and object interactions

Grouplet: A Structured Image
Representation for Recognizing
Human and Object Interactions
Bangpeng Yao and Li Fei-Fei
Computer Science Department, Stanford University
{bangpeng,feifeili}@cs.stanford.edu

1

Human-Object Interaction

Playing saxophone
Human Not playing saxophone
Saxophone
2

Human-Object Interaction

Robots interact Automatic sports Medical care
with objects commentary
“Kobe is dunking the ball.”

3

Background: Human-Object Interaction

To be done
• Schneiderman & Kanade, 2000
• Viola & Jones, 2001 • Lowe, 1999
• Huang et al, 2007 • Belongie et al, 2002
• Papageorgiou & Poggio, 2000 • Fergus et al, 2003
•
•
Wu & Nevatia, 2005
Dalal & Triggs, 2005
•
•
Fei-Fei et al, 2004
Berg & Malik, 2005

• Mikolajczyk et al, 2005 • Felzenszwalb et al, 2005
• Leibe et al, 2005 • Grauman & Darrell, 2005
• Bourdev & Malik, 2009 • Sivic et al, 2005
• Felzenszwalb & Huttenlocher, 2005 • Lazebnik et al, 2006
• Ren et al, 2005 • Zhang et al, 2006 • Gupta et al, 2009
• Ramanan, 2006 • Savarese et al, 2007 • Yao & Fei-Fei, 2010a
• Ferrari et al, 2008 • Lampert et al, 2008
• Yao & Fei-Fei, 2010b
• Yang & Mori, 2008 • Desai et al, 2009
• Andriluka et al, 2009 • Gehler & Nowozin, 2009
• Eichner & Ferrari, 2009 context
• Murphy et al, 2003 • Rabinovich et al, 2007
• Hoiem et al, 2006 • Heitz & Koller, 2008
• Shotton et al, 2006 • Divvala et al, 2009
4

Background: Human-Object Interaction

To be done
• Schneiderman & Kanade, 2000
• Viola & Jones, 2001 • Lowe, 1999
• Huang et al, 2007 • Belongie et al, 2002
• Papageorgiou & Poggio, 2000 • Fergus et al, 2003
•
•
Wu & Nevatia, 2005
Dalal & Triggs, 2005
•
•
Fei-Fei et al, 2004
Berg & Malik, 2005

• Mikolajczyk et al, 2005 • Felzenszwalb et al, 2005
• Leibe et al, 2005 • Grauman & Darrell, 2005
• Bourdev & Malik, 2009 • Sivic et al, 2005
• Felzenszwalb & Huttenlocher, 2005 • Lazebnik et al, 2006
• Ren et al, 2005 • Zhang et al, 2006 • Gupta et al, 2009
• Ramanan, 2006 • Savarese et al, 2007 • Yao & Fei-Fei, 2010a
• Ferrari et al, 2008 • Lampert et al, 2008
• Yao & Fei-Fei, 2010b
• Yang & Mori, 2008 • Desai et al, 2009
• Andriluka et al, 2009 • Gehler & Nowozin, 2009
• Eichner & Ferrari, 2009 context
• Murphy et al, 2003 • Rabinovich et al, 2007
• Hoiem et al, 2006 • Heitz & Koller, 2008
• Shotton et al, 2006 • Divvala et al, 2009
5

Outline
• Intuition of Grouplet Representation
• Grouplet Feature Representation
• Using Grouplet for Recognition
• Dataset & Experiments
• Conclusion

6

Outline
• Conclusion

7

Recognizing Human-Object Interaction is Challenging

Reference image:
playing saxophone

Different pose Different Different
(or viewpoint) lighting background

Different Same object
instrument, similar (saxophone), different
pose interactions 8

Grouplet: our intuition
Bag-of-words Spatial pyramid Part-based Grouplet
Representation:

25

20

15

10

5

0
0 20 40 60 80 100 120 140 160 180 200

• • •
• •
• • •
• •
• 9

Grouplet: our intuition
Grouplet
Representation:

• Part-based
configuration
• Co-occurrence
• Discriminative
• Dense

Capture the subtle difference in human-object interactions.

10

Outline
• Conclusion

11

Grouplet representation (e.g. 2-Grouplet)

I Notations
• I: Image.
P   {λ , λ } • P: Reference point in the image.
1 2 • Λ: Grouplet.
• λi: Feature unit.
 - Ai: Visual codeword;
λ 2 :{ A2 , x2 , 2 } - xi: Image location;
- σi: Variance of spatial distribution.

λ1 :{ A1 , x1 , 1}


Visual codewords Gaussian distribution

12


I Notations
• I: Image.

λ1 :{ A1 , x1 , 1} • ν(Λ,I): Matching score of Λ and I.
• ν(λi,I): Matching score of λi and I.



v(, I )  min v  λ i , I 
i
Matching score Matching score
between Λ and I between λi and I

13


I Notations
• I: Image.

• For an image patch:
- a′: Its visual appearance;
- x′: Its image location.
• Ω(x): Image neighborhood of x.




 

v(, I )  min v  λ i , I   min    p( Ai | a)  N ( x | xi , i ) 
Matching score Matching score  x  xi Codeword Gaussian 
i
 
i

between Λ and I between λi and I assignment score density value

14


I Notations
• I: Image.

• For an image patch:
- a′: Its visual appearance;
- x′: Its image location.
• Ω(x): Image neighborhood of x.
• Δ: A small shift of the location.


   
   
v(, I )  min v  λ i , I   min max  p(| a)  NAix|axi   i( x| xi  i , i )  
x Ai j p( ( | ) , N )  j
 
i
Matching score Matching score 
i
 x i x xi i 

j

 
 Gaussian 

 Codeword Codeword Gaussian
between Λ and I between λi and I assignment score density value
assignment score density value

15

Grouplet representation

I • Part-based configuration
• Co-occurrence
P   {λ , λ }
1 2 • Discriminative


λ 2 :{ A2 , x2 , 2 }

λ1 :{ A1 , x1 , 1}

Playing saxophone Other interactions

matching score: 0.6 matching score: 0.4 matching score: 0.0 matching score: 0.1

16

Grouplet representation

I • Part-based configuration
• Co-occurrence
P   {λ , λ }
1 2 • Discriminative

 • Dense
λ 2 :{ A2 , x2 , 2 }

λ1 :{ A1 , x1 , 1}

All possible combinations of
Densely sample Many possible feature units
All possible image locations
Codewords spatial distributions

 1-grouplet 2-grouplet 3-grouplet


17

Outline
• Conclusion

18

A “Space” of Grouplets

19

Playing Other
violin interactions

20

Playing Other Playing Other
saxophone interactions violin interactions

21


On background

Shared by different
interactions

22

We only need discriminative Grouplets

Large ν(Λ,I) Small ν(Λ,I) Large ν(Λ,I) Small ν(Λ,I)

On background

Number of feature units: N.
N is large (192200)
Shared by different
Number of Grouplets: 2N interactions
very large space
23

Obtaining discriminative grouplets for a class

Apriori Mining
Selected 1-grouplets
Obtain grouplets
with large ν(Λ,I)

on the class.

Remove grouplets
 
with large ν(Λ,I) Candidate 2-grouplets
from other classes.

Mine 1000~2000 grouplets, only need
Number of feature units: N. to evaluate (2~100) N grouplets
N is large (192200)
Number of Grouplets: 2N
[Agrawal & Srikant, 1994]
very large space
24

Using Grouplets for Classification

I

Discriminative 
grouplets   1 , I  ,,   N , I  
 
  1 , ,  N 

SVM

25

Outline
• Conclusion

26

People-Playing-Musical-Instruments (PPMI) Dataset

PPMI+

# Image: (172) (191) (177) (179) (200) (198) (185)

PPMI-

# Image: (164) (148) (133) (149) (188) (169) (167)

Original image Normalized image
(200 images each interaction)
27

Recognition Tasks on People-Playing-
Musical-Instruments (PPMI) Dataset

Classification Detection

Playing different instruments
Playing Playing
Playing Playing bassoon saxophone
French horn violin Playing
saxophone
vs.

Playing vs. Not playing
Playing Not playing
violin violin
vs.

For each interaction, 100 training
and 100 testing images.
28

Classification: Playing Different Instruments
• 7-class classification on PPMI+ images

0.7 1200
65.7%
Classification accuracy

1000

No. of mined Grouplets
59.9%
0.6 800
54.9%
600
0.5
400

39.0% 200
0.4 37.7%
0
Constel Grouplet 1 2 3 4 5 6
BoW DPM SPM Grouplet size
-lation +SVM

SPM: [Lazebnik et al, 2006]
DPM: [Felzenszwalb et al, 2008]
Constellation: [Fergus et al, 2003]
[Niebles & Fei-Fei, 2007]

29

Classifying Playing vs. Not playing
• Seven 2-class classification problem; PPMI+ vs. PPMI- for each instrument.

BoW DPM SPM
DPM Grouplet+SVM
Accuracy

Bassoon
Bassoon Erhu
Erhu Flute
Flute Frenchhorn
French horn Guitar Saxophone
Saxophone Violin
Violin
PPMI+ images
Average
PPMI- images
Average

30

Classifying Playing vs. Not playing
• Seven 2-class classification problem; PPMI+ vs. PPMI- for each instrument.

BoW DPM SPM
DPM Grouplet+SVM
Accuracy

Bassoon Erhu Flute French horn Guitar
Guitar Saxophone Violin

PPMI+ images
Average
PPMI- images
Average

31

Detecting people playing musical instruments

Procedure:
• Face detection with a low threshold;
• Crop and normalize image regions;
• 8-class classification
- 7 classes of playing instruments;
- Another class of not playing any instrument.


Playing saxophone No playing No playing

32


Area under the precision-recall curve:
• Out method: 45.7%;
• Spatial pyramid: 37.3%.

Playing
Playing Playing saxophone
bassoon saxophone
Playing
French horn Playing
French horn

33


Area under the precision-recall curve:
• Out method: 45.7%;
• Spatial pyramid: 37.3%.

Playing
French horn

False detection Missed detection
34

Examples of Mined Grouplets

Playing Playing
bassoon: saxophone:

Playing Playing
violin: guitar:

35

Conclusion
• Holistic image-based classification • Detailed understanding and reasoning

Vs.

Playing Playing
bassoon saxophone
Playing
saxophone

Pose estimation & object detection

The Next Talk
[B. Yao and L. Fei-Fei. “Grouplet: A structured [B. Yao and L. Fei-Fei. “Modeling mutual
image representation for recognizing human context of object and human pose in human-
and object interactions.” CVPR 2010.] object interaction activities.” CVPR 2010.]

36

Thanks to
Juan Carlos Niebles, Jia Deng, Jia Li, Hao
Su, Silvio Savarese, and anonymous reviewers.

And You

37

CVPR2010: grouplet: a structured image representation for recognizing human and object interactions

Recommended

Recommended

More Related Content

More from zukun

More from zukun (20)

Recently uploaded

Recently uploaded (20)

CVPR2010: grouplet: a structured image representation for recognizing human and object interactions

Editor's Notes