SlideShare a Scribd company logo
1 of 37
Grouplet: A Structured Image
 Representation for Recognizing
 Human and Object Interactions
         Bangpeng Yao and Li Fei-Fei
Computer Science Department, Stanford University
   {bangpeng,feifeili}@cs.stanford.edu




                                                   1
Human-Object Interaction




Playing saxophone
    Human           Not playing saxophone
                          Saxophone
                                            2
Human-Object Interaction




Robots interact      Automatic sports           Medical care
 with objects          commentary
                  “Kobe is dunking the ball.”

                                                               3
Background: Human-Object Interaction


                                                                            To be done
•   Schneiderman & Kanade, 2000
•   Viola & Jones, 2001                 •    Lowe, 1999
•   Huang et al, 2007                   •    Belongie et al, 2002
•   Papageorgiou & Poggio, 2000         •    Fergus et al, 2003
•
•
    Wu & Nevatia, 2005
    Dalal & Triggs, 2005
                                        •
                                        •
                                             Fei-Fei et al, 2004
                                             Berg & Malik, 2005
                                                                                  
•   Mikolajczyk et al, 2005             •    Felzenszwalb et al, 2005
•   Leibe et al, 2005                   •    Grauman & Darrell, 2005
•   Bourdev & Malik, 2009               •    Sivic et al, 2005
•   Felzenszwalb & Huttenlocher, 2005   •    Lazebnik et al, 2006
•   Ren et al, 2005                     •    Zhang et al, 2006          • Gupta et al, 2009
•   Ramanan, 2006                       •    Savarese et al, 2007       • Yao & Fei-Fei, 2010a
•   Ferrari et al, 2008                 •    Lampert et al, 2008
                                                                        • Yao & Fei-Fei, 2010b
•   Yang & Mori, 2008                   •    Desai et al, 2009
•   Andriluka et al, 2009               •    Gehler & Nowozin, 2009
•   Eichner & Ferrari, 2009      context
                   • Murphy et al, 2003     • Rabinovich et al, 2007
                   • Hoiem et al, 2006      • Heitz & Koller, 2008
                   • Shotton et al, 2006    • Divvala et al, 2009
                                                                                                 4
Background: Human-Object Interaction


                                                                            To be done
•   Schneiderman & Kanade, 2000
•   Viola & Jones, 2001                 •    Lowe, 1999
•   Huang et al, 2007                   •    Belongie et al, 2002
•   Papageorgiou & Poggio, 2000         •    Fergus et al, 2003
•
•
    Wu & Nevatia, 2005
    Dalal & Triggs, 2005
                                        •
                                        •
                                             Fei-Fei et al, 2004
                                             Berg & Malik, 2005
                                                                                 
•   Mikolajczyk et al, 2005             •    Felzenszwalb et al, 2005
•   Leibe et al, 2005                   •    Grauman & Darrell, 2005
•   Bourdev & Malik, 2009               •    Sivic et al, 2005
•   Felzenszwalb & Huttenlocher, 2005   •    Lazebnik et al, 2006
•   Ren et al, 2005                     •    Zhang et al, 2006          • Gupta et al, 2009
•   Ramanan, 2006                       •    Savarese et al, 2007       • Yao & Fei-Fei, 2010a
•   Ferrari et al, 2008                 •    Lampert et al, 2008
                                                                        • Yao & Fei-Fei, 2010b
•   Yang & Mori, 2008                   •    Desai et al, 2009
•   Andriluka et al, 2009               •    Gehler & Nowozin, 2009
•   Eichner & Ferrari, 2009      context
                   • Murphy et al, 2003     • Rabinovich et al, 2007
                   • Hoiem et al, 2006      • Heitz & Koller, 2008
                   • Shotton et al, 2006    • Divvala et al, 2009
                                                                                                 5
Outline
• Intuition of Grouplet Representation
• Grouplet Feature Representation
• Using Grouplet for Recognition
• Dataset & Experiments
• Conclusion


                                         6
Outline
• Intuition of Grouplet Representation
• Grouplet Feature Representation
• Using Grouplet for Recognition
• Dataset & Experiments
• Conclusion


                                         7
Recognizing Human-Object Interaction is Challenging

Reference image:
playing saxophone




                    Different pose       Different      Different
                    (or viewpoint)       lighting      background




                            Different                Same object
                       instrument, similar       (saxophone), different
                              pose                    interactions    8
Grouplet: our intuition
    Bag-of-words Spatial pyramid                                              Part-based   Grouplet
                                                                                           Representation:




     25




     20




     15




     10




     5




     0
          0   20   40   60   80   100   120   140   160   180   200




•                                                                     •   •
•                                                                         •
•                                                                     •   •
•                                                                         •
                                                                          •                            9
Grouplet: our intuition
                                                  Grouplet
                                                  Representation:

                                                • Part-based
                                                configuration
                                                • Co-occurrence
                                                • Discriminative
                                                • Dense




Capture the subtle difference in human-object interactions.


                                                                10
Outline
• Intuition of Grouplet Representation
• Grouplet Feature Representation
• Using Grouplet for Recognition
• Dataset & Experiments
• Conclusion


                                         11
Grouplet representation (e.g. 2-Grouplet)

                 I                                                    Notations
                                                        • I: Image.
                            P   {λ , λ }              • P: Reference point in the image.
                                    1   2               • Λ: Grouplet.
                                                        • λi: Feature unit.
                                                          - Ai: Visual codeword;
                                λ 2 :{ A2 , x2 , 2 }      - xi: Image location;
                                                           - σi: Variance of spatial distribution.
                          
               λ1 :{ A1 , x1 , 1}




                     
      Visual codewords Gaussian distribution




                                                                                               12
Grouplet representation (e.g. 2-Grouplet)

                                      I                                                    Notations
                                                                             • I: Image.
                                                 P   {λ , λ }              • P: Reference point in the image.
                                                         1   2               • Λ: Grouplet.
                                                                             • λi: Feature unit.
                                                                               - Ai: Visual codeword;
                                                     λ 2 :{ A2 , x2 , 2 }      - xi: Image location;
                                                                                - σi: Variance of spatial distribution.
                                               
                                    λ1 :{ A1 , x1 , 1}                      • ν(Λ,I): Matching score of Λ and I.
                                                                             • ν(λi,I): Matching score of λi and I.




                                          
                     Visual codewords Gaussian distribution

   v(, I )  min v  λ i , I 
                 i
 Matching score       Matching score
between Λ and I       between λi and I

                                                                                                                    13
Grouplet representation (e.g. 2-Grouplet)

                                  I                                                    Notations
                                                                         • I: Image.
                                             P   {λ , λ }              • P: Reference point in the image.
                                                     1   2               • Λ: Grouplet.
                                                                         • λi: Feature unit.
                                                                           - Ai: Visual codeword;
                                                 λ 2 :{ A2 , x2 , 2 }      - xi: Image location;
                                                                            - σi: Variance of spatial distribution.
                                           
                                λ1 :{ A1 , x1 , 1}                      • ν(Λ,I): Matching score of Λ and I.
                                                                         • ν(λi,I): Matching score of λi and I.
                                                                         • For an image patch:
                                                                            - a′: Its visual appearance;
                                                                            - x′: Its image location.
                                                                         • Ω(x): Image neighborhood of x.

                                       
                   Visual codewords Gaussian distribution

                                       
                                                                              
                                                                               
  v(, I )  min v  λ i , I   min    p( Ai | a)  N ( x | xi , i ) 
 Matching score Matching score  x  xi Codeword             Gaussian 
               i
                                                                             
                                    i
                                           
between Λ and I     between λi and I        assignment score density value

                                                                                                                14
Grouplet representation (e.g. 2-Grouplet)

                                  I                                                    Notations
                                                                         • I: Image.
                                             P   {λ , λ }              • P: Reference point in the image.
                                                     1   2               • Λ: Grouplet.
                                                                         • λi: Feature unit.
                                                                           - Ai: Visual codeword;
                                                 λ 2 :{ A2 , x2 , 2 }      - xi: Image location;
                                                                            - σi: Variance of spatial distribution.
                                           
                                λ1 :{ A1 , x1 , 1}                      • ν(Λ,I): Matching score of Λ and I.
                                                                         • ν(λi,I): Matching score of λi and I.
                                                                         • For an image patch:
                                                                            - a′: Its visual appearance;
                                                                            - x′: Its image location.
                                                                         • Ω(x): Image neighborhood of x.
                                                                         • Δ: A small shift of the location.
                                       
                   Visual codewords Gaussian distribution
                                       
                                                                                              
                                                                                              
  v(, I )  min v  λ i , I   min max  p(| a)  NAix|axi   i( x| xi  i , i )  
                                           x Ai j p( ( | ) , N )                   j
                                                                                              
               i
 Matching score Matching score 
                                    i
                                        x i x xi i 
                                       
                                            j
                                                
                                                                            
                                                                              Gaussian          
                                                                                                 
                                               Codeword Codeword  Gaussian
between Λ and I     between λi and I        assignment score density value
                                                     assignment score     density value

                                                                                                                15
Grouplet representation

                           I                                           • Part-based configuration
                                                                       • Co-occurrence
                                      P   {λ , λ }
                                              1   2                    • Discriminative

                                                      
                                          λ 2 :{ A2 , x2 , 2 }
                                    
                         λ1 :{ A1 , x1 , 1}

        Playing saxophone                                         Other interactions




matching score: 0.6 matching score: 0.4          matching score: 0.0 matching score: 0.1

                                                                                            16
Grouplet representation

                                   I                                           • Part-based configuration
                                                                               • Co-occurrence
                                              P   {λ , λ }
                                                      1   2                    • Discriminative

                                                                              • Dense
                                                  λ 2 :{ A2 , x2 , 2 }
                                            
                                 λ1 :{ A1 , x1 , 1}


                                                                          All possible combinations of
               Densely sample        Many possible                        feature units
All possible   image locations
Codewords                          spatial distributions
                                                                                                         
                                                                     1-grouplet 2-grouplet 3-grouplet
                                                 

                                                                                                         17
Outline
• Intuition of Grouplet Representation
• Grouplet Feature Representation
• Using Grouplet for Recognition
• Dataset & Experiments
• Conclusion


                                         18
A “Space” of Grouplets




                         19
A “Space” of Grouplets
                    Playing      Other
                     violin   interactions




                                             20
A “Space” of Grouplets
  Playing      Other                           Playing      Other
saxophone   interactions                        violin   interactions




                                                                        21
A “Space” of Grouplets
  Playing      Other                           Playing      Other
saxophone   interactions                        violin   interactions




   On background




                                                Shared by different
                                                   interactions

                                                                        22
We only need discriminative Grouplets
  Playing         Other                     Playing         Other
saxophone      interactions                  violin      interactions




Large ν(Λ,I)    Small ν(Λ,I)              Large ν(Λ,I)    Small ν(Λ,I)




   On background



  Number of feature units: N.
           N is large (192200)
                                              Shared by different
  Number of Grouplets: 2N                        interactions
              very large space
                                                                         23
Obtaining discriminative grouplets for a class

                          Apriori Mining
                                             Selected 1-grouplets
Obtain grouplets
with large ν(Λ,I)
                                                                     
on the class.



Remove grouplets
                                                                        
with large ν(Λ,I)                                  Candidate 2-grouplets
from other classes.
                                                             
                                Mine 1000~2000 grouplets, only need
Number of feature units: N.     to evaluate (2~100) N grouplets
         N is large (192200)
Number of Grouplets: 2N
                                                 [Agrawal & Srikant, 1994]
            very large space
                                                                         24
Using Grouplets for Classification




                                   I

Discriminative          
grouplets                                     1 , I  ,,   N , I  
                                                                          
    1 , ,  N 

                                                      SVM




                                                                         25
Outline
• Intuition of Grouplet Representation
• Grouplet Feature Representation
• Using Grouplet for Recognition
• Dataset & Experiments
• Conclusion


                                         26
People-Playing-Musical-Instruments (PPMI) Dataset




  PPMI+

   # Image:   (172)    (191)     (177)       (179)       (200)       (198)   (185)


  PPMI-

   # Image:   (164)    (148)     (133)       (149)       (188)       (169)   (167)




                      Original image        Normalized image
                                         (200 images each interaction)
                                                                                     27
Recognition Tasks on People-Playing-
            Musical-Instruments (PPMI) Dataset

Classification                         Detection

  Playing different instruments
                                                     Playing     Playing
       Playing           Playing                    bassoon    saxophone
     French horn          violin             Playing
                                           saxophone
                   vs.



  Playing vs. Not playing
     Playing             Not playing
      violin               violin
                   vs.




For each interaction, 100 training
and 100 testing images.
                                                                           28
Classification: Playing Different Instruments
             • 7-class classification on PPMI+ images


                          0.7                                                                         1200
                                                                  65.7%
Classification accuracy




                                                                                                      1000




                                                                             No. of mined Grouplets
                                                          59.9%
                          0.6                                                                         800
                                                  54.9%
                                                                                                      600
                          0.5
                                                                                                      400

                                         39.0%                                                        200
                          0.4 37.7%
                                                                                                        0
                                        Constel                   Grouplet                                   1   2      3      4     5   6
                                 BoW              DPM     SPM                                                        Grouplet size
                                        -lation                    +SVM

                                SPM: [Lazebnik et al, 2006]
                                DPM: [Felzenszwalb et al, 2008]
                                Constellation: [Fergus et al, 2003]
                                               [Niebles & Fei-Fei, 2007]



                                                                                                                                             29
Classifying Playing vs. Not playing
               • Seven 2-class classification problem; PPMI+ vs. PPMI- for each instrument.

                          BoW               DPM                SPM
                                                               DPM               Grouplet+SVM
Accuracy




                  Bassoon
                  Bassoon       Erhu
                                Erhu        Flute
                                            Flute    Frenchhorn
                                                      French horn    Guitar   Saxophone
                                                                              Saxophone       Violin
                                                                                              Violin
PPMI+ images
  Average
PPMI- images
  Average




                                                                                                       30
Classifying Playing vs. Not playing
           • Seven 2-class classification problem; PPMI+ vs. PPMI- for each instrument.

                         BoW            DPM                         SPM
                                                                    DPM               Grouplet+SVM
Accuracy




               Bassoon         Erhu     Flute    French horn              Guitar
                                                                          Guitar   Saxophone     Violin




                                                     PPMI+ images
                                                       Average
                                                     PPMI- images
                                                       Average




                                                                                                          31
Detecting people playing musical instruments

                                      Procedure:
                                      • Face detection with a low threshold;
                                      • Crop and normalize image regions;
                                      • 8-class classification
                                        - 7 classes of playing instruments;
                                        - Another class of not playing any instrument.




                                 
Playing saxophone   No playing              No playing

                                                                                  32
Detecting people playing musical instruments

     Area under the precision-recall curve:
        • Out method: 45.7%;
        • Spatial pyramid: 37.3%.



                                                              Playing
              Playing     Playing                           saxophone
              bassoon   saxophone
  Playing
French horn                                     Playing
                                              French horn




                                                                   33
Detecting people playing musical instruments

       Area under the precision-recall curve:
          • Out method: 45.7%;
          • Spatial pyramid: 37.3%.




    Playing
  French horn




False detection                  Missed detection
                                                    34
Examples of Mined Grouplets

Playing                Playing
bassoon:               saxophone:




 Playing                Playing
 violin:                guitar:




                                         35
Conclusion
• Holistic image-based classification               • Detailed understanding and reasoning




                               Vs.




                          Playing     Playing
                          bassoon   saxophone
                Playing
              saxophone




                                                        Pose estimation & object detection


                                                                     The Next Talk
 [B. Yao and L. Fei-Fei. “Grouplet: A structured      [B. Yao and L. Fei-Fei. “Modeling mutual
 image representation for recognizing human           context of object and human pose in human-
 and object interactions.” CVPR 2010.]                object interaction activities.” CVPR 2010.]


                                                                                                36
Thanks to
Juan Carlos Niebles, Jia Deng, Jia Li, Hao
Su, Silvio Savarese, and anonymous reviewers.

                 And You



                                            37

More Related Content

More from zukun

Modern features-part-3-software
Modern features-part-3-softwareModern features-part-3-software
Modern features-part-3-softwarezukun
 
Modern features-part-2-descriptors
Modern features-part-2-descriptorsModern features-part-2-descriptors
Modern features-part-2-descriptorszukun
 
Modern features-part-1-detectors
Modern features-part-1-detectorsModern features-part-1-detectors
Modern features-part-1-detectorszukun
 
Modern features-part-0-intro
Modern features-part-0-introModern features-part-0-intro
Modern features-part-0-introzukun
 
Lecture 02 internet video search
Lecture 02 internet video searchLecture 02 internet video search
Lecture 02 internet video searchzukun
 
Lecture 01 internet video search
Lecture 01 internet video searchLecture 01 internet video search
Lecture 01 internet video searchzukun
 
Lecture 03 internet video search
Lecture 03 internet video searchLecture 03 internet video search
Lecture 03 internet video searchzukun
 
Icml2012 tutorial representation_learning
Icml2012 tutorial representation_learningIcml2012 tutorial representation_learning
Icml2012 tutorial representation_learningzukun
 
Advances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer visionAdvances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer visionzukun
 
Gephi tutorial: quick start
Gephi tutorial: quick startGephi tutorial: quick start
Gephi tutorial: quick startzukun
 
EM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysisEM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysiszukun
 
Object recognition with pictorial structures
Object recognition with pictorial structuresObject recognition with pictorial structures
Object recognition with pictorial structureszukun
 
Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities zukun
 
Icml2012 learning hierarchies of invariant features
Icml2012 learning hierarchies of invariant featuresIcml2012 learning hierarchies of invariant features
Icml2012 learning hierarchies of invariant featureszukun
 
ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...
ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...
ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...zukun
 
Quoc le tera-scale deep learning
Quoc le   tera-scale deep learningQuoc le   tera-scale deep learning
Quoc le tera-scale deep learningzukun
 
Deep Learning workshop 2010: Deep Learning of Invariant Spatiotemporal Featur...
Deep Learning workshop 2010: Deep Learning of Invariant Spatiotemporal Featur...Deep Learning workshop 2010: Deep Learning of Invariant Spatiotemporal Featur...
Deep Learning workshop 2010: Deep Learning of Invariant Spatiotemporal Featur...zukun
 
Lecun 20060816-ciar-01-energy based learning
Lecun 20060816-ciar-01-energy based learningLecun 20060816-ciar-01-energy based learning
Lecun 20060816-ciar-01-energy based learningzukun
 
Lecun 20060816-ciar-02-deep learning for generic object recognition
Lecun 20060816-ciar-02-deep learning for generic object recognitionLecun 20060816-ciar-02-deep learning for generic object recognition
Lecun 20060816-ciar-02-deep learning for generic object recognitionzukun
 
P05 deep boltzmann machines cvpr2012 deep learning methods for vision
P05 deep boltzmann machines cvpr2012 deep learning methods for visionP05 deep boltzmann machines cvpr2012 deep learning methods for vision
P05 deep boltzmann machines cvpr2012 deep learning methods for visionzukun
 

More from zukun (20)

Modern features-part-3-software
Modern features-part-3-softwareModern features-part-3-software
Modern features-part-3-software
 
Modern features-part-2-descriptors
Modern features-part-2-descriptorsModern features-part-2-descriptors
Modern features-part-2-descriptors
 
Modern features-part-1-detectors
Modern features-part-1-detectorsModern features-part-1-detectors
Modern features-part-1-detectors
 
Modern features-part-0-intro
Modern features-part-0-introModern features-part-0-intro
Modern features-part-0-intro
 
Lecture 02 internet video search
Lecture 02 internet video searchLecture 02 internet video search
Lecture 02 internet video search
 
Lecture 01 internet video search
Lecture 01 internet video searchLecture 01 internet video search
Lecture 01 internet video search
 
Lecture 03 internet video search
Lecture 03 internet video searchLecture 03 internet video search
Lecture 03 internet video search
 
Icml2012 tutorial representation_learning
Icml2012 tutorial representation_learningIcml2012 tutorial representation_learning
Icml2012 tutorial representation_learning
 
Advances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer visionAdvances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer vision
 
Gephi tutorial: quick start
Gephi tutorial: quick startGephi tutorial: quick start
Gephi tutorial: quick start
 
EM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysisEM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysis
 
Object recognition with pictorial structures
Object recognition with pictorial structuresObject recognition with pictorial structures
Object recognition with pictorial structures
 
Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities
 
Icml2012 learning hierarchies of invariant features
Icml2012 learning hierarchies of invariant featuresIcml2012 learning hierarchies of invariant features
Icml2012 learning hierarchies of invariant features
 
ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...
ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...
ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...
 
Quoc le tera-scale deep learning
Quoc le   tera-scale deep learningQuoc le   tera-scale deep learning
Quoc le tera-scale deep learning
 
Deep Learning workshop 2010: Deep Learning of Invariant Spatiotemporal Featur...
Deep Learning workshop 2010: Deep Learning of Invariant Spatiotemporal Featur...Deep Learning workshop 2010: Deep Learning of Invariant Spatiotemporal Featur...
Deep Learning workshop 2010: Deep Learning of Invariant Spatiotemporal Featur...
 
Lecun 20060816-ciar-01-energy based learning
Lecun 20060816-ciar-01-energy based learningLecun 20060816-ciar-01-energy based learning
Lecun 20060816-ciar-01-energy based learning
 
Lecun 20060816-ciar-02-deep learning for generic object recognition
Lecun 20060816-ciar-02-deep learning for generic object recognitionLecun 20060816-ciar-02-deep learning for generic object recognition
Lecun 20060816-ciar-02-deep learning for generic object recognition
 
P05 deep boltzmann machines cvpr2012 deep learning methods for vision
P05 deep boltzmann machines cvpr2012 deep learning methods for visionP05 deep boltzmann machines cvpr2012 deep learning methods for vision
P05 deep boltzmann machines cvpr2012 deep learning methods for vision
 

Recently uploaded

Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 

Recently uploaded (20)

Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 

CVPR2010: grouplet: a structured image representation for recognizing human and object interactions

  • 1. Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions Bangpeng Yao and Li Fei-Fei Computer Science Department, Stanford University {bangpeng,feifeili}@cs.stanford.edu 1
  • 2. Human-Object Interaction Playing saxophone Human Not playing saxophone Saxophone 2
  • 3. Human-Object Interaction Robots interact Automatic sports Medical care with objects commentary “Kobe is dunking the ball.” 3
  • 4. Background: Human-Object Interaction To be done • Schneiderman & Kanade, 2000 • Viola & Jones, 2001 • Lowe, 1999 • Huang et al, 2007 • Belongie et al, 2002 • Papageorgiou & Poggio, 2000 • Fergus et al, 2003 • • Wu & Nevatia, 2005 Dalal & Triggs, 2005 • • Fei-Fei et al, 2004 Berg & Malik, 2005  • Mikolajczyk et al, 2005 • Felzenszwalb et al, 2005 • Leibe et al, 2005 • Grauman & Darrell, 2005 • Bourdev & Malik, 2009 • Sivic et al, 2005 • Felzenszwalb & Huttenlocher, 2005 • Lazebnik et al, 2006 • Ren et al, 2005 • Zhang et al, 2006 • Gupta et al, 2009 • Ramanan, 2006 • Savarese et al, 2007 • Yao & Fei-Fei, 2010a • Ferrari et al, 2008 • Lampert et al, 2008 • Yao & Fei-Fei, 2010b • Yang & Mori, 2008 • Desai et al, 2009 • Andriluka et al, 2009 • Gehler & Nowozin, 2009 • Eichner & Ferrari, 2009 context • Murphy et al, 2003 • Rabinovich et al, 2007 • Hoiem et al, 2006 • Heitz & Koller, 2008 • Shotton et al, 2006 • Divvala et al, 2009 4
  • 5. Background: Human-Object Interaction To be done • Schneiderman & Kanade, 2000 • Viola & Jones, 2001 • Lowe, 1999 • Huang et al, 2007 • Belongie et al, 2002 • Papageorgiou & Poggio, 2000 • Fergus et al, 2003 • • Wu & Nevatia, 2005 Dalal & Triggs, 2005 • • Fei-Fei et al, 2004 Berg & Malik, 2005  • Mikolajczyk et al, 2005 • Felzenszwalb et al, 2005 • Leibe et al, 2005 • Grauman & Darrell, 2005 • Bourdev & Malik, 2009 • Sivic et al, 2005 • Felzenszwalb & Huttenlocher, 2005 • Lazebnik et al, 2006 • Ren et al, 2005 • Zhang et al, 2006 • Gupta et al, 2009 • Ramanan, 2006 • Savarese et al, 2007 • Yao & Fei-Fei, 2010a • Ferrari et al, 2008 • Lampert et al, 2008 • Yao & Fei-Fei, 2010b • Yang & Mori, 2008 • Desai et al, 2009 • Andriluka et al, 2009 • Gehler & Nowozin, 2009 • Eichner & Ferrari, 2009 context • Murphy et al, 2003 • Rabinovich et al, 2007 • Hoiem et al, 2006 • Heitz & Koller, 2008 • Shotton et al, 2006 • Divvala et al, 2009 5
  • 6. Outline • Intuition of Grouplet Representation • Grouplet Feature Representation • Using Grouplet for Recognition • Dataset & Experiments • Conclusion 6
  • 7. Outline • Intuition of Grouplet Representation • Grouplet Feature Representation • Using Grouplet for Recognition • Dataset & Experiments • Conclusion 7
  • 8. Recognizing Human-Object Interaction is Challenging Reference image: playing saxophone Different pose Different Different (or viewpoint) lighting background Different Same object instrument, similar (saxophone), different pose interactions 8
  • 9. Grouplet: our intuition Bag-of-words Spatial pyramid Part-based Grouplet Representation: 25 20 15 10 5 0 0 20 40 60 80 100 120 140 160 180 200 • • • • • • • • • • • 9
  • 10. Grouplet: our intuition Grouplet Representation: • Part-based configuration • Co-occurrence • Discriminative • Dense Capture the subtle difference in human-object interactions. 10
  • 11. Outline • Intuition of Grouplet Representation • Grouplet Feature Representation • Using Grouplet for Recognition • Dataset & Experiments • Conclusion 11
  • 12. Grouplet representation (e.g. 2-Grouplet) I Notations • I: Image. P   {λ , λ } • P: Reference point in the image. 1 2 • Λ: Grouplet. • λi: Feature unit.  - Ai: Visual codeword; λ 2 :{ A2 , x2 , 2 } - xi: Image location; - σi: Variance of spatial distribution.  λ1 :{ A1 , x1 , 1}  Visual codewords Gaussian distribution 12
  • 13. Grouplet representation (e.g. 2-Grouplet) I Notations • I: Image. P   {λ , λ } • P: Reference point in the image. 1 2 • Λ: Grouplet. • λi: Feature unit.  - Ai: Visual codeword; λ 2 :{ A2 , x2 , 2 } - xi: Image location; - σi: Variance of spatial distribution.  λ1 :{ A1 , x1 , 1} • ν(Λ,I): Matching score of Λ and I. • ν(λi,I): Matching score of λi and I.  Visual codewords Gaussian distribution v(, I )  min v  λ i , I  i Matching score Matching score between Λ and I between λi and I 13
  • 14. Grouplet representation (e.g. 2-Grouplet) I Notations • I: Image. P   {λ , λ } • P: Reference point in the image. 1 2 • Λ: Grouplet. • λi: Feature unit.  - Ai: Visual codeword; λ 2 :{ A2 , x2 , 2 } - xi: Image location; - σi: Variance of spatial distribution.  λ1 :{ A1 , x1 , 1} • ν(Λ,I): Matching score of Λ and I. • ν(λi,I): Matching score of λi and I. • For an image patch: - a′: Its visual appearance; - x′: Its image location. • Ω(x): Image neighborhood of x.  Visual codewords Gaussian distribution     v(, I )  min v  λ i , I   min    p( Ai | a)  N ( x | xi , i )  Matching score Matching score  x  xi Codeword Gaussian  i   i  between Λ and I between λi and I assignment score density value 14
  • 15. Grouplet representation (e.g. 2-Grouplet) I Notations • I: Image. P   {λ , λ } • P: Reference point in the image. 1 2 • Λ: Grouplet. • λi: Feature unit.  - Ai: Visual codeword; λ 2 :{ A2 , x2 , 2 } - xi: Image location; - σi: Variance of spatial distribution.  λ1 :{ A1 , x1 , 1} • ν(Λ,I): Matching score of Λ and I. • ν(λi,I): Matching score of λi and I. • For an image patch: - a′: Its visual appearance; - x′: Its image location. • Ω(x): Image neighborhood of x. • Δ: A small shift of the location.  Visual codewords Gaussian distribution          v(, I )  min v  λ i , I   min max  p(| a)  NAix|axi   i( x| xi  i , i )   x Ai j p( ( | ) , N )  j   i Matching score Matching score  i  x i x xi i   j     Gaussian    Codeword Codeword Gaussian between Λ and I between λi and I assignment score density value assignment score density value 15
  • 16. Grouplet representation I • Part-based configuration • Co-occurrence P   {λ , λ } 1 2 • Discriminative  λ 2 :{ A2 , x2 , 2 }  λ1 :{ A1 , x1 , 1} Playing saxophone Other interactions matching score: 0.6 matching score: 0.4 matching score: 0.0 matching score: 0.1 16
  • 17. Grouplet representation I • Part-based configuration • Co-occurrence P   {λ , λ } 1 2 • Discriminative  • Dense λ 2 :{ A2 , x2 , 2 }  λ1 :{ A1 , x1 , 1} All possible combinations of Densely sample Many possible feature units All possible image locations Codewords spatial distributions   1-grouplet 2-grouplet 3-grouplet  17
  • 18. Outline • Intuition of Grouplet Representation • Grouplet Feature Representation • Using Grouplet for Recognition • Dataset & Experiments • Conclusion 18
  • 19. A “Space” of Grouplets 19
  • 20. A “Space” of Grouplets Playing Other violin interactions 20
  • 21. A “Space” of Grouplets Playing Other Playing Other saxophone interactions violin interactions 21
  • 22. A “Space” of Grouplets Playing Other Playing Other saxophone interactions violin interactions On background Shared by different interactions 22
  • 23. We only need discriminative Grouplets Playing Other Playing Other saxophone interactions violin interactions Large ν(Λ,I) Small ν(Λ,I) Large ν(Λ,I) Small ν(Λ,I) On background Number of feature units: N. N is large (192200) Shared by different Number of Grouplets: 2N interactions very large space 23
  • 24. Obtaining discriminative grouplets for a class Apriori Mining Selected 1-grouplets Obtain grouplets with large ν(Λ,I)  on the class. Remove grouplets   with large ν(Λ,I) Candidate 2-grouplets from other classes.  Mine 1000~2000 grouplets, only need Number of feature units: N. to evaluate (2~100) N grouplets N is large (192200) Number of Grouplets: 2N [Agrawal & Srikant, 1994] very large space 24
  • 25. Using Grouplets for Classification I Discriminative  grouplets   1 , I  ,,   N , I       1 , ,  N  SVM 25
  • 26. Outline • Intuition of Grouplet Representation • Grouplet Feature Representation • Using Grouplet for Recognition • Dataset & Experiments • Conclusion 26
  • 27. People-Playing-Musical-Instruments (PPMI) Dataset PPMI+ # Image: (172) (191) (177) (179) (200) (198) (185) PPMI- # Image: (164) (148) (133) (149) (188) (169) (167) Original image Normalized image (200 images each interaction) 27
  • 28. Recognition Tasks on People-Playing- Musical-Instruments (PPMI) Dataset Classification Detection Playing different instruments Playing Playing Playing Playing bassoon saxophone French horn violin Playing saxophone vs. Playing vs. Not playing Playing Not playing violin violin vs. For each interaction, 100 training and 100 testing images. 28
  • 29. Classification: Playing Different Instruments • 7-class classification on PPMI+ images 0.7 1200 65.7% Classification accuracy 1000 No. of mined Grouplets 59.9% 0.6 800 54.9% 600 0.5 400 39.0% 200 0.4 37.7% 0 Constel Grouplet 1 2 3 4 5 6 BoW DPM SPM Grouplet size -lation +SVM SPM: [Lazebnik et al, 2006] DPM: [Felzenszwalb et al, 2008] Constellation: [Fergus et al, 2003] [Niebles & Fei-Fei, 2007] 29
  • 30. Classifying Playing vs. Not playing • Seven 2-class classification problem; PPMI+ vs. PPMI- for each instrument. BoW DPM SPM DPM Grouplet+SVM Accuracy Bassoon Bassoon Erhu Erhu Flute Flute Frenchhorn French horn Guitar Saxophone Saxophone Violin Violin PPMI+ images Average PPMI- images Average 30
  • 31. Classifying Playing vs. Not playing • Seven 2-class classification problem; PPMI+ vs. PPMI- for each instrument. BoW DPM SPM DPM Grouplet+SVM Accuracy Bassoon Erhu Flute French horn Guitar Guitar Saxophone Violin PPMI+ images Average PPMI- images Average 31
  • 32. Detecting people playing musical instruments Procedure: • Face detection with a low threshold; • Crop and normalize image regions; • 8-class classification - 7 classes of playing instruments; - Another class of not playing any instrument.  Playing saxophone No playing No playing 32
  • 33. Detecting people playing musical instruments Area under the precision-recall curve: • Out method: 45.7%; • Spatial pyramid: 37.3%. Playing Playing Playing saxophone bassoon saxophone Playing French horn Playing French horn 33
  • 34. Detecting people playing musical instruments Area under the precision-recall curve: • Out method: 45.7%; • Spatial pyramid: 37.3%. Playing French horn False detection Missed detection 34
  • 35. Examples of Mined Grouplets Playing Playing bassoon: saxophone: Playing Playing violin: guitar: 35
  • 36. Conclusion • Holistic image-based classification • Detailed understanding and reasoning Vs. Playing Playing bassoon saxophone Playing saxophone Pose estimation & object detection The Next Talk [B. Yao and L. Fei-Fei. “Grouplet: A structured [B. Yao and L. Fei-Fei. “Modeling mutual image representation for recognizing human context of object and human pose in human- and object interactions.” CVPR 2010.] object interaction activities.” CVPR 2010.] 36
  • 37. Thanks to Juan Carlos Niebles, Jia Deng, Jia Li, Hao Su, Silvio Savarese, and anonymous reviewers. And You 37

Editor's Notes

  1. Some people may ask the difference between our work and the Hough voting approaches.
  2. Given an image and 【c】 a reference point, 【c】thegrouplet feature considers the co-occurrence of a set of highly-related image patches. 【c】Those patches are encoded by feature units, which models specific appearance and location information. 【c】The appearance information is represented by a visual codeword, 【c】while the location information is represented by a 2D Gaussian distribution.Here we show a two-grouplet, which contains two feature units.
  3. Given an image of human and object interaction 【c】 with the center of human face as reference point, 【c】 we need to calculate the matching score between the image and the grouplet, which measures the likelihood of observing the grouplet in the image. Because the grouplet requires the co-occurrence of all its feature units, therefore the 【c】 matching score between the image and the grouplet is the minimum value of the scores between the image and each feature unit in the grouplet.
  4. Given one feature unit, we consider the image patches in the neighborhood of the center of the Gaussian distribution, 【c】 and measure the probability of assigning each patch to the codeword of the feature unit.
  5. Furthermore, in order to allow the feature unit to be more resistant to small spatial variations, we allow the Gaussian distribution to shift in its small neighborhood, which will result to a set of matching scores, 【c】 and their maximum value will be taken as the matching score of the feature unit and the image.
  6. For the first step, we apply an Apriori mining approach to make it tractable. In Apriori mining, we start from 1-grouplets, which consists of one feature unit. Then we generate candidate 2-grouplets based on only the selected 1-grouplets. All the other 2-grouplets do not need to be considered. This is because in our definition, once a feature unit has a small matching score with an image, then all the grouplets contain this feature unit will also have a small matching score. The Apriori mining approach continues with the candidate 2-grouplets until all the grouplets that have large matching score with images of this class are obtained. With Apriori mining, when we want to obtain thousands of grouplets, the total number of grouplets that need to be evaluated is linear to the number of feature units instead of the exponential number in the brute force way. Furthermore, in our method, we assume an initial spatial distribution of each feature unit, and have a probabilistic model to update those distributions.
  7. However, there is not much work on activities of human-object interactions, nor a suitable data set for this problem. We therefore collect a new dataset called PPMI of people-playing-musical-instrument. Currently there are seven musical instruments. In each instrument, there are not only PPMI+ images of people playing the instrument, but also PPMI- images of a human holding the instrument without playing. Therefore this data set offers us an opportunity to understand how humans interact with the object, and this property cannot be captured by the sports data set of Gupta 2009, which might be the only existing one that involves human and object interactions. Furthermore we also normalize each original image by cropping the upper-body part of the person. Both original and normalized images are available at our website.
  8. We also test the performance of the grouplet feature to detect people playing musical instruments on the original images. This is a very challenging problem because we also want the detector to tell which musical instrument that the human is playing. Instead of using the traditional scanning window method, we first run a face detector, based on the face detection results, we train a eight-class SVM classifier. We have eight classes because there are seven different musical instruments, and another class that the face detection result is a false alarm or the human is not playing any instrument. The preliminary results show that, our method outperforms the spatial pyramid approach on this very challenging problem.