SlideShare ist ein Scribd-Unternehmen logo
1 von 52
Downloaden Sie, um offline zu lesen
6: Location and context
What makes a cow a cow?
                     Google knows
  How do you know?     because other people know

                     We think we know
                      “because it has four legs”

                     But the fact of the matter:
                       not all cows show four legs
                       nor are they brown …
                       not all…
What is the object in the middle?




No segmentation …
Not even the pixel values of the object …
Where is evidence for an object?




                         Uijlings IJCV 2011
Where is evidence for an object?




                         Uijlings IJCV 2011
What is the visual extent of an object?




Uijlings IJCV 2012
Where: exhaustive search

Look everywhere for the object window
Imposes computational constraints on
   Very many locations and windows
   (coarse grid/fixed aspect ratio)
   Evaluation cost per location
   (weak features/classifiers)
Impressive but takes long.


Viola IJCV 2004 Dalal CVPR 2005
Felzenszwalb PAMI 2010 Vedaldi ICCV 2009   7
Where: the need for a hierarchy

  An image is intrinsically hierarchical.




                                            Gu CVPR 2009
Selective search

Windows formed by hierarchical grouping.




Adjacent grouping on color/texture/shape cues.
              Felzenszwalb 2004    Van de Sande ICCV 2011
Selective search example
Selective search example




                           11
Average best overlap ~88%

… looks like this




                    High recall
                                  cat
Pairs of concepts




           Uijlings ICCV demo 2012
6 Conclusion

Selective search gives good localization.

Localization needed to understand pairs of concepts.
7 Data and metadata




         http://bit.ly/visualsearchengines
How many concepts?




   Li Fei Fei slide. Biederman, Psychological Rev. 1987
How many examples?




Once you are over 100 – 1000 examples, success is there.
Amateur labeling




LabelMe 290,000 object annotations
                                     Russell IJCV 2008
Amateur labeling
Amateur labeling
Tag relevance by social annotation

Consistency in tagging between users on similar images.




                                         Xirong Li, TMM 2009
Tag relevance by social annotation




    Pretty good for snow not so good for rainbow.
Social negative bootstrapping

Negative images are as important as positive images to learn.
Not just random negative images, but close ones.
•                               We want to learn positive
                                example from an expert,
                                and obtain as many
                                negative samples as we
                                like for free from the web.
•                               We iteratively aim for the
                                hardest negatives.


                           Xirong Li ACM MM 2009
Social negative bootstrapping




                     Xirong Li ICMR 2011
Knowledge ontology ImageNet
acknowledgement
    WordNet friends




Christiane Fellbaum
  Dan Osherson            Kai Li    Alex Berg Columbia
     Princeton          Princeton




       Jia Deng         Hao Su
  Princeton/Stanford    Stanford
PASCAL VOC

The PASCAL Visual Object Classes (VOC).

500,000 Images downloaded from flickr.
Queries like “car”, “vehicle”, “street”, “downtown”.
10,000 objects, 25,000 labels.

Mark Everingham, Luc Van Gool, Chris Williams, John Winn,
Andrew Zisserman
7. Conclusion

Data is king.

The data are beginning to reflect the human cognition
capacity [at a basic level].

Harvesting social data requires advanced computer
vision control.
8 Performance
PASCAL 2010
Aeroplane   Bicycle   Bird   Boat    Bottle




  Bus         Car      Cat   Chair   Cow
True Positives - Person
             UOCTTI_LSVM_MDPM




          NLPR_HOGLBP_MC_LCEGCHLC




       NUS_HOGLBP_CTX_CLS_RESCORE_V2
False Positives - Person
              UOCTTI_LSVM_MDPM




          NLPR_HOGLBP_MC_LCEGCHLC




        NUS_HOGLBP_CTX_CLS_RESCORE_V2
Non-birds & non-boats

Non-bird images:
Highest ranked




Non-boat images:
Highest ranked



Water texture and scene composition?
Non-chair
True Positives - Motorbike
             MITUCLA_HIERARCHY




         NLPR_HOGLBP_MC_LCEGCHLC




       NUS_HOGLBP_CTX_CLS_RESCORE_V2
False Positives - Motorbike
             MITUCLA_HIERARCHY




         NLPR_HOGLBP_MC_LCEGCHLC




       NUS_HOGLBP_CTX_CLS_RESCORE_V2
Object localization 2008-2010

             60

             50
Max AP (%)




             40
                                                                                                                                                                                                 2008
             30                                                                                                                                                                                  2009
                                                                                                                                                                                                 2010
             20

             10

             0




                                                                                                                                                                                     tvmonitor
                                                                                                                                                pottedplant
                                                      bottle




                                                                                                                           motorbike
                                                                                               diningtable


                                                                                                                   horse




                                                                                                                                                                      sofa
                                                                                                                                                                             train
                                                                                                                                       person


                                                                                                                                                              sheep
                  aeroplane
                              bicycle




                                                                                         cow
                                                                           cat
                                               boat


                                                               bus




                                                                                                             dog
                                        bird




                                                                     car


                                                                                 chair




         Results on 2008 data improve for 2010 methods for all
         categories, by over 100% for some categories.
TRECvid evaluation standard
Concept detection

                    Aircraft

                    Beach

                    Mountain

                    People marching

                    Police/Security

                    Flower
Measuring performance

           Set of relevant            Set of retrieved
 Results       items                       items
1.

2.
             • Precision      Set of relevant
3.
                              retrieved items
4.
                             inverse relationship
               Recall
5.
UvA-MediaMill@TRECVID




                 • other systems




                 Snoek et al, TRECVID 04-10
Performance doubled in just 3 years

   • 36 concept detectors
                                                 Even when
                                                 using training
                                                 data of different
                                                 origin, great
                                                 progress.
                                                 But the number
                                                 of concepts is
                                                 still limited.


                            Snoek & Smeulders, IEEE Computer 2010
8. Conclusion

Impressive results and quickly improving per year.

Very valuable competition.

Best non-classes start to make sense!
9 Speed
SURF based on integral images

Introduced by Viola & Jones in the context of face
detection: sliding windows in left to right / up to bottom
integral images.




                                                             46
SURF principle

Approximate Gaussian derivatives with box filters:
     Lyy
     Lyy




                                               Lyy          Lxy




            L xx                   L
                     LREC 2004, 26 May yy Lisbon
                                       2004,         L xy     47
SURF speed




                                        Scale
Computation time: 6 times faster than DoG (~100msec).
Independent of filter scale. 26 May 2004, Lisbon
                       LREC 2004,                       48
Dense descriptor extraction




  Pixel-wise Responses            Final Descriptor



             Factor 16 speed improvement,
             Another factor 2 by the use of matrix libs.
Projection: Random Forest

Binary decision trees
.
...

      .
      ...




         ......
                        Moosmann et al. 2008
Real-time bag of words
   Descriptor                       Projection                          Classification
   Extraction
                         Pre-projection        Actual projection         SVM kernel

   D-SURF                                         Random                                 MAP:
                          <empty>                                          RBF
     2x2                                           Forest                                0.370



            15                            10                       13
                Total computation time is 38 milliseconds per image


26 frames per second on a normal PC in any 20 concepts.
9. Conclusion

SURF scale and rotation invariant
Fast due to the use of integral images
Download: http://www.vision.ee.ethz.ch/~surf/
DURF extraction is 6x faster than Dense-SIFT.
Projection using Random Forest 50x faster than NN.
Internet Video Search: the beginning




                                                  telling
                                                  stories

          measuring    concept         lexicon
video      features   detection       learning

                                                 browsing
                                                   video
                              video

Weitere ähnliche Inhalte

Mehr von zukun

My lyn tutorial 2009
My lyn tutorial 2009My lyn tutorial 2009
My lyn tutorial 2009
zukun
 
ETHZ CV2012: Tutorial openCV
ETHZ CV2012: Tutorial openCVETHZ CV2012: Tutorial openCV
ETHZ CV2012: Tutorial openCV
zukun
 
ETHZ CV2012: Information
ETHZ CV2012: InformationETHZ CV2012: Information
ETHZ CV2012: Information
zukun
 
Siwei lyu: natural image statistics
Siwei lyu: natural image statisticsSiwei lyu: natural image statistics
Siwei lyu: natural image statistics
zukun
 
Lecture9 camera calibration
Lecture9 camera calibrationLecture9 camera calibration
Lecture9 camera calibration
zukun
 
Brunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer visionBrunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer vision
zukun
 
Modern features-part-4-evaluation
Modern features-part-4-evaluationModern features-part-4-evaluation
Modern features-part-4-evaluation
zukun
 
Modern features-part-3-software
Modern features-part-3-softwareModern features-part-3-software
Modern features-part-3-software
zukun
 
Modern features-part-2-descriptors
Modern features-part-2-descriptorsModern features-part-2-descriptors
Modern features-part-2-descriptors
zukun
 
Modern features-part-1-detectors
Modern features-part-1-detectorsModern features-part-1-detectors
Modern features-part-1-detectors
zukun
 
Modern features-part-0-intro
Modern features-part-0-introModern features-part-0-intro
Modern features-part-0-intro
zukun
 
Lecture 02 internet video search
Lecture 02 internet video searchLecture 02 internet video search
Lecture 02 internet video search
zukun
 
Lecture 01 internet video search
Lecture 01 internet video searchLecture 01 internet video search
Lecture 01 internet video search
zukun
 
Icml2012 tutorial representation_learning
Icml2012 tutorial representation_learningIcml2012 tutorial representation_learning
Icml2012 tutorial representation_learning
zukun
 
Advances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer visionAdvances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer vision
zukun
 
Gephi tutorial: quick start
Gephi tutorial: quick startGephi tutorial: quick start
Gephi tutorial: quick start
zukun
 
EM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysisEM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysis
zukun
 
Object recognition with pictorial structures
Object recognition with pictorial structuresObject recognition with pictorial structures
Object recognition with pictorial structures
zukun
 
Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities
zukun
 
Icml2012 learning hierarchies of invariant features
Icml2012 learning hierarchies of invariant featuresIcml2012 learning hierarchies of invariant features
Icml2012 learning hierarchies of invariant features
zukun
 

Mehr von zukun (20)

My lyn tutorial 2009
My lyn tutorial 2009My lyn tutorial 2009
My lyn tutorial 2009
 
ETHZ CV2012: Tutorial openCV
ETHZ CV2012: Tutorial openCVETHZ CV2012: Tutorial openCV
ETHZ CV2012: Tutorial openCV
 
ETHZ CV2012: Information
ETHZ CV2012: InformationETHZ CV2012: Information
ETHZ CV2012: Information
 
Siwei lyu: natural image statistics
Siwei lyu: natural image statisticsSiwei lyu: natural image statistics
Siwei lyu: natural image statistics
 
Lecture9 camera calibration
Lecture9 camera calibrationLecture9 camera calibration
Lecture9 camera calibration
 
Brunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer visionBrunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer vision
 
Modern features-part-4-evaluation
Modern features-part-4-evaluationModern features-part-4-evaluation
Modern features-part-4-evaluation
 
Modern features-part-3-software
Modern features-part-3-softwareModern features-part-3-software
Modern features-part-3-software
 
Modern features-part-2-descriptors
Modern features-part-2-descriptorsModern features-part-2-descriptors
Modern features-part-2-descriptors
 
Modern features-part-1-detectors
Modern features-part-1-detectorsModern features-part-1-detectors
Modern features-part-1-detectors
 
Modern features-part-0-intro
Modern features-part-0-introModern features-part-0-intro
Modern features-part-0-intro
 
Lecture 02 internet video search
Lecture 02 internet video searchLecture 02 internet video search
Lecture 02 internet video search
 
Lecture 01 internet video search
Lecture 01 internet video searchLecture 01 internet video search
Lecture 01 internet video search
 
Icml2012 tutorial representation_learning
Icml2012 tutorial representation_learningIcml2012 tutorial representation_learning
Icml2012 tutorial representation_learning
 
Advances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer visionAdvances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer vision
 
Gephi tutorial: quick start
Gephi tutorial: quick startGephi tutorial: quick start
Gephi tutorial: quick start
 
EM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysisEM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysis
 
Object recognition with pictorial structures
Object recognition with pictorial structuresObject recognition with pictorial structures
Object recognition with pictorial structures
 
Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities
 
Icml2012 learning hierarchies of invariant features
Icml2012 learning hierarchies of invariant featuresIcml2012 learning hierarchies of invariant features
Icml2012 learning hierarchies of invariant features
 

Lecture 03 internet video search

  • 1. 6: Location and context
  • 2. What makes a cow a cow? Google knows How do you know? because other people know We think we know “because it has four legs” But the fact of the matter: not all cows show four legs nor are they brown … not all…
  • 3. What is the object in the middle? No segmentation … Not even the pixel values of the object …
  • 4. Where is evidence for an object? Uijlings IJCV 2011
  • 5. Where is evidence for an object? Uijlings IJCV 2011
  • 6. What is the visual extent of an object? Uijlings IJCV 2012
  • 7. Where: exhaustive search Look everywhere for the object window Imposes computational constraints on Very many locations and windows (coarse grid/fixed aspect ratio) Evaluation cost per location (weak features/classifiers) Impressive but takes long. Viola IJCV 2004 Dalal CVPR 2005 Felzenszwalb PAMI 2010 Vedaldi ICCV 2009 7
  • 8. Where: the need for a hierarchy An image is intrinsically hierarchical. Gu CVPR 2009
  • 9. Selective search Windows formed by hierarchical grouping. Adjacent grouping on color/texture/shape cues. Felzenszwalb 2004 Van de Sande ICCV 2011
  • 12. Average best overlap ~88% … looks like this High recall cat
  • 13. Pairs of concepts Uijlings ICCV demo 2012
  • 14. 6 Conclusion Selective search gives good localization. Localization needed to understand pairs of concepts.
  • 15. 7 Data and metadata http://bit.ly/visualsearchengines
  • 16. How many concepts? Li Fei Fei slide. Biederman, Psychological Rev. 1987
  • 17. How many examples? Once you are over 100 – 1000 examples, success is there.
  • 18. Amateur labeling LabelMe 290,000 object annotations Russell IJCV 2008
  • 21. Tag relevance by social annotation Consistency in tagging between users on similar images. Xirong Li, TMM 2009
  • 22. Tag relevance by social annotation Pretty good for snow not so good for rainbow.
  • 23. Social negative bootstrapping Negative images are as important as positive images to learn. Not just random negative images, but close ones. • We want to learn positive example from an expert, and obtain as many negative samples as we like for free from the web. • We iteratively aim for the hardest negatives. Xirong Li ACM MM 2009
  • 24. Social negative bootstrapping Xirong Li ICMR 2011
  • 26. acknowledgement WordNet friends Christiane Fellbaum Dan Osherson Kai Li Alex Berg Columbia Princeton Princeton Jia Deng Hao Su Princeton/Stanford Stanford
  • 27. PASCAL VOC The PASCAL Visual Object Classes (VOC). 500,000 Images downloaded from flickr. Queries like “car”, “vehicle”, “street”, “downtown”. 10,000 objects, 25,000 labels. Mark Everingham, Luc Van Gool, Chris Williams, John Winn, Andrew Zisserman
  • 28. 7. Conclusion Data is king. The data are beginning to reflect the human cognition capacity [at a basic level]. Harvesting social data requires advanced computer vision control.
  • 30. PASCAL 2010 Aeroplane Bicycle Bird Boat Bottle Bus Car Cat Chair Cow
  • 31. True Positives - Person UOCTTI_LSVM_MDPM NLPR_HOGLBP_MC_LCEGCHLC NUS_HOGLBP_CTX_CLS_RESCORE_V2
  • 32. False Positives - Person UOCTTI_LSVM_MDPM NLPR_HOGLBP_MC_LCEGCHLC NUS_HOGLBP_CTX_CLS_RESCORE_V2
  • 33. Non-birds & non-boats Non-bird images: Highest ranked Non-boat images: Highest ranked Water texture and scene composition?
  • 35. True Positives - Motorbike MITUCLA_HIERARCHY NLPR_HOGLBP_MC_LCEGCHLC NUS_HOGLBP_CTX_CLS_RESCORE_V2
  • 36. False Positives - Motorbike MITUCLA_HIERARCHY NLPR_HOGLBP_MC_LCEGCHLC NUS_HOGLBP_CTX_CLS_RESCORE_V2
  • 37. Object localization 2008-2010 60 50 Max AP (%) 40 2008 30 2009 2010 20 10 0 tvmonitor pottedplant bottle motorbike diningtable horse sofa train person sheep aeroplane bicycle cow cat boat bus dog bird car chair Results on 2008 data improve for 2010 methods for all categories, by over 100% for some categories.
  • 39. Concept detection Aircraft Beach Mountain People marching Police/Security Flower
  • 40. Measuring performance Set of relevant Set of retrieved Results items items 1. 2. • Precision Set of relevant 3. retrieved items 4. inverse relationship Recall 5.
  • 41. UvA-MediaMill@TRECVID • other systems Snoek et al, TRECVID 04-10
  • 42. Performance doubled in just 3 years • 36 concept detectors Even when using training data of different origin, great progress. But the number of concepts is still limited. Snoek & Smeulders, IEEE Computer 2010
  • 43. 8. Conclusion Impressive results and quickly improving per year. Very valuable competition. Best non-classes start to make sense!
  • 45. SURF based on integral images Introduced by Viola & Jones in the context of face detection: sliding windows in left to right / up to bottom integral images. 46
  • 46. SURF principle Approximate Gaussian derivatives with box filters: Lyy Lyy Lyy Lxy L xx L LREC 2004, 26 May yy Lisbon 2004, L xy 47
  • 47. SURF speed Scale Computation time: 6 times faster than DoG (~100msec). Independent of filter scale. 26 May 2004, Lisbon LREC 2004, 48
  • 48. Dense descriptor extraction Pixel-wise Responses Final Descriptor Factor 16 speed improvement, Another factor 2 by the use of matrix libs.
  • 49. Projection: Random Forest Binary decision trees . ... . ... ...... Moosmann et al. 2008
  • 50. Real-time bag of words Descriptor Projection Classification Extraction Pre-projection Actual projection SVM kernel D-SURF Random MAP: <empty> RBF 2x2 Forest 0.370 15 10 13 Total computation time is 38 milliseconds per image 26 frames per second on a normal PC in any 20 concepts.
  • 51. 9. Conclusion SURF scale and rotation invariant Fast due to the use of integral images Download: http://www.vision.ee.ethz.ch/~surf/ DURF extraction is 6x faster than Dense-SIFT. Projection using Random Forest 50x faster than NN.
  • 52. Internet Video Search: the beginning telling stories measuring concept lexicon video features detection learning browsing video video