SlideShare ist ein Scribd-Unternehmen logo
1 von 26
Features and Learning Methods for
Large-scale Image Annotation and Categorization




                Hideki Nakayama
              The University of Tokyo
         Department of Creative Informatics

                     2013/1/15
My research interest
 Generic image (object) recognition
   Whole-image level recognition
   Weakly supervised training samples




    画像アノテーショ
    ン
    一枚の画像全体へ
    複数の単語を付与



                                  (without region correspondence)
The era of big data
 We can use gigantic weakly-labeled web data now!


                                    Tags:
                                    Nikon D200 DSLR Nikkor 60mmf28dmicro Nature
                                    Landscape
                                    Lake Idaho Ice Sunset Sun Mountain
                                    Sky Frozen AnAwesomeShot
                                    ImpressedBeauty isawyoufirst
                                    ABigFave Ljomi ljspot4 ColorPhotoAward

           http://www.flickr.com/



       Flickr: 6 billion images (2011)
       Facebook: 3 billion images every year
       Youtube: 8 year movies every day
More data helps recognition

                                                               Simple k-NN using Flickr images & tags
query
Recog. result




                             100K dataset                            1.6M dataset                      12M dataset

                    football soccer varsity girls boys       football soccer festival college   church stainedglass football
                    travel party family school high          futbol park people cycling         bath city vacation travel
                                                             marchingband vacation              cathedral window glass
Nearest neighbors
Growth of datasets
     Search engine: Tinyimage, ARISTA
     Crowd sourcing: ImageNet, SUN397



        Corel5K    Caltech256       NUS-WIDE     ImageNet   ARISTA
        (2002)       (2007)         (2009)       (2011)     (2008)
          5K          30K           200K         14M        2B




102      103      104         105      106     107    108     109
          Caltech101 Pascal    SUN397 ILSVRC            TinyImage
            (2004)    VOC      (2010) (2010)            (2008)
              9K      20K      100K   1.4M              80M
Challenge: scaling to large training data
 Traditional methods are not scalable in training
      Bag-of-visual words + kernel SVM (chi-square, etc)

            complexity                memory
              O N2 ~ O N3                O N2

                                cf. [Yang et al., CVPR’09]
                                                             ☹
 Recent methods exploit linear methods
      With carefully designed image features, where dot kernel
       approximates the similarity between instances



                                                             ☺
            complexity                memory
                 ON                       O1
Linear Distance metric learning for
        image annotation
Example-based image annotation
 Standard approach for image annotation problem


                                                    K-NN                   tiger
                                tiger           Kernel density
                                forest           estimation
                                                                           grass
                                                    etc…                   water
                                cow
                                street
                                city        MBRM [Feng et al, 2004],
                                sea         JEC [Makadia et al, 2008]
                                wave
 Similar image                  people
                                            TagProp [Guillaumin et al, 2009]
 search
                                plane
                                sky
                                jet
   Problem:
                                grass
     How to define              tiger
    similarity?                 water

                                people
                                tree
                                stone

                     Image and label data
                      (training samples)
Fundamental problem: Semantic gap
 Visually similar ≠ semantically similar




                                            I look my dog contest:
                                            http://www.hemmy.net/2006/06/25
                                            /i-look-like-my-dog-contest/




 Solution: Distance metric learning
Canonical Contextual Distance                                                               [Nakayama+, BMVC’10]

      Canonical Correlation Analysis (CCA)
       x : image features (e.g. BoVW), y : binary label vector
       finds linear transformations
       s       AT x x , t           BT y y that maximizes the correlation between s and t
                           t
                                                                                       1                         2
                                                                                XY    YY       YX   A   XX   A       AT   XX   A   I
                    s
X                                                  t
                                                                     Y          YX
                                                                                       1
                                                                                      XX       XY   B   YY   B   2
                                                                                                                     BT   YY   B   I
                                             s                                  : covariance matrices
      Image feature            Canonical space    Labels feature
                                                                                : canonical correlatio ns

                                      similarity measure in the latent subspace
                                               using probabilistic structure
       latent
      variable

           z                          z ~        N 0, I d , min{ p, q} d                   1
                                  x|z ~          N Wx z      x   ,   x   , Wx        Rp    d


  x               y               y|z ~          N Wy z      y   ,   y   , Wy        Rq    d

 image            labels
feature          feature

Probabilistic interpretation of CCA [Bach and Jordan, 2005]
CCD for image auto-annotation



                                                                                                                                                                M x AT x x
                                                                                                                                                                  T
                         T

E z | xi , y i
                    Mx        I   2   1

                                      1
                                                  I   2   1

                                                          1
                                                                  AT x i     x                                                                      E z|x
                    My        I   2
                                                  I   2
                                                                  BT y i     y
                                                                                                                                                                          T
                         Mx
                              T
                                          2   1               2   1
                                                                        Mx
                                                                                                                                                    var z | x    I   M xM x
                                  I                   I
var z | x i , y i   I                         1                   1
                         My       I       2
                                                      I       2         My




                                                      1           N                                                          p z | x i , y i p z | x dz
                    P w | xs                                            P w | li P li | x s                P li | x    N
                                                      N           i 1                                                         p z | x j , y j p z | x dz
                                                                                                                       j 1




                                                                                 P w | li   w , li   1              IDF w

                                                                                                         w,li   : annotation of training samples
                                                                                                                1
Features
 Image features
       BoVW, GIST, etc… (off-the-shelf ones)
       Needs to be encoded in a Euclidean space


 Labels features
       Binary occurrence vector      cf. [Guillaumin et al., CVPR’10]


   When the dictionary contains 「plane, sea, sky, clouds, mountain」

       Ij                 y j = (1, 0, 1, 1, 0)
       plane sky clouds                                  y j , yk   2

        Ik                yk    0, 0, 1, 1, 1         Dot product counts the
                                                      number of common labels.
       sky clouds
          mountain
Evaluation
 Benchmark datasets
                               Corel5K   IAPR-TC12 ESP Game
        # of words               260        291      268
        # of training images    4,500      17,665   18,689
        # of testing images      499       1,962     2,081
         # of words per
                                3.4/5      5.7/23   4.7/15
       image
         (avg./max)
Evaluation
        Comparable performance to state-of-the-arts


            Corel5K              IAPR-TC12             ESP Game
0.45                      0.45                  0.35
 0.4                       0.4
                                                 0.3
0.35                      0.35
                                                0.25
 0.3                       0.3
0.25                      0.25                   0.2
 0.2                       0.2                  0.15
0.15                      0.15
                                                 0.1
 0.1                       0.1
0.05                      0.05                  0.05

  0                         0                     0
Image features for linear classifiers
Basic pipeline

                                                   0 .5
                                                   1 .2
                                                   0 .1
                                                   
                                                   



1. Local feature extraction    2. Coding image-level
      1-1. feature detector     feature vector
       (Operator, grid)
      1-2. descriptor
       (SIFT, SURF, …)         How to encode similarity between
                               distributions of local features?
Bag-of-Visual-Words (traditional)                          [Csurka et al. 2004]

 Vector quantization → histogram
      ○ computationally efficient
      × large reconstruction error
      × non-linear property       (must be used with non-linear kernel)
      Training images




                                             Visual
                        Local features       words



     query
                                                                Credit: K. Yanai
New BoVW①                sparse coding + max pooling

 Reduce reconstruction error using multiple basis (words)
 Max pooling leads to linearly-separable image signatures
   (taking max response for each visual word) cf. [Boureau et al., ICML’10]




                          [Yang+, CVPR’09]           [Wang+, CVPR’10]
New BoVW②                   encode higher-level statistics

                                         N: # of visual words (10^3~10^4)
                                         d: dimension of descriptor (10~100

Method                            Statistics        Dim. of image signature

BoVW                              count (ratio)     N
VLAD       [Jegou+,CVPR’10]       mean              Nd
Super vector [Zhou+, ECCV’10]     ratio+mean        N(d+1)
Fisher vector [Perronnin+,        mean+variance     2Nd
ECCV’10]

Global Gaussian                   mean+covariance   d(d+1)/2
[Nakayama+, CVPR’10]              (N=1)
VLAT       [Picard+, ICIP’11]     mean+covariance   Nd(d+1)/2

 Encoded in a feature vector so that the dot product
 approximates the distance between distributions
Global Gaussian Coding                                                              [Nakayama+, CVPR’10]

 Exploit Riemannian manifold of Gaussian
  using information geometry framework
                      1                           1     T
    p x;              d /2
                                    exp             x μ         1
                                                                    x μ
                2                                 2
                                                       x : local descriptor



   Affine coordinates
                                                                                                          2 T
    η       1   ,,       d   ,    11
                                          2
                                          1   ,   12   1   2   ,,   1d   1   d   ,   22
                                                                                           2
                                                                                           2   ,,   dd   d


     η, η        η                ηT G η η                 η
                                                           Inverse of Fisher information matrix

   We use G η (metric on the center of samples) for entire space
                                                                                                          η
     ηi , η j         ηT G η η j
                       i

    Somewhat approximates
    the KL-divergence…
Competition
 Large-scale visual recognition challenge 2010
      1000-class categorization
      1.2M training images, 150K testing images
      Evaluate top 5 classification accuracy


 Part of ImageNet dataset      [Deng et al.]
      Labeled with Amazon Mechanical Turk
      14M images, 22K categories (as of 2011)
      Semantic structure according to WordNet




                                                   Credit: Fei-Fei Li
Result (2010)
 11 teams participated
      1. NEC+UIUC (72%)           80,000~260,000 dim ×6
      2. Xerox Research (64%)           260,000 dim ×2
      3. ISI(55%)      12,000 dim
      4. UC Irvine (53%)
      5. MIT (46%)


 Examples
      http://www.isi.imi.i.u-tokyo.ac.jp/pattern/ilsvrc/index.html
2010 Winner:                  NEC-UIUC
 LCC + super vector coding
     Ensemble of six classifiers using different features
 Parallelized feature extraction (Hadoop)
 Linear SVM (Averaging SGD)
     LCC→2days、Super vector→7days (with a 8-core
      machine)
2011 Winner:            XRCE
 Fisher vector
   520K dim ×2 (SIFT, color)
   2 days with a 16-core machine
 Linear SVM (SGD)
   1.5 days with a 16-core machine
2012 Winner: Univ. Toronto
 Deep learning
     Huge convolutional neural network from raw images
     Two GPUs, one week




            10%
Summary
 Large-scale image recognition is now a hot topic
      Millions of training images, tens of thousands of categories

 Scalability is the key issue
      Linear training methods + compatibly-designed features
      If we somehow approximate the sample similarity with dot
       kernel, we can simply apply linear methods!
          Explicit embedding
          Fisher kernel
          KPCA + Nystrom method


      Personal interest: Can we do this with graph kernels?

Weitere ähnliche Inhalte

Was ist angesagt?

CVPR2010: Advanced ITinCVPR in a Nutshell: part 4: additional slides
CVPR2010: Advanced ITinCVPR in a Nutshell: part 4: additional slidesCVPR2010: Advanced ITinCVPR in a Nutshell: part 4: additional slides
CVPR2010: Advanced ITinCVPR in a Nutshell: part 4: additional slides
zukun
 
Biao Hou--SAR IMAGE DESPECKLING BASED ON IMPROVED DIRECTIONLET DOMAIN GAUSSIA...
Biao Hou--SAR IMAGE DESPECKLING BASED ON IMPROVED DIRECTIONLET DOMAIN GAUSSIA...Biao Hou--SAR IMAGE DESPECKLING BASED ON IMPROVED DIRECTIONLET DOMAIN GAUSSIA...
Biao Hou--SAR IMAGE DESPECKLING BASED ON IMPROVED DIRECTIONLET DOMAIN GAUSSIA...
grssieee
 
Ondelettes, représentations bidimensionnelles, multi-échelles et géométriques...
Ondelettes, représentations bidimensionnelles, multi-échelles et géométriques...Ondelettes, représentations bidimensionnelles, multi-échelles et géométriques...
Ondelettes, représentations bidimensionnelles, multi-échelles et géométriques...
Laurent Duval
 
Unsupervised Change Detection in the Feature Space Using Kernels.pdf
Unsupervised Change Detection in the Feature Space Using Kernels.pdfUnsupervised Change Detection in the Feature Space Using Kernels.pdf
Unsupervised Change Detection in the Feature Space Using Kernels.pdf
grssieee
 
NIPS2008: tutorial: statistical models of visual images
NIPS2008: tutorial: statistical models of visual imagesNIPS2008: tutorial: statistical models of visual images
NIPS2008: tutorial: statistical models of visual images
zukun
 
Image Smoothing for Structure Extraction
Image Smoothing for Structure ExtractionImage Smoothing for Structure Extraction
Image Smoothing for Structure Extraction
Jia-Bin Huang
 

Was ist angesagt? (17)

CVPR2010: Advanced ITinCVPR in a Nutshell: part 4: additional slides
CVPR2010: Advanced ITinCVPR in a Nutshell: part 4: additional slidesCVPR2010: Advanced ITinCVPR in a Nutshell: part 4: additional slides
CVPR2010: Advanced ITinCVPR in a Nutshell: part 4: additional slides
 
Biao Hou--SAR IMAGE DESPECKLING BASED ON IMPROVED DIRECTIONLET DOMAIN GAUSSIA...
Biao Hou--SAR IMAGE DESPECKLING BASED ON IMPROVED DIRECTIONLET DOMAIN GAUSSIA...Biao Hou--SAR IMAGE DESPECKLING BASED ON IMPROVED DIRECTIONLET DOMAIN GAUSSIA...
Biao Hou--SAR IMAGE DESPECKLING BASED ON IMPROVED DIRECTIONLET DOMAIN GAUSSIA...
 
Transformations en ondelettes 2D directionnelles - Un panorama
Transformations en ondelettes 2D directionnelles - Un panoramaTransformations en ondelettes 2D directionnelles - Un panorama
Transformations en ondelettes 2D directionnelles - Un panorama
 
Test
TestTest
Test
 
Ondelettes, représentations bidimensionnelles, multi-échelles et géométriques...
Ondelettes, représentations bidimensionnelles, multi-échelles et géométriques...Ondelettes, représentations bidimensionnelles, multi-échelles et géométriques...
Ondelettes, représentations bidimensionnelles, multi-échelles et géométriques...
 
State of art pde based ip to bt vijayakrishna rowthu
State of art pde based ip to bt  vijayakrishna rowthuState of art pde based ip to bt  vijayakrishna rowthu
State of art pde based ip to bt vijayakrishna rowthu
 
Hoip10 presentacion cambios de color_univ_granada
Hoip10 presentacion cambios de color_univ_granadaHoip10 presentacion cambios de color_univ_granada
Hoip10 presentacion cambios de color_univ_granada
 
20110415 Scattering in CG and CV
20110415 Scattering in CG and CV20110415 Scattering in CG and CV
20110415 Scattering in CG and CV
 
Color Img at Prisma Network meeting 2009
Color Img at Prisma Network meeting 2009Color Img at Prisma Network meeting 2009
Color Img at Prisma Network meeting 2009
 
201109CVIM/PRMU Inverse Composite Alignment of a sphere under orthogonal proj...
201109CVIM/PRMU Inverse Composite Alignment of a sphere under orthogonal proj...201109CVIM/PRMU Inverse Composite Alignment of a sphere under orthogonal proj...
201109CVIM/PRMU Inverse Composite Alignment of a sphere under orthogonal proj...
 
20110326 CG・CVにおける散乱
20110326 CG・CVにおける散乱20110326 CG・CVにおける散乱
20110326 CG・CVにおける散乱
 
Test
TestTest
Test
 
Unsupervised Change Detection in the Feature Space Using Kernels.pdf
Unsupervised Change Detection in the Feature Space Using Kernels.pdfUnsupervised Change Detection in the Feature Space Using Kernels.pdf
Unsupervised Change Detection in the Feature Space Using Kernels.pdf
 
NIPS2008: tutorial: statistical models of visual images
NIPS2008: tutorial: statistical models of visual imagesNIPS2008: tutorial: statistical models of visual images
NIPS2008: tutorial: statistical models of visual images
 
Image Smoothing for Structure Extraction
Image Smoothing for Structure ExtractionImage Smoothing for Structure Extraction
Image Smoothing for Structure Extraction
 
Gaining Colour Stability in Live Image Capturing
Gaining Colour Stability in Live Image CapturingGaining Colour Stability in Live Image Capturing
Gaining Colour Stability in Live Image Capturing
 
Vladan Mlinar Ph.D. defense (2007)
Vladan Mlinar Ph.D. defense (2007)Vladan Mlinar Ph.D. defense (2007)
Vladan Mlinar Ph.D. defense (2007)
 

Andere mochten auch

SSII2014 詳細画像識別 (FGVC) @OS2
SSII2014 詳細画像識別 (FGVC) @OS2SSII2014 詳細画像識別 (FGVC) @OS2
SSII2014 詳細画像識別 (FGVC) @OS2
nlab_utokyo
 
Lab introduction 2014
Lab introduction 2014Lab introduction 2014
Lab introduction 2014
nlab_utokyo
 

Andere mochten auch (20)

SSII2014 詳細画像識別 (FGVC) @OS2
SSII2014 詳細画像識別 (FGVC) @OS2SSII2014 詳細画像識別 (FGVC) @OS2
SSII2014 詳細画像識別 (FGVC) @OS2
 
Deep Learningと画像認識   ~歴史・理論・実践~
Deep Learningと画像認識 ~歴史・理論・実践~Deep Learningと画像認識 ~歴史・理論・実践~
Deep Learningと画像認識   ~歴史・理論・実践~
 
ICME 2013
ICME 2013ICME 2013
ICME 2013
 
MIRU2014 SLAC
MIRU2014 SLACMIRU2014 SLAC
MIRU2014 SLAC
 
マシンパーセプション研究におけるChainer活用事例
マシンパーセプション研究におけるChainer活用事例マシンパーセプション研究におけるChainer活用事例
マシンパーセプション研究におけるChainer活用事例
 
RecSysTV2014
RecSysTV2014RecSysTV2014
RecSysTV2014
 
Lab introduction 2014
Lab introduction 2014Lab introduction 2014
Lab introduction 2014
 
ISM2014
ISM2014ISM2014
ISM2014
 
画像処理分野における研究事例紹介
画像処理分野における研究事例紹介画像処理分野における研究事例紹介
画像処理分野における研究事例紹介
 
20150414seminar
20150414seminar20150414seminar
20150414seminar
 
20160601画像電子学会
20160601画像電子学会20160601画像電子学会
20160601画像電子学会
 
Machine Translation Introduction
Machine Translation IntroductionMachine Translation Introduction
Machine Translation Introduction
 
20150930
2015093020150930
20150930
 
Deep Learningによる画像認識革命 ー歴史・最新理論から実践応用までー
Deep Learningによる画像認識革命 ー歴史・最新理論から実践応用までーDeep Learningによる画像認識革命 ー歴史・最新理論から実践応用までー
Deep Learningによる画像認識革命 ー歴史・最新理論から実践応用までー
 
Laplacian Pyramid of Generative Adversarial Networks (LAPGAN) - NIPS2015読み会 #...
Laplacian Pyramid of Generative Adversarial Networks (LAPGAN) - NIPS2015読み会 #...Laplacian Pyramid of Generative Adversarial Networks (LAPGAN) - NIPS2015読み会 #...
Laplacian Pyramid of Generative Adversarial Networks (LAPGAN) - NIPS2015読み会 #...
 
DeepLearningDay2016Summer
DeepLearningDay2016SummerDeepLearningDay2016Summer
DeepLearningDay2016Summer
 
Tutorial-DeepLearning-PCSJ-IMPS2016
Tutorial-DeepLearning-PCSJ-IMPS2016Tutorial-DeepLearning-PCSJ-IMPS2016
Tutorial-DeepLearning-PCSJ-IMPS2016
 
GTC Japan 2015 - Experiments to apply Deep Learning to Forex time series data
GTC Japan 2015 - Experiments to apply Deep Learning to Forex time series dataGTC Japan 2015 - Experiments to apply Deep Learning to Forex time series data
GTC Japan 2015 - Experiments to apply Deep Learning to Forex time series data
 
Deep Learningの技術と未来
Deep Learningの技術と未来Deep Learningの技術と未来
Deep Learningの技術と未来
 
NVIDIA Seminar ディープラーニングによる画像認識と応用事例
NVIDIA Seminar ディープラーニングによる画像認識と応用事例NVIDIA Seminar ディープラーニングによる画像認識と応用事例
NVIDIA Seminar ディープラーニングによる画像認識と応用事例
 

Ähnlich wie Seminar

NIPS2009: Understand Visual Scenes - Part 2
NIPS2009: Understand Visual Scenes - Part 2NIPS2009: Understand Visual Scenes - Part 2
NIPS2009: Understand Visual Scenes - Part 2
zukun
 
Representative Previous Work
Representative Previous WorkRepresentative Previous Work
Representative Previous Work
butest
 
Representative Previous Work
Representative Previous WorkRepresentative Previous Work
Representative Previous Work
butest
 
Spatially Coherent Latent Topic Model For Concurrent Object Segmentation and ...
Spatially Coherent Latent Topic Model For Concurrent Object Segmentation and ...Spatially Coherent Latent Topic Model For Concurrent Object Segmentation and ...
Spatially Coherent Latent Topic Model For Concurrent Object Segmentation and ...
Shao-Chuan Wang
 
Fcv scene hebert
Fcv scene hebertFcv scene hebert
Fcv scene hebert
zukun
 
Subspace Indexing on Grassmannian Manifold for Large Scale Visual Identification
Subspace Indexing on Grassmannian Manifold for Large Scale Visual IdentificationSubspace Indexing on Grassmannian Manifold for Large Scale Visual Identification
Subspace Indexing on Grassmannian Manifold for Large Scale Visual Identification
United States Air Force Academy
 
Lecture11
Lecture11Lecture11
Lecture11
Bo Li
 

Ähnlich wie Seminar (15)

NIPS2009: Understand Visual Scenes - Part 2
NIPS2009: Understand Visual Scenes - Part 2NIPS2009: Understand Visual Scenes - Part 2
NIPS2009: Understand Visual Scenes - Part 2
 
Progress review1
Progress review1Progress review1
Progress review1
 
Computer Vision - Stereo Vision
Computer Vision - Stereo VisionComputer Vision - Stereo Vision
Computer Vision - Stereo Vision
 
CORNAR: Looking Around Corners using Trillion FPS Imaging
CORNAR: Looking Around Corners using Trillion FPS ImagingCORNAR: Looking Around Corners using Trillion FPS Imaging
CORNAR: Looking Around Corners using Trillion FPS Imaging
 
Representative Previous Work
Representative Previous WorkRepresentative Previous Work
Representative Previous Work
 
Representative Previous Work
Representative Previous WorkRepresentative Previous Work
Representative Previous Work
 
Spatially Coherent Latent Topic Model For Concurrent Object Segmentation and ...
Spatially Coherent Latent Topic Model For Concurrent Object Segmentation and ...Spatially Coherent Latent Topic Model For Concurrent Object Segmentation and ...
Spatially Coherent Latent Topic Model For Concurrent Object Segmentation and ...
 
Cluster based landmark and event detection for tagged photo collections
Cluster based landmark and event detection for tagged photo collectionsCluster based landmark and event detection for tagged photo collections
Cluster based landmark and event detection for tagged photo collections
 
Clustering
ClusteringClustering
Clustering
 
Fcv scene hebert
Fcv scene hebertFcv scene hebert
Fcv scene hebert
 
Subspace Indexing on Grassmannian Manifold for Large Scale Visual Identification
Subspace Indexing on Grassmannian Manifold for Large Scale Visual IdentificationSubspace Indexing on Grassmannian Manifold for Large Scale Visual Identification
Subspace Indexing on Grassmannian Manifold for Large Scale Visual Identification
 
Lecture11
Lecture11Lecture11
Lecture11
 
Keynote Virtual Efficiency Congress 2012
Keynote Virtual Efficiency Congress 2012Keynote Virtual Efficiency Congress 2012
Keynote Virtual Efficiency Congress 2012
 
CBIR
CBIRCBIR
CBIR
 
Marked Point Process For Neurite Tracing
Marked Point Process For Neurite TracingMarked Point Process For Neurite Tracing
Marked Point Process For Neurite Tracing
 

Seminar

  • 1. Features and Learning Methods for Large-scale Image Annotation and Categorization Hideki Nakayama The University of Tokyo Department of Creative Informatics 2013/1/15
  • 2. My research interest  Generic image (object) recognition  Whole-image level recognition  Weakly supervised training samples 画像アノテーショ ン 一枚の画像全体へ 複数の単語を付与 (without region correspondence)
  • 3. The era of big data  We can use gigantic weakly-labeled web data now! Tags: Nikon D200 DSLR Nikkor 60mmf28dmicro Nature Landscape Lake Idaho Ice Sunset Sun Mountain Sky Frozen AnAwesomeShot ImpressedBeauty isawyoufirst ABigFave Ljomi ljspot4 ColorPhotoAward http://www.flickr.com/  Flickr: 6 billion images (2011)  Facebook: 3 billion images every year  Youtube: 8 year movies every day
  • 4. More data helps recognition  Simple k-NN using Flickr images & tags query Recog. result 100K dataset 1.6M dataset 12M dataset football soccer varsity girls boys football soccer festival college church stainedglass football travel party family school high futbol park people cycling bath city vacation travel marchingband vacation cathedral window glass Nearest neighbors
  • 5. Growth of datasets  Search engine: Tinyimage, ARISTA  Crowd sourcing: ImageNet, SUN397 Corel5K Caltech256 NUS-WIDE ImageNet ARISTA (2002) (2007) (2009) (2011) (2008) 5K 30K 200K 14M 2B 102 103 104 105 106 107 108 109 Caltech101 Pascal SUN397 ILSVRC TinyImage (2004) VOC (2010) (2010) (2008) 9K 20K 100K 1.4M 80M
  • 6. Challenge: scaling to large training data  Traditional methods are not scalable in training  Bag-of-visual words + kernel SVM (chi-square, etc) complexity memory O N2 ~ O N3 O N2 cf. [Yang et al., CVPR’09] ☹  Recent methods exploit linear methods  With carefully designed image features, where dot kernel approximates the similarity between instances ☺ complexity memory ON O1
  • 7. Linear Distance metric learning for image annotation
  • 8. Example-based image annotation  Standard approach for image annotation problem K-NN tiger tiger Kernel density forest estimation grass etc… water cow street city MBRM [Feng et al, 2004], sea JEC [Makadia et al, 2008] wave Similar image people TagProp [Guillaumin et al, 2009] search plane sky jet Problem: grass How to define tiger similarity? water people tree stone Image and label data (training samples)
  • 9. Fundamental problem: Semantic gap  Visually similar ≠ semantically similar I look my dog contest: http://www.hemmy.net/2006/06/25 /i-look-like-my-dog-contest/  Solution: Distance metric learning
  • 10. Canonical Contextual Distance [Nakayama+, BMVC’10]  Canonical Correlation Analysis (CCA) x : image features (e.g. BoVW), y : binary label vector finds linear transformations s AT x x , t BT y y that maximizes the correlation between s and t t 1 2 XY YY YX A XX A AT XX A I s X t Y YX 1 XX XY B YY B 2 BT YY B I s : covariance matrices Image feature Canonical space Labels feature : canonical correlatio ns similarity measure in the latent subspace using probabilistic structure latent variable z z ~ N 0, I d , min{ p, q} d 1 x|z ~ N Wx z x , x , Wx Rp d x y y|z ~ N Wy z y , y , Wy Rq d image labels feature feature Probabilistic interpretation of CCA [Bach and Jordan, 2005]
  • 11. CCD for image auto-annotation M x AT x x T T E z | xi , y i Mx I 2 1 1 I 2 1 1 AT x i x E z|x My I 2 I 2 BT y i y T Mx T 2 1 2 1 Mx var z | x I M xM x I I var z | x i , y i I 1 1 My I 2 I 2 My 1 N p z | x i , y i p z | x dz P w | xs P w | li P li | x s P li | x N N i 1 p z | x j , y j p z | x dz j 1 P w | li w , li 1 IDF w w,li : annotation of training samples 1
  • 12. Features  Image features  BoVW, GIST, etc… (off-the-shelf ones)  Needs to be encoded in a Euclidean space  Labels features  Binary occurrence vector cf. [Guillaumin et al., CVPR’10] When the dictionary contains 「plane, sea, sky, clouds, mountain」 Ij y j = (1, 0, 1, 1, 0) plane sky clouds y j , yk 2 Ik yk 0, 0, 1, 1, 1 Dot product counts the number of common labels. sky clouds mountain
  • 13. Evaluation  Benchmark datasets Corel5K IAPR-TC12 ESP Game # of words 260 291 268 # of training images 4,500 17,665 18,689 # of testing images 499 1,962 2,081 # of words per 3.4/5 5.7/23 4.7/15 image (avg./max)
  • 14. Evaluation  Comparable performance to state-of-the-arts Corel5K IAPR-TC12 ESP Game 0.45 0.45 0.35 0.4 0.4 0.3 0.35 0.35 0.25 0.3 0.3 0.25 0.25 0.2 0.2 0.2 0.15 0.15 0.15 0.1 0.1 0.1 0.05 0.05 0.05 0 0 0
  • 15. Image features for linear classifiers
  • 16. Basic pipeline 0 .5 1 .2 0 .1   1. Local feature extraction 2. Coding image-level  1-1. feature detector feature vector (Operator, grid)  1-2. descriptor (SIFT, SURF, …) How to encode similarity between distributions of local features?
  • 17. Bag-of-Visual-Words (traditional) [Csurka et al. 2004]  Vector quantization → histogram ○ computationally efficient × large reconstruction error × non-linear property (must be used with non-linear kernel) Training images Visual Local features words query Credit: K. Yanai
  • 18. New BoVW① sparse coding + max pooling  Reduce reconstruction error using multiple basis (words)  Max pooling leads to linearly-separable image signatures (taking max response for each visual word) cf. [Boureau et al., ICML’10] [Yang+, CVPR’09] [Wang+, CVPR’10]
  • 19. New BoVW② encode higher-level statistics N: # of visual words (10^3~10^4) d: dimension of descriptor (10~100 Method Statistics Dim. of image signature BoVW count (ratio) N VLAD [Jegou+,CVPR’10] mean Nd Super vector [Zhou+, ECCV’10] ratio+mean N(d+1) Fisher vector [Perronnin+, mean+variance 2Nd ECCV’10] Global Gaussian mean+covariance d(d+1)/2 [Nakayama+, CVPR’10] (N=1) VLAT [Picard+, ICIP’11] mean+covariance Nd(d+1)/2 Encoded in a feature vector so that the dot product approximates the distance between distributions
  • 20. Global Gaussian Coding [Nakayama+, CVPR’10]  Exploit Riemannian manifold of Gaussian using information geometry framework 1 1 T p x; d /2 exp x μ 1 x μ 2 2 x : local descriptor Affine coordinates 2 T η 1 ,, d , 11 2 1 , 12 1 2 ,, 1d 1 d , 22 2 2 ,, dd d η, η η ηT G η η η Inverse of Fisher information matrix We use G η (metric on the center of samples) for entire space η ηi , η j ηT G η η j i Somewhat approximates the KL-divergence…
  • 21. Competition  Large-scale visual recognition challenge 2010  1000-class categorization  1.2M training images, 150K testing images  Evaluate top 5 classification accuracy  Part of ImageNet dataset [Deng et al.]  Labeled with Amazon Mechanical Turk  14M images, 22K categories (as of 2011)  Semantic structure according to WordNet Credit: Fei-Fei Li
  • 22. Result (2010)  11 teams participated  1. NEC+UIUC (72%) 80,000~260,000 dim ×6  2. Xerox Research (64%) 260,000 dim ×2  3. ISI(55%) 12,000 dim  4. UC Irvine (53%)  5. MIT (46%)  Examples  http://www.isi.imi.i.u-tokyo.ac.jp/pattern/ilsvrc/index.html
  • 23. 2010 Winner: NEC-UIUC  LCC + super vector coding  Ensemble of six classifiers using different features  Parallelized feature extraction (Hadoop)  Linear SVM (Averaging SGD)  LCC→2days、Super vector→7days (with a 8-core machine)
  • 24. 2011 Winner: XRCE  Fisher vector  520K dim ×2 (SIFT, color)  2 days with a 16-core machine  Linear SVM (SGD)  1.5 days with a 16-core machine
  • 25. 2012 Winner: Univ. Toronto  Deep learning  Huge convolutional neural network from raw images  Two GPUs, one week 10%
  • 26. Summary  Large-scale image recognition is now a hot topic  Millions of training images, tens of thousands of categories  Scalability is the key issue  Linear training methods + compatibly-designed features  If we somehow approximate the sample similarity with dot kernel, we can simply apply linear methods!  Explicit embedding  Fisher kernel  KPCA + Nystrom method  Personal interest: Can we do this with graph kernels?