CBIR in the Era of Deep Learning

Lei Wang
School of Computing and Information Technology
University of Wollongong, Australia
15-Oct-2016
CBIR in the Era of Deep Learning
-- A Perspective from Feature Representation

• Introduction of CBIR
• Evolution of CBIR
– Early days (before 2000)
– Days of BoF model (2000 ~ 2012)
– Era of Deep learning (after 2012)
• Conclusion
Outline
Images courtesy of related papers and authors

Introduction
• Retrieval
– Getting back information that has been stored in a
database
• Image Retrieval

Introduction
• Text-based image retrieval (TBIR, since late 1970’s)
– Manually associate images with text annotations
– Interpret images with high-level semantics
– Retrieval by matching the associated text annotations
Retrieval result of Google Images for “Airplane”

Introduction
• Issus with text-based image retrieval
– Annotation is time consuming and labour intensive
– Only partially describe the visual content
– Human’s perception subjectivity
– Not support query by example
Drouin Post Office, front desks Iron Ore Fashion

Introduction
• Content-based image retrieval
– Human annotators are replaced by computers
– Text annotations are replaced by visual features
– Retrieval by comparing the associated visual features
Drouin Post Office, front desks Iron Ore Fashion

Introduction
• National Science Foundation (NSF) organised a special
workshop on the topic of visual information
management (Feb 1992, San Jose, CA)
• "It would be impossible to cope with this explosion of image
information, unless the images were organized for retrieval.
The fundamental problem is that images, video, and other
similar data differ from numeric data and text data format,
and hence they require a totally different technique of
organization, indexing and query processing."

Introduction
• CBIR categorisation
– No query: Randomly browse similar images
– Query by text (by typing “airplane” or description)
– Query by example
• by using an image, sketch, or graphic of airplane

Introduction
– Find images of similar colour, texture or shape
– Find images of similar object, scene, place, event, etc.

Introduction
– Narrow domain
– Broad domain

Introduction
CBIR
Image matching
Image Recognition
Image Segmentation
Object detection
Image annotation
More tasks …

Introduction
CBIR
Computer
Vision
Informational
Retrieval
Database
Machine
Learning

Introduction
• Applications of CBIR
– Archival photo collection management
– Personal album management
– Crime investigation
– Fashion and design
– Education and entertainment
– Localisation and navigation
– Medical Image analysis
– ….

Introduction
• CBIR systems
– QBIC, Virage, Photobook, VisualSEEk, MARS, etc.
Source: http://vismod.media.mit.edu/vismod/demos/photobook/Source: http://www.cse.unsw.edu.au/~jas/talks/curveix/notes.html

Introduction
• CBIR systems
– QBIC, Virage, Photobook, VisualSEEk, MARS, etc.

Early days
A new research problem received great interest
CBIR
Application
Semantic gap
Domain
knowledge
User model
Query mode
Visual features
Similarity measure
Interaction
Learning from data
System
Evaluation

• Hand-crafted features
– Color, texture, shape, structure, etc.
– Goal: “Invariant and discriminative”
• Similarity or distance measure
– Euclidean distance, Manhattan distance, etc.
– Specific measures designed for specific features
Early days

• Relevance feedback
– Bring user into the loop of CBIR to handle “Semantic Gap”
– A key point of “machine Learning” research in CBIR
Early days

• Relevance feedback
– Learning from small sample
– Semi-supervised learning
– Transductive learning
– Feature selection, dimensionality reduction
– Kernel based learning
– Manifold learning
– Relation learning
– …
Early days

• Achievements
– Researched CBIR from various perspectives
– Identified the key issues and obstacles
– Many initial but insightful observations and attempts
– Machine learning started playing an important role
• To be improved
– Basic, hand-crafted features, limited invariance
– Considerably depend on domain theory
– Small-sized databases for evaluation

• Introduction of CBIR
• Evolution of CBIR
– Early days (before 2000)
– Days of the BoF model (2000 ~ 2012)
– Era of Deep learning (after 2012)
• Conclusion
Outline
Images courtesy of related papers and authors

• SIFT, HOG, SURF, CENTRIST, filter-based, …
– Invariant to view angle, rotation, scale, illumination, ...
Days of the BoF model
Local Invariant Features
http://www.robots.ox.ac.uk/~vgg/software/
Image courtesy of David Lowe, IJCV04
SIFT (Scale Invariant Feature Transform

http://www.robots.ox.ac.uk/~vgg/research/affine/#software/
Image A Image B

Source: http://ivt.sourceforge.net/examples.html
Image A Image B

Source: http://www.robots.ox.ac.uk/~vgg/share/SearchPractical2012.html
Image A Image B

Bag-of-features (BoF) model is borrowed from text analysis

Interest point detection
or
Dense sampling
The cropped detected regions
Bag-of-feature model is borrowed from text analysis

A close-up view

Extract features from all training/test images
x 2 Rd

Cluster all features to generated “Visual Words”
Rd

Generated “Visual Words”
…
…
…
…
Word 1:
Word 2:
Word 3:
Word 4:
Word k: … … … … … … … … … … … … … … … … … … … … … … … … …
…

From an image to a histogram
[ n1 , n2, … , nk ]
The number of
occurrence of 1st “word”
in this image
2 Rk
[ 0 , 1, 0, … , 0 ] 2 Rk
[ 1 , 0, 0, … , 0 ] 2 Rk
[ 0 , 0, 1, … , 0 ] 2 Rk
… … … …

Classifying, clustering or retrieving images
Rk
y = w>
x + b

A Bag-of-Features Image Analysis System
Image
database
Feature
extraction
Codebook
generation
Feature
coding
Feature
pooling
Classification
Clustering or
Retrieval

Local Invariant Features, such as SIFT (Lowe, ICCV99)
Video Google (Sivic, CVPR03); Bag-of-keypoints (Csurka, SLCV@ECCV04)
Vocabulary tree (Nister, CVPR06); Randomized Clustering Forests
(Moosmann, NIPS06); Spatial Pyramid Matching (Lazebnik, CVPR06)
Pyramid Match
Kernel (Grauman,
ICCV05);
Dense sampling
(Jurie, ICCV05);
Compact Codebook
(Winn, ICCV05)
Comparative Study (Zhang, IJCV07);
Coding with Fisher Kernels (Perronnin, CVPR07)
Local Soft-assignment Coding & Mix-order pooling (Liu, ICCV11);
Comparative Study on BoF model (Chatfield, BMVC, 2011);
Locality-constrained Linear Coding for BoF (Wang, CVPR10);
Coding & pooling scheme comparison (Boureau, CVPR10);
Sparse coding for BoF (Yang, CVPR09)
Local Coordinate Coding (Yu, NIPS09)
Kernel Codebook
(van Gemert, ECCV08);
In Defense of Nearest
Neighbor Classifier
(Boiman, CVPR08)
11
10
09
08
07
06
05
03
99

Key issues of CBIR with the BoF model
Source: Nister and Stewenius, CVPR06
• How to quickly create a large visual codebook
– hierarchical k-means clustering
– Approximate k-means clustering

• How to incorporate spatial information
– The BoF model ignores the spatial information of
SIFT features
Spatial Pyramid Matching Re-ranking with Spatial verification

Retrieval result before spatial verification
Query:

25 points matched under a consistent spatial relationship
Only 4 points matched under a consistent spatial
relationship
• Re-ranking with spatial verification

Retrieval result after spatial verification
Query:

• Large-scale image retrieval
– Memory, time, precision
– Approximate nearest-neighbor search
x1
x2
xd
.
.
.
0100101100…
How?

• Local sensitive hashing (LSH)
– Random projection, data independent, unsupervised,
• Learning compact binary codes
– Preserving sample similarities, data dependent
1
1
1
0
0
0
LSH

Retrieval examples from the “Oxford5K” data set
Source: Philbin et. al, Object retrieval with large vocabularies and fast spatial matching, CVPR07

Days of the BoF model (Summary)
• Achievements
– Local invariant features plays a fundamental role
– Visual codebook creation, feature coding, and feature
pooling are extensively studied
– Multiple benchmark data sets are established
– Large-scale image retrieval is also researched
• To be improved
– Feature representation and recognition separate
– Focused more on object level level retrieval but less
on semantic level retrieval

Era of Deep Learning
 Visual
• Images
• Videos
 Audio
• Speech
• Music
 Text
• Natural Language
 Planning
 …

• Image Recognition
– Faces, objects, poses, scenes, …
• Video content analysis
– Action, activities, events, summarization, …
• Visual information management
– Search, retrieval, indexing, browsing, …
• Potential Outcome: AI
– Computers can see and understand visual
information
– Robotics, self-driving cars, surveillance
– ….

Object detection (Source: Rich feature hierarchies for accurate object detection and
semantic segmentation, CVPR 2014)
Face Recognition (Source: DeepFace: Closing the Gap to Human-Level Performance in Face
Verification, CVPR 2014)

Pose estimation (DeepPose: Human Pose Estimation via Deep Neural Networks, CVPR2014)
Image Segmentation (Source: SegNet: A Deep Convolutional Encoder-Decoder
Architecture for Image Segmentation, IEEE TPAMI 2016)

• Fine-grained image recognition
• Human attribute classification
[Ning Zhang et al.
CVPR 2014]
[Branson et al. arXiv 2014 ]

• Action Recognition
• Large-scale Video Classification
[Karpathy et al. CVPR 2014]
[Simonyan et al. arXiv 2014]

• Invariant and discriminative features
Feature Representation
Feature Extraction Classification “Panda”?
Prior Knowledge,
Experience
Pose Occlusion Multiple
objects
Inter-class
similarity
Image courtesy of M. Ranzato

• From hand-crafted features to automatically learned ones
Rd
Rk
y = w>
x + b

• Directly learn features representations from data.
• Joint learn feature representation and classifier.
Low-level
Features
Mid-level
Features
High-level
Features
Classifier
Deep Learning: train layers of features so that classifier works well.
More abstract representation
“Panda”?
Image courtesy of M. Ranzato

• Deep Learning
– Inspired by the way human brain processes information
– Many layers of non-linear information processing stages

Yes.
• Basic ideas common to past neural networks research
• Standard machine learning strategies still relevant.
No.
Have we been here before?
Computational
Power
Large-scale Data New Algorithms
Deep Learning

Convolutional Neural Networks (CNNs)
• A special multi-stage architecture inspired by visual system

Source: Slide: Girshick
Fukushima 1980
Neocognitron
LeCun et al. 1989-1998
Hand-written digit reading
Rumelhart, Hinton, Williams 1986
“T” versus “C” problem
...
Krizhevksy, Sutskever, Hinton 2012
ImageNet classification breakthrough
“SuperVision” CNN
Convolutional Neural Networks (CNNs)

CNNs: ImageNet Breakthrough
● Krizhevsky et al. win 2012 ImageNet classification with a much bigger ConvNet
○ deeper: 7 stages vs 3 before
○ larger: 60 million parameters vs 1 million before
○ 16.4% error (top-5) vs Next best 26.2% error
● This was made possible by:
○ fast hardware: GPU-optimized code
○ big dataset: 1.2 million images vs thousands before
○ better regularization: dropout et al.
[Krizhevsky et al. NIPS 2012]
Image courtesy of Deng et al.

Learned Features of CNNs
[Matthew D. Zeiler et al. ECCV 2014]

CBIR: From SIFT to CNNs
• Three main approaches
– Directly use pre-trained CNNs models
• to extract feature representations
– Fine-tune pre-trained CNNs models
• with information (pairwise or triplet similarity)
– Bag-of-features model on CNN features
• “Deep SIFT”

1. Directly use pre-trained CNNs
• How to use the feature representations?
– Which layer?
– How to pool the features in a convolutional layer?
– How to select the features in a convolutional layer?

– Which layer?
Fully connected layer
Convolutional layer

Depth
Height
Width
x1
x2
.
.
.
xn
How?

Depth
Height
Width
x1
x2
.
.
.
xn
How?
• Sum-pooling
• Max-pooling
• Grid-based max-pooling
• Region-based pooling
• Mixed sum & max pooling

– How to select the features in a convolutional layer?
• Weighting
• Activation
magnitude
• Region
detection
Source: Cao et. al, Where to Focus: Query Adaptive Matching for Instance Retrieval Using Convolutional Feature Maps

2. Fine-tune pre-trained CNNs
• To incorporate extra information from a new
image data set
– Side information (pairwise or triplet similarity)
– Distance metric learning
√
X

2. Fine-tune pre-trained CNNs
Source: MatchNet, CVPR2015
Source: Learning Fine-Grained Image Similarity with Deep
Ranking. CVPR 2014

3. Bag-of-features model on “Deep SIFT”
Source: Multi-scale Orderless Pooling of Deep Convolutional Activation Features, ECCV2014

“Deep SIFT”
Source: Cao et. al, Where to Focus: Query Adaptive Matching for Instance Retrieval Using Convolutional Feature Maps

Codebook
generation
Feature
coding
Feature
pooling
Classification
Clustering or
Retrieval
Or

Image Classification with DCNN (Krizhevsky, NIPS12)
CNN Features off-the-shelf
(Razavian, CVPRW14);
Neural codes (Babenko,
ECCV14)
Deep ranking (Wang, CVPR14)
Multi-scale orderless pooling
(Gong, ECCV14)
Encoding High Dimensional
Local Features (Liu, NIPS14)
Survey: Deep learning for CBIR
(Wan, ACMMM14)
16
15
14
13
12
Deep filter banks (Cimpoi, CVPR15);
Exploiting Local Features from DNN (Ng,
CVPRW15)
SPoC (Babenko, ICCV15);
MatchNet (Han, CVPR15)
R-MAC (Tolias, ICLR16);
CNN IR Learns from BoW (Radenovic,
ECCV16);
CroW (Kalantidis, ECCVW16);
Where to focus (Cao, 2016)
Some papers appeared on Arxiv

Summary
• A very limited (and biased) account of CBIR
• CBIR has made significant progress during two
past decades
• The development of feature representation plays
a key role
• Issues to be resolved
– How to transfer the benefit of Deep Learning?
– How to deal with unsupervised learning case?
– How to better handle the semantic gap?
– …

Color
histogram
Gabor feature
Euclidean
distance
User model
Query model
…
SIFT
Bag-of-features
Hashing
Fine-grained
recognition
…
Deep features
Deep
retrieval
Deep ranking
Deep hashing
…
Images Courtesy of Google Image
…

CBIR in the Era of Deep Learning

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie CBIR in the Era of Deep Learning

Ähnlich wie CBIR in the Era of Deep Learning (20)

Mehr von Xiaohu ZHU

Mehr von Xiaohu ZHU (9)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

CBIR in the Era of Deep Learning