This document summarizes a research paper on deep image retrieval using global image representations. It presents three key ideas: 1) A siamese network trained with a triplet loss to learn image representations optimized for retrieval. 2) Replacing rigid region grids with a region proposal network to localize regions of interest. 3) Experiments showing their method outperforms classification features and achieves state-of-the-art results on standard retrieval datasets. Their work demonstrates an effective and scalable approach to image retrieval based on learning compact global image signatures.
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Deep image retrieval - learning global representations for image search - ub version
1. Deep Image Retrieval:
Learning global representations for image
search
Albert Gordo, Jon Almazan, Jerome Revaud, Diane Larlus
Original Slides by Albert Jiménez
Computer Vision Reading Group
1
[arXiv]
4. CNN-based retrieval
● CNNs trained for classification tasks
● Features are very robust to intra-class variability
● Lack of robustness to scaling, cropping and image clutter
Related Work
Lamp
We are interested in distinguishing between particular objects from the same class!
4
5. R-MAC
● Regional Maximum Activation of Convolutions
● Compact feature vectors encode image regions
Related Work
Giorgos Tolias, Ronan Sicre, Hervé Jégou, Particular object retrieval with integral max-pooling of CNN
activations (Submitted to ICLR 2016)
5
6. R-MAC
● Regions selected using a rigid grid
● Compute a feature vector per region
● Combine all region feature vectors
○ Dimension → 256 / 512
Related Work
Giorgos Tolias, Ronan Sicre, Hervé Jégou, Particular object retrieval with integral max-pooling of CNN
activations (Submitted to ICLR 2016)
ConvNet
Last
Layer
K feature maps
size = W x H
Different scale
region grids
maximum activation
6
8. 1st Contribution
● Three-stream siamese network
● PCA implemented as a shift + fully connected layer
● Optimize weights (CNN + PCA) from R-MAC representation with a triplet
loss function
8
9. where:
● m is a scalar that controls the margin
● q, d+, d- are the descriptors for the query, positive and negative images
1st Contribution
Ranking Loss Function
9
10. 2nd Contribution
● Localize regions of interest (ROIs)
● Train a Region Proposal Network with bounding boxes (Similar Fast R-CNN,
[arXiv])
In R-MAC → Rigid grid
Replace
Region Proposal Network
10
11. 2nd Contribution
RPN in a nutshell
11
● Predict, for a set of candidate boxes of
various sizes and aspects ratio, and at all
possible image locations, a score
describing how likely each box contains an
object of interest.
● Simultaneously, for each candidate box
perform regression to improve its location.
12. Summary
12
● Able to encode one image into a compact feature vector in a single forward
pass
● Images can be compared using the dot product
● Very efficient at test time
14. Datasets
14
● Training Landmarks dataset: 214k images from 672 landmark
sites
● Testing Oxford 5k, Paris 6k, Oxford 105k, Paris 106k, INRIA
Holidays
● Remove all images contained in Oxford 5k and Paris 6k datasets
○ Landmarks-full: 200k images from 592 landmarks
● Cleaning Landmarks dataset (Select most relevant images/discard incorrect)
○ SIFT + Hessian Affine keypoint det. → Construct graph of similar images
○ Landmarks-clean: 52k images from 592 landmarks
15. Bounding Box Estimation
15
● RPN trained using automatically estimated bounding box annotations
1. Define initial bounding box: min rectangle that encloses all matched
keypoints
2. For a pair (i, j) we predict the bounding box Bj using Bi and an affine transform
Aij
3. Update (Merge
using geometrical mean)
4. Iterate until convergence
Bounding box projections Initial vs Final estimations
16. Experimental Details
16
● VGG-16 network pre-trained on ImageNet
● Fine-tune with Landmarks dataset
● Select triplets in an efficient manner
○ Forward pass to obtain image representations
○ Select hard negatives (Large loss)
● Dimension of the feature vector = 512
● Evaluation: mean Average Precision (mAP)
VGG16
25. Conclusions
25
● They have proposed an effective and scalable method for image retrieval that
encodes images into compact global signatures that can be compared with the
dot-product.
● Proposal of a siamese network architecture trained for the specific task of
image retrieval using ranking loss function (Triplets).
● Demonstrate the benefit of predicting the ROI of the images when encoding by
using Region Proposal Networks.
It aggregates several image regions into a compact feature vector of fixed length and is thus robust to scale and translation. This representation can deal with high resolution images of different aspect ratios and obtains a very competitive accuracy.