The current state of the research in landmark recognition highlights the good accuracy which can be achieved by embedding techniques, such as Fisher vector and VLAD. All these techniques do not exploit spatial information, i.e. consider all the features and the corresponding descriptors without embedding their location in the image. This paper presents a new variant of the well-known VLAD (Vector of Locally Aggregated Descriptors) embedding technique which accounts, at a certain degree, for the location of features. The driving motivation comes from the observation that, usually, the most interesting part of an image (e.g., the landmark to be recognized) is almost at the center of the image, while the features at the borders are irrelevant features which do no depend on the landmark. The proposed variant, called locVLAD (location-aware VLAD), computes the mean of the two global descriptors: the VLAD executed on the entire original image, and the one computed on a cropped image which removes a certain percentage of the image borders. This simple variant shows an accuracy greater than the existing state-of-the-art approach. Experiments are conducted on two public datasets (ZuBuD and Holidays) which are used both for training and testing. Morever a more balanced version of ZuBuD is proposed.
1. A location-aware embedding technique for
accurate landmark recognition
Federico Magliani, Navid Mahmoudian Bidgoli, Andrea Prati
ICDSC 2017 – Stanford, USA – 5-7 September 2017
2. Agenda
2
➢ Motivations
➢ Summary of contribution
➢ Related works
➢ Introduction to VLAD
➢ Proposed approach (locVLAD)
➢ Experimental results
➢ Conclusions and Future Works
4. Motivations
4
➢ Challenges
○ high accuracy retrieval (precision)
○ fast research (response to query)
○ reduced memory occupied (mobile friendly)
○ work well with big data (>100k data)
➢ Possible applications
○ augmented reality (tourism)
➢ Why mobile based?
○ everyone owns a mobile phone
○ a mobile phone has powerful HW, that allows to run some applications
5. Motivations
5
“Changes in the image resolution, illumination conditions, viewpoint and the presence
of distractors such as trees or traffic signs (just to mention some) make the task of
matching features between a query image and the database rather difficult.”
➢ In order to mitigate these problems, the existing approaches rely on feature
description with a certain degree of invariance to scale, orientation and
illumination changes.
6. Agenda
6
➢ Motivations
➢ Summary of contribution
➢ Related works
➢ Introduction to VLAD
➢ Proposed approach (locVLAD)
➢ Experimental results
➢ Conclusions and Future Works
7. Summary of contribution
7
➢ A location-aware version of VLAD, called locVLAD, that allows to outperform the state
of the art in the intra-dataset problem. It tries to overcome a weakness of VLAD,
reducing the noise of the features in the borders of the images
➢ The time for vocabulary creation is significantly reduced, using only ⅕ random of the
detected features
➢ A new balanced version of the public dataset ZuBuD is proposed and made available
to the scientific community (ZuBuD+)
8. Agenda
8
➢ Motivations
➢ Summary of contribution
➢ Related works
➢ Introduction to VLAD
➢ Proposed approach (locVLAD)
➢ Experimental results
➢ Conclusions and Future Works
9. Related work
9
➢ Bag of Words (BoW): first method for solving the problem (different
techniques: vocabulary tree, …)
➢ Fisher vector: embedding based on Fisher kernel
➢ VLAD and its variants: simplified version of Fisher vector
➢ Hamming embedding: embedding based on binarized descriptors
➢ CNN based: deep neural network, that at the end contain
classification layers
11. Agenda
11
➢ Motivations
➢ Summary of contribution
➢ Related works
➢ Introduction to VLAD
➢ Proposed approach (locVLAD)
➢ Experimental results
➢ Conclusions and Future Works
12. VLAD (Vector of Locally Aggregated Descriptors)
C = {c1
,.., ck
} codebook of k visual words (K-means clustering)
1. Every local descriptor x, extracted from the image, is assigned to the closest cluster
center of the codebook (ci
= NN(xj
))
2. vi
= ∑ (x - ci
) (residuals)
3. VLAD vector is the concatenation of vi
vectors (i = 1, …, k) d-dimensional
4. VLAD normalization to contrast the burstiness problem
16 centroids, features described with SIFT 128d → D=128x16=2048 12
13. VLAD normalization
13
➢ Signed Square Rooting normalization: sign(xi
) sqrt(|xi
|) followed by L2
norm
➢ Residual normalization: independent residual L2
norm followed by L2
norm
➢ Z-Score normalization: residual normalization followed by subtraction of the mean
from every vector and division by the standard deviation
➢ Power normalization: sign(xi
)|xi
|α
(usually α=0.2) followed by L2
norm
14. Agenda
14
➢ Motivations
➢ Summary of contribution
➢ Related works
➢ Introduction to VLAD
➢ Proposed approach (locVLAD)
➢ Experimental results
➢ Conclusions and Future Works
15. Proposed approach: locVLAD
➢ This method allows to improve the performance of VLAD vectors in the recognition
problem.
➢ It tackles this problem by reducing the influence of features found at the borders of the
image.
How does it work?
It consists in a new global descriptor, that is the mean of VLAD descriptors of the original
query image (v̇) and a VLAD descriptor calculated on a cropped query image (v̇cropped
).
15
16. Proposed approach: locVLAD
The dimension of the cropped image is a parameter, that depends on the used dataset
➢ ZuBuD → 90% of the original query images
➢ Holidays → 70% of the original query images.
16424 features detected 367 features detected
17. Why does it increase the performance?
Because, usually, the important features for the recognition are located in the center of the
images while the features close to the border are noisy features.
Why not applying VLAD encoding directly on the cropped image?
Because useful information might be lost. Not any guarantee that features in the borders
are only noisy features.
Why not creating a cropped vocabulary?
Experiments were conducted but results were poor.
Proposed approach: locVLAD
17
18. Agenda
18
➢ Motivations
➢ Summary of contribution
➢ Related works
➢ Introduction to VLAD
➢ Proposed approach (locVLAD)
➢ Experimental results
➢ Conclusions and Future Works
19. Datasets
➢ INRIA Holidays (1491 images in 2448x3264: 500 classes, 500 query)
➢ ZuBuD (1005 images in 640x480: 201 classes, 115 query in 320x240)
➢ ZuBuD+ (1005 images in 640x480: 201 classes, 1005 query in 320x240)
19
22. ZuBuD+
2222
It is the balanced version of ZuBuD
➢ 1005 query in 320x240 instead of 115 query.
➢ The new query images are random choices of database images, but different from other
query images
○ rotation (±90°) and resize
○ resize only
Download: http://implab.ce.unipr.it/?page_id=194
23. Evaluation Metrics
2323
Different evaluation metrics are used to compare with the state-of-the-art approaches:
➢ Top1 → accuracy retrieval, evaluating only the first position of the ranking
➢ 5 x Recall in Top5 → average of how many times the correct image is in the top 5
results in the ranking
➢ mAP (mean Average Precision) → mean of Average Precision scores (correct results) for
each query, based on the position in the ranking
25. Results on ZuBuD (and ZuBuD+)
25
Method Descriptor size Top1 5 x Recall in Top5
Tree histogram (ZuBuD) [7] 10M 98.00 % -
Decision tree (ZuBuD) [9] n/a 91.00 % -
Sparse coding (ZuBuD) [22] 8k*64+1k*36 - 4.538
VLAD (ZuBuD) [12] 4281*128 99.00 % 4.416
VLAD (ZuBuD+) [12] 4281*128 99.00 % 4.526
locVLAD (ZuBuD) 4281*128 100.00 % 4.469
locVLAD (ZuBuD+) 4281*128 100.00 % 4.543
It is worth to note that on ZuBuD the method based on sparse coding slightly outperforms the proposed one.
This is due to an unbalanced query set and, probably, on the use of color information.
29. Agenda
29
➢ Motivations
➢ Summary of contribution
➢ Related works
➢ Introduction to VLAD
➢ Proposed approach (locVLAD)
➢ Experimental results
➢ Conclusions and Future Works
30. Conclusions
➢ The proposed locVLAD technique includes, at a certain degree, information on
the location of the features, by mitigating the negative effects of distractors
found at the image borders.
➢ Experiments are performed on two public datasets, namely ZuBuD and Holidays,
and demonstrate superior recognition accuracy w.r.t. the state of the art.
30
31. Future works
➢ Compression: try to reduce the dimension of the descriptors, while keeping the
same accuracy in retrieval (mobile friendly).
➢ Indexing: create a system for the evaluation in a large scale domain (adding until 1M
distractors). Passing from Nearest Neighbor problem to Approximate Nearest
Neighbor problem. We are working with kd tree and permutation-based methods.
➢ Sparse coding: new methods for the creation of the vocabulary and the assignment
of the features to the VLAD vector.
31
32. Thank you for your attention!
questions?
http://implab.ce.unipr.it
32