Presenter: Giorgos Kordopatis-Zilos
Placing Images with Refined Language Models and Similarity Search with PCA-reduced VGG Features In Working Notes Proceedings of the MediaEval 2016 Workshop, Hilversum, Netherlands, October 20-21, CEUR-WS.org (2016) by Giorgos Kordopatis-Zilos, Adrian Popescu, Symeon Papadopoulos, Yiannis Kompatsiaris
Paper: http://ceur-ws.org/Vol-1739/MediaEval_2016_paper_13.pdf
Video: https://youtu.be/WR4I3CWjcR4
Abstract: We describe the participation of the CERTH/CEA-LIST team in the MediaEval 2016 Placing Task. We submitted five runs to the estimation-based sub-task: one based only on text by employing a Language Model-based approach with several refinements, one based on visual content, using geospatial clustering over the most visually similar images, and three based on a hybrid scheme exploiting both visual and textual cues from the multimedia items, trained on datasets of different size and origin. The best results were obtained by a hybrid approach trained with external training data and using two publicly available gazetteers.
MediaEval 2016 - Placing Images with Refined Language Models and Similarity Search with PCA-reduced VGG Features
1. Placing Images with Refined Language Models and
Similarity Search with PCA-reduced VGG Features
Giorgos Kordopatis-Zilos1, Adrian Popescu2, Symeon Papadopoulos1 and
Yiannis Kompatsiaris1
1 Information Technologies Institute (ITI), CERTH, Greece
2 CEA LIST, 91190 Gif-sur-Yvette, France
MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Netherlands.
2. Summary
#2
Tag-based location estimation (1 runs)
• Built upon the scheme of our 2015 participation [1] (Kordopatis-Zilos et
al., MediaEval 2015)
• Based on a refined probabilistic Language Model
Visual-based location estimation (1 run)
• Extract PCA-reduced VGG features to compute image similarities
• Geospatial clustering scheme of the most visually similar images
Hybrid location estimation (3 run)
• Combination of the textual and visual approaches using a set of rules
Training sets
• Training set released by the organisers (≈4.7M geotagged items)
• YFCC dataset, excl. images from users in test set (≈40M geotagged items)
• External data derived from gazetteers, i.e. Geonames and OpenStreetMap
3. Tag-based location estimation
#3
• Processing steps of the approach
– Offline: language model construction
– Online: location estimation
OpenStreetMap
4. Pre-processing
• Tags and titles of the training set items are processed
• Apply
– URL decoding
– lowercase transformation
– tokenization
• Remove
– accents
– symbols
– punctuations
• The multi-word tags are split into their individual terms,
which are also included in the item's term set
• Discarded numerics or less than three characters terms
#4
5. Language Model (LM)
• LM-based estimation
– Most Likely Cell (mlc) considered the cell with the highest probability and
used to produce the estimation
𝑚𝑙𝑐𝑗 = arg max 𝑖
𝑘=1
𝑇 𝑗
𝑝(𝑡 𝑘|𝑐𝑖) ∗ 𝑤(𝑡 𝑘)
Inspired from [4]: (Popescu, MediaEval 2013)
#5
• LM generation scheme
– divide earth surface in rectangular
cells with a side length of 0.01°
– calculate term-cell probabilities
𝑝(𝑡|𝑐) = 𝑁 𝑢/𝑁𝑡
6. Feature Selection and Weighting
#6
Feature Weighting
• Locality weight function, a function based on term relative position in T
• Spatial Entropy weight function, a Gaussian function based on the term’s
spatial entropy
• Linear combination of the two weights
Feature Selection
• Calculate terms locality using a grid of 0.01°×0.01°
• When a user uses a given term, he/she is assigned to the
entire cell neighborhood instead of a unique cell as in [1]
𝑙 𝑡 = 𝑁𝑡 ∗
σ 𝑐∈𝐶 σ 𝑢∈𝑈𝑡,𝑐
|{𝑢′|𝑢′ ∈ 𝑈𝑡,𝑐, 𝑢′ ≠ 𝑢}|
𝑁𝑡
2
• Terms with non-zero locality score form the term set 𝑇
7. Refinements
#7
• Multiple Grids
– Built an additional LM using a finer
grid (cell side length of 0.001°)
– combine the MLC of the individual
language models
• Similarity search [5] (Van Laere et al., ICMR 2011)
– determine 𝑘 𝑡 most similar training images in the MLC
– their center-of-gravity is the final location estimation
From [2]: (Kordopatis-Zilos et al., PAISI 2015)
8. Visual-based location estimation
#8
• Main Objectives
• Ensure that the visual features are generic and transferable
• Provide a compact representation of the features
• Model building
• CNN features extracted by fine-tuning the VGG model [4]
• Training: ~5K Points Of Interest (POIs), over 7M Flickr images using
queries with:
– the POI name and a radius of 5km around its coordinates
– the POI name and the associated city name
• Compressed outputs of fc7 layer (4096d) to 128d using PCA,
learned on a subset of 250,000 train images
• Similarity Search based on the PCA-reduced CNN features
9. Visual-based location estimation
#9
Location Estimation
• Geospatial clustering of 𝑘 𝑣 = 20 visually most similar images
• The largest cluster (or the first in case of equal size) is selected and
its centroid is used as the location estimate
Visual Confidence
• Confidence metric for the visual estimation is based on the size of
the largest cluster
𝑐𝑜𝑛𝑓𝑣 𝑖 = max(
𝑛 𝑖 − 𝑛 𝑡
𝑘 𝑣 − 𝑛 𝑡
, 0)
𝑛 𝑖 : number of neighbors in the largest cluster of image i
𝑛 𝑡: configuration parameter of the confidence score ‘’strictness’’
10. Hybrid-based location estimation
• A set of rules to determine the
source of estimation between the
text and visual approaches
• The visual estimation is chosen in
cases:
→ No estimation could be produced by
the text approach
→ Visual estimation fell inside the
borders of the mlc
→ By comparing the confidence scores
𝑐𝑜𝑛𝑓𝑣 and 𝑐𝑜𝑛𝑓𝑡 [1]
• Otherwise the text estimation is
selected
#10
11. Runs and Results
#11
RUN-1: Tag-based location estimation + released training set
RUN-2: Visual-based location estimation + released training set
RUN-3: Hybrid location estimation + released training set
RUN-4: Hybrid location estimation + YFCC dataset
RUN-5: Hybrid location estimation + YFCC + External data
RUN-E: Visual-based location estimation + entire YFCC dataset
Images
12. Runs and Results
#12
RUN-1: Tag-based location estimation + released training set
RUN-2: Visual-based location estimation + released training set
RUN-3: Hybrid location estimation + released training set
RUN-4: Hybrid location estimation + YFCC dataset
RUN-5: Hybrid location estimation + YFCC + External data
Videos
13. References
#13
[1] G. Kordopatis-Zilos, A. Popescu, S. Papadopoulos, and Y. Kompatsiaris.
Socialsensor at mediaeval placing task 2015. In MediaEval 2015 Placing Task,
2015.
[2] G. Kordopatis-Zilos, S. Papadopoulos, and Y. Kompatsiaris. Geotagging social
media content with a refined language modelling approach. In Intelligence and
Security Informatics, pages 21–40, 2015.
[3] A. Popescu. CEA LIST's participation at mediaeval 2013 placing task. In
MediaEval 2013 Placing Task, 2013.
[4] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-
scale image recognition. In International Conference on Learning
Representations, 2015.
[5] O. Van Laere, S. Schockaert, and B. Dhoedt. Finding locations of Flickr resources
using language models and similarity search. ICMR ’11, pages 48:1–48:8, New
York, NY, USA, 2011. ACM.