Lecture given on January 28, 2019 to post-graduate students of the Computer Engineering and Media program, at the School of Journalism and Media, Aristotle University of Thessaloniki.
2. Our lab
Multimedia Knowledge and
Social Media Analytics Laboratory
⢠Part of Information Technologies Institute (ITI) -
Centre for Research and Technology Hellas (CERTH)
⢠60+ researchers (20+ post-docs)
⢠key areas: multimedia, social media, computer vision,
data mining, machine learning
⢠applications: media, security, culture, environment
⢠involved in 60+ projects and published 600+ papers
https://mklab.iti.gr/
5. 500 hours of video per min =
720,000 hours per day >
82 years of video per day!
6. Pope Francis
Pope Benedict
2007: iPhone release
2008: Android release
2010: iPad release
http://petapixel.com/2013/03/14/a-starry-sea-of-cameras-at-the-unveiling-of-pope-francis/
13. Similarity-based media search
Two main problems
â˘How to compute similarity between two
items (in accordance with my needs)?
â˘How to search (using above similarity
function) in very large collections in
reasonable time?
15. What is similar?
⢠Variety of definitions and understandings regarding what
can be considered to be similar
⢠Near-duplicate videos: definition by Wu et al. (2007)
⢠photometric variations: gamma, contrast, brightness, etc.
⢠editing operations: resize, shift, crop, flip
⢠insertion of patterns: caption, logo, subtitles, sliding captions, etc.
⢠re-encoding: video format, compression
⢠video modifications: frame rate, frame insertion, deletion, swap
X. Wu, A. G. Hauptmann, and C. W. Ngo. Practical elimination of near-duplicates from web video search. In
Proceedings of the 15th ACM international conference on Multimedia, pp. 218-227, 2007
16. Hashing
⢠Cryptographic or checksum hashing: MD5, SHA1
⢠Input: bitstream (not just images or videos)
⢠Output: hash code 128-bit (MD5), 160-bit (SHA1), etc.
⢠Property: minor changes in input can lead to completely
different hash codes
https://jenssegers.com/61/perceptual-image-hashes
18. Perceptual hashing
⢠Generate a fingerprint that can be used to compare
images using the Hamming Distance
⢠Instance: Average Hashing (aHash)
⢠Reduce size ď 8x8 pixels
⢠Reduce colour ď RGB to grayscale
⢠Calculate average colour ď among 64 grayscale values
⢠Compute hash ď for each pixel, binary value depending
on whether it is higher or lower than average
ď 64-bit signature
20. dHash and pHash
⢠dHash: Difference Hash
⢠same steps as aHash
⢠hash is generated based on whether left pixel is brighter
than the right one
⢠less false positives compared to aHash
⢠pHash: Perceptual Hash
⢠more complicated algorithm
⢠resize to 32x32
⢠DCT on luma (brightness) component
⢠top left 8x8 ď hash by comparing to median value
22. Pixel-based similarity doesnât
match perception
All three variations of the first image are equidistant
from it in terms of L2 pixel distance!
http://cs231n.github.io/classification/
23. Global descriptors
⢠A single vector that attempts to capture the main
visual properties of an image, e.g. distribution of
colour, spatial layout of brightness, textures, etc.
⢠Popular choices include:
⢠GIST â spatial envelope (Oliva & Torralba, 2001)
⢠Color: Dominant Color, Scalable Color, Color Structure,
Color Layout Descriptor (MPEG-7, 2001)
⢠Texture: Texture Browsing, Homogeneous Texture, Edge
Histogram (MPEG-7, 2001)
A. Oliva and A. Torralba. Modeling the shape of the scene: a holistic representation
of the spatial envelope. IJCV, 42(3):145â175, 2001
Text of ISO/IEC 15 938-3 Multimedia Content Description InterfaceâPart 3: Visual.
Final Committee Draft, ISO/IEC/JTC1/SC29/ WG11, Doc. N4062, Mar. 2001
24. GIST-based near-duplicate search
Douze, M., JĂŠgou, H., Sandhawalia, H., Amsaleg, L., & Schmid, C. (2009, July). Evaluation of gist descriptors for web-
scale image search. In Proceedings of the ACM International Conference on Image and Video Retrieval (p. 19). ACM.
25. Local descriptors
⢠Basic scheme:
⢠Detect a set of features (i.e. interest points) in an image
⢠Extract one descriptor around each feature
⢠Plenty of options for both parts, e.g.:
⢠Feature detectors: Canny, Sobel, Harris, FAST, Laplacian
of Gaussian (LoG), Difference of Gaussians (DoG),
Determinant of Hessian (DoH), MSER
⢠Feature descriptors: SIFT, GLOH, SURF, ORB
⢠Much higher accuracy at the cost of increased
complexity
26. Scale-Invariant Feature Transforms (SIFT)
Set of descriptors
A single descriptor
(16 histograms of 8 bins ď
128 dims)
http://faculty.ucmerced.edu/mhyang/project/iccv13_exemplar/ICCV13_exemplarCut/vlfeat-0.9.14/doc/overview/sift.html
Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International journal of computer
vision, 60(2), 91-110.
28. Bag of Visual Words (BoVW)
https://towardsdatascience.com/bag-of-visual-words-in-a-nutshell-9ceea97ce0fb
29. Bag of Visual Words (BoVW)
https://towardsdatascience.com/bag-of-visual-words-in-a-nutshell-9ceea97ce0fb
extract a set of local features from each image
30. Bag of Visual Words (BoVW)
⢠a representative
sample of features
selected
⢠features are clustered
⢠cluster centroids (or
medoids) are
considered to be the
visual codebook
https://towardsdatascience.com/bag-of-visual-words-in-a-nutshell-9ceea97ce0fb
31. Bag of Visual Words (BoVW)
https://towardsdatascience.com/bag-of-visual-words-in-a-nutshell-9ceea97ce0fb
32. Indexing and Querying
⢠tf-idf weighting of visual words
đ¤đĄđ = đ đĄđ â log đˇ đ /đ đĄ
⢠Inverted file indexing structure for fast search
⢠Retrieve candidates with at least one common
visual word
⢠Rank candidates, e.g. based on cosine similarity
of their tf-idf representations
đ đđ đ, đ =
đ đ â đ đ
đ đ đ đ
33. BoVW Discussion
⢠BoVW is a sparse representation: each image is
associated with few visual words (compared to the
whole vocabulary)
⢠Convenient for indexing and look-up
⢠Completely misses spatial layout ď extensions
⢠Performance depends on:
⢠size of vocabulary
⢠dataset where vocabulary was learned
38. From Image to Video Similarity
⢠A video can be considered as a richer
representation compared to images:
⢠set of images (frames)
⢠frames and motion
⢠frames and motion and audio
⢠For efficiency purposes, we typically simplify or
discard part of the information:
⢠frames ď descriptors ď average
⢠frames ď visual words ď bag of frame-words
40. Video indexing calls
/index (HTTP GET request)
Add the provided video to the video index
⢠url: the URL of the video that is going to be indexed
⢠async: flag for asynchronous processing
/youtube (HTTP GET request)
Query YouTube API with either a video ID or a provided text query
and add the retrieved videos to the video index
⢠video_id: video ID to query YouTube API
⢠text: provided text to query YouTube API
⢠max: maximum number of videos to be add to the video index
/delete (HTTP DELETE request)
Delete the provided video from the video index
⢠url: the URL of the video that is going to be deleted
41. Video search calls
/search (HTTP GET request)
Video-level search: retrieve relevant video by calculating the
similarity between the entire videos
⢠url: URL of the query video
⢠t_sim: similarity threshold
⢠t_rank: rank threshold
/partial (HTTP GET request)
Shot-level search: retrieve relevant video segments from the indexed
videos in the database
⢠url: URL of the query video
⢠v_sim: video similarity threshold
⢠s_sim: shot similarity threshold
42. Combining CNNs and BoVW
Kordopatis-Zilos, G., Papadopoulos, S., Patras, I., & Kompatsiaris, Y. (2017, January). Near-duplicate video retrieval by
aggregating intermediate CNN layers. In International Conference on Multimedia Modeling (pp. 251-263). Springer
43. An improved setup
Kordopatis-Zilos, G., Papadopoulos, S., Patras, I., & Kompatsiaris, Y. (2017, January). Near-duplicate video retrieval by
aggregating intermediate CNN layers. In International Conference on Multimedia Modeling (pp. 251-263). Springer
44. Learning similarity
Before training
After training
Kordopatis-Zilos, G., Papadopoulos, S., Patras, I., & Kompatsiaris, Y. (2017, October). Near-Duplicate Video Retrieval with
Deep Metric Learning. In 2017 IEEE International Conference on Computer Vision Workshop (ICCVW), (pp. 347-356). IEEE
46. FIVR-200K
a dataset for evaluating NDVR
Kordopatis-Zilos, G., Papadopoulos, S., Patras, I., & Kompatsiaris, I. (2018).
FIVR: Fine-grained Incident Video Retrieval. arXiv preprint arXiv:1809.04094
47. FIVR-200K
⢠A video dataset to help research on the problem of
Fine-grained Incident Video Retrieval
⢠Duplicate Scene Videos (DSVs)
⢠Complementary Scene Videos (CSVs)
⢠Incident Scene Videos (ISVs)
⢠225,960 videos around 4,687 news events from Jan
1st 2013 to Dec 31st 2017
57. Ideas
⢠Pick one video around one event between 2013
and 2017 and try to find similar versions of it
⢠Pick one of the clusters-events in the Browse
section and try to find some important videos that
cover the event
⢠Given an event of interest, identify in which sources
it is covered (language, country, type of channel)
⢠Add videos from a newer event and use them to
perform new searches
59. Papers
⢠Kordopatis-Zilos, G., Papadopoulos, S., Patras, I., & Kompatsiaris, Y.
(2017, January). Near-duplicate video retrieval by aggregating
intermediate CNN layers. In International Conference on Multimedia
Modeling (pp. 251-263). Springer
⢠Kordopatis-Zilos, G., Papadopoulos, S., Patras, I., & Kompatsiaris, Y.
(2017, October). Near-Duplicate Video Retrieval with Deep Metric
Learning. In 2017 IEEE International Conference on Computer Vision
Workshop (ICCVW), (pp. 347-356). IEEE
⢠Kordopatis-Zilos, G., Papadopoulos, S., Patras, I., & Kompatsiaris, I.
(2018). FIVR: Fine-grained Incident Video Retrieval. arXiv preprint
arXiv:1809.04094
60. Acknowledgements
⢠Giorgos Kordopatis-Zilos / near-duplicate video
retrieval, back-end development, FIVR-200K
collection and annotation
⢠Lazaros Apostolidis / web front-end development
⢠Polichronis Charitidis / FIVR-200K annotation
61. Thank you for your attention!
Akis Papadopoulos papadop@iti.gr
@sympap