Strategies for Landing an Oracle DBA Job as a Fresher
Near-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers
1. Near-Duplicate Video Retrieval by
Aggregating Intermediate CNN Layers
Giorgos Kordopatis-Zilos1,2, Symeon Papadopoulos1,
Ioannis Patras2 and Yiannis Kompatsiaris1
1Information Technologies Institute, CERTH, Thessaloniki, Greece
2Queen Mary University of London, Mile end Campus, UK, E14NS
23rd International Conference on MultiMedia Modeling
Reykjavík, Iceland, 4-6 January 2017
2. Problem & Motivation
• Near-Duplicate Video Retrieval (NDVR)
• Given a query video, search a video dataset to retrieve (visually)
highly similar videos
• Rank the candidate videos based on their similarity to the query
• Various applications
• content verification
• video retrieval, management and recommendation
• copyright protection
• Crucial importance of NDVR, due to the exponential growth
of video content
3. Near-Duplicate Videos: Definition
• Variety of definitions and understandings regarding the
near-duplicate videos
• Adopt definition by Wu et al. (2007)
• photometric variations: gamma, contrast, brightness, etc.
• editing operations: resize, shift, crop, flip
• insertion of patterns: caption, logo, subtitles, sliding captions, etc.
• re-encoding: video format, compression
• video modifications: frame rate, frame insertion, deletion, swap
X. Wu, A. G. Hauptmann, and C. W. Ngo. Practical elimination of near-duplicates from web video search. In
Proceedings of the 15th ACM international conference on Multimedia, pp. 218-227, 2007
4. Related Work
• Variety of approaches (Liu et al., 2013)
• Video-level matching: comparison of global signatures
• Global feature vectors
• Fingerprints
• Hash codes
• Frame-level matching: frames or sequences
• Local descriptors
• Spatiotemporal features
• Hybrid-level matching
• Filter-and-refine methods
• TRECVID content-based copy detection (Kraaij & Awad, 2011)
• duplicates artificially generated by standard transformations
W. Kraaij, and G. Awad. TRECVID 2011 content-based copy detection: Task overview. Proc. TRECVid 2010, 2011
J. Liu, Z. Huang, H. Cai, H. T. Shen, C. W. Ngo, and W. Wang. Near-duplicate video retrieval: Current research and
future trends. ACM Computing Surveys, vol.45, no. 4, 44, 2013
5. Feature Extraction (1/2)
• Employ a pre-trained CNN with 𝐿 convolutional layers
• Apply max pooling on every channel of the feature map of
each layer (Zheng et al., 2016)
𝑣 𝑙
𝑖 = max 𝑀 𝑙
(∙,∙, 𝑖) , 𝑖 = 1, 2, … 𝑐 𝑙
, 𝑙 = 1, 2, … 𝐿
• 𝐿 𝑐 𝑙-dimensional vectors generated
L. Zheng, Y. Zhao, S. Wang, J. Wang, and Q. Tian. Good Practice in CNN Feature Transfer. arXiv:1604.00133, 2016
6. Feature Extraction (2/2)
• Pre-trained CNN networks from Caffe (Jia et al., 2014):
a) AlexNet, b) VGGNet, c) GoogLeNet
• Feature extraction uses the convolution layers of the
architectures
Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe:
Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM int. conference on
Multimedia, pp. 675-678, 2014
AlexNet VGGNet GoogLeNet
12. Video Indexing and Querying
• tf-idf weighting of visual words
𝑤𝑡𝑑 = 𝑛 𝑡𝑑 ∙ log 𝐷 𝑏 /𝑛 𝑡
• Inverted file indexing structure for fast search
• Retrieve candidates with at least one common visual word
• Rank candidates based on cosine similarity of their tf-idf
representations
𝑠𝑖𝑚 𝑞, 𝑝 =
𝒘 𝒒 ∙ 𝒘 𝒑
𝒘 𝒒 𝒘 𝒑
17. Results III
• Performance per query
• Best runs
• CNN-V: Vector-based aggregation GoogLeNet
• CNN-L: Layer-based aggregation VGGNet
Lower precision in hard
queries
• query 18 (Bus uncle)
• query 22 (Numa Gary)
18. Evaluation: Comparison to SoA
• Color Histograms (CH) (Wu et al., 2007) - Video-level matching, color histograms
• Auto Color Correlograms (ACC) (Cai et al., 2011) - Frame-level matching, auto-
color correlograms, BoW, tf-idf weighted cosine similarity
• Local Structure (LS) (Wu et al., 2007) - Hybrid-level matching, Color Histograms,
keyframes similarity of PCA-SIFT descriptors
• Multiple Feature Hashing (MFH) (Song et al., 2013) - Video-level matching, hash
multiple features into Hamming space, combination of the keyframe hash code
to a global video representation
• Pattern-based approach (PPT) (Chou et al., 2015) - Hybrid-level matching,
pattern-based indexing tree (PI-tree), m-pattern-based dynamic programming
(mPDP), time-shift m-pattern similarity (TPS)
X. Wu, A. G. Hauptmann, and C. W. Ngo. Practical elimination of near-duplicates from web video search. In
Proceedings of the 15th ACM international conference on Multimedia, pp. 218-227, 2007
Y. Cai, L. Yang, W. Ping, F. Wang, T. Mei, X. S. Hua, and S. Li. Million-scale near-duplicate video retrieval system. In
Proceedings of the 19th ACM international conference on Multimedia, pp. 837-838, 2011
J. Song, Y. Yang, Z. Huang, H. T. Shen, and J. Luo Effective multiple feature hashing for large-scale near-duplicate
video retrieval. In IEEE Transactions on Multimedia, vol. 15, no. 8, pp. 1997-2008, 2013
C. L. Chou, H. T. Chen, and S. Y. Lee. Pattern-Based Near-Duplicate Video Retrieval and Localization on Web-Scale
Videos. IEEE Transactions on Multimedia, vol. 17, no. 3, pp. 382-395, 2015
20. Future Work
• Exploit the C3D features (Tran et al., 2015)
• Conduct more comprehensive evaluations
• More challenging datasets: larger scale, more similar but non-
relevant videos (distractors)
• Partial Duplicate Video Retrieval (PDVR)
• Assess the applicability of the approach on the PDVR problem
D. Tran, L. Bourdev, R. Fergus, L. Torresani and M. Paluri. Learning spatiotemporal features with 3D convolutional networks.
In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497, 2015
21. Thank you!
Get in touch:
• George Kordopatis-Zilos: georgekordopatis@iti.gr
• Symeon Papadopoulos: papadop@iti.gr / @sympap
With the support of: