Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Near-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige

Hier ansehen

1 von 21 Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Anzeige

Ähnlich wie Near-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers (20)

Weitere von Symeon Papadopoulos (20)

Anzeige

Aktuellste (20)

Near-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers

  1. 1. Near-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers Giorgos Kordopatis-Zilos1,2, Symeon Papadopoulos1, Ioannis Patras2 and Yiannis Kompatsiaris1 1Information Technologies Institute, CERTH, Thessaloniki, Greece 2Queen Mary University of London, Mile end Campus, UK, E14NS 23rd International Conference on MultiMedia Modeling Reykjavík, Iceland, 4-6 January 2017
  2. 2. Problem & Motivation • Near-Duplicate Video Retrieval (NDVR) • Given a query video, search a video dataset to retrieve (visually) highly similar videos • Rank the candidate videos based on their similarity to the query • Various applications • content verification • video retrieval, management and recommendation • copyright protection • Crucial importance of NDVR, due to the exponential growth of video content
  3. 3. Near-Duplicate Videos: Definition • Variety of definitions and understandings regarding the near-duplicate videos • Adopt definition by Wu et al. (2007) • photometric variations: gamma, contrast, brightness, etc. • editing operations: resize, shift, crop, flip • insertion of patterns: caption, logo, subtitles, sliding captions, etc. • re-encoding: video format, compression • video modifications: frame rate, frame insertion, deletion, swap X. Wu, A. G. Hauptmann, and C. W. Ngo. Practical elimination of near-duplicates from web video search. In Proceedings of the 15th ACM international conference on Multimedia, pp. 218-227, 2007
  4. 4. Related Work • Variety of approaches (Liu et al., 2013) • Video-level matching: comparison of global signatures • Global feature vectors • Fingerprints • Hash codes • Frame-level matching: frames or sequences • Local descriptors • Spatiotemporal features • Hybrid-level matching • Filter-and-refine methods • TRECVID content-based copy detection (Kraaij & Awad, 2011) • duplicates artificially generated by standard transformations W. Kraaij, and G. Awad. TRECVID 2011 content-based copy detection: Task overview. Proc. TRECVid 2010, 2011 J. Liu, Z. Huang, H. Cai, H. T. Shen, C. W. Ngo, and W. Wang. Near-duplicate video retrieval: Current research and future trends. ACM Computing Surveys, vol.45, no. 4, 44, 2013
  5. 5. Feature Extraction (1/2) • Employ a pre-trained CNN with 𝐿 convolutional layers • Apply max pooling on every channel of the feature map of each layer (Zheng et al., 2016) 𝑣 𝑙 𝑖 = max 𝑀 𝑙 (∙,∙, 𝑖) , 𝑖 = 1, 2, … 𝑐 𝑙 , 𝑙 = 1, 2, … 𝐿 • 𝐿 𝑐 𝑙-dimensional vectors generated L. Zheng, Y. Zhao, S. Wang, J. Wang, and Q. Tian. Good Practice in CNN Feature Transfer. arXiv:1604.00133, 2016
  6. 6. Feature Extraction (2/2) • Pre-trained CNN networks from Caffe (Jia et al., 2014): a) AlexNet, b) VGGNet, c) GoogLeNet • Feature extraction uses the convolution layers of the architectures Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM int. conference on Multimedia, pp. 675-678, 2014 AlexNet VGGNet GoogLeNet
  7. 7. Vector Aggregation
  8. 8. Vector Aggregation
  9. 9. Vector Aggregation
  10. 10. Vector Aggregation
  11. 11. Layer Aggregation
  12. 12. Video Indexing and Querying • tf-idf weighting of visual words 𝑤𝑡𝑑 = 𝑛 𝑡𝑑 ∙ log 𝐷 𝑏 /𝑛 𝑡 • Inverted file indexing structure for fast search • Retrieve candidates with at least one common visual word • Rank candidates based on cosine similarity of their tf-idf representations 𝑠𝑖𝑚 𝑞, 𝑝 = 𝒘 𝒒 ∙ 𝒘 𝒑 𝒘 𝒒 𝒘 𝒑
  13. 13. Evaluation: Dataset • Dataset: CC_WEB_VIDEO • Videos: 13,139 videos • Keyframes: 397,965 images CC_WEB_VIDEO: http://vireo.cs.cityu.edu.hk/webvideo/ Dataset Annotation • Evaluation metrics • precision-recall (PR) • mean Average Precision (mAP) 𝐴𝑃 = 1 𝑛 𝑖=0 𝑛 𝑖 𝑟𝑖
  14. 14. Query video Near-duplicate Videos Dataset Examples
  15. 15. Results I Impact of CNN architecture and vocabulary size
  16. 16. Results II Performance using individual layers AlexNet VGGNet GoogLeNet
  17. 17. Results III • Performance per query • Best runs • CNN-V: Vector-based aggregation GoogLeNet • CNN-L: Layer-based aggregation VGGNet Lower precision in hard queries • query 18 (Bus uncle) • query 22 (Numa Gary)
  18. 18. Evaluation: Comparison to SoA • Color Histograms (CH) (Wu et al., 2007) - Video-level matching, color histograms • Auto Color Correlograms (ACC) (Cai et al., 2011) - Frame-level matching, auto- color correlograms, BoW, tf-idf weighted cosine similarity • Local Structure (LS) (Wu et al., 2007) - Hybrid-level matching, Color Histograms, keyframes similarity of PCA-SIFT descriptors • Multiple Feature Hashing (MFH) (Song et al., 2013) - Video-level matching, hash multiple features into Hamming space, combination of the keyframe hash code to a global video representation • Pattern-based approach (PPT) (Chou et al., 2015) - Hybrid-level matching, pattern-based indexing tree (PI-tree), m-pattern-based dynamic programming (mPDP), time-shift m-pattern similarity (TPS) X. Wu, A. G. Hauptmann, and C. W. Ngo. Practical elimination of near-duplicates from web video search. In Proceedings of the 15th ACM international conference on Multimedia, pp. 218-227, 2007 Y. Cai, L. Yang, W. Ping, F. Wang, T. Mei, X. S. Hua, and S. Li. Million-scale near-duplicate video retrieval system. In Proceedings of the 19th ACM international conference on Multimedia, pp. 837-838, 2011 J. Song, Y. Yang, Z. Huang, H. T. Shen, and J. Luo Effective multiple feature hashing for large-scale near-duplicate video retrieval. In IEEE Transactions on Multimedia, vol. 15, no. 8, pp. 1997-2008, 2013 C. L. Chou, H. T. Chen, and S. Y. Lee. Pattern-Based Near-Duplicate Video Retrieval and Localization on Web-Scale Videos. IEEE Transactions on Multimedia, vol. 17, no. 3, pp. 382-395, 2015
  19. 19. Results IV Comparison against existing NDVR approaches
  20. 20. Future Work • Exploit the C3D features (Tran et al., 2015) • Conduct more comprehensive evaluations • More challenging datasets: larger scale, more similar but non- relevant videos (distractors) • Partial Duplicate Video Retrieval (PDVR) • Assess the applicability of the approach on the PDVR problem D. Tran, L. Bourdev, R. Fergus, L. Torresani and M. Paluri. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497, 2015
  21. 21. Thank you! Get in touch: • George Kordopatis-Zilos: georgekordopatis@iti.gr • Symeon Papadopoulos: papadop@iti.gr / @sympap With the support of:

×