Anzeige

Video Thumbnail Selector

18. Nov 2021
Anzeige

Más contenido relacionado

Similar a Video Thumbnail Selector(20)

Anzeige

Video Thumbnail Selector

  1. Combining Adversarial and Reinforcement Learning for Video Thumbnail Selection E. Apostolidis1,2, E. Adamantidou1, V. Mezaris1, I. Patras2 1 Information Technologies Institute, CERTH, Thermi - Thessaloniki, Greece 2 School of EECS, Queen Mary University of London, London, UK 2021 ACM International Conference on Multimedia Retrieval
  2. Outline • Problem statement • Related work • Developed approach • Experiments • Conclusions 1
  3. Problem statement 2 Video is everywhere! • Captured by smart devices and instantly shared online • Constantly and rapidly increasing volumes of video content on the Web Hours of video content uploaded on YouTube every minute Image sources: https://www.financialexpress.com/india-news/govt-agencies-adopt-new-age-video-sharing-apps- like-tiktok/1767354/ (left) & https://www.statista.com/ (right)
  4. Problem statement 3 But how to spot what we are looking for in endless collections of video content? Get a quick idea about a video’s content by checking its thumbnail! Image source: https://www.voicendata.com/sprint-removes-video-streaming-limits/
  5. Goal of video thumbnail selection technologies 4 Analysis outcomes: a set of representative video frames “Select one or a few video frames that provide a representative and aesthetically-pleasing overview of the video content” Video title: “Susan Boyle's First Audition - I Dreamed a Dream - Britain's Got Talent 2009” Video source: OVP dataset (video also online available at: https://www.youtube.com/watch?v=deRF9oEbRso)
  6. Related work Early visual-based approaches: Use of hand-crafted rules about the optimal thumbnail, and tailored features and mechanisms to assess video frames’ alignment with these rules • Thumbnail selection associated with: appearance and positioning of faces/objects, color diversity, variance of luminance, scene steadiness, thematic relevance, absence of subtitles • Main shortcoming: rule definition and features’ engineering are highly-complex tasks Recent visual-based approaches: Target a few commonly-desired characteristics for a video thumbnail, and exploit learning efficiency of deep network architectures • Thumbnail selection associated with: learnable estimates about frames’ representativeness and aesthetic quality (focusing also on faces), learnable classifications of good and bad frames Recent multimodal approaches: Exploit data from additional modalities or auxiliary sources • Video thumbnail selection is associated with: extracted keywords from the video metadata, databases with visually-similar content, latent representations of textual and audio data, textual user queries 5
  7. Developed approach High-level overview 6
  8. Developed approach Network architecture • Thumbnail Selector • Estimating frames’ aesthetic quality • Estimating frames’ importance • Fusing estimations and selecting a small set of candidate thumbnails • Thumbnail Evaluator • Evaluating thumbnails’ aesthetic quality • Evaluating thumbnails’ representativeness • Fusing evaluations (rewards) • Thumbnail Evaluator  Thumbnail Selector • Using the overall reward for reinforcement learning 7
  9. Developed approach Learning objectives and pipeline • We follow a step-wise learning approach • Step 1: Update Encoder based on: LRecon= “distance between original and reconstructed feature vectors, based on a latent representation in the last hidden layer of the Discriminator” LPrior: “information loss when using Encoder’s latent space to represent the prior distribution defined by the Variational Auto-Encoder” 8
  10. Developed approach Learning objectives and pipeline • We follow a step-wise learning approach • Step 2: Update Decoder based on: LRecon= “distance between original and reconstructed feature vectors, based on a latent representation in the last hidden layer of the Discriminator” LGEN: “difference between Discriminator’s output when seeing the thumbnail-based reconstructed feature vectors and the label (“1”) associated to the original video” 8
  11. Developed approach Learning objectives and pipeline • We follow a step-wise learning approach • Step 3: Update Discriminator based on: LORIG = “difference between Discriminator’s output when seeing the original feature vectors and the label (“1”) associated to the original video” LSUM: “difference between Discriminator’s output when seeing the thumbnail-based reconstructed feature vectors and the label (“0”) associated to the thumbnail- based video summary” 8
  12. Developed approach Learning objectives and pipeline • We follow a step-wise learning approach • Step 4: Update Importance Estimator based on the Episodic REINFORCE algorithm 8
  13. Experiments 9 Datasets • Open Video Project (OVP) (https://sites.google.com/site/vsummsite/download) • 50 videos of various genres (e.g. documentary, educational, historical, lecture) • Video length: 46 sec. to 3.5 min. • Annotation: keyframe-based video summaries (5 per video) • Youtube (https://sites.google.com/site/vsummsite/download) • 50 videos of diverse content (e.g. news, TV-shows, sports, commercials) collected from the Web • Video length: 9 sec. to 11 min. • Annotation: keyframe-based video summaries (5 per video)
  14. Experiments 10 Evaluation approach • Ground-truth thumbnails: top-3 selected keyframes by human annotators • Evaluation measures: • Precision at 1 (P@1): matching ground-truth with top-1 machine-selected thumbnail • Precision at 3 (P@3): matching ground-truth with top-3 machine-selected thumbnails • Measure performance also when using only the top-1 selected keyframe by human annotators as the ground-truth • Run experiments on 10 different randomly-created splits of the used data (80% training; 20% testing) and report the average performance over these runs
  15. Experiments 11 Performance comparisons using top-3 human selected keyframes as ground-truth OVP Youtube P@1 P@3 P@1 P@3 Baseline (random) 15.79% 32.51% 7.53% 17.94% Mahasseni et al., (2017) - 7.80% - 11.34% Song et al., (2016) - 11.72% - 16.47% Gu et al., (2018) - 12.18% - 18.25% Apostolidis et al., (2021) 15.00% 24.00% 8.75% 15.00% Proposed approach 31.00% 40.00% 15.00% 20.00% B. Mahasseni, et al., (2017). Unsupervised Video Summarization with Adversarial LSTM Networks. Proc. CVPR 2017, pp. 2982–2991. Y. Song, et al., (2016). To Click or Not To Click: Automatic Selection of Beautiful Thumbnails from Videos. Proc. CIKM ’16, pp. 659–668. H. Gu et al., (2018). From Thumbnails to Summaries - A Single Deep Neural Network to Rule Them All. Proc. ICME 2018 , pp. 1–6. E. Apostolidis, et al., (2021). AC-SUM-GAN: Connecting Actor-Critic and Generative Adversarial Networks for Unsupervised Video Summarization. IEEE Trans. CSVT , vol. 31, no. 8, pp. 3278-3292, Aug. 2021.
  16. Experiments 12 Performance comparisons using top-1 human selected keyframes as ground-truth OVP Youtube P@1 P@3 P@1 P@3 Baseline (random) 6.36% 16.66% 4.23% 9.98% Apostolidis et al., (2021) 7.00% 14.00% 6.25% 8.75% Proposed approach 17.00% 21.00% 10.00% 16.25%
  17. Experiments 13 Ablation study Thumbnail selection criteria OVP Youtube Aesthetics estimations Represent. estimations Using top-3 human selections Using top-1 human selections Using top-3 human selections Using top-1 human selections Frame picking Reward P@1 P@3 P@1 P@3 P@1 P@3 P@1 P@3 Baseline (random) - - - 15.79 32.51 6.36 16.66 7.53 17.94 4.23 9.98 Variant #1 √ √ X 16.00 20.00 8.00 12.00 6.00 17.50 5.00 7.50 Variant #2 X X √ 20.00 30.00 8.00 13.00 10.00 18.75 3.75 8.75 Variant #3 √ X √ 12.00 36.00 3.00 18.00 10.00 18.75 6.25 12.50 Variant #4 X √ √ 30.00 39.00 18.00 23.00 13.75 16.25 10.00 12.50 Proposed approach √ √ √ 31.00 40.00 17.00 21.00 15.00 20.00 10.00 16.25
  18. Conclusions 14 • Deep network architecture for video thumbnail selection, trained by combining adversarial and reinforcement learning • Thumbnail selection relies on representativeness and aesthetic quality of video frames • Representativeness measured by an adversarially-trained discriminator • Aesthetic quality estimated by a pretrained Fully Convolutional Network • An overall reward is used to train Thumbnail Selector via reinforcement learning • Experiments on two benchmark datasets (OVP and Youtube): • Showed the advanced performance of our method against other SoA video thumbnail selection or summarization approaches • Documented the importance of aesthetics for the video thumbnail selection task
  19. Thank you for your attention! Questions? Evlampios Apostolidis, apostolid@iti.gr Code and documentation publicly available at: https://github.com/e-apostolidis/Video-Thumbnail-Selector This work was supported by the EUs Horizon 2020 research and innovation programme under grant agreement H2020-832921 MIRROR, and by EPSRC under grant No. EP/R026424/1
Anzeige