SlideShare ist ein Scribd-Unternehmen logo
1 von 19
Downloaden Sie, um offline zu lesen
Combining Adversarial and Reinforcement Learning for
Video Thumbnail Selection
E. Apostolidis1,2, E. Adamantidou1, V. Mezaris1, I. Patras2
1 Information Technologies Institute, CERTH, Thermi - Thessaloniki, Greece
2 School of EECS, Queen Mary University of London, London, UK
2021 ACM International Conference
on Multimedia Retrieval
Outline
• Problem statement
• Related work
• Developed approach
• Experiments
• Conclusions
1
Problem statement
2
Video is everywhere!
• Captured by smart devices and instantly
shared online
• Constantly and rapidly increasing volumes
of video content on the Web
Hours of video content uploaded on
YouTube every minute
Image sources: https://www.financialexpress.com/india-news/govt-agencies-adopt-new-age-video-sharing-apps-
like-tiktok/1767354/ (left) & https://www.statista.com/ (right)
Problem statement
3
But how to spot what we are looking for in endless collections of video content?
Get a quick idea about a
video’s content by
checking its thumbnail!
Image source: https://www.voicendata.com/sprint-removes-video-streaming-limits/
Goal of video thumbnail selection technologies
4
Analysis outcomes: a set of
representative video frames
“Select one or a few video frames
that provide a representative and
aesthetically-pleasing overview of
the video content”
Video title: “Susan Boyle's First Audition - I Dreamed a Dream - Britain's Got Talent 2009”
Video source: OVP dataset (video also online available at: https://www.youtube.com/watch?v=deRF9oEbRso)
Related work
Early visual-based approaches: Use of hand-crafted rules about the optimal thumbnail, and
tailored features and mechanisms to assess video frames’ alignment with these rules
• Thumbnail selection associated with: appearance and positioning of faces/objects, color diversity,
variance of luminance, scene steadiness, thematic relevance, absence of subtitles
• Main shortcoming: rule definition and features’ engineering are highly-complex tasks
Recent visual-based approaches: Target a few commonly-desired characteristics for a video
thumbnail, and exploit learning efficiency of deep network architectures
• Thumbnail selection associated with: learnable estimates about frames’ representativeness and
aesthetic quality (focusing also on faces), learnable classifications of good and bad frames
Recent multimodal approaches: Exploit data from additional modalities or auxiliary sources
• Video thumbnail selection is associated with: extracted keywords from the video metadata, databases
with visually-similar content, latent representations of textual and audio data, textual user queries
5
Developed approach
High-level overview
6
Developed approach
Network architecture
• Thumbnail Selector
• Estimating frames’ aesthetic quality
• Estimating frames’ importance
• Fusing estimations and selecting a small set of
candidate thumbnails
• Thumbnail Evaluator
• Evaluating thumbnails’ aesthetic quality
• Evaluating thumbnails’ representativeness
• Fusing evaluations (rewards)
• Thumbnail Evaluator  Thumbnail Selector
• Using the overall reward for reinforcement learning
7
Developed approach
Learning objectives and pipeline
• We follow a step-wise learning approach
• Step 1: Update Encoder based on:
LRecon= “distance between original and reconstructed
feature vectors, based on a latent representation in
the last hidden layer of the Discriminator”
LPrior: “information loss when using Encoder’s latent
space to represent the prior distribution defined by
the Variational Auto-Encoder”
8
Developed approach
Learning objectives and pipeline
• We follow a step-wise learning approach
• Step 2: Update Decoder based on:
LRecon= “distance between original and reconstructed
feature vectors, based on a latent representation in the
last hidden layer of the Discriminator”
LGEN: “difference between Discriminator’s output when
seeing the thumbnail-based reconstructed feature vectors
and the label (“1”) associated to the original video”
8
Developed approach
Learning objectives and pipeline
• We follow a step-wise learning approach
• Step 3: Update Discriminator based on:
LORIG = “difference between Discriminator’s output when
seeing the original feature vectors and the label (“1”)
associated to the original video”
LSUM: “difference between Discriminator’s output when
seeing the thumbnail-based reconstructed feature
vectors and the label (“0”) associated to the thumbnail-
based video summary”
8
Developed approach
Learning objectives and pipeline
• We follow a step-wise learning approach
• Step 4: Update Importance Estimator based
on the Episodic REINFORCE algorithm
8
Experiments
9
Datasets
• Open Video Project (OVP) (https://sites.google.com/site/vsummsite/download)
• 50 videos of various genres (e.g. documentary, educational, historical, lecture)
• Video length: 46 sec. to 3.5 min.
• Annotation: keyframe-based video summaries (5 per video)
• Youtube (https://sites.google.com/site/vsummsite/download)
• 50 videos of diverse content (e.g. news, TV-shows, sports, commercials) collected from the Web
• Video length: 9 sec. to 11 min.
• Annotation: keyframe-based video summaries (5 per video)
Experiments
10
Evaluation approach
• Ground-truth thumbnails: top-3 selected keyframes by human annotators
• Evaluation measures:
• Precision at 1 (P@1): matching ground-truth with top-1 machine-selected thumbnail
• Precision at 3 (P@3): matching ground-truth with top-3 machine-selected thumbnails
• Measure performance also when using only the top-1 selected keyframe by human
annotators as the ground-truth
• Run experiments on 10 different randomly-created splits of the used data (80% training;
20% testing) and report the average performance over these runs
Experiments
11
Performance comparisons using top-3 human selected keyframes as ground-truth
OVP Youtube
P@1 P@3 P@1 P@3
Baseline (random) 15.79% 32.51% 7.53% 17.94%
Mahasseni et al., (2017) - 7.80% - 11.34%
Song et al., (2016) - 11.72% - 16.47%
Gu et al., (2018) - 12.18% - 18.25%
Apostolidis et al., (2021) 15.00% 24.00% 8.75% 15.00%
Proposed approach 31.00% 40.00% 15.00% 20.00%
B. Mahasseni, et al., (2017). Unsupervised Video Summarization with Adversarial LSTM Networks. Proc. CVPR 2017, pp. 2982–2991.
Y. Song, et al., (2016). To Click or Not To Click: Automatic Selection of Beautiful Thumbnails from Videos. Proc. CIKM ’16, pp. 659–668.
H. Gu et al., (2018). From Thumbnails to Summaries - A Single Deep Neural Network to Rule Them All. Proc. ICME 2018 , pp. 1–6.
E. Apostolidis, et al., (2021). AC-SUM-GAN: Connecting Actor-Critic and Generative Adversarial Networks
for Unsupervised Video Summarization. IEEE Trans. CSVT , vol. 31, no. 8, pp. 3278-3292, Aug. 2021.
Experiments
12
Performance comparisons using top-1 human selected keyframes as ground-truth
OVP Youtube
P@1 P@3 P@1 P@3
Baseline (random) 6.36% 16.66% 4.23% 9.98%
Apostolidis et al., (2021) 7.00% 14.00% 6.25% 8.75%
Proposed approach 17.00% 21.00% 10.00% 16.25%
Experiments
13
Ablation study
Thumbnail selection criteria OVP Youtube
Aesthetics
estimations
Represent.
estimations
Using top-3
human selections
Using top-1
human selections
Using top-3
human selections
Using top-1
human selections
Frame
picking
Reward P@1 P@3 P@1 P@3 P@1 P@3 P@1 P@3
Baseline
(random)
- - - 15.79 32.51 6.36 16.66 7.53 17.94 4.23 9.98
Variant #1 √ √ X 16.00 20.00 8.00 12.00 6.00 17.50 5.00 7.50
Variant #2 X X √ 20.00 30.00 8.00 13.00 10.00 18.75 3.75 8.75
Variant #3 √ X √ 12.00 36.00 3.00 18.00 10.00 18.75 6.25 12.50
Variant #4 X √ √ 30.00 39.00 18.00 23.00 13.75 16.25 10.00 12.50
Proposed
approach
√ √ √ 31.00 40.00 17.00 21.00 15.00 20.00 10.00 16.25
Conclusions
14
• Deep network architecture for video thumbnail selection, trained by combining adversarial
and reinforcement learning
• Thumbnail selection relies on representativeness and aesthetic quality of video frames
• Representativeness measured by an adversarially-trained discriminator
• Aesthetic quality estimated by a pretrained Fully Convolutional Network
• An overall reward is used to train Thumbnail Selector via reinforcement learning
• Experiments on two benchmark datasets (OVP and Youtube):
• Showed the advanced performance of our method against other SoA video thumbnail selection or
summarization approaches
• Documented the importance of aesthetics for the video thumbnail selection task
Thank you for your attention!
Questions?
Evlampios Apostolidis, apostolid@iti.gr
Code and documentation publicly available at:
https://github.com/e-apostolidis/Video-Thumbnail-Selector
This work was supported by the EUs Horizon 2020 research and innovation programme under
grant agreement H2020-832921 MIRROR, and by EPSRC under grant No. EP/R026424/1

Weitere ähnliche Inhalte

Ähnlich wie Video Thumbnail Selector

Research Proposal Presentation Pitch
Research Proposal Presentation PitchResearch Proposal Presentation Pitch
Research Proposal Presentation Pitchtchoonyong
 
Fast object re-detection and localization in video for spatio-temporal fragme...
Fast object re-detection and localization in video for spatio-temporal fragme...Fast object re-detection and localization in video for spatio-temporal fragme...
Fast object re-detection and localization in video for spatio-temporal fragme...LinkedTV
 
Fast object re detection and localization in video for spatio-temporal fragme...
Fast object re detection and localization in video for spatio-temporal fragme...Fast object re detection and localization in video for spatio-temporal fragme...
Fast object re detection and localization in video for spatio-temporal fragme...MediaMixerCommunity
 
Forever Young: A Tribute to the Grandmaster through a recount of Personal Jou...
Forever Young: A Tribute to the Grandmaster through a recount of Personal Jou...Forever Young: A Tribute to the Grandmaster through a recount of Personal Jou...
Forever Young: A Tribute to the Grandmaster through a recount of Personal Jou...Goergen Institute for Data Science
 
MediaEval 2016 - UNIFESP Predicting Media Interestingness Task
MediaEval 2016 - UNIFESP Predicting Media Interestingness TaskMediaEval 2016 - UNIFESP Predicting Media Interestingness Task
MediaEval 2016 - UNIFESP Predicting Media Interestingness Taskmultimediaeval
 
Deep learning Tutorial - Part II
Deep learning Tutorial - Part IIDeep learning Tutorial - Part II
Deep learning Tutorial - Part IIQuantUniversity
 
Human age and gender Detection
Human age and gender DetectionHuman age and gender Detection
Human age and gender DetectionAbhiAchalla
 
Near-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers
Near-Duplicate Video Retrieval by Aggregating Intermediate CNN LayersNear-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers
Near-Duplicate Video Retrieval by Aggregating Intermediate CNN LayersSymeon Papadopoulos
 
Real Time Object Dectection using machine learning
Real Time Object Dectection using machine learningReal Time Object Dectection using machine learning
Real Time Object Dectection using machine learningpratik pratyay
 
IceBreaker Solving Cold Start Problem For Video Recommendation Engines
IceBreaker  Solving Cold Start Problem For Video Recommendation EnginesIceBreaker  Solving Cold Start Problem For Video Recommendation Engines
IceBreaker Solving Cold Start Problem For Video Recommendation EnginesJamie Boyd
 
Explaining video summarization based on the focus of attention
Explaining video summarization based on the focus of attentionExplaining video summarization based on the focus of attention
Explaining video summarization based on the focus of attentionVasileiosMezaris
 
Presentation of the InVID verification technologies at IPTC 2018
Presentation of the InVID verification technologies at IPTC 2018Presentation of the InVID verification technologies at IPTC 2018
Presentation of the InVID verification technologies at IPTC 2018InVID Project
 
A Framework for Adaptive Delivery of Omnidirectional Video
A Framework for Adaptive Delivery of Omnidirectional VideoA Framework for Adaptive Delivery of Omnidirectional Video
A Framework for Adaptive Delivery of Omnidirectional VideoAlpen-Adria-Universität
 
Movie recommendation Engine using Artificial Intelligence
Movie recommendation Engine using Artificial IntelligenceMovie recommendation Engine using Artificial Intelligence
Movie recommendation Engine using Artificial IntelligenceHarivamshi D
 
John W. Vinti Particle Tracker Final Presentation
John W. Vinti Particle Tracker Final PresentationJohn W. Vinti Particle Tracker Final Presentation
John W. Vinti Particle Tracker Final PresentationJohn Vinti
 
Synthesizing pseudo 2.5 d content from monocular videos for mixed reality
Synthesizing pseudo 2.5 d content from monocular videos for mixed realitySynthesizing pseudo 2.5 d content from monocular videos for mixed reality
Synthesizing pseudo 2.5 d content from monocular videos for mixed realityNAVER Engineering
 
Video production pedagogy
Video production pedagogyVideo production pedagogy
Video production pedagogyChris Willmott
 

Ähnlich wie Video Thumbnail Selector (20)

Research Proposal Presentation Pitch
Research Proposal Presentation PitchResearch Proposal Presentation Pitch
Research Proposal Presentation Pitch
 
Fast object re-detection and localization in video for spatio-temporal fragme...
Fast object re-detection and localization in video for spatio-temporal fragme...Fast object re-detection and localization in video for spatio-temporal fragme...
Fast object re-detection and localization in video for spatio-temporal fragme...
 
Fast object re detection and localization in video for spatio-temporal fragme...
Fast object re detection and localization in video for spatio-temporal fragme...Fast object re detection and localization in video for spatio-temporal fragme...
Fast object re detection and localization in video for spatio-temporal fragme...
 
Forever Young: A Tribute to the Grandmaster through a recount of Personal Jou...
Forever Young: A Tribute to the Grandmaster through a recount of Personal Jou...Forever Young: A Tribute to the Grandmaster through a recount of Personal Jou...
Forever Young: A Tribute to the Grandmaster through a recount of Personal Jou...
 
MediaEval 2016 - UNIFESP Predicting Media Interestingness Task
MediaEval 2016 - UNIFESP Predicting Media Interestingness TaskMediaEval 2016 - UNIFESP Predicting Media Interestingness Task
MediaEval 2016 - UNIFESP Predicting Media Interestingness Task
 
Defense_20140625
Defense_20140625Defense_20140625
Defense_20140625
 
Deep learning Tutorial - Part II
Deep learning Tutorial - Part IIDeep learning Tutorial - Part II
Deep learning Tutorial - Part II
 
Visual Search for Musical Performances and Endoscopic Videos
Visual Search for Musical Performances and Endoscopic VideosVisual Search for Musical Performances and Endoscopic Videos
Visual Search for Musical Performances and Endoscopic Videos
 
Human age and gender Detection
Human age and gender DetectionHuman age and gender Detection
Human age and gender Detection
 
Near-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers
Near-Duplicate Video Retrieval by Aggregating Intermediate CNN LayersNear-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers
Near-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers
 
slide-171212080528.pptx
slide-171212080528.pptxslide-171212080528.pptx
slide-171212080528.pptx
 
Real Time Object Dectection using machine learning
Real Time Object Dectection using machine learningReal Time Object Dectection using machine learning
Real Time Object Dectection using machine learning
 
IceBreaker Solving Cold Start Problem For Video Recommendation Engines
IceBreaker  Solving Cold Start Problem For Video Recommendation EnginesIceBreaker  Solving Cold Start Problem For Video Recommendation Engines
IceBreaker Solving Cold Start Problem For Video Recommendation Engines
 
Explaining video summarization based on the focus of attention
Explaining video summarization based on the focus of attentionExplaining video summarization based on the focus of attention
Explaining video summarization based on the focus of attention
 
Presentation of the InVID verification technologies at IPTC 2018
Presentation of the InVID verification technologies at IPTC 2018Presentation of the InVID verification technologies at IPTC 2018
Presentation of the InVID verification technologies at IPTC 2018
 
A Framework for Adaptive Delivery of Omnidirectional Video
A Framework for Adaptive Delivery of Omnidirectional VideoA Framework for Adaptive Delivery of Omnidirectional Video
A Framework for Adaptive Delivery of Omnidirectional Video
 
Movie recommendation Engine using Artificial Intelligence
Movie recommendation Engine using Artificial IntelligenceMovie recommendation Engine using Artificial Intelligence
Movie recommendation Engine using Artificial Intelligence
 
John W. Vinti Particle Tracker Final Presentation
John W. Vinti Particle Tracker Final PresentationJohn W. Vinti Particle Tracker Final Presentation
John W. Vinti Particle Tracker Final Presentation
 
Synthesizing pseudo 2.5 d content from monocular videos for mixed reality
Synthesizing pseudo 2.5 d content from monocular videos for mixed realitySynthesizing pseudo 2.5 d content from monocular videos for mixed reality
Synthesizing pseudo 2.5 d content from monocular videos for mixed reality
 
Video production pedagogy
Video production pedagogyVideo production pedagogy
Video production pedagogy
 

Mehr von VasileiosMezaris

Multi-Modal Fusion for Image Manipulation Detection and Localization
Multi-Modal Fusion for Image Manipulation Detection and LocalizationMulti-Modal Fusion for Image Manipulation Detection and Localization
Multi-Modal Fusion for Image Manipulation Detection and LocalizationVasileiosMezaris
 
CERTH-ITI at MediaEval 2023 NewsImages Task
CERTH-ITI at MediaEval 2023 NewsImages TaskCERTH-ITI at MediaEval 2023 NewsImages Task
CERTH-ITI at MediaEval 2023 NewsImages TaskVasileiosMezaris
 
Spatio-Temporal Summarization of 360-degrees Videos
Spatio-Temporal Summarization of 360-degrees VideosSpatio-Temporal Summarization of 360-degrees Videos
Spatio-Temporal Summarization of 360-degrees VideosVasileiosMezaris
 
Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...
Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...
Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...VasileiosMezaris
 
Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022
Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022
Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022VasileiosMezaris
 
TAME: Trainable Attention Mechanism for Explanations
TAME: Trainable Attention Mechanism for ExplanationsTAME: Trainable Attention Mechanism for Explanations
TAME: Trainable Attention Mechanism for ExplanationsVasileiosMezaris
 
Combining textual and visual features for Ad-hoc Video Search
Combining textual and visual features for Ad-hoc Video SearchCombining textual and visual features for Ad-hoc Video Search
Combining textual and visual features for Ad-hoc Video SearchVasileiosMezaris
 
Explaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiersExplaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiersVasileiosMezaris
 
Learning visual explanations for DCNN-based image classifiers using an attent...
Learning visual explanations for DCNN-based image classifiers using an attent...Learning visual explanations for DCNN-based image classifiers using an attent...
Learning visual explanations for DCNN-based image classifiers using an attent...VasileiosMezaris
 
Are all combinations equal? Combining textual and visual features with multi...
Are all combinations equal?  Combining textual and visual features with multi...Are all combinations equal?  Combining textual and visual features with multi...
Are all combinations equal? Combining textual and visual features with multi...VasileiosMezaris
 
Misinformation on the internet: Video and AI
Misinformation on the internet: Video and AIMisinformation on the internet: Video and AI
Misinformation on the internet: Video and AIVasileiosMezaris
 
PoR_evaluation_measure_acm_mm_2020
PoR_evaluation_measure_acm_mm_2020PoR_evaluation_measure_acm_mm_2020
PoR_evaluation_measure_acm_mm_2020VasileiosMezaris
 
GAN-based video summarization
GAN-based video summarizationGAN-based video summarization
GAN-based video summarizationVasileiosMezaris
 
Migration-related video retrieval
Migration-related video retrievalMigration-related video retrieval
Migration-related video retrievalVasileiosMezaris
 
Fractional step discriminant pruning
Fractional step discriminant pruningFractional step discriminant pruning
Fractional step discriminant pruningVasileiosMezaris
 
Icme2020 tutorial video_summarization_part1
Icme2020 tutorial video_summarization_part1Icme2020 tutorial video_summarization_part1
Icme2020 tutorial video_summarization_part1VasileiosMezaris
 
Video, AI and News: video analysis and verification technologies for supporti...
Video, AI and News: video analysis and verification technologies for supporti...Video, AI and News: video analysis and verification technologies for supporti...
Video, AI and News: video analysis and verification technologies for supporti...VasileiosMezaris
 
Subclass deep neural networks
Subclass deep neural networksSubclass deep neural networks
Subclass deep neural networksVasileiosMezaris
 

Mehr von VasileiosMezaris (20)

Multi-Modal Fusion for Image Manipulation Detection and Localization
Multi-Modal Fusion for Image Manipulation Detection and LocalizationMulti-Modal Fusion for Image Manipulation Detection and Localization
Multi-Modal Fusion for Image Manipulation Detection and Localization
 
CERTH-ITI at MediaEval 2023 NewsImages Task
CERTH-ITI at MediaEval 2023 NewsImages TaskCERTH-ITI at MediaEval 2023 NewsImages Task
CERTH-ITI at MediaEval 2023 NewsImages Task
 
Spatio-Temporal Summarization of 360-degrees Videos
Spatio-Temporal Summarization of 360-degrees VideosSpatio-Temporal Summarization of 360-degrees Videos
Spatio-Temporal Summarization of 360-degrees Videos
 
Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...
Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...
Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...
 
Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022
Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022
Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022
 
TAME: Trainable Attention Mechanism for Explanations
TAME: Trainable Attention Mechanism for ExplanationsTAME: Trainable Attention Mechanism for Explanations
TAME: Trainable Attention Mechanism for Explanations
 
Gated-ViGAT
Gated-ViGATGated-ViGAT
Gated-ViGAT
 
Combining textual and visual features for Ad-hoc Video Search
Combining textual and visual features for Ad-hoc Video SearchCombining textual and visual features for Ad-hoc Video Search
Combining textual and visual features for Ad-hoc Video Search
 
Explaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiersExplaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiers
 
Learning visual explanations for DCNN-based image classifiers using an attent...
Learning visual explanations for DCNN-based image classifiers using an attent...Learning visual explanations for DCNN-based image classifiers using an attent...
Learning visual explanations for DCNN-based image classifiers using an attent...
 
Are all combinations equal? Combining textual and visual features with multi...
Are all combinations equal?  Combining textual and visual features with multi...Are all combinations equal?  Combining textual and visual features with multi...
Are all combinations equal? Combining textual and visual features with multi...
 
Misinformation on the internet: Video and AI
Misinformation on the internet: Video and AIMisinformation on the internet: Video and AI
Misinformation on the internet: Video and AI
 
LSTM Structured Pruning
LSTM Structured PruningLSTM Structured Pruning
LSTM Structured Pruning
 
PoR_evaluation_measure_acm_mm_2020
PoR_evaluation_measure_acm_mm_2020PoR_evaluation_measure_acm_mm_2020
PoR_evaluation_measure_acm_mm_2020
 
GAN-based video summarization
GAN-based video summarizationGAN-based video summarization
GAN-based video summarization
 
Migration-related video retrieval
Migration-related video retrievalMigration-related video retrieval
Migration-related video retrieval
 
Fractional step discriminant pruning
Fractional step discriminant pruningFractional step discriminant pruning
Fractional step discriminant pruning
 
Icme2020 tutorial video_summarization_part1
Icme2020 tutorial video_summarization_part1Icme2020 tutorial video_summarization_part1
Icme2020 tutorial video_summarization_part1
 
Video, AI and News: video analysis and verification technologies for supporti...
Video, AI and News: video analysis and verification technologies for supporti...Video, AI and News: video analysis and verification technologies for supporti...
Video, AI and News: video analysis and verification technologies for supporti...
 
Subclass deep neural networks
Subclass deep neural networksSubclass deep neural networks
Subclass deep neural networks
 

Kürzlich hochgeladen

TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...Lokesh Kothari
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptxAlMamun560346
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and ClassificationsAreesha Ahmad
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLkantirani197
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 

Kürzlich hochgeladen (20)

TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 

Video Thumbnail Selector

  • 1. Combining Adversarial and Reinforcement Learning for Video Thumbnail Selection E. Apostolidis1,2, E. Adamantidou1, V. Mezaris1, I. Patras2 1 Information Technologies Institute, CERTH, Thermi - Thessaloniki, Greece 2 School of EECS, Queen Mary University of London, London, UK 2021 ACM International Conference on Multimedia Retrieval
  • 2. Outline • Problem statement • Related work • Developed approach • Experiments • Conclusions 1
  • 3. Problem statement 2 Video is everywhere! • Captured by smart devices and instantly shared online • Constantly and rapidly increasing volumes of video content on the Web Hours of video content uploaded on YouTube every minute Image sources: https://www.financialexpress.com/india-news/govt-agencies-adopt-new-age-video-sharing-apps- like-tiktok/1767354/ (left) & https://www.statista.com/ (right)
  • 4. Problem statement 3 But how to spot what we are looking for in endless collections of video content? Get a quick idea about a video’s content by checking its thumbnail! Image source: https://www.voicendata.com/sprint-removes-video-streaming-limits/
  • 5. Goal of video thumbnail selection technologies 4 Analysis outcomes: a set of representative video frames “Select one or a few video frames that provide a representative and aesthetically-pleasing overview of the video content” Video title: “Susan Boyle's First Audition - I Dreamed a Dream - Britain's Got Talent 2009” Video source: OVP dataset (video also online available at: https://www.youtube.com/watch?v=deRF9oEbRso)
  • 6. Related work Early visual-based approaches: Use of hand-crafted rules about the optimal thumbnail, and tailored features and mechanisms to assess video frames’ alignment with these rules • Thumbnail selection associated with: appearance and positioning of faces/objects, color diversity, variance of luminance, scene steadiness, thematic relevance, absence of subtitles • Main shortcoming: rule definition and features’ engineering are highly-complex tasks Recent visual-based approaches: Target a few commonly-desired characteristics for a video thumbnail, and exploit learning efficiency of deep network architectures • Thumbnail selection associated with: learnable estimates about frames’ representativeness and aesthetic quality (focusing also on faces), learnable classifications of good and bad frames Recent multimodal approaches: Exploit data from additional modalities or auxiliary sources • Video thumbnail selection is associated with: extracted keywords from the video metadata, databases with visually-similar content, latent representations of textual and audio data, textual user queries 5
  • 8. Developed approach Network architecture • Thumbnail Selector • Estimating frames’ aesthetic quality • Estimating frames’ importance • Fusing estimations and selecting a small set of candidate thumbnails • Thumbnail Evaluator • Evaluating thumbnails’ aesthetic quality • Evaluating thumbnails’ representativeness • Fusing evaluations (rewards) • Thumbnail Evaluator  Thumbnail Selector • Using the overall reward for reinforcement learning 7
  • 9. Developed approach Learning objectives and pipeline • We follow a step-wise learning approach • Step 1: Update Encoder based on: LRecon= “distance between original and reconstructed feature vectors, based on a latent representation in the last hidden layer of the Discriminator” LPrior: “information loss when using Encoder’s latent space to represent the prior distribution defined by the Variational Auto-Encoder” 8
  • 10. Developed approach Learning objectives and pipeline • We follow a step-wise learning approach • Step 2: Update Decoder based on: LRecon= “distance between original and reconstructed feature vectors, based on a latent representation in the last hidden layer of the Discriminator” LGEN: “difference between Discriminator’s output when seeing the thumbnail-based reconstructed feature vectors and the label (“1”) associated to the original video” 8
  • 11. Developed approach Learning objectives and pipeline • We follow a step-wise learning approach • Step 3: Update Discriminator based on: LORIG = “difference between Discriminator’s output when seeing the original feature vectors and the label (“1”) associated to the original video” LSUM: “difference between Discriminator’s output when seeing the thumbnail-based reconstructed feature vectors and the label (“0”) associated to the thumbnail- based video summary” 8
  • 12. Developed approach Learning objectives and pipeline • We follow a step-wise learning approach • Step 4: Update Importance Estimator based on the Episodic REINFORCE algorithm 8
  • 13. Experiments 9 Datasets • Open Video Project (OVP) (https://sites.google.com/site/vsummsite/download) • 50 videos of various genres (e.g. documentary, educational, historical, lecture) • Video length: 46 sec. to 3.5 min. • Annotation: keyframe-based video summaries (5 per video) • Youtube (https://sites.google.com/site/vsummsite/download) • 50 videos of diverse content (e.g. news, TV-shows, sports, commercials) collected from the Web • Video length: 9 sec. to 11 min. • Annotation: keyframe-based video summaries (5 per video)
  • 14. Experiments 10 Evaluation approach • Ground-truth thumbnails: top-3 selected keyframes by human annotators • Evaluation measures: • Precision at 1 (P@1): matching ground-truth with top-1 machine-selected thumbnail • Precision at 3 (P@3): matching ground-truth with top-3 machine-selected thumbnails • Measure performance also when using only the top-1 selected keyframe by human annotators as the ground-truth • Run experiments on 10 different randomly-created splits of the used data (80% training; 20% testing) and report the average performance over these runs
  • 15. Experiments 11 Performance comparisons using top-3 human selected keyframes as ground-truth OVP Youtube P@1 P@3 P@1 P@3 Baseline (random) 15.79% 32.51% 7.53% 17.94% Mahasseni et al., (2017) - 7.80% - 11.34% Song et al., (2016) - 11.72% - 16.47% Gu et al., (2018) - 12.18% - 18.25% Apostolidis et al., (2021) 15.00% 24.00% 8.75% 15.00% Proposed approach 31.00% 40.00% 15.00% 20.00% B. Mahasseni, et al., (2017). Unsupervised Video Summarization with Adversarial LSTM Networks. Proc. CVPR 2017, pp. 2982–2991. Y. Song, et al., (2016). To Click or Not To Click: Automatic Selection of Beautiful Thumbnails from Videos. Proc. CIKM ’16, pp. 659–668. H. Gu et al., (2018). From Thumbnails to Summaries - A Single Deep Neural Network to Rule Them All. Proc. ICME 2018 , pp. 1–6. E. Apostolidis, et al., (2021). AC-SUM-GAN: Connecting Actor-Critic and Generative Adversarial Networks for Unsupervised Video Summarization. IEEE Trans. CSVT , vol. 31, no. 8, pp. 3278-3292, Aug. 2021.
  • 16. Experiments 12 Performance comparisons using top-1 human selected keyframes as ground-truth OVP Youtube P@1 P@3 P@1 P@3 Baseline (random) 6.36% 16.66% 4.23% 9.98% Apostolidis et al., (2021) 7.00% 14.00% 6.25% 8.75% Proposed approach 17.00% 21.00% 10.00% 16.25%
  • 17. Experiments 13 Ablation study Thumbnail selection criteria OVP Youtube Aesthetics estimations Represent. estimations Using top-3 human selections Using top-1 human selections Using top-3 human selections Using top-1 human selections Frame picking Reward P@1 P@3 P@1 P@3 P@1 P@3 P@1 P@3 Baseline (random) - - - 15.79 32.51 6.36 16.66 7.53 17.94 4.23 9.98 Variant #1 √ √ X 16.00 20.00 8.00 12.00 6.00 17.50 5.00 7.50 Variant #2 X X √ 20.00 30.00 8.00 13.00 10.00 18.75 3.75 8.75 Variant #3 √ X √ 12.00 36.00 3.00 18.00 10.00 18.75 6.25 12.50 Variant #4 X √ √ 30.00 39.00 18.00 23.00 13.75 16.25 10.00 12.50 Proposed approach √ √ √ 31.00 40.00 17.00 21.00 15.00 20.00 10.00 16.25
  • 18. Conclusions 14 • Deep network architecture for video thumbnail selection, trained by combining adversarial and reinforcement learning • Thumbnail selection relies on representativeness and aesthetic quality of video frames • Representativeness measured by an adversarially-trained discriminator • Aesthetic quality estimated by a pretrained Fully Convolutional Network • An overall reward is used to train Thumbnail Selector via reinforcement learning • Experiments on two benchmark datasets (OVP and Youtube): • Showed the advanced performance of our method against other SoA video thumbnail selection or summarization approaches • Documented the importance of aesthetics for the video thumbnail selection task
  • 19. Thank you for your attention! Questions? Evlampios Apostolidis, apostolid@iti.gr Code and documentation publicly available at: https://github.com/e-apostolidis/Video-Thumbnail-Selector This work was supported by the EUs Horizon 2020 research and innovation programme under grant agreement H2020-832921 MIRROR, and by EPSRC under grant No. EP/R026424/1