SlideShare ist ein Scribd-Unternehmen logo
1 von 50
Downloaden Sie, um offline zu lesen
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Title of presentation
Subtitle
Name of presenter
Date
Unsupervised Video Summarization via Attention-Driven
Adversarial Learning
E. Apostolidis1,2, E. Adamantidou1, A. I. Metsai1, V. Mezaris1, I. Patras2
1 CERTH-ITI, Thermi - Thessaloniki, Greece
2 School of EECS, Queen Mary University of London, London, UK
26th Int. Conf. on Multimedia Modeling
Daejeon, Korea, January 2020
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Outline
2
 Introduction
 Motivation
 Developed approach
 Experiments
 Summarization example
 Conclusions
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
3
Video summary: a short visual summary that encapsulates the flow of the story and
the essential parts of the full-length video
Original video
Video summary (storyboard)
Problem statement
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
4
Problem statement
Applications of video summarization
 Professional CMS: effective indexing,
browsing, retrieval & promotion of media
assets
 Video sharing platforms: improved viewer
experience, enhanced viewer engagement &
increased content consumption
 Other summarization scenarios: movie trailer production, sports highlights video generation,
video synopsis of 24h surveillance recordings
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
5
Related work
Deep-learning approaches
 Supervised methods that use types of feedforward neural nets (e.g. CNNs), extract and use
video semantics to identify important video parts based on sequence labeling [21], self-
attention networks [7], or video-level metadata [17]
 Supervised approaches that capture the story flow using recurrent neural nets (e.g. LSTMs)
 In combination with statistical models to select a representative and diverse set of keyframes [27]
 In hierarchies to identify the video structure and select key-fragments [30, 31]
 In combination with DTR units and GANs to capture long-range frame dependency [28]
 To form attention-based encoder-decoders [9, 13], or memory-augmented networks [8]
 Unsupervised algorithms that do not rely on human-annotations, and build summaries
 Using adversarial learning to: minimize the distance between videos and their summary-based
reconstructions [1, 16]; maximize the mutual information between summary and video [25]; learn a
mapping from raw videos to human-like summaries based on online available summaries [20]
 Through a decision-making process that is learned via RL and reward functions [32]
 By learning to extract key motions of appearing objects [29]
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
6
Motivation
Disadvantages of supervised learning
 Restricted amount of annotated data is available for supervised training of a video
summarization method
 Highly-subjective nature of video summarization (relying on viewer’s demands and aesthetics);
there is no “ideal” or commonly accepted summary that could be used for training an algorithm
Advantages of unsupervised learning
 No need for learning data; avoid laborious and time-demanding labeling of video data
 Adaptability to different types of video; summarization is learned based on the video content
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
7
Contributions
 Introduce an attention mechanism in an unsupervised learning framework, whereas all
previous attention-based summarization methods ([7-9, 13]) were supervised
 Investigate the integration of an attention mechanism into a variational auto-encoder for video
summarization purposes
 Use attention to guide the generative adversarial training of the model, rather than using it to
rank the video fragments as in [9]
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Developed approach
 Starting point: the SUM-GAN architecture
 Main idea: build a keyframe selection mechanism
by minimizing the distance between the deep
representations of the original video and a
reconstructed version of it based on the selected
keyframes
 Problem: how to define a good distance?
 Solution: use a trainable discriminator network!
 Goal: train the Summarizer to maximally confuse
the Discriminator when distinguishing the original
from the reconstructed video
8
Building on adversarial learning
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Developed approach
 SUM-GAN-sl:
 Contains a linear compression layer that reduces
the size of CNN feature vectors
9
Building on adversarial learning
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Developed approach
 SUM-GAN-sl:
 Contains a linear compression layer that reduces the size of CNN feature vectors
 Follows an incremental and fine-grained approach to train the model’s components
10
Building on adversarial learning
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Developed approach
11
Building on adversarial learning
 SUM-GAN-sl:
 Contains a linear compression layer that reduces the size of CNN feature vectors
 Follows an incremental and fine-grained approach to train the model’s components
(regularization factor)
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Developed approach
12
Building on adversarial learning
 SUM-GAN-sl:
 Contains a linear compression layer that reduces the size of CNN feature vectors
 Follows an incremental and fine-grained approach to train the model’s components
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Developed approach
13
Building on adversarial learning
 SUM-GAN-sl:
 Contains a linear compression layer that reduces the size of CNN feature vectors
 Follows an incremental and fine-grained approach to train the model’s components
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Developed approach
 Examined approaches:
1) Integrate an attention layer within the variational auto-encoder (VAE) of SUM-GAN-sl
2) Replace the VAE of SUM-GAN-sl with a deterministic attention auto-encoder
14
Introducing an attention mechanism
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Developed approach
 Variational attention was described in [4] and used for natural language modeling
 Models the attention vector as Gaussian distributed random variables
15
1) Adversarial learning driven by a variational attention auto-encoder
Variational auto-encoder
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Developed approach
 Extended SUM-GAN-sl with variational attention, forming the SUM-GAN-VAAE architecture
 The attention weights for each frame were handled as random variables and a latent space
was computed for these values, too
 In every time-step t the attention component combines the encoder's output at t and the
decoder's hidden state at t - 1 to compute an attention weight vector
 The decoder was modified to update its hidden states based on both latent spaces during the
reconstruction of the video
16
1) Adversarial learning driven by a variational attention auto-encoder
Variational attention
auto-encoder
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Developed approach
 Inspired by the efficiency of attention-based
encoder-decoder network in [13]
 Built on the findings of [4] w.r.t. the impact of
deterministic attention on VAE
 VAE was entirely replaced by an attention auto-
encoder (AAE) network, forming the SUM-
GAN-AAE architecture
17
2) Adversarial learning driven by deterministic attention auto-encoder
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Developed approach
18
2) Adversarial learning driven by deterministic attention auto-encoder
Attention auto-encoder
Processing pipeline
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Developed approach
19
2) Adversarial learning driven by deterministic attention auto-encoder
Processing pipeline
 Weighted feature vectors fed to the Encoder
Attention auto-encoder
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Developed approach
20
2) Adversarial learning driven by deterministic attention auto-encoder
Processing pipeline
 Weighted feature vectors fed to the Encoder
 Encoder’s output (V) and Decoder’s previous
hidden state fed to the Attention component
 For t > 1: use the hidden state of the previous
Decoder’s step (h1)
 For t = 1: use the hidden state of the last
Encoder’s step (He)
Attention auto-encoder
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Developed approach
21
2) Adversarial learning driven by deterministic attention auto-encoder
Processing pipeline
 Weighted feature vectors fed to the Encoder
 Encoder’s output (V) and Decoder’s previous
hidden state fed to the Attention component
 Attention weights (αt) computed using:
Attention auto-encoder
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Processing pipeline
 Weighted feature vectors fed to the Encoder
 Encoder’s output (V) and Decoder’s previous
hidden state fed to the Attention component
 Attention weights (αt) computed using:
 Energy score function
 Soft-max function
Developed approach
22
2) Adversarial learning driven by deterministic attention auto-encoder
Attention auto-encoder
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Processing pipeline
 Weighted feature vectors fed to the Encoder
 Encoder’s output (V) and Decoder’s previous
hidden state fed to the Attention component
 Attention weights (αt) computed using:
 Energy score function
 Soft-max function
 αt multiplied with V and form Context Vector vt’
Developed approach
23
2) Adversarial learning driven by deterministic attention auto-encoder
Attention auto-encoder
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Processing pipeline
 Weighted feature vectors fed to the Encoder
 Encoder’s output (V) and Decoder’s previous
hidden state fed to the Attention component
 Attention weights (αt) computed using:
 Energy score function
 Soft-max function
 αt multiplied with V and form Context Vector vt’
 vt’ combined with Decoder’s previous output yt-1
Developed approach
24
2) Adversarial learning driven by deterministic attention auto-encoder
Attention auto-encoder
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Developed approach
25
2) Adversarial learning driven by deterministic attention auto-encoder
Attention auto-encoder
Processing pipeline
 Weighted feature vectors fed to the Encoder
 Encoder’s output (V) and Decoder’s previous
hidden state fed to the Attention component
 Attention weights (αt) computed using:
 Energy score function
 Soft-max function
 αt multiplied with V and form Context Vector vt’
 vt’ combined with Decoder’s previous output yt-1
 Decoder gradually reconstructs the video
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Developed approach
 Input: The CNN feature vectors of the (sampled) video frames
 Output: Frame-level importance scores
 Summarization process:
 CNN features pass through the linear compression layer and the frame selector  importance
scores computed at frame-level
 Given a video segmentation (using KTS [18]) calculate fragment-level importance scores by
averaging the scores of each fragment's frames
 Summary is created by selecting the fragments that maximize the total importance score provided
that summary length does not exceed 15% of video duration, by solving the 0/1 Knapsack problem
26
Model’s I/O and summarization process
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Experiments
27
Datasets
 SumMe (https://gyglim.github.io/me/vsum/index.html#benchmark)
 25 videos capturing multiple events (e.g. cooking and sports)
 video length: 1.6 to 6.5 min
 annotation: fragment-based video summaries
 TVSum (https://github.com/yalesong/tvsum)
 50 videos from 10 categories of TRECVid MED task
 video length: 1 to 5 min
 annotation: frame-level importance scores
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Experiments
28
Evaluation protocol
 The generated summary should not exceed 15% of the video length
 Similarity between automatically generated (A) and ground-truth (G) summary is expressed
by the F-Score (%), with (P)recision and (R)ecall measuring the temporal overlap (∩) (|| ||
means duration)
 Typical metrics for computing Precision and Recall at the frame-level
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Experiments
29
Evaluation protocol
 Slight but important distinction w.r.t. what is eventually used as ground-truth summary
 Most used approach (by [1, 6, 7, 8, 14, 20, 21, 26, 27, 29, 30, 31, 32, 33])
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Experiments
30
Evaluation protocol
 Slight but important distinction w.r.t. what is eventually used as ground-truth summary
 Most used approach (by [1, 6, 7, 8, 14, 20, 21, 26, 27, 29, 30, 31, 32, 33])
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Experiments
31
Evaluation protocol
 Slight but important distinction w.r.t. what is eventually used as ground-truth summary
 Most used approach (by [1, 6, 7, 8, 14, 20, 21, 26, 27, 29, 30, 31, 32, 33])
F-Score1
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Experiments
32
Evaluation protocol
 Slight but important distinction w.r.t. what is eventually used as ground-truth summary
 Most used approach (by [1, 6, 7, 8, 14, 20, 21, 26, 27, 29, 30, 31, 32, 33])
F-Score2
F-Score1
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Experiments
33
Evaluation protocol
 Slight but important distinction w.r.t. what is eventually used as ground-truth summary
 Most used approach (by [1, 6, 7, 8, 14, 20, 21, 26, 27, 29, 30, 31, 32, 33])
F-ScoreN
F-Score2
F-Score1
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Experiments
34
Evaluation protocol
 Slight but important distinction w.r.t. what is eventually used as ground-truth summary
 Most used approach (by [1, 6, 7, 8, 14, 20, 21, 26, 27, 29, 30, 31, 32, 33])
F-ScoreN
F-Score2
F-Score1
SumMe: TVSum:
N
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Experiments
35
Evaluation protocol
 Slight but important distinction w.r.t. what is eventually used as ground-truth summary
 Alternative approach (used in [9, 13, 16, 24, 25, 28])
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Experiments
36
Evaluation protocol
 Slight but important distinction w.r.t. what is eventually used as ground-truth summary
 Alternative approach (used in [9, 13, 16, 24, 25, 28])
F-Score
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
 Videos were down-sampled to 2 fps
 Feature extraction was based on the pool5 layer of GoogleNet trained on ImageNet
 Linear compression layer reduces the size of these vectors from 1024 to 500
 All components are 2-layer LSTMs with 500 hidden units; Frame selector is a bi-directional LSTM
 Training based on the Adam optimizer; Summarizer’s learning rate = 10-4; Discriminator’s
learning rate = 10-5
 Dataset was split into two non-overlapping sets; a training set having 80% of data and a testing
set having the remaining 20% of data
 Ran experiments on 5 differently created random splits and report the average performance at
the training-epoch-level (i.e. for the same training epoch) over these runs
Experiments
37
Implementation details
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
 Step 1: Assessing the impact of regularization factor σ
Experiments
38
SUM-GAN-sl SUM-GAN-VAAE SUM-GAN-AAE
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
 Step 1: Assessing the impact of regularization factor σ
 Outcomes:
 Value of σ affects the models’ performance and needs fine-tuning
 Fine-tuning is dataset-dependent
 Best overall performance for each model, observed for different σ value
Experiments
39
SUM-GAN-sl SUM-GAN-VAAE SUM-GAN-AAE
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
 Step 2: Comparison with SoA unsupervised approaches based on multiple user summaries
 Outcomes
 A few SoA methods are comparable (or even worse) with a random summary generator
 Best method on TVSum shows random-level performance on SumMe
 Best method on SumMe performs worse than SUM-GAN-AAE and is less competitive on TVSum
 Variational attention reduces SUM-GAN-sl efficiency due to the difficulty in efficiently defining two
latent spaces in parallel to the continuous update of the model's components during the training
 Replacement of VAE with AAE leads to a noticeable performance improvement over SUM-GAN-sl
Experiments
40
+/- indicate better/worse performance
compared to SUM-GAN-AAE
Note: SUM-GAN is not listed in this table as it follows
the single gt-summary evaluation protocol
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
 Step 3: Evaluating the effect of the introduced AAE component
 Key-fragment selection: Attention mechanism leads to much smoother series of importance
scores
Experiments
41
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
 Step 3: Evaluating the effect of the introduced AAE component
 Training efficiency: much faster and more stable training of the model
Experiments
42
Loss curves for the SUM-GAN-sl and SUM-GAN-AAE
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
 Step 4: Comparison with SoA supervised approaches based on multiple user summaries
 Outcomes
 Best methods in TVSum (MAVS and Tessellationsup, respectively) seem adapted to this dataset, as
they exhibit random-level performance on SumMe
 Only a few supervised methods surpass the performance of a random summary generator on both
datasets, with VASNet being the best among them
 The performance of these methods ranges between 44.1 - 49.7 on SumMe, and 56.1 - 61.4 on TVSum
 Τhe unsupervised SUM-GAN-AAE model is comparable with SoA supervised methods
Experiments
43
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
 Step 5: Comparison with SoA approaches based on single ground-truth summaries
 Impact of regularization factor σ (best scores in bold)
 The model’s performance is affected by the value of σ
 The effect of σ depends (also) on the evaluation approach; best performance when using multiple
human summaries was observed for σ = 0.15
 SUM-GAN-AAE outperforms the original SUM-GAN model on both datasets, even for the same value
of σ
Experiments
44
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
 Step 5: Comparison with SoA approaches based on single ground-truth summaries
 Outcomes
 SUM-GAN-AAE model performs consistently well on both datasets
 SUM-GAN-AAE shows advanced performance compared to SoA supervised and unsupervised (*)
summarization methods
Experiments
45
Unsupervised approaches
marked with an asterisk
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Summarization example
46
Full video Generated summary
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
 Presented a video summarization method that combines:
 The effectiveness of attention mechanisms in spotting the most important parts of the video
 The learning efficiency of the generative adversarial networks for unsupervised training
 Experimental evaluations on two benchmarking datasets:
 Documented the positive contribution of the introduced attention auto-encoder component in the
model's training and summarization performance
 Highlighted the competitiveness of the unsupervised SUM-GAN-AAE method against SoA video
summarization techniques
Conclusions
47
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
1. E. Apostolidis, et al.: A stepwise, label-based approach for improving the adversarial training in unsupervised video summarization. In:
AI4TV, ACM MM 2019
2. E. Apostolidis, et al.: Fast shot segmentation combining global and local visual descriptors. In: IEEE ICASSP 2014. pp. 6583-6587
3. K. Apostolidis, et al.: A motion-driven approach for fine-grained temporal segmentation of user-generated videos. In: MMM 2018. pp. 29-41
4. H. Bahuleyan, et al.: Variational attention for sequence-to-sequence models. In: 27th COLING. pp. 1672-1682 (2018)
5. J. Cho: PyTorch implementation of SUM-GAN (2017), https://github.com/j-min/Adversarial Video Summary, (last accessed on Oct. 18, 2019)
6. M. Elfeki, et al.: Video summarization via actionness ranking. In: IEEE WACV 2019. pp. 754-763
7. J. Fajtl, et al.: Summarizing videos with attention. In: ACCV 2018. pp. 39-54
8. L. Feng, et al.: Extractive video summarizer with memory augmented neural networks. In: ACM MM 2018. pp. 976-983
9. T. Fu, et al.: Attentive and adversarial learning for video summarization. In: IEEE WACV 2019. pp. 1579-1587
10. M. Gygli, et al.: Creating summaries from user videos. In: ECCV 2014. pp. 505-520
11. M. Gygli, et al.: Video summarization by learning submodular mixtures of objectives. In: IEEE CVPR 2015. pp. 3090-3098
12. S. Hochreiter, et al.: Long Short-Term Memory. Neural Computation 9(8), 1735-1780 (1997)
13. Z. Ji, et al.: Video summarization with attention-based encoder-decoder networks. IEEE Trans. on Circuits and Systems for Video
Technology (2019)
14. D. Kaufman, et al.: Temporal Tessellation: A unified approach for video analysis. In: IEEE ICCV 2017. pp. 94-104
15. S. Lee, et al.: A memory network approach for story-based temporal summarization of 360 videos. In: IEEE CVPR 2018. pp. 1410-1419
16. B. Mahasseni, et al.: Unsupervised video summarization with adversarial LSTM networks. In: IEEE CVPR 2017. pp. 2982-2991
17. M. Otani, et al.: Video summarization using deep semantic features. In: ACCV 2016. pp. 361-377
Key references
48
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
18. D. Potapov, et al.: Category-specific video summarization. In: ECCV 2014. pp. 540-555
19. A. Radford, et al.: Unsupervised representation learning with deep convolutional generative adversarial networks. In: ICLR 2016
20. M. Rochan, et al.: Video summarization by learning from unpaired data. In: IEEE CVPR 2019
21. M. Rochan, et al.: Video summarization using fully convolutional sequence networks. In: ECCV 2018. pp. 358-374
22. Y. Song, et al.: TVSum: Summarizing web videos using titles. In: IEEE CVPR 2015. pp. 5179-5187
23. C. Szegedy, et al.: Going deeper with convolutions. In: IEEE CVPR 2015. pp. 1-9
24. H. Wei, et al.: Video summarization via semantic attended networks. In: AAAI 2018. pp. 216-223
25. L. Yuan, et al.: Cycle-SUM: Cycle-consistent adversarial LSTM networks for unsupervised video summarization. In: AAAI 2019. pp. 9143-
9150
26. Y. Yuan, et al.: Video summarization by learning deep side semantic embedding. IEEE Trans. on Circuits and Systems for Video Technology
29(1), 226-237 (2019)
27. K. Zhang, et al.: Video summarization with Long Short-Term Memory. In: ECCV 2016. pp. 766-782
28. Y. Zhang, et al.: DTR-GAN: Dilated temporal relational adversarial network for video summarization. In: ACM TURC 2019. pp. 89:1-89:6
29. Y. Zhang, et al.: Unsupervised object-level video summarization with online motion auto-encoder. Pattern Recognition Letters (2018)
30. B. Zhao, et al.: Hierarchical recurrent neural network for video summarization. In: ACM MM 2017. pp. 863-871
31. B. Zhao, et al.: HSA-RNN: Hierarchical structure-adaptive RNN for video summarization. In: IEEE/CVF CVPR 2018. pp. 7405-7414
32. K. Zhou, et al.: Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In: AAAI
2018. pp. 7582-7589
33. K. Zhou, et al.: Video summarisation by classification with deep reinforcement learning. In: BMVC 2018
Key references
49
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
50
Thank you for your attention!
Questions?
Vasileios Mezaris, bmezaris@iti.gr
Code and documentation publicly available at:
https://github.com/e-apostolidis/SUM-GAN-AAE
This work was supported by the EUs Horizon 2020 research and innovation
programme under grant agreement H2020-780656 ReTV. The work of Ioannis
Patras has been supported by EPSRC under grant No. EP/R026424/1.

Weitere ähnliche Inhalte

Was ist angesagt?

【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...
【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...
【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...Deep Learning JP
 
【DL輪読会】Visual Classification via Description from Large Language Models (ICLR...
【DL輪読会】Visual Classification via Description from Large Language Models (ICLR...【DL輪読会】Visual Classification via Description from Large Language Models (ICLR...
【DL輪読会】Visual Classification via Description from Large Language Models (ICLR...Deep Learning JP
 
Convolutional Neural Networks のトレンド @WBAFLカジュアルトーク#2
Convolutional Neural Networks のトレンド @WBAFLカジュアルトーク#2Convolutional Neural Networks のトレンド @WBAFLカジュアルトーク#2
Convolutional Neural Networks のトレンド @WBAFLカジュアルトーク#2Daiki Shimada
 
CVPR2019読み会 (Rethinking the Evaluation of Video Summaries)
CVPR2019読み会 (Rethinking the Evaluation of Video Summaries)CVPR2019読み会 (Rethinking the Evaluation of Video Summaries)
CVPR2019読み会 (Rethinking the Evaluation of Video Summaries)Yasunori Ozaki
 
【DL輪読会】Flamingo: a Visual Language Model for Few-Shot Learning 画像×言語の大規模基盤モ...
【DL輪読会】Flamingo: a Visual Language Model for Few-Shot Learning   画像×言語の大規模基盤モ...【DL輪読会】Flamingo: a Visual Language Model for Few-Shot Learning   画像×言語の大規模基盤モ...
【DL輪読会】Flamingo: a Visual Language Model for Few-Shot Learning 画像×言語の大規模基盤モ...Deep Learning JP
 
Unsupervised Video Anomaly Detection: A brief overview
Unsupervised Video Anomaly Detection: A brief overviewUnsupervised Video Anomaly Detection: A brief overview
Unsupervised Video Anomaly Detection: A brief overviewRidge-i, Inc.
 
Visual Question Answering (VQA) - CVPR2018動向分析 (CVPR 2018 完全読破チャレンジ報告会)
Visual Question Answering (VQA) - CVPR2018動向分析 (CVPR 2018 完全読破チャレンジ報告会)Visual Question Answering (VQA) - CVPR2018動向分析 (CVPR 2018 完全読破チャレンジ報告会)
Visual Question Answering (VQA) - CVPR2018動向分析 (CVPR 2018 完全読破チャレンジ報告会)cvpaper. challenge
 
Blended learning and flipped classroom التعلم المدمج والفصل المقلوب
Blended learning and flipped classroom    التعلم المدمج والفصل المقلوبBlended learning and flipped classroom    التعلم المدمج والفصل المقلوب
Blended learning and flipped classroom التعلم المدمج والفصل المقلوبMohammed Abdel Razek
 
これからの Vision & Language ~ Acadexit した4つの理由
これからの Vision & Language ~ Acadexit した4つの理由これからの Vision & Language ~ Acadexit した4つの理由
これからの Vision & Language ~ Acadexit した4つの理由Yoshitaka Ushiku
 
Emily Denton - Unsupervised Learning of Disentangled Representations from Vid...
Emily Denton - Unsupervised Learning of Disentangled Representations from Vid...Emily Denton - Unsupervised Learning of Disentangled Representations from Vid...
Emily Denton - Unsupervised Learning of Disentangled Representations from Vid...Luba Elliott
 
"How to Get the Best Deep Learning Performance with the OpenVINO Toolkit," a ...
"How to Get the Best Deep Learning Performance with the OpenVINO Toolkit," a ..."How to Get the Best Deep Learning Performance with the OpenVINO Toolkit," a ...
"How to Get the Best Deep Learning Performance with the OpenVINO Toolkit," a ...Edge AI and Vision Alliance
 
Digital Image Fundamentals
Digital Image FundamentalsDigital Image Fundamentals
Digital Image FundamentalsKalyan Acharjya
 
最近(2020/09/13)のarxivの分布外検知の論文を紹介
最近(2020/09/13)のarxivの分布外検知の論文を紹介最近(2020/09/13)のarxivの分布外検知の論文を紹介
最近(2020/09/13)のarxivの分布外検知の論文を紹介ぱんいち すみもと
 
عرض استراتيجية الاستقصاء
عرض استراتيجية الاستقصاءعرض استراتيجية الاستقصاء
عرض استراتيجية الاستقصاءAbdalziz Al-gragheer
 
NeurIPS2020参加報告
NeurIPS2020参加報告NeurIPS2020参加報告
NeurIPS2020参加報告Sho Takase
 
YOLOv4: optimal speed and accuracy of object detection review
YOLOv4: optimal speed and accuracy of object detection reviewYOLOv4: optimal speed and accuracy of object detection review
YOLOv4: optimal speed and accuracy of object detection reviewLEE HOSEONG
 
[DL輪読会]In Search of Lost Domain Generalization
[DL輪読会]In Search of Lost Domain Generalization[DL輪読会]In Search of Lost Domain Generalization
[DL輪読会]In Search of Lost Domain GeneralizationDeep Learning JP
 

Was ist angesagt? (20)

【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...
【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...
【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...
 
Content-based Image Retrieval - Eva Mohedano - UPC Barcelona 2018
Content-based Image Retrieval - Eva Mohedano - UPC Barcelona 2018Content-based Image Retrieval - Eva Mohedano - UPC Barcelona 2018
Content-based Image Retrieval - Eva Mohedano - UPC Barcelona 2018
 
【DL輪読会】Visual Classification via Description from Large Language Models (ICLR...
【DL輪読会】Visual Classification via Description from Large Language Models (ICLR...【DL輪読会】Visual Classification via Description from Large Language Models (ICLR...
【DL輪読会】Visual Classification via Description from Large Language Models (ICLR...
 
Convolutional Neural Networks のトレンド @WBAFLカジュアルトーク#2
Convolutional Neural Networks のトレンド @WBAFLカジュアルトーク#2Convolutional Neural Networks のトレンド @WBAFLカジュアルトーク#2
Convolutional Neural Networks のトレンド @WBAFLカジュアルトーク#2
 
CVPR2019読み会 (Rethinking the Evaluation of Video Summaries)
CVPR2019読み会 (Rethinking the Evaluation of Video Summaries)CVPR2019読み会 (Rethinking the Evaluation of Video Summaries)
CVPR2019読み会 (Rethinking the Evaluation of Video Summaries)
 
【DL輪読会】Flamingo: a Visual Language Model for Few-Shot Learning 画像×言語の大規模基盤モ...
【DL輪読会】Flamingo: a Visual Language Model for Few-Shot Learning   画像×言語の大規模基盤モ...【DL輪読会】Flamingo: a Visual Language Model for Few-Shot Learning   画像×言語の大規模基盤モ...
【DL輪読会】Flamingo: a Visual Language Model for Few-Shot Learning 画像×言語の大規模基盤モ...
 
Unsupervised Video Anomaly Detection: A brief overview
Unsupervised Video Anomaly Detection: A brief overviewUnsupervised Video Anomaly Detection: A brief overview
Unsupervised Video Anomaly Detection: A brief overview
 
Visual Question Answering (VQA) - CVPR2018動向分析 (CVPR 2018 完全読破チャレンジ報告会)
Visual Question Answering (VQA) - CVPR2018動向分析 (CVPR 2018 完全読破チャレンジ報告会)Visual Question Answering (VQA) - CVPR2018動向分析 (CVPR 2018 完全読破チャレンジ報告会)
Visual Question Answering (VQA) - CVPR2018動向分析 (CVPR 2018 完全読破チャレンジ報告会)
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
 
Blended learning and flipped classroom التعلم المدمج والفصل المقلوب
Blended learning and flipped classroom    التعلم المدمج والفصل المقلوبBlended learning and flipped classroom    التعلم المدمج والفصل المقلوب
Blended learning and flipped classroom التعلم المدمج والفصل المقلوب
 
乱数とは
乱数とは乱数とは
乱数とは
 
これからの Vision & Language ~ Acadexit した4つの理由
これからの Vision & Language ~ Acadexit した4つの理由これからの Vision & Language ~ Acadexit した4つの理由
これからの Vision & Language ~ Acadexit した4つの理由
 
Emily Denton - Unsupervised Learning of Disentangled Representations from Vid...
Emily Denton - Unsupervised Learning of Disentangled Representations from Vid...Emily Denton - Unsupervised Learning of Disentangled Representations from Vid...
Emily Denton - Unsupervised Learning of Disentangled Representations from Vid...
 
"How to Get the Best Deep Learning Performance with the OpenVINO Toolkit," a ...
"How to Get the Best Deep Learning Performance with the OpenVINO Toolkit," a ..."How to Get the Best Deep Learning Performance with the OpenVINO Toolkit," a ...
"How to Get the Best Deep Learning Performance with the OpenVINO Toolkit," a ...
 
Digital Image Fundamentals
Digital Image FundamentalsDigital Image Fundamentals
Digital Image Fundamentals
 
最近(2020/09/13)のarxivの分布外検知の論文を紹介
最近(2020/09/13)のarxivの分布外検知の論文を紹介最近(2020/09/13)のarxivの分布外検知の論文を紹介
最近(2020/09/13)のarxivの分布外検知の論文を紹介
 
عرض استراتيجية الاستقصاء
عرض استراتيجية الاستقصاءعرض استراتيجية الاستقصاء
عرض استراتيجية الاستقصاء
 
NeurIPS2020参加報告
NeurIPS2020参加報告NeurIPS2020参加報告
NeurIPS2020参加報告
 
YOLOv4: optimal speed and accuracy of object detection review
YOLOv4: optimal speed and accuracy of object detection reviewYOLOv4: optimal speed and accuracy of object detection review
YOLOv4: optimal speed and accuracy of object detection review
 
[DL輪読会]In Search of Lost Domain Generalization
[DL輪読会]In Search of Lost Domain Generalization[DL輪読会]In Search of Lost Domain Generalization
[DL輪読会]In Search of Lost Domain Generalization
 

Ähnlich wie Unsupervised Video Summarization via Attention-Driven Adversarial Learning

ReTV AI4TV Summarization
ReTV AI4TV SummarizationReTV AI4TV Summarization
ReTV AI4TV SummarizationReTV project
 
PoR_evaluation_measure_acm_mm_2020
PoR_evaluation_measure_acm_mm_2020PoR_evaluation_measure_acm_mm_2020
PoR_evaluation_measure_acm_mm_2020VasileiosMezaris
 
Icme2020 tutorial video_summarization_part1
Icme2020 tutorial video_summarization_part1Icme2020 tutorial video_summarization_part1
Icme2020 tutorial video_summarization_part1VasileiosMezaris
 
ICME 2020 Tutorial Part II: Video summary (re-)use and recommendation
ICME 2020 Tutorial Part II: Video summary (re-)use and recommendationICME 2020 Tutorial Part II: Video summary (re-)use and recommendation
ICME 2020 Tutorial Part II: Video summary (re-)use and recommendationReTV project
 
OOMEN MEZARIS ReTV
OOMEN MEZARIS ReTVOOMEN MEZARIS ReTV
OOMEN MEZARIS ReTVFIAT/IFTA
 
Implementing artificial intelligence strategies for content annotation and pu...
Implementing artificial intelligence strategies for content annotation and pu...Implementing artificial intelligence strategies for content annotation and pu...
Implementing artificial intelligence strategies for content annotation and pu...ReTV project
 
Implementing Artificial Intelligence Strategies for Content Annotation and Pu...
Implementing Artificial Intelligence Strategies for Content Annotation and Pu...Implementing Artificial Intelligence Strategies for Content Annotation and Pu...
Implementing Artificial Intelligence Strategies for Content Annotation and Pu...ReTV project
 
5 ijaems sept-2015-9-video feature extraction based on modified lle using ada...
5 ijaems sept-2015-9-video feature extraction based on modified lle using ada...5 ijaems sept-2015-9-video feature extraction based on modified lle using ada...
5 ijaems sept-2015-9-video feature extraction based on modified lle using ada...INFOGAIN PUBLICATION
 
Using TV Metadata to optimise the repurposing and republication of TV Content...
Using TV Metadata to optimise the repurposing and republication of TV Content...Using TV Metadata to optimise the repurposing and republication of TV Content...
Using TV Metadata to optimise the repurposing and republication of TV Content...ReTV project
 
ReTV: Bringing Broadcaster Archives to the 21st-century Audiences
 ReTV: Bringing Broadcaster Archives to the 21st-century Audiences ReTV: Bringing Broadcaster Archives to the 21st-century Audiences
ReTV: Bringing Broadcaster Archives to the 21st-century AudiencesReTV project
 
CA-SUM Video Summarization
CA-SUM Video SummarizationCA-SUM Video Summarization
CA-SUM Video SummarizationVasileiosMezaris
 
Software Tools for Building Industry 4.0 Applications
Software Tools for Building Industry 4.0 ApplicationsSoftware Tools for Building Industry 4.0 Applications
Software Tools for Building Industry 4.0 ApplicationsPankesh Patel
 
GAN-based video summarization
GAN-based video summarizationGAN-based video summarization
GAN-based video summarizationVasileiosMezaris
 
Paper id 36201508
Paper id 36201508Paper id 36201508
Paper id 36201508IJRAT
 
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...Edge AI and Vision Alliance
 
Paper id 2120148
Paper id 2120148Paper id 2120148
Paper id 2120148IJRAT
 
UVM_Full_Print_n.pptx
UVM_Full_Print_n.pptxUVM_Full_Print_n.pptx
UVM_Full_Print_n.pptxnikitha992646
 
COCOMO methods for software size estimation
COCOMO methods for software size estimationCOCOMO methods for software size estimation
COCOMO methods for software size estimationPramod Parajuli
 

Ähnlich wie Unsupervised Video Summarization via Attention-Driven Adversarial Learning (20)

ReTV AI4TV Summarization
ReTV AI4TV SummarizationReTV AI4TV Summarization
ReTV AI4TV Summarization
 
PoR_evaluation_measure_acm_mm_2020
PoR_evaluation_measure_acm_mm_2020PoR_evaluation_measure_acm_mm_2020
PoR_evaluation_measure_acm_mm_2020
 
Icme2020 tutorial video_summarization_part1
Icme2020 tutorial video_summarization_part1Icme2020 tutorial video_summarization_part1
Icme2020 tutorial video_summarization_part1
 
ICME 2020 Tutorial Part II: Video summary (re-)use and recommendation
ICME 2020 Tutorial Part II: Video summary (re-)use and recommendationICME 2020 Tutorial Part II: Video summary (re-)use and recommendation
ICME 2020 Tutorial Part II: Video summary (re-)use and recommendation
 
OOMEN MEZARIS ReTV
OOMEN MEZARIS ReTVOOMEN MEZARIS ReTV
OOMEN MEZARIS ReTV
 
Implementing artificial intelligence strategies for content annotation and pu...
Implementing artificial intelligence strategies for content annotation and pu...Implementing artificial intelligence strategies for content annotation and pu...
Implementing artificial intelligence strategies for content annotation and pu...
 
Implementing Artificial Intelligence Strategies for Content Annotation and Pu...
Implementing Artificial Intelligence Strategies for Content Annotation and Pu...Implementing Artificial Intelligence Strategies for Content Annotation and Pu...
Implementing Artificial Intelligence Strategies for Content Annotation and Pu...
 
5 ijaems sept-2015-9-video feature extraction based on modified lle using ada...
5 ijaems sept-2015-9-video feature extraction based on modified lle using ada...5 ijaems sept-2015-9-video feature extraction based on modified lle using ada...
5 ijaems sept-2015-9-video feature extraction based on modified lle using ada...
 
Mini Project- Digital Video Editing
Mini Project- Digital Video EditingMini Project- Digital Video Editing
Mini Project- Digital Video Editing
 
Keyframe-based Video Summarization Designer
Keyframe-based Video Summarization DesignerKeyframe-based Video Summarization Designer
Keyframe-based Video Summarization Designer
 
Using TV Metadata to optimise the repurposing and republication of TV Content...
Using TV Metadata to optimise the repurposing and republication of TV Content...Using TV Metadata to optimise the repurposing and republication of TV Content...
Using TV Metadata to optimise the repurposing and republication of TV Content...
 
ReTV: Bringing Broadcaster Archives to the 21st-century Audiences
 ReTV: Bringing Broadcaster Archives to the 21st-century Audiences ReTV: Bringing Broadcaster Archives to the 21st-century Audiences
ReTV: Bringing Broadcaster Archives to the 21st-century Audiences
 
CA-SUM Video Summarization
CA-SUM Video SummarizationCA-SUM Video Summarization
CA-SUM Video Summarization
 
Software Tools for Building Industry 4.0 Applications
Software Tools for Building Industry 4.0 ApplicationsSoftware Tools for Building Industry 4.0 Applications
Software Tools for Building Industry 4.0 Applications
 
GAN-based video summarization
GAN-based video summarizationGAN-based video summarization
GAN-based video summarization
 
Paper id 36201508
Paper id 36201508Paper id 36201508
Paper id 36201508
 
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
 
Paper id 2120148
Paper id 2120148Paper id 2120148
Paper id 2120148
 
UVM_Full_Print_n.pptx
UVM_Full_Print_n.pptxUVM_Full_Print_n.pptx
UVM_Full_Print_n.pptx
 
COCOMO methods for software size estimation
COCOMO methods for software size estimationCOCOMO methods for software size estimation
COCOMO methods for software size estimation
 

Mehr von VasileiosMezaris

Multi-Modal Fusion for Image Manipulation Detection and Localization
Multi-Modal Fusion for Image Manipulation Detection and LocalizationMulti-Modal Fusion for Image Manipulation Detection and Localization
Multi-Modal Fusion for Image Manipulation Detection and LocalizationVasileiosMezaris
 
CERTH-ITI at MediaEval 2023 NewsImages Task
CERTH-ITI at MediaEval 2023 NewsImages TaskCERTH-ITI at MediaEval 2023 NewsImages Task
CERTH-ITI at MediaEval 2023 NewsImages TaskVasileiosMezaris
 
Spatio-Temporal Summarization of 360-degrees Videos
Spatio-Temporal Summarization of 360-degrees VideosSpatio-Temporal Summarization of 360-degrees Videos
Spatio-Temporal Summarization of 360-degrees VideosVasileiosMezaris
 
Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...
Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...
Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...VasileiosMezaris
 
Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022
Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022
Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022VasileiosMezaris
 
TAME: Trainable Attention Mechanism for Explanations
TAME: Trainable Attention Mechanism for ExplanationsTAME: Trainable Attention Mechanism for Explanations
TAME: Trainable Attention Mechanism for ExplanationsVasileiosMezaris
 
Explaining video summarization based on the focus of attention
Explaining video summarization based on the focus of attentionExplaining video summarization based on the focus of attention
Explaining video summarization based on the focus of attentionVasileiosMezaris
 
Combining textual and visual features for Ad-hoc Video Search
Combining textual and visual features for Ad-hoc Video SearchCombining textual and visual features for Ad-hoc Video Search
Combining textual and visual features for Ad-hoc Video SearchVasileiosMezaris
 
Explaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiersExplaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiersVasileiosMezaris
 
Learning visual explanations for DCNN-based image classifiers using an attent...
Learning visual explanations for DCNN-based image classifiers using an attent...Learning visual explanations for DCNN-based image classifiers using an attent...
Learning visual explanations for DCNN-based image classifiers using an attent...VasileiosMezaris
 
Are all combinations equal? Combining textual and visual features with multi...
Are all combinations equal?  Combining textual and visual features with multi...Are all combinations equal?  Combining textual and visual features with multi...
Are all combinations equal? Combining textual and visual features with multi...VasileiosMezaris
 
Video smart cropping web application
Video smart cropping web applicationVideo smart cropping web application
Video smart cropping web applicationVasileiosMezaris
 
Hard-Negatives Selection Strategy for Cross-Modal Retrieval
Hard-Negatives Selection Strategy for Cross-Modal RetrievalHard-Negatives Selection Strategy for Cross-Modal Retrieval
Hard-Negatives Selection Strategy for Cross-Modal RetrievalVasileiosMezaris
 
Misinformation on the internet: Video and AI
Misinformation on the internet: Video and AIMisinformation on the internet: Video and AI
Misinformation on the internet: Video and AIVasileiosMezaris
 
Migration-related video retrieval
Migration-related video retrievalMigration-related video retrieval
Migration-related video retrievalVasileiosMezaris
 
Fractional step discriminant pruning
Fractional step discriminant pruningFractional step discriminant pruning
Fractional step discriminant pruningVasileiosMezaris
 
Video, AI and News: video analysis and verification technologies for supporti...
Video, AI and News: video analysis and verification technologies for supporti...Video, AI and News: video analysis and verification technologies for supporti...
Video, AI and News: video analysis and verification technologies for supporti...VasileiosMezaris
 

Mehr von VasileiosMezaris (20)

Multi-Modal Fusion for Image Manipulation Detection and Localization
Multi-Modal Fusion for Image Manipulation Detection and LocalizationMulti-Modal Fusion for Image Manipulation Detection and Localization
Multi-Modal Fusion for Image Manipulation Detection and Localization
 
CERTH-ITI at MediaEval 2023 NewsImages Task
CERTH-ITI at MediaEval 2023 NewsImages TaskCERTH-ITI at MediaEval 2023 NewsImages Task
CERTH-ITI at MediaEval 2023 NewsImages Task
 
Spatio-Temporal Summarization of 360-degrees Videos
Spatio-Temporal Summarization of 360-degrees VideosSpatio-Temporal Summarization of 360-degrees Videos
Spatio-Temporal Summarization of 360-degrees Videos
 
Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...
Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...
Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...
 
Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022
Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022
Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022
 
TAME: Trainable Attention Mechanism for Explanations
TAME: Trainable Attention Mechanism for ExplanationsTAME: Trainable Attention Mechanism for Explanations
TAME: Trainable Attention Mechanism for Explanations
 
Gated-ViGAT
Gated-ViGATGated-ViGAT
Gated-ViGAT
 
Explaining video summarization based on the focus of attention
Explaining video summarization based on the focus of attentionExplaining video summarization based on the focus of attention
Explaining video summarization based on the focus of attention
 
Combining textual and visual features for Ad-hoc Video Search
Combining textual and visual features for Ad-hoc Video SearchCombining textual and visual features for Ad-hoc Video Search
Combining textual and visual features for Ad-hoc Video Search
 
Explaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiersExplaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiers
 
Learning visual explanations for DCNN-based image classifiers using an attent...
Learning visual explanations for DCNN-based image classifiers using an attent...Learning visual explanations for DCNN-based image classifiers using an attent...
Learning visual explanations for DCNN-based image classifiers using an attent...
 
Are all combinations equal? Combining textual and visual features with multi...
Are all combinations equal?  Combining textual and visual features with multi...Are all combinations equal?  Combining textual and visual features with multi...
Are all combinations equal? Combining textual and visual features with multi...
 
Video smart cropping web application
Video smart cropping web applicationVideo smart cropping web application
Video smart cropping web application
 
Video Thumbnail Selector
Video Thumbnail SelectorVideo Thumbnail Selector
Video Thumbnail Selector
 
Hard-Negatives Selection Strategy for Cross-Modal Retrieval
Hard-Negatives Selection Strategy for Cross-Modal RetrievalHard-Negatives Selection Strategy for Cross-Modal Retrieval
Hard-Negatives Selection Strategy for Cross-Modal Retrieval
 
Misinformation on the internet: Video and AI
Misinformation on the internet: Video and AIMisinformation on the internet: Video and AI
Misinformation on the internet: Video and AI
 
LSTM Structured Pruning
LSTM Structured PruningLSTM Structured Pruning
LSTM Structured Pruning
 
Migration-related video retrieval
Migration-related video retrievalMigration-related video retrieval
Migration-related video retrieval
 
Fractional step discriminant pruning
Fractional step discriminant pruningFractional step discriminant pruning
Fractional step discriminant pruning
 
Video, AI and News: video analysis and verification technologies for supporti...
Video, AI and News: video analysis and verification technologies for supporti...Video, AI and News: video analysis and verification technologies for supporti...
Video, AI and News: video analysis and verification technologies for supporti...
 

Kürzlich hochgeladen

Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyDrAnita Sharma
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 

Kürzlich hochgeladen (20)

Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomology
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 

Unsupervised Video Summarization via Attention-Driven Adversarial Learning

  • 1. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Title of presentation Subtitle Name of presenter Date Unsupervised Video Summarization via Attention-Driven Adversarial Learning E. Apostolidis1,2, E. Adamantidou1, A. I. Metsai1, V. Mezaris1, I. Patras2 1 CERTH-ITI, Thermi - Thessaloniki, Greece 2 School of EECS, Queen Mary University of London, London, UK 26th Int. Conf. on Multimedia Modeling Daejeon, Korea, January 2020
  • 2. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Outline 2  Introduction  Motivation  Developed approach  Experiments  Summarization example  Conclusions
  • 3. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 3 Video summary: a short visual summary that encapsulates the flow of the story and the essential parts of the full-length video Original video Video summary (storyboard) Problem statement
  • 4. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 4 Problem statement Applications of video summarization  Professional CMS: effective indexing, browsing, retrieval & promotion of media assets  Video sharing platforms: improved viewer experience, enhanced viewer engagement & increased content consumption  Other summarization scenarios: movie trailer production, sports highlights video generation, video synopsis of 24h surveillance recordings
  • 5. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 5 Related work Deep-learning approaches  Supervised methods that use types of feedforward neural nets (e.g. CNNs), extract and use video semantics to identify important video parts based on sequence labeling [21], self- attention networks [7], or video-level metadata [17]  Supervised approaches that capture the story flow using recurrent neural nets (e.g. LSTMs)  In combination with statistical models to select a representative and diverse set of keyframes [27]  In hierarchies to identify the video structure and select key-fragments [30, 31]  In combination with DTR units and GANs to capture long-range frame dependency [28]  To form attention-based encoder-decoders [9, 13], or memory-augmented networks [8]  Unsupervised algorithms that do not rely on human-annotations, and build summaries  Using adversarial learning to: minimize the distance between videos and their summary-based reconstructions [1, 16]; maximize the mutual information between summary and video [25]; learn a mapping from raw videos to human-like summaries based on online available summaries [20]  Through a decision-making process that is learned via RL and reward functions [32]  By learning to extract key motions of appearing objects [29]
  • 6. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 6 Motivation Disadvantages of supervised learning  Restricted amount of annotated data is available for supervised training of a video summarization method  Highly-subjective nature of video summarization (relying on viewer’s demands and aesthetics); there is no “ideal” or commonly accepted summary that could be used for training an algorithm Advantages of unsupervised learning  No need for learning data; avoid laborious and time-demanding labeling of video data  Adaptability to different types of video; summarization is learned based on the video content
  • 7. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 7 Contributions  Introduce an attention mechanism in an unsupervised learning framework, whereas all previous attention-based summarization methods ([7-9, 13]) were supervised  Investigate the integration of an attention mechanism into a variational auto-encoder for video summarization purposes  Use attention to guide the generative adversarial training of the model, rather than using it to rank the video fragments as in [9]
  • 8. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Developed approach  Starting point: the SUM-GAN architecture  Main idea: build a keyframe selection mechanism by minimizing the distance between the deep representations of the original video and a reconstructed version of it based on the selected keyframes  Problem: how to define a good distance?  Solution: use a trainable discriminator network!  Goal: train the Summarizer to maximally confuse the Discriminator when distinguishing the original from the reconstructed video 8 Building on adversarial learning
  • 9. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Developed approach  SUM-GAN-sl:  Contains a linear compression layer that reduces the size of CNN feature vectors 9 Building on adversarial learning
  • 10. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Developed approach  SUM-GAN-sl:  Contains a linear compression layer that reduces the size of CNN feature vectors  Follows an incremental and fine-grained approach to train the model’s components 10 Building on adversarial learning
  • 11. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Developed approach 11 Building on adversarial learning  SUM-GAN-sl:  Contains a linear compression layer that reduces the size of CNN feature vectors  Follows an incremental and fine-grained approach to train the model’s components (regularization factor)
  • 12. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Developed approach 12 Building on adversarial learning  SUM-GAN-sl:  Contains a linear compression layer that reduces the size of CNN feature vectors  Follows an incremental and fine-grained approach to train the model’s components
  • 13. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Developed approach 13 Building on adversarial learning  SUM-GAN-sl:  Contains a linear compression layer that reduces the size of CNN feature vectors  Follows an incremental and fine-grained approach to train the model’s components
  • 14. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Developed approach  Examined approaches: 1) Integrate an attention layer within the variational auto-encoder (VAE) of SUM-GAN-sl 2) Replace the VAE of SUM-GAN-sl with a deterministic attention auto-encoder 14 Introducing an attention mechanism
  • 15. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Developed approach  Variational attention was described in [4] and used for natural language modeling  Models the attention vector as Gaussian distributed random variables 15 1) Adversarial learning driven by a variational attention auto-encoder Variational auto-encoder
  • 16. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Developed approach  Extended SUM-GAN-sl with variational attention, forming the SUM-GAN-VAAE architecture  The attention weights for each frame were handled as random variables and a latent space was computed for these values, too  In every time-step t the attention component combines the encoder's output at t and the decoder's hidden state at t - 1 to compute an attention weight vector  The decoder was modified to update its hidden states based on both latent spaces during the reconstruction of the video 16 1) Adversarial learning driven by a variational attention auto-encoder Variational attention auto-encoder
  • 17. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Developed approach  Inspired by the efficiency of attention-based encoder-decoder network in [13]  Built on the findings of [4] w.r.t. the impact of deterministic attention on VAE  VAE was entirely replaced by an attention auto- encoder (AAE) network, forming the SUM- GAN-AAE architecture 17 2) Adversarial learning driven by deterministic attention auto-encoder
  • 18. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Developed approach 18 2) Adversarial learning driven by deterministic attention auto-encoder Attention auto-encoder Processing pipeline
  • 19. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Developed approach 19 2) Adversarial learning driven by deterministic attention auto-encoder Processing pipeline  Weighted feature vectors fed to the Encoder Attention auto-encoder
  • 20. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Developed approach 20 2) Adversarial learning driven by deterministic attention auto-encoder Processing pipeline  Weighted feature vectors fed to the Encoder  Encoder’s output (V) and Decoder’s previous hidden state fed to the Attention component  For t > 1: use the hidden state of the previous Decoder’s step (h1)  For t = 1: use the hidden state of the last Encoder’s step (He) Attention auto-encoder
  • 21. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Developed approach 21 2) Adversarial learning driven by deterministic attention auto-encoder Processing pipeline  Weighted feature vectors fed to the Encoder  Encoder’s output (V) and Decoder’s previous hidden state fed to the Attention component  Attention weights (αt) computed using: Attention auto-encoder
  • 22. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Processing pipeline  Weighted feature vectors fed to the Encoder  Encoder’s output (V) and Decoder’s previous hidden state fed to the Attention component  Attention weights (αt) computed using:  Energy score function  Soft-max function Developed approach 22 2) Adversarial learning driven by deterministic attention auto-encoder Attention auto-encoder
  • 23. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Processing pipeline  Weighted feature vectors fed to the Encoder  Encoder’s output (V) and Decoder’s previous hidden state fed to the Attention component  Attention weights (αt) computed using:  Energy score function  Soft-max function  αt multiplied with V and form Context Vector vt’ Developed approach 23 2) Adversarial learning driven by deterministic attention auto-encoder Attention auto-encoder
  • 24. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Processing pipeline  Weighted feature vectors fed to the Encoder  Encoder’s output (V) and Decoder’s previous hidden state fed to the Attention component  Attention weights (αt) computed using:  Energy score function  Soft-max function  αt multiplied with V and form Context Vector vt’  vt’ combined with Decoder’s previous output yt-1 Developed approach 24 2) Adversarial learning driven by deterministic attention auto-encoder Attention auto-encoder
  • 25. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Developed approach 25 2) Adversarial learning driven by deterministic attention auto-encoder Attention auto-encoder Processing pipeline  Weighted feature vectors fed to the Encoder  Encoder’s output (V) and Decoder’s previous hidden state fed to the Attention component  Attention weights (αt) computed using:  Energy score function  Soft-max function  αt multiplied with V and form Context Vector vt’  vt’ combined with Decoder’s previous output yt-1  Decoder gradually reconstructs the video
  • 26. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Developed approach  Input: The CNN feature vectors of the (sampled) video frames  Output: Frame-level importance scores  Summarization process:  CNN features pass through the linear compression layer and the frame selector  importance scores computed at frame-level  Given a video segmentation (using KTS [18]) calculate fragment-level importance scores by averaging the scores of each fragment's frames  Summary is created by selecting the fragments that maximize the total importance score provided that summary length does not exceed 15% of video duration, by solving the 0/1 Knapsack problem 26 Model’s I/O and summarization process
  • 27. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Experiments 27 Datasets  SumMe (https://gyglim.github.io/me/vsum/index.html#benchmark)  25 videos capturing multiple events (e.g. cooking and sports)  video length: 1.6 to 6.5 min  annotation: fragment-based video summaries  TVSum (https://github.com/yalesong/tvsum)  50 videos from 10 categories of TRECVid MED task  video length: 1 to 5 min  annotation: frame-level importance scores
  • 28. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Experiments 28 Evaluation protocol  The generated summary should not exceed 15% of the video length  Similarity between automatically generated (A) and ground-truth (G) summary is expressed by the F-Score (%), with (P)recision and (R)ecall measuring the temporal overlap (∩) (|| || means duration)  Typical metrics for computing Precision and Recall at the frame-level
  • 29. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Experiments 29 Evaluation protocol  Slight but important distinction w.r.t. what is eventually used as ground-truth summary  Most used approach (by [1, 6, 7, 8, 14, 20, 21, 26, 27, 29, 30, 31, 32, 33])
  • 30. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Experiments 30 Evaluation protocol  Slight but important distinction w.r.t. what is eventually used as ground-truth summary  Most used approach (by [1, 6, 7, 8, 14, 20, 21, 26, 27, 29, 30, 31, 32, 33])
  • 31. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Experiments 31 Evaluation protocol  Slight but important distinction w.r.t. what is eventually used as ground-truth summary  Most used approach (by [1, 6, 7, 8, 14, 20, 21, 26, 27, 29, 30, 31, 32, 33]) F-Score1
  • 32. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Experiments 32 Evaluation protocol  Slight but important distinction w.r.t. what is eventually used as ground-truth summary  Most used approach (by [1, 6, 7, 8, 14, 20, 21, 26, 27, 29, 30, 31, 32, 33]) F-Score2 F-Score1
  • 33. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Experiments 33 Evaluation protocol  Slight but important distinction w.r.t. what is eventually used as ground-truth summary  Most used approach (by [1, 6, 7, 8, 14, 20, 21, 26, 27, 29, 30, 31, 32, 33]) F-ScoreN F-Score2 F-Score1
  • 34. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Experiments 34 Evaluation protocol  Slight but important distinction w.r.t. what is eventually used as ground-truth summary  Most used approach (by [1, 6, 7, 8, 14, 20, 21, 26, 27, 29, 30, 31, 32, 33]) F-ScoreN F-Score2 F-Score1 SumMe: TVSum: N
  • 35. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Experiments 35 Evaluation protocol  Slight but important distinction w.r.t. what is eventually used as ground-truth summary  Alternative approach (used in [9, 13, 16, 24, 25, 28])
  • 36. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Experiments 36 Evaluation protocol  Slight but important distinction w.r.t. what is eventually used as ground-truth summary  Alternative approach (used in [9, 13, 16, 24, 25, 28]) F-Score
  • 37. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Videos were down-sampled to 2 fps  Feature extraction was based on the pool5 layer of GoogleNet trained on ImageNet  Linear compression layer reduces the size of these vectors from 1024 to 500  All components are 2-layer LSTMs with 500 hidden units; Frame selector is a bi-directional LSTM  Training based on the Adam optimizer; Summarizer’s learning rate = 10-4; Discriminator’s learning rate = 10-5  Dataset was split into two non-overlapping sets; a training set having 80% of data and a testing set having the remaining 20% of data  Ran experiments on 5 differently created random splits and report the average performance at the training-epoch-level (i.e. for the same training epoch) over these runs Experiments 37 Implementation details
  • 38. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Step 1: Assessing the impact of regularization factor σ Experiments 38 SUM-GAN-sl SUM-GAN-VAAE SUM-GAN-AAE
  • 39. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Step 1: Assessing the impact of regularization factor σ  Outcomes:  Value of σ affects the models’ performance and needs fine-tuning  Fine-tuning is dataset-dependent  Best overall performance for each model, observed for different σ value Experiments 39 SUM-GAN-sl SUM-GAN-VAAE SUM-GAN-AAE
  • 40. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Step 2: Comparison with SoA unsupervised approaches based on multiple user summaries  Outcomes  A few SoA methods are comparable (or even worse) with a random summary generator  Best method on TVSum shows random-level performance on SumMe  Best method on SumMe performs worse than SUM-GAN-AAE and is less competitive on TVSum  Variational attention reduces SUM-GAN-sl efficiency due to the difficulty in efficiently defining two latent spaces in parallel to the continuous update of the model's components during the training  Replacement of VAE with AAE leads to a noticeable performance improvement over SUM-GAN-sl Experiments 40 +/- indicate better/worse performance compared to SUM-GAN-AAE Note: SUM-GAN is not listed in this table as it follows the single gt-summary evaluation protocol
  • 41. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Step 3: Evaluating the effect of the introduced AAE component  Key-fragment selection: Attention mechanism leads to much smoother series of importance scores Experiments 41
  • 42. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Step 3: Evaluating the effect of the introduced AAE component  Training efficiency: much faster and more stable training of the model Experiments 42 Loss curves for the SUM-GAN-sl and SUM-GAN-AAE
  • 43. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Step 4: Comparison with SoA supervised approaches based on multiple user summaries  Outcomes  Best methods in TVSum (MAVS and Tessellationsup, respectively) seem adapted to this dataset, as they exhibit random-level performance on SumMe  Only a few supervised methods surpass the performance of a random summary generator on both datasets, with VASNet being the best among them  The performance of these methods ranges between 44.1 - 49.7 on SumMe, and 56.1 - 61.4 on TVSum  Τhe unsupervised SUM-GAN-AAE model is comparable with SoA supervised methods Experiments 43
  • 44. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Step 5: Comparison with SoA approaches based on single ground-truth summaries  Impact of regularization factor σ (best scores in bold)  The model’s performance is affected by the value of σ  The effect of σ depends (also) on the evaluation approach; best performance when using multiple human summaries was observed for σ = 0.15  SUM-GAN-AAE outperforms the original SUM-GAN model on both datasets, even for the same value of σ Experiments 44
  • 45. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Step 5: Comparison with SoA approaches based on single ground-truth summaries  Outcomes  SUM-GAN-AAE model performs consistently well on both datasets  SUM-GAN-AAE shows advanced performance compared to SoA supervised and unsupervised (*) summarization methods Experiments 45 Unsupervised approaches marked with an asterisk
  • 46. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Summarization example 46 Full video Generated summary
  • 47. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Presented a video summarization method that combines:  The effectiveness of attention mechanisms in spotting the most important parts of the video  The learning efficiency of the generative adversarial networks for unsupervised training  Experimental evaluations on two benchmarking datasets:  Documented the positive contribution of the introduced attention auto-encoder component in the model's training and summarization performance  Highlighted the competitiveness of the unsupervised SUM-GAN-AAE method against SoA video summarization techniques Conclusions 47
  • 48. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 1. E. Apostolidis, et al.: A stepwise, label-based approach for improving the adversarial training in unsupervised video summarization. In: AI4TV, ACM MM 2019 2. E. Apostolidis, et al.: Fast shot segmentation combining global and local visual descriptors. In: IEEE ICASSP 2014. pp. 6583-6587 3. K. Apostolidis, et al.: A motion-driven approach for fine-grained temporal segmentation of user-generated videos. In: MMM 2018. pp. 29-41 4. H. Bahuleyan, et al.: Variational attention for sequence-to-sequence models. In: 27th COLING. pp. 1672-1682 (2018) 5. J. Cho: PyTorch implementation of SUM-GAN (2017), https://github.com/j-min/Adversarial Video Summary, (last accessed on Oct. 18, 2019) 6. M. Elfeki, et al.: Video summarization via actionness ranking. In: IEEE WACV 2019. pp. 754-763 7. J. Fajtl, et al.: Summarizing videos with attention. In: ACCV 2018. pp. 39-54 8. L. Feng, et al.: Extractive video summarizer with memory augmented neural networks. In: ACM MM 2018. pp. 976-983 9. T. Fu, et al.: Attentive and adversarial learning for video summarization. In: IEEE WACV 2019. pp. 1579-1587 10. M. Gygli, et al.: Creating summaries from user videos. In: ECCV 2014. pp. 505-520 11. M. Gygli, et al.: Video summarization by learning submodular mixtures of objectives. In: IEEE CVPR 2015. pp. 3090-3098 12. S. Hochreiter, et al.: Long Short-Term Memory. Neural Computation 9(8), 1735-1780 (1997) 13. Z. Ji, et al.: Video summarization with attention-based encoder-decoder networks. IEEE Trans. on Circuits and Systems for Video Technology (2019) 14. D. Kaufman, et al.: Temporal Tessellation: A unified approach for video analysis. In: IEEE ICCV 2017. pp. 94-104 15. S. Lee, et al.: A memory network approach for story-based temporal summarization of 360 videos. In: IEEE CVPR 2018. pp. 1410-1419 16. B. Mahasseni, et al.: Unsupervised video summarization with adversarial LSTM networks. In: IEEE CVPR 2017. pp. 2982-2991 17. M. Otani, et al.: Video summarization using deep semantic features. In: ACCV 2016. pp. 361-377 Key references 48
  • 49. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 18. D. Potapov, et al.: Category-specific video summarization. In: ECCV 2014. pp. 540-555 19. A. Radford, et al.: Unsupervised representation learning with deep convolutional generative adversarial networks. In: ICLR 2016 20. M. Rochan, et al.: Video summarization by learning from unpaired data. In: IEEE CVPR 2019 21. M. Rochan, et al.: Video summarization using fully convolutional sequence networks. In: ECCV 2018. pp. 358-374 22. Y. Song, et al.: TVSum: Summarizing web videos using titles. In: IEEE CVPR 2015. pp. 5179-5187 23. C. Szegedy, et al.: Going deeper with convolutions. In: IEEE CVPR 2015. pp. 1-9 24. H. Wei, et al.: Video summarization via semantic attended networks. In: AAAI 2018. pp. 216-223 25. L. Yuan, et al.: Cycle-SUM: Cycle-consistent adversarial LSTM networks for unsupervised video summarization. In: AAAI 2019. pp. 9143- 9150 26. Y. Yuan, et al.: Video summarization by learning deep side semantic embedding. IEEE Trans. on Circuits and Systems for Video Technology 29(1), 226-237 (2019) 27. K. Zhang, et al.: Video summarization with Long Short-Term Memory. In: ECCV 2016. pp. 766-782 28. Y. Zhang, et al.: DTR-GAN: Dilated temporal relational adversarial network for video summarization. In: ACM TURC 2019. pp. 89:1-89:6 29. Y. Zhang, et al.: Unsupervised object-level video summarization with online motion auto-encoder. Pattern Recognition Letters (2018) 30. B. Zhao, et al.: Hierarchical recurrent neural network for video summarization. In: ACM MM 2017. pp. 863-871 31. B. Zhao, et al.: HSA-RNN: Hierarchical structure-adaptive RNN for video summarization. In: IEEE/CVF CVPR 2018. pp. 7405-7414 32. K. Zhou, et al.: Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In: AAAI 2018. pp. 7582-7589 33. K. Zhou, et al.: Video summarisation by classification with deep reinforcement learning. In: BMVC 2018 Key references 49
  • 50. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 50 Thank you for your attention! Questions? Vasileios Mezaris, bmezaris@iti.gr Code and documentation publicly available at: https://github.com/e-apostolidis/SUM-GAN-AAE This work was supported by the EUs Horizon 2020 research and innovation programme under grant agreement H2020-780656 ReTV. The work of Ioannis Patras has been supported by EPSRC under grant No. EP/R026424/1.