Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Icme2020 tutorial video_summarization_part1

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Nächste SlideShare
ReTV AI4TV Summarization
ReTV AI4TV Summarization
Wird geladen in …3
×

Hier ansehen

1 von 112 Anzeige

Icme2020 tutorial video_summarization_part1

Herunterladen, um offline zu lesen

Tutorial on "Video Summarization and Re-use Technologies and Tools", delivered at IEEE ICME 2020. These slides correspond to the first part of the tutorial, presented by Vasileios Mezaris and Evlampios Apostolidis. This part deals with automatic video summarization, and includes a presentation of the video summarization problem definition and a literature overview; an in-depth discussion on a few unsupervised GAN-based methods; and a discussion on video summarization datasets, evaluation protocols and results, and future directions.

Tutorial on "Video Summarization and Re-use Technologies and Tools", delivered at IEEE ICME 2020. These slides correspond to the first part of the tutorial, presented by Vasileios Mezaris and Evlampios Apostolidis. This part deals with automatic video summarization, and includes a presentation of the video summarization problem definition and a literature overview; an in-depth discussion on a few unsupervised GAN-based methods; and a discussion on video summarization datasets, evaluation protocols and results, and future directions.

Anzeige
Anzeige

Weitere Verwandte Inhalte

Ähnlich wie Icme2020 tutorial video_summarization_part1 (20)

Anzeige

Weitere von VasileiosMezaris (18)

Aktuellste (20)

Anzeige

Icme2020 tutorial video_summarization_part1

  1. 1. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Vasileios Mezaris, Evlampios Apostolidis CERTH-ITI, Greece Tutorial at IEEE ICME 2020 Section I.1: Video summarization problem definition and literature overview Video Summarization and Re-use Technologies and Tools Part I: Automatic video summarization
  2. 2. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Tutorial’s structure and time schedule 2 Part I: Automatic video summarization  Section I.1: Video summarization problem definition and literature overview (20’)  Q&A (5’)  Section I.2: In-depth discussion on a few unsupervised GAN-based methods (20’)  Q&A (5’)  Section I.3: Datasets, evaluation protocols and results, and future directions (20’) 20’ Q&A and break, then we are back with the tutorial’s Part II: Video summaries re-use and recommendation
  3. 3. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 3 Video is everywhere! Problem definition Hours of video content uploaded on YouTube every minute  Captured by smart-devices and instantly shared online  Constantly and rapidly increased volumes of video content Image sources: https://www.financialexpress.com/india-news/govt-agencies-adopt-new- age-video-sharing-apps-like-tiktok/1767354/ (left) & https://www.statista.com/ (right)
  4. 4. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 4 But how to find what we are looking for in endless collections of video content? Problem definition - video consumption side Image source: https://www.voicendata.com/sprint-removes-video-streaming-limits/
  5. 5. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 5 But how to find what we are looking for in endless collections of video content? Problem definition - video consumption side Image source: https://www.voicendata.com/sprint-removes-video-streaming-limits/ Quickly inspect a video’s content by checking its synopsis!
  6. 6. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 6 But how to reach different audiences for a given media item? Problem definition - video editing side Image source: https://marketingland.com/social-media-audience-critical-content-marketing-223647 Good Very interesting Boring Nice Much detailed
  7. 7. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 7 But how to reach different audiences for a given media item? Problem definition - video editing side Image source: https://marketingland.com/social-media-audience-critical-content-marketing-223647 Good Very interesting Boring Nice Use of technologies for content adaptation, re-use and re-purposing! Much detailed
  8. 8. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 8 Video summary: a short visual summary that encapsulates the flow of the story and the essential parts of the full-length video Original video Video summary (storyboard) Problem definition Source: https://www.youtube.com/watch?v=deRF9oEbRso
  9. 9. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 9 Problem definition General applications of video summarization  Professional CMS: effective indexing, browsing, retrieval & promotion of media assets!  Video sharing platforms: improved viewer experience, enhanced viewer engagement & increased content consumption! Source: https://www.redbytes.in/how-to-build-an-app-like-hotstar/ Source: Screenshot of the BBC News channel on YouTube
  10. 10. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 10 Problem definition General applications of video summarization Audience- and channel-specific content adaptation: video content re-use and re-distribution in the most appropriate way! Image source: https://www.databagg.com/online-video-sharing
  11. 11. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 11 Problem definition Domain-specific applications of video summarization Full movie (e.g. 1h 30’-2h) Movie trailer (2’30’’) J. R. Smith, D. Joshi, B. Huet, W. Hsu, and J. Cota, “Harnessing A.I. for Augmenting Creativity: Application to Movie Trailer Creation,” in Proc. of the 25th ACM Int. Conf. on Multimedia, ser. MM ’17. New York, NY, USA: ACM, 2017, pp. 1799–1808. Source: https://www.youtube.com/watch?v=wb49-oV0F78
  12. 12. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 12 Problem definition Domain-specific applications of video summarization Full game (e.g. 1h 30’) Game’s synopsis & highlights (1’32’’) Source: https://www.youtube.com/watch?v=oo-2IFTifUU
  13. 13. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 13 Problem definition Domain-specific applications of video summarization Video samples extracted from: https://www.youtube.com/watch?v=gk3qTMlcadk Raw CCTV material (e.g. 24h) Summary of important actions/events (with timestamps)
  14. 14. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 14 Literature overview Taxonomy of deep learning based methods for automatic video summarization
  15. 15. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 15 Literature overview Supervised approaches: using video semantics and metadata  [Zhang, 2016; Kaufman, 2017] learn and transfer the summary structure of semantically-similar videos  [Panda, 2017] metadata-driven video categorization and summarization by maximizing relevance with the video category  [Song, 2016; Zhou, 2018a] category-driven summarization by category feature preservation (keep main parts of a wedding when summarizing a wedding video)  [Otani, 2016; Yuan, 2019] maximize relevance of visual (video) and textual (metadata) data in a common latent space
  16. 16. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 16 Literature overview Supervised approaches: considering temporal structure and dependency  [Zhang, 2016b] estimate frames’ importance by modeling their variable-range temporal dependency using RNNs  [Zhao, 2018] models and encodes the temporal structure of the video for defining the key-fragments using hierarchies of RNNs  [Ji, 2019] video-to-summary as a sequence-to-sequence learning problem using attention-driven encoder-decoder network  [Feng, 2018; Wang, 2019] estimate frames’ importance by modeling their long- range dependency using high-capacity memory networks
  17. 17. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 17 Literature overview Supervised approaches: imitating human summaries  [Zhang, 2019] summarization by confusing a trainable discriminator when making the distinction between a machine- and a human-generated summary; model the variable-range temporal dependency using RNNs and Dilated Temporal Units  [Fu, 2019] key-fragment selection by confusing a trainable discriminator when making the distinction between the machine- and a human-selected key-fragments; fragmentation based on attention-based Pointer Network, and discrimination using a 3D-CNN classifier
  18. 18. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 18 Literature overview Supervised approaches: targeting specific properties of the summary  [Chu, 2019] models spatiotemporal information based on raw frames and optical flow maps, and learns frames’ importance from human annotations via a label distribution learning process  [Elfeki, 2019] uses of CNNs and RNNs to form spatiotemporal feature vectors and estimates the level of activity and importance of each frame to create the summary  [Chen, 2019] summarization based on reinforcement learning and reward functions associated to the diversity and representativeness of the video summary
  19. 19. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 19 Literature overview Unsupervised approaches: inferring the original video  [Mahasseni, 2017] SUM-GAN trains a summarizer to fool a discriminator when distinguishing the original from the summary-based reconstructed video using adversarial learning  [Jung, 2019] CSNet extends [Mahasseni, 2017] with a chunk and stride network and attention mechanism to assess variable-range dependencies and select the video key- frames  [Apostolidis, 2020] SUM-GAN-AAE extends [Mahasseni, 2017] with a stepwise, fine- grained training strategy and an attention auto-encoder to improve the key-fragment selection process  [Rochan, 2019] UnpairedVSN learns video summarization from unpaired data based on an adversarial process that defines a mapping function of a raw video to a human summary
  20. 20. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 20 Literature overview Unsupervised approaches: targeting specific properties of the summary  [Zhou, 2018b] DR-DSN learns to create representative and diverse summaries via reinforcement learning and relevant reward functions  [Gonuguntla, 2019] EDSN extracts spatiotemporal information and learns summarization by rewarding the maintenance of main spatiotemporal patterns in the summary  [Zhang, 2018] OnlineMotionAE extracts the key motions of appearing objects and uses an online motion auto-encoder model to generate summaries that include the main objects in the video and the attractive actions made by each of these objects
  21. 21. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  DL-based video summarization methods mainly rely on combinations of CNNs and RNNs  Pre-trained CNNs are used to represent the visual content; RNNs (mostly LSTMs) are used to model the temporal dependency among video frames  The proposed video summarization approaches are mostly supervised  Best supervised approaches utilize tailored attention mechanisms or memory networks to capture variable- and long-range temporal dependencies respectively  For unsupervised video summarization GANs are the central direction and RL is another but less common approach  Best unsupervised approaches rely on VAE-GAN architectures that have been enhanced with attention mechanisms Some concluding remarks 21
  22. 22. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  The generation of ground-truth data can be an expensive and laborious process  Video summarization is a subjective task and multiple summaries can be proposed for a video  Human annotations that vary a lot make it hard to train a method with the typical supervised training approaches  Unsupervised video summarization algorithms overcome the need for ground-truth data and can be trained using only an adequately large collection of videos  Unsupervised learning allows to train a summarization method using different types of video content (TV shows, news) and then perform content-wise video summarization Some concluding remarks 22
  23. 23. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  The generation of ground-truth data can be an expensive and laborious process  Video summarization is a subjective task and multiple summaries can be proposed for a video  Human annotations that vary a lot make it hard to train a method with the typical supervised training approaches  Unsupervised video summarization algorithms overcome the need for ground-truth data and can be trained using only an adequately large collection of videos  Unsupervised learning allows to train a summarization method using different types of video content (TV shows, news) and then perform content-wise video summarization Some concluding remarks 23 Unsupervised video summarization has great advantages, increases the applicability of summarization technologies, and its potential should be investigated
  24. 24. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Vasileios Mezaris, Evlampios Apostolidis CERTH-ITI, Greece Tutorial at IEEE ICME 2020 Short break; coming up: Section I.2: Discussion on a few unsupervised GAN-based methods Video Summarization and Re-use Technologies and Tools Part I: Automatic video summarization
  25. 25. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Vasileios Mezaris, Evlampios Apostolidis CERTH-ITI, Greece Tutorial at IEEE ICME 2020 Section I.2: Discussion on a few unsupervised GAN-based methods Video Summarization and Re-use Technologies and Tools Part I: Automatic video summarization
  26. 26. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project The SUM-GAN method [Mahasseni, 2017]  Problem formulation: video summarization via selecting a sparse subset of frames that optimally represent the video  Main idea: learn summarization by minimizing the distance between videos and a distribution of their summarizations  Goal: select a set of keyframes such that a distance between the deep representations of the selected keyframes and the video is minimized 26 B. Mahasseni, M. Lam and S. Todorovic, "Unsupervised Video Summarization with Adversarial LSTM Networks," 2017 IEEE CVPR, Honolulu, HI, 2017, pp. 2982-2991, doi: 10.1109/CVPR.2017.318. Courtesy of Mahasseni et al.
  27. 27. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project The SUM-GAN method [Mahasseni, 2017]  Problem formulation: video summarization via selecting a sparse subset of frames that optimally represent the video  Main idea: learn summarization by minimizing the distance between videos and a distribution of their summarizations  Goal: select a set of keyframes such that a distance between the deep representations of the selected keyframes and the video is minimized  Challenge: how to define a good distance? 27 B. Mahasseni, M. Lam and S. Todorovic, "Unsupervised Video Summarization with Adversarial LSTM Networks," 2017 IEEE CVPR, Honolulu, HI, 2017, pp. 2982-2991, doi: 10.1109/CVPR.2017.318. Courtesy of Mahasseni et al.
  28. 28. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project The SUM-GAN method [Mahasseni, 2017]  Problem formulation: video summarization via selecting a sparse subset of frames that optimally represent the video  Main idea: learn summarization by minimizing the distance between videos and a distribution of their summarizations  Goal: select a set of keyframes such that a distance between the deep representations of the selected keyframes and the video is minimized  Challenge: how to define a good distance?  Solution: use a Discriminator network and train it with the Summarizer in an adversarial manner 28 B. Mahasseni, M. Lam and S. Todorovic, "Unsupervised Video Summarization with Adversarial LSTM Networks," 2017 IEEE CVPR, Honolulu, HI, 2017, pp. 2982-2991, doi: 10.1109/CVPR.2017.318. Courtesy of Mahasseni et al.
  29. 29. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project The SUM-GAN method [Mahasseni, 2017]  Deep features of video frames in Frame Selector => normalized importance scores  Weighted features in Encoder => latent representation e  Latent representation e in Decoder => sequence of features for the frames of input video  Original & reconstructed features in Discriminator => distance estimation and binary classification as “video” or “summary” 29 Training pipeline and loss functions
  30. 30. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project The SUM-GAN method [Mahasseni, 2017]  Deep features of video frames in Frame Selector => normalized importance scores  Weighted features in Encoder => latent representation e  Latent representation e in Decoder => sequence of features for the frames of input video  Original & reconstructed features in Discriminator => distance estimation and binary classification as “video” or “summary” 30 Training pipeline and loss functions
  31. 31. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project The SUM-GAN method [Mahasseni, 2017]  Deep features of video frames in Frame Selector => normalized importance scores  Weighted features in Encoder => latent representation e  Latent representation e in Decoder => sequence of features for the frames of input video  Original & reconstructed features in Discriminator => distance estimation and binary classification as “video” or “summary” 31 Training pipeline and loss functions
  32. 32. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project The SUM-GAN method [Mahasseni, 2017]  Deep features of video frames in Frame Selector => normalized importance scores  Weighted features in Encoder => latent representation e  Latent representation e in Decoder => sequence of features for the frames of input video  Original & reconstructed features in Discriminator => distance estimation and binary classification as “video” or “summary” 32 Training pipeline and loss functions
  33. 33. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project The SUM-GAN method [Mahasseni, 2017]  Deep features of video frames in Frame Selector => normalized importance scores  Weighted features in Encoder => latent representation e  Latent representation e in Decoder => sequence of features for the frames of input video  Original & reconstructed features in Discriminator => distance estimation and binary classification as “video” or “summary” 33 Training pipeline and loss functions
  34. 34. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project The SUM-GAN method [Mahasseni, 2017]  Train Frame Selector and Encoder by minimizing Lsparsity + Lprior + Lreconst  Train Decoder by minimizing Lreconst + LGAN  Train Discriminator by maximizing LGAN  Update all components via backward propagation using Stochastic Gradient Variational Bayes estimation 34 Training pipeline and loss functions
  35. 35. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project The SUM-GAN method [Mahasseni, 2017]  Deep features of video frames in Frame Selector => normalized importance scores    35 Inference stage and video summarization 35 Video fragmentation using KTS Fragment-level importance scores Key-fragment selection as a Knapsack problem Frame-level importance scores
  36. 36. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project The SUM-GAN-sl method [Apostolidis, 2019] 36 E. Apostolidis, A. Metsai, E. Adamantidou, V. Mezaris, I. Patras, "A Stepwise, Label-based Approach for Improving the Adversarial Training in Unsupervised Video Summarization", Proc. 1st Int. Workshop on AI for Smart TV Content Production, Access and Delivery (AI4TV'19) at ACM Multimedia 2019, Nice, France, October 2019.  Builds on the SUM-GAN architecture  Contains a linear compression layer that reduces the size of CNN feature vectors  Follows an incremental and fine-grained approach to train the model’s components
  37. 37. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project The SUM-GAN-sl method [Apostolidis, 2019] 37 E. Apostolidis, A. Metsai, E. Adamantidou, V. Mezaris, I. Patras, "A Stepwise, Label-based Approach for Improving the Adversarial Training in Unsupervised Video Summarization", Proc. 1st Int. Workshop on AI for Smart TV Content Production, Access and Delivery (AI4TV'19) at ACM Multimedia 2019, Nice, France, October 2019.  Builds on the SUM-GAN architecture  Contains a linear compression layer that reduces the size of CNN feature vectors  Follows an incremental and fine-grained approach to train the model’s components
  38. 38. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project The SUM-GAN-sl method [Apostolidis, 2019]  Builds on the SUM-GAN architecture  Contains a linear compression layer that reduces the size of CNN feature vectors  Follows an incremental and fine-grained approach to train the model’s components 38 E. Apostolidis, A. Metsai, E. Adamantidou, V. Mezaris, I. Patras, "A Stepwise, Label-based Approach for Improving the Adversarial Training in Unsupervised Video Summarization", Proc. 1st Int. Workshop on AI for Smart TV Content Production, Access and Delivery (AI4TV'19) at ACM Multimedia 2019, Nice, France, October 2019.
  39. 39. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project The SUM-GAN-sl method [Apostolidis, 2019]  Step-wise training process 39 Training pipeline and loss functions
  40. 40. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project The SUM-GAN-sl method [Apostolidis, 2019] 40  Step-wise training process Training pipeline and loss functions
  41. 41. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project The SUM-GAN-sl method [Apostolidis, 2019] 41  Step-wise training process Training pipeline and loss functions
  42. 42. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project The SUM-GAN-sl method [Apostolidis, 2019] 42  Step-wise training process Training pipeline and loss functions
  43. 43. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project The SUM-GAN-sl method [Apostolidis, 2019]  Deep features of video frames in LC layer and Frame Selector => normalized importance scores    43 Inference stage and video summarization 43 Video fragmentation using KTS Fragment-level importance scores Key-fragment selection as a Knapsack problem Frame-level importance scores
  44. 44. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project The SUM-GAN-AAE method [Apostolidis, 2020]  Builds on the SUM-GAN-sl algorithm  Introduces an attention mechanism by replacing the VAE of SUM-GAN-sl with a deterministic attention auto-encoder 44 E. Apostolidis, E. Adamantidou, A. Metsai, V. Mezaris, I. Patras, "Unsupervised Video Summarization via Attention-Driven Adversarial Learning", Proc. 26th Int. Conf. on Multimedia Modeling (MMM2020), Daejeon, Korea, Springer LNCS vol. 11961, pp. 492-504, Jan. 2020. Best paper award
  45. 45. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project The SUM-GAN-AAE method [Apostolidis, 2020]  Builds on the SUM-GAN-sl algorithm  Introduces an attention mechanism by replacing the VAE of SUM-GAN-sl with a deterministic attention auto-encoder 45 E. Apostolidis, E. Adamantidou, A. Metsai, V. Mezaris, I. Patras, "Unsupervised Video Summarization via Attention-Driven Adversarial Learning", Proc. 26th Int. Conf. on Multimedia Modeling (MMM2020), Daejeon, Korea, Springer LNCS vol. 11961, pp. 492-504, Jan. 2020. Best paper award
  46. 46. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project The SUM-GAN-AAE method [Apostolidis, 2020] 46 The attention auto-encoder: Processing pipeline
  47. 47. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project The SUM-GAN-AAE method [Apostolidis, 2020] 47 The attention auto-encoder: Processing pipeline  Weighted feature vectors fed to the Encoder
  48. 48. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project The SUM-GAN-AAE method [Apostolidis, 2020] 48 The attention auto-encoder: Processing pipeline  Weighted feature vectors fed to the Encoder  Encoder’s output (V) and Decoder’s previous hidden state fed to the Attention component  For t > 1: use the hidden state of the previous Decoder’s step (h1)  For t = 1: use the hidden state of the last Encoder’s step (He)
  49. 49. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project The SUM-GAN-AAE method [Apostolidis, 2020] 49 The attention auto-encoder: Processing pipeline  Weighted feature vectors fed to the Encoder  Encoder’s output (V) and Decoder’s previous hidden state fed to the Attention component  Attention weights (αt) computed using:  Energy score function  Soft-max function
  50. 50. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Weighted feature vectors fed to the Encoder  Encoder’s output (V) and Decoder’s previous hidden state fed to the Attention component  Attention weights (αt) computed using:  Energy score function  Soft-max function The SUM-GAN-AAE method [Apostolidis, 2020] 50 The attention auto-encoder: Processing pipeline
  51. 51. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Weighted feature vectors fed to the Encoder  Encoder’s output (V) and Decoder’s previous hidden state fed to the Attention component  Attention weights (αt) computed using:  Energy score function  Soft-max function  αt multiplied with V and form Context Vector vt’ The SUM-GAN-AAE method [Apostolidis, 2020] 51 The attention auto-encoder: Processing pipeline
  52. 52. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Weighted feature vectors fed to the Encoder  Encoder’s output (V) and Decoder’s previous hidden state fed to the Attention component  Attention weights (αt) computed using:  Energy score function  Soft-max function  αt multiplied with V and form Context Vector vt’  vt’ combined with Decoder’s previous output yt-1 The SUM-GAN-AAE method [Apostolidis, 2020] 52 The attention auto-encoder: Processing pipeline
  53. 53. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Weighted feature vectors fed to the Encoder  Encoder’s output (V) and Decoder’s previous hidden state fed to the Attention component  Attention weights (αt) computed using:  Energy score function  Soft-max function  αt multiplied with V and form Context Vector vt’  vt’ combined with Decoder’s previous output yt-1  Decoder gradually reconstructs the video The SUM-GAN-AAE method [Apostolidis, 2020] 53 The attention auto-encoder: Processing pipeline
  54. 54. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project The SUM-GAN-AAE method [Apostolidis, 2020]  Training is performed in an incremental way as in SUM-GAN-sl  No prior loss is used 54 Training pipeline and loss functions
  55. 55. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project The SUM-GAN-AAE method [Apostolidis, 2020]  Deep features of video frames in LC layer and Frame Selector => normalized importance scores    55 Inference stage and video summarization 55 Video fragmentation using KTS Fragment-level importance scores Key-fragment selection as a Knapsack problem Frame-level importance scores
  56. 56. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Much smoother series of importance scores The SUM-GAN-AAE method [Apostolidis, 2020] 56 Impact of the introduced attention mechanism
  57. 57. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Much faster and more stable training of the model The SUM-GAN-AAE method [Apostolidis, 2020] 57 Impact of the introduced attention mechanism Average (over 5 splits) learning curve of SUM-GAN-sl and SUM-GAN-AAE on SumMeLoss curves for the SUM-GAN-sl and SUM-GAN-AAE
  58. 58. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  The most common strategy for learning summarization in an unsupervised way  A mechanism to build a representative summary by maximizing inference to the full video  Summarization performance is superior to other unsupervised learning approaches (e.g. reinforcement learning) and comparable to a few supervised learning methods  Step-wise training facilitates the training of complex GAN-based architectures  Introduction of attention mechanisms is beneficial to the quality of the created summary  There is room for further improving GAN-based unsupervised video summarization via: a) combination with reinforcement learning approaches, b) extension with memory networks Some concluding remarks 58 Using GANs for video summarization
  59. 59. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Vasileios Mezaris, Evlampios Apostolidis CERTH-ITI, Greece Tutorial at IEEE ICME 2020 Short break; coming up: Section I.3: Datasets, evaluation protocols and results, and future directions Video Summarization and Re-use Technologies and Tools Part I: Automatic video summarization
  60. 60. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Vasileios Mezaris, Evlampios Apostolidis CERTH-ITI, Greece Tutorial at IEEE ICME 2020 Section I.3: Datasets, evaluation protocols and results, and future directions Video Summarization and Re-use Technologies and Tools Part I: Automatic video summarization
  61. 61. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Datasets 61  SumMe (https://gyglim.github.io/me/vsum/index.html#benchmark)  25 videos capturing multiple events (e.g. cooking and sports)  video length: 1 to 6 min  annotation: fragment-based video summaries (15-18 per video)  TVSum (https://github.com/yalesong/tvsum)  50 videos from 10 categories of TRECVid MED task  video length: 1 to 11 min  annotation: frame-level importance scores (20 per video) Most commonly used
  62. 62. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Datasets 62  Open Video Project (OVP) (https://sites.google.com/site/vsummsite/download)  50 videos of various genres (e.g. documentary, educational, historical, lecture)  video length: 1 to 4 min  annotation: keyframe-based video summaries (5 per video)  Youtube (https://sites.google.com/site/vsummsite/download)  50 videos of diverse content (e.g. cartoons, news, sports, commercials) collected from websites  video length: 1 to 10 min  annotation: keyframe-based video summaries (5 per video) Less commonly used
  63. 63. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Evaluation protocols 63 Early approach  Agreement between automatically-created (A) and user-defined (U) summary is expressed by  Matching of a pair of frames is based on color histograms, the Manhattan distance and a predefined similarity threshold  80% of video samples are used for training and the remaining 20% for testing  The final evaluation outcome occurs by:  Computing the average F-Score for a test video given the different user summaries for this video  Computing the average of the calculated F-Score values for the different test videos
  64. 64. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Evaluation protocols 64 Established approach  The generated summary should not exceed 15% of the video length  Agreement between automatically-generated (A) and user-defined (U) summary is expressed by the F-Score (%), with (P)recision and (R)ecall measuring the temporal overlap (∩) (|| || means duration)  Typical metrics for computing Precision and Recall at the frame-level  80% of video samples are used for training and the remaining 20% for testing
  65. 65. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Evaluation protocols 65 Established approach - A side note  TVSum annotations need conversion from frame-level importance scores to key-fragments 65 Human annotations in TVSum: frame-level importance scores
  66. 66. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Evaluation protocols 66 Established approach - A side note  TVSum annotations need conversion from frame-level importance scores to key-fragments 66 Video fragmentation using KTS Human annotations in TVSum: frame-level importance scores
  67. 67. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Evaluation protocols 67 Established approach - A side note  TVSum annotations need conversion from frame-level importance scores to key-fragments 67 Video fragmentation using KTS Fragment-level importance scores Human annotations in TVSum: frame-level importance scores
  68. 68. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Evaluation protocols 68 Established approach - A side note  TVSum annotations need conversion from frame-level importance scores to key-fragments Video fragmentation using KTS Fragment-level importance scores Key-fragment selection as a Knapsack problem Human annotations in TVSum: frame-level importance scores
  69. 69. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Evaluation protocols 69 Established approach  Slight but important distinction w.r.t. what is eventually used as ground-truth summary  Most used approach
  70. 70. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Evaluation protocols 70 Established approach  Slight but important distinction w.r.t. what is eventually used as ground-truth summary  Most used approach
  71. 71. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Evaluation protocols 71 F-Score1 Established approach  Slight but important distinction w.r.t. what is eventually used as ground-truth summary  Most used approach
  72. 72. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 72 F-Score2 F-Score1 Evaluation protocols Established approach  Slight but important distinction w.r.t. what is eventually used as ground-truth summary  Most used approach
  73. 73. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 73 F-ScoreN F-Score2 F-Score1 Evaluation protocols Established approach  Slight but important distinction w.r.t. what is eventually used as ground-truth summary  Most used approach
  74. 74. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 74 F-ScoreN F-Score2 F-Score1 Evaluation protocols Established approach  Slight but important distinction w.r.t. what is eventually used as ground-truth summary  Most used approach SumMe: F-Score = max{F-Scorei}i=1 N TVSum: F-Score = mean{F-Scorei}i=1 N
  75. 75. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 75 Evaluation protocols Established approach  Slight but important distinction w.r.t. what is eventually used as ground-truth summary  Alternative approach
  76. 76. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 76 F-Score Evaluation protocols Established approach  Slight but important distinction w.r.t. what is eventually used as ground-truth summary  Alternative approach
  77. 77. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Results: comparison of unsupervised methods 77 Method Reference Online Motion AE [Zhang, 2018] SUM-FCNunsup [Rochan, 2018] DR-DSN [Zhou, 2018b] EDSN [Gonuguntla, 2019] UnpairedVSN [Rochan, 2019] PCDL [Zhao, 2019] ACGAN [He, 2019] Tesselation [Kaufman, 2017] SUM-GAN-sl [Apostolidis, 2019] SUM-GAN-AAE [Apostolidis, 2020] CSNet [Jung, 2019]
  78. 78. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Best-performing unsupervised methods rely on Generative Adversarial Networks  The use of attention mechanisms allows the identification of important parts of the video  Best on TVSum is a dataset-tailored method as it has random-level performance on SumMe  The use of rewards and reinforcement learning is less competitive than the use of GANs  A few methods show random performance in at least one of the used datasets Results: comparison of unsupervised methods 78 Method SumMe TVSum AVG FSc Rnk FSc Rnk Rnk Random summary 40.2 10 54.4 9 9.5 Online Motion AE 37.7 11 51.5 11 11 SUM-FCNunsup 41.5 8 52.7 10 9 DR-DSN 41.4 9 57.6 6 7.5 EDSN 42.6 7 57.3 7 7 UnpairedVSN 47.5 4 55.6 8 6 PCDL 42.7 6 58.4 4 5 ACGAN 46.0 5 58.5 3 4 Tesselation 41.4 7 64.1 1 4 SUM-GAN-sl 47.8 3 58.4 4 3.5 SUM-GAN-AAE 48.9 2 58.3 5 3.5 CSNet 51.3 1 58.8 2 1.5 General remarks
  79. 79. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Best-performing unsupervised methods rely on Generative Adversarial Networks  The use of attention mechanisms allows the identification of important parts of the video  Best on TVSum is a dataset-tailored method as it has random-level performance on SumMe  The use of rewards and reinforcement learning is less competitive than the use of GANs  A few methods show random performance in at least one of the used datasets Results: comparison of unsupervised methods 79 Method SumMe TVSum AVG FSc Rnk FSc Rnk Rnk Random summary 40.2 10 54.4 9 9.5 Online Motion AE 37.7 11 51.5 11 11 SUM-FCNunsup 41.5 8 52.7 10 9 DR-DSN 41.4 9 57.6 6 7.5 EDSN 42.6 7 57.3 7 7 UnpairedVSN 47.5 4 55.6 8 6 PCDL 42.7 6 58.4 4 5 ACGAN 46.0 5 58.5 3 4 Tesselation 41.4 7 64.1 1 4 SUM-GAN-sl 47.8 3 58.4 4 3.5 SUM-GAN-AAE 48.9 2 58.3 5 3.5 CSNet 51.3 1 58.8 2 1.5 General remarks
  80. 80. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Best-performing unsupervised methods rely on Generative Adversarial Networks  The use of attention mechanisms allows the identification of important parts of the video  Best on TVSum is a dataset-tailored method as it has random-level performance on SumMe  The use of rewards and reinforcement learning is less competitive than the use of GANs  A few methods show random performance in at least one of the used datasets Results: comparison of unsupervised methods 80 Method SumMe TVSum AVG FSc Rnk FSc Rnk Rnk Random summary 40.2 10 54.4 9 9.5 Online Motion AE 37.7 11 51.5 11 11 SUM-FCNunsup 41.5 8 52.7 10 9 DR-DSN 41.4 9 57.6 6 7.5 EDSN 42.6 7 57.3 7 7 UnpairedVSN 47.5 4 55.6 8 6 PCDL 42.7 6 58.4 4 5 ACGAN 46.0 5 58.5 3 4 Tesselation 41.4 7 64.1 1 4 SUM-GAN-sl 47.8 3 58.4 4 3.5 SUM-GAN-AAE 48.9 2 58.3 5 3.5 CSNet 51.3 1 58.8 2 1.5 General remarks
  81. 81. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Best-performing unsupervised methods rely on Generative Adversarial Networks  The use of attention mechanisms allows the identification of important parts of the video  Best on TVSum is a dataset-tailored method as it has random-level performance on SumMe  The use of rewards and reinforcement learning is less competitive than the use of GANs  A few methods show random performance in at least one of the used datasets Results: comparison of unsupervised methods 81 Method SumMe TVSum AVG FSc Rnk FSc Rnk Rnk Random summary 40.2 10 54.4 9 9.5 Online Motion AE 37.7 11 51.5 11 11 SUM-FCNunsup 41.5 8 52.7 10 9 DR-DSN 41.4 9 57.6 6 7.5 EDSN 42.6 7 57.3 7 7 UnpairedVSN 47.5 4 55.6 8 6 PCDL 42.7 6 58.4 4 5 ACGAN 46.0 5 58.5 3 4 Tesselation 41.4 7 64.1 1 4 SUM-GAN-sl 47.8 3 58.4 4 3.5 SUM-GAN-AAE 48.9 2 58.3 5 3.5 CSNet 51.3 1 58.8 2 1.5 General remarks
  82. 82. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Best-performing unsupervised methods rely on Generative Adversarial Networks  The use of attention mechanisms allows the identification of important parts of the video  Best on TVSum is a dataset-tailored method as it has random-level performance on SumMe  The use of rewards and reinforcement learning is less competitive than the use of GANs  A few methods show random performance in at least one of the used datasets Results: comparison of unsupervised methods 82 Method SumMe TVSum AVG FSc Rnk FSc Rnk Rnk Random summary 40.2 10 54.4 9 9.5 Online Motion AE 37.7 11 51.5 11 11 SUM-FCNunsup 41.5 8 52.7 10 9 DR-DSN 41.4 9 57.6 6 7.5 EDSN 42.6 7 57.3 7 7 UnpairedVSN 47.5 4 55.6 8 6 PCDL 42.7 6 58.4 4 5 ACGAN 46.0 5 58.5 3 4 Tesselation 41.4 7 64.1 1 4 SUM-GAN-sl 47.8 3 58.4 4 3.5 SUM-GAN-AAE 48.9 2 58.3 5 3.5 CSNet 51.3 1 58.8 2 1.5 General remarks
  83. 83. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Best-performing unsupervised methods rely on Generative Adversarial Networks  The use of attention mechanisms allows the identification of important parts of the video  Best on TVSum is a dataset-tailored method as it has random-level performance on SumMe  The use of rewards and reinforcement learning is less competitive than the use of GANs  A few methods show random performance in at least one of the used datasets Results: comparison of unsupervised methods 83 Method SumMe TVSum AVG FSc Rnk FSc Rnk Rnk Random summary 40.2 10 54.4 9 9.5 Online Motion AE 37.7 11 51.5 11 11 SUM-FCNunsup 41.5 8 52.7 10 9 DR-DSN 41.4 9 57.6 6 7.5 EDSN 42.6 7 57.3 7 7 UnpairedVSN 47.5 4 55.6 8 6 PCDL 42.7 6 58.4 4 5 ACGAN 46.0 5 58.5 3 4 Tesselation 41.4 7 64.1 1 4 SUM-GAN-sl 47.8 3 58.4 4 3.5 SUM-GAN-AAE 48.9 2 58.3 5 3.5 CSNet 51.3 1 58.8 2 1.5 General remarks
  84. 84. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Best-performing unsupervised methods rely on Generative Adversarial Networks  The use of attention mechanisms allows the identification of important parts of the video  Best on TVSum is a dataset-tailored method as it has random-level performance on SumMe  The use of rewards and reinforcement learning is less competitive than the use of GANs  A few methods show random performance in at least one of the used datasets Results: comparison of unsupervised methods 84 General remarks Method SumMe TVSum AVG FSc Rnk FSc Rnk Rnk Random summary 40.2 10 54.4 9 9.5 Online Motion AE 37.7 11 51.5 11 11 SUM-FCNunsup 41.5 8 52.7 10 9 DR-DSN 41.4 9 57.6 6 7.5 EDSN 42.6 7 57.3 7 7 UnpairedVSN 47.5 4 55.6 8 6 PCDL 42.7 6 58.4 4 5 ACGAN 46.0 5 58.5 3 4 Tesselation 41.4 7 64.1 1 4 SUM-GAN-sl 47.8 3 58.4 4 3.5 SUM-GAN-AAE 48.9 2 58.3 5 3.5 CSNet 51.3 1 58.8 2 1.5
  85. 85. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  tbd Results: comparison of supervised methods 85 Method Reference vsLSTM [Zhang, 2016b] dppLSTM [Zhang, 2016b] SASUMwsup [Wei. 2018] ActionRanking [Elfeki, 2019] ESS-VS [Zhang, 2016a] H-RNN [Zhao, 2017] vsLSTM+Att [Lebron Casas, 2019] DSSE [Yuan, 2019b] DR-DSNsup [Zhou, 2018b] Tessellationsup [Kaufman, 2017] Method Reference dppLSTM+Att [Lebron Casas, 2019] WS-HRL [Chen, 2019] UnpairedVSNsup [Rochan, 2019] SUM-FCN [Rochan, 2018] SF-CVS [Huang, 2020] SASUMsup [Wei, 2018] CRSum [Yuan, 2019c] PCDLsup [Zhao, 2019] MAVS [Feng, 2018] HSA-RNN [Zhao, 2018] Method Reference DQSN [Zhou, 2018a] ACGANsup [He, 2019] SUM-DeepLab [Rochan, 2018] CSNetsup [Yuan, 2019a] SMLD [Chu, 2019] H-MAN [Liu, 2019] VASNet [Fajtl, 2019] SMN [Wang, 2019] * SUM-GAN-AAE [Apostolidis, 2020]
  86. 86. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  tbd Results: comparison of supervised methods 86 Method Sum Me TV Sum AVG FSc FSc Rnk Random summary 40.2 54.4 22.5 vsLSTM 37.6 54.2 24.5 dppLSTM 38.6 54.7 23 SASUMwsup 40.6 53.9 22.5 ActionRanking 40.1 56.3 21.5 ESS-VS 40.9 - 20 H-RNN 41.1 57.7 17.5 vsLSTM+Att 43.2 - 17 DSSE - 57.0 17 DR-DSNsup 42.1 58.1 16 Method Sum Me TV Sum AVG FSc FSc Rnk Tessellationsup 37.2 63.4 15 dppLSTM+Att 43.8 - 14 WS-HRL 43.6 58.4 14 UnpairedVSNsup 48.0 56.1 13 SUM-FCN 47.5 56.8 13 SF-CVS 46.0 58.0 13 SASUMsup 45.3 58.2 12.5 CRSum 47.3 58.0 12 PCDLsup 43.7 59.2 12 MAVS 40.3 66.8 11.5 Method Sum Me TV Sum AVG FSc FSc Rnk HSA-RNN 44.1 59.8 10 DQSN - 58.6 10 ACGANsup 47.2 59.4 9 SUM-DeepLab 48.8 58.4 8 HSA-RNN 44.1 59.8 10 CSNetsup 48.6 58.5 8 SMLD 47.6 61.0 6 H-MAN 51.8 60.4 4 VASNet 49.7 61.4 3.5 SMN 58.3 64.5 1.5 * SUM-GAN-AAE 48.9 58.3 8.5
  87. 87. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  tbd Results: comparison of supervised methods 87 Method Sum Me TV Sum AVG FSc FSc Rnk Random summary 40.2 54.4 22.5 vsLSTM 37.6 54.2 24.5 dppLSTM 38.6 54.7 23 SASUMwsup 40.6 53.9 22.5 ActionRanking 40.1 56.3 21.5 ESS-VS 40.9 - 20 H-RNN 41.1 57.7 17.5 vsLSTM+Att 43.2 - 17 DSSE - 57.0 17 DR-DSNsup 42.1 58.1 16 Method Sum Me TV Sum AVG FSc FSc Rnk Tessellationsup 37.2 63.4 15 dppLSTM+Att 43.8 - 14 WS-HRL 43.6 58.4 14 UnpairedVSNsup 48.0 56.1 13 SUM-FCN 47.5 56.8 13 SF-CVS 46.0 58.0 13 SASUMsup 45.3 58.2 12.5 CRSum 47.3 58.0 12 PCDLsup 43.7 59.2 12 MAVS 40.3 66.8 11.5 Method Sum Me TV Sum AVG FSc FSc Rnk HSA-RNN 44.1 59.8 10 DQSN - 58.6 10 ACGANsup 47.2 59.4 9 SUM-DeepLab 48.8 58.4 8 HSA-RNN 44.1 59.8 10 CSNetsup 48.6 58.5 8 SMLD 47.6 61.0 6 H-MAN 51.8 60.4 4 VASNet 49.7 61.4 3.5 SMN 58.3 64.5 1.5 * SUM-GAN-AAE 48.9 58.3 8.5
  88. 88. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  tbd Results: comparison of supervised methods 88 Method Sum Me TV Sum AVG FSc FSc Rnk Random summary 40.2 54.4 22.5 vsLSTM 37.6 54.2 24.5 dppLSTM 38.6 54.7 23 SASUMwsup 40.6 53.9 22.5 ActionRanking 40.1 56.3 21.5 ESS-VS 40.9 - 20 H-RNN 41.1 57.7 17.5 vsLSTM+Att 43.2 - 17 DSSE - 57.0 17 DR-DSNsup 42.1 58.1 16 Method Sum Me TV Sum AVG FSc FSc Rnk Tessellationsup 37.2 63.4 15 dppLSTM+Att 43.8 - 14 WS-HRL 43.6 58.4 14 UnpairedVSNsup 48.0 56.1 13 SUM-FCN 47.5 56.8 13 SF-CVS 46.0 58.0 13 SASUMsup 45.3 58.2 12.5 CRSum 47.3 58.0 12 PCDLsup 43.7 59.2 12 MAVS 40.3 66.8 11.5 Method Sum Me TV Sum AVG FSc FSc Rnk HSA-RNN 44.1 59.8 10 DQSN - 58.6 10 ACGANsup 47.2 59.4 9 SUM-DeepLab 48.8 58.4 8 HSA-RNN 44.1 59.8 10 CSNetsup 48.6 58.5 8 SMLD 47.6 61.0 6 H-MAN 51.8 60.4 4 VASNet 49.7 61.4 3.5 SMN 58.3 64.5 1.5 * SUM-GAN-AAE 48.9 58.3 8.5
  89. 89. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  tbd Results: comparison of supervised methods 89 Method Sum Me TV Sum AVG FSc FSc Rnk Random summary 40.2 54.4 22.5 vsLSTM 37.6 54.2 24.5 dppLSTM 38.6 54.7 23 SASUMwsup 40.6 53.9 22.5 ActionRanking 40.1 56.3 21.5 ESS-VS 40.9 - 20 H-RNN 41.1 57.7 17.5 vsLSTM+Att 43.2 - 17 DSSE - 57.0 17 DR-DSNsup 42.1 58.1 16 Method Sum Me TV Sum AVG FSc FSc Rnk Tessellationsup 37.2 63.4 15 dppLSTM+Att 43.8 - 14 WS-HRL 43.6 58.4 14 UnpairedVSNsup 48.0 56.1 13 SUM-FCN 47.5 56.8 13 SF-CVS 46.0 58.0 13 SASUMsup 45.3 58.2 12.5 CRSum 47.3 58.0 12 PCDLsup 43.7 59.2 12 MAVS 40.3 66.8 11.5 Method Sum Me TV Sum AVG FSc FSc Rnk HSA-RNN 44.1 59.8 10 DQSN - 58.6 10 ACGANsup 47.2 59.4 9 SUM-DeepLab 48.8 58.4 8 HSA-RNN 44.1 59.8 10 CSNetsup 48.6 58.5 8 SMLD 47.6 61.0 6 H-MAN 51.8 60.4 4 VASNet 49.7 61.4 3.5 SMN 58.3 64.5 1.5 * SUM-GAN-AAE 48.9 58.3 8.5
  90. 90. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  tbd Results: comparison of supervised methods 90 Method Sum Me TV Sum AVG FSc FSc Rnk Random summary 40.2 54.4 22.5 vsLSTM 37.6 54.2 24.5 dppLSTM 38.6 54.7 23 SASUMwsup 40.6 53.9 22.5 ActionRanking 40.1 56.3 21.5 ESS-VS 40.9 - 20 H-RNN 41.1 57.7 17.5 vsLSTM+Att 43.2 - 17 DSSE - 57.0 17 DR-DSNsup 42.1 58.1 16 Method Sum Me TV Sum AVG FSc FSc Rnk Tessellationsup 37.2 63.4 15 dppLSTM+Att 43.8 - 14 WS-HRL 43.6 58.4 14 UnpairedVSNsup 48.0 56.1 13 SUM-FCN 47.5 56.8 13 SF-CVS 46.0 58.0 13 SASUMsup 45.3 58.2 12.5 CRSum 47.3 58.0 12 PCDLsup 43.7 59.2 12 MAVS 40.3 66.8 11.5 Method Sum Me TV Sum AVG FSc FSc Rnk HSA-RNN 44.1 59.8 10 DQSN - 58.6 10 ACGANsup 47.2 59.4 9 SUM-DeepLab 48.8 58.4 8 HSA-RNN 44.1 59.8 10 CSNetsup 48.6 58.5 8 SMLD 47.6 61.0 6 H-MAN 51.8 60.4 4 VASNet 49.7 61.4 3.5 SMN 58.3 64.5 1.5 * SUM-GAN-AAE 48.9 58.3 8.5
  91. 91. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  tbd Results: comparison of supervised methods 91 Method Sum Me TV Sum AVG FSc FSc Rnk Random summary 40.2 54.4 22.5 vsLSTM 37.6 54.2 24.5 dppLSTM 38.6 54.7 23 SASUMwsup 40.6 53.9 22.5 ActionRanking 40.1 56.3 21.5 ESS-VS 40.9 - 20 H-RNN 41.1 57.7 17.5 vsLSTM+Att 43.2 - 17 DSSE - 57.0 17 DR-DSNsup 42.1 58.1 16 Method Sum Me TV Sum AVG FSc FSc Rnk Tessellationsup 37.2 63.4 15 dppLSTM+Att 43.8 - 14 WS-HRL 43.6 58.4 14 UnpairedVSNsup 48.0 56.1 13 SUM-FCN 47.5 56.8 13 SF-CVS 46.0 58.0 13 SASUMsup 45.3 58.2 12.5 CRSum 47.3 58.0 12 PCDLsup 43.7 59.2 12 MAVS 40.3 66.8 11.5 Method Sum Me TV Sum AVG FSc FSc Rnk HSA-RNN 44.1 59.8 10 DQSN - 58.6 10 ACGANsup 47.2 59.4 9 SUM-DeepLab 48.8 58.4 8 HSA-RNN 44.1 59.8 10 CSNetsup 48.6 58.5 8 SMLD 47.6 61.0 6 H-MAN 51.8 60.4 4 VASNet 49.7 61.4 3.5 SMN 58.3 64.5 1.5 * SUM-GAN-AAE 48.9 58.3 8.5
  92. 92. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Quantitative comparison 92 Keyframe-based overview of video #15 of TVSum (1 keyframe / shot)
  93. 93. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Quantitative comparison 93 Generated summaries by five summarization methods
  94. 94. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Quantitative comparison 94 Generated summaries by five summarization methods
  95. 95. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Quantitative comparison 95 Generated summaries by five summarization methods
  96. 96. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Quantitative comparison 96 Generated summaries by five summarization methods
  97. 97. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Quantitative comparison 97 Video #15 of TVSum: “How to Clean Your Dog’s Ears - Vetoquinol USA
  98. 98. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Quantitative comparison 98 Automatically generated summaries VASNet SUM-GAN-AAE DR-DSN
  99. 99. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Use of video summarization technologies 99 Tool for content adaptation / re-purposing  Developed by CERTH-ITI  Elaborates GAN-based methods for unsupervised learning [Apostolidis 2019, 2020]  Enables content adaptation for distribution via multiple communication channels  Faciliates summary creation based on the audience needs for: Twitter, Facebook (feed & stories), Instagram (feed & stories), YouTube, TikTok E. Apostolidis, A. Metsai, E. Adamantidou, V. Mezaris, I. Patras, "A Stepwise, Label-based Approach for Improving the Adversarial Training in Unsupervised Video Summarization", Proc. 1st Int. Workshop on AI for Smart TV Content Production, Access and Delivery (AI4TV'19) at ACM Multimedia 2019, Nice, France, October 2019. E. Apostolidis, E. Adamantidou, A. Metsai, V. Mezaris, I. Patras, "Unsupervised Video Summarization via Attention-Driven Adversarial Learning", Proc. 26th Int. Conf. on Multimedia Modeling (MMM2020), Daejeon, Korea, Springer LNCS vol. 11961, pp. 492-504, Jan. 2020.
  100. 100. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Use of video summarization technologies 100 Tool for content adaptation / re-purposing  Learns content-specific summarization  Separate models can be trained and used for different video content (e.g. TV shows)  Creating these models does not require manually- generated training data (it’s (almost) for free) E. Apostolidis, A. Metsai, E. Adamantidou, V. Mezaris, I. Patras, "A Stepwise, Label-based Approach for Improving the Adversarial Training in Unsupervised Video Summarization", Proc. 1st Int. Workshop on AI for Smart TV Content Production, Access and Delivery (AI4TV'19) at ACM Multimedia 2019, Nice, France, October 2019. E. Apostolidis, E. Adamantidou, A. Metsai, V. Mezaris, I. Patras, "Unsupervised Video Summarization via Attention-Driven Adversarial Learning", Proc. 26th Int. Conf. on Multimedia Modeling (MMM2020), Daejeon, Korea, Springer LNCS vol. 11961, pp. 492-504, Jan. 2020.
  101. 101. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Use of video summarization technologies 101 Tool for content adaption / re-purposing  Try it with your video at: http://multimedia2.iti.gr/videosummarization/service/start.html  Demo video: https://youtu.be/LbjPLJzeNII
  102. 102. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Future directions 102  Unsupervised video summarization based on combining adversarial and reinforcement learning  Advanced attention mechanisms and memory networks for capturing long-range temporal dependencies among parts of the video  Exploiting augmented/extended training data  Introducing editorial rules in unsupervised video summarization  Examine the potential of transfer learning in video summarization Analysis-oriented
  103. 103. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Future directions 103  There is a lack of integrated technologies for automating video summarization and CERTH’s web application is one of the first complete tools  Automated summarization that is adaptive to the distribution channel / targeted audience or the video content has a strong potential!  Further applications of video summarization should be investigated by:  monitoring the modern media/social media ecosystem  identifying new application domains for content adaptation / re-purposing  translating the needs of these application domains into analysis requirements Application-oriented
  104. 104. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project [Apostolidis, 2019] E. Apostolidis, A. I. Metsai, E. Adamantidou, V. Mezaris, and I. Patras, “A stepwise, label-based approach for improving the adversarial training in unsupervised video summarization,” in Proc. of the 1st Int. Workshop on AI for Smart TV Content Production, Access and Delivery, ser. AI4TV ’19. New York, NY, USA: ACM, 2019, pp. 17–25. [Apostolidis, 2020] E. Apostolidis, E. Adamantidou, A. I. Metsai, V. Mezaris, and I. Patras, “Unsupervised video summarization via attention-driven adversarial learning,” in Proc. of the Int. Conf. on Multimedia Modeling. Springer, 2020, pp. 492–504. [Bahdanau, 2015] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in Proc. of the 3rd Int. Conf. on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun, Eds., 2015. [Chen 2019] Y. Chen, L. Tao, X. Wang, and T. Yamasaki, “Weakly supervised video summarization by hierarchical reinforcement learning,” in Proc. of the ACM Multimedia Asia, 2019, pp. 1–6. [Cho, 2014] K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio, “On the properties of neural machine translation: Encoder– decoder approaches,” in Proc. of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation. Doha, Qatar: Association for Computational Linguistics, Oct. 2014, pp. 103–111. [Chu, 2019] W.-T. Chu and Y.-H. Liu, “Spatiotemporal modeling and label distribution learning for video summarization,” in Proc. of the 2019 IEEE 21st Int. Workshop on Multimedia Signal Processing (MMSP). IEEE, 2019, pp. 1–6. [Elfeki, 2019] M. Elfeki and A. Borji, “Video summarization via actionness ranking,” in Proc. of the IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa Village, HI, USA, January 7-11, 2019, Jan 2019, pp. 754–763. Key references 104
  105. 105. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project [Fajtl, 2019] J. Fajtl, H. S. Sokeh, V. Argyriou, D. Monekosso, and P. Remagnino, “Summarizing videos with attention,” in Asian Conf. on Computer Vision (ACCV) 2019 Workshops, G. Carneiro and S. You, Eds. Cham: Springer International Publishing, 2019, pp. 39–54. [Feng, 2018] L. Feng, Z. Li, Z. Kuang, and W. Zhang, “Extractive video summarizer with memory augmented neural networks,” in Proc. of the 26th ACM Int. Conf. on Multimedia, ser. MM ’18. New York, NY, USA: ACM, 2018, pp. 976–983. [Fu, 2019] T. Fu, S. Tai, and H. Chen, “Attentive and adversarial learning for video summarization,” in Proc. of the IEEE Winter Conf. on Applications of Computer Vision (WACV), Waikoloa Village, HI, USA, January 7-11, 2019, pp. 1579–1587. [Gonuguntla, 2019] N. Gonuguntla, B. Mandal, N. Puhan et al., “Enhanced deep video summarization network,” in Proc. of the 2019 British Machine Vision Conference (BMVC), 2019. [Goyal, 2017] A. Goyal, N. R. Ke, A. Lamb, R. D. Hjelm, C. J. Pal, J. Pineau, and Y. Bengio, “Actual: Actor-critic under adversarial learning,” ArXiv, vol. abs/1711.04755, 2017. [Gygli, 2014] M. Gygli, H. Grabner, H. Riemenschneider, and L. Van Gool, “Creating summaries from user videos,” in Proc. of the European Conference on Computer Vision (ECCV) 2014, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Eds. Cham: Springer International Publishing, 2014, pp. 505–520. [Gygli, 2015] M. Gygli, H. Grabner, and L. V. Gool, “Video summarization by learning submodular mixtures of objectives,” in Proc. of the 2015 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2015, pp. 3090–3098. [Haarnoja, 2018] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” in Proc. of the 35th Int. Conf. on Machine Learning (ICML), 2018. Key references 105
  106. 106. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project [He, 2019] X. He, Y. Hua, T. Song, Z. Zhang, Z. Xue, R. Ma, N. Robertson, and H. Guan, “Unsupervised video summarization with attentive conditional generative adversarial networks,” in Proc. of the 27th ACM Int. Conf. on Multimedia, ser. MM ’19. New York, NY, USA: ACM, 2019, pp. 2296–2304. [Hochreiter, 1997] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735– 1780, 1997. [Huang, 2020] C. Huang and H. Wang, “A novel key-frames selection framework for comprehensive video summarization,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 2, pp. 577–589, 2020. [Ji, 2019] Z. Ji, K. Xiong, Y. Pang, and X. Li, “Video summarization with attention-based encoder-decoder networks,” IEEE Transactions on Circuits and Systems for Video Technology, pp. 1–1, 2019. [Jung, 2019] Y. Jung, D. Cho, D. Kim, S. Woo, and I. S. Kweon, “Discriminative feature learning for unsupervised video summarization,” in Proc. of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 8537–8544. [Kaufman, 2017] D. Kaufman, G. Levi, T. Hassner, and L. Wolf, “Temporal tessellation: A unified approach for video analysis,” in Proc. of the 2017 IEEE Int. Conf. on Computer Vision (ICCV), Oct 2017, pp. 94–104. [Kulesza, 2012] A. Kulesza and B. Taskar, Determinantal Point Processes for Machine Learning. Hanover, MA, USA: Now Publishers Inc., 2012. [Lal, 2019] S. Lal, S. Duggal, and I. Sreedevi, “Online video summarization: Predicting future to better summarize present,” in Proc. of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2019, pp. 471–480. Key references 106
  107. 107. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project [Lebron Casas, 2019] L. Lebron Casas and E. Koblents, “Video summarization with LSTM and deep attention models,” in MultiMedia Modeling, I. Kompatsiaris, B. Huet, V. Mezaris, C. Gurrin, W.-H. Cheng, and S. Vrochidis, Eds. Cham: Springer International Publishing, 2019, pp. 67–79. [Liu, 2019] Y.-T. Liu, Y.-J. Li, F.-E. Yang, S.-F. Chen, and Y.-C. F. Wang, “Learning hierarchical self-attention for video summarization,” in Proc. of the 2019 IEEE Int. Conf. on Image Processing (ICIP). IEEE, 2019, pp. 3377–3381. [Mahasseni, 2017] B. Mahasseni, M. Lam, and S. Todorovic, “Unsupervised video summarization with adversarial LSTM networks,” in Proc. of the 2017 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2982– 2991. [Otani, 2016] M. Otani, Y. Nakashima, E. Rahtu, J. Heikkil¨a, and N. Yokoya, “Video summarization using deep semantic features,” in Proc. of the 13th Asian Conference on Computer Vision (ACCV’16), 2016. [Panda, 2017] R. Panda, A. Das, Z. Wu, J. Ernst, and A. K. Roy-Chowdhury, “Weakly supervised summarization of web videos,” in Proc. of the 2017 IEEE Int. Conf. on Computer Vision (ICCV), Oct 2017, pp. 3677–3686. [Pfau, 2016] D. Pfau and O. Vinyals, “Connecting generative adversarial networks and actor-critic methods,” in NIPS Workshop on Adversarial Training, 2016. [Potapov, 2014] D. Potapov, M. Douze, Z. Harchaoui, and C. Schmid, “Category-specific video summarization,” in Proc. of the European Conference on Computer Vision (ECCV) 2014, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Eds. Cham: Springer International Publishing, 2014, pp. 540–555. Key references 107
  108. 108. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project [Rochan, 2018] M. Rochan, L. Ye, and Y. Wang, “Video summarization using fully convolutional sequence networks,” in Proc. of the European Conference on Computer Vision (ECCV) 2018, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, Eds. Cham: Springer International Publishing, 2018, pp. 358–374. [Rochan, 2019] M. Rochan and Y. Wang, “Video summarization by learning from unpaired data,” in Proc. of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019. [Savioli, 2019] N. Savioli, “A hybrid approach between adversarial generative networks and actor-critic policy gradient for low rate high-resolution image compression,” in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE, 2019. [Smith, 2017] J. R. Smith, D. Joshi, B. Huet, W. Hsu, and J. Cota, “Harnessing A.I. for Augmenting Creativity: Application to Movie Trailer Creation,” in Proc. of the 25th ACM Int. Conf. on Multimedia, ser. MM ’17. New York, NY, USA: ACM, 2017, pp. 1799– 1808. [Song, 2015] Y. Song, J. Vallmitjana, A. Stent, and A. Jaimes, “TvSUM: Summarizing web videos using titles,” in Proc. of the 2015 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2015, pp. 5179–5187. [Song, 2016] X. Song, K. Chen, J. Lei, L. Sun, Z. Wang, L. Xie, and M. Song, “Category driven deep recurrent neural network for video summarization,” in Proc. of the 2016 IEEE Int. Conf. on Multimedia Expo Workshops (ICMEW), July 2016, pp. 1–6. [Szegedy, 2015] C. Szegedy, Wei Liu, Yangqing Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proc. of the 2015 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2015, pp. 1–9. Key references 108
  109. 109. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project [Vinyals, 2015] O. Vinyals, M. Fortunato, and N. Jaitly, “Pointer networks,” in Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds. Curran Associates, Inc., 2015, pp. 2692–2700. [Wang, 2019] J. Wang, W. Wang, Z. Wang, L. Wang, D. Feng, and T. Tan, “Stacked memory network for video summarization,” in Proc. of the 27th ACM Int. Conf. on Multimedia, ser. MM ’19. New York, NY, USA: ACM, 2019, pp. 836–844. [Wang, 2016] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, “Temporal segment networks: Towards good practices for deep action recognition,” in Proc. of the European Conference on Computer Vision – ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds. Cham: Springer International Publishing, 2016, pp. 20–36. [Wei, 2018] H. Wei, B. Ni, Y. Yan, H. Yu, X. Yang, and C. Yao, “Video summarization via semantic attended networks,” in Proc. of the 2018 AAAI Conf. on Artificial Intelligence (AAAI), 2018. [Yu, 2017] L. Yu, W. Zhang, J. Wang, and Y. Yu, “SeqGAN: Sequence generative adversarial nets with policy gradient,” in Proc. of the 2017 AAAI Conf. on Artificial Intelligence, ser. (AAAI). AAAI Press, 2017, pp. 2852–2858. [Yuan, 2019a] L. Yuan, F. E. H. Tay, P. Li, L. Zhou, and J. Feng, “Cycle-SUM: Cycle-consistent adversarial lstm networks for unsupervised video summarization,” in Proc. of the 2019 AAAI Conf. on Artificial Intelligence (AAAI), 2019. [Yuan, 2019b] Y. Yuan, T. Mei, P. Cui, and W. Zhu, “Video summarization by learning deep side semantic embedding,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, no. 1, pp. 226–237, Jan 2019. [Yuan, 2019c] Y. Yuan, H. Li, and Q. Wang, “Spatiotemporal modeling for video summarization using convolutional recurrent neural network,” IEEE Access, vol. 7, pp. 64 676–64 685, 2019. Key references 109
  110. 110. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project [Zhang, 2016a] K. Zhang, W.-L. Chao, F. Sha, and K. Grauman, “Summary transfer: Exemplar-based subset selection for video summarization,” in Proc. of the 2016 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 1059–1067. [Zhang, 2016b] K. Zhang, W.-L. Chao, F. Sha, and K. Grauman, “Video summarization with long short-term memory,” in Proc. of the European Conference on Computer Vision (ECCV) 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds. Cham: Springer International Publishing, 2016, pp. 766–782. [Zhang, 2018] Y. Zhang, X. Liang, D. Zhang, M. Tan, and E. P. Xing, “Unsupervised object-level video summarization with online motion auto-encoder,” Pattern Recognition Letters, 2018. [Zhang, 2019] Y. Zhang, M. Kampffmeyer, X. Zhao, and M. Tan, “DTR-GAN: Dilated temporal relational adversarial network for video summarization,” in Proc. of the ACM Turing Celebration Conference - China, ser. ACM TURC ’19. New York, NY, USA: ACM, 2019, pp. 89:1–89:6. [Zhao, 2017] B. Zhao, X. Li, and X. Lu, “Hierarchical recurrent neural network for video summarization,” in Proc. of the 2017 ACM on Multimedia Conference, ser. MM ’17. New York, NY, USA: ACM, 2017, pp. 863–871. [Zhao, 2018] B. Zhao, X. Li, and X. Lu, “HSA-RNN: Hierarchical structure-adaptive RNN for video summarization,” in Proc. of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7405–7414.), 2018. [Zhao, 2019] B. Zhao, X. Li, and X. Lu, “Property-constrained dual learning for video summarization,” IEEE Transactions on Neural Networks and Learning Systems, 2019. Key references 110
  111. 111. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project [Zhou, 2018a] K. Zhou, T. Xiang, and A. Cavallaro, “Video summarisation by classification with deep reinforcement learning,” in Proc. of the 2018 British Machine Vision Conference (BMVC), 2018. [Zhou, 2018b] K. Zhou and Y. Qiao, “Deep reinforcement learning for unsupervised video summarization with diversity- representativeness reward,” in Proc. of the 2018 AAAI Conference on Artificial Intelligence (AAAI), 2018. Key references 111
  112. 112. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Vasileios Mezaris bmezaris@iti.gr Evlampios Apostolidis apostolid@iti.gr CERTH-ITI, Greece info@retv-project.eu This work has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement H2020-780656 ReTV Questions? Following the Q&A session and the break, we will be back with Part II of the tutorial, on video summaries re- use and recommendation

×