Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Autom editor video blooper recognition and localization for automatic monologue video editing

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Wird geladen in …3
×

Hier ansehen

1 von 65 Anzeige

Autom editor video blooper recognition and localization for automatic monologue video editing

Herunterladen, um offline zu lesen

Multimodal video action (bloopers) recognition and localization methods for spatio-temporal feature fusion by using Face, Body, Audio, and Emotion features

Multimodal video action (bloopers) recognition and localization methods for spatio-temporal feature fusion by using Face, Body, Audio, and Emotion features

Anzeige
Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Ähnlich wie Autom editor video blooper recognition and localization for automatic monologue video editing (20)

Anzeige

Weitere von Carlos Toxtli (20)

Anzeige

Autom editor video blooper recognition and localization for automatic monologue video editing

  1. 1. AutomEditor: Video blooper recognition and localization for automatic monologue video editing Carlos Toxtli
  2. 2. Multimodal video action (bloopers) recognition and localization methods for spatio-temporal feature fusion by using Face, Body, Audio, and Emotion features
  3. 3. Index ● Basic concepts ● Video bloopers dataset ● Features extraction ● Blooper recognition ● Blooper localization ● System implementation ● Conclusions
  4. 4. Previous work on Automatic Video Editing (AVE) ● Previous work over automatic video editing is more focused in how to enhance existing videos by adding music, transitions, zoom, camera changes, among other improvements. ● Simple silence detection mechanisms are just beginning to be implemented by commercial software. However, these are easy to detect visually. ● Video summarization techniques involve editing but are content based. ● Video action recognition is the area that studies the behavioral patterns in videos. ● Video action recognition applied to bloopers detection is an area that has not yet been studied by literature.
  5. 5. Problem ● According to online sources, basic video editing can take 30 minutes to an hour for each minute of finished video (a 4-minute video would take 4 hours to edit). More advanced editing (adding in animations, VFX, and compositing) can take much longer. ● The time that a video takes to be edited discourages users to produce periodic content.
  6. 6. Solution: AutomEditor A system that automates monologue video editing. ● AutomEditor is fed with example video clips (1-3 seconds each) of bloopers and not bloopers (separated by folders) ● Extracts features and trains a model. ● Evaluates its performance. ● Localizes the blooper fragments in full-length videos ● Shows the results in a web interface.
  7. 7. End-to-end solution From database creation to the web application. https://github.com/toxtli/AutomEditor
  8. 8. Main contributions ● Creation of a video bloopers dataset (Blooper DB) ● Feature extraction methods for video blooper recognition ● Video blooper recognition models ● Video blooper localization techniques ● Web interface for automatic video editing Problem: Every contribution by itself is enough for an individual publication. I could not cover all of them in-depth.
  9. 9. Creation of a monologue video bloopers dataset ● ~600 videos ● Between 1 and 3 seconds per video ● Train, Validation, and Test batches ○ Train: 464 ○ Test: 66 ○ Validation: 66 ● 2 categories ○ Blooper ○ No blooper ● Stratified data
  10. 10. Criterias ● I splitted long bloopers (more than 2 seconds) into a non-blooper (before the mistake) and blooper (the clip that contains the mistake) ● For short bloopers (1 to 2 seconds) I found other clips from the same video of about the same length (as non-bloopers). ● The clips do not contain truncated phrases. ● Tried to avoid as much as possible green-screen vs non-green-screen backgrounds.
  11. 11. Examples No blooper Blooper
  12. 12. Examples No blooper Blooper
  13. 13. Feature extraction methods for blooper recognition The main goal of this process is to extract features invariant in person descriptors (i.e. gender, age, etc.), scale, position, background, and language ● Audio ■ General (1) Audio handcrafted features per clip (OpenSMILE) ■ Temporal (20) Audio handcrafted features per clip (OpenSMILE) ● Images ○ Face ■ Temporal (20) Face handcrafted features (OpenFace) ■ Temporal (20) Face deep features (VGG16) ○ Body ■ Temporal (20) Body handcrafted features (OpenPose) ■ Temporal (20) Body deep features (VGG16) ○ Emotions ■ General (1) FER predictions (EmoPy and others) ■ Temporal (20) FER predictions (EmoPy and others)
  14. 14. Audio features OpenSMILE (1582 features): The audio is extracted from the videos and are processed by OpenSMILE that extract audio features such as loudness, pitch, jitter, etc. It was tested on video-clip length (general) and 20 fragments (temporal).
  15. 15. OpenFace (709 features): Facial behavior analysis tool that provides accurate facial landmark detection, head pose estimation, facial action unit recognition, and eye-gaze estimation. We get points that represents the face. VGG16 FC6 (4096 features): The faces are cropped (224×224×3), aligned, zero out the background, and passed through a pretrained VGG16 to get a take a dimensional feature vector from FC6 layer. Face features
  16. 16. Body Features OpenPose (BODY_25) (11 features): The normalized angles between the joints.I did not use the calculated features because were 25x224x224 VGG16 FC6 Skelethon image (4096 features): I drew the skeleton (neck in the center) on a black background and feed a VGG16 and extracted a feature vector of the FC6 layer.
  17. 17. Emotion features EmoPy (7 features): A deep neural net toolkit for emotion analysis via Facial Expression Recognition (FER). Other (28 features): Other 4 models from different FER contest participants. 7 categories per model, 35 features in total 20 samples per video clip were predicted (temporal) from there I computed its normalized sum (general)
  18. 18. Feature fusion The features of the same source were normalized and fusioned, getting the following feature sizes: Face fusioned (4096 + 709 = 4805 features) Body fusioned (4096 + 11 = 4107 features) Emotion features (7 + 7 + 7 + 7 + 7 = 35 features) Audio features came only from OpenSMILE so these were not fusioned (1582 features)
  19. 19. Feature sequences Features extracted are grouped in sequences to feed the RNNs ● Each video clip (fragments between 1 to 3 seconds) is divided into 20 samples (i.e. 20 face images) and equally spaced (i.e. in a 60 frames video, the frames 1, 4, 7, .. ,57, 60 are processed) ● The samples were extracted from the end to the beginning ● It produces a matrix of [20][feature_size]
  20. 20. Fusions Early fusion: For early fusion, features from different modalities are projected into the same joint feature space before being fed into the classifier. Late fusion: For late fusion, classifications are made on each modality and their decisions or predictions are later merged together. We used early fusion for our training cases.
  21. 21. Early fusion
  22. 22. Late fusion
  23. 23. Quad Model The proposed model uses all the features extracted and integrates into LSTMs (except the audio) and combines into an early fusion.
  24. 24. LSTM
  25. 25. Evaluation ● The models were trained on a NVIDIA GTX 1080 ti graphic card. ● Since there is no previous work in this field, we used the individual feature models as baseline. ● 300 epochs ● Optimizer: Adam ● Loss: MSE ● Learning rate: 0.001
  26. 26. Emotion features: Global & Temporal acc_val acc_train acc_test f1_score f1_test loss Emotion Global 0.59 0.86 0.59 0.60 0.56 0.28 Emotion Temporal 0.62 0.99 0.69 0.66 0.63 0.32 Temporal Global
  27. 27. Body Temporal Features: Handcrafted & Deep acc_val acc_train acc_test f1_score f1_test Loss Body Hand 0.63 0.92 0.54 0.72 0.59 0.27 Body Deep 0.68 0.99 0.65 0.72 0.71 0.26 Handcrafted Deep
  28. 28. Body fusion (handcrafted + deep features) acc_val acc_train acc_test f1_score f1_test Loss Body Fus 0.66 0.98 0.66 0.74 0.69 0.22
  29. 29. Face Temporal Features: Handcrafted & Deep acc_val acc_train acc_test f1_score f1_test Loss Face Hand 0.84 0.99 0.87 0.89 0.86 0.12 Face Deep 0.89 1.00 0.81 0.92 0.83 0.12 Handcrafted Deep
  30. 30. Face fusion (handcrafted + deep features) acc_val acc_train acc_test f1_score f1_test Loss Face Fus 0.89 1.00 0.89 0.92 0.84 0.09
  31. 31. Audio Features: Temporal & General acc_val acc_train acc_test f1_score f1_test Loss Audio Temporal 0.86 1.00 0.84 0.89 0.83 0.11 Audio General 0.95 1.00 0.90 0.96 0.92 0.03 Temporal General
  32. 32. Top 3: Face handcrafted + Face deep + Audio gen acc_val acc_train acc_test f1_score f1_test Loss Aud+Face 0.96 1.00 0.90 0.98 0.92 0.03
  33. 33. All (Quadmodal): BodyTF+FaceTF+AudioG+EmoT acc_val acc_train acc_test f1_score f1_test Loss All 1.00 1.00 0.90 1.00 0.90 0.01
  34. 34. Train Validation Test Confusion matrices of the Quadmodal model
  35. 35. ResultsModel acc_val acc_train acc_test f1_score f1_test Loss Emotion Gl 0.59 0.86 0.59 0.60 0.56 0.28 Emotion Te 0.62 0.99 0.69 0.66 0.63 0.32 Body Feat 0.63 0.92 0.54 0.72 0.59 0.27 Body Fus 0.66 0.98 0.66 0.74 0.69 0.26 Body Vis 0.68 0.99 0.65 0.72 0.71 0.22 Face Feat 0.84 0.99 0.87 0.89 0.86 0.12 Audio Te 0.86 1.00 0.84 0.89 0.83 0.11 Face Vis 0.89 1.00 0.81 0.92 0.83 0.12 Face Fus 0.89 1.00 0.89 0.92 0.84 0.09 Audio 0.95 1.00 0.90 0.96 0.92 0.03 Aud+Face 0.96 1.00 0.90 0.98 0.92 0.03 Quadmodal 1.00 1.00 0.90 1.00 0.90 0.01
  36. 36. Early VS Late fusion Model acc_val acc_train acc_test f1_score f1_test Loss Quadmodal Early 1.00 1.00 0.90 1.00 0.90 0.01 Quadmodal Late 0.96 1.00 0.93 0.96 0.93 0.06
  37. 37. But, How good is a tr 100% va 100% te 90% model? Sometimes algorithmic research work ends after the computation of the performance metrics, but … Now that we have a model with good performance over small data. How can we test that it works for real-life applications? Full length videos will be provided by users, so the first step is to find bloopers on a video. Localization techniques are needed.
  38. 38. Video blooper localization techniques More challenges ... ● There are no existing localization techniques for video bloopers. ● Temporal action localization techniques in untrimmed videos work mostly for image-only processing. ● Localization in multimodality is mostly limited to video indexing ● No localization methods for mixed temporal and non-temporal features ● The videos must be analyzed in small fragments. ● The analysis of multiple video fragments is costly ● Taking different frames of the same video clip can give different results. ● The output should be a time range.
  39. 39. Diagnosis of how predictions are distributed To test how algorithms can find bloopers I inserted randomly 6 clips (3 bloopers and 3 non-bloopers) in a 70 seconds length video of the same person. I created two test videos. I performed the analysis of 2 second fragments separated by 500 milliseconds each and plotted the results. Then I compared my expectations VS reality
  40. 40. Expectations VS Reality Expectation...
  41. 41. Expectations VS Reality Reality...
  42. 42. Defining an algorithm to find the bloopers I defined the concept of blooper_score as the predicted value of the blooper category. Instead of using a 0 to 1 scale, I used the discrete 0,1, and 2 values. 0 stands for blooper_score=0, 1 for ‘almost 1’ that are the intermediate values between a threshold range, and 2 for blooper_score=1. The most important pattern that I found was the contiguous high numbers.
  43. 43. Adding neighbors To emphasize the values I added them in bins of neighbor elements.
  44. 44. Calculating the sequences of top 3 values I defined window size and calculated the percentage of elements that are in the top 3 values. I used a threshold to add to a range.
  45. 45. Result of ranges It returned the 3 ranges that contained the bloopers of the video. Milliseconds accuracy is needed, but this approach is good enough for at least identifying them.
  46. 46. It also worked for the second video
  47. 47. But not everybody is familiar with the command line Now we have a model that is able to recognize, a localization method, but our system is not user friendly. So there is another challenge ... There are no automatic video editing interfaces on the web. So I developed an open source web interface for automatic video editing interface.
  48. 48. Web interface for automatic video editing http://www.carlostoxtli.com/AutomEditor/frontend/ The tool helps users to analyze their videos and visualize their bloopers. For developers, it brings a simple and easy to integrate platform for testing their algorithms.
  49. 49. Examples of the processed videos in the GUI
  50. 50. Future work ● Explore one of the contributions in depth. ● Data augmentation methods for video bloopers ○ Generative video bloopers ? ● Research about temporal action localization techniques in untrimmed videos for mixed spatio-temporal modalities ● Detecting multiple people bloopers. ● Study the people’s interaction with AVE interfaces (HCI)
  51. 51. Conclusions ● Video bloopers recognition is benefited from multimodal techniques. ● Results are not generalizable enough from small data ● Models for localization of mixed spatio-temporal multimodal features are needed for reducing the time and processing load. ● AutomEditor interface can ○ Help users to edit their videos automatically online ○ Help developers to test and publish their models to the public.
  52. 52. Thanks http://www.carlostoxtli.com @ctoxtli
  53. 53. Back up
  54. 54. Link http://bit.ly/2V3U3aS
  55. 55. Decision layers The activation function used for each metric were: Emotion (categorical): Softmax Valence (dimensional): hyperbolic tangent function (tanh) Arousal (dimensional): Sigmoid
  56. 56. Sigmoid as activation function A sigmoid activation function turns an activation into a value between 0 and 1. It is useful for binary classification problems and is mostly used in the final output layer of such problems. Also, sigmoid activation leads to slow gradient descent because the slope is small for high and low values.
  57. 57. Hyperbolic tangent as activation function A Tanh activation function turns an activation into a value between -1 and +1. The outputs are normalized. The gradient is stronger for tanh than sigmoid (derivatives are steeper)
  58. 58. SoftMax as activation function The Softmax function is a wonderful activation function that turns numbers aka logits into probabilities that sum to one.
  59. 59. MSE as loss function for linear regression Linear regression uses Mean Squared Error as loss function that gives a convex graph and then we can complete the optimization by finding its vertex as global minimum.
  60. 60. SGD as Optimizer Stochastic gradient descent (SGD) computes the gradient for each update using a single training data point x_i (chosen at random). The idea is that the gradient calculated this way is a stochastic approximation to the gradient calculated using the entire training data. Each update is now much faster to calculate than in batch gradient descent, and over many updates, we will head in the same general direction
  61. 61. Layers Early fusion - Hidden layer Early fusion Fully connected LSTM Late fusion
  62. 62. 1DConv Average Pooling 1D convolutional neural nets can be used for extracting local 1D patches (subsequences) from sequences and can identify local patterns within the window of convolution. A pattern learnt at one position can also be recognized at a different position, making 1D conv nets translation invariant. Long sequence to process so long that it cannot be realistically processed by RNNs. In such cases, 1D conv nets can be used as a pre-processing step to make the sequence smaller through downsampling by extracting higher level features, which can, then be passed on to the RNN as input.
  63. 63. Batch Normalization We normalize the input layer by adjusting and scaling the activations to speed up learning, the same thing also for the values in the hidden layers, that are changing all the time.
  64. 64. VGG16

×