Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Multimodal emotion recognition at utterance level with spatio-temporal feature fusion by using face, body, audio, and text features. (1)

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Wird geladen in …3
×

Hier ansehen

1 von 52 Anzeige

Multimodal emotion recognition at utterance level with spatio-temporal feature fusion by using face, body, audio, and text features. (1)

Herunterladen, um offline zu lesen

This is a project in which we use state-of-the-art techniques to detect affect on videos by using multiple modalities. We used the OMG dataset.

This is a project in which we use state-of-the-art techniques to detect affect on videos by using multiple modalities. We used the OMG dataset.

Anzeige
Anzeige

Weitere Verwandte Inhalte

Ähnlich wie Multimodal emotion recognition at utterance level with spatio-temporal feature fusion by using face, body, audio, and text features. (1) (20)

Anzeige

Weitere von Carlos Toxtli (20)

Aktuellste (20)

Anzeige

Multimodal emotion recognition at utterance level with spatio-temporal feature fusion by using face, body, audio, and text features. (1)

  1. 1. Multimodal Affect Recognition at utterance-level with spatio- temporal feature fusion by using Face, Audio, Text [,and Body] features Carlos Toxtli
  2. 2. Index ● Basic concepts ● Architecture ● Experiments ● Results ● Conclusions ● Next steps
  3. 3. Paper Multimodal Utterance-level Affect Analysis using Visual, Audio and Text Features Didan Deng, Yuqian Zhou, Jimin Pi, Bertram E. Shi Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology IFP, Beckman, University of Illinois at Urbana-Champaign Highest score in the “Visual + Audio + Text” category of the OMG Emotion Challenge 2018
  4. 4. Long-term (spatio-temporal) emotion recognition ● The integration of information across multiple modalities and across time is a promising way to enhance the emotion recognition performance of affective systems. ● Much previous work has focused on instantaneous emotion recognition. ● This work addresses long-term emotion recognition by integrating cues from multiple modalities. ● Since emotions normally change gradually under the same context, analyzing long-term dependency of emotions will stabilize the overall predictions.
  5. 5. Utterance level ● Spoken statement ● It is a continuous piece of speech beginning and ending with a clear pause. ● Utterances do not exist in written language ● This word does not exist in some languages.
  6. 6. Multimodal ● Humans perceive others’ emotional states by combining information across multiple modalities simultaneously. ● Intuitively, a multi-modal inference network should be able to leverage information from each modality and their correlations to improve recognition over that achievable by a single modality network. ● This work uses multiple modalities including facial expression, audio and language. ● The paper describes a multi-modal neural architecture that integrates visual information over time using LSTMs, and combines it with utterance level audio and text cues to recognize human sentiment from multimodal clips.
  7. 7. Affect (dimensional) vs Emotion (discrete) recognition ● Dimensional models aim to avoid the restrictiveness of discrete states, and allow more flexible definition of affective states as points in a multi- dimensional space spanned by concepts such as affect intensity and positivity. ● For affect recognition, the dimensional space is commonly operationalized as a regression task. ● The most commonly dimensional model is Russell’s circumplex model, which consists of the two dimensions valence and arousal.
  8. 8. Affect and emotion
  9. 9. Database - OMG - One Minute Gradual-Emotion 10 hours of data 497 videos 6422 utterances Annotations: Arousal: -1 Calm to +1 Alert Valence: -1 Negative to +1 Positive Emotions: "Anger","Disgust","Fear","Happy","Neutral","Sad","Surprise"
  10. 10. Video example, What emotion is represented? Options: “Anger","Disgust","Fear","Happy","Neutral","Sad","Surprise" Arousal? Valence? (numbers between -1 and 1) https://youtu.be/EWRTue-AeSo
  11. 11. Result arousal 0.3730994852 Valence 0.2109641637 Emotion: "Surprise"
  12. 12. OpenFace (709 features): Facial behavior analysis tool that provides accurate facial landmark detection, head pose estimation, facial action unit recognition, and eye-gaze estimation. We get points that represents the face. VGG16 FC6 (4096 features): The faces are cropped (224×224×3), aligned, zero out the background, and passed through a pretrained VGG16 to get a take a dimensional feature vector from FC6 layer. Face features
  13. 13. Audio features OpenSMILE (1582 features): The audio is extracted from the videos and are processed by OpenSMILE that extract audio features such as loudness, pitch, jitter, etc.
  14. 14. Text features Opinion Lexicon (6 features): depends on the ratio of sentiment words (adjectives, adverbs, verbs and nouns), which express positive or negative sentiments. Subjective Lexicon (4 features): They used the subjective Lexicon from MPQA (Multi-Perspective Question Answering) that models the sentiment by its type and intensity.
  15. 15. Feature fusion The features of the same source were normalized and fusioned, getting the following feature sizes: Face fusioned (4096 + 709 = 4805 features) Word fusioned (6 + 4 = 10 features) Audio features came only from OpenSMILE so these were not fusioned (1582 features)
  16. 16. Early fusion For early fusion, features from different modalities are projected into the same joint feature space before being fed into the classifier.
  17. 17. Early fusion
  18. 18. Late fusion For late fusion, classifications are made on each modality and their decisions or predictions are later merged together.
  19. 19. Late fusion
  20. 20. LSTM - Long Short-Term Memory A LSTM network is a recurrent neural network that models time or sequence dependent behaviour. This is performed by feeding back the output of a neural network layer at time t to the input of the same network layer at time t + 1.
  21. 21. Metrics - Concordance Correlation Coefficients CCC is an index of how well a new test or measurement (Y) reproduces a gold standard test or measurement (X). It quantifies the agreement between these two measures of the same variable. Like a correlation, ρc ranges from -1 to 1, with perfect agreement at 1. Mean, variance, correlation coefficient between the two variables As a fine tune they also used 1 - as loss function instead MSE
  22. 22. Metrics - Accuracy and F1-score Accuracy: percentage of correct predictions from all predictions made F1-Score: conveys the balance between the precision and the recall
  23. 23. Limitations The dataset was designed to be downloaded from youtube. From the 497 videos, 111 were unavailable. I trained with limited data and the results were different from the ones that were reported.
  24. 24. Their results
  25. 25. Results CCC Arousal CCC Valence Accuracy F1-score Reported in their paper 0.400 0.353 Contest evaluation 0.359 0.276 My local environment 0.210 0.257 0.434 0.362
  26. 26. Mixed features CCC Arousal Their value CCC Arousal My machine CCC Valence Their value CCC Valence My machine Accuracy My value F1-score My value Face Visual 0.109 0.075 0.237 0.193 0.405 0.396 Face Feature 0.046 0.007 0.080 0.012 0.204 0.204 Face Fusion 0.175 0.113 0.261 0.149 0.381 0.383 Audio Feature 0.273 0.207 0.266 0.015 0.418 0.420 Text Fusion 0.137 0.107 0.259 0.037 0.259 0.259
  27. 27. Body Features OpenPose (BODY_25) (11 features): The normalized angles between the joints.I did not use the calculated features because were 25x224x224 VGG16 FC6 Skelethon image (4096 features): I drew the skeleton on a black background and feed a VGG16 and extracted a feature vector of the FC6 layer.
  28. 28. Quad Model The proposed model adds body gesture features from handcrafted and deep features as a fusioned layer and is evaluated through a LSTM.
  29. 29. My experiments CCC Arousal CCC Valence Accuracy F1-score Body Feature 0.067 0.013 0.285 0.283 Body Visual 0.077 0.005 0.361 0.350 Body Fusion 0.002 0.049 0.136 0.191 Trimodal + Body Feature 0.267 0.283 0.185 0.274 Trimodal + Body Visual 0.006 0.244 0.411 0.407 Trimodal + Body Fusion 0.026 0.307 0.449 0.451
  30. 30. Experiments After running 112 experiments with the combinations of features we found the best models for each metric. NVIDIA GTX 1080 ti
  31. 31. Other experiments CCC Arousal CCC Valence Accuracy F1-score Fusion_late Body_feature Audio_feature 0.272 0.064 0.380 0.380 Fusion_late Face_fusion Audio_feature Word_fusion Body_fusion face_fusion 0.173 0.359 0.411 0.358 Fusion_early Face_fusion Audio_feature Word_fusion body_fusion 0.249 0.267 0.451 0.449 Trimodal + Body Fusion 0.026 0.307 0.449 0.451
  32. 32. Final results CCC Arousal CCC Valence Accuracy F1-score Authors approach Trimodal 0.210 0.257 0.434 0.362 My approach Quadmodal 0.249 0.267 0.451 0.449 Mixed models 0.272 0.359 0.451 0.451
  33. 33. Conclusions ● Multimodal models outperform the baseline methods ● The results show that cross-modal information benefit the estimation of long- term affective states. ● Early fusion performed better in general but for some for dimensional metrics late fusion performed better.
  34. 34. Next steps ● I’m planning to explore 3Dconv instead LSTM, rey ResNet instead VGG16, different network models for each feature. ● UPDATE: These are the evaluations from test datasets. CCC Arousal CCC Valence Accuracy F1-score Trimodal Val 0.298 0.428 0.440 0.455 Trimodal Test 0.180 0.405 0.455 0.455 Quadmodal Val 0.340 0.454 0.445 0.453 Quadmodal Test 0.235 0.413 0.453 0.453
  35. 35. Thanks
  36. 36. Trimodal
  37. 37. Quadmodal architecture
  38. 38. LSTM
  39. 39. Decision layers The activation function used for each metric were: Emotion (categorical): Softmax Valence (dimensional): hyperbolic tangent function (tanh) Arousal (dimensional): Sigmoid
  40. 40. Sigmoid as activation function A sigmoid activation function turns an activation into a value between 0 and 1. It is useful for binary classification problems and is mostly used in the final output layer of such problems. Also, sigmoid activation leads to slow gradient descent because the slope is small for high and low values.
  41. 41. Hyperbolic tangent as activation function A Tanh activation function turns an activation into a value between -1 and +1. The outputs are normalized. The gradient is stronger for tanh than sigmoid (derivatives are steeper)
  42. 42. SoftMax as activation function The Softmax function is a wonderful activation function that turns numbers aka logits into probabilities that sum to one.
  43. 43. MSE as loss function for linear regression Linear regression uses Mean Squared Error as loss function that gives a convex graph and then we can complete the optimization by finding its vertex as global minimum.
  44. 44. SGD as Optimizer Stochastic gradient descent (SGD) computes the gradient for each update using a single training data point x_i (chosen at random). The idea is that the gradient calculated this way is a stochastic approximation to the gradient calculated using the entire training data. Each update is now much faster to calculate than in batch gradient descent, and over many updates, we will head in the same general direction
  45. 45. Layers Early fusion - Hidden layer Early fusion Fully connected LSTM Late fusion
  46. 46. 1DConv Average Pooling 1D convolutional neural nets can be used for extracting local 1D patches (subsequences) from sequences and can identify local patterns within the window of convolution. A pattern learnt at one position can also be recognized at a different position, making 1D conv nets translation invariant. Long sequence to process so long that it cannot be realistically processed by RNNs. In such cases, 1D conv nets can be used as a pre-processing step to make the sequence smaller through downsampling by extracting higher level features, which can, then be passed on to the RNN as input.
  47. 47. Batch Normalization We normalize the input layer by adjusting and scaling the activations to speed up learning, the same thing also for the values in the hidden layers, that are changing all the time.
  48. 48. VGG16

×