Multimodal emotion recognition at utterance level with spatio-temporal feature fusion by using face, body, audio, and text features. (1)

Multimodal Affect Recognition
at utterance-level with spatio-
temporal feature fusion by
using Face, Audio, Text [,and
Body] features
Carlos Toxtli

Index
● Basic concepts
● Architecture
● Experiments
● Results
● Conclusions
● Next steps

Paper
Multimodal Utterance-level Affect Analysis using Visual, Audio and Text
Features
Didan Deng, Yuqian Zhou, Jimin Pi, Bertram E. Shi
Department of Electronic and Computer Engineering, Hong Kong University of
Science and Technology
IFP, Beckman, University of Illinois at Urbana-Champaign
Highest score in the “Visual + Audio + Text” category of the OMG Emotion
Challenge 2018

Long-term (spatio-temporal) emotion recognition
● The integration of information across multiple modalities and across time
is a promising way to enhance the emotion recognition performance of
affective systems.
● Much previous work has focused on instantaneous emotion recognition.
● This work addresses long-term emotion recognition by integrating cues
from multiple modalities.
● Since emotions normally change gradually under the same context,
analyzing long-term dependency of emotions will stabilize the overall
predictions.

Utterance level
● Spoken statement
● It is a continuous piece of speech beginning and ending with a clear
pause.
● Utterances do not exist in written language
● This word does not exist in some languages.

Multimodal
● Humans perceive others’ emotional states by combining information
across multiple modalities simultaneously.
● Intuitively, a multi-modal inference network should be able to leverage
information from each modality and their correlations to improve recognition
over that achievable by a single modality network.
● This work uses multiple modalities including facial expression, audio and
language.
● The paper describes a multi-modal neural architecture that integrates visual
information over time using LSTMs, and combines it with utterance level
audio and text cues to recognize human sentiment from multimodal clips.

Affect (dimensional) vs Emotion (discrete) recognition
● Dimensional models aim to avoid the restrictiveness of discrete states,
and allow more flexible definition of affective states as points in a multi-
dimensional space spanned by concepts such as affect intensity and
positivity.
● For affect recognition, the dimensional space is commonly operationalized as
a regression task.
● The most commonly dimensional model is Russell’s circumplex model,
which consists of the two dimensions valence and arousal.

Database - OMG - One Minute Gradual-Emotion
10 hours of data
497 videos
6422 utterances
Annotations:
Arousal: -1 Calm to +1 Alert
Valence: -1 Negative to +1 Positive
Emotions: "Anger","Disgust","Fear","Happy","Neutral","Sad","Surprise"

Video example, What emotion is represented?
Options: “Anger","Disgust","Fear","Happy","Neutral","Sad","Surprise"
Arousal? Valence? (numbers between -1 and 1)
https://youtu.be/EWRTue-AeSo

Result
arousal 0.3730994852
Valence 0.2109641637
Emotion: "Surprise"

OpenFace (709 features): Facial behavior
analysis tool that provides accurate facial
landmark detection, head pose
estimation, facial action unit recognition,
and eye-gaze estimation. We get points that
represents the face.
VGG16 FC6 (4096 features): The faces are
cropped (224×224×3), aligned, zero out the
background, and passed through a
pretrained VGG16 to get a take a
dimensional feature vector from FC6 layer.
Face features

Audio features
OpenSMILE (1582 features): The
audio is extracted from the videos and
are processed by OpenSMILE that
extract audio features such as
loudness, pitch, jitter, etc.

Text features
Opinion Lexicon (6 features): depends
on the ratio of sentiment words
(adjectives, adverbs, verbs and nouns),
which express positive or negative
sentiments.
Subjective Lexicon (4 features): They
used the subjective Lexicon from MPQA
(Multi-Perspective Question Answering)
that models the sentiment by its type
and intensity.

Feature fusion
The features of the same source were normalized and
fusioned, getting the following feature sizes:
Face fusioned (4096 + 709 = 4805 features)
Word fusioned (6 + 4 = 10 features)
Audio features came only from OpenSMILE so these
were not fusioned (1582 features)

Early fusion
For early fusion, features from different modalities are projected into the same
joint feature space before being fed into the classifier.

Late fusion
For late fusion, classifications are made on each modality and their decisions or
predictions are later merged together.

LSTM - Long Short-Term Memory
A LSTM network is a recurrent
neural network that models time
or sequence dependent behaviour.
This is performed by feeding back
the output of a neural network layer
at time t to the input of the same
network layer at time t + 1.

Metrics - Concordance Correlation Coefficients
CCC is an index of how well a new test or measurement (Y) reproduces a gold
standard test or measurement (X). It quantifies the agreement between these
two measures of the same variable. Like a correlation, ρc ranges from -1 to 1,
with perfect agreement at 1.
Mean, variance, correlation coefficient between the two variables
As a fine tune they also used 1 - as loss function instead MSE

Metrics - Accuracy and F1-score
Accuracy: percentage of correct predictions from all predictions made
F1-Score: conveys the balance between the precision and the recall

Limitations
The dataset was designed to be downloaded from youtube.
From the 497 videos, 111 were unavailable.
I trained with limited data and the results were different from the ones that were
reported.

Results
CCC Arousal CCC Valence Accuracy F1-score
Reported in
their paper
0.400 0.353
Contest
evaluation
0.359 0.276
My local
environment
0.210 0.257 0.434 0.362

Mixed features
CCC Arousal
Their value
CCC Arousal
My machine
CCC Valence
Their value
CCC Valence
My machine
Accuracy
My value
F1-score
My value
Face
Visual
0.109 0.075 0.237 0.193 0.405 0.396
Face
Feature
0.046 0.007 0.080 0.012 0.204 0.204
Face
Fusion
0.175 0.113 0.261 0.149 0.381 0.383
Audio
Feature
0.273 0.207 0.266 0.015 0.418 0.420
Text
Fusion
0.137 0.107 0.259 0.037 0.259 0.259

Body Features
OpenPose (BODY_25) (11
features): The normalized angles
between the joints.I did not use the
calculated features because were
25x224x224
VGG16 FC6 Skelethon image (4096
features): I drew the skeleton on a
black background and feed a VGG16
and extracted a feature vector of the
FC6 layer.

Quad Model
The proposed model adds body
gesture features from handcrafted
and deep features as a fusioned
layer and is evaluated through a
LSTM.

My experiments
Body Feature 0.067 0.013 0.285 0.283
Body Visual 0.077 0.005 0.361 0.350
Body Fusion 0.002 0.049 0.136 0.191
Trimodal +
Body Feature
0.267 0.283 0.185 0.274
Trimodal +
Body Visual
0.006 0.244 0.411 0.407
Trimodal +
Body Fusion
0.026 0.307 0.449 0.451

Experiments
After running 112
experiments with the
combinations of
features we found the
best models for each
metric.
NVIDIA GTX 1080 ti

Other experiments
Fusion_late
Body_feature
Audio_feature
0.272 0.064 0.380 0.380
Fusion_late
Face_fusion
Audio_feature
Word_fusion
Body_fusion
face_fusion
0.173 0.359 0.411 0.358
Fusion_early
Face_fusion
Audio_feature
Word_fusion
body_fusion
0.249 0.267 0.451 0.449
Trimodal +
Body Fusion
0.026 0.307 0.449 0.451

Final results
Authors
approach
Trimodal
0.210 0.257 0.434 0.362
My approach
Quadmodal
0.249 0.267 0.451 0.449
Mixed models 0.272 0.359 0.451 0.451

Conclusions
● Multimodal models outperform the baseline methods
● The results show that cross-modal information benefit the estimation of long-
term affective states.
● Early fusion performed better in general but for some for dimensional metrics
late fusion performed better.

Next steps
● I’m planning to explore 3Dconv instead LSTM, rey ResNet instead VGG16,
different network models for each feature.
● UPDATE: These are the evaluations from test datasets.
Trimodal Val 0.298 0.428 0.440 0.455
Trimodal Test 0.180 0.405 0.455 0.455
Quadmodal Val 0.340 0.454 0.445 0.453
Quadmodal Test 0.235 0.413 0.453 0.453

Decision layers
The activation function used for each metric were:
Emotion (categorical): Softmax
Valence (dimensional): hyperbolic tangent function (tanh)
Arousal (dimensional): Sigmoid

Sigmoid as activation function
A sigmoid activation function turns an
activation into a value between 0 and
1. It is useful for binary classification
problems and is mostly used in the
final output layer of such problems.
Also, sigmoid activation leads to slow
gradient descent because the slope is
small for high and low values.

Hyperbolic tangent as activation function
A Tanh activation function turns an
activation into a value between -1 and
+1. The outputs are normalized. The
gradient is stronger for tanh than sigmoid
(derivatives are steeper)

SoftMax as activation function
The Softmax function is a
wonderful activation function that
turns numbers aka logits into
probabilities that sum to one.

MSE as loss function for linear regression
Linear regression uses Mean Squared
Error as loss function that gives a
convex graph and then we can
complete the optimization by finding its
vertex as global minimum.

SGD as Optimizer
Stochastic gradient descent (SGD)
computes the gradient for each
update using a single training data
point x_i (chosen at random). The
idea is that the gradient calculated
this way is a stochastic approximation
to the gradient calculated using the
entire training data. Each update is
now much faster to calculate than in
batch gradient descent, and over
many updates, we will head in the
same general direction

Layers
Early fusion - Hidden layer
Early fusion Fully connected
LSTM
Late fusion

1DConv Average Pooling
1D convolutional neural nets can be used for extracting local 1D patches
(subsequences) from sequences and can identify local patterns within the window
of convolution. A pattern learnt at one position can also be recognized at a
different position, making 1D conv nets translation invariant. Long sequence to
process so long that it cannot be realistically processed by RNNs. In such cases,
1D conv nets can be used as a pre-processing step to make the sequence smaller
through downsampling by extracting higher level features, which can, then be
passed on to the RNN as input.

Batch Normalization
We normalize the input layer by
adjusting and scaling the activations
to speed up learning, the same thing
also for the values in the hidden
layers, that are changing all the time.

Multimodal emotion recognition at utterance level with spatio-temporal feature fusion by using face, body, audio, and text features. (1)

Multimodal emotion recognition at utterance level with spatio-temporal feature fusion by using face, body, audio, and text features. (1)

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (9)

Ähnlich wie Multimodal emotion recognition at utterance level with spatio-temporal feature fusion by using face, body, audio, and text features. (1)

Ähnlich wie Multimodal emotion recognition at utterance level with spatio-temporal feature fusion by using face, body, audio, and text features. (1) (20)

Mehr von Carlos Toxtli

Mehr von Carlos Toxtli (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Multimodal emotion recognition at utterance level with spatio-temporal feature fusion by using face, body, audio, and text features. (1)