SlideShare a Scribd company logo
1 of 5
Download to read offline
Emotion Detection from Speech
1. Introduction
Although emotion detection from speech is a relatively new field of research, it has many potential
applications. In human-computer or human-human interaction systems, emotion recognition systems
could provide users with improved services by being adaptive to their emotions. In virtual worlds,
emotion recognition could help simulate more realistic avatar interaction.

The body of work on detecting emotion in speech is quite limited. Currently, researchers are still
debating what features influence the recognition of emotion in speech. There is also considerable
uncertainty as to the best algorithm for classifying emotion, and which emotions to class together.

In this project, we attempt to address these issues. We use K-Means and Support Vector Machines
(SVMs) to classify opposing emotions. We separate the speech by speaker gender to investigate the
relationship between gender and emotional content of speech.

There are a variety of temporal and spectral features that can be extracted from human speech. We use
statistics relating to the pitch, Mel Frequency Cepstral Coefficients (MFCCs) and Formants of speech as
inputs to classification algorithms. The emotion recognition accuracy of these experiments allow us to
explain which features carry the most emotional information and why.

It also allows us to develop criteria to class emotions together. Using these techniques we are able to
achieve high emotion recognition accuracy.

2. Corpus of Emotional Speech Data
The data used for this project comes from the Linguistic Data Consortium’s study on Emotional
Prosody and Speech Transcripts [1]. The audio recordings and corresponding transcripts were collected
over an eight month period in 2000-2001 and are designed to support research in emotional prosody.

The recordings consist of professional actors reading a series of semantically neutral utterances (dates
and numbers) spanning fourteen distinct emotional categories, selected after Banse & Scherer's study of
vocal emotional expression in German [2]. There were 5 female speakers and 3 male speakers, all in
their mid-20s. The number of utterances that belong to each emotion category is shown in Table 1. The
recordings were recorded with a sampling rate of 22050Hz and encoded in two-channel interleaved 16-
bit PCM, high-byte-first ("big-endian") format. They were then converted to single channel recordings
by taking the average of both channels and removing the DC-offset.

               Neutral               Disgust                Panic                Anxiety
                  82                   171                   141                   170
              Hot Anger             Cold Anger             Despair               Sadness
                 138                   155                   171                   149
               Elation                Happy                Interest              Boredom
                 159                   177                   176                   154
               Shame                  Pride               Contempt
                 148                   150                   181
                 Table 1: Number of utterances belonging to each Emotion Category

3. Feature Extraction
Pitch and related features
Bäzinger et al. argued that statistics related to pitch conveys considerable information about emotional
status [3]. Yu et al. have shown that some statistics of the pitch carries information about emotion in
Mandarin speech [4].
For this project, pitch is extracted from the speech waveform using a modified version of the RAPT
            algorithm for pitch tracking [5] implemented in the VOICEBOX toolbox [6]. Using a frame length of
            50ms, the pitch for each frame was calculated and placed in a vector to correspond to that frame. If the
            speech is unvoiced the corresponding marker in the pitch vector was set to zero.
            400
                              Pitch vs Frame Number                Figure 1 shows the variation in pitch for a female
                                                          Elation  speaker uttering “Seventy one” in emotional states of
                                                          Despair
                                                                   despair and elation. It is evident from this figure that
            350
                                                                   the mean and variance of the pitch is higher when
                                                                   “Seventy one” is uttered in elation rather than despair.
            300                                                    In order to capture these and other characteristics, the
Pitch(Hz)




                                                                   following statistics are calculated from the pitch and
                                                                   used in the pitch feature vector:
            250
                                                                   • Mean, Median, Variance, Maximum, Minimum (for
                                                                      the pitch vector and its derivative)
            200                                                    • Average energies of voiced and unvoiced speech
                                                                   • Speaking rate (inverse of the average length of the
            150
                                                                      voiced part of the utterance)
                0   10    20      30       40
                                  Frame Number
                                                    50 60       70 Hence, the pitch feature vector is 13-dimensional.
               Figure 1: Variation in Pitch for 2 emotional states

            MFCC and related features
            MFCCs are the most widely used spectral representation of speech in many applications, including
            speech and speaker recognition. Kim et al. argued that statistics relating to MFCCs also carry emotional
            information [7].
                                                                     3rd MFCC Coefficient 2nd MFCC Coefficient 1 MFCC Coefficient




                                                                                                                                          Elation
                                                                                                                                          Despair
                                                                                                                                    100
            For each 25ms frame of speech, thirteen standard
                                                                    50
            MFCC parameters are calculated by taking the
                                                                     0
            absolute value of the STFT, warping it to a Mel            0  10   20   30   40    50   60   70 80  90    100
                                                                                                                st




                                                                                            Frame Number
            frequency scale, taking the DCT of the log-Mel-                                                               Elation
            spectrum and returning the first 13 components [8].     20                                                    Despair

            Figure 2 shows the variation in three MFCCs for a        0

            female speaker uttering “Seventy one” in               -20
            emotional states of despair and elation. It is             0  10   20   30   40    50   60
                                                                                            Frame Number
                                                                                                         70 80  90    100

            evident from this figure that the mean of the first                                                           Despair
                                                                                                                          Elation
            coefficient is higher when “Seventy one” is uttered     20

            in elation rather than despair, but is lower for the     0
            second and third coefficients. In order to capture
                                                                   -20
            these and other characteristics, we extracted              0  10   20   30   40    50   60   70 80  90    100
                                                                                            Frame Number
            statistics based on the MFCCs. For each
            coefficient and its derivative we calculated the         Figure 2: Variation in MFCCs for 2 emotional states
            mean, variance, maximum and minimum across all
            frames. We also calculate the mean, variance, maximum and minimum of the mean of each coefficient
            and its derivative. Each MFCC feature vector is 112-dimensional.

            Formants and related features
            Tracking formants over time is used to model the change in the vocal tract shape. The use of Linear
            Predictive Coding (LPC) to model formants is widely used in speech synthesis [9]. Prior work done by
            Petrushin suggests that formants carry information about emotional content [10]. The first three
            formants and their bandwidths were estimated using LPC on 15ms frames of speech. For each of the
            three formants, their derivatives and bandwidths, we calculated the mean, variance, maximum and
            minimum across all frames. We also calculate the mean, variance, maximum and minimum of the mean
            of each formant frequency, its derivative and bandwidth. The formant feature vector is 48-dimensional.
4. Classification
We tried to differentiate between “opposing” emotional states. Six different “opposing” emotion pairs
were chosen: despair and elation, happy and sadness, interest and boredom, shame and pride, hot anger
and elation, and cold anger and sadness.

For each emotion pair, we formed data sets comprising of emotional speech from all speakers, only
male speakers, and only female speakers because the features are affected by the gender of the speaker.
For example, the pitch of males ranges from 80Hz to 200Hz while the pitch of females ranges from
150Hz to 350Hz [11]. This corresponds to a total of eighteen unique data sets.

For each data set, we formed inputs to our classification algorithm comprising of feature vectors from:
Pitch only, MFCCs only, Formants only, Pitch & MFCCs, Pitch & Formants, MFCCs & Formants, and
Pitch, MFCCs & Formants. Hence, for each emotion pair, the classification algorithm was run on
twenty one different sets of inputs.

K-Means Clustering
For each emotion pair, all input sets were clustered using K-Means clustering (k = 2) for all twelve
combinations of the parameters listed below:
Distance Measure Minimized: Squared Euclidean, L1 norm, Correlation, and Cosine (the Correlation
and Cosine distance measures used here are as defined in the MATLAB ‘kmeans’ function).
Initial Cluster Centroids: Random Centroids and User Defined Centroids (UDC). A UDC is the
centroid that minimizes the distance measure for the input features of one emotion in the emotion pair.
Maximum Number of Iterations: 1 (only when the initial cluster centroid is a UDC) and 100 (for both
Random and UDC centroids).

The error used to obtain the recognition accuracy is the average of the training errors obtained by 10-
fold cross validation and is an estimate of the generalization error. The variance of the recognition
accuracy is the variance of these training errors. For each experiment, the highest recognition accuracy
achieved, its variance, the inputs features and clustering parameters used, is listed in Table 2.
   All Speakers
    Experiment         Features     Distance Measure     Centroid    Iterations   Recognition Accuracy   Variance
  despair-elation       MFCC            L1 norm           UDC           100             75.76%            1.74%
  happy-sadness         MFCC            L1 norm           UDC             1             77.91%           14.34%
 interest-boredom       Pitch           L1 norm           UDC           100             71.21%            2.48%
    shame-pride         MFCC            L1 norm           UDC             1             73.15%            3.23%
 hot anger-elation      MFCC            L1 norm           UDC             1             69.70%           10.75%
cold anger-sadness      MFCC            L1 norm           UDC             1             75.66%            3.35%
  Male Speakers
    Experiment         Features     Distance Measure     Centroid    Iterations   Recognition Accuracy   Variance
  despair-elation    MFCC & Pitch      Correlation        UDC             1             87.80%            0.14%
  happy-sadness         MFCC            L1 norm           UDC             1             88.80%            3.66%
 interest-boredom    MFCC & Pitch        Cosine          Random         100             81.20%            6.36%
    shame-pride      MFCC & Pitch      Correlation        UDC             1             74.24%           15.53%
 hot anger-elation      MFCC            L1 norm           UDC             1             65.89%           14.95%
cold anger-sadness      MFCC            L1 norm           UDC             1             88.43%            9.78%
Female Speakers
    Experiment         Features      Distance Measure     Centroid     Iterations Recognition Accuracy   Variance
  despair-elation       MFCC              L1 norm           UDC             1            80.42%           9.66%
  happy-sadness         MFCC              L1 norm           UDC             1            72.80%          15.24%
 interest-boredom       MFCC              L1 norm           UDC             1            70.62%          18.06%
    shame-pride         MFCC              L1 norm           UDC             1            81.18%          19.79%
 hot anger-elation      MFCC              L1 norm           UDC             1            77.16%           4.37%
cold anger-sadness      MFCC            Correlation         UDC             1            72.04%          15.00%
                         Table 2: Highest Recognition Accuracies using K-means Clustering
Support Vector Machines (SVMs)
A modified version of the 2-class SVM classifier in Schwaighofer’s SVM Toolbox [12] was used to
classify all input sets of each emotion pair. The two kernels used and their parameters are:
1. Linear Kernel (with parameter C, corresponding to the upper bound for the coefficients αi’s, ranges
from 0.1-100, with multiplicative step 10).
2. Radial Basis Function (RBF) Kernel (parameter C, corresponding to the upper bound for the
coefficients αi’s, ranges from 0.1-10, with multiplicative step 2 )

The recognition accuracy and variance was calculated using the same technique as for K-means. For
each experiment, the highest recognition accuracy achieved, its variance, the inputs features and
clustering parameters used is listed in Table 3.
            All Speakers
             Experiment           Features           Kernel          C    Recognition Accuracy    Variance
           despair-elation         MFCC            RBF Kernel       6.4         83.44%             5.52%
           happy-sadness           MFCC           Linear Kernel      1          72.50%            19.65%
          interest-boredom      MFCC + Pitch      Linear Kernel     10          65.15%            17.68%
             shame-pride           MFCC            RBF Kernel       1.6         76.55%             7.93%
          hot anger-elation     MFCC + Pitch      Linear Kernel      1          60.00%            19.79%
         cold anger-sadness        MFCC            RBF Kernel       6.4         86.00%             4.10%
           Male Speakers
             Experiment           Features           Kernel         C     Recognition Accuracy    Variance
           despair-elation      MFCC + Pitch      Linear Kernel      1          96.67%             6.57%
           happy-sadness           MFCC           Linear Kernel      1          99.17%            24.55%
          interest-boredom      MFCC + Pitch      Linear Kernel     10          96.15%            22.78%
             shame-pride        MFCC + Pitch      Linear Kernel    100          91.54%            24.59%
          hot anger-elation        MFCC           Linear Kernel    10           90.00%            16.22%
         cold anger-sadness        MFCC           Linear Kernel    100          96.67%            21.25%
         Female Speakers
             Experiment            Features            Kernel        C     Recognition Accuracy   Variance
           despair-elation      MFCC + Pitch       Linear Kernel      1           79.50%          14.03%
           happy-sadness             Pitch         Linear Kernel      1           62.00%          11.11%
          interest-boredom           Pitch         Linear Kernel    100           80.53%          11.85%
             shame-pride             Pitch         Linear Kernel      1           76.88%          23.61%
          hot anger-elation     MFCC + Pitch       Linear Kernel     10           88.75%          13.52%
         cold anger-sadness         MFCC           Linear Kernel     10           96.11%           7.15%
                              Table 3: Highest Recognition Accuracies using 2-Class SVMs

6. Discussion
The results obtained by the experiments performed allow us to make the following observations.

Using the formant feature vector as an input to our classification algorithms, always results in sub-
optimal recognition accuracy. We can infer that formant features do not carry much emotional
information. Since formants are used to model the resonance frequencies (and shape) of the vocal tract,
we can postulate that different emotions do not significantly affect the vocal tract shape.

Using Squared Euclidean as a distance measure for K-means always results in sub-optimal recognition
accuracy. Using this distance metric effectively places a lot of weight on the magnitude of an element in
the feature vector. Hence, an input feature that might vary a lot between the two opposing emotions may
be discounted by this distance measure, if the mean of this feature is smaller than that of other features.

Tables 2 and 3 indicate that the recognition accuracy is higher when the emotion pairs of male and
female speakers were classified separately. We postulate two reasons for this behavior. First, using a
larger number of speakers (as in the all speaker case) increases the variability associated with the
features, thereby hindering correct classification. Second, since MFCCs are used for speaker recognition,
we hypothesize that the features also carry information relating to the identity of the speaker. In addition
to emotional content MFCCs and Pitch also carry information about the gender of the speaker. This
additional information is unrelated to emotion and increases misclassification.

Tables 2 and 3 also suggest that the recognition rate for female speakers is lower than male speakers
when classifying emotional states of elation, happiness and interest. The higher number of female
speakers than male speakers in our data set may contribute to this lower recognition accuracy. Further
investigation suggested that in excitable emotional states such as interest, elation and happiness, the
variance of the Pitch and MFCCs increases significantly. However, variance of the Pitch and MFCCs is
higher for female voices than male voices. Hence, this increase in variance is masked by the natural
variance in female voices, which could make the features less effective at correctly classifying agitated
emotions in female speakers.

Of all the methods implemented, SVMs with a linear kernel give us the best results for single-gender
classification, especially in male speakers. This indicates that this feature space is almost linearly
separable. The best results using K-Means classification are usually obtained when the cluster centroids
are UDCs which we think indicates that unsupervised learning algorithms such as K-Means cannot pick
up on all the information contained in the feature sets, unless we add some bias to the features.

7. Conclusion & Future Work
Although it is impossible to accurately compare recognition accuracies from this study to other studies
because of the different data sets used, the methods implemented here are extremely promising. The
recognition accuracies obtained using SVMs with linear kernels for male speakers are higher than any
other study. Previous studies have neglected to separate out male and female speakers. This project
shows that there is significant benefit in doing so. Our methods are reasonably accurate at recognizing
emotions in female and all speakers. Our project shows that features derived from agitated emotions
such as happiness, elation and interest have similar properties, as do those from more subdued emotions
such as despair and sadness. Hence, ‘agitated’ and ‘subdued’ emotion class can encompass these
narrower emotions. This is especially useful for animating gestures of avatars in virtual worlds.

This project focused on 2-way classification. The performance of these methods should be evaluated for
multi-class classification (using multi-class SVMs and K-Means). In addition the features could be fit to
Gaussians and classified using Gaussian Mixture Models. The speakers used here uttered numbers and
dates in various emotions – the words themselves carried to emotional information. In reality, word
choice can indicate emotion. MFCCs are widely used in speech recognition systems and also carry
emotional information. Existing speech recognition systems could be modified to detect emotions as
well. To help improve emotion recognition we could combine methods in this project and methods
similar to the Naïve Bayes in order to take advantage of the emotional content of the words.

8. References
[1] http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2002S28
[2] R.Banse, K.R.Scherer, “Acoustic profiles in vocal emotion expression”, Journal of Personality and Social Psychology, Vol.70, 614-
636, 1996
[3] T.Bänziger, K.R.Scherer, “The role of intonation in emotional expression”, Speech Communication, Vol.46, 252-267, 2005
[4] F.Yu, E.Chang, Y.Xu, H.Shum, “Emotion detection from speech to enrich multimedia content”, Lecture Notes In Computer Science,
Vol.2195, 550-557, 2001
[5] D.Talkin, “A Robust Algorithm for Pitch Tracking (RAPT)”, Speech Coding & Synthesis, 1995
[6] http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html
[7] S.Kim, P.Georgiou, S.Lee, S.Narayanan. “Real-time emotion detection system using speech: Multi-modal fusion of different timescale
features”, Proceedings of IEEE Multimedia Signal Processing Workshop, Chania, Greece, 2007
[8] http://www.ee.columbia.edu/~dpwe/resources/matlab/rastamat/
[9] L.R.Rabiner and B.H.Juang. “Fundamentals of Speech Recognition”, Upper Saddle River; NJ: Prentice-Hall, 1993
[10] V.A Petrushin, “Emotional Recognition in Speech Signal: Experimental Study, Development, and Application”, ICSLP-2000, Vol.2,
222-225, 2000
[11] L.R.Rabiner and R.W.Schafer. “Digital processing of speech signals”, Englewood Cliffs; London: Prentice-Hall, 1978
[12] http://ida.first.fraunhofer.de/~anton/software.html

More Related Content

Recently uploaded

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 

Featured

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Featured (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Shah hewlett emotion detection from speech

  • 1. Emotion Detection from Speech 1. Introduction Although emotion detection from speech is a relatively new field of research, it has many potential applications. In human-computer or human-human interaction systems, emotion recognition systems could provide users with improved services by being adaptive to their emotions. In virtual worlds, emotion recognition could help simulate more realistic avatar interaction. The body of work on detecting emotion in speech is quite limited. Currently, researchers are still debating what features influence the recognition of emotion in speech. There is also considerable uncertainty as to the best algorithm for classifying emotion, and which emotions to class together. In this project, we attempt to address these issues. We use K-Means and Support Vector Machines (SVMs) to classify opposing emotions. We separate the speech by speaker gender to investigate the relationship between gender and emotional content of speech. There are a variety of temporal and spectral features that can be extracted from human speech. We use statistics relating to the pitch, Mel Frequency Cepstral Coefficients (MFCCs) and Formants of speech as inputs to classification algorithms. The emotion recognition accuracy of these experiments allow us to explain which features carry the most emotional information and why. It also allows us to develop criteria to class emotions together. Using these techniques we are able to achieve high emotion recognition accuracy. 2. Corpus of Emotional Speech Data The data used for this project comes from the Linguistic Data Consortium’s study on Emotional Prosody and Speech Transcripts [1]. The audio recordings and corresponding transcripts were collected over an eight month period in 2000-2001 and are designed to support research in emotional prosody. The recordings consist of professional actors reading a series of semantically neutral utterances (dates and numbers) spanning fourteen distinct emotional categories, selected after Banse & Scherer's study of vocal emotional expression in German [2]. There were 5 female speakers and 3 male speakers, all in their mid-20s. The number of utterances that belong to each emotion category is shown in Table 1. The recordings were recorded with a sampling rate of 22050Hz and encoded in two-channel interleaved 16- bit PCM, high-byte-first ("big-endian") format. They were then converted to single channel recordings by taking the average of both channels and removing the DC-offset. Neutral Disgust Panic Anxiety 82 171 141 170 Hot Anger Cold Anger Despair Sadness 138 155 171 149 Elation Happy Interest Boredom 159 177 176 154 Shame Pride Contempt 148 150 181 Table 1: Number of utterances belonging to each Emotion Category 3. Feature Extraction Pitch and related features Bäzinger et al. argued that statistics related to pitch conveys considerable information about emotional status [3]. Yu et al. have shown that some statistics of the pitch carries information about emotion in Mandarin speech [4].
  • 2. For this project, pitch is extracted from the speech waveform using a modified version of the RAPT algorithm for pitch tracking [5] implemented in the VOICEBOX toolbox [6]. Using a frame length of 50ms, the pitch for each frame was calculated and placed in a vector to correspond to that frame. If the speech is unvoiced the corresponding marker in the pitch vector was set to zero. 400 Pitch vs Frame Number Figure 1 shows the variation in pitch for a female Elation speaker uttering “Seventy one” in emotional states of Despair despair and elation. It is evident from this figure that 350 the mean and variance of the pitch is higher when “Seventy one” is uttered in elation rather than despair. 300 In order to capture these and other characteristics, the Pitch(Hz) following statistics are calculated from the pitch and used in the pitch feature vector: 250 • Mean, Median, Variance, Maximum, Minimum (for the pitch vector and its derivative) 200 • Average energies of voiced and unvoiced speech • Speaking rate (inverse of the average length of the 150 voiced part of the utterance) 0 10 20 30 40 Frame Number 50 60 70 Hence, the pitch feature vector is 13-dimensional. Figure 1: Variation in Pitch for 2 emotional states MFCC and related features MFCCs are the most widely used spectral representation of speech in many applications, including speech and speaker recognition. Kim et al. argued that statistics relating to MFCCs also carry emotional information [7]. 3rd MFCC Coefficient 2nd MFCC Coefficient 1 MFCC Coefficient Elation Despair 100 For each 25ms frame of speech, thirteen standard 50 MFCC parameters are calculated by taking the 0 absolute value of the STFT, warping it to a Mel 0 10 20 30 40 50 60 70 80 90 100 st Frame Number frequency scale, taking the DCT of the log-Mel- Elation spectrum and returning the first 13 components [8]. 20 Despair Figure 2 shows the variation in three MFCCs for a 0 female speaker uttering “Seventy one” in -20 emotional states of despair and elation. It is 0 10 20 30 40 50 60 Frame Number 70 80 90 100 evident from this figure that the mean of the first Despair Elation coefficient is higher when “Seventy one” is uttered 20 in elation rather than despair, but is lower for the 0 second and third coefficients. In order to capture -20 these and other characteristics, we extracted 0 10 20 30 40 50 60 70 80 90 100 Frame Number statistics based on the MFCCs. For each coefficient and its derivative we calculated the Figure 2: Variation in MFCCs for 2 emotional states mean, variance, maximum and minimum across all frames. We also calculate the mean, variance, maximum and minimum of the mean of each coefficient and its derivative. Each MFCC feature vector is 112-dimensional. Formants and related features Tracking formants over time is used to model the change in the vocal tract shape. The use of Linear Predictive Coding (LPC) to model formants is widely used in speech synthesis [9]. Prior work done by Petrushin suggests that formants carry information about emotional content [10]. The first three formants and their bandwidths were estimated using LPC on 15ms frames of speech. For each of the three formants, their derivatives and bandwidths, we calculated the mean, variance, maximum and minimum across all frames. We also calculate the mean, variance, maximum and minimum of the mean of each formant frequency, its derivative and bandwidth. The formant feature vector is 48-dimensional.
  • 3. 4. Classification We tried to differentiate between “opposing” emotional states. Six different “opposing” emotion pairs were chosen: despair and elation, happy and sadness, interest and boredom, shame and pride, hot anger and elation, and cold anger and sadness. For each emotion pair, we formed data sets comprising of emotional speech from all speakers, only male speakers, and only female speakers because the features are affected by the gender of the speaker. For example, the pitch of males ranges from 80Hz to 200Hz while the pitch of females ranges from 150Hz to 350Hz [11]. This corresponds to a total of eighteen unique data sets. For each data set, we formed inputs to our classification algorithm comprising of feature vectors from: Pitch only, MFCCs only, Formants only, Pitch & MFCCs, Pitch & Formants, MFCCs & Formants, and Pitch, MFCCs & Formants. Hence, for each emotion pair, the classification algorithm was run on twenty one different sets of inputs. K-Means Clustering For each emotion pair, all input sets were clustered using K-Means clustering (k = 2) for all twelve combinations of the parameters listed below: Distance Measure Minimized: Squared Euclidean, L1 norm, Correlation, and Cosine (the Correlation and Cosine distance measures used here are as defined in the MATLAB ‘kmeans’ function). Initial Cluster Centroids: Random Centroids and User Defined Centroids (UDC). A UDC is the centroid that minimizes the distance measure for the input features of one emotion in the emotion pair. Maximum Number of Iterations: 1 (only when the initial cluster centroid is a UDC) and 100 (for both Random and UDC centroids). The error used to obtain the recognition accuracy is the average of the training errors obtained by 10- fold cross validation and is an estimate of the generalization error. The variance of the recognition accuracy is the variance of these training errors. For each experiment, the highest recognition accuracy achieved, its variance, the inputs features and clustering parameters used, is listed in Table 2. All Speakers Experiment Features Distance Measure Centroid Iterations Recognition Accuracy Variance despair-elation MFCC L1 norm UDC 100 75.76% 1.74% happy-sadness MFCC L1 norm UDC 1 77.91% 14.34% interest-boredom Pitch L1 norm UDC 100 71.21% 2.48% shame-pride MFCC L1 norm UDC 1 73.15% 3.23% hot anger-elation MFCC L1 norm UDC 1 69.70% 10.75% cold anger-sadness MFCC L1 norm UDC 1 75.66% 3.35% Male Speakers Experiment Features Distance Measure Centroid Iterations Recognition Accuracy Variance despair-elation MFCC & Pitch Correlation UDC 1 87.80% 0.14% happy-sadness MFCC L1 norm UDC 1 88.80% 3.66% interest-boredom MFCC & Pitch Cosine Random 100 81.20% 6.36% shame-pride MFCC & Pitch Correlation UDC 1 74.24% 15.53% hot anger-elation MFCC L1 norm UDC 1 65.89% 14.95% cold anger-sadness MFCC L1 norm UDC 1 88.43% 9.78% Female Speakers Experiment Features Distance Measure Centroid Iterations Recognition Accuracy Variance despair-elation MFCC L1 norm UDC 1 80.42% 9.66% happy-sadness MFCC L1 norm UDC 1 72.80% 15.24% interest-boredom MFCC L1 norm UDC 1 70.62% 18.06% shame-pride MFCC L1 norm UDC 1 81.18% 19.79% hot anger-elation MFCC L1 norm UDC 1 77.16% 4.37% cold anger-sadness MFCC Correlation UDC 1 72.04% 15.00% Table 2: Highest Recognition Accuracies using K-means Clustering
  • 4. Support Vector Machines (SVMs) A modified version of the 2-class SVM classifier in Schwaighofer’s SVM Toolbox [12] was used to classify all input sets of each emotion pair. The two kernels used and their parameters are: 1. Linear Kernel (with parameter C, corresponding to the upper bound for the coefficients αi’s, ranges from 0.1-100, with multiplicative step 10). 2. Radial Basis Function (RBF) Kernel (parameter C, corresponding to the upper bound for the coefficients αi’s, ranges from 0.1-10, with multiplicative step 2 ) The recognition accuracy and variance was calculated using the same technique as for K-means. For each experiment, the highest recognition accuracy achieved, its variance, the inputs features and clustering parameters used is listed in Table 3. All Speakers Experiment Features Kernel C Recognition Accuracy Variance despair-elation MFCC RBF Kernel 6.4 83.44% 5.52% happy-sadness MFCC Linear Kernel 1 72.50% 19.65% interest-boredom MFCC + Pitch Linear Kernel 10 65.15% 17.68% shame-pride MFCC RBF Kernel 1.6 76.55% 7.93% hot anger-elation MFCC + Pitch Linear Kernel 1 60.00% 19.79% cold anger-sadness MFCC RBF Kernel 6.4 86.00% 4.10% Male Speakers Experiment Features Kernel C Recognition Accuracy Variance despair-elation MFCC + Pitch Linear Kernel 1 96.67% 6.57% happy-sadness MFCC Linear Kernel 1 99.17% 24.55% interest-boredom MFCC + Pitch Linear Kernel 10 96.15% 22.78% shame-pride MFCC + Pitch Linear Kernel 100 91.54% 24.59% hot anger-elation MFCC Linear Kernel 10 90.00% 16.22% cold anger-sadness MFCC Linear Kernel 100 96.67% 21.25% Female Speakers Experiment Features Kernel C Recognition Accuracy Variance despair-elation MFCC + Pitch Linear Kernel 1 79.50% 14.03% happy-sadness Pitch Linear Kernel 1 62.00% 11.11% interest-boredom Pitch Linear Kernel 100 80.53% 11.85% shame-pride Pitch Linear Kernel 1 76.88% 23.61% hot anger-elation MFCC + Pitch Linear Kernel 10 88.75% 13.52% cold anger-sadness MFCC Linear Kernel 10 96.11% 7.15% Table 3: Highest Recognition Accuracies using 2-Class SVMs 6. Discussion The results obtained by the experiments performed allow us to make the following observations. Using the formant feature vector as an input to our classification algorithms, always results in sub- optimal recognition accuracy. We can infer that formant features do not carry much emotional information. Since formants are used to model the resonance frequencies (and shape) of the vocal tract, we can postulate that different emotions do not significantly affect the vocal tract shape. Using Squared Euclidean as a distance measure for K-means always results in sub-optimal recognition accuracy. Using this distance metric effectively places a lot of weight on the magnitude of an element in the feature vector. Hence, an input feature that might vary a lot between the two opposing emotions may be discounted by this distance measure, if the mean of this feature is smaller than that of other features. Tables 2 and 3 indicate that the recognition accuracy is higher when the emotion pairs of male and female speakers were classified separately. We postulate two reasons for this behavior. First, using a larger number of speakers (as in the all speaker case) increases the variability associated with the
  • 5. features, thereby hindering correct classification. Second, since MFCCs are used for speaker recognition, we hypothesize that the features also carry information relating to the identity of the speaker. In addition to emotional content MFCCs and Pitch also carry information about the gender of the speaker. This additional information is unrelated to emotion and increases misclassification. Tables 2 and 3 also suggest that the recognition rate for female speakers is lower than male speakers when classifying emotional states of elation, happiness and interest. The higher number of female speakers than male speakers in our data set may contribute to this lower recognition accuracy. Further investigation suggested that in excitable emotional states such as interest, elation and happiness, the variance of the Pitch and MFCCs increases significantly. However, variance of the Pitch and MFCCs is higher for female voices than male voices. Hence, this increase in variance is masked by the natural variance in female voices, which could make the features less effective at correctly classifying agitated emotions in female speakers. Of all the methods implemented, SVMs with a linear kernel give us the best results for single-gender classification, especially in male speakers. This indicates that this feature space is almost linearly separable. The best results using K-Means classification are usually obtained when the cluster centroids are UDCs which we think indicates that unsupervised learning algorithms such as K-Means cannot pick up on all the information contained in the feature sets, unless we add some bias to the features. 7. Conclusion & Future Work Although it is impossible to accurately compare recognition accuracies from this study to other studies because of the different data sets used, the methods implemented here are extremely promising. The recognition accuracies obtained using SVMs with linear kernels for male speakers are higher than any other study. Previous studies have neglected to separate out male and female speakers. This project shows that there is significant benefit in doing so. Our methods are reasonably accurate at recognizing emotions in female and all speakers. Our project shows that features derived from agitated emotions such as happiness, elation and interest have similar properties, as do those from more subdued emotions such as despair and sadness. Hence, ‘agitated’ and ‘subdued’ emotion class can encompass these narrower emotions. This is especially useful for animating gestures of avatars in virtual worlds. This project focused on 2-way classification. The performance of these methods should be evaluated for multi-class classification (using multi-class SVMs and K-Means). In addition the features could be fit to Gaussians and classified using Gaussian Mixture Models. The speakers used here uttered numbers and dates in various emotions – the words themselves carried to emotional information. In reality, word choice can indicate emotion. MFCCs are widely used in speech recognition systems and also carry emotional information. Existing speech recognition systems could be modified to detect emotions as well. To help improve emotion recognition we could combine methods in this project and methods similar to the Naïve Bayes in order to take advantage of the emotional content of the words. 8. References [1] http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2002S28 [2] R.Banse, K.R.Scherer, “Acoustic profiles in vocal emotion expression”, Journal of Personality and Social Psychology, Vol.70, 614- 636, 1996 [3] T.Bänziger, K.R.Scherer, “The role of intonation in emotional expression”, Speech Communication, Vol.46, 252-267, 2005 [4] F.Yu, E.Chang, Y.Xu, H.Shum, “Emotion detection from speech to enrich multimedia content”, Lecture Notes In Computer Science, Vol.2195, 550-557, 2001 [5] D.Talkin, “A Robust Algorithm for Pitch Tracking (RAPT)”, Speech Coding & Synthesis, 1995 [6] http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html [7] S.Kim, P.Georgiou, S.Lee, S.Narayanan. “Real-time emotion detection system using speech: Multi-modal fusion of different timescale features”, Proceedings of IEEE Multimedia Signal Processing Workshop, Chania, Greece, 2007 [8] http://www.ee.columbia.edu/~dpwe/resources/matlab/rastamat/ [9] L.R.Rabiner and B.H.Juang. “Fundamentals of Speech Recognition”, Upper Saddle River; NJ: Prentice-Hall, 1993 [10] V.A Petrushin, “Emotional Recognition in Speech Signal: Experimental Study, Development, and Application”, ICSLP-2000, Vol.2, 222-225, 2000 [11] L.R.Rabiner and R.W.Schafer. “Digital processing of speech signals”, Englewood Cliffs; London: Prentice-Hall, 1978 [12] http://ida.first.fraunhofer.de/~anton/software.html