SlideShare ist ein Scribd-Unternehmen logo
1 von 4
Downloaden Sie, um offline zu lesen
«                                                                                                                                                         

             PROSODY ANALYSIS AND MODELING FOR EMOTIONAL SPEECH SYNTHESIS*

                                    Dan-ning Jiang1, Wei Zhang2, Li-qin Shen2, Lian-hong Cai1
                        1
                     Department of Computer Science & Technology, Tsinghua University, China
                                              2
                                                IBM China Research Lab
                jdn00@mails.tsinghua.edu.cn, {zhangzw,shenlq}@cn.ibm.com, clh-dcs@tsinghua.edu.cn
                                 ABSTRACT                                         to attain the communication purposes of emotions. Perceptual
                                                                                  tests are first performed to validate this notion. Then, based on
      Current concatenative Text-to-Speech systems can synthesize                 the positive results, the same corpus with sufficient prosody
      varied emotions, but the subtle and range of the results are                coverage is shared by different emotions in unit selection.
      limited because large amount of emotional speech data are                   Finally, an adaptation algorithm is proposed to predict the
      required. This paper studies a more flexible approach based on              emotional prosody features. It models the prosodic variations by
      analyzing and modeling the emotional prosody features.                      linguistic cues and emotion cues separately, and requires only a
      Perceptual tests are first performed to investigate whether just            small amount of data. Experiments on Mandarin show that the
      manipulating prosody features can attain the communication                  adaptation algorithm can obtain appropriate emotional prosody
      purposes of emotions. Then, based on the positive results, the              features, and several emotions (at least sadness and happiness)
      same corpus with sufficient prosody coverage is shared by                   can be synthesized without the use of special emotional corpus.
      different emotions in unit selection. Finally, an adaptation                    The paper is organized as follows. Section 2 presents
      algorithm is proposed to predict the emotional prosody features.            perceptual tests to investigate the prosody’s contribution in
      It models the prosodic variations by linguistic cues and emotion            emotional communications, and discusses the results. Section 3
      cues separately, and requires only a small amount of data.                  illustrates the prosody adaptation algorithm. In section 4,
      Experiments on Mandarin show that the adaptation algorithm                  experiments on Mandarin are performed and evaluation results
      can obtain appropriate emotional prosody features, and at least             are shown. Conclusions are finally given in section 5.
      several emotions can be synthesized without the use of special
      emotional corpus.
                                                                                      2. PROSODY’S CONTRIBUTION IN EMOTIONAL
                                                                                                  COMMUNICATIONS
                            1. INTRODUCTION                                       We presume that what more important to emotional synthesis is
                                                                                  to attain the communication purposes of emotions, rather than
      Current concatenative Text-to-Speech (TTS) systems can                      obtain intense emotions. As suggested in [8], there are two
      synthesize high-quality speech [1]. Therefore, how to incorporate           communication functions in emotions. One is appropriateness,
      emotions in the neutral systems becomes recent research focus.              and the other is efficiency. The former means that the emotional
      Previous works [2][3][4] have proved the synthesizers’ capability to        expression should be appropriate for the verbal content and
      produce varied emotions. In these systems, a target emotion is              condition, for example, bad news should be presented unhappily.
      synthesized by selecting appropriate units from its special large           The latter means that the expression should deliver information
      corpus. Emotional prosody features are controlled by Rule-based             about the speaker’s attitude when the content does not show it.
      methods [5] or data-driven methods. Since it is not easy to                 Thus, perceptual tests are performed to investigate whether
      describe the prosody characteristics comprehensively by rules,              prosody features alone can attain the appropriateness and
      data-driven methods are preferred in these years, but they also             efficiency in communications.
      require large amount of training data. Unfortunately, it is quite               To isolate the influence of prosody features, the copy
      difficult to record so much emotional speech data and guarantee             synthesis technology is used. Pairs of utterances with the same
      that the same emotion is recorded consistently. The approach is             content and different expressions (neutral and emotional) are
      very costly and unable to synthesize more subtle and mixed                  recorded. Then pitch, duration, and energy features of the
      emotions flexibly.                                                          emotional utterance are copied to the neutral counterpart through
          Considering that it is relatively easy to control prosody               FD-PSOLA algorithm. So there are neutral segments and
      features in TTS systems, the paper studies a more flexible                  emotional prosody in the synthesis samples. All these three types
      synthesis approach based on analyzing and modeling the                      of expressions are used as stimulus in perceptual tests. To test
      emotional prosody features. Although relevant analysis                      appropriateness of the above expressions, the text materials are
      suggested that just manipulating prosody features is not                    emotional, and the subjects are asked whether the expression is
      sufficient to present intense emotions [6][7], it still may be feasible     appropriate for the content. In the efficiency test, the contents are

      *
          This work was done in IBM China Research Lab.
          It was also supported by China National Natural Science Foundation (60433030).




    0-7803-8874-7/05/$20.00 ©2005 IEEE                                      I - 281                                                      ICASSP 2005
ÂŹ                                                                                                                                                   ÂŹ
    neutral, and the subjects are asked to guess the most possible            sometimes even give subjects positive information. Sad
    condition according to the expression, for example, whether it            expressions are associated with the bad news at a high rate over
    delivers good news or bad news, or there is no emotional bias.            80%. Although synthesis expressions deliver negative
    Examples of the tests are shown in figure 1. Here, the examined           information only slightly over a half of time, they do not mislead
    emotion is sadness.                                                       the subjects with wrong information. The results indicate that
                                                                              although prosody features alone seem not enough to present
                                                                              intense emotions, they are still very meaningful in emotional
                                                                              communications.
                                                                                  Other works [4][6][7] suggested that some emotions (e.g. anger)
                                                                              are more heavily influenced by segmental features than others,
                                                                              but they also agreed that prosody features are more important for
                                                                              a number of emotions. So it is reasonable to generalize the
                                                                              analysis results to several other emotions.


                                                                                             3. PROSODY ADAPTATION

                                                                              Emotional prosody features encode information from at least two
                                                                              sources. One source is linguistics, and the other is emotion.
                                                                              Variations caused by the former are determined by linguistic
     Figure 1. Two examples of the perceptual tests. One is for the           contexts, such as position in the prosodic structure, syllable tone
      appropriateness test, and the other is for the efficiency test.         type (as Chinese is a well-known tonal language), accent level,
                                                                              and so on. A large amount of training data is necessary to obtain
        The text materials are 8 sentences, 4 with negative contents
                                                                              a stable prediction of the linguistic component, typically several
    for the appropriateness test, and 4 with neutral contents for the
                                                                              thousands of sentences [1]. This drives up the required data
    efficiency test. The subjects are 24 university students unfamiliar
                                                                              amount to train an emotional prosody model. However, the
    with TTS. All utterances are randomly ordered.
                                                                              linguistic component is unrelated to emotions, so it does not
                                                                              have to be contained repeatedly in each emotional prosody
              Appropriate    Some Inappropriate   Very Inappropriate          model.
                                                                                  The paper proposes an adaptation algorithm. In the algorithm,
            100                                                               the linguistic and emotion component of prosody features are
             80                                                               modeled separately. The former is modeled only once by using a
             60                                                               huge amount of neutral speech data. The latter is estimated as the
             40                                                               differences between the emotional and neutral prosody features,
             20                                                               and modeled with only a small amount of data for each target
                                                                              emotion. Then these two parts are combined together to predict
               0
                                                                              the emotional prosody features.
                       Neutral          Sad         Synthesis

              Figure 2. Results of the appropriateness test.

                      Bad News      Don't Know      Good News
            100
             80
             60
             40
             20
              0
                      Neutral           Sad          Synthesis

                   Figure 3. Results of the efficiency test.

       Results of the two tests are shown in figure 2 and figure 3
    respectively. From figure 2, it can be seen that neutral
    expressions are perceived as appropriate for the negative                     Figure 4. Paradigm of the emotional synthesis system using
    contents only at a rate about 10%, and as very inappropriate at                                  prosody adaptation.
    most time. On the contrary, the sad and synthesis expressions are
    perceived as appropriate at most time, and very occasionally to              Figure 4 shows paradigm of the emotional synthesis system
    be judged as very inappropriate. Figure 3 shows that neutral              using prosody adaptation. Here, the “difference prosody model”
    expressions nearly deliver no negative information, and                   and “neutral prosody model” represent models of the emotional




                                                                        I - 282
ÂŹ                                                                                                                                                    ÂŹ
    component and the linguistic component respectively. A                    (3) Combine the two components to predict PF .
    difference prosody model concerned with the target emotion is                Having the prediction of PF l and the probabilistic
    first selected from several ones. Then it is combined with the
    neutral prosody model to predict emotional prosody features. In           description of PF e , PF is modeled as,
                                                                                                         M
    unit selection, based on the analysis in section 2, different target                 P ( PF | CT )         wk N (   k   PF l ,   k   )   (3)
    emotions share the same corpus with sufficient prosody coverage.                                     k 1

    Note that the natural relationship between prosody and
    segmental features is maintained in the corpus, which is
    important to the naturalness of the synthesis results [7].                                      4. EXPERIMENTS
    3.1. Probability Based Prosody Model
                                                                              4.1. Speech Materials
    In prosody adaptation, the difference and neutral prosody models
    are all built through the probability based prosody model [1]. It         In experiments, two emotions in Mandarin are implemented.
    calculates the target and transition costs for unit selection in a        One is sadness, and the other is happiness. All emotional speech
    probabilistic way. In the training stage, linguistic contexts are         data are recorded in the lab, by asking a female speaker to read
    first clustered by decision trees based on the distribution of the        emotional paragraphs with a proper degree of emotion. Here, one
    concerned prosody features; those with similar prosody features           example of the “proper degree” is the way that a newscaster
    are clustered into the same terminal leaf. Then for each leaf,            reports someone’s death with a pity. The segmental features are
    GMM is trained to model the probabilistic distribution of the             not much different with those of neutral speech.
    prosody features. In unit selection, the target and transition               The sad paragraphs are reports of illness and death, while the
    decision trees are first traversed according to the input contexts,       happy paragraphs are descriptions of the spring’s beautiful views
    and the associated GMMs are obtained. Then the target and                 and a girl’s imagination about her wedding. Note that the
    transition costs are both given by the corresponding GMM                  “happiness” is not a pure one; it mixes the love and enthusiasm
    probabilities. Finally, a dynamic program algorithm is performed          about life. We totally use 216 sad sentences and 143 happy
    to find out a candidate sequence with the maximal combination             sentences, from which 10 sentences of each emotion are testing
    of the target and transition probabilities as the prediction result.      data, and the remaining sentences are used to train the emotional
        It can be seen that when there are sufficient candidates in           prosody models. These emotional data, together with other 500
    corpus, the predicted prosody features are determined by the              neutral sentences, are put into the corpus for unit selection.
    input contexts. Thus the result is quite stable.                          4.2. Emotional Prosody Characteristics
    3.2. Adaptation Algorithm                                                 The neutral, emotional, and synthesis prosody characteristics are
    Suppose PF is the prosody feature vector of a synthesis unit,             compared in figure 5 and figure 6. From figure 5, it can be seen
                                                                              that sadness is mainly characterized by low pitch mean, narrow
    then it contains both the linguistic component PF l and the
                                                                              pitch range, and very flat pitch contour. The limited pitch
    emotion component PF e ,                                                  variations merely realize the syllable tone shape, and there is
               PF PF l PF e                                     (1)           little pitch declination in the intonation phrase. The synthesis sad
                                                                              pitch contour does have above characteristics. Figure 6 shows
        The adaptation algorithm is divided into three steps: predict         that the happy prosody has a faster speak rate and less pitch
     PF l using a huge amount of neutral speech data, model PF e              declination. The pitch range is also slightly narrower. The
    for each target emotion, and finally combine them to predict PF .         synthesis prosody also shows similar characteristics.
    (1) Predict PF l using a huge amount of neutral speech data.              4.3. ABX Tests
       The linguistic component PF l is predicted by using the                To evaluate whether the synthesis speech conveys the target
    model described in section 3.1. The neutral speech corpus                 emotion, we perform ABX tests, which are often used in studies
    contains 5,000 sentences, which can guarantee the stability of            on voice conversion. A, B and X represent the neutral speech,
    prediction results at most time. So in the following two steps,           emotional speech, and synthesis speech respectively. In the test,
     PF l is regarded as a constant rather than a distribution.               three utterances are played with the order of A, B, X or B, A, X.
                                                                              Subjects are asked to judge whether the emotion implied in X is
    (2) Model PF e for each target emotion.                                   similar with that in A or B. There are 20 university students as
        For each target emotion, PF e is estimated as the difference          the subjects.
    between the emotional prosody feature PF and the predicted
    linguistic component PF l . The modeling method is nearly the                            Table 1. Results of the ABX tests.
    same with that used to model PF l , except that the prosody                                           Sadness            Happiness
                                                                                  Ave. Correct Rate       100.0%                82.5%
    feature is PF e . So, given the input contexts CT , the
    corresponding GMM can be obtained, as formula (2) shows,                      Table 1 shows the average rate that the subjects judge the
                        M
              e                                                               synthesis expression similar with the target emotions. The
        P( PF | CT )          wk N (   k   ,   k   )              (2)
                        k 1                                                   correct rate of sadness is very high, which may be because that
                                                                              the sad pitch contour is very flat and easy to predict. The correct
       Where M is the number of components in the GMM, wk ,                   rate of happiness is some lower, but still much higher than the
      k and  k are the weight, mean and variation of the k-th                 random rate.
    component respectively.




                                                                        I - 283
                                                                                                                                                «
                         5. CONCLUSIONS                                   [8] E. Eide, R. Bakis, W. Hamza, and J. Pitrelli, “Multilayered
                                                                          Extensions to the Speech Synthesis Markup Language for
    This paper studies a flexible emotional synthesis approach using      Describing Expressiveness,” Proc. of EuroSpeech’03, pp. 1645-
    the concatenative synthesizer. Perceptual tests are first             1648, Sep. 2003.
    performed to investigate the prosody’s contribution in emotional
    communications. The analysis results show that just
    manipulating prosody features can obtain meaningful results at
    least for some emotions. Then, the same corpus with sufficient
    prosody coverage is shared by different emotions in unit
    selection. Finally, an adaptation algorithm is proposed to predict
    the emotional prosody features. In experiments, two emotions
    (sadness and happiness) in Mandarin are implemented. By using
    no more than 200 sentences as the training data, appropriate
    emotional prosody features can be predicted for each emotion,
    and the synthesis speech is judged as similar with the target
    emotions at a rate over 80%. The experiment results show that
    the adaptation algorithm is efficient when there are only small
    amount of training data. So, prosody features of more subtle and
    mixed emotions can be obtained with fewer difficulties. The
    results also indicate that at least several emotions can be
    synthesized without the use of special emotional corpus.
    However, since it is difficult to conclude that prosody features
    are important in all emotions, in further work, it may be
    necessary to extend the corpus by adding some emotional
    segments.


                   6. ACKNOWLEDGEMENTS                                          Figure 5. Neutral, sad, and synthesis pitch contour of a sad
                                                                                sentence, which means “Mother’s illness has broken out.”
    This work was done in IBM China Research Lab. I thank the
    work foundation and environment provided by the company, as
    well many valuable suggestions and help from Xijun Ma, Qin
    Shi, Weibin Zhu, and Yi Liu.


                          7. REFERENCES

    [1] X.J. Ma, W. Zhang, W.B. Zhu, etc, “Probability based
    Prosody Model for Unit Selection,” Proc. of. ICASSP’04,
    Montreal, Canada, pp. 649-652, May 2004.
    [2] E.Eide, “Preservation, Identification, and Use of Emotion in
    a Text-to-Speech System,” Proc. of IEEE Workshop on Speech
    Synthesis, Santa Monica, Sep. 2002.
    [3] A.W. Black, “Unit Selection and Emotional Speech,” Proc.
    of EuroSpeech’03, pp. 1649-1652, Sep. 2003.
    [4] M. Bulut, S. S. Narayanon, and A. K. Syrdal, “Expressive
    Speech Synthesis Using a Concatenative Synthesizer,” Proc. of
    ICSLP’02, Denver, pp. 1265-1268, Sep. 2002.
    [5] I.R. Murray, M.D. Edgington, D. Campion, etc. “Rule-Based
    Emotion Synthesis Using Concatenated Speech,” Proc. of ISCA
    Workshop on Speech and Emotion, Belfast, North Ireland, pp.
    173-177, 2000.
    [6] K. Hirose, N. Minematsu, and H. Kawanami, “Analytical and
    Perceptual Study on the Role of Acoustic Features in Realizing
    Emotional Speech,” Proc. of ICSLP’00, pp. 369-372, Oct. 2000.
                                                                              Figure 6. Neutral, happy, and synthesis pitch contour of a happy
    [7] C. Gobl, E. Bennet, and A.N. Chassaide, “Expressive
                                                                                    sentence, which means “You see that tall columns.
    Synthesis: How Crucial is Voice Quality?” Proc. of IEEE
    Workshop on Speech Synthesis, Santa Monica, Sep. 2002.




                                                                    I - 284

Weitere Àhnliche Inhalte

Ähnlich wie 200502

3-540-45453-5_71.pdf
3-540-45453-5_71.pdf3-540-45453-5_71.pdf
3-540-45453-5_71.pdfJyoti863900
 
Emotion detection from text documents
Emotion detection from text documentsEmotion detection from text documents
Emotion detection from text documentsIJDKP
 
Literature Review On: ”Speech Emotion Recognition Using Deep Neural Network”
Literature Review On: ”Speech Emotion Recognition Using Deep Neural Network”Literature Review On: ”Speech Emotion Recognition Using Deep Neural Network”
Literature Review On: ”Speech Emotion Recognition Using Deep Neural Network”IRJET Journal
 
BASIC ANALYSIS ON PROSODIC FEATURES IN EMOTIONAL SPEECH
BASIC ANALYSIS ON PROSODIC FEATURES IN EMOTIONAL SPEECHBASIC ANALYSIS ON PROSODIC FEATURES IN EMOTIONAL SPEECH
BASIC ANALYSIS ON PROSODIC FEATURES IN EMOTIONAL SPEECHIJCSEA Journal
 
EMOTION DETECTION FROM TEXT
EMOTION DETECTION FROM TEXTEMOTION DETECTION FROM TEXT
EMOTION DETECTION FROM TEXTcscpconf
 
Impact of Emotion on Prosody Analysis
Impact of Emotion on Prosody AnalysisImpact of Emotion on Prosody Analysis
Impact of Emotion on Prosody AnalysisIOSR Journals
 
Day 4 Topic 1 - Ivanovic.ppt
Day 4 Topic 1 - Ivanovic.pptDay 4 Topic 1 - Ivanovic.ppt
Day 4 Topic 1 - Ivanovic.pptMeghanaDBengalur
 
Day 4 Topic 1 - Ivanovic.ppt
Day 4 Topic 1 - Ivanovic.pptDay 4 Topic 1 - Ivanovic.ppt
Day 4 Topic 1 - Ivanovic.pptFatmaSetyaningsih2
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
Emotion Detection from Text
Emotion Detection from TextEmotion Detection from Text
Emotion Detection from TextIJERD Editor
 
Thelxinoë: Recognizing Human Emotions Using Pupillometry and Machine Learning
Thelxinoë: Recognizing Human Emotions Using Pupillometry and Machine LearningThelxinoë: Recognizing Human Emotions Using Pupillometry and Machine Learning
Thelxinoë: Recognizing Human Emotions Using Pupillometry and Machine Learningmlaij
 
Thelxinoë: Recognizing Human Emotions Using Pupillometry and Machine Learning
Thelxinoë: Recognizing Human Emotions Using Pupillometry and Machine LearningThelxinoë: Recognizing Human Emotions Using Pupillometry and Machine Learning
Thelxinoë: Recognizing Human Emotions Using Pupillometry and Machine Learningmlaij
 
Signal Processing Tool for Emotion Recognition
Signal Processing Tool for Emotion RecognitionSignal Processing Tool for Emotion Recognition
Signal Processing Tool for Emotion Recognitionidescitation
 
Voice Emotion Recognition
Voice Emotion RecognitionVoice Emotion Recognition
Voice Emotion Recognitionpaperpublications3
 
Affect Sensing and Contextual Affect Modeling from Improvisational Interaction
Affect Sensing and Contextual Affect Modeling from Improvisational InteractionAffect Sensing and Contextual Affect Modeling from Improvisational Interaction
Affect Sensing and Contextual Affect Modeling from Improvisational InteractionWaqas Tariq
 
Emotional communication1
Emotional communication1Emotional communication1
Emotional communication1bson1012
 
ă„ăšă‚‚ăŸă‚„ă™ăèĄŒă‚ă‚Œă‚‹ăˆă’ă€ăȘă„ç ”ç©¶èĄŒç‚ș
ă„ăšă‚‚ăŸă‚„ă™ăèĄŒă‚ă‚Œă‚‹ăˆă’ă€ăȘă„ç ”ç©¶èĄŒç‚șă„ăšă‚‚ăŸă‚„ă™ăèĄŒă‚ă‚Œă‚‹ăˆă’ă€ăȘă„ç ”ç©¶èĄŒç‚ș
ă„ăšă‚‚ăŸă‚„ă™ăèĄŒă‚ă‚Œă‚‹ăˆă’ă€ăȘă„ç ”ç©¶èĄŒç‚șYuki Yamada
 

Ähnlich wie 200502 (20)

3-540-45453-5_71.pdf
3-540-45453-5_71.pdf3-540-45453-5_71.pdf
3-540-45453-5_71.pdf
 
Emotion detection from text documents
Emotion detection from text documentsEmotion detection from text documents
Emotion detection from text documents
 
Literature Review On: ”Speech Emotion Recognition Using Deep Neural Network”
Literature Review On: ”Speech Emotion Recognition Using Deep Neural Network”Literature Review On: ”Speech Emotion Recognition Using Deep Neural Network”
Literature Review On: ”Speech Emotion Recognition Using Deep Neural Network”
 
BASIC ANALYSIS ON PROSODIC FEATURES IN EMOTIONAL SPEECH
BASIC ANALYSIS ON PROSODIC FEATURES IN EMOTIONAL SPEECHBASIC ANALYSIS ON PROSODIC FEATURES IN EMOTIONAL SPEECH
BASIC ANALYSIS ON PROSODIC FEATURES IN EMOTIONAL SPEECH
 
EMOTION DETECTION FROM TEXT
EMOTION DETECTION FROM TEXTEMOTION DETECTION FROM TEXT
EMOTION DETECTION FROM TEXT
 
Impact of Emotion on Prosody Analysis
Impact of Emotion on Prosody AnalysisImpact of Emotion on Prosody Analysis
Impact of Emotion on Prosody Analysis
 
Day 4 Topic 1 - Ivanovic.ppt
Day 4 Topic 1 - Ivanovic.pptDay 4 Topic 1 - Ivanovic.ppt
Day 4 Topic 1 - Ivanovic.ppt
 
Day 4 Topic 1 - Ivanovic.ppt
Day 4 Topic 1 - Ivanovic.pptDay 4 Topic 1 - Ivanovic.ppt
Day 4 Topic 1 - Ivanovic.ppt
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
Emotion Detection from Text
Emotion Detection from TextEmotion Detection from Text
Emotion Detection from Text
 
Thelxinoë: Recognizing Human Emotions Using Pupillometry and Machine Learning
Thelxinoë: Recognizing Human Emotions Using Pupillometry and Machine LearningThelxinoë: Recognizing Human Emotions Using Pupillometry and Machine Learning
Thelxinoë: Recognizing Human Emotions Using Pupillometry and Machine Learning
 
Thelxinoë: Recognizing Human Emotions Using Pupillometry and Machine Learning
Thelxinoë: Recognizing Human Emotions Using Pupillometry and Machine LearningThelxinoë: Recognizing Human Emotions Using Pupillometry and Machine Learning
Thelxinoë: Recognizing Human Emotions Using Pupillometry and Machine Learning
 
65 69
65 6965 69
65 69
 
H010215561
H010215561H010215561
H010215561
 
Signal Processing Tool for Emotion Recognition
Signal Processing Tool for Emotion RecognitionSignal Processing Tool for Emotion Recognition
Signal Processing Tool for Emotion Recognition
 
Voice Emotion Recognition
Voice Emotion RecognitionVoice Emotion Recognition
Voice Emotion Recognition
 
F0363942
F0363942F0363942
F0363942
 
Affect Sensing and Contextual Affect Modeling from Improvisational Interaction
Affect Sensing and Contextual Affect Modeling from Improvisational InteractionAffect Sensing and Contextual Affect Modeling from Improvisational Interaction
Affect Sensing and Contextual Affect Modeling from Improvisational Interaction
 
Emotional communication1
Emotional communication1Emotional communication1
Emotional communication1
 
ă„ăšă‚‚ăŸă‚„ă™ăèĄŒă‚ă‚Œă‚‹ăˆă’ă€ăȘă„ç ”ç©¶èĄŒç‚ș
ă„ăšă‚‚ăŸă‚„ă™ăèĄŒă‚ă‚Œă‚‹ăˆă’ă€ăȘă„ç ”ç©¶èĄŒç‚șă„ăšă‚‚ăŸă‚„ă™ăèĄŒă‚ă‚Œă‚‹ăˆă’ă€ăȘă„ç ”ç©¶èĄŒç‚ș
ă„ăšă‚‚ăŸă‚„ă™ăèĄŒă‚ă‚Œă‚‹ăˆă’ă€ăȘă„ç ”ç©¶èĄŒç‚ș
 

KĂŒrzlich hochgeladen

2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptxMaritesTamaniVerdade
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxEsquimalt MFRC
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...Poonam Aher Patil
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxAmanpreet Kaur
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfNirmal Dwivedi
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
TỔNG ÔN TáșŹP THI VÀO LỚP 10 MÔN TIáșŸNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGở Â...
TỔNG ÔN TáșŹP THI VÀO LỚP 10 MÔN TIáșŸNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGở Â...TỔNG ÔN TáșŹP THI VÀO LỚP 10 MÔN TIáșŸNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGở Â...
TỔNG ÔN TáșŹP THI VÀO LỚP 10 MÔN TIáșŸNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGở Â...Nguyen Thanh Tu Collection
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxJisc
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...pradhanghanshyam7136
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentationcamerronhm
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxcallscotland1987
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Association for Project Management
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and ModificationsMJDuyan
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17Celine George
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.MaryamAhmad92
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701bronxfugly43
 

KĂŒrzlich hochgeladen (20)

2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
TỔNG ÔN TáșŹP THI VÀO LỚP 10 MÔN TIáșŸNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGở Â...
TỔNG ÔN TáșŹP THI VÀO LỚP 10 MÔN TIáșŸNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGở Â...TỔNG ÔN TáșŹP THI VÀO LỚP 10 MÔN TIáșŸNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGở Â...
TỔNG ÔN TáșŹP THI VÀO LỚP 10 MÔN TIáșŸNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGở Â...
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 

200502

  • 1. « ÂŹ PROSODY ANALYSIS AND MODELING FOR EMOTIONAL SPEECH SYNTHESIS* Dan-ning Jiang1, Wei Zhang2, Li-qin Shen2, Lian-hong Cai1 1 Department of Computer Science & Technology, Tsinghua University, China 2 IBM China Research Lab jdn00@mails.tsinghua.edu.cn, {zhangzw,shenlq}@cn.ibm.com, clh-dcs@tsinghua.edu.cn ABSTRACT to attain the communication purposes of emotions. Perceptual tests are first performed to validate this notion. Then, based on Current concatenative Text-to-Speech systems can synthesize the positive results, the same corpus with sufficient prosody varied emotions, but the subtle and range of the results are coverage is shared by different emotions in unit selection. limited because large amount of emotional speech data are Finally, an adaptation algorithm is proposed to predict the required. This paper studies a more flexible approach based on emotional prosody features. It models the prosodic variations by analyzing and modeling the emotional prosody features. linguistic cues and emotion cues separately, and requires only a Perceptual tests are first performed to investigate whether just small amount of data. Experiments on Mandarin show that the manipulating prosody features can attain the communication adaptation algorithm can obtain appropriate emotional prosody purposes of emotions. Then, based on the positive results, the features, and several emotions (at least sadness and happiness) same corpus with sufficient prosody coverage is shared by can be synthesized without the use of special emotional corpus. different emotions in unit selection. Finally, an adaptation The paper is organized as follows. Section 2 presents algorithm is proposed to predict the emotional prosody features. perceptual tests to investigate the prosody’s contribution in It models the prosodic variations by linguistic cues and emotion emotional communications, and discusses the results. Section 3 cues separately, and requires only a small amount of data. illustrates the prosody adaptation algorithm. In section 4, Experiments on Mandarin show that the adaptation algorithm experiments on Mandarin are performed and evaluation results can obtain appropriate emotional prosody features, and at least are shown. Conclusions are finally given in section 5. several emotions can be synthesized without the use of special emotional corpus. 2. PROSODY’S CONTRIBUTION IN EMOTIONAL COMMUNICATIONS 1. INTRODUCTION We presume that what more important to emotional synthesis is to attain the communication purposes of emotions, rather than Current concatenative Text-to-Speech (TTS) systems can obtain intense emotions. As suggested in [8], there are two synthesize high-quality speech [1]. Therefore, how to incorporate communication functions in emotions. One is appropriateness, emotions in the neutral systems becomes recent research focus. and the other is efficiency. The former means that the emotional Previous works [2][3][4] have proved the synthesizers’ capability to expression should be appropriate for the verbal content and produce varied emotions. In these systems, a target emotion is condition, for example, bad news should be presented unhappily. synthesized by selecting appropriate units from its special large The latter means that the expression should deliver information corpus. Emotional prosody features are controlled by Rule-based about the speaker’s attitude when the content does not show it. methods [5] or data-driven methods. Since it is not easy to Thus, perceptual tests are performed to investigate whether describe the prosody characteristics comprehensively by rules, prosody features alone can attain the appropriateness and data-driven methods are preferred in these years, but they also efficiency in communications. require large amount of training data. Unfortunately, it is quite To isolate the influence of prosody features, the copy difficult to record so much emotional speech data and guarantee synthesis technology is used. Pairs of utterances with the same that the same emotion is recorded consistently. The approach is content and different expressions (neutral and emotional) are very costly and unable to synthesize more subtle and mixed recorded. Then pitch, duration, and energy features of the emotions flexibly. emotional utterance are copied to the neutral counterpart through Considering that it is relatively easy to control prosody FD-PSOLA algorithm. So there are neutral segments and features in TTS systems, the paper studies a more flexible emotional prosody in the synthesis samples. All these three types synthesis approach based on analyzing and modeling the of expressions are used as stimulus in perceptual tests. To test emotional prosody features. Although relevant analysis appropriateness of the above expressions, the text materials are suggested that just manipulating prosody features is not emotional, and the subjects are asked whether the expression is sufficient to present intense emotions [6][7], it still may be feasible appropriate for the content. In the efficiency test, the contents are * This work was done in IBM China Research Lab. It was also supported by China National Natural Science Foundation (60433030). 0-7803-8874-7/05/$20.00 ©2005 IEEE I - 281 ICASSP 2005
  • 2. ÂŹ ÂŹ neutral, and the subjects are asked to guess the most possible sometimes even give subjects positive information. Sad condition according to the expression, for example, whether it expressions are associated with the bad news at a high rate over delivers good news or bad news, or there is no emotional bias. 80%. Although synthesis expressions deliver negative Examples of the tests are shown in figure 1. Here, the examined information only slightly over a half of time, they do not mislead emotion is sadness. the subjects with wrong information. The results indicate that although prosody features alone seem not enough to present intense emotions, they are still very meaningful in emotional communications. Other works [4][6][7] suggested that some emotions (e.g. anger) are more heavily influenced by segmental features than others, but they also agreed that prosody features are more important for a number of emotions. So it is reasonable to generalize the analysis results to several other emotions. 3. PROSODY ADAPTATION Emotional prosody features encode information from at least two sources. One source is linguistics, and the other is emotion. Variations caused by the former are determined by linguistic Figure 1. Two examples of the perceptual tests. One is for the contexts, such as position in the prosodic structure, syllable tone appropriateness test, and the other is for the efficiency test. type (as Chinese is a well-known tonal language), accent level, and so on. A large amount of training data is necessary to obtain The text materials are 8 sentences, 4 with negative contents a stable prediction of the linguistic component, typically several for the appropriateness test, and 4 with neutral contents for the thousands of sentences [1]. This drives up the required data efficiency test. The subjects are 24 university students unfamiliar amount to train an emotional prosody model. However, the with TTS. All utterances are randomly ordered. linguistic component is unrelated to emotions, so it does not have to be contained repeatedly in each emotional prosody Appropriate Some Inappropriate Very Inappropriate model. The paper proposes an adaptation algorithm. In the algorithm, 100 the linguistic and emotion component of prosody features are 80 modeled separately. The former is modeled only once by using a 60 huge amount of neutral speech data. The latter is estimated as the 40 differences between the emotional and neutral prosody features, 20 and modeled with only a small amount of data for each target emotion. Then these two parts are combined together to predict 0 the emotional prosody features. Neutral Sad Synthesis Figure 2. Results of the appropriateness test. Bad News Don't Know Good News 100 80 60 40 20 0 Neutral Sad Synthesis Figure 3. Results of the efficiency test. Results of the two tests are shown in figure 2 and figure 3 respectively. From figure 2, it can be seen that neutral expressions are perceived as appropriate for the negative Figure 4. Paradigm of the emotional synthesis system using contents only at a rate about 10%, and as very inappropriate at prosody adaptation. most time. On the contrary, the sad and synthesis expressions are perceived as appropriate at most time, and very occasionally to Figure 4 shows paradigm of the emotional synthesis system be judged as very inappropriate. Figure 3 shows that neutral using prosody adaptation. Here, the “difference prosody model” expressions nearly deliver no negative information, and and “neutral prosody model” represent models of the emotional I - 282
  • 3. ÂŹ ÂŹ component and the linguistic component respectively. A (3) Combine the two components to predict PF . difference prosody model concerned with the target emotion is Having the prediction of PF l and the probabilistic first selected from several ones. Then it is combined with the neutral prosody model to predict emotional prosody features. In description of PF e , PF is modeled as, M unit selection, based on the analysis in section 2, different target P ( PF | CT ) wk N ( k PF l , k ) (3) emotions share the same corpus with sufficient prosody coverage. k 1 Note that the natural relationship between prosody and segmental features is maintained in the corpus, which is important to the naturalness of the synthesis results [7]. 4. EXPERIMENTS 3.1. Probability Based Prosody Model 4.1. Speech Materials In prosody adaptation, the difference and neutral prosody models are all built through the probability based prosody model [1]. It In experiments, two emotions in Mandarin are implemented. calculates the target and transition costs for unit selection in a One is sadness, and the other is happiness. All emotional speech probabilistic way. In the training stage, linguistic contexts are data are recorded in the lab, by asking a female speaker to read first clustered by decision trees based on the distribution of the emotional paragraphs with a proper degree of emotion. Here, one concerned prosody features; those with similar prosody features example of the “proper degree” is the way that a newscaster are clustered into the same terminal leaf. Then for each leaf, reports someone’s death with a pity. The segmental features are GMM is trained to model the probabilistic distribution of the not much different with those of neutral speech. prosody features. In unit selection, the target and transition The sad paragraphs are reports of illness and death, while the decision trees are first traversed according to the input contexts, happy paragraphs are descriptions of the spring’s beautiful views and the associated GMMs are obtained. Then the target and and a girl’s imagination about her wedding. Note that the transition costs are both given by the corresponding GMM “happiness” is not a pure one; it mixes the love and enthusiasm probabilities. Finally, a dynamic program algorithm is performed about life. We totally use 216 sad sentences and 143 happy to find out a candidate sequence with the maximal combination sentences, from which 10 sentences of each emotion are testing of the target and transition probabilities as the prediction result. data, and the remaining sentences are used to train the emotional It can be seen that when there are sufficient candidates in prosody models. These emotional data, together with other 500 corpus, the predicted prosody features are determined by the neutral sentences, are put into the corpus for unit selection. input contexts. Thus the result is quite stable. 4.2. Emotional Prosody Characteristics 3.2. Adaptation Algorithm The neutral, emotional, and synthesis prosody characteristics are Suppose PF is the prosody feature vector of a synthesis unit, compared in figure 5 and figure 6. From figure 5, it can be seen that sadness is mainly characterized by low pitch mean, narrow then it contains both the linguistic component PF l and the pitch range, and very flat pitch contour. The limited pitch emotion component PF e , variations merely realize the syllable tone shape, and there is PF PF l PF e (1) little pitch declination in the intonation phrase. The synthesis sad pitch contour does have above characteristics. Figure 6 shows The adaptation algorithm is divided into three steps: predict that the happy prosody has a faster speak rate and less pitch PF l using a huge amount of neutral speech data, model PF e declination. The pitch range is also slightly narrower. The for each target emotion, and finally combine them to predict PF . synthesis prosody also shows similar characteristics. (1) Predict PF l using a huge amount of neutral speech data. 4.3. ABX Tests The linguistic component PF l is predicted by using the To evaluate whether the synthesis speech conveys the target model described in section 3.1. The neutral speech corpus emotion, we perform ABX tests, which are often used in studies contains 5,000 sentences, which can guarantee the stability of on voice conversion. A, B and X represent the neutral speech, prediction results at most time. So in the following two steps, emotional speech, and synthesis speech respectively. In the test, PF l is regarded as a constant rather than a distribution. three utterances are played with the order of A, B, X or B, A, X. Subjects are asked to judge whether the emotion implied in X is (2) Model PF e for each target emotion. similar with that in A or B. There are 20 university students as For each target emotion, PF e is estimated as the difference the subjects. between the emotional prosody feature PF and the predicted linguistic component PF l . The modeling method is nearly the Table 1. Results of the ABX tests. same with that used to model PF l , except that the prosody Sadness Happiness Ave. Correct Rate 100.0% 82.5% feature is PF e . So, given the input contexts CT , the corresponding GMM can be obtained, as formula (2) shows, Table 1 shows the average rate that the subjects judge the M e synthesis expression similar with the target emotions. The P( PF | CT ) wk N ( k , k ) (2) k 1 correct rate of sadness is very high, which may be because that the sad pitch contour is very flat and easy to predict. The correct Where M is the number of components in the GMM, wk , rate of happiness is some lower, but still much higher than the k and k are the weight, mean and variation of the k-th random rate. component respectively. I - 283
  • 4. ÂŹ « 5. CONCLUSIONS [8] E. Eide, R. Bakis, W. Hamza, and J. Pitrelli, “Multilayered Extensions to the Speech Synthesis Markup Language for This paper studies a flexible emotional synthesis approach using Describing Expressiveness,” Proc. of EuroSpeech’03, pp. 1645- the concatenative synthesizer. Perceptual tests are first 1648, Sep. 2003. performed to investigate the prosody’s contribution in emotional communications. The analysis results show that just manipulating prosody features can obtain meaningful results at least for some emotions. Then, the same corpus with sufficient prosody coverage is shared by different emotions in unit selection. Finally, an adaptation algorithm is proposed to predict the emotional prosody features. In experiments, two emotions (sadness and happiness) in Mandarin are implemented. By using no more than 200 sentences as the training data, appropriate emotional prosody features can be predicted for each emotion, and the synthesis speech is judged as similar with the target emotions at a rate over 80%. The experiment results show that the adaptation algorithm is efficient when there are only small amount of training data. So, prosody features of more subtle and mixed emotions can be obtained with fewer difficulties. The results also indicate that at least several emotions can be synthesized without the use of special emotional corpus. However, since it is difficult to conclude that prosody features are important in all emotions, in further work, it may be necessary to extend the corpus by adding some emotional segments. 6. ACKNOWLEDGEMENTS Figure 5. Neutral, sad, and synthesis pitch contour of a sad sentence, which means “Mother’s illness has broken out.” This work was done in IBM China Research Lab. I thank the work foundation and environment provided by the company, as well many valuable suggestions and help from Xijun Ma, Qin Shi, Weibin Zhu, and Yi Liu. 7. REFERENCES [1] X.J. Ma, W. Zhang, W.B. Zhu, etc, “Probability based Prosody Model for Unit Selection,” Proc. of. ICASSP’04, Montreal, Canada, pp. 649-652, May 2004. [2] E.Eide, “Preservation, Identification, and Use of Emotion in a Text-to-Speech System,” Proc. of IEEE Workshop on Speech Synthesis, Santa Monica, Sep. 2002. [3] A.W. Black, “Unit Selection and Emotional Speech,” Proc. of EuroSpeech’03, pp. 1649-1652, Sep. 2003. [4] M. Bulut, S. S. Narayanon, and A. K. Syrdal, “Expressive Speech Synthesis Using a Concatenative Synthesizer,” Proc. of ICSLP’02, Denver, pp. 1265-1268, Sep. 2002. [5] I.R. Murray, M.D. Edgington, D. Campion, etc. “Rule-Based Emotion Synthesis Using Concatenated Speech,” Proc. of ISCA Workshop on Speech and Emotion, Belfast, North Ireland, pp. 173-177, 2000. [6] K. Hirose, N. Minematsu, and H. Kawanami, “Analytical and Perceptual Study on the Role of Acoustic Features in Realizing Emotional Speech,” Proc. of ICSLP’00, pp. 369-372, Oct. 2000. Figure 6. Neutral, happy, and synthesis pitch contour of a happy [7] C. Gobl, E. Bennet, and A.N. Chassaide, “Expressive sentence, which means “You see that tall columns. Synthesis: How Crucial is Voice Quality?” Proc. of IEEE Workshop on Speech Synthesis, Santa Monica, Sep. 2002. I - 284