SlideShare a Scribd company logo
1 of 85
University of Minho
                                                                      Engineering School




 Developmentally inspired computational
framework for embodied speech imitation


                                Miguel Vaz
                          mvaz@dei.uminho.pt


  Dep. Industrial Electronics           Honda Research Institute Europe
     University of Minho                     Offenbach am Main
          Portugal                                Germany




                      25th January, Guimarães
Long-term goal:
verbal interaction with ASIMO




                                2
Long-term goal:
verbal interaction with ASIMO




                      speech perception

                      speech production

                      meaning / language




                                           2
Long-term goal:
verbal interaction with ASIMO




                      speech perception

                      speech production

                      meaning / language




                                           2
Constraints and specificities of the ASIMO
                platform




                                        3
Constraints and specificities of the ASIMO
                platform



 no pre-defined vocabulary




                                        3
Constraints and specificities of the ASIMO
                platform
                             acquire speech in interaction
                             
                                 imitation
                             
                                 online learning
 no pre-defined vocabulary   
                                 unlabeled data
                             minimize language assumptions
                             
                                 no corpus for system’s voice




                                                                3
Constraints and specificities of the ASIMO
                platform
                             acquire speech in interaction
                             
                                 imitation
                             
                                 online learning
 no pre-defined vocabulary   
                                 unlabeled data
                             minimize language assumptions
                             
                                 no corpus for system’s voice




                                                                3
Constraints and specificities of the ASIMO
                platform
                             acquire speech in interaction
                             
                                 imitation
                             
                                 online learning
 no pre-defined vocabulary   
                                 unlabeled data
                             minimize language assumptions
                             
                                 no corpus for system’s voice




 system has child’s voice




                                                                3
Constraints and specificities of the ASIMO
                platform
                             acquire speech in interaction
                             
                                 imitation
                             
                                 online learning
 no pre-defined vocabulary   
                                 unlabeled data
                             minimize language assumptions
                             
                                 no corpus for system’s voice



                             synthesize child’s voice
 system has child’s voice




                                                                3
Constraints and specificities of the ASIMO
                platform
                             acquire speech in interaction
                             
                                 imitation
                             
                                 online learning
 no pre-defined vocabulary   
                                 unlabeled data
                             minimize language assumptions
                             
                                 no corpus for system’s voice



                             synthesize child’s voice
 system has child’s voice

                             address correspondence problem




                                                                3
outline




synthesize child’s voice


address correspondence problem




                                 4
outline




synthesize child’s voice


address correspondence problem




                                 4
outline



synthesize child’s voice

    vocoder using gammatone filter bank


address correspondence problem




                                          4
outline



synthesize child’s voice

    vocoder using gammatone filter bank


address correspondence problem

    sensorimotor model trained with tutor
    imitative feedback
     
        feature space
     
        perceptual space




                                            4
Speech: source-filter model of speech
            production




                time (s)                                         time (s)


       glottal airflow          vocal tract                output from lips



                                filter function
dB                         dB                         dB




           Hz                         Hz                         Hz

     source spectrum                                       output spectrum
                                formant frequencies




                                                                              5
Spectral feature extraction with a
                   gammatone filter bank
                         scheme                                              example
                                                                        8

                                                                        3




                                                          freq (kHz)
                               Envelope      Harmonic
         Gammatone                                                      1
                               Extraction    Structure
          Filterbank                                                   0.4
                                            Elimination
                                                                       0.1
                                                                                0.25         0.5
                                                                              time (s)
                                                                        8

                                                                        3




                                                          freq (kHz)
speech                                                                  1
             ...                  ...                                  0.4
                                                                       0.1
                                                                         8

                                                                        3
                                                          freq (kHz)
                       pitch                                            1
                                                                       0.4
                                                                       0.1


                                                                             zur¨ ck
                                                                                u



                                                                                         6
VOCODER-like synthesis algorithm with a
      gammatone filter bank




                                      7
VOCODER-like synthesis algorithm with a
      gammatone filter bank
hybrid architecture

    channel vocoder for frication

    harmonic model for voicing      voicing
                                     mask
                                                 harmonic
                                                              voicing
                                     pitch        energy
                                                 sampling
                                    spectral
                                    vectors                          synthesis


                                     white      gammatone
                                     noise       filter bank   frication

                                    frication
                                      mask




                                                                          7
VOCODER-like synthesis algorithm with a
      gammatone filter bank
hybrid architecture

    channel vocoder for frication

    harmonic model for voicing                 voicing
                                                mask
                                                            harmonic
                                                                         voicing
                                                pitch        energy
                                                            sampling
good naturalness for high- and low-
                                               spectral
pitch voices                                   vectors                          synthesis

    good results in comparison to standard
                                                white      gammatone
    acoustic synthesis techniques               noise       filter bank   frication
        tested against MCEP based synthesis
                                               frication
                                                 mask




                                                                                     7
VOCODER-like synthesis algorithm with a
      gammatone filter bank
hybrid architecture

    channel vocoder for frication

    harmonic model for voicing                 voicing
                                                mask
                                                            harmonic
                                                                         voicing
                                                pitch        energy
                                                            sampling
good naturalness for high- and low-
                                               spectral
pitch voices                                   vectors                          synthesis

    good results in comparison to standard
                                                white      gammatone
    acoustic synthesis techniques               noise       filter bank   frication
        tested against MCEP based synthesis
                                               frication
                                                 mask

good intelligibility

    tested with Modified Rhyme Test for
    german



                                                                                     7
Example from copy synthesis




                              8
Example from copy synthesis




                              8
outline




synthesize child’s voice

    vocoder using gammatone filter bank
address correspondence problem

    sensorimotor model trained with tutor
    imitative feedback
     
        feature space
     
        perceptual space




                                            9
Correspondence problem

            ?
 asimo             Asimo




                           10
Correspondence problem in the literature




                                      11
Correspondence problem in the literature

innate representations

    [Marean1992, Kuhl1996, Minematsu2009]

    labeled data in standard Voice Conversion systems




                                                        11
Correspondence problem in the literature

innate representations

    [Marean1992, Kuhl1996, Minematsu2009]

    labeled data in standard Voice Conversion systems

important information from feedback of parent / tutor

    imitation [Papousek1992, Girolametto1999]

    reward, stimulation

    distinctive maternal responses [Gros-Louis2006, Goldstein2003]




                                                                     11
Correspondence problem in the literature

innate representations

    [Marean1992, Kuhl1996, Minematsu2009]

    labeled data in standard Voice Conversion systems

important information from feedback of parent / tutor

    imitation [Papousek1992, Girolametto1999]

    reward, stimulation

    distinctive maternal responses [Gros-Louis2006, Goldstein2003]



    mutual imitation games guide acquisition of vowels
        [Miura2007, Kanda2009]

    tutor imitation as reward signal in RL framework
        [Howard2007, Messum2007]




                                                                     11
We use tutor’s imitative feedback




                                    12
We use tutor’s imitative feedback

cooperative tutor (always) imitates




                                           12
We use tutor’s imitative feedback

cooperative tutor (always) imitates
                                                         vocal tract model
probabilistic mapping                    motor

    tutor’s voice   motor repertoire   commands


                                                                             tutor
                                                          cochlear model
                                              sensory-
                                                                              imitative
                                               motor
                                                                             response
                                               model




                                                                                 12
We use tutor’s imitative feedback

cooperative tutor (always) imitates

probabilistic mapping

    tutor’s voice      motor repertoire



innate vocal repertoire

    vowels (primitives)
      8 vectors
      10 year old boy

      TIDIGITS corpus

      formant-annotated




                                             12
We use tutor’s imitative feedback

cooperative tutor (always) imitates

probabilistic mapping

    tutor’s voice         motor repertoire


                                                  0 p0     p1         pc                         p2         p3        p4
innate vocal repertoire

    vowels (primitives)                                                        S(α, c)
                                                           c1
                                                 α
      8 vectors                                                           c                    c2         c3
      10 year old boy

      TIDIGITS corpus
                                                  1                                                                   q4
                                                      q0        q1         qc              q2         q3
      formant-annotated


                                                                     p c = pi +          cj −ci (pj − pi )
                                                                                          c−ci
morphing to combine primitives

    assumption:                                                      qc = qi +           cj −ci (qj − qi )
                                                                                          c−ci

        intermediate states will sound “inbetween”



                                                                                                                 12
Training phase


  m1   m2   m3




                 13
Training phase

 vocal
primitive     m1   m2   m3




                             13
Training phase

   vocal
  primitive         m1   m2   m3


tutor imitative
  response




                                   13
Training phase

   vocal
  primitive         m1   m2   m3


tutor imitative
  response



                                      feature space
                                        p1 (t) = F1 (t)
                                        p2 (t) = F2 (t) − F1 (t)
                                        p3 (t) = F3 (t) − F1 (t)
                                   p{4,5,6} (t) = log(S(C{1,2,3} (t), t))




                                                              13
Training phase

   vocal
  primitive         m1   m2   m3


tutor imitative
  response



                                      feature space
                                        p1 (t) = F1 (t)
                                        p2 (t) = F2 (t) − F1 (t)
                                        p3 (t) = F3 (t) − F1 (t)
                                   p{4,5,6} (t) = log(S(C{1,2,3} (t), t))




                                                              13
Training phase

   vocal
  primitive         m1   m2   m3


tutor imitative
  response



                                      feature space

build model of                          p1 (t) = F1 (t)
 response to                            p2 (t) = F2 (t) − F1 (t)
   primitive                            p3 (t) = F3 (t) − F1 (t)
                                   p{4,5,6} (t) = log(S(C{1,2,3} (t), t))




                                                              13
Imitation phase




                  14
Imitation phase

   tutor
  target
utterance




                              14
Imitation phase

    tutor
   target
 utterance


                                  k-Nearest Neighbours
   class                                        Kj
  posterior                        p(Cj |x) =
                                                K
probabilities
                                   Kj - number of points
                                        of class Cj in a
                                        neighbourhood V (x)
                                        with K elements




                                                     14
Imitation phase

    tutor
   target
 utterance


                                  k-Nearest Neighbours
   class                                        Kj
  posterior                        p(Cj |x) =
                                                K
probabilities
                                   Kj - number of points
                                        of class Cj in a
                                        neighbourhood V (x)
                                        with K elements


population
  coding




                                                     14
Imitation phase

    tutor
   target
 utterance


                                     k-Nearest Neighbours
   class                                            Kj
  posterior                            p(Cj |x) =
                                                    K
probabilities
                                       Kj - number of points
                                            of class Cj in a
                                            neighbourhood V (x)
                                            with K elements


population
  coding


                       ...
  spectral                                 p(Cj1 |x)
                                  α=
   output              ...           p(Cj1 |x) + p(Cj2 |x)




                                                         14
Imitation example




                    15
Imitation example
                   target utterance                                                                         classification
            8000
                                                                                        0.8




                                                                               P(C|x)
                                                                                        0.6
                                                                                        0.4
                                                                                                                                     p(Cj |x)
            3000                                                                        0.2
                                                                                          0
                                                                                                     0.25        0.5      0.75   1
freq (Hz)




                                                                                                               time (s)
                                                                                    8000
            1000
                                                                                    3000
                                                                                                                                     morphed




                                                                        freq (Hz)
                                                                                    1000                                             primitives
            100
                    0.25   0.5        0.75           1                              100
                           time (s)
                                                                                                               time (s)


                                                                  imitation
                                                         8000




                                                         3000
                                             freq (Hz)




                                                         1000




                                                         100
                                                                0.25     0.5                  0.75      1
                                                                       time (s)




                                                                                                                                         15
Imitation example
                   target utterance                                                                         classification
            8000
                                                                                        0.8




                                                                               P(C|x)
                                                                                        0.6
                                                                                        0.4
                                                                                                                                     p(Cj |x)
            3000                                                                        0.2
                                                                                          0
                                                                                                     0.25        0.5      0.75   1
freq (Hz)




                                                                                                               time (s)
                                                                                    8000
            1000
                                                                                    3000
                                                                                                                                     morphed




                                                                        freq (Hz)
                                                                                    1000                                             primitives
            100
                    0.25   0.5        0.75           1                              100
                           time (s)
                                                                                                               time (s)


                                                                  imitation
                                                         8000




                                                         3000
                                             freq (Hz)




                                                         1000




                                                         100
                                                                0.25     0.5                  0.75      1
                                                                       time (s)




                                                                                                                                         15
Imitation example
                   target utterance                                                                         classification
            8000
                                                                                        0.8




                                                                               P(C|x)
                                                                                        0.6
                                                                                        0.4
                                                                                                                                     p(Cj |x)
            3000                                                                        0.2
                                                                                          0
                                                                                                     0.25        0.5      0.75   1
freq (Hz)




                                                                                                               time (s)
                                                                                    8000
            1000
                                                                                    3000
                                                                                                                                     morphed




                                                                        freq (Hz)
                                                                                    1000                                             primitives
            100
                    0.25   0.5        0.75           1                              100
                           time (s)
                                                                                                               time (s)


                                                                  imitation
                                                         8000




                                                         3000
                                             freq (Hz)




                                                         1000


                                                                                                             pitch + energy
                                                         100
                                                                0.25     0.5                  0.75      1
                                                                       time (s)




                                                                                                                                         15
other examples


       adult    imitation


aia




aua




papa




                            16
other examples


       adult    imitation


aia




aua




papa




                            16
other examples


       adult    imitation


aia




aua




papa




                            16
other examples


       adult    imitation


aia




aua




papa




                            16
other examples


       adult    imitation


aia




aua




papa




                            16
other examples


       adult    imitation


aia




aua




papa




                            16
other examples


       adult    imitation


aia




aua




papa




                            16
Subjective evaluation of imitation

experiment

    how similar is the content of
    the two sounds?
        1 (different) ... 5 (same)

    24 test subjects

stimuli

    3 systems x 13 phonemes
     pairs < human, imitated >
       O, e, @, o, a, E, i, U
       Y, 9, aI, aU, OI               S3 a, i, U
                                      S5 a, i, U, E, O

    8 pairs < human, control >
        supervised activation        S8 a, i, U, E, O, e, @, o




                                                                  17
outline




synthesize child’s voice

    vocoder using gammatone filter bank
address correspondence problem

    sensorimotor model trained with tutor
    imitative feedback
     
        feature space
     
        perceptual space




                                            18
Integration with an existing speech
    acquisition system (Azubi)



                        phone                             syllable
                                            model                                             word
                        model           initialization     model
                                                                                             lexicon
                         pool                               pool
                                                   training
                                phone             segments           syllable                              word
                                 LM                                    LM                                  LM
                                                                                 syllable
                      phone            score              syllable              sequence      word
                                    normalization
                    recognizer                            spotter                            spotter
                                                                                            detect words
                    phone                                      syllabic
                   activities     phonotactic                 constraints
                                                                                             symbol
                                 speech model
                                                                                            grounding




                                                                                                     19
Integration with an existing speech
             acquisition system (Azubi)

Goals:

    integrate with perceptual model

    make it more appropriate to use in
                                              phone                             syllable
    real scenarios                            model
                                                                  model
                                                              initialization     model
                                                                                                                    word
                                                                                                                   lexicon
                                               pool                               pool
                                                                         training
                                                      phone             segments           syllable                              word
                                                       LM                                    LM                                  LM
                                                                                                       syllable
                                            phone            score              syllable              sequence      word
                                                          normalization
                                          recognizer                            spotter                            spotter
                                                                                                                  detect words
                                          phone                                      syllabic
                                         activities     phonotactic                 constraints
                                                                                                                   symbol
                                                       speech model
                                                                                                                  grounding




                                                                                                                           19
Integration with an existing speech
             acquisition system (Azubi)

Goals:

    integrate with perceptual model

    make it more appropriate to use in
                                               phone                             syllable
    real scenarios                             model
                                                                   model
                                                               initialization     model
                                                                                                                     word
                                                                                                                    lexicon
                                                pool                               pool
                                                                          training
Azubi model [Brandl et al, 2008]                       phone
                                                        LM
                                                                         segments           syllable
                                                                                              LM
                                                                                                                                  word
                                                                                                                                  LM
                                                                                                        syllable
acquires speech                              phone
                                           recognizer
                                                              score
                                                           normalization
                                                                                 syllable
                                                                                 spotter
                                                                                                       sequence      word
                                                                                                                    spotter

    phones, syllables, words                                                                                       detect words
                                           phone                                      syllabic

    already used in interaction           activities     phonotactic                 constraints
                                                                                                                    symbol
                                                        speech model
    scenarios [Bolder et al, 2008, etc]                                                                            grounding




                                                                                                                            19
Integration with an existing speech
             acquisition system (Azubi)

Goals:

    integrate with perceptual model

    make it more appropriate to use in
                                               phone                             syllable
    real scenarios                             model
                                                                   model
                                                               initialization     model
                                                                                                                     word
                                                                                                                    lexicon
                                                pool                               pool
                                                                          training
Azubi model [Brandl et al, 2008]                       phone
                                                        LM
                                                                         segments           syllable
                                                                                              LM
                                                                                                                                  word
                                                                                                                                  LM
                                                                                                        syllable
acquires speech                              phone
                                           recognizer
                                                              score
                                                           normalization
                                                                                 syllable
                                                                                 spotter
                                                                                                       sequence      word
                                                                                                                    spotter

    phones, syllables, words                                                                                       detect words
                                           phone                                      syllabic

    already used in interaction           activities     phonotactic                 constraints
                                                                                                                    symbol
                                                        speech model
    scenarios [Bolder et al, 2008, etc]                                                                            grounding




                                                                                                                            19
Integration with an existing speech
             acquisition system (Azubi)

Goals:

    integrate with perceptual model

    make it more appropriate to use in
                                                         phone                             syllable
    real scenarios                                       model
                                                                             model
                                                                         initialization     model
                                                                                                                               word
                                                                                                                              lexicon
                                                          pool                               pool
                                                                                    training
Azubi model [Brandl et al, 2008]                                 phone
                                                                  LM
                                                                                   segments           syllable
                                                                                                        LM
                                                                                                                                            word
                                                                                                                                            LM
                                                                                                                  syllable
acquires speech                                        phone
                                                     recognizer
                                                                        score
                                                                     normalization
                                                                                           syllable
                                                                                           spotter
                                                                                                                 sequence      word
                                                                                                                              spotter

    phones, syllables, words                                                                                                 detect words
                                                     phone                                      syllabic

    already used in interaction                     activities     phonotactic                 constraints
                                                                                                                              symbol
                                                                  speech model
    scenarios [Bolder et al, 2008, etc]                                                                                      grounding




                                λp λp
                                 1  2     λp
                                           3   λp
                                                4     λp
                                                       5



                                                                                                                                      19
Integration with an existing speech
             acquisition system (Azubi)

Goals:

    integrate with perceptual model

    make it more appropriate to use in
                                                          phone                             syllable
    real scenarios                                        model
                                                                              model
                                                                          initialization     model
                                                                                                                                word
                                                                                                                               lexicon
                                                           pool                               pool
                                                                                     training
Azubi model [Brandl et al, 2008]                                  phone
                                                                   LM
                                                                                    segments           syllable
                                                                                                         LM
                                                                                                                                             word
                                                                                                                                             LM
                                                                                                                   syllable
acquires speech                                         phone
                                                      recognizer
                                                                         score
                                                                      normalization
                                                                                            syllable
                                                                                            spotter
                                                                                                                  sequence      word
                                                                                                                               spotter

    phones, syllables, words                                                                                                  detect words
                                                      phone                                       syllabic

    already used in interaction                      activities     phonotactic                  constraints
                                                                                                                               symbol
                                                                   speech model
    scenarios [Bolder et al, 2008, etc]                                                           utterance
                                                                                                                              grounding
                                                                                                 generation
                                                                          production
                                                                          primitives
Correspondence model trained at                     correspondence           primitive     synergistic      activity
                                                                                                                        synthesizer
the phone model level
                                                                              activity                      contour
                                                       mapping                              encoder




                                λp λp
                                 1  2     λp
                                           3   λp
                                                4      λp
                                                        5



                                                                                                                                       19
Training phase: correspondence model




              λp λp
               1  2   λp
                       3   λp
                            4   λp
                                 5

         m3
     m2
    m1



                                     20
Training phase: correspondence model


  vocal
 primitive




                   λp λp
                    1  2   λp
                            3   λp
                                 4   λp
                                      5

              m3
             m2
        m1



                                          20
Training phase: correspondence model

                               segmentation
                               classification

  vocal
                     tutor
 primitive
                   imitation



                                                          [λp , ..., λp ] = arg max P ([λp ]|Xtutor )
                                                            1         n         p
                                                                             [λ ]∈P




                               λp λp
                                1  2      λp
                                           3    λp
                                                 4   λp
                                                      5

              m3
             m2
        m1



                                                                                             20
Training phase: correspondence model

                               segmentation
                               classification
                                                            update
  vocal                                                   probabilistic
                     tutor                                 mapping
 primitive
                   imitation



                                                                 [λp , ..., λp ] = arg max P ([λp ]|Xtutor )
                                                                   1         n         p
                                                                                      [λ ]∈P




                               λp 15
                                1 λp
                                   2      λp
                                           3    λp
                                                 4   λp
                                                      5
                                                                             Mij = P (λp |mj , Dj)
              m3                                                                       i
                                     -
                                                                             Cij =             P (mj |λp )
             m2                                                                                        i
                                                                                        P (λp |mj ,Dj) P (λp )
                                     +                                            =         i
                                                                                               P (mj )
                                                                                                           i



        m1                                                                        =               Mij
                                 -




                                                                                                        20
Training phase: correspondence model

                               segmentation
                               classification
                                                               update
  vocal                                                      probabilistic
                     tutor                                    mapping
 primitive
                   imitation



                                                                    [λp , ..., λp ] = arg max P ([λp ]|Xtutor )
                                                                      1         n         p
                                                                                         [λ ]∈P




                               λp 15
                                1 λp
                                   2      λp 15
                                           3 λp
                                              4         λp
                                                         5
                                                                                Mij = P (λp |mj , Dj)
              m3                                                                          i
                                     -          -
                                                                                Cij =             P (mj |λp )
             m2                                                                                           i
                                                                                           P (λp |mj ,Dj) P (λp )
                                     +          +                                    =         i
                                                                                                  P (mj )
                                                                                                              i



        m1                                          -
                                                                                     =               Mij
                                 -




                                                                                                           20
Imitation phase




 λp λp
  1  2   λp
          3   λp
               4   λp
                    5


                        m3
                             m2
                                  m1




                                       21
Imitation phase

 target tutor
  utterance


segmentation                            [λp , ..., λp ] = arg max P ([λp ]|Xtutor )
                                          1         n         p
                                                           [λ ]∈P



                 λp λp
                  1  2   λp
                          3   λp
                               4   λp
                                    5


                                        m3
                                             m2
                                                  m1




                                                                         21
Imitation phase

 target tutor
  utterance


segmentation                            [λp , ..., λp ] = arg max P ([λp ]|Xtutor )
                                          1         n         p
                                                           [λ ]∈P



                 λp 15
                  1 λp
                     2   λp
                          3   λp
                               4   λp
                                    5


    vocal                               m3
 primitives’
  posterior                                  m2
probabilities
                                                  m1




                                                                         21
Imitation phase

 target tutor
  utterance


segmentation                          [λp , ..., λp ] = arg max P ([λp ]|Xtutor )
                                        1         n         p
                                                         [λ ]∈P



                 λp 15
                  1 λp
                     2   λp 15
                          3 λp
                             4   λp
                                  5


    vocal                             m3
 primitives’
  posterior                                m2
probabilities
                                                m1




                                                                       21
Imitation phase

 target tutor
  utterance


segmentation                          [λp , ..., λp ] = arg max P ([λp ]|Xtutor )
                                        1         n         p
                                                         [λ ]∈P



                 λp 15
                  1 λp
                     2   λp 15
                          3 λp
                             4   λp
                                  5


    vocal                             m3
 primitives’
  posterior                                m2
probabilities
                                                m1

 population
   coding
                                                 gaussian activation contours
  spectral
   output




                                                                       21
Experimental results

                                                  Correspondence matrix


phone models

    “child-directed”-like speech
    +- 1min

                               vocal primitives





interaction

    15 imitations of each
    vocal primitive



                                                       phone models




                                                                          22
Experimental results

                                                  Correspondence matrix


phone models

    “child-directed”-like speech
    +- 1min

                               vocal primitives





interaction

    15 imitations of each
    vocal primitive



                                                       phone models




                                                                          22
Experimental results

                                                  Correspondence matrix


phone models

    “child-directed”-like speech
    +- 1min

                               vocal primitives





interaction

    15 imitations of each
    vocal primitive



                                                       phone models




                                                                          22
Experimental results

                                                  Correspondence matrix


phone models

    “child-directed”-like speech
    +- 1min

                               vocal primitives





interaction

    15 imitations of each
    vocal primitive



                                                       phone models




                                                                          22
Imitation example




                    23
Imitation example

                        mama




  input
spectrum




population
  coding




 spectral
  output




                                 23
Imitation example

                        mama




  input
spectrum




population
  coding




 spectral
  output




                                 23
Imitation example

                        mama




  input
spectrum




population
  coding




 spectral
  output




                                 23
Summary

Framework where speech imitation can be possible

    speech synthesis technique to synthesize child’s voice
      channel vocoder meets gammatone filterbank
      evaluation




    address the correspondence problem
      probabilistic mapping between tutor’s voice and system’s motor space
      tutor feedback interpreted in

           feature space
           unsupervisedly acquired perceptual space




    integration in an online speech acquisition framework (Azubi)
        paves the way for usage on the robot




                                                                              24
Publications

"Learning from a tutor: embodied speech acquisition and imitation learning"
M.Vaz, H.Brandl, F.Joublin, C.Goerick
Proc. IEEE Intl. Conf. on Development and Learning 2009, Shanghai, China

"Speech imitation with a child’s voice: addressing the correspondence problem"
M.Vaz, H.Brandl, F.Joublin, C.Goerick
Proc. SPECOM’2009, St Petersburg, Russia

"Linking Perception and Production: System Learns a Correspondence Between its Own
Voice and the Tutor's"
M.Vaz, H.Brandl, F.Joublin, C.Goerick,
Speech and Face to Face Communication Workshop in memory of Christian Benoît: GIPSA-
lab, Grenoble, Université Stendhal, France

"Speech structure acquisition for interactive systems"
H.Brandl, M.Vaz, F.Joublin, C.Goerick
Speech and Face to Face Communication Workshop in memory of Christian Benoît: GIPSA-
lab, Grenoble, Université Stendhal, France

"Listen to the Parrot: Demonstrating the Quality of Online Pitch and Formant Extraction
via Feature-based Resynthesis"
M.Heckmann, C.Glaeser, M.Vaz, T.Rodemann, F.Joublin, C. Goerick
Proc. IEEE/RSJ International Conference on Intelligent Robots and Systems 2008, Nice,
France


                                                                                          25
Thank you




Dr. Estela Bicho
Dr. Frank Joublin
Dr. Wolfram Erlhagen

Colleagues @ Honda Research Institute
Colleagues @ DEI

Family
Friends




                                        26

More Related Content

Viewers also liked

G325 a media theory and theorists_sectiona-
G325 a media theory and theorists_sectiona-G325 a media theory and theorists_sectiona-
G325 a media theory and theorists_sectiona-
gdsteacher
 

Viewers also liked (9)

Emergency Office Munchies
Emergency Office MunchiesEmergency Office Munchies
Emergency Office Munchies
 
Millport trip
Millport tripMillport trip
Millport trip
 
I nearshore
I nearshore I nearshore
I nearshore
 
Research termanolagy
Research termanolagyResearch termanolagy
Research termanolagy
 
Design of an Adaptive Hearing Aid Algorithm using Booth-Wallace Tree Multiplier
Design of an Adaptive Hearing Aid Algorithm using Booth-Wallace Tree MultiplierDesign of an Adaptive Hearing Aid Algorithm using Booth-Wallace Tree Multiplier
Design of an Adaptive Hearing Aid Algorithm using Booth-Wallace Tree Multiplier
 
KHALED-CV
KHALED-CVKHALED-CV
KHALED-CV
 
Zen Stories for Management Students-Monkey King
Zen Stories for Management Students-Monkey King Zen Stories for Management Students-Monkey King
Zen Stories for Management Students-Monkey King
 
G325 a media theory and theorists_sectiona-
G325 a media theory and theorists_sectiona-G325 a media theory and theorists_sectiona-
G325 a media theory and theorists_sectiona-
 
Developmental anomalies of teeth - by variyta
Developmental anomalies of teeth - by  variytaDevelopmental anomalies of teeth - by  variyta
Developmental anomalies of teeth - by variyta
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Recently uploaded (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

2010.01.25 - Developmentally inspired computational framework for embodied speech imitation
 (PhD presentation)

  • 1. University of Minho Engineering School Developmentally inspired computational framework for embodied speech imitation Miguel Vaz mvaz@dei.uminho.pt Dep. Industrial Electronics Honda Research Institute Europe University of Minho Offenbach am Main Portugal Germany 25th January, Guimarães
  • 3. Long-term goal: verbal interaction with ASIMO speech perception speech production meaning / language 2
  • 4. Long-term goal: verbal interaction with ASIMO speech perception speech production meaning / language 2
  • 5. Constraints and specificities of the ASIMO platform 3
  • 6. Constraints and specificities of the ASIMO platform no pre-defined vocabulary 3
  • 7. Constraints and specificities of the ASIMO platform acquire speech in interaction  imitation  online learning no pre-defined vocabulary  unlabeled data minimize language assumptions  no corpus for system’s voice 3
  • 8. Constraints and specificities of the ASIMO platform acquire speech in interaction  imitation  online learning no pre-defined vocabulary  unlabeled data minimize language assumptions  no corpus for system’s voice 3
  • 9. Constraints and specificities of the ASIMO platform acquire speech in interaction  imitation  online learning no pre-defined vocabulary  unlabeled data minimize language assumptions  no corpus for system’s voice system has child’s voice 3
  • 10. Constraints and specificities of the ASIMO platform acquire speech in interaction  imitation  online learning no pre-defined vocabulary  unlabeled data minimize language assumptions  no corpus for system’s voice synthesize child’s voice system has child’s voice 3
  • 11. Constraints and specificities of the ASIMO platform acquire speech in interaction  imitation  online learning no pre-defined vocabulary  unlabeled data minimize language assumptions  no corpus for system’s voice synthesize child’s voice system has child’s voice address correspondence problem 3
  • 12. outline synthesize child’s voice address correspondence problem 4
  • 13. outline synthesize child’s voice address correspondence problem 4
  • 14. outline synthesize child’s voice  vocoder using gammatone filter bank address correspondence problem 4
  • 15. outline synthesize child’s voice  vocoder using gammatone filter bank address correspondence problem  sensorimotor model trained with tutor imitative feedback  feature space  perceptual space 4
  • 16. Speech: source-filter model of speech production time (s) time (s) glottal airflow vocal tract output from lips filter function dB dB dB Hz Hz Hz source spectrum output spectrum formant frequencies 5
  • 17. Spectral feature extraction with a gammatone filter bank scheme example 8 3 freq (kHz) Envelope Harmonic Gammatone 1 Extraction Structure Filterbank 0.4 Elimination 0.1 0.25 0.5 time (s) 8 3 freq (kHz) speech 1 ... ... 0.4 0.1 8 3 freq (kHz) pitch 1 0.4 0.1 zur¨ ck u 6
  • 18. VOCODER-like synthesis algorithm with a gammatone filter bank 7
  • 19. VOCODER-like synthesis algorithm with a gammatone filter bank hybrid architecture  channel vocoder for frication  harmonic model for voicing voicing mask harmonic voicing pitch energy sampling spectral vectors synthesis white gammatone noise filter bank frication frication mask 7
  • 20. VOCODER-like synthesis algorithm with a gammatone filter bank hybrid architecture  channel vocoder for frication  harmonic model for voicing voicing mask harmonic voicing pitch energy sampling good naturalness for high- and low- spectral pitch voices vectors synthesis  good results in comparison to standard white gammatone acoustic synthesis techniques noise filter bank frication  tested against MCEP based synthesis frication mask 7
  • 21. VOCODER-like synthesis algorithm with a gammatone filter bank hybrid architecture  channel vocoder for frication  harmonic model for voicing voicing mask harmonic voicing pitch energy sampling good naturalness for high- and low- spectral pitch voices vectors synthesis  good results in comparison to standard white gammatone acoustic synthesis techniques noise filter bank frication  tested against MCEP based synthesis frication mask good intelligibility  tested with Modified Rhyme Test for german 7
  • 22. Example from copy synthesis 8
  • 23. Example from copy synthesis 8
  • 24. outline synthesize child’s voice  vocoder using gammatone filter bank address correspondence problem  sensorimotor model trained with tutor imitative feedback  feature space  perceptual space 9
  • 25. Correspondence problem ? asimo Asimo 10
  • 26. Correspondence problem in the literature 11
  • 27. Correspondence problem in the literature innate representations  [Marean1992, Kuhl1996, Minematsu2009]  labeled data in standard Voice Conversion systems 11
  • 28. Correspondence problem in the literature innate representations  [Marean1992, Kuhl1996, Minematsu2009]  labeled data in standard Voice Conversion systems important information from feedback of parent / tutor  imitation [Papousek1992, Girolametto1999]  reward, stimulation  distinctive maternal responses [Gros-Louis2006, Goldstein2003] 11
  • 29. Correspondence problem in the literature innate representations  [Marean1992, Kuhl1996, Minematsu2009]  labeled data in standard Voice Conversion systems important information from feedback of parent / tutor  imitation [Papousek1992, Girolametto1999]  reward, stimulation  distinctive maternal responses [Gros-Louis2006, Goldstein2003]  mutual imitation games guide acquisition of vowels  [Miura2007, Kanda2009]  tutor imitation as reward signal in RL framework  [Howard2007, Messum2007] 11
  • 30. We use tutor’s imitative feedback 12
  • 31. We use tutor’s imitative feedback cooperative tutor (always) imitates 12
  • 32. We use tutor’s imitative feedback cooperative tutor (always) imitates vocal tract model probabilistic mapping motor  tutor’s voice motor repertoire commands tutor cochlear model sensory- imitative motor response model 12
  • 33. We use tutor’s imitative feedback cooperative tutor (always) imitates probabilistic mapping  tutor’s voice motor repertoire innate vocal repertoire  vowels (primitives)  8 vectors  10 year old boy  TIDIGITS corpus  formant-annotated 12
  • 34. We use tutor’s imitative feedback cooperative tutor (always) imitates probabilistic mapping  tutor’s voice motor repertoire 0 p0 p1 pc p2 p3 p4 innate vocal repertoire  vowels (primitives) S(α, c) c1 α  8 vectors c c2 c3  10 year old boy  TIDIGITS corpus 1 q4 q0 q1 qc q2 q3  formant-annotated p c = pi + cj −ci (pj − pi ) c−ci morphing to combine primitives  assumption: qc = qi + cj −ci (qj − qi ) c−ci  intermediate states will sound “inbetween” 12
  • 35. Training phase m1 m2 m3 13
  • 37. Training phase vocal primitive m1 m2 m3 tutor imitative response 13
  • 38. Training phase vocal primitive m1 m2 m3 tutor imitative response feature space p1 (t) = F1 (t) p2 (t) = F2 (t) − F1 (t) p3 (t) = F3 (t) − F1 (t) p{4,5,6} (t) = log(S(C{1,2,3} (t), t)) 13
  • 39. Training phase vocal primitive m1 m2 m3 tutor imitative response feature space p1 (t) = F1 (t) p2 (t) = F2 (t) − F1 (t) p3 (t) = F3 (t) − F1 (t) p{4,5,6} (t) = log(S(C{1,2,3} (t), t)) 13
  • 40. Training phase vocal primitive m1 m2 m3 tutor imitative response feature space build model of p1 (t) = F1 (t) response to p2 (t) = F2 (t) − F1 (t) primitive p3 (t) = F3 (t) − F1 (t) p{4,5,6} (t) = log(S(C{1,2,3} (t), t)) 13
  • 42. Imitation phase tutor target utterance 14
  • 43. Imitation phase tutor target utterance k-Nearest Neighbours class Kj posterior p(Cj |x) = K probabilities Kj - number of points of class Cj in a neighbourhood V (x) with K elements 14
  • 44. Imitation phase tutor target utterance k-Nearest Neighbours class Kj posterior p(Cj |x) = K probabilities Kj - number of points of class Cj in a neighbourhood V (x) with K elements population coding 14
  • 45. Imitation phase tutor target utterance k-Nearest Neighbours class Kj posterior p(Cj |x) = K probabilities Kj - number of points of class Cj in a neighbourhood V (x) with K elements population coding ... spectral p(Cj1 |x) α= output ... p(Cj1 |x) + p(Cj2 |x) 14
  • 47. Imitation example target utterance classification 8000 0.8 P(C|x) 0.6 0.4 p(Cj |x) 3000 0.2 0 0.25 0.5 0.75 1 freq (Hz) time (s) 8000 1000 3000 morphed freq (Hz) 1000 primitives 100 0.25 0.5 0.75 1 100 time (s) time (s) imitation 8000 3000 freq (Hz) 1000 100 0.25 0.5 0.75 1 time (s) 15
  • 48. Imitation example target utterance classification 8000 0.8 P(C|x) 0.6 0.4 p(Cj |x) 3000 0.2 0 0.25 0.5 0.75 1 freq (Hz) time (s) 8000 1000 3000 morphed freq (Hz) 1000 primitives 100 0.25 0.5 0.75 1 100 time (s) time (s) imitation 8000 3000 freq (Hz) 1000 100 0.25 0.5 0.75 1 time (s) 15
  • 49. Imitation example target utterance classification 8000 0.8 P(C|x) 0.6 0.4 p(Cj |x) 3000 0.2 0 0.25 0.5 0.75 1 freq (Hz) time (s) 8000 1000 3000 morphed freq (Hz) 1000 primitives 100 0.25 0.5 0.75 1 100 time (s) time (s) imitation 8000 3000 freq (Hz) 1000 pitch + energy 100 0.25 0.5 0.75 1 time (s) 15
  • 50. other examples adult imitation aia aua papa 16
  • 51. other examples adult imitation aia aua papa 16
  • 52. other examples adult imitation aia aua papa 16
  • 53. other examples adult imitation aia aua papa 16
  • 54. other examples adult imitation aia aua papa 16
  • 55. other examples adult imitation aia aua papa 16
  • 56. other examples adult imitation aia aua papa 16
  • 57. Subjective evaluation of imitation experiment  how similar is the content of the two sounds?  1 (different) ... 5 (same)  24 test subjects stimuli  3 systems x 13 phonemes pairs < human, imitated > O, e, @, o, a, E, i, U Y, 9, aI, aU, OI S3 a, i, U S5 a, i, U, E, O  8 pairs < human, control >  supervised activation S8 a, i, U, E, O, e, @, o 17
  • 58. outline synthesize child’s voice  vocoder using gammatone filter bank address correspondence problem  sensorimotor model trained with tutor imitative feedback  feature space  perceptual space 18
  • 59. Integration with an existing speech acquisition system (Azubi) phone syllable model word model initialization model lexicon pool pool training phone segments syllable word LM LM LM syllable phone score syllable sequence word normalization recognizer spotter spotter detect words phone syllabic activities phonotactic constraints symbol speech model grounding 19
  • 60. Integration with an existing speech acquisition system (Azubi) Goals:  integrate with perceptual model  make it more appropriate to use in phone syllable real scenarios model model initialization model word lexicon pool pool training phone segments syllable word LM LM LM syllable phone score syllable sequence word normalization recognizer spotter spotter detect words phone syllabic activities phonotactic constraints symbol speech model grounding 19
  • 61. Integration with an existing speech acquisition system (Azubi) Goals:  integrate with perceptual model  make it more appropriate to use in phone syllable real scenarios model model initialization model word lexicon pool pool training Azubi model [Brandl et al, 2008] phone LM segments syllable LM word LM syllable acquires speech phone recognizer score normalization syllable spotter sequence word spotter  phones, syllables, words detect words phone syllabic  already used in interaction activities phonotactic constraints symbol speech model scenarios [Bolder et al, 2008, etc] grounding 19
  • 62. Integration with an existing speech acquisition system (Azubi) Goals:  integrate with perceptual model  make it more appropriate to use in phone syllable real scenarios model model initialization model word lexicon pool pool training Azubi model [Brandl et al, 2008] phone LM segments syllable LM word LM syllable acquires speech phone recognizer score normalization syllable spotter sequence word spotter  phones, syllables, words detect words phone syllabic  already used in interaction activities phonotactic constraints symbol speech model scenarios [Bolder et al, 2008, etc] grounding 19
  • 63. Integration with an existing speech acquisition system (Azubi) Goals:  integrate with perceptual model  make it more appropriate to use in phone syllable real scenarios model model initialization model word lexicon pool pool training Azubi model [Brandl et al, 2008] phone LM segments syllable LM word LM syllable acquires speech phone recognizer score normalization syllable spotter sequence word spotter  phones, syllables, words detect words phone syllabic  already used in interaction activities phonotactic constraints symbol speech model scenarios [Bolder et al, 2008, etc] grounding λp λp 1 2 λp 3 λp 4 λp 5 19
  • 64. Integration with an existing speech acquisition system (Azubi) Goals:  integrate with perceptual model  make it more appropriate to use in phone syllable real scenarios model model initialization model word lexicon pool pool training Azubi model [Brandl et al, 2008] phone LM segments syllable LM word LM syllable acquires speech phone recognizer score normalization syllable spotter sequence word spotter  phones, syllables, words detect words phone syllabic  already used in interaction activities phonotactic constraints symbol speech model scenarios [Bolder et al, 2008, etc] utterance grounding generation production primitives Correspondence model trained at correspondence primitive synergistic activity synthesizer the phone model level activity contour mapping encoder λp λp 1 2 λp 3 λp 4 λp 5 19
  • 65. Training phase: correspondence model λp λp 1 2 λp 3 λp 4 λp 5 m3 m2 m1 20
  • 66. Training phase: correspondence model vocal primitive λp λp 1 2 λp 3 λp 4 λp 5 m3 m2 m1 20
  • 67. Training phase: correspondence model segmentation classification vocal tutor primitive imitation [λp , ..., λp ] = arg max P ([λp ]|Xtutor ) 1 n p [λ ]∈P λp λp 1 2 λp 3 λp 4 λp 5 m3 m2 m1 20
  • 68. Training phase: correspondence model segmentation classification update vocal probabilistic tutor mapping primitive imitation [λp , ..., λp ] = arg max P ([λp ]|Xtutor ) 1 n p [λ ]∈P λp 15 1 λp 2 λp 3 λp 4 λp 5 Mij = P (λp |mj , Dj) m3 i - Cij = P (mj |λp ) m2 i P (λp |mj ,Dj) P (λp ) + = i P (mj ) i m1 = Mij - 20
  • 69. Training phase: correspondence model segmentation classification update vocal probabilistic tutor mapping primitive imitation [λp , ..., λp ] = arg max P ([λp ]|Xtutor ) 1 n p [λ ]∈P λp 15 1 λp 2 λp 15 3 λp 4 λp 5 Mij = P (λp |mj , Dj) m3 i - - Cij = P (mj |λp ) m2 i P (λp |mj ,Dj) P (λp ) + + = i P (mj ) i m1 - = Mij - 20
  • 70. Imitation phase λp λp 1 2 λp 3 λp 4 λp 5 m3 m2 m1 21
  • 71. Imitation phase target tutor utterance segmentation [λp , ..., λp ] = arg max P ([λp ]|Xtutor ) 1 n p [λ ]∈P λp λp 1 2 λp 3 λp 4 λp 5 m3 m2 m1 21
  • 72. Imitation phase target tutor utterance segmentation [λp , ..., λp ] = arg max P ([λp ]|Xtutor ) 1 n p [λ ]∈P λp 15 1 λp 2 λp 3 λp 4 λp 5 vocal m3 primitives’ posterior m2 probabilities m1 21
  • 73. Imitation phase target tutor utterance segmentation [λp , ..., λp ] = arg max P ([λp ]|Xtutor ) 1 n p [λ ]∈P λp 15 1 λp 2 λp 15 3 λp 4 λp 5 vocal m3 primitives’ posterior m2 probabilities m1 21
  • 74. Imitation phase target tutor utterance segmentation [λp , ..., λp ] = arg max P ([λp ]|Xtutor ) 1 n p [λ ]∈P λp 15 1 λp 2 λp 15 3 λp 4 λp 5 vocal m3 primitives’ posterior m2 probabilities m1 population coding gaussian activation contours spectral output 21
  • 75. Experimental results Correspondence matrix phone models  “child-directed”-like speech +- 1min vocal primitives  interaction  15 imitations of each vocal primitive phone models 22
  • 76. Experimental results Correspondence matrix phone models  “child-directed”-like speech +- 1min vocal primitives  interaction  15 imitations of each vocal primitive phone models 22
  • 77. Experimental results Correspondence matrix phone models  “child-directed”-like speech +- 1min vocal primitives  interaction  15 imitations of each vocal primitive phone models 22
  • 78. Experimental results Correspondence matrix phone models  “child-directed”-like speech +- 1min vocal primitives  interaction  15 imitations of each vocal primitive phone models 22
  • 80. Imitation example mama input spectrum population coding spectral output 23
  • 81. Imitation example mama input spectrum population coding spectral output 23
  • 82. Imitation example mama input spectrum population coding spectral output 23
  • 83. Summary Framework where speech imitation can be possible  speech synthesis technique to synthesize child’s voice  channel vocoder meets gammatone filterbank  evaluation  address the correspondence problem  probabilistic mapping between tutor’s voice and system’s motor space  tutor feedback interpreted in  feature space  unsupervisedly acquired perceptual space  integration in an online speech acquisition framework (Azubi)  paves the way for usage on the robot 24
  • 84. Publications "Learning from a tutor: embodied speech acquisition and imitation learning" M.Vaz, H.Brandl, F.Joublin, C.Goerick Proc. IEEE Intl. Conf. on Development and Learning 2009, Shanghai, China "Speech imitation with a child’s voice: addressing the correspondence problem" M.Vaz, H.Brandl, F.Joublin, C.Goerick Proc. SPECOM’2009, St Petersburg, Russia "Linking Perception and Production: System Learns a Correspondence Between its Own Voice and the Tutor's" M.Vaz, H.Brandl, F.Joublin, C.Goerick, Speech and Face to Face Communication Workshop in memory of Christian Benoît: GIPSA- lab, Grenoble, Université Stendhal, France "Speech structure acquisition for interactive systems" H.Brandl, M.Vaz, F.Joublin, C.Goerick Speech and Face to Face Communication Workshop in memory of Christian Benoît: GIPSA- lab, Grenoble, Université Stendhal, France "Listen to the Parrot: Demonstrating the Quality of Online Pitch and Formant Extraction via Feature-based Resynthesis" M.Heckmann, C.Glaeser, M.Vaz, T.Rodemann, F.Joublin, C. Goerick Proc. IEEE/RSJ International Conference on Intelligent Robots and Systems 2008, Nice, France 25
  • 85. Thank you Dr. Estela Bicho Dr. Frank Joublin Dr. Wolfram Erlhagen Colleagues @ Honda Research Institute Colleagues @ DEI Family Friends 26

Editor's Notes

  1. There I had presented and evaluated a framework for synthesizing speech with a child&amp;#x2019;s voice. The ultimate goal was to use the framework to learn speech through interaction with a tutor. In the end, I&amp;#x2019;d shown you the first steps
  2. language assumptions: syllable structure number of vowels in the vowel system prosodic traditional HMM synthesis approaches not suitable
  3. language assumptions: syllable structure number of vowels in the vowel system prosodic traditional HMM synthesis approaches not suitable
  4. language assumptions: syllable structure number of vowels in the vowel system prosodic traditional HMM synthesis approaches not suitable
  5. language assumptions: syllable structure number of vowels in the vowel system prosodic traditional HMM synthesis approaches not suitable
  6. language assumptions: syllable structure number of vowels in the vowel system prosodic traditional HMM synthesis approaches not suitable
  7. language assumptions: syllable structure number of vowels in the vowel system prosodic traditional HMM synthesis approaches not suitable
  8. language assumptions: syllable structure number of vowels in the vowel system prosodic traditional HMM synthesis approaches not suitable
  9. language assumptions: syllable structure number of vowels in the vowel system prosodic traditional HMM synthesis approaches not suitable
  10. language assumptions: syllable structure number of vowels in the vowel system prosodic traditional HMM synthesis approaches not suitable
  11. language assumptions: syllable structure number of vowels in the vowel system prosodic traditional HMM synthesis approaches not suitable
  12. language assumptions: syllable structure number of vowels in the vowel system prosodic traditional HMM synthesis approaches not suitable
  13. language assumptions: syllable structure number of vowels in the vowel system prosodic traditional HMM synthesis approaches not suitable
  14. language assumptions: syllable structure number of vowels in the vowel system prosodic traditional HMM synthesis approaches not suitable
  15. language assumptions: syllable structure number of vowels in the vowel system prosodic traditional HMM synthesis approaches not suitable
  16. language assumptions: syllable structure number of vowels in the vowel system prosodic traditional HMM synthesis approaches not suitable
  17. language assumptions: syllable structure number of vowels in the vowel system prosodic traditional HMM synthesis approaches not suitable
  18. language assumptions: syllable structure number of vowels in the vowel system prosodic traditional HMM synthesis approaches not suitable
  19. language assumptions: syllable structure number of vowels in the vowel system prosodic traditional HMM synthesis approaches not suitable
  20. language assumptions: syllable structure number of vowels in the vowel system prosodic traditional HMM synthesis approaches not suitable
  21. language assumptions: syllable structure number of vowels in the vowel system prosodic traditional HMM synthesis approaches not suitable
  22. language assumptions: syllable structure number of vowels in the vowel system prosodic traditional HMM synthesis approaches not suitable
  23. language assumptions: syllable structure number of vowels in the vowel system prosodic traditional HMM synthesis approaches not suitable
  24. - explain the difficulties of working with a child&amp;#x2019;s voice - motivate the need for the new technique - articulatory: limited on voices and phoneme sets - VOCODER has been shown to work well with good spectral representations - speech is the physical result of air being expelled from the lungs and passing through the vocal tract - Source Filter Model of speech production - source signal (larynx, vocal tract constriction) that is modulated by a Vocal Tract Filter Function - different ways of representing and deriving the Vocal Tract Filter Function
  25. focus on the architecture and properties we tested for intelligibility and naturalness
  26. focus on the architecture and properties we tested for intelligibility and naturalness
  27. focus on the architecture and properties we tested for intelligibility and naturalness
  28. focus on the architecture and properties we tested for intelligibility and naturalness
  29. language assumptions: syllable structure number of vowels in the vowel system prosodic traditional HMM synthesis approaches not suitable
  30. properties different BUT meaning same
  31. 1. even if it were true, there are is no know speech representation that would do the job 2. also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence Gros-Louis 2006 - interactive, differentiated and proximate responses increase production of more advanced utterances Goldstein 2003 -
  32. 1. even if it were true, there are is no know speech representation that would do the job 2. also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence Gros-Louis 2006 - interactive, differentiated and proximate responses increase production of more advanced utterances Goldstein 2003 -
  33. 1. even if it were true, there are is no know speech representation that would do the job 2. also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence Gros-Louis 2006 - interactive, differentiated and proximate responses increase production of more advanced utterances Goldstein 2003 -
  34. also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence M. Vaz, H. Brandl, F. Joublin, and C. Goerick, &amp;#x201C;Speech imitation with a child&amp;#x2019;s voice: addressing the correspondence problem,&amp;#x201D; accepted for 13-th Int. Conf. on Speech and Computer - SPECOM, 2009
  35. also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence M. Vaz, H. Brandl, F. Joublin, and C. Goerick, &amp;#x201C;Speech imitation with a child&amp;#x2019;s voice: addressing the correspondence problem,&amp;#x201D; accepted for 13-th Int. Conf. on Speech and Computer - SPECOM, 2009
  36. also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence M. Vaz, H. Brandl, F. Joublin, and C. Goerick, &amp;#x201C;Speech imitation with a child&amp;#x2019;s voice: addressing the correspondence problem,&amp;#x201D; accepted for 13-th Int. Conf. on Speech and Computer - SPECOM, 2009
  37. also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence M. Vaz, H. Brandl, F. Joublin, and C. Goerick, &amp;#x201C;Speech imitation with a child&amp;#x2019;s voice: addressing the correspondence problem,&amp;#x201D; accepted for 13-th Int. Conf. on Speech and Computer - SPECOM, 2009
  38. also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence M. Vaz, H. Brandl, F. Joublin, and C. Goerick, &amp;#x201C;Speech imitation with a child&amp;#x2019;s voice: addressing the correspondence problem,&amp;#x201D; accepted for 13-th Int. Conf. on Speech and Computer - SPECOM, 2009
  39. also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence M. Vaz, H. Brandl, F. Joublin, and C. Goerick, &amp;#x201C;Speech imitation with a child&amp;#x2019;s voice: addressing the correspondence problem,&amp;#x201D; accepted for 13-th Int. Conf. on Speech and Computer - SPECOM, 2009
  40. also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence M. Vaz, H. Brandl, F. Joublin, and C. Goerick, &amp;#x201C;Speech imitation with a child&amp;#x2019;s voice: addressing the correspondence problem,&amp;#x201D; accepted for 13-th Int. Conf. on Speech and Computer - SPECOM, 2009
  41. also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence M. Vaz, H. Brandl, F. Joublin, and C. Goerick, &amp;#x201C;Speech imitation with a child&amp;#x2019;s voice: addressing the correspondence problem,&amp;#x201D; accepted for 13-th Int. Conf. on Speech and Computer - SPECOM, 2009
  42. also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence M. Vaz, H. Brandl, F. Joublin, and C. Goerick, &amp;#x201C;Speech imitation with a child&amp;#x2019;s voice: addressing the correspondence problem,&amp;#x201D; accepted for 13-th Int. Conf. on Speech and Computer - SPECOM, 2009
  43. also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence M. Vaz, H. Brandl, F. Joublin, and C. Goerick, &amp;#x201C;Speech imitation with a child&amp;#x2019;s voice: addressing the correspondence problem,&amp;#x201D; accepted for 13-th Int. Conf. on Speech and Computer - SPECOM, 2009
  44. egin{split} p_1(t) &amp; = F_1(t) \ p_2(t) &amp; = F_2(t) - F_1(t) \ p_3(t) &amp; = F_3(t) - F_1(t) \ p_{{4,5,6}}(t) &amp; = log( S( C_{{1,2,3}}(t), t) )%\ % p_5(t) &amp; = log( S( c_2(t), t) ) \ % p_6(t) &amp; = log( S( c_3(t), t) ) end{split}
  45. egin{split} p_1(t) &amp; = F_1(t) \ p_2(t) &amp; = F_2(t) - F_1(t) \ p_3(t) &amp; = F_3(t) - F_1(t) \ p_{{4,5,6}}(t) &amp; = log( S( C_{{1,2,3}}(t), t) )%\ % p_5(t) &amp; = log( S( c_2(t), t) ) \ % p_6(t) &amp; = log( S( c_3(t), t) ) end{split}
  46. egin{split} p_1(t) &amp; = F_1(t) \ p_2(t) &amp; = F_2(t) - F_1(t) \ p_3(t) &amp; = F_3(t) - F_1(t) \ p_{{4,5,6}}(t) &amp; = log( S( C_{{1,2,3}}(t), t) )%\ % p_5(t) &amp; = log( S( c_2(t), t) ) \ % p_6(t) &amp; = log( S( c_3(t), t) ) end{split}
  47. egin{split} p_1(t) &amp; = F_1(t) \ p_2(t) &amp; = F_2(t) - F_1(t) \ p_3(t) &amp; = F_3(t) - F_1(t) \ p_{{4,5,6}}(t) &amp; = log( S( C_{{1,2,3}}(t), t) )%\ % p_5(t) &amp; = log( S( c_2(t), t) ) \ % p_6(t) &amp; = log( S( c_3(t), t) ) end{split}
  48. egin{split} p_1(t) &amp; = F_1(t) \ p_2(t) &amp; = F_2(t) - F_1(t) \ p_3(t) &amp; = F_3(t) - F_1(t) \ p_{{4,5,6}}(t) &amp; = log( S( C_{{1,2,3}}(t), t) )%\ % p_5(t) &amp; = log( S( c_2(t), t) ) \ % p_6(t) &amp; = log( S( c_3(t), t) ) end{split}
  49. egin{split} p_1(t) &amp; = F_1(t) \ p_2(t) &amp; = F_2(t) - F_1(t) \ p_3(t) &amp; = F_3(t) - F_1(t) \ p_{{4,5,6}}(t) &amp; = log( S( C_{{1,2,3}}(t), t) )%\ % p_5(t) &amp; = log( S( c_2(t), t) ) \ % p_6(t) &amp; = log( S( c_3(t), t) ) end{split}
  50. egin{split} p_1(t) &amp; = F_1(t) \ p_2(t) &amp; = F_2(t) - F_1(t) \ p_3(t) &amp; = F_3(t) - F_1(t) \ p_{{4,5,6}}(t) &amp; = log( S( C_{{1,2,3}}(t), t) )%\ % p_5(t) &amp; = log( S( c_2(t), t) ) \ % p_6(t) &amp; = log( S( c_3(t), t) ) end{split}
  51. egin{split} p_1(t) &amp; = F_1(t) \ p_2(t) &amp; = F_2(t) - F_1(t) \ p_3(t) &amp; = F_3(t) - F_1(t) \ p_{{4,5,6}}(t) &amp; = log( S( C_{{1,2,3}}(t), t) )%\ % p_5(t) &amp; = log( S( c_2(t), t) ) \ % p_6(t) &amp; = log( S( c_3(t), t) ) end{split}
  52. egin{split} p_1(t) &amp; = F_1(t) \ p_2(t) &amp; = F_2(t) - F_1(t) \ p_3(t) &amp; = F_3(t) - F_1(t) \ p_{{4,5,6}}(t) &amp; = log( S( C_{{1,2,3}}(t), t) )%\ % p_5(t) &amp; = log( S( c_2(t), t) ) \ % p_6(t) &amp; = log( S( c_3(t), t) ) end{split}
  53. egin{split} p_1(t) &amp; = F_1(t) \ p_2(t) &amp; = F_2(t) - F_1(t) \ p_3(t) &amp; = F_3(t) - F_1(t) \ p_{{4,5,6}}(t) &amp; = log( S( C_{{1,2,3}}(t), t) )%\ % p_5(t) &amp; = log( S( c_2(t), t) ) \ % p_6(t) &amp; = log( S( c_3(t), t) ) end{split}
  54. egin{split} p_1(t) &amp; = F_1(t) \ p_2(t) &amp; = F_2(t) - F_1(t) \ p_3(t) &amp; = F_3(t) - F_1(t) \ p_{{4,5,6}}(t) &amp; = log( S( C_{{1,2,3}}(t), t) )%\ % p_5(t) &amp; = log( S( c_2(t), t) ) \ % p_6(t) &amp; = log( S( c_3(t), t) ) end{split}
  55. egin{split} p_1(t) &amp; = F_1(t) \ p_2(t) &amp; = F_2(t) - F_1(t) \ p_3(t) &amp; = F_3(t) - F_1(t) \ p_{{4,5,6}}(t) &amp; = log( S( C_{{1,2,3}}(t), t) )%\ % p_5(t) &amp; = log( S( c_2(t), t) ) \ % p_6(t) &amp; = log( S( c_3(t), t) ) end{split}
  56. egin{split} p_1(t) &amp; = F_1(t) \ p_2(t) &amp; = F_2(t) - F_1(t) \ p_3(t) &amp; = F_3(t) - F_1(t) \ p_{{4,5,6}}(t) &amp; = log( S( C_{{1,2,3}}(t), t) )%\ % p_5(t) &amp; = log( S( c_2(t), t) ) \ % p_6(t) &amp; = log( S( c_3(t), t) ) end{split}
  57. egin{split} p_1(t) &amp; = F_1(t) \ p_2(t) &amp; = F_2(t) - F_1(t) \ p_3(t) &amp; = F_3(t) - F_1(t) \ p_{{4,5,6}}(t) &amp; = log( S( C_{{1,2,3}}(t), t) )%\ % p_5(t) &amp; = log( S( c_2(t), t) ) \ % p_6(t) &amp; = log( S( c_3(t), t) ) end{split}
  58. why kNN? no assumptions on the distribution of the elements of ech class important because data quite irregular For a set of labels or vocal-classes $C_j$ and an input feature vector $x$, we consider a neighbourhood $V$ of $x$ that contains exactly $k$ points. The posterior probability of class membership depends on the number of training points of class $C_j$ present in $V$, denoted by $K_j$: egin{equation} p( C_j | x ) = frac{K_j}{K} alpha = frac{p( C_{j1} | x )}{ p( C_{j1} | x ) + p( C_{j2} | x )}
  59. why kNN? no assumptions on the distribution of the elements of ech class important because data quite irregular For a set of labels or vocal-classes $C_j$ and an input feature vector $x$, we consider a neighbourhood $V$ of $x$ that contains exactly $k$ points. The posterior probability of class membership depends on the number of training points of class $C_j$ present in $V$, denoted by $K_j$: egin{equation} p( C_j | x ) = frac{K_j}{K} alpha = frac{p( C_{j1} | x )}{ p( C_{j1} | x ) + p( C_{j2} | x )}
  60. why kNN? no assumptions on the distribution of the elements of ech class important because data quite irregular For a set of labels or vocal-classes $C_j$ and an input feature vector $x$, we consider a neighbourhood $V$ of $x$ that contains exactly $k$ points. The posterior probability of class membership depends on the number of training points of class $C_j$ present in $V$, denoted by $K_j$: egin{equation} p( C_j | x ) = frac{K_j}{K} alpha = frac{p( C_{j1} | x )}{ p( C_{j1} | x ) + p( C_{j2} | x )}
  61. why kNN? no assumptions on the distribution of the elements of ech class important because data quite irregular For a set of labels or vocal-classes $C_j$ and an input feature vector $x$, we consider a neighbourhood $V$ of $x$ that contains exactly $k$ points. The posterior probability of class membership depends on the number of training points of class $C_j$ present in $V$, denoted by $K_j$: egin{equation} p( C_j | x ) = frac{K_j}{K} alpha = frac{p( C_{j1} | x )}{ p( C_{j1} | x ) + p( C_{j2} | x )}
  62. why kNN? no assumptions on the distribution of the elements of ech class important because data quite irregular For a set of labels or vocal-classes $C_j$ and an input feature vector $x$, we consider a neighbourhood $V$ of $x$ that contains exactly $k$ points. The posterior probability of class membership depends on the number of training points of class $C_j$ present in $V$, denoted by $K_j$: egin{equation} p( C_j | x ) = frac{K_j}{K} alpha = frac{p( C_{j1} | x )}{ p( C_{j1} | x ) + p( C_{j2} | x )}
  63. why kNN? no assumptions on the distribution of the elements of ech class important because data quite irregular For a set of labels or vocal-classes $C_j$ and an input feature vector $x$, we consider a neighbourhood $V$ of $x$ that contains exactly $k$ points. The posterior probability of class membership depends on the number of training points of class $C_j$ present in $V$, denoted by $K_j$: egin{equation} p( C_j | x ) = frac{K_j}{K} alpha = frac{p( C_{j1} | x )}{ p( C_{j1} | x ) + p( C_{j2} | x )}
  64. why kNN? no assumptions on the distribution of the elements of ech class important because data quite irregular For a set of labels or vocal-classes $C_j$ and an input feature vector $x$, we consider a neighbourhood $V$ of $x$ that contains exactly $k$ points. The posterior probability of class membership depends on the number of training points of class $C_j$ present in $V$, denoted by $K_j$: egin{equation} p( C_j | x ) = frac{K_j}{K} alpha = frac{p( C_{j1} | x )}{ p( C_{j1} | x ) + p( C_{j2} | x )}
  65. why kNN? no assumptions on the distribution of the elements of ech class important because data quite irregular For a set of labels or vocal-classes $C_j$ and an input feature vector $x$, we consider a neighbourhood $V$ of $x$ that contains exactly $k$ points. The posterior probability of class membership depends on the number of training points of class $C_j$ present in $V$, denoted by $K_j$: egin{equation} p( C_j | x ) = frac{K_j}{K} alpha = frac{p( C_{j1} | x )}{ p( C_{j1} | x ) + p( C_{j2} | x )}
  66. why kNN? no assumptions on the distribution of the elements of ech class important because data quite irregular For a set of labels or vocal-classes $C_j$ and an input feature vector $x$, we consider a neighbourhood $V$ of $x$ that contains exactly $k$ points. The posterior probability of class membership depends on the number of training points of class $C_j$ present in $V$, denoted by $K_j$: egin{equation} p( C_j | x ) = frac{K_j}{K} alpha = frac{p( C_{j1} | x )}{ p( C_{j1} | x ) + p( C_{j2} | x )}
  67. why kNN? no assumptions on the distribution of the elements of ech class important because data quite irregular For a set of labels or vocal-classes $C_j$ and an input feature vector $x$, we consider a neighbourhood $V$ of $x$ that contains exactly $k$ points. The posterior probability of class membership depends on the number of training points of class $C_j$ present in $V$, denoted by $K_j$: egin{equation} p( C_j | x ) = frac{K_j}{K} alpha = frac{p( C_{j1} | x )}{ p( C_{j1} | x ) + p( C_{j2} | x )}
  68. why kNN? no assumptions on the distribution of the elements of ech class important because data quite irregular For a set of labels or vocal-classes $C_j$ and an input feature vector $x$, we consider a neighbourhood $V$ of $x$ that contains exactly $k$ points. The posterior probability of class membership depends on the number of training points of class $C_j$ present in $V$, denoted by $K_j$: egin{equation} p( C_j | x ) = frac{K_j}{K} alpha = frac{p( C_{j1} | x )}{ p( C_{j1} | x ) + p( C_{j2} | x )}
  69. S_c always better system benefits from an extended vocal repertoire trends: canonical vowels generalization isn&amp;#x2019;t working 100%: morphing might be introducing some distortions
  70. language assumptions: syllable structure number of vowels in the vowel system prosodic traditional HMM synthesis approaches not suitable
  71. add to scheme that the system gets the phone models after they have been C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) } M_{ij} = P( lambda_i^p | m_j , Dj) [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})
  72. add to scheme that the system gets the phone models after they have been C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) } M_{ij} = P( lambda_i^p | m_j , Dj) [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})
  73. add to scheme that the system gets the phone models after they have been C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) } M_{ij} = P( lambda_i^p | m_j , Dj) [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})
  74. add to scheme that the system gets the phone models after they have been C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) } M_{ij} = P( lambda_i^p | m_j , Dj) [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})
  75. add to scheme that the system gets the phone models after they have been C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) } M_{ij} = P( lambda_i^p | m_j , Dj) [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})
  76. add to scheme that the system gets the phone models after they have been C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) } M_{ij} = P( lambda_i^p | m_j , Dj) [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})
  77. add to scheme that the system gets the phone models after they have been C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) } M_{ij} = P( lambda_i^p | m_j , Dj) [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})
  78. add to scheme that the system gets the phone models after they have been C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) } M_{ij} = P( lambda_i^p | m_j , Dj) [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})
  79. add to scheme that the system gets the phone models after they have been C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) } M_{ij} = P( lambda_i^p | m_j , Dj) [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})
  80. add to scheme that the system gets the phone models after they have been C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) } M_{ij} = P( lambda_i^p | m_j , Dj) [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})
  81. add to scheme that the system gets the phone models after they have been C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) } M_{ij} = P( lambda_i^p | m_j , Dj) [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})
  82. add to scheme that the system gets the phone models after they have been C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) } M_{ij} = P( lambda_i^p | m_j , Dj) [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})
  83. add to scheme that the system gets the phone models after they have been C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) } M_{ij} = P( lambda_i^p | m_j , Dj) [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})
  84. - Correspondence model has the form of a matrix, because the perceptual space is discrete - from a given input, the Azubi model C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) } M_{ij} = P( lambda_i^p | m_j , Dj) [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor}) add to scheme that the system gets the phone models after they have been
  85. - Correspondence model has the form of a matrix, because the perceptual space is discrete - from a given input, the Azubi model C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) } M_{ij} = P( lambda_i^p | m_j , Dj) [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor}) add to scheme that the system gets the phone models after they have been
  86. - Correspondence model has the form of a matrix, because the perceptual space is discrete - from a given input, the Azubi model C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) } M_{ij} = P( lambda_i^p | m_j , Dj) [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor}) add to scheme that the system gets the phone models after they have been
  87. - Correspondence model has the form of a matrix, because the perceptual space is discrete - from a given input, the Azubi model C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) } M_{ij} = P( lambda_i^p | m_j , Dj) [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor}) add to scheme that the system gets the phone models after they have been
  88. - Correspondence model has the form of a matrix, because the perceptual space is discrete - from a given input, the Azubi model C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) } M_{ij} = P( lambda_i^p | m_j , Dj) [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor}) add to scheme that the system gets the phone models after they have been
  89. - Correspondence model has the form of a matrix, because the perceptual space is discrete - from a given input, the Azubi model C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) } M_{ij} = P( lambda_i^p | m_j , Dj) [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor}) add to scheme that the system gets the phone models after they have been
  90. - Correspondence model has the form of a matrix, because the perceptual space is discrete - from a given input, the Azubi model C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) } M_{ij} = P( lambda_i^p | m_j , Dj) [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor}) add to scheme that the system gets the phone models after they have been
  91. - Correspondence model has the form of a matrix, because the perceptual space is discrete - from a given input, the Azubi model C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) } M_{ij} = P( lambda_i^p | m_j , Dj) [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor}) add to scheme that the system gets the phone models after they have been
  92. - Correspondence model has the form of a matrix, because the perceptual space is discrete - from a given input, the Azubi model C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) } M_{ij} = P( lambda_i^p | m_j , Dj) [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor}) add to scheme that the system gets the phone models after they have been
  93. - Correspondence model has the form of a matrix, because the perceptual space is discrete - from a given input, the Azubi model C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) } M_{ij} = P( lambda_i^p | m_j , Dj) [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor}) add to scheme that the system gets the phone models after they have been
  94. - Correspondence model has the form of a matrix, because the perceptual space is discrete - from a given input, the Azubi model C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) } M_{ij} = P( lambda_i^p | m_j , Dj) [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor}) add to scheme that the system gets the phone models after they have been
  95. - Correspondence model has the form of a matrix, because the perceptual space is discrete - from a given input, the Azubi model C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) } M_{ij} = P( lambda_i^p | m_j , Dj) [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor}) add to scheme that the system gets the phone models after they have been
  96. - Correspondence model has the form of a matrix, because the perceptual space is discrete - from a given input, the Azubi model C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) } M_{ij} = P( lambda_i^p | m_j , Dj) [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor}) add to scheme that the system gets the phone models after they have been
  97. - Correspondence model has the form of a matrix, because the perceptual space is discrete - from a given input, the Azubi model C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) } M_{ij} = P( lambda_i^p | m_j , Dj) [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor}) add to scheme that the system gets the phone models after they have been
  98. - Correspondence model has the form of a matrix, because the perceptual space is discrete - from a given input, the Azubi model C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) } M_{ij} = P( lambda_i^p | m_j , Dj) [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor}) add to scheme that the system gets the phone models after they have been
  99. - Correspondence model has the form of a matrix, because the perceptual space is discrete - from a given input, the Azubi model C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) } M_{ij} = P( lambda_i^p | m_j , Dj) [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor}) add to scheme that the system gets the phone models after they have been
  100. - Correspondence model has the form of a matrix, because the perceptual space is discrete - from a given input, the Azubi model C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) } M_{ij} = P( lambda_i^p | m_j , Dj) [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor}) add to scheme that the system gets the phone models after they have been
  101. - Correspondence model has the form of a matrix, because the perceptual space is discrete - from a given input, the Azubi model C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) } M_{ij} = P( lambda_i^p | m_j , Dj) [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor}) add to scheme that the system gets the phone models after they have been
  102. - Correspondence model has the form of a matrix, because the perceptual space is discrete - from a given input, the Azubi model C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) } M_{ij} = P( lambda_i^p | m_j , Dj) [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor}) add to scheme that the system gets the phone models after they have been
  103. - Correspondence model has the form of a matrix, because the perceptual space is discrete - from a given input, the Azubi model C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) } M_{ij} = P( lambda_i^p | m_j , Dj) [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor}) add to scheme that the system gets the phone models after they have been
  104. - Correspondence model has the form of a matrix, because the perceptual space is discrete - from a given input, the Azubi model C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) } M_{ij} = P( lambda_i^p | m_j , Dj) [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor}) add to scheme that the system gets the phone models after they have been
  105. - Correspondence model has the form of a matrix, because the perceptual space is discrete - from a given input, the Azubi model C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) } M_{ij} = P( lambda_i^p | m_j , Dj) [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor}) add to scheme that the system gets the phone models after they have been
  106. - Correspondence model has the form of a matrix, because the perceptual space is discrete - from a given input, the Azubi model C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) } M_{ij} = P( lambda_i^p | m_j , Dj) [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor}) add to scheme that the system gets the phone models after they have been
  107. - Correspondence model has the form of a matrix, because the perceptual space is discrete - from a given input, the Azubi model C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) } M_{ij} = P( lambda_i^p | m_j , Dj) [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor}) add to scheme that the system gets the phone models after they have been
  108. over-representation: more models than vowels 1. there are some phonemes for which there is a sparse activity 2. some phone models are never active 3. some are active all of the time whole subset is not covered - primitives are only vowels different primitives have a stronger dispersion than others either - non-uniform imitative response of the tutor to the vocal primitive - limitations to synthesizing a phoneme with only one spectral vector - or the inexistence of any phone model fully representing the imitative response - issues of over- or under- representation
  109. over-representation: more models than vowels 1. there are some phonemes for which there is a sparse activity 2. some phone models are never active 3. some are active all of the time whole subset is not covered - primitives are only vowels different primitives have a stronger dispersion than others either - non-uniform imitative response of the tutor to the vocal primitive - limitations to synthesizing a phoneme with only one spectral vector - or the inexistence of any phone model fully representing the imitative response - issues of over- or under- representation
  110. over-representation: more models than vowels 1. there are some phonemes for which there is a sparse activity 2. some phone models are never active 3. some are active all of the time whole subset is not covered - primitives are only vowels different primitives have a stronger dispersion than others either - non-uniform imitative response of the tutor to the vocal primitive - limitations to synthesizing a phoneme with only one spectral vector - or the inexistence of any phone model fully representing the imitative response - issues of over- or under- representation
  111. over-representation: more models than vowels 1. there are some phonemes for which there is a sparse activity 2. some phone models are never active 3. some are active all of the time whole subset is not covered - primitives are only vowels different primitives have a stronger dispersion than others either - non-uniform imitative response of the tutor to the vocal primitive - limitations to synthesizing a phoneme with only one spectral vector - or the inexistence of any phone model fully representing the imitative response - issues of over- or under- representation
  112. which conclusions are here OK? retake conclusions
  113. also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence M. Vaz, H. Brandl, F. Joublin, and C. Goerick, &amp;#x201C;Speech imitation with a child&amp;#x2019;s voice: addressing the correspondence problem,&amp;#x201D; accepted for 13-th Int. Conf. on Speech and Computer - SPECOM, 2009
  114. also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence M. Vaz, H. Brandl, F. Joublin, and C. Goerick, &amp;#x201C;Speech imitation with a child&amp;#x2019;s voice: addressing the correspondence problem,&amp;#x201D; accepted for 13-th Int. Conf. on Speech and Computer - SPECOM, 2009