2010.01.25 - Developmentally inspired computational framework for embodied speech imitation  (PhD presentation)

University of Minho
Engineering School

Developmentally inspired computational
framework for embodied speech imitation

Miguel Vaz
mvaz@dei.uminho.pt

Dep. Industrial Electronics Honda Research Institute Europe
University of Minho Offenbach am Main
Portugal Germany

25th January, Guimarães

Long-term goal:
verbal interaction with ASIMO

2

Long-term goal:
verbal interaction with ASIMO

speech perception

speech production

meaning / language

2

Constraints and specificities of the ASIMO
platform

3

platform

no pre-defined vocabulary

3

platform
acquire speech in interaction

imitation

online learning
no pre-defined vocabulary 
unlabeled data
minimize language assumptions

no corpus for system’s voice

3

platform

imitation

online learning
unlabeled data


system has child’s voice

3

platform

imitation

online learning
unlabeled data


synthesize child’s voice

3

platform

imitation

online learning
unlabeled data



address correspondence problem

3

outline



4

outline


vocoder using gammatone filter bank


4

outline




sensorimotor model trained with tutor
imitative feedback

feature space

perceptual space

4

Speech: source-filter model of speech
production

time (s) time (s)

glottal airflow vocal tract output from lips

filter function
dB dB dB

Hz Hz Hz

source spectrum output spectrum
formant frequencies

5

Spectral feature extraction with a
gammatone filter bank
scheme example
8

3

freq (kHz)
Envelope Harmonic
Gammatone 1
Extraction Structure
Filterbank 0.4
Elimination
0.1
0.25 0.5
time (s)
8

3

freq (kHz)
speech 1
... ... 0.4
0.1
8

3
freq (kHz)
pitch 1
0.4
0.1

zur¨ ck
u

6

VOCODER-like synthesis algorithm with a

7

hybrid architecture

channel vocoder for frication

harmonic model for voicing voicing
mask
harmonic
voicing
pitch energy
sampling
spectral
vectors synthesis

white gammatone
noise ﬁlter bank frication

frication
mask

7

hybrid architecture


mask
harmonic
voicing
pitch energy
sampling
good naturalness for high- and low-
spectral
pitch voices vectors synthesis

good results in comparison to standard
white gammatone
acoustic synthesis techniques noise ﬁlter bank frication
 tested against MCEP based synthesis
frication
mask

7

hybrid architecture


mask
harmonic
voicing
pitch energy
sampling
good naturalness for high- and low-
spectral
pitch voices vectors synthesis

good results in comparison to standard
white gammatone
acoustic synthesis techniques noise ﬁlter bank frication
 tested against MCEP based synthesis
frication
mask

good intelligibility

tested with Modified Rhyme Test for
german

7

Example from copy synthesis

8

outline



imitative feedback

feature space

perceptual space

9

Correspondence problem

?
asimo Asimo

10

Correspondence problem in the literature

11


innate representations

[Marean1992, Kuhl1996, Minematsu2009]

labeled data in standard Voice Conversion systems

11





important information from feedback of parent / tutor

imitation [Papousek1992, Girolametto1999]

reward, stimulation

distinctive maternal responses [Gros-Louis2006, Goldstein2003]

11





important information from feedback of parent / tutor

imitation [Papousek1992, Girolametto1999]

reward, stimulation

distinctive maternal responses [Gros-Louis2006, Goldstein2003]


mutual imitation games guide acquisition of vowels
 [Miura2007, Kanda2009]

tutor imitation as reward signal in RL framework
 [Howard2007, Messum2007]

11

We use tutor’s imitative feedback

12


cooperative tutor (always) imitates

12


vocal tract model
probabilistic mapping motor

tutor’s voice motor repertoire commands

tutor
cochlear model
sensory-
imitative
motor
response
model

12



probabilistic mapping

tutor’s voice motor repertoire

innate vocal repertoire

vowels (primitives)
 8 vectors
 10 year old boy

 TIDIGITS corpus

 formant-annotated

12



probabilistic mapping

tutor’s voice motor repertoire

0 p0 p1 pc p2 p3 p4
innate vocal repertoire

vowels (primitives) S(α, c)
c1
α
 8 vectors c c2 c3
 10 year old boy

 TIDIGITS corpus
1 q4
q0 q1 qc q2 q3
 formant-annotated

p c = pi + cj −ci (pj − pi )
c−ci
morphing to combine primitives

assumption: qc = qi + cj −ci (qj − qi )
c−ci

 intermediate states will sound “inbetween”

12

Training phase

m1 m2 m3

13

Training phase

vocal
primitive m1 m2 m3

13

Training phase

vocal
primitive m1 m2 m3

tutor imitative
response

13

Training phase

vocal
primitive m1 m2 m3

tutor imitative
response

feature space
p1 (t) = F1 (t)
p2 (t) = F2 (t) − F1 (t)
p3 (t) = F3 (t) − F1 (t)
p{4,5,6} (t) = log(S(C{1,2,3} (t), t))

13

Training phase

vocal
primitive m1 m2 m3

tutor imitative
response

feature space

build model of p1 (t) = F1 (t)
response to p2 (t) = F2 (t) − F1 (t)
primitive p3 (t) = F3 (t) − F1 (t)
p{4,5,6} (t) = log(S(C{1,2,3} (t), t))

13

Imitation phase

tutor
target
utterance

14

Imitation phase

tutor
target
utterance

k-Nearest Neighbours
class Kj
posterior p(Cj |x) =
K
probabilities
Kj - number of points
of class Cj in a
neighbourhood V (x)
with K elements

14

Imitation phase

tutor
target
utterance

class Kj
K
probabilities
of class Cj in a
neighbourhood V (x)
with K elements

population
coding

14

Imitation phase

tutor
target
utterance

class Kj
K
probabilities
of class Cj in a
neighbourhood V (x)
with K elements

population
coding

...
spectral p(Cj1 |x)
α=
output ... p(Cj1 |x) + p(Cj2 |x)

14

Imitation example

15

Imitation example
target utterance classification
8000
0.8

P(C|x)
0.6
0.4
p(Cj |x)
3000 0.2
0
0.25 0.5 0.75 1
freq (Hz)

time (s)
8000
1000
3000
morphed

freq (Hz)
1000 primitives
100
0.25 0.5 0.75 1 100
time (s)
time (s)

imitation
8000

3000
freq (Hz)

1000

100
0.25 0.5 0.75 1
time (s)

15

Imitation example
target utterance classification
8000
0.8

P(C|x)
0.6
0.4
p(Cj |x)
3000 0.2
0
0.25 0.5 0.75 1
freq (Hz)

time (s)
8000
1000
3000
morphed

freq (Hz)
1000 primitives
100
0.25 0.5 0.75 1 100
time (s)
time (s)

imitation
8000

3000
freq (Hz)

1000

pitch + energy
100
0.25 0.5 0.75 1
time (s)

15

other examples

adult imitation

aia

aua

papa

16

Subjective evaluation of imitation

experiment

how similar is the content of
the two sounds?
 1 (different) ... 5 (same)

24 test subjects

stimuli

3 systems x 13 phonemes
pairs < human, imitated >
O, e, @, o, a, E, i, U
Y, 9, aI, aU, OI S3 a, i, U
S5 a, i, U, E, O

8 pairs < human, control >
 supervised activation S8 a, i, U, E, O, e, @, o

17

outline



imitative feedback

feature space

perceptual space

18

Integration with an existing speech
acquisition system (Azubi)

phone syllable
model word
model initialization model
lexicon
pool pool
training
phone segments syllable word
LM LM LM
syllable
phone score syllable sequence word
normalization
recognizer spotter spotter
detect words
phone syllabic
activities phonotactic constraints
symbol
speech model
grounding

19


Goals:

integrate with perceptual model

make it more appropriate to use in
phone syllable
real scenarios model
model
initialization model
word
lexicon
pool pool
training
phone segments syllable word
LM LM LM
syllable
phone score syllable sequence word
normalization
recognizer spotter spotter
detect words
phone syllabic
activities phonotactic constraints
symbol
speech model
grounding

19


Goals:


phone syllable
model
word
lexicon
pool pool
training
Azubi model [Brandl et al, 2008] phone
LM
segments syllable
LM
word
LM
syllable
acquires speech phone
recognizer
score
normalization
syllable
spotter
sequence word
spotter

phones, syllables, words detect words
phone syllabic

already used in interaction activities phonotactic constraints
symbol
speech model
scenarios [Bolder et al, 2008, etc] grounding

19


Goals:


phone syllable
model
word
lexicon
pool pool
training
LM
segments syllable
LM
word
LM
syllable
recognizer
score
normalization
syllable
spotter
sequence word
spotter

phone syllabic

symbol
speech model
scenarios [Bolder et al, 2008, etc] grounding

λp λp
1 2 λp
3 λp
4 λp
5

19


Goals:


phone syllable
model
word
lexicon
pool pool
training
LM
segments syllable
LM
word
LM
syllable
recognizer
score
normalization
syllable
spotter
sequence word
spotter

phone syllabic

symbol
speech model
scenarios [Bolder et al, 2008, etc] utterance
grounding
generation
production
primitives
Correspondence model trained at correspondence primitive synergistic activity
synthesizer
the phone model level
activity contour
mapping encoder

λp λp
1 2 λp
3 λp
4 λp
5

19

Training phase: correspondence model

λp λp
1 2 λp
3 λp
4 λp
5

m3
m2
m1

20


vocal
primitive

λp λp
1 2 λp
3 λp
4 λp
5

m3
m2
m1

20


segmentation
classification

vocal
tutor
primitive
imitation

[λp , ..., λp ] = arg max P ([λp ]|Xtutor )
1 n p
[λ ]∈P

λp λp
1 2 λp
3 λp
4 λp
5

m3
m2
m1

20


segmentation
classification
update
vocal probabilistic
tutor mapping
primitive
imitation

1 n p
[λ ]∈P

λp 15
1 λp
2 λp
3 λp
4 λp
5
Mij = P (λp |mj , Dj)
m3 i
-
Cij = P (mj |λp )
m2 i
P (λp |mj ,Dj) P (λp )
+ = i
P (mj )
i

m1 = Mij
-

20


segmentation
classification
update
vocal probabilistic
tutor mapping
primitive
imitation

1 n p
[λ ]∈P

λp 15
1 λp
2 λp 15
3 λp
4 λp
5
Mij = P (λp |mj , Dj)
m3 i
- -
Cij = P (mj |λp )
m2 i
P (λp |mj ,Dj) P (λp )
+ + = i
P (mj )
i

m1 -
= Mij
-

20

Imitation phase

λp λp
1 2 λp
3 λp
4 λp
5

m3
m2
m1

21

Imitation phase

target tutor
utterance

segmentation [λp , ..., λp ] = arg max P ([λp ]|Xtutor )
1 n p
[λ ]∈P

λp λp
1 2 λp
3 λp
4 λp
5

m3
m2
m1

21

Imitation phase

target tutor
utterance

1 n p
[λ ]∈P

λp 15
1 λp
2 λp
3 λp
4 λp
5

vocal m3
primitives’
posterior m2
probabilities
m1

21

Imitation phase

target tutor
utterance

1 n p
[λ ]∈P

λp 15
1 λp
2 λp 15
3 λp
4 λp
5

vocal m3
primitives’
posterior m2
probabilities
m1

21

Imitation phase

target tutor
utterance

1 n p
[λ ]∈P

λp 15
1 λp
2 λp 15
3 λp
4 λp
5

vocal m3
primitives’
posterior m2
probabilities
m1

population
coding
gaussian activation contours
spectral
output

21

Experimental results

Correspondence matrix

phone models

“child-directed”-like speech
+- 1min

vocal primitives


interaction

15 imitations of each
vocal primitive

phone models

22

Imitation example

23

Imitation example

mama

input
spectrum

population
coding

spectral
output

23

Summary

Framework where speech imitation can be possible

speech synthesis technique to synthesize child’s voice
 channel vocoder meets gammatone filterbank
 evaluation


address the correspondence problem
 probabilistic mapping between tutor’s voice and system’s motor space
 tutor feedback interpreted in

 feature space
 unsupervisedly acquired perceptual space


integration in an online speech acquisition framework (Azubi)
 paves the way for usage on the robot

24

Publications

"Learning from a tutor: embodied speech acquisition and imitation learning"
M.Vaz, H.Brandl, F.Joublin, C.Goerick
Proc. IEEE Intl. Conf. on Development and Learning 2009, Shanghai, China

"Speech imitation with a child’s voice: addressing the correspondence problem"
M.Vaz, H.Brandl, F.Joublin, C.Goerick
Proc. SPECOM’2009, St Petersburg, Russia

"Linking Perception and Production: System Learns a Correspondence Between its Own
Voice and the Tutor's"
M.Vaz, H.Brandl, F.Joublin, C.Goerick,
Speech and Face to Face Communication Workshop in memory of Christian Benoît: GIPSA-
lab, Grenoble, Université Stendhal, France

"Speech structure acquisition for interactive systems"
H.Brandl, M.Vaz, F.Joublin, C.Goerick
Speech and Face to Face Communication Workshop in memory of Christian Benoît: GIPSA-
lab, Grenoble, Université Stendhal, France

"Listen to the Parrot: Demonstrating the Quality of Online Pitch and Formant Extraction
via Feature-based Resynthesis"
M.Heckmann, C.Glaeser, M.Vaz, T.Rodemann, F.Joublin, C. Goerick
Proc. IEEE/RSJ International Conference on Intelligent Robots and Systems 2008, Nice,
France

25

Thank you

Dr. Estela Bicho
Dr. Frank Joublin
Dr. Wolfram Erlhagen

Colleagues @ Honda Research Institute
Colleagues @ DEI

Family
Friends

26

2010.01.25 - Developmentally inspired computational framework for embodied speech imitation  (PhD presentation)

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (9)

Recently uploaded

Recently uploaded (20)