This thesis is concerned with the autonomous acquisition of speech production skills by a robotic system.
The acquisition should occur in interaction with a human tutor, making little or no assumptions on the vocabulary and language of interaction.
A particular target embodiment of the acquisition framework presented in this thesis is the humanoid robot ASIMO.
Because of its size, and the little knowledge of the world it possesses, a child's voice is probably the most appropriate type of voice for such an interactive system.
This means, however, that the acoustic properties of the tutor's voice are very different from the system's.
Consequently, the system has to address the correspondence problem in speech.
For this, inspired by findings in the development of speech skills in infants, we propose an interaction scheme involving a cooperative tutor that provides imitative feedback for simple utterances of the system.
It allows the robot to learn a probabilistic correspondence model, which lets the system associate configurations of it's own vocal tract with the acoustic properties of the tutor's voice.
Using this correspondence model, the system can project a target tutor utterance into its motor space, making an imitation possible.
We also integrated this interaction scheme in an embodied speech structure acquisition framework, already used to teach and interact with the robot.
With this integration, we measure the tutor response, and the utterances to be imitated, in a previously trained perceptual space.
This is not only biologically more plausible, but also paves the way for an embodiment in the humanoid robot.
We also investigated a new speech synthesis algorithm, which operates in the acoustic domain and provides the system with a child-like voice.
Its architecture is a hybrid of a harmonic model and a channel vocoder, and uses a gammatone filter bank to produce the spectral representations.
For the control of the speech synthesizer in the context of imitation learning, a synergistic coding scheme, based on the concept of motor primitive, was investigated.
1. University of Minho
Engineering School
Developmentally inspired computational
framework for embodied speech imitation
Miguel Vaz
mvaz@dei.uminho.pt
Dep. Industrial Electronics Honda Research Institute Europe
University of Minho Offenbach am Main
Portugal Germany
25th January, Guimarães
7. Constraints and specificities of the ASIMO
platform
acquire speech in interaction
imitation
online learning
no pre-defined vocabulary
unlabeled data
minimize language assumptions
no corpus for system’s voice
3
8. Constraints and specificities of the ASIMO
platform
acquire speech in interaction
imitation
online learning
no pre-defined vocabulary
unlabeled data
minimize language assumptions
no corpus for system’s voice
3
9. Constraints and specificities of the ASIMO
platform
acquire speech in interaction
imitation
online learning
no pre-defined vocabulary
unlabeled data
minimize language assumptions
no corpus for system’s voice
system has child’s voice
3
10. Constraints and specificities of the ASIMO
platform
acquire speech in interaction
imitation
online learning
no pre-defined vocabulary
unlabeled data
minimize language assumptions
no corpus for system’s voice
synthesize child’s voice
system has child’s voice
3
11. Constraints and specificities of the ASIMO
platform
acquire speech in interaction
imitation
online learning
no pre-defined vocabulary
unlabeled data
minimize language assumptions
no corpus for system’s voice
synthesize child’s voice
system has child’s voice
address correspondence problem
3
15. outline
synthesize child’s voice
vocoder using gammatone filter bank
address correspondence problem
sensorimotor model trained with tutor
imitative feedback
feature space
perceptual space
4
16. Speech: source-filter model of speech
production
time (s) time (s)
glottal airflow vocal tract output from lips
filter function
dB dB dB
Hz Hz Hz
source spectrum output spectrum
formant frequencies
5
17. Spectral feature extraction with a
gammatone filter bank
scheme example
8
3
freq (kHz)
Envelope Harmonic
Gammatone 1
Extraction Structure
Filterbank 0.4
Elimination
0.1
0.25 0.5
time (s)
8
3
freq (kHz)
speech 1
... ... 0.4
0.1
8
3
freq (kHz)
pitch 1
0.4
0.1
zur¨ ck
u
6
19. VOCODER-like synthesis algorithm with a
gammatone filter bank
hybrid architecture
channel vocoder for frication
harmonic model for voicing voicing
mask
harmonic
voicing
pitch energy
sampling
spectral
vectors synthesis
white gammatone
noise filter bank frication
frication
mask
7
20. VOCODER-like synthesis algorithm with a
gammatone filter bank
hybrid architecture
channel vocoder for frication
harmonic model for voicing voicing
mask
harmonic
voicing
pitch energy
sampling
good naturalness for high- and low-
spectral
pitch voices vectors synthesis
good results in comparison to standard
white gammatone
acoustic synthesis techniques noise filter bank frication
tested against MCEP based synthesis
frication
mask
7
21. VOCODER-like synthesis algorithm with a
gammatone filter bank
hybrid architecture
channel vocoder for frication
harmonic model for voicing voicing
mask
harmonic
voicing
pitch energy
sampling
good naturalness for high- and low-
spectral
pitch voices vectors synthesis
good results in comparison to standard
white gammatone
acoustic synthesis techniques noise filter bank frication
tested against MCEP based synthesis
frication
mask
good intelligibility
tested with Modified Rhyme Test for
german
7
24. outline
synthesize child’s voice
vocoder using gammatone filter bank
address correspondence problem
sensorimotor model trained with tutor
imitative feedback
feature space
perceptual space
9
27. Correspondence problem in the literature
innate representations
[Marean1992, Kuhl1996, Minematsu2009]
labeled data in standard Voice Conversion systems
11
28. Correspondence problem in the literature
innate representations
[Marean1992, Kuhl1996, Minematsu2009]
labeled data in standard Voice Conversion systems
important information from feedback of parent / tutor
imitation [Papousek1992, Girolametto1999]
reward, stimulation
distinctive maternal responses [Gros-Louis2006, Goldstein2003]
11
29. Correspondence problem in the literature
innate representations
[Marean1992, Kuhl1996, Minematsu2009]
labeled data in standard Voice Conversion systems
important information from feedback of parent / tutor
imitation [Papousek1992, Girolametto1999]
reward, stimulation
distinctive maternal responses [Gros-Louis2006, Goldstein2003]
mutual imitation games guide acquisition of vowels
[Miura2007, Kanda2009]
tutor imitation as reward signal in RL framework
[Howard2007, Messum2007]
11
31. We use tutor’s imitative feedback
cooperative tutor (always) imitates
12
32. We use tutor’s imitative feedback
cooperative tutor (always) imitates
vocal tract model
probabilistic mapping motor
tutor’s voice motor repertoire commands
tutor
cochlear model
sensory-
imitative
motor
response
model
12
33. We use tutor’s imitative feedback
cooperative tutor (always) imitates
probabilistic mapping
tutor’s voice motor repertoire
innate vocal repertoire
vowels (primitives)
8 vectors
10 year old boy
TIDIGITS corpus
formant-annotated
12
34. We use tutor’s imitative feedback
cooperative tutor (always) imitates
probabilistic mapping
tutor’s voice motor repertoire
0 p0 p1 pc p2 p3 p4
innate vocal repertoire
vowels (primitives) S(α, c)
c1
α
8 vectors c c2 c3
10 year old boy
TIDIGITS corpus
1 q4
q0 q1 qc q2 q3
formant-annotated
p c = pi + cj −ci (pj − pi )
c−ci
morphing to combine primitives
assumption: qc = qi + cj −ci (qj − qi )
c−ci
intermediate states will sound “inbetween”
12
43. Imitation phase
tutor
target
utterance
k-Nearest Neighbours
class Kj
posterior p(Cj |x) =
K
probabilities
Kj - number of points
of class Cj in a
neighbourhood V (x)
with K elements
14
44. Imitation phase
tutor
target
utterance
k-Nearest Neighbours
class Kj
posterior p(Cj |x) =
K
probabilities
Kj - number of points
of class Cj in a
neighbourhood V (x)
with K elements
population
coding
14
45. Imitation phase
tutor
target
utterance
k-Nearest Neighbours
class Kj
posterior p(Cj |x) =
K
probabilities
Kj - number of points
of class Cj in a
neighbourhood V (x)
with K elements
population
coding
...
spectral p(Cj1 |x)
α=
output ... p(Cj1 |x) + p(Cj2 |x)
14
57. Subjective evaluation of imitation
experiment
how similar is the content of
the two sounds?
1 (different) ... 5 (same)
24 test subjects
stimuli
3 systems x 13 phonemes
pairs < human, imitated >
O, e, @, o, a, E, i, U
Y, 9, aI, aU, OI S3 a, i, U
S5 a, i, U, E, O
8 pairs < human, control >
supervised activation S8 a, i, U, E, O, e, @, o
17
58. outline
synthesize child’s voice
vocoder using gammatone filter bank
address correspondence problem
sensorimotor model trained with tutor
imitative feedback
feature space
perceptual space
18
59. Integration with an existing speech
acquisition system (Azubi)
phone syllable
model word
model initialization model
lexicon
pool pool
training
phone segments syllable word
LM LM LM
syllable
phone score syllable sequence word
normalization
recognizer spotter spotter
detect words
phone syllabic
activities phonotactic constraints
symbol
speech model
grounding
19
60. Integration with an existing speech
acquisition system (Azubi)
Goals:
integrate with perceptual model
make it more appropriate to use in
phone syllable
real scenarios model
model
initialization model
word
lexicon
pool pool
training
phone segments syllable word
LM LM LM
syllable
phone score syllable sequence word
normalization
recognizer spotter spotter
detect words
phone syllabic
activities phonotactic constraints
symbol
speech model
grounding
19
61. Integration with an existing speech
acquisition system (Azubi)
Goals:
integrate with perceptual model
make it more appropriate to use in
phone syllable
real scenarios model
model
initialization model
word
lexicon
pool pool
training
Azubi model [Brandl et al, 2008] phone
LM
segments syllable
LM
word
LM
syllable
acquires speech phone
recognizer
score
normalization
syllable
spotter
sequence word
spotter
phones, syllables, words detect words
phone syllabic
already used in interaction activities phonotactic constraints
symbol
speech model
scenarios [Bolder et al, 2008, etc] grounding
19
62. Integration with an existing speech
acquisition system (Azubi)
Goals:
integrate with perceptual model
make it more appropriate to use in
phone syllable
real scenarios model
model
initialization model
word
lexicon
pool pool
training
Azubi model [Brandl et al, 2008] phone
LM
segments syllable
LM
word
LM
syllable
acquires speech phone
recognizer
score
normalization
syllable
spotter
sequence word
spotter
phones, syllables, words detect words
phone syllabic
already used in interaction activities phonotactic constraints
symbol
speech model
scenarios [Bolder et al, 2008, etc] grounding
19
63. Integration with an existing speech
acquisition system (Azubi)
Goals:
integrate with perceptual model
make it more appropriate to use in
phone syllable
real scenarios model
model
initialization model
word
lexicon
pool pool
training
Azubi model [Brandl et al, 2008] phone
LM
segments syllable
LM
word
LM
syllable
acquires speech phone
recognizer
score
normalization
syllable
spotter
sequence word
spotter
phones, syllables, words detect words
phone syllabic
already used in interaction activities phonotactic constraints
symbol
speech model
scenarios [Bolder et al, 2008, etc] grounding
λp λp
1 2 λp
3 λp
4 λp
5
19
64. Integration with an existing speech
acquisition system (Azubi)
Goals:
integrate with perceptual model
make it more appropriate to use in
phone syllable
real scenarios model
model
initialization model
word
lexicon
pool pool
training
Azubi model [Brandl et al, 2008] phone
LM
segments syllable
LM
word
LM
syllable
acquires speech phone
recognizer
score
normalization
syllable
spotter
sequence word
spotter
phones, syllables, words detect words
phone syllabic
already used in interaction activities phonotactic constraints
symbol
speech model
scenarios [Bolder et al, 2008, etc] utterance
grounding
generation
production
primitives
Correspondence model trained at correspondence primitive synergistic activity
synthesizer
the phone model level
activity contour
mapping encoder
λp λp
1 2 λp
3 λp
4 λp
5
19
80. Imitation example
mama
input
spectrum
population
coding
spectral
output
23
81. Imitation example
mama
input
spectrum
population
coding
spectral
output
23
82. Imitation example
mama
input
spectrum
population
coding
spectral
output
23
83. Summary
Framework where speech imitation can be possible
speech synthesis technique to synthesize child’s voice
channel vocoder meets gammatone filterbank
evaluation
address the correspondence problem
probabilistic mapping between tutor’s voice and system’s motor space
tutor feedback interpreted in
feature space
unsupervisedly acquired perceptual space
integration in an online speech acquisition framework (Azubi)
paves the way for usage on the robot
24
84. Publications
"Learning from a tutor: embodied speech acquisition and imitation learning"
M.Vaz, H.Brandl, F.Joublin, C.Goerick
Proc. IEEE Intl. Conf. on Development and Learning 2009, Shanghai, China
"Speech imitation with a child’s voice: addressing the correspondence problem"
M.Vaz, H.Brandl, F.Joublin, C.Goerick
Proc. SPECOM’2009, St Petersburg, Russia
"Linking Perception and Production: System Learns a Correspondence Between its Own
Voice and the Tutor's"
M.Vaz, H.Brandl, F.Joublin, C.Goerick,
Speech and Face to Face Communication Workshop in memory of Christian Benoît: GIPSA-
lab, Grenoble, Université Stendhal, France
"Speech structure acquisition for interactive systems"
H.Brandl, M.Vaz, F.Joublin, C.Goerick
Speech and Face to Face Communication Workshop in memory of Christian Benoît: GIPSA-
lab, Grenoble, Université Stendhal, France
"Listen to the Parrot: Demonstrating the Quality of Online Pitch and Formant Extraction
via Feature-based Resynthesis"
M.Heckmann, C.Glaeser, M.Vaz, T.Rodemann, F.Joublin, C. Goerick
Proc. IEEE/RSJ International Conference on Intelligent Robots and Systems 2008, Nice,
France
25
85. Thank you
Dr. Estela Bicho
Dr. Frank Joublin
Dr. Wolfram Erlhagen
Colleagues @ Honda Research Institute
Colleagues @ DEI
Family
Friends
26
Editor's Notes
There I had presented and evaluated a framework for synthesizing speech with a child&#x2019;s voice.
The ultimate goal was to use the framework to learn speech through interaction with a tutor.
In the end, I&#x2019;d shown you the first steps
language assumptions:
syllable structure
number of vowels in the vowel system
prosodic
traditional HMM synthesis approaches not suitable
language assumptions:
syllable structure
number of vowels in the vowel system
prosodic
traditional HMM synthesis approaches not suitable
language assumptions:
syllable structure
number of vowels in the vowel system
prosodic
traditional HMM synthesis approaches not suitable
language assumptions:
syllable structure
number of vowels in the vowel system
prosodic
traditional HMM synthesis approaches not suitable
language assumptions:
syllable structure
number of vowels in the vowel system
prosodic
traditional HMM synthesis approaches not suitable
language assumptions:
syllable structure
number of vowels in the vowel system
prosodic
traditional HMM synthesis approaches not suitable
language assumptions:
syllable structure
number of vowels in the vowel system
prosodic
traditional HMM synthesis approaches not suitable
language assumptions:
syllable structure
number of vowels in the vowel system
prosodic
traditional HMM synthesis approaches not suitable
language assumptions:
syllable structure
number of vowels in the vowel system
prosodic
traditional HMM synthesis approaches not suitable
language assumptions:
syllable structure
number of vowels in the vowel system
prosodic
traditional HMM synthesis approaches not suitable
language assumptions:
syllable structure
number of vowels in the vowel system
prosodic
traditional HMM synthesis approaches not suitable
language assumptions:
syllable structure
number of vowels in the vowel system
prosodic
traditional HMM synthesis approaches not suitable
language assumptions:
syllable structure
number of vowels in the vowel system
prosodic
traditional HMM synthesis approaches not suitable
language assumptions:
syllable structure
number of vowels in the vowel system
prosodic
traditional HMM synthesis approaches not suitable
language assumptions:
syllable structure
number of vowels in the vowel system
prosodic
traditional HMM synthesis approaches not suitable
language assumptions:
syllable structure
number of vowels in the vowel system
prosodic
traditional HMM synthesis approaches not suitable
language assumptions:
syllable structure
number of vowels in the vowel system
prosodic
traditional HMM synthesis approaches not suitable
language assumptions:
syllable structure
number of vowels in the vowel system
prosodic
traditional HMM synthesis approaches not suitable
language assumptions:
syllable structure
number of vowels in the vowel system
prosodic
traditional HMM synthesis approaches not suitable
language assumptions:
syllable structure
number of vowels in the vowel system
prosodic
traditional HMM synthesis approaches not suitable
language assumptions:
syllable structure
number of vowels in the vowel system
prosodic
traditional HMM synthesis approaches not suitable
language assumptions:
syllable structure
number of vowels in the vowel system
prosodic
traditional HMM synthesis approaches not suitable
- explain the difficulties of working with a child&#x2019;s voice
- motivate the need for the new technique
- articulatory: limited on voices and phoneme sets
- VOCODER has been shown to work well with good spectral representations
- speech is the physical result of air being expelled from the lungs and passing through the vocal tract
- Source Filter Model of speech production
- source signal (larynx, vocal tract constriction) that is modulated by a Vocal Tract Filter Function
- different ways of representing and deriving the Vocal Tract Filter Function
focus on the architecture and properties
we tested for intelligibility and naturalness
focus on the architecture and properties
we tested for intelligibility and naturalness
focus on the architecture and properties
we tested for intelligibility and naturalness
focus on the architecture and properties
we tested for intelligibility and naturalness
language assumptions:
syllable structure
number of vowels in the vowel system
prosodic
traditional HMM synthesis approaches not suitable
properties different BUT meaning same
1. even if it were true, there are is no know speech representation that would do the job
2.
also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence
Gros-Louis 2006 - interactive, differentiated and proximate responses increase production of more advanced utterances
Goldstein 2003 -
1. even if it were true, there are is no know speech representation that would do the job
2.
also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence
Gros-Louis 2006 - interactive, differentiated and proximate responses increase production of more advanced utterances
Goldstein 2003 -
1. even if it were true, there are is no know speech representation that would do the job
2.
also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence
Gros-Louis 2006 - interactive, differentiated and proximate responses increase production of more advanced utterances
Goldstein 2003 -
also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence
M. Vaz, H. Brandl, F. Joublin, and C. Goerick, &#x201C;Speech imitation with a child&#x2019;s voice: addressing the correspondence problem,&#x201D; accepted for 13-th Int. Conf. on Speech and Computer - SPECOM, 2009
also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence
M. Vaz, H. Brandl, F. Joublin, and C. Goerick, &#x201C;Speech imitation with a child&#x2019;s voice: addressing the correspondence problem,&#x201D; accepted for 13-th Int. Conf. on Speech and Computer - SPECOM, 2009
also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence
M. Vaz, H. Brandl, F. Joublin, and C. Goerick, &#x201C;Speech imitation with a child&#x2019;s voice: addressing the correspondence problem,&#x201D; accepted for 13-th Int. Conf. on Speech and Computer - SPECOM, 2009
also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence
M. Vaz, H. Brandl, F. Joublin, and C. Goerick, &#x201C;Speech imitation with a child&#x2019;s voice: addressing the correspondence problem,&#x201D; accepted for 13-th Int. Conf. on Speech and Computer - SPECOM, 2009
also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence
M. Vaz, H. Brandl, F. Joublin, and C. Goerick, &#x201C;Speech imitation with a child&#x2019;s voice: addressing the correspondence problem,&#x201D; accepted for 13-th Int. Conf. on Speech and Computer - SPECOM, 2009
also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence
M. Vaz, H. Brandl, F. Joublin, and C. Goerick, &#x201C;Speech imitation with a child&#x2019;s voice: addressing the correspondence problem,&#x201D; accepted for 13-th Int. Conf. on Speech and Computer - SPECOM, 2009
also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence
M. Vaz, H. Brandl, F. Joublin, and C. Goerick, &#x201C;Speech imitation with a child&#x2019;s voice: addressing the correspondence problem,&#x201D; accepted for 13-th Int. Conf. on Speech and Computer - SPECOM, 2009
also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence
M. Vaz, H. Brandl, F. Joublin, and C. Goerick, &#x201C;Speech imitation with a child&#x2019;s voice: addressing the correspondence problem,&#x201D; accepted for 13-th Int. Conf. on Speech and Computer - SPECOM, 2009
also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence
M. Vaz, H. Brandl, F. Joublin, and C. Goerick, &#x201C;Speech imitation with a child&#x2019;s voice: addressing the correspondence problem,&#x201D; accepted for 13-th Int. Conf. on Speech and Computer - SPECOM, 2009
also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence
M. Vaz, H. Brandl, F. Joublin, and C. Goerick, &#x201C;Speech imitation with a child&#x2019;s voice: addressing the correspondence problem,&#x201D; accepted for 13-th Int. Conf. on Speech and Computer - SPECOM, 2009
egin{split}
p_1(t) & = F_1(t) \
p_2(t) & = F_2(t) - F_1(t) \
p_3(t) & = F_3(t) - F_1(t) \
p_{{4,5,6}}(t) & = log( S( C_{{1,2,3}}(t), t) )%\
% p_5(t) & = log( S( c_2(t), t) ) \
% p_6(t) & = log( S( c_3(t), t) )
end{split}
egin{split}
p_1(t) & = F_1(t) \
p_2(t) & = F_2(t) - F_1(t) \
p_3(t) & = F_3(t) - F_1(t) \
p_{{4,5,6}}(t) & = log( S( C_{{1,2,3}}(t), t) )%\
% p_5(t) & = log( S( c_2(t), t) ) \
% p_6(t) & = log( S( c_3(t), t) )
end{split}
egin{split}
p_1(t) & = F_1(t) \
p_2(t) & = F_2(t) - F_1(t) \
p_3(t) & = F_3(t) - F_1(t) \
p_{{4,5,6}}(t) & = log( S( C_{{1,2,3}}(t), t) )%\
% p_5(t) & = log( S( c_2(t), t) ) \
% p_6(t) & = log( S( c_3(t), t) )
end{split}
egin{split}
p_1(t) & = F_1(t) \
p_2(t) & = F_2(t) - F_1(t) \
p_3(t) & = F_3(t) - F_1(t) \
p_{{4,5,6}}(t) & = log( S( C_{{1,2,3}}(t), t) )%\
% p_5(t) & = log( S( c_2(t), t) ) \
% p_6(t) & = log( S( c_3(t), t) )
end{split}
egin{split}
p_1(t) & = F_1(t) \
p_2(t) & = F_2(t) - F_1(t) \
p_3(t) & = F_3(t) - F_1(t) \
p_{{4,5,6}}(t) & = log( S( C_{{1,2,3}}(t), t) )%\
% p_5(t) & = log( S( c_2(t), t) ) \
% p_6(t) & = log( S( c_3(t), t) )
end{split}
egin{split}
p_1(t) & = F_1(t) \
p_2(t) & = F_2(t) - F_1(t) \
p_3(t) & = F_3(t) - F_1(t) \
p_{{4,5,6}}(t) & = log( S( C_{{1,2,3}}(t), t) )%\
% p_5(t) & = log( S( c_2(t), t) ) \
% p_6(t) & = log( S( c_3(t), t) )
end{split}
egin{split}
p_1(t) & = F_1(t) \
p_2(t) & = F_2(t) - F_1(t) \
p_3(t) & = F_3(t) - F_1(t) \
p_{{4,5,6}}(t) & = log( S( C_{{1,2,3}}(t), t) )%\
% p_5(t) & = log( S( c_2(t), t) ) \
% p_6(t) & = log( S( c_3(t), t) )
end{split}
egin{split}
p_1(t) & = F_1(t) \
p_2(t) & = F_2(t) - F_1(t) \
p_3(t) & = F_3(t) - F_1(t) \
p_{{4,5,6}}(t) & = log( S( C_{{1,2,3}}(t), t) )%\
% p_5(t) & = log( S( c_2(t), t) ) \
% p_6(t) & = log( S( c_3(t), t) )
end{split}
egin{split}
p_1(t) & = F_1(t) \
p_2(t) & = F_2(t) - F_1(t) \
p_3(t) & = F_3(t) - F_1(t) \
p_{{4,5,6}}(t) & = log( S( C_{{1,2,3}}(t), t) )%\
% p_5(t) & = log( S( c_2(t), t) ) \
% p_6(t) & = log( S( c_3(t), t) )
end{split}
egin{split}
p_1(t) & = F_1(t) \
p_2(t) & = F_2(t) - F_1(t) \
p_3(t) & = F_3(t) - F_1(t) \
p_{{4,5,6}}(t) & = log( S( C_{{1,2,3}}(t), t) )%\
% p_5(t) & = log( S( c_2(t), t) ) \
% p_6(t) & = log( S( c_3(t), t) )
end{split}
egin{split}
p_1(t) & = F_1(t) \
p_2(t) & = F_2(t) - F_1(t) \
p_3(t) & = F_3(t) - F_1(t) \
p_{{4,5,6}}(t) & = log( S( C_{{1,2,3}}(t), t) )%\
% p_5(t) & = log( S( c_2(t), t) ) \
% p_6(t) & = log( S( c_3(t), t) )
end{split}
egin{split}
p_1(t) & = F_1(t) \
p_2(t) & = F_2(t) - F_1(t) \
p_3(t) & = F_3(t) - F_1(t) \
p_{{4,5,6}}(t) & = log( S( C_{{1,2,3}}(t), t) )%\
% p_5(t) & = log( S( c_2(t), t) ) \
% p_6(t) & = log( S( c_3(t), t) )
end{split}
egin{split}
p_1(t) & = F_1(t) \
p_2(t) & = F_2(t) - F_1(t) \
p_3(t) & = F_3(t) - F_1(t) \
p_{{4,5,6}}(t) & = log( S( C_{{1,2,3}}(t), t) )%\
% p_5(t) & = log( S( c_2(t), t) ) \
% p_6(t) & = log( S( c_3(t), t) )
end{split}
egin{split}
p_1(t) & = F_1(t) \
p_2(t) & = F_2(t) - F_1(t) \
p_3(t) & = F_3(t) - F_1(t) \
p_{{4,5,6}}(t) & = log( S( C_{{1,2,3}}(t), t) )%\
% p_5(t) & = log( S( c_2(t), t) ) \
% p_6(t) & = log( S( c_3(t), t) )
end{split}
why kNN?
no assumptions on the distribution of the elements of ech class
important because data quite irregular
For a set of labels or vocal-classes $C_j$ and an input feature vector $x$, we consider a neighbourhood $V$ of $x$ that contains exactly $k$ points.
The posterior probability of class membership depends on the number of training points of class $C_j$ present in $V$, denoted by $K_j$:
egin{equation}
p( C_j | x ) = frac{K_j}{K}
alpha = frac{p( C_{j1} | x )}{ p( C_{j1} | x ) + p( C_{j2} | x )}
why kNN?
no assumptions on the distribution of the elements of ech class
important because data quite irregular
For a set of labels or vocal-classes $C_j$ and an input feature vector $x$, we consider a neighbourhood $V$ of $x$ that contains exactly $k$ points.
The posterior probability of class membership depends on the number of training points of class $C_j$ present in $V$, denoted by $K_j$:
egin{equation}
p( C_j | x ) = frac{K_j}{K}
alpha = frac{p( C_{j1} | x )}{ p( C_{j1} | x ) + p( C_{j2} | x )}
why kNN?
no assumptions on the distribution of the elements of ech class
important because data quite irregular
For a set of labels or vocal-classes $C_j$ and an input feature vector $x$, we consider a neighbourhood $V$ of $x$ that contains exactly $k$ points.
The posterior probability of class membership depends on the number of training points of class $C_j$ present in $V$, denoted by $K_j$:
egin{equation}
p( C_j | x ) = frac{K_j}{K}
alpha = frac{p( C_{j1} | x )}{ p( C_{j1} | x ) + p( C_{j2} | x )}
why kNN?
no assumptions on the distribution of the elements of ech class
important because data quite irregular
For a set of labels or vocal-classes $C_j$ and an input feature vector $x$, we consider a neighbourhood $V$ of $x$ that contains exactly $k$ points.
The posterior probability of class membership depends on the number of training points of class $C_j$ present in $V$, denoted by $K_j$:
egin{equation}
p( C_j | x ) = frac{K_j}{K}
alpha = frac{p( C_{j1} | x )}{ p( C_{j1} | x ) + p( C_{j2} | x )}
why kNN?
no assumptions on the distribution of the elements of ech class
important because data quite irregular
For a set of labels or vocal-classes $C_j$ and an input feature vector $x$, we consider a neighbourhood $V$ of $x$ that contains exactly $k$ points.
The posterior probability of class membership depends on the number of training points of class $C_j$ present in $V$, denoted by $K_j$:
egin{equation}
p( C_j | x ) = frac{K_j}{K}
alpha = frac{p( C_{j1} | x )}{ p( C_{j1} | x ) + p( C_{j2} | x )}
why kNN?
no assumptions on the distribution of the elements of ech class
important because data quite irregular
For a set of labels or vocal-classes $C_j$ and an input feature vector $x$, we consider a neighbourhood $V$ of $x$ that contains exactly $k$ points.
The posterior probability of class membership depends on the number of training points of class $C_j$ present in $V$, denoted by $K_j$:
egin{equation}
p( C_j | x ) = frac{K_j}{K}
alpha = frac{p( C_{j1} | x )}{ p( C_{j1} | x ) + p( C_{j2} | x )}
why kNN?
no assumptions on the distribution of the elements of ech class
important because data quite irregular
For a set of labels or vocal-classes $C_j$ and an input feature vector $x$, we consider a neighbourhood $V$ of $x$ that contains exactly $k$ points.
The posterior probability of class membership depends on the number of training points of class $C_j$ present in $V$, denoted by $K_j$:
egin{equation}
p( C_j | x ) = frac{K_j}{K}
alpha = frac{p( C_{j1} | x )}{ p( C_{j1} | x ) + p( C_{j2} | x )}
why kNN?
no assumptions on the distribution of the elements of ech class
important because data quite irregular
For a set of labels or vocal-classes $C_j$ and an input feature vector $x$, we consider a neighbourhood $V$ of $x$ that contains exactly $k$ points.
The posterior probability of class membership depends on the number of training points of class $C_j$ present in $V$, denoted by $K_j$:
egin{equation}
p( C_j | x ) = frac{K_j}{K}
alpha = frac{p( C_{j1} | x )}{ p( C_{j1} | x ) + p( C_{j2} | x )}
why kNN?
no assumptions on the distribution of the elements of ech class
important because data quite irregular
For a set of labels or vocal-classes $C_j$ and an input feature vector $x$, we consider a neighbourhood $V$ of $x$ that contains exactly $k$ points.
The posterior probability of class membership depends on the number of training points of class $C_j$ present in $V$, denoted by $K_j$:
egin{equation}
p( C_j | x ) = frac{K_j}{K}
alpha = frac{p( C_{j1} | x )}{ p( C_{j1} | x ) + p( C_{j2} | x )}
why kNN?
no assumptions on the distribution of the elements of ech class
important because data quite irregular
For a set of labels or vocal-classes $C_j$ and an input feature vector $x$, we consider a neighbourhood $V$ of $x$ that contains exactly $k$ points.
The posterior probability of class membership depends on the number of training points of class $C_j$ present in $V$, denoted by $K_j$:
egin{equation}
p( C_j | x ) = frac{K_j}{K}
alpha = frac{p( C_{j1} | x )}{ p( C_{j1} | x ) + p( C_{j2} | x )}
why kNN?
no assumptions on the distribution of the elements of ech class
important because data quite irregular
For a set of labels or vocal-classes $C_j$ and an input feature vector $x$, we consider a neighbourhood $V$ of $x$ that contains exactly $k$ points.
The posterior probability of class membership depends on the number of training points of class $C_j$ present in $V$, denoted by $K_j$:
egin{equation}
p( C_j | x ) = frac{K_j}{K}
alpha = frac{p( C_{j1} | x )}{ p( C_{j1} | x ) + p( C_{j2} | x )}
S_c always better
system benefits from an extended vocal repertoire
trends:
canonical vowels
generalization isn&#x2019;t working 100%: morphing might be introducing some distortions
language assumptions:
syllable structure
number of vowels in the vowel system
prosodic
traditional HMM synthesis approaches not suitable
add to scheme that the system gets the phone models after they have been
C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }
M_{ij} = P( lambda_i^p | m_j , Dj)
[ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})
add to scheme that the system gets the phone models after they have been
C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }
M_{ij} = P( lambda_i^p | m_j , Dj)
[ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})
add to scheme that the system gets the phone models after they have been
C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }
M_{ij} = P( lambda_i^p | m_j , Dj)
[ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})
add to scheme that the system gets the phone models after they have been
C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }
M_{ij} = P( lambda_i^p | m_j , Dj)
[ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})
add to scheme that the system gets the phone models after they have been
C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }
M_{ij} = P( lambda_i^p | m_j , Dj)
[ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})
add to scheme that the system gets the phone models after they have been
C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }
M_{ij} = P( lambda_i^p | m_j , Dj)
[ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})
add to scheme that the system gets the phone models after they have been
C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }
M_{ij} = P( lambda_i^p | m_j , Dj)
[ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})
add to scheme that the system gets the phone models after they have been
C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }
M_{ij} = P( lambda_i^p | m_j , Dj)
[ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})
add to scheme that the system gets the phone models after they have been
C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }
M_{ij} = P( lambda_i^p | m_j , Dj)
[ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})
add to scheme that the system gets the phone models after they have been
C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }
M_{ij} = P( lambda_i^p | m_j , Dj)
[ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})
add to scheme that the system gets the phone models after they have been
C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }
M_{ij} = P( lambda_i^p | m_j , Dj)
[ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})
add to scheme that the system gets the phone models after they have been
C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }
M_{ij} = P( lambda_i^p | m_j , Dj)
[ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})
add to scheme that the system gets the phone models after they have been
C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }
M_{ij} = P( lambda_i^p | m_j , Dj)
[ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})
- Correspondence model has the form of a matrix, because the perceptual space is discrete
- from a given input, the Azubi model
C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }
M_{ij} = P( lambda_i^p | m_j , Dj)
[ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})
add to scheme that the system gets the phone models after they have been
- Correspondence model has the form of a matrix, because the perceptual space is discrete
- from a given input, the Azubi model
C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }
M_{ij} = P( lambda_i^p | m_j , Dj)
[ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})
add to scheme that the system gets the phone models after they have been
- Correspondence model has the form of a matrix, because the perceptual space is discrete
- from a given input, the Azubi model
C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }
M_{ij} = P( lambda_i^p | m_j , Dj)
[ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})
add to scheme that the system gets the phone models after they have been
- Correspondence model has the form of a matrix, because the perceptual space is discrete
- from a given input, the Azubi model
C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }
M_{ij} = P( lambda_i^p | m_j , Dj)
[ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})
add to scheme that the system gets the phone models after they have been
- Correspondence model has the form of a matrix, because the perceptual space is discrete
- from a given input, the Azubi model
C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }
M_{ij} = P( lambda_i^p | m_j , Dj)
[ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})
add to scheme that the system gets the phone models after they have been
- Correspondence model has the form of a matrix, because the perceptual space is discrete
- from a given input, the Azubi model
C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }
M_{ij} = P( lambda_i^p | m_j , Dj)
[ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})
add to scheme that the system gets the phone models after they have been
- Correspondence model has the form of a matrix, because the perceptual space is discrete
- from a given input, the Azubi model
C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }
M_{ij} = P( lambda_i^p | m_j , Dj)
[ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})
add to scheme that the system gets the phone models after they have been
- Correspondence model has the form of a matrix, because the perceptual space is discrete
- from a given input, the Azubi model
C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }
M_{ij} = P( lambda_i^p | m_j , Dj)
[ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})
add to scheme that the system gets the phone models after they have been
- Correspondence model has the form of a matrix, because the perceptual space is discrete
- from a given input, the Azubi model
C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }
M_{ij} = P( lambda_i^p | m_j , Dj)
[ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})
add to scheme that the system gets the phone models after they have been
- Correspondence model has the form of a matrix, because the perceptual space is discrete
- from a given input, the Azubi model
C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }
M_{ij} = P( lambda_i^p | m_j , Dj)
[ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})
add to scheme that the system gets the phone models after they have been
- Correspondence model has the form of a matrix, because the perceptual space is discrete
- from a given input, the Azubi model
C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }
M_{ij} = P( lambda_i^p | m_j , Dj)
[ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})
add to scheme that the system gets the phone models after they have been
- Correspondence model has the form of a matrix, because the perceptual space is discrete
- from a given input, the Azubi model
C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }
M_{ij} = P( lambda_i^p | m_j , Dj)
[ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})
add to scheme that the system gets the phone models after they have been
- Correspondence model has the form of a matrix, because the perceptual space is discrete
- from a given input, the Azubi model
C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }
M_{ij} = P( lambda_i^p | m_j , Dj)
[ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})
add to scheme that the system gets the phone models after they have been
- Correspondence model has the form of a matrix, because the perceptual space is discrete
- from a given input, the Azubi model
C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }
M_{ij} = P( lambda_i^p | m_j , Dj)
[ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})
add to scheme that the system gets the phone models after they have been
- Correspondence model has the form of a matrix, because the perceptual space is discrete
- from a given input, the Azubi model
C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }
M_{ij} = P( lambda_i^p | m_j , Dj)
[ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})
add to scheme that the system gets the phone models after they have been
- Correspondence model has the form of a matrix, because the perceptual space is discrete
- from a given input, the Azubi model
C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }
M_{ij} = P( lambda_i^p | m_j , Dj)
[ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})
add to scheme that the system gets the phone models after they have been
- Correspondence model has the form of a matrix, because the perceptual space is discrete
- from a given input, the Azubi model
C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }
M_{ij} = P( lambda_i^p | m_j , Dj)
[ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})
add to scheme that the system gets the phone models after they have been
- Correspondence model has the form of a matrix, because the perceptual space is discrete
- from a given input, the Azubi model
C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }
M_{ij} = P( lambda_i^p | m_j , Dj)
[ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})
add to scheme that the system gets the phone models after they have been
- Correspondence model has the form of a matrix, because the perceptual space is discrete
- from a given input, the Azubi model
C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }
M_{ij} = P( lambda_i^p | m_j , Dj)
[ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})
add to scheme that the system gets the phone models after they have been
- Correspondence model has the form of a matrix, because the perceptual space is discrete
- from a given input, the Azubi model
C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }
M_{ij} = P( lambda_i^p | m_j , Dj)
[ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})
add to scheme that the system gets the phone models after they have been
- Correspondence model has the form of a matrix, because the perceptual space is discrete
- from a given input, the Azubi model
C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }
M_{ij} = P( lambda_i^p | m_j , Dj)
[ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})
add to scheme that the system gets the phone models after they have been
- Correspondence model has the form of a matrix, because the perceptual space is discrete
- from a given input, the Azubi model
C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }
M_{ij} = P( lambda_i^p | m_j , Dj)
[ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})
add to scheme that the system gets the phone models after they have been
- Correspondence model has the form of a matrix, because the perceptual space is discrete
- from a given input, the Azubi model
C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }
M_{ij} = P( lambda_i^p | m_j , Dj)
[ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})
add to scheme that the system gets the phone models after they have been
- Correspondence model has the form of a matrix, because the perceptual space is discrete
- from a given input, the Azubi model
C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }
M_{ij} = P( lambda_i^p | m_j , Dj)
[ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})
add to scheme that the system gets the phone models after they have been
over-representation: more models than vowels
1. there are some phonemes for which there is a sparse activity
2. some phone models are never active
3. some are active all of the time
whole subset is not covered
- primitives are only vowels
different primitives have a stronger dispersion than others
either
- non-uniform imitative response of the tutor to the vocal primitive
- limitations to synthesizing a phoneme with only one spectral vector
- or the inexistence of any phone model fully representing the imitative response
- issues of over- or under- representation
over-representation: more models than vowels
1. there are some phonemes for which there is a sparse activity
2. some phone models are never active
3. some are active all of the time
whole subset is not covered
- primitives are only vowels
different primitives have a stronger dispersion than others
either
- non-uniform imitative response of the tutor to the vocal primitive
- limitations to synthesizing a phoneme with only one spectral vector
- or the inexistence of any phone model fully representing the imitative response
- issues of over- or under- representation
over-representation: more models than vowels
1. there are some phonemes for which there is a sparse activity
2. some phone models are never active
3. some are active all of the time
whole subset is not covered
- primitives are only vowels
different primitives have a stronger dispersion than others
either
- non-uniform imitative response of the tutor to the vocal primitive
- limitations to synthesizing a phoneme with only one spectral vector
- or the inexistence of any phone model fully representing the imitative response
- issues of over- or under- representation
over-representation: more models than vowels
1. there are some phonemes for which there is a sparse activity
2. some phone models are never active
3. some are active all of the time
whole subset is not covered
- primitives are only vowels
different primitives have a stronger dispersion than others
either
- non-uniform imitative response of the tutor to the vocal primitive
- limitations to synthesizing a phoneme with only one spectral vector
- or the inexistence of any phone model fully representing the imitative response
- issues of over- or under- representation
which conclusions are here OK?
retake conclusions
also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence
M. Vaz, H. Brandl, F. Joublin, and C. Goerick, &#x201C;Speech imitation with a child&#x2019;s voice: addressing the correspondence problem,&#x201D; accepted for 13-th Int. Conf. on Speech and Computer - SPECOM, 2009
also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence
M. Vaz, H. Brandl, F. Joublin, and C. Goerick, &#x201C;Speech imitation with a child&#x2019;s voice: addressing the correspondence problem,&#x201D; accepted for 13-th Int. Conf. on Speech and Computer - SPECOM, 2009