2. /51
1
Self Introduction
➢ SAITO Yuki (齋藤 佑樹)
– Born in
• Kushiro-shi, Hokkaido, Japan
– Educational Background
• Apr. 2016 ~ Mar. 2018: UTokyo (MS)
• Apr. 2018 ~ Mar. 2021: UTokyo (PhD)
– Research interests
• Text-To-Speech (TTS) & Voice Conversion (VC) based on deep learning
– Selected publications (from 11 journal papers & 25 conf. papers)
• Saito et al., "Statistical parametric speech synthesis incorporating
generative adversarial networks," IEEE/ACM TASLP, 2018.
• Saito et al., "Perceptual-similarity-aware deep speaker
representation learning for multi-speaker generative modeling,"
IEEE/ACM TASLP, 2021.
3. /51
2
Our Lab. in UTokyo, Japan
➢ 3 groups organized by Prof. Saruwatari & Lect. Koyama
– Source separation, sound field analysis & synthesis, and TTS & VC
4. /51
3
TTS/VC Research Group in Our Lab.
➢ Organized by Dr. Takamichi & me (since Apr. 2016)
– Current students: 4 PhD & 8 MS
– Past students: 1 PhD (me) & 10 MS
TTS
VC
Toward universal
speech communication
based on TTS/VC
technologies
5. /51
4
Table of Contents
➢ Part 1: approx. 45 min (presented by me)
– Human-in-the-loop deep speaker representation learning
– Human-in-the-loop speaker adaptation for multi-speaker TTS
➢ Part 2: approx. 45 min (presented by Mr. Xin)
– Automatic quality assessment of synthetic speech
• (The UTMOS system at The VoiceMOS Challenge 2022)
– Speech emotion recognition for nonverbal vocalizations
• (The 1st place at The ICML ExVo Competition 2022)
➢ Q&A (until 4pm in SGT / 5pm in JST)
6. /51
5
Table of Contents
➢ Part 1: approx. 45 min (presented by me)
– Human-in-the-loop deep speaker representation learning
– Human-in-the-loop speaker adaptation for multi-speaker TTS
➢ Part 2: approx. 45 min (presented by Mr. Xin)
– Automatic quality assessment of synthetic speech
• (The UTMOS system at The VoiceMOS Challenge 2022)
– Speech emotion recognition for nonverbal vocalizations
• (The 1st place at The ICML ExVo Competition 2022)
➢ Q&A (until 4pm in SGT / 5pm in JST)
7. /51
➢ Speech synthesis
– Technology for synthesizing speech using a computer
➢ Applications
– Speech communication assistance (e.g., speech translation)
– Entertainments (e.g., singing voice synthesis/conversion)
➢ DNN-based speech synthesis [Zen+13][Oord+16]
– Using a DNN for learning statistical relation betw. input-to-speech
6
Research Field: Speech Synthesis
Text-To-Speech (TTS)
Text Speech
Voice Conversion (VC)
Output
speech
Input
speech
Hello Hello
[Sagisaka+88]
[Stylianou+88]
DNN: Deep Neural Network
8. /51
➢ SOTA DNN-based speech synthesis methods
– Quality of synthetic speech: as natural as human speech
– Adversarial training betw. two DNNs (i.e., GANs [Goodfellow+14])
• Ours [Saito+18], MelGAN [Kumar+19], Parallel WaveGAN [Yamamoto+20],
HiFi-GAN [Kong+20], VITS [Kim+21], JETS [Lim+22], etc...
7
Discriminator
1: natural
Adversarial
loss
𝒙
Input
feats.
Acoustic model
(generator)
ෝ
𝒚 Synthetic
speech
Natural
speech
𝒚
General Background
Reconstruction
loss
GAN: Generative Adversarial Network
9. /51
➢ SOTA DNN-based speech synthesis methods
– Quality of synthetic speech: as natural as human speech
– Adversarial training betw. two DNNs (i.e., GANs [Goodfellow+14])
• Ours [Saito+18], MelGAN [Kumar+19], Parallel WaveGAN [Yamamoto+20],
HiFi-GAN [Kong+20], VITS [Kim+21], JETS [Lim+22], etc...
8
Human listener
Human
perception
𝒙
Input
feats.
Acoustic model
(generator)
ෝ
𝒚 Synthetic
speech
Natural
speech
𝒚
General Background
Reconstruction
loss
GAN: Generative Adversarial Network
Can we replace the GAN discriminator with a human listener?
10. /51
9
Motivation of Human-In-The-Loop
Speech Synthesis Technologies
➢ Speech communication: intrinsically imperfect
– Humans often make mistakes, but we can communicate!
• Mispronunciations, wrong accents, unnatural pausing, etc...
– Mistakes can be corrected thru interaction betw. speaker & listener.
• c.f., Machine speech chain (speech synth. & recog.) [Tjandra+20]
– Intervention of human listeners will cultivate advanced research field!
➢
Possible applications
– Human-machine interaction
• e.g., spoken dialogue systems
– Media creation
• e.g., singing voice synthesis & dubbing
The image was automatically generated by craiyon
11. /51
10
Table of Contents
➢ Part 1: approx. 45 min (presented by me)
– Human-in-the-loop deep speaker representation learning
– Human-in-the-loop speaker adaptation for multi-speaker TTS
➢ Part 2: approx. 45 min (presented by Mr. Xin)
– Automatic quality assessment of synthetic speech
• (The UTMOS system at The VoiceMOS Challenge 2022)
– Speech emotion recognition for nonverbal vocalizations
• (The 1st place at The ICML ExVo Competition 2022)
➢ Q&A (until 4pm in SGT / 5pm in JST)
12. /51
11
Overview: Deep Speaker Representation Learning
➢ Deep Speaker Representation Learning (DSRL)
– DNN-based technology for learning Speaker Embeddings (SEs)
• Feature extraction for discriminative tasks (e.g., [Variani+14])
• Control of speaker ID in generative tasks (e.g., [Jia+18])
➢ This talk: method to learn SEs suitable for generative tasks
– Purpose: improving quality & controllability of synthetic speech
– Core idea: introducing human listeners for learning SEs that are highly
correlated with perceptual similarity among speakers
DNN
NG
ASV
DNN
TTS
Discriminative task
(e.g., automatic speaker verification: ASV)
Generative task
(e.g., TTS and VC)
DNN: Deep Neural Network
13. /51
12
Conventional Method:
Speaker-Classification-Based DSRL
➢ Learning to predict speaker ID from input speech parameters
– SEs suitable for speaker classification → also suitable for TTS/VC?
– One reason: low interpretability of SEs
Minimizing
cross-entropy
Speech
params.
d-vectors
[Variani+14]
Spkr.
classification
Spkr.
encoder
Spkr.
IDs
Distance metric in SE space
≠
Perceptual metric
(i.e., speaker similarity)
SE
space
14. /51
13
Our Method:
Perceptual-Similarity-Aware DSRL
➢ 1. Large-scale scoring of perceptual speaker similarity
➢ 2. SE learning considering the similarity scores
DNN
(Spkr. encoder)
Learned
similarity
Speech
params.
Similarity
score
SEs
Similarity
score
Perceptual
similarity
scoring
Spkr.
pairs
𝐿SIM
(∗)
Vector Matrix Graph
Loss to predict sim.
15. /51
14
Large Scale Scoring of
Perceptual Speaker Similarity
➢ Crowdsourcing of perceptual speaker similarity scores
– Dataset we used: 153 females in JNAS corpus [Itou+99]
– 4,000↑ listeners scored the similarity of two speakers' voices.
➢ Histogram of the collected scores
Instruction of the scoring
To what degree do these two speakers'
voices sound similar?
(−3: dissimilar ~ +3: similar)
( , ) → +2
( , ) → −3
( , ) → −2
16. /51
15
Perceptual Speaker Similarity Matrix
➢ Similarity matrix 𝐒 = 𝒔1, ⋯ , 𝒔𝑖, ⋯ , 𝒔𝑁s
– 𝑁s: # of pre-stored (i.e., closed) speakers
– 𝒔𝑖 = 𝑠𝑖,1, ⋯ , 𝑠𝑖,𝑗, ⋯ , 𝑠𝑖,𝑁s
⊤
: the 𝑖th similarity score vector
• 𝑠𝑖,𝑗: similarity of the 𝑖th & 𝑗th speakers −𝑣 ≤ 𝑠𝑖,𝑗 ≤ 𝑣
3
2
1
0
−1
−2
−3
(a) Full score matrix
(153 females)
(b) Sub-matrix of (a)
(13 females)
I'll present three algorithms to learn the similarity.
26. /51
25
Experimental Conditions
Dataset
(16 kHz sampling)
JNAS [Itou+99] 153 female speakers
5 utterances per speaker for scoring
About 130 / 15 utterances for DSRL & evaluation
(F001 ~ F013: unseen speakers for evaluation)
Similarity score
-3 (dissimilar) ~ +3 (similar)
(Normalized to [-1, +1] or [0, 1] in DSRL)
Speech parameters
40-dimensional mel-cepstra, F0, aperiodicity
(extracted by STRAIGHT analysis [Kawahara+99])
DNNs Fully-connected (for details, please see our paper)
Dimensionality of SEs 8
AL setting
Pool-based simulation
(Using binary masking for excluding unobserved scores)
DSRL methods
Conventional: d-vectors [Variani+14]
Ours: Prop. (vec), Prop. (mat), or Prop. (graph)
27. /51
26
Evaluation 1: SE Interpretability
➢ Scatter plots of human-/SE-derived similarity scores
– Prop. (*) highly correlated with the human-derived sim. scores.
• → Our DSRL can learn interpretable SEs better than d-vec!
d-vec.
Prop.
(graph)
Prop.
(mat)
Prop.
(vec)
SE-derived
0 1
Human-derived
1
0
Seen-Seen
Seen-Unseen
28. /51
27
Evaluation 2: Speaker Interpolation Controllability
➢ Task: generate new speaker identity by mixing two SEs
– We evaluated spkr. sim. between interpolated speech with
𝛼 ∈ 0.0, 0.25, 0.5, 0.75, 1.0 and original speaker's (𝛼 = 0 or 1).
– The score curves of Prop. (*) were closer to the red line.
• → Our SEs achieve higher controllability than d-vec.!
(20 answers/listener, total 30 × 2 listeners, method-wise preference XAB test)
Mixing coefficient 𝛼
0.0 0.5 1.0
1.0
0.5
0.0
Preference
score
A (mixed w/ 𝛼 = 0)
B (mixed w/ 𝛼 = 1)
29. /51
28
Evaluation 3: AL Cost Efficacy
➢ AL setting: starting DSRL from PS to reach FS situation
– MSF was the best query strategy for all proposed methods.
– Prop. (vec / graph) reduced the cost, but Prop. (mat) didn't work.
In each AL iteration, sim. scores of 43 speaker-pairs were newly annotated.
Fully Scored
(FS)
Partially Scored
(PS)
AUC
of
similar
speaker-pair
detection
30. /51
29
Summary
➢ Purpose
– Learning SEs highly correlated with perceptual speaker similarity
➢ Proposed methods
– 1) Perceptual-similarity-aware learning of SEs
– 2) Human-in-the-loop AL for DSRL
➢ Results of our methods
– 1) learned SEs having high correlation with human perception
– 2) achieved better controllability in speaker interpolation
– 3) reduced costs of scoring/training by introducing AL
➢ For detailed discussion...
– Please read our TASLP paper (open access)!
31. /51
30
Table of Contents
➢ Part 1: approx. 45 min (presented by me)
– Human-in-the-loop deep speaker representation learning
– Human-in-the-loop speaker adaptation for multi-speaker TTS
➢ Part 2: approx. 45 min (presented by Mr. Xin)
– Automatic quality assessment of synthetic speech
– Speech emotion recognition for nonverbal vocalizations
➢ Q&A (until 4 p.m. in SGT & 5 p.m. in JST)
32. /51
31
Overview: Speaker Adaptation for
Multi-Speaker TTS
➢ Text-To-Speech (TTS) [Sagisaka+88]
– Technology to artificially synthesize speech from given text
➢
DNN-based multi-speaker TTS [Fan+15][Hojo+18]
– Single DNN to generate multiple speakers' voices
• SE: conditional input to control speaker ID of synthetic speech
➢
Speaker adaptation for multi-speaker TTS (e.g., [Jia+18])
– TTS of unseen speaker's voice with small amount of data
Text-To-Speech (TTS)
Text Speech
SE
Multi-speaker
TTS model
33. /51
32
Conventional Speaker Adaptation Method
➢ Transfer Learning (TL) from speaker verification [Jia+18]
– Speaker encoder for extracting SE from reference speech
• Pretrained on speaker verification (e.g., GE2E loss [Wan+18])
– Multi-speaker TTS model for synthesizing speech from (text, SE) pairs
• Training: generate voices of seen speakers (∈ training data)
• Inference: extract SE of unseen speaker & input to TTS model
– Issue: cannot be used w/o the reference speech
• e.g., deceased person w/o any speech recordings
Multi-speaker
TTS model
Speaker
encoder
Ref.
speech
FROZEN
Can we find the target speaker's SE w/o using ref. speech?
34. /51
33
Proposed Method:
Human-In-The-Loop Speaker Adaptation
➢ Core algorithm: Sequential Line Search (SLS) [Koyama+17] on SE space
Multi-speaker
TTS system
Text ⋰
Waveforms
Candidates
SEs
⋯
SE
space
Bayesian
optimization
Update
line segment
SE selection
by user
Selected
SE
Selected Waveform
37. /51
36
Proposed Method:
Human-In-The-Loop Speaker Adaptation
➢ SLS step 3: select one SE based on user's speech perception
Multi-speaker
TTS system
Text ⋰
Waveforms
Candidate
SEs
⋯
SE
space
Selection
by user
Selected
SE
Selected Waveform
38. /51
37
Proposed Method:
Human-In-The-Loop Speaker Adaptation
➢ SLS step 4: update line segment using Bayesian Optimization
Multi-speaker
TTS system
Text ⋰
Waveforms
Candidate
SEs
⋯
SE
space
Bayesian
optimization
Update
line segment
Selection
by user
Selected
SE
Selected Waveform
39. /51
Candidate
SEs
⋯
SE
space
38
Proposed Method:
Human-In-The-Loop Speaker Adaptation
➢ ... and loops SLS steps until the user gets desired outcome
– Ref. speech & spkr. encoder are no longer needed in adaptation!
Multi-speaker
TTS system
Text ⋰
Waveforms
Bayesian
optimization
Update
line segment
Selection
by user
Selected
SE
Selected Waveform
40. /51
39
Two Strategies for Improving Search Efficacy
➢ Performing SLS in original SE space is inefficient because...
– It assumes the search space to be
𝐷-dimensional hypercube 0, 1 𝐷.
However, actual SEs are NOT distributed
uniformly (e.g., right figure).
– SEs in the dead space can degrade
the naturalness of synthetic speech...
➢ Our strategies for SLS-based speaker adaptation
– 1) Use mean {male, female} speakers' SEs as initial line endpoints
• → Start the search from more natural voices
– 2) Set the search space to a quantile of SEs in the training data
• → Search for more natural voice (but limit the search space)
– We empirically confirmed that these strategies significantly
improved the naturalness of synthetic speech during search.
Dead
space
42. /51
41
Experimental Conditions
Corpus for training
speaker encoder
Corpus of Spontaneous Japanese (CSJ) [Maekawa03]
(947 males and 470 females, 660h)
TTS model FastSpeech 2 [Ren+21]
Corpus for TTS
model
"parallel100" subset of Japanese Versatile Speech (JVS) corpus
[Takamichi+20]
(49 males and 51 females, 22h, 100 sentences / speaker)
Data
split
Train 90 speakers (44 males, 46 females)
Test 4 speakers (2 males, 2 females)
Validation 6 speakers (3 males, 3 females)
Vocoder
Pretrained "universal_v1" model of HiFi-GAN [Kong+20]
(published in ming024's GitHub repository)
43. /51
42
➢ Interface for SLS experiment
– Button to play reference speaker's voice
• Simulating situation where users have their desired voice in mind
– Slider to change multiple speakers' IDs smoothly
Demonstration
44. /51
➢ Conditions
– 8 participants searched for 4 target speakers w/ SLS (30 iterations).
– We computed the mel-spectrogram MAE betw. natural & synthetic
speech for each SE and selected one based on the MAE values.
43
Ref.
waveform
Participant 1
Participant 8
SLS for
our method
⋮
⋮
Searched
SEs
⋮
SLS-best:
Lowest MAE
SLS-mean:
Closest to mean MAE
SLS-worst:
Highest MAE
Human-In-The-Loop Experiment
49. /51
48
Subjective Evaluation (Similarity MOS)
(20 answers/listener, total 50 × 4 listeners, target-speaker-wise MOS test)
We observe similar tendency to the naturalness MOS results.
51. /51
50
Summary
➢ Purpose
– Speaker adaptation for multi-speaker TTS w/o ref. speech
➢ Proposed method
– SLS-based human-in-the-loop speaker adaptation algorithm
➢ Results of our method
– 1) achieved comparable performance to TL-based adaptation method
– 2) showed the difficulty in finding desirable SEs (less interpretability?)
➢ For detailed discussion...
– Please read our INTERSPEECH2022 paper (ACCEPTED)!
• Mr. Kenta Udagawa will talk about this work in poster session.
52. /51
51
Conclusions (Part 1)
➢ Main topic: human-in-the-loop speech synthesis
– Intervening human listeners in SOTA DNN-based TTS/VC methods
➢ Presented work
– 1) Human-in-the-loop deep speaker representation learning
– 2) Human-in-the-loop speaker adaptation for multi-speaker TTS
➢ Future prospection
– Continually trainable TTS/VC technology with the aid of humans
• As we grow, so do speech synthesis technologies!
➢ I'll physical attend INTERSPEECH2022 w/ 8 lab members!
– Very looking forward to meet you at Incheon, South Korea :)
Thank you for your attention!