SlideShare ist ein Scribd-Unternehmen logo
1 von 52
Downloaden Sie, um offline zu lesen
©Yuki Saito, Aug. 18, 2022.
Towards Human-In-The-Loop
Speech Synthesis Technology
The University of Tokyo (UTokyo), Japan.
Online Research Talk Hosted by NUS, Singapore @ Zoom
Yuki Saito
/51
1
Self Introduction
➢ SAITO Yuki (齋藤 佑樹)
– Born in
• Kushiro-shi, Hokkaido, Japan
– Educational Background
• Apr. 2016 ~ Mar. 2018: UTokyo (MS)
• Apr. 2018 ~ Mar. 2021: UTokyo (PhD)
– Research interests
• Text-To-Speech (TTS) & Voice Conversion (VC) based on deep learning
– Selected publications (from 11 journal papers & 25 conf. papers)
• Saito et al., "Statistical parametric speech synthesis incorporating
generative adversarial networks," IEEE/ACM TASLP, 2018.
• Saito et al., "Perceptual-similarity-aware deep speaker
representation learning for multi-speaker generative modeling,"
IEEE/ACM TASLP, 2021.
/51
2
Our Lab. in UTokyo, Japan
➢ 3 groups organized by Prof. Saruwatari & Lect. Koyama
– Source separation, sound field analysis & synthesis, and TTS & VC
/51
3
TTS/VC Research Group in Our Lab.
➢ Organized by Dr. Takamichi & me (since Apr. 2016)
– Current students: 4 PhD & 8 MS
– Past students: 1 PhD (me) & 10 MS
TTS
VC
Toward universal
speech communication
based on TTS/VC
technologies
/51
4
Table of Contents
➢ Part 1: approx. 45 min (presented by me)
– Human-in-the-loop deep speaker representation learning
– Human-in-the-loop speaker adaptation for multi-speaker TTS
➢ Part 2: approx. 45 min (presented by Mr. Xin)
– Automatic quality assessment of synthetic speech
• (The UTMOS system at The VoiceMOS Challenge 2022)
– Speech emotion recognition for nonverbal vocalizations
• (The 1st place at The ICML ExVo Competition 2022)
➢ Q&A (until 4pm in SGT / 5pm in JST)
/51
5
Table of Contents
➢ Part 1: approx. 45 min (presented by me)
– Human-in-the-loop deep speaker representation learning
– Human-in-the-loop speaker adaptation for multi-speaker TTS
➢ Part 2: approx. 45 min (presented by Mr. Xin)
– Automatic quality assessment of synthetic speech
• (The UTMOS system at The VoiceMOS Challenge 2022)
– Speech emotion recognition for nonverbal vocalizations
• (The 1st place at The ICML ExVo Competition 2022)
➢ Q&A (until 4pm in SGT / 5pm in JST)
/51
➢ Speech synthesis
– Technology for synthesizing speech using a computer
➢ Applications
– Speech communication assistance (e.g., speech translation)
– Entertainments (e.g., singing voice synthesis/conversion)
➢ DNN-based speech synthesis [Zen+13][Oord+16]
– Using a DNN for learning statistical relation betw. input-to-speech
6
Research Field: Speech Synthesis
Text-To-Speech (TTS)
Text Speech
Voice Conversion (VC)
Output
speech
Input
speech
Hello Hello
[Sagisaka+88]
[Stylianou+88]
DNN: Deep Neural Network
/51
➢ SOTA DNN-based speech synthesis methods
– Quality of synthetic speech: as natural as human speech
– Adversarial training betw. two DNNs (i.e., GANs [Goodfellow+14])
• Ours [Saito+18], MelGAN [Kumar+19], Parallel WaveGAN [Yamamoto+20],
HiFi-GAN [Kong+20], VITS [Kim+21], JETS [Lim+22], etc...
7
Discriminator
1: natural
Adversarial
loss
𝒙
Input
feats.
Acoustic model
(generator)
ෝ
𝒚 Synthetic
speech
Natural
speech
𝒚
General Background
Reconstruction
loss
GAN: Generative Adversarial Network
/51
➢ SOTA DNN-based speech synthesis methods
– Quality of synthetic speech: as natural as human speech
– Adversarial training betw. two DNNs (i.e., GANs [Goodfellow+14])
• Ours [Saito+18], MelGAN [Kumar+19], Parallel WaveGAN [Yamamoto+20],
HiFi-GAN [Kong+20], VITS [Kim+21], JETS [Lim+22], etc...
8
Human listener
Human
perception
𝒙
Input
feats.
Acoustic model
(generator)
ෝ
𝒚 Synthetic
speech
Natural
speech
𝒚
General Background
Reconstruction
loss
GAN: Generative Adversarial Network
Can we replace the GAN discriminator with a human listener?
/51
9
Motivation of Human-In-The-Loop
Speech Synthesis Technologies
➢ Speech communication: intrinsically imperfect
– Humans often make mistakes, but we can communicate!
• Mispronunciations, wrong accents, unnatural pausing, etc...
– Mistakes can be corrected thru interaction betw. speaker & listener.
• c.f., Machine speech chain (speech synth. & recog.) [Tjandra+20]
– Intervention of human listeners will cultivate advanced research field!
➢
Possible applications
– Human-machine interaction
• e.g., spoken dialogue systems
– Media creation
• e.g., singing voice synthesis & dubbing
The image was automatically generated by craiyon
/51
10
Table of Contents
➢ Part 1: approx. 45 min (presented by me)
– Human-in-the-loop deep speaker representation learning
– Human-in-the-loop speaker adaptation for multi-speaker TTS
➢ Part 2: approx. 45 min (presented by Mr. Xin)
– Automatic quality assessment of synthetic speech
• (The UTMOS system at The VoiceMOS Challenge 2022)
– Speech emotion recognition for nonverbal vocalizations
• (The 1st place at The ICML ExVo Competition 2022)
➢ Q&A (until 4pm in SGT / 5pm in JST)
/51
11
Overview: Deep Speaker Representation Learning
➢ Deep Speaker Representation Learning (DSRL)
– DNN-based technology for learning Speaker Embeddings (SEs)
• Feature extraction for discriminative tasks (e.g., [Variani+14])
• Control of speaker ID in generative tasks (e.g., [Jia+18])
➢ This talk: method to learn SEs suitable for generative tasks
– Purpose: improving quality & controllability of synthetic speech
– Core idea: introducing human listeners for learning SEs that are highly
correlated with perceptual similarity among speakers
DNN
NG
ASV
DNN
TTS
Discriminative task
(e.g., automatic speaker verification: ASV)
Generative task
(e.g., TTS and VC)
DNN: Deep Neural Network
/51
12
Conventional Method:
Speaker-Classification-Based DSRL
➢ Learning to predict speaker ID from input speech parameters
– SEs suitable for speaker classification → also suitable for TTS/VC?
– One reason: low interpretability of SEs
Minimizing
cross-entropy
Speech
params.
d-vectors
[Variani+14]
Spkr.
classification
Spkr.
encoder
Spkr.
IDs
Distance metric in SE space
≠
Perceptual metric
(i.e., speaker similarity)
SE
space
/51
13
Our Method:
Perceptual-Similarity-Aware DSRL
➢ 1. Large-scale scoring of perceptual speaker similarity
➢ 2. SE learning considering the similarity scores
DNN
(Spkr. encoder)
Learned
similarity
Speech
params.
Similarity
score
SEs
Similarity
score
Perceptual
similarity
scoring
Spkr.
pairs
𝐿SIM
(∗)
Vector Matrix Graph
Loss to predict sim.
/51
14
Large Scale Scoring of
Perceptual Speaker Similarity
➢ Crowdsourcing of perceptual speaker similarity scores
– Dataset we used: 153 females in JNAS corpus [Itou+99]
– 4,000↑ listeners scored the similarity of two speakers' voices.
➢ Histogram of the collected scores
Instruction of the scoring
To what degree do these two speakers'
voices sound similar?
(−3: dissimilar ~ +3: similar)
( , ) → +2
( , ) → −3
( , ) → −2
/51
15
Perceptual Speaker Similarity Matrix
➢ Similarity matrix 𝐒 = 𝒔1, ⋯ , 𝒔𝑖, ⋯ , 𝒔𝑁s
– 𝑁s: # of pre-stored (i.e., closed) speakers
– 𝒔𝑖 = 𝑠𝑖,1, ⋯ , 𝑠𝑖,𝑗, ⋯ , 𝑠𝑖,𝑁s
⊤
: the 𝑖th similarity score vector
• 𝑠𝑖,𝑗: similarity of the 𝑖th & 𝑗th speakers −𝑣 ≤ 𝑠𝑖,𝑗 ≤ 𝑣
3
2
1
0
−1
−2
−3
(a) Full score matrix
(153 females)
(b) Sub-matrix of (a)
(13 females)
I'll present three algorithms to learn the similarity.
/51
16
Algorithm 1: Similarity Vector Embedding
➢ Predict a vector of the matrix 𝐒 from speech parameters
𝐿SIM
(vec)
𝒔, ො
𝒔 =
1
𝑁𝑠
ො
𝒔 − 𝒔 ⊤ ො
𝒔 − 𝒔
Spkr.
encoder
𝐿SIM
(vec)
𝒔
ො
𝒔
𝐒
Sim. score
vector Sim.
matrix
Speech
params.
Similarity
prediction
𝒅
SEs
/51
17
Algorithm 2: Similarity Matrix Embedding
➢ Associate the Gram matrix of SEs with the matrix 𝐒
𝐿SIM
(mat)
𝐿SIM
(mat)
𝐃, 𝐒 =
1
𝑍s
෩
𝐊𝐃 − ෨
𝐒 𝐹
2
𝐊𝐃
Gram
matrix
Calc.
kernel
𝑘 ⋅,⋅
𝑍s: Normalization coefficient (෨
𝐒 represents off-diagonal matrix of 𝐒)
𝐒
Sim.
matrix
Speech
params.
Spkr.
encoder
𝒅
SEs
/51
18
Algorithm 3: Similarity Graph Embedding
➢ Learn the structure of speaker similarity graph from SE pairs
𝐿SIM
graph
𝒅𝑖, 𝒅𝑗 = −𝑎𝑖,𝑗 log 𝑝𝑖,𝑗 − 1 − 𝑎𝑖,𝑗 log 1 − 𝑝𝑖,𝑗
Spkr. sim.
graph
Edge
prediction 0: no edge
1: exist edge
𝐿SIM
(graph)
𝑝𝑖,𝑗 = exp − 𝒅𝑖 − 𝒅𝑗 2
2
: edge probability (referring to [Li+18])
Spkr.
encoder
𝐒
Sim.
matrix
Speech
params.
𝒅
𝑎𝑖,𝑗
SEs
/51
Spkr. encoder
training
Score
prediction
Query
selection
Score
annotation
: +3
: -1
: ??
: ??
: ??
Spkr. encoder
Scored
spkr. pairs
Listeners
Unscored spkr.
pairs
19
Human-In-The-Loop Active Learning (AL) for
Perceptual-Similarity-Aware SEs
➢ Overall framework: iterate similarity scoring & SE learning
– Obtaining better SEs while reducing costs of scoring & learning
– Using partially observed similarity scores
/51
20
Human-In-The-Loop Active Learning (AL) for
Perceptual-Similarity-Aware SEs
➢ AL step 1: train spkr. encoder using partially observed scores
Spkr. encoder
training
Score
prediction
Query
selection
Score
annotation
: +3
: -1
: ??
: ??
: ??
Spkr. encoder
Scored
spkr. pairs
Listeners
Unscored spkr.
pairs
Vector Matrix Graph
/51
21
Human-In-The-Loop Active Learning (AL) for
Perceptual-Similarity-Aware SEs
➢ AL step 2: predict similarity scores for unscored spkr. pairs
: +3
: 0
: -2
Predicted
Spkr. encoder
training
Score
prediction
Query
selection
Score
annotation
: +3
: -1
: ??
: ??
: ??
Spkr. encoder
Scored
spkr. pairs
Listeners
Unscored spkr.
pairs
Vector Matrix Graph
/51
22
Human-In-The-Loop Active Learning (AL) for
Perceptual-Similarity-Aware SEs
➢ AL step 3: select unscored pairs to be scored next
– Query strategy: criterion to determine priority of scoring
: +3
: 0
: -2
Predicted
: HSF
: MSF
: LSF
Selected
Spkr. encoder
training
Score
prediction
Query
selection
Score
annotation
: +3
: -1
: ??
: ??
: ??
Spkr. encoder
Scored
spkr. pairs
Listeners
Unscored spkr.
pairs
Vector Matrix Graph
Query
strategy
{ Higher, Middle, Lower }-Similarity First
/51
23
Human-In-The-Loop Active Learning (AL) for
Perceptual-Similarity-Aware SEs
➢ AL step 4: annotate similarity scores to selected spkr. pairs
– → return to AL step 1
: +3
: 0
: -2
Predicted
: HSF
: LSF
Selected
Spkr. encoder
training
Score
prediction
Query
selection
Score
annotation
: +3
: -1
: ??
: ??
: ??
Spkr. encoder
Scored
spkr. pairs
Listeners
Unscored spkr.
pairs
Vector Matrix Graph
Query
strategy
: +1
: MSF
/51
24
➢ Experimental Evaluations
/51
25
Experimental Conditions
Dataset
(16 kHz sampling)
JNAS [Itou+99] 153 female speakers
5 utterances per speaker for scoring
About 130 / 15 utterances for DSRL & evaluation
(F001 ~ F013: unseen speakers for evaluation)
Similarity score
-3 (dissimilar) ~ +3 (similar)
(Normalized to [-1, +1] or [0, 1] in DSRL)
Speech parameters
40-dimensional mel-cepstra, F0, aperiodicity
(extracted by STRAIGHT analysis [Kawahara+99])
DNNs Fully-connected (for details, please see our paper)
Dimensionality of SEs 8
AL setting
Pool-based simulation
(Using binary masking for excluding unobserved scores)
DSRL methods
Conventional: d-vectors [Variani+14]
Ours: Prop. (vec), Prop. (mat), or Prop. (graph)
/51
26
Evaluation 1: SE Interpretability
➢ Scatter plots of human-/SE-derived similarity scores
– Prop. (*) highly correlated with the human-derived sim. scores.
• → Our DSRL can learn interpretable SEs better than d-vec!
d-vec.
Prop.
(graph)
Prop.
(mat)
Prop.
(vec)
SE-derived
0 1
Human-derived
1
0
Seen-Seen
Seen-Unseen
/51
27
Evaluation 2: Speaker Interpolation Controllability
➢ Task: generate new speaker identity by mixing two SEs
– We evaluated spkr. sim. between interpolated speech with
𝛼 ∈ 0.0, 0.25, 0.5, 0.75, 1.0 and original speaker's (𝛼 = 0 or 1).
– The score curves of Prop. (*) were closer to the red line.
• → Our SEs achieve higher controllability than d-vec.!
(20 answers/listener, total 30 × 2 listeners, method-wise preference XAB test)
Mixing coefficient 𝛼
0.0 0.5 1.0
1.0
0.5
0.0
Preference
score
A (mixed w/ 𝛼 = 0)
B (mixed w/ 𝛼 = 1)
/51
28
Evaluation 3: AL Cost Efficacy
➢ AL setting: starting DSRL from PS to reach FS situation
– MSF was the best query strategy for all proposed methods.
– Prop. (vec / graph) reduced the cost, but Prop. (mat) didn't work.
In each AL iteration, sim. scores of 43 speaker-pairs were newly annotated.
Fully Scored
(FS)
Partially Scored
(PS)
AUC
of
similar
speaker-pair
detection
/51
29
Summary
➢ Purpose
– Learning SEs highly correlated with perceptual speaker similarity
➢ Proposed methods
– 1) Perceptual-similarity-aware learning of SEs
– 2) Human-in-the-loop AL for DSRL
➢ Results of our methods
– 1) learned SEs having high correlation with human perception
– 2) achieved better controllability in speaker interpolation
– 3) reduced costs of scoring/training by introducing AL
➢ For detailed discussion...
– Please read our TASLP paper (open access)!
/51
30
Table of Contents
➢ Part 1: approx. 45 min (presented by me)
– Human-in-the-loop deep speaker representation learning
– Human-in-the-loop speaker adaptation for multi-speaker TTS
➢ Part 2: approx. 45 min (presented by Mr. Xin)
– Automatic quality assessment of synthetic speech
– Speech emotion recognition for nonverbal vocalizations
➢ Q&A (until 4 p.m. in SGT & 5 p.m. in JST)
/51
31
Overview: Speaker Adaptation for
Multi-Speaker TTS
➢ Text-To-Speech (TTS) [Sagisaka+88]
– Technology to artificially synthesize speech from given text
➢
DNN-based multi-speaker TTS [Fan+15][Hojo+18]
– Single DNN to generate multiple speakers' voices
• SE: conditional input to control speaker ID of synthetic speech
➢
Speaker adaptation for multi-speaker TTS (e.g., [Jia+18])
– TTS of unseen speaker's voice with small amount of data
Text-To-Speech (TTS)
Text Speech
SE
Multi-speaker
TTS model
/51
32
Conventional Speaker Adaptation Method
➢ Transfer Learning (TL) from speaker verification [Jia+18]
– Speaker encoder for extracting SE from reference speech
• Pretrained on speaker verification (e.g., GE2E loss [Wan+18])
– Multi-speaker TTS model for synthesizing speech from (text, SE) pairs
• Training: generate voices of seen speakers (∈ training data)
• Inference: extract SE of unseen speaker & input to TTS model
– Issue: cannot be used w/o the reference speech
• e.g., deceased person w/o any speech recordings
Multi-speaker
TTS model
Speaker
encoder
Ref.
speech
FROZEN
Can we find the target speaker's SE w/o using ref. speech?
/51
33
Proposed Method:
Human-In-The-Loop Speaker Adaptation
➢ Core algorithm: Sequential Line Search (SLS) [Koyama+17] on SE space
Multi-speaker
TTS system
Text ⋰
Waveforms
Candidates
SEs
⋯
SE
space
Bayesian
optimization
Update
line segment
SE selection
by user
Selected
SE
Selected Waveform
/51
34
Proposed Method:
Human-In-The-Loop Speaker Adaptation
➢ SLS step 1: define line segment in SE space
Candidate
SEs
⋯
SE
space
/51
35
Proposed Method:
Human-In-The-Loop Speaker Adaptation
➢ SLS step 2: synthesize waveforms using candidate SEs
Multi-speaker
TTS system
Text ⋰
Waveforms
Candidate
SEs
⋯
SE
space
/51
36
Proposed Method:
Human-In-The-Loop Speaker Adaptation
➢ SLS step 3: select one SE based on user's speech perception
Multi-speaker
TTS system
Text ⋰
Waveforms
Candidate
SEs
⋯
SE
space
Selection
by user
Selected
SE
Selected Waveform
/51
37
Proposed Method:
Human-In-The-Loop Speaker Adaptation
➢ SLS step 4: update line segment using Bayesian Optimization
Multi-speaker
TTS system
Text ⋰
Waveforms
Candidate
SEs
⋯
SE
space
Bayesian
optimization
Update
line segment
Selection
by user
Selected
SE
Selected Waveform
/51
Candidate
SEs
⋯
SE
space
38
Proposed Method:
Human-In-The-Loop Speaker Adaptation
➢ ... and loops SLS steps until the user gets desired outcome
– Ref. speech & spkr. encoder are no longer needed in adaptation!
Multi-speaker
TTS system
Text ⋰
Waveforms
Bayesian
optimization
Update
line segment
Selection
by user
Selected
SE
Selected Waveform
/51
39
Two Strategies for Improving Search Efficacy
➢ Performing SLS in original SE space is inefficient because...
– It assumes the search space to be
𝐷-dimensional hypercube 0, 1 𝐷.
However, actual SEs are NOT distributed
uniformly (e.g., right figure).
– SEs in the dead space can degrade
the naturalness of synthetic speech...
➢ Our strategies for SLS-based speaker adaptation
– 1) Use mean {male, female} speakers' SEs as initial line endpoints
• → Start the search from more natural voices
– 2) Set the search space to a quantile of SEs in the training data
• → Search for more natural voice (but limit the search space)
– We empirically confirmed that these strategies significantly
improved the naturalness of synthetic speech during search.
Dead
space
/51
40
➢ Experimental Evaluations
/51
41
Experimental Conditions
Corpus for training
speaker encoder
Corpus of Spontaneous Japanese (CSJ) [Maekawa03]
(947 males and 470 females, 660h)
TTS model FastSpeech 2 [Ren+21]
Corpus for TTS
model
"parallel100" subset of Japanese Versatile Speech (JVS) corpus
[Takamichi+20]
(49 males and 51 females, 22h, 100 sentences / speaker)
Data
split
Train 90 speakers (44 males, 46 females)
Test 4 speakers (2 males, 2 females)
Validation 6 speakers (3 males, 3 females)
Vocoder
Pretrained "universal_v1" model of HiFi-GAN [Kong+20]
(published in ming024's GitHub repository)
/51
42
➢ Interface for SLS experiment
– Button to play reference speaker's voice
• Simulating situation where users have their desired voice in mind
– Slider to change multiple speakers' IDs smoothly
Demonstration
/51
➢ Conditions
– 8 participants searched for 4 target speakers w/ SLS (30 iterations).
– We computed the mel-spectrogram MAE betw. natural & synthetic
speech for each SE and selected one based on the MAE values.
43
Ref.
waveform
Participant 1
Participant 8
SLS for
our method
⋮
⋮
Searched
SEs
⋮
SLS-best:
Lowest MAE
SLS-mean:
Closest to mean MAE
SLS-worst:
Highest MAE
Human-In-The-Loop Experiment
/51
44
Subjective Evaluation (Naturalness MOS)
(24 answers/listener, total 50 × 4 listeners, target-speaker-wise MOS test)
/51
45
Subjective Evaluation (Naturalness MOS)
(24 answers/listener, total 50 × 4 listeners, target-speaker-wise MOS test)
Our methods achieve MOSs comparable to TL-based method!
/51
46
Subjective Evaluation (Naturalness MOS)
(24 answers/listener, total 50 × 4 listeners, target-speaker-wise MOS test)
SLS-worst tends to degrade the naturalness significantly.
/51
47
Subjective Evaluation (Similarity MOS)
(20 answers/listener, total 50 × 4 listeners, target-speaker-wise MOS test)
/51
48
Subjective Evaluation (Similarity MOS)
(20 answers/listener, total 50 × 4 listeners, target-speaker-wise MOS test)
We observe similar tendency to the naturalness MOS results.
/51
49
Speech Samples
Ground-
Truth
TL
Mean-
Speaker
SLS-
worst
SLS-
mean
SLS-
best
jvs078
(male)
jvs005
(male)
jvs060
(female)
jvs010
(female)
Other samples are available online! →
/51
50
Summary
➢ Purpose
– Speaker adaptation for multi-speaker TTS w/o ref. speech
➢ Proposed method
– SLS-based human-in-the-loop speaker adaptation algorithm
➢ Results of our method
– 1) achieved comparable performance to TL-based adaptation method
– 2) showed the difficulty in finding desirable SEs (less interpretability?)
➢ For detailed discussion...
– Please read our INTERSPEECH2022 paper (ACCEPTED)!
• Mr. Kenta Udagawa will talk about this work in poster session.
/51
51
Conclusions (Part 1)
➢ Main topic: human-in-the-loop speech synthesis
– Intervening human listeners in SOTA DNN-based TTS/VC methods
➢ Presented work
– 1) Human-in-the-loop deep speaker representation learning
– 2) Human-in-the-loop speaker adaptation for multi-speaker TTS
➢ Future prospection
– Continually trainable TTS/VC technology with the aid of humans
• As we grow, so do speech synthesis technologies!
➢ I'll physical attend INTERSPEECH2022 w/ 8 lab members!
– Very looking forward to meet you at Incheon, South Korea :)
Thank you for your attention!

Weitere ähnliche Inhalte

Ähnlich wie saito22research_talk_at_NUS

What can GAN and GMMN do for augmented speech communication?
What can GAN and GMMN do for augmented speech communication? What can GAN and GMMN do for augmented speech communication?
What can GAN and GMMN do for augmented speech communication? Shinnosuke Takamichi
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingParrotAI
 
final ppt BATCH 3.pptx
final ppt BATCH 3.pptxfinal ppt BATCH 3.pptx
final ppt BATCH 3.pptxMounika715343
 
Incremental Difference as Feature for Lipreading
Incremental Difference as Feature for LipreadingIncremental Difference as Feature for Lipreading
Incremental Difference as Feature for LipreadingIDES Editor
 
Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...karthik annam
 
Doing Something We Never Could with Spoken Language Technologies_109-10-29_In...
Doing Something We Never Could with Spoken Language Technologies_109-10-29_In...Doing Something We Never Could with Spoken Language Technologies_109-10-29_In...
Doing Something We Never Could with Spoken Language Technologies_109-10-29_In...linshanleearchive
 
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2Karthik Murugesan
 
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLEMULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLEIRJET Journal
 
COMPUTATIONAL APPROACHES TO THE SYNTAX-PROSODY INTERFACE: USING PROSODY TO IM...
COMPUTATIONAL APPROACHES TO THE SYNTAX-PROSODY INTERFACE: USING PROSODY TO IM...COMPUTATIONAL APPROACHES TO THE SYNTAX-PROSODY INTERFACE: USING PROSODY TO IM...
COMPUTATIONAL APPROACHES TO THE SYNTAX-PROSODY INTERFACE: USING PROSODY TO IM...Hussein Ghaly
 
IEEE ICASSP 2021
IEEE ICASSP 2021IEEE ICASSP 2021
IEEE ICASSP 2021KunZhou18
 
Deep network notes.pdf
Deep network notes.pdfDeep network notes.pdf
Deep network notes.pdfRamya Nellutla
 
Interspeech 2017 s_miyoshi
Interspeech 2017 s_miyoshiInterspeech 2017 s_miyoshi
Interspeech 2017 s_miyoshiHiroyuki Miyoshi
 
Development of text to speech system for yoruba language
Development of text to speech system for yoruba languageDevelopment of text to speech system for yoruba language
Development of text to speech system for yoruba languageAlexander Decker
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsRoelof Pieters
 
Improvement in Quality of Speech associated with Braille codes - A Review
Improvement in Quality of Speech associated with Braille codes - A ReviewImprovement in Quality of Speech associated with Braille codes - A Review
Improvement in Quality of Speech associated with Braille codes - A Reviewinscit2006
 
NLP introduced and in 47 slides Lecture 1.ppt
NLP introduced and in 47 slides Lecture 1.pptNLP introduced and in 47 slides Lecture 1.ppt
NLP introduced and in 47 slides Lecture 1.pptOlusolaTop
 
[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...
[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...
[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...DataScienceConferenc1
 
[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...
[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...
[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...DataScienceConferenc1
 
Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4DigiGurukul
 
Autotuned voice cloning enabling multilingualism
Autotuned voice cloning enabling multilingualismAutotuned voice cloning enabling multilingualism
Autotuned voice cloning enabling multilingualismIRJET Journal
 

Ähnlich wie saito22research_talk_at_NUS (20)

What can GAN and GMMN do for augmented speech communication?
What can GAN and GMMN do for augmented speech communication? What can GAN and GMMN do for augmented speech communication?
What can GAN and GMMN do for augmented speech communication?
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
 
final ppt BATCH 3.pptx
final ppt BATCH 3.pptxfinal ppt BATCH 3.pptx
final ppt BATCH 3.pptx
 
Incremental Difference as Feature for Lipreading
Incremental Difference as Feature for LipreadingIncremental Difference as Feature for Lipreading
Incremental Difference as Feature for Lipreading
 
Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...
 
Doing Something We Never Could with Spoken Language Technologies_109-10-29_In...
Doing Something We Never Could with Spoken Language Technologies_109-10-29_In...Doing Something We Never Could with Spoken Language Technologies_109-10-29_In...
Doing Something We Never Could with Spoken Language Technologies_109-10-29_In...
 
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
 
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLEMULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
 
COMPUTATIONAL APPROACHES TO THE SYNTAX-PROSODY INTERFACE: USING PROSODY TO IM...
COMPUTATIONAL APPROACHES TO THE SYNTAX-PROSODY INTERFACE: USING PROSODY TO IM...COMPUTATIONAL APPROACHES TO THE SYNTAX-PROSODY INTERFACE: USING PROSODY TO IM...
COMPUTATIONAL APPROACHES TO THE SYNTAX-PROSODY INTERFACE: USING PROSODY TO IM...
 
IEEE ICASSP 2021
IEEE ICASSP 2021IEEE ICASSP 2021
IEEE ICASSP 2021
 
Deep network notes.pdf
Deep network notes.pdfDeep network notes.pdf
Deep network notes.pdf
 
Interspeech 2017 s_miyoshi
Interspeech 2017 s_miyoshiInterspeech 2017 s_miyoshi
Interspeech 2017 s_miyoshi
 
Development of text to speech system for yoruba language
Development of text to speech system for yoruba languageDevelopment of text to speech system for yoruba language
Development of text to speech system for yoruba language
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
 
Improvement in Quality of Speech associated with Braille codes - A Review
Improvement in Quality of Speech associated with Braille codes - A ReviewImprovement in Quality of Speech associated with Braille codes - A Review
Improvement in Quality of Speech associated with Braille codes - A Review
 
NLP introduced and in 47 slides Lecture 1.ppt
NLP introduced and in 47 slides Lecture 1.pptNLP introduced and in 47 slides Lecture 1.ppt
NLP introduced and in 47 slides Lecture 1.ppt
 
[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...
[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...
[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...
 
[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...
[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...
[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...
 
Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4
 
Autotuned voice cloning enabling multilingualism
Autotuned voice cloning enabling multilingualismAutotuned voice cloning enabling multilingualism
Autotuned voice cloning enabling multilingualism
 

Mehr von Yuki Saito

hirai23slp03.pdf
hirai23slp03.pdfhirai23slp03.pdf
hirai23slp03.pdfYuki Saito
 
Interspeech2022 参加報告
Interspeech2022 参加報告Interspeech2022 参加報告
Interspeech2022 参加報告Yuki Saito
 
fujii22apsipa_asc
fujii22apsipa_ascfujii22apsipa_asc
fujii22apsipa_ascYuki Saito
 
Neural text-to-speech and voice conversion
Neural text-to-speech and voice conversionNeural text-to-speech and voice conversion
Neural text-to-speech and voice conversionYuki Saito
 
Nishimura22slp03 presentation
Nishimura22slp03 presentationNishimura22slp03 presentation
Nishimura22slp03 presentationYuki Saito
 
Nakai22sp03 presentation
Nakai22sp03 presentationNakai22sp03 presentation
Nakai22sp03 presentationYuki Saito
 
Saito21asj Autumn Meeting
Saito21asj Autumn MeetingSaito21asj Autumn Meeting
Saito21asj Autumn MeetingYuki Saito
 
Interspeech2020 reading
Interspeech2020 readingInterspeech2020 reading
Interspeech2020 readingYuki Saito
 
Saito20asj_autumn
Saito20asj_autumnSaito20asj_autumn
Saito20asj_autumnYuki Saito
 
ICASSP読み会2020
ICASSP読み会2020ICASSP読み会2020
ICASSP読み会2020Yuki Saito
 
Saito20asj s slide_published
Saito20asj s slide_publishedSaito20asj s slide_published
Saito20asj s slide_publishedYuki Saito
 
Saito19asjAutumn_DeNA
Saito19asjAutumn_DeNASaito19asjAutumn_DeNA
Saito19asjAutumn_DeNAYuki Saito
 
Deep learning for acoustic modeling in parametric speech generation
Deep learning for acoustic modeling in parametric speech generationDeep learning for acoustic modeling in parametric speech generation
Deep learning for acoustic modeling in parametric speech generationYuki Saito
 
釧路高専情報工学科向け進学説明会
釧路高専情報工学科向け進学説明会釧路高専情報工学科向け進学説明会
釧路高専情報工学科向け進学説明会Yuki Saito
 

Mehr von Yuki Saito (20)

hirai23slp03.pdf
hirai23slp03.pdfhirai23slp03.pdf
hirai23slp03.pdf
 
Interspeech2022 参加報告
Interspeech2022 参加報告Interspeech2022 参加報告
Interspeech2022 参加報告
 
fujii22apsipa_asc
fujii22apsipa_ascfujii22apsipa_asc
fujii22apsipa_asc
 
Neural text-to-speech and voice conversion
Neural text-to-speech and voice conversionNeural text-to-speech and voice conversion
Neural text-to-speech and voice conversion
 
Nishimura22slp03 presentation
Nishimura22slp03 presentationNishimura22slp03 presentation
Nishimura22slp03 presentation
 
Nakai22sp03 presentation
Nakai22sp03 presentationNakai22sp03 presentation
Nakai22sp03 presentation
 
Saito21asj Autumn Meeting
Saito21asj Autumn MeetingSaito21asj Autumn Meeting
Saito21asj Autumn Meeting
 
Saito2103slp
Saito2103slpSaito2103slp
Saito2103slp
 
Interspeech2020 reading
Interspeech2020 readingInterspeech2020 reading
Interspeech2020 reading
 
Saito20asj_autumn
Saito20asj_autumnSaito20asj_autumn
Saito20asj_autumn
 
ICASSP読み会2020
ICASSP読み会2020ICASSP読み会2020
ICASSP読み会2020
 
Saito20asj s slide_published
Saito20asj s slide_publishedSaito20asj s slide_published
Saito20asj s slide_published
 
Saito19asjAutumn_DeNA
Saito19asjAutumn_DeNASaito19asjAutumn_DeNA
Saito19asjAutumn_DeNA
 
Deep learning for acoustic modeling in parametric speech generation
Deep learning for acoustic modeling in parametric speech generationDeep learning for acoustic modeling in parametric speech generation
Deep learning for acoustic modeling in parametric speech generation
 
Saito19asj_s
Saito19asj_sSaito19asj_s
Saito19asj_s
 
Une18apsipa
Une18apsipaUne18apsipa
Une18apsipa
 
Saito18sp03
Saito18sp03Saito18sp03
Saito18sp03
 
Saito18asj_s
Saito18asj_sSaito18asj_s
Saito18asj_s
 
Saito17asjA
Saito17asjASaito17asjA
Saito17asjA
 
釧路高専情報工学科向け進学説明会
釧路高専情報工学科向け進学説明会釧路高専情報工学科向け進学説明会
釧路高専情報工学科向け進学説明会
 

Kürzlich hochgeladen

Basic Concepts in Pharmacology in molecular .pptx
Basic Concepts in Pharmacology in molecular  .pptxBasic Concepts in Pharmacology in molecular  .pptx
Basic Concepts in Pharmacology in molecular .pptxVijayaKumarR28
 
IB Biology New syllabus B3.2 Transport.pptx
IB Biology New syllabus B3.2 Transport.pptxIB Biology New syllabus B3.2 Transport.pptx
IB Biology New syllabus B3.2 Transport.pptxUalikhanKalkhojayev1
 
Human brain.. It's parts and function.
Human brain.. It's parts and function. Human brain.. It's parts and function.
Human brain.. It's parts and function. MUKTA MANJARI SAHOO
 
SCIENCE 6 QUARTER 3 REVIEWER(FRICTION, GRAVITY, ENERGY AND SPEED).pptx
SCIENCE 6 QUARTER 3 REVIEWER(FRICTION, GRAVITY, ENERGY AND SPEED).pptxSCIENCE 6 QUARTER 3 REVIEWER(FRICTION, GRAVITY, ENERGY AND SPEED).pptx
SCIENCE 6 QUARTER 3 REVIEWER(FRICTION, GRAVITY, ENERGY AND SPEED).pptxROVELYNEDELUNA3
 
Lehninger_Chapter 17_Fatty acid Oxid.ppt
Lehninger_Chapter 17_Fatty acid Oxid.pptLehninger_Chapter 17_Fatty acid Oxid.ppt
Lehninger_Chapter 17_Fatty acid Oxid.pptSachin Teotia
 
3.2 Pests of Sorghum_Identification, Symptoms and nature of damage, Binomics,...
3.2 Pests of Sorghum_Identification, Symptoms and nature of damage, Binomics,...3.2 Pests of Sorghum_Identification, Symptoms and nature of damage, Binomics,...
3.2 Pests of Sorghum_Identification, Symptoms and nature of damage, Binomics,...PirithiRaju
 
Role of herbs in hair care Amla and heena.pptx
Role of herbs in hair care  Amla and  heena.pptxRole of herbs in hair care  Amla and  heena.pptx
Role of herbs in hair care Amla and heena.pptxVaishnaviAware
 
Role of Herbs in Cosmetics in Cosmetic Science.
Role of Herbs in Cosmetics in Cosmetic Science.Role of Herbs in Cosmetics in Cosmetic Science.
Role of Herbs in Cosmetics in Cosmetic Science.ShwetaHattimare
 
Principles & Formulation of Hair Care Products
Principles & Formulation of Hair Care  ProductsPrinciples & Formulation of Hair Care  Products
Principles & Formulation of Hair Care Productspurwaborkar@gmail.com
 
KeyBio pipeline for bioinformatics and data science
KeyBio pipeline for bioinformatics and data scienceKeyBio pipeline for bioinformatics and data science
KeyBio pipeline for bioinformatics and data scienceLayne Sadler
 
Pests of ragi_Identification, Binomics_Dr.UPR
Pests of ragi_Identification, Binomics_Dr.UPRPests of ragi_Identification, Binomics_Dr.UPR
Pests of ragi_Identification, Binomics_Dr.UPRPirithiRaju
 
Applied Biochemistry feedback_M Ahwad 2023.docx
Applied Biochemistry feedback_M Ahwad 2023.docxApplied Biochemistry feedback_M Ahwad 2023.docx
Applied Biochemistry feedback_M Ahwad 2023.docxmarwaahmad357
 
Pests of wheat_Identification, Bionomics, Damage symptoms, IPM_Dr.UPR.pdf
Pests of wheat_Identification, Bionomics, Damage symptoms, IPM_Dr.UPR.pdfPests of wheat_Identification, Bionomics, Damage symptoms, IPM_Dr.UPR.pdf
Pests of wheat_Identification, Bionomics, Damage symptoms, IPM_Dr.UPR.pdfPirithiRaju
 
SUKDANAN DIAGNOSTIC TEST IN PHYSICAL SCIENCE ANSWER KEYY.pdf
SUKDANAN DIAGNOSTIC TEST IN PHYSICAL SCIENCE ANSWER KEYY.pdfSUKDANAN DIAGNOSTIC TEST IN PHYSICAL SCIENCE ANSWER KEYY.pdf
SUKDANAN DIAGNOSTIC TEST IN PHYSICAL SCIENCE ANSWER KEYY.pdfsantiagojoderickdoma
 
Pests of tenai_Identification,Binomics_Dr.UPR
Pests of tenai_Identification,Binomics_Dr.UPRPests of tenai_Identification,Binomics_Dr.UPR
Pests of tenai_Identification,Binomics_Dr.UPRPirithiRaju
 
MARSILEA notes in detail for II year Botany.ppt
MARSILEA  notes in detail for II year Botany.pptMARSILEA  notes in detail for II year Botany.ppt
MARSILEA notes in detail for II year Botany.pptaigil2
 
Main Exam Applied biochemistry final year
Main Exam Applied biochemistry final yearMain Exam Applied biochemistry final year
Main Exam Applied biochemistry final yearmarwaahmad357
 
Application of Foraminiferal Ecology- Rahul.pptx
Application of Foraminiferal Ecology- Rahul.pptxApplication of Foraminiferal Ecology- Rahul.pptx
Application of Foraminiferal Ecology- Rahul.pptxRahulVishwakarma71547
 
CW marking grid Analytical BS - M Ahmad.docx
CW  marking grid Analytical BS - M Ahmad.docxCW  marking grid Analytical BS - M Ahmad.docx
CW marking grid Analytical BS - M Ahmad.docxmarwaahmad357
 
Q3W4part1-SSSSSSSSSSSSSSSSSSSSSSSSCI.pptx
Q3W4part1-SSSSSSSSSSSSSSSSSSSSSSSSCI.pptxQ3W4part1-SSSSSSSSSSSSSSSSSSSSSSSSCI.pptx
Q3W4part1-SSSSSSSSSSSSSSSSSSSSSSSSCI.pptxArdeniel
 

Kürzlich hochgeladen (20)

Basic Concepts in Pharmacology in molecular .pptx
Basic Concepts in Pharmacology in molecular  .pptxBasic Concepts in Pharmacology in molecular  .pptx
Basic Concepts in Pharmacology in molecular .pptx
 
IB Biology New syllabus B3.2 Transport.pptx
IB Biology New syllabus B3.2 Transport.pptxIB Biology New syllabus B3.2 Transport.pptx
IB Biology New syllabus B3.2 Transport.pptx
 
Human brain.. It's parts and function.
Human brain.. It's parts and function. Human brain.. It's parts and function.
Human brain.. It's parts and function.
 
SCIENCE 6 QUARTER 3 REVIEWER(FRICTION, GRAVITY, ENERGY AND SPEED).pptx
SCIENCE 6 QUARTER 3 REVIEWER(FRICTION, GRAVITY, ENERGY AND SPEED).pptxSCIENCE 6 QUARTER 3 REVIEWER(FRICTION, GRAVITY, ENERGY AND SPEED).pptx
SCIENCE 6 QUARTER 3 REVIEWER(FRICTION, GRAVITY, ENERGY AND SPEED).pptx
 
Lehninger_Chapter 17_Fatty acid Oxid.ppt
Lehninger_Chapter 17_Fatty acid Oxid.pptLehninger_Chapter 17_Fatty acid Oxid.ppt
Lehninger_Chapter 17_Fatty acid Oxid.ppt
 
3.2 Pests of Sorghum_Identification, Symptoms and nature of damage, Binomics,...
3.2 Pests of Sorghum_Identification, Symptoms and nature of damage, Binomics,...3.2 Pests of Sorghum_Identification, Symptoms and nature of damage, Binomics,...
3.2 Pests of Sorghum_Identification, Symptoms and nature of damage, Binomics,...
 
Role of herbs in hair care Amla and heena.pptx
Role of herbs in hair care  Amla and  heena.pptxRole of herbs in hair care  Amla and  heena.pptx
Role of herbs in hair care Amla and heena.pptx
 
Role of Herbs in Cosmetics in Cosmetic Science.
Role of Herbs in Cosmetics in Cosmetic Science.Role of Herbs in Cosmetics in Cosmetic Science.
Role of Herbs in Cosmetics in Cosmetic Science.
 
Principles & Formulation of Hair Care Products
Principles & Formulation of Hair Care  ProductsPrinciples & Formulation of Hair Care  Products
Principles & Formulation of Hair Care Products
 
KeyBio pipeline for bioinformatics and data science
KeyBio pipeline for bioinformatics and data scienceKeyBio pipeline for bioinformatics and data science
KeyBio pipeline for bioinformatics and data science
 
Pests of ragi_Identification, Binomics_Dr.UPR
Pests of ragi_Identification, Binomics_Dr.UPRPests of ragi_Identification, Binomics_Dr.UPR
Pests of ragi_Identification, Binomics_Dr.UPR
 
Applied Biochemistry feedback_M Ahwad 2023.docx
Applied Biochemistry feedback_M Ahwad 2023.docxApplied Biochemistry feedback_M Ahwad 2023.docx
Applied Biochemistry feedback_M Ahwad 2023.docx
 
Pests of wheat_Identification, Bionomics, Damage symptoms, IPM_Dr.UPR.pdf
Pests of wheat_Identification, Bionomics, Damage symptoms, IPM_Dr.UPR.pdfPests of wheat_Identification, Bionomics, Damage symptoms, IPM_Dr.UPR.pdf
Pests of wheat_Identification, Bionomics, Damage symptoms, IPM_Dr.UPR.pdf
 
SUKDANAN DIAGNOSTIC TEST IN PHYSICAL SCIENCE ANSWER KEYY.pdf
SUKDANAN DIAGNOSTIC TEST IN PHYSICAL SCIENCE ANSWER KEYY.pdfSUKDANAN DIAGNOSTIC TEST IN PHYSICAL SCIENCE ANSWER KEYY.pdf
SUKDANAN DIAGNOSTIC TEST IN PHYSICAL SCIENCE ANSWER KEYY.pdf
 
Pests of tenai_Identification,Binomics_Dr.UPR
Pests of tenai_Identification,Binomics_Dr.UPRPests of tenai_Identification,Binomics_Dr.UPR
Pests of tenai_Identification,Binomics_Dr.UPR
 
MARSILEA notes in detail for II year Botany.ppt
MARSILEA  notes in detail for II year Botany.pptMARSILEA  notes in detail for II year Botany.ppt
MARSILEA notes in detail for II year Botany.ppt
 
Main Exam Applied biochemistry final year
Main Exam Applied biochemistry final yearMain Exam Applied biochemistry final year
Main Exam Applied biochemistry final year
 
Application of Foraminiferal Ecology- Rahul.pptx
Application of Foraminiferal Ecology- Rahul.pptxApplication of Foraminiferal Ecology- Rahul.pptx
Application of Foraminiferal Ecology- Rahul.pptx
 
CW marking grid Analytical BS - M Ahmad.docx
CW  marking grid Analytical BS - M Ahmad.docxCW  marking grid Analytical BS - M Ahmad.docx
CW marking grid Analytical BS - M Ahmad.docx
 
Q3W4part1-SSSSSSSSSSSSSSSSSSSSSSSSCI.pptx
Q3W4part1-SSSSSSSSSSSSSSSSSSSSSSSSCI.pptxQ3W4part1-SSSSSSSSSSSSSSSSSSSSSSSSCI.pptx
Q3W4part1-SSSSSSSSSSSSSSSSSSSSSSSSCI.pptx
 

saito22research_talk_at_NUS

  • 1. ©Yuki Saito, Aug. 18, 2022. Towards Human-In-The-Loop Speech Synthesis Technology The University of Tokyo (UTokyo), Japan. Online Research Talk Hosted by NUS, Singapore @ Zoom Yuki Saito
  • 2. /51 1 Self Introduction ➢ SAITO Yuki (齋藤 佑樹) – Born in • Kushiro-shi, Hokkaido, Japan – Educational Background • Apr. 2016 ~ Mar. 2018: UTokyo (MS) • Apr. 2018 ~ Mar. 2021: UTokyo (PhD) – Research interests • Text-To-Speech (TTS) & Voice Conversion (VC) based on deep learning – Selected publications (from 11 journal papers & 25 conf. papers) • Saito et al., "Statistical parametric speech synthesis incorporating generative adversarial networks," IEEE/ACM TASLP, 2018. • Saito et al., "Perceptual-similarity-aware deep speaker representation learning for multi-speaker generative modeling," IEEE/ACM TASLP, 2021.
  • 3. /51 2 Our Lab. in UTokyo, Japan ➢ 3 groups organized by Prof. Saruwatari & Lect. Koyama – Source separation, sound field analysis & synthesis, and TTS & VC
  • 4. /51 3 TTS/VC Research Group in Our Lab. ➢ Organized by Dr. Takamichi & me (since Apr. 2016) – Current students: 4 PhD & 8 MS – Past students: 1 PhD (me) & 10 MS TTS VC Toward universal speech communication based on TTS/VC technologies
  • 5. /51 4 Table of Contents ➢ Part 1: approx. 45 min (presented by me) – Human-in-the-loop deep speaker representation learning – Human-in-the-loop speaker adaptation for multi-speaker TTS ➢ Part 2: approx. 45 min (presented by Mr. Xin) – Automatic quality assessment of synthetic speech • (The UTMOS system at The VoiceMOS Challenge 2022) – Speech emotion recognition for nonverbal vocalizations • (The 1st place at The ICML ExVo Competition 2022) ➢ Q&A (until 4pm in SGT / 5pm in JST)
  • 6. /51 5 Table of Contents ➢ Part 1: approx. 45 min (presented by me) – Human-in-the-loop deep speaker representation learning – Human-in-the-loop speaker adaptation for multi-speaker TTS ➢ Part 2: approx. 45 min (presented by Mr. Xin) – Automatic quality assessment of synthetic speech • (The UTMOS system at The VoiceMOS Challenge 2022) – Speech emotion recognition for nonverbal vocalizations • (The 1st place at The ICML ExVo Competition 2022) ➢ Q&A (until 4pm in SGT / 5pm in JST)
  • 7. /51 ➢ Speech synthesis – Technology for synthesizing speech using a computer ➢ Applications – Speech communication assistance (e.g., speech translation) – Entertainments (e.g., singing voice synthesis/conversion) ➢ DNN-based speech synthesis [Zen+13][Oord+16] – Using a DNN for learning statistical relation betw. input-to-speech 6 Research Field: Speech Synthesis Text-To-Speech (TTS) Text Speech Voice Conversion (VC) Output speech Input speech Hello Hello [Sagisaka+88] [Stylianou+88] DNN: Deep Neural Network
  • 8. /51 ➢ SOTA DNN-based speech synthesis methods – Quality of synthetic speech: as natural as human speech – Adversarial training betw. two DNNs (i.e., GANs [Goodfellow+14]) • Ours [Saito+18], MelGAN [Kumar+19], Parallel WaveGAN [Yamamoto+20], HiFi-GAN [Kong+20], VITS [Kim+21], JETS [Lim+22], etc... 7 Discriminator 1: natural Adversarial loss 𝒙 Input feats. Acoustic model (generator) ෝ 𝒚 Synthetic speech Natural speech 𝒚 General Background Reconstruction loss GAN: Generative Adversarial Network
  • 9. /51 ➢ SOTA DNN-based speech synthesis methods – Quality of synthetic speech: as natural as human speech – Adversarial training betw. two DNNs (i.e., GANs [Goodfellow+14]) • Ours [Saito+18], MelGAN [Kumar+19], Parallel WaveGAN [Yamamoto+20], HiFi-GAN [Kong+20], VITS [Kim+21], JETS [Lim+22], etc... 8 Human listener Human perception 𝒙 Input feats. Acoustic model (generator) ෝ 𝒚 Synthetic speech Natural speech 𝒚 General Background Reconstruction loss GAN: Generative Adversarial Network Can we replace the GAN discriminator with a human listener?
  • 10. /51 9 Motivation of Human-In-The-Loop Speech Synthesis Technologies ➢ Speech communication: intrinsically imperfect – Humans often make mistakes, but we can communicate! • Mispronunciations, wrong accents, unnatural pausing, etc... – Mistakes can be corrected thru interaction betw. speaker & listener. • c.f., Machine speech chain (speech synth. & recog.) [Tjandra+20] – Intervention of human listeners will cultivate advanced research field! ➢ Possible applications – Human-machine interaction • e.g., spoken dialogue systems – Media creation • e.g., singing voice synthesis & dubbing The image was automatically generated by craiyon
  • 11. /51 10 Table of Contents ➢ Part 1: approx. 45 min (presented by me) – Human-in-the-loop deep speaker representation learning – Human-in-the-loop speaker adaptation for multi-speaker TTS ➢ Part 2: approx. 45 min (presented by Mr. Xin) – Automatic quality assessment of synthetic speech • (The UTMOS system at The VoiceMOS Challenge 2022) – Speech emotion recognition for nonverbal vocalizations • (The 1st place at The ICML ExVo Competition 2022) ➢ Q&A (until 4pm in SGT / 5pm in JST)
  • 12. /51 11 Overview: Deep Speaker Representation Learning ➢ Deep Speaker Representation Learning (DSRL) – DNN-based technology for learning Speaker Embeddings (SEs) • Feature extraction for discriminative tasks (e.g., [Variani+14]) • Control of speaker ID in generative tasks (e.g., [Jia+18]) ➢ This talk: method to learn SEs suitable for generative tasks – Purpose: improving quality & controllability of synthetic speech – Core idea: introducing human listeners for learning SEs that are highly correlated with perceptual similarity among speakers DNN NG ASV DNN TTS Discriminative task (e.g., automatic speaker verification: ASV) Generative task (e.g., TTS and VC) DNN: Deep Neural Network
  • 13. /51 12 Conventional Method: Speaker-Classification-Based DSRL ➢ Learning to predict speaker ID from input speech parameters – SEs suitable for speaker classification → also suitable for TTS/VC? – One reason: low interpretability of SEs Minimizing cross-entropy Speech params. d-vectors [Variani+14] Spkr. classification Spkr. encoder Spkr. IDs Distance metric in SE space ≠ Perceptual metric (i.e., speaker similarity) SE space
  • 14. /51 13 Our Method: Perceptual-Similarity-Aware DSRL ➢ 1. Large-scale scoring of perceptual speaker similarity ➢ 2. SE learning considering the similarity scores DNN (Spkr. encoder) Learned similarity Speech params. Similarity score SEs Similarity score Perceptual similarity scoring Spkr. pairs 𝐿SIM (∗) Vector Matrix Graph Loss to predict sim.
  • 15. /51 14 Large Scale Scoring of Perceptual Speaker Similarity ➢ Crowdsourcing of perceptual speaker similarity scores – Dataset we used: 153 females in JNAS corpus [Itou+99] – 4,000↑ listeners scored the similarity of two speakers' voices. ➢ Histogram of the collected scores Instruction of the scoring To what degree do these two speakers' voices sound similar? (−3: dissimilar ~ +3: similar) ( , ) → +2 ( , ) → −3 ( , ) → −2
  • 16. /51 15 Perceptual Speaker Similarity Matrix ➢ Similarity matrix 𝐒 = 𝒔1, ⋯ , 𝒔𝑖, ⋯ , 𝒔𝑁s – 𝑁s: # of pre-stored (i.e., closed) speakers – 𝒔𝑖 = 𝑠𝑖,1, ⋯ , 𝑠𝑖,𝑗, ⋯ , 𝑠𝑖,𝑁s ⊤ : the 𝑖th similarity score vector • 𝑠𝑖,𝑗: similarity of the 𝑖th & 𝑗th speakers −𝑣 ≤ 𝑠𝑖,𝑗 ≤ 𝑣 3 2 1 0 −1 −2 −3 (a) Full score matrix (153 females) (b) Sub-matrix of (a) (13 females) I'll present three algorithms to learn the similarity.
  • 17. /51 16 Algorithm 1: Similarity Vector Embedding ➢ Predict a vector of the matrix 𝐒 from speech parameters 𝐿SIM (vec) 𝒔, ො 𝒔 = 1 𝑁𝑠 ො 𝒔 − 𝒔 ⊤ ො 𝒔 − 𝒔 Spkr. encoder 𝐿SIM (vec) 𝒔 ො 𝒔 𝐒 Sim. score vector Sim. matrix Speech params. Similarity prediction 𝒅 SEs
  • 18. /51 17 Algorithm 2: Similarity Matrix Embedding ➢ Associate the Gram matrix of SEs with the matrix 𝐒 𝐿SIM (mat) 𝐿SIM (mat) 𝐃, 𝐒 = 1 𝑍s ෩ 𝐊𝐃 − ෨ 𝐒 𝐹 2 𝐊𝐃 Gram matrix Calc. kernel 𝑘 ⋅,⋅ 𝑍s: Normalization coefficient (෨ 𝐒 represents off-diagonal matrix of 𝐒) 𝐒 Sim. matrix Speech params. Spkr. encoder 𝒅 SEs
  • 19. /51 18 Algorithm 3: Similarity Graph Embedding ➢ Learn the structure of speaker similarity graph from SE pairs 𝐿SIM graph 𝒅𝑖, 𝒅𝑗 = −𝑎𝑖,𝑗 log 𝑝𝑖,𝑗 − 1 − 𝑎𝑖,𝑗 log 1 − 𝑝𝑖,𝑗 Spkr. sim. graph Edge prediction 0: no edge 1: exist edge 𝐿SIM (graph) 𝑝𝑖,𝑗 = exp − 𝒅𝑖 − 𝒅𝑗 2 2 : edge probability (referring to [Li+18]) Spkr. encoder 𝐒 Sim. matrix Speech params. 𝒅 𝑎𝑖,𝑗 SEs
  • 20. /51 Spkr. encoder training Score prediction Query selection Score annotation : +3 : -1 : ?? : ?? : ?? Spkr. encoder Scored spkr. pairs Listeners Unscored spkr. pairs 19 Human-In-The-Loop Active Learning (AL) for Perceptual-Similarity-Aware SEs ➢ Overall framework: iterate similarity scoring & SE learning – Obtaining better SEs while reducing costs of scoring & learning – Using partially observed similarity scores
  • 21. /51 20 Human-In-The-Loop Active Learning (AL) for Perceptual-Similarity-Aware SEs ➢ AL step 1: train spkr. encoder using partially observed scores Spkr. encoder training Score prediction Query selection Score annotation : +3 : -1 : ?? : ?? : ?? Spkr. encoder Scored spkr. pairs Listeners Unscored spkr. pairs Vector Matrix Graph
  • 22. /51 21 Human-In-The-Loop Active Learning (AL) for Perceptual-Similarity-Aware SEs ➢ AL step 2: predict similarity scores for unscored spkr. pairs : +3 : 0 : -2 Predicted Spkr. encoder training Score prediction Query selection Score annotation : +3 : -1 : ?? : ?? : ?? Spkr. encoder Scored spkr. pairs Listeners Unscored spkr. pairs Vector Matrix Graph
  • 23. /51 22 Human-In-The-Loop Active Learning (AL) for Perceptual-Similarity-Aware SEs ➢ AL step 3: select unscored pairs to be scored next – Query strategy: criterion to determine priority of scoring : +3 : 0 : -2 Predicted : HSF : MSF : LSF Selected Spkr. encoder training Score prediction Query selection Score annotation : +3 : -1 : ?? : ?? : ?? Spkr. encoder Scored spkr. pairs Listeners Unscored spkr. pairs Vector Matrix Graph Query strategy { Higher, Middle, Lower }-Similarity First
  • 24. /51 23 Human-In-The-Loop Active Learning (AL) for Perceptual-Similarity-Aware SEs ➢ AL step 4: annotate similarity scores to selected spkr. pairs – → return to AL step 1 : +3 : 0 : -2 Predicted : HSF : LSF Selected Spkr. encoder training Score prediction Query selection Score annotation : +3 : -1 : ?? : ?? : ?? Spkr. encoder Scored spkr. pairs Listeners Unscored spkr. pairs Vector Matrix Graph Query strategy : +1 : MSF
  • 26. /51 25 Experimental Conditions Dataset (16 kHz sampling) JNAS [Itou+99] 153 female speakers 5 utterances per speaker for scoring About 130 / 15 utterances for DSRL & evaluation (F001 ~ F013: unseen speakers for evaluation) Similarity score -3 (dissimilar) ~ +3 (similar) (Normalized to [-1, +1] or [0, 1] in DSRL) Speech parameters 40-dimensional mel-cepstra, F0, aperiodicity (extracted by STRAIGHT analysis [Kawahara+99]) DNNs Fully-connected (for details, please see our paper) Dimensionality of SEs 8 AL setting Pool-based simulation (Using binary masking for excluding unobserved scores) DSRL methods Conventional: d-vectors [Variani+14] Ours: Prop. (vec), Prop. (mat), or Prop. (graph)
  • 27. /51 26 Evaluation 1: SE Interpretability ➢ Scatter plots of human-/SE-derived similarity scores – Prop. (*) highly correlated with the human-derived sim. scores. • → Our DSRL can learn interpretable SEs better than d-vec! d-vec. Prop. (graph) Prop. (mat) Prop. (vec) SE-derived 0 1 Human-derived 1 0 Seen-Seen Seen-Unseen
  • 28. /51 27 Evaluation 2: Speaker Interpolation Controllability ➢ Task: generate new speaker identity by mixing two SEs – We evaluated spkr. sim. between interpolated speech with 𝛼 ∈ 0.0, 0.25, 0.5, 0.75, 1.0 and original speaker's (𝛼 = 0 or 1). – The score curves of Prop. (*) were closer to the red line. • → Our SEs achieve higher controllability than d-vec.! (20 answers/listener, total 30 × 2 listeners, method-wise preference XAB test) Mixing coefficient 𝛼 0.0 0.5 1.0 1.0 0.5 0.0 Preference score A (mixed w/ 𝛼 = 0) B (mixed w/ 𝛼 = 1)
  • 29. /51 28 Evaluation 3: AL Cost Efficacy ➢ AL setting: starting DSRL from PS to reach FS situation – MSF was the best query strategy for all proposed methods. – Prop. (vec / graph) reduced the cost, but Prop. (mat) didn't work. In each AL iteration, sim. scores of 43 speaker-pairs were newly annotated. Fully Scored (FS) Partially Scored (PS) AUC of similar speaker-pair detection
  • 30. /51 29 Summary ➢ Purpose – Learning SEs highly correlated with perceptual speaker similarity ➢ Proposed methods – 1) Perceptual-similarity-aware learning of SEs – 2) Human-in-the-loop AL for DSRL ➢ Results of our methods – 1) learned SEs having high correlation with human perception – 2) achieved better controllability in speaker interpolation – 3) reduced costs of scoring/training by introducing AL ➢ For detailed discussion... – Please read our TASLP paper (open access)!
  • 31. /51 30 Table of Contents ➢ Part 1: approx. 45 min (presented by me) – Human-in-the-loop deep speaker representation learning – Human-in-the-loop speaker adaptation for multi-speaker TTS ➢ Part 2: approx. 45 min (presented by Mr. Xin) – Automatic quality assessment of synthetic speech – Speech emotion recognition for nonverbal vocalizations ➢ Q&A (until 4 p.m. in SGT & 5 p.m. in JST)
  • 32. /51 31 Overview: Speaker Adaptation for Multi-Speaker TTS ➢ Text-To-Speech (TTS) [Sagisaka+88] – Technology to artificially synthesize speech from given text ➢ DNN-based multi-speaker TTS [Fan+15][Hojo+18] – Single DNN to generate multiple speakers' voices • SE: conditional input to control speaker ID of synthetic speech ➢ Speaker adaptation for multi-speaker TTS (e.g., [Jia+18]) – TTS of unseen speaker's voice with small amount of data Text-To-Speech (TTS) Text Speech SE Multi-speaker TTS model
  • 33. /51 32 Conventional Speaker Adaptation Method ➢ Transfer Learning (TL) from speaker verification [Jia+18] – Speaker encoder for extracting SE from reference speech • Pretrained on speaker verification (e.g., GE2E loss [Wan+18]) – Multi-speaker TTS model for synthesizing speech from (text, SE) pairs • Training: generate voices of seen speakers (∈ training data) • Inference: extract SE of unseen speaker & input to TTS model – Issue: cannot be used w/o the reference speech • e.g., deceased person w/o any speech recordings Multi-speaker TTS model Speaker encoder Ref. speech FROZEN Can we find the target speaker's SE w/o using ref. speech?
  • 34. /51 33 Proposed Method: Human-In-The-Loop Speaker Adaptation ➢ Core algorithm: Sequential Line Search (SLS) [Koyama+17] on SE space Multi-speaker TTS system Text ⋰ Waveforms Candidates SEs ⋯ SE space Bayesian optimization Update line segment SE selection by user Selected SE Selected Waveform
  • 35. /51 34 Proposed Method: Human-In-The-Loop Speaker Adaptation ➢ SLS step 1: define line segment in SE space Candidate SEs ⋯ SE space
  • 36. /51 35 Proposed Method: Human-In-The-Loop Speaker Adaptation ➢ SLS step 2: synthesize waveforms using candidate SEs Multi-speaker TTS system Text ⋰ Waveforms Candidate SEs ⋯ SE space
  • 37. /51 36 Proposed Method: Human-In-The-Loop Speaker Adaptation ➢ SLS step 3: select one SE based on user's speech perception Multi-speaker TTS system Text ⋰ Waveforms Candidate SEs ⋯ SE space Selection by user Selected SE Selected Waveform
  • 38. /51 37 Proposed Method: Human-In-The-Loop Speaker Adaptation ➢ SLS step 4: update line segment using Bayesian Optimization Multi-speaker TTS system Text ⋰ Waveforms Candidate SEs ⋯ SE space Bayesian optimization Update line segment Selection by user Selected SE Selected Waveform
  • 39. /51 Candidate SEs ⋯ SE space 38 Proposed Method: Human-In-The-Loop Speaker Adaptation ➢ ... and loops SLS steps until the user gets desired outcome – Ref. speech & spkr. encoder are no longer needed in adaptation! Multi-speaker TTS system Text ⋰ Waveforms Bayesian optimization Update line segment Selection by user Selected SE Selected Waveform
  • 40. /51 39 Two Strategies for Improving Search Efficacy ➢ Performing SLS in original SE space is inefficient because... – It assumes the search space to be 𝐷-dimensional hypercube 0, 1 𝐷. However, actual SEs are NOT distributed uniformly (e.g., right figure). – SEs in the dead space can degrade the naturalness of synthetic speech... ➢ Our strategies for SLS-based speaker adaptation – 1) Use mean {male, female} speakers' SEs as initial line endpoints • → Start the search from more natural voices – 2) Set the search space to a quantile of SEs in the training data • → Search for more natural voice (but limit the search space) – We empirically confirmed that these strategies significantly improved the naturalness of synthetic speech during search. Dead space
  • 42. /51 41 Experimental Conditions Corpus for training speaker encoder Corpus of Spontaneous Japanese (CSJ) [Maekawa03] (947 males and 470 females, 660h) TTS model FastSpeech 2 [Ren+21] Corpus for TTS model "parallel100" subset of Japanese Versatile Speech (JVS) corpus [Takamichi+20] (49 males and 51 females, 22h, 100 sentences / speaker) Data split Train 90 speakers (44 males, 46 females) Test 4 speakers (2 males, 2 females) Validation 6 speakers (3 males, 3 females) Vocoder Pretrained "universal_v1" model of HiFi-GAN [Kong+20] (published in ming024's GitHub repository)
  • 43. /51 42 ➢ Interface for SLS experiment – Button to play reference speaker's voice • Simulating situation where users have their desired voice in mind – Slider to change multiple speakers' IDs smoothly Demonstration
  • 44. /51 ➢ Conditions – 8 participants searched for 4 target speakers w/ SLS (30 iterations). – We computed the mel-spectrogram MAE betw. natural & synthetic speech for each SE and selected one based on the MAE values. 43 Ref. waveform Participant 1 Participant 8 SLS for our method ⋮ ⋮ Searched SEs ⋮ SLS-best: Lowest MAE SLS-mean: Closest to mean MAE SLS-worst: Highest MAE Human-In-The-Loop Experiment
  • 45. /51 44 Subjective Evaluation (Naturalness MOS) (24 answers/listener, total 50 × 4 listeners, target-speaker-wise MOS test)
  • 46. /51 45 Subjective Evaluation (Naturalness MOS) (24 answers/listener, total 50 × 4 listeners, target-speaker-wise MOS test) Our methods achieve MOSs comparable to TL-based method!
  • 47. /51 46 Subjective Evaluation (Naturalness MOS) (24 answers/listener, total 50 × 4 listeners, target-speaker-wise MOS test) SLS-worst tends to degrade the naturalness significantly.
  • 48. /51 47 Subjective Evaluation (Similarity MOS) (20 answers/listener, total 50 × 4 listeners, target-speaker-wise MOS test)
  • 49. /51 48 Subjective Evaluation (Similarity MOS) (20 answers/listener, total 50 × 4 listeners, target-speaker-wise MOS test) We observe similar tendency to the naturalness MOS results.
  • 51. /51 50 Summary ➢ Purpose – Speaker adaptation for multi-speaker TTS w/o ref. speech ➢ Proposed method – SLS-based human-in-the-loop speaker adaptation algorithm ➢ Results of our method – 1) achieved comparable performance to TL-based adaptation method – 2) showed the difficulty in finding desirable SEs (less interpretability?) ➢ For detailed discussion... – Please read our INTERSPEECH2022 paper (ACCEPTED)! • Mr. Kenta Udagawa will talk about this work in poster session.
  • 52. /51 51 Conclusions (Part 1) ➢ Main topic: human-in-the-loop speech synthesis – Intervening human listeners in SOTA DNN-based TTS/VC methods ➢ Presented work – 1) Human-in-the-loop deep speaker representation learning – 2) Human-in-the-loop speaker adaptation for multi-speaker TTS ➢ Future prospection – Continually trainable TTS/VC technology with the aid of humans • As we grow, so do speech synthesis technologies! ➢ I'll physical attend INTERSPEECH2022 w/ 8 lab members! – Very looking forward to meet you at Incheon, South Korea :) Thank you for your attention!