SlideShare ist ein Scribd-Unternehmen logo
1 von 20
Downloaden Sie, um offline zu lesen
Multi-Task Adversarial Training Algorithm for
Multi-Speaker Neural Text-To-Speech
The University of Tokyo, Japan.
APSIPA ASC 2022 WedAM1-8-2 (SS04)
Yusuke Nakai, Yuki Saito, Kenta Udagawa, Hiroshi Saruwatari
/19
1
Overview: Multi-Speaker Neural Text-To-Speech
➢ Text-To-Speech (TTS) [Sagisaka+88]
– Technology to artificially synthesize speech from given text
➢
Multi-speaker Neural TTS [Fan+15][Hojo+18]
– Single Deep Neural Network (DNN) to generate multi-speakers' voices
• Speaker embedding: conditional input to control speaker ID
➢
Voice cloning (e.g., [Arik+18])
– TTS of unseen speaker's voice with small amount of data
Text-To-Speech (TTS)
Text Speech
Spkr.
emb.
Multi-spkr. neural
TTS model
/19
2
Research Outline
➢ Conventional algorithm: GAN*-based training
– High-quality TTS by adversarial training of discriminator & generator
– Poor generalization performance in voice cloning
• TTS model cannot observe unseen speakers' voices in training...
➢ Proposed algorithm: Multi-task adversarial training
– Primal task: GAN-based multi-speaker neural TTS training
• Objective: feature reconstruction loss + adversarial loss
– Secondary task: improving (pseudo) unseen speaker's TTS quality
• Objective: loss to generate realistic voices of unseen speakers
➢ Results: High-quality voice cloning by our algorithm!
*GAN: Generative Adversarial Network [Goodfellow+14]
/19
3
Baseline 1: Multi-Speaker FastSpeech 2 (FS2) [Ren+21]
Spkr. encoder
➢ Transfer-learning-based multi-speaker neural TTS [Jia+18]
– 1. Pretrain spkr. encoder w/ spkr. verification task (e.g., GE2E* loss)
– 2. Train FS2-based TTS model w/ pretrained spkr. encoder
*GE2E: Generalized End-2-End [Wan+18]
Extract spkr. emb. from
reference speech
(fixed during TTS training)
Add variance
information of speech
/19
4
Baseline 2: GANSpeech [Yang+21]
➢ Overview: TTS model (generator) vs. JCU* discriminator
– TTS model generates speech features from text & spkr. emb.
– JCU discriminator classifies synth. / nat. from two kinds of inputs
• Unconditional: w/o spkr. emb. & Conditional: w/ spkr. emb.
TTS
model
𝐺 ⋅ 𝐷S ⋅ 𝐷U ⋅
𝐷C ⋅
or
0: Synth.
1: Nat.
0: Synth.
1: Nat.
JCU discriminator
Text
Spkr.
emb.
Synth. Nat.
*JCU: Joint Conditional & Unconditional [Zhang+18]
/19
5
GANSpeech Algorithm: JCU Discriminator Update
➢ Objective: Discriminating synth. (0) / nat. (1) correctly
– 𝐷S extracts shared features of nat. / synth. speech
– 𝐷U learns general characteristic of speech
– 𝐷C captures spkr.-specific characteristic of speech
TTS
model
𝐺 ⋅ 𝐷S ⋅ 𝐷U ⋅
𝐷C ⋅
or
0: Synth.
1: Nat.
0: Synth.
1: Nat.
JCU discriminator
Text
Spkr.
emb.
Synth. Nat.
Disc. loss
/19
6
GANSpeech Algorithm: TTS Model Update
➢ Objective: Generating speech & Deceiving JCU discriminator
– Speech reconst. loss makes TTS model generate speech features
• Phoneme duration, F0, energy, mel-spectrogram
– Adv. loss causes JCU discriminator to misclassify synth. as nat.
TTS
model
𝐺 ⋅ 𝐷S ⋅ 𝐷U ⋅
𝐷C ⋅
Speech
reconst. loss
1: Synth.
1: Synth.
JCU discriminator
Text
Spkr.
emb.
Synth. Nat.
Adv. loss
➢ GAN = Distribution matching betw. nat. and synth. data
– High-quality TTS for seen spkrs. included in training corpus
– No guarantee to generalize TTS for unseen spkrs.
/19
7
➢ Proposed Method:
Multi-Speaker Neural TTS based on
Multi-Task Adversarial Training
/19
8
Overview of Proposed Method
➢ Motivation: Diversifying spkr. variation during training to
– Widen spkr. emb. distribution that TTS model can cover
– Improve robustness of TTS model towards unseen speakers
➢ Idea: Adversarially Constrained Autoencoder Interpolation [Berthelot+19]
– Architecture: Autoencoder w/ feature interpolation + Critic
• Critic estimates 𝛼 from given input (𝛼 = 0 if input is pure data)
• Autoencoder makes critic output 𝛼 = 0 for interpolated data
Feature interpolation w/ 𝛼
We introduce this idea to GAN-based multi-speaker TTS
/19
9
➢ Overview: GANSpeech + ACAI-derived regularization
– Encoder = spkr. encoder (fixed parameters during training)
– Decoder = TTS model (i.e., Generator in GAN)
– Critic 𝐶 = additional branch in Multi-Task (MT) discriminator
Multi-Task Adversarial Training Algorithm
TTS
model
𝐺 ⋅ 𝐷S ⋅ 𝐷U ⋅
𝐷C ⋅
or
0: Synth.
1: Nat.
0: Synth.
1: Nat.
MT discriminator
Text
Mixed/Pure
spkr. emb.
Synth. Nat.
𝐶 ⋅
Spkr. enc.
& interp. 𝛼: Mixed
0: Pure
/19
10
Proposed Algorithm: MT Discriminator Update
TTS
model
𝐺 ⋅ 𝐷S ⋅ 𝐷U ⋅
𝐷C ⋅
or
0: Synth.
1: Nat.
0: Synth.
1: Nat.
MT discriminator
Text
Mixed/Pure
spkr. emb.
Synth. Nat.
Disc. loss
𝐶 ⋅
Spkr. enc.
& interp. 𝛼: Mixed
0: Pure
Critic loss
➢ Objective: Discriminating synth./nat. & mixed/pure
– Synth. speech samples are generated from mixed / pure spkr. emb.
• Coefficient: 𝛼 ~ 𝑈(0.0, 0.5), spkr. pairs: shuffled w/n mini-batch
– Criterion for critic training: MSE betw. predicted / correct 𝛼
/19
11
➢ Objective: GANSpeech objective + ACAI loss
– ACAI loss makes critic output 0 for synth. speech of mixed spkrs.
• Regularization on TTS for (pseudo) unseen spkrs.
– Computation time for inference does not change from GANSpeech
Proposed Algorithm: TTS Update
TTS
model
𝐺 ⋅ 𝐷S ⋅ 𝐷U ⋅
𝐷C ⋅
MT discriminator
Text
Mixed/Pure
spkr. emb.
Synth. Nat.
𝐶 ⋅
Spkr. enc.
& interp.
Speech
reconst. loss
1: Synth.
1: Synth.
Adv. loss
0: Mixed
ACAI loss
/19
12
➢ Experimental Evaluations
/19
13
Experimental Conditions
Corpus
(speaker encoder)
CSJ [Maekawa03] (947 males & 470 females, 660h)
Corpus (TTS)
"parallel100" subset of JVS [Takamichi+20]
(49 males & 51 females, 22h, 100 sent./spkr.)
Feature dimensions Mel-spectrogram: 80, Spkr. emb.: 256
Data split
Train/Validation/Test = 0.8/0.1/0.1
Seen spkrs: 96, Unseen spkrs. 4 (2 males & 2 females)
Vocoder
(for 22,050 Hz)
"generator_universal_model" of HiFi-GAN [Kong+20]
(included in ming024's GitHub repository)
Compared methods
FS2: Multi-spkr. FastSpeech 2 [Ren+21]
GAN: GANSpeech [Yang+21]
Ours: Multi-task adv. training
/19
14
Subjective Evaluation & Results
➢ Criterion: quality of synth. speech (Mean Opinion Score tests)
– Naturalness (MOS) & spkr. similarity (Degradation MOS)
➢ Results w/ 95% intervals (50 listeners/test, 15 samples/listener)
TTS for seen spkrs. Voice cloning
Naturalness Similarity Naturalness Similarity
FS2 3.13±0.12 3.57±0.12 3.13±0.12 2.38±0.12
GAN 3.52±0.12 3.79±0.12 3.38±0.12 2.40±0.12
Ours 3.55±0.12 3.87±0.12 3.50±0.12 2.48±0.12
/19
15
Subjective Evaluation & Results
➢ Criterion: quality of synth. speech (Mean Opinion Score tests)
– Naturalness (MOS) & spkr. similarity (Degradation MOS)
➢ Results w/ 95% intervals (50 listeners/test, 15 samples/listener)
TTS for seen spkrs. Voice cloning
Naturalness Similarity Naturalness Similarity
FS2 3.13±0.12 3.57±0.12 3.13±0.12 2.38±0.12
GAN 3.52±0.12 3.79±0.12 3.38±0.12 2.40±0.12
Ours 3.55±0.12 3.87±0.12 3.50±0.12 2.48±0.12
GAN-based methods significantly improve
quality of TTS for seen spkrs.
/19
16
Subjective Evaluation & Results
➢ Criterion: quality of synth. speech (Mean Opinion Score tests)
– Naturalness (MOS) & spkr. similarity (Degradation MOS)
➢ Results w/ 95% intervals (50 listeners/test, 15 samples/listener)
TTS for seen spkrs. Voice cloning
Naturalness Similarity Naturalness Similarity
FS2 3.13±0.12 3.57±0.12 3.13±0.12 2.38±0.12
GAN 3.52±0.12 3.79±0.12 3.38±0.12 2.40±0.12
Ours 3.55±0.12 3.87±0.12 3.50±0.12 2.48±0.12
Our MT algorithm overcomes degradation of quality in
voice cloning (TTS for unseen spkrs.)
/19
17
Subjective Evaluation & Results
➢ Criterion: quality of synth. speech (Mean Opinion Score tests)
– Naturalness (MOS) & spkr. similarity (Degradation MOS)
➢ Results w/ 95% intervals (50 listeners/test, 15 samples/listener)
TTS for seen spkrs. Voice cloning
Naturalness Similarity Naturalness Similarity
FS2 3.13±0.12 3.57±0.12 3.13±0.12 2.38±0.12
GAN 3.52±0.12 3.79±0.12 3.38±0.12 2.40±0.12
Ours 3.55±0.12 3.87±0.12 3.50±0.12 2.48±0.12
There is still large gap betw. quality of spkr. similarity
betw. TTS for seen spkrs. & voice cloning
/19
18
Speech Samples (Voice Cloning)
Ground-truth FS2 GAN Ours
jvs078
(male)
jvs005
(male)
jvs060
(female)
jvs010
(female)
Other samples are available online! →
/19
19
Summary
➢ Purpose
– Improving performance of multi-spkr. neural TTS for voice cloning
➢ Proposed method
– Multi-task adversarial training (GANSpeech + ACAI regularization)
➢ Results of our method
– 1) improves naturalness & spkr. similarity better than GANSpeech
– 2) has room for improvement for better spkr. similarity
➢ Future work
– Introducing sophisticated speaker generation framework [Stanton+22]
– Extending our method to multi-lingual TTS
Thank you for your attention!

Weitere ähnliche Inhalte

Ähnlich wie nakai22apsipa_presentation.pdf

fujii22apsipa_asc
fujii22apsipa_ascfujii22apsipa_asc
fujii22apsipa_ascYuki Saito
 
Latent Semantic Transliteration using Dirichlet Mixture
Latent Semantic Transliteration using Dirichlet MixtureLatent Semantic Transliteration using Dirichlet Mixture
Latent Semantic Transliteration using Dirichlet MixtureRakuten Group, Inc.
 
ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...
ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...
ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...Tomoki Hayashi
 
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...NU_I_TODALAB
 
Sepformer&DPTNet.pdf
Sepformer&DPTNet.pdfSepformer&DPTNet.pdf
Sepformer&DPTNet.pdfssuser849b73
 
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...ssuser849b73
 
Toward wave net speech synthesis
Toward wave net speech synthesisToward wave net speech synthesis
Toward wave net speech synthesisNAVER Engineering
 
Ph.D defence (Shinnosuke Takamichi)
Ph.D defence (Shinnosuke Takamichi)Ph.D defence (Shinnosuke Takamichi)
Ph.D defence (Shinnosuke Takamichi)Shinnosuke Takamichi
 
Interspeech 2017 s_miyoshi
Interspeech 2017 s_miyoshiInterspeech 2017 s_miyoshi
Interspeech 2017 s_miyoshiHiroyuki Miyoshi
 
Transformer-based SE.pptx
Transformer-based SE.pptxTransformer-based SE.pptx
Transformer-based SE.pptxssuser849b73
 
Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognitionRichie
 
Speech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdfSpeech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdfssuser849b73
 
Contrastive Learning with Adversarial Perturbations for Conditional Text Gene...
Contrastive Learning with Adversarial Perturbations for Conditional Text Gene...Contrastive Learning with Adversarial Perturbations for Conditional Text Gene...
Contrastive Learning with Adversarial Perturbations for Conditional Text Gene...MLAI2
 
Audio Inpainting with D2WGAN.pdf
Audio Inpainting with D2WGAN.pdfAudio Inpainting with D2WGAN.pdf
Audio Inpainting with D2WGAN.pdfssuser849b73
 
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...Hayahide Yamagishi
 

Ähnlich wie nakai22apsipa_presentation.pdf (20)

fujii22apsipa_asc
fujii22apsipa_ascfujii22apsipa_asc
fujii22apsipa_asc
 
Conv-TasNet.pdf
Conv-TasNet.pdfConv-TasNet.pdf
Conv-TasNet.pdf
 
Latent Semantic Transliteration using Dirichlet Mixture
Latent Semantic Transliteration using Dirichlet MixtureLatent Semantic Transliteration using Dirichlet Mixture
Latent Semantic Transliteration using Dirichlet Mixture
 
ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...
ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...
ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...
 
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
 
Sepformer&DPTNet.pdf
Sepformer&DPTNet.pdfSepformer&DPTNet.pdf
Sepformer&DPTNet.pdf
 
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
 
Toward wave net speech synthesis
Toward wave net speech synthesisToward wave net speech synthesis
Toward wave net speech synthesis
 
Ph.D defence (Shinnosuke Takamichi)
Ph.D defence (Shinnosuke Takamichi)Ph.D defence (Shinnosuke Takamichi)
Ph.D defence (Shinnosuke Takamichi)
 
Wavesplit.pdf
Wavesplit.pdfWavesplit.pdf
Wavesplit.pdf
 
Interspeech 2017 s_miyoshi
Interspeech 2017 s_miyoshiInterspeech 2017 s_miyoshi
Interspeech 2017 s_miyoshi
 
Esa act
Esa actEsa act
Esa act
 
Transformer-based SE.pptx
Transformer-based SE.pptxTransformer-based SE.pptx
Transformer-based SE.pptx
 
H0814247
H0814247H0814247
H0814247
 
Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognition
 
Speech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdfSpeech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdf
 
Contrastive Learning with Adversarial Perturbations for Conditional Text Gene...
Contrastive Learning with Adversarial Perturbations for Conditional Text Gene...Contrastive Learning with Adversarial Perturbations for Conditional Text Gene...
Contrastive Learning with Adversarial Perturbations for Conditional Text Gene...
 
N20181217
N20181217N20181217
N20181217
 
Audio Inpainting with D2WGAN.pdf
Audio Inpainting with D2WGAN.pdfAudio Inpainting with D2WGAN.pdf
Audio Inpainting with D2WGAN.pdf
 
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
 

Mehr von Yuki Saito

hirai23slp03.pdf
hirai23slp03.pdfhirai23slp03.pdf
hirai23slp03.pdfYuki Saito
 
Interspeech2022 参加報告
Interspeech2022 参加報告Interspeech2022 参加報告
Interspeech2022 参加報告Yuki Saito
 
Neural text-to-speech and voice conversion
Neural text-to-speech and voice conversionNeural text-to-speech and voice conversion
Neural text-to-speech and voice conversionYuki Saito
 
Nishimura22slp03 presentation
Nishimura22slp03 presentationNishimura22slp03 presentation
Nishimura22slp03 presentationYuki Saito
 
Nakai22sp03 presentation
Nakai22sp03 presentationNakai22sp03 presentation
Nakai22sp03 presentationYuki Saito
 
GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)Yuki Saito
 
Saito21asj Autumn Meeting
Saito21asj Autumn MeetingSaito21asj Autumn Meeting
Saito21asj Autumn MeetingYuki Saito
 
Interspeech2020 reading
Interspeech2020 readingInterspeech2020 reading
Interspeech2020 readingYuki Saito
 
Saito20asj_autumn
Saito20asj_autumnSaito20asj_autumn
Saito20asj_autumnYuki Saito
 
ICASSP読み会2020
ICASSP読み会2020ICASSP読み会2020
ICASSP読み会2020Yuki Saito
 
Saito20asj s slide_published
Saito20asj s slide_publishedSaito20asj s slide_published
Saito20asj s slide_publishedYuki Saito
 
Saito19asjAutumn_DeNA
Saito19asjAutumn_DeNASaito19asjAutumn_DeNA
Saito19asjAutumn_DeNAYuki Saito
 
Deep learning for acoustic modeling in parametric speech generation
Deep learning for acoustic modeling in parametric speech generationDeep learning for acoustic modeling in parametric speech generation
Deep learning for acoustic modeling in parametric speech generationYuki Saito
 
釧路高専情報工学科向け進学説明会
釧路高専情報工学科向け進学説明会釧路高専情報工学科向け進学説明会
釧路高専情報工学科向け進学説明会Yuki Saito
 

Mehr von Yuki Saito (20)

hirai23slp03.pdf
hirai23slp03.pdfhirai23slp03.pdf
hirai23slp03.pdf
 
Interspeech2022 参加報告
Interspeech2022 参加報告Interspeech2022 参加報告
Interspeech2022 参加報告
 
Neural text-to-speech and voice conversion
Neural text-to-speech and voice conversionNeural text-to-speech and voice conversion
Neural text-to-speech and voice conversion
 
Nishimura22slp03 presentation
Nishimura22slp03 presentationNishimura22slp03 presentation
Nishimura22slp03 presentation
 
Nakai22sp03 presentation
Nakai22sp03 presentationNakai22sp03 presentation
Nakai22sp03 presentation
 
GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)
 
Saito21asj Autumn Meeting
Saito21asj Autumn MeetingSaito21asj Autumn Meeting
Saito21asj Autumn Meeting
 
Saito2103slp
Saito2103slpSaito2103slp
Saito2103slp
 
Interspeech2020 reading
Interspeech2020 readingInterspeech2020 reading
Interspeech2020 reading
 
Saito20asj_autumn
Saito20asj_autumnSaito20asj_autumn
Saito20asj_autumn
 
ICASSP読み会2020
ICASSP読み会2020ICASSP読み会2020
ICASSP読み会2020
 
Saito20asj s slide_published
Saito20asj s slide_publishedSaito20asj s slide_published
Saito20asj s slide_published
 
Saito19asjAutumn_DeNA
Saito19asjAutumn_DeNASaito19asjAutumn_DeNA
Saito19asjAutumn_DeNA
 
Deep learning for acoustic modeling in parametric speech generation
Deep learning for acoustic modeling in parametric speech generationDeep learning for acoustic modeling in parametric speech generation
Deep learning for acoustic modeling in parametric speech generation
 
Saito19asj_s
Saito19asj_sSaito19asj_s
Saito19asj_s
 
Saito18sp03
Saito18sp03Saito18sp03
Saito18sp03
 
Saito18asj_s
Saito18asj_sSaito18asj_s
Saito18asj_s
 
Saito17asjA
Saito17asjASaito17asjA
Saito17asjA
 
釧路高専情報工学科向け進学説明会
釧路高専情報工学科向け進学説明会釧路高専情報工学科向け進学説明会
釧路高専情報工学科向け進学説明会
 
miyoshi17sp07
miyoshi17sp07miyoshi17sp07
miyoshi17sp07
 

Kürzlich hochgeladen

VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfSumit Kumar yadav
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 

Kürzlich hochgeladen (20)

VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 

nakai22apsipa_presentation.pdf

  • 1. Multi-Task Adversarial Training Algorithm for Multi-Speaker Neural Text-To-Speech The University of Tokyo, Japan. APSIPA ASC 2022 WedAM1-8-2 (SS04) Yusuke Nakai, Yuki Saito, Kenta Udagawa, Hiroshi Saruwatari
  • 2. /19 1 Overview: Multi-Speaker Neural Text-To-Speech ➢ Text-To-Speech (TTS) [Sagisaka+88] – Technology to artificially synthesize speech from given text ➢ Multi-speaker Neural TTS [Fan+15][Hojo+18] – Single Deep Neural Network (DNN) to generate multi-speakers' voices • Speaker embedding: conditional input to control speaker ID ➢ Voice cloning (e.g., [Arik+18]) – TTS of unseen speaker's voice with small amount of data Text-To-Speech (TTS) Text Speech Spkr. emb. Multi-spkr. neural TTS model
  • 3. /19 2 Research Outline ➢ Conventional algorithm: GAN*-based training – High-quality TTS by adversarial training of discriminator & generator – Poor generalization performance in voice cloning • TTS model cannot observe unseen speakers' voices in training... ➢ Proposed algorithm: Multi-task adversarial training – Primal task: GAN-based multi-speaker neural TTS training • Objective: feature reconstruction loss + adversarial loss – Secondary task: improving (pseudo) unseen speaker's TTS quality • Objective: loss to generate realistic voices of unseen speakers ➢ Results: High-quality voice cloning by our algorithm! *GAN: Generative Adversarial Network [Goodfellow+14]
  • 4. /19 3 Baseline 1: Multi-Speaker FastSpeech 2 (FS2) [Ren+21] Spkr. encoder ➢ Transfer-learning-based multi-speaker neural TTS [Jia+18] – 1. Pretrain spkr. encoder w/ spkr. verification task (e.g., GE2E* loss) – 2. Train FS2-based TTS model w/ pretrained spkr. encoder *GE2E: Generalized End-2-End [Wan+18] Extract spkr. emb. from reference speech (fixed during TTS training) Add variance information of speech
  • 5. /19 4 Baseline 2: GANSpeech [Yang+21] ➢ Overview: TTS model (generator) vs. JCU* discriminator – TTS model generates speech features from text & spkr. emb. – JCU discriminator classifies synth. / nat. from two kinds of inputs • Unconditional: w/o spkr. emb. & Conditional: w/ spkr. emb. TTS model 𝐺 ⋅ 𝐷S ⋅ 𝐷U ⋅ 𝐷C ⋅ or 0: Synth. 1: Nat. 0: Synth. 1: Nat. JCU discriminator Text Spkr. emb. Synth. Nat. *JCU: Joint Conditional & Unconditional [Zhang+18]
  • 6. /19 5 GANSpeech Algorithm: JCU Discriminator Update ➢ Objective: Discriminating synth. (0) / nat. (1) correctly – 𝐷S extracts shared features of nat. / synth. speech – 𝐷U learns general characteristic of speech – 𝐷C captures spkr.-specific characteristic of speech TTS model 𝐺 ⋅ 𝐷S ⋅ 𝐷U ⋅ 𝐷C ⋅ or 0: Synth. 1: Nat. 0: Synth. 1: Nat. JCU discriminator Text Spkr. emb. Synth. Nat. Disc. loss
  • 7. /19 6 GANSpeech Algorithm: TTS Model Update ➢ Objective: Generating speech & Deceiving JCU discriminator – Speech reconst. loss makes TTS model generate speech features • Phoneme duration, F0, energy, mel-spectrogram – Adv. loss causes JCU discriminator to misclassify synth. as nat. TTS model 𝐺 ⋅ 𝐷S ⋅ 𝐷U ⋅ 𝐷C ⋅ Speech reconst. loss 1: Synth. 1: Synth. JCU discriminator Text Spkr. emb. Synth. Nat. Adv. loss ➢ GAN = Distribution matching betw. nat. and synth. data – High-quality TTS for seen spkrs. included in training corpus – No guarantee to generalize TTS for unseen spkrs.
  • 8. /19 7 ➢ Proposed Method: Multi-Speaker Neural TTS based on Multi-Task Adversarial Training
  • 9. /19 8 Overview of Proposed Method ➢ Motivation: Diversifying spkr. variation during training to – Widen spkr. emb. distribution that TTS model can cover – Improve robustness of TTS model towards unseen speakers ➢ Idea: Adversarially Constrained Autoencoder Interpolation [Berthelot+19] – Architecture: Autoencoder w/ feature interpolation + Critic • Critic estimates 𝛼 from given input (𝛼 = 0 if input is pure data) • Autoencoder makes critic output 𝛼 = 0 for interpolated data Feature interpolation w/ 𝛼 We introduce this idea to GAN-based multi-speaker TTS
  • 10. /19 9 ➢ Overview: GANSpeech + ACAI-derived regularization – Encoder = spkr. encoder (fixed parameters during training) – Decoder = TTS model (i.e., Generator in GAN) – Critic 𝐶 = additional branch in Multi-Task (MT) discriminator Multi-Task Adversarial Training Algorithm TTS model 𝐺 ⋅ 𝐷S ⋅ 𝐷U ⋅ 𝐷C ⋅ or 0: Synth. 1: Nat. 0: Synth. 1: Nat. MT discriminator Text Mixed/Pure spkr. emb. Synth. Nat. 𝐶 ⋅ Spkr. enc. & interp. 𝛼: Mixed 0: Pure
  • 11. /19 10 Proposed Algorithm: MT Discriminator Update TTS model 𝐺 ⋅ 𝐷S ⋅ 𝐷U ⋅ 𝐷C ⋅ or 0: Synth. 1: Nat. 0: Synth. 1: Nat. MT discriminator Text Mixed/Pure spkr. emb. Synth. Nat. Disc. loss 𝐶 ⋅ Spkr. enc. & interp. 𝛼: Mixed 0: Pure Critic loss ➢ Objective: Discriminating synth./nat. & mixed/pure – Synth. speech samples are generated from mixed / pure spkr. emb. • Coefficient: 𝛼 ~ 𝑈(0.0, 0.5), spkr. pairs: shuffled w/n mini-batch – Criterion for critic training: MSE betw. predicted / correct 𝛼
  • 12. /19 11 ➢ Objective: GANSpeech objective + ACAI loss – ACAI loss makes critic output 0 for synth. speech of mixed spkrs. • Regularization on TTS for (pseudo) unseen spkrs. – Computation time for inference does not change from GANSpeech Proposed Algorithm: TTS Update TTS model 𝐺 ⋅ 𝐷S ⋅ 𝐷U ⋅ 𝐷C ⋅ MT discriminator Text Mixed/Pure spkr. emb. Synth. Nat. 𝐶 ⋅ Spkr. enc. & interp. Speech reconst. loss 1: Synth. 1: Synth. Adv. loss 0: Mixed ACAI loss
  • 14. /19 13 Experimental Conditions Corpus (speaker encoder) CSJ [Maekawa03] (947 males & 470 females, 660h) Corpus (TTS) "parallel100" subset of JVS [Takamichi+20] (49 males & 51 females, 22h, 100 sent./spkr.) Feature dimensions Mel-spectrogram: 80, Spkr. emb.: 256 Data split Train/Validation/Test = 0.8/0.1/0.1 Seen spkrs: 96, Unseen spkrs. 4 (2 males & 2 females) Vocoder (for 22,050 Hz) "generator_universal_model" of HiFi-GAN [Kong+20] (included in ming024's GitHub repository) Compared methods FS2: Multi-spkr. FastSpeech 2 [Ren+21] GAN: GANSpeech [Yang+21] Ours: Multi-task adv. training
  • 15. /19 14 Subjective Evaluation & Results ➢ Criterion: quality of synth. speech (Mean Opinion Score tests) – Naturalness (MOS) & spkr. similarity (Degradation MOS) ➢ Results w/ 95% intervals (50 listeners/test, 15 samples/listener) TTS for seen spkrs. Voice cloning Naturalness Similarity Naturalness Similarity FS2 3.13±0.12 3.57±0.12 3.13±0.12 2.38±0.12 GAN 3.52±0.12 3.79±0.12 3.38±0.12 2.40±0.12 Ours 3.55±0.12 3.87±0.12 3.50±0.12 2.48±0.12
  • 16. /19 15 Subjective Evaluation & Results ➢ Criterion: quality of synth. speech (Mean Opinion Score tests) – Naturalness (MOS) & spkr. similarity (Degradation MOS) ➢ Results w/ 95% intervals (50 listeners/test, 15 samples/listener) TTS for seen spkrs. Voice cloning Naturalness Similarity Naturalness Similarity FS2 3.13±0.12 3.57±0.12 3.13±0.12 2.38±0.12 GAN 3.52±0.12 3.79±0.12 3.38±0.12 2.40±0.12 Ours 3.55±0.12 3.87±0.12 3.50±0.12 2.48±0.12 GAN-based methods significantly improve quality of TTS for seen spkrs.
  • 17. /19 16 Subjective Evaluation & Results ➢ Criterion: quality of synth. speech (Mean Opinion Score tests) – Naturalness (MOS) & spkr. similarity (Degradation MOS) ➢ Results w/ 95% intervals (50 listeners/test, 15 samples/listener) TTS for seen spkrs. Voice cloning Naturalness Similarity Naturalness Similarity FS2 3.13±0.12 3.57±0.12 3.13±0.12 2.38±0.12 GAN 3.52±0.12 3.79±0.12 3.38±0.12 2.40±0.12 Ours 3.55±0.12 3.87±0.12 3.50±0.12 2.48±0.12 Our MT algorithm overcomes degradation of quality in voice cloning (TTS for unseen spkrs.)
  • 18. /19 17 Subjective Evaluation & Results ➢ Criterion: quality of synth. speech (Mean Opinion Score tests) – Naturalness (MOS) & spkr. similarity (Degradation MOS) ➢ Results w/ 95% intervals (50 listeners/test, 15 samples/listener) TTS for seen spkrs. Voice cloning Naturalness Similarity Naturalness Similarity FS2 3.13±0.12 3.57±0.12 3.13±0.12 2.38±0.12 GAN 3.52±0.12 3.79±0.12 3.38±0.12 2.40±0.12 Ours 3.55±0.12 3.87±0.12 3.50±0.12 2.48±0.12 There is still large gap betw. quality of spkr. similarity betw. TTS for seen spkrs. & voice cloning
  • 19. /19 18 Speech Samples (Voice Cloning) Ground-truth FS2 GAN Ours jvs078 (male) jvs005 (male) jvs060 (female) jvs010 (female) Other samples are available online! →
  • 20. /19 19 Summary ➢ Purpose – Improving performance of multi-spkr. neural TTS for voice cloning ➢ Proposed method – Multi-task adversarial training (GANSpeech + ACAI regularization) ➢ Results of our method – 1) improves naturalness & spkr. similarity better than GANSpeech – 2) has room for improvement for better spkr. similarity ➢ Future work – Introducing sophisticated speaker generation framework [Stanton+22] – Extending our method to multi-lingual TTS Thank you for your attention!