SlideShare ist ein Scribd-Unternehmen logo
1 von 33
Downloaden Sie, um offline zu lesen
1
ESPnet-TTS: Unified, Reproducible,
and Integratable Open Source
End-to-End Text-to-Speech Toolkit
Tomoki Hayashi (@kan-bayashi)1,2,
Ryuichi Yamamoto3, Katsuki Inoue4,
Takenori Yoshimura1,2, Shinji Watanabe5,
Tomoki Toda1, Kazuya Takeda1, Yu Zhang6, Xu Tan7
1Nagoya University, 2Human Dataware lab. Co., Ltd.,
3LINE Corp., 4Okayama University, 5Johns Hopkins University,
6Google AI, 7Microsoft Research
Background
p The era of End-to-End Text-to-Speech (E2E-TTS)
p Various advantages of E2E-TTS
n Require no language-dependent expert knowledge
n Require no alignment between text and speech
p More and more new research ideas
n Style control / Multi-speaker / Multi-lingual / etc...
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 2
Hello,
world! Speech
Text2Mel Mel2Wav
Neural Network
We definitely need to accelerate the research
and prepare the comparable baseline!
Background
p The era of End-to-End Text-to-Speech (E2E-TTS)
p Various advantages of E2E-TTS
n Require no language-dependent expert knowledge
n Require no alignment between text and speech
p More and more new research ideas
n Style control / Multi-speaker / Multi-lingual / etc...
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 3
Hello,
world! Speech
Text2Mel Mel2Wav
Neural Network
We introduce ESPnet-TTS,
the new open-source toolkit of E2E-TTS
What is ESPnet-TTS?
p Open-source E2E-TTS toolkit
n Apache 2.0 LICENSE / Pytorch as main network engine
p Developed for the researcher community
n Easy to reproduce the-state-of-art model
n Can be used as a baseline to check the performance
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 4
1. Support of various Text2Mel models
n Include autoregressive (AR), non-AR, and multi-spk models
2. Support of various Mel2Wav models
n Include both AR and the latest non-AR models
3. Unified and reproducible kaldi-style recipes
n Support 10+ recipes including En, Jp, Zn, and more
n Provide pretrained models of all recipes
n Integratable with ASR functions
(Extension of )
ESPnet-TTS
functions
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 5
Supported Text2Mel models
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 6
Hello,
world! Speech
Text2Mel Mel2Wav
Neural Network
Supported Text2Mel models
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 7
Hello,
world! Speech
Text2Mel Mel2Wav
Neural Network
Hello,
world! Speech
Text2Mel Mel2Wav
Neural Network
This part!
Input sequence
CNN+BLSTM
Encoder
Attention LSTM
Decoder
Postnet
Prenet
Next output
Input sequence
Transformer
Encoder
Transformer
Decoder
Postnet
Decoder
Prenet
Next output
Encoder
Prenet
Positional
Encoding
Positional
Encoding
Tacotron 2
[Shen+, 2018]
Transformer-TTS
[Li+, 2018]
FastSpeech
[Ren+, 2019]
: Autoregressive
: Non-autoregressive
Input sequence
Transformer
Encoder
Transformer
Decoder
Duration
Predictor
Embedding
Positional
Encoding
Length
Regulator
Output sequence
Duration
p Extension with pretrained speaker embedding
n Use X-Vector [Snyder+ 2018] trained by VoxCeleb corpus
Multi-speaker extension (1)
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 8
Multi-speaker Tacotron 2
[Jia+, 2018]
Input sequence
CNN+BLSTM
Encoder
Attention LSTM
Decoder
Postnet
Prenet
Next output
Tacotron 2
[Shen+, 2018]
Reference
audio
Add / Concat
Pretrained
X-vector
Extractor
Input sequence
CNN+BLSTM
Encoder
Attention LSTM
Decoder
Postnet
Prenet
Next output
p Extension with pretrained speaker embedding
n Apply the same idea to the other models
Multi-speaker extension (2)
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 9
Multi-speaker Transformer-TTS Multi-speaker FastSpeech
※ EXPERIMENTAL
Input sequence
Transformer
Encoder
Transformer
Decoder
Postnet
Decoder
Prenet
Next output
Encoder
Prenet
Positional
Encoding
Positional
EncodingReference
audio
Add / Concat
Pretrained
X-vector
Extractor
Reference
audio
Add / Concat
Pretrained
X-vector
Extractor
Input sequence
Transformer
Encoder
Transformer
Decoder
Duration
Predictor
Embedding
Positional
Encoding
Length
Regulator
Output sequence
Duration
Support Mel2Wav models
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 10
Hello,
world! Speech
Text2Mel Mel2Wav
Neural Network
Support Mel2Wav models
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 11
Hello,
world! Speech
Text2Mel Mel2Wav
Neural Network
Hello,
world! Speech
Text2Mel Mel2Wav
Neural Network
This part!
Mel
spectrogram
Deep causal
dilated CNN
Previous
waveform
Posterior
Upsample
network
Sampling
Next
waveform
Mel
spectrogram
Deep
dilated CNN
Noise
sequence
Waveform
sequence
Upsample
network
Mel
spectrogram
Upsample
deep CNN
Waveform
sequence
: Autoregressive
: Non-autoregressive
WaveNet
[Oord+, 2016]
Parallel WaveGAN
[Yamamoto+, 2020]
MelGAN
[Kumar+, 2019]
Mixture of Logistics (MoL)
and Softmax support
Support the combination of these
GAN-based models
Support Mel2Wav models
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 12
Hello,
world! Speech
Text2Mel Mel2Wav
Neural Network
Hello,
world! Speech
Text2Mel Mel2Wav
Neural Network
This part!
Mel
spectrogram
Deep causal
dilated CNN
Previous
waveform
Posterior
Upsample
network
Sampling
Next
waveform
Mel
spectrogram
Deep
dilated CNN
Noise
sequence
Waveform
sequence
Upsample
network
Mel
spectrogram
Upsample
deep CNN
Waveform
sequence
: Autoregressive
: Non-autoregressive
WaveNet
[Oord+, 2016]
Parallel WaveGAN
[Yamamoto+, 2020]
MelGAN
[Kumar+, 2019]
Mixture of Logistics (MoL)
and Softmax support
Support the combination of these
GAN-based models
Mel
spectrogram
Deep
dilated CNN
Noise
sequence
Waveform
sequence
Upsample
network
Parallel WaveGAN
[Yamamoto+, 2020]
Please check
Ryuichi‘s
presentation on
this ICASSP.
Other remarkable functions
p Dynamic batch-size to maximize GPU utilization
n Change batch-size dynamically according to the length
p Gradient accumulation
n Pseudo-increase the batch-size even with a single GPU
p Guided attention loss [Tachibana+, 2017]
n Constrain the attention weight to be diagonal
p Attention constraint decoding [Ping+, 2017]
n Stably decode with a long input sentence
p Forward attention [Zhang+, 2018]
n Attention mechanism with causal regularization
p CBHG [Wang+, 2017]
n Upsample the frequency resolution
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 13
ESPnet-TTS
recipes
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 14
Unified, reproducible recipe
p All-in-one Kaldi-style recipe
n Include all procedures needed to reproduce the results
n Have an unified design for both ASR and TTS recipe
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 15
The same data format for ASR and TTS recipes
Unified, reproducible recipe
p All-in-one Kaldi-style recipe
n Include all procedures needed to reproduce the results
n Have an unified design for both ASR and TTS recipe
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 16
ASR and TTS recipes can be converted to each other
Supported recipes
p Support 10+ recipes including 10 langs.
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 17
Corpus name Lang Recipe type
Arctic En Adaptation
Blizzard 2017 En Single
CSMSC Zn Single
JNAS Jp Multi
JVS Jp Adaptation
JUST Jp Single
LibriTTS En Multi
LJSpeech En Single
M-AILABS En, De, Fr, Es, Pl, Uk, Ru Single
TWEB En Single
VAIS1000 Vi Single
We provide pretrained models of all recipes
Integration with ASR
p ASR-based evaluation for TTS
n Automatically check the deletion or repetition of words
p Advanced recipes combining TTS with ASR
n ASR-TTS cycle-consistency training [Karthick+, 2019]
n Semi-supervised ASR-TTS training [Karita+, 2019]
n Non-parallel voice conversion
l Cascade ASR + TTS system
l VCC2020 baseline system (http://www.vc-challenge.org/)
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 18
We can combine TTS with ASR
for the development and new research ideas
※Not merged yet
ESPnet-TTS
performance
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 19
Experimental condition
p Evaluation with the LJSpeech dataset
n #Training 12,600 / #validation 250 / #evaluation 250
p Comparison methods (Input type, [attention type])
n Tacotron 2 (Char, Forward)
n Transformer (Char)
n FastSpeech (Char)※1
p Comparison other toolkits
n CSTR/Merlin: Conventional TTS + WORLD [Morise+, 2016]
n NVIDIA/tacotron2: Pretrained※2 Tacotron 2 + WaveGlow
n Mozilla/TTS: Pretrained※2 Tacotron 2 + WaveRNN
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 20
※2 Data split is different. The evaluation samples might be included in training data.
n Tacotron 2 (Char, Location)
n Transformer (Phoneme)
n FastSpeech (Phoneme)※1
※1 We did not use knowledge distillation
The same MoL-WaveNet trained w/ natural feats is used
Objective evaluation (CER)
p Character error rate (CER)
n ASR model: Transformer trained on Librispeech
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 21
Method Sub [%] Del [%] Ins [%] CER [%]
Tacotron 2 (Char, Forward) 0.4 1.0 3.6※ 5.0
Tacotron 2 (Char, Location) 0.5 1.2 0.3 2.1
Transformer (Char) 0.6 1.7 0.5 2.8
Transformer (Phoneme) 0.5 1.8 0.5 2.8
FastSpeech (Char) 0.3 0.9 0.3 1.6
FastSpeech (Phoneme) 0.4 1.3 0.4 2.1
Groundtruth (Raw) 0.3 0.7 0.3 1.3
※Only one sample failed to stop the generation
Objective evaluation (CER)
p Character error rate (CER)
n ASR model: Transformer trained on Librispeech
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 22
Method Sub [%] Del [%] Ins [%] CER [%]
Tacotron 2 (Char, Forward) 0.4 1.0 3.6※ 5.0
Tacotron 2 (Char, Location) 0.5 1.2 0.3 2.1
Transformer (Char) 0.6 1.7 0.5 2.8
Transformer (Phoneme) 0.5 1.8 0.5 2.8
FastSpeech (Char) 0.3 0.9 0.3 1.6
FastSpeech (Phoneme) 0.4 1.3 0.4 2.1
Groundtruth (Raw) 0.3 0.7 0.3 1.3
※Only one sample failed to stop the generation
Tacotron 2 is more robust than Transformer-TTS
Objective evaluation (CER)
p Character error rate (CER)
n ASR model: Transformer trained on Librispeech
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 23
Method Sub [%] Del [%] Ins [%] CER [%]
Tacotron 2 (Char, Forward) 0.4 1.0 3.6※ 5.0
Tacotron 2 (Char, Location) 0.5 1.2 0.3 2.1
Transformer (Char) 0.6 1.7 0.5 2.8
Transformer (Phoneme) 0.5 1.8 0.5 2.8
FastSpeech (Char) 0.3 0.9 0.3 1.6
FastSpeech (Phoneme) 0.4 1.3 0.4 2.1
Groundtruth (Raw) 0.3 0.7 0.3 1.3
FastSpeech is the most robust
thanks to non-AR architecture
Objective evaluation (RTF)
p Real-time factor (RTF) of Char-based models
n Calculate the speed for only Text2Mel part
n GPU: Titan V / CPU: Xeon Gold 6154 3 GHz x 16 threads
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 24
Method RTF on CPU RTF on GPU
Tacotron 2 (Forward) 0.216 ± 0.016 0.104 ± 0.006
Tacotron 2 (Location) 0.225 ± 0.016 0.094 ± 0.009
Transformer 0.851 ± 0.076 0.634 ± 0.025
FastSpeech 0.015 ± 0.005 0.003 ± 0.004
Objective evaluation (RTF)
p Real-time factor (RTF) of Char-based models
n Calculate the speed for only Text2Mel part
n GPU: Titan V / CPU: Xeon Gold 6154 3 GHz x 16 threads
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 25
Method RTF on CPU RTF on GPU
Tacotron 2 (Forward) 0.216 ± 0.016 0.104 ± 0.006
Tacotron 2 (Location) 0.225 ± 0.016 0.094 ± 0.009
Transformer 0.851 ± 0.076 0.634 ± 0.025
FastSpeech 0.015 ± 0.005 0.003 ± 0.004
Tacotron 2 is faster than Transformer-TTS
Objective evaluation (RTF)
p Real-time factor (RTF) of Char-based models
n Calculate the speed for only Text2Mel part
n GPU: Titan V / CPU: Xeon Gold 6154 3 GHz x 16 threads
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 26
Method RTF on CPU RTF on GPU
Tacotron 2 (Forward) 0.216 ± 0.016 0.104 ± 0.006
Tacotron 2 (Location) 0.225 ± 0.016 0.094 ± 0.009
Transformer 0.851 ± 0.076 0.634 ± 0.025
FastSpeech 0.015 ± 0.005 0.003 ± 0.004
FastSpeech is much faster than real-time
thanks to non-AR architecture
Objective evaluation (RTF)
p Real-time factor (RTF) of Char-based models
n Calculate the speed for only Text2Mel part
n GPU: Titan V / CPU: Xeon Gold 6154 3 GHz x 16 threads
p (For reference) RTF of Non-AR Mel2Wav models
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 27
Method RTF on CPU RTF on GPU
Tacotron 2 (Forward) 0.216 ± 0.016 0.104 ± 0.006
Tacotron 2(Location) 0.225 ± 0.016 0.094 ± 0.009
Transformer 0.851 ± 0.076 0.634 ± 0.025
FastSpeech 0.015 ± 0.005 0.003 ± 0.004
Method RTF on CPU RTF on GPU
Parallel WaveGAN 0.734 0.016
MelGAN 0.137 0.002
Subjective evaluation (MOS)
p Mean opinion score (MOS) on naturalness
n #subjects = 101 @ Amazon Mechanical Turk
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 28
Method MOS (± 95% CI)
Tacotron 2 (Char, Forward) 4.14 ± 0.06
Tacotron 2 (Char, Location) 4.20 ± 0.06
Transformer (Char) 4.17 ± 0.06
Transformer (Phoneme) 4.25 ± 0.06
CSTR/Merlin 2.69 ± 0.09
NVIDIA/tacotron2※ 4.21 ± 0.06
Mozilla/TTS※ 3.91 ± 0.07
Groundtruth (Raw) 4.46 ± 0.05
Please check the samples
from QR-code!
Method MOS (± 95% CI)
Tacotron 2 (Char, Forward) 4.14 ± 0.06
Tacotron 2 (Char, Location) 4.20 ± 0.06
Transformer (Char) 4.17 ± 0.06
Transformer (Phoneme) 4.25 ± 0.06
CSTR/Merlin 2.69 ± 0.09
NVIDIA/tacotron2※ 4.21 ± 0.06
Mozilla/TTS※ 3.91 ± 0.07
Groundtruth (Raw) 4.46 ± 0.05
Subjective evaluation (MOS)
p Mean opinion score (MOS) on naturalness
n #subjects = 101 @ Amazon Mechanical Turk
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 29
Please check the samples
from QR-code!
Tacotron 2 and Transformer-TTS have
almost the same performance
Method MOS (± 95% CI)
Tacotron 2 (Char, Forward) 4.14 ± 0.06
Tacotron 2 (Char, Location) 4.20 ± 0.06
Transformer (Char) 4.17 ± 0.06
Transformer (Phoneme) 4.25 ± 0.06
CSTR/Merlin 2.69 ± 0.09
NVIDIA/tacotron2※ 4.21 ± 0.06
Mozilla/TTS※ 3.91 ± 0.07
Groundtruth (Raw) 4.46 ± 0.05
Subjective evaluation (MOS)
p Mean opinion score (MOS) on naturalness
n #subjects = 101 @ Amazon Mechanical Turk
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 30
Our best model can achieve the performance
comparable to state-of-the-art
※ The evaluation samples might be included in training data.
Please check the samples
from QR-code!
Subjective evaluation (MOS)
p Mean opinion score (MOS) on naturalness
n #subjects = 101 @ Amazon Mechanical Turk
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 31
Please check the samples
from QR-code!
Method MOS (± 95% CI)
Tacotron 2 (Char, Forward) 4.14 ± 0.06
Tacotron 2 (Char, Location) 4.20 ± 0.06
Transformer (Char) 4.17 ± 0.06
Transformer (Phoneme) 4.25 ± 0.06
CSTR/Merlin 2.69 ± 0.09
NVIDIA/tacotron2※ 4.21 ± 0.06
Mozilla/TTS※ 3.91 ± 0.07
Groundtruth (Raw) 4.46 ± 0.05
Demonstration
p Demo notebooks with Google Colab.
1. E2E-TTS real-time demonstration
https://bit.ly/2Vex0Iw
2. E2E-TTS recipe Tutorial
https://bit.ly/3bhv0ow
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 32
You can generate your favorite
sentence in En, Jp, Zn!
You can learn the TTS recipe
flow online!
Closing
p Conclusion
n Introduced open-source toolkit ESPnet-TTS
l Developed for the research community
l Make E2E-TTS more user-friendly
l Accelerate the research in this field
n Provide various Text2Mel and Mel2Wav models
n Provide reproducible recipes including various langs
n Achieved the performance comparable to SoTA
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 33
We are always welcome
your feature requests and pull requests!

Weitere ähnliche Inhalte

Was ist angesagt?

“Autonomous Driving AI Workloads: Technology Trends and Optimization Strategi...
“Autonomous Driving AI Workloads: Technology Trends and Optimization Strategi...“Autonomous Driving AI Workloads: Technology Trends and Optimization Strategi...
“Autonomous Driving AI Workloads: Technology Trends and Optimization Strategi...
Edge AI and Vision Alliance
 

Was ist angesagt? (20)

"Approaches for Energy Efficient Implementation of Deep Neural Networks," a P...
"Approaches for Energy Efficient Implementation of Deep Neural Networks," a P..."Approaches for Energy Efficient Implementation of Deep Neural Networks," a P...
"Approaches for Energy Efficient Implementation of Deep Neural Networks," a P...
 
The Power of HPC with Next Generation Supermicro Systems
The Power of HPC with Next Generation Supermicro Systems The Power of HPC with Next Generation Supermicro Systems
The Power of HPC with Next Generation Supermicro Systems
 
Intro/Overview on Machine Learning Presentation
Intro/Overview on Machine Learning PresentationIntro/Overview on Machine Learning Presentation
Intro/Overview on Machine Learning Presentation
 
Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...
Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...
Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...
 
AUToSAR introduction
AUToSAR introductionAUToSAR introduction
AUToSAR introduction
 
ML-Ops: Philosophy, Best-Practices and Tools
ML-Ops:Philosophy, Best-Practices and ToolsML-Ops:Philosophy, Best-Practices and Tools
ML-Ops: Philosophy, Best-Practices and Tools
 
Sequence to Sequence Learning with Neural Networks
Sequence to Sequence Learning with Neural NetworksSequence to Sequence Learning with Neural Networks
Sequence to Sequence Learning with Neural Networks
 
Onnx and onnx runtime
Onnx and onnx runtimeOnnx and onnx runtime
Onnx and onnx runtime
 
“Autonomous Driving AI Workloads: Technology Trends and Optimization Strategi...
“Autonomous Driving AI Workloads: Technology Trends and Optimization Strategi...“Autonomous Driving AI Workloads: Technology Trends and Optimization Strategi...
“Autonomous Driving AI Workloads: Technology Trends and Optimization Strategi...
 
Webinar: Microcontroladores Infineon ARM: PSoC e Traveo II para aplicações au...
Webinar: Microcontroladores Infineon ARM: PSoC e Traveo II para aplicações au...Webinar: Microcontroladores Infineon ARM: PSoC e Traveo II para aplicações au...
Webinar: Microcontroladores Infineon ARM: PSoC e Traveo II para aplicações au...
 
Accelerated Training of Transformer Models
Accelerated Training of Transformer ModelsAccelerated Training of Transformer Models
Accelerated Training of Transformer Models
 
System of systems modeling with Capella
System of systems modeling with CapellaSystem of systems modeling with Capella
System of systems modeling with Capella
 
Introduction to VP8
Introduction to VP8Introduction to VP8
Introduction to VP8
 
Ultrasound Sensing Technologies 2020
Ultrasound Sensing Technologies 2020Ultrasound Sensing Technologies 2020
Ultrasound Sensing Technologies 2020
 
Seq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) modelSeq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) model
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Word Sense Disambiguation and Induction
Word Sense Disambiguation and InductionWord Sense Disambiguation and Induction
Word Sense Disambiguation and Induction
 
Autosar Basics hand book_v1
Autosar Basics  hand book_v1Autosar Basics  hand book_v1
Autosar Basics hand book_v1
 
Simulink Stateflow workshop
 Simulink Stateflow workshop Simulink Stateflow workshop
Simulink Stateflow workshop
 
Vertex AI: Pipelines for your MLOps workflows
Vertex AI: Pipelines for your MLOps workflowsVertex AI: Pipelines for your MLOps workflows
Vertex AI: Pipelines for your MLOps workflows
 

Ähnlich wie ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit

Ähnlich wie ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit (20)

Learning to Generate Pseudo-code from Source Code using Statistical Machine T...
Learning to Generate Pseudo-code from Source Code using Statistical Machine T...Learning to Generate Pseudo-code from Source Code using Statistical Machine T...
Learning to Generate Pseudo-code from Source Code using Statistical Machine T...
 
SP Study1018 Paper Reading
SP Study1018 Paper ReadingSP Study1018 Paper Reading
SP Study1018 Paper Reading
 
IRJET- Speech to Speech Translation System
IRJET- Speech to Speech Translation SystemIRJET- Speech to Speech Translation System
IRJET- Speech to Speech Translation System
 
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLEMULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
 
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
 
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIO...
ANALYZING ARCHITECTURES FOR NEURAL  MACHINE TRANSLATION USING LOW  COMPUTATIO...ANALYZING ARCHITECTURES FOR NEURAL  MACHINE TRANSLATION USING LOW  COMPUTATIO...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIO...
 
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
 
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
 
Rendering Of Voice By Using Convolutional Neural Network And With The Help Of...
Rendering Of Voice By Using Convolutional Neural Network And With The Help Of...Rendering Of Voice By Using Convolutional Neural Network And With The Help Of...
Rendering Of Voice By Using Convolutional Neural Network And With The Help Of...
 
Non autoregressive neural text-to-speech review
Non autoregressive neural text-to-speech reviewNon autoregressive neural text-to-speech review
Non autoregressive neural text-to-speech review
 
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
 
Deep Learning for Machine Translation - A dramatic turn of paradigm
Deep Learning for Machine Translation - A dramatic turn of paradigmDeep Learning for Machine Translation - A dramatic turn of paradigm
Deep Learning for Machine Translation - A dramatic turn of paradigm
 
IRJET - Storytelling App for Children with Hearing Impairment using Natur...
IRJET -  	  Storytelling App for Children with Hearing Impairment using Natur...IRJET -  	  Storytelling App for Children with Hearing Impairment using Natur...
IRJET - Storytelling App for Children with Hearing Impairment using Natur...
 
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...
 
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...
 
Lenar Gabdrakhmanov (Provectus): Speech synthesis
Lenar Gabdrakhmanov (Provectus): Speech synthesisLenar Gabdrakhmanov (Provectus): Speech synthesis
Lenar Gabdrakhmanov (Provectus): Speech synthesis
 
Utilizing the Pre-trained Model Effectively for Speech Translation
Utilizing the Pre-trained Model Effectively for Speech TranslationUtilizing the Pre-trained Model Effectively for Speech Translation
Utilizing the Pre-trained Model Effectively for Speech Translation
 
A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISH
A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISHA NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISH
A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISH
 
IRJET- On-Screen Translator using NLP and Text Detection
IRJET- On-Screen Translator using NLP and Text DetectionIRJET- On-Screen Translator using NLP and Text Detection
IRJET- On-Screen Translator using NLP and Text Detection
 
Recent Trends in Translation of Programming Languages using NLP Approaches
Recent Trends in Translation of Programming Languages using NLP ApproachesRecent Trends in Translation of Programming Languages using NLP Approaches
Recent Trends in Translation of Programming Languages using NLP Approaches
 

Mehr von Tomoki Hayashi

Mehr von Tomoki Hayashi (7)

複数話者WaveNetボコーダに関する調査
複数話者WaveNetボコーダに関する調査複数話者WaveNetボコーダに関する調査
複数話者WaveNetボコーダに関する調査
 
イベント継続長を明示的に制御したBLSTM-HSMMハイブリッドモデルによる多重音響イベント検出
イベント継続長を明示的に制御したBLSTM-HSMMハイブリッドモデルによる多重音響イベント検出イベント継続長を明示的に制御したBLSTM-HSMMハイブリッドモデルによる多重音響イベント検出
イベント継続長を明示的に制御したBLSTM-HSMMハイブリッドモデルによる多重音響イベント検出
 
イベント区間検出統合型 BLSTM-HMMハイブリッドモデルによる 多重音響イベント検出
イベント区間検出統合型 BLSTM-HMMハイブリッドモデルによる 多重音響イベント検出イベント区間検出統合型 BLSTM-HMMハイブリッドモデルによる 多重音響イベント検出
イベント区間検出統合型 BLSTM-HMMハイブリッドモデルによる 多重音響イベント検出
 
形態素解析も辞書も言語モデルもいらないend-to-end音声認識
形態素解析も辞書も言語モデルもいらないend-to-end音声認識形態素解析も辞書も言語モデルもいらないend-to-end音声認識
形態素解析も辞書も言語モデルもいらないend-to-end音声認識
 
PRML 5章 PP.227-PP.247
PRML 5章 PP.227-PP.247PRML 5章 PP.227-PP.247
PRML 5章 PP.227-PP.247
 
Visualizing Data Using t-SNE
Visualizing Data Using t-SNEVisualizing Data Using t-SNE
Visualizing Data Using t-SNE
 
Prediction of Exchange Rate Using Deep Neural Network
Prediction of Exchange Rate Using Deep Neural Network  Prediction of Exchange Rate Using Deep Neural Network
Prediction of Exchange Rate Using Deep Neural Network
 

Kürzlich hochgeladen

Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Kandungan 087776558899
 

Kürzlich hochgeladen (20)

Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 

ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit

  • 1. 1 ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit Tomoki Hayashi (@kan-bayashi)1,2, Ryuichi Yamamoto3, Katsuki Inoue4, Takenori Yoshimura1,2, Shinji Watanabe5, Tomoki Toda1, Kazuya Takeda1, Yu Zhang6, Xu Tan7 1Nagoya University, 2Human Dataware lab. Co., Ltd., 3LINE Corp., 4Okayama University, 5Johns Hopkins University, 6Google AI, 7Microsoft Research
  • 2. Background p The era of End-to-End Text-to-Speech (E2E-TTS) p Various advantages of E2E-TTS n Require no language-dependent expert knowledge n Require no alignment between text and speech p More and more new research ideas n Style control / Multi-speaker / Multi-lingual / etc... ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 2 Hello, world! Speech Text2Mel Mel2Wav Neural Network We definitely need to accelerate the research and prepare the comparable baseline!
  • 3. Background p The era of End-to-End Text-to-Speech (E2E-TTS) p Various advantages of E2E-TTS n Require no language-dependent expert knowledge n Require no alignment between text and speech p More and more new research ideas n Style control / Multi-speaker / Multi-lingual / etc... ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 3 Hello, world! Speech Text2Mel Mel2Wav Neural Network We introduce ESPnet-TTS, the new open-source toolkit of E2E-TTS
  • 4. What is ESPnet-TTS? p Open-source E2E-TTS toolkit n Apache 2.0 LICENSE / Pytorch as main network engine p Developed for the researcher community n Easy to reproduce the-state-of-art model n Can be used as a baseline to check the performance ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 4 1. Support of various Text2Mel models n Include autoregressive (AR), non-AR, and multi-spk models 2. Support of various Mel2Wav models n Include both AR and the latest non-AR models 3. Unified and reproducible kaldi-style recipes n Support 10+ recipes including En, Jp, Zn, and more n Provide pretrained models of all recipes n Integratable with ASR functions (Extension of )
  • 5. ESPnet-TTS functions ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 5
  • 6. Supported Text2Mel models ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 6 Hello, world! Speech Text2Mel Mel2Wav Neural Network
  • 7. Supported Text2Mel models ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 7 Hello, world! Speech Text2Mel Mel2Wav Neural Network Hello, world! Speech Text2Mel Mel2Wav Neural Network This part! Input sequence CNN+BLSTM Encoder Attention LSTM Decoder Postnet Prenet Next output Input sequence Transformer Encoder Transformer Decoder Postnet Decoder Prenet Next output Encoder Prenet Positional Encoding Positional Encoding Tacotron 2 [Shen+, 2018] Transformer-TTS [Li+, 2018] FastSpeech [Ren+, 2019] : Autoregressive : Non-autoregressive Input sequence Transformer Encoder Transformer Decoder Duration Predictor Embedding Positional Encoding Length Regulator Output sequence Duration
  • 8. p Extension with pretrained speaker embedding n Use X-Vector [Snyder+ 2018] trained by VoxCeleb corpus Multi-speaker extension (1) ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 8 Multi-speaker Tacotron 2 [Jia+, 2018] Input sequence CNN+BLSTM Encoder Attention LSTM Decoder Postnet Prenet Next output Tacotron 2 [Shen+, 2018] Reference audio Add / Concat Pretrained X-vector Extractor Input sequence CNN+BLSTM Encoder Attention LSTM Decoder Postnet Prenet Next output
  • 9. p Extension with pretrained speaker embedding n Apply the same idea to the other models Multi-speaker extension (2) ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 9 Multi-speaker Transformer-TTS Multi-speaker FastSpeech ※ EXPERIMENTAL Input sequence Transformer Encoder Transformer Decoder Postnet Decoder Prenet Next output Encoder Prenet Positional Encoding Positional EncodingReference audio Add / Concat Pretrained X-vector Extractor Reference audio Add / Concat Pretrained X-vector Extractor Input sequence Transformer Encoder Transformer Decoder Duration Predictor Embedding Positional Encoding Length Regulator Output sequence Duration
  • 10. Support Mel2Wav models ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 10 Hello, world! Speech Text2Mel Mel2Wav Neural Network
  • 11. Support Mel2Wav models ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 11 Hello, world! Speech Text2Mel Mel2Wav Neural Network Hello, world! Speech Text2Mel Mel2Wav Neural Network This part! Mel spectrogram Deep causal dilated CNN Previous waveform Posterior Upsample network Sampling Next waveform Mel spectrogram Deep dilated CNN Noise sequence Waveform sequence Upsample network Mel spectrogram Upsample deep CNN Waveform sequence : Autoregressive : Non-autoregressive WaveNet [Oord+, 2016] Parallel WaveGAN [Yamamoto+, 2020] MelGAN [Kumar+, 2019] Mixture of Logistics (MoL) and Softmax support Support the combination of these GAN-based models
  • 12. Support Mel2Wav models ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 12 Hello, world! Speech Text2Mel Mel2Wav Neural Network Hello, world! Speech Text2Mel Mel2Wav Neural Network This part! Mel spectrogram Deep causal dilated CNN Previous waveform Posterior Upsample network Sampling Next waveform Mel spectrogram Deep dilated CNN Noise sequence Waveform sequence Upsample network Mel spectrogram Upsample deep CNN Waveform sequence : Autoregressive : Non-autoregressive WaveNet [Oord+, 2016] Parallel WaveGAN [Yamamoto+, 2020] MelGAN [Kumar+, 2019] Mixture of Logistics (MoL) and Softmax support Support the combination of these GAN-based models Mel spectrogram Deep dilated CNN Noise sequence Waveform sequence Upsample network Parallel WaveGAN [Yamamoto+, 2020] Please check Ryuichi‘s presentation on this ICASSP.
  • 13. Other remarkable functions p Dynamic batch-size to maximize GPU utilization n Change batch-size dynamically according to the length p Gradient accumulation n Pseudo-increase the batch-size even with a single GPU p Guided attention loss [Tachibana+, 2017] n Constrain the attention weight to be diagonal p Attention constraint decoding [Ping+, 2017] n Stably decode with a long input sentence p Forward attention [Zhang+, 2018] n Attention mechanism with causal regularization p CBHG [Wang+, 2017] n Upsample the frequency resolution ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 13
  • 14. ESPnet-TTS recipes ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 14
  • 15. Unified, reproducible recipe p All-in-one Kaldi-style recipe n Include all procedures needed to reproduce the results n Have an unified design for both ASR and TTS recipe ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 15 The same data format for ASR and TTS recipes
  • 16. Unified, reproducible recipe p All-in-one Kaldi-style recipe n Include all procedures needed to reproduce the results n Have an unified design for both ASR and TTS recipe ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 16 ASR and TTS recipes can be converted to each other
  • 17. Supported recipes p Support 10+ recipes including 10 langs. ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 17 Corpus name Lang Recipe type Arctic En Adaptation Blizzard 2017 En Single CSMSC Zn Single JNAS Jp Multi JVS Jp Adaptation JUST Jp Single LibriTTS En Multi LJSpeech En Single M-AILABS En, De, Fr, Es, Pl, Uk, Ru Single TWEB En Single VAIS1000 Vi Single We provide pretrained models of all recipes
  • 18. Integration with ASR p ASR-based evaluation for TTS n Automatically check the deletion or repetition of words p Advanced recipes combining TTS with ASR n ASR-TTS cycle-consistency training [Karthick+, 2019] n Semi-supervised ASR-TTS training [Karita+, 2019] n Non-parallel voice conversion l Cascade ASR + TTS system l VCC2020 baseline system (http://www.vc-challenge.org/) ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 18 We can combine TTS with ASR for the development and new research ideas ※Not merged yet
  • 19. ESPnet-TTS performance ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 19
  • 20. Experimental condition p Evaluation with the LJSpeech dataset n #Training 12,600 / #validation 250 / #evaluation 250 p Comparison methods (Input type, [attention type]) n Tacotron 2 (Char, Forward) n Transformer (Char) n FastSpeech (Char)※1 p Comparison other toolkits n CSTR/Merlin: Conventional TTS + WORLD [Morise+, 2016] n NVIDIA/tacotron2: Pretrained※2 Tacotron 2 + WaveGlow n Mozilla/TTS: Pretrained※2 Tacotron 2 + WaveRNN ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 20 ※2 Data split is different. The evaluation samples might be included in training data. n Tacotron 2 (Char, Location) n Transformer (Phoneme) n FastSpeech (Phoneme)※1 ※1 We did not use knowledge distillation The same MoL-WaveNet trained w/ natural feats is used
  • 21. Objective evaluation (CER) p Character error rate (CER) n ASR model: Transformer trained on Librispeech ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 21 Method Sub [%] Del [%] Ins [%] CER [%] Tacotron 2 (Char, Forward) 0.4 1.0 3.6※ 5.0 Tacotron 2 (Char, Location) 0.5 1.2 0.3 2.1 Transformer (Char) 0.6 1.7 0.5 2.8 Transformer (Phoneme) 0.5 1.8 0.5 2.8 FastSpeech (Char) 0.3 0.9 0.3 1.6 FastSpeech (Phoneme) 0.4 1.3 0.4 2.1 Groundtruth (Raw) 0.3 0.7 0.3 1.3 ※Only one sample failed to stop the generation
  • 22. Objective evaluation (CER) p Character error rate (CER) n ASR model: Transformer trained on Librispeech ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 22 Method Sub [%] Del [%] Ins [%] CER [%] Tacotron 2 (Char, Forward) 0.4 1.0 3.6※ 5.0 Tacotron 2 (Char, Location) 0.5 1.2 0.3 2.1 Transformer (Char) 0.6 1.7 0.5 2.8 Transformer (Phoneme) 0.5 1.8 0.5 2.8 FastSpeech (Char) 0.3 0.9 0.3 1.6 FastSpeech (Phoneme) 0.4 1.3 0.4 2.1 Groundtruth (Raw) 0.3 0.7 0.3 1.3 ※Only one sample failed to stop the generation Tacotron 2 is more robust than Transformer-TTS
  • 23. Objective evaluation (CER) p Character error rate (CER) n ASR model: Transformer trained on Librispeech ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 23 Method Sub [%] Del [%] Ins [%] CER [%] Tacotron 2 (Char, Forward) 0.4 1.0 3.6※ 5.0 Tacotron 2 (Char, Location) 0.5 1.2 0.3 2.1 Transformer (Char) 0.6 1.7 0.5 2.8 Transformer (Phoneme) 0.5 1.8 0.5 2.8 FastSpeech (Char) 0.3 0.9 0.3 1.6 FastSpeech (Phoneme) 0.4 1.3 0.4 2.1 Groundtruth (Raw) 0.3 0.7 0.3 1.3 FastSpeech is the most robust thanks to non-AR architecture
  • 24. Objective evaluation (RTF) p Real-time factor (RTF) of Char-based models n Calculate the speed for only Text2Mel part n GPU: Titan V / CPU: Xeon Gold 6154 3 GHz x 16 threads ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 24 Method RTF on CPU RTF on GPU Tacotron 2 (Forward) 0.216 ± 0.016 0.104 ± 0.006 Tacotron 2 (Location) 0.225 ± 0.016 0.094 ± 0.009 Transformer 0.851 ± 0.076 0.634 ± 0.025 FastSpeech 0.015 ± 0.005 0.003 ± 0.004
  • 25. Objective evaluation (RTF) p Real-time factor (RTF) of Char-based models n Calculate the speed for only Text2Mel part n GPU: Titan V / CPU: Xeon Gold 6154 3 GHz x 16 threads ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 25 Method RTF on CPU RTF on GPU Tacotron 2 (Forward) 0.216 ± 0.016 0.104 ± 0.006 Tacotron 2 (Location) 0.225 ± 0.016 0.094 ± 0.009 Transformer 0.851 ± 0.076 0.634 ± 0.025 FastSpeech 0.015 ± 0.005 0.003 ± 0.004 Tacotron 2 is faster than Transformer-TTS
  • 26. Objective evaluation (RTF) p Real-time factor (RTF) of Char-based models n Calculate the speed for only Text2Mel part n GPU: Titan V / CPU: Xeon Gold 6154 3 GHz x 16 threads ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 26 Method RTF on CPU RTF on GPU Tacotron 2 (Forward) 0.216 ± 0.016 0.104 ± 0.006 Tacotron 2 (Location) 0.225 ± 0.016 0.094 ± 0.009 Transformer 0.851 ± 0.076 0.634 ± 0.025 FastSpeech 0.015 ± 0.005 0.003 ± 0.004 FastSpeech is much faster than real-time thanks to non-AR architecture
  • 27. Objective evaluation (RTF) p Real-time factor (RTF) of Char-based models n Calculate the speed for only Text2Mel part n GPU: Titan V / CPU: Xeon Gold 6154 3 GHz x 16 threads p (For reference) RTF of Non-AR Mel2Wav models ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 27 Method RTF on CPU RTF on GPU Tacotron 2 (Forward) 0.216 ± 0.016 0.104 ± 0.006 Tacotron 2(Location) 0.225 ± 0.016 0.094 ± 0.009 Transformer 0.851 ± 0.076 0.634 ± 0.025 FastSpeech 0.015 ± 0.005 0.003 ± 0.004 Method RTF on CPU RTF on GPU Parallel WaveGAN 0.734 0.016 MelGAN 0.137 0.002
  • 28. Subjective evaluation (MOS) p Mean opinion score (MOS) on naturalness n #subjects = 101 @ Amazon Mechanical Turk ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 28 Method MOS (± 95% CI) Tacotron 2 (Char, Forward) 4.14 ± 0.06 Tacotron 2 (Char, Location) 4.20 ± 0.06 Transformer (Char) 4.17 ± 0.06 Transformer (Phoneme) 4.25 ± 0.06 CSTR/Merlin 2.69 ± 0.09 NVIDIA/tacotron2※ 4.21 ± 0.06 Mozilla/TTS※ 3.91 ± 0.07 Groundtruth (Raw) 4.46 ± 0.05 Please check the samples from QR-code!
  • 29. Method MOS (± 95% CI) Tacotron 2 (Char, Forward) 4.14 ± 0.06 Tacotron 2 (Char, Location) 4.20 ± 0.06 Transformer (Char) 4.17 ± 0.06 Transformer (Phoneme) 4.25 ± 0.06 CSTR/Merlin 2.69 ± 0.09 NVIDIA/tacotron2※ 4.21 ± 0.06 Mozilla/TTS※ 3.91 ± 0.07 Groundtruth (Raw) 4.46 ± 0.05 Subjective evaluation (MOS) p Mean opinion score (MOS) on naturalness n #subjects = 101 @ Amazon Mechanical Turk ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 29 Please check the samples from QR-code! Tacotron 2 and Transformer-TTS have almost the same performance
  • 30. Method MOS (± 95% CI) Tacotron 2 (Char, Forward) 4.14 ± 0.06 Tacotron 2 (Char, Location) 4.20 ± 0.06 Transformer (Char) 4.17 ± 0.06 Transformer (Phoneme) 4.25 ± 0.06 CSTR/Merlin 2.69 ± 0.09 NVIDIA/tacotron2※ 4.21 ± 0.06 Mozilla/TTS※ 3.91 ± 0.07 Groundtruth (Raw) 4.46 ± 0.05 Subjective evaluation (MOS) p Mean opinion score (MOS) on naturalness n #subjects = 101 @ Amazon Mechanical Turk ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 30 Our best model can achieve the performance comparable to state-of-the-art ※ The evaluation samples might be included in training data. Please check the samples from QR-code!
  • 31. Subjective evaluation (MOS) p Mean opinion score (MOS) on naturalness n #subjects = 101 @ Amazon Mechanical Turk ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 31 Please check the samples from QR-code! Method MOS (± 95% CI) Tacotron 2 (Char, Forward) 4.14 ± 0.06 Tacotron 2 (Char, Location) 4.20 ± 0.06 Transformer (Char) 4.17 ± 0.06 Transformer (Phoneme) 4.25 ± 0.06 CSTR/Merlin 2.69 ± 0.09 NVIDIA/tacotron2※ 4.21 ± 0.06 Mozilla/TTS※ 3.91 ± 0.07 Groundtruth (Raw) 4.46 ± 0.05
  • 32. Demonstration p Demo notebooks with Google Colab. 1. E2E-TTS real-time demonstration https://bit.ly/2Vex0Iw 2. E2E-TTS recipe Tutorial https://bit.ly/3bhv0ow ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 32 You can generate your favorite sentence in En, Jp, Zn! You can learn the TTS recipe flow online!
  • 33. Closing p Conclusion n Introduced open-source toolkit ESPnet-TTS l Developed for the research community l Make E2E-TTS more user-friendly l Accelerate the research in this field n Provide various Text2Mel and Mel2Wav models n Provide reproducible recipes including various langs n Achieved the performance comparable to SoTA ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 33 We are always welcome your feature requests and pull requests!