SlideShare ist ein Scribd-Unternehmen logo
1 von 53
Downloaden Sie, um offline zu lesen
Modern Text-to-Speech
Systems Review
Lenar Gabdrakhmanov
ML Engineer at qqwqweqwe
Table of contents
1. Introducing to TTS
2. Synthesized speech quality evaluation
3. WaveNet (2016-09-12)
4. Fast WaveNet (2016-11-29)
5. Deep Voice (2017-02-25)
6. Tacotron (2017-03-29)
7. Deep Voice 2 (2017-05-24)
8. TTS + ASR = Speech Chain (2017-07-16)
9. Deep Voice 3 (2017-10-20)
10. Tacotron 2 (2017-12-16)
Table of contents
11. GST-Tacotron (2018-03-23)
12. Transfer Learning Tacotron (2018-06-12)
13. Transformer TTS (2018-09-19)
14. RUSLAN: Russian Spoken Language Corpus For Speech Synthesis
15. References
Introducing to Text-to-Speech
Two original speech synthesis techniques:
1. Concatenative (unit selection);
2. Statistical parametric.
Introducing to Text-to-Speech
Introducing to Text-to-Speech
Synthesized speech quality evaluation
● Mean Opinion Score (MOS) is the most frequently used method of subjective
measure of speech quality;
● Speech quality is rated on a 5-point scale:
○ 1 - bad quality;
○ 5 - excellent quality.
WaveNet: A Generative Model For Raw Audio [1]
● Not truly end-to-end system yet;
● Generates raw audio waveforms (≥ 16k samples per second!);
● Fully convolutional neural network;
● Every new sample is conditioned on all previous samples;
● Softmax layer models conditional distributions over individual audio sample;
● Extra linguistic features (e.g. phone identities, syllable stress, fundamental
frequency F0
) required;
● Can be extended to multi-speaker TTS;
● Achieves a 4.21 ± 0.081 MOS on US English, a 4.08 ± 0.085 on Chinese;
● Computationally expensive.
WaveNet: A Generative Model For Raw Audio
WaveNet: A Generative Model For Raw Audio
WaveNet: A Generative Model For Raw Audio
Samples:
1. “The Blue Lagoon is a 1980 American romance and adventure film directed
by Randal Kleiser” (parametric);
2. <same> (concatenative);
3. <same> (WaveNet).
Fast Wavenet Generation Algorithm [2]
Fast Wavenet Generation Algorithm
Deep Voice: Real-time Neural Text-to-Speech [3]
● Not truly end-to-end system yet: “Deep Voice lays the groundwork for truly
end-to-end neural speech synthesis”;
● Consists of five blocks:
○ grapheme-to-phoneme conversion model (encoder - Bi-GRU with 1024 units x 3, decoder -
GRU with 1024 units x 3);
○ segmentation model for locating phoneme boundaries (Convs + GRU + Convs);
○ phoneme duration prediction model (FC-256 x 2, GRU with 128 units x 2, FC);
○ fundamental frequency prediction model (joint model with above);
○ audio synthesis model (variant of WaveNet).
● Faster than real time inference (up to 400x faster on both CPU and GPU
compared to Fast WaveNet [2]);
● Achieves a 2.67± 0.37 MOS on US English;
Deep Voice: Real-time Neural Text-to-Speech
Deep Voice: Real-time Neural Text-to-Speech
Tacotron: Towards End-to-End Speech Synthesis [4]
● Fully end-to-end: given <text, audio> pairs, model can be trained completely
from scratch with random initialization;
● Predicts linear- and mel-scale spectrograms;
● Achieves a 3.82 MOS on US English.
Tacotron: Towards End-to-End Speech Synthesis
Tacotron: Towards End-to-End Speech Synthesis
Samples:
1. “Generative adversarial network or variational auto-encoder.”;
2. “He has read the whole thing.”;
3. “He reads books.”;
4. “The quick brown fox jumps over the lazy dog.”;
5. “Does the quick brown fox jump over the lazy dog?”.
Deep Voice 2: Multi-Speaker Neural Text-to-Speech
[5]
● Improved architecture based on Deep Voice [3];
● Can learn from hundreds of unique voices from less than half an hour of data
per speaker;
● Voice model based on WaveNet [1] architecture;
● Achieves a 3.53 ± 0.12 MOS.
Deep Voice 2: Multi-Speaker Neural Text-to-Speech
Deep Voice 2: Multi-Speaker Neural Text-to-Speech
Deep Voice 2: Multi-Speaker Neural Text-to-Speech
Samples:
1. “About half the people who are infected also lose weight.”;
2. <same>;
3. <same>.
Listening while Speaking: Speech Chain by Deep
Learning [6]
● Two parts: TTS model and ASR model;
● Single- and multi-speaker;
● Jointly training:
○ Supervised step;
○ Unsupervised: unpaired text and speech.
● Both TTS and AST models are Tacotron-like [4].
Listening while Speaking: Speech Chain by Deep
Learning [6]
Deep Voice 3: Scaling Text-to-Speech with
Convolutional Sequence Learning [7]
● Fully-convolutional sequence-to-sequence attention-based model;
● Converts input text to spectrograms (or other acoustic parameters);
● Suitable for both single and multi speaker;
● Needs 10x less training time and and converges after 500k iterations
(compared to Tacotron [4] that converges after 2m iterations);
● Novel attention mechanism to introduce monotonic alignment;
● MOS: 3.78 ± 0.30 (with WaveNet), same score for Tacotron [4] (with
Wavenet), 2.74 ± 0.35 for Deep Voice 2 [5] (with WaveNet).
Deep Voice 3: Scaling Text-to-Speech with
Convolutional Sequence Learning [6]
Deep Voice 3: Scaling Text-to-Speech with
Convolutional Sequence Learning [6]
Deep Voice 3: Scaling Text-to-Speech with
Convolutional Sequence Learning [6]
Deep Voice 3: Scaling Text-to-Speech with
Convolutional Sequence Learning [6]
Samples:
1. … (trained for single speaker - 20 hours total);
2. … (trained for 108 speakers - 44 hours total);
3. <same>;
4. … (trained for 2484 speakers - 820 hours of ASR data total);
5. <same>.
Natural TTS Synthesis by Conditioning WaveNet on
Mel Spectrogram Predictions [8]
● Uses simpler building blocks, in contrast to original Tacotron [4];
● Maps input characters to mel-scale spectrogram;
● Modified WaveNet [1] synthesizes audio waveforms from spectrograms
directly (no need for linguistic, phoneme duration and other features);
● However, this WaveNet is pretrained separately;
● Achieves a 4.526 ± 0.066 MOS on US English.
Natural TTS Synthesis by Conditioning WaveNet on
Mel Spectrogram Predictions [7]
Natural TTS Synthesis by Conditioning WaveNet on
Mel Spectrogram Predictions [7]
Samples:
1. “Generative adversarial network or variational auto-encoder.”;
2. “Don't desert me here in the desert!”;
3. “He thought it was time to present the present.”;
4. “The buses aren't the problem, they actually provide a solution.”;
5. “The buses aren't the PROBLEM, they actually provide a SOLUTION.”.
Style Tokens: Unsupervised Style Modeling, Control
and Transfer in End-to-End Speech Synthesis [9]
● Based on Tacotron [4] with slight changes;
● Learn embeddings for 10 style tokens;
● Reference encoder: stack of 2D Convs, GRU with 128 units;
● Style token layer: Attention with 256-D token embeddings
Style Tokens: Unsupervised Style Modeling, Control
and Transfer in End-to-End Speech Synthesis [9]
Style Tokens: Unsupervised Style Modeling, Control
and Transfer in End-to-End Speech Synthesis [9]
Samples:
1. “United Airlines five six three from Los Angeles to New Orleans has Landed”;
2. <same>;
3. <same>;
4. <same>;
5. <same>.
Transfer Learning from Speaker Verification to
Multispeaker Text-To-Speech Synthesis [10]
● Consists of several independently trained components:
○ Speaker encoder network to generate speaker embedding vectors: LSTM with 768 units
followed by 256-D projection layer x 3;
○ Synthesis network - Tacotron 2 [8].
● MOS: 4.22 ± 0.06 and 3.28 ± 0.07 for seen and unseen speakers
respectively.
Transfer Learning from Speaker Verification to
Multispeaker Text-To-Speech Synthesis [10]
Transfer Learning from Speaker Verification to
Multispeaker Text-To-Speech Synthesis [10]
Transfer Learning from Speaker Verification to
Multispeaker Text-To-Speech Synthesis [10]
Samples:
1. … (reference audio, seen speaker);
2. … (synthesized audio, seen speaker);
3. … (ref., unseen);
4. … (synth., unseen);
5. … (ref., different language);
6. … (synth., different language).
Close to Human Quality TTS with Transformer [11]
● Based on state-of-the-art NMT model called “Transformer”;
● Based on Tacotron 2 [8] model, but all RNN layers are replaced with
Transformers;
● Using of Transformer allows to construct encoder and decoder hidden states
in parallel;
● Training about 4.25 times faster compared with Tacotron 2;
● Achieves state-of-the-art performance: 4.39 ± 0.05 (proposed model), 4.38 ±
0.05 (Tacotron 2), 4.44 ± 0.05 (ground truth).
Close to Human Quality TTS with Transformer
Close to Human Quality TTS with Transformer
Close to Human Quality TTS with Transformer
Samples:
1. “Two to five inches of rain is possible by early Monday, resulting in some
flooding.”;
2. “Flooding is likely in some parishes of southern Louisiana.”;
3. “Defending champ tiger woods is one of eight Golfers within two strokes.”.
RUSLAN: Russian Spoken Language Corpus For
Speech Synthesis
● Authors: Rustem Garaev, Evgenii Razinkov, Lenar Gabdrakhmanov;
● Largest annotated speech corpus in Russian for a single speaker - more than
31 hours of speech (<text, audio> pairs) - 22200 samples;
● Several improvements for Tacotron [4] were proposed:
○ All GRU layers replaced with Layer Normalized LSTM;
○ Fast Griffin-Lim algorithm used for waveform synthesizing from spectrogram.
● MOS: 4.05 for intelligibility and 3.78 for naturalness (vs 3.12 and 2.17 for
original model);
● Corpus site with recordings and synthesized speech:
https://ruslan-corpus.github.io/
RUSLAN: Russian Spoken Language Corpus For
Speech Synthesis
RUSLAN: Russian Spoken Language Corpus For
Speech Synthesis
Samples:
1. “Тринадцать лет назад я взялся за перо.” (from dataset);
2. “Тема работы - разработка и реализация метода генерации русской речи
на основе текста.”;
3. “Спасибо за внимание.”.
References
[1] Van Den Oord, Aäron, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol
Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W. Senior, and Koray
Kavukcuoglu. "WaveNet: A generative model for raw audio." In SSW, p. 125.
2016.
[2] Paine, Tom Le, Pooya Khorrami, Shiyu Chang, Yang Zhang, Prajit
Ramachandran, Mark A. Hasegawa-Johnson, and Thomas S. Huang. "Fast
wavenet generation algorithm." arXiv preprint arXiv:1611.09482 (2016).
References (2)
[3] Arik, Sercan O., Mike Chrzanowski, Adam Coates, Gregory Diamos, Andrew
Gibiansky, Yongguo Kang, Xian Li et al. "Deep voice: Real-time neural
text-to-speech." arXiv preprint arXiv:1702.07825 (2017).
[4] Wang, Yuxuan, R. J. Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss,
Navdeep Jaitly, Zongheng Yang et al. "Tacotron: Towards end-to-end speech
synthesis." arXiv preprint arXiv:1703.10135 (2017).
[5] Arik, Sercan, Gregory Diamos, Andrew Gibiansky, John Miller, Kainan Peng,
Wei Ping, Jonathan Raiman, and Yanqi Zhou. "Deep voice 2: Multi-speaker
neural text-to-speech." arXiv preprint arXiv:1705.08947 (2017).
References (3)
[6] Tjandra, Andros, Sakriani Sakti, and Satoshi Nakamura. "Listening while
speaking: Speech chain by deep learning." Automatic Speech Recognition
and Understanding Workshop (ASRU), 2017 IEEE. IEEE, 2017.
[7] Ping, Wei, Kainan Peng, Andrew Gibiansky, Sercan O. Arik, Ajay Kannan,
Sharan Narang, Jonathan Raiman, and John Miller. "Deep voice 3: Scaling
text-to-speech with convolutional sequence learning." (2018).
[8] Shen, Jonathan, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep
Jaitly, Zongheng Yang, Zhifeng Chen et al. "Natural tts synthesis by
conditioning wavenet on mel spectrogram predictions." In 2018 IEEE
International Conference on Acoustics, Speech and Signal Processing
(ICASSP), pp. 4779-4783. IEEE, 2018.
References (4)
[9] Wang, Yuxuan, et al. "Style Tokens: Unsupervised Style Modeling, Control
and Transfer in End-to-End Speech Synthesis." arXiv preprint
arXiv:1803.09017 (2018).
[10] Jia, Ye, et al. "Transfer Learning from Speaker Verification to Multispeaker
Text-To-Speech Synthesis." arXiv preprint arXiv:1806.04558 (2018).
[11] Li, N., Liu, S., Liu, Y., Zhao, S., Liu, M., & Zhou, M. (2018). Close to Human
Quality TTS with Transformer. arXiv preprint arXiv:1809.08895.
Stay in Touch with Us
https://provectus.com/
Lenar Gabdrakhmanov:
@morelen17
RUSLAN Corpus:
https://ruslan-corpus.github.io/

Weitere ähnliche Inhalte

Was ist angesagt?

CS 6390 Project design report
CS 6390 Project design reportCS 6390 Project design report
CS 6390 Project design report
Raj Gupta
 
CS 6390 Project design report
CS 6390 Project design reportCS 6390 Project design report
CS 6390 Project design report
Abhishek Datta
 
Specifying and Implementing SNOW3G with Cryptol
Specifying and Implementing SNOW3G with CryptolSpecifying and Implementing SNOW3G with Cryptol
Specifying and Implementing SNOW3G with Cryptol
Ulisses Costa
 
Μεταπρογραµµατισµός κώδικα Python σε γλώσσα γραµµικού χρόνου για αυτόµατη επα...
Μεταπρογραµµατισµός κώδικα Python σε γλώσσα γραµµικού χρόνου για αυτόµατη επα...Μεταπρογραµµατισµός κώδικα Python σε γλώσσα γραµµικού χρόνου για αυτόµατη επα...
Μεταπρογραµµατισµός κώδικα Python σε γλώσσα γραµµικού χρόνου για αυτόµατη επα...
ISSEL
 
Ad Hoc Probe
Ad Hoc ProbeAd Hoc Probe
Ad Hoc Probe
nutikumar
 
Exploring Optimization in Vowpal Wabbit
Exploring Optimization in Vowpal WabbitExploring Optimization in Vowpal Wabbit
Exploring Optimization in Vowpal Wabbit
Shiladitya Sen
 

Was ist angesagt? (20)

CS 6390 Project design report
CS 6390 Project design reportCS 6390 Project design report
CS 6390 Project design report
 
CS 6390 Project design report
CS 6390 Project design reportCS 6390 Project design report
CS 6390 Project design report
 
Microservice Protocols of Interaction
Microservice Protocols of InteractionMicroservice Protocols of Interaction
Microservice Protocols of Interaction
 
QCon London 2015 Protocols of Interaction
QCon London 2015 Protocols of InteractionQCon London 2015 Protocols of Interaction
QCon London 2015 Protocols of Interaction
 
Specifying and Implementing SNOW3G with Cryptol
Specifying and Implementing SNOW3G with CryptolSpecifying and Implementing SNOW3G with Cryptol
Specifying and Implementing SNOW3G with Cryptol
 
Flower and celery
Flower and celeryFlower and celery
Flower and celery
 
Evaluation of scalability and bandwidth
Evaluation of scalability and bandwidthEvaluation of scalability and bandwidth
Evaluation of scalability and bandwidth
 
Reactive Programming Models for IoT
Reactive Programming Models for IoTReactive Programming Models for IoT
Reactive Programming Models for IoT
 
Concurrency and Python - PyCon MY 2015
Concurrency and Python - PyCon MY 2015Concurrency and Python - PyCon MY 2015
Concurrency and Python - PyCon MY 2015
 
Rnn & Lstm
Rnn & LstmRnn & Lstm
Rnn & Lstm
 
FEC & File Multicast
FEC & File MulticastFEC & File Multicast
FEC & File Multicast
 
Μεταπρογραµµατισµός κώδικα Python σε γλώσσα γραµµικού χρόνου για αυτόµατη επα...
Μεταπρογραµµατισµός κώδικα Python σε γλώσσα γραµµικού χρόνου για αυτόµατη επα...Μεταπρογραµµατισµός κώδικα Python σε γλώσσα γραµµικού χρόνου για αυτόµατη επα...
Μεταπρογραµµατισµός κώδικα Python σε γλώσσα γραµµικού χρόνου για αυτόµατη επα...
 
Homomorphic encryption in_cloud
Homomorphic encryption in_cloudHomomorphic encryption in_cloud
Homomorphic encryption in_cloud
 
DL for molecules
DL for moleculesDL for molecules
DL for molecules
 
TecDoc
TecDocTecDoc
TecDoc
 
DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task
DCU at the NTCIR-9 SpokenDoc Passage Retrieval TaskDCU at the NTCIR-9 SpokenDoc Passage Retrieval Task
DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task
 
Deep speech
Deep speechDeep speech
Deep speech
 
Ad Hoc Probe
Ad Hoc ProbeAd Hoc Probe
Ad Hoc Probe
 
Exploring Optimization in Vowpal Wabbit
Exploring Optimization in Vowpal WabbitExploring Optimization in Vowpal Wabbit
Exploring Optimization in Vowpal Wabbit
 
ECE 565 FInal Project
ECE 565 FInal ProjectECE 565 FInal Project
ECE 565 FInal Project
 

Ähnlich wie Lenar Gabdrakhmanov (Provectus): Speech synthesis

Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...
Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...
Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...
Lviv Startup Club
 
Pauls klein 2011-lm_paper(3)
Pauls klein 2011-lm_paper(3)Pauls klein 2011-lm_paper(3)
Pauls klein 2011-lm_paper(3)
Red Over
 

Ähnlich wie Lenar Gabdrakhmanov (Provectus): Speech synthesis (20)

final ppt BATCH 3.pptx
final ppt BATCH 3.pptxfinal ppt BATCH 3.pptx
final ppt BATCH 3.pptx
 
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
 
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIO...
ANALYZING ARCHITECTURES FOR NEURAL  MACHINE TRANSLATION USING LOW  COMPUTATIO...ANALYZING ARCHITECTURES FOR NEURAL  MACHINE TRANSLATION USING LOW  COMPUTATIO...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIO...
 
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
 
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~
 
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...
 
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...
 
The first FOSD-tacotron-2-based text-to-speech application for Vietnamese
The first FOSD-tacotron-2-based text-to-speech application for VietnameseThe first FOSD-tacotron-2-based text-to-speech application for Vietnamese
The first FOSD-tacotron-2-based text-to-speech application for Vietnamese
 
Transformer-based SE.pptx
Transformer-based SE.pptxTransformer-based SE.pptx
Transformer-based SE.pptx
 
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLEMULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
 
Linguistic Passphrase Cracking
Linguistic Passphrase CrackingLinguistic Passphrase Cracking
Linguistic Passphrase Cracking
 
Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...
Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...
Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...
 
Acceptance Testing Of A Spoken Language Translation System
Acceptance Testing Of A Spoken Language Translation SystemAcceptance Testing Of A Spoken Language Translation System
Acceptance Testing Of A Spoken Language Translation System
 
Speech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdfSpeech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdf
 
Hybrid Machine Translation by Combining Multiple Machine Translation Systems
Hybrid Machine Translation by Combining Multiple Machine Translation SystemsHybrid Machine Translation by Combining Multiple Machine Translation Systems
Hybrid Machine Translation by Combining Multiple Machine Translation Systems
 
Pauls klein 2011-lm_paper(3)
Pauls klein 2011-lm_paper(3)Pauls klein 2011-lm_paper(3)
Pauls klein 2011-lm_paper(3)
 
An expert system for automatic reading of a text written in standard arabic
An expert system for automatic reading of a text written in standard arabicAn expert system for automatic reading of a text written in standard arabic
An expert system for automatic reading of a text written in standard arabic
 
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
 
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...
 
Lexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam searchLexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam search
 

Mehr von Provectus

AI Stack on AWS: Amazon SageMaker and Beyond
AI Stack on AWS: Amazon SageMaker and BeyondAI Stack on AWS: Amazon SageMaker and Beyond
AI Stack on AWS: Amazon SageMaker and Beyond
Provectus
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine Learning
Provectus
 
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMakerMLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
Provectus
 

Mehr von Provectus (20)

Choosing the right IDP Solution
Choosing the right IDP SolutionChoosing the right IDP Solution
Choosing the right IDP Solution
 
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
 
Choosing the Right Document Processing Solution for Healthcare Organizations
Choosing the Right Document Processing Solution for Healthcare OrganizationsChoosing the Right Document Processing Solution for Healthcare Organizations
Choosing the Right Document Processing Solution for Healthcare Organizations
 
MLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in Production
 
AI Stack on AWS: Amazon SageMaker and Beyond
AI Stack on AWS: Amazon SageMaker and BeyondAI Stack on AWS: Amazon SageMaker and Beyond
AI Stack on AWS: Amazon SageMaker and Beyond
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine Learning
 
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMakerMLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
 
Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMR
Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMRCost Optimization for Apache Hadoop/Spark Workloads with Amazon EMR
Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMR
 
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...
 
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K..."Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
 
"How to build a global serverless service", Alex Casalboni, AWS Dev Day Kyiv ...
"How to build a global serverless service", Alex Casalboni, AWS Dev Day Kyiv ..."How to build a global serverless service", Alex Casalboni, AWS Dev Day Kyiv ...
"How to build a global serverless service", Alex Casalboni, AWS Dev Day Kyiv ...
 
"Automating AWS Infrastructure with PowerShell", Martin Beeby, AWS Dev Day Ky...
"Automating AWS Infrastructure with PowerShell", Martin Beeby, AWS Dev Day Ky..."Automating AWS Infrastructure with PowerShell", Martin Beeby, AWS Dev Day Ky...
"Automating AWS Infrastructure with PowerShell", Martin Beeby, AWS Dev Day Ky...
 
"Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2...
"Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2..."Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2...
"Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2...
 
"Resiliency and Availability Design Patterns for the Cloud", Sebastien Storma...
"Resiliency and Availability Design Patterns for the Cloud", Sebastien Storma..."Resiliency and Availability Design Patterns for the Cloud", Sebastien Storma...
"Resiliency and Availability Design Patterns for the Cloud", Sebastien Storma...
 
"Architecting SaaS solutions on AWS", Oleksandr Mykhalchuk, AWS Dev Day Kyiv ...
"Architecting SaaS solutions on AWS", Oleksandr Mykhalchuk, AWS Dev Day Kyiv ..."Architecting SaaS solutions on AWS", Oleksandr Mykhalchuk, AWS Dev Day Kyiv ...
"Architecting SaaS solutions on AWS", Oleksandr Mykhalchuk, AWS Dev Day Kyiv ...
 
"Developing with .NET Core on AWS", Martin Beeby, AWS Dev Day Kyiv 2019
"Developing with .NET Core on AWS", Martin Beeby, AWS Dev Day Kyiv 2019"Developing with .NET Core on AWS", Martin Beeby, AWS Dev Day Kyiv 2019
"Developing with .NET Core on AWS", Martin Beeby, AWS Dev Day Kyiv 2019
 
"How to build real-time backends", Martin Beeby, AWS Dev Day Kyiv 2019
"How to build real-time backends", Martin Beeby, AWS Dev Day Kyiv 2019"How to build real-time backends", Martin Beeby, AWS Dev Day Kyiv 2019
"How to build real-time backends", Martin Beeby, AWS Dev Day Kyiv 2019
 
"Integrate your front end apps with serverless backend in the cloud", Sebasti...
"Integrate your front end apps with serverless backend in the cloud", Sebasti..."Integrate your front end apps with serverless backend in the cloud", Sebasti...
"Integrate your front end apps with serverless backend in the cloud", Sebasti...
 
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
 
How to implement authorization in your backend with AWS IAM
How to implement authorization in your backend with AWS IAMHow to implement authorization in your backend with AWS IAM
How to implement authorization in your backend with AWS IAM
 

Kürzlich hochgeladen

Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
shivangimorya083
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
JohnnyPlasten
 

Kürzlich hochgeladen (20)

Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 

Lenar Gabdrakhmanov (Provectus): Speech synthesis

  • 1. Modern Text-to-Speech Systems Review Lenar Gabdrakhmanov ML Engineer at qqwqweqwe
  • 2. Table of contents 1. Introducing to TTS 2. Synthesized speech quality evaluation 3. WaveNet (2016-09-12) 4. Fast WaveNet (2016-11-29) 5. Deep Voice (2017-02-25) 6. Tacotron (2017-03-29) 7. Deep Voice 2 (2017-05-24) 8. TTS + ASR = Speech Chain (2017-07-16) 9. Deep Voice 3 (2017-10-20) 10. Tacotron 2 (2017-12-16)
  • 3. Table of contents 11. GST-Tacotron (2018-03-23) 12. Transfer Learning Tacotron (2018-06-12) 13. Transformer TTS (2018-09-19) 14. RUSLAN: Russian Spoken Language Corpus For Speech Synthesis 15. References
  • 4. Introducing to Text-to-Speech Two original speech synthesis techniques: 1. Concatenative (unit selection); 2. Statistical parametric.
  • 7. Synthesized speech quality evaluation ● Mean Opinion Score (MOS) is the most frequently used method of subjective measure of speech quality; ● Speech quality is rated on a 5-point scale: ○ 1 - bad quality; ○ 5 - excellent quality.
  • 8. WaveNet: A Generative Model For Raw Audio [1] ● Not truly end-to-end system yet; ● Generates raw audio waveforms (≥ 16k samples per second!); ● Fully convolutional neural network; ● Every new sample is conditioned on all previous samples; ● Softmax layer models conditional distributions over individual audio sample; ● Extra linguistic features (e.g. phone identities, syllable stress, fundamental frequency F0 ) required; ● Can be extended to multi-speaker TTS; ● Achieves a 4.21 ± 0.081 MOS on US English, a 4.08 ± 0.085 on Chinese; ● Computationally expensive.
  • 9. WaveNet: A Generative Model For Raw Audio
  • 10. WaveNet: A Generative Model For Raw Audio
  • 11. WaveNet: A Generative Model For Raw Audio Samples: 1. “The Blue Lagoon is a 1980 American romance and adventure film directed by Randal Kleiser” (parametric); 2. <same> (concatenative); 3. <same> (WaveNet).
  • 12. Fast Wavenet Generation Algorithm [2]
  • 14. Deep Voice: Real-time Neural Text-to-Speech [3] ● Not truly end-to-end system yet: “Deep Voice lays the groundwork for truly end-to-end neural speech synthesis”; ● Consists of five blocks: ○ grapheme-to-phoneme conversion model (encoder - Bi-GRU with 1024 units x 3, decoder - GRU with 1024 units x 3); ○ segmentation model for locating phoneme boundaries (Convs + GRU + Convs); ○ phoneme duration prediction model (FC-256 x 2, GRU with 128 units x 2, FC); ○ fundamental frequency prediction model (joint model with above); ○ audio synthesis model (variant of WaveNet). ● Faster than real time inference (up to 400x faster on both CPU and GPU compared to Fast WaveNet [2]); ● Achieves a 2.67± 0.37 MOS on US English;
  • 15. Deep Voice: Real-time Neural Text-to-Speech
  • 16. Deep Voice: Real-time Neural Text-to-Speech
  • 17. Tacotron: Towards End-to-End Speech Synthesis [4] ● Fully end-to-end: given <text, audio> pairs, model can be trained completely from scratch with random initialization; ● Predicts linear- and mel-scale spectrograms; ● Achieves a 3.82 MOS on US English.
  • 18. Tacotron: Towards End-to-End Speech Synthesis
  • 19. Tacotron: Towards End-to-End Speech Synthesis Samples: 1. “Generative adversarial network or variational auto-encoder.”; 2. “He has read the whole thing.”; 3. “He reads books.”; 4. “The quick brown fox jumps over the lazy dog.”; 5. “Does the quick brown fox jump over the lazy dog?”.
  • 20. Deep Voice 2: Multi-Speaker Neural Text-to-Speech [5] ● Improved architecture based on Deep Voice [3]; ● Can learn from hundreds of unique voices from less than half an hour of data per speaker; ● Voice model based on WaveNet [1] architecture; ● Achieves a 3.53 ± 0.12 MOS.
  • 21. Deep Voice 2: Multi-Speaker Neural Text-to-Speech
  • 22. Deep Voice 2: Multi-Speaker Neural Text-to-Speech
  • 23. Deep Voice 2: Multi-Speaker Neural Text-to-Speech Samples: 1. “About half the people who are infected also lose weight.”; 2. <same>; 3. <same>.
  • 24. Listening while Speaking: Speech Chain by Deep Learning [6] ● Two parts: TTS model and ASR model; ● Single- and multi-speaker; ● Jointly training: ○ Supervised step; ○ Unsupervised: unpaired text and speech. ● Both TTS and AST models are Tacotron-like [4].
  • 25. Listening while Speaking: Speech Chain by Deep Learning [6]
  • 26. Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning [7] ● Fully-convolutional sequence-to-sequence attention-based model; ● Converts input text to spectrograms (or other acoustic parameters); ● Suitable for both single and multi speaker; ● Needs 10x less training time and and converges after 500k iterations (compared to Tacotron [4] that converges after 2m iterations); ● Novel attention mechanism to introduce monotonic alignment; ● MOS: 3.78 ± 0.30 (with WaveNet), same score for Tacotron [4] (with Wavenet), 2.74 ± 0.35 for Deep Voice 2 [5] (with WaveNet).
  • 27. Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning [6]
  • 28. Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning [6]
  • 29. Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning [6]
  • 30. Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning [6] Samples: 1. … (trained for single speaker - 20 hours total); 2. … (trained for 108 speakers - 44 hours total); 3. <same>; 4. … (trained for 2484 speakers - 820 hours of ASR data total); 5. <same>.
  • 31. Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions [8] ● Uses simpler building blocks, in contrast to original Tacotron [4]; ● Maps input characters to mel-scale spectrogram; ● Modified WaveNet [1] synthesizes audio waveforms from spectrograms directly (no need for linguistic, phoneme duration and other features); ● However, this WaveNet is pretrained separately; ● Achieves a 4.526 ± 0.066 MOS on US English.
  • 32. Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions [7]
  • 33. Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions [7] Samples: 1. “Generative adversarial network or variational auto-encoder.”; 2. “Don't desert me here in the desert!”; 3. “He thought it was time to present the present.”; 4. “The buses aren't the problem, they actually provide a solution.”; 5. “The buses aren't the PROBLEM, they actually provide a SOLUTION.”.
  • 34. Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis [9] ● Based on Tacotron [4] with slight changes; ● Learn embeddings for 10 style tokens; ● Reference encoder: stack of 2D Convs, GRU with 128 units; ● Style token layer: Attention with 256-D token embeddings
  • 35. Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis [9]
  • 36. Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis [9] Samples: 1. “United Airlines five six three from Los Angeles to New Orleans has Landed”; 2. <same>; 3. <same>; 4. <same>; 5. <same>.
  • 37. Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis [10] ● Consists of several independently trained components: ○ Speaker encoder network to generate speaker embedding vectors: LSTM with 768 units followed by 256-D projection layer x 3; ○ Synthesis network - Tacotron 2 [8]. ● MOS: 4.22 ± 0.06 and 3.28 ± 0.07 for seen and unseen speakers respectively.
  • 38. Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis [10]
  • 39. Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis [10]
  • 40. Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis [10] Samples: 1. … (reference audio, seen speaker); 2. … (synthesized audio, seen speaker); 3. … (ref., unseen); 4. … (synth., unseen); 5. … (ref., different language); 6. … (synth., different language).
  • 41. Close to Human Quality TTS with Transformer [11] ● Based on state-of-the-art NMT model called “Transformer”; ● Based on Tacotron 2 [8] model, but all RNN layers are replaced with Transformers; ● Using of Transformer allows to construct encoder and decoder hidden states in parallel; ● Training about 4.25 times faster compared with Tacotron 2; ● Achieves state-of-the-art performance: 4.39 ± 0.05 (proposed model), 4.38 ± 0.05 (Tacotron 2), 4.44 ± 0.05 (ground truth).
  • 42. Close to Human Quality TTS with Transformer
  • 43. Close to Human Quality TTS with Transformer
  • 44. Close to Human Quality TTS with Transformer Samples: 1. “Two to five inches of rain is possible by early Monday, resulting in some flooding.”; 2. “Flooding is likely in some parishes of southern Louisiana.”; 3. “Defending champ tiger woods is one of eight Golfers within two strokes.”.
  • 45. RUSLAN: Russian Spoken Language Corpus For Speech Synthesis ● Authors: Rustem Garaev, Evgenii Razinkov, Lenar Gabdrakhmanov; ● Largest annotated speech corpus in Russian for a single speaker - more than 31 hours of speech (<text, audio> pairs) - 22200 samples; ● Several improvements for Tacotron [4] were proposed: ○ All GRU layers replaced with Layer Normalized LSTM; ○ Fast Griffin-Lim algorithm used for waveform synthesizing from spectrogram. ● MOS: 4.05 for intelligibility and 3.78 for naturalness (vs 3.12 and 2.17 for original model); ● Corpus site with recordings and synthesized speech: https://ruslan-corpus.github.io/
  • 46. RUSLAN: Russian Spoken Language Corpus For Speech Synthesis
  • 47. RUSLAN: Russian Spoken Language Corpus For Speech Synthesis Samples: 1. “Тринадцать лет назад я взялся за перо.” (from dataset); 2. “Тема работы - разработка и реализация метода генерации русской речи на основе текста.”; 3. “Спасибо за внимание.”.
  • 48.
  • 49. References [1] Van Den Oord, Aäron, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W. Senior, and Koray Kavukcuoglu. "WaveNet: A generative model for raw audio." In SSW, p. 125. 2016. [2] Paine, Tom Le, Pooya Khorrami, Shiyu Chang, Yang Zhang, Prajit Ramachandran, Mark A. Hasegawa-Johnson, and Thomas S. Huang. "Fast wavenet generation algorithm." arXiv preprint arXiv:1611.09482 (2016).
  • 50. References (2) [3] Arik, Sercan O., Mike Chrzanowski, Adam Coates, Gregory Diamos, Andrew Gibiansky, Yongguo Kang, Xian Li et al. "Deep voice: Real-time neural text-to-speech." arXiv preprint arXiv:1702.07825 (2017). [4] Wang, Yuxuan, R. J. Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang et al. "Tacotron: Towards end-to-end speech synthesis." arXiv preprint arXiv:1703.10135 (2017). [5] Arik, Sercan, Gregory Diamos, Andrew Gibiansky, John Miller, Kainan Peng, Wei Ping, Jonathan Raiman, and Yanqi Zhou. "Deep voice 2: Multi-speaker neural text-to-speech." arXiv preprint arXiv:1705.08947 (2017).
  • 51. References (3) [6] Tjandra, Andros, Sakriani Sakti, and Satoshi Nakamura. "Listening while speaking: Speech chain by deep learning." Automatic Speech Recognition and Understanding Workshop (ASRU), 2017 IEEE. IEEE, 2017. [7] Ping, Wei, Kainan Peng, Andrew Gibiansky, Sercan O. Arik, Ajay Kannan, Sharan Narang, Jonathan Raiman, and John Miller. "Deep voice 3: Scaling text-to-speech with convolutional sequence learning." (2018). [8] Shen, Jonathan, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen et al. "Natural tts synthesis by conditioning wavenet on mel spectrogram predictions." In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779-4783. IEEE, 2018.
  • 52. References (4) [9] Wang, Yuxuan, et al. "Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis." arXiv preprint arXiv:1803.09017 (2018). [10] Jia, Ye, et al. "Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis." arXiv preprint arXiv:1806.04558 (2018). [11] Li, N., Liu, S., Liu, Y., Zhao, S., Liu, M., & Zhou, M. (2018). Close to Human Quality TTS with Transformer. arXiv preprint arXiv:1809.08895.
  • 53. Stay in Touch with Us https://provectus.com/ Lenar Gabdrakhmanov: @morelen17 RUSLAN Corpus: https://ruslan-corpus.github.io/