APSIPA ASC 2021
Ding Ma, Wen-Chin Huang, Tomoki Toda: Investigation of text-to-speech-based synthetic parallel data for sequence-to-sequence non-parallel voice conversion, Dec. 2021
Toda Laboratory, Department of Intelligent Systems, Graduate School of Informatics, Nagoya University
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to-Sequence Non-Parallel Voice Conversion
1. Investigation of Text-to-speech based
Synthetic Parallel Data for Sequence-to-
sequence Non-parallel Voice Conversion
Ding Ma, Wen-Chin Huang and Tomoki Toda
Graduate School of Informatics, Nagoya University, Nagoya, Japan
Paper ID: #1606 Presenter: Ding Ma
2. Introduction
•Voice conversion (VC)
• The methodology that aims to convert the speaker identity
of speech from source speaker into target speaker while
preserving the linguistic information.
• VC is expected to play a significant role in augmented
human communication.
Source
speech
VC
Target
speech
2
3. Introduction
•Sequence-to-sequence (seq2seq) modeling
• Seq2Seq model: a model that takes a sequence of items and outputs another sequence of
items, have emerged from the development of deep neural networks (DNN).
• Can automatically determine the output phoneme duration.
• Capture long term dependencies: prosody (F0 & duration), intonation…
• Requires a large amount and parallel speech corpus from source and target
speakers for training.
Encoder Attention Decoder
Source
speech
Target
speech
3
4. Background
• Voice conversion challenge 2020(VCC2020)
• Bi-annual event to compare the performance of different VC systems.
• 2 tasks: Intra-lingual and semi-parallel case in Task 1 & cross lingual case in Task 2.
• Parallel: same utterances
• Nonparallel: different utterances
• Semiparallel: Parallel + Nonparallel situation
(can be regarded as the relaxation of non-
parallel case)
• Limited dataset: Only 90 corpus in T1/ 70
corpus in T2
4
5. Background
• VTN: Voice Transformer Network, which is the sequence-to-
sequence Voice Conversion Using Transformer with Text-to-
speech (TTS) pretraining.
• ➕ 1hr à 5 mins training data (Thanks to pretraining technology).
• ➖ still Needs parallel training data.
• How to tackle issue of semiparallel dataset?
• 「 Synthetic speech method 」
• We extended VTN model by training TTS models to generate synthetic
parallel data (SPD). (Semiparallel à Parallel)
[1]
[1] W. C. Huang, T. Hayashi, Y. C. Wu, H. Kameoka, and T. Toda, “Voice transformer network: Sequence-to-sequence voice conversion using transformer with text-to-speech pretraining,” Proc. Interspeech,
pp. 4676-4680, 2020.
5
6. Background
• Generation process of a synthetic parallel data (SPD) from a
semiparallel dataset.
(a) TTS training process using the semiparallel dataset; and (b) SPD generation process using source synthetic data,
target synthetic data, and external SPD.
6
7. Background
•Generation process of a synthetic parallel data (SPD)
from a semiparallel dataset.
• Four types of parallel data available for training the VC
model in total:
1. <source natural, target natural>
2. <SPD with source synthetic, target natural>
3. <source natural, SPD with target synthetic>
4. <external SPD with source synthetic, external SPD with
target synthetic>
7
8. 「Synthetic Speech Method」
• There are still uncertainties about the effects and usage of
SPD on seq2seq VC model. In this paper we try to address
the following 3 questions:
• Q1: What are the feasibility and properties of using SPD?
• Q1-1: How does quality of data affect VC performance?
• Q1-2: Which kind of the training pair is better?
Ø Source + target natural / source synthetic only / target synthetic only / natural+synthetic (mixed
situation)
• Q2: How can this method benefit from a semiparallel setting?
• Fix original training data, and set semiparallel ratio(0/25%/50%/75%/100%)
• Q3: What are the influences of using external text data?
• Fix original training data, increase external data (1k/2k/5k)
8
9. Datasets and Configuration
• Initial dataset : CMU ARCTIC database(containing parallel 1132 utterances
recorded by the English speakers in 16kHZ)
• Female: clb, slt
• Male: bdl, rms
• Development set and evaluation set: 100 utterances separately
• External dataset: M-AILABS database
• English corpus: 15369 utterances, 30 hours long
• Implementation:
• TTS models: Pretrained Transformer-TTS architecture
• VC model: VTN (Transformer-based seq2seq VC model) [1]
• Vocoder: Parallel WaveGAN (PWG) neural vocoder [2]
• Objective evaluation: Transformer-based ASR engine trained by LibriSpeech
database [3]
[1] W. C. Huang, T. Hayashi, Y. C. Wu, H. Kameoka, and T. Toda, “Voice transformer network: Sequence-to-sequence voice conversion using transformer with text-to-speech pretraining,” Proc. Interspeech, pp. 4676-4680, 2020.
[2] R. Yamamoto, E. Song, and J. M. Kim. “Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,” 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6199-6203, 2020.
[3] L. Dong, S. Xu, and B. Xu, “Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition,” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5884-5888, 2018.
9
10. Experiment and Evaluation
• Q1: What are the feasibility and properties of using SPD?
Five kinds of training pairs:
1. <source natural, target natural>
2. <source natural, target synthetic>
3. <source synthetic, target natural>
4. <source synthetic, target synthetic>
5. <source synthetic and source natural, target
natural and target synthetic>
10
11. Experiment and Evaluation
• MCD: Mel Cepstrum Distortion / CER: Character Error Rate / WER: Word Error Rate
• The Objective evaluationresults of Q1
Table I: The comparison results with different training pair and datasize. TTS-450, TTS-400, TTS-200 and TTS-80 represent the homologous datasize of
TTS finetuning, which also reflect TTS performance, the datasize of SPD generation and VC training.
• The TTS performance is critical in terms of the impact of VC results.
• The training dataset of source synthetic - target natural generally performs better among the other pairs using
SPD. 11
12. Experiment and Evaluation
• Q2: How can this method benefit from a semiparallel setting?
• Training procedure with different semiparallel setting
(e.g., datasize=400).
• Parallel ratio (PR) is used to represent the proportion
of natural parallel corpus, so as to reflect the semi-
parallel setting under each group.
• The respective TTS models of source and target
speaker are trained in case of constant datasize but
different semiparallel setting for each group.
• Two parts experiment: Training dataset I retains all
SPD as shown in (a); training dataset II removes
natural-synthetic part of semiparallel cases for
training as shown in (b).
12
13. Experiment and Evaluation
• The Objective evaluationresults of Q2
Table II: Experimental results under different semiparallel setting.
13
14. Experiment and Evaluation
• The Objective evaluationresults of Q2
Table II: Experimental results under different semiparallel setting.
14
15. Experiment and Evaluation
• The Objective evaluation results of Q3
Table III: Experimental results of adding external data with different datasizes. TTS-400 and TTS-200 represent homologous datasize
of TTS finetuning.
15
16. Experiment and Evaluation
• The Subjective evaluation (MOS)results of Q1, Q2 and Q3 under specific datasets.
Table IV: Results of subjective evaluation using test set under 450 and 80 datasize with 95%
confidence intervals for Q1.
Table V: Results of subjective evaluation using test sets with 95%
confidence intervals for Q2.
Table VI: Results of subjective evaluation using test sets under 400 and 200 datasize with 95% confidence
intervals for Q3.
• The overall results are consistent with
the findings in the objective evaluations.
16
17. Conclusions
• SPD is feasible for seq2seq non-parallel VC. The VC results using
SPD are determined by the performance of TTS models and VC
training datasize. In addition, the VC result is also affected by the object of
using SPD.
• When the dataset is semiparallel, we should try to ensure the PR is
large enough. If the original datasize is large, the introduction of SPD
into target speaker or source speaker can both achieve ideal VC results.
Thus, the full use of all types of SPD to ensure amount of data, can
maximize the benefits. On the contrary, when the original datasize is
small, the well-performing TTS models are difficult to get. Introducing
training pair with negative impact such as source natural-target
synthetic should be avoided.
• SPD with external text data as data augmentation can improve parallel
seq2seq VC performance to a certain extent (e.g. natural-natural).
17
18. Future work
• Using more speakers and a larger amount of data to further investigate the
beneficial trend that seq2seq non-parallel VC can obtain from SPD.
• In terms of methodology, we can introduce the VC models which can directly
processing non-parallel data training to compare the performance with the
way of using SPD on seq2seq VC in the future research, so as to further
clarify the role of SPD.
18