SlideShare ist ein Scribd-Unternehmen logo
1 von 18
Investigation of Text-to-speech based
Synthetic Parallel Data for Sequence-to-
sequence Non-parallel Voice Conversion
Ding Ma, Wen-Chin Huang and Tomoki Toda
Graduate School of Informatics, Nagoya University, Nagoya, Japan
Paper ID: #1606 Presenter: Ding Ma
Introduction
•Voice conversion (VC)
• The methodology that aims to convert the speaker identity
of speech from source speaker into target speaker while
preserving the linguistic information.
• VC is expected to play a significant role in augmented
human communication.
Source
speech
VC
Target
speech
2
Introduction
•Sequence-to-sequence (seq2seq) modeling
• Seq2Seq model: a model that takes a sequence of items and outputs another sequence of
items, have emerged from the development of deep neural networks (DNN).
• Can automatically determine the output phoneme duration.
• Capture long term dependencies: prosody (F0 & duration), intonation…
• Requires a large amount and parallel speech corpus from source and target
speakers for training.
Encoder Attention Decoder
Source
speech
Target
speech
3
Background
• Voice conversion challenge 2020(VCC2020)
• Bi-annual event to compare the performance of different VC systems.
• 2 tasks: Intra-lingual and semi-parallel case in Task 1 & cross lingual case in Task 2.
• Parallel: same utterances
• Nonparallel: different utterances
• Semiparallel: Parallel + Nonparallel situation
(can be regarded as the relaxation of non-
parallel case)
• Limited dataset: Only 90 corpus in T1/ 70
corpus in T2
4
Background
• VTN: Voice Transformer Network, which is the sequence-to-
sequence Voice Conversion Using Transformer with Text-to-
speech (TTS) pretraining.
• ➕ 1hr à 5 mins training data (Thanks to pretraining technology).
• ➖ still Needs parallel training data.
• How to tackle issue of semiparallel dataset?
• 「 Synthetic speech method 」
• We extended VTN model by training TTS models to generate synthetic
parallel data (SPD). (Semiparallel à Parallel)
[1]
[1] W. C. Huang, T. Hayashi, Y. C. Wu, H. Kameoka, and T. Toda, “Voice transformer network: Sequence-to-sequence voice conversion using transformer with text-to-speech pretraining,” Proc. Interspeech,
pp. 4676-4680, 2020.
5
Background
• Generation process of a synthetic parallel data (SPD) from a
semiparallel dataset.
(a) TTS training process using the semiparallel dataset; and (b) SPD generation process using source synthetic data,
target synthetic data, and external SPD.
6
Background
•Generation process of a synthetic parallel data (SPD)
from a semiparallel dataset.
• Four types of parallel data available for training the VC
model in total:
1. <source natural, target natural>
2. <SPD with source synthetic, target natural>
3. <source natural, SPD with target synthetic>
4. <external SPD with source synthetic, external SPD with
target synthetic>
7
「Synthetic Speech Method」
• There are still uncertainties about the effects and usage of
SPD on seq2seq VC model. In this paper we try to address
the following 3 questions:
• Q1: What are the feasibility and properties of using SPD?
• Q1-1: How does quality of data affect VC performance?
• Q1-2: Which kind of the training pair is better?
Ø Source + target natural / source synthetic only / target synthetic only / natural+synthetic (mixed
situation)
• Q2: How can this method benefit from a semiparallel setting?
• Fix original training data, and set semiparallel ratio(0/25%/50%/75%/100%)
• Q3: What are the influences of using external text data?
• Fix original training data, increase external data (1k/2k/5k)
8
Datasets and Configuration
• Initial dataset : CMU ARCTIC database(containing parallel 1132 utterances
recorded by the English speakers in 16kHZ)
• Female: clb, slt
• Male: bdl, rms
• Development set and evaluation set: 100 utterances separately
• External dataset: M-AILABS database
• English corpus: 15369 utterances, 30 hours long
• Implementation:
• TTS models: Pretrained Transformer-TTS architecture
• VC model: VTN (Transformer-based seq2seq VC model) [1]
• Vocoder: Parallel WaveGAN (PWG) neural vocoder [2]
• Objective evaluation: Transformer-based ASR engine trained by LibriSpeech
database [3]
[1] W. C. Huang, T. Hayashi, Y. C. Wu, H. Kameoka, and T. Toda, “Voice transformer network: Sequence-to-sequence voice conversion using transformer with text-to-speech pretraining,” Proc. Interspeech, pp. 4676-4680, 2020.
[2] R. Yamamoto, E. Song, and J. M. Kim. “Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,” 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6199-6203, 2020.
[3] L. Dong, S. Xu, and B. Xu, “Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition,” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5884-5888, 2018.
9
Experiment and Evaluation
• Q1: What are the feasibility and properties of using SPD?
Five kinds of training pairs:
1. <source natural, target natural>
2. <source natural, target synthetic>
3. <source synthetic, target natural>
4. <source synthetic, target synthetic>
5. <source synthetic and source natural, target
natural and target synthetic>
10
Experiment and Evaluation
• MCD: Mel Cepstrum Distortion / CER: Character Error Rate / WER: Word Error Rate
• The Objective evaluationresults of Q1
Table I: The comparison results with different training pair and datasize. TTS-450, TTS-400, TTS-200 and TTS-80 represent the homologous datasize of
TTS finetuning, which also reflect TTS performance, the datasize of SPD generation and VC training.
• The TTS performance is critical in terms of the impact of VC results.
• The training dataset of source synthetic - target natural generally performs better among the other pairs using
SPD. 11
Experiment and Evaluation
• Q2: How can this method benefit from a semiparallel setting?
• Training procedure with different semiparallel setting
(e.g., datasize=400).
• Parallel ratio (PR) is used to represent the proportion
of natural parallel corpus, so as to reflect the semi-
parallel setting under each group.
• The respective TTS models of source and target
speaker are trained in case of constant datasize but
different semiparallel setting for each group.
• Two parts experiment: Training dataset I retains all
SPD as shown in (a); training dataset II removes
natural-synthetic part of semiparallel cases for
training as shown in (b).
12
Experiment and Evaluation
• The Objective evaluationresults of Q2
Table II: Experimental results under different semiparallel setting.
13
Experiment and Evaluation
• The Objective evaluationresults of Q2
Table II: Experimental results under different semiparallel setting.
14
Experiment and Evaluation
• The Objective evaluation results of Q3
Table III: Experimental results of adding external data with different datasizes. TTS-400 and TTS-200 represent homologous datasize
of TTS finetuning.
15
Experiment and Evaluation
• The Subjective evaluation (MOS)results of Q1, Q2 and Q3 under specific datasets.
Table IV: Results of subjective evaluation using test set under 450 and 80 datasize with 95%
confidence intervals for Q1.
Table V: Results of subjective evaluation using test sets with 95%
confidence intervals for Q2.
Table VI: Results of subjective evaluation using test sets under 400 and 200 datasize with 95% confidence
intervals for Q3.
• The overall results are consistent with
the findings in the objective evaluations.
16
Conclusions
• SPD is feasible for seq2seq non-parallel VC. The VC results using
SPD are determined by the performance of TTS models and VC
training datasize. In addition, the VC result is also affected by the object of
using SPD.
• When the dataset is semiparallel, we should try to ensure the PR is
large enough. If the original datasize is large, the introduction of SPD
into target speaker or source speaker can both achieve ideal VC results.
Thus, the full use of all types of SPD to ensure amount of data, can
maximize the benefits. On the contrary, when the original datasize is
small, the well-performing TTS models are difficult to get. Introducing
training pair with negative impact such as source natural-target
synthetic should be avoided.
• SPD with external text data as data augmentation can improve parallel
seq2seq VC performance to a certain extent (e.g. natural-natural).
17
Future work
• Using more speakers and a larger amount of data to further investigate the
beneficial trend that seq2seq non-parallel VC can obtain from SPD.
• In terms of methodology, we can introduce the VC models which can directly
processing non-parallel data training to compare the performance with the
way of using SPD on seq2seq VC in the future research, so as to further
clarify the role of SPD.
18

Weitere ähnliche Inhalte

Was ist angesagt?

深層生成モデルに基づく音声合成技術
深層生成モデルに基づく音声合成技術深層生成モデルに基づく音声合成技術
深層生成モデルに基づく音声合成技術NU_I_TODALAB
 
音声の声質を変換する技術とその応用
音声の声質を変換する技術とその応用音声の声質を変換する技術とその応用
音声の声質を変換する技術とその応用NU_I_TODALAB
 
差分スペクトル法に基づく DNN 声質変換の計算量削減に向けたフィルタ推定
差分スペクトル法に基づく DNN 声質変換の計算量削減に向けたフィルタ推定差分スペクトル法に基づく DNN 声質変換の計算量削減に向けたフィルタ推定
差分スペクトル法に基づく DNN 声質変換の計算量削減に向けたフィルタ推定Shinnosuke Takamichi
 
統計的音声合成変換と近年の発展
統計的音声合成変換と近年の発展統計的音声合成変換と近年の発展
統計的音声合成変換と近年の発展Shinnosuke Takamichi
 
[DL輪読会]Diffusion-based Voice Conversion with Fast Maximum Likelihood Samplin...
[DL輪読会]Diffusion-based Voice Conversion with Fast  Maximum Likelihood Samplin...[DL輪読会]Diffusion-based Voice Conversion with Fast  Maximum Likelihood Samplin...
[DL輪読会]Diffusion-based Voice Conversion with Fast Maximum Likelihood Samplin...Deep Learning JP
 
複数話者WaveNetボコーダに関する調査
複数話者WaveNetボコーダに関する調査複数話者WaveNetボコーダに関する調査
複数話者WaveNetボコーダに関する調査Tomoki Hayashi
 
やさしく音声分析法を学ぶ: ケプストラム分析とLPC分析
やさしく音声分析法を学ぶ: ケプストラム分析とLPC分析やさしく音声分析法を学ぶ: ケプストラム分析とLPC分析
やさしく音声分析法を学ぶ: ケプストラム分析とLPC分析Shinnosuke Takamichi
 
Statistical voice conversion with direct waveform modeling
Statistical voice conversion with direct waveform modelingStatistical voice conversion with direct waveform modeling
Statistical voice conversion with direct waveform modelingNU_I_TODALAB
 
Interspeech2022 参加報告
Interspeech2022 参加報告Interspeech2022 参加報告
Interspeech2022 参加報告Yuki Saito
 
深層学習に基づく音響特徴量からの振幅スペクトログラム予測
深層学習に基づく音響特徴量からの振幅スペクトログラム予測深層学習に基づく音響特徴量からの振幅スペクトログラム予測
深層学習に基づく音響特徴量からの振幅スペクトログラム予測Kitamura Laboratory
 
Empirical Study Incorporating Linguistic Knowledge on Filled Pauses for Pers...
Empirical Study Incorporating Linguistic Knowledge on Filled Pausesfor Pers...Empirical Study Incorporating Linguistic Knowledge on Filled Pausesfor Pers...
Empirical Study Incorporating Linguistic Knowledge on Filled Pauses for Pers...Yuta Matsunaga
 
[DL輪読会]Parallel WaveNet: Fast High-Fidelity Speech Synthesis
[DL輪読会]Parallel WaveNet: Fast High-Fidelity Speech Synthesis[DL輪読会]Parallel WaveNet: Fast High-Fidelity Speech Synthesis
[DL輪読会]Parallel WaveNet: Fast High-Fidelity Speech SynthesisDeep Learning JP
 
音情報処理における特徴表現
音情報処理における特徴表現音情報処理における特徴表現
音情報処理における特徴表現NU_I_TODALAB
 
CREST「共生インタラクション」共創型音メディア機能拡張プロジェクト
CREST「共生インタラクション」共創型音メディア機能拡張プロジェクトCREST「共生インタラクション」共創型音メディア機能拡張プロジェクト
CREST「共生インタラクション」共創型音メディア機能拡張プロジェクトNU_I_TODALAB
 
GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)Yuki Saito
 
喉頭摘出者のための歌唱支援を目指した電気音声変換法
喉頭摘出者のための歌唱支援を目指した電気音声変換法喉頭摘出者のための歌唱支援を目指した電気音声変換法
喉頭摘出者のための歌唱支援を目指した電気音声変換法NU_I_TODALAB
 
PCFG構文解析法
PCFG構文解析法PCFG構文解析法
PCFG構文解析法Yusuke Oda
 
J-KAC:日本語オーディオブック・紙芝居朗読音声コーパス
J-KAC:日本語オーディオブック・紙芝居朗読音声コーパスJ-KAC:日本語オーディオブック・紙芝居朗読音声コーパス
J-KAC:日本語オーディオブック・紙芝居朗読音声コーパスShinnosuke Takamichi
 
ボコーダ波形生成における励振源の群遅延操作に向けた声帯音源特性の解析
ボコーダ波形生成における励振源の群遅延操作に向けた声帯音源特性の解析ボコーダ波形生成における励振源の群遅延操作に向けた声帯音源特性の解析
ボコーダ波形生成における励振源の群遅延操作に向けた声帯音源特性の解析Junya Koguchi
 
時間領域低ランクスペクトログラム近似法に基づくマスキング音声の欠損成分復元
時間領域低ランクスペクトログラム近似法に基づくマスキング音声の欠損成分復元時間領域低ランクスペクトログラム近似法に基づくマスキング音声の欠損成分復元
時間領域低ランクスペクトログラム近似法に基づくマスキング音声の欠損成分復元NU_I_TODALAB
 

Was ist angesagt? (20)

深層生成モデルに基づく音声合成技術
深層生成モデルに基づく音声合成技術深層生成モデルに基づく音声合成技術
深層生成モデルに基づく音声合成技術
 
音声の声質を変換する技術とその応用
音声の声質を変換する技術とその応用音声の声質を変換する技術とその応用
音声の声質を変換する技術とその応用
 
差分スペクトル法に基づく DNN 声質変換の計算量削減に向けたフィルタ推定
差分スペクトル法に基づく DNN 声質変換の計算量削減に向けたフィルタ推定差分スペクトル法に基づく DNN 声質変換の計算量削減に向けたフィルタ推定
差分スペクトル法に基づく DNN 声質変換の計算量削減に向けたフィルタ推定
 
統計的音声合成変換と近年の発展
統計的音声合成変換と近年の発展統計的音声合成変換と近年の発展
統計的音声合成変換と近年の発展
 
[DL輪読会]Diffusion-based Voice Conversion with Fast Maximum Likelihood Samplin...
[DL輪読会]Diffusion-based Voice Conversion with Fast  Maximum Likelihood Samplin...[DL輪読会]Diffusion-based Voice Conversion with Fast  Maximum Likelihood Samplin...
[DL輪読会]Diffusion-based Voice Conversion with Fast Maximum Likelihood Samplin...
 
複数話者WaveNetボコーダに関する調査
複数話者WaveNetボコーダに関する調査複数話者WaveNetボコーダに関する調査
複数話者WaveNetボコーダに関する調査
 
やさしく音声分析法を学ぶ: ケプストラム分析とLPC分析
やさしく音声分析法を学ぶ: ケプストラム分析とLPC分析やさしく音声分析法を学ぶ: ケプストラム分析とLPC分析
やさしく音声分析法を学ぶ: ケプストラム分析とLPC分析
 
Statistical voice conversion with direct waveform modeling
Statistical voice conversion with direct waveform modelingStatistical voice conversion with direct waveform modeling
Statistical voice conversion with direct waveform modeling
 
Interspeech2022 参加報告
Interspeech2022 参加報告Interspeech2022 参加報告
Interspeech2022 参加報告
 
深層学習に基づく音響特徴量からの振幅スペクトログラム予測
深層学習に基づく音響特徴量からの振幅スペクトログラム予測深層学習に基づく音響特徴量からの振幅スペクトログラム予測
深層学習に基づく音響特徴量からの振幅スペクトログラム予測
 
Empirical Study Incorporating Linguistic Knowledge on Filled Pauses for Pers...
Empirical Study Incorporating Linguistic Knowledge on Filled Pausesfor Pers...Empirical Study Incorporating Linguistic Knowledge on Filled Pausesfor Pers...
Empirical Study Incorporating Linguistic Knowledge on Filled Pauses for Pers...
 
[DL輪読会]Parallel WaveNet: Fast High-Fidelity Speech Synthesis
[DL輪読会]Parallel WaveNet: Fast High-Fidelity Speech Synthesis[DL輪読会]Parallel WaveNet: Fast High-Fidelity Speech Synthesis
[DL輪読会]Parallel WaveNet: Fast High-Fidelity Speech Synthesis
 
音情報処理における特徴表現
音情報処理における特徴表現音情報処理における特徴表現
音情報処理における特徴表現
 
CREST「共生インタラクション」共創型音メディア機能拡張プロジェクト
CREST「共生インタラクション」共創型音メディア機能拡張プロジェクトCREST「共生インタラクション」共創型音メディア機能拡張プロジェクト
CREST「共生インタラクション」共創型音メディア機能拡張プロジェクト
 
GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)
 
喉頭摘出者のための歌唱支援を目指した電気音声変換法
喉頭摘出者のための歌唱支援を目指した電気音声変換法喉頭摘出者のための歌唱支援を目指した電気音声変換法
喉頭摘出者のための歌唱支援を目指した電気音声変換法
 
PCFG構文解析法
PCFG構文解析法PCFG構文解析法
PCFG構文解析法
 
J-KAC:日本語オーディオブック・紙芝居朗読音声コーパス
J-KAC:日本語オーディオブック・紙芝居朗読音声コーパスJ-KAC:日本語オーディオブック・紙芝居朗読音声コーパス
J-KAC:日本語オーディオブック・紙芝居朗読音声コーパス
 
ボコーダ波形生成における励振源の群遅延操作に向けた声帯音源特性の解析
ボコーダ波形生成における励振源の群遅延操作に向けた声帯音源特性の解析ボコーダ波形生成における励振源の群遅延操作に向けた声帯音源特性の解析
ボコーダ波形生成における励振源の群遅延操作に向けた声帯音源特性の解析
 
時間領域低ランクスペクトログラム近似法に基づくマスキング音声の欠損成分復元
時間領域低ランクスペクトログラム近似法に基づくマスキング音声の欠損成分復元時間領域低ランクスペクトログラム近似法に基づくマスキング音声の欠損成分復元
時間領域低ランクスペクトログラム近似法に基づくマスキング音声の欠損成分復元
 

Ähnlich wie Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to-Sequence Non-Parallel Voice Conversion

Rendering Of Voice By Using Convolutional Neural Network And With The Help Of...
Rendering Of Voice By Using Convolutional Neural Network And With The Help Of...Rendering Of Voice By Using Convolutional Neural Network And With The Help Of...
Rendering Of Voice By Using Convolutional Neural Network And With The Help Of...IRJET Journal
 
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLEMULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLEIRJET Journal
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer modelsDing Li
 
Naver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltcNaver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltcNAVER Engineering
 
SiddhantSancheti_MediumShortStory.pptx
SiddhantSancheti_MediumShortStory.pptxSiddhantSancheti_MediumShortStory.pptx
SiddhantSancheti_MediumShortStory.pptxSiddhantSancheti1
 
IRJET - Automatic Lip Reading: Classification of Words and Phrases using Conv...
IRJET - Automatic Lip Reading: Classification of Words and Phrases using Conv...IRJET - Automatic Lip Reading: Classification of Words and Phrases using Conv...
IRJET - Automatic Lip Reading: Classification of Words and Phrases using Conv...IRJET Journal
 
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFJayavardhan Reddy Peddamail
 
Triantafyllia Voulibasi
Triantafyllia VoulibasiTriantafyllia Voulibasi
Triantafyllia VoulibasiISSEL
 
IRJET- Speech to Speech Translation System
IRJET- Speech to Speech Translation SystemIRJET- Speech to Speech Translation System
IRJET- Speech to Speech Translation SystemIRJET Journal
 
Identification of frequency domain using quantum based optimization neural ne...
Identification of frequency domain using quantum based optimization neural ne...Identification of frequency domain using quantum based optimization neural ne...
Identification of frequency domain using quantum based optimization neural ne...eSAT Publishing House
 
Conversational transfer learning for emotion recognition
Conversational transfer learning for emotion recognitionConversational transfer learning for emotion recognition
Conversational transfer learning for emotion recognitionTakato Hayashi
 
Mediaeval 2013 Spoken Web Search results slides
Mediaeval 2013 Spoken Web Search results slidesMediaeval 2013 Spoken Web Search results slides
Mediaeval 2013 Spoken Web Search results slidesXavier Anguera
 
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...ijnlc
 
A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISH
A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISHA NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISH
A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISHIRJET Journal
 
2. Constantin Orasan (UoW) EXPERT Introduction
2. Constantin Orasan (UoW) EXPERT Introduction2. Constantin Orasan (UoW) EXPERT Introduction
2. Constantin Orasan (UoW) EXPERT IntroductionRIILP
 
OWF14 - Big Data : The State of Machine Learning in 2014
OWF14 - Big Data : The State of Machine  Learning in 2014OWF14 - Big Data : The State of Machine  Learning in 2014
OWF14 - Big Data : The State of Machine Learning in 2014Paris Open Source Summit
 
BioTeam Bhanu Rekepalli Presentation at BICoB 2015
 BioTeam Bhanu Rekepalli Presentation at BICoB 2015 BioTeam Bhanu Rekepalli Presentation at BICoB 2015
BioTeam Bhanu Rekepalli Presentation at BICoB 2015The BioTeam Inc.
 

Ähnlich wie Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to-Sequence Non-Parallel Voice Conversion (20)

Rendering Of Voice By Using Convolutional Neural Network And With The Help Of...
Rendering Of Voice By Using Convolutional Neural Network And With The Help Of...Rendering Of Voice By Using Convolutional Neural Network And With The Help Of...
Rendering Of Voice By Using Convolutional Neural Network And With The Help Of...
 
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLEMULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
 
Naver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltcNaver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltc
 
SiddhantSancheti_MediumShortStory.pptx
SiddhantSancheti_MediumShortStory.pptxSiddhantSancheti_MediumShortStory.pptx
SiddhantSancheti_MediumShortStory.pptx
 
IRJET - Automatic Lip Reading: Classification of Words and Phrases using Conv...
IRJET - Automatic Lip Reading: Classification of Words and Phrases using Conv...IRJET - Automatic Lip Reading: Classification of Words and Phrases using Conv...
IRJET - Automatic Lip Reading: Classification of Words and Phrases using Conv...
 
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
 
Triantafyllia Voulibasi
Triantafyllia VoulibasiTriantafyllia Voulibasi
Triantafyllia Voulibasi
 
IRJET- Speech to Speech Translation System
IRJET- Speech to Speech Translation SystemIRJET- Speech to Speech Translation System
IRJET- Speech to Speech Translation System
 
Identification of frequency domain using quantum based optimization neural ne...
Identification of frequency domain using quantum based optimization neural ne...Identification of frequency domain using quantum based optimization neural ne...
Identification of frequency domain using quantum based optimization neural ne...
 
Conversational transfer learning for emotion recognition
Conversational transfer learning for emotion recognitionConversational transfer learning for emotion recognition
Conversational transfer learning for emotion recognition
 
Mediaeval 2013 Spoken Web Search results slides
Mediaeval 2013 Spoken Web Search results slidesMediaeval 2013 Spoken Web Search results slides
Mediaeval 2013 Spoken Web Search results slides
 
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...
 
A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISH
A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISHA NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISH
A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISH
 
Deep Domain
Deep DomainDeep Domain
Deep Domain
 
team10.ppt.pptx
team10.ppt.pptxteam10.ppt.pptx
team10.ppt.pptx
 
2. Constantin Orasan (UoW) EXPERT Introduction
2. Constantin Orasan (UoW) EXPERT Introduction2. Constantin Orasan (UoW) EXPERT Introduction
2. Constantin Orasan (UoW) EXPERT Introduction
 
OWF14 - Big Data : The State of Machine Learning in 2014
OWF14 - Big Data : The State of Machine  Learning in 2014OWF14 - Big Data : The State of Machine  Learning in 2014
OWF14 - Big Data : The State of Machine Learning in 2014
 
BioTeam Bhanu Rekepalli Presentation at BICoB 2015
 BioTeam Bhanu Rekepalli Presentation at BICoB 2015 BioTeam Bhanu Rekepalli Presentation at BICoB 2015
BioTeam Bhanu Rekepalli Presentation at BICoB 2015
 

Mehr von NU_I_TODALAB

異常音検知に対する深層学習適用事例
異常音検知に対する深層学習適用事例異常音検知に対する深層学習適用事例
異常音検知に対する深層学習適用事例NU_I_TODALAB
 
距離学習を導入した二値分類モデルによる異常音検知
距離学習を導入した二値分類モデルによる異常音検知距離学習を導入した二値分類モデルによる異常音検知
距離学習を導入した二値分類モデルによる異常音検知NU_I_TODALAB
 
Weakly-Supervised Sound Event Detection with Self-Attention
Weakly-Supervised Sound Event Detection with Self-AttentionWeakly-Supervised Sound Event Detection with Self-Attention
Weakly-Supervised Sound Event Detection with Self-AttentionNU_I_TODALAB
 
音素事後確率を利用した表現学習に基づく発話感情認識
音素事後確率を利用した表現学習に基づく発話感情認識音素事後確率を利用した表現学習に基づく発話感情認識
音素事後確率を利用した表現学習に基づく発話感情認識NU_I_TODALAB
 
楽曲中歌声加工における声質変換精度向上のための歌声・伴奏分離法
楽曲中歌声加工における声質変換精度向上のための歌声・伴奏分離法楽曲中歌声加工における声質変換精度向上のための歌声・伴奏分離法
楽曲中歌声加工における声質変換精度向上のための歌声・伴奏分離法NU_I_TODALAB
 
End-to-End音声認識ためのMulti-Head Decoderネットワーク
End-to-End音声認識ためのMulti-Head DecoderネットワークEnd-to-End音声認識ためのMulti-Head Decoderネットワーク
End-to-End音声認識ためのMulti-Head DecoderネットワークNU_I_TODALAB
 
空気/体内伝導マイクロフォンを用いた雑音環境下における自己発声音強調/抑圧法
空気/体内伝導マイクロフォンを用いた雑音環境下における自己発声音強調/抑圧法空気/体内伝導マイクロフォンを用いた雑音環境下における自己発声音強調/抑圧法
空気/体内伝導マイクロフォンを用いた雑音環境下における自己発声音強調/抑圧法NU_I_TODALAB
 
Advanced Voice Conversion
Advanced Voice ConversionAdvanced Voice Conversion
Advanced Voice ConversionNU_I_TODALAB
 
Deep Neural Networkに基づく日常生活行動認識における適応手法
Deep Neural Networkに基づく日常生活行動認識における適応手法Deep Neural Networkに基づく日常生活行動認識における適応手法
Deep Neural Networkに基づく日常生活行動認識における適応手法NU_I_TODALAB
 
CTCに基づく音響イベントからの擬音語表現への変換
CTCに基づく音響イベントからの擬音語表現への変換CTCに基づく音響イベントからの擬音語表現への変換
CTCに基づく音響イベントからの擬音語表現への変換NU_I_TODALAB
 
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...NU_I_TODALAB
 
実環境下におけるサイレント音声通話の実現に向けた雑音環境変動に頑健な非可聴つぶやき強調
実環境下におけるサイレント音声通話の実現に向けた雑音環境変動に頑健な非可聴つぶやき強調実環境下におけるサイレント音声通話の実現に向けた雑音環境変動に頑健な非可聴つぶやき強調
実環境下におけるサイレント音声通話の実現に向けた雑音環境変動に頑健な非可聴つぶやき強調NU_I_TODALAB
 
ケプストラム正則化NTFによるステレオチャネル楽曲音源分離
ケプストラム正則化NTFによるステレオチャネル楽曲音源分離ケプストラム正則化NTFによるステレオチャネル楽曲音源分離
ケプストラム正則化NTFによるステレオチャネル楽曲音源分離NU_I_TODALAB
 

Mehr von NU_I_TODALAB (13)

異常音検知に対する深層学習適用事例
異常音検知に対する深層学習適用事例異常音検知に対する深層学習適用事例
異常音検知に対する深層学習適用事例
 
距離学習を導入した二値分類モデルによる異常音検知
距離学習を導入した二値分類モデルによる異常音検知距離学習を導入した二値分類モデルによる異常音検知
距離学習を導入した二値分類モデルによる異常音検知
 
Weakly-Supervised Sound Event Detection with Self-Attention
Weakly-Supervised Sound Event Detection with Self-AttentionWeakly-Supervised Sound Event Detection with Self-Attention
Weakly-Supervised Sound Event Detection with Self-Attention
 
音素事後確率を利用した表現学習に基づく発話感情認識
音素事後確率を利用した表現学習に基づく発話感情認識音素事後確率を利用した表現学習に基づく発話感情認識
音素事後確率を利用した表現学習に基づく発話感情認識
 
楽曲中歌声加工における声質変換精度向上のための歌声・伴奏分離法
楽曲中歌声加工における声質変換精度向上のための歌声・伴奏分離法楽曲中歌声加工における声質変換精度向上のための歌声・伴奏分離法
楽曲中歌声加工における声質変換精度向上のための歌声・伴奏分離法
 
End-to-End音声認識ためのMulti-Head Decoderネットワーク
End-to-End音声認識ためのMulti-Head DecoderネットワークEnd-to-End音声認識ためのMulti-Head Decoderネットワーク
End-to-End音声認識ためのMulti-Head Decoderネットワーク
 
空気/体内伝導マイクロフォンを用いた雑音環境下における自己発声音強調/抑圧法
空気/体内伝導マイクロフォンを用いた雑音環境下における自己発声音強調/抑圧法空気/体内伝導マイクロフォンを用いた雑音環境下における自己発声音強調/抑圧法
空気/体内伝導マイクロフォンを用いた雑音環境下における自己発声音強調/抑圧法
 
Advanced Voice Conversion
Advanced Voice ConversionAdvanced Voice Conversion
Advanced Voice Conversion
 
Deep Neural Networkに基づく日常生活行動認識における適応手法
Deep Neural Networkに基づく日常生活行動認識における適応手法Deep Neural Networkに基づく日常生活行動認識における適応手法
Deep Neural Networkに基づく日常生活行動認識における適応手法
 
CTCに基づく音響イベントからの擬音語表現への変換
CTCに基づく音響イベントからの擬音語表現への変換CTCに基づく音響イベントからの擬音語表現への変換
CTCに基づく音響イベントからの擬音語表現への変換
 
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
 
実環境下におけるサイレント音声通話の実現に向けた雑音環境変動に頑健な非可聴つぶやき強調
実環境下におけるサイレント音声通話の実現に向けた雑音環境変動に頑健な非可聴つぶやき強調実環境下におけるサイレント音声通話の実現に向けた雑音環境変動に頑健な非可聴つぶやき強調
実環境下におけるサイレント音声通話の実現に向けた雑音環境変動に頑健な非可聴つぶやき強調
 
ケプストラム正則化NTFによるステレオチャネル楽曲音源分離
ケプストラム正則化NTFによるステレオチャネル楽曲音源分離ケプストラム正則化NTFによるステレオチャネル楽曲音源分離
ケプストラム正則化NTFによるステレオチャネル楽曲音源分離
 

Kürzlich hochgeladen

MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesPrabhanshu Chaturvedi
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 

Kürzlich hochgeladen (20)

MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and Properties
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 

Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to-Sequence Non-Parallel Voice Conversion

  • 1. Investigation of Text-to-speech based Synthetic Parallel Data for Sequence-to- sequence Non-parallel Voice Conversion Ding Ma, Wen-Chin Huang and Tomoki Toda Graduate School of Informatics, Nagoya University, Nagoya, Japan Paper ID: #1606 Presenter: Ding Ma
  • 2. Introduction •Voice conversion (VC) • The methodology that aims to convert the speaker identity of speech from source speaker into target speaker while preserving the linguistic information. • VC is expected to play a significant role in augmented human communication. Source speech VC Target speech 2
  • 3. Introduction •Sequence-to-sequence (seq2seq) modeling • Seq2Seq model: a model that takes a sequence of items and outputs another sequence of items, have emerged from the development of deep neural networks (DNN). • Can automatically determine the output phoneme duration. • Capture long term dependencies: prosody (F0 & duration), intonation… • Requires a large amount and parallel speech corpus from source and target speakers for training. Encoder Attention Decoder Source speech Target speech 3
  • 4. Background • Voice conversion challenge 2020(VCC2020) • Bi-annual event to compare the performance of different VC systems. • 2 tasks: Intra-lingual and semi-parallel case in Task 1 & cross lingual case in Task 2. • Parallel: same utterances • Nonparallel: different utterances • Semiparallel: Parallel + Nonparallel situation (can be regarded as the relaxation of non- parallel case) • Limited dataset: Only 90 corpus in T1/ 70 corpus in T2 4
  • 5. Background • VTN: Voice Transformer Network, which is the sequence-to- sequence Voice Conversion Using Transformer with Text-to- speech (TTS) pretraining. • ➕ 1hr à 5 mins training data (Thanks to pretraining technology). • ➖ still Needs parallel training data. • How to tackle issue of semiparallel dataset? • 「 Synthetic speech method 」 • We extended VTN model by training TTS models to generate synthetic parallel data (SPD). (Semiparallel à Parallel) [1] [1] W. C. Huang, T. Hayashi, Y. C. Wu, H. Kameoka, and T. Toda, “Voice transformer network: Sequence-to-sequence voice conversion using transformer with text-to-speech pretraining,” Proc. Interspeech, pp. 4676-4680, 2020. 5
  • 6. Background • Generation process of a synthetic parallel data (SPD) from a semiparallel dataset. (a) TTS training process using the semiparallel dataset; and (b) SPD generation process using source synthetic data, target synthetic data, and external SPD. 6
  • 7. Background •Generation process of a synthetic parallel data (SPD) from a semiparallel dataset. • Four types of parallel data available for training the VC model in total: 1. <source natural, target natural> 2. <SPD with source synthetic, target natural> 3. <source natural, SPD with target synthetic> 4. <external SPD with source synthetic, external SPD with target synthetic> 7
  • 8. 「Synthetic Speech Method」 • There are still uncertainties about the effects and usage of SPD on seq2seq VC model. In this paper we try to address the following 3 questions: • Q1: What are the feasibility and properties of using SPD? • Q1-1: How does quality of data affect VC performance? • Q1-2: Which kind of the training pair is better? Ø Source + target natural / source synthetic only / target synthetic only / natural+synthetic (mixed situation) • Q2: How can this method benefit from a semiparallel setting? • Fix original training data, and set semiparallel ratio(0/25%/50%/75%/100%) • Q3: What are the influences of using external text data? • Fix original training data, increase external data (1k/2k/5k) 8
  • 9. Datasets and Configuration • Initial dataset : CMU ARCTIC database(containing parallel 1132 utterances recorded by the English speakers in 16kHZ) • Female: clb, slt • Male: bdl, rms • Development set and evaluation set: 100 utterances separately • External dataset: M-AILABS database • English corpus: 15369 utterances, 30 hours long • Implementation: • TTS models: Pretrained Transformer-TTS architecture • VC model: VTN (Transformer-based seq2seq VC model) [1] • Vocoder: Parallel WaveGAN (PWG) neural vocoder [2] • Objective evaluation: Transformer-based ASR engine trained by LibriSpeech database [3] [1] W. C. Huang, T. Hayashi, Y. C. Wu, H. Kameoka, and T. Toda, “Voice transformer network: Sequence-to-sequence voice conversion using transformer with text-to-speech pretraining,” Proc. Interspeech, pp. 4676-4680, 2020. [2] R. Yamamoto, E. Song, and J. M. Kim. “Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,” 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6199-6203, 2020. [3] L. Dong, S. Xu, and B. Xu, “Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition,” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5884-5888, 2018. 9
  • 10. Experiment and Evaluation • Q1: What are the feasibility and properties of using SPD? Five kinds of training pairs: 1. <source natural, target natural> 2. <source natural, target synthetic> 3. <source synthetic, target natural> 4. <source synthetic, target synthetic> 5. <source synthetic and source natural, target natural and target synthetic> 10
  • 11. Experiment and Evaluation • MCD: Mel Cepstrum Distortion / CER: Character Error Rate / WER: Word Error Rate • The Objective evaluationresults of Q1 Table I: The comparison results with different training pair and datasize. TTS-450, TTS-400, TTS-200 and TTS-80 represent the homologous datasize of TTS finetuning, which also reflect TTS performance, the datasize of SPD generation and VC training. • The TTS performance is critical in terms of the impact of VC results. • The training dataset of source synthetic - target natural generally performs better among the other pairs using SPD. 11
  • 12. Experiment and Evaluation • Q2: How can this method benefit from a semiparallel setting? • Training procedure with different semiparallel setting (e.g., datasize=400). • Parallel ratio (PR) is used to represent the proportion of natural parallel corpus, so as to reflect the semi- parallel setting under each group. • The respective TTS models of source and target speaker are trained in case of constant datasize but different semiparallel setting for each group. • Two parts experiment: Training dataset I retains all SPD as shown in (a); training dataset II removes natural-synthetic part of semiparallel cases for training as shown in (b). 12
  • 13. Experiment and Evaluation • The Objective evaluationresults of Q2 Table II: Experimental results under different semiparallel setting. 13
  • 14. Experiment and Evaluation • The Objective evaluationresults of Q2 Table II: Experimental results under different semiparallel setting. 14
  • 15. Experiment and Evaluation • The Objective evaluation results of Q3 Table III: Experimental results of adding external data with different datasizes. TTS-400 and TTS-200 represent homologous datasize of TTS finetuning. 15
  • 16. Experiment and Evaluation • The Subjective evaluation (MOS)results of Q1, Q2 and Q3 under specific datasets. Table IV: Results of subjective evaluation using test set under 450 and 80 datasize with 95% confidence intervals for Q1. Table V: Results of subjective evaluation using test sets with 95% confidence intervals for Q2. Table VI: Results of subjective evaluation using test sets under 400 and 200 datasize with 95% confidence intervals for Q3. • The overall results are consistent with the findings in the objective evaluations. 16
  • 17. Conclusions • SPD is feasible for seq2seq non-parallel VC. The VC results using SPD are determined by the performance of TTS models and VC training datasize. In addition, the VC result is also affected by the object of using SPD. • When the dataset is semiparallel, we should try to ensure the PR is large enough. If the original datasize is large, the introduction of SPD into target speaker or source speaker can both achieve ideal VC results. Thus, the full use of all types of SPD to ensure amount of data, can maximize the benefits. On the contrary, when the original datasize is small, the well-performing TTS models are difficult to get. Introducing training pair with negative impact such as source natural-target synthetic should be avoided. • SPD with external text data as data augmentation can improve parallel seq2seq VC performance to a certain extent (e.g. natural-natural). 17
  • 18. Future work • Using more speakers and a larger amount of data to further investigate the beneficial trend that seq2seq non-parallel VC can obtain from SPD. • In terms of methodology, we can introduce the VC models which can directly processing non-parallel data training to compare the performance with the way of using SPD on seq2seq VC in the future research, so as to further clarify the role of SPD. 18