SlideShare ist ein Scribd-Unternehmen logo
1 von 12
Language-independent Speaker Anonymization
Approach using Self-supervised Pre-trained
Models
Xiaoxiao Miao1, Xin Wang1, Erica Cooper1,
Junichi Yamagishi1, Natalia Tomashenko2
1 National Institute of Informatics, Japan 2 LIA, University of Avignon, France
Odyssey 2022
Outline
• Introduction
• Motivation
• Language-independent SSL-based Speaker Anonymization System
• SSL-based Content Encoder
• ECAPA-TDNN Speaker Encoder
• HiFi-GAN
• Experimental Results and Analysis
• Conclusions and future work
1
Introduction
• Definition[1] from VoicePrivacy challenge (VPC) 2020
• Suppress the speaker’s identity
• Preserve the linguistic content, other paralinguistic attributes
such as age, gender, emotion, and the diversity of speech
• Allowing downstream tasks: human communication, automated
processing, model training, etc.
• VPC2020 primary baseline[1] : ASR + TTS. - B1
Step1: disentangle speech into F0, BN, and x-vector
Step2: anonymize x-vector
Step3: synthesize anonymized speech using source F0, BN, and anonymized
x-vector
[1] N. Tomashenko, et al., “Introducing the VoicePrivacy Initiative,” Interspeech2020 2
• Can we create one speaker anonymization system (SAS) that can
anonymize speech from unseen languages?
• Directly use B1- content is distorted because of the language-specific
ASR AM
ECAPA
Self-
supervised
models
Motivation
Wav2vec2
HuBERT
HuBERT-kmeans
HiFi-GAN
• SSL-based content encoder provides general context representations
• Updating the TDNN x-vector to the ECAPA-TDNN speaker encoder
• Using a HiFi-GAN as the speech waveform generation model instead
of the traditional TTS pipeline to make the system more efficiently
3
SSL-based SAS --SSL
• HuBERT[2]
• Continuous features contain both context and speaker information.
Not suitable for speech disentanglement
• HuBERT-km[3]
• Apply k-means algorithm over HuBERT continuous features
• Discrete features get rid of speaker identity attributes.
While inaccurate discrete features lead to incorrect pronunciations
• HuBERT-based Soft Content Encoder[4]
• Trade-off between continuous and discrete features
[2] Wei-Ning Hsu, et al., “HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units,” TASLP 2021
[3] Kushal Lakhotia, et al., “Generative spoken language modeling from raw audio,” TACL 2021
[4] Benjamin van Niekerk, et al, “A comparison of discrete and soft speech units for improved voice conversion,” ICASSP 2022
4
SSL-based SAS --ECAPA
• ECAPA-TDNN[5]
F-ECAPA: 80-dim FBank
S-ECAPA: 768-dim weighted average from all the hidden layers
[5] Brecht Desplanques, et al., “ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker
Verification,” INTERSPEECH 2020
5
SSL-based SAS --HiFiGAN
• Frame-wise F0, context, and segment
al-level speaker embedding are
up-sampled and concatenated
• Then passed to HiFi-GAN[6], which
contains one generator and two
discriminators:
• Multi-period discriminator (MPD)
captures the periodic patterns
• Multi-scale discriminator (MSD)
exploring long-range and consecutive
interactions
[6] Jungil Kong, et al., “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” NIPS 2020
Freeze Freeze
6
Experiment details
Training set
HuBERT soft : LibriSpeech-100[9]
HiFi-GAN: LibriTTS-100[10]
Test set :
LibriSpeech[10],VCTK[11]
• 𝐴𝑆𝑉
𝑒𝑣𝑎𝑙
𝐸𝑛𝑔
: TDNN
• 𝐴𝑆𝑅𝑒𝑣𝑎𝑙
𝐸𝑛𝑔
: TDNN-F
Experimental Results and Analysis
Privacy: EER
Game between users and attackers
• Unprotected (OO): no anonymization
• Ignorant attacker (OA): users anonymize
• Lazy-informed (AA): users & attackers
anonymize, attackers know the SAS, but don’t
know the specific parameters, would get the
different pseudo speaker.
• Evaluation Plan
Utility:WER
LibriSpeech-
360[9]
[9] Vassil Panayotov, et al., “Librispeech: an ASR corpus based on public domain audio books,” ICASSP 2015
[10] Heiga Zen, et al., “LibriTTS: A corpus derived from LibriSpeech for text-to-speech,” arXiv preprint arXiv:1904.02882, 2019
[11] Christophe Veaux, et al., “CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92),”2019.
Weak attacker
Strong attacker
7
Experimental Results and Analysis
47.26
52.12
32.12
36.75
41.42 39.87
31.02
36.97
41.24 42.54
29.74
33.18
0
10
20
30
40
50
60
OA-EER-Female OA-EER-Male AA-EER-Female AA-EER-Male
The EER results of Libri Test (%)
VPC2020 B1 S-ECAPA+HuBERT-soft F-ECAPA+HuBERT-soft
Original VPC2020 B1 S-ECAPA+HuBERT-soft F-ECAPA+HuBERT-soft
WER 4.15 6.73 4.7 4.69
4.15
6.73
4.7 4.69
The WER results of the original and anon. Libri Test speech(%)
EER: Female-OO=8.66%,
Male-OO=1.24%
• The proposed SAS evaluated on English data:
• Protect the speaker information almost as well as VPC B1
• Provide more reliable linguistic information.
LowerWER
Comparable EER
8
• English Speaker Anonymization
Experimental Results and Analysis
• Mandarin Speaker Anonymization
• Experiment details
• Test set sampled from AISHELL-3[12]: 10120 enrollment-test trials
• 𝐴𝑆𝑉𝑒𝑣𝑎𝑙
𝑚𝑎𝑛𝑑
: F-ECAPA trained on CN-Celeb-1&2[13]
• 𝐴𝑆𝑅𝑒𝑣𝑎𝑙
𝑚𝑎𝑛𝑑
: publicly available Transformer* trained on AISHELL-1[14]
[12] Yao Shi, Hui Bu, Xin Xu, Shaoji Zhang, and Ming Li, “AISHELL-3: A Multi-Speaker Mandarin TTS Corpus,” INTERSPEECH, 2021
[13] Lantian Li, et al., “CN-Celeb: multi-genre speaker recognition,” Speech Communication, 2022
[14] Hui Bu, et al.,,“Aishell-1: An open-source Mandarin speech corpus and a speech recognition baseline,” O-COCOSDA 2017
*SpeechBrain: https://github.com/speechbrain/speechbrain/tree/develop/recipes/AISHELL-1/ASR/transformer
(%) S-ECAPA +
HuBERT-soft
F-ECAPA +
HuBERT-soft
EER OO 2.04
OA 40.81 37.58
AA 23.26 22.98
CER Original 10.36 10.36
Anony. 21.18 18.86
• Proposed SAS is simpler without any language-specific models can be adopted to
other languages successfully
• CERs on the anonymized trials were increased to about 20%, suggesting degradation
on the speech content
9
Conclusions and Future Work
• Can we create one speaker anonymization system (SAS) that can
anonymize speech from unseen languages?
• Yes, we can use Language-independent SSL-based SAS.
• The limitation of the proposed SAS:
• There is still room to improve the performance of the SAS under
unseen condition.
• See our Interspeech2022 paper: https://arxiv.org/abs/2203.14834
• Audio samples and source code are available at
https://github.com/nii-yamagishilab/SSL-SAS
10
Thanks for listening
Q&A

Weitere ähnliche Inhalte

Ähnlich wie Odyssey 2022: Language-Independent Speaker Anonymization Approach using Self-Supervised Pre-Trained Models

Introduction to text to speech
Introduction to text to speechIntroduction to text to speech
Introduction to text to speechBilgin Aksoy
 
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...NU_I_TODALAB
 
Utterance based speaker identification
Utterance based speaker identificationUtterance based speaker identification
Utterance based speaker identificationIJCSEA Journal
 
Automated Podcasting System for Universities
Automated Podcasting System for UniversitiesAutomated Podcasting System for Universities
Automated Podcasting System for UniversitiesEducational Technology
 
Utterance Based Speaker Identification Using ANN
Utterance Based Speaker Identification Using ANNUtterance Based Speaker Identification Using ANN
Utterance Based Speaker Identification Using ANNIJCSEA Journal
 
Utterance Based Speaker Identification Using ANN
Utterance Based Speaker Identification Using ANNUtterance Based Speaker Identification Using ANN
Utterance Based Speaker Identification Using ANNIJCSEA Journal
 
Deep Learning for Automatic Speaker Recognition
Deep Learning for Automatic Speaker RecognitionDeep Learning for Automatic Speaker Recognition
Deep Learning for Automatic Speaker RecognitionSai Kiran Kadam
 
MediaEval 2016 - ININ Submission to Zero Cost ASR Task
MediaEval 2016 - ININ Submission to Zero Cost ASR TaskMediaEval 2016 - ININ Submission to Zero Cost ASR Task
MediaEval 2016 - ININ Submission to Zero Cost ASR Taskmultimediaeval
 
Toward wave net speech synthesis
Toward wave net speech synthesisToward wave net speech synthesis
Toward wave net speech synthesisNAVER Engineering
 
Topic Listener - Observing Key Topics from Multi-Channel Speech Audio Streams...
Topic Listener - Observing Key Topics from Multi-Channel Speech Audio Streams...Topic Listener - Observing Key Topics from Multi-Channel Speech Audio Streams...
Topic Listener - Observing Key Topics from Multi-Channel Speech Audio Streams...Sebastian Ruder
 
Lip Reading.pptx
Lip Reading.pptxLip Reading.pptx
Lip Reading.pptxNivethaT15
 
COMBINED FEATURE EXTRACTION TECHNIQUES AND NAIVE BAYES CLASSIFIER FOR SPEECH ...
COMBINED FEATURE EXTRACTION TECHNIQUES AND NAIVE BAYES CLASSIFIER FOR SPEECH ...COMBINED FEATURE EXTRACTION TECHNIQUES AND NAIVE BAYES CLASSIFIER FOR SPEECH ...
COMBINED FEATURE EXTRACTION TECHNIQUES AND NAIVE BAYES CLASSIFIER FOR SPEECH ...csandit
 
Combined feature extraction techniques and naive bayes classifier for speech ...
Combined feature extraction techniques and naive bayes classifier for speech ...Combined feature extraction techniques and naive bayes classifier for speech ...
Combined feature extraction techniques and naive bayes classifier for speech ...csandit
 
COMBINED FEATURE EXTRACTION TECHNIQUES AND NAIVE BAYES CLASSIFIER FOR SPEECH ...
COMBINED FEATURE EXTRACTION TECHNIQUES AND NAIVE BAYES CLASSIFIER FOR SPEECH ...COMBINED FEATURE EXTRACTION TECHNIQUES AND NAIVE BAYES CLASSIFIER FOR SPEECH ...
COMBINED FEATURE EXTRACTION TECHNIQUES AND NAIVE BAYES CLASSIFIER FOR SPEECH ...cscpconf
 
Speech compression using voiced excited loosy predictive coding (lpc)
Speech compression using voiced excited loosy predictive coding (lpc)Speech compression using voiced excited loosy predictive coding (lpc)
Speech compression using voiced excited loosy predictive coding (lpc)Harshal Ladhe
 
final ppt BATCH 3.pptx
final ppt BATCH 3.pptxfinal ppt BATCH 3.pptx
final ppt BATCH 3.pptxMounika715343
 
Data integration in ENFIN using standards. The EnCore DAS service.
Data integration in ENFIN using standards. The EnCore DAS service.Data integration in ENFIN using standards. The EnCore DAS service.
Data integration in ENFIN using standards. The EnCore DAS service.Rafael C. Jimenez
 
Automatic Key Term Extraction from Spoken Course Lectures
Automatic Key Term Extraction from Spoken Course LecturesAutomatic Key Term Extraction from Spoken Course Lectures
Automatic Key Term Extraction from Spoken Course LecturesYun-Nung (Vivian) Chen
 
Experimental analysis of optimal window length for independent low-rank matri...
Experimental analysis of optimal window length for independent low-rank matri...Experimental analysis of optimal window length for independent low-rank matri...
Experimental analysis of optimal window length for independent low-rank matri...Daichi Kitamura
 
Listening at the Cocktail Party with Deep Neural Networks and TensorFlow
Listening at the Cocktail Party with Deep Neural Networks and TensorFlowListening at the Cocktail Party with Deep Neural Networks and TensorFlow
Listening at the Cocktail Party with Deep Neural Networks and TensorFlowDatabricks
 

Ähnlich wie Odyssey 2022: Language-Independent Speaker Anonymization Approach using Self-Supervised Pre-Trained Models (20)

Introduction to text to speech
Introduction to text to speechIntroduction to text to speech
Introduction to text to speech
 
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
 
Utterance based speaker identification
Utterance based speaker identificationUtterance based speaker identification
Utterance based speaker identification
 
Automated Podcasting System for Universities
Automated Podcasting System for UniversitiesAutomated Podcasting System for Universities
Automated Podcasting System for Universities
 
Utterance Based Speaker Identification Using ANN
Utterance Based Speaker Identification Using ANNUtterance Based Speaker Identification Using ANN
Utterance Based Speaker Identification Using ANN
 
Utterance Based Speaker Identification Using ANN
Utterance Based Speaker Identification Using ANNUtterance Based Speaker Identification Using ANN
Utterance Based Speaker Identification Using ANN
 
Deep Learning for Automatic Speaker Recognition
Deep Learning for Automatic Speaker RecognitionDeep Learning for Automatic Speaker Recognition
Deep Learning for Automatic Speaker Recognition
 
MediaEval 2016 - ININ Submission to Zero Cost ASR Task
MediaEval 2016 - ININ Submission to Zero Cost ASR TaskMediaEval 2016 - ININ Submission to Zero Cost ASR Task
MediaEval 2016 - ININ Submission to Zero Cost ASR Task
 
Toward wave net speech synthesis
Toward wave net speech synthesisToward wave net speech synthesis
Toward wave net speech synthesis
 
Topic Listener - Observing Key Topics from Multi-Channel Speech Audio Streams...
Topic Listener - Observing Key Topics from Multi-Channel Speech Audio Streams...Topic Listener - Observing Key Topics from Multi-Channel Speech Audio Streams...
Topic Listener - Observing Key Topics from Multi-Channel Speech Audio Streams...
 
Lip Reading.pptx
Lip Reading.pptxLip Reading.pptx
Lip Reading.pptx
 
COMBINED FEATURE EXTRACTION TECHNIQUES AND NAIVE BAYES CLASSIFIER FOR SPEECH ...
COMBINED FEATURE EXTRACTION TECHNIQUES AND NAIVE BAYES CLASSIFIER FOR SPEECH ...COMBINED FEATURE EXTRACTION TECHNIQUES AND NAIVE BAYES CLASSIFIER FOR SPEECH ...
COMBINED FEATURE EXTRACTION TECHNIQUES AND NAIVE BAYES CLASSIFIER FOR SPEECH ...
 
Combined feature extraction techniques and naive bayes classifier for speech ...
Combined feature extraction techniques and naive bayes classifier for speech ...Combined feature extraction techniques and naive bayes classifier for speech ...
Combined feature extraction techniques and naive bayes classifier for speech ...
 
COMBINED FEATURE EXTRACTION TECHNIQUES AND NAIVE BAYES CLASSIFIER FOR SPEECH ...
COMBINED FEATURE EXTRACTION TECHNIQUES AND NAIVE BAYES CLASSIFIER FOR SPEECH ...COMBINED FEATURE EXTRACTION TECHNIQUES AND NAIVE BAYES CLASSIFIER FOR SPEECH ...
COMBINED FEATURE EXTRACTION TECHNIQUES AND NAIVE BAYES CLASSIFIER FOR SPEECH ...
 
Speech compression using voiced excited loosy predictive coding (lpc)
Speech compression using voiced excited loosy predictive coding (lpc)Speech compression using voiced excited loosy predictive coding (lpc)
Speech compression using voiced excited loosy predictive coding (lpc)
 
final ppt BATCH 3.pptx
final ppt BATCH 3.pptxfinal ppt BATCH 3.pptx
final ppt BATCH 3.pptx
 
Data integration in ENFIN using standards. The EnCore DAS service.
Data integration in ENFIN using standards. The EnCore DAS service.Data integration in ENFIN using standards. The EnCore DAS service.
Data integration in ENFIN using standards. The EnCore DAS service.
 
Automatic Key Term Extraction from Spoken Course Lectures
Automatic Key Term Extraction from Spoken Course LecturesAutomatic Key Term Extraction from Spoken Course Lectures
Automatic Key Term Extraction from Spoken Course Lectures
 
Experimental analysis of optimal window length for independent low-rank matri...
Experimental analysis of optimal window length for independent low-rank matri...Experimental analysis of optimal window length for independent low-rank matri...
Experimental analysis of optimal window length for independent low-rank matri...
 
Listening at the Cocktail Party with Deep Neural Networks and TensorFlow
Listening at the Cocktail Party with Deep Neural Networks and TensorFlowListening at the Cocktail Party with Deep Neural Networks and TensorFlow
Listening at the Cocktail Party with Deep Neural Networks and TensorFlow
 

Mehr von Yamagishi Laboratory, National Institute of Informatics, Japan

エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~Yamagishi Laboratory, National Institute of Informatics, Japan
 
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~Yamagishi Laboratory, National Institute of Informatics, Japan
 

Mehr von Yamagishi Laboratory, National Institute of Informatics, Japan (17)

DDS: A new device-degraded speech dataset for speech enhancement
DDS: A new device-degraded speech dataset for speech enhancement DDS: A new device-degraded speech dataset for speech enhancement
DDS: A new device-degraded speech dataset for speech enhancement
 
The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022
 
Spoofing-aware Attention Back-end with Multiple Enrollment and Novel Trials S...
Spoofing-aware Attention Back-end with Multiple Enrollment and Novel Trials S...Spoofing-aware Attention Back-end with Multiple Enrollment and Novel Trials S...
Spoofing-aware Attention Back-end with Multiple Enrollment and Novel Trials S...
 
Odyssey 2022: Investigating self-supervised front ends for speech spoofing co...
Odyssey 2022: Investigating self-supervised front ends for speech spoofing co...Odyssey 2022: Investigating self-supervised front ends for speech spoofing co...
Odyssey 2022: Investigating self-supervised front ends for speech spoofing co...
 
Generalization Ability of MOS Prediction Networks
Generalization Ability of MOS Prediction NetworksGeneralization Ability of MOS Prediction Networks
Generalization Ability of MOS Prediction Networks
 
Estimating the confidence of speech spoofing countermeasure
Estimating the confidence of speech spoofing countermeasureEstimating the confidence of speech spoofing countermeasure
Estimating the confidence of speech spoofing countermeasure
 
Attention Back-end for Automatic Speaker Verification with Multiple Enrollmen...
Attention Back-end for Automatic Speaker Verification with Multiple Enrollmen...Attention Back-end for Automatic Speaker Verification with Multiple Enrollmen...
Attention Back-end for Automatic Speaker Verification with Multiple Enrollmen...
 
How do Voices from Past Speech Synthesis Challenges Compare Today?
 How do Voices from Past Speech Synthesis Challenges Compare Today? How do Voices from Past Speech Synthesis Challenges Compare Today?
How do Voices from Past Speech Synthesis Challenges Compare Today?
 
Text-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis
Text-to-Speech Synthesis Techniques for MIDI-to-Audio SynthesisText-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis
Text-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis
 
Preliminary study on using vector quantization latent spaces for TTS/VC syste...
Preliminary study on using vector quantization latent spaces for TTS/VC syste...Preliminary study on using vector quantization latent spaces for TTS/VC syste...
Preliminary study on using vector quantization latent spaces for TTS/VC syste...
 
Advancements in Neural Vocoders
Advancements in Neural VocodersAdvancements in Neural Vocoders
Advancements in Neural Vocoders
 
Neural Waveform Modeling
Neural Waveform Modeling Neural Waveform Modeling
Neural Waveform Modeling
 
Neural source-filter waveform model
Neural source-filter waveform modelNeural source-filter waveform model
Neural source-filter waveform model
 
Tutorial on end-to-end text-to-speech synthesis: Part 2 – Tactron and related...
Tutorial on end-to-end text-to-speech synthesis: Part 2 – Tactron and related...Tutorial on end-to-end text-to-speech synthesis: Part 2 – Tactron and related...
Tutorial on end-to-end text-to-speech synthesis: Part 2 – Tactron and related...
 
Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...
Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...
Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...
 
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~
 
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~
 

Kürzlich hochgeladen

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

Kürzlich hochgeladen (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

Odyssey 2022: Language-Independent Speaker Anonymization Approach using Self-Supervised Pre-Trained Models

  • 1. Language-independent Speaker Anonymization Approach using Self-supervised Pre-trained Models Xiaoxiao Miao1, Xin Wang1, Erica Cooper1, Junichi Yamagishi1, Natalia Tomashenko2 1 National Institute of Informatics, Japan 2 LIA, University of Avignon, France Odyssey 2022
  • 2. Outline • Introduction • Motivation • Language-independent SSL-based Speaker Anonymization System • SSL-based Content Encoder • ECAPA-TDNN Speaker Encoder • HiFi-GAN • Experimental Results and Analysis • Conclusions and future work 1
  • 3. Introduction • Definition[1] from VoicePrivacy challenge (VPC) 2020 • Suppress the speaker’s identity • Preserve the linguistic content, other paralinguistic attributes such as age, gender, emotion, and the diversity of speech • Allowing downstream tasks: human communication, automated processing, model training, etc. • VPC2020 primary baseline[1] : ASR + TTS. - B1 Step1: disentangle speech into F0, BN, and x-vector Step2: anonymize x-vector Step3: synthesize anonymized speech using source F0, BN, and anonymized x-vector [1] N. Tomashenko, et al., “Introducing the VoicePrivacy Initiative,” Interspeech2020 2
  • 4. • Can we create one speaker anonymization system (SAS) that can anonymize speech from unseen languages? • Directly use B1- content is distorted because of the language-specific ASR AM ECAPA Self- supervised models Motivation Wav2vec2 HuBERT HuBERT-kmeans HiFi-GAN • SSL-based content encoder provides general context representations • Updating the TDNN x-vector to the ECAPA-TDNN speaker encoder • Using a HiFi-GAN as the speech waveform generation model instead of the traditional TTS pipeline to make the system more efficiently 3
  • 5. SSL-based SAS --SSL • HuBERT[2] • Continuous features contain both context and speaker information. Not suitable for speech disentanglement • HuBERT-km[3] • Apply k-means algorithm over HuBERT continuous features • Discrete features get rid of speaker identity attributes. While inaccurate discrete features lead to incorrect pronunciations • HuBERT-based Soft Content Encoder[4] • Trade-off between continuous and discrete features [2] Wei-Ning Hsu, et al., “HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units,” TASLP 2021 [3] Kushal Lakhotia, et al., “Generative spoken language modeling from raw audio,” TACL 2021 [4] Benjamin van Niekerk, et al, “A comparison of discrete and soft speech units for improved voice conversion,” ICASSP 2022 4
  • 6. SSL-based SAS --ECAPA • ECAPA-TDNN[5] F-ECAPA: 80-dim FBank S-ECAPA: 768-dim weighted average from all the hidden layers [5] Brecht Desplanques, et al., “ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification,” INTERSPEECH 2020 5
  • 7. SSL-based SAS --HiFiGAN • Frame-wise F0, context, and segment al-level speaker embedding are up-sampled and concatenated • Then passed to HiFi-GAN[6], which contains one generator and two discriminators: • Multi-period discriminator (MPD) captures the periodic patterns • Multi-scale discriminator (MSD) exploring long-range and consecutive interactions [6] Jungil Kong, et al., “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” NIPS 2020 Freeze Freeze 6
  • 8. Experiment details Training set HuBERT soft : LibriSpeech-100[9] HiFi-GAN: LibriTTS-100[10] Test set : LibriSpeech[10],VCTK[11] • 𝐴𝑆𝑉 𝑒𝑣𝑎𝑙 𝐸𝑛𝑔 : TDNN • 𝐴𝑆𝑅𝑒𝑣𝑎𝑙 𝐸𝑛𝑔 : TDNN-F Experimental Results and Analysis Privacy: EER Game between users and attackers • Unprotected (OO): no anonymization • Ignorant attacker (OA): users anonymize • Lazy-informed (AA): users & attackers anonymize, attackers know the SAS, but don’t know the specific parameters, would get the different pseudo speaker. • Evaluation Plan Utility:WER LibriSpeech- 360[9] [9] Vassil Panayotov, et al., “Librispeech: an ASR corpus based on public domain audio books,” ICASSP 2015 [10] Heiga Zen, et al., “LibriTTS: A corpus derived from LibriSpeech for text-to-speech,” arXiv preprint arXiv:1904.02882, 2019 [11] Christophe Veaux, et al., “CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92),”2019. Weak attacker Strong attacker 7
  • 9. Experimental Results and Analysis 47.26 52.12 32.12 36.75 41.42 39.87 31.02 36.97 41.24 42.54 29.74 33.18 0 10 20 30 40 50 60 OA-EER-Female OA-EER-Male AA-EER-Female AA-EER-Male The EER results of Libri Test (%) VPC2020 B1 S-ECAPA+HuBERT-soft F-ECAPA+HuBERT-soft Original VPC2020 B1 S-ECAPA+HuBERT-soft F-ECAPA+HuBERT-soft WER 4.15 6.73 4.7 4.69 4.15 6.73 4.7 4.69 The WER results of the original and anon. Libri Test speech(%) EER: Female-OO=8.66%, Male-OO=1.24% • The proposed SAS evaluated on English data: • Protect the speaker information almost as well as VPC B1 • Provide more reliable linguistic information. LowerWER Comparable EER 8 • English Speaker Anonymization
  • 10. Experimental Results and Analysis • Mandarin Speaker Anonymization • Experiment details • Test set sampled from AISHELL-3[12]: 10120 enrollment-test trials • 𝐴𝑆𝑉𝑒𝑣𝑎𝑙 𝑚𝑎𝑛𝑑 : F-ECAPA trained on CN-Celeb-1&2[13] • 𝐴𝑆𝑅𝑒𝑣𝑎𝑙 𝑚𝑎𝑛𝑑 : publicly available Transformer* trained on AISHELL-1[14] [12] Yao Shi, Hui Bu, Xin Xu, Shaoji Zhang, and Ming Li, “AISHELL-3: A Multi-Speaker Mandarin TTS Corpus,” INTERSPEECH, 2021 [13] Lantian Li, et al., “CN-Celeb: multi-genre speaker recognition,” Speech Communication, 2022 [14] Hui Bu, et al.,,“Aishell-1: An open-source Mandarin speech corpus and a speech recognition baseline,” O-COCOSDA 2017 *SpeechBrain: https://github.com/speechbrain/speechbrain/tree/develop/recipes/AISHELL-1/ASR/transformer (%) S-ECAPA + HuBERT-soft F-ECAPA + HuBERT-soft EER OO 2.04 OA 40.81 37.58 AA 23.26 22.98 CER Original 10.36 10.36 Anony. 21.18 18.86 • Proposed SAS is simpler without any language-specific models can be adopted to other languages successfully • CERs on the anonymized trials were increased to about 20%, suggesting degradation on the speech content 9
  • 11. Conclusions and Future Work • Can we create one speaker anonymization system (SAS) that can anonymize speech from unseen languages? • Yes, we can use Language-independent SSL-based SAS. • The limitation of the proposed SAS: • There is still room to improve the performance of the SAS under unseen condition. • See our Interspeech2022 paper: https://arxiv.org/abs/2203.14834 • Audio samples and source code are available at https://github.com/nii-yamagishilab/SSL-SAS 10

Hinweis der Redaktion

  1. to begin with, I would like to give the introduction of the speaker anonymization task And then its our motivation, next the detailed of our system, followed by experimental results and analysis Conclusions and future work will be shown in the final part.
  2. Thanks to the VPC2020, giving the definition that the SA task to suppress the individuality of speech, leaving other speaker attribute information such as age and gender, while preserving the naturalness of speech And VPC2020 organization also has provided common datasets, evaluation metrics (objective and subjective), and two main baselines. One is nn-based, a primary baseline, another is signal-processing based, we take the B1 as the baseline in this paper, as it is more effective. Specifically, there are three steps in the inference stage,
  3. For the B1, let us listen the English samples, the speaker identity is protected at an acceptable level and the content is preserved, we can still understand what he is talking, the system can effectively anonymize the English speech data, While we are wondering whether we can create one sas can anonymize speech from unseen languages, if we directly use B1 to anonymize mandarin speech, we will get these, Obviously the content was distorted, the system cannot be directly applied to another language, the reason is clear, the ASR AM of this system is language-dependent trained on English data. To create language-undepentent SAS, the language-dependent ASR AM can be replaced with SSL-based content encoder, which can extract general context representations regardless of the language of the input speech. In addition to ssl content encoder, two main differences: Speaker vector is important, we updating the TDNN x-vector to the ECAPA-TDNN speaker encoder 3) To make the system more efficiently, using a HiFi-GAN as the speech waveform generation model instead of the traditional TTS pipeline (an SS AM and NSF-based neural vocoder). I will introduuce this three part one by one
  4. The first module is the SSL-based content encoder, we mainly focus on HuBERT, We also tested wav2vec 2.0 as the content encoder and observed similar results to those using HuBERT. We thus only talk about HuBERT-based systems . There are two common ways to use HuBERT, One is directly using the continuous output features of the HuBERT model. However, it has been proven that the continuous representations contain both context and speaker information, not suitable for speech disentanglement. Another work applied a k-means algorithm over HuBERT continuous representations to obtain discrete features, getting rid of speaker identity attributes. While inaccurate discrete features are known to lead to incorrect pronunciation, and we confirm the issue in our experiments later as well. To compromise a trade-off between continuous representations and discrete speech units, we implement the HuBERT- based soft content encoder to capture soft content features. We finetune the hubert base model, Replace the final layer with the linear projection to predict soft prediction, take the discrete speech unit as target obtained from HuBERT-KM by predicting a distribution over discrete speech units, we cen get a better tradeoff between continues and discrete features
  5. The next part is speaker encoder - ECAPA. We test two ECAPAs using different input features. The first one followed the original ECAPA , using the 80-dimensional Fbank as the input feature The second one is to use ssl features, this is the weighted average embedding from all the hidden layers of a pre-trained HuBERT Base. The parameters of HuBERT Base were updated when training S-ECAPA.
  6. After getting frame-wise F0, context, and segmental-level speaker embedding, we upsampled and concatenated them and input to HiFi-GAN to produce the speech, during the training, the HuBERT and ECAPA are fixed. HiFi-GAN consists of one generator and two discriminators. One is multi-period discriminator to capture the periodic patterns from the speech, another is multi-scale discriminator to explore long-range and consecutive interactions.
  7. We evaluate two objective metrics, one is EER metric to evaluate the privacy Another is WER metric to evaluate the utility For the privacy, assume it’s a game between users and attackers, users have their test trials, they can use the SAS to anonymize their data beforehand, attackers have their enrollment utterances, three scenarios are considered OO, no anonymiazation applied for both users and attackers, OA, only users anonymize their data, attacker is unware of this, and use the original enrollment speech, AA condition, both users and attackers anonymize the speech, For the utility, compare the WER before and after anonymization to measure how well the speech content is preserved. In summary, a good SAS should achieve higher EER on OA and AA conditions, and lower WER using anonyized speech Here are major settings, ASV and ASR are language-matched models.
  8. Here are the English anoymization results We list the libri-test set results, five systems are compared, male and female speech on OA and AA conditions Top is the eer, the higher the better, then is WER lower is better, the bottom is the one female audio sample Compared with B1, soft content SAS achieve comparable EER and lower WER, we can here the sample For the different content encoder, if we directly use the continuous features, achieve lower content feature speaker features
  9. The final part is the Mandarin SA, besides OO,OA,AA, we also evaluate the OR scenorio where the test trials is the resnthesized, a good SAS should achieve loww EER an OR and OO, OR should as low as OO