Presenter: Dr. Xiaoxiao Miao, NII
Paper: https://arxiv.org/abs/2202.13097
Speaker anonymization aims to protect the privacy of speakers while preserving spoken linguistic information from speech. Current mainstream neural network speaker anonymization systems are complicated, containing an F0 extractor, speaker encoder, automatic speech recognition acoustic model (ASR AM), speech synthesis acoustic model and speech waveform generation model. Moreover, as an ASR AM is language-dependent, trained on English data, it is hard to adapt it into another language. In this paper, we propose a simpler self-supervised learning (SSL)-based method for language-independent speaker anonymization without any explicit language-dependent model, which can be easily used for other languages. Extensive experiments were conducted on the VoicePrivacy Challenge 2020 datasets in English and AISHELL-3 datasets in Mandarin to demonstrate the effectiveness of our proposed SSL-based language-independent speaker anonymization method.
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Odyssey 2022: Language-Independent Speaker Anonymization Approach using Self-Supervised Pre-Trained Models
1. Language-independent Speaker Anonymization
Approach using Self-supervised Pre-trained
Models
Xiaoxiao Miao1, Xin Wang1, Erica Cooper1,
Junichi Yamagishi1, Natalia Tomashenko2
1 National Institute of Informatics, Japan 2 LIA, University of Avignon, France
Odyssey 2022
2. Outline
• Introduction
• Motivation
• Language-independent SSL-based Speaker Anonymization System
• SSL-based Content Encoder
• ECAPA-TDNN Speaker Encoder
• HiFi-GAN
• Experimental Results and Analysis
• Conclusions and future work
1
3. Introduction
• Definition[1] from VoicePrivacy challenge (VPC) 2020
• Suppress the speaker’s identity
• Preserve the linguistic content, other paralinguistic attributes
such as age, gender, emotion, and the diversity of speech
• Allowing downstream tasks: human communication, automated
processing, model training, etc.
• VPC2020 primary baseline[1] : ASR + TTS. - B1
Step1: disentangle speech into F0, BN, and x-vector
Step2: anonymize x-vector
Step3: synthesize anonymized speech using source F0, BN, and anonymized
x-vector
[1] N. Tomashenko, et al., “Introducing the VoicePrivacy Initiative,” Interspeech2020 2
4. • Can we create one speaker anonymization system (SAS) that can
anonymize speech from unseen languages?
• Directly use B1- content is distorted because of the language-specific
ASR AM
ECAPA
Self-
supervised
models
Motivation
Wav2vec2
HuBERT
HuBERT-kmeans
HiFi-GAN
• SSL-based content encoder provides general context representations
• Updating the TDNN x-vector to the ECAPA-TDNN speaker encoder
• Using a HiFi-GAN as the speech waveform generation model instead
of the traditional TTS pipeline to make the system more efficiently
3
5. SSL-based SAS --SSL
• HuBERT[2]
• Continuous features contain both context and speaker information.
Not suitable for speech disentanglement
• HuBERT-km[3]
• Apply k-means algorithm over HuBERT continuous features
• Discrete features get rid of speaker identity attributes.
While inaccurate discrete features lead to incorrect pronunciations
• HuBERT-based Soft Content Encoder[4]
• Trade-off between continuous and discrete features
[2] Wei-Ning Hsu, et al., “HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units,” TASLP 2021
[3] Kushal Lakhotia, et al., “Generative spoken language modeling from raw audio,” TACL 2021
[4] Benjamin van Niekerk, et al, “A comparison of discrete and soft speech units for improved voice conversion,” ICASSP 2022
4
6. SSL-based SAS --ECAPA
• ECAPA-TDNN[5]
F-ECAPA: 80-dim FBank
S-ECAPA: 768-dim weighted average from all the hidden layers
[5] Brecht Desplanques, et al., “ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker
Verification,” INTERSPEECH 2020
5
7. SSL-based SAS --HiFiGAN
• Frame-wise F0, context, and segment
al-level speaker embedding are
up-sampled and concatenated
• Then passed to HiFi-GAN[6], which
contains one generator and two
discriminators:
• Multi-period discriminator (MPD)
captures the periodic patterns
• Multi-scale discriminator (MSD)
exploring long-range and consecutive
interactions
[6] Jungil Kong, et al., “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” NIPS 2020
Freeze Freeze
6
8. Experiment details
Training set
HuBERT soft : LibriSpeech-100[9]
HiFi-GAN: LibriTTS-100[10]
Test set :
LibriSpeech[10],VCTK[11]
• 𝐴𝑆𝑉
𝑒𝑣𝑎𝑙
𝐸𝑛𝑔
: TDNN
• 𝐴𝑆𝑅𝑒𝑣𝑎𝑙
𝐸𝑛𝑔
: TDNN-F
Experimental Results and Analysis
Privacy: EER
Game between users and attackers
• Unprotected (OO): no anonymization
• Ignorant attacker (OA): users anonymize
• Lazy-informed (AA): users & attackers
anonymize, attackers know the SAS, but don’t
know the specific parameters, would get the
different pseudo speaker.
• Evaluation Plan
Utility:WER
LibriSpeech-
360[9]
[9] Vassil Panayotov, et al., “Librispeech: an ASR corpus based on public domain audio books,” ICASSP 2015
[10] Heiga Zen, et al., “LibriTTS: A corpus derived from LibriSpeech for text-to-speech,” arXiv preprint arXiv:1904.02882, 2019
[11] Christophe Veaux, et al., “CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92),”2019.
Weak attacker
Strong attacker
7
9. Experimental Results and Analysis
47.26
52.12
32.12
36.75
41.42 39.87
31.02
36.97
41.24 42.54
29.74
33.18
0
10
20
30
40
50
60
OA-EER-Female OA-EER-Male AA-EER-Female AA-EER-Male
The EER results of Libri Test (%)
VPC2020 B1 S-ECAPA+HuBERT-soft F-ECAPA+HuBERT-soft
Original VPC2020 B1 S-ECAPA+HuBERT-soft F-ECAPA+HuBERT-soft
WER 4.15 6.73 4.7 4.69
4.15
6.73
4.7 4.69
The WER results of the original and anon. Libri Test speech(%)
EER: Female-OO=8.66%,
Male-OO=1.24%
• The proposed SAS evaluated on English data:
• Protect the speaker information almost as well as VPC B1
• Provide more reliable linguistic information.
LowerWER
Comparable EER
8
• English Speaker Anonymization
10. Experimental Results and Analysis
• Mandarin Speaker Anonymization
• Experiment details
• Test set sampled from AISHELL-3[12]: 10120 enrollment-test trials
• 𝐴𝑆𝑉𝑒𝑣𝑎𝑙
𝑚𝑎𝑛𝑑
: F-ECAPA trained on CN-Celeb-1&2[13]
• 𝐴𝑆𝑅𝑒𝑣𝑎𝑙
𝑚𝑎𝑛𝑑
: publicly available Transformer* trained on AISHELL-1[14]
[12] Yao Shi, Hui Bu, Xin Xu, Shaoji Zhang, and Ming Li, “AISHELL-3: A Multi-Speaker Mandarin TTS Corpus,” INTERSPEECH, 2021
[13] Lantian Li, et al., “CN-Celeb: multi-genre speaker recognition,” Speech Communication, 2022
[14] Hui Bu, et al.,,“Aishell-1: An open-source Mandarin speech corpus and a speech recognition baseline,” O-COCOSDA 2017
*SpeechBrain: https://github.com/speechbrain/speechbrain/tree/develop/recipes/AISHELL-1/ASR/transformer
(%) S-ECAPA +
HuBERT-soft
F-ECAPA +
HuBERT-soft
EER OO 2.04
OA 40.81 37.58
AA 23.26 22.98
CER Original 10.36 10.36
Anony. 21.18 18.86
• Proposed SAS is simpler without any language-specific models can be adopted to
other languages successfully
• CERs on the anonymized trials were increased to about 20%, suggesting degradation
on the speech content
9
11. Conclusions and Future Work
• Can we create one speaker anonymization system (SAS) that can
anonymize speech from unseen languages?
• Yes, we can use Language-independent SSL-based SAS.
• The limitation of the proposed SAS:
• There is still room to improve the performance of the SAS under
unseen condition.
• See our Interspeech2022 paper: https://arxiv.org/abs/2203.14834
• Audio samples and source code are available at
https://github.com/nii-yamagishilab/SSL-SAS
10
to begin with, I would like to give the introduction of the speaker anonymization task
And then its our motivation, next the detailed of our system, followed by experimental results and analysis
Conclusions and future work will be shown in the final part.
Thanks to the VPC2020, giving the definition that the SA task to suppress the individuality of speech, leaving other speaker attribute information such as age and gender, while preserving the naturalness of speech
And VPC2020 organization also has provided common datasets, evaluation metrics (objective and subjective), and two main baselines.
One is nn-based, a primary baseline, another is signal-processing based, we take the B1 as the baseline in this paper, as it is more effective.
Specifically, there are three steps in the inference stage,
For the B1, let us listen the English samples, the speaker identity is protected at an acceptable level and the content is preserved, we can still understand what he is talking,
the system can effectively anonymize the English speech data,
While we are wondering whether we can create one sas can anonymize speech from unseen languages, if we directly use B1 to anonymize mandarin speech, we will get these,
Obviously the content was distorted, the system cannot be directly applied to another language,
the reason is clear, the ASR AM of this system is language-dependent trained on English data.
To create language-undepentent SAS, the language-dependent ASR AM can be replaced with SSL-based content encoder, which can extract general context representations regardless of the language of the input speech.
In addition to ssl content encoder, two main differences:
Speaker vector is important, we updating the TDNN x-vector to the ECAPA-TDNN speaker encoder
3) To make the system more efficiently, using a HiFi-GAN as the speech waveform generation model instead of the traditional TTS pipeline (an SS AM and NSF-based neural vocoder).
I will introduuce this three part one by one
The first module is the SSL-based content encoder, we mainly focus on HuBERT, We also tested wav2vec 2.0 as the content encoder and observed similar results to those using HuBERT. We thus only talk about HuBERT-based systems .
There are two common ways to use HuBERT,
One is directly using the continuous output features of the HuBERT model.
However, it has been proven that the continuous representations contain both context and speaker information, not suitable for speech disentanglement.
Another work applied a k-means algorithm over HuBERT continuous representations to obtain discrete features, getting rid of speaker identity attributes.
While inaccurate discrete features are known to lead to incorrect pronunciation, and we confirm the issue in our experiments later as well.
To compromise a trade-off between continuous representations and discrete speech units, we implement the HuBERT- based soft content encoder to capture soft content features.
We finetune the hubert base model, Replace the final layer with the linear projection to predict soft prediction, take the discrete speech unit as target obtained from HuBERT-KM
by predicting a distribution over discrete speech units, we cen get a better tradeoff between continues and discrete features
The next part is speaker encoder - ECAPA. We test two ECAPAs using different input features.
The first one followed the original ECAPA , using the 80-dimensional Fbank as the input feature
The second one is to use ssl features, this is the weighted average embedding from all the hidden layers of a pre-trained HuBERT Base. The parameters of HuBERT Base were updated when training S-ECAPA.
After getting frame-wise F0, context, and segmental-level speaker embedding, we upsampled and concatenated them and input to HiFi-GAN to produce the speech, during the training, the HuBERT and ECAPA are fixed.
HiFi-GAN consists of one generator and two discriminators. One is multi-period discriminator to capture the periodic patterns from the speech, another is multi-scale discriminator to explore long-range and consecutive interactions.
We evaluate two objective metrics, one is EER metric to evaluate the privacy
Another is WER metric to evaluate the utility
For the privacy, assume it’s a game between users and attackers, users have their test trials, they can use the SAS to anonymize their data beforehand, attackers have their enrollment utterances, three scenarios are considered
OO, no anonymiazation applied for both users and attackers, OA, only users anonymize their data, attacker is unware of this, and use the original enrollment speech, AA condition, both users and attackers anonymize the speech,
For the utility, compare the WER before and after anonymization to measure how well the speech content is preserved.
In summary, a good SAS should achieve higher EER on OA and AA conditions, and lower WER using anonyized speech
Here are major settings, ASV and ASR are language-matched models.
Here are the English anoymization results
We list the libri-test set results, five systems are compared, male and female speech on OA and AA conditions
Top is the eer, the higher the better, then is WER lower is better, the bottom is the one female audio sample
Compared with B1, soft content SAS achieve comparable EER and lower WER, we can here the sample
For the different content encoder, if we directly use the continuous features, achieve lower content feature speaker features
The final part is the Mandarin SA, besides OO,OA,AA, we also evaluate the OR scenorio where the test trials is the resnthesized, a good SAS should achieve loww EER an OR and OO, OR should as low as OO