Multi-talker Speech Separation and Tracing at AI NEXT Conference

Dong Yu
Distinguished Scientist and Vice General Manager
Tencent AI Lab
work was done while @ Microsoft Research
Joint work with
Morten Kolbæk, Zheng-Hua Tan, and Jesper Jensen
Multi-talker
Speech Separation and Tracing with
Permutation Invariant Training

Outline
• Motivation
• Problem Setup and Prior Arts
• Multi-talker Speech Separation
• Experiments
• Conclusion
3/27/17 Dong Yu : Multi-talker Speech Separation and Tracing with Permutation Invariant Training 2

Outline
• Motivation
• Experiments
• Conclusion

Frontier Shift
• Driven by demand from users to interact with devices
without wearing or carrying a close-talk microphone.
• Many difficulties hidden by close-talk microphones now
surface:
• The energy of speech signal is very low when it reaches the
microphones.
• The interfering signals, such as background noise, reverberation,
and speech from other talkers, become so distinct that they can no
longer be ignored.
close-talk microphone far-field microphone

reverberation from
surface reflections
additive noise from
other sound sources
source
Channel
distortion
ASR in Real World Scenarios

Cocktail Party Problem
• Term coined by Cherry
• “One of our most important faculties is our ability to listen to, and follow,
one speaker in the presence of others. This is such a common experience
that we may take it for granted; we may call it ‘the cocktail party problem’…”
(Cherry’57)
• Human’s performance is superior to machine
• “For ‘cocktail party’-like situations… when all voices are equally loud, speech
remains intelligible for normal-hearing listeners even when there are as
many as six interfering talkers” (Bronkhorst & Plomp’92)
• Speech separation problem
• Separate and trace audio streams
• Sometimes called speech enhancement when dealing with non-speech
interference

Is Speech Separation Work Needed?
• End-to-end ASR system sufficient?
• Current ASR techniques require huge amount of training data that covers various
conditions to train well
• Speech separation can be used as advanced front-end
• Speech separation criterion can be used as regularization to aid and speed up
training of ASR systems
• More applications than ASR
• Hearing aids
• Cochlear implants
• Noise reduction for mobile communication
• Audio information retrieval
• Using microphone array sufficient?
• Mic-array alone is not sufficient, e.g., when at same direction
• Many recordings are still collected with single microphone

Outline
• Motivation
• Experiments
• Conclusion

Problem Definition
• Source speech streams
• Mixed speech
• STFT domain
• Estimate Mask
• Reconstruct with Mask
•Ill-posed problem (# constraints < # free params:
• There are an inﬁnite number of possible 𝑋" 𝑡, 𝑓 combinations that lead to
the same 𝑌 𝑡, 𝑓
•Solution:
• Learn from training set to look for hidden regularities (complicated soft
constraints)

Prior Arts Before Deep Learning Era
• Computational auditory scene analysis (CASA)
• Use perceptual grouping cues to estimate time-frequency masks
• Non-negative matrix factorization (NMF)
• Learn a set of non-negative bases during training
• Estimate mixing factors during evaluation
• Model based approach such as factorial GMM-HMM
• Models the interaction between the target and competing speech signals and
their temporal dynamics
• Spatial filtering with a microphone array
• Beamforming: Extract target sound from a specific spatial direction
• Independent component analysis: Find a demixing matrix from multiple
mixtures of sound sources

Training Criteria for Deep Learning
• Ideal amplitude mask (IAM) 𝑀" 𝑡, 𝑓 =
)* +,,
- +,,
• Minimize mask estimation error (two problems)
• In silence segments 𝑋" 𝑡, 𝑓 = 0 and 𝑌 𝑡, 𝑓 = 0 → 𝑀" 𝑡, 𝑓 is not well defined
• Smaller error on masks may not lead to a smaller error on magnitude (which is what
we care about)
• Minimize magnitude estimation error (used in this study)
• Magnitude still estimated through masks: often lead to better performance esp.
when training set is small

Prior Arts with DL: Speech + Others
(many works, OSU, MERL, CUST, etc.)
• Basic Architecture: mix of different types of signals
Noise/Music/
Other Speakers
Est. Noise/Music/
Other Speakers

Prior Arts with DL: Focus on Speech
(many works, OSU, MERL, CUST, etc.)
• Basic Architecture: mix of different types of signals
Noise/Music/
Other Speakers
Est. Noise/Music/
Other Speakers
Speech + noise
Speech + music
Specific speaker + other speakers

Outline
• Motivation
• Experiments
• Conclusion

Multi-Talker Speech Separation
• Label Ambiguity / Label Permutation Problem
Speaker 1 à output 1 ?
Speaker 1 à output 2 ?

Solution 1: Deep Clustering
(Hershey, Chen, Roux, Watanabe, 2016)
• Learn a unit-size embedding for each time-frequency bin
• If two bins belong to the same speaker they are close in the embedding
space, and father away otherwise.
• Trained on a large window of frames
• Separation is done by clustering embedding space representations
(i.e., segment the bins)
• Shortcomings
• Pipeline is complicated
• Each bin is assumed to belong to one and only one speaker à limited its
ability to combine with other techniques

Solution 2: Use Manually Defined Rules
(Weng, Yu, Seltzer, Droppo, 14,15)
• Use instantaneous energy instead of speaker ID to assign labels: manually
designed limited cues
Low-energy
speech
High-energy
speech

Our Solution: Permutation Invariant Training
(Yu, Kolbæk, Tan, Jensen, 16, 17)
Simple to implement
Can be easily extended to 3-speakers
𝑋0 − 𝑋20
3
+ 𝑋3 − 𝑋23
3
𝑋3 − 𝑋20
3
+ 𝑋0 − 𝑋23
3

Testing
• Default assignment: concatenate output s’s frames to form stream s
• Optimal assignment: output of each frame is correctly assigned to speakers.
Concatenate frames belong to speaker s to form stream s
• Gap between them indicates the gain from additional speaker tracing

Outline
• Motivation
• Experiments
• Conclusion

Experiment Setup: Datasets
• WSJ0-2mix and 3-mix
• Derived from WSJ0 corpus
• 2- and 3-speaker mixtures (artificially generated)
• 30h training set, 10h validation set, 5h test set
• Mixed at SIRs between 0 dB and 5 dB.
• Danish-2mix and 3-mix
• Derived from a Danish corpus
• 2- or 3-speaker mixtures (artificially generated)
• 10k, 1k, 1k+1k utterances in training, validation, and test sets
• Mixed at 0dB
• WSJ0-2mix-other
• Same as WSJ0-2mix but mixed at 0dB

Models
• Implemented using the Microsoft cognitive toolkit (CNTK)
• Input: 257 dim STFT; Output: 257 x S streams
• Segment-based (PIT-S): Each segment is independent, no tracing
• DNN: 3 hidden layers each with 1024 ReLU units
• PIT with tracing (PIT-T): force all frames from the same output layer
to belong to the same speaker
• LSTM: 3 LSTM layers each with 1792 units
• BLSTM: 3 BLSTM layers each with 896 units
• Test Conditions
• Closed condition (CC): seen speakers
• Open condition (OC): unseen speakers

PIT-S Training Behavior: WSJ0-2mix

PIT-S: SDR Gain (dB) on WSJ0-2MIX

PIT-T Training Behavior: WSJ0-2mix

PIT-T: SDR Gain (dB) on WSJ0-2MIX

SDR (dB) and PESQ Gain Comparison

Cross Language Behavior on 2-talker Mix

PIT-T on WSJ0-3mix

PIT-T Trained with Both 2- and 3-mix

Examples: 2-talker Mix
•Male+Female:
•Mix:
•S1:
•S2:
•Female+Male:
•Mix:
•S1:
•S2:
•Female+Female:
•Mix:
•S1:
•S2:
•Male+Male:
•Mix:
•S1:
•S2:

Examples: 3-talker Mix
•Male+2Female:
•Mix:
•S1:
•S2:
•S3:
•Female+2Male:
•Mix:
•S1:
•S2:
•S3:

Example: Trained on 3-Mix Test on 2-Mix
•Diff Gender:
•Mix:
•S1:
•S2:
•S3:
•Same Gender:
•Mix:
•S1:
•S2:
•S3:

Example: Trained on 2 and 3-Mix, test on 2-Mix
•Diff Gender:
•Mix:
•S1:
•S2:
•S3:
•Same Gender:
•Mix:
•S1:
•S2:
•S3:

Outline
• Motivation
• Experiments
• Conclusion

Conclusion
• PIT can solve the label permutation problem
• PIT is effective in speech separation without knowing number of
speakers
• PIT trained models generalize well to unseen speakers and languages
• PIT is simple to implement
• PIT has great potential since it can be easily integrated and combined
with other techniques
Classification View
(supervised approach)
Segmentation view
(deep clustering)
Separation View
(PIT)
PIT is an important ingredient in the
final solution to the cocktail party problem

Multi-talker Speech Separation and Tracing at AI NEXT Conference

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Multi-talker Speech Separation and Tracing at AI NEXT Conference

Ähnlich wie Multi-talker Speech Separation and Tracing at AI NEXT Conference (20)

Mehr von Bill Liu

Mehr von Bill Liu (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Multi-talker Speech Separation and Tracing at AI NEXT Conference