5. reverberation from
surface reflections
additive noise from
other sound sources
source
Channel
distortion
ASR in Real World Scenarios
3/27/17 Dong Yu : Multi-talker Speech Separation and Tracing with Permutation Invariant Training 5
7. Is Speech Separation Work Needed?
• End-to-end ASR system sufficient?
• Current ASR techniques require huge amount of training data that covers various
conditions to train well
• Speech separation can be used as advanced front-end
• Speech separation criterion can be used as regularization to aid and speed up
training of ASR systems
• More applications than ASR
• Hearing aids
• Cochlear implants
• Noise reduction for mobile communication
• Audio information retrieval
• Using microphone array sufficient?
• Mic-array alone is not sufficient, e.g., when at same direction
• Many recordings are still collected with single microphone
3/27/17 Dong Yu : Multi-talker Speech Separation and Tracing with Permutation Invariant Training 7
9. Problem Definition
• Source speech streams
• Mixed speech
• STFT domain
• Estimate Mask
• Reconstruct with Mask
3/27/17 Dong Yu : Multi-talker Speech Separation and Tracing with Permutation Invariant Training 9
•Ill-posed problem (# constraints < # free params:
• There are an infinite number of possible 𝑋" 𝑡, 𝑓 combinations that lead to
the same 𝑌 𝑡, 𝑓
•Solution:
• Learn from training set to look for hidden regularities (complicated soft
constraints)
10. Prior Arts Before Deep Learning Era
• Computational auditory scene analysis (CASA)
• Use perceptual grouping cues to estimate time-frequency masks
• Non-negative matrix factorization (NMF)
• Learn a set of non-negative bases during training
• Estimate mixing factors during evaluation
• Model based approach such as factorial GMM-HMM
• Models the interaction between the target and competing speech signals and
their temporal dynamics
• Spatial filtering with a microphone array
• Beamforming: Extract target sound from a specific spatial direction
• Independent component analysis: Find a demixing matrix from multiple
mixtures of sound sources
3/27/17 Dong Yu : Multi-talker Speech Separation and Tracing with Permutation Invariant Training 10
11. Training Criteria for Deep Learning
• Ideal amplitude mask (IAM) 𝑀" 𝑡, 𝑓 =
)* +,,
- +,,
• Minimize mask estimation error (two problems)
• In silence segments 𝑋" 𝑡, 𝑓 = 0 and 𝑌 𝑡, 𝑓 = 0 → 𝑀" 𝑡, 𝑓 is not well defined
• Smaller error on masks may not lead to a smaller error on magnitude (which is what
we care about)
• Minimize magnitude estimation error (used in this study)
• Magnitude still estimated through masks: often lead to better performance esp.
when training set is small
3/27/17 Dong Yu : Multi-talker Speech Separation and Tracing with Permutation Invariant Training 11
21. Experiment Setup: Datasets
• WSJ0-2mix and 3-mix
• Derived from WSJ0 corpus
• 2- and 3-speaker mixtures (artificially generated)
• 30h training set, 10h validation set, 5h test set
• Mixed at SIRs between 0 dB and 5 dB.
• Danish-2mix and 3-mix
• Derived from a Danish corpus
• 2- or 3-speaker mixtures (artificially generated)
• 10k, 1k, 1k+1k utterances in training, validation, and test sets
• Mixed at 0dB
• WSJ0-2mix-other
• Same as WSJ0-2mix but mixed at 0dB
3/27/17 Dong Yu : Multi-talker Speech Separation and Tracing with Permutation Invariant Training 21
22. Models
• Implemented using the Microsoft cognitive toolkit (CNTK)
• Input: 257 dim STFT; Output: 257 x S streams
• Segment-based (PIT-S): Each segment is independent, no tracing
• DNN: 3 hidden layers each with 1024 ReLU units
• PIT with tracing (PIT-T): force all frames from the same output layer
to belong to the same speaker
• LSTM: 3 LSTM layers each with 1792 units
• BLSTM: 3 BLSTM layers each with 896 units
• Test Conditions
• Closed condition (CC): seen speakers
• Open condition (OC): unseen speakers
3/27/17 Dong Yu : Multi-talker Speech Separation and Tracing with Permutation Invariant Training 22
36. Conclusion
• PIT can solve the label permutation problem
• PIT is effective in speech separation without knowing number of
speakers
• PIT trained models generalize well to unseen speakers and languages
• PIT is simple to implement
• PIT has great potential since it can be easily integrated and combined
with other techniques
3/27/17 Dong Yu : Multi-talker Speech Separation and Tracing with Permutation Invariant Training 36
Classification View
(supervised approach)
Segmentation view
(deep clustering)
Separation View
(PIT)
PIT is an important ingredient in the
final solution to the cocktail party problem