1) The document proposes a DNN-based method to solve the permutation problem in frequency-domain independent component analysis (FDICA) for audio source separation.
2) Conventional permutation solvers sometimes fail to correctly align the separated signal components across frequencies. The proposed method trains a DNN on simulated permutation data to learn how to align components.
3) In experiments separating reverberant speech mixtures, the proposed DNN-based method improved the signal-to-distortion ratio by about 8 dB, outperforming other techniques and approaching the upper limit of performance.
DNN-based permutation solver for frequency-domain independent component analysis in two-source mixture case
1. DNN-based permutation solver for
frequency-domain independent component
analysis in two-source mixture case
Shuhei Yamaji and Daichi Kitamura
National Institute of Technology, Kagawa College
Japan
12th Asia-Pacific Signal and Information Processing
Association (APSIPA)
1
2. Introduction
ď° About audio source separation
ď° Applications of audio source separation
â Speech recognition
â Noise canceling
â Voice command device etc.
Nice to
meet you...
HelloâŚ
HelloâŚ
Nice to
meet you...
Audio
source
separation
2
3. Blind Source Separation
ď° Independent component analysis (ICA) [Comon, 1994]
â Assumes independence between source signals
â Estimates demixing matrix without knowing mixing matrix
ď°Actual audio mixing in reverberant environment
â Convolution with room impulse responses between sources mics
â Extend ICA to the frequency domain
Source signal Mixture signal Estimated signal
3
4. Frequency-Domain ICA
ď° Frequency-domain ICA (FDICA) [Smaragdis, 1998]
â Apply ICA in each frequency bin
Spectrogram
ICA1
ICA2
ICA3
âŚ
âŚ
ICA
Frequency
bin
Time frame
âŚ
Inverse matrix
Frequency-wise
mixing matrix
Frequency-wise
demixing matrix
4
5. Permutation Problem in FDICA
ď° Permutation problem in frequency-domain ICA
â Order of separated signals in each frequency is messed up
â Separated components must be aligned along the frequency axis
FDICA
All frequency
components
Source 1
Source 2
Observed 1
Observed 2
Estimated signal 1
Estimated signal 2
Non-aligned signal
Permutation
Solver
Time
5
6. ď° Popular permutation solvers
â Based on Temporal Structures
⢠FDICA + correlation-based alignment between adjacent
frequencies [Murata+, 2001]
â Based on direction of arrival (DOA)
⢠Frequency-domain ICA + DOA alignment [Saruwatari+, 2006]
â Based on a relative correlation among frequencies
⢠Independent vector analysis (IVA) [Hiroe, 2006], [Kim+, 2006]
â Based on a low-rank modeling of each source
⢠Independent low-rank matrix analysis (ILRMA) [Kitamura+, 2016]
Conventional Permutation Solvers
Time
âŚ
âŚ
Sort
Non-aligned signal Non-aligned signal
6
7. ď° Problems of conventional permutation solvers
â Correlation-based method sometimes
fails to align components
â Even in IVA and ILRMA,
block permutation problem arise
ď° Proposed method: DNN-based permutation solver
â The permutation problems can be simulated by shuffling the
frequency components of source signals
â Training data for DNN are easy to produce
Motivation of Proposed Method
Non-aligned
signal
Non-aligned
signal
Time
Separated
signal
Separated
signal
DNN
DNN
7
8. Proposed method: DNN input and label
ď° Input and label
â Extract two short-time activations of reference and another
frequencies from the separated signal
â DNN predicts whether the permutation of input two frequencies is
correct (correct=0 and incorrect=1)
8
DNN
Correct permutation case Incorrect permutation case
DNN
Reference
Another
Reference
Another
10. ď° Apply DNN in subband frequency (local time-frequency area)
â Subband: Reference (center) frequency several frequencies
ď° Take majority decision along time frames
â to determine the subband permutation vector
Proposed method: DNN predictions in subband frequency bins
DNN output
Input vector
1 : Different sound source
1 : Different sound source
0 : Same sound source
1 : Different sound source
0 : Same sound source
10
Subband
permutation
vectorăŤăăŚ
ăă
11. Proposed method: construct a fullband permutation vector
ď° Alignment among subbands
â When the subband slides along frequency axis, the reference
(center) frequency component changes
⢠The meanings of â0 (same)â and â1 (different)â labels are not
shared among subbands
â The orders of source components in all subbands must be aligned
after the DNN prediction in all subbands
11
12. Proposed method: construct a fullband permutation vector
ď° Objective
â Estimate âfullband permutation vectorâ that corresponds the two
sources to â0â and â1â
ď° Step1
â The subband permutation vector of the lowest frequency subband is
simply set to the corresponding frequency bins in the fullband
permutation vector
Time
Frequency
1
1
0
1
0
1
1
0
1
0
1
1
0
1
0
1. Set
Fullband
permutation
vector
2. Set
12
13. ď° Step2
â Slide the subband frequencies
â Obtain the subband permutation vector of the current subband and
its binary complement vector
â The similarity between subband and fullband permutation vectors are
measured by mean squared error (MSE)
â Set the subband vector that minimize MSE to the memory
â Update fullband permutation vector by taking majority decision
Proposed method: construct a fullband permutation vector
Time
Frequency
1
0
0
1
0
1
1
0
1
0
0
1
1
0
1
0
1
1
0
1
0
2. Set
0
1
1
0
1
1. Similarity comparison
3.
Majority
decision
Fullband
permutation
vector
13
14. Proposed method: construct a fullband permutation vector
ď° Step3
â Iterate step2 up to the highest frequency subband
â Replace the components based on the fullband permutation vector
â Obtain permutation-aligned estimated signals
1
1
0
1
0
0
1
1
0
1
1
0
0
1
1
0
0
1
1
0
1
0
0
1
1
0
1
0
Majority
decision
Time
Frequency
Replace
Fullband
permutation
vector
Fullband
Vector
14
15. Experimental conditions
Training speech
signals
Dry sources: JVS corpus [Takamichi+, 2019] (Japanese speech)
Mixture: Convolve dry sources with RWCP impulse responses [Nakamura+, 2000]
Permutation: apply FDICA and randomly shuffling the components
Test speech
signals
Speech signals obtained from SiSEC2011 UND task [Araki+, 2012]
FFT length 8192 (512 ms, Humming window)
Shift length 2048
Subjective
evaluation
Average improvement of signal-to-distortion ratio (SDR)
Reverberation Time
15
16. Results
ď° Findings
â Proposed method achieves an improvement of about 8 dB
â ILRMA's separation performance is about 4dB
â The proposed method is close to the upper-limit performance
0
2
4
6
8
10
12
FDICA
with IPS
ILRMA
(2 bases)
ILRMA
(3 bases)
ILRMA
(4 bases)
Proposed
method
SDR
improvement
[dB]
Good
Poor
ILRMA
(2 bases)
FDICA with
ideal
permutation
solver
(reference score)
ILRMA
ďź3 basesďź
ILRMA
ďź4 basesďź
FDICA with
DNN-based
permutation
solver
(proposed)
16
17. Conclusion
ď° In this paper
â We proposed a new DNN-based permutation solver for determined
audio source separation using FDICA
â An SDR improvement of about 8 dB was achieved in experiments
with a highly reverberant speech mixture signal
ď° Future work
â The proposed method creates a combinatorial explosion for three or
more separated signals
17
Thank you for your attention!
Hello everyone, Iâm Shuei Yamaji at National Institute of Technology, Kagawa College, Japan.
In this presentation, we talk about DNN-based permutation solver for frequency-domain independent component analysis in two-source mixture case.
This presentation deals with audio source separation, / which is a technique to separate sounds from a mixture signal / into individual audio sources.
This technology can be used to many audio applications, / such as / speech recognition, / noise canceling, / voice command device, / and so on.
The popular approach for audio source separation is / independent component analysis, / ICA in short.
ICA assumes independence between sources / and estimates demixing matrix W / without knowing mixing matrix A.
This is represented in this figure.
The source signals, / s1 and s2, / are mixed by A, / then / observed as x1 and x2.
W / can separate the sources in x / if W is an inverse matrix of A / as y1 and y2.
Of cource we donât know the mixing matrix A, / so / ICA estimates W using statistical independence between sources.
In actual situation ďźăˇăăĽă¨ă¤ăˇă§ăłďź, audio signals are mixed with room reverberations as a convolutive mixture, / and simple ICA cannot separate in that situation.
To solve this problem, frequency-domain ICA, / FDICA in short, / was proposed.
01:00
This figure represents the mixture signals in time-frequency domain, / which are obtained by short-time Fourier transform.
In FDICA, / simple ICA / is applied to each frequency bin / like this figure.
Therefore, / the demixing matrix W must be estimated in each frequency bin / to achieve the source separation.
However, / since ICA cannot determine the order of the separated signals, / the output components of FDICA are not aligned like this, / and we have to re-order these separated red and blue components along frequency axis.
This is the so-called permutation problem.
Thus, a permutation solver must be applied after FDICA as post processing.
In this presentation, / we aim to solve the permutation problem over all frequency bins / using a new, / data-driven approach.
A major approach to solving the permutation problem / is based on temporal structures of the separated components
We can re-order the components based on the correlation values between adjacent frequencies.
ďźăăŁăăéăéăăďź
When the positions of microphones are known, / the direction of arrivals of the sources / can also be utilized / for solving the permutation problem.
In recent years, / algorithms without encountering the permutation problem / have been proposed.
For example, both independent vector analysis, / IVA, / and independent low-rank matrix analysis, / ILRMA (ă˘ă¤ăŤăźă), / estimate the frequency-wise demixing matrices / avoiding the permutation problem.
ILRMAďźă˘ă¤ăŤăźăďź is a state-of-the-art algorithm for blind audio source separation.
OK, letâs talk about our proposed method.
This slide explains our motivation.
The conventional correlation-based permutation solver / sometimes fails to align components correctly.
Even in IVA or ILRMA (ă˘ă¤ăŤăźă), / the components are sometimes misaligned in blocks, / which is called the block permutation problem, / like this figure.
To achieve a stable and accurate permutation solver, / in this presentation, / we propose a DNN-based permutation solver, / where the training data for DNN permutation solver / can easily be obtained.
This is because the permutation problem can be simulated by randomly shuffling the frequency components of source signals.
In this slide, / we explain the input vector for the proposed DNN modelďźăăăźďź.
In our DNN model ďźăăăźďź, / first, / we extract / two short-time activations of reference / and another frequencies / from the separated signal.
These activations are concatenated ďźăŤăłăŤăźăăă¤ăăăďźas a single vector like this, / and input to the DNN.
Then, / DNN predicts whether the permutation of input two frequencies is correct, / where âzeroâďźă¸ăăźďźmeans that the current permutation is correct, / and âoneâ means they are inverted.
In the left-side figure, / the reference frequency is red and blue, / and another frequency is also red and blue.
So, / the current permutation is correct, / and its label should be zeroďźă¸ăăźďź.
In the right-side figure, / the reference frequency is red and blue, / but another frequency is blue and red.
Therefore, / the current permutation is wrong, / and its labelďźăŹă¤ăăźăĽďźshould be one.
This figure depicts an architecture of DNN used in the proposed permutation solver.
This DNN model has full-connected 6 hidden layers, / and its structure is very simple.
Hereafter, / we consider the process in a sort-time subband frequency, / where the subband consists of reference frequency and plus-minus several frequencies.
In the proposed method, / we perform the DNN-based permutation prediction for all the combinations of reference and another frequencies, / where the reference frequency is fixed to the center of the subband.
In this figure, / the reference frequency is f3, / and fixed.
Another frequency is chosen from f1 to f5, / and all the combinations are input to DNN like this.
Thus, / we obtain these DNN outputs.
Since the correct permutation / does not depend on time, / we stride this short-time subband in time axis, / and collect DNN outputs like this figure.
Finally, we take a majority decision with the collected DNN outputs, / and obtain a subband permutation vector.
After the estimation of subband permutation vector, / we slide the subband along the frequency axis / like this figure.
However, / since the center frequency of the subband is always set to the reference frequency, / the meanings of the labels ďźăŹă¤ăăźăĽăšďź âzeroâ and âoneâ are not shared / among subbands.
This is because the DNN outputs mean that / the components of reference and another frequencies are the same or different.
For this reason, / even if the subband components are aligned by the subband permutation vector, / the order of sources / could be different among the subbands / like this figure.
To solve this problem, it is necessary to unify the results for all the subband vectors, / for example, / 0 indicates a red source and 1 indicates a blue source in all the subbands.
This labelďźăŹă¤ăăźăĽďź unification / can be achieved by the following 3 steps.
The objective of the following steps is that / we estimate a fullband permutation vector, / which corresponds the red and blue sources / to âzeroâ and âone,â respectively.
In the first step, / as shown in this figure, / the subband permutation vector in the lowest subband is simply set to the corresponding frequency bins / in the fullband permutation vector.
In step 2, / we slide the subband from the previous one / and obtain the subband permutation vector in that subband.
We also calculate the binary complement vector of the subband permutation vector / like this.
These two vectors are compared with the corresponding parts of the fullband vectors using mean square error, / then the vector that minimizes the error is selected and stored in the memory.
The fullband permutation vector is updated by taking a majority decision / using the vectors stored in the memory.
By repeating the process of the step 2, the complete fullband vector can be obtained.
Finally, / the permutation problem can be solved by replacing the frequency-wise source components based on the estimated fullband vector.
Letâs move on / to the experiments.
This tableďźăă¤ăăźăĽďźshows the conditions.
In this experiment, / as a training dataset, / we used JVS corpus, / which is a Japanese speech dataset, / as dry sources, / and we mix them using impulse responses.
The permutation problem is simulated by randomly shuffling the frequency-wise components of the sources.
The test speech dataset is obtained from SiSEC UND task.
The bottom figure shows the impulse responses / used in this experiment, / where the reverberation time is 470 ms.
Here is the result of the experiment.
The vertical axis shows an average SDR improvement, / which shows the accuracy of the source separation.
The leftmost one is an FDICA with ideal permutation solver, / namely, / the permutation is perfectly solved by using the completely separated source signals.
So, this is an upper-bound score of the FDICA-based methods.
ILRMAďźă˘ă¤ăŤăźăďź is the state-of-the-art blind source separation method.
Since the reverberation time is long in this experiment, / the performance of ILRMA is not so high.
The rightmost one is our proposed method, / where the DNN-based permutation solver is applied after FDICA.
The proposed method achieves 8 dB improvement in SDR, / which is close to the upper-limit.
This is the conclusion ďźăŤăłăŻăŤăźă¸ă§ăłďź.
Thank you for your attention.