SlideShare ist ein Scribd-Unternehmen logo
1 von 9
Downloaden Sie, um offline zu lesen
Audio Engineering Society
Convention Paper
Presented at the 120th Convention
2006 May 20–23 Paris, France
This convention paper has been reproduced from the author’s advance manuscript, without editing, corrections, or
consideration by the Review Board. The AES takes no responsibility for the contents. Additional papers may be
obtained by sending request and remittance to Audio Engineering Society, 60 East 42nd
Street, New York, New York
10165-2520, USA; also see www.aes.org. All rights reserved. Reproduction of this paper, or any portion thereof, is
not permitted without direct permission from the Journal of the Audio Engineering Society.
Demixing Commercial Music Productions via
Human-Assisted Time-Frequency Masking
MarC Vinyes1
, Jordi Bonada1
, Alex Loscos1
1
Pompeu Fabra University, Audiovisual Institute, Music Technology Group, Barcelona, 08003, Spain
Correspondence should be addressed to MarC (mvinyes@iua.upf.edu)
ABSTRACT
Audio Blind Separation in real commercial music recordings is still an open problem. In the last few years
some techniques have provided interesting results. This article presents a human-assisted selection of the
DFT coefficients for the Time-Frequency Masking demixing technique. The DFT coefficients are grouped by
adjacent pan, inter-channel phase difference, magnitude and magnitude-variance with a real-time interactive
graphical interface. Results prove an implementation of such technique can be used to demix tracks from
nowadays commercial songs.
Sample sounds can be found at http://www.iua.upf.es/~mvinyes/abs/demos.
1. INTRODUCTION
1.1. What do we mean by Audio Blind Separa-
tion?
We understand ABS (Audio Blind Separation) as
extracting from an input audio signal, without addi-
tional information, a set of audio signals whose mix
is perceived similarly1
to the original audio signal2
.
1We assume that when comparing a pair of sounds where
one is a moderately equalized and compressed version of the
other, their perceptual similarity will be very high.
2Note that we don’t stick to the Audio Blind Source Sepa-
ration definition, in which the original signal is to be exactly
equal to the sum of the extracted signals. Instead we suggest
a perceptual interpretation of this equality, which is more
This problem has infinite solutions, however, a hu-
man being would only find a reduced set of those
solutions meaningful. These are the ones that we
would like to extract.
1.2. Demixing tracks
Commercial music is often produced using a set of
recorded mono or stereo audio tracks which are af-
terwards mixed instantaneously. Using an analog
mixer or a digital audio workstation, audio tracks are
usually processed separately in groups with specific
pan, equalization, reverb and other digital or ana-
log effects. With this procedure, the sound engineer
general.
Vinyes et al. Human-Assisted Time-Frequency Masking
often pursues to favor their perception as different
auditory streams (see [1] for better understanding of
this term). Because of this reason, when we listen to
their mix, we perceive separately these audio tracks
most of the times and, consequently, we find them
meaningful.
On the other hand, if we were able to extract audio
signals that are perceived similarly to these audio
tracks, they would be a solution of the ABS problem
because their remix would also be perceived similarly
to the original mixture. Therefore, in this article,
our solution of the ABS problem pursues the extrac-
tion of audio signals which are perceived similarly to
the audio tracks used to produce the mix.
1.3. Notation
We are considering stereo songs as inputs. We use L
to label variables related to the left channel and R
for the right channel. We will refer to the two chan-
nels of the mixture as outL[k], outR[k], the stereo
tracks whose instantaneously mix produces the mix-
ture will be labeled inR
i [k], inL
i [k] and sR
i [k], sL
i [k]
will denote the extracted stereo sounds. Hence,
outL[k]
outR[k]
= i inL
i [k]
i inR
i [k]
And we pursue,
sL
i [k]
sR
i [k]
inL
i [k]
inR
i [k]
1.4. Algorithm overview
Our algorithm uses TFM (Time-Frequency Mask-
ing) as a mechanism to generate candidate solutions
of the ABS problem from the input data (section
2). Some criteria are developed to choose only the
ones that are real meaningful solutions of the ABS
problem. As we discussed previously, signals that
are likely to match the tracks used in the produc-
tion of the song will often be meaningful solutions
of the ABS problem, so a set of mathematical char-
acterizations of the sound of a track are presented in
section 3 and used in section 4 to choose between the
candidate solutions generated by TFM. Finally two
additional selection procedures not based on these
characterizations are presented.
2. GENERATION OF CANDIDATE SOLU-
TIONS OF THE ABS PROBLEM
We generate candidate solutions of the ABS problem
as follows:
1. First we take a set of P overlapped time frames
of size N from the mixture:
outL[0] · · · outL[N − 1]
outR[0] · · · outR[N − 1]
· · ·
outL[(P − 1) · M] · · · outL[(P − 1) · M + N − 1]
outR[(P − 1) · M] · · · outR[(P − 1) · M + N − 1]
Frames will have an overlap of (N−M) samples.
2. Each frame is windowed in order to avoid spec-
tral leaking and the DFT (Discrete Fourier
Transform). Because the input signal is real,
the DFT frame has hermitian symmetry, there-
fore all the information is stored in the first
N/2+1 coefficients. We will refer to their values
as:
DFT0(outL)[0] · · · DFT0(outL)[N/2]
DFT0(outR)[0] · · · DFT0(outR)[N/2]
· · ·
DFTP −1(outL)[0] · · · DFTP −1(outL)[N/2]
DFTP −1(outR)[0] · · · DFTP −1(outR)[N/2]
3. Next, the DFT frames of sR
i [k],sL
i [k] are built
keeping some the obtained DFT coefficients and
setting the others to 0. In other words, we apply
a binary mask to the DFT coefficients. Let p ∈
0..P − 1,
DFTp(sL
i )[f] =
0
DFTp(outL
[f])
DFTp(sR
i )[f] =
0
DFTp(outR
[f])
This is the step where different parameters can
be chosen to generate different sounds. For each
frame and each DFT coefficient, we can choose
whether to set it to zero or keep its value. Con-
sequently, a huge family of candidate solutions
can be generated: up to 2(N/2+1)·P
different
sounds.
AES 120th
Convention, Paris, France, 2006 May 20–23
Page 2 of 9
Vinyes et al. Human-Assisted Time-Frequency Masking
4. The IDFT (Inverse Discrete Fourier Transform)
of these frames is performed after filling their
coefficients from N/2 + 2 to N − 1 with values
that force the hermitian symmetry (the IDFT
output must be a real signal). Next, we multi-
ply it by the inverse of the window values used
before computing the DFT.
5. Finally we overlap and add those frames to ob-
tain sR
i [k],sL
i [k]. The overlap is performed using
a triangular window placed around the center
and padded with zeros when M < N/2. Two
examples are displayed in figures 1(a),1(b).
(a) M=N/2 (b) M=N/4
Fig. 1: Overlap and add with a triangular window
with different values of M
Some of the articles ([2]) that introduced the idea of
applying a binary mask to the DFT coefficients re-
ferred to the derived processing mechanism as Time-
Frequency Masking. We will continue to use this
term.
We don’t know how many meaningful solutions of
the ABS problem are included in the space of can-
didate solutions produced by TFM. Moreover, this
could vary among different input data. However
some experiments reveal that for many real com-
mercial music productions at least some meaningful
solutions are included. In this article we present
some examples of songs where our algorithm finds
meaningful solutions (see section 5). Additionally,
we have built a web site ([3]) whose forum collects re-
ports of successful audio blind separation using this
technique. For speech signals, ([2]) shows that the
mixed voices are usually recovered.
In fact, if the original tracks had non-overlapping
nonzero DFT coefficients, we would be able to gen-
erate them perfectly with TFM of the mixture (as-
suming loss-less DFT-IDFT). Unfortunately, most
of the music sounds don’t verify this first hypothe-
sis. However it seems that a track is perceived simi-
larly when some of its DFT coefficients are replaced
by their corresponding DFT coefficients of the mix,
where more than a track may overlap. This may be
explained by the experience that moderate equaliza-
tion and compression (that may occur in frequency
bands where two tracks overlap) don’t alter signifi-
cantly our perception of a sound.
On the other hand, we can think of cases where
tracks will be hard to extract. In particular, when
two tracks contain two performances of the same in-
strument playing the same music (often some vo-
cals and some guitars are recorded twice), then both
tracks will undoubtedly overlap in frequency. How-
ever, in such cases, we are not usually able to per-
ceptually distinguish both tracks either, so we will
only pursue the extraction of their mixture.
3. MATHEMATICAL CHARACTERIZATIONS
OF THE SOUND OF A TRACK
3.1. Pan
The fact that some mono1
tracks are mirrored in
the two stereo channels when they are panned in the
mixing process is helpful to decide whether a sound
may correspond to a track or it doesn’t.
Let ini[k] be the original mono tracks of one mixture,
mixers mirror them in two channels as follows:
inL
i [k]
inR
i [k]
=
αL
i · ini[k]
αR
i · ini[k]
Therefore, the mixture can be obtained with the fol-
lowing expression,
outL[k]
outR[k]
=
αL
1 . . . αL
n
αR
1 . . . αR
n
·



in1[k]
...
inn[k]



1Note that when a stereo reverberation effect is applied to
one mono track, the output of the process is stereo, so these
tracks are not considered here.
AES 120th
Convention, Paris, France, 2006 May 20–23
Page 3 of 9
Vinyes et al. Human-Assisted Time-Frequency Masking
Additionally, we found that the majority of analog
and digital mixers use the following pan law, where
x ∈ [0, 1] is the value of the analog or digital pan
knob:



αL
i = cos (x · π/2) = 1
1+(αR
i /αL
i )2
αR
i = sin (x · π/2) =
(αR
i /αL
i )2
1+(αR
i /αL
i )2
x = arctan (
αR
i
αL
i
) · 2/π
(1)
Hence if the extracted sound sL
i [k],sR
i [k] is one of
the original stereo tracks, it will verify (assuming
ini[k] = 0):
sR
i [k]
sL
i [k]
=
inR
i [k]
inL
i [k]
=
αR
i
αL
i
= constant
Moreover, the DFT coefficients of both channels will
still verify this expression because the DFT is a lin-
ear transformation. Consequently, our requirement
for the extracted stereo sounds sL
i [k], sR
i [k] may be
extended as follows to their DFT coefficients:
DFTp(sR
i )[f]
DFTp(sL
i )[f]
= constant ∀f ∈ 0 . . . N/2 (2)
if DFTp(sR
i )[f] = 0 or DFTp(sL
i )[f] = 0
It is worthwhile to mention that this mechanism is
particularly good because it allows the discrimina-
tion between mono tracks that have been panned
with different
αR
i
αL
i
ratio. On the other side of the
coin, there are many kinds of tracks whose charac-
terization with this procedure is not valid:
• Stereo tracks
• Mono tracks with artificial stereo reverberation
• Mono tracks with “automated” pan
3.2. Inter-channel Phase Difference
Another consequence of mirroring mono tracks in
the two stereo channels is that their DFT phase spec-
trum will be the same for both channels.
Hence if the extracted sounds are the original tracks,
they must verify:
|Arg(DFTp(sL
i )[f]) − Arg(DFTp(sR
i )[f])| = 0
∀f ∈ 0 . . . N/2
And mono tracks with artificial stereo reverberation
or stereo tracks may be generally characterized by
the opposite case:
|Arg(DFTp(sL
i )[f]) − Arg(DFTp(sR
i )[f])| > 0
∀f ∈ 0 . . . N/2
In this case, it may happen that the track not only
contains DFT coefficients with different phase but
also some with the same phase; in that case some
kind of dereverberation is performed.
IPD (Inter-channel Phase Difference) is good be-
cause all mono tracks are well characterized (us-
ing either of the two complementary formulas),
however it only will allow us to distinguish be-
tween mono-non-reverberated and mono-artificially-
reverberated/stereo tracks.
On the other hand, the zero phase difference can be
used as a prerequisite of the pan characterization
because the latter presupposes that the same sound
is mirrored in both channels.
4. SOLUTION SELECTION
In this section we will describe how to select the
DFT coefficients that aren’t set to zero in step 3
of the candidate solutions generation process. This
mechanism should allow us to output one sound that
is a meaningful solution of the ABS problem out of
all the possible sounds generated with TFM.
We designed it to be build in a real-time application
which could receive feedback from the user. That
is why we named it “Human-Assisted TFM”. In or-
der to set the parameters of our algorithm, we use
an interface similar to the one presented by the au-
thors of [4]. The user is able to set a few parameters
that determine the DFT coefficients that are masked
to obtain sounds that are perceived similarly to the
audio original tracks of the song.
The selection process is performed applying inde-
pendent processing layers that we will call “Time-
Frequency Filters”. In each one, a different Time-
Frequency binary mask is set following a different
AES 120th
Convention, Paris, France, 2006 May 20–23
Page 4 of 9
Vinyes et al. Human-Assisted Time-Frequency Masking
approach. The overall algorithm applies a Time-
Frequency mask that is the union of the previous
masks in each frame.
First two TFF (Time-Frequency Filters) based on
the mathematical characterizations of the sound of a
track are presented. Then, we suggest two auxiliary
post-processing TFFs that may help to discriminate
between some specific sounds.
4.1. Pan TFF
It is obvious that when two tracks overlap in the
same DFT coefficient of frequency f, we can’t expect
the ratio:
DFTp(outR)[f]
DFTp(outL)[f]
(3)
to be any of the ratios of the two tracks. Moreover,
the overlapping coefficients may change in time.
Hence, the DFT coefficients can’t be assigned to dif-
ferent tracks directly using the expression in 2.
However, when dealing with non-reverberated
speech signals, such overlap is not very significant,
and those changes in 3 only vary slightly from a
value. Therefore it might make sense to define a
maximum likelihood estimator of the value 3 of each
track in order to select them. Although that was the
approach of [2] with speech signals, in commercial
music production signals the overlap is much more
significant and, at the moment, some researchers se-
lect ranges of values where 3 may vary without hav-
ing a clear peak. [5] defines a Gaussian window and
the authors of [4] manually select a range of pan.
Our Pan TFF is based on the manual selection ap-
proach of [4], which is improved adding a mapping
of the values of 3 to their corresponding estimated
pan (we use the expressions set in 1). In this way
their values are displayed with the same mapping
the sound engineer used to define the pan. Addi-
tionally the resulting interval [0, 1] of possible values
is bounded while expression 3 took values in [0, ∞).
Next we assist the user of our application with a
visual representation of the energy of the DFT coef-
ficients in each pan (see subsection 4.4). The range
of pans that the DFT coefficients should have is then
selected. If the DFT coefficient has an estimated pan
out of this range, it will be set to 0. Otherwise it
will be kept as it is.
4.2. Inter-channel Phase Difference TFF
For the same reasons stated above it is not possible
to clearly distinguish between exactly zero and non-
zero IPD (Inter-channel Phase Difference) and the
characterization of section 3.2 should be adapted to
work with overlapping DFT coefficients. Therefore,
we define a TFF where a threshold is set to limit
both situations. We assist the user of our application
with a visual representation of the energy of the DFT
coefficients with each possible IPD values (between
−π and π) and the threshold is defined by the user
looking at this graph.
This graph might be also used to decide whether
the pan discrimination is going to work. If the DFT
coefficients have the same phase (all the energy is
accumulated around the zero IPD value), then it
makes sense to suppose that they were mono tracks
mirrored in both channels with a constant ratio, oth-
erwise some artificial reverberation may have been
added and the previous criterion may be useless.
4.3. Magnitude and Magnitude-Variance TFFs
Those filters are not based on characterizations of
the sound of a track. Hence we believe that they
should be used to post-process the results obtained
with the previously introduced TFFs.
The Magnitude TFF, first normalizes all the DFT
coefficient’ magnitudes using their maximum value
in the current frame. A set of magnitudes between
0 and 1 are obtained. Next, ranges within those val-
ues are selected in a graph that represents their ac-
cumulated energy across multiple frames. We found
this criterion useful to distinguish between percus-
sive sounds with large and flat frequency spectrum
(eg. snares and crashes) and harmonic instruments
that have a spectrum with peaks of frequency com-
ponents (eg. voice or guitars).
The Magnitude-Variance TFF computes for each
DFT coefficient of each channel, its magnitude vari-
ance along time. Then all variances are normalized
by their maximum value and their energy is dis-
played in a graph. A range of this variances are also
selected manually. This TFF is useful to discrimi-
nate between brief and steady attacks or sounds with
different magnitude variation along time.
Note that these two procedures lead to Time-
Frequency masks which may be different for each
stereo channel. An application of these TFFs is in-
cluded in example 3 of section 5.
AES 120th
Convention, Paris, France, 2006 May 20–23
Page 5 of 9
Vinyes et al. Human-Assisted Time-Frequency Masking
4.4. Visual representations
All graphs are built exactly in the same way. For
each DFT coefficient we compute one attribute (Pan,
IPD, Magnitude or Magnitude-Variance). Then this
attribute is mapped in a closed interval (generally
[0,1] or [-π,π] for IPD).
This interval is divided in a finite number of small
bins and for each bin we add the square modu-
lus contributions of the DFT coefficients whose at-
tribute value corresponds to it. Finally we average
the values among several frames.
The values of the graph are normalized by their max-
imum value along some frames in order to achieve
an optimum vertical resolution without rescaling the
graph too frequently. Two examples are displayed in
figures 2(a) and 2(b).
(a) Pan
(b) IPD
Fig. 2: Pan and IPD energy distribution graphs
5. EXPERIMENTS
5.1. Parameters
In our experiments we tested several values of N and
we found that values above 8192 didn’t improve the
perceived quality of the output sound. M is set to
N/4 for better quality of the sound transitions. Fi-
nally, we selected the Blackmann-Harris -92dB win-
dow in order to minimize the side-lobes which make
the DFT frequency coefficients merge and change
the estimation of their pan, IPD or Magnitude.
The DFT is performed with the Fast Fourier Trans-
form algorithm and the graphs are computed with
an horizontal resolution of 300 bins, average of 20
frames, and a normalization along 40 frames.
Next we will discuss some examples. Although, some
waveforms and spectrograms are drawn, the reader
is encouraged to download and listen to their audio
at http://www.iua.upf.es/~mvinyes/abs/demos,
where an evaluation version of our algorithm imple-
mentation is also available.
5.2. Example 1
We first tested our algorithm over a self-produced
song whose tracks were available. The song was
produced with synthetic sampled drums in one
mono track and 3 stereo real guitar tracks (acous-
tic rhythm guitar, feedback-delayed acoustic gui-
tar pickings and electric distortion guitar). The
feedback-delayed acoustic guitar and the electric dis-
tortion guitar are panned to opposite sides and both
drums and the rhythm guitar are panned to the cen-
ter.
Using pan discrimination we were able to extract
the original tracks of the guitars panned at both
sides and one track consisting of the mixture of the
drums and the rhythm guitar (because they shared
the same pan). In figures 4(a) and 4(b) the orig-
inal waveforms and spectrograms of the feedback-
delayed acoustic guitar and its corresponding ex-
tracted sound are displayed together.
Note that the recovered track has a similar wave-
form although the amplitude varies. However when
we listen to them, we perceive them to be highly sim-
ilar. This example supports our decision of pursu-
ing tracks that are perceived similarly to the original
ones.
5.3. Example 2: Help (The Beatles)
Next we present an extraction of the vocals of the
popular song Help (The Beatles). In this example
AES 120th
Convention, Paris, France, 2006 May 20–23
Page 6 of 9
Vinyes et al. Human-Assisted Time-Frequency Masking
Fig. 3: S03: Mix waveform and spectrogram
0 5 10 15 20 25 30 35
−0.5
0
0.5
Original track waveform
0 5 10 15 20 25 30 35
−0.5
0
0.5
Recovered track waveform
(a) Waveforms
(b) Spectrograms
Fig. 4: S03: Original vs recovered guitar track
the IPD TFF improves the separation when it is
placed before the pan TFF.
(a) Spectrograms
0 2 4 6 8 10 12 14 16 18
−0.5
0
0.5
Mix waveform
0 2 4 6 8 10 12 14 16 18
−0.5
0
0.5
Vocals waveform (panning discrimination)
0 2 4 6 8 10 12 14 16 18
−0.5
0
0.5
Vocals waveform (panning+IPD discrimination)
(b) Waveforms
Fig. 5: Help: Mix and extracted vocals
The IPD TFF filters many of the drums residual
noise that remains present when the pan TFF is
used. We guess that drums are discriminated by
IPD because they were recorded in stereo tracks,
while the vocals were recorded using a mono mic.
5.4. Example 3: Memorial (Explosions in the
Sky)
This is an example of snare and guitar extraction.
In this song, only guitars seem to be highly rever-
berated so a IPD TFF helps to discriminate between
them and the other instruments. So we select a mar-
gin around the center (zero phase) to begin the ex-
traction of the snare and we use its complementary
range to extract the guitars. Next, in both cases
we use the Magnitude TFF. In order to isolate the
AES 120th
Convention, Paris, France, 2006 May 20–23
Page 7 of 9
Vinyes et al. Human-Assisted Time-Frequency Masking
guitars we remove the DFT coefficients with small
magnitude and we do the opposite for the snare.
Finally the Magnitude-Variance TFF is applied to
reduce the noise produced by the guitar pickings in
our attempt to isolate the snare (those noises have
greater magnitude variance than the sound of the
snare).
(a) Snare Spectrograms
(b) Guitars Spectrograms
Fig. 6: Memorial: Extracted snare and guitars
6. CONCLUSIONS
Our work can be included in the group of techniques
that obtain the solution of the ABS problem ex-
tracting sets of the input Discrete Fourier Trans-
form (DFT) coefficients ([2],[4],[5],[6]). In particular
we design a human-assisted mechanism as in [4] to
select those sets.
A graphic interface is developed to select them us-
ing their inter-channel magnitude ratio and inter-
channel phase difference, with some post processing
involving the magnitude of the DFT coefficients and
their magnitude variance. The processing is divided
in several independent layers called TFF that set a
Time-Frequency binary mask in multiple steps with
different criteria. We claim that this a more flexi-
ble way to select the Time-Frequency Mask than the
ones developed in the previously cited articles.
Another contribution is to consider inter-channel
phase difference instead of estimated time delay as
introduced in [2]. [7] points out ambiguities in the
estimated time delay over frequencies over 900Hz
and suggests a method to resolve them in mixtures
of acoustically stereo recorded sound sources. How-
ever, in commercial music productions, IPD seems
to make more sense and may help to distinguish be-
tween mono tracks and stereo tracks based. More-
over, its energy distribution graph may be useful to
determine whether the pan discrimination criterion
is going to work and embed TFM with pan selection
in automatic ABS systems.
We have shown that, in some cases, it is possible to
extract from real commercial music productions the
original tracks that were used in their mixing. It is
often the case that vocals and other music instru-
ments are recorded in different audio tracks. Con-
sequently, our algorithm can be useful in karaoke
systems to remove the voice of some songs, and it
can be applied as a DJ remixing tool or isolating
instruments in a mixture. In [3] visitors are encour-
aged to download our software and post successful
audio separations in order to evaluate its real-world
performance.
7. FUTURE WORK
This method trusts TFM as a successful way to ob-
tain all the meaningful solutions of the ABS prob-
lem. Future work may consist of evaluating experi-
mentally the limits of this technique independently
of the chosen TFFs.
Additionally we have only presented 4 TFFs (Pan,
IPD, Magnitude and Magnitude-Variance), so more
ways to set the TFM mask may be explored.
Finally we have noticed low frequencies have big
magnitudes that alter significantly the graph even
AES 120th
Convention, Paris, France, 2006 May 20–23
Page 8 of 9
Vinyes et al. Human-Assisted Time-Frequency Masking
if they are not really important to our ears. Percep-
tual weighting might help to make the graphs better
represent what we really hear and ease the selection
of meaningful sounds.
8. REFERENCES
[1] Albert S. Bregman. Auditory Scene Analysis.
MIT Press, 1990.
[2] ¨Ozg¨ur Yilmazz and Scott Rickard. Blind sep-
aration of speech mixtures via time- frequency
masking. IEEE Transactions on Signal Process-
ing, 2003.
[3] MarC Vinyes, Alex Loscos, and Jordi Bonada.
http://www.iua.upf.edu/mtg/audioscanner.
[4] Dan Barry, Bob Lawlor, and Eugene Coyle.
Sound source separation: Azimuth discrim-
ination and resynthesis. Proc. of the 7th
Int.Conference on Digital Audio Effects (DAFX
04), 2004.
[5] Carlos Avendano. Frequency-domain source
identification and manipulation in stereo mixes
for enhancement, suppression and re-panning ap-
plications. Proc. IEEE Workshop on Applica-
tions of Signal Processing to Audio and Acous-
tics, 2004.
[6] Michael A. Casey and Alex Westner. Separation
of mixed audio sources by independent subspace
analysis. International Computer Music Confer-
ence (ICMC), 2000.
[7] Harald Viste and Gianpaolo Evangelista. On the
use of spatial cues to improve binaural source
separation. In Proc. of the 6th Int. Conference
on Digital Audio Effects (DAFX-03), 2003.
AES 120th
Convention, Paris, France, 2006 May 20–23
Page 9 of 9

Weitere ähnliche Inhalte

Was ist angesagt?

Reversible Digital Watermarking of Audio Wav Signal Using Additive Interpolat...
Reversible Digital Watermarking of Audio Wav Signal Using Additive Interpolat...Reversible Digital Watermarking of Audio Wav Signal Using Additive Interpolat...
Reversible Digital Watermarking of Audio Wav Signal Using Additive Interpolat...ijsrd.com
 
(2012) Rigaud, David, Daudet - Piano Sound Analysis Using Non-negative Matrix...
(2012) Rigaud, David, Daudet - Piano Sound Analysis Using Non-negative Matrix...(2012) Rigaud, David, Daudet - Piano Sound Analysis Using Non-negative Matrix...
(2012) Rigaud, David, Daudet - Piano Sound Analysis Using Non-negative Matrix...François Rigaud
 
(2012) Rigaud, Falaize, David, Daudet - Does Inharmonicity Improve an NMF-Bas...
(2012) Rigaud, Falaize, David, Daudet - Does Inharmonicity Improve an NMF-Bas...(2012) Rigaud, Falaize, David, Daudet - Does Inharmonicity Improve an NMF-Bas...
(2012) Rigaud, Falaize, David, Daudet - Does Inharmonicity Improve an NMF-Bas...François Rigaud
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
Ch7 noise variation of different modulation scheme pg 63
Ch7 noise variation of different modulation scheme pg 63Ch7 noise variation of different modulation scheme pg 63
Ch7 noise variation of different modulation scheme pg 63Prateek Omer
 
Noisy Speech Enhancement Using Soft Thresholding on Selected Intrinsic Mode F...
Noisy Speech Enhancement Using Soft Thresholding on Selected Intrinsic Mode F...Noisy Speech Enhancement Using Soft Thresholding on Selected Intrinsic Mode F...
Noisy Speech Enhancement Using Soft Thresholding on Selected Intrinsic Mode F...CSCJournals
 
Filtering in frequency domain
Filtering in frequency domainFiltering in frequency domain
Filtering in frequency domainGowriLatha1
 
(2013) Rigaud et al. - A parametric model and estimation techniques for the i...
(2013) Rigaud et al. - A parametric model and estimation techniques for the i...(2013) Rigaud et al. - A parametric model and estimation techniques for the i...
(2013) Rigaud et al. - A parametric model and estimation techniques for the i...François Rigaud
 
The method of comparing two audio files
The method of comparing two audio filesThe method of comparing two audio files
The method of comparing two audio filesMinh Anh Nguyen
 
Image Denoising Using Wavelet
Image Denoising Using WaveletImage Denoising Using Wavelet
Image Denoising Using WaveletAsim Qureshi
 
Real-time DSP Implementation of Audio Crosstalk Cancellation using Mixed Unif...
Real-time DSP Implementation of Audio Crosstalk Cancellation using Mixed Unif...Real-time DSP Implementation of Audio Crosstalk Cancellation using Mixed Unif...
Real-time DSP Implementation of Audio Crosstalk Cancellation using Mixed Unif...CSCJournals
 
Digital signal processing techniques for lti fiber
Digital signal processing techniques for lti fiberDigital signal processing techniques for lti fiber
Digital signal processing techniques for lti fibereSAT Publishing House
 
(2011) Rigaud, David, Daudet - A Parametric Model of Piano Tuning
(2011) Rigaud, David, Daudet - A Parametric Model of Piano Tuning(2011) Rigaud, David, Daudet - A Parametric Model of Piano Tuning
(2011) Rigaud, David, Daudet - A Parametric Model of Piano TuningFrançois Rigaud
 
The method of comparing two audio files
The method of comparing two audio filesThe method of comparing two audio files
The method of comparing two audio filesMinh Anh Nguyen
 

Was ist angesagt? (18)

Reversible Digital Watermarking of Audio Wav Signal Using Additive Interpolat...
Reversible Digital Watermarking of Audio Wav Signal Using Additive Interpolat...Reversible Digital Watermarking of Audio Wav Signal Using Additive Interpolat...
Reversible Digital Watermarking of Audio Wav Signal Using Additive Interpolat...
 
(2012) Rigaud, David, Daudet - Piano Sound Analysis Using Non-negative Matrix...
(2012) Rigaud, David, Daudet - Piano Sound Analysis Using Non-negative Matrix...(2012) Rigaud, David, Daudet - Piano Sound Analysis Using Non-negative Matrix...
(2012) Rigaud, David, Daudet - Piano Sound Analysis Using Non-negative Matrix...
 
Microphone arrays
Microphone arraysMicrophone arrays
Microphone arrays
 
(2012) Rigaud, Falaize, David, Daudet - Does Inharmonicity Improve an NMF-Bas...
(2012) Rigaud, Falaize, David, Daudet - Does Inharmonicity Improve an NMF-Bas...(2012) Rigaud, Falaize, David, Daudet - Does Inharmonicity Improve an NMF-Bas...
(2012) Rigaud, Falaize, David, Daudet - Does Inharmonicity Improve an NMF-Bas...
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
Ch7 noise variation of different modulation scheme pg 63
Ch7 noise variation of different modulation scheme pg 63Ch7 noise variation of different modulation scheme pg 63
Ch7 noise variation of different modulation scheme pg 63
 
Lect5 v2
Lect5 v2Lect5 v2
Lect5 v2
 
Noisy Speech Enhancement Using Soft Thresholding on Selected Intrinsic Mode F...
Noisy Speech Enhancement Using Soft Thresholding on Selected Intrinsic Mode F...Noisy Speech Enhancement Using Soft Thresholding on Selected Intrinsic Mode F...
Noisy Speech Enhancement Using Soft Thresholding on Selected Intrinsic Mode F...
 
Filtering in frequency domain
Filtering in frequency domainFiltering in frequency domain
Filtering in frequency domain
 
(2013) Rigaud et al. - A parametric model and estimation techniques for the i...
(2013) Rigaud et al. - A parametric model and estimation techniques for the i...(2013) Rigaud et al. - A parametric model and estimation techniques for the i...
(2013) Rigaud et al. - A parametric model and estimation techniques for the i...
 
The method of comparing two audio files
The method of comparing two audio filesThe method of comparing two audio files
The method of comparing two audio files
 
Dcp project
Dcp projectDcp project
Dcp project
 
Image Denoising Using Wavelet
Image Denoising Using WaveletImage Denoising Using Wavelet
Image Denoising Using Wavelet
 
Real-time DSP Implementation of Audio Crosstalk Cancellation using Mixed Unif...
Real-time DSP Implementation of Audio Crosstalk Cancellation using Mixed Unif...Real-time DSP Implementation of Audio Crosstalk Cancellation using Mixed Unif...
Real-time DSP Implementation of Audio Crosstalk Cancellation using Mixed Unif...
 
Digital signal processing techniques for lti fiber
Digital signal processing techniques for lti fiberDigital signal processing techniques for lti fiber
Digital signal processing techniques for lti fiber
 
(2011) Rigaud, David, Daudet - A Parametric Model of Piano Tuning
(2011) Rigaud, David, Daudet - A Parametric Model of Piano Tuning(2011) Rigaud, David, Daudet - A Parametric Model of Piano Tuning
(2011) Rigaud, David, Daudet - A Parametric Model of Piano Tuning
 
The method of comparing two audio files
The method of comparing two audio filesThe method of comparing two audio files
The method of comparing two audio files
 
H0814247
H0814247H0814247
H0814247
 

Andere mochten auch

Cylindrical rubber fender
Cylindrical rubber fenderCylindrical rubber fender
Cylindrical rubber fenderbeast0004
 
Auckland Quantified Self - Extreme Productivity Hacking (with EEG)
Auckland Quantified Self - Extreme Productivity Hacking (with EEG)Auckland Quantified Self - Extreme Productivity Hacking (with EEG)
Auckland Quantified Self - Extreme Productivity Hacking (with EEG)Jay Best
 
Cell fender
Cell fender Cell fender
Cell fender beast0004
 
Ръководство за употреба - Philips Lumea SC2008
Ръководство за употреба - Philips Lumea SC2008Ръководство за употреба - Philips Lumea SC2008
Ръководство за употреба - Philips Lumea SC2008Philips - България
 
My recharge power point june 2014
My recharge power point june 2014My recharge power point june 2014
My recharge power point june 2014Md Solanki
 
Leveling, Instruments of Leveling, Bearings (Surveying, ECE)
Leveling, Instruments of Leveling, Bearings (Surveying, ECE) Leveling, Instruments of Leveling, Bearings (Surveying, ECE)
Leveling, Instruments of Leveling, Bearings (Surveying, ECE) Kaushal Mehta
 
Ppt of properties of steam
Ppt of properties of steamPpt of properties of steam
Ppt of properties of steamKaushal Mehta
 
ICT, Basic of Computer, Hardware, Various parts of computer hardware, What is...
ICT, Basic of Computer, Hardware, Various parts of computer hardware, What is...ICT, Basic of Computer, Hardware, Various parts of computer hardware, What is...
ICT, Basic of Computer, Hardware, Various parts of computer hardware, What is...Kaushal Mehta
 

Andere mochten auch (11)

Cylindrical rubber fender
Cylindrical rubber fenderCylindrical rubber fender
Cylindrical rubber fender
 
Auckland Quantified Self - Extreme Productivity Hacking (with EEG)
Auckland Quantified Self - Extreme Productivity Hacking (with EEG)Auckland Quantified Self - Extreme Productivity Hacking (with EEG)
Auckland Quantified Self - Extreme Productivity Hacking (with EEG)
 
Cell fender
Cell fender Cell fender
Cell fender
 
Rizalmmmmm
RizalmmmmmRizalmmmmm
Rizalmmmmm
 
Ръководство за употреба - Philips Lumea SC2008
Ръководство за употреба - Philips Lumea SC2008Ръководство за употреба - Philips Lumea SC2008
Ръководство за употреба - Philips Lumea SC2008
 
My recharge power point june 2014
My recharge power point june 2014My recharge power point june 2014
My recharge power point june 2014
 
Ct aggregates
Ct aggregatesCt aggregates
Ct aggregates
 
Array ppt
Array pptArray ppt
Array ppt
 
Leveling, Instruments of Leveling, Bearings (Surveying, ECE)
Leveling, Instruments of Leveling, Bearings (Surveying, ECE) Leveling, Instruments of Leveling, Bearings (Surveying, ECE)
Leveling, Instruments of Leveling, Bearings (Surveying, ECE)
 
Ppt of properties of steam
Ppt of properties of steamPpt of properties of steam
Ppt of properties of steam
 
ICT, Basic of Computer, Hardware, Various parts of computer hardware, What is...
ICT, Basic of Computer, Hardware, Various parts of computer hardware, What is...ICT, Basic of Computer, Hardware, Various parts of computer hardware, What is...
ICT, Basic of Computer, Hardware, Various parts of computer hardware, What is...
 

Ähnlich wie Demixing Commercial Music Productions via Human-Assisted Time-Frequency Masking

(2016) Rigaud and Radenen - Singing Voice Melody Transcription Using Deep Neu...
(2016) Rigaud and Radenen - Singing Voice Melody Transcription Using Deep Neu...(2016) Rigaud and Radenen - Singing Voice Melody Transcription Using Deep Neu...
(2016) Rigaud and Radenen - Singing Voice Melody Transcription Using Deep Neu...François Rigaud
 
DETECTING AND LOCATING PLAGIARISM OF MUSIC MELODIES BY PATH EXPLORATION OVER ...
DETECTING AND LOCATING PLAGIARISM OF MUSIC MELODIES BY PATH EXPLORATION OVER ...DETECTING AND LOCATING PLAGIARISM OF MUSIC MELODIES BY PATH EXPLORATION OVER ...
DETECTING AND LOCATING PLAGIARISM OF MUSIC MELODIES BY PATH EXPLORATION OVER ...csandit
 
Paper on Modeling and Subtractive synthesis of virtual violin
Paper on Modeling and Subtractive synthesis of virtual violinPaper on Modeling and Subtractive synthesis of virtual violin
Paper on Modeling and Subtractive synthesis of virtual violinVinoth Punniyamoorthy
 
Audio Morphing for Percussive Sound Generation
Audio Morphing for Percussive Sound GenerationAudio Morphing for Percussive Sound Generation
Audio Morphing for Percussive Sound Generationa3labdsp
 
Audio Art Authentication and Classification with Wavelet Statistics
Audio Art Authentication and Classification with Wavelet StatisticsAudio Art Authentication and Classification with Wavelet Statistics
Audio Art Authentication and Classification with Wavelet StatisticsWaqas Tariq
 
Handling Ihnarmonic Series with Median-Adjustive Trajectories
Handling Ihnarmonic Series with Median-Adjustive TrajectoriesHandling Ihnarmonic Series with Median-Adjustive Trajectories
Handling Ihnarmonic Series with Median-Adjustive TrajectoriesMatthieu Hodgkinson
 
Project_report_BSS
Project_report_BSSProject_report_BSS
Project_report_BSSKamal Bhagat
 
A Combined Voice Activity Detector Based On Singular Value Decomposition and ...
A Combined Voice Activity Detector Based On Singular Value Decomposition and ...A Combined Voice Activity Detector Based On Singular Value Decomposition and ...
A Combined Voice Activity Detector Based On Singular Value Decomposition and ...CSCJournals
 
Fundamentals of music processing chapter 5 발표자료
Fundamentals of music processing chapter 5 발표자료Fundamentals of music processing chapter 5 발표자료
Fundamentals of music processing chapter 5 발표자료Jeong Choi
 
A Novel Method for Silence Removal in Sounds Produced by Percussive Instruments
A Novel Method for Silence Removal in Sounds Produced by Percussive InstrumentsA Novel Method for Silence Removal in Sounds Produced by Percussive Instruments
A Novel Method for Silence Removal in Sounds Produced by Percussive InstrumentsIJMTST Journal
 
A New Approach for Speech Enhancement Based On Eigenvalue Spectral Subtraction
A New Approach for Speech Enhancement Based On Eigenvalue Spectral SubtractionA New Approach for Speech Enhancement Based On Eigenvalue Spectral Subtraction
A New Approach for Speech Enhancement Based On Eigenvalue Spectral SubtractionCSCJournals
 
VOICED SPEECH CHARACTERISATION BASED ON EMPIRICAL MODE DECOMPOSITION
VOICED SPEECH CHARACTERISATION BASED ON EMPIRICAL MODE DECOMPOSITION VOICED SPEECH CHARACTERISATION BASED ON EMPIRICAL MODE DECOMPOSITION
VOICED SPEECH CHARACTERISATION BASED ON EMPIRICAL MODE DECOMPOSITION optljjournal
 
An Advanced Implementation of a Digital Artificial Reverberator
An Advanced Implementation of a Digital Artificial ReverberatorAn Advanced Implementation of a Digital Artificial Reverberator
An Advanced Implementation of a Digital Artificial Reverberatora3labdsp
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
Vibration source identification caused by bearing faults based on SVD-EMD-ICA
Vibration source identification caused by bearing faults based on SVD-EMD-ICAVibration source identification caused by bearing faults based on SVD-EMD-ICA
Vibration source identification caused by bearing faults based on SVD-EMD-ICAIJRES Journal
 
Sound Source Localization with microphone arrays
Sound Source Localization with microphone arraysSound Source Localization with microphone arrays
Sound Source Localization with microphone arraysRamin Anushiravani
 

Ähnlich wie Demixing Commercial Music Productions via Human-Assisted Time-Frequency Masking (20)

FK_icassp_2014
FK_icassp_2014FK_icassp_2014
FK_icassp_2014
 
(2016) Rigaud and Radenen - Singing Voice Melody Transcription Using Deep Neu...
(2016) Rigaud and Radenen - Singing Voice Melody Transcription Using Deep Neu...(2016) Rigaud and Radenen - Singing Voice Melody Transcription Using Deep Neu...
(2016) Rigaud and Radenen - Singing Voice Melody Transcription Using Deep Neu...
 
DETECTING AND LOCATING PLAGIARISM OF MUSIC MELODIES BY PATH EXPLORATION OVER ...
DETECTING AND LOCATING PLAGIARISM OF MUSIC MELODIES BY PATH EXPLORATION OVER ...DETECTING AND LOCATING PLAGIARISM OF MUSIC MELODIES BY PATH EXPLORATION OVER ...
DETECTING AND LOCATING PLAGIARISM OF MUSIC MELODIES BY PATH EXPLORATION OVER ...
 
Paper on Modeling and Subtractive synthesis of virtual violin
Paper on Modeling and Subtractive synthesis of virtual violinPaper on Modeling and Subtractive synthesis of virtual violin
Paper on Modeling and Subtractive synthesis of virtual violin
 
Audio Morphing for Percussive Sound Generation
Audio Morphing for Percussive Sound GenerationAudio Morphing for Percussive Sound Generation
Audio Morphing for Percussive Sound Generation
 
Final presentation
Final presentationFinal presentation
Final presentation
 
Audio Art Authentication and Classification with Wavelet Statistics
Audio Art Authentication and Classification with Wavelet StatisticsAudio Art Authentication and Classification with Wavelet Statistics
Audio Art Authentication and Classification with Wavelet Statistics
 
Real Time Speech Enhancement in the Waveform Domain
Real Time Speech Enhancement in the Waveform DomainReal Time Speech Enhancement in the Waveform Domain
Real Time Speech Enhancement in the Waveform Domain
 
Handling Ihnarmonic Series with Median-Adjustive Trajectories
Handling Ihnarmonic Series with Median-Adjustive TrajectoriesHandling Ihnarmonic Series with Median-Adjustive Trajectories
Handling Ihnarmonic Series with Median-Adjustive Trajectories
 
Project_report_BSS
Project_report_BSSProject_report_BSS
Project_report_BSS
 
A Combined Voice Activity Detector Based On Singular Value Decomposition and ...
A Combined Voice Activity Detector Based On Singular Value Decomposition and ...A Combined Voice Activity Detector Based On Singular Value Decomposition and ...
A Combined Voice Activity Detector Based On Singular Value Decomposition and ...
 
Speech Signal Processing
Speech Signal ProcessingSpeech Signal Processing
Speech Signal Processing
 
Fundamentals of music processing chapter 5 발표자료
Fundamentals of music processing chapter 5 발표자료Fundamentals of music processing chapter 5 발표자료
Fundamentals of music processing chapter 5 발표자료
 
A Novel Method for Silence Removal in Sounds Produced by Percussive Instruments
A Novel Method for Silence Removal in Sounds Produced by Percussive InstrumentsA Novel Method for Silence Removal in Sounds Produced by Percussive Instruments
A Novel Method for Silence Removal in Sounds Produced by Percussive Instruments
 
A New Approach for Speech Enhancement Based On Eigenvalue Spectral Subtraction
A New Approach for Speech Enhancement Based On Eigenvalue Spectral SubtractionA New Approach for Speech Enhancement Based On Eigenvalue Spectral Subtraction
A New Approach for Speech Enhancement Based On Eigenvalue Spectral Subtraction
 
VOICED SPEECH CHARACTERISATION BASED ON EMPIRICAL MODE DECOMPOSITION
VOICED SPEECH CHARACTERISATION BASED ON EMPIRICAL MODE DECOMPOSITION VOICED SPEECH CHARACTERISATION BASED ON EMPIRICAL MODE DECOMPOSITION
VOICED SPEECH CHARACTERISATION BASED ON EMPIRICAL MODE DECOMPOSITION
 
An Advanced Implementation of a Digital Artificial Reverberator
An Advanced Implementation of a Digital Artificial ReverberatorAn Advanced Implementation of a Digital Artificial Reverberator
An Advanced Implementation of a Digital Artificial Reverberator
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
Vibration source identification caused by bearing faults based on SVD-EMD-ICA
Vibration source identification caused by bearing faults based on SVD-EMD-ICAVibration source identification caused by bearing faults based on SVD-EMD-ICA
Vibration source identification caused by bearing faults based on SVD-EMD-ICA
 
Sound Source Localization with microphone arrays
Sound Source Localization with microphone arraysSound Source Localization with microphone arrays
Sound Source Localization with microphone arrays
 

Kürzlich hochgeladen

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 

Kürzlich hochgeladen (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 

Demixing Commercial Music Productions via Human-Assisted Time-Frequency Masking

  • 1. Audio Engineering Society Convention Paper Presented at the 120th Convention 2006 May 20–23 Paris, France This convention paper has been reproduced from the author’s advance manuscript, without editing, corrections, or consideration by the Review Board. The AES takes no responsibility for the contents. Additional papers may be obtained by sending request and remittance to Audio Engineering Society, 60 East 42nd Street, New York, New York 10165-2520, USA; also see www.aes.org. All rights reserved. Reproduction of this paper, or any portion thereof, is not permitted without direct permission from the Journal of the Audio Engineering Society. Demixing Commercial Music Productions via Human-Assisted Time-Frequency Masking MarC Vinyes1 , Jordi Bonada1 , Alex Loscos1 1 Pompeu Fabra University, Audiovisual Institute, Music Technology Group, Barcelona, 08003, Spain Correspondence should be addressed to MarC (mvinyes@iua.upf.edu) ABSTRACT Audio Blind Separation in real commercial music recordings is still an open problem. In the last few years some techniques have provided interesting results. This article presents a human-assisted selection of the DFT coefficients for the Time-Frequency Masking demixing technique. The DFT coefficients are grouped by adjacent pan, inter-channel phase difference, magnitude and magnitude-variance with a real-time interactive graphical interface. Results prove an implementation of such technique can be used to demix tracks from nowadays commercial songs. Sample sounds can be found at http://www.iua.upf.es/~mvinyes/abs/demos. 1. INTRODUCTION 1.1. What do we mean by Audio Blind Separa- tion? We understand ABS (Audio Blind Separation) as extracting from an input audio signal, without addi- tional information, a set of audio signals whose mix is perceived similarly1 to the original audio signal2 . 1We assume that when comparing a pair of sounds where one is a moderately equalized and compressed version of the other, their perceptual similarity will be very high. 2Note that we don’t stick to the Audio Blind Source Sepa- ration definition, in which the original signal is to be exactly equal to the sum of the extracted signals. Instead we suggest a perceptual interpretation of this equality, which is more This problem has infinite solutions, however, a hu- man being would only find a reduced set of those solutions meaningful. These are the ones that we would like to extract. 1.2. Demixing tracks Commercial music is often produced using a set of recorded mono or stereo audio tracks which are af- terwards mixed instantaneously. Using an analog mixer or a digital audio workstation, audio tracks are usually processed separately in groups with specific pan, equalization, reverb and other digital or ana- log effects. With this procedure, the sound engineer general.
  • 2. Vinyes et al. Human-Assisted Time-Frequency Masking often pursues to favor their perception as different auditory streams (see [1] for better understanding of this term). Because of this reason, when we listen to their mix, we perceive separately these audio tracks most of the times and, consequently, we find them meaningful. On the other hand, if we were able to extract audio signals that are perceived similarly to these audio tracks, they would be a solution of the ABS problem because their remix would also be perceived similarly to the original mixture. Therefore, in this article, our solution of the ABS problem pursues the extrac- tion of audio signals which are perceived similarly to the audio tracks used to produce the mix. 1.3. Notation We are considering stereo songs as inputs. We use L to label variables related to the left channel and R for the right channel. We will refer to the two chan- nels of the mixture as outL[k], outR[k], the stereo tracks whose instantaneously mix produces the mix- ture will be labeled inR i [k], inL i [k] and sR i [k], sL i [k] will denote the extracted stereo sounds. Hence, outL[k] outR[k] = i inL i [k] i inR i [k] And we pursue, sL i [k] sR i [k] inL i [k] inR i [k] 1.4. Algorithm overview Our algorithm uses TFM (Time-Frequency Mask- ing) as a mechanism to generate candidate solutions of the ABS problem from the input data (section 2). Some criteria are developed to choose only the ones that are real meaningful solutions of the ABS problem. As we discussed previously, signals that are likely to match the tracks used in the produc- tion of the song will often be meaningful solutions of the ABS problem, so a set of mathematical char- acterizations of the sound of a track are presented in section 3 and used in section 4 to choose between the candidate solutions generated by TFM. Finally two additional selection procedures not based on these characterizations are presented. 2. GENERATION OF CANDIDATE SOLU- TIONS OF THE ABS PROBLEM We generate candidate solutions of the ABS problem as follows: 1. First we take a set of P overlapped time frames of size N from the mixture: outL[0] · · · outL[N − 1] outR[0] · · · outR[N − 1] · · · outL[(P − 1) · M] · · · outL[(P − 1) · M + N − 1] outR[(P − 1) · M] · · · outR[(P − 1) · M + N − 1] Frames will have an overlap of (N−M) samples. 2. Each frame is windowed in order to avoid spec- tral leaking and the DFT (Discrete Fourier Transform). Because the input signal is real, the DFT frame has hermitian symmetry, there- fore all the information is stored in the first N/2+1 coefficients. We will refer to their values as: DFT0(outL)[0] · · · DFT0(outL)[N/2] DFT0(outR)[0] · · · DFT0(outR)[N/2] · · · DFTP −1(outL)[0] · · · DFTP −1(outL)[N/2] DFTP −1(outR)[0] · · · DFTP −1(outR)[N/2] 3. Next, the DFT frames of sR i [k],sL i [k] are built keeping some the obtained DFT coefficients and setting the others to 0. In other words, we apply a binary mask to the DFT coefficients. Let p ∈ 0..P − 1, DFTp(sL i )[f] = 0 DFTp(outL [f]) DFTp(sR i )[f] = 0 DFTp(outR [f]) This is the step where different parameters can be chosen to generate different sounds. For each frame and each DFT coefficient, we can choose whether to set it to zero or keep its value. Con- sequently, a huge family of candidate solutions can be generated: up to 2(N/2+1)·P different sounds. AES 120th Convention, Paris, France, 2006 May 20–23 Page 2 of 9
  • 3. Vinyes et al. Human-Assisted Time-Frequency Masking 4. The IDFT (Inverse Discrete Fourier Transform) of these frames is performed after filling their coefficients from N/2 + 2 to N − 1 with values that force the hermitian symmetry (the IDFT output must be a real signal). Next, we multi- ply it by the inverse of the window values used before computing the DFT. 5. Finally we overlap and add those frames to ob- tain sR i [k],sL i [k]. The overlap is performed using a triangular window placed around the center and padded with zeros when M < N/2. Two examples are displayed in figures 1(a),1(b). (a) M=N/2 (b) M=N/4 Fig. 1: Overlap and add with a triangular window with different values of M Some of the articles ([2]) that introduced the idea of applying a binary mask to the DFT coefficients re- ferred to the derived processing mechanism as Time- Frequency Masking. We will continue to use this term. We don’t know how many meaningful solutions of the ABS problem are included in the space of can- didate solutions produced by TFM. Moreover, this could vary among different input data. However some experiments reveal that for many real com- mercial music productions at least some meaningful solutions are included. In this article we present some examples of songs where our algorithm finds meaningful solutions (see section 5). Additionally, we have built a web site ([3]) whose forum collects re- ports of successful audio blind separation using this technique. For speech signals, ([2]) shows that the mixed voices are usually recovered. In fact, if the original tracks had non-overlapping nonzero DFT coefficients, we would be able to gen- erate them perfectly with TFM of the mixture (as- suming loss-less DFT-IDFT). Unfortunately, most of the music sounds don’t verify this first hypothe- sis. However it seems that a track is perceived simi- larly when some of its DFT coefficients are replaced by their corresponding DFT coefficients of the mix, where more than a track may overlap. This may be explained by the experience that moderate equaliza- tion and compression (that may occur in frequency bands where two tracks overlap) don’t alter signifi- cantly our perception of a sound. On the other hand, we can think of cases where tracks will be hard to extract. In particular, when two tracks contain two performances of the same in- strument playing the same music (often some vo- cals and some guitars are recorded twice), then both tracks will undoubtedly overlap in frequency. How- ever, in such cases, we are not usually able to per- ceptually distinguish both tracks either, so we will only pursue the extraction of their mixture. 3. MATHEMATICAL CHARACTERIZATIONS OF THE SOUND OF A TRACK 3.1. Pan The fact that some mono1 tracks are mirrored in the two stereo channels when they are panned in the mixing process is helpful to decide whether a sound may correspond to a track or it doesn’t. Let ini[k] be the original mono tracks of one mixture, mixers mirror them in two channels as follows: inL i [k] inR i [k] = αL i · ini[k] αR i · ini[k] Therefore, the mixture can be obtained with the fol- lowing expression, outL[k] outR[k] = αL 1 . . . αL n αR 1 . . . αR n ·    in1[k] ... inn[k]    1Note that when a stereo reverberation effect is applied to one mono track, the output of the process is stereo, so these tracks are not considered here. AES 120th Convention, Paris, France, 2006 May 20–23 Page 3 of 9
  • 4. Vinyes et al. Human-Assisted Time-Frequency Masking Additionally, we found that the majority of analog and digital mixers use the following pan law, where x ∈ [0, 1] is the value of the analog or digital pan knob:    αL i = cos (x · π/2) = 1 1+(αR i /αL i )2 αR i = sin (x · π/2) = (αR i /αL i )2 1+(αR i /αL i )2 x = arctan ( αR i αL i ) · 2/π (1) Hence if the extracted sound sL i [k],sR i [k] is one of the original stereo tracks, it will verify (assuming ini[k] = 0): sR i [k] sL i [k] = inR i [k] inL i [k] = αR i αL i = constant Moreover, the DFT coefficients of both channels will still verify this expression because the DFT is a lin- ear transformation. Consequently, our requirement for the extracted stereo sounds sL i [k], sR i [k] may be extended as follows to their DFT coefficients: DFTp(sR i )[f] DFTp(sL i )[f] = constant ∀f ∈ 0 . . . N/2 (2) if DFTp(sR i )[f] = 0 or DFTp(sL i )[f] = 0 It is worthwhile to mention that this mechanism is particularly good because it allows the discrimina- tion between mono tracks that have been panned with different αR i αL i ratio. On the other side of the coin, there are many kinds of tracks whose charac- terization with this procedure is not valid: • Stereo tracks • Mono tracks with artificial stereo reverberation • Mono tracks with “automated” pan 3.2. Inter-channel Phase Difference Another consequence of mirroring mono tracks in the two stereo channels is that their DFT phase spec- trum will be the same for both channels. Hence if the extracted sounds are the original tracks, they must verify: |Arg(DFTp(sL i )[f]) − Arg(DFTp(sR i )[f])| = 0 ∀f ∈ 0 . . . N/2 And mono tracks with artificial stereo reverberation or stereo tracks may be generally characterized by the opposite case: |Arg(DFTp(sL i )[f]) − Arg(DFTp(sR i )[f])| > 0 ∀f ∈ 0 . . . N/2 In this case, it may happen that the track not only contains DFT coefficients with different phase but also some with the same phase; in that case some kind of dereverberation is performed. IPD (Inter-channel Phase Difference) is good be- cause all mono tracks are well characterized (us- ing either of the two complementary formulas), however it only will allow us to distinguish be- tween mono-non-reverberated and mono-artificially- reverberated/stereo tracks. On the other hand, the zero phase difference can be used as a prerequisite of the pan characterization because the latter presupposes that the same sound is mirrored in both channels. 4. SOLUTION SELECTION In this section we will describe how to select the DFT coefficients that aren’t set to zero in step 3 of the candidate solutions generation process. This mechanism should allow us to output one sound that is a meaningful solution of the ABS problem out of all the possible sounds generated with TFM. We designed it to be build in a real-time application which could receive feedback from the user. That is why we named it “Human-Assisted TFM”. In or- der to set the parameters of our algorithm, we use an interface similar to the one presented by the au- thors of [4]. The user is able to set a few parameters that determine the DFT coefficients that are masked to obtain sounds that are perceived similarly to the audio original tracks of the song. The selection process is performed applying inde- pendent processing layers that we will call “Time- Frequency Filters”. In each one, a different Time- Frequency binary mask is set following a different AES 120th Convention, Paris, France, 2006 May 20–23 Page 4 of 9
  • 5. Vinyes et al. Human-Assisted Time-Frequency Masking approach. The overall algorithm applies a Time- Frequency mask that is the union of the previous masks in each frame. First two TFF (Time-Frequency Filters) based on the mathematical characterizations of the sound of a track are presented. Then, we suggest two auxiliary post-processing TFFs that may help to discriminate between some specific sounds. 4.1. Pan TFF It is obvious that when two tracks overlap in the same DFT coefficient of frequency f, we can’t expect the ratio: DFTp(outR)[f] DFTp(outL)[f] (3) to be any of the ratios of the two tracks. Moreover, the overlapping coefficients may change in time. Hence, the DFT coefficients can’t be assigned to dif- ferent tracks directly using the expression in 2. However, when dealing with non-reverberated speech signals, such overlap is not very significant, and those changes in 3 only vary slightly from a value. Therefore it might make sense to define a maximum likelihood estimator of the value 3 of each track in order to select them. Although that was the approach of [2] with speech signals, in commercial music production signals the overlap is much more significant and, at the moment, some researchers se- lect ranges of values where 3 may vary without hav- ing a clear peak. [5] defines a Gaussian window and the authors of [4] manually select a range of pan. Our Pan TFF is based on the manual selection ap- proach of [4], which is improved adding a mapping of the values of 3 to their corresponding estimated pan (we use the expressions set in 1). In this way their values are displayed with the same mapping the sound engineer used to define the pan. Addi- tionally the resulting interval [0, 1] of possible values is bounded while expression 3 took values in [0, ∞). Next we assist the user of our application with a visual representation of the energy of the DFT coef- ficients in each pan (see subsection 4.4). The range of pans that the DFT coefficients should have is then selected. If the DFT coefficient has an estimated pan out of this range, it will be set to 0. Otherwise it will be kept as it is. 4.2. Inter-channel Phase Difference TFF For the same reasons stated above it is not possible to clearly distinguish between exactly zero and non- zero IPD (Inter-channel Phase Difference) and the characterization of section 3.2 should be adapted to work with overlapping DFT coefficients. Therefore, we define a TFF where a threshold is set to limit both situations. We assist the user of our application with a visual representation of the energy of the DFT coefficients with each possible IPD values (between −π and π) and the threshold is defined by the user looking at this graph. This graph might be also used to decide whether the pan discrimination is going to work. If the DFT coefficients have the same phase (all the energy is accumulated around the zero IPD value), then it makes sense to suppose that they were mono tracks mirrored in both channels with a constant ratio, oth- erwise some artificial reverberation may have been added and the previous criterion may be useless. 4.3. Magnitude and Magnitude-Variance TFFs Those filters are not based on characterizations of the sound of a track. Hence we believe that they should be used to post-process the results obtained with the previously introduced TFFs. The Magnitude TFF, first normalizes all the DFT coefficient’ magnitudes using their maximum value in the current frame. A set of magnitudes between 0 and 1 are obtained. Next, ranges within those val- ues are selected in a graph that represents their ac- cumulated energy across multiple frames. We found this criterion useful to distinguish between percus- sive sounds with large and flat frequency spectrum (eg. snares and crashes) and harmonic instruments that have a spectrum with peaks of frequency com- ponents (eg. voice or guitars). The Magnitude-Variance TFF computes for each DFT coefficient of each channel, its magnitude vari- ance along time. Then all variances are normalized by their maximum value and their energy is dis- played in a graph. A range of this variances are also selected manually. This TFF is useful to discrimi- nate between brief and steady attacks or sounds with different magnitude variation along time. Note that these two procedures lead to Time- Frequency masks which may be different for each stereo channel. An application of these TFFs is in- cluded in example 3 of section 5. AES 120th Convention, Paris, France, 2006 May 20–23 Page 5 of 9
  • 6. Vinyes et al. Human-Assisted Time-Frequency Masking 4.4. Visual representations All graphs are built exactly in the same way. For each DFT coefficient we compute one attribute (Pan, IPD, Magnitude or Magnitude-Variance). Then this attribute is mapped in a closed interval (generally [0,1] or [-π,π] for IPD). This interval is divided in a finite number of small bins and for each bin we add the square modu- lus contributions of the DFT coefficients whose at- tribute value corresponds to it. Finally we average the values among several frames. The values of the graph are normalized by their max- imum value along some frames in order to achieve an optimum vertical resolution without rescaling the graph too frequently. Two examples are displayed in figures 2(a) and 2(b). (a) Pan (b) IPD Fig. 2: Pan and IPD energy distribution graphs 5. EXPERIMENTS 5.1. Parameters In our experiments we tested several values of N and we found that values above 8192 didn’t improve the perceived quality of the output sound. M is set to N/4 for better quality of the sound transitions. Fi- nally, we selected the Blackmann-Harris -92dB win- dow in order to minimize the side-lobes which make the DFT frequency coefficients merge and change the estimation of their pan, IPD or Magnitude. The DFT is performed with the Fast Fourier Trans- form algorithm and the graphs are computed with an horizontal resolution of 300 bins, average of 20 frames, and a normalization along 40 frames. Next we will discuss some examples. Although, some waveforms and spectrograms are drawn, the reader is encouraged to download and listen to their audio at http://www.iua.upf.es/~mvinyes/abs/demos, where an evaluation version of our algorithm imple- mentation is also available. 5.2. Example 1 We first tested our algorithm over a self-produced song whose tracks were available. The song was produced with synthetic sampled drums in one mono track and 3 stereo real guitar tracks (acous- tic rhythm guitar, feedback-delayed acoustic gui- tar pickings and electric distortion guitar). The feedback-delayed acoustic guitar and the electric dis- tortion guitar are panned to opposite sides and both drums and the rhythm guitar are panned to the cen- ter. Using pan discrimination we were able to extract the original tracks of the guitars panned at both sides and one track consisting of the mixture of the drums and the rhythm guitar (because they shared the same pan). In figures 4(a) and 4(b) the orig- inal waveforms and spectrograms of the feedback- delayed acoustic guitar and its corresponding ex- tracted sound are displayed together. Note that the recovered track has a similar wave- form although the amplitude varies. However when we listen to them, we perceive them to be highly sim- ilar. This example supports our decision of pursu- ing tracks that are perceived similarly to the original ones. 5.3. Example 2: Help (The Beatles) Next we present an extraction of the vocals of the popular song Help (The Beatles). In this example AES 120th Convention, Paris, France, 2006 May 20–23 Page 6 of 9
  • 7. Vinyes et al. Human-Assisted Time-Frequency Masking Fig. 3: S03: Mix waveform and spectrogram 0 5 10 15 20 25 30 35 −0.5 0 0.5 Original track waveform 0 5 10 15 20 25 30 35 −0.5 0 0.5 Recovered track waveform (a) Waveforms (b) Spectrograms Fig. 4: S03: Original vs recovered guitar track the IPD TFF improves the separation when it is placed before the pan TFF. (a) Spectrograms 0 2 4 6 8 10 12 14 16 18 −0.5 0 0.5 Mix waveform 0 2 4 6 8 10 12 14 16 18 −0.5 0 0.5 Vocals waveform (panning discrimination) 0 2 4 6 8 10 12 14 16 18 −0.5 0 0.5 Vocals waveform (panning+IPD discrimination) (b) Waveforms Fig. 5: Help: Mix and extracted vocals The IPD TFF filters many of the drums residual noise that remains present when the pan TFF is used. We guess that drums are discriminated by IPD because they were recorded in stereo tracks, while the vocals were recorded using a mono mic. 5.4. Example 3: Memorial (Explosions in the Sky) This is an example of snare and guitar extraction. In this song, only guitars seem to be highly rever- berated so a IPD TFF helps to discriminate between them and the other instruments. So we select a mar- gin around the center (zero phase) to begin the ex- traction of the snare and we use its complementary range to extract the guitars. Next, in both cases we use the Magnitude TFF. In order to isolate the AES 120th Convention, Paris, France, 2006 May 20–23 Page 7 of 9
  • 8. Vinyes et al. Human-Assisted Time-Frequency Masking guitars we remove the DFT coefficients with small magnitude and we do the opposite for the snare. Finally the Magnitude-Variance TFF is applied to reduce the noise produced by the guitar pickings in our attempt to isolate the snare (those noises have greater magnitude variance than the sound of the snare). (a) Snare Spectrograms (b) Guitars Spectrograms Fig. 6: Memorial: Extracted snare and guitars 6. CONCLUSIONS Our work can be included in the group of techniques that obtain the solution of the ABS problem ex- tracting sets of the input Discrete Fourier Trans- form (DFT) coefficients ([2],[4],[5],[6]). In particular we design a human-assisted mechanism as in [4] to select those sets. A graphic interface is developed to select them us- ing their inter-channel magnitude ratio and inter- channel phase difference, with some post processing involving the magnitude of the DFT coefficients and their magnitude variance. The processing is divided in several independent layers called TFF that set a Time-Frequency binary mask in multiple steps with different criteria. We claim that this a more flexi- ble way to select the Time-Frequency Mask than the ones developed in the previously cited articles. Another contribution is to consider inter-channel phase difference instead of estimated time delay as introduced in [2]. [7] points out ambiguities in the estimated time delay over frequencies over 900Hz and suggests a method to resolve them in mixtures of acoustically stereo recorded sound sources. How- ever, in commercial music productions, IPD seems to make more sense and may help to distinguish be- tween mono tracks and stereo tracks based. More- over, its energy distribution graph may be useful to determine whether the pan discrimination criterion is going to work and embed TFM with pan selection in automatic ABS systems. We have shown that, in some cases, it is possible to extract from real commercial music productions the original tracks that were used in their mixing. It is often the case that vocals and other music instru- ments are recorded in different audio tracks. Con- sequently, our algorithm can be useful in karaoke systems to remove the voice of some songs, and it can be applied as a DJ remixing tool or isolating instruments in a mixture. In [3] visitors are encour- aged to download our software and post successful audio separations in order to evaluate its real-world performance. 7. FUTURE WORK This method trusts TFM as a successful way to ob- tain all the meaningful solutions of the ABS prob- lem. Future work may consist of evaluating experi- mentally the limits of this technique independently of the chosen TFFs. Additionally we have only presented 4 TFFs (Pan, IPD, Magnitude and Magnitude-Variance), so more ways to set the TFM mask may be explored. Finally we have noticed low frequencies have big magnitudes that alter significantly the graph even AES 120th Convention, Paris, France, 2006 May 20–23 Page 8 of 9
  • 9. Vinyes et al. Human-Assisted Time-Frequency Masking if they are not really important to our ears. Percep- tual weighting might help to make the graphs better represent what we really hear and ease the selection of meaningful sounds. 8. REFERENCES [1] Albert S. Bregman. Auditory Scene Analysis. MIT Press, 1990. [2] ¨Ozg¨ur Yilmazz and Scott Rickard. Blind sep- aration of speech mixtures via time- frequency masking. IEEE Transactions on Signal Process- ing, 2003. [3] MarC Vinyes, Alex Loscos, and Jordi Bonada. http://www.iua.upf.edu/mtg/audioscanner. [4] Dan Barry, Bob Lawlor, and Eugene Coyle. Sound source separation: Azimuth discrim- ination and resynthesis. Proc. of the 7th Int.Conference on Digital Audio Effects (DAFX 04), 2004. [5] Carlos Avendano. Frequency-domain source identification and manipulation in stereo mixes for enhancement, suppression and re-panning ap- plications. Proc. IEEE Workshop on Applica- tions of Signal Processing to Audio and Acous- tics, 2004. [6] Michael A. Casey and Alex Westner. Separation of mixed audio sources by independent subspace analysis. International Computer Music Confer- ence (ICMC), 2000. [7] Harald Viste and Gianpaolo Evangelista. On the use of spatial cues to improve binaural source separation. In Proc. of the 6th Int. Conference on Digital Audio Effects (DAFX-03), 2003. AES 120th Convention, Paris, France, 2006 May 20–23 Page 9 of 9