In this presentation I focus on 3 projects I have been working in the last year. The first one is a novel pattern matching algorithm, based on the well known Dynamic Time Warping. The presented algorithm can be used to find real-valued subsequences within a longer sequence, without prior knowledge of their start-end points. I have applied the algorithm for the task of acoustic matching, for which I will show some preliminary results. Then I will continue to explain a second DTW-based algorithm, this one being able do an online of two musical pieces. One of the music pieces can be input life or be retrieved from an audio file, while the second one is extracted from an online music video. The online alignment allows for the music video to be played in total synchrony with the corresponding ambient/recorded audio. Finally, I will talk about video copy detection, which is the task of finding video duplicate segments within a big database. I will explain our multimodal approach, based on audio-visual change-based features.
3. Par$al
Sequence
Matching
Using
an
Unbounded
Dynamic
Time
Warping
Algorithm
Xavier
Anguera,
Robert
Macrare
and
Nuria
Oliver
Telefonica
Research,
Barcelona,
Spain
4. Proposed
challenge
• Given
one
or
several
audio
signals
we
want
to
find
and
align
recurring
acous$c
pa+erns.
5. Proposed
challenge
• We
could
use
the
ASR/phone$c
output
and
search
for
symbol
repe$$ons
PROS:
– It
is
easy
to
apply,
the
ASR
takes
care
of
any
$me
warping
CONS:
– ASR
is
language
dependent
and
requires
training
– We
introduce
addi$onal
sources
of
error
(acous$c
condi$ons,
OOV’s)
– It
can
be
very
slow
and
not
embeddable
• Automa$c
mo$f
discovery
directly
in
the
speech
signal
– Train
free,
language
independent
and
resilient
to
some
noises
Symbolic
representa$on
ASR/ symbols
Phone$za$on
alignment
acous$c
•
Alignment
loca$ons
alignment
•
Scores
6. Areas
of
applica$on
• Improve
ASR
by
disambigua$on
over
several
repe$$ons
(Park
and
Glass,
2005)
• Pa+ern-‐based
speech
recogni$on
–
flat
modelling
(Zweig
and
Nguyen,
2010)
• Acous$c
summariza$on
(Muscariello,
2009)
• Musical
structure
analysis
(Müller,
2007)
• Server-‐less
mobile
voice
search
(Anguera,
2010)
7. Automa$c
mo$f
discovery
• Goal
is
to
avoid
going
to
text
and
therefore
be
more
robust
to
errors
• Good
deal
of
applicable
work
on
this
area:
– Biomedicine
in
matching
DNA
sequences
(conver$ng
the
speech
signals
into
symbol
strings)
– Directly
from
real-‐valued
mul$dimensional
samples
using
DTW-‐like
algorithms
• Müller’07,
Muscariello’09,
Park’05,
Zweig’10
• Most
need
to
compute
all
the
cost
matrix
a
priori
8. Dynamic
Time
Warping
-‐
DTW
• DTW
algorithm
allows
the
computa$on
of
the
op$mal
alignment
between
two
$me
series
Xu,
Yv
ε
ΦD
XU = (u1,...,um ,...,uM )
X V = (v1,....,v n ,..,v N )
Image
by
Daniel
Lemire
9. Dynamic
Time
Warping
(II)
• The
op$mal
alignment
can
be
found
in
O(MN)
complexity
using
dynamic
programming.
• We
need
to
define
a
cost
func$on
between
any
two
elements
in
the
series
and
build
a
distance
matrix:
d : ΦD × ΦD →ℜ ≥ 0
Where
usually:
d(i, j) = um − v n
€
Euclidean
distance
Image
by
Tsanko
Dyustabanov
Warping
func$on:
F
=
c(1),...,c(K)
where
c(i(k), j(k))
€
10. Warping
constraints
For
speech
signals
some
constraints
are
usually
applied
to
the
warping
func$on
F:
– Monotonicity:
i(k −1) ≤ i(k) j(k −1) ≤ j(k)
– Con$nuity
(i.e.
local
constraints):
i(k) − i(k −1) ≤ 1 j(k) − j(k −1) ≤ 1
€ €
(m,
n)
⎧ D(m −1,n)
(m-‐1,
n)
⎪
D(m,n) = min⎨ D(m,n −1) + d(um ,v n )
€ (m-‐1,
n-‐1)
€ ⎪
⎩ D(m −1,n −1)
Sakoe,H.
and
Chiba,S.
(1978)
Dynamic
programming
algorithm
op0miza0on
for
spoken
word
recogni0on,
IEEE
Trans.
on
Acoust.,
Speech,
and
Signal
Process,
ASSP-‐26,
43-‐49.
11. Warping
constraints
(II)
– Boundary
condi$on:
i(1) = 1 j(1) = 1 i(K) = M j(K) = N
i.e.
DTW
needs
prior
knowledge
of
the
start-‐end
alignment
points.
– Global
constraints
€ € € €
Image
from
Keogh
and
Ratanamahatana
16. DTW
main
problem
• The
boundary
condi$on
constraints
$me-‐
series
to
be
aligned
from
start
to
end
– We
need
a
modifica$on
to
DTW
to
allow
common
pa+ern
discovery
in
reference
and
query
signals
regardless
of
the
sequence’s
other
content
17. Alterna$ve
proposals
• Meinard
Müller’s
Path
extrac$on
for
music
– Needs
to
pre-‐compute
the
complete
cost
matrix.
• Alex
Park’s
Segmental
DTW
– Needs
to
pre-‐compute
the
complete
cost
matrix,
very
computa$onally
expensive
ajerwards.
• Armando
Muscarielo’s
word
discovery
algorithm
– Searches
for
pa+erns
locally,
does
not
check
all
possible
star$ng
points.
[1]
M.
Müller,
“Informa$on
Retrieval
for
Music
and
Mo$on”,Springer,
New
York,
USA,
2007.
[2]
A.
Park
et
al.,
“ Towards
unsupervised
pa+ern
discovery
in
speech,”
in
In
Proc.
ASRU’05,
Puerto
Rico,
2005.
[3]
A.
Muscariello
et
al.,
“Audio
keyword
extrac$on
by
unsupervised
word
discovery,”
in
Proc.
INTER-‐
SPEECH’09,
2009.
18. Unbounded-‐DTW
Algorithm
• U-‐DTW
is
a
modifica$on
to
DTW
that
is
fast
and
accurate
in
finding
recurring
pa+erns
• We
call
it
unbounded
because:
– The
start-‐end
posi$ons
of
both
segments
are
not
constrained
– Mul$ple
matching
segments
can
be
found
with
a
single
pass
of
the
algorithm
– Minimizes
the
computa$onal
cost
of
comparing
two
mul$dimensional
$me
series
19. U-‐DTW
Cost
func$on
and
matching
length
• Given
two
sequences
to
be
matched
U=(u1,
u2,
…,
uM)
and
V=(v1,
v2,
…,
vN)
we
use
the
inner
product
similarity
um ,v n
s(m,n) = cosθ =
um v n
Values
range
[-‐1,1],
the
higher
the
closer
• We
look
for
matching
sequences
with
a
minimum
€
length
Lmin
(set
at
400ms
in
our
experiments)
20. U-‐DTW
global/local
constraints
• no
global
constraints
are
applied
in
order
to
allow
for
matching
of
any
segment
among
both
sequences
• Local
constraints
are
set
to
allow
warping
up
to
2X
(m,
n)
⎧ D(m − 2,n)
⎪ (m-‐2,
n-‐1)
D(m,n) = max⎨ D(m,n − 2) + s(um ,v n ) (m-‐1,
n-‐1)
⎪
⎩ D(m − 2,n − 2)
(m-‐1,
n-‐2)
21. U-‐DTW
computa$onal
savings
• Computa$onal
savings
are
achieved
thanks
to:
1. We
sample
the
distance/similarity
matrix
at
certain
possible
matching
start
points
(sesng
Synchroniza$on
points)
2. Dynamic
programming
is
done
forward,
prunning
out
low
similarity
paths
22. Synchroniza$on
points
• Only
certain
(m,n)
posi$ons
are
analyzed
in
the
matrix
for
possible
matching
segments
– Selected
not
to
loose
any
matching
segment
– Op$mize
the
computa$onal
cost
• Two
methods
are
followed:
horizontal
and
ver$cal
bands:
U
U
λ
τh
(m,n)
λ
2τh
(m,n)
π/4
V
λ
τd
V
24. Forward
dynamic
programming
• For
each
posi$on
(m,n):
3
possible
forward
paths
are
considered
(m+1,
n+2)
(m+1,
n+1)
(m+2,
n+1)
(m,
n)
• The
forward
path
is
extended
forward
IIF:
– Its
normalized
global
similarity
is
above
a
pruning
threshold
D(m,n) + s(m',n')
S(m',n') = ≥ Thrprun
M(m,n) +1
–
S(m',n')
is
greater
than
any
previous
path
in
that
loca$on
€
27. Backward
path
algorithm
• When
a
possible
matching
segment
is
found
in
the
forward
path,
the
same
is
done
backwards
star$ng
from
the
origina$ng
SP
posi$on.
(m,
n)
(m-‐2,
n-‐1)
(m-‐1,
n-‐1)
(m-‐1,
n-‐2)
The
same
procedure
is
followed
as
in
the
forward
path
31. Experimental
setup
• We
asked
23
people
to
record
47
words
from
6
categories,
5
itera$ons
each:
XU ,V [n,i],i = 1...5, j = 1...47
Monuments
• Simple
energy-‐based
trimming
Family
eliminates
non-‐speech
regions
€ Events
• We
simulate
acous$c
context
by
Ci$es
a+aching
different
start-‐end
audio
People
sequences
to
Xu,v.
Nature
32. Experimental
setup
(II)
• Signals
are
parameterized
with
10MFCC
every
10ms
• Each
word
Xu
is
compared
to
all
words
Xv
from
the
same
speaker
(234
comparisons)
and
the
closest
one
is
retrieved
argmin m, j D(XU [n,i], X V [m, j]) | (n,i) ≠ (m, j)
We
get
a
hit
m=n,
a
miss
otherwise
• Tests
were
performed
on
an
Ubuntu
Linux
PC
€
@2.4GHz.
33. Comparing
systems
• Standard
DTW
– Compare
the
sequences
without
any
added
acous$c
context
(i.e.
prior
knowledge
of
start-‐end
points)
• Segmental
DTW
(Park
and
Glass,
2005)
– Minimum
segment
length
of
500ms
– Band
size
of
70ms,
50%
overlap
– Used
2
distances:
Euclidean
and
1-‐inner
product
34. Performance
evalua$on
Used
metrics:
– Accuracy:
percentage
of
words
correctly
matched
(Xu
y
Xv
are
different
itera$ons
of
the
same
word).
Acc =
∑ correct matches ⋅ 100
all matches
– Average
processing
$me
per
sequence
pair
(Xu-‐Xv)
(excluding
parameteriza$on)
€
Time =
∑ time(D(X U [n,i],X V [m, j]))
⋅ 100
# matches
– Average
ra$o
of
frame-‐pair
distances
within
each
sequence-‐pair
cost
matrix.
€
Ratio =
∑ computed(d(X U [n,i], X V [m, j]))
⋅ 100
MN
37. Conclusions
and
future
work
• We
propose
a
novel
algorithm
called
U-‐DTW
for
unconstrained
pa+ern
discovery
in
speech
• We
show
it
is
faster
and
more
accurate
than
exis$ng
alterna$ves
• We
are
star$ng
to
test
the
algorithm
for
unrestricted
audio
summariza$on
39. People
enjoy
listening
to
their
favorite
music
everywhere…
…at
home,
…
…on
the
go,
…
…or
in
a
party
with
friends
40. Users
increasingly
have
a
personal
mp3
music
collec$on…
…but
it
usually
contains
‘only’
music.
What
if
you
could
watch
the
video
clip
of
any
of
our
songs
while
listening
to
it?
41. You
could
go
to
sites
like
YouTube…
…but
the
audio
quality
is
much
worse
that
in
your
mp3…
What
if
you
could
listen
to
our
high
quality
mp3
music
while
watching
the
video
clips?
42. MuViSync:
Music
and
Video
Synchroniza$on
system
streaming
MuViSync
Video
clip
local
MuViSync
synchronizes
audio
and
video
from
two
different
sources
and
plays
them
Personal
Music
together
in-‐sync
43. Applica$on
scenarios
• Watch
on
TV
your
favorite
music
– Personal
music
synchroniza$on
with
video
clips
either
local
or
streamed
• Watch
on
your
iPhone
your
music
– Personal
music
synchroniza$on
by
streaming
the
video
into
the
iPhone
• Iden0fy
and
watch
any
music
– Combined
with
songID
technology,
either
at
home
or
on
the
go.
44. MuViSync
applica$on
• We
have
developed
a
prototype
applica0on
for
Windows/mac,
and
soon
for
Iphone.
45. Alignment
algorithm
requirements
• Perform
an
alignment
between
the
mp3
music
and
the
Video’s
audio
track
• Ini$ally
only
par$al
knowledge
is
available
from
both
sources
(life
recording
or
buffering)
• Alignment
has
to
be
done
online
and
in
real-‐
$me
• Emphasis
is
needed
on
the
user
sa$sfac$on
when
playing
the
video.
46. Applica$on
testbed
• We
use
320
music
videos
(Youtube)
+
their
corresponding
mp3
files
• A
supervised
ground-‐truth
alignment
was
performed
using
offline
DTW
and
checking
for
consistency
• Audio
is
processed
every
100ms
(200ms
window)
and
chroma
features
are
extracted
47. MuViSync
online
alignment
algorithm
1. Ini$al
path
discovery
– Both
signals
(audio
and
video)
are
buffered,
features
are
extracted
and
an
ini$al
alignment
is
found
2. Real-‐$me
online
alignment
– An
incremental
alignment
is
computed
3. Alignment
post-‐processing
to
ensure
a
smooth
playback
of
the
aligned
video.
Audio
+
feats
Ini$al
path
extrac$on
1)
discovery
ta
tv
Feats
2)
Real-‐$me
alignment
extrac$on
alignment
48. Ini$al
path
discovery
(online
mp3
playback
+
video
buffering)
Sync
request
Audio
from
the
mp3
file
Video
buffering
end
Audio
available
from
the
video
49. Ini$al
path
discovery
• A
segment
of
the
audio
and
the
buffered
video
are
checked
for
alignment
using
forward-‐DTW
• The
global
similarity
D(m,n)
at
each
loca$on
(m,n)
is
normalized
by
the
length
of
the
op$mum
path
to
that
loca$on
• At
each
step,
all
paths
with
D’(m,n)
<
Dave(*,n)
are
pruned.
• The
ini0al
alignment
is
selected
when
only
one
path
survives
or
the
sync
0me
is
reached.
50. Ini$al
path
discovery
Audio
$me
Audio
being
played
from
mp3
alignment
buffer
(about
1s)
Audio
available
from
the
video
53. Audio
being
played
from
mp3
Ini$al
path
discovery
Audio
available
from
the
video
54. Real-‐$me
online
alignment
• Star$ng
from
the
ini$al
alignment
we
itera$vely
compute:
1. Locally
op$mum
forward
path
for
L
steps:
p1…pL
using
a)
local
constraints
(no
dynamic
programming)
2. Backward
(standard)
DTW
from
pL
to
p1
using
b)
local
constraints
3. Add
the
ini$al
p/2
steps
to
the
final
path,
and
start
1)
from
pL/2
un$l
the
playback
ends
58. Real-‐$me
online
alignment
Audio
being
played
from
mp3
3)Move
forward
the
new
star$ng
point
p1
Audio
available
from
the
video
59. Alignment
postprocessing
• Alignment
es$mates
every
100ms
are
not
enough
to
drive
25/30
fps
video
• An
interpola$on
of
the
points
+
averaging
over
5
seconds
gives
the
projec$on
es$mate
for
current
playback
60. Experiments
• We
use
320
videos+mp3,
aligned
using
offline
DTW
and
manually
checked
for
consistency.
• Accuracy
is
computed
as
the
%
of
songs
with
average
error
<
some
ms.
Average
accuracy
@100ms
for
different
video
buffer
lengths
65. …ajer
40
minutes...
watching
many
of
the
videos
returned
you
no$ce
that
many
are
similar,
i.e.
near
duplicates
27%
in
average
in
Youtube
[Wu
et
al.,
2007]
12%
in
average
in
Youtube
[Anguera
et
al,
2009]
66. Near
duplicate
(NDVC)
defini$on
• Iden$cal
or
approximately
iden$cal
videos,
that
differ
in
some
feature:
– file
formats,
encoding
parameters
– photometric
varia$ons
(color,
ligh$ng
changes)
– overlays
(cap$on,
logo,
audio
commentary)
– edi$ng
opera$ons
(frames
add/remove)
–
seman$c
similarity
NDVC
are
videos
that
are
“essen(ally
the
same”
67. Near
duplicates(NDVC)
vs.
Video
copies
• These
two
concepts
are
not
totally
well
discriminated
in
the
bibliography.
• Video
copy:
exact
video
segment,
with
some
transforma$ons
on
it
• Near
duplicate:
similar
videos
on
the
same
topic
(different
view
points,
seman$cally
similar
videos,
…)
In
our
research
we
approach
the
video
copy
detec;on
69. Use
Scenarios:
Copyright
law
enforcement
Detec$on
of
copyright
infringing
videos
in
online
video
sharing
sites
In
a
recent
study
we
found
that
in
average
12%
of
search
results
in
YouTube
are
copies
of
the
same
video
70. Use
Scenarios:
Video
forensics
for
illegal
ac$vi$es
Discover
illegal
content
hidden
within
other
videos
Currently
police
forces
usually
have
to
manually
scroll
through
ALL
materials
in
pederasty
cases
searching
for
evidence.
71. Use
Scenarios:
Database
management
Video
excerpts
used
several
$mes
Database
management/op$miza$on
and
helping
in
searches
over
historic
contents
73. Use
Scenarios:
Informa$on
overload
reduc$on
Improved
(more
diverse)
video
search
results
by
clustering
all
video
duplicates.
George
Bush
Ajer
clustering
Before
clustering
74. Steps
in
Video
Duplicate
detec$on
1. Indexing
of
the
reference
videos
A. Obtain
features
represen$ng
the
video
B. Store
these
features
in
a
scalable
manner
2. Search
of
queries
within
the
reference
set
OFFLINE
Ref
videos
References
Features
Feature
extrac$on
indexing
Database
Query
Search
for
video
Feature
extrac$on
duplicates
ONLINE
75. Ways
to
approach
near-‐duplicate
video
detec$on
• Local
features
– Extracted
from
selected
frames
in
the
videos
– Focus
on
local
characteris$cs
within
those
frames
• Global
features
– Extracted
from
selected
frames
or
from
all
the
video
– Focus
on
overall
characteris$cs
76. Local
features
• Comes
from
the
previous
knowledge
on
image
copy
detec$on/near
duplicates
detec$on
• Steps:
– Keyframes
are
first
extracted
from
the
videos
at
regular
intervals
or
by
detec$ng
shots
– Local
features
are
obtained
for
these
keyframes:
• SIFT
• SURF
• HARRIS
• …
77. Global
Features
• Features
are
extracted
either
from
the
whole
video
or
from
keyframes
by
looking
at
the
overall
image
(not
at
par$cular
points).
In
our
work
we
extract
them
from
the
whole
video
78. Mul$modal
video
copy
detec$on
• Most
works
use
only
video/images
informa$on
– They
prefer
local
features
for
their
robustness
• We
introduce
audio
informa$on
by
combining
global
features
from
both
the
audio
and
video
tracks
• We
are
also
experimen$ng
on
fusing
local
features
with
global
features
(work
in
progress)
79. Mul$modal
global
features
• We
use
features
based
on
the
changes
in
the
data-‐>
more
robust
to
transforma$ons
• Video:
– Hue
+
satura$on
interframe
change
– Lightest
and
darkest
centroid
interframe
distance
• Audio:
– Bayesian
informa$on
criterion
(BIC)
between
adjacent
segments
– Cross-‐BIC
between
adjacent
segments
– Kullback-‐Leibler
divergence
(KL2)
between
adjacent
segments
81. Hue+Satura$on
interframe
change
2. Compute
for
each
2
consecu$ve
frames
their
HS
histogram
and
compute
their
intersec$on
as:
82. Lightest and darkest centroid interframe distance
1. Find
the
lightest
and
darkest
regions
in
each
frame
and
obtain
its
centroid
83. Lightest and darkest centroid interframe distance
We
compute
the
euclidean
distance
between
each
two
adjacent
frames,
obtaining
two
global
feature
streams
84. Acous$c
features
• Compute
some
acous$c
distance
between
adjacent
acous$c
segments
Segment
A
Segment
B
GMM
A
GMM
B
GMM
A+B
85. Acous$c
features
(II)
• Likelihood-‐based
metrics:
– Bayesian
Informa$on
Criterion
– Cross-‐BIC
• Model
distance
metrics:
– Kullback-‐Leibler
divergence
(KL2)
87. Search
for
full
copies
• For
each
video-‐query
pair
we
compute
the
correla$on
of
each
feature
pair
Reference
FFT
X IFFT
Find
peaks
Possible
FFT
copy
• We
then
find
the
posi$ons
with
high
similarity
(peaks).
88. Mul$modal
fusion
• When
mul$ple
modali$es
are
available,
fusion
is
performed
on
the
correla$ons
89. Output
score
• The
resul$ng
score
is
computed
by
weighted
sum
of
the
different
modali$es’
normalized
dot
product
at
the
found
peak
• Automa$c
weights
are
obtained
via
90. Finding
subsegments
of
the
query
• The
previously
described
algorithm
considers
the
whole
query
matches
a
por$on
of
the
reference
videos
• To
avoid
such
restric$on
a
modifica$on
to
the
algorithm
first
splits
the
query
into
overlaping
20s
segments
• By
accumula$ng
the
resul$ng
peaks
for
each
segment
we
can
obtain
the
main
delay
and
its
segment
91. Algorithm
performance
evalua$on
• To
test
the
algorithm
we
used
the
MUSCLE-‐
VCD
database:
– Over
100
hours
of
reference
videos
from
the
SoundVision
group
(Nederlands)
– 2
test
sets
• ST1:
15
query
videos
where
the
whole
query
is
considered
• ST2:
3
videos
with
21
segments
appearing
in
the
reference
database
h+p://www-‐roc.inria.fr/imedia/civr-‐bench/benchMuscle.html