SlideShare a Scribd company logo
1 of 100
Download to read offline
Mul$modal	
  pa+ern	
  matching	
  
algorithms	
  and	
  applica$ons	
  
          Xavier	
  Anguera	
  
         Telefonica	
  Research	
  
Outline	
  
•  Introduc$on	
  
•  Par$al	
  sequence	
  matching	
  	
  
   –  U-­‐DTW	
  algorithm	
  	
  
•  Music/video	
  online	
  synchroniza$on	
  	
  
   –  MuViSync	
  prototype	
  
•  Video	
  Copy	
  detec$on	
  
Par$al	
  Sequence	
  Matching	
  Using	
  an	
  
Unbounded	
  Dynamic	
  Time	
  Warping	
  
               Algorithm	
  
   Xavier	
  Anguera,	
  Robert	
  Macrare	
  and	
  
                 Nuria	
  Oliver	
  
          Telefonica	
  Research,	
  Barcelona,	
  Spain	
  
Proposed	
  challenge	
  
•  Given	
  one	
  or	
  several	
  audio	
  signals	
  we	
  want	
  to	
  
   find	
  and	
  align	
  recurring	
  acous$c	
  pa+erns.	
  
Proposed	
  challenge	
  
•  We	
  could	
  use	
  the	
  ASR/phone$c	
  output	
  and	
  search	
  for	
  symbol	
  
   repe$$ons	
  
    PROS:	
  
    –  It	
  is	
  easy	
  to	
  apply,	
  the	
  ASR	
  takes	
  care	
  of	
  any	
  $me	
  warping	
  
    CONS:	
  
    –  ASR	
  is	
  language	
  dependent	
  and	
  requires	
  training	
  
    –  We	
  introduce	
  addi$onal	
  sources	
  of	
  error	
  (acous$c	
  condi$ons,	
  OOV’s)	
  
    –  It	
  can	
  be	
  very	
  slow	
  and	
  not	
  embeddable	
  
•  Automa$c	
  mo$f	
  discovery	
  directly	
  in	
  the	
  speech	
  signal	
  
    –  Train	
  free,	
  language	
  independent	
  and	
  resilient	
  to	
  some	
  noises	
  

                                                         Symbolic	
  
                                                         representa$on	
  
                                         ASR/                                 symbols	
  
                                      Phone$za$on	
                          alignment	
  


                                         acous$c	
                                           • 	
  Alignment	
  loca$ons	
  
                                        alignment	
  
                                                                                             • 	
  Scores	
  
Areas	
  of	
  applica$on	
  
•  Improve	
  ASR	
  by	
  disambigua$on	
  over	
  several	
  
   repe$$ons	
  (Park	
  and	
  Glass,	
  2005)	
  
•  Pa+ern-­‐based	
  speech	
  recogni$on	
  –	
  flat	
  
   modelling	
  (Zweig	
  and	
  Nguyen,	
  2010)	
  
•  Acous$c	
  summariza$on	
  (Muscariello,	
  2009)	
  
•  Musical	
  structure	
  analysis	
  (Müller,	
  2007)	
  
•  Server-­‐less	
  mobile	
  voice	
  search	
  (Anguera,	
  2010)	
  	
  
Automa$c	
  mo$f	
  discovery	
  
•  Goal	
  is	
  to	
  avoid	
  going	
  to	
  text	
  and	
  therefore	
  be	
  
   more	
  robust	
  to	
  errors	
  
•  Good	
  deal	
  of	
  applicable	
  work	
  on	
  this	
  area:	
  
    –  Biomedicine	
  in	
  matching	
  DNA	
  sequences	
  
       (conver$ng	
  the	
  speech	
  signals	
  into	
  symbol	
  strings)	
  
    –  Directly	
  from	
  real-­‐valued	
  mul$dimensional	
  
       samples	
  using	
  DTW-­‐like	
  algorithms	
  
         •  Müller’07,	
  Muscariello’09,	
  Park’05,	
  Zweig’10	
  
         •  Most	
  need	
  to	
  compute	
  all	
  the	
  cost	
  matrix	
  a	
  priori	
  
Dynamic	
  Time	
  Warping	
  -­‐	
  DTW	
  
 •  DTW	
  algorithm	
  allows	
  the	
  computa$on	
  of	
  the	
  
    op$mal	
  alignment	
  between	
  two	
  $me	
  series	
  	
  	
  
    Xu,	
  Yv	
  ε	
  ΦD	
  	
  


XU = (u1,...,um ,...,uM )



 X V = (v1,....,v n ,..,v N )
                                     Image	
  by	
  Daniel	
  Lemire	
  
Dynamic	
  Time	
  Warping	
  (II)	
  
•  The	
  op$mal	
  alignment	
  can	
  be	
  found	
  in	
  O(MN)	
  complexity	
  
   using	
  dynamic	
  programming.	
  
•  We	
  need	
  to	
  define	
  a	
  cost	
  func$on	
  between	
  any	
  two	
  
   elements	
  in	
  the	
  series	
  and	
  build	
  a	
  distance	
  matrix:	
  


                                                                                                             d : ΦD × ΦD →ℜ ≥ 0
                                                                                                              Where	
  usually:	
  

                                                                                                            d(i, j) = um − v n
                                                                                  €
                                                                                                                 Euclidean	
  distance	
  
                                                           Image	
  by	
  Tsanko	
  Dyustabanov	
  


Warping	
  func$on:	
  	
  	
  F	
  	
  	
  =	
  	
  c(1),...,c(K)	
  	
  where	
   c(i(k), j(k))
                               	
  	
   	
  	
   	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  €	
  
                                                                                                                             	
  	
  	
  
Warping	
  constraints	
  
For	
  speech	
  signals	
  some	
  constraints	
  are	
  usually	
  
  applied	
  to	
  the	
  warping	
  func$on	
  F:	
  
        –  Monotonicity:	
  
    	
  	
   	
   	
  	
   i(k −1) ≤ i(k)                                    j(k −1) ≤ j(k)
        –  Con$nuity	
  (i.e.	
  local	
  constraints):	
  
                       i(k) − i(k −1) ≤ 1                                      j(k) − j(k −1) ≤ 1
€                                                           €
                                                (m,	
  n)	
                   ⎧ D(m −1,n)
                            (m-­‐1,	
  n)	
                                   ⎪
                                                                  D(m,n) = min⎨ D(m,n −1) + d(um ,v n )
€                          (m-­‐1,	
  n-­‐1)	
  
                                                                €             ⎪
                                                                              ⎩ D(m −1,n −1)

Sakoe,H.	
  and	
  Chiba,S.	
  (1978)	
  Dynamic	
  programming	
  algorithm	
  op0miza0on	
  for	
  spoken	
  word	
  recogni0on,	
  IEEE	
  Trans.	
  
on	
  Acoust.,	
  Speech,	
  and	
  Signal	
  Process,	
  ASSP-­‐26,	
  43-­‐49.	
  
Warping	
  constraints	
  (II)	
  
    –  Boundary	
  condi$on:	
  	
  
       i(1) = 1       j(1) = 1            i(K) = M                     j(K) = N
    i.e.	
  DTW	
  needs	
  prior	
  knowledge	
  of	
  the	
  start-­‐end	
  
       alignment	
  points.	
  
    –  Global	
  constraints	
  
€    €                    €                 €




                                                   Image	
  from	
  Keogh	
  and	
  Ratanamahatana	
  
DTW	
  Dynamic	
  Programming	
  
DTW	
  Dynamic	
  Programming	
  
DTW	
  Dynamic	
  Programming	
  
DTW	
  Dynamic	
  Programming	
  
DTW	
  main	
  problem	
  
•  The	
  boundary	
  condi$on	
  constraints	
  $me-­‐
   series	
  to	
  be	
  aligned	
  from	
  start	
  to	
  end	
  
    –  We	
  need	
  a	
  modifica$on	
  to	
  DTW	
  to	
  allow	
  common	
  
       pa+ern	
  discovery	
  in	
  reference	
  and	
  query	
  signals	
  
       regardless	
  of	
  the	
  sequence’s	
  other	
  content	
  
Alterna$ve	
  proposals	
  
    •  Meinard	
  Müller’s	
  Path	
  extrac$on	
  for	
  music	
  
             –  Needs	
  to	
  pre-­‐compute	
  the	
  complete	
  cost	
  matrix.	
  
    •  Alex	
  Park’s	
  Segmental	
  DTW	
  
             –  Needs	
  to	
  pre-­‐compute	
  the	
  complete	
  cost	
  matrix,	
  
                very	
  computa$onally	
  expensive	
  ajerwards.	
  	
  
    •  Armando	
  Muscarielo’s	
  word	
  discovery	
  
       algorithm	
  
             –  Searches	
  for	
  pa+erns	
  locally,	
  does	
  not	
  check	
  all	
  
                possible	
  star$ng	
  points.	
  
[1]	
  M.	
  Müller,	
  “Informa$on	
  Retrieval	
  for	
  Music	
  and	
  Mo$on”,Springer,	
  New	
  York,	
  USA,	
  2007.	
  
[2]	
  A.	
  Park	
  et	
  al.,	
  “ Towards	
  unsupervised	
  pa+ern	
  discovery	
  in	
  speech,”	
  in	
  In	
  Proc.	
  ASRU’05,	
  Puerto	
  Rico,	
  2005.	
  
[3]	
  A.	
  Muscariello	
  et	
  al.,	
  “Audio	
  keyword	
  extrac$on	
  by	
  unsupervised	
  word	
  discovery,”	
  in	
  Proc.	
  INTER-­‐	
  SPEECH’09,	
  2009.	
  
Unbounded-­‐DTW	
  Algorithm	
  
•  U-­‐DTW	
  is	
  a	
  modifica$on	
  to	
  DTW	
  that	
  is	
  fast	
  and	
  
   accurate	
  in	
  finding	
  recurring	
  pa+erns	
  
•  We	
  call	
  it	
  unbounded	
  because:	
  
    –  The	
  start-­‐end	
  posi$ons	
  of	
  both	
  segments	
  are	
  not	
  
       constrained	
  
    –  Mul$ple	
  matching	
  segments	
  can	
  be	
  found	
  with	
  a	
  
       single	
  pass	
  of	
  the	
  algorithm	
  
    –  Minimizes	
  the	
  computa$onal	
  cost	
  of	
  comparing	
  two	
  
       mul$dimensional	
  $me	
  series	
  
U-­‐DTW	
  Cost	
  func$on	
  and	
  matching	
  length	
  
  •  Given	
  two	
  sequences	
  to	
  be	
  matched	
  
    	
  	
   	
  U=(u1,	
  u2,	
  …,	
  uM)	
  and	
  V=(v1,	
  v2,	
  …,	
  vN)	
  
    	
  	
  we	
  use	
  the	
  inner	
  product	
  similarity	
  	
  	
  
                                            um ,v n
                            s(m,n) = cosθ =
                                            um v n
     	
  Values	
  range	
  [-­‐1,1],	
  the	
  higher	
  the	
  closer	
  
  •  We	
  look	
  for	
  matching	
  sequences	
  with	
  a	
  minimum	
  
             €
         length	
  Lmin	
  (set	
  at	
  400ms	
  in	
  our	
  experiments)	
  
U-­‐DTW	
  global/local	
  constraints	
  
   •  no	
  global	
  constraints	
  are	
  applied	
  in	
  order	
  to	
  allow	
  for	
  
      matching	
  of	
  any	
  segment	
  among	
  both	
  sequences	
  
   •  Local	
  constraints	
  are	
  set	
  to	
  allow	
  warping	
  up	
  to	
  2X	
  
                                                                                                                 (m,	
  n)	
  


            ⎧ D(m − 2,n)
            ⎪                                             (m-­‐2,	
  n-­‐1)	
  
D(m,n) = max⎨ D(m,n − 2)     + s(um ,v n )                                        (m-­‐1,	
  n-­‐1)	
  
            ⎪
            ⎩ D(m − 2,n − 2)
                                                                                         (m-­‐1,	
  n-­‐2)	
  
U-­‐DTW	
  computa$onal	
  savings	
  
•  Computa$onal	
  savings	
  are	
  achieved	
  thanks	
  to:	
  
   1.  We	
  sample	
  the	
  distance/similarity	
  matrix	
  at	
  
       certain	
  possible	
  matching	
  start	
  points	
  (sesng	
  
       Synchroniza$on	
  points)	
  
   2.  Dynamic	
  programming	
  is	
  done	
  forward,	
  
       prunning	
  out	
  low	
  similarity	
  paths	
  
Synchroniza$on	
  points	
  
•  Only	
  certain	
  (m,n)	
  posi$ons	
  are	
  analyzed	
  in	
  
   the	
  matrix	
  for	
  possible	
  matching	
  segments	
  
    –  Selected	
  not	
  to	
  loose	
  any	
  matching	
  segment	
  
    –  Op$mize	
  the	
  computa$onal	
  cost	
  
•  Two	
  methods	
  are	
  followed:	
  horizontal	
  and	
  
   ver$cal	
  bands:	
  
                                                        U	
  
                 U	
  
                                                                        λ	
  

                                               τh	
  
                         (m,n)	
                          λ	
  
                                     2τh	
                 (m,n)	
  
                                                                       π/4	
  


                                               V	
                        λ	
     τd	
     V	
  
U-­‐DTW	
  Dynamic	
  Programming	
  
Forward	
  dynamic	
  programming	
  
•  For	
  each	
  posi$on	
  (m,n):	
  3	
  possible	
  forward	
  paths	
  are	
  
   considered	
                        (m+1,	
  n+2)	
  


                                                (m+1,	
  n+1)	
  

                                                                    (m+2,	
  n+1)	
  



                               (m,	
  n)	
  



•  The	
  forward	
  path	
  is	
  extended	
  forward	
  IIF:	
  
    –  Its	
  normalized	
  global	
  similarity	
  is	
  above	
  a	
  pruning	
  threshold	
  
                             D(m,n) + s(m',n')
                  S(m',n') =                   ≥ Thrprun
                               M(m,n) +1

    –  	
  S(m',n')	
  	
  is	
  greater	
  than	
  any	
  previous	
  path	
  in	
  that	
  loca$on	
  
           	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
  €
U-­‐DTW	
  Dynamic	
  Programming	
  
U-­‐DTW	
  Dynamic	
  Programming	
  
Backward	
  path	
  algorithm	
  
•  When	
  a	
  possible	
  matching	
  segment	
  is	
  found	
  in	
  
   the	
  forward	
  path,	
  the	
  same	
  is	
  done	
  backwards	
  
   star$ng	
  from	
  the	
  origina$ng	
  SP	
  posi$on.	
  
                                                                              (m,	
  n)	
  



                        (m-­‐2,	
  n-­‐1)	
  
                                                (m-­‐1,	
  n-­‐1)	
  




                                                      (m-­‐1,	
  n-­‐2)	
  
The	
  same	
  procedure	
  is	
  followed	
  as	
  in	
  the	
  forward	
  path	
  	
  
U-­‐DTW	
  Dynamic	
  Programming	
  
U-­‐DTW	
  Dynamic	
  Programming	
  
Computa$onal	
  savings	
  example	
  
Barcelona	
  




                Barcelona	
  
Experimental	
  setup	
  
•  We	
  asked	
  23	
  people	
  to	
  record	
  47	
  
   words	
  from	
  6	
  categories,	
  5	
  itera$ons	
  
   each:	
  
              XU ,V [n,i],i = 1...5, j = 1...47
                                                                                  Monuments	
  
•  Simple	
  energy-­‐based	
  trimming	
  
                                                             Family	
  
   eliminates	
  non-­‐speech	
  regions	
  
  €                                                                       Events	
  
•  We	
  simulate	
  acous$c	
  context	
  by	
  
                                                             Ci$es	
  
   a+aching	
  different	
  start-­‐end	
  audio	
  
                                                                                       People	
  
   sequences	
  to	
  Xu,v.	
  

                                                                     Nature	
  
Experimental	
  setup	
  (II)	
  
•  Signals	
  are	
  parameterized	
  with	
  10MFCC	
  every	
  
   10ms	
  
•  Each	
  word	
  Xu	
  is	
  compared	
  to	
  all	
  words	
  Xv	
  from	
  
   the	
  same	
  speaker	
  (234	
  comparisons)	
  and	
  the	
  
   closest	
  one	
  is	
  retrieved	
  
                  argmin m, j D(XU [n,i], X V [m, j]) | (n,i) ≠ (m, j)
  	
  We	
  get	
  a	
  hit	
  m=n,	
  a	
  miss	
  otherwise	
  
•  Tests	
  were	
  performed	
  on	
  an	
  Ubuntu	
  Linux	
  PC	
  
       €
      @2.4GHz.	
  
Comparing	
  systems	
  
•  Standard	
  DTW	
  
   –  Compare	
  the	
  sequences	
  without	
  any	
  added	
  
      acous$c	
  context	
  (i.e.	
  prior	
  knowledge	
  of	
  start-­‐end	
  
      points)	
  
•  Segmental	
  DTW	
  (Park	
  and	
  Glass,	
  2005)	
  
   –  Minimum	
  segment	
  length	
  of	
  500ms	
  
   –  Band	
  size	
  of	
  70ms,	
  50%	
  overlap	
  
   –  Used	
  2	
  distances:	
  Euclidean	
  and	
  1-­‐inner	
  product	
  
Performance	
  evalua$on	
  
Used	
  metrics:	
  
    –  Accuracy:	
  percentage	
  of	
  words	
  correctly	
  matched	
  (Xu	
  y	
  Xv	
  
       are	
  different	
  itera$ons	
  of	
  the	
  same	
  word).	
  

                        Acc =
                              ∑ correct matches ⋅ 100
                                 all matches
    –  Average	
  processing	
  $me	
  per	
  sequence	
  pair	
  (Xu-­‐Xv)	
  
       (excluding	
  parameteriza$on)	
  
          €
                  Time =
                         ∑ time(D(X           U   [n,i],X V [m, j]))
                                                                       ⋅ 100
                                         # matches
    –  Average	
  ra$o	
  of	
  frame-­‐pair	
  distances	
  within	
  each	
  
       sequence-­‐pair	
  cost	
  matrix.	
  	
  
    €
           Ratio =
                   ∑ computed(d(X             U   [n,i], X V [m, j]))
                                                                        ⋅ 100
                                           MN
Results	
  

Algorithm	
                                     Accuracy	
     Avg.	
  ;me	
     ra;o	
  
Segmental	
  DTW	
  w/	
  Eucl.	
               80.61%	
       82.7ms	
          1	
  
Segmental	
  DTW	
  w/	
  inner	
  prod.	
      74.62%	
       86.7ms	
          1	
  
U-­‐DTW	
  horiz.	
  bands	
                    89.53%	
       10.6ms	
          0.51	
  
U-­‐DTW	
  diag.	
  bands	
                     89.34%	
       9.0ms	
           0.42	
  
Standard	
  DTW	
                               95.42%	
       0.6ms	
           1	
  
Effect	
  of	
  the	
  Cutout	
  Threshold	
  
Conclusions	
  and	
  future	
  work	
  
•  We	
  propose	
  a	
  novel	
  algorithm	
  called	
  U-­‐DTW	
  
   for	
  unconstrained	
  pa+ern	
  discovery	
  in	
  speech	
  	
  
•  We	
  show	
  it	
  is	
  faster	
  and	
  more	
  accurate	
  than	
  
   exis$ng	
  alterna$ves	
  
•  We	
  are	
  star$ng	
  to	
  test	
  the	
  algorithm	
  for	
  
   unrestricted	
  audio	
  summariza$on	
  
MuViSync	
  
AudioVisual	
  Music	
  Synchroniza$on	
  
   Xavier	
  Anguera,	
  Robert	
  Macrae	
  and	
  Nuria	
  Oliver	
  
People	
  enjoy	
  listening	
  to	
  their	
  
  favorite	
  music	
  everywhere…	
  	
  
                …at	
  home,	
  …	
  


                                       …on	
  the	
  go,	
  …	
  


…or	
  in	
  a	
  party	
  with	
  friends	
  
Users	
  increasingly	
  have	
  a	
  personal	
  
          mp3	
  music	
  collec$on…	
  

                    …but	
  it	
  usually	
  contains	
  
                     ‘only’	
  music.	
  	
  

                    What	
  if	
  you	
  could	
  
                     watch	
  the	
  video	
  clip	
  
                     of	
  any	
  of	
  our	
  songs	
  
                     while	
  listening	
  to	
  it?	
  
You	
  could	
  go	
  to	
  sites	
  like	
  YouTube…	
  


                                …but	
  the	
  audio	
  
                                 quality	
  is	
  much	
  
                                 worse	
  that	
  in	
  
                                 your	
  mp3…	
  	
  
What	
  if	
  you	
  could	
  listen	
  to	
  our	
  high	
  quality	
  
 mp3	
  music	
  while	
  watching	
  the	
  video	
  clips?	
  
MuViSync:	
  	
  	
  
     Music	
  and	
  Video	
  Synchroniza$on	
  system	
  


                                     streaming	
      MuViSync	
  
                      Video	
  clip	
  
                                          local	
  




MuViSync	
  synchronizes	
  audio	
  
 and	
  video	
  from	
  two	
  different	
  
   sources	
  and	
  plays	
  them	
                   Personal	
  Music	
  

          together	
  in-­‐sync	
  
Applica$on	
  scenarios	
  
•  Watch	
  on	
  TV	
  your	
  favorite	
  music	
  
    –  Personal	
  music	
  synchroniza$on	
  with	
  video	
  clips	
  
       either	
  local	
  or	
  streamed	
  
•  Watch	
  on	
  your	
  iPhone	
  your	
  music	
  
    –  Personal	
  music	
  synchroniza$on	
  by	
  streaming	
  the	
  
       video	
  into	
  the	
  iPhone	
  
•  Iden0fy	
  and	
  watch	
  any	
  music	
  
    –  Combined	
  with	
  songID	
  technology,	
  either	
  at	
  home	
  
       or	
  on	
  the	
  go.	
  
MuViSync	
  applica$on	
  
•  We	
  have	
  developed	
  a	
  prototype	
  applica0on	
  for	
  
   Windows/mac,	
  and	
  soon	
  for	
  Iphone.	
  
Alignment	
  algorithm	
  requirements	
  
•  Perform	
  an	
  alignment	
  between	
  the	
  mp3	
  music	
  
   and	
  the	
  Video’s	
  audio	
  track	
  
•  Ini$ally	
  only	
  par$al	
  knowledge	
  is	
  available	
  
   from	
  both	
  sources	
  (life	
  recording	
  or	
  buffering)	
  
•  Alignment	
  has	
  to	
  be	
  done	
  online	
  and	
  in	
  real-­‐
   $me	
  
•  Emphasis	
  is	
  needed	
  on	
  the	
  user	
  sa$sfac$on	
  
   when	
  playing	
  the	
  video.	
  
Applica$on	
  testbed	
  
•  We	
  use	
  320	
  music	
  videos	
  (Youtube)	
  +	
  their	
  
   corresponding	
  mp3	
  files	
  
•  A	
  supervised	
  ground-­‐truth	
  alignment	
  was	
  performed	
  
   using	
  offline	
  DTW	
  and	
  checking	
  for	
  consistency	
  




•  Audio	
  is	
  processed	
  every	
  100ms	
  (200ms	
  window)	
  and	
  
   chroma	
  features	
  are	
  extracted	
  
MuViSync	
  online	
  alignment	
  algorithm	
  
1.  Ini$al	
  path	
  discovery	
  
   –  Both	
  signals	
  (audio	
  and	
  video)	
  are	
  buffered,	
  features	
  
      are	
  extracted	
  and	
  an	
  ini$al	
  alignment	
  is	
  found	
  
2.  Real-­‐$me	
  online	
  alignment	
  
   –  An	
  incremental	
  alignment	
  is	
  computed	
  
3.  Alignment	
  post-­‐processing	
  to	
  ensure	
  a	
  smooth	
  
    playback	
  of	
  the	
  aligned	
  video.	
  
           Audio	
  +	
  feats	
              Ini$al	
  path	
  
            extrac$on	
              1)	
      discovery	
  
                                                 ta	
         tv	
  
               Feats	
               2)	
     Real-­‐$me	
             alignment	
  
             extrac$on	
                      alignment	
  
Ini$al	
  path	
  discovery	
  	
  
(online	
  mp3	
  playback	
  	
  +	
  video	
  buffering)	
  
                                                                   Sync	
  request	
  




                  Audio	
  from	
  the	
  mp3	
  file	
  
                                                                     Video	
  buffering	
  end	
  




                  Audio	
  available	
  from	
  the	
  video	
  
Ini$al	
  path	
  discovery	
  
•  A	
  segment	
  of	
  the	
  audio	
  and	
  the	
  buffered	
  video	
  are	
  
   checked	
  for	
  alignment	
  using	
  forward-­‐DTW	
  


•  The	
  global	
  similarity	
  D(m,n)	
  at	
  each	
  loca$on	
  (m,n)	
  is	
  
   normalized	
  by	
  the	
  length	
  of	
  the	
  op$mum	
  path	
  to	
  
   that	
  loca$on	
  
•  At	
  each	
  step,	
  all	
  paths	
  with	
  D’(m,n)	
  <	
  Dave(*,n)	
  are	
  
   pruned.	
  	
  
•  The	
  ini0al	
  alignment	
  is	
  selected	
  when	
  only	
  one	
  path	
  
   survives	
  or	
  the	
  sync	
  0me	
  is	
  reached.	
  
Ini$al	
  path	
  discovery	
  

                                                                                                    Audio	
  $me	
  
Audio	
  being	
  played	
  from	
  mp3	
  


                                                                                                    alignment	
  buffer	
  
                                                                                                    (about	
  1s)	
  




                                                   Audio	
  available	
  from	
  the	
  video	
  
Ini$al	
  path	
  discovery	
  
Audio	
  being	
  played	
  from	
  mp3	
  




                                                   Audio	
  available	
  from	
  the	
  video	
  
Ini$al	
  path	
  discovery	
  
Audio	
  being	
  played	
  from	
  mp3	
  




                                                   Audio	
  available	
  from	
  the	
  video	
  
Audio	
  being	
  played	
  from	
  mp3	
     Ini$al	
  path	
  discovery	
  




                                                       Audio	
  available	
  from	
  the	
  video	
  
Real-­‐$me	
  online	
  alignment	
  
•  Star$ng	
  from	
  the	
  ini$al	
  alignment	
  we	
  itera$vely	
  
   compute:	
  	
  
   1.  Locally	
  op$mum	
  forward	
  path	
  for	
  L	
  steps:	
  p1…pL	
  
       using	
  a)	
  local	
  constraints	
  (no	
  dynamic	
  programming)	
  
   2.  Backward	
  (standard)	
  DTW	
  from	
  pL	
  to	
  p1	
  using	
  b)	
  local	
  
       constraints	
  
   3.  Add	
  the	
  ini$al	
  p/2	
  steps	
  to	
  the	
  final	
  path,	
  and	
  start	
  1)	
  
       from	
  pL/2	
  un$l	
  the	
  playback	
  ends	
  
Real-­‐$me	
  online	
  alignment	
  
Audio	
  being	
  played	
  from	
  mp3	
  




                                                       Audio	
  available	
  from	
  the	
  video	
  
Real-­‐$me	
  online	
  alignment	
  
Audio	
  being	
  played	
  from	
  mp3	
  


                                                                                   pL	
  
                                                                                                        1)Forward	
  locally	
  
                                                                                                        best	
  path	
  with	
  L=8	
  




                                                       p1	
  




                                                       Audio	
  available	
  from	
  the	
  video	
  
Real-­‐$me	
  online	
  alignment	
  
Audio	
  being	
  played	
  from	
  mp3	
  


                                                                                   pL	
  
                                                                                                        2)stardard	
  DTW	
  




                                                       p1	
  




                                                       Audio	
  available	
  from	
  the	
  video	
  
Real-­‐$me	
  online	
  alignment	
  
Audio	
  being	
  played	
  from	
  mp3	
  




                                                                                                        3)Move	
  forward	
  the	
  
                                                                                                        new	
  star$ng	
  point	
  
                                                                        p1	
  




                                                       Audio	
  available	
  from	
  the	
  video	
  
Alignment	
  postprocessing	
  
•  Alignment	
  es$mates	
  every	
  100ms	
  are	
  not	
  enough	
  
   to	
  drive	
  25/30	
  fps	
  video	
  
•  An	
  interpola$on	
  of	
  the	
  points	
  +	
  averaging	
  over	
  5	
  
   seconds	
  gives	
  the	
  projec$on	
  es$mate	
  for	
  current	
  
   playback	
  
Experiments	
  
•  We	
  use	
  320	
  videos+mp3,	
  aligned	
  using	
  offline	
  DTW	
  and	
  
   manually	
  checked	
  for	
  consistency.	
  
•  Accuracy	
  is	
  computed	
  as	
  the	
  %	
  of	
  songs	
  with	
  average	
  
   error	
  <	
  some	
  ms.	
  




          Average	
  accuracy	
  @100ms	
  for	
  different	
  video	
  buffer	
  lengths	
  	
  
Experiments	
  
Video	
  Duplicate	
  Detec$on	
  
        Xavier	
  Anguera	
  and	
  Pere	
  Obrador	
  
Let’s	
  say	
  you’re	
  looking	
  for	
  the	
  
         Bush	
  a+ack	
  video…	
  
…and	
  you	
  
get	
  11,100	
  
 results.	
  
…ajer	
  
40	
  minutes...	
  

 watching	
  many	
  of	
  
 the	
  videos	
  returned	
  
 you	
  no$ce	
  that	
  
	
  many	
  are	
  similar,	
  i.e.	
  near	
  duplicates	
  
                             27%	
  in	
  average	
  in	
  Youtube	
  [Wu	
  et	
  al.,	
  2007]	
  
                             12%	
  in	
  average	
  in	
  Youtube	
  [Anguera	
  et	
  al,	
  2009]	
  
Near	
  duplicate	
  (NDVC)	
  defini$on	
  
•  Iden$cal	
  or	
  approximately	
  iden$cal	
  videos,	
  
   that	
  differ	
  in	
  some	
  feature:	
  
   –  file	
  formats,	
  encoding	
  parameters	
  
   –  photometric	
  varia$ons	
  (color,	
  ligh$ng	
  changes)	
  
   –  overlays	
  (cap$on,	
  logo,	
  audio	
  commentary)	
  
   –  edi$ng	
  opera$ons	
  (frames	
  add/remove)	
  
   –  	
  seman$c	
  similarity	
  




NDVC	
  are	
  videos	
  that	
  are	
  “essen(ally	
  the	
  same”	
  
Near	
  duplicates(NDVC)	
  vs.	
  Video	
  copies	
  
•  These	
  two	
  concepts	
  are	
  not	
  totally	
  well	
  
   discriminated	
  in	
  the	
  bibliography.	
  
•  Video	
  copy:	
  exact	
  video	
  segment,	
  with	
  some	
  
   transforma$ons	
  on	
  it	
  
•  Near	
  duplicate:	
  similar	
  videos	
  on	
  the	
  same	
  
   topic	
  (different	
  view	
  points,	
  seman$cally	
  
   similar	
  videos,	
  …)	
  

          In	
  our	
  research	
  we	
  approach	
  the	
  
                   video	
  copy	
  detec;on	
  
Examples	
  of	
  video	
  copies	
  
Use	
  Scenarios:	
  Copyright	
  law	
  
                      enforcement	
  




 Detec$on	
  of	
  copyright	
  
  infringing	
  videos	
  in	
  online	
  
  video	
  sharing	
  sites	
  

In	
  a	
  recent	
  study	
  we	
  found	
  that	
  in	
  average	
  12%	
  of	
  search	
  
     results	
  in	
  YouTube	
  are	
  copies	
  of	
  the	
  same	
  video	
  
Use	
  Scenarios:	
  Video	
  forensics	
  for	
  
               illegal	
  ac$vi$es	
  
                    Discover	
  illegal	
  content	
  hidden	
  
                      within	
  other	
  videos	
  




Currently	
  police	
  forces	
  usually	
  have	
  to	
  manually	
  
  scroll	
  through	
  ALL	
  materials	
  in	
  pederasty	
  cases	
  
  searching	
  for	
  evidence.	
  
Use	
  Scenarios:	
  Database	
  
                    management	
  




                Video	
  excerpts	
  used	
  several	
  $mes	
  



Database	
  management/op$miza$on	
  and	
  
 helping	
  in	
  searches	
  over	
  historic	
  contents	
  
Use	
  Scenarios:	
  adver$sement	
  
       detec$on	
  and	
  management	
  

Adver$sement	
  detec$on/iden$fica$on	
  
Programming	
  analysis	
  
Use	
  Scenarios:	
  Informa$on	
  overload	
  
                reduc$on	
  
Improved	
  (more	
  diverse)	
  video	
  search	
  
  results	
  by	
  clustering	
  all	
  video	
  duplicates.	
  
                     George	
  Bush	
  




                                                     Ajer	
  
                                                  clustering	
  


  Before	
  clustering	
  
Steps	
  in	
  Video	
  Duplicate	
  detec$on	
  
1.  Indexing	
  of	
  the	
  reference	
  videos	
  
   A.  Obtain	
  features	
  represen$ng	
  the	
  video	
  
   B.  Store	
  these	
  features	
  in	
  a	
  scalable	
  manner	
  
2.  Search	
  of	
  queries	
  within	
  the	
  reference	
  set	
  
    OFFLINE	
  

     Ref	
  videos	
                                References	
         Features	
  
                         Feature	
  extrac$on	
  
                                                     indexing	
          Database	
  


     Query	
  	
                                                         Search	
  for	
  
     video	
             Feature	
  extrac$on	
  
                                                                         duplicates	
  

   ONLINE	
  
Ways	
  to	
  approach	
  near-­‐duplicate	
  video	
  
                   detec$on	
  
•  Local	
  features	
  
    –  Extracted	
  from	
  selected	
  frames	
  in	
  the	
  videos	
  
    –  Focus	
  on	
  local	
  characteris$cs	
  within	
  those	
  frames	
  
•  Global	
  features	
  
    –  Extracted	
  from	
  selected	
  frames	
  or	
  from	
  all	
  the	
  
       video	
  	
  
    –  Focus	
  on	
  overall	
  characteris$cs	
  
Local	
  features	
  
•  Comes	
  from	
  the	
  previous	
  knowledge	
  on	
  image	
  
   copy	
  detec$on/near	
  duplicates	
  detec$on	
  
•  Steps:	
  
   –  Keyframes	
  are	
  first	
  extracted	
  from	
  the	
  videos	
  at	
  
      regular	
  intervals	
  or	
  by	
  detec$ng	
  shots	
  
   –  Local	
  features	
  are	
  obtained	
  for	
  these	
  keyframes:	
  
       •  SIFT	
  
       •  SURF	
  
       •  HARRIS	
  
       •  …	
  
Global	
  Features	
  
   •  Features	
  are	
  extracted	
  either	
  from	
  the	
  whole	
  
      video	
  or	
  from	
  keyframes	
  by	
  looking	
  at	
  the	
  
      overall	
  image	
  (not	
  at	
  par$cular	
  points).	
  




In	
  our	
  work	
  we	
  extract	
  them	
  from	
  the	
  
whole	
  video	
  
Mul$modal	
  video	
  copy	
  detec$on	
  
•  Most	
  works	
  use	
  only	
  video/images	
  informa$on	
  
    –  They	
  prefer	
  local	
  features	
  for	
  their	
  robustness	
  
•  We	
  introduce	
  audio	
  informa$on	
  by	
  combining	
  
   global	
  features	
  from	
  both	
  the	
  audio	
  and	
  video	
  
   tracks	
  
•  We	
  are	
  also	
  experimen$ng	
  on	
  fusing	
  local	
  
   features	
  with	
  global	
  features	
  (work	
  in	
  progress)	
  
Mul$modal	
  global	
  features	
  
•  We	
  use	
  features	
  based	
  on	
  the	
  changes	
  in	
  the	
  
   data-­‐>	
  more	
  robust	
  to	
  transforma$ons	
  
•  Video:	
  
    –  Hue	
  +	
  satura$on	
  interframe	
  change	
  
    –  Lightest	
  and	
  darkest	
  centroid	
  interframe	
  distance	
  
•  Audio:	
  
    –  Bayesian	
  informa$on	
  criterion	
  (BIC)	
  between	
  adjacent	
  
       segments	
  
    –  Cross-­‐BIC	
  between	
  adjacent	
  segments	
  
    –  Kullback-­‐Leibler	
  divergence	
  (KL2)	
  between	
  adjacent	
  
       segments	
  
Hue+Satura$on	
  interframe	
  change	
  
1.  Transform	
  the	
  colorspace	
  from	
  RGB	
  to	
  HSV	
  (Hue
    +Satura$on+Value)	
  
Hue+Satura$on	
  interframe	
  change	
  
2.  Compute	
  for	
  each	
  2	
  consecu$ve	
  frames	
  their	
  HS	
  
    histogram	
  and	
  compute	
  their	
  intersec$on	
  as:	
  
Lightest and darkest centroid interframe distance

1.  Find	
  the	
  lightest	
  and	
  darkest	
  regions	
  in	
  each	
  
    frame	
  and	
  obtain	
  its	
  centroid	
  
Lightest and darkest centroid interframe distance

We	
  compute	
  the	
  euclidean	
  distance	
  between	
  
  each	
  two	
  adjacent	
  frames,	
  obtaining	
  two	
  
  global	
  feature	
  streams	
  
Acous$c	
  features	
  
•  Compute	
  some	
  acous$c	
  distance	
  between	
  
   adjacent	
  acous$c	
  segments	
  



       Segment	
  A	
         Segment	
  B	
  




         GMM	
  A	
             GMM	
  B	
       GMM	
  A+B	
  
Acous$c	
  features	
  (II)	
  
•  Likelihood-­‐based	
  metrics:	
  
   –  Bayesian	
  Informa$on	
  Criterion	
  


   –  Cross-­‐BIC	
  


•  Model	
  distance	
  metrics:	
  
   –  Kullback-­‐Leibler	
  divergence	
  (KL2)	
  
Acous$c	
  features	
  (III)	
  
•  For	
  example:	
  the	
  Bayesian	
  Informa$on	
  
   Criterion	
  (BIC)	
  output:	
  
Search	
  for	
  full	
  copies	
  
•  For	
  each	
  video-­‐query	
  pair	
  we	
  compute	
  the	
  
   correla$on	
  of	
  each	
  feature	
  pair	
  




     Reference	
          FFT	
  
                                       X           IFFT	
       Find	
  peaks	
  
      Possible	
  
                          FFT	
  
       copy	
  

•  We	
  then	
  find	
  the	
  posi$ons	
  with	
  high	
  similarity	
  
   (peaks).	
  
Mul$modal	
  fusion	
  
•  When	
  mul$ple	
  modali$es	
  are	
  available,	
  fusion	
  
   is	
  performed	
  on	
  the	
  correla$ons	
  
Output	
  score	
  
•  The	
  resul$ng	
  score	
  is	
  computed	
  by	
  weighted	
  
   sum	
  of	
  the	
  different	
  modali$es’	
  normalized	
  
   dot	
  product	
  at	
  the	
  found	
  peak	
  



•  Automa$c	
  weights	
  are	
  obtained	
  via	
  
Finding	
  subsegments	
  of	
  the	
  query	
  
•  The	
  previously	
  described	
  algorithm	
  considers	
  the	
  whole	
  
   query	
  matches	
  a	
  por$on	
  of	
  the	
  reference	
  videos	
  
•  To	
  avoid	
  such	
  restric$on	
  a	
  modifica$on	
  to	
  the	
  algorithm	
  
   first	
  splits	
  the	
  query	
  into	
  overlaping	
  20s	
  segments	
  



•  By	
  accumula$ng	
  the	
  resul$ng	
  peaks	
  for	
  each	
  segment	
  we	
  
   can	
  obtain	
  the	
  main	
  delay	
  and	
  its	
  segment	
  
Algorithm	
  performance	
  evalua$on	
  
•  To	
  test	
  the	
  algorithm	
  we	
  used	
  the	
  MUSCLE-­‐
   VCD	
  database:	
      	
  
    –  Over	
  100	
  hours	
  of	
  reference	
  videos	
  from	
  the	
  
       SoundVision	
  group	
  (Nederlands)	
  
    –  2	
  test	
  sets	
  
        •  ST1:	
  15	
  query	
  videos	
  where	
  the	
  whole	
  query	
  is	
  
           considered	
  
        •  ST2:	
  3	
  videos	
  with	
  21	
  segments	
  appearing	
  in	
  the	
  
           reference	
  database	
  


         h+p://www-­‐roc.inria.fr/imedia/civr-­‐bench/benchMuscle.html	
  
MUSCLE-­‐VCD	
  transforma$on	
  examples	
  
Evalua$on	
  metrics	
  
•  We	
  use	
  the	
  same	
  metrics	
  as	
  in	
  the	
  MUSCLE-­‐
   VCD	
  benchmark	
  tests	
  
Evalua$on	
  metrics	
  (II)	
  
•  We	
  also	
  use	
  the	
  more	
  standard	
  Precision	
  and	
  
   recall	
  metrics	
  
Evalua$on	
  results	
  
Evalua$on	
  results	
  histogram	
  for	
  ST1	
  
Youtube	
  reranking	
  applica$on	
  
                    •  We	
  downloaded	
  all	
  
                       videos	
  searching	
  for	
  
                       the	
  top	
  20	
  most	
  
                       viewed	
  and	
  20	
  most	
  
                       visited	
  videos	
  
Youtube	
  reranking	
  applica$on	
  
•  We	
  applied	
  mul$modal	
  copy	
  detec$on	
  and	
  
   grouped	
  all	
  near	
  duplicates	
  
Youtube	
  Reranking	
  test	
  	
  
•  Results	
  show	
  how	
  
   some	
  videos	
  have	
  
   mul$ple	
  clear	
  copies	
  
   that	
  can	
  boost	
  their	
  
   ranking	
  once	
  clustered	
  
Thanks	
  for	
  your	
  aHen;on	
  
            xanguera@$d.es	
  
          www.xavieranguera.com	
  
  Linkedin:	
  h+p://es.linkedin.com/in/xanguera	
  
      Twi+er:	
  h+p://twi+er.com/xanguera	
  
    Website:	
  h+p://www.xavieranguera.com/	
  	
  

More Related Content

What's hot

slides for "Supervised Model Learning with Feature Grouping based on a Discre...
slides for "Supervised Model Learning with Feature Grouping based on a Discre...slides for "Supervised Model Learning with Feature Grouping based on a Discre...
slides for "Supervised Model Learning with Feature Grouping based on a Discre...Kensuke Mitsuzawa
 
NIPS2017 Few-shot Learning and Graph Convolution
NIPS2017 Few-shot Learning and Graph ConvolutionNIPS2017 Few-shot Learning and Graph Convolution
NIPS2017 Few-shot Learning and Graph ConvolutionKazuki Fujikawa
 
Lesson 17: The Mean Value Theorem
Lesson 17: The Mean Value TheoremLesson 17: The Mean Value Theorem
Lesson 17: The Mean Value TheoremMatthew Leingang
 
A Physical Approach to Moving Cast Shadow Detection (ICASSP 2009)
A Physical Approach to Moving Cast Shadow Detection (ICASSP 2009)A Physical Approach to Moving Cast Shadow Detection (ICASSP 2009)
A Physical Approach to Moving Cast Shadow Detection (ICASSP 2009)Jia-Bin Huang
 
Learning Moving Cast Shadows for Foreground Detection (VS 2008)
Learning Moving Cast Shadows for Foreground Detection (VS 2008)Learning Moving Cast Shadows for Foreground Detection (VS 2008)
Learning Moving Cast Shadows for Foreground Detection (VS 2008)Jia-Bin Huang
 
Tuto part2
Tuto part2Tuto part2
Tuto part2Bo Li
 
Lecture6
Lecture6Lecture6
Lecture6voracle
 
Mesh Processing Course : Geodesics
Mesh Processing Course : GeodesicsMesh Processing Course : Geodesics
Mesh Processing Course : GeodesicsGabriel Peyré
 
Rear View Virtual Image Displays
Rear View Virtual Image DisplaysRear View Virtual Image Displays
Rear View Virtual Image DisplaysGa S
 
Extreme learning machine:Theory and applications
Extreme learning machine:Theory and applicationsExtreme learning machine:Theory and applications
Extreme learning machine:Theory and applicationsJames Chou
 
Differential Geometry
Differential GeometryDifferential Geometry
Differential Geometrylapuyade
 
Lucanas Presentation R Rver2 Dcm2
Lucanas Presentation R Rver2 Dcm2Lucanas Presentation R Rver2 Dcm2
Lucanas Presentation R Rver2 Dcm2mdlynch
 
Influence of Signal-to-Noise Ratio and Point Spread Function on Limits of Sup...
Influence of Signal-to-Noise Ratio and Point Spread Function on Limits of Sup...Influence of Signal-to-Noise Ratio and Point Spread Function on Limits of Sup...
Influence of Signal-to-Noise Ratio and Point Spread Function on Limits of Sup...Tuan Q. Pham
 
Bachelor thesis of do dai chi
Bachelor thesis of do dai chiBachelor thesis of do dai chi
Bachelor thesis of do dai chidodaichi2005
 
Lesson 18: Maximum and Minimum Vaues
Lesson 18: Maximum and Minimum VauesLesson 18: Maximum and Minimum Vaues
Lesson 18: Maximum and Minimum VauesMatthew Leingang
 
Lesson 18: Maximum and Minimum Vaues
Lesson 18: Maximum and Minimum VauesLesson 18: Maximum and Minimum Vaues
Lesson 18: Maximum and Minimum VauesMatthew Leingang
 
A brief introduction to Hartree-Fock and TDDFT
A brief introduction to Hartree-Fock and TDDFTA brief introduction to Hartree-Fock and TDDFT
A brief introduction to Hartree-Fock and TDDFTJiahao Chen
 
NIPS読み会2013: One-shot learning by inverting a compositional causal process
NIPS読み会2013: One-shot learning by inverting  a compositional causal processNIPS読み会2013: One-shot learning by inverting  a compositional causal process
NIPS読み会2013: One-shot learning by inverting a compositional causal processnozyh
 
Elementary Landscape Decomposition of Combinatorial Optimization Problems
Elementary Landscape Decomposition of Combinatorial Optimization ProblemsElementary Landscape Decomposition of Combinatorial Optimization Problems
Elementary Landscape Decomposition of Combinatorial Optimization Problemsjfrchicanog
 

What's hot (20)

slides for "Supervised Model Learning with Feature Grouping based on a Discre...
slides for "Supervised Model Learning with Feature Grouping based on a Discre...slides for "Supervised Model Learning with Feature Grouping based on a Discre...
slides for "Supervised Model Learning with Feature Grouping based on a Discre...
 
NIPS2017 Few-shot Learning and Graph Convolution
NIPS2017 Few-shot Learning and Graph ConvolutionNIPS2017 Few-shot Learning and Graph Convolution
NIPS2017 Few-shot Learning and Graph Convolution
 
Lesson 17: The Mean Value Theorem
Lesson 17: The Mean Value TheoremLesson 17: The Mean Value Theorem
Lesson 17: The Mean Value Theorem
 
A Physical Approach to Moving Cast Shadow Detection (ICASSP 2009)
A Physical Approach to Moving Cast Shadow Detection (ICASSP 2009)A Physical Approach to Moving Cast Shadow Detection (ICASSP 2009)
A Physical Approach to Moving Cast Shadow Detection (ICASSP 2009)
 
Learning Moving Cast Shadows for Foreground Detection (VS 2008)
Learning Moving Cast Shadows for Foreground Detection (VS 2008)Learning Moving Cast Shadows for Foreground Detection (VS 2008)
Learning Moving Cast Shadows for Foreground Detection (VS 2008)
 
Tuto part2
Tuto part2Tuto part2
Tuto part2
 
Lecture6
Lecture6Lecture6
Lecture6
 
Mesh Processing Course : Geodesics
Mesh Processing Course : GeodesicsMesh Processing Course : Geodesics
Mesh Processing Course : Geodesics
 
elm
elmelm
elm
 
Rear View Virtual Image Displays
Rear View Virtual Image DisplaysRear View Virtual Image Displays
Rear View Virtual Image Displays
 
Extreme learning machine:Theory and applications
Extreme learning machine:Theory and applicationsExtreme learning machine:Theory and applications
Extreme learning machine:Theory and applications
 
Differential Geometry
Differential GeometryDifferential Geometry
Differential Geometry
 
Lucanas Presentation R Rver2 Dcm2
Lucanas Presentation R Rver2 Dcm2Lucanas Presentation R Rver2 Dcm2
Lucanas Presentation R Rver2 Dcm2
 
Influence of Signal-to-Noise Ratio and Point Spread Function on Limits of Sup...
Influence of Signal-to-Noise Ratio and Point Spread Function on Limits of Sup...Influence of Signal-to-Noise Ratio and Point Spread Function on Limits of Sup...
Influence of Signal-to-Noise Ratio and Point Spread Function on Limits of Sup...
 
Bachelor thesis of do dai chi
Bachelor thesis of do dai chiBachelor thesis of do dai chi
Bachelor thesis of do dai chi
 
Lesson 18: Maximum and Minimum Vaues
Lesson 18: Maximum and Minimum VauesLesson 18: Maximum and Minimum Vaues
Lesson 18: Maximum and Minimum Vaues
 
Lesson 18: Maximum and Minimum Vaues
Lesson 18: Maximum and Minimum VauesLesson 18: Maximum and Minimum Vaues
Lesson 18: Maximum and Minimum Vaues
 
A brief introduction to Hartree-Fock and TDDFT
A brief introduction to Hartree-Fock and TDDFTA brief introduction to Hartree-Fock and TDDFT
A brief introduction to Hartree-Fock and TDDFT
 
NIPS読み会2013: One-shot learning by inverting a compositional causal process
NIPS読み会2013: One-shot learning by inverting  a compositional causal processNIPS読み会2013: One-shot learning by inverting  a compositional causal process
NIPS読み会2013: One-shot learning by inverting a compositional causal process
 
Elementary Landscape Decomposition of Combinatorial Optimization Problems
Elementary Landscape Decomposition of Combinatorial Optimization ProblemsElementary Landscape Decomposition of Combinatorial Optimization Problems
Elementary Landscape Decomposition of Combinatorial Optimization Problems
 

Viewers also liked

Information Retrieval Dynamic Time Warping - Interspeech 2013 presentation
Information Retrieval Dynamic Time Warping - Interspeech 2013 presentationInformation Retrieval Dynamic Time Warping - Interspeech 2013 presentation
Information Retrieval Dynamic Time Warping - Interspeech 2013 presentationXavier Anguera
 
Time series data mining techniques
Time series data mining techniquesTime series data mining techniques
Time series data mining techniquesShanmukha S. Potti
 
Gesture Recognition using Principle Component Analysis & Viola-Jones Algorithm
Gesture Recognition using Principle Component Analysis &  Viola-Jones AlgorithmGesture Recognition using Principle Component Analysis &  Viola-Jones Algorithm
Gesture Recognition using Principle Component Analysis & Viola-Jones AlgorithmIJMER
 
論文紹介 Probabilistic sfa for behavior analysis
論文紹介 Probabilistic sfa for behavior analysis論文紹介 Probabilistic sfa for behavior analysis
論文紹介 Probabilistic sfa for behavior analysisShuuji Mihara
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataDataminingTools Inc
 

Viewers also liked (6)

Classifying human motion for active music systems
Classifying human motion for active music systemsClassifying human motion for active music systems
Classifying human motion for active music systems
 
Information Retrieval Dynamic Time Warping - Interspeech 2013 presentation
Information Retrieval Dynamic Time Warping - Interspeech 2013 presentationInformation Retrieval Dynamic Time Warping - Interspeech 2013 presentation
Information Retrieval Dynamic Time Warping - Interspeech 2013 presentation
 
Time series data mining techniques
Time series data mining techniquesTime series data mining techniques
Time series data mining techniques
 
Gesture Recognition using Principle Component Analysis & Viola-Jones Algorithm
Gesture Recognition using Principle Component Analysis &  Viola-Jones AlgorithmGesture Recognition using Principle Component Analysis &  Viola-Jones Algorithm
Gesture Recognition using Principle Component Analysis & Viola-Jones Algorithm
 
論文紹介 Probabilistic sfa for behavior analysis
論文紹介 Probabilistic sfa for behavior analysis論文紹介 Probabilistic sfa for behavior analysis
論文紹介 Probabilistic sfa for behavior analysis
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence data
 

Similar to Multimodal pattern matching algorithms and applications

Lecture11
Lecture11Lecture11
Lecture11Bo Li
 
Bayesian Defect Signal Analysis for Nondestructive Evaluation of Materials
Bayesian Defect Signal Analysis for Nondestructive Evaluation of MaterialsBayesian Defect Signal Analysis for Nondestructive Evaluation of Materials
Bayesian Defect Signal Analysis for Nondestructive Evaluation of MaterialsAleksandar Dogandžić
 
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskinComputer Science Club
 
A note on word embedding
A note on word embeddingA note on word embedding
A note on word embeddingKhang Pham
 
Collision Detection In 3D Environments
Collision Detection In 3D EnvironmentsCollision Detection In 3D Environments
Collision Detection In 3D EnvironmentsUng-Su Lee
 
IGARSS2011 FR3.T08.3 BenDavid.pdf
IGARSS2011 FR3.T08.3 BenDavid.pdfIGARSS2011 FR3.T08.3 BenDavid.pdf
IGARSS2011 FR3.T08.3 BenDavid.pdfgrssieee
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Jeong-Gwan Lee
 
Introduction to Deep Neural Network
Introduction to Deep Neural NetworkIntroduction to Deep Neural Network
Introduction to Deep Neural NetworkLiwei Ren任力偉
 
Lecture 02 internet video search
Lecture 02 internet video searchLecture 02 internet video search
Lecture 02 internet video searchzukun
 
Loop parallelization & pipelining
Loop parallelization & pipeliningLoop parallelization & pipelining
Loop parallelization & pipeliningjagrat123
 
Connect-the-Dots in a Graph and Buffon's Needle on a Chessboard: Two Problems...
Connect-the-Dots in a Graph and Buffon's Needle on a Chessboard: Two Problems...Connect-the-Dots in a Graph and Buffon's Needle on a Chessboard: Two Problems...
Connect-the-Dots in a Graph and Buffon's Needle on a Chessboard: Two Problems...Vladimir Kulyukin
 
Schbath Rmes Bosc2009
Schbath Rmes Bosc2009Schbath Rmes Bosc2009
Schbath Rmes Bosc2009bosc
 
Cryptanalysis Project Report
Cryptanalysis Project ReportCryptanalysis Project Report
Cryptanalysis Project Reportshahparin
 
A Novel Methodology for Designing Linear Phase IIR Filters
A Novel Methodology for Designing Linear Phase IIR FiltersA Novel Methodology for Designing Linear Phase IIR Filters
A Novel Methodology for Designing Linear Phase IIR FiltersIDES Editor
 
Output Units and Cost Function in FNN
Output Units and Cost Function in FNNOutput Units and Cost Function in FNN
Output Units and Cost Function in FNNLin JiaMing
 
Weight enumerators of block codes and the mc williams
Weight  enumerators of block codes and  the mc williamsWeight  enumerators of block codes and  the mc williams
Weight enumerators of block codes and the mc williamsMadhumita Tamhane
 
Multiuser MIMO Vector Perturbation Precoding
Multiuser MIMO Vector Perturbation PrecodingMultiuser MIMO Vector Perturbation Precoding
Multiuser MIMO Vector Perturbation Precodingadeelrazi
 
Different techniques for speech recognition
Different  techniques for speech recognitionDifferent  techniques for speech recognition
Different techniques for speech recognitionyashi saxena
 

Similar to Multimodal pattern matching algorithms and applications (20)

Lecture11
Lecture11Lecture11
Lecture11
 
Bayesian Defect Signal Analysis for Nondestructive Evaluation of Materials
Bayesian Defect Signal Analysis for Nondestructive Evaluation of MaterialsBayesian Defect Signal Analysis for Nondestructive Evaluation of Materials
Bayesian Defect Signal Analysis for Nondestructive Evaluation of Materials
 
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
 
VoxelNet
VoxelNetVoxelNet
VoxelNet
 
A note on word embedding
A note on word embeddingA note on word embedding
A note on word embedding
 
Collision Detection In 3D Environments
Collision Detection In 3D EnvironmentsCollision Detection In 3D Environments
Collision Detection In 3D Environments
 
Dsp
DspDsp
Dsp
 
IGARSS2011 FR3.T08.3 BenDavid.pdf
IGARSS2011 FR3.T08.3 BenDavid.pdfIGARSS2011 FR3.T08.3 BenDavid.pdf
IGARSS2011 FR3.T08.3 BenDavid.pdf
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)
 
Introduction to Deep Neural Network
Introduction to Deep Neural NetworkIntroduction to Deep Neural Network
Introduction to Deep Neural Network
 
Lecture 02 internet video search
Lecture 02 internet video searchLecture 02 internet video search
Lecture 02 internet video search
 
Loop parallelization & pipelining
Loop parallelization & pipeliningLoop parallelization & pipelining
Loop parallelization & pipelining
 
Connect-the-Dots in a Graph and Buffon's Needle on a Chessboard: Two Problems...
Connect-the-Dots in a Graph and Buffon's Needle on a Chessboard: Two Problems...Connect-the-Dots in a Graph and Buffon's Needle on a Chessboard: Two Problems...
Connect-the-Dots in a Graph and Buffon's Needle on a Chessboard: Two Problems...
 
Schbath Rmes Bosc2009
Schbath Rmes Bosc2009Schbath Rmes Bosc2009
Schbath Rmes Bosc2009
 
Cryptanalysis Project Report
Cryptanalysis Project ReportCryptanalysis Project Report
Cryptanalysis Project Report
 
A Novel Methodology for Designing Linear Phase IIR Filters
A Novel Methodology for Designing Linear Phase IIR FiltersA Novel Methodology for Designing Linear Phase IIR Filters
A Novel Methodology for Designing Linear Phase IIR Filters
 
Output Units and Cost Function in FNN
Output Units and Cost Function in FNNOutput Units and Cost Function in FNN
Output Units and Cost Function in FNN
 
Weight enumerators of block codes and the mc williams
Weight  enumerators of block codes and  the mc williamsWeight  enumerators of block codes and  the mc williams
Weight enumerators of block codes and the mc williams
 
Multiuser MIMO Vector Perturbation Precoding
Multiuser MIMO Vector Perturbation PrecodingMultiuser MIMO Vector Perturbation Precoding
Multiuser MIMO Vector Perturbation Precoding
 
Different techniques for speech recognition
Different  techniques for speech recognitionDifferent  techniques for speech recognition
Different techniques for speech recognition
 

Recently uploaded

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 

Recently uploaded (20)

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 

Multimodal pattern matching algorithms and applications

  • 1. Mul$modal  pa+ern  matching   algorithms  and  applica$ons   Xavier  Anguera   Telefonica  Research  
  • 2. Outline   •  Introduc$on   •  Par$al  sequence  matching     –  U-­‐DTW  algorithm     •  Music/video  online  synchroniza$on     –  MuViSync  prototype   •  Video  Copy  detec$on  
  • 3. Par$al  Sequence  Matching  Using  an   Unbounded  Dynamic  Time  Warping   Algorithm   Xavier  Anguera,  Robert  Macrare  and   Nuria  Oliver   Telefonica  Research,  Barcelona,  Spain  
  • 4. Proposed  challenge   •  Given  one  or  several  audio  signals  we  want  to   find  and  align  recurring  acous$c  pa+erns.  
  • 5. Proposed  challenge   •  We  could  use  the  ASR/phone$c  output  and  search  for  symbol   repe$$ons   PROS:   –  It  is  easy  to  apply,  the  ASR  takes  care  of  any  $me  warping   CONS:   –  ASR  is  language  dependent  and  requires  training   –  We  introduce  addi$onal  sources  of  error  (acous$c  condi$ons,  OOV’s)   –  It  can  be  very  slow  and  not  embeddable   •  Automa$c  mo$f  discovery  directly  in  the  speech  signal   –  Train  free,  language  independent  and  resilient  to  some  noises   Symbolic   representa$on   ASR/ symbols   Phone$za$on   alignment   acous$c   •   Alignment  loca$ons   alignment   •   Scores  
  • 6. Areas  of  applica$on   •  Improve  ASR  by  disambigua$on  over  several   repe$$ons  (Park  and  Glass,  2005)   •  Pa+ern-­‐based  speech  recogni$on  –  flat   modelling  (Zweig  and  Nguyen,  2010)   •  Acous$c  summariza$on  (Muscariello,  2009)   •  Musical  structure  analysis  (Müller,  2007)   •  Server-­‐less  mobile  voice  search  (Anguera,  2010)    
  • 7. Automa$c  mo$f  discovery   •  Goal  is  to  avoid  going  to  text  and  therefore  be   more  robust  to  errors   •  Good  deal  of  applicable  work  on  this  area:   –  Biomedicine  in  matching  DNA  sequences   (conver$ng  the  speech  signals  into  symbol  strings)   –  Directly  from  real-­‐valued  mul$dimensional   samples  using  DTW-­‐like  algorithms   •  Müller’07,  Muscariello’09,  Park’05,  Zweig’10   •  Most  need  to  compute  all  the  cost  matrix  a  priori  
  • 8. Dynamic  Time  Warping  -­‐  DTW   •  DTW  algorithm  allows  the  computa$on  of  the   op$mal  alignment  between  two  $me  series       Xu,  Yv  ε  ΦD     XU = (u1,...,um ,...,uM ) X V = (v1,....,v n ,..,v N ) Image  by  Daniel  Lemire  
  • 9. Dynamic  Time  Warping  (II)   •  The  op$mal  alignment  can  be  found  in  O(MN)  complexity   using  dynamic  programming.   •  We  need  to  define  a  cost  func$on  between  any  two   elements  in  the  series  and  build  a  distance  matrix:   d : ΦD × ΦD →ℜ ≥ 0 Where  usually:   d(i, j) = um − v n € Euclidean  distance   Image  by  Tsanko  Dyustabanov   Warping  func$on:      F      =    c(1),...,c(K)    where   c(i(k), j(k))                                              €        
  • 10. Warping  constraints   For  speech  signals  some  constraints  are  usually   applied  to  the  warping  func$on  F:   –  Monotonicity:             i(k −1) ≤ i(k) j(k −1) ≤ j(k) –  Con$nuity  (i.e.  local  constraints):   i(k) − i(k −1) ≤ 1 j(k) − j(k −1) ≤ 1 € € (m,  n)   ⎧ D(m −1,n) (m-­‐1,  n)   ⎪ D(m,n) = min⎨ D(m,n −1) + d(um ,v n ) € (m-­‐1,  n-­‐1)   € ⎪ ⎩ D(m −1,n −1) Sakoe,H.  and  Chiba,S.  (1978)  Dynamic  programming  algorithm  op0miza0on  for  spoken  word  recogni0on,  IEEE  Trans.   on  Acoust.,  Speech,  and  Signal  Process,  ASSP-­‐26,  43-­‐49.  
  • 11. Warping  constraints  (II)   –  Boundary  condi$on:     i(1) = 1 j(1) = 1 i(K) = M j(K) = N i.e.  DTW  needs  prior  knowledge  of  the  start-­‐end   alignment  points.   –  Global  constraints   € € € € Image  from  Keogh  and  Ratanamahatana  
  • 16. DTW  main  problem   •  The  boundary  condi$on  constraints  $me-­‐ series  to  be  aligned  from  start  to  end   –  We  need  a  modifica$on  to  DTW  to  allow  common   pa+ern  discovery  in  reference  and  query  signals   regardless  of  the  sequence’s  other  content  
  • 17. Alterna$ve  proposals   •  Meinard  Müller’s  Path  extrac$on  for  music   –  Needs  to  pre-­‐compute  the  complete  cost  matrix.   •  Alex  Park’s  Segmental  DTW   –  Needs  to  pre-­‐compute  the  complete  cost  matrix,   very  computa$onally  expensive  ajerwards.     •  Armando  Muscarielo’s  word  discovery   algorithm   –  Searches  for  pa+erns  locally,  does  not  check  all   possible  star$ng  points.   [1]  M.  Müller,  “Informa$on  Retrieval  for  Music  and  Mo$on”,Springer,  New  York,  USA,  2007.   [2]  A.  Park  et  al.,  “ Towards  unsupervised  pa+ern  discovery  in  speech,”  in  In  Proc.  ASRU’05,  Puerto  Rico,  2005.   [3]  A.  Muscariello  et  al.,  “Audio  keyword  extrac$on  by  unsupervised  word  discovery,”  in  Proc.  INTER-­‐  SPEECH’09,  2009.  
  • 18. Unbounded-­‐DTW  Algorithm   •  U-­‐DTW  is  a  modifica$on  to  DTW  that  is  fast  and   accurate  in  finding  recurring  pa+erns   •  We  call  it  unbounded  because:   –  The  start-­‐end  posi$ons  of  both  segments  are  not   constrained   –  Mul$ple  matching  segments  can  be  found  with  a   single  pass  of  the  algorithm   –  Minimizes  the  computa$onal  cost  of  comparing  two   mul$dimensional  $me  series  
  • 19. U-­‐DTW  Cost  func$on  and  matching  length   •  Given  two  sequences  to  be  matched        U=(u1,  u2,  …,  uM)  and  V=(v1,  v2,  …,  vN)      we  use  the  inner  product  similarity       um ,v n s(m,n) = cosθ = um v n  Values  range  [-­‐1,1],  the  higher  the  closer   •  We  look  for  matching  sequences  with  a  minimum   € length  Lmin  (set  at  400ms  in  our  experiments)  
  • 20. U-­‐DTW  global/local  constraints   •  no  global  constraints  are  applied  in  order  to  allow  for   matching  of  any  segment  among  both  sequences   •  Local  constraints  are  set  to  allow  warping  up  to  2X   (m,  n)   ⎧ D(m − 2,n) ⎪ (m-­‐2,  n-­‐1)   D(m,n) = max⎨ D(m,n − 2) + s(um ,v n ) (m-­‐1,  n-­‐1)   ⎪ ⎩ D(m − 2,n − 2) (m-­‐1,  n-­‐2)  
  • 21. U-­‐DTW  computa$onal  savings   •  Computa$onal  savings  are  achieved  thanks  to:   1.  We  sample  the  distance/similarity  matrix  at   certain  possible  matching  start  points  (sesng   Synchroniza$on  points)   2.  Dynamic  programming  is  done  forward,   prunning  out  low  similarity  paths  
  • 22. Synchroniza$on  points   •  Only  certain  (m,n)  posi$ons  are  analyzed  in   the  matrix  for  possible  matching  segments   –  Selected  not  to  loose  any  matching  segment   –  Op$mize  the  computa$onal  cost   •  Two  methods  are  followed:  horizontal  and   ver$cal  bands:   U   U   λ   τh   (m,n)   λ   2τh   (m,n)   π/4   V   λ   τd   V  
  • 24. Forward  dynamic  programming   •  For  each  posi$on  (m,n):  3  possible  forward  paths  are   considered   (m+1,  n+2)   (m+1,  n+1)   (m+2,  n+1)   (m,  n)   •  The  forward  path  is  extended  forward  IIF:   –  Its  normalized  global  similarity  is  above  a  pruning  threshold   D(m,n) + s(m',n') S(m',n') = ≥ Thrprun M(m,n) +1 –   S(m',n')    is  greater  than  any  previous  path  in  that  loca$on                       €
  • 27. Backward  path  algorithm   •  When  a  possible  matching  segment  is  found  in   the  forward  path,  the  same  is  done  backwards   star$ng  from  the  origina$ng  SP  posi$on.   (m,  n)   (m-­‐2,  n-­‐1)   (m-­‐1,  n-­‐1)   (m-­‐1,  n-­‐2)   The  same  procedure  is  followed  as  in  the  forward  path    
  • 30. Computa$onal  savings  example   Barcelona   Barcelona  
  • 31. Experimental  setup   •  We  asked  23  people  to  record  47   words  from  6  categories,  5  itera$ons   each:   XU ,V [n,i],i = 1...5, j = 1...47 Monuments   •  Simple  energy-­‐based  trimming   Family   eliminates  non-­‐speech  regions   € Events   •  We  simulate  acous$c  context  by   Ci$es   a+aching  different  start-­‐end  audio   People   sequences  to  Xu,v.   Nature  
  • 32. Experimental  setup  (II)   •  Signals  are  parameterized  with  10MFCC  every   10ms   •  Each  word  Xu  is  compared  to  all  words  Xv  from   the  same  speaker  (234  comparisons)  and  the   closest  one  is  retrieved   argmin m, j D(XU [n,i], X V [m, j]) | (n,i) ≠ (m, j)  We  get  a  hit  m=n,  a  miss  otherwise   •  Tests  were  performed  on  an  Ubuntu  Linux  PC   € @2.4GHz.  
  • 33. Comparing  systems   •  Standard  DTW   –  Compare  the  sequences  without  any  added   acous$c  context  (i.e.  prior  knowledge  of  start-­‐end   points)   •  Segmental  DTW  (Park  and  Glass,  2005)   –  Minimum  segment  length  of  500ms   –  Band  size  of  70ms,  50%  overlap   –  Used  2  distances:  Euclidean  and  1-­‐inner  product  
  • 34. Performance  evalua$on   Used  metrics:   –  Accuracy:  percentage  of  words  correctly  matched  (Xu  y  Xv   are  different  itera$ons  of  the  same  word).   Acc = ∑ correct matches ⋅ 100 all matches –  Average  processing  $me  per  sequence  pair  (Xu-­‐Xv)   (excluding  parameteriza$on)   € Time = ∑ time(D(X U [n,i],X V [m, j])) ⋅ 100 # matches –  Average  ra$o  of  frame-­‐pair  distances  within  each   sequence-­‐pair  cost  matrix.     € Ratio = ∑ computed(d(X U [n,i], X V [m, j])) ⋅ 100 MN
  • 35. Results   Algorithm   Accuracy   Avg.  ;me   ra;o   Segmental  DTW  w/  Eucl.   80.61%   82.7ms   1   Segmental  DTW  w/  inner  prod.   74.62%   86.7ms   1   U-­‐DTW  horiz.  bands   89.53%   10.6ms   0.51   U-­‐DTW  diag.  bands   89.34%   9.0ms   0.42   Standard  DTW   95.42%   0.6ms   1  
  • 36. Effect  of  the  Cutout  Threshold  
  • 37. Conclusions  and  future  work   •  We  propose  a  novel  algorithm  called  U-­‐DTW   for  unconstrained  pa+ern  discovery  in  speech     •  We  show  it  is  faster  and  more  accurate  than   exis$ng  alterna$ves   •  We  are  star$ng  to  test  the  algorithm  for   unrestricted  audio  summariza$on  
  • 38. MuViSync   AudioVisual  Music  Synchroniza$on   Xavier  Anguera,  Robert  Macrae  and  Nuria  Oliver  
  • 39. People  enjoy  listening  to  their   favorite  music  everywhere…     …at  home,  …   …on  the  go,  …   …or  in  a  party  with  friends  
  • 40. Users  increasingly  have  a  personal   mp3  music  collec$on…   …but  it  usually  contains   ‘only’  music.     What  if  you  could   watch  the  video  clip   of  any  of  our  songs   while  listening  to  it?  
  • 41. You  could  go  to  sites  like  YouTube…   …but  the  audio   quality  is  much   worse  that  in   your  mp3…     What  if  you  could  listen  to  our  high  quality   mp3  music  while  watching  the  video  clips?  
  • 42. MuViSync:       Music  and  Video  Synchroniza$on  system   streaming   MuViSync   Video  clip   local   MuViSync  synchronizes  audio   and  video  from  two  different   sources  and  plays  them   Personal  Music   together  in-­‐sync  
  • 43. Applica$on  scenarios   •  Watch  on  TV  your  favorite  music   –  Personal  music  synchroniza$on  with  video  clips   either  local  or  streamed   •  Watch  on  your  iPhone  your  music   –  Personal  music  synchroniza$on  by  streaming  the   video  into  the  iPhone   •  Iden0fy  and  watch  any  music   –  Combined  with  songID  technology,  either  at  home   or  on  the  go.  
  • 44. MuViSync  applica$on   •  We  have  developed  a  prototype  applica0on  for   Windows/mac,  and  soon  for  Iphone.  
  • 45. Alignment  algorithm  requirements   •  Perform  an  alignment  between  the  mp3  music   and  the  Video’s  audio  track   •  Ini$ally  only  par$al  knowledge  is  available   from  both  sources  (life  recording  or  buffering)   •  Alignment  has  to  be  done  online  and  in  real-­‐ $me   •  Emphasis  is  needed  on  the  user  sa$sfac$on   when  playing  the  video.  
  • 46. Applica$on  testbed   •  We  use  320  music  videos  (Youtube)  +  their   corresponding  mp3  files   •  A  supervised  ground-­‐truth  alignment  was  performed   using  offline  DTW  and  checking  for  consistency   •  Audio  is  processed  every  100ms  (200ms  window)  and   chroma  features  are  extracted  
  • 47. MuViSync  online  alignment  algorithm   1.  Ini$al  path  discovery   –  Both  signals  (audio  and  video)  are  buffered,  features   are  extracted  and  an  ini$al  alignment  is  found   2.  Real-­‐$me  online  alignment   –  An  incremental  alignment  is  computed   3.  Alignment  post-­‐processing  to  ensure  a  smooth   playback  of  the  aligned  video.   Audio  +  feats   Ini$al  path   extrac$on   1)   discovery   ta   tv   Feats   2)   Real-­‐$me   alignment   extrac$on   alignment  
  • 48. Ini$al  path  discovery     (online  mp3  playback    +  video  buffering)   Sync  request   Audio  from  the  mp3  file   Video  buffering  end   Audio  available  from  the  video  
  • 49. Ini$al  path  discovery   •  A  segment  of  the  audio  and  the  buffered  video  are   checked  for  alignment  using  forward-­‐DTW   •  The  global  similarity  D(m,n)  at  each  loca$on  (m,n)  is   normalized  by  the  length  of  the  op$mum  path  to   that  loca$on   •  At  each  step,  all  paths  with  D’(m,n)  <  Dave(*,n)  are   pruned.     •  The  ini0al  alignment  is  selected  when  only  one  path   survives  or  the  sync  0me  is  reached.  
  • 50. Ini$al  path  discovery   Audio  $me   Audio  being  played  from  mp3   alignment  buffer   (about  1s)   Audio  available  from  the  video  
  • 51. Ini$al  path  discovery   Audio  being  played  from  mp3   Audio  available  from  the  video  
  • 52. Ini$al  path  discovery   Audio  being  played  from  mp3   Audio  available  from  the  video  
  • 53. Audio  being  played  from  mp3   Ini$al  path  discovery   Audio  available  from  the  video  
  • 54. Real-­‐$me  online  alignment   •  Star$ng  from  the  ini$al  alignment  we  itera$vely   compute:     1.  Locally  op$mum  forward  path  for  L  steps:  p1…pL   using  a)  local  constraints  (no  dynamic  programming)   2.  Backward  (standard)  DTW  from  pL  to  p1  using  b)  local   constraints   3.  Add  the  ini$al  p/2  steps  to  the  final  path,  and  start  1)   from  pL/2  un$l  the  playback  ends  
  • 55. Real-­‐$me  online  alignment   Audio  being  played  from  mp3   Audio  available  from  the  video  
  • 56. Real-­‐$me  online  alignment   Audio  being  played  from  mp3   pL   1)Forward  locally   best  path  with  L=8   p1   Audio  available  from  the  video  
  • 57. Real-­‐$me  online  alignment   Audio  being  played  from  mp3   pL   2)stardard  DTW   p1   Audio  available  from  the  video  
  • 58. Real-­‐$me  online  alignment   Audio  being  played  from  mp3   3)Move  forward  the   new  star$ng  point   p1   Audio  available  from  the  video  
  • 59. Alignment  postprocessing   •  Alignment  es$mates  every  100ms  are  not  enough   to  drive  25/30  fps  video   •  An  interpola$on  of  the  points  +  averaging  over  5   seconds  gives  the  projec$on  es$mate  for  current   playback  
  • 60. Experiments   •  We  use  320  videos+mp3,  aligned  using  offline  DTW  and   manually  checked  for  consistency.   •  Accuracy  is  computed  as  the  %  of  songs  with  average   error  <  some  ms.   Average  accuracy  @100ms  for  different  video  buffer  lengths    
  • 62. Video  Duplicate  Detec$on   Xavier  Anguera  and  Pere  Obrador  
  • 63. Let’s  say  you’re  looking  for  the   Bush  a+ack  video…  
  • 64. …and  you   get  11,100   results.  
  • 65. …ajer   40  minutes...   watching  many  of   the  videos  returned   you  no$ce  that    many  are  similar,  i.e.  near  duplicates   27%  in  average  in  Youtube  [Wu  et  al.,  2007]   12%  in  average  in  Youtube  [Anguera  et  al,  2009]  
  • 66. Near  duplicate  (NDVC)  defini$on   •  Iden$cal  or  approximately  iden$cal  videos,   that  differ  in  some  feature:   –  file  formats,  encoding  parameters   –  photometric  varia$ons  (color,  ligh$ng  changes)   –  overlays  (cap$on,  logo,  audio  commentary)   –  edi$ng  opera$ons  (frames  add/remove)   –   seman$c  similarity   NDVC  are  videos  that  are  “essen(ally  the  same”  
  • 67. Near  duplicates(NDVC)  vs.  Video  copies   •  These  two  concepts  are  not  totally  well   discriminated  in  the  bibliography.   •  Video  copy:  exact  video  segment,  with  some   transforma$ons  on  it   •  Near  duplicate:  similar  videos  on  the  same   topic  (different  view  points,  seman$cally   similar  videos,  …)   In  our  research  we  approach  the   video  copy  detec;on  
  • 68. Examples  of  video  copies  
  • 69. Use  Scenarios:  Copyright  law   enforcement   Detec$on  of  copyright   infringing  videos  in  online   video  sharing  sites   In  a  recent  study  we  found  that  in  average  12%  of  search   results  in  YouTube  are  copies  of  the  same  video  
  • 70. Use  Scenarios:  Video  forensics  for   illegal  ac$vi$es   Discover  illegal  content  hidden   within  other  videos   Currently  police  forces  usually  have  to  manually   scroll  through  ALL  materials  in  pederasty  cases   searching  for  evidence.  
  • 71. Use  Scenarios:  Database   management   Video  excerpts  used  several  $mes   Database  management/op$miza$on  and   helping  in  searches  over  historic  contents  
  • 72. Use  Scenarios:  adver$sement   detec$on  and  management   Adver$sement  detec$on/iden$fica$on   Programming  analysis  
  • 73. Use  Scenarios:  Informa$on  overload   reduc$on   Improved  (more  diverse)  video  search   results  by  clustering  all  video  duplicates.   George  Bush   Ajer   clustering   Before  clustering  
  • 74. Steps  in  Video  Duplicate  detec$on   1.  Indexing  of  the  reference  videos   A.  Obtain  features  represen$ng  the  video   B.  Store  these  features  in  a  scalable  manner   2.  Search  of  queries  within  the  reference  set   OFFLINE   Ref  videos   References   Features   Feature  extrac$on   indexing   Database   Query     Search  for   video   Feature  extrac$on   duplicates   ONLINE  
  • 75. Ways  to  approach  near-­‐duplicate  video   detec$on   •  Local  features   –  Extracted  from  selected  frames  in  the  videos   –  Focus  on  local  characteris$cs  within  those  frames   •  Global  features   –  Extracted  from  selected  frames  or  from  all  the   video     –  Focus  on  overall  characteris$cs  
  • 76. Local  features   •  Comes  from  the  previous  knowledge  on  image   copy  detec$on/near  duplicates  detec$on   •  Steps:   –  Keyframes  are  first  extracted  from  the  videos  at   regular  intervals  or  by  detec$ng  shots   –  Local  features  are  obtained  for  these  keyframes:   •  SIFT   •  SURF   •  HARRIS   •  …  
  • 77. Global  Features   •  Features  are  extracted  either  from  the  whole   video  or  from  keyframes  by  looking  at  the   overall  image  (not  at  par$cular  points).   In  our  work  we  extract  them  from  the   whole  video  
  • 78. Mul$modal  video  copy  detec$on   •  Most  works  use  only  video/images  informa$on   –  They  prefer  local  features  for  their  robustness   •  We  introduce  audio  informa$on  by  combining   global  features  from  both  the  audio  and  video   tracks   •  We  are  also  experimen$ng  on  fusing  local   features  with  global  features  (work  in  progress)  
  • 79. Mul$modal  global  features   •  We  use  features  based  on  the  changes  in  the   data-­‐>  more  robust  to  transforma$ons   •  Video:   –  Hue  +  satura$on  interframe  change   –  Lightest  and  darkest  centroid  interframe  distance   •  Audio:   –  Bayesian  informa$on  criterion  (BIC)  between  adjacent   segments   –  Cross-­‐BIC  between  adjacent  segments   –  Kullback-­‐Leibler  divergence  (KL2)  between  adjacent   segments  
  • 80. Hue+Satura$on  interframe  change   1.  Transform  the  colorspace  from  RGB  to  HSV  (Hue +Satura$on+Value)  
  • 81. Hue+Satura$on  interframe  change   2.  Compute  for  each  2  consecu$ve  frames  their  HS   histogram  and  compute  their  intersec$on  as:  
  • 82. Lightest and darkest centroid interframe distance 1.  Find  the  lightest  and  darkest  regions  in  each   frame  and  obtain  its  centroid  
  • 83. Lightest and darkest centroid interframe distance We  compute  the  euclidean  distance  between   each  two  adjacent  frames,  obtaining  two   global  feature  streams  
  • 84. Acous$c  features   •  Compute  some  acous$c  distance  between   adjacent  acous$c  segments   Segment  A   Segment  B   GMM  A   GMM  B   GMM  A+B  
  • 85. Acous$c  features  (II)   •  Likelihood-­‐based  metrics:   –  Bayesian  Informa$on  Criterion   –  Cross-­‐BIC   •  Model  distance  metrics:   –  Kullback-­‐Leibler  divergence  (KL2)  
  • 86. Acous$c  features  (III)   •  For  example:  the  Bayesian  Informa$on   Criterion  (BIC)  output:  
  • 87. Search  for  full  copies   •  For  each  video-­‐query  pair  we  compute  the   correla$on  of  each  feature  pair   Reference   FFT   X IFFT   Find  peaks   Possible   FFT   copy   •  We  then  find  the  posi$ons  with  high  similarity   (peaks).  
  • 88. Mul$modal  fusion   •  When  mul$ple  modali$es  are  available,  fusion   is  performed  on  the  correla$ons  
  • 89. Output  score   •  The  resul$ng  score  is  computed  by  weighted   sum  of  the  different  modali$es’  normalized   dot  product  at  the  found  peak   •  Automa$c  weights  are  obtained  via  
  • 90. Finding  subsegments  of  the  query   •  The  previously  described  algorithm  considers  the  whole   query  matches  a  por$on  of  the  reference  videos   •  To  avoid  such  restric$on  a  modifica$on  to  the  algorithm   first  splits  the  query  into  overlaping  20s  segments   •  By  accumula$ng  the  resul$ng  peaks  for  each  segment  we   can  obtain  the  main  delay  and  its  segment  
  • 91. Algorithm  performance  evalua$on   •  To  test  the  algorithm  we  used  the  MUSCLE-­‐ VCD  database:     –  Over  100  hours  of  reference  videos  from  the   SoundVision  group  (Nederlands)   –  2  test  sets   •  ST1:  15  query  videos  where  the  whole  query  is   considered   •  ST2:  3  videos  with  21  segments  appearing  in  the   reference  database   h+p://www-­‐roc.inria.fr/imedia/civr-­‐bench/benchMuscle.html  
  • 93. Evalua$on  metrics   •  We  use  the  same  metrics  as  in  the  MUSCLE-­‐ VCD  benchmark  tests  
  • 94. Evalua$on  metrics  (II)   •  We  also  use  the  more  standard  Precision  and   recall  metrics  
  • 97. Youtube  reranking  applica$on   •  We  downloaded  all   videos  searching  for   the  top  20  most   viewed  and  20  most   visited  videos  
  • 98. Youtube  reranking  applica$on   •  We  applied  mul$modal  copy  detec$on  and   grouped  all  near  duplicates  
  • 99. Youtube  Reranking  test     •  Results  show  how   some  videos  have   mul$ple  clear  copies   that  can  boost  their   ranking  once  clustered  
  • 100. Thanks  for  your  aHen;on   xanguera@$d.es   www.xavieranguera.com   Linkedin:  h+p://es.linkedin.com/in/xanguera   Twi+er:  h+p://twi+er.com/xanguera   Website:  h+p://www.xavieranguera.com/    

Editor's Notes

  1. Left: scatter plot of the total video and mp3 lengths. Right: time difference of the alignment between audio and video at 30s
  2. Technically speaking these videos are called Near Duplicate Videos, or NDVC.