SlideShare ist ein Scribd-Unternehmen logo
1 von 22
ALSIP, Dec. 1 2011


Kernel-based similarity search
 in massive graph databases
     with wavelet trees
         Yasuo Tabei and Koji Tsuda
       JST ERATO Minato Project,
 National Institute of Advanced Industrial
         Science and Technology
Outline
• Overview
• Wavelet Tree
  ✓ Problem = Range intersection on array
• Graph similarity search
  ✓ Weisfeiler-Lehman kernel
  ✓ Apply wavelet tree

• Experiments
  ✓ Comparison to inverted index
  ✓ 25 million molecular graphs
Graph similarity search
• Similarity search for 25 million molecular
  graphs
 ✓ Find all graphs whose similarity to the query 1
 ✓ Similarity = Weisfeiler-Lehman kernel (NIPS, 2009)

• Use data-structure called “Wavelet
  Tree” (SODA, 2003)
 ✓ Self-index of an integer array
 ✓ Enable fast array operations
 ‣ e.g., range minimum query, range intersection
Range intersection on array
•   Array A of length N, 1     Ai    M

                   i    j      k   "
        A       1 3 6 8 2 5 7 1 2 7 4 5


• Range intersection: rint(A, [i,j],[k,])
    ✓ Find common elements of A[i,j] and A[k,]

• The naive method is to concatenate and sort
    Ex) concatenate:6,8,2,2,7 ⇛ sort:2,2,7,6,8

• Use wavelet tree and solve the problem faster
Tree of subarrays:
Lower half=left, Higher half=right
            [1,8]
                1 3 6 8 2 5 7 1 7 2 4 5

    [1,4]                                             [5,8]
            1 3 2 1 2 4                 6 8 5 7 7 5

  [1,2]                   [3,4] [5,6]                 [7,8]
     1 2 1 2          3 4          6 5 5        8 7 7


   1 1       2 2      3      4     5 5     6    7 7      8
Remember if each element is either
 in lower half(0) or higher half(1)
            [1,8]
                  0 0 1 1 0 1 1 0 1 0 0 1

    [1,4]                                             [5,8]
            0 1 0 0 0 1                 0 1 0 1 1 0


  [1,2]                   [3,4] [5,6]               [7,8]
        0 1 0 1           0 1      1 0 0        1 0 0


    1         2         3   4       5    6      7     8
Index each bit array
      with a rank dictionary
• Using rank dictionary, the rank operation can be
  performed in O(1) time
  ✓ rankc(B, i): return the number of c   {0, 1} in B[1,i]

• Several methods known: rank9sel (Vigna, 08)
• Example) B=0110011100
                    i 1 2 3 4 5 6 7 8 9 10
 rank1 (B, 8) = 5     011001110 0
 rank0 (B, 5) = 3     011001110 0
O(1)-division of an interval
• Using the rank operation, the division of an
•




  interval can be done in constant time
    ✓ rank0 for left child and rank1 for right child

•   Naive = linear time to the total number of elements


          [1,8]
           Aroot 1 3 6 8 2 5 7 1 7 2 4 5

                   rank0                       rank1
      [1,4]                      [5,8]
       Aleft   1 3 2 1 2 4        Aright 6 8 5 7 7 5
Fast computation of rank
   intersection by pruning
Pruned      [1,8]
                  1 3 6 8 2 5 7 1 7 2 4 5

    [1,4]                                                [5,8]
            1 3 2 1 2 4                    6 8 5 7 7 5

  [1,2]                      [3,4] [5,6]             [6,8]
     1 2 1 2             3 4          6 5 5        8 7 7


    1 1       2 2        3      4     5 5     6    7 7     8

            solution!!
Outline
• Overview
• Wavelet Tree
  ✓ Problem = Range intersection on array
• Graph similarity search
  ✓ Weisfeiler-Lehman kernel
  ✓ Apply wavelet tree

• Experiments
  ✓ Comparison to inverted index
  ✓ 25 million molecular graphs
Graph Similarity Search
• Bag-of-words representation of graph
   ✓ Weisfeiler-Lehman procedure (NIPS, 2009), Hido and
     Kashima (ICDM, 2009), Wang et al., (EDBT, 2009)


                      W=(A,D,E,H)


• Consine similarity query
 ✓ Find all graphs W whose cosine similarity (kernel) to
   the query Q is at least 1
Weisfeiler-Lehman Procedure (NIPS,09)
•   Convert a graph into a set of words (bag-of-words)
Semi-conjunctive query
• Cosine similarity query can be relaxed to
  the following form
           W s.t. |W         Q|      k
  ✓ Find all graphs W which share at least k words
    to the query Q

• No false negatives
• False positives can easily be filtered out by
  cosine calculations
Inverted index, Array, Wavelet Tree
                                • Inverted index is built from
                                  graph database
                                • Concatenate all rows to make
                                •




                                    an array
                                • Index the array with wavelet
                                •




                                  tree
Aroot 1 3 6 8 2 5 7 1 2 7 4 5
                                • Semi-conjunctive query =
                                •




                                    Extension of range intersection
    Wavelet Tree
                                    ✓ Find graph ids which appear at
                                      least k times in given intervals
Pruning search space
• Find all graphs W in the database whose cosine
    to a query Q is larger than a threshold 1
                          |W Q|
       W s.t. KN (W, Q) =                      1
                            W Q
    ✓ W,Q: bag-of-words of graphs
• The above solution can be relaxed as follows
•




    If KN (W, Q)       1   , then
                                         |Q|
          (1       ) |Q|
                   2
                           |W |
                                    (1         )2
    ✓ Can be used for pruning search space
Complexity
• Time per query: O(τm)
 •   τ: the number of traversed nodes
 •   m: the number of bag-of-words in a query

• Memory: (1+α)N log n + M log N
 •   N: the number of all words in the database
 •   M: Maximum integer in the array
 •   n: the number of graphs
 •   α: overhead for rank dictionary (α=0.6)

• Inverted index takes Nlog n bits
• About 60% overhead to inverted index!
Outline
• Overview
• A data-structure
  ✓ Wavelet Tree
• Graph similarity search
  ✓ Weisfeiler-Lehman kernel
  ✓ Apply wavelet tree

• Experiments
  ✓ Comparison to inverted index
  ✓ 25 million molecular graphs
Experiments

• 25 million chemical compounds from PubChem
  database
• Evaluate search time and memory usage
• Cosine threshold ε=0.3,0.35,0.4
• Compare our method gWT to
 ✓   Inverted index (concatenate all intervals and sort)
 ✓   Sequential scan (Compute similarity one by one)
Search time
              40 sec
              38 sec




              8 sec
              3 sec
              2 sec
Memory usage
               20GB
Construction time
                    7h
Summary
• Efficient similarity search method of
    massive graph databases
• Solve semi-conjunctive query efficiently
• Build on Wavelet Tree
• Use Weisfeiler-Lehman procedure to
    represent graphs as bag-of-words
• Applicable to 25 million graphs
• Software
•




    http://code.google.com/p/gwt

Weitere ähnliche Inhalte

Was ist angesagt?

6 adesh kumar tripathi -71-74
6 adesh kumar tripathi -71-746 adesh kumar tripathi -71-74
6 adesh kumar tripathi -71-74Alexander Decker
 
Some fixed point theorems in fuzzy mappings
Some fixed point theorems in fuzzy mappingsSome fixed point theorems in fuzzy mappings
Some fixed point theorems in fuzzy mappingsAlexander Decker
 
l1-Embeddings and Algorithmic Applications
l1-Embeddings and Algorithmic Applicationsl1-Embeddings and Algorithmic Applications
l1-Embeddings and Algorithmic ApplicationsGrigory Yaroslavtsev
 
Group {1, −1, i, −i} Cordial Labeling of Product Related Graphs
Group {1, −1, i, −i} Cordial Labeling of Product Related GraphsGroup {1, −1, i, −i} Cordial Labeling of Product Related Graphs
Group {1, −1, i, −i} Cordial Labeling of Product Related GraphsIJASRD Journal
 
A common fixed point of integral type contraction in generalized metric spacess
A  common fixed point of integral type contraction in generalized metric spacessA  common fixed point of integral type contraction in generalized metric spacess
A common fixed point of integral type contraction in generalized metric spacessAlexander Decker
 
Common fixed point theorem for occasionally weakly compatible mapping in q fu...
Common fixed point theorem for occasionally weakly compatible mapping in q fu...Common fixed point theorem for occasionally weakly compatible mapping in q fu...
Common fixed point theorem for occasionally weakly compatible mapping in q fu...Alexander Decker
 
11.the univalence of some integral operators
11.the univalence of some integral operators11.the univalence of some integral operators
11.the univalence of some integral operatorsAlexander Decker
 
The univalence of some integral operators
The univalence of some integral operatorsThe univalence of some integral operators
The univalence of some integral operatorsAlexander Decker
 
Coincidence points for mappings under generalized contraction
Coincidence points for mappings under generalized contractionCoincidence points for mappings under generalized contraction
Coincidence points for mappings under generalized contractionAlexander Decker
 
A Szemerédi-type theorem for subsets of the unit cube
A Szemerédi-type theorem for subsets of the unit cubeA Szemerédi-type theorem for subsets of the unit cube
A Szemerédi-type theorem for subsets of the unit cubeVjekoslavKovac1
 
Tales on two commuting transformations or flows
Tales on two commuting transformations or flowsTales on two commuting transformations or flows
Tales on two commuting transformations or flowsVjekoslavKovac1
 
Multilinear singular integrals with entangled structure
Multilinear singular integrals with entangled structureMultilinear singular integrals with entangled structure
Multilinear singular integrals with entangled structureVjekoslavKovac1
 
Paraproducts with general dilations
Paraproducts with general dilationsParaproducts with general dilations
Paraproducts with general dilationsVjekoslavKovac1
 
Supersymmetric Q-balls and boson stars in (d + 1) dimensions
Supersymmetric Q-balls and boson stars in (d + 1) dimensionsSupersymmetric Q-balls and boson stars in (d + 1) dimensions
Supersymmetric Q-balls and boson stars in (d + 1) dimensionsJurgen Riedel
 
Scattering theory analogues of several classical estimates in Fourier analysis
Scattering theory analogues of several classical estimates in Fourier analysisScattering theory analogues of several classical estimates in Fourier analysis
Scattering theory analogues of several classical estimates in Fourier analysisVjekoslavKovac1
 
A Szemeredi-type theorem for subsets of the unit cube
A Szemeredi-type theorem for subsets of the unit cubeA Szemeredi-type theorem for subsets of the unit cube
A Szemeredi-type theorem for subsets of the unit cubeVjekoslavKovac1
 
Ecfft zk studyclub 9.9
Ecfft zk studyclub 9.9Ecfft zk studyclub 9.9
Ecfft zk studyclub 9.9Alex Pruden
 
Supersymmetric Q-balls and boson stars in (d + 1) dimensions - Jena Talk Mar ...
Supersymmetric Q-balls and boson stars in (d + 1) dimensions - Jena Talk Mar ...Supersymmetric Q-balls and boson stars in (d + 1) dimensions - Jena Talk Mar ...
Supersymmetric Q-balls and boson stars in (d + 1) dimensions - Jena Talk Mar ...Jurgen Riedel
 
Regularity and complexity in dynamical systems
Regularity and complexity in dynamical systemsRegularity and complexity in dynamical systems
Regularity and complexity in dynamical systemsSpringer
 
Skiena algorithm 2007 lecture04 elementary data structures
Skiena algorithm 2007 lecture04 elementary data structuresSkiena algorithm 2007 lecture04 elementary data structures
Skiena algorithm 2007 lecture04 elementary data structureszukun
 

Was ist angesagt? (20)

6 adesh kumar tripathi -71-74
6 adesh kumar tripathi -71-746 adesh kumar tripathi -71-74
6 adesh kumar tripathi -71-74
 
Some fixed point theorems in fuzzy mappings
Some fixed point theorems in fuzzy mappingsSome fixed point theorems in fuzzy mappings
Some fixed point theorems in fuzzy mappings
 
l1-Embeddings and Algorithmic Applications
l1-Embeddings and Algorithmic Applicationsl1-Embeddings and Algorithmic Applications
l1-Embeddings and Algorithmic Applications
 
Group {1, −1, i, −i} Cordial Labeling of Product Related Graphs
Group {1, −1, i, −i} Cordial Labeling of Product Related GraphsGroup {1, −1, i, −i} Cordial Labeling of Product Related Graphs
Group {1, −1, i, −i} Cordial Labeling of Product Related Graphs
 
A common fixed point of integral type contraction in generalized metric spacess
A  common fixed point of integral type contraction in generalized metric spacessA  common fixed point of integral type contraction in generalized metric spacess
A common fixed point of integral type contraction in generalized metric spacess
 
Common fixed point theorem for occasionally weakly compatible mapping in q fu...
Common fixed point theorem for occasionally weakly compatible mapping in q fu...Common fixed point theorem for occasionally weakly compatible mapping in q fu...
Common fixed point theorem for occasionally weakly compatible mapping in q fu...
 
11.the univalence of some integral operators
11.the univalence of some integral operators11.the univalence of some integral operators
11.the univalence of some integral operators
 
The univalence of some integral operators
The univalence of some integral operatorsThe univalence of some integral operators
The univalence of some integral operators
 
Coincidence points for mappings under generalized contraction
Coincidence points for mappings under generalized contractionCoincidence points for mappings under generalized contraction
Coincidence points for mappings under generalized contraction
 
A Szemerédi-type theorem for subsets of the unit cube
A Szemerédi-type theorem for subsets of the unit cubeA Szemerédi-type theorem for subsets of the unit cube
A Szemerédi-type theorem for subsets of the unit cube
 
Tales on two commuting transformations or flows
Tales on two commuting transformations or flowsTales on two commuting transformations or flows
Tales on two commuting transformations or flows
 
Multilinear singular integrals with entangled structure
Multilinear singular integrals with entangled structureMultilinear singular integrals with entangled structure
Multilinear singular integrals with entangled structure
 
Paraproducts with general dilations
Paraproducts with general dilationsParaproducts with general dilations
Paraproducts with general dilations
 
Supersymmetric Q-balls and boson stars in (d + 1) dimensions
Supersymmetric Q-balls and boson stars in (d + 1) dimensionsSupersymmetric Q-balls and boson stars in (d + 1) dimensions
Supersymmetric Q-balls and boson stars in (d + 1) dimensions
 
Scattering theory analogues of several classical estimates in Fourier analysis
Scattering theory analogues of several classical estimates in Fourier analysisScattering theory analogues of several classical estimates in Fourier analysis
Scattering theory analogues of several classical estimates in Fourier analysis
 
A Szemeredi-type theorem for subsets of the unit cube
A Szemeredi-type theorem for subsets of the unit cubeA Szemeredi-type theorem for subsets of the unit cube
A Szemeredi-type theorem for subsets of the unit cube
 
Ecfft zk studyclub 9.9
Ecfft zk studyclub 9.9Ecfft zk studyclub 9.9
Ecfft zk studyclub 9.9
 
Supersymmetric Q-balls and boson stars in (d + 1) dimensions - Jena Talk Mar ...
Supersymmetric Q-balls and boson stars in (d + 1) dimensions - Jena Talk Mar ...Supersymmetric Q-balls and boson stars in (d + 1) dimensions - Jena Talk Mar ...
Supersymmetric Q-balls and boson stars in (d + 1) dimensions - Jena Talk Mar ...
 
Regularity and complexity in dynamical systems
Regularity and complexity in dynamical systemsRegularity and complexity in dynamical systems
Regularity and complexity in dynamical systems
 
Skiena algorithm 2007 lecture04 elementary data structures
Skiena algorithm 2007 lecture04 elementary data structuresSkiena algorithm 2007 lecture04 elementary data structures
Skiena algorithm 2007 lecture04 elementary data structures
 

Andere mochten auch

Sketch sort sugiyamalab-20101026 - public
Sketch sort sugiyamalab-20101026 - publicSketch sort sugiyamalab-20101026 - public
Sketch sort sugiyamalab-20101026 - publicYasuo Tabei
 
Sketch sort ochadai20101015-public
Sketch sort ochadai20101015-publicSketch sort ochadai20101015-public
Sketch sort ochadai20101015-publicYasuo Tabei
 
Kdd2015reading-tabei
Kdd2015reading-tabeiKdd2015reading-tabei
Kdd2015reading-tabeiYasuo Tabei
 
Ibisml2011 06-20
Ibisml2011 06-20Ibisml2011 06-20
Ibisml2011 06-20Yasuo Tabei
 
DCC2014 - Fully Online Grammar Compression in Constant Space
DCC2014 - Fully Online Grammar Compression in Constant SpaceDCC2014 - Fully Online Grammar Compression in Constant Space
DCC2014 - Fully Online Grammar Compression in Constant SpaceYasuo Tabei
 
CPM2013-tabei201306
CPM2013-tabei201306CPM2013-tabei201306
CPM2013-tabei201306Yasuo Tabei
 
SPIRE2013-tabei20131009
SPIRE2013-tabei20131009SPIRE2013-tabei20131009
SPIRE2013-tabei20131009Yasuo Tabei
 
NIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributesNIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributesYasuo Tabei
 
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data MatricesScalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data MatricesYasuo Tabei
 
Lgm pakdd2011 public
Lgm pakdd2011 publicLgm pakdd2011 public
Lgm pakdd2011 publicYasuo Tabei
 
異常検知 - 何を探すかよく分かっていないものを見つける方法
異常検知 - 何を探すかよく分かっていないものを見つける方法異常検知 - 何を探すかよく分かっていないものを見つける方法
異常検知 - 何を探すかよく分かっていないものを見つける方法MapR Technologies Japan
 
ウェーブレット木の世界
ウェーブレット木の世界ウェーブレット木の世界
ウェーブレット木の世界Preferred Networks
 
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)Shirou Maruyama
 
Euruko 2009 - Software Craftsmanship
Euruko 2009 - Software CraftsmanshipEuruko 2009 - Software Craftsmanship
Euruko 2009 - Software CraftsmanshipPhillip Oertel
 
Smart%20 Manual%20rev20060403
Smart%20 Manual%20rev20060403Smart%20 Manual%20rev20060403
Smart%20 Manual%20rev20060403guest4fb07c
 

Andere mochten auch (20)

Sketch sort sugiyamalab-20101026 - public
Sketch sort sugiyamalab-20101026 - publicSketch sort sugiyamalab-20101026 - public
Sketch sort sugiyamalab-20101026 - public
 
Sketch sort ochadai20101015-public
Sketch sort ochadai20101015-publicSketch sort ochadai20101015-public
Sketch sort ochadai20101015-public
 
Kdd2015reading-tabei
Kdd2015reading-tabeiKdd2015reading-tabei
Kdd2015reading-tabei
 
GIW2013
GIW2013GIW2013
GIW2013
 
Ibisml2011 06-20
Ibisml2011 06-20Ibisml2011 06-20
Ibisml2011 06-20
 
DCC2014 - Fully Online Grammar Compression in Constant Space
DCC2014 - Fully Online Grammar Compression in Constant SpaceDCC2014 - Fully Online Grammar Compression in Constant Space
DCC2014 - Fully Online Grammar Compression in Constant Space
 
Lp Boost
Lp BoostLp Boost
Lp Boost
 
CPM2013-tabei201306
CPM2013-tabei201306CPM2013-tabei201306
CPM2013-tabei201306
 
SPIRE2013-tabei20131009
SPIRE2013-tabei20131009SPIRE2013-tabei20131009
SPIRE2013-tabei20131009
 
Lgm saarbrucken
Lgm saarbruckenLgm saarbrucken
Lgm saarbrucken
 
NIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributesNIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributes
 
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data MatricesScalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
 
Lgm pakdd2011 public
Lgm pakdd2011 publicLgm pakdd2011 public
Lgm pakdd2011 public
 
異常検知 - 何を探すかよく分かっていないものを見つける方法
異常検知 - 何を探すかよく分かっていないものを見つける方法異常検知 - 何を探すかよく分かっていないものを見つける方法
異常検知 - 何を探すかよく分かっていないものを見つける方法
 
ウェーブレット木の世界
ウェーブレット木の世界ウェーブレット木の世界
ウェーブレット木の世界
 
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)
 
bigdata2012nlp okanohara
bigdata2012nlp okanoharabigdata2012nlp okanohara
bigdata2012nlp okanohara
 
Welcome To Design Tech
Welcome To Design TechWelcome To Design Tech
Welcome To Design Tech
 
Euruko 2009 - Software Craftsmanship
Euruko 2009 - Software CraftsmanshipEuruko 2009 - Software Craftsmanship
Euruko 2009 - Software Craftsmanship
 
Smart%20 Manual%20rev20060403
Smart%20 Manual%20rev20060403Smart%20 Manual%20rev20060403
Smart%20 Manual%20rev20060403
 

Ähnlich wie Gwt presen alsip-20111201

is anyone_interest_in_auto-encoding_variational-bayes
is anyone_interest_in_auto-encoding_variational-bayesis anyone_interest_in_auto-encoding_variational-bayes
is anyone_interest_in_auto-encoding_variational-bayesNAVER Engineering
 
Topological Sort Algorithm.pptx
Topological Sort Algorithm.pptxTopological Sort Algorithm.pptx
Topological Sort Algorithm.pptxMuhammadShafi89
 
splaytree-171227043127.pptx NNNNNNNNNNNNNNNNNNNNNNN
splaytree-171227043127.pptx NNNNNNNNNNNNNNNNNNNNNNNsplaytree-171227043127.pptx NNNNNNNNNNNNNNNNNNNNNNN
splaytree-171227043127.pptx NNNNNNNNNNNNNNNNNNNNNNNratnapatil14
 
Dsoop (co 221) 1
Dsoop (co 221) 1Dsoop (co 221) 1
Dsoop (co 221) 1Puja Koch
 
Paper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satPaper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satChenYiHuang5
 
Top k string similarity search
Top k string similarity searchTop k string similarity search
Top k string similarity searchChiao-Meng Huang
 
Design and analysis of algorithms question paper 2015 tutorialsduniya.com
Design and analysis of algorithms  question paper 2015   tutorialsduniya.comDesign and analysis of algorithms  question paper 2015   tutorialsduniya.com
Design and analysis of algorithms question paper 2015 tutorialsduniya.comTutorialsDuniya.com
 
Network analysis lecture
Network analysis lectureNetwork analysis lecture
Network analysis lectureSara-Jayne Terp
 

Ähnlich wie Gwt presen alsip-20111201 (20)

SISAP17
SISAP17SISAP17
SISAP17
 
Data structures
Data structuresData structures
Data structures
 
Plc (1)
Plc (1)Plc (1)
Plc (1)
 
is anyone_interest_in_auto-encoding_variational-bayes
is anyone_interest_in_auto-encoding_variational-bayesis anyone_interest_in_auto-encoding_variational-bayes
is anyone_interest_in_auto-encoding_variational-bayes
 
Matlab lec1
Matlab lec1Matlab lec1
Matlab lec1
 
03 search blind
03 search blind03 search blind
03 search blind
 
Topological Sort Algorithm.pptx
Topological Sort Algorithm.pptxTopological Sort Algorithm.pptx
Topological Sort Algorithm.pptx
 
Splay tree
Splay treeSplay tree
Splay tree
 
splaytree-171227043127.pptx NNNNNNNNNNNNNNNNNNNNNNN
splaytree-171227043127.pptx NNNNNNNNNNNNNNNNNNNNNNNsplaytree-171227043127.pptx NNNNNNNNNNNNNNNNNNNNNNN
splaytree-171227043127.pptx NNNNNNNNNNNNNNNNNNNNNNN
 
Mit6 094 iap10_lec04
Mit6 094 iap10_lec04Mit6 094 iap10_lec04
Mit6 094 iap10_lec04
 
Dsoop (co 221) 1
Dsoop (co 221) 1Dsoop (co 221) 1
Dsoop (co 221) 1
 
Paper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satPaper study: Learning to solve circuit sat
Paper study: Learning to solve circuit sat
 
Basic mathematics
Basic mathematicsBasic mathematics
Basic mathematics
 
Lec28
Lec28Lec28
Lec28
 
Top k string similarity search
Top k string similarity searchTop k string similarity search
Top k string similarity search
 
Design and analysis of algorithms question paper 2015 tutorialsduniya.com
Design and analysis of algorithms  question paper 2015   tutorialsduniya.comDesign and analysis of algorithms  question paper 2015   tutorialsduniya.com
Design and analysis of algorithms question paper 2015 tutorialsduniya.com
 
Network analysis lecture
Network analysis lectureNetwork analysis lecture
Network analysis lecture
 
Plc (1)
Plc (1)Plc (1)
Plc (1)
 
Enter The Matrix
Enter The MatrixEnter The Matrix
Enter The Matrix
 
sorting
sortingsorting
sorting
 

Kürzlich hochgeladen

Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 

Kürzlich hochgeladen (20)

Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 

Gwt presen alsip-20111201

  • 1. ALSIP, Dec. 1 2011 Kernel-based similarity search in massive graph databases with wavelet trees Yasuo Tabei and Koji Tsuda JST ERATO Minato Project, National Institute of Advanced Industrial Science and Technology
  • 2. Outline • Overview • Wavelet Tree ✓ Problem = Range intersection on array • Graph similarity search ✓ Weisfeiler-Lehman kernel ✓ Apply wavelet tree • Experiments ✓ Comparison to inverted index ✓ 25 million molecular graphs
  • 3. Graph similarity search • Similarity search for 25 million molecular graphs ✓ Find all graphs whose similarity to the query 1 ✓ Similarity = Weisfeiler-Lehman kernel (NIPS, 2009) • Use data-structure called “Wavelet Tree” (SODA, 2003) ✓ Self-index of an integer array ✓ Enable fast array operations ‣ e.g., range minimum query, range intersection
  • 4. Range intersection on array • Array A of length N, 1 Ai M i j k " A 1 3 6 8 2 5 7 1 2 7 4 5 • Range intersection: rint(A, [i,j],[k,]) ✓ Find common elements of A[i,j] and A[k,] • The naive method is to concatenate and sort Ex) concatenate:6,8,2,2,7 ⇛ sort:2,2,7,6,8 • Use wavelet tree and solve the problem faster
  • 5. Tree of subarrays: Lower half=left, Higher half=right [1,8] 1 3 6 8 2 5 7 1 7 2 4 5 [1,4] [5,8] 1 3 2 1 2 4 6 8 5 7 7 5 [1,2] [3,4] [5,6] [7,8] 1 2 1 2 3 4 6 5 5 8 7 7 1 1 2 2 3 4 5 5 6 7 7 8
  • 6. Remember if each element is either in lower half(0) or higher half(1) [1,8] 0 0 1 1 0 1 1 0 1 0 0 1 [1,4] [5,8] 0 1 0 0 0 1 0 1 0 1 1 0 [1,2] [3,4] [5,6] [7,8] 0 1 0 1 0 1 1 0 0 1 0 0 1 2 3 4 5 6 7 8
  • 7. Index each bit array with a rank dictionary • Using rank dictionary, the rank operation can be performed in O(1) time ✓ rankc(B, i): return the number of c {0, 1} in B[1,i] • Several methods known: rank9sel (Vigna, 08) • Example) B=0110011100 i 1 2 3 4 5 6 7 8 9 10 rank1 (B, 8) = 5 011001110 0 rank0 (B, 5) = 3 011001110 0
  • 8. O(1)-division of an interval • Using the rank operation, the division of an • interval can be done in constant time ✓ rank0 for left child and rank1 for right child • Naive = linear time to the total number of elements [1,8] Aroot 1 3 6 8 2 5 7 1 7 2 4 5 rank0 rank1 [1,4] [5,8] Aleft 1 3 2 1 2 4 Aright 6 8 5 7 7 5
  • 9. Fast computation of rank intersection by pruning Pruned [1,8] 1 3 6 8 2 5 7 1 7 2 4 5 [1,4] [5,8] 1 3 2 1 2 4 6 8 5 7 7 5 [1,2] [3,4] [5,6] [6,8] 1 2 1 2 3 4 6 5 5 8 7 7 1 1 2 2 3 4 5 5 6 7 7 8 solution!!
  • 10. Outline • Overview • Wavelet Tree ✓ Problem = Range intersection on array • Graph similarity search ✓ Weisfeiler-Lehman kernel ✓ Apply wavelet tree • Experiments ✓ Comparison to inverted index ✓ 25 million molecular graphs
  • 11. Graph Similarity Search • Bag-of-words representation of graph ✓ Weisfeiler-Lehman procedure (NIPS, 2009), Hido and Kashima (ICDM, 2009), Wang et al., (EDBT, 2009) W=(A,D,E,H) • Consine similarity query ✓ Find all graphs W whose cosine similarity (kernel) to the query Q is at least 1
  • 12. Weisfeiler-Lehman Procedure (NIPS,09) • Convert a graph into a set of words (bag-of-words)
  • 13. Semi-conjunctive query • Cosine similarity query can be relaxed to the following form W s.t. |W Q| k ✓ Find all graphs W which share at least k words to the query Q • No false negatives • False positives can easily be filtered out by cosine calculations
  • 14. Inverted index, Array, Wavelet Tree • Inverted index is built from graph database • Concatenate all rows to make • an array • Index the array with wavelet • tree Aroot 1 3 6 8 2 5 7 1 2 7 4 5 • Semi-conjunctive query = • Extension of range intersection Wavelet Tree ✓ Find graph ids which appear at least k times in given intervals
  • 15. Pruning search space • Find all graphs W in the database whose cosine to a query Q is larger than a threshold 1 |W Q| W s.t. KN (W, Q) = 1 W Q ✓ W,Q: bag-of-words of graphs • The above solution can be relaxed as follows • If KN (W, Q) 1 , then |Q| (1 ) |Q| 2 |W | (1 )2 ✓ Can be used for pruning search space
  • 16. Complexity • Time per query: O(τm) • τ: the number of traversed nodes • m: the number of bag-of-words in a query • Memory: (1+α)N log n + M log N • N: the number of all words in the database • M: Maximum integer in the array • n: the number of graphs • α: overhead for rank dictionary (α=0.6) • Inverted index takes Nlog n bits • About 60% overhead to inverted index!
  • 17. Outline • Overview • A data-structure ✓ Wavelet Tree • Graph similarity search ✓ Weisfeiler-Lehman kernel ✓ Apply wavelet tree • Experiments ✓ Comparison to inverted index ✓ 25 million molecular graphs
  • 18. Experiments • 25 million chemical compounds from PubChem database • Evaluate search time and memory usage • Cosine threshold ε=0.3,0.35,0.4 • Compare our method gWT to ✓ Inverted index (concatenate all intervals and sort) ✓ Sequential scan (Compute similarity one by one)
  • 19. Search time 40 sec 38 sec 8 sec 3 sec 2 sec
  • 20. Memory usage 20GB
  • 22. Summary • Efficient similarity search method of massive graph databases • Solve semi-conjunctive query efficiently • Build on Wavelet Tree • Use Weisfeiler-Lehman procedure to represent graphs as bag-of-words • Applicable to 25 million graphs • Software • http://code.google.com/p/gwt