SlideShare a Scribd company logo
1 of 33
12th Workshop on Algorithms in Bioinformatics,
Ljubljana, Slovenia




             Succinct Multibit Tree:
    Compact Representation of Multibit Trees
          by Succinct Data Structures
       in Chemical Fingerprint Searches	


                           Yasuo Tabei
                     JST ERATO Minato Project
Chemical fingerprint search	
•  Space-efficient data structures to index 30 million
   chemical fingerprints, e.g., W=(1,5,7,10)
•  Find all fingerprints similar to a query (≧ε)
   –  Similarity = Jaccard (Tanimoto) (J(W,W’)=|W∩W’|/|W∪W’|)
•  Multibit tree (Kristensen et al.,WABI09)
  –  Data structure enabling fast similarity searches
   –  Memory-inefficiency of pointer-based representation
•  Succinct data structures (Jacobson, 1989)
   –  Space efficient and enabling fast operations
Ø Present succinct representation of multibit tree
Outline	
•  Multibit Tree
•  Succinct Data Structures
   –  Rank/Select dictionary
   –  Succinct ordered tree: LOUDS
•  Succinct Multibit Trees
   –  Compact representation of multibit trees
   –  Compact representation of fingerprint databases
      1.  Variable-length array
      2.  Succinct Trie

•  Experiments
Outline	
•  Multibit Tree
•  Succinct Data Structures
   –  Rank/Select dictionary
   –  Succinct ordered tree: LOUDS
•  Succinct Multibit Trees
   –  Compact representation of multibit trees
   –  Compact representation of fingerprint databases
      1.  Variable-length array
      2.  Succinct Trie

•  Experiments
Multibit Tree (MT) (Kristensen et al., 09)	
l    Multiple decision trees built on fingerprints
      clustered with respect to cardinality	
           (i)Fingerprint       (ii)Cluster into bins   (iii)Build decision
              Database              w.r.t cardinality        trees

           W1=(1,2,7,4,8)           W6 =(1)
           W2=(1,3,7)               W32=(2)
           W3=(1,3)                 W42=(4)              W6
           W5=(1,4,8,7)             W50=(8)                   W32
           W6=(1)                                                   W42 W
                                                                         50

             ...                    W3 =(1,3)
                                    W9 =(2,4)
           Wn=(1,3,4)               W12=(1,4)
                                                         W9
                                    W3 =(2,5,6)
                                                              W3      W12
                                    W9 =(1,3,6)
                            .
                            .
                                    W12=(4,6,7)
                                    W15=(2,3,5)
                            .       W18=(4,6,8)
                                         .
                                         .          .
                                                    .
                                         .
                                                        W3 W15 W9

                                                    .                 W12 W18
Similarity search of a query fingerprint Q	

l    If Jaccard similarity J(Wi , Q)          , two constraints are
      satisfied:
      1.  Cardinality constraint
                                     1
                       |Q|   |Wi |       |Q|
      2.  Upper bound of Jaccard similarity
                        min(|Wi | N0 , |Q| N1 )
                 |Wi | + |Q| min(|Wi | N0 , |Q|         N1 )
 - N0: The number of elements contained in Wi and not in Q
 - N1: The number of elements contained in Q and not in Wi
Similarity search of a query fingerprint Q	
Step1:                              Step2:                              Step3:
Find candidate solutions I1         Find candidate solutions I2         Calculate similarities
satisfying carinality constraints   satisfying upper bounds             to remove false positives
                                                                        in


                                           Searched
     W6
                                                   pruned
          W32
                W42 W
                     50
                                    W9
                                                                                     ?
                                         W3          W12
                                                                                     ?
     W9
          W3        W12




                                    W4    W15 W
                                               9
    W4 W15 W9                                         W12         W18
                    W12 W18

                .
                .
                .
Drawbacks	

•  Pointer-based representation of multibit trees
   needs a large amount of memory
                 bits
 - Kc: number of fingerprints in bin c
 - C: total number of bins
   –  Log(.) factor is too large!
•  Need to store original fingerprint databases in
   memory to filter out false positives
Outline	
•  Multibit Tree
•  Succinct Data Structures
   –  Rank/select dictionary
   –  Succinct ordered tree: LOUDS
•  Succinct Multibit Trees
   –  Compact representation of multibit trees
   –  Compact representation of fingerprint databases
      1.  Variable-length array
      2.  Succinct Trie

•  Experiments
Rank/select dictionary (RRR, 2002)
       : Foundation of various succinct data structures	

l    Enables the rank/select operations on bit string B in
      O(1)-time
      -  Rankc(B,i): return the number of c∈{0,1} in B[1…i]
      -  Selectc(B,i): return the position of i-th occurrence of c∈{0,1}
l    Efficient rank/select dictionary (Navarro and Providel, 2012) 	

 Ex) B=0110011100	
                    i 1 2 3 4 5 6 7 8 9 10
    Rank1(B,8)=5      011001110 0
    Select1(B,3)=6	
 0 1 1 0 0 1 1 1 0 0	
                                            Memory: n + o(n) bits
Level-order Unary Degree Sequence
            (LOUDS) (Jacobson, 1989)	
•  Represents an ordered tree as a bit string
   of length 2n+1 (n: node number)
•  Construction
1)  Traversing the tree in a breadth-first manner
2)  Generating k 1s followed by 0 for a k-degree node in
    preorder

                   1            S:	
  super	
  root	
                                  S 1 2 3 4 567
           2           3       B 101101101100000

       4       5       6   7
Properties of LOUDS	
                 1
                               1   23       4 5 67
         2            3       B:101101101100000
                                        1     2      34 5 67
     4       5       6    7

•  For a tree consisting of n nodes, there are n 1s
   and n+1 0s on bit string B
•  Each 1 and 0 except the first 0 on B corresponds
   to a tree node one-by-one
•  Positions of the parent and children for a tree
   node on B can be calculated by combining the
   rank/select operations in O(1)-time.
O(1)-time operations on a tree	
•  Parent/child operations for i such that B[i]=1
     –  First child:p=select0(B,rank1(B,i))+1
     –  Next child:i+1 for position i of the first child
     –  Parent     :p=select1(B,rank0(B,i))	

Ex)	
  Calcula2ng	
  the	
  first	
  child	
  for	
  i	
  =	
  4	

                        1
                                     i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
                2           3        B 101101101100000
                                           1     23       45        6 7
            4       5       6    7
                                                 i=4	
              rank1(B,4)=3	
                                                                    select0(B,3)=9
Outline	
•  Overview
   –  Chemical fingerprint search
•  Multibit Tree
•  Succinct Data Structures
   –  Rank/Select dictionary
   –  Succinct ordered tree: LOUDS
•  Succinct Multibit Trees
   –  Compact representation of multibit trees
   –  Compact representation of fingerprint databases
      1.  Variable-length array
      2.  Succinct Trie
•  Experiments
Succinct Multibit Trees (SMT)	

•  Consist of compact representations of multibit
   trees and fingerprint databases
•  Represent multibit trees by LOUDS
  –  O(8 C |Kc | 4C + M C) bits not including log factor
           c=1
  –  Fast similarity searches
•  Two compact representations of fingerprint
   databases
  –  Variable-length array (VLA)
  –  Succinct trie (TRIE)
Succinct representation of
             multibit trees (SMT)	
•  Basic idea is to represent MT by LOUDS
  –  MT consists of multiple binary decision trees.
•  Bc: LOUDS representation of a decision tree
•  Lc: bit string indicating whether Bc[i] is a leaf or not
•  IDs: Array containing fingerprint identifiers

      MT	
         1                SMT	
           2               3

      4        5       6       7
      W3       W4 W1           W2
Access to node auxiliaries and
    fingerprint identifiers in O(1)-time	
                               1    0
•  Access to node auxiliaries Mv , Mv for calculating
   upper bounds
   –  v = rank1 (Bc , p) for a given position p
   –  Each 1 bit in Bc corresponds to a node v
•  Identifiers for calculating Jaccard[p] = 1
                                    Lc similarities
   –  IDs[rank1 (Lc , p)] for a given position p
   –  Each 1 bit in Lc corresponds to an index on IDs
                  1

          2               3

      4       5       6       7
     W3       W4 W1           W2
Variable-length array for compactly
        representing fingerprints	
•  Standard array consists of bit strings of fixed-length
   –  Space-inefficient for storing small values
   Ex) Array, each element is represented as 8 bits
        Integer       2        1        3        4        32bits	
        Bit string 00000010 00000001 00000011 00000100

•  Variable-length array = bit strings of different lengths
  Ex)
        Integer      2        1        3        4         8bits	
        Bit string       10       1        11       100
   –  Space-efficient
   –  Random access is impossible
Representation of variable-length array	




•  Use two bit strings to represent an array A:
   -  R: bit string whose k-th substring corresponds to the
      bit string representation of A[k]
   -  P: bit string whose k-th substring consists of
       ( log A[k]    1) 0s followed by 1
Recovering A[k] from variable-length array	




                           K=3	
 s	
 e	

•  A[k] is recovered by three steps:
  1.  Start position s: If k=1 s=1, else s = select1(P,k-1) + 1
  2.  End position e: e = select1(P,k)
  3.  Conversion: Convert substring R[s,e] to an integer
•  O(1)-time
Trie	
•  Used to store an associative array
   –  keys are, usually, a string
•  Applicable to fingerprints considered as strings
                                                            0	
                                                       1	
                          Build                1	
                  2	
           W1=(1,2,3)     trie	
               2	
                3	
                                     2	
               3	
          3	
           W2=(2,3,7,8)                  4	
          5	
         6	
           W3=(1,2,5,8)        3	
              5	
        7	
                                   7	
                                                     5	
 10	
                                               8	
 9	
           W4=(1,3,5)	
                                     8	
                                                 8	
     12	
                                           11
Difficulty	
•  The alphabet size tends to be small for typical trie
   applications, e.g., DNA(4), English(26)
•  Difficulty: the word size of fingerprints is not always
   small, e.g., PubChem, 881 dimension
   –  Memory usage is dominated by labels
•  Compute the differences between every pair of a node
   label and the parent node label
                                      0       Compute
                                                               0
Ex)              Build    1               2   difference
                 trie                                 1            2
  W1=(1,2,3)                              3
                      2           3
  W2=(2,3,7,8)                                     1       2       1   Succinct Trie
                                                                        Succinct Trie
  W3=(1,2,5,8)    3       5       5
                                          7                            by LOUDS
                                                                        by LOUDS
                                               1       3           4
  W4=(1,3,5)                                               2
                                          8
                              8                                    1
                                                       3
Succinct Trie (TRIE)	
•  Three components:
  –  T: LOUDS representation of trie
  –  D: Variable-length array containing node labels
  –  Idconv: Array containing fingerprint identifiers	




  Trie                  1	
                             0	
           Succinct Trie
                1	
                  2	
   Node ids	
 -	
 1	
 2	
 3	
 4	
 5	
 6	
              7	
 8	
    9	
 10	
 11	
 12	
                                           LBS T	
     10	
 110	
 110	
 10	
 110	
 10	
 10	
   0	
 10	
   0	
 10	
 0	
 0	
                2	
                3	
     Words D	
  	
      0	
 1	
 2	
 1	
 2	
 1	
          1	
 3	
    2	
 4	
 3	
 1	
      1	
               2	
          1	
          4	
          5	
         6	
     Index	
      W1	
   W2	
   W3	
   W4	
    W5	
   1	
           3	
        4	
            idconv	
      7	
   12	
   11	
   10	
     9	
    7	
                      2	
 10	
                8	
 9	
                             1	
                  3	
     12	
            11
Outline	
•  Multibit Tree
•  Succinct Data Structures
   –  Rank/Select dictionary
   –  Succinct ordered tree: LOUDS
•  Succinct Multibit Trees
   –  Compact representation of multibit trees
   –  Compact representation of fingerprint databases
      1.  Variable-length array
      2.  Succinct Trie

•  Experiments
Experiments	

•  30 million chemical fingerprints from
   PubChem database
•  Evaluate search time and memory
•  Compared succinct multibit tree (SMT) to
   pointer-based multibit tree (MT)
•  Compared variable-length array (VLD) and
   succint trie (TRIE) to the raw
   representation of fingeprint databases.
Memory usage of multibit trees	
              6000       SMT
                                                                                                  6G	
                     ●

                         MT


              5000


              4000
Memory (MB)




              3000


              2000


              1000
                                                          ●
                                                                      ●
                                                                                 ●         ●
                                                                                                  847MB	
                                            ●
                                  ●
                0    ●●
                     ●

                 0.0e+00       5.0e+06   1.0e+07      1.5e+07       2.0e+07   2.5e+07   3.0e+07
                                                   # of fingerprints
Memory usage of representations of
                           fingerprint databases	
                           TRIE
                                                                                                     16GB	
                       ●

                           VLA
                           RAW
       15000
Memory (MB)




       10000




              5000

                                                                                                     3.2GB	
                                                                                    ●         ●

                        ●●
                        ●
                                     ●         ●             ●           ●
                                                                                                     1.3GB	
                0
                     0.0e+00      5.0e+06   1.0e+07      1.5e+07       2.0e+07   2.5e+07   3.0e+07
                                                      # of fingerprints
Search time and memory on 30 million
  fingerprints (ε=0.98) #answers:10	
                         0.025
                                   SMT+TRIE
                         0.021	
●
                         0.020
     search time (sec)




                         0.015        SMT+VLA
                         0.014	
                                SMT+RAW

                         0.010                MT+TRIE

                                                 MT+VLA
                                                                          0.006	
                         0.005                                  MT+RAW


                         0.000
                                 2GB	
5000    10000     15000   20000 22GB	
                                   4GB	
      memory (MB)
Search time and memory on 30 million
 fingerprints (ε=0.9) #answers:1,440	
                        2.0
                                SMT+TRIE
                        1.7	
●
                        1.5
    search time (sec)




                        1.0
                                 SMT+VLA MT+VLA
               0.58	
                                                           SMT+RAW
                        0.5           MT+TRIE
                                                                       0.3	
                                                              MT+RAW
                        0.0
                              2GB	
5000    10000   15000     2000022GB	
                                 4GB	
     memory (MB)
Summary	

•  Succinct Multibit Trees (SMT)
•  Compactly represent multibit trees and
   fingerprints by succinct data structures
•  Represent multibit trees by LOUDS
•  Represent fingerprints by variabl-length array and
   succinct trie
•  Enables us to index 30 million fingerprints in 2GB
   by SMT+TRIE and in 4GB by SMT+VLA
•  Search time remains practically fast
Succinct Data Structures	
•  Space-efficient data structures enabling fast
   operations
•  Pointer-based representations of ordered trees
   consume a large amount of memory
  –  O(nlogn) bits for the number n of nodes
  –  logn factor is too large for large-scale data
•  Represent ordered trees as bit strings of length 2n
   + 1 and enables O(1)-time operations
  –  Ex) 0100100101000
•  Various succinct data structures
  –  sets(Raman,2002), sequences(Ferragina,2001),
     trees(Jacobson,1989), graphs(Turan,1989)
B




l  Divide the bit array B into large blocks of length =log2n
   RL=Ranks of large blocks
l  Divide each large block to small blocks of length s=(logn)/2

  Rs=Ranks of small blocks relative to the large block
      rank1(B,i)=RL[i/l]+Rs[i/s]+(remaining rank)	
          Time:O(1)
          Memory: n + o(n) bits
Recovering A[k] from variable-length array	
 •  A[k] is recovered by three steps:
    1.  Start position s: If k=1 s=1, else s = select1(P,k-1) + 1
    2.  End position e: e = select1(P,k)
    3.  Conversion: Convert substring R[s,e] to an integer
 •  O(1)-time
Ex)k=3



  1.  s = select1(P,2)+1=4        s	
 e	
  2.  e = select1(P,3)=7
  3.  Convert R[4,7]=1000 to the integer 8

More Related Content

What's hot

Tensorizing Neural Network
Tensorizing Neural NetworkTensorizing Neural Network
Tensorizing Neural NetworkRuochun Tzeng
 
Datastructure tree
Datastructure treeDatastructure tree
Datastructure treerantd
 
E-Cordial Labeling of Some Mirror Graphs
E-Cordial Labeling of Some Mirror GraphsE-Cordial Labeling of Some Mirror Graphs
E-Cordial Labeling of Some Mirror GraphsWaqas Tariq
 
Add Maths Module
Add Maths ModuleAdd Maths Module
Add Maths Modulebspm
 
Teknik menjawab-percubaan-pmr-melaka-2010
Teknik menjawab-percubaan-pmr-melaka-2010Teknik menjawab-percubaan-pmr-melaka-2010
Teknik menjawab-percubaan-pmr-melaka-2010Ieda Adam
 
Nelly Litvak – Asymptotic behaviour of ranking algorithms in directed random ...
Nelly Litvak – Asymptotic behaviour of ranking algorithms in directed random ...Nelly Litvak – Asymptotic behaviour of ranking algorithms in directed random ...
Nelly Litvak – Asymptotic behaviour of ranking algorithms in directed random ...Yandex
 
Chapter 1 functions
Chapter 1  functionsChapter 1  functions
Chapter 1 functionsUmair Pearl
 
NIPS2010: optimization algorithms in machine learning
NIPS2010: optimization algorithms in machine learningNIPS2010: optimization algorithms in machine learning
NIPS2010: optimization algorithms in machine learningzukun
 
An order seven implicit symmetric sheme applied to second order initial value...
An order seven implicit symmetric sheme applied to second order initial value...An order seven implicit symmetric sheme applied to second order initial value...
An order seven implicit symmetric sheme applied to second order initial value...Alexander Decker
 
Paraproducts with general dilations
Paraproducts with general dilationsParaproducts with general dilations
Paraproducts with general dilationsVjekoslavKovac1
 
Multilinear singular integrals with entangled structure
Multilinear singular integrals with entangled structureMultilinear singular integrals with entangled structure
Multilinear singular integrals with entangled structureVjekoslavKovac1
 
A Szemerédi-type theorem for subsets of the unit cube
A Szemerédi-type theorem for subsets of the unit cubeA Szemerédi-type theorem for subsets of the unit cube
A Szemerédi-type theorem for subsets of the unit cubeVjekoslavKovac1
 
2012 mdsp pr09 pca lda
2012 mdsp pr09 pca lda2012 mdsp pr09 pca lda
2012 mdsp pr09 pca ldanozomuhamada
 
Tales on two commuting transformations or flows
Tales on two commuting transformations or flowsTales on two commuting transformations or flows
Tales on two commuting transformations or flowsVjekoslavKovac1
 

What's hot (19)

IMT, col space again
IMT, col space againIMT, col space again
IMT, col space again
 
Tensorizing Neural Network
Tensorizing Neural NetworkTensorizing Neural Network
Tensorizing Neural Network
 
On the Zeros of Complex Polynomials
On the Zeros of Complex PolynomialsOn the Zeros of Complex Polynomials
On the Zeros of Complex Polynomials
 
Datastructure tree
Datastructure treeDatastructure tree
Datastructure tree
 
E-Cordial Labeling of Some Mirror Graphs
E-Cordial Labeling of Some Mirror GraphsE-Cordial Labeling of Some Mirror Graphs
E-Cordial Labeling of Some Mirror Graphs
 
Add Maths Module
Add Maths ModuleAdd Maths Module
Add Maths Module
 
Teknik menjawab-percubaan-pmr-melaka-2010
Teknik menjawab-percubaan-pmr-melaka-2010Teknik menjawab-percubaan-pmr-melaka-2010
Teknik menjawab-percubaan-pmr-melaka-2010
 
Nelly Litvak – Asymptotic behaviour of ranking algorithms in directed random ...
Nelly Litvak – Asymptotic behaviour of ranking algorithms in directed random ...Nelly Litvak – Asymptotic behaviour of ranking algorithms in directed random ...
Nelly Litvak – Asymptotic behaviour of ranking algorithms in directed random ...
 
Mgm
MgmMgm
Mgm
 
Chapter 1 functions
Chapter 1  functionsChapter 1  functions
Chapter 1 functions
 
NIPS2010: optimization algorithms in machine learning
NIPS2010: optimization algorithms in machine learningNIPS2010: optimization algorithms in machine learning
NIPS2010: optimization algorithms in machine learning
 
An order seven implicit symmetric sheme applied to second order initial value...
An order seven implicit symmetric sheme applied to second order initial value...An order seven implicit symmetric sheme applied to second order initial value...
An order seven implicit symmetric sheme applied to second order initial value...
 
Functions
FunctionsFunctions
Functions
 
Paraproducts with general dilations
Paraproducts with general dilationsParaproducts with general dilations
Paraproducts with general dilations
 
Multilinear singular integrals with entangled structure
Multilinear singular integrals with entangled structureMultilinear singular integrals with entangled structure
Multilinear singular integrals with entangled structure
 
International Journal of Engineering Inventions (IJEI)
International Journal of Engineering Inventions (IJEI)International Journal of Engineering Inventions (IJEI)
International Journal of Engineering Inventions (IJEI)
 
A Szemerédi-type theorem for subsets of the unit cube
A Szemerédi-type theorem for subsets of the unit cubeA Szemerédi-type theorem for subsets of the unit cube
A Szemerédi-type theorem for subsets of the unit cube
 
2012 mdsp pr09 pca lda
2012 mdsp pr09 pca lda2012 mdsp pr09 pca lda
2012 mdsp pr09 pca lda
 
Tales on two commuting transformations or flows
Tales on two commuting transformations or flowsTales on two commuting transformations or flows
Tales on two commuting transformations or flows
 

Viewers also liked

Mlab2012 tabei 20120806
Mlab2012 tabei 20120806Mlab2012 tabei 20120806
Mlab2012 tabei 20120806Yasuo Tabei
 
SPIRE2013-tabei20131009
SPIRE2013-tabei20131009SPIRE2013-tabei20131009
SPIRE2013-tabei20131009Yasuo Tabei
 
CPM2013-tabei201306
CPM2013-tabei201306CPM2013-tabei201306
CPM2013-tabei201306Yasuo Tabei
 
NIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributesNIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributesYasuo Tabei
 
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data MatricesScalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data MatricesYasuo Tabei
 
Lgm pakdd2011 public
Lgm pakdd2011 publicLgm pakdd2011 public
Lgm pakdd2011 publicYasuo Tabei
 
Sketch sort ochadai20101015-public
Sketch sort ochadai20101015-publicSketch sort ochadai20101015-public
Sketch sort ochadai20101015-publicYasuo Tabei
 
Ibisml2011 06-20
Ibisml2011 06-20Ibisml2011 06-20
Ibisml2011 06-20Yasuo Tabei
 
Sketch sort sugiyamalab-20101026 - public
Sketch sort sugiyamalab-20101026 - publicSketch sort sugiyamalab-20101026 - public
Sketch sort sugiyamalab-20101026 - publicYasuo Tabei
 
DCC2014 - Fully Online Grammar Compression in Constant Space
DCC2014 - Fully Online Grammar Compression in Constant SpaceDCC2014 - Fully Online Grammar Compression in Constant Space
DCC2014 - Fully Online Grammar Compression in Constant SpaceYasuo Tabei
 
Kdd2015reading-tabei
Kdd2015reading-tabeiKdd2015reading-tabei
Kdd2015reading-tabeiYasuo Tabei
 
X86opti 05 s5yata
X86opti 05 s5yataX86opti 05 s5yata
X86opti 05 s5yatas5yata
 
Poptrie: A Compressed Trie with Population Count for Fast and Scalable Softwa...
Poptrie: A Compressed Trie with Population Count for Fast and Scalable Softwa...Poptrie: A Compressed Trie with Population Count for Fast and Scalable Softwa...
Poptrie: A Compressed Trie with Population Count for Fast and Scalable Softwa...Hirochika Asai
 
mSwitch: A Highly-Scalable, Modular Software Switch
mSwitch: A Highly-Scalable, Modular Software SwitchmSwitch: A Highly-Scalable, Modular Software Switch
mSwitch: A Highly-Scalable, Modular Software Switchmicchie
 
Tries - Tree Based Structures for Strings
Tries - Tree Based Structures for StringsTries - Tree Based Structures for Strings
Tries - Tree Based Structures for StringsAmrinder Arora
 
異常検知 - 何を探すかよく分かっていないものを見つける方法
異常検知 - 何を探すかよく分かっていないものを見つける方法異常検知 - 何を探すかよく分かっていないものを見つける方法
異常検知 - 何を探すかよく分かっていないものを見つける方法MapR Technologies Japan
 

Viewers also liked (20)

Mlab2012 tabei 20120806
Mlab2012 tabei 20120806Mlab2012 tabei 20120806
Mlab2012 tabei 20120806
 
SPIRE2013-tabei20131009
SPIRE2013-tabei20131009SPIRE2013-tabei20131009
SPIRE2013-tabei20131009
 
CPM2013-tabei201306
CPM2013-tabei201306CPM2013-tabei201306
CPM2013-tabei201306
 
NIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributesNIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributes
 
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data MatricesScalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
 
Lgm pakdd2011 public
Lgm pakdd2011 publicLgm pakdd2011 public
Lgm pakdd2011 public
 
GIW2013
GIW2013GIW2013
GIW2013
 
Sketch sort ochadai20101015-public
Sketch sort ochadai20101015-publicSketch sort ochadai20101015-public
Sketch sort ochadai20101015-public
 
Ibisml2011 06-20
Ibisml2011 06-20Ibisml2011 06-20
Ibisml2011 06-20
 
Sketch sort sugiyamalab-20101026 - public
Sketch sort sugiyamalab-20101026 - publicSketch sort sugiyamalab-20101026 - public
Sketch sort sugiyamalab-20101026 - public
 
DCC2014 - Fully Online Grammar Compression in Constant Space
DCC2014 - Fully Online Grammar Compression in Constant SpaceDCC2014 - Fully Online Grammar Compression in Constant Space
DCC2014 - Fully Online Grammar Compression in Constant Space
 
Kdd2015reading-tabei
Kdd2015reading-tabeiKdd2015reading-tabei
Kdd2015reading-tabei
 
Lp Boost
Lp BoostLp Boost
Lp Boost
 
Lgm saarbrucken
Lgm saarbruckenLgm saarbrucken
Lgm saarbrucken
 
TrieとLOUDS??
TrieとLOUDS??TrieとLOUDS??
TrieとLOUDS??
 
X86opti 05 s5yata
X86opti 05 s5yataX86opti 05 s5yata
X86opti 05 s5yata
 
Poptrie: A Compressed Trie with Population Count for Fast and Scalable Softwa...
Poptrie: A Compressed Trie with Population Count for Fast and Scalable Softwa...Poptrie: A Compressed Trie with Population Count for Fast and Scalable Softwa...
Poptrie: A Compressed Trie with Population Count for Fast and Scalable Softwa...
 
mSwitch: A Highly-Scalable, Modular Software Switch
mSwitch: A Highly-Scalable, Modular Software SwitchmSwitch: A Highly-Scalable, Modular Software Switch
mSwitch: A Highly-Scalable, Modular Software Switch
 
Tries - Tree Based Structures for Strings
Tries - Tree Based Structures for StringsTries - Tree Based Structures for Strings
Tries - Tree Based Structures for Strings
 
異常検知 - 何を探すかよく分かっていないものを見つける方法
異常検知 - 何を探すかよく分かっていないものを見つける方法異常検知 - 何を探すかよく分かっていないものを見つける方法
異常検知 - 何を探すかよく分かっていないものを見つける方法
 

Similar to WABI2012-SuccinctMultibitTree

International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)ijceronline
 
Self Organinising neural networks
Self Organinising  neural networksSelf Organinising  neural networks
Self Organinising neural networksESCOM
 
Math 223 Disclaimer It is not a good idea.docx
    Math 223   Disclaimer It is not a good idea.docx    Math 223   Disclaimer It is not a good idea.docx
Math 223 Disclaimer It is not a good idea.docxjoyjonna282
 
Lca seminar modified
Lca seminar modifiedLca seminar modified
Lca seminar modifiedInbok Lee
 
LinearAlgebraReview.ppt
LinearAlgebraReview.pptLinearAlgebraReview.ppt
LinearAlgebraReview.pptShobhitTyagi46
 
Dynamic Programming for 4th sem cse students
Dynamic Programming for 4th sem cse studentsDynamic Programming for 4th sem cse students
Dynamic Programming for 4th sem cse studentsDeepakGowda357858
 
SHA1 weakness
SHA1 weaknessSHA1 weakness
SHA1 weaknesscnpo
 
Applied Psych Test Design: Part E--Cacluate norms and derived scores
Applied Psych Test Design: Part E--Cacluate norms and derived scoresApplied Psych Test Design: Part E--Cacluate norms and derived scores
Applied Psych Test Design: Part E--Cacluate norms and derived scoresKevin McGrew
 
Topological sort
Topological sortTopological sort
Topological sortjabishah
 

Similar to WABI2012-SuccinctMultibitTree (13)

International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
Self Organinising neural networks
Self Organinising  neural networksSelf Organinising  neural networks
Self Organinising neural networks
 
Math 223 Disclaimer It is not a good idea.docx
    Math 223   Disclaimer It is not a good idea.docx    Math 223   Disclaimer It is not a good idea.docx
Math 223 Disclaimer It is not a good idea.docx
 
Solid modeling
Solid modelingSolid modeling
Solid modeling
 
Lca seminar modified
Lca seminar modifiedLca seminar modified
Lca seminar modified
 
LinearAlgebraReview.ppt
LinearAlgebraReview.pptLinearAlgebraReview.ppt
LinearAlgebraReview.ppt
 
Dynamic Programming for 4th sem cse students
Dynamic Programming for 4th sem cse studentsDynamic Programming for 4th sem cse students
Dynamic Programming for 4th sem cse students
 
Functional sudoku
Functional sudokuFunctional sudoku
Functional sudoku
 
VoxelNet
VoxelNetVoxelNet
VoxelNet
 
Lecture.1
Lecture.1Lecture.1
Lecture.1
 
SHA1 weakness
SHA1 weaknessSHA1 weakness
SHA1 weakness
 
Applied Psych Test Design: Part E--Cacluate norms and derived scores
Applied Psych Test Design: Part E--Cacluate norms and derived scoresApplied Psych Test Design: Part E--Cacluate norms and derived scores
Applied Psych Test Design: Part E--Cacluate norms and derived scores
 
Topological sort
Topological sortTopological sort
Topological sort
 

WABI2012-SuccinctMultibitTree

  • 1. 12th Workshop on Algorithms in Bioinformatics, Ljubljana, Slovenia Succinct Multibit Tree: Compact Representation of Multibit Trees by Succinct Data Structures in Chemical Fingerprint Searches Yasuo Tabei JST ERATO Minato Project
  • 2. Chemical fingerprint search •  Space-efficient data structures to index 30 million chemical fingerprints, e.g., W=(1,5,7,10) •  Find all fingerprints similar to a query (≧ε) –  Similarity = Jaccard (Tanimoto) (J(W,W’)=|W∩W’|/|W∪W’|) •  Multibit tree (Kristensen et al.,WABI09) –  Data structure enabling fast similarity searches –  Memory-inefficiency of pointer-based representation •  Succinct data structures (Jacobson, 1989) –  Space efficient and enabling fast operations Ø Present succinct representation of multibit tree
  • 3. Outline •  Multibit Tree •  Succinct Data Structures –  Rank/Select dictionary –  Succinct ordered tree: LOUDS •  Succinct Multibit Trees –  Compact representation of multibit trees –  Compact representation of fingerprint databases 1.  Variable-length array 2.  Succinct Trie •  Experiments
  • 4. Outline •  Multibit Tree •  Succinct Data Structures –  Rank/Select dictionary –  Succinct ordered tree: LOUDS •  Succinct Multibit Trees –  Compact representation of multibit trees –  Compact representation of fingerprint databases 1.  Variable-length array 2.  Succinct Trie •  Experiments
  • 5. Multibit Tree (MT) (Kristensen et al., 09) l  Multiple decision trees built on fingerprints clustered with respect to cardinality (i)Fingerprint (ii)Cluster into bins (iii)Build decision Database w.r.t cardinality trees W1=(1,2,7,4,8) W6 =(1) W2=(1,3,7) W32=(2) W3=(1,3) W42=(4) W6 W5=(1,4,8,7) W50=(8) W32 W6=(1) W42 W 50 ... W3 =(1,3) W9 =(2,4) Wn=(1,3,4) W12=(1,4) W9 W3 =(2,5,6) W3 W12 W9 =(1,3,6) . . W12=(4,6,7) W15=(2,3,5) . W18=(4,6,8) . . . . . W3 W15 W9 . W12 W18
  • 6. Similarity search of a query fingerprint Q l  If Jaccard similarity J(Wi , Q) , two constraints are satisfied: 1.  Cardinality constraint 1 |Q| |Wi | |Q| 2.  Upper bound of Jaccard similarity min(|Wi | N0 , |Q| N1 ) |Wi | + |Q| min(|Wi | N0 , |Q| N1 ) - N0: The number of elements contained in Wi and not in Q - N1: The number of elements contained in Q and not in Wi
  • 7. Similarity search of a query fingerprint Q Step1: Step2: Step3: Find candidate solutions I1 Find candidate solutions I2 Calculate similarities satisfying carinality constraints satisfying upper bounds to remove false positives in Searched W6 pruned W32 W42 W 50 W9 ? W3 W12 ? W9 W3 W12 W4 W15 W 9 W4 W15 W9 W12 W18 W12 W18 . . .
  • 8. Drawbacks •  Pointer-based representation of multibit trees needs a large amount of memory                  bits - Kc: number of fingerprints in bin c - C: total number of bins –  Log(.) factor is too large! •  Need to store original fingerprint databases in memory to filter out false positives
  • 9. Outline •  Multibit Tree •  Succinct Data Structures –  Rank/select dictionary –  Succinct ordered tree: LOUDS •  Succinct Multibit Trees –  Compact representation of multibit trees –  Compact representation of fingerprint databases 1.  Variable-length array 2.  Succinct Trie •  Experiments
  • 10. Rank/select dictionary (RRR, 2002) : Foundation of various succinct data structures l  Enables the rank/select operations on bit string B in O(1)-time -  Rankc(B,i): return the number of c∈{0,1} in B[1…i] -  Selectc(B,i): return the position of i-th occurrence of c∈{0,1} l  Efficient rank/select dictionary (Navarro and Providel, 2012) Ex) B=0110011100 i 1 2 3 4 5 6 7 8 9 10 Rank1(B,8)=5 011001110 0 Select1(B,3)=6 0 1 1 0 0 1 1 1 0 0 Memory: n + o(n) bits
  • 11. Level-order Unary Degree Sequence (LOUDS) (Jacobson, 1989) •  Represents an ordered tree as a bit string of length 2n+1 (n: node number) •  Construction 1)  Traversing the tree in a breadth-first manner 2)  Generating k 1s followed by 0 for a k-degree node in preorder 1 S:  super  root S 1 2 3 4 567 2 3 B 101101101100000 4 5 6 7
  • 12. Properties of LOUDS 1 1 23 4 5 67 2 3 B:101101101100000 1 2 34 5 67 4 5 6 7 •  For a tree consisting of n nodes, there are n 1s and n+1 0s on bit string B •  Each 1 and 0 except the first 0 on B corresponds to a tree node one-by-one •  Positions of the parent and children for a tree node on B can be calculated by combining the rank/select operations in O(1)-time.
  • 13. O(1)-time operations on a tree •  Parent/child operations for i such that B[i]=1 –  First child:p=select0(B,rank1(B,i))+1 –  Next child:i+1 for position i of the first child –  Parent :p=select1(B,rank0(B,i)) Ex)  Calcula2ng  the  first  child  for  i  =  4 1 i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 2 3 B 101101101100000 1 23 45 6 7 4 5 6 7 i=4 rank1(B,4)=3 select0(B,3)=9
  • 14. Outline •  Overview –  Chemical fingerprint search •  Multibit Tree •  Succinct Data Structures –  Rank/Select dictionary –  Succinct ordered tree: LOUDS •  Succinct Multibit Trees –  Compact representation of multibit trees –  Compact representation of fingerprint databases 1.  Variable-length array 2.  Succinct Trie •  Experiments
  • 15. Succinct Multibit Trees (SMT) •  Consist of compact representations of multibit trees and fingerprint databases •  Represent multibit trees by LOUDS –  O(8 C |Kc | 4C + M C) bits not including log factor c=1 –  Fast similarity searches •  Two compact representations of fingerprint databases –  Variable-length array (VLA) –  Succinct trie (TRIE)
  • 16. Succinct representation of multibit trees (SMT) •  Basic idea is to represent MT by LOUDS –  MT consists of multiple binary decision trees. •  Bc: LOUDS representation of a decision tree •  Lc: bit string indicating whether Bc[i] is a leaf or not •  IDs: Array containing fingerprint identifiers MT 1 SMT 2 3 4 5 6 7 W3 W4 W1 W2
  • 17. Access to node auxiliaries and fingerprint identifiers in O(1)-time 1 0 •  Access to node auxiliaries Mv , Mv for calculating upper bounds –  v = rank1 (Bc , p) for a given position p –  Each 1 bit in Bc corresponds to a node v •  Identifiers for calculating Jaccard[p] = 1 Lc similarities –  IDs[rank1 (Lc , p)] for a given position p –  Each 1 bit in Lc corresponds to an index on IDs 1 2 3 4 5 6 7 W3 W4 W1 W2
  • 18. Variable-length array for compactly representing fingerprints •  Standard array consists of bit strings of fixed-length –  Space-inefficient for storing small values Ex) Array, each element is represented as 8 bits Integer 2 1 3 4 32bits Bit string 00000010 00000001 00000011 00000100 •  Variable-length array = bit strings of different lengths Ex) Integer 2 1 3 4 8bits Bit string 10 1 11 100 –  Space-efficient –  Random access is impossible
  • 19. Representation of variable-length array •  Use two bit strings to represent an array A: -  R: bit string whose k-th substring corresponds to the bit string representation of A[k] -  P: bit string whose k-th substring consists of ( log A[k] 1) 0s followed by 1
  • 20. Recovering A[k] from variable-length array K=3 s e •  A[k] is recovered by three steps: 1.  Start position s: If k=1 s=1, else s = select1(P,k-1) + 1 2.  End position e: e = select1(P,k) 3.  Conversion: Convert substring R[s,e] to an integer •  O(1)-time
  • 21. Trie •  Used to store an associative array –  keys are, usually, a string •  Applicable to fingerprints considered as strings 0 1 Build 1 2 W1=(1,2,3) trie 2 3 2 3 3 W2=(2,3,7,8) 4 5 6 W3=(1,2,5,8) 3 5 7 7 5 10 8 9 W4=(1,3,5) 8 8 12 11
  • 22. Difficulty •  The alphabet size tends to be small for typical trie applications, e.g., DNA(4), English(26) •  Difficulty: the word size of fingerprints is not always small, e.g., PubChem, 881 dimension –  Memory usage is dominated by labels •  Compute the differences between every pair of a node label and the parent node label 0 Compute 0 Ex) Build 1 2 difference trie 1 2 W1=(1,2,3) 3 2 3 W2=(2,3,7,8) 1 2 1 Succinct Trie Succinct Trie W3=(1,2,5,8) 3 5 5 7 by LOUDS by LOUDS 1 3 4 W4=(1,3,5) 2 8 8 1 3
  • 23. Succinct Trie (TRIE) •  Three components: –  T: LOUDS representation of trie –  D: Variable-length array containing node labels –  Idconv: Array containing fingerprint identifiers Trie 1 0 Succinct Trie 1 2 Node ids - 1 2 3 4 5 6 7 8 9 10 11 12 LBS T 10 110 110 10 110 10 10 0 10 0 10 0 0 2 3 Words D   0 1 2 1 2 1 1 3 2 4 3 1 1 2 1 4 5 6 Index W1 W2 W3 W4 W5 1 3 4 idconv 7 12 11 10 9 7 2 10 8 9 1 3 12 11
  • 24. Outline •  Multibit Tree •  Succinct Data Structures –  Rank/Select dictionary –  Succinct ordered tree: LOUDS •  Succinct Multibit Trees –  Compact representation of multibit trees –  Compact representation of fingerprint databases 1.  Variable-length array 2.  Succinct Trie •  Experiments
  • 25. Experiments •  30 million chemical fingerprints from PubChem database •  Evaluate search time and memory •  Compared succinct multibit tree (SMT) to pointer-based multibit tree (MT) •  Compared variable-length array (VLD) and succint trie (TRIE) to the raw representation of fingeprint databases.
  • 26. Memory usage of multibit trees 6000 SMT 6G ● MT 5000 4000 Memory (MB) 3000 2000 1000 ● ● ● ● 847MB ● ● 0 ●● ● 0.0e+00 5.0e+06 1.0e+07 1.5e+07 2.0e+07 2.5e+07 3.0e+07 # of fingerprints
  • 27. Memory usage of representations of fingerprint databases TRIE 16GB ● VLA RAW 15000 Memory (MB) 10000 5000 3.2GB ● ● ●● ● ● ● ● ● 1.3GB 0 0.0e+00 5.0e+06 1.0e+07 1.5e+07 2.0e+07 2.5e+07 3.0e+07 # of fingerprints
  • 28. Search time and memory on 30 million fingerprints (ε=0.98) #answers:10 0.025 SMT+TRIE 0.021 ● 0.020 search time (sec) 0.015 SMT+VLA 0.014 SMT+RAW 0.010 MT+TRIE MT+VLA 0.006 0.005 MT+RAW 0.000 2GB 5000 10000 15000 20000 22GB 4GB memory (MB)
  • 29. Search time and memory on 30 million fingerprints (ε=0.9) #answers:1,440 2.0 SMT+TRIE 1.7 ● 1.5 search time (sec) 1.0 SMT+VLA MT+VLA 0.58 SMT+RAW 0.5 MT+TRIE 0.3 MT+RAW 0.0 2GB 5000 10000 15000 2000022GB 4GB memory (MB)
  • 30. Summary •  Succinct Multibit Trees (SMT) •  Compactly represent multibit trees and fingerprints by succinct data structures •  Represent multibit trees by LOUDS •  Represent fingerprints by variabl-length array and succinct trie •  Enables us to index 30 million fingerprints in 2GB by SMT+TRIE and in 4GB by SMT+VLA •  Search time remains practically fast
  • 31. Succinct Data Structures •  Space-efficient data structures enabling fast operations •  Pointer-based representations of ordered trees consume a large amount of memory –  O(nlogn) bits for the number n of nodes –  logn factor is too large for large-scale data •  Represent ordered trees as bit strings of length 2n + 1 and enables O(1)-time operations –  Ex) 0100100101000 •  Various succinct data structures –  sets(Raman,2002), sequences(Ferragina,2001), trees(Jacobson,1989), graphs(Turan,1989)
  • 32. B l  Divide the bit array B into large blocks of length =log2n RL=Ranks of large blocks l  Divide each large block to small blocks of length s=(logn)/2 Rs=Ranks of small blocks relative to the large block rank1(B,i)=RL[i/l]+Rs[i/s]+(remaining rank) Time:O(1) Memory: n + o(n) bits
  • 33. Recovering A[k] from variable-length array •  A[k] is recovered by three steps: 1.  Start position s: If k=1 s=1, else s = select1(P,k-1) + 1 2.  End position e: e = select1(P,k) 3.  Conversion: Convert substring R[s,e] to an integer •  O(1)-time Ex)k=3 1.  s = select1(P,2)+1=4 s e 2.  e = select1(P,3)=7 3.  Convert R[4,7]=1000 to the integer 8