1. 12th Workshop on Algorithms in Bioinformatics,
Ljubljana, Slovenia
Succinct Multibit Tree:
Compact Representation of Multibit Trees
by Succinct Data Structures
in Chemical Fingerprint Searches
Yasuo Tabei
JST ERATO Minato Project
2. Chemical fingerprint search
• Space-efficient data structures to index 30 million
chemical fingerprints, e.g., W=(1,5,7,10)
• Find all fingerprints similar to a query (≧ε)
– Similarity = Jaccard (Tanimoto) (J(W,W’)=|W∩W’|/|W∪W’|)
• Multibit tree (Kristensen et al.,WABI09)
– Data structure enabling fast similarity searches
– Memory-inefficiency of pointer-based representation
• Succinct data structures (Jacobson, 1989)
– Space efficient and enabling fast operations
Ø Present succinct representation of multibit tree
3. Outline
• Multibit Tree
• Succinct Data Structures
– Rank/Select dictionary
– Succinct ordered tree: LOUDS
• Succinct Multibit Trees
– Compact representation of multibit trees
– Compact representation of fingerprint databases
1. Variable-length array
2. Succinct Trie
• Experiments
4. Outline
• Multibit Tree
• Succinct Data Structures
– Rank/Select dictionary
– Succinct ordered tree: LOUDS
• Succinct Multibit Trees
– Compact representation of multibit trees
– Compact representation of fingerprint databases
1. Variable-length array
2. Succinct Trie
• Experiments
5. Multibit Tree (MT) (Kristensen et al., 09)
l Multiple decision trees built on fingerprints
clustered with respect to cardinality
(i)Fingerprint (ii)Cluster into bins (iii)Build decision
Database w.r.t cardinality trees
W1=(1,2,7,4,8) W6 =(1)
W2=(1,3,7) W32=(2)
W3=(1,3) W42=(4) W6
W5=(1,4,8,7) W50=(8) W32
W6=(1) W42 W
50
... W3 =(1,3)
W9 =(2,4)
Wn=(1,3,4) W12=(1,4)
W9
W3 =(2,5,6)
W3 W12
W9 =(1,3,6)
.
.
W12=(4,6,7)
W15=(2,3,5)
. W18=(4,6,8)
.
. .
.
.
W3 W15 W9
. W12 W18
6. Similarity search of a query fingerprint Q
l If Jaccard similarity J(Wi , Q) , two constraints are
satisfied:
1. Cardinality constraint
1
|Q| |Wi | |Q|
2. Upper bound of Jaccard similarity
min(|Wi | N0 , |Q| N1 )
|Wi | + |Q| min(|Wi | N0 , |Q| N1 )
- N0: The number of elements contained in Wi and not in Q
- N1: The number of elements contained in Q and not in Wi
8. Drawbacks
• Pointer-based representation of multibit trees
needs a large amount of memory
bits
- Kc: number of fingerprints in bin c
- C: total number of bins
– Log(.) factor is too large!
• Need to store original fingerprint databases in
memory to filter out false positives
9. Outline
• Multibit Tree
• Succinct Data Structures
– Rank/select dictionary
– Succinct ordered tree: LOUDS
• Succinct Multibit Trees
– Compact representation of multibit trees
– Compact representation of fingerprint databases
1. Variable-length array
2. Succinct Trie
• Experiments
10. Rank/select dictionary (RRR, 2002)
: Foundation of various succinct data structures
l Enables the rank/select operations on bit string B in
O(1)-time
- Rankc(B,i): return the number of c∈{0,1} in B[1…i]
- Selectc(B,i): return the position of i-th occurrence of c∈{0,1}
l Efficient rank/select dictionary (Navarro and Providel, 2012)
Ex) B=0110011100
i 1 2 3 4 5 6 7 8 9 10
Rank1(B,8)=5 011001110 0
Select1(B,3)=6
0 1 1 0 0 1 1 1 0 0
Memory: n + o(n) bits
11. Level-order Unary Degree Sequence
(LOUDS) (Jacobson, 1989)
• Represents an ordered tree as a bit string
of length 2n+1 (n: node number)
• Construction
1) Traversing the tree in a breadth-first manner
2) Generating k 1s followed by 0 for a k-degree node in
preorder
1 S:
super
root
S 1 2 3 4 567
2 3 B 101101101100000
4 5 6 7
12. Properties of LOUDS
1
1 23 4 5 67
2 3 B:101101101100000
1 2 34 5 67
4 5 6 7
• For a tree consisting of n nodes, there are n 1s
and n+1 0s on bit string B
• Each 1 and 0 except the first 0 on B corresponds
to a tree node one-by-one
• Positions of the parent and children for a tree
node on B can be calculated by combining the
rank/select operations in O(1)-time.
13. O(1)-time operations on a tree
• Parent/child operations for i such that B[i]=1
– First child:p=select0(B,rank1(B,i))+1
– Next child:i+1 for position i of the first child
– Parent :p=select1(B,rank0(B,i))
Ex)
Calcula2ng
the
first
child
for
i
=
4
1
i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
2 3 B 101101101100000
1 23 45 6 7
4 5 6 7
i=4
rank1(B,4)=3
select0(B,3)=9
14. Outline
• Overview
– Chemical fingerprint search
• Multibit Tree
• Succinct Data Structures
– Rank/Select dictionary
– Succinct ordered tree: LOUDS
• Succinct Multibit Trees
– Compact representation of multibit trees
– Compact representation of fingerprint databases
1. Variable-length array
2. Succinct Trie
• Experiments
15. Succinct Multibit Trees (SMT)
• Consist of compact representations of multibit
trees and fingerprint databases
• Represent multibit trees by LOUDS
– O(8 C |Kc | 4C + M C) bits not including log factor
c=1
– Fast similarity searches
• Two compact representations of fingerprint
databases
– Variable-length array (VLA)
– Succinct trie (TRIE)
16. Succinct representation of
multibit trees (SMT)
• Basic idea is to represent MT by LOUDS
– MT consists of multiple binary decision trees.
• Bc: LOUDS representation of a decision tree
• Lc: bit string indicating whether Bc[i] is a leaf or not
• IDs: Array containing fingerprint identifiers
MT
1 SMT
2 3
4 5 6 7
W3 W4 W1 W2
17. Access to node auxiliaries and
fingerprint identifiers in O(1)-time
1 0
• Access to node auxiliaries Mv , Mv for calculating
upper bounds
– v = rank1 (Bc , p) for a given position p
– Each 1 bit in Bc corresponds to a node v
• Identifiers for calculating Jaccard[p] = 1
Lc similarities
– IDs[rank1 (Lc , p)] for a given position p
– Each 1 bit in Lc corresponds to an index on IDs
1
2 3
4 5 6 7
W3 W4 W1 W2
18. Variable-length array for compactly
representing fingerprints
• Standard array consists of bit strings of fixed-length
– Space-inefficient for storing small values
Ex) Array, each element is represented as 8 bits
Integer 2 1 3 4 32bits
Bit string 00000010 00000001 00000011 00000100
• Variable-length array = bit strings of different lengths
Ex)
Integer 2 1 3 4 8bits
Bit string 10 1 11 100
– Space-efficient
– Random access is impossible
19. Representation of variable-length array
• Use two bit strings to represent an array A:
- R: bit string whose k-th substring corresponds to the
bit string representation of A[k]
- P: bit string whose k-th substring consists of
( log A[k] 1) 0s followed by 1
20. Recovering A[k] from variable-length array
K=3
s
e
• A[k] is recovered by three steps:
1. Start position s: If k=1 s=1, else s = select1(P,k-1) + 1
2. End position e: e = select1(P,k)
3. Conversion: Convert substring R[s,e] to an integer
• O(1)-time
21. Trie
• Used to store an associative array
– keys are, usually, a string
• Applicable to fingerprints considered as strings
0
1
Build 1
2
W1=(1,2,3) trie
2
3
2
3
3
W2=(2,3,7,8) 4
5
6
W3=(1,2,5,8) 3
5
7
7
5
10
8
9
W4=(1,3,5)
8
8
12
11
22. Difficulty
• The alphabet size tends to be small for typical trie
applications, e.g., DNA(4), English(26)
• Difficulty: the word size of fingerprints is not always
small, e.g., PubChem, 881 dimension
– Memory usage is dominated by labels
• Compute the differences between every pair of a node
label and the parent node label
0 Compute
0
Ex) Build 1 2 difference
trie 1 2
W1=(1,2,3) 3
2 3
W2=(2,3,7,8) 1 2 1 Succinct Trie
Succinct Trie
W3=(1,2,5,8) 3 5 5
7 by LOUDS
by LOUDS
1 3 4
W4=(1,3,5) 2
8
8 1
3
24. Outline
• Multibit Tree
• Succinct Data Structures
– Rank/Select dictionary
– Succinct ordered tree: LOUDS
• Succinct Multibit Trees
– Compact representation of multibit trees
– Compact representation of fingerprint databases
1. Variable-length array
2. Succinct Trie
• Experiments
25. Experiments
• 30 million chemical fingerprints from
PubChem database
• Evaluate search time and memory
• Compared succinct multibit tree (SMT) to
pointer-based multibit tree (MT)
• Compared variable-length array (VLD) and
succint trie (TRIE) to the raw
representation of fingeprint databases.
27. Memory usage of representations of
fingerprint databases
TRIE
16GB
●
VLA
RAW
15000
Memory (MB)
10000
5000
3.2GB
● ●
●●
●
● ● ● ●
1.3GB
0
0.0e+00 5.0e+06 1.0e+07 1.5e+07 2.0e+07 2.5e+07 3.0e+07
# of fingerprints
28. Search time and memory on 30 million
fingerprints (ε=0.98) #answers:10
0.025
SMT+TRIE
0.021
●
0.020
search time (sec)
0.015 SMT+VLA
0.014
SMT+RAW
0.010 MT+TRIE
MT+VLA
0.006
0.005 MT+RAW
0.000
2GB
5000 10000 15000 20000 22GB
4GB
memory (MB)
29. Search time and memory on 30 million
fingerprints (ε=0.9) #answers:1,440
2.0
SMT+TRIE
1.7
●
1.5
search time (sec)
1.0
SMT+VLA MT+VLA
0.58
SMT+RAW
0.5 MT+TRIE
0.3
MT+RAW
0.0
2GB
5000 10000 15000 2000022GB
4GB
memory (MB)
30. Summary
• Succinct Multibit Trees (SMT)
• Compactly represent multibit trees and
fingerprints by succinct data structures
• Represent multibit trees by LOUDS
• Represent fingerprints by variabl-length array and
succinct trie
• Enables us to index 30 million fingerprints in 2GB
by SMT+TRIE and in 4GB by SMT+VLA
• Search time remains practically fast
31. Succinct Data Structures
• Space-efficient data structures enabling fast
operations
• Pointer-based representations of ordered trees
consume a large amount of memory
– O(nlogn) bits for the number n of nodes
– logn factor is too large for large-scale data
• Represent ordered trees as bit strings of length 2n
+ 1 and enables O(1)-time operations
– Ex) 0100100101000
• Various succinct data structures
– sets(Raman,2002), sequences(Ferragina,2001),
trees(Jacobson,1989), graphs(Turan,1989)
32. B
l Divide the bit array B into large blocks of length =log2n
RL=Ranks of large blocks
l Divide each large block to small blocks of length s=(logn)/2
Rs=Ranks of small blocks relative to the large block
rank1(B,i)=RL[i/l]+Rs[i/s]+(remaining rank)
Time:O(1)
Memory: n + o(n) bits
33. Recovering A[k] from variable-length array
• A[k] is recovered by three steps:
1. Start position s: If k=1 s=1, else s = select1(P,k-1) + 1
2. End position e: e = select1(P,k)
3. Conversion: Convert substring R[s,e] to an integer
• O(1)-time
Ex)k=3
1. s = select1(P,2)+1=4 s
e
2. e = select1(P,3)=7
3. Convert R[4,7]=1000 to the integer 8