1. B-Tree Lexicon, Min-Heaps
Kira Radinsky
Min-Heap slides are courtesy of Aya Soffer and David Carmel,
IBM Haifa Research Lab
2. 2 November 2010 236621 Search Engine Technology 2
The Lexicon as a B-Tree
• B-Tree: a balanced tree that is optimized for disk I/O, holding key/value
pairs
• Branching is defined by a min-degree parameter t, t > 1
– t is chosen according to the size of a disk block
• Any internal node other than the root has at least t and at most 2t
children; the root has either no children, or at least two and at most 2t
children
• Any internal node with k children also stores k-1 keys which serve as
separator values: separator j is larger than the keys of subtree j and
smaller than the keys of subtree j+1
• Leaf nodes, like all nodes, store at most 2t-1 key/value pairs
– When not the root, store at least t-1 key/value pairs
• Lookup, insertion and deletion operations on a B-Tree are linear in its
height (and t-logarithmic in the number of keys)
3. 2 November 2010 236621 Search Engine Technology 3
B-Tree Lexicon - Example
• t=2
• Each key is associated with a value that contains a DF and
a pointer to the postings list (dashed line)
gets more
1 2
and as bad
3 1 2
good is it
2 1 2
the ugly
1 2
4. 2 November 2010 236620 Search Engine Technology 4
B-Tree Lookup
Looking up the value associated with key x:
1. current_node root
2. Let k1<k2<…<km be the keys of current_node
3. if x{k1,k2,…,km} – we’re done, return associated value
4. else, if current_node is a leaf node, return null
5. else, let j be the smallest index s.t. x<kj (j m+1 if x>km);
– current_node j’th subtree, and goto 2
5. 2 November 2010 236621 Search Engine Technology 5
Top-r Document Selection
Problem definition: Given a set A of scored documents,
select the r documents with the highest scores in A and
return them in decreasing relevance order
• Naïve method: sort the set A by score
– If |A|=M, time complexity is O(M logM)
• Better approach: since typically r<<M, selecting the r
top scores can be done in O(M+r log M) time using a
heap:
1. Heapify the set of M scores (about 2M comparisons) so that the
top score is at the root
2. Repeatedly extract the heap’s root (r times), each time fixing
the heap in O(logM)
6. 2 November 2010 236621 Search Engine Technology 6
The Heap Data Structure - Reminder
• A binary heap is a (mostly full) binary tree with values
stored at all leaves and internal nodes, and an ordering
rule that requires values to be non-decreasing
(alternatively, non-increasing) along each path from a leaf
to the root
– Largest/smallest value is at the root
• Heap implemented in an Array:
– Root at index 1
– For node at index i, left child is at index 2i and right child at index
2i+1
– Thus the parent of the node at index i is at index i/2
8. 2 November 2010 236621 Search Engine Technology 8
Extracting the Top Element
• Remove the largest item r times
• Each time:
– Remove the largest item – the root of the heap
– Replace it with the last element of the heap
– Sift the new root down until restoring order
• Example
– Remove item 23 from the root
– Last item in array 5 (at location 10) replaces it
– Reinstate heap order - worst case 5 will be sifted
back down the tree - number of sifts is bounded
by log(size of heap)
9. 2 November 2010 236621 Search Engine Technology 9
Heap Example (cont.)
To restore order at the top level
of tree, item 17, the larger of
the 2 children of root must be
swapped with 5.
This limits the order violation to
the left sub-tree.
5
17
28
15
13
144
17
The process is repeated until heap order is restored
11. 2 November 2010 236621 Search Engine Technology 11
Top-r Selection Using a Min-Heap
• The selection problem can be solved by a heap that stores
the smallest item at the root: min-heap
• A min-heap of r items is held instead of a max-heap of M –
lots of memory is saved, which is always good
• Process the M scores, storing in the min-heap the r largest
values seen so far
– First r values are heapified in O(r) comparisons
– Replace the smallest value in the min-heap (the rth largest)
whenever a larger value is found
• Sort the r highest values in descending order and return
the corresponding documents – O(r log r)
12. 2 November 2010 236621 Search Engine Technology 12
Min-Heap Processing - Illustration
Processed Unprocessed
Min-heap of r
largest items
Discard smallest
value
13. 2 November 2010 236621 Search Engine Technology 13
Top-r Selection Using a Min-Heap:
Complexity Analysis
• Worst case: the scores are already in increasing order
– Each of the M-r last values is inserted into the heap
– Furthermore, it percolates to the bottom of the heap
– Complexity is O( (M-r)*log(r) )
• Average case – the scores arrive in a permutation of size
M chosen uniformly at random
– The expected number of times one of the M-r last values is
inserted into the heap is ~ r*ln(M/r)
– Each insertion costs O(log(r))
– Complexity is O( r*log(r)*log(M/r) )
• Proof on the board