1. 1
Text Indexing
Class Algorithmic Methods of Data Mining
Program M. Sc. Data Science
University Sapienza University of Rome
Semester Fall 2015
Lecturer Carlos Castillo http://chato.cl/
Sources:
● Gonzalo Navarro: “Indexing and Searching.” Chapter 9 in
Modern Information Retrieval, 2nd
Edition. 2011. [slides]
● Christopher D. Manning, Prabhakar Raghavan & Hinrich
Schütze: “Introduction to Information Retrieval”. 2008 [link]
3. 3
Search by keywords
● Given a set of keywords
● Find documents containing all keywords
● Each keyword may be in millions of documents
● Hundreds of queries per second
4. 4
Indexing the documents helps
● For an Information Retrieval system that uses an index,
efficiency means:
– Indexing time: Time needed to build the index
– Indexing space: Space used during the generation of the index
– Index storage: Space required to store the index
– Query latency: Time interval between the arrival of the query
and the generation of the answer
– Query throughput: Average number of queries processed per
second
● We assume a static or semi-static collection
5. 5
Inverted index
● The index we have so far:
– Given a document ID
– Return the words in the document
● The index we want:
– Given a word
– Return the IDs of documents containing that word
10. 10
Full inverted index
(single document, character level)
● Allows us to answer phrase and proximity
queries, e.g. “theory * practice” or “difference
between theory and practice”
15. 15
Try it
d1: “global warming”
d2: “global climate”
d3: “climate change”
d4: “warm climate”
d5: “global village”
Build an inverted index with word
addressing for these documents
Consider “warm” and “warming” as a
single term “warm”
Verify: third posting list has 3 docs
http://chato.cl/2015/data_analysis/exercise-answers/text-indexing_exercise_01_answer.txt
16. 16
Searching time
● Assuming the vocabulary fits on main memory,
and m terms in the query, this is O(m)
● The time is dominated by merging the lists of
the words
● Merging is fast if lists are sorted
– At most n1 + n2 comparisons where n1 and n2 are
the sizes of the posting lists
17. 17
Example
● Documents containing “syria”
– 1, 3, 12, 15, 19, 20, 34, 90, 96
● Documents containing “russia”
– 1, 9, 10, 18, 19, 24, 35, 90, 101
What should we do if one of the posting lists is very small
compared to the other?
What should we do if there are more than 2 posting lists?
18. 18
Skip lists in indexing
● “Skips” are special shortcuts in the list
● Useful to avoid certain comparisons
● Good strategy is skips for list of size
19. 19
Compressing inverted indexes
● Documents containing “robot”
– 1, 3, 12, 15, 19, 20, 24
● Sorted in ascending order, could encode as (smaller)
gaps
– 1, +2, +9, +3, +4, +1, +4
● Gaps are small for frequent words and large for
infrequent words
● Thus, compression can be obtained by encoding
small values with shorter codes
23. 23
Elias-γ coding
Number (decimal) Binary (16 bits) Unary Elias-γ
1 0000000000000001 0 0
2 0000000000000010 10 100
3 0000000000000011 110 101
4 0000000000000100 1110 11000
5 0000000000000101 11110 11001
6 0000000000000110 111110 11010
7 0000000000000111 1111110 11011
8 0000000000001000 11111110 1110000
9 0000000000001001 111111110 1110001
10 0000000000001010 1111111110 1110010
In practice, indexing with this coding uses about 1/5 of the
space in TREC-3 (a collection of about 1GB of text)
24. 24
Try it
Encode the list 1, 5, 14 using:
● Standard binary coding (8 bits)
● Gap encoding in binary (8 bits)
● Gap encoding in unary
● Gap encoding in gamma coding
Which one is shorter?
http://chato.cl/2015/data_analysis/exercise-answers/text-indexing_exercise_02_answer.txt