Text Indexing / Inverted Indices

1
Text Indexing
Class Algorithmic Methods of Data Mining
Program M. Sc. Data Science
University Sapienza University of Rome
Semester Fall 2015
Lecturer Carlos Castillo http://chato.cl/
Sources:
● Gonzalo Navarro: “Indexing and Searching.” Chapter 9 in
Modern Information Retrieval, 2nd
Edition. 2011. [slides]
● Christopher D. Manning, Prabhakar Raghavan & Hinrich
Schütze: “Introduction to Information Retrieval”. 2008 [link]

2
Index by document ID
BBC20151015001
CNN20150809002
AFP20130917001
RTE20151019001
TVE20140914001
BBC20151015001
CNN20150809002
AFP20130917001
RTE20151019001
TVE20140914001
File 2 Doc 2
File 1 Doc 1
File 1 Doc 2
File 1 Doc 3
File 2 Doc 1
Document identifiers Physical locations

3
Search by keywords
● Given a set of keywords
● Find documents containing all keywords
● Each keyword may be in millions of documents
● Hundreds of queries per second

4
Indexing the documents helps
● For an Information Retrieval system that uses an index,
efficiency means:
– Indexing time: Time needed to build the index
– Indexing space: Space used during the generation of the index
– Index storage: Space required to store the index
– Query latency: Time interval between the arrival of the query
and the generation of the answer
– Query throughput: Average number of queries processed per
second
● We assume a static or semi-static collection

5
Inverted index
● The index we have so far:
– Given a document ID
– Return the words in the document
● The index we want:
– Given a word
– Return the IDs of documents containing that word

6
Term-document matrix
Doc frequency Term frequencies
Space inefficient: why?

7
How large is the vocabulary?
In English:
Why it is not bounded?

9
Inverted index (vocabulary)
What are the alternatives for storing the
vocabulary?
What are the trade-offs involved?

10
Full inverted index
(single document, character level)
● Allows us to answer phrase and proximity
queries, e.g. “theory * practice” or “difference
between theory and practice”

11
Full inverted index
(multiple documents, word-level)

12
Space usage of an index
● Vocabulary requires
● Occurrences require
● Address documents or words?
● Address blocks is an intermediary solution

13
Phrase search
● How do you do a phrase search with:
– Addressing document
– Addressing words
– Addressing blocks

15
Try it
d1: “global warming”
d2: “global climate”
d3: “climate change”
d4: “warm climate”
d5: “global village”
Build an inverted index with word
addressing for these documents
Consider “warm” and “warming” as a
single term “warm”
Verify: third posting list has 3 docs
http://chato.cl/2015/data_analysis/exercise-answers/text-indexing_exercise_01_answer.txt

16
Searching time
● Assuming the vocabulary fits on main memory,
and m terms in the query, this is O(m)
● The time is dominated by merging the lists of
the words
● Merging is fast if lists are sorted
– At most n1 + n2 comparisons where n1 and n2 are
the sizes of the posting lists

17
Example
● Documents containing “syria”
– 1, 3, 12, 15, 19, 20, 34, 90, 96
● Documents containing “russia”
– 1, 9, 10, 18, 19, 24, 35, 90, 101
What should we do if one of the posting lists is very small
compared to the other?
What should we do if there are more than 2 posting lists?

18
Skip lists in indexing
● “Skips” are special shortcuts in the list
● Useful to avoid certain comparisons
● Good strategy is skips for list of size

19
Compressing inverted indexes
● Documents containing “robot”
– 1, 3, 12, 15, 19, 20, 24
● Sorted in ascending order, could encode as (smaller)
gaps
– 1, +2, +9, +3, +4, +1, +4
● Gaps are small for frequent words and large for
infrequent words
● Thus, compression can be obtained by encoding
small values with shorter codes

20
Binary coding
Number (decimal) Binary (16 bits) Unary
1 0000000000000001 0
2 0000000000000010 10
3 0000000000000011 110
4 0000000000000100 1110
5 0000000000000101 11110
6 0000000000000110 111110
7 0000000000000111 1111110
8 0000000000001000 11111110
9 0000000000001001 111111110
10 0000000000001010 1111111110
16 bits allows to encode gaps of 64K docids

21
Unary coding
Number (decimal) Binary (16 bits) Unary
1 0000000000000001 0
2 0000000000000010 10
3 0000000000000011 110
4 0000000000000100 1110
5 0000000000000101 11110
6 0000000000000110 111110
7 0000000000000111 1111110
8 0000000000001000 11111110
9 0000000000001001 111111110
10 0000000000001010 1111111110
For small gaps this saves a lot of space

22
Elias-γ coding
● Unary code for
● Binary code of length for
● Example

23
Elias-γ coding
Number (decimal) Binary (16 bits) Unary Elias-γ
1 0000000000000001 0 0
2 0000000000000010 10 100
3 0000000000000011 110 101
4 0000000000000100 1110 11000
5 0000000000000101 11110 11001
6 0000000000000110 111110 11010
7 0000000000000111 1111110 11011
8 0000000000001000 11111110 1110000
9 0000000000001001 111111110 1110001
10 0000000000001010 1111111110 1110010
In practice, indexing with this coding uses about 1/5 of the
space in TREC-3 (a collection of about 1GB of text)

24
Try it
Encode the list 1, 5, 14 using:
● Standard binary coding (8 bits)
● Gap encoding in binary (8 bits)
● Gap encoding in unary
● Gap encoding in gamma coding
Which one is shorter?
http://chato.cl/2015/data_analysis/exercise-answers/text-indexing_exercise_02_answer.txt

Text Indexing / Inverted Indices

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Text Indexing / Inverted Indices

Ähnlich wie Text Indexing / Inverted Indices (20)

Mehr von Carlos Castillo (ChaTo)

Mehr von Carlos Castillo (ChaTo) (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Text Indexing / Inverted Indices