SlideShare ist ein Scribd-Unternehmen logo
1 von 19
 Definition: an inverted file is a word-oriented
mechanism for indexing a text collection in
order to speed up the searching task.
 Structure of inverted file:
◦ Vocabulary: is the set of all distinct words in the
text
◦ Occurrences: lists containing all information
necessary for each word of the vocabulary (text
position, frequency, documents where the word
appears, etc.)
 Inverted file index is list of terms that appear in the
document collection (called a lexicon or vocabulary) and
for each term in the lexicon, stores a list of pointers to all
occurrences of that term in the document collection. This
list is called an inverted list.
 Granularity of an index determines the accuracy of
representation of the location of the word
◦ Coarse-grained index requires less storage and more
query processing to eliminate false matches
◦ Word-level index enables queries involving adjacency
and proximity, but has higher space requirements
4
Indexed
Terms
Number of
occurrences
Occurrences Lists
Vocabulary
Posting File
This could be a tree like structure !
5
 Text:
 Inverted file
1 6 12 16 18 25 29 36 40 45 54 58 66 70
That house has a garden. The garden has many flowers. The flowers are
beautiful
beautiful
flowers
garden
house
70
45, 58
18, 29
6
Vocabulary Occurrences
 Prior example allows for boolean
queries.
 Need the document frequency and term
frequency.
Vocabulary entry Posting file entry
k dk doc1 f1k doc2 f2k …
dk : document frequency of term k
doci : i-th document that contains term k
fik : term frequency of term k in document i
 The space required for the vocabulary is rather
small. According to Heaps’ law the vocabulary
grows as O(nβ
), where β is a constant between
0.4 and 0.6 in practice
◦ TREC-2: 1 GB text, 5 MB lexicon
 On the other hand, the occurrences demand
much more space. Since each word appearing
in the text is referenced once in that structure,
the extra space is O(n)
 To reduce space requirements, a technique
called block addressing is used
 The text is divided in blocks
 The occurrences point to the blocks where the
word appears
 Advantages:
◦ the number of pointers is smaller than positions
◦ all the occurrences of a word inside a single block
are collapsed to one reference
 Disadvantages:
◦ online search over the qualifying blocks if exact
positions are required
 Text:
 Inverted file
beautiful
flowers
garden
house
4
3
2
1
Vocabulary Occurrences
Block 1 Block 2 Block 3 Block 4
That house has a garden. The garden has many flowers. The flowers are
beautiful
 How big are inverted files?
◦ In relation to original collection size
 right column indexes stopwords while left removes
stopwords
 Blocks require text to be available for location of
terms within blocks.
45%
27%
18%
73%
41%
25%
36%
18%
1.7%
64%
32%
2.4%
35%
5%
0.5%
63%
9%
0.7%
Addressing words
Addressing 256 blocks
Addressing 64K blocks
Index Small collection
(1Mb)
Medium collection
(200Mb)
Large collection
(2Gb)
 The search algorithm on an inverted
index follows three steps:
1. Vocabulary search: the words present in
the query are located in the vocabulary
2. Retrieval occurrences: the lists of the
occurrences of all query words found are
retrieved
3. Manipulation of occurrences: the
occurrences are processed to solve the
query
 Searching inverted files starts with vocabulary
◦ store the vocabulary in a separate file
 Structures used to store the vocabulary
include
◦ Hashing : O (1) lookup, does not support range
queries
◦ Tries : O (c) lookup, c = length (word)
◦ B-trees : O (log v) lookup
 An alternative is simply storing the words in
lexicographical order
◦ cheaper in space and very competitive with O(log
v) cost
 All the vocabulary is kept in a suitable data
structure storing for each word and a list of
its occurrences
 Each word of each text in the corpus is
read and searched for in the vocabulary
 If it is not found, it is added to the
vocabulary with a empty list of occurrences
 The new position is added to the end of its
list of occurrences for the word
 Once the text is exhausted the vocabulary is
written to disk with the list of occurrences.
 Two files are created:
◦ in the first file, each list of word occurrences is
stored contiguously
◦ in the second file, the vocabulary is stored in
lexicographical order and, for each word, a pointer
to its list in the first file is also included. This allows
the vocabulary to be kept in memory at search time
 The overall process is O(n) worst-case time
 An option is to use the previous algorithm until
the main memory is exhausted. When no
more memory is available, the partial index Ii
obtained up to now is written to disk and
erased the main memory before continuing
with the rest of the text
 Once the text is exhausted, a number of
partial indices Ii exist on disk
 The partial indices are merged to obtain the
final index
I 1...8
I 1...4 I 5...8
I 1...2 I 3...4 I 5...6 I 7...8
I 1 I 2 I 3 I 4 I 5 I 6 I 7 I 8
1 2 4 5
3 6
7
final index
initial dumps
level 1
level 2
level 3
 The total time to generate partial indices is
O(n)
 The number of partial indices is O(n/M)
 To merge the O(n/M) partial indices are
necessary log2(n/M) merging levels
 The total cost of this algorithm is O(n log(n/M))
 Inverted files are used to index text
 The indices are appropriate when the
text collection is large and semi-static
 If the text collection is volatile online
searching is the only option
 Some techniques combine online and
indexed searching
 Vocabulary List
◦ Text preprocessing modules
 lexical analysis, stemming, stopwords
 Occurrences of Vocabulary Terms
◦ Inverted index creation
 term frequency in documents, document frequency
 Retrieval and Ranking Algorithm
 Query and Ranking Interfaces
 Browsing/Visualization Interface

Weitere ähnliche Inhalte

Was ist angesagt?

16. Concurrency Control in DBMS
16. Concurrency Control in DBMS16. Concurrency Control in DBMS
16. Concurrency Control in DBMS
koolkampus
 
Lectures 1,2,3
Lectures 1,2,3Lectures 1,2,3
Lectures 1,2,3
alaa223
 
Evaluation in Information Retrieval
Evaluation in Information RetrievalEvaluation in Information Retrieval
Evaluation in Information Retrieval
Dishant Ailawadi
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
14. Query Optimization in DBMS
14. Query Optimization in DBMS14. Query Optimization in DBMS
14. Query Optimization in DBMS
koolkampus
 

Was ist angesagt? (20)

16. Concurrency Control in DBMS
16. Concurrency Control in DBMS16. Concurrency Control in DBMS
16. Concurrency Control in DBMS
 
The vector space model
The vector space modelThe vector space model
The vector space model
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATA
 
Text similarity measures
Text similarity measuresText similarity measures
Text similarity measures
 
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
 
IRS-Cataloging and Indexing-2.1.pptx
IRS-Cataloging and Indexing-2.1.pptxIRS-Cataloging and Indexing-2.1.pptx
IRS-Cataloging and Indexing-2.1.pptx
 
Lectures 1,2,3
Lectures 1,2,3Lectures 1,2,3
Lectures 1,2,3
 
Code generation
Code generationCode generation
Code generation
 
Evaluation in Information Retrieval
Evaluation in Information RetrievalEvaluation in Information Retrieval
Evaluation in Information Retrieval
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Lec1,2
Lec1,2Lec1,2
Lec1,2
 
3. mining frequent patterns
3. mining frequent patterns3. mining frequent patterns
3. mining frequent patterns
 
Query optimization
Query optimizationQuery optimization
Query optimization
 
1.2 steps and functionalities
1.2 steps and functionalities1.2 steps and functionalities
1.2 steps and functionalities
 
14. Query Optimization in DBMS
14. Query Optimization in DBMS14. Query Optimization in DBMS
14. Query Optimization in DBMS
 
CS6010 Social Network Analysis Unit II
CS6010 Social Network Analysis   Unit IICS6010 Social Network Analysis   Unit II
CS6010 Social Network Analysis Unit II
 
File organization 1
File organization 1File organization 1
File organization 1
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
Web search vs ir
Web search vs irWeb search vs ir
Web search vs ir
 
1.8 discretization
1.8 discretization1.8 discretization
1.8 discretization
 

Andere mochten auch

An introduction to inverted index
An introduction to inverted indexAn introduction to inverted index
An introduction to inverted index
weedge
 
Search Lucene
Search LuceneSearch Lucene
Search Lucene
Jeremy Coates
 
Public key Cryptography & RSA
Public key Cryptography & RSAPublic key Cryptography & RSA
Public key Cryptography & RSA
Amit Debnath
 
Information searching & retrieving techniques khalid
Information searching & retrieving techniques khalidInformation searching & retrieving techniques khalid
Information searching & retrieving techniques khalid
Khalid Mahmood
 
Devinsampa nginx-scripting
Devinsampa nginx-scriptingDevinsampa nginx-scripting
Devinsampa nginx-scripting
Tony Fabeen
 

Andere mochten auch (20)

An introduction to inverted index
An introduction to inverted indexAn introduction to inverted index
An introduction to inverted index
 
The search engine index
The search engine indexThe search engine index
The search engine index
 
Text Indexing / Inverted Indices
Text Indexing / Inverted IndicesText Indexing / Inverted Indices
Text Indexing / Inverted Indices
 
Using Solr Cloud to Tame an Index Explosion
Using Solr Cloud to Tame an Index ExplosionUsing Solr Cloud to Tame an Index Explosion
Using Solr Cloud to Tame an Index Explosion
 
The Role of Enterprise Integration in Digital Transformation
The Role of Enterprise Integration in Digital TransformationThe Role of Enterprise Integration in Digital Transformation
The Role of Enterprise Integration in Digital Transformation
 
Web technology: Web search
Web technology: Web searchWeb technology: Web search
Web technology: Web search
 
Product quantization for nearest neighbor search-report
Product quantization for nearest neighbor search-reportProduct quantization for nearest neighbor search-report
Product quantization for nearest neighbor search-report
 
Privacy preserving multi-keyword ranked search over encrypted cloud data
Privacy preserving multi-keyword ranked search over encrypted cloud dataPrivacy preserving multi-keyword ranked search over encrypted cloud data
Privacy preserving multi-keyword ranked search over encrypted cloud data
 
Information seeking
Information seekingInformation seeking
Information seeking
 
Practical Elasticsearch - real world use cases
Practical Elasticsearch - real world use casesPractical Elasticsearch - real world use cases
Practical Elasticsearch - real world use cases
 
Architecture and implementation of Apache Lucene
Architecture and implementation of Apache LuceneArchitecture and implementation of Apache Lucene
Architecture and implementation of Apache Lucene
 
Introduction To Apache Lucene
Introduction To Apache LuceneIntroduction To Apache Lucene
Introduction To Apache Lucene
 
Search Lucene
Search LuceneSearch Lucene
Search Lucene
 
Solr
SolrSolr
Solr
 
Public key Cryptography & RSA
Public key Cryptography & RSAPublic key Cryptography & RSA
Public key Cryptography & RSA
 
Information searching & retrieving techniques khalid
Information searching & retrieving techniques khalidInformation searching & retrieving techniques khalid
Information searching & retrieving techniques khalid
 
Devinsampa nginx-scripting
Devinsampa nginx-scriptingDevinsampa nginx-scripting
Devinsampa nginx-scripting
 
Munching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processingMunching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processing
 
Index types
Index typesIndex types
Index types
 
A SECURE AND DYNAMIC MULTI-KEYWORD RANKED SEARCH SCHEME OVER ENCRYPTED CLOUD...
 A SECURE AND DYNAMIC MULTI-KEYWORD RANKED SEARCH SCHEME OVER ENCRYPTED CLOUD... A SECURE AND DYNAMIC MULTI-KEYWORD RANKED SEARCH SCHEME OVER ENCRYPTED CLOUD...
A SECURE AND DYNAMIC MULTI-KEYWORD RANKED SEARCH SCHEME OVER ENCRYPTED CLOUD...
 

Ähnlich wie Inverted index

Chapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdfChapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdf
JemalNesre1
 
Information_Retrievals Unit_3_chap09.pdf
Information_Retrievals Unit_3_chap09.pdfInformation_Retrievals Unit_3_chap09.pdf
Information_Retrievals Unit_3_chap09.pdf
lekhacce
 
Boolean Retrieval
Boolean RetrievalBoolean Retrieval
Boolean Retrieval
mghgk
 

Ähnlich wie Inverted index (20)

Chapter 3 Indexing.pdf
Chapter 3 Indexing.pdfChapter 3 Indexing.pdf
Chapter 3 Indexing.pdf
 
Chapter 3 Indexing Structure.pdf
Chapter 3 Indexing Structure.pdfChapter 3 Indexing Structure.pdf
Chapter 3 Indexing Structure.pdf
 
Chapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdfChapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdf
 
Ir 03
Ir   03Ir   03
Ir 03
 
Chapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrievalChapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrieval
 
Search pitb
Search pitbSearch pitb
Search pitb
 
Lucece Indexing
Lucece IndexingLucece Indexing
Lucece Indexing
 
Chapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdfChapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdf
 
File Types in Data Structure
File Types in Data StructureFile Types in Data Structure
File Types in Data Structure
 
Ch 17 disk storage, basic files structure, and hashing
Ch 17 disk storage, basic files structure, and hashingCh 17 disk storage, basic files structure, and hashing
Ch 17 disk storage, basic files structure, and hashing
 
Survey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse DictionarySurvey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse Dictionary
 
Information_Retrievals Unit_3_chap09.pdf
Information_Retrievals Unit_3_chap09.pdfInformation_Retrievals Unit_3_chap09.pdf
Information_Retrievals Unit_3_chap09.pdf
 
Chapter13
Chapter13Chapter13
Chapter13
 
lecture 2 notes indexing in application of database systems.pptx
lecture 2 notes indexing in application of database systems.pptxlecture 2 notes indexing in application of database systems.pptx
lecture 2 notes indexing in application of database systems.pptx
 
Index Structures.pptx
Index Structures.pptxIndex Structures.pptx
Index Structures.pptx
 
Hashing
HashingHashing
Hashing
 
Data storage and indexing
Data storage and indexingData storage and indexing
Data storage and indexing
 
3_Indexing.ppt
3_Indexing.ppt3_Indexing.ppt
3_Indexing.ppt
 
G0361034038
G0361034038G0361034038
G0361034038
 
Boolean Retrieval
Boolean RetrievalBoolean Retrieval
Boolean Retrieval
 

Kürzlich hochgeladen

Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
Epec Engineered Technologies
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Kandungan 087776558899
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
dharasingh5698
 

Kürzlich hochgeladen (20)

Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
 

Inverted index

  • 1.
  • 2.  Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching task.  Structure of inverted file: ◦ Vocabulary: is the set of all distinct words in the text ◦ Occurrences: lists containing all information necessary for each word of the vocabulary (text position, frequency, documents where the word appears, etc.)
  • 3.  Inverted file index is list of terms that appear in the document collection (called a lexicon or vocabulary) and for each term in the lexicon, stores a list of pointers to all occurrences of that term in the document collection. This list is called an inverted list.  Granularity of an index determines the accuracy of representation of the location of the word ◦ Coarse-grained index requires less storage and more query processing to eliminate false matches ◦ Word-level index enables queries involving adjacency and proximity, but has higher space requirements
  • 5. 5  Text:  Inverted file 1 6 12 16 18 25 29 36 40 45 54 58 66 70 That house has a garden. The garden has many flowers. The flowers are beautiful beautiful flowers garden house 70 45, 58 18, 29 6 Vocabulary Occurrences
  • 6.  Prior example allows for boolean queries.  Need the document frequency and term frequency. Vocabulary entry Posting file entry k dk doc1 f1k doc2 f2k … dk : document frequency of term k doci : i-th document that contains term k fik : term frequency of term k in document i
  • 7.  The space required for the vocabulary is rather small. According to Heaps’ law the vocabulary grows as O(nβ ), where β is a constant between 0.4 and 0.6 in practice ◦ TREC-2: 1 GB text, 5 MB lexicon  On the other hand, the occurrences demand much more space. Since each word appearing in the text is referenced once in that structure, the extra space is O(n)  To reduce space requirements, a technique called block addressing is used
  • 8.  The text is divided in blocks  The occurrences point to the blocks where the word appears  Advantages: ◦ the number of pointers is smaller than positions ◦ all the occurrences of a word inside a single block are collapsed to one reference  Disadvantages: ◦ online search over the qualifying blocks if exact positions are required
  • 9.  Text:  Inverted file beautiful flowers garden house 4 3 2 1 Vocabulary Occurrences Block 1 Block 2 Block 3 Block 4 That house has a garden. The garden has many flowers. The flowers are beautiful
  • 10.  How big are inverted files? ◦ In relation to original collection size  right column indexes stopwords while left removes stopwords  Blocks require text to be available for location of terms within blocks. 45% 27% 18% 73% 41% 25% 36% 18% 1.7% 64% 32% 2.4% 35% 5% 0.5% 63% 9% 0.7% Addressing words Addressing 256 blocks Addressing 64K blocks Index Small collection (1Mb) Medium collection (200Mb) Large collection (2Gb)
  • 11.  The search algorithm on an inverted index follows three steps: 1. Vocabulary search: the words present in the query are located in the vocabulary 2. Retrieval occurrences: the lists of the occurrences of all query words found are retrieved 3. Manipulation of occurrences: the occurrences are processed to solve the query
  • 12.  Searching inverted files starts with vocabulary ◦ store the vocabulary in a separate file  Structures used to store the vocabulary include ◦ Hashing : O (1) lookup, does not support range queries ◦ Tries : O (c) lookup, c = length (word) ◦ B-trees : O (log v) lookup  An alternative is simply storing the words in lexicographical order ◦ cheaper in space and very competitive with O(log v) cost
  • 13.  All the vocabulary is kept in a suitable data structure storing for each word and a list of its occurrences  Each word of each text in the corpus is read and searched for in the vocabulary  If it is not found, it is added to the vocabulary with a empty list of occurrences  The new position is added to the end of its list of occurrences for the word
  • 14.  Once the text is exhausted the vocabulary is written to disk with the list of occurrences.  Two files are created: ◦ in the first file, each list of word occurrences is stored contiguously ◦ in the second file, the vocabulary is stored in lexicographical order and, for each word, a pointer to its list in the first file is also included. This allows the vocabulary to be kept in memory at search time  The overall process is O(n) worst-case time
  • 15.  An option is to use the previous algorithm until the main memory is exhausted. When no more memory is available, the partial index Ii obtained up to now is written to disk and erased the main memory before continuing with the rest of the text  Once the text is exhausted, a number of partial indices Ii exist on disk  The partial indices are merged to obtain the final index
  • 16. I 1...8 I 1...4 I 5...8 I 1...2 I 3...4 I 5...6 I 7...8 I 1 I 2 I 3 I 4 I 5 I 6 I 7 I 8 1 2 4 5 3 6 7 final index initial dumps level 1 level 2 level 3
  • 17.  The total time to generate partial indices is O(n)  The number of partial indices is O(n/M)  To merge the O(n/M) partial indices are necessary log2(n/M) merging levels  The total cost of this algorithm is O(n log(n/M))
  • 18.  Inverted files are used to index text  The indices are appropriate when the text collection is large and semi-static  If the text collection is volatile online searching is the only option  Some techniques combine online and indexed searching
  • 19.  Vocabulary List ◦ Text preprocessing modules  lexical analysis, stemming, stopwords  Occurrences of Vocabulary Terms ◦ Inverted index creation  term frequency in documents, document frequency  Retrieval and Ranking Algorithm  Query and Ranking Interfaces  Browsing/Visualization Interface