SlideShare ist ein Scribd-Unternehmen logo
1 von 24
Downloaden Sie, um offline zu lesen
1
Text Indexing
Class Algorithmic Methods of Data Mining
Program M. Sc. Data Science
University Sapienza University of Rome
Semester Fall 2015
Lecturer Carlos Castillo http://chato.cl/
Sources:
● Gonzalo Navarro: “Indexing and Searching.” Chapter 9 in
Modern Information Retrieval, 2nd
Edition. 2011. [slides]
● Christopher D. Manning, Prabhakar Raghavan & Hinrich
Schütze: “Introduction to Information Retrieval”. 2008 [link]
2
Index by document ID
BBC20151015001
CNN20150809002
AFP20130917001
RTE20151019001
TVE20140914001
BBC20151015001
CNN20150809002
AFP20130917001
RTE20151019001
TVE20140914001
File 2 Doc 2
File 1 Doc 1
File 1 Doc 2
File 1 Doc 3
File 2 Doc 1
Document identifiers Physical locations
3
Search by keywords
● Given a set of keywords
● Find documents containing all keywords
● Each keyword may be in millions of documents
● Hundreds of queries per second
4
Indexing the documents helps
● For an Information Retrieval system that uses an index,
efficiency means:
– Indexing time: Time needed to build the index
– Indexing space: Space used during the generation of the index
– Index storage: Space required to store the index
– Query latency: Time interval between the arrival of the query
and the generation of the answer
– Query throughput: Average number of queries processed per
second
● We assume a static or semi-static collection
5
Inverted index
● The index we have so far:
– Given a document ID
– Return the words in the document
● The index we want:
– Given a word
– Return the IDs of documents containing that word
6
Term-document matrix
Doc frequency Term frequencies
Space inefficient: why?
7
How large is the vocabulary?
In English:
Why it is not bounded?
8
Inverted index
9
Inverted index (vocabulary)
What are the alternatives for storing the
vocabulary?
What are the trade-offs involved?
10
Full inverted index
(single document, character level)
● Allows us to answer phrase and proximity
queries, e.g. “theory * practice” or “difference
between theory and practice”
11
Full inverted index
(multiple documents, word-level)
12
Space usage of an index
● Vocabulary requires
● Occurrences require
● Address documents or words?
● Address blocks is an intermediary solution
13
Phrase search
● How do you do a phrase search with:
– Addressing document
– Addressing words
– Addressing blocks
14
Estimated sizes of indices
15
Try it
d1: “global warming”
d2: “global climate”
d3: “climate change”
d4: “warm climate”
d5: “global village”
Build an inverted index with word
addressing for these documents
Consider “warm” and “warming” as a
single term “warm”
Verify: third posting list has 3 docs
http://chato.cl/2015/data_analysis/exercise-answers/text-indexing_exercise_01_answer.txt
16
Searching time
● Assuming the vocabulary fits on main memory,
and m terms in the query, this is O(m)
● The time is dominated by merging the lists of
the words
● Merging is fast if lists are sorted
– At most n1 + n2 comparisons where n1 and n2 are
the sizes of the posting lists
17
Example
● Documents containing “syria”
– 1, 3, 12, 15, 19, 20, 34, 90, 96
● Documents containing “russia”
– 1, 9, 10, 18, 19, 24, 35, 90, 101
What should we do if one of the posting lists is very small
compared to the other?
What should we do if there are more than 2 posting lists?
18
Skip lists in indexing
● “Skips” are special shortcuts in the list
● Useful to avoid certain comparisons
● Good strategy is skips for list of size
19
Compressing inverted indexes
● Documents containing “robot”
– 1, 3, 12, 15, 19, 20, 24
● Sorted in ascending order, could encode as (smaller)
gaps
– 1, +2, +9, +3, +4, +1, +4
● Gaps are small for frequent words and large for
infrequent words
● Thus, compression can be obtained by encoding
small values with shorter codes
20
Binary coding
Number (decimal) Binary (16 bits) Unary
1 0000000000000001 0
2 0000000000000010 10
3 0000000000000011 110
4 0000000000000100 1110
5 0000000000000101 11110
6 0000000000000110 111110
7 0000000000000111 1111110
8 0000000000001000 11111110
9 0000000000001001 111111110
10 0000000000001010 1111111110
16 bits allows to encode gaps of 64K docids
21
Unary coding
Number (decimal) Binary (16 bits) Unary
1 0000000000000001 0
2 0000000000000010 10
3 0000000000000011 110
4 0000000000000100 1110
5 0000000000000101 11110
6 0000000000000110 111110
7 0000000000000111 1111110
8 0000000000001000 11111110
9 0000000000001001 111111110
10 0000000000001010 1111111110
For small gaps this saves a lot of space
22
Elias-γ coding
● Unary code for
● Binary code of length for
● Example
23
Elias-γ coding
Number (decimal) Binary (16 bits) Unary Elias-γ
1 0000000000000001 0 0
2 0000000000000010 10 100
3 0000000000000011 110 101
4 0000000000000100 1110 11000
5 0000000000000101 11110 11001
6 0000000000000110 111110 11010
7 0000000000000111 1111110 11011
8 0000000000001000 11111110 1110000
9 0000000000001001 111111110 1110001
10 0000000000001010 1111111110 1110010
In practice, indexing with this coding uses about 1/5 of the
space in TREC-3 (a collection of about 1GB of text)
24
Try it
Encode the list 1, 5, 14 using:
● Standard binary coding (8 bits)
● Gap encoding in binary (8 bits)
● Gap encoding in unary
● Gap encoding in gamma coding
Which one is shorter?
http://chato.cl/2015/data_analysis/exercise-answers/text-indexing_exercise_02_answer.txt

Weitere ähnliche Inhalte

Was ist angesagt?

WSDM Cup 2017: Vandalism Detection
WSDM Cup 2017: Vandalism DetectionWSDM Cup 2017: Vandalism Detection
WSDM Cup 2017: Vandalism DetectionAlexey Grigorev
 
Adressing modes of 8086
Adressing modes of 8086Adressing modes of 8086
Adressing modes of 8086KANNANKR12
 
Indexing, searching, and aggregation with redi search and .net
Indexing, searching, and aggregation with redi search and .netIndexing, searching, and aggregation with redi search and .net
Indexing, searching, and aggregation with redi search and .netStephen Lorello
 
45 Days C Language Training In Ambala! Batra Computer Centre
45 Days C Language Training In Ambala! Batra Computer Centre45 Days C Language Training In Ambala! Batra Computer Centre
45 Days C Language Training In Ambala! Batra Computer Centrejatin batra
 
Catalog enrichment: importing Dewey Decimal Classification from external sour...
Catalog enrichment: importing Dewey Decimal Classification from external sour...Catalog enrichment: importing Dewey Decimal Classification from external sour...
Catalog enrichment: importing Dewey Decimal Classification from external sour...Stefano Bargioni
 
Versioned Triple Pattern Fragments
Versioned Triple Pattern FragmentsVersioned Triple Pattern Fragments
Versioned Triple Pattern FragmentsRuben Taelman
 
Automate your PDF factsheets with xlwings Reports
Automate your PDF factsheets with xlwings ReportsAutomate your PDF factsheets with xlwings Reports
Automate your PDF factsheets with xlwings Reportsxlwings
 
Why go ?
Why go ?Why go ?
Why go ?Mailjet
 
Addressing modes of 8086
Addressing modes of 8086Addressing modes of 8086
Addressing modes of 8086Dr. AISHWARYA N
 
C Language Training in Ambala ! Batra Computer Centre
C Language Training in Ambala ! Batra Computer CentreC Language Training in Ambala ! Batra Computer Centre
C Language Training in Ambala ! Batra Computer Centrejatin batra
 
Addressing Modes Of 8086
Addressing Modes Of 8086Addressing Modes Of 8086
Addressing Modes Of 8086Ikhlas Rahman
 
Large Scale Vandalism Detection in Knowledge Bases: PyData Berlin 2017
Large Scale Vandalism Detection in Knowledge Bases: PyData Berlin 2017Large Scale Vandalism Detection in Knowledge Bases: PyData Berlin 2017
Large Scale Vandalism Detection in Knowledge Bases: PyData Berlin 2017Alexey Grigorev
 
Maurer Presentation - WARCnet Spring Meeting 2021
Maurer Presentation - WARCnet Spring Meeting 2021Maurer Presentation - WARCnet Spring Meeting 2021
Maurer Presentation - WARCnet Spring Meeting 2021WARCnet
 
Can Deep Learning Techniques Improve Entity Linking?
Can Deep Learning Techniques Improve Entity Linking?Can Deep Learning Techniques Improve Entity Linking?
Can Deep Learning Techniques Improve Entity Linking?Julien PLU
 

Was ist angesagt? (20)

Linked list
Linked listLinked list
Linked list
 
WSDM Cup 2017: Vandalism Detection
WSDM Cup 2017: Vandalism DetectionWSDM Cup 2017: Vandalism Detection
WSDM Cup 2017: Vandalism Detection
 
Adressing modes of 8086
Adressing modes of 8086Adressing modes of 8086
Adressing modes of 8086
 
Indexing, searching, and aggregation with redi search and .net
Indexing, searching, and aggregation with redi search and .netIndexing, searching, and aggregation with redi search and .net
Indexing, searching, and aggregation with redi search and .net
 
45 Days C Language Training In Ambala! Batra Computer Centre
45 Days C Language Training In Ambala! Batra Computer Centre45 Days C Language Training In Ambala! Batra Computer Centre
45 Days C Language Training In Ambala! Batra Computer Centre
 
Hadoop @ eBuddy
Hadoop @ eBuddyHadoop @ eBuddy
Hadoop @ eBuddy
 
Catalog enrichment: importing Dewey Decimal Classification from external sour...
Catalog enrichment: importing Dewey Decimal Classification from external sour...Catalog enrichment: importing Dewey Decimal Classification from external sour...
Catalog enrichment: importing Dewey Decimal Classification from external sour...
 
Learning R - Handling NetCDF files
Learning R - Handling NetCDF filesLearning R - Handling NetCDF files
Learning R - Handling NetCDF files
 
Versioned Triple Pattern Fragments
Versioned Triple Pattern FragmentsVersioned Triple Pattern Fragments
Versioned Triple Pattern Fragments
 
8086 addressing modes
8086 addressing modes8086 addressing modes
8086 addressing modes
 
Experiment no 4
Experiment no 4Experiment no 4
Experiment no 4
 
Automate your PDF factsheets with xlwings Reports
Automate your PDF factsheets with xlwings ReportsAutomate your PDF factsheets with xlwings Reports
Automate your PDF factsheets with xlwings Reports
 
Why go ?
Why go ?Why go ?
Why go ?
 
Addressing modes
Addressing modesAddressing modes
Addressing modes
 
Addressing modes of 8086
Addressing modes of 8086Addressing modes of 8086
Addressing modes of 8086
 
C Language Training in Ambala ! Batra Computer Centre
C Language Training in Ambala ! Batra Computer CentreC Language Training in Ambala ! Batra Computer Centre
C Language Training in Ambala ! Batra Computer Centre
 
Addressing Modes Of 8086
Addressing Modes Of 8086Addressing Modes Of 8086
Addressing Modes Of 8086
 
Large Scale Vandalism Detection in Knowledge Bases: PyData Berlin 2017
Large Scale Vandalism Detection in Knowledge Bases: PyData Berlin 2017Large Scale Vandalism Detection in Knowledge Bases: PyData Berlin 2017
Large Scale Vandalism Detection in Knowledge Bases: PyData Berlin 2017
 
Maurer Presentation - WARCnet Spring Meeting 2021
Maurer Presentation - WARCnet Spring Meeting 2021Maurer Presentation - WARCnet Spring Meeting 2021
Maurer Presentation - WARCnet Spring Meeting 2021
 
Can Deep Learning Techniques Improve Entity Linking?
Can Deep Learning Techniques Improve Entity Linking?Can Deep Learning Techniques Improve Entity Linking?
Can Deep Learning Techniques Improve Entity Linking?
 

Andere mochten auch

An introduction to inverted index
An introduction to inverted indexAn introduction to inverted index
An introduction to inverted indexweedge
 
Web technology: Web search
Web technology: Web searchWeb technology: Web search
Web technology: Web searchVictor de Boer
 
Inverted files for text search engines
Inverted files for text search enginesInverted files for text search engines
Inverted files for text search enginesunyil96
 
Advanced fulltext search with Sphinx
Advanced fulltext search with SphinxAdvanced fulltext search with Sphinx
Advanced fulltext search with SphinxAdrian Nuta
 
Real time fulltext search with sphinx
Real time fulltext search with sphinxReal time fulltext search with sphinx
Real time fulltext search with sphinxAdrian Nuta
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Information seeking
Information seekingInformation seeking
Information seekingJohan Koren
 
Practical Elasticsearch - real world use cases
Practical Elasticsearch - real world use casesPractical Elasticsearch - real world use cases
Practical Elasticsearch - real world use casesItamar
 
Sphinx - High performance full-text search for MySQL
Sphinx - High performance full-text search for MySQLSphinx - High performance full-text search for MySQL
Sphinx - High performance full-text search for MySQLNguyen Van Vuong
 
Architecture and implementation of Apache Lucene
Architecture and implementation of Apache LuceneArchitecture and implementation of Apache Lucene
Architecture and implementation of Apache LuceneJosiane Gamgo
 
Information searching & retrieving techniques khalid
Information searching & retrieving techniques khalidInformation searching & retrieving techniques khalid
Information searching & retrieving techniques khalidKhalid Mahmood
 
Devinsampa nginx-scripting
Devinsampa nginx-scriptingDevinsampa nginx-scripting
Devinsampa nginx-scriptingTony Fabeen
 
Munching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processingMunching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processingabial
 
From Lucene to Elasticsearch, a short explanation of horizontal scalability
From Lucene to Elasticsearch, a short explanation of horizontal scalabilityFrom Lucene to Elasticsearch, a short explanation of horizontal scalability
From Lucene to Elasticsearch, a short explanation of horizontal scalabilityStéphane Gamard
 
Using Sphinx for Search in PHP
Using Sphinx for Search in PHPUsing Sphinx for Search in PHP
Using Sphinx for Search in PHPMike Lively
 

Andere mochten auch (20)

An introduction to inverted index
An introduction to inverted indexAn introduction to inverted index
An introduction to inverted index
 
Inverted index
Inverted indexInverted index
Inverted index
 
Web technology: Web search
Web technology: Web searchWeb technology: Web search
Web technology: Web search
 
Inverted files for text search engines
Inverted files for text search enginesInverted files for text search engines
Inverted files for text search engines
 
Advanced fulltext search with Sphinx
Advanced fulltext search with SphinxAdvanced fulltext search with Sphinx
Advanced fulltext search with Sphinx
 
Real time fulltext search with sphinx
Real time fulltext search with sphinxReal time fulltext search with sphinx
Real time fulltext search with sphinx
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Information seeking
Information seekingInformation seeking
Information seeking
 
Practical Elasticsearch - real world use cases
Practical Elasticsearch - real world use casesPractical Elasticsearch - real world use cases
Practical Elasticsearch - real world use cases
 
Sphinx - High performance full-text search for MySQL
Sphinx - High performance full-text search for MySQLSphinx - High performance full-text search for MySQL
Sphinx - High performance full-text search for MySQL
 
Solr
SolrSolr
Solr
 
Introduction To Apache Lucene
Introduction To Apache LuceneIntroduction To Apache Lucene
Introduction To Apache Lucene
 
Search Lucene
Search LuceneSearch Lucene
Search Lucene
 
Architecture and implementation of Apache Lucene
Architecture and implementation of Apache LuceneArchitecture and implementation of Apache Lucene
Architecture and implementation of Apache Lucene
 
Information searching & retrieving techniques khalid
Information searching & retrieving techniques khalidInformation searching & retrieving techniques khalid
Information searching & retrieving techniques khalid
 
Devinsampa nginx-scripting
Devinsampa nginx-scriptingDevinsampa nginx-scripting
Devinsampa nginx-scripting
 
Munching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processingMunching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processing
 
Index types
Index typesIndex types
Index types
 
From Lucene to Elasticsearch, a short explanation of horizontal scalability
From Lucene to Elasticsearch, a short explanation of horizontal scalabilityFrom Lucene to Elasticsearch, a short explanation of horizontal scalability
From Lucene to Elasticsearch, a short explanation of horizontal scalability
 
Using Sphinx for Search in PHP
Using Sphinx for Search in PHPUsing Sphinx for Search in PHP
Using Sphinx for Search in PHP
 

Ähnlich wie Text Indexing / Inverted Indices

Concepts and Challenges of Text Retrieval for Search Engine
Concepts and Challenges of Text Retrieval for Search EngineConcepts and Challenges of Text Retrieval for Search Engine
Concepts and Challenges of Text Retrieval for Search EngineGan Keng Hoon
 
PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...
PhD thesis defense:  Large-scale multilingual knowledge extraction, publishin...PhD thesis defense:  Large-scale multilingual knowledge extraction, publishin...
PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...Dimitris Kontokostas
 
Linked Open Data and Applications
Linked Open Data and Applications Linked Open Data and Applications
Linked Open Data and Applications Victor de Boer
 
Digital History Seminar
Digital History SeminarDigital History Seminar
Digital History SeminarCDesenclos
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrievalrchbeir
 
On the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingOn the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingPlanetData Network of Excellence
 
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...Oscar Corcho
 
RDF Data and Image Annotations in ResearchSpace (slides)
RDF Data and Image Annotations in ResearchSpace (slides)RDF Data and Image Annotations in ResearchSpace (slides)
RDF Data and Image Annotations in ResearchSpace (slides)Vladimir Alexiev, PhD, PMP
 
#kbdata: Exploring potential impact of technology limitations on DH research
#kbdata: Exploring potential impact of technology limitations on DH research#kbdata: Exploring potential impact of technology limitations on DH research
#kbdata: Exploring potential impact of technology limitations on DH researchJacco van Ossenbruggen
 
Applications of Word Vectors in Text Retrieval and Classification
Applications of Word Vectors in Text Retrieval and ClassificationApplications of Word Vectors in Text Retrieval and Classification
Applications of Word Vectors in Text Retrieval and Classificationshakimov
 
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)Konstantinos Zagoris
 
The Dendro research data management platform: Applying ontologies to long-ter...
The Dendro research data management platform: Applying ontologies to long-ter...The Dendro research data management platform: Applying ontologies to long-ter...
The Dendro research data management platform: Applying ontologies to long-ter...João Rocha da Silva
 
Data management for researchers
Data management for researchersData management for researchers
Data management for researchersDirk Roorda
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Kira
 
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The Services
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The ServicesLynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The Services
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The ServicesLynx Project
 
Introducing RDA: June 2013
Introducing RDA: June 2013Introducing RDA: June 2013
Introducing RDA: June 2013ALATechSource
 

Ähnlich wie Text Indexing / Inverted Indices (20)

Concepts and Challenges of Text Retrieval for Search Engine
Concepts and Challenges of Text Retrieval for Search EngineConcepts and Challenges of Text Retrieval for Search Engine
Concepts and Challenges of Text Retrieval for Search Engine
 
PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...
PhD thesis defense:  Large-scale multilingual knowledge extraction, publishin...PhD thesis defense:  Large-scale multilingual knowledge extraction, publishin...
PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...
 
Linked Open Data and Applications
Linked Open Data and Applications Linked Open Data and Applications
Linked Open Data and Applications
 
Digital History Seminar
Digital History SeminarDigital History Seminar
Digital History Seminar
 
Knowledge exchange consensus on monitoring OA: recommendations from the Copen...
Knowledge exchange consensus on monitoring OA: recommendations from the Copen...Knowledge exchange consensus on monitoring OA: recommendations from the Copen...
Knowledge exchange consensus on monitoring OA: recommendations from the Copen...
 
Multilingual document analysis
Multilingual document analysisMultilingual document analysis
Multilingual document analysis
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
 
Web search engines
Web search enginesWeb search engines
Web search engines
 
On the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingOn the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream Processing
 
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
 
RDF Data and Image Annotations in ResearchSpace (slides)
RDF Data and Image Annotations in ResearchSpace (slides)RDF Data and Image Annotations in ResearchSpace (slides)
RDF Data and Image Annotations in ResearchSpace (slides)
 
#kbdata: Exploring potential impact of technology limitations on DH research
#kbdata: Exploring potential impact of technology limitations on DH research#kbdata: Exploring potential impact of technology limitations on DH research
#kbdata: Exploring potential impact of technology limitations on DH research
 
Applications of Word Vectors in Text Retrieval and Classification
Applications of Word Vectors in Text Retrieval and ClassificationApplications of Word Vectors in Text Retrieval and Classification
Applications of Word Vectors in Text Retrieval and Classification
 
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)
 
The Dendro research data management platform: Applying ontologies to long-ter...
The Dendro research data management platform: Applying ontologies to long-ter...The Dendro research data management platform: Applying ontologies to long-ter...
The Dendro research data management platform: Applying ontologies to long-ter...
 
Data management for researchers
Data management for researchersData management for researchers
Data management for researchers
 
Text Mining
Text MiningText Mining
Text Mining
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)
 
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The Services
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The ServicesLynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The Services
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The Services
 
Introducing RDA: June 2013
Introducing RDA: June 2013Introducing RDA: June 2013
Introducing RDA: June 2013
 

Mehr von Carlos Castillo (ChaTo)

Finding High Quality Content in Social Media
Finding High Quality Content in Social MediaFinding High Quality Content in Social Media
Finding High Quality Content in Social MediaCarlos Castillo (ChaTo)
 
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Carlos Castillo (ChaTo)
 
Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Carlos Castillo (ChaTo)
 

Mehr von Carlos Castillo (ChaTo) (20)

Finding High Quality Content in Social Media
Finding High Quality Content in Social MediaFinding High Quality Content in Social Media
Finding High Quality Content in Social Media
 
When no clicks are good news
When no clicks are good newsWhen no clicks are good news
When no clicks are good news
 
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
 
Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)
 
Discrimination Discovery
Discrimination DiscoveryDiscrimination Discovery
Discrimination Discovery
 
Fairness-Aware Data Mining
Fairness-Aware Data MiningFairness-Aware Data Mining
Fairness-Aware Data Mining
 
Big Crisis Data for ISPC
Big Crisis Data for ISPCBig Crisis Data for ISPC
Big Crisis Data for ISPC
 
Databeers: Big Crisis Data
Databeers: Big Crisis DataDatabeers: Big Crisis Data
Databeers: Big Crisis Data
 
Observational studies in social media
Observational studies in social mediaObservational studies in social media
Observational studies in social media
 
Natural experiments
Natural experimentsNatural experiments
Natural experiments
 
Content-based link prediction
Content-based link predictionContent-based link prediction
Content-based link prediction
 
Link prediction
Link predictionLink prediction
Link prediction
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Graph Partitioning and Spectral Methods
Graph Partitioning and Spectral MethodsGraph Partitioning and Spectral Methods
Graph Partitioning and Spectral Methods
 
Finding Dense Subgraphs
Finding Dense SubgraphsFinding Dense Subgraphs
Finding Dense Subgraphs
 
Graph Evolution Models
Graph Evolution ModelsGraph Evolution Models
Graph Evolution Models
 
Link-Based Ranking
Link-Based RankingLink-Based Ranking
Link-Based Ranking
 
Text Summarization
Text SummarizationText Summarization
Text Summarization
 
Hierarchical Clustering
Hierarchical ClusteringHierarchical Clustering
Hierarchical Clustering
 
K-Means Algorithm
K-Means AlgorithmK-Means Algorithm
K-Means Algorithm
 

Kürzlich hochgeladen

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 

Kürzlich hochgeladen (20)

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 

Text Indexing / Inverted Indices

  • 1. 1 Text Indexing Class Algorithmic Methods of Data Mining Program M. Sc. Data Science University Sapienza University of Rome Semester Fall 2015 Lecturer Carlos Castillo http://chato.cl/ Sources: ● Gonzalo Navarro: “Indexing and Searching.” Chapter 9 in Modern Information Retrieval, 2nd Edition. 2011. [slides] ● Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütze: “Introduction to Information Retrieval”. 2008 [link]
  • 2. 2 Index by document ID BBC20151015001 CNN20150809002 AFP20130917001 RTE20151019001 TVE20140914001 BBC20151015001 CNN20150809002 AFP20130917001 RTE20151019001 TVE20140914001 File 2 Doc 2 File 1 Doc 1 File 1 Doc 2 File 1 Doc 3 File 2 Doc 1 Document identifiers Physical locations
  • 3. 3 Search by keywords ● Given a set of keywords ● Find documents containing all keywords ● Each keyword may be in millions of documents ● Hundreds of queries per second
  • 4. 4 Indexing the documents helps ● For an Information Retrieval system that uses an index, efficiency means: – Indexing time: Time needed to build the index – Indexing space: Space used during the generation of the index – Index storage: Space required to store the index – Query latency: Time interval between the arrival of the query and the generation of the answer – Query throughput: Average number of queries processed per second ● We assume a static or semi-static collection
  • 5. 5 Inverted index ● The index we have so far: – Given a document ID – Return the words in the document ● The index we want: – Given a word – Return the IDs of documents containing that word
  • 6. 6 Term-document matrix Doc frequency Term frequencies Space inefficient: why?
  • 7. 7 How large is the vocabulary? In English: Why it is not bounded?
  • 9. 9 Inverted index (vocabulary) What are the alternatives for storing the vocabulary? What are the trade-offs involved?
  • 10. 10 Full inverted index (single document, character level) ● Allows us to answer phrase and proximity queries, e.g. “theory * practice” or “difference between theory and practice”
  • 11. 11 Full inverted index (multiple documents, word-level)
  • 12. 12 Space usage of an index ● Vocabulary requires ● Occurrences require ● Address documents or words? ● Address blocks is an intermediary solution
  • 13. 13 Phrase search ● How do you do a phrase search with: – Addressing document – Addressing words – Addressing blocks
  • 15. 15 Try it d1: “global warming” d2: “global climate” d3: “climate change” d4: “warm climate” d5: “global village” Build an inverted index with word addressing for these documents Consider “warm” and “warming” as a single term “warm” Verify: third posting list has 3 docs http://chato.cl/2015/data_analysis/exercise-answers/text-indexing_exercise_01_answer.txt
  • 16. 16 Searching time ● Assuming the vocabulary fits on main memory, and m terms in the query, this is O(m) ● The time is dominated by merging the lists of the words ● Merging is fast if lists are sorted – At most n1 + n2 comparisons where n1 and n2 are the sizes of the posting lists
  • 17. 17 Example ● Documents containing “syria” – 1, 3, 12, 15, 19, 20, 34, 90, 96 ● Documents containing “russia” – 1, 9, 10, 18, 19, 24, 35, 90, 101 What should we do if one of the posting lists is very small compared to the other? What should we do if there are more than 2 posting lists?
  • 18. 18 Skip lists in indexing ● “Skips” are special shortcuts in the list ● Useful to avoid certain comparisons ● Good strategy is skips for list of size
  • 19. 19 Compressing inverted indexes ● Documents containing “robot” – 1, 3, 12, 15, 19, 20, 24 ● Sorted in ascending order, could encode as (smaller) gaps – 1, +2, +9, +3, +4, +1, +4 ● Gaps are small for frequent words and large for infrequent words ● Thus, compression can be obtained by encoding small values with shorter codes
  • 20. 20 Binary coding Number (decimal) Binary (16 bits) Unary 1 0000000000000001 0 2 0000000000000010 10 3 0000000000000011 110 4 0000000000000100 1110 5 0000000000000101 11110 6 0000000000000110 111110 7 0000000000000111 1111110 8 0000000000001000 11111110 9 0000000000001001 111111110 10 0000000000001010 1111111110 16 bits allows to encode gaps of 64K docids
  • 21. 21 Unary coding Number (decimal) Binary (16 bits) Unary 1 0000000000000001 0 2 0000000000000010 10 3 0000000000000011 110 4 0000000000000100 1110 5 0000000000000101 11110 6 0000000000000110 111110 7 0000000000000111 1111110 8 0000000000001000 11111110 9 0000000000001001 111111110 10 0000000000001010 1111111110 For small gaps this saves a lot of space
  • 22. 22 Elias-γ coding ● Unary code for ● Binary code of length for ● Example
  • 23. 23 Elias-γ coding Number (decimal) Binary (16 bits) Unary Elias-γ 1 0000000000000001 0 0 2 0000000000000010 10 100 3 0000000000000011 110 101 4 0000000000000100 1110 11000 5 0000000000000101 11110 11001 6 0000000000000110 111110 11010 7 0000000000000111 1111110 11011 8 0000000000001000 11111110 1110000 9 0000000000001001 111111110 1110001 10 0000000000001010 1111111110 1110010 In practice, indexing with this coding uses about 1/5 of the space in TREC-3 (a collection of about 1GB of text)
  • 24. 24 Try it Encode the list 1, 5, 14 using: ● Standard binary coding (8 bits) ● Gap encoding in binary (8 bits) ● Gap encoding in unary ● Gap encoding in gamma coding Which one is shorter? http://chato.cl/2015/data_analysis/exercise-answers/text-indexing_exercise_02_answer.txt