SlideShare ist ein Scribd-Unternehmen logo
1 von 21
Boolean and Vector Space
Retrieval Models

1
Retrieval Models
• A retrieval model specifies the details
of:
– Document representation
– Query representation
– Retrieval function

• Determines a notion of relevance.
• Notion of relevance can be binary or
continuous (i.e. ranked retrieval).
2
Classes of Retrieval Models
• Boolean models (set theoretic)
– Extended Boolean

• Vector space models
(statistical/algebraic)
– Generalized VS
– Latent Semantic Indexing

• Probabilistic models
3
Common Preprocessing Steps
• Strip unwanted characters/markup (e.g. HTML
tags, punctuation, numbers, etc.).
• Break into tokens (keywords) on whitespace.
• Stem tokens to “root” words
– computational  comput

• Remove common stopwords (e.g. a, the, it, etc.).
• Detect common phrases (possibly using a domain
specific dictionary).
• Build inverted index (keyword  list of docs
containing it).
4
Statistical Models
• A document is typically represented by a bag of
words (unordered words with frequencies).
• Bag = set that allows multiple occurrences of the
same element.
• User specifies a set of desired terms with optional
weights:
– Weighted query terms:
Q = < database 0.5; text 0.8; information 0.2 >
– Unweighted query terms:
Q = < database; text; information >
– No Boolean conditions specified in the query.
5
Statistical Retrieval
• Retrieval based on similarity between query
and documents.
• Output documents are ranked according to
similarity to query.
• Similarity based on occurrence frequencies of
keywords in query and document.
• Automatic relevance feedback can be
supported:
– Relevant documents “added” to query.
– Irrelevant documents “subtracted” from query.

6
Issues for Vector Space Model
• How to determine important words in a
document?
– Word sense?
– Word n-grams (and phrases, idioms,…)  terms

• How to determine the degree of importance of a
term within a document and within the entire
collection?
• How to determine the degree of similarity between
a document and the query?
• In the case of the web, what is a collection and
what are the effects of links, formatting
information, etc.?
7
The Vector-Space Model
• Assume t distinct terms remain after
preprocessing; call them index terms or the
vocabulary.
• These “orthogonal” terms form a vector space.
Dimension = t = |vocabulary|

• Each term, i, in a document or query, j, is given a
real-valued weight, wij.
• Both documents and queries are expressed as
t-dimensional vectors:
dj = (w1j, w2j, …, wtj)
8
Graphic Representation
Example:
D1 = 2T1 + 3T2 + 5T3

T3

D2 = 3T1 + 7T2 + T3

5

Q = 0T1 + 0T2 + 2T3
D1 = 2T1+ 3T2 + 5T3

Q = 0T1 + 0T2 + 2T3
2

3

T1
D2 = 3T1 + 7T2 + T3

T2

7

3

• Is D1 or D2 more similar to Q?
• How to measure the degree of
similarity? Distance? Angle?
Projection?
9
Document Collection
• A collection of n documents can be represented in the
vector space model by a term-document matrix.
• An entry in the matrix corresponds to the “weight” of a
term in the document; zero means the term has no
significance in the document or it simply doesn’t exist in
the document.
T1 T2 ….
Tt
D1 w11 w21 …
wt1
D2 w12 w22 … wt2
:
:
:
:
:
:
:
:
Dn w1n w2n …
wtn
10
Term Weights: Term Frequency
• More frequent terms in a document are more
important, i.e. more indicative of the topic.
fij = frequency of term i in document j
• May want to normalize term frequency (tf) by
dividing by the frequency of the most common
term in the document:
tfij = fij / maxi{fij}

11
Term Weights: Inverse Document Frequency
• Terms that appear in many different documents are
less indicative of overall topic.
df i = document frequency of term i
= number of documents containing term i
idfi = inverse document frequency of term i,
= log2 (N/ df i)
(N: total number of documents)
• An indication of a term’s discrimination power.
• Log used to dampen the effect relative to tf.
12
TF-IDF Weighting
• A typical combined term importance indicator is
tf-idf weighting:
wij = tfij idfi = tfij log2 (N/ dfi)
• A term occurring frequently in the document but
rarely in the rest of the collection is given high
weight.
• Many other ways of determining term weights
have been proposed.
• Experimentally, tf-idf has been found to work
well.
13
Computing TF-IDF -- An Example
Given a document containing terms with given frequencies:
A(3), B(2), C(1)
Assume collection contains 10,000 documents and
document frequencies of these terms are:
A(50), B(1300), C(250)
Then:
A: tf = 3/3; idf = log2(10000/50) = 7.6; tf-idf = 7.6
B: tf = 2/3; idf = log2 (10000/1300) = 2.9; tf-idf = 2.0
C: tf = 1/3; idf = log2 (10000/250) = 5.3; tf-idf = 1.8

14
Query Vector
• Query vector is typically treated as a
document and also tf-idf weighted.
• Alternative is for the user to supply weights
for the given query terms.

15
Similarity Measure
• A similarity measure is a function that computes
the degree of similarity between two vectors.
• Using a similarity measure between the query and
each document:
– It is possible to rank the retrieved documents in the
order of presumed relevance.
– It is possible to enforce a certain threshold so that the
size of the retrieved set can be controlled.

16
Similarity Measure - Inner Product
• Similarity between vectors for the document di and query q can
be computed as the vector inner product (a.k.a. dot product):
t

sim(dj,q) = dj•q =

∑w w
i =1

ij

iq

where wij is the weight of term i in document j and wiq is the weight of
term i in the query

• For binary vectors, the inner product is the number of matched
query terms in the document (size of intersection).
• For weighted term vectors, it is the sum of the products of the
weights of the matched terms.
17
Properties of Inner Product
• The inner product is unbounded.
• Favors long documents with a large number of
unique terms.
• Measures how many terms matched but not how
many terms are not matched.

18
Inner Product -- Examples
Binary:

t
e
e n t io n
ur er
m a
al se ct t
ge rm
ev aba hite pu t
a o
i
etr dat arc com tex man inf
r

– D = 1, 1, 1, 0, 1, 1,

0

– Q = 1, 0 , 1, 0, 0, 1,

1

Size of vector = size of vocabulary = 7
0 means corresponding term not found in
document or query

sim(D, Q) = 3

Weighted:
D1 = 2T1 + 3T2 + 5T3
Q = 0T1 + 0T2 + 2T3

D2 = 3T1 + 7T2 + 1T3

sim(D1 , Q) = 2*0 + 3*0 + 5*2 = 10
sim(D2 , Q) = 3*0 + 7*0 + 1*2 = 2

19
Cosine Similarity Measure
• Cosine similarity measures the cosine of
the angle between two vectors.
• Inner product normalized by the vector
lengths.
 
(
dj q = ∑wij ⋅wiq )
 
CosSim(dj, q) =
dj ⋅ q
∑ ij ⋅∑ iq
w
w

t3

θ1

D1

t

•

i=
1

t

i=
1

2

t

2

θ2

Q
t1

i=
1

t2

D2

D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / √(4+9+25)(0+0+4) = 0.81
D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / √(9+49+1)(0+0+4) = 0.13
Q = 0T1 + 0T2 + 2T3
D1 is 6 times better than D2 using cosine similarity but only 5 times better using
inner product.
20
Comments on Vector Space Models
• Simple, mathematically based approach.
• Considers both local (tf) and global (idf) word
occurrence frequencies.
• Provides partial matching and ranked results.
• Tends to work quite well in practice despite
obvious weaknesses.
• Allows efficient implementation for large
document collections.

21

Weitere ähnliche Inhalte

Was ist angesagt?

Evaluation in Information Retrieval
Evaluation in Information RetrievalEvaluation in Information Retrieval
Evaluation in Information Retrieval
Dishant Ailawadi
 
Vector space model of information retrieval
Vector space model of information retrievalVector space model of information retrieval
Vector space model of information retrieval
Nanthini Dominique
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval s
silambu111
 
Probabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsProbabilistic information retrieval models & systems
Probabilistic information retrieval models & systems
Selman Bozkır
 

Was ist angesagt? (20)

Introduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsIntroduction to Information Retrieval & Models
Introduction to Information Retrieval & Models
 
Vector space model in information retrieval
Vector space model in information retrievalVector space model in information retrieval
Vector space model in information retrieval
 
Evaluation in Information Retrieval
Evaluation in Information RetrievalEvaluation in Information Retrieval
Evaluation in Information Retrieval
 
Information retrieval 15 alternative algebraic models
Information retrieval 15 alternative algebraic modelsInformation retrieval 15 alternative algebraic models
Information retrieval 15 alternative algebraic models
 
Signature files
Signature filesSignature files
Signature files
 
Inverted index
Inverted indexInverted index
Inverted index
 
Vector space model of information retrieval
Vector space model of information retrievalVector space model of information retrieval
Vector space model of information retrieval
 
Information Retrieval Models
Information Retrieval ModelsInformation Retrieval Models
Information Retrieval Models
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval s
 
CS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I PPT IN PDF
CS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I  PPT  IN PDFCS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I  PPT  IN PDF
CS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I PPT IN PDF
 
IR
IRIR
IR
 
Information retrieval concept, practice and challenge
Information retrieval   concept, practice and challengeInformation retrieval   concept, practice and challenge
Information retrieval concept, practice and challenge
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)
 
Word2 vec
Word2 vecWord2 vec
Word2 vec
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
Information retrival system and PageRank algorithm
Information retrival system and PageRank algorithmInformation retrival system and PageRank algorithm
Information retrival system and PageRank algorithm
 
Resource description framework
Resource description frameworkResource description framework
Resource description framework
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
 
Probabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsProbabilistic information retrieval models & systems
Probabilistic information retrieval models & systems
 

Andere mochten auch

Model of information retrieval (3)
Model  of information retrieval (3)Model  of information retrieval (3)
Model of information retrieval (3)
9866825059
 
Document similarity with vector space model
Document similarity with vector space modelDocument similarity with vector space model
Document similarity with vector space model
dalal404
 
Information storage and retrieval
Information storage and retrievalInformation storage and retrieval
Information storage and retrieval
Sadaf Rafiq
 

Andere mochten auch (11)

Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
 
Probabilistic Retrieval
Probabilistic RetrievalProbabilistic Retrieval
Probabilistic Retrieval
 
Research IT at the University of Bristol
Research IT at the University of BristolResearch IT at the University of Bristol
Research IT at the University of Bristol
 
SubSift: a novel application of the vector space model to support the academi...
SubSift: a novel application of the vector space model to support the academi...SubSift: a novel application of the vector space model to support the academi...
SubSift: a novel application of the vector space model to support the academi...
 
Model of information retrieval (3)
Model  of information retrieval (3)Model  of information retrieval (3)
Model of information retrieval (3)
 
Document similarity with vector space model
Document similarity with vector space modelDocument similarity with vector space model
Document similarity with vector space model
 
SAX-VSM
SAX-VSMSAX-VSM
SAX-VSM
 
Fuzzy Logic ppt
Fuzzy Logic pptFuzzy Logic ppt
Fuzzy Logic ppt
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text Mining
 
Precis writing
Precis writingPrecis writing
Precis writing
 
Information storage and retrieval
Information storage and retrievalInformation storage and retrieval
Information storage and retrieval
 

Ähnlich wie Ir models

vectorSpaceModelPeterBurden.ppt
vectorSpaceModelPeterBurden.pptvectorSpaceModelPeterBurden.ppt
vectorSpaceModelPeterBurden.ppt
pepe3059
 
Simple semantics in topic detection and tracking
Simple semantics in topic detection and trackingSimple semantics in topic detection and tracking
Simple semantics in topic detection and tracking
George Ang
 
CS3114_09212011.ppt
CS3114_09212011.pptCS3114_09212011.ppt
CS3114_09212011.ppt
Arumugam90
 

Ähnlich wie Ir models (20)

Lec 4,5
Lec 4,5Lec 4,5
Lec 4,5
 
Chapter 4 IR Models.pdf
Chapter 4 IR Models.pdfChapter 4 IR Models.pdf
Chapter 4 IR Models.pdf
 
IRT Unit_ 2.pptx
IRT Unit_ 2.pptxIRT Unit_ 2.pptx
IRT Unit_ 2.pptx
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
 
Lec1
Lec1Lec1
Lec1
 
UNIT 3 IRT.docx
UNIT 3 IRT.docxUNIT 3 IRT.docx
UNIT 3 IRT.docx
 
IRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF MetricIRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF Metric
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
A SVM Applied Text Categorization of Academia-Industry Collaborative Research...
A SVM Applied Text Categorization of Academia-Industry Collaborative Research...A SVM Applied Text Categorization of Academia-Industry Collaborative Research...
A SVM Applied Text Categorization of Academia-Industry Collaborative Research...
 
vectorSpaceModelPeterBurden.ppt
vectorSpaceModelPeterBurden.pptvectorSpaceModelPeterBurden.ppt
vectorSpaceModelPeterBurden.ppt
 
Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Boolean,vector space retrieval Models
Boolean,vector space retrieval Models
 
Simple semantics in topic detection and tracking
Simple semantics in topic detection and trackingSimple semantics in topic detection and tracking
Simple semantics in topic detection and tracking
 
Web search engines
Web search enginesWeb search engines
Web search engines
 
Some Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBASome Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBA
 
Vector space model12345678910111213.pptx
Vector space model12345678910111213.pptxVector space model12345678910111213.pptx
Vector space model12345678910111213.pptx
 
4-IR Models_new.ppt
4-IR Models_new.ppt4-IR Models_new.ppt
4-IR Models_new.ppt
 
4-IR Models_new.ppt
4-IR Models_new.ppt4-IR Models_new.ppt
4-IR Models_new.ppt
 
A-Study_TopicModeling
A-Study_TopicModelingA-Study_TopicModeling
A-Study_TopicModeling
 
Information retrieval 8 term weighting
Information retrieval 8 term weightingInformation retrieval 8 term weighting
Information retrieval 8 term weighting
 
CS3114_09212011.ppt
CS3114_09212011.pptCS3114_09212011.ppt
CS3114_09212011.ppt
 

Kürzlich hochgeladen

An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
SanaAli374401
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
heathfieldcps1
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
MateoGardella
 

Kürzlich hochgeladen (20)

Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
 

Ir models

  • 1. Boolean and Vector Space Retrieval Models 1
  • 2. Retrieval Models • A retrieval model specifies the details of: – Document representation – Query representation – Retrieval function • Determines a notion of relevance. • Notion of relevance can be binary or continuous (i.e. ranked retrieval). 2
  • 3. Classes of Retrieval Models • Boolean models (set theoretic) – Extended Boolean • Vector space models (statistical/algebraic) – Generalized VS – Latent Semantic Indexing • Probabilistic models 3
  • 4. Common Preprocessing Steps • Strip unwanted characters/markup (e.g. HTML tags, punctuation, numbers, etc.). • Break into tokens (keywords) on whitespace. • Stem tokens to “root” words – computational  comput • Remove common stopwords (e.g. a, the, it, etc.). • Detect common phrases (possibly using a domain specific dictionary). • Build inverted index (keyword  list of docs containing it). 4
  • 5. Statistical Models • A document is typically represented by a bag of words (unordered words with frequencies). • Bag = set that allows multiple occurrences of the same element. • User specifies a set of desired terms with optional weights: – Weighted query terms: Q = < database 0.5; text 0.8; information 0.2 > – Unweighted query terms: Q = < database; text; information > – No Boolean conditions specified in the query. 5
  • 6. Statistical Retrieval • Retrieval based on similarity between query and documents. • Output documents are ranked according to similarity to query. • Similarity based on occurrence frequencies of keywords in query and document. • Automatic relevance feedback can be supported: – Relevant documents “added” to query. – Irrelevant documents “subtracted” from query. 6
  • 7. Issues for Vector Space Model • How to determine important words in a document? – Word sense? – Word n-grams (and phrases, idioms,…)  terms • How to determine the degree of importance of a term within a document and within the entire collection? • How to determine the degree of similarity between a document and the query? • In the case of the web, what is a collection and what are the effects of links, formatting information, etc.? 7
  • 8. The Vector-Space Model • Assume t distinct terms remain after preprocessing; call them index terms or the vocabulary. • These “orthogonal” terms form a vector space. Dimension = t = |vocabulary| • Each term, i, in a document or query, j, is given a real-valued weight, wij. • Both documents and queries are expressed as t-dimensional vectors: dj = (w1j, w2j, …, wtj) 8
  • 9. Graphic Representation Example: D1 = 2T1 + 3T2 + 5T3 T3 D2 = 3T1 + 7T2 + T3 5 Q = 0T1 + 0T2 + 2T3 D1 = 2T1+ 3T2 + 5T3 Q = 0T1 + 0T2 + 2T3 2 3 T1 D2 = 3T1 + 7T2 + T3 T2 7 3 • Is D1 or D2 more similar to Q? • How to measure the degree of similarity? Distance? Angle? Projection? 9
  • 10. Document Collection • A collection of n documents can be represented in the vector space model by a term-document matrix. • An entry in the matrix corresponds to the “weight” of a term in the document; zero means the term has no significance in the document or it simply doesn’t exist in the document. T1 T2 …. Tt D1 w11 w21 … wt1 D2 w12 w22 … wt2 : : : : : : : : Dn w1n w2n … wtn 10
  • 11. Term Weights: Term Frequency • More frequent terms in a document are more important, i.e. more indicative of the topic. fij = frequency of term i in document j • May want to normalize term frequency (tf) by dividing by the frequency of the most common term in the document: tfij = fij / maxi{fij} 11
  • 12. Term Weights: Inverse Document Frequency • Terms that appear in many different documents are less indicative of overall topic. df i = document frequency of term i = number of documents containing term i idfi = inverse document frequency of term i, = log2 (N/ df i) (N: total number of documents) • An indication of a term’s discrimination power. • Log used to dampen the effect relative to tf. 12
  • 13. TF-IDF Weighting • A typical combined term importance indicator is tf-idf weighting: wij = tfij idfi = tfij log2 (N/ dfi) • A term occurring frequently in the document but rarely in the rest of the collection is given high weight. • Many other ways of determining term weights have been proposed. • Experimentally, tf-idf has been found to work well. 13
  • 14. Computing TF-IDF -- An Example Given a document containing terms with given frequencies: A(3), B(2), C(1) Assume collection contains 10,000 documents and document frequencies of these terms are: A(50), B(1300), C(250) Then: A: tf = 3/3; idf = log2(10000/50) = 7.6; tf-idf = 7.6 B: tf = 2/3; idf = log2 (10000/1300) = 2.9; tf-idf = 2.0 C: tf = 1/3; idf = log2 (10000/250) = 5.3; tf-idf = 1.8 14
  • 15. Query Vector • Query vector is typically treated as a document and also tf-idf weighted. • Alternative is for the user to supply weights for the given query terms. 15
  • 16. Similarity Measure • A similarity measure is a function that computes the degree of similarity between two vectors. • Using a similarity measure between the query and each document: – It is possible to rank the retrieved documents in the order of presumed relevance. – It is possible to enforce a certain threshold so that the size of the retrieved set can be controlled. 16
  • 17. Similarity Measure - Inner Product • Similarity between vectors for the document di and query q can be computed as the vector inner product (a.k.a. dot product): t sim(dj,q) = dj•q = ∑w w i =1 ij iq where wij is the weight of term i in document j and wiq is the weight of term i in the query • For binary vectors, the inner product is the number of matched query terms in the document (size of intersection). • For weighted term vectors, it is the sum of the products of the weights of the matched terms. 17
  • 18. Properties of Inner Product • The inner product is unbounded. • Favors long documents with a large number of unique terms. • Measures how many terms matched but not how many terms are not matched. 18
  • 19. Inner Product -- Examples Binary: t e e n t io n ur er m a al se ct t ge rm ev aba hite pu t a o i etr dat arc com tex man inf r – D = 1, 1, 1, 0, 1, 1, 0 – Q = 1, 0 , 1, 0, 0, 1, 1 Size of vector = size of vocabulary = 7 0 means corresponding term not found in document or query sim(D, Q) = 3 Weighted: D1 = 2T1 + 3T2 + 5T3 Q = 0T1 + 0T2 + 2T3 D2 = 3T1 + 7T2 + 1T3 sim(D1 , Q) = 2*0 + 3*0 + 5*2 = 10 sim(D2 , Q) = 3*0 + 7*0 + 1*2 = 2 19
  • 20. Cosine Similarity Measure • Cosine similarity measures the cosine of the angle between two vectors. • Inner product normalized by the vector lengths.   ( dj q = ∑wij ⋅wiq )   CosSim(dj, q) = dj ⋅ q ∑ ij ⋅∑ iq w w t3 θ1 D1 t • i= 1 t i= 1 2 t 2 θ2 Q t1 i= 1 t2 D2 D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / √(4+9+25)(0+0+4) = 0.81 D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / √(9+49+1)(0+0+4) = 0.13 Q = 0T1 + 0T2 + 2T3 D1 is 6 times better than D2 using cosine similarity but only 5 times better using inner product. 20
  • 21. Comments on Vector Space Models • Simple, mathematically based approach. • Considers both local (tf) and global (idf) word occurrence frequencies. • Provides partial matching and ranked results. • Tends to work quite well in practice despite obvious weaknesses. • Allows efficient implementation for large document collections. 21