SlideShare ist ein Scribd-Unternehmen logo
1 von 19
Lecture 08
Information Retrieval
The Vector Space Model for Scoring
Introduction
 The representation of a set of documents as vectors in a common
vector space is known as the vector space model and is
fundamental to a host of information retrieval operations ranging
from scoring documents on a query, document classification and
document clustering.
 We first develop the basic ideas underlying vector space scoring; a
pivotal step in this development is the view of queries as vectors
in the same vector space as the document collection.
The Vector Space Model for Scoring
Dot Products
 We denote by 𝑽(𝒅)the vector derived from document 𝒅, with one
component in the vector for each dictionary term.
 Unless otherwise specified, you may assume that the components are
computed using the 𝒕𝒇 − 𝒊𝒅𝒇 weighting scheme, although the
particular weighting scheme is immaterial to the discussion that
follows.
 The set of documents in a collection then may be viewed as a set of
vectors in a vector space, in which there is one axis for each term.
 So we have a |V|- dimensional vector space
 Terms are axes of the space
 Documents are points or vectors in this space
 Very high-dimensional: tens of millions of dimensions when you apply
this to a web search engine
 These are very sparse vectors – most entries are zero
The Vector Space Model for Scoring
Dot Products
 How do we quantify the similarity between two documents in this
vector space?
 A first attempt might consider the magnitude of the vector difference
between two document vectors.
 This measure suffers from a drawback: two documents with very
similar content can have a significant vector difference simply
because one is much longer than the other. Thus the relative
distributions of terms may be identical in the two documents, but the
absolute term frequencies of one may be far larger.
𝑑1
𝑑2
𝑑1
𝑑2
The Vector Space Model for Scoring
Dot Products
 To compensate for the effect of document length, the standard way
of quantifying the similarity between two documents 𝒅𝟏 and 𝒅𝟐 is
to compute the cosine similarity of their vector representations:
 where the numerator represents the dot product (also known as
the inner product) of the vectors 𝑽(𝒅 𝟏) and 𝑽 𝒅 𝟐 . The dot product
𝒙 . 𝒚 of two vectors is defined as:
 while the denominator is the product of their Euclidean lengths.
The Vector Space Model for Scoring
Dot Products
 Let 𝑽 𝒅 denote the document vector for 𝒅, with M components
𝑽 𝟏 𝒅 . . . 𝑽 𝑴 𝒅 . The Euclidean length of 𝒅 is defined to be:
 The effect of the denominator of is thus to length-normalize the
vectors 𝑽(𝒅 𝟏) and 𝑽 𝒅 𝟐 to unit vectors 𝒗(𝒅 𝟏) =
𝑽 𝒅 𝟏
|𝑽(𝒅 𝟏) |
and 𝒗(𝒅 𝟐) =
𝑽 𝒅 𝟐
|𝑽(𝒅 𝟐) |
 We can then rewrite:
as
The Vector Space Model for Scoring
Dot Products
 The effect of the denominator of is thus to length-normalize the
vectors 𝑽(𝒅 𝟏) and 𝑽 𝒅 𝟐 to unit vectors 𝒗(𝒅 𝟏) =
𝑽 𝒅 𝟏
|𝑽(𝒅 𝟏) |
and 𝒗(𝒅 𝟐) =
𝑽 𝒅 𝟐
|𝑽(𝒅 𝟐) |
 Example: for Doc1:
(𝟐𝟕) 𝟐+(𝟑) 𝟐+(𝟎) 𝟐+(𝟏𝟒) 𝟐  𝟗𝟑𝟒  30.56
 27/30.56, 3/30.56, 0/30.56, 14/30.56
The Vector Space Model for Scoring
Cosine Similarity
The Vector Space Model for Scoring
Cosine Similarity
 Example:
The Vector Space Model for Scoring
Cosine Similarity
 Example:
The Vector Space Model for Scoring
Queries as Vectors
 There is a far more compelling reason to represent
documents as vectors: we can also view a query as a vector.
 So, we represent queries as vectors in the space.
 Rank documents according to their proximity to the query in this
space.
 Consider the Query q= jealous gossip
term Query
affection 0
jealous 1
gossip 1
wuthering 0
The Vector Space Model for Scoring
Queries as Vectors
 Consider the Query q= jealous gossip
Log frequency weighting After length
normalization
 The key idea now: to assign to each document 𝒅 a score equal to
the dot product
𝑽 𝒒 . 𝑽(𝒅)
term Query
affection 0
jealous 1
gossip 1
wuthering 0
term Query
affection 0
jealous 0.70
gossip 0.70
wuthering 0
The Vector Space Model for Scoring
Queries as Vectors
After length normalization
 Recall: We do this because we want to get away from the youʼre-
either-in-or-out Boolean model.
 Instead: rank more relevant documents higher than less relevant
documents
term Query
affection 0
jealous 0.70
gossip 0.70
wuthering 0
The Vector Space Model for Scoring
Queries as Vectors
 To summarize, by viewing a query as a “bag of words”, we are able
to treat it as a very short document.
 As a consequence, we can use the cosine similarity between the
query vector and a document vector as a measure of the score of
the document for that query.
 The resulting scores can then be used to select the top-scoring
documents for a query. Thus we have:
The Vector Space Model for Scoring
Queries as Vectors
The Vector Space Model for Scoring
Computing Vector Scores
 In a typical setting we have a collection of documents each
represented by a vector, a free text query represented by a vector,
and a positive integer K.
 We seek the K documents of the collection with the highest vector
space scores on the given query.
 We now initiate the study of determining the K documents with the
highest vector space scores for a query.
 Typically, we seek these K top documents in ordered by
decreasing score; for instance many search engines use K = 10 to
retrieve and rank-order the first page of the ten best results.
The Vector Space Model for Scoring
Computing Vector Scores
 The array Length holds the lengths (normalization factors) for each
of the N documents, whereas the array Scores holds the scores for
each of the documents. When the scores are finally computed in
Step 9, all that remains in Step 10 is to pick off the K documents
with the highest scores
The Vector Space Model for Scoring
Computing Vector Scores
 The outermost loop beginning Step 3 repeats the updating of Scores,
iterating over each query term t in turn.
 In Step 5 we calculate the weight in the query vector for term t.
 Steps 6-8 update the score of each document by adding in the contribution
from term t.
 This process of adding in contributions one query term at a time is
sometimes known as term-at-a-time scoring or accumulation, and the N
elements of the array Scores are therefore known as accumulators.
The Vector Space Model for Scoring
Computing Vector Scores
 It would appear necessary to store, with each postings entry, the weight
𝒘𝒇 𝒕,𝒅 of term t in document d (we have thus far used either tf or tf-idf for
this weight, but leave open the possibility of other functions to be
developed in later sections).
 In fact this is wasteful, since storing this weight may require a floating
point number.
 Two ideas help alleviate this space problem:
 First, if we are using inverse document frequency, we need not precompute
𝒊𝒅𝒇 𝒕 ; it suffices to store N/𝒅𝒇 𝒕 ; at the head of the postings for t.
 Second, we store the term frequency 𝒕𝒇 𝒕,𝒅 for each postings entry.
 Finally, Step 12 extracts the top K scores – this requires a priority queue
data structure, often implemented using a heap. Such a heap takes no
more than 2N comparisons to construct, following which each of the K top

Weitere ähnliche Inhalte

Was ist angesagt?

Finding Similar Files in Large Document Repositories
Finding Similar Files in Large Document RepositoriesFinding Similar Files in Large Document Repositories
Finding Similar Files in Large Document Repositoriesfeiwin
 
Text clustering
Text clusteringText clustering
Text clusteringKU Leuven
 
Scalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large FoldersScalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large Foldersfeiwin
 
Text Categorization Using Improved K Nearest Neighbor Algorithm
Text Categorization Using Improved K Nearest Neighbor AlgorithmText Categorization Using Improved K Nearest Neighbor Algorithm
Text Categorization Using Improved K Nearest Neighbor AlgorithmIJTET Journal
 
Boolean Retrieval
Boolean RetrievalBoolean Retrieval
Boolean Retrievalmghgk
 
The vector space model
The vector space modelThe vector space model
The vector space modelpkgosh
 
Similarity Measurement Preliminary Results
Similarity  Measurement  Preliminary ResultsSimilarity  Measurement  Preliminary Results
Similarity Measurement Preliminary Resultsxiaojuzheng
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
 
Duet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackDuet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackBhaskar Mitra
 
Adversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalAdversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalBhaskar Mitra
 
TextRank: Bringing Order into Texts
TextRank: Bringing Order into TextsTextRank: Bringing Order into Texts
TextRank: Bringing Order into TextsShubhangi Tandon
 
Introduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisIntroduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisNYC Predictive Analytics
 
Vsm 벡터공간모델
Vsm 벡터공간모델Vsm 벡터공간모델
Vsm 벡터공간모델guesta34d441
 
G04124041046
G04124041046G04124041046
G04124041046IOSR-JEN
 
Latent Semanctic Analysis Auro Tripathy
Latent Semanctic Analysis Auro TripathyLatent Semanctic Analysis Auro Tripathy
Latent Semanctic Analysis Auro TripathyAuro Tripathy
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modellingcsandit
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrievalssbd6985
 
Information retrieval 7 boolean model
Information retrieval 7 boolean modelInformation retrieval 7 boolean model
Information retrieval 7 boolean modelVaibhav Khanna
 

Was ist angesagt? (19)

Finding Similar Files in Large Document Repositories
Finding Similar Files in Large Document RepositoriesFinding Similar Files in Large Document Repositories
Finding Similar Files in Large Document Repositories
 
Ir models
Ir modelsIr models
Ir models
 
Text clustering
Text clusteringText clustering
Text clustering
 
Scalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large FoldersScalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large Folders
 
Text Categorization Using Improved K Nearest Neighbor Algorithm
Text Categorization Using Improved K Nearest Neighbor AlgorithmText Categorization Using Improved K Nearest Neighbor Algorithm
Text Categorization Using Improved K Nearest Neighbor Algorithm
 
Boolean Retrieval
Boolean RetrievalBoolean Retrieval
Boolean Retrieval
 
The vector space model
The vector space modelThe vector space model
The vector space model
 
Similarity Measurement Preliminary Results
Similarity  Measurement  Preliminary ResultsSimilarity  Measurement  Preliminary Results
Similarity Measurement Preliminary Results
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articles
 
Duet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackDuet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning Track
 
Adversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalAdversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrieval
 
TextRank: Bringing Order into Texts
TextRank: Bringing Order into TextsTextRank: Bringing Order into Texts
TextRank: Bringing Order into Texts
 
Introduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisIntroduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic Analysis
 
Vsm 벡터공간모델
Vsm 벡터공간모델Vsm 벡터공간모델
Vsm 벡터공간모델
 
G04124041046
G04124041046G04124041046
G04124041046
 
Latent Semanctic Analysis Auro Tripathy
Latent Semanctic Analysis Auro TripathyLatent Semanctic Analysis Auro Tripathy
Latent Semanctic Analysis Auro Tripathy
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modelling
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
 
Information retrieval 7 boolean model
Information retrieval 7 boolean modelInformation retrieval 7 boolean model
Information retrieval 7 boolean model
 

Andere mochten auch

similarity measure
similarity measure similarity measure
similarity measure ZHAO Sam
 
Initial Configuration of Router
Initial Configuration of RouterInitial Configuration of Router
Initial Configuration of RouterKishore Kumar
 
Teacher management system guide
Teacher management system guideTeacher management system guide
Teacher management system guidenicolasmunozvera
 
Computer networking short_questions_and_answers
Computer networking short_questions_and_answersComputer networking short_questions_and_answers
Computer networking short_questions_and_answersTarun Thakur
 
Evaluation in Information Retrieval
Evaluation in Information RetrievalEvaluation in Information Retrieval
Evaluation in Information RetrievalDishant Ailawadi
 
Pass4sure 640-864 Questions Answers
Pass4sure 640-864 Questions AnswersPass4sure 640-864 Questions Answers
Pass4sure 640-864 Questions AnswersRoxycodone Online
 
MikroTik Basic Training Class - Online Moduls - English
 MikroTik Basic Training Class - Online Moduls - English MikroTik Basic Training Class - Online Moduls - English
MikroTik Basic Training Class - Online Moduls - EnglishAdhie Lesmana
 
De-Risk Data Center Projects With Cisco Services
De-Risk Data Center Projects With Cisco ServicesDe-Risk Data Center Projects With Cisco Services
De-Risk Data Center Projects With Cisco ServicesCisco Canada
 
Document similarity with vector space model
Document similarity with vector space modelDocument similarity with vector space model
Document similarity with vector space modeldalal404
 
Cisco router command configuration overview
Cisco router command configuration overviewCisco router command configuration overview
Cisco router command configuration overview3Anetwork com
 
Day 25 cisco ios router configuration
Day 25 cisco ios router configurationDay 25 cisco ios router configuration
Day 25 cisco ios router configurationCYBERINTELLIGENTS
 
MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...
MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...
MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...University of Minnesota, Duluth
 
Router configuration
Router configurationRouter configuration
Router configuration97148881557
 
Day 5.3 configuration of router
Day 5.3 configuration of routerDay 5.3 configuration of router
Day 5.3 configuration of routerCYBERINTELLIGENTS
 

Andere mochten auch (19)

similarity measure
similarity measure similarity measure
similarity measure
 
Initial Configuration of Router
Initial Configuration of RouterInitial Configuration of Router
Initial Configuration of Router
 
Teacher management system guide
Teacher management system guideTeacher management system guide
Teacher management system guide
 
Computer networking short_questions_and_answers
Computer networking short_questions_and_answersComputer networking short_questions_and_answers
Computer networking short_questions_and_answers
 
E s switch_v6_ch01
E s switch_v6_ch01E s switch_v6_ch01
E s switch_v6_ch01
 
Evaluation in Information Retrieval
Evaluation in Information RetrievalEvaluation in Information Retrieval
Evaluation in Information Retrieval
 
Pass4sure 640-864 Questions Answers
Pass4sure 640-864 Questions AnswersPass4sure 640-864 Questions Answers
Pass4sure 640-864 Questions Answers
 
MikroTik Basic Training Class - Online Moduls - English
 MikroTik Basic Training Class - Online Moduls - English MikroTik Basic Training Class - Online Moduls - English
MikroTik Basic Training Class - Online Moduls - English
 
College Network
College NetworkCollege Network
College Network
 
De-Risk Data Center Projects With Cisco Services
De-Risk Data Center Projects With Cisco ServicesDe-Risk Data Center Projects With Cisco Services
De-Risk Data Center Projects With Cisco Services
 
Document similarity with vector space model
Document similarity with vector space modelDocument similarity with vector space model
Document similarity with vector space model
 
Day 11 eigrp
Day 11 eigrpDay 11 eigrp
Day 11 eigrp
 
Cisco router command configuration overview
Cisco router command configuration overviewCisco router command configuration overview
Cisco router command configuration overview
 
Lesson 1 slideshow
Lesson 1 slideshowLesson 1 slideshow
Lesson 1 slideshow
 
Day 25 cisco ios router configuration
Day 25 cisco ios router configurationDay 25 cisco ios router configuration
Day 25 cisco ios router configuration
 
MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...
MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...
MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...
 
Router configuration
Router configurationRouter configuration
Router configuration
 
Day 5.3 configuration of router
Day 5.3 configuration of routerDay 5.3 configuration of router
Day 5.3 configuration of router
 
10 More Quotes for Entrepreneurs
10 More Quotes for Entrepreneurs10 More Quotes for Entrepreneurs
10 More Quotes for Entrepreneurs
 

Ähnlich wie Ir 08

A Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and DistributionsA Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and DistributionsRebecca Bilbro
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroPyData
 
Textmining Retrieval And Clustering
Textmining Retrieval And ClusteringTextmining Retrieval And Clustering
Textmining Retrieval And ClusteringDataminingTools Inc
 
Textmining Retrieval And Clustering
Textmining Retrieval And ClusteringTextmining Retrieval And Clustering
Textmining Retrieval And ClusteringDatamining Tools
 
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...Jonathon Hare
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentIJERD Editor
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)inventionjournals
 
IRT Unit_ 2.pptx
IRT Unit_ 2.pptxIRT Unit_ 2.pptx
IRT Unit_ 2.pptxthenmozhip8
 
Latent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalLatent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalSudarsun Santhiappan
 
Machine learning session8(svm nlp)
Machine learning   session8(svm nlp)Machine learning   session8(svm nlp)
Machine learning session8(svm nlp)Abhimanyu Dwivedi
 
SVM - Functional Verification
SVM - Functional VerificationSVM - Functional Verification
SVM - Functional VerificationSai Kiran Kadam
 
Hierarchical topics in texts generated by a stream
Hierarchical topics in texts generated by a streamHierarchical topics in texts generated by a stream
Hierarchical topics in texts generated by a streamkevig
 

Ähnlich wie Ir 08 (20)

TEXT CLUSTERING.doc
TEXT CLUSTERING.docTEXT CLUSTERING.doc
TEXT CLUSTERING.doc
 
A Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and DistributionsA Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and Distributions
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
Textmining Retrieval And Clustering
Textmining Retrieval And ClusteringTextmining Retrieval And Clustering
Textmining Retrieval And Clustering
 
Textmining Retrieval And Clustering
Textmining Retrieval And ClusteringTextmining Retrieval And Clustering
Textmining Retrieval And Clustering
 
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
 
Words in space
Words in spaceWords in space
Words in space
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
L0261075078
L0261075078L0261075078
L0261075078
 
L0261075078
L0261075078L0261075078
L0261075078
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
IRT Unit_ 2.pptx
IRT Unit_ 2.pptxIRT Unit_ 2.pptx
IRT Unit_ 2.pptx
 
Latent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalLatent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information Retrieval
 
Efficient projections
Efficient projectionsEfficient projections
Efficient projections
 
Efficient projections
Efficient projectionsEfficient projections
Efficient projections
 
Machine learning session8(svm nlp)
Machine learning   session8(svm nlp)Machine learning   session8(svm nlp)
Machine learning session8(svm nlp)
 
SVM - Functional Verification
SVM - Functional VerificationSVM - Functional Verification
SVM - Functional Verification
 
Mp2420852090
Mp2420852090Mp2420852090
Mp2420852090
 
A-Study_TopicModeling
A-Study_TopicModelingA-Study_TopicModeling
A-Study_TopicModeling
 
Hierarchical topics in texts generated by a stream
Hierarchical topics in texts generated by a streamHierarchical topics in texts generated by a stream
Hierarchical topics in texts generated by a stream
 

Mehr von Mohammed Romi

Ai 02 intelligent_agents(1)
Ai 02 intelligent_agents(1)Ai 02 intelligent_agents(1)
Ai 02 intelligent_agents(1)Mohammed Romi
 
Ai 03 solving_problems_by_searching
Ai 03 solving_problems_by_searchingAi 03 solving_problems_by_searching
Ai 03 solving_problems_by_searchingMohammed Romi
 
Ch19 network layer-logical add
Ch19 network layer-logical addCh19 network layer-logical add
Ch19 network layer-logical addMohammed Romi
 
Chapter02 graphics-programming
Chapter02 graphics-programmingChapter02 graphics-programming
Chapter02 graphics-programmingMohammed Romi
 
Ian Sommerville, Software Engineering, 9th Edition Ch 4
Ian Sommerville,  Software Engineering, 9th Edition Ch 4Ian Sommerville,  Software Engineering, 9th Edition Ch 4
Ian Sommerville, Software Engineering, 9th Edition Ch 4Mohammed Romi
 
Ian Sommerville, Software Engineering, 9th Edition Ch2
Ian Sommerville,  Software Engineering, 9th Edition Ch2Ian Sommerville,  Software Engineering, 9th Edition Ch2
Ian Sommerville, Software Engineering, 9th Edition Ch2Mohammed Romi
 
Ian Sommerville, Software Engineering, 9th Edition Ch1
Ian Sommerville,  Software Engineering, 9th Edition Ch1Ian Sommerville,  Software Engineering, 9th Edition Ch1
Ian Sommerville, Software Engineering, 9th Edition Ch1Mohammed Romi
 
Ian Sommerville, Software Engineering, 9th Edition Ch 23
Ian Sommerville,  Software Engineering, 9th Edition Ch 23Ian Sommerville,  Software Engineering, 9th Edition Ch 23
Ian Sommerville, Software Engineering, 9th Edition Ch 23Mohammed Romi
 
Ian Sommerville, Software Engineering, 9th EditionCh 8
Ian Sommerville,  Software Engineering, 9th EditionCh 8Ian Sommerville,  Software Engineering, 9th EditionCh 8
Ian Sommerville, Software Engineering, 9th EditionCh 8Mohammed Romi
 
Ch 4 software engineering
Ch 4 software engineeringCh 4 software engineering
Ch 4 software engineeringMohammed Romi
 

Mehr von Mohammed Romi (20)

Ai 02 intelligent_agents(1)
Ai 02 intelligent_agents(1)Ai 02 intelligent_agents(1)
Ai 02 intelligent_agents(1)
 
Ai 01 introduction
Ai 01 introductionAi 01 introduction
Ai 01 introduction
 
Ai 03 solving_problems_by_searching
Ai 03 solving_problems_by_searchingAi 03 solving_problems_by_searching
Ai 03 solving_problems_by_searching
 
Ch2020
Ch2020Ch2020
Ch2020
 
Swiching
SwichingSwiching
Swiching
 
Ch8
Ch8Ch8
Ch8
 
Ch19 network layer-logical add
Ch19 network layer-logical addCh19 network layer-logical add
Ch19 network layer-logical add
 
Ch12
Ch12Ch12
Ch12
 
Ir 01
Ir   01Ir   01
Ir 01
 
Angel6 e05
Angel6 e05Angel6 e05
Angel6 e05
 
Chapter02 graphics-programming
Chapter02 graphics-programmingChapter02 graphics-programming
Chapter02 graphics-programming
 
Swe notes
Swe notesSwe notes
Swe notes
 
Ian Sommerville, Software Engineering, 9th Edition Ch 4
Ian Sommerville,  Software Engineering, 9th Edition Ch 4Ian Sommerville,  Software Engineering, 9th Edition Ch 4
Ian Sommerville, Software Engineering, 9th Edition Ch 4
 
Ian Sommerville, Software Engineering, 9th Edition Ch2
Ian Sommerville,  Software Engineering, 9th Edition Ch2Ian Sommerville,  Software Engineering, 9th Edition Ch2
Ian Sommerville, Software Engineering, 9th Edition Ch2
 
Ian Sommerville, Software Engineering, 9th Edition Ch1
Ian Sommerville,  Software Engineering, 9th Edition Ch1Ian Sommerville,  Software Engineering, 9th Edition Ch1
Ian Sommerville, Software Engineering, 9th Edition Ch1
 
Ian Sommerville, Software Engineering, 9th Edition Ch 23
Ian Sommerville,  Software Engineering, 9th Edition Ch 23Ian Sommerville,  Software Engineering, 9th Edition Ch 23
Ian Sommerville, Software Engineering, 9th Edition Ch 23
 
Ian Sommerville, Software Engineering, 9th EditionCh 8
Ian Sommerville,  Software Engineering, 9th EditionCh 8Ian Sommerville,  Software Engineering, 9th EditionCh 8
Ian Sommerville, Software Engineering, 9th EditionCh 8
 
Ch7
Ch7Ch7
Ch7
 
Ch 6
Ch 6Ch 6
Ch 6
 
Ch 4 software engineering
Ch 4 software engineeringCh 4 software engineering
Ch 4 software engineering
 

Kürzlich hochgeladen

Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfUmakantAnnand
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 

Kürzlich hochgeladen (20)

Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.Compdf
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 

Ir 08

  • 2. The Vector Space Model for Scoring Introduction  The representation of a set of documents as vectors in a common vector space is known as the vector space model and is fundamental to a host of information retrieval operations ranging from scoring documents on a query, document classification and document clustering.  We first develop the basic ideas underlying vector space scoring; a pivotal step in this development is the view of queries as vectors in the same vector space as the document collection.
  • 3. The Vector Space Model for Scoring Dot Products  We denote by 𝑽(𝒅)the vector derived from document 𝒅, with one component in the vector for each dictionary term.  Unless otherwise specified, you may assume that the components are computed using the 𝒕𝒇 − 𝒊𝒅𝒇 weighting scheme, although the particular weighting scheme is immaterial to the discussion that follows.  The set of documents in a collection then may be viewed as a set of vectors in a vector space, in which there is one axis for each term.  So we have a |V|- dimensional vector space  Terms are axes of the space  Documents are points or vectors in this space  Very high-dimensional: tens of millions of dimensions when you apply this to a web search engine  These are very sparse vectors – most entries are zero
  • 4. The Vector Space Model for Scoring Dot Products  How do we quantify the similarity between two documents in this vector space?  A first attempt might consider the magnitude of the vector difference between two document vectors.  This measure suffers from a drawback: two documents with very similar content can have a significant vector difference simply because one is much longer than the other. Thus the relative distributions of terms may be identical in the two documents, but the absolute term frequencies of one may be far larger. 𝑑1 𝑑2 𝑑1 𝑑2
  • 5. The Vector Space Model for Scoring Dot Products  To compensate for the effect of document length, the standard way of quantifying the similarity between two documents 𝒅𝟏 and 𝒅𝟐 is to compute the cosine similarity of their vector representations:  where the numerator represents the dot product (also known as the inner product) of the vectors 𝑽(𝒅 𝟏) and 𝑽 𝒅 𝟐 . The dot product 𝒙 . 𝒚 of two vectors is defined as:  while the denominator is the product of their Euclidean lengths.
  • 6. The Vector Space Model for Scoring Dot Products  Let 𝑽 𝒅 denote the document vector for 𝒅, with M components 𝑽 𝟏 𝒅 . . . 𝑽 𝑴 𝒅 . The Euclidean length of 𝒅 is defined to be:  The effect of the denominator of is thus to length-normalize the vectors 𝑽(𝒅 𝟏) and 𝑽 𝒅 𝟐 to unit vectors 𝒗(𝒅 𝟏) = 𝑽 𝒅 𝟏 |𝑽(𝒅 𝟏) | and 𝒗(𝒅 𝟐) = 𝑽 𝒅 𝟐 |𝑽(𝒅 𝟐) |  We can then rewrite: as
  • 7. The Vector Space Model for Scoring Dot Products  The effect of the denominator of is thus to length-normalize the vectors 𝑽(𝒅 𝟏) and 𝑽 𝒅 𝟐 to unit vectors 𝒗(𝒅 𝟏) = 𝑽 𝒅 𝟏 |𝑽(𝒅 𝟏) | and 𝒗(𝒅 𝟐) = 𝑽 𝒅 𝟐 |𝑽(𝒅 𝟐) |  Example: for Doc1: (𝟐𝟕) 𝟐+(𝟑) 𝟐+(𝟎) 𝟐+(𝟏𝟒) 𝟐  𝟗𝟑𝟒  30.56  27/30.56, 3/30.56, 0/30.56, 14/30.56
  • 8. The Vector Space Model for Scoring Cosine Similarity
  • 9. The Vector Space Model for Scoring Cosine Similarity  Example:
  • 10. The Vector Space Model for Scoring Cosine Similarity  Example:
  • 11. The Vector Space Model for Scoring Queries as Vectors  There is a far more compelling reason to represent documents as vectors: we can also view a query as a vector.  So, we represent queries as vectors in the space.  Rank documents according to their proximity to the query in this space.  Consider the Query q= jealous gossip term Query affection 0 jealous 1 gossip 1 wuthering 0
  • 12. The Vector Space Model for Scoring Queries as Vectors  Consider the Query q= jealous gossip Log frequency weighting After length normalization  The key idea now: to assign to each document 𝒅 a score equal to the dot product 𝑽 𝒒 . 𝑽(𝒅) term Query affection 0 jealous 1 gossip 1 wuthering 0 term Query affection 0 jealous 0.70 gossip 0.70 wuthering 0
  • 13. The Vector Space Model for Scoring Queries as Vectors After length normalization  Recall: We do this because we want to get away from the youʼre- either-in-or-out Boolean model.  Instead: rank more relevant documents higher than less relevant documents term Query affection 0 jealous 0.70 gossip 0.70 wuthering 0
  • 14. The Vector Space Model for Scoring Queries as Vectors  To summarize, by viewing a query as a “bag of words”, we are able to treat it as a very short document.  As a consequence, we can use the cosine similarity between the query vector and a document vector as a measure of the score of the document for that query.  The resulting scores can then be used to select the top-scoring documents for a query. Thus we have:
  • 15. The Vector Space Model for Scoring Queries as Vectors
  • 16. The Vector Space Model for Scoring Computing Vector Scores  In a typical setting we have a collection of documents each represented by a vector, a free text query represented by a vector, and a positive integer K.  We seek the K documents of the collection with the highest vector space scores on the given query.  We now initiate the study of determining the K documents with the highest vector space scores for a query.  Typically, we seek these K top documents in ordered by decreasing score; for instance many search engines use K = 10 to retrieve and rank-order the first page of the ten best results.
  • 17. The Vector Space Model for Scoring Computing Vector Scores  The array Length holds the lengths (normalization factors) for each of the N documents, whereas the array Scores holds the scores for each of the documents. When the scores are finally computed in Step 9, all that remains in Step 10 is to pick off the K documents with the highest scores
  • 18. The Vector Space Model for Scoring Computing Vector Scores  The outermost loop beginning Step 3 repeats the updating of Scores, iterating over each query term t in turn.  In Step 5 we calculate the weight in the query vector for term t.  Steps 6-8 update the score of each document by adding in the contribution from term t.  This process of adding in contributions one query term at a time is sometimes known as term-at-a-time scoring or accumulation, and the N elements of the array Scores are therefore known as accumulators.
  • 19. The Vector Space Model for Scoring Computing Vector Scores  It would appear necessary to store, with each postings entry, the weight 𝒘𝒇 𝒕,𝒅 of term t in document d (we have thus far used either tf or tf-idf for this weight, but leave open the possibility of other functions to be developed in later sections).  In fact this is wasteful, since storing this weight may require a floating point number.  Two ideas help alleviate this space problem:  First, if we are using inverse document frequency, we need not precompute 𝒊𝒅𝒇 𝒕 ; it suffices to store N/𝒅𝒇 𝒕 ; at the head of the postings for t.  Second, we store the term frequency 𝒕𝒇 𝒕,𝒅 for each postings entry.  Finally, Step 12 extracts the top K scores – this requires a priority queue data structure, often implemented using a heap. Such a heap takes no more than 2N comparisons to construct, following which each of the K top