SlideShare a Scribd company logo
1 of 44
Download to read offline
Indexing Vector Spaces 
Graphs Search Engines 
Piero Molino 
XY Lab
Relevance for 
New Publishing 
Search Engines can be considered publishers 
They select, filter, choose and sometimes create the 
content they give as result to queries 
From the user point of view, the search process is a 
blackbox, like magic 
Understanding it increases the awareness of the 
risks it has and could help bypassing or hacking it
Information Retrieval 
Let a collection of unstructured information (set of 
texts) 
Let an information need (query) 
Objective: Find items of the collection that answers 
the information need
Representation 
Simple/naive assumption 
A text can be represented by the words it contains 
Bag-of-words model
Bag of Words 
"Me and John and Mary attend this lab" 
! 
Me:1
Bag of Words 
"Me and John and Mary attend this lab" 
! 
Me:1 and:1
Bag of Words 
"Me and John and Mary attend this lab" 
! 
Me:1 and:1 John:1
Bag of Words 
"Me and John and Mary attend this lab" 
! 
Me:1 and:2 John:1
Bag of Words 
"Me and John and Mary attend this lab" 
! 
Me:1 and:2 John:1 Mary:1
Bag of Words 
"Me and John and Mary attend this lab" 
! 
Me:1 and:2 John:1 Mary:1 attend:1
Bag of Words 
"Me and John and Mary attend this lab" 
! 
Me:1 and:2 John:1 Mary:1 attend:1 this:1
Bag of Words 
"Me and John and Mary attend this lab" 
! 
Me:1 and:2 John:1 Mary:1 attend:1 this:1 lab:1
Inverted index 
To access the documents, we build an index, just like 
humans do 
Inverted Index: word-to-document 
"Indice Analitico"
Example 
doc1:"I like football" 
doc2:"John likes football" 
doc3:"John likes basketball" 
I → doc1 
like → doc1 
football → doc1 doc2 
John → doc2 doc3 
likes → doc2 doc3 
basketball → doc3
Query 
Information need represented with the same bag-of-words 
model 
Who likes basketball? → Who:1 likes:1 basketball:1
Retrieval 
Who:1 likes:1 basketball:1 
I → doc1 
like → doc1 
football → doc1 doc2 
John → doc2 doc3 
likes → doc2 doc3 
basketball → doc3 
Result: [doc2, doc3]! 
(without any order... or no?)
Boolean Model 
likes → doc2 doc3 
basketball → doc3 
doc3 contains both "likes" and "basketball", 2/3 of the input 
query terms, doc2 contains only 1/3 
Boolean Model: represents only presence or absence of 
query words and ranks document according to how many 
query words they contain 
Result: [doc3, doc2]! 
(with this precise order)
Vector Space Model 
Represents documents and words as vectors or 
points in a (multidimensional) geometrical space 
We build a term x documents matrix 
doc1 doc2 doc3 
I 1 0 0 
like 1 0 0 
football 1 1 0 
John 0 1 1 
likes 0 1 1 
basketball 0 0 1
Graphical Representation 
likes 
1 
football 
1 
0 
doc1 
doc2 
doc3
Tf-idf 
The cells of the matrix contain the frequency of a 
word in a document (term frequency tf) 
This value can be counterbalanced by the number 
of documents that contain the word (inverse 
document frequency idf) 
The final weight is called tf-idf
Tf-idf Example 1 
"Romeo" is found 700 times in the document "Romeo 
and Juliet" and also in 7 other plays 
The tf-idf of "Romeo" in the document "Romeo and 
Juliet" is 7000x(1/7)=100
Tf-idf Example 2 
"and" is found 1000 times in the document "Romeo 
and Juliet" and also in 10000 other plays 
The tf-idf of "and" in the document "Romeo and Juliet" 
is 1000x(1/10000)=0.1
Similarity 
Queries are like an additional column in matrix 
We can compute a similarity between document and 
queries
Similarity 
0 
doc1 
doc2 
doc3 
query 
Euclidean Distance
Similarity 
0 
doc1 
doc2 
doc3 
query 
Vector length normalization 
Projection of one normalized 
vector over another = cos(θ) 
θ
Cosine Similarity 
The cosine of the angle between two vectors is a 
measure of how similar two vectors are 
As the vectors represents documents and queries, 
the cosine is a measure of similarity of how similar is 
a document with respect to the query
Ranking with 
Cosine Similarity 
Shows how query and documents are correlated 
1 = maximum correlation, 0 = no correlation, -1 = 
negative correlation 
The documents obtained from the inverted index are 
ranked according to their cosine similarity wrt. the 
query
Ranking on the web 
In the web documents (webpages) are connected to 
each other by hyperlinks 
Is there a way to exploit this topology? 
PageRank algorithm (and several others)
The web as a graph 
2 
6 
3 
7 
5 
4 
1 
Webpage 
Hyperlink
Matrix Representation 
1 2 3 4 5 6 7 
1 1/2 1/2 
2 1/2 1/2 
3 1 
4 1/3 1/3 1/3 
5 1/2 1/2 
6 1/2 1/2 
7 1
Simulating Random Walks 
e = random vector, can be interpreted as how important 
is a node in the network 
Es. e = 1:0.5, 2:0.5, 3:0.5, 4:0.5, ... 
A = the matrix 
e = e · A, repeatedly is like simulating walking randomly 
on the graph 
The process converges to a vector e where each value 
represents the importance in the network
Random Walks 
2 
6 
3 
7 
5 
4 
1 
Active 
Webpage 
Hyperlink
Random Walks 
2 
6 
3 
7 
5 
4 
1 
Active 
Webpage 
Hyperlink
Random Walks 
2 
6 
3 
7 
5 
4 
1 
Active 
Webpage 
Hyperlink
Random Walks 
2 
6 
3 
7 
5 
4 
1 
Active 
Webpage 
Hyperlink
Random Walks 
2 
6 
3 
7 
5 
4 
1 
Active 
Webpage 
Hyperlink
Importance - Authority 
Webpage 
Hyperlink 
2 
6 
3 
7 
5 
4 
1 
Importance (value in e)
Ranking with PageRank 
I → doc1 
like → doc1 
football → doc1 doc2 
John → doc2 doc3 
likes → doc2 doc3 
basketball → doc3 
e 
doc1 → 0.3 
doc2 → 0.1 
doc3 → 0.5 
The inverted index gives doc2 and doc3 
Using the importance in e for ranking 
Result: [doc3, doc2] (ordered list)
Real World Ranking 
Real word search engines exploit Vector-Space- 
Model-like approaches, PageRank-like approaches 
and several others 
They balance all the different factors observing what 
webpage you click on after issuing a query and using 
them as examples for a machine learning algorithm
Machine Learning 
A machine learning algorithm works like a baby 
You give him examples of what is a dog and what is 
a cat 
He learns to look at the pointy ears, the nose and 
other aspects of the animals 
When he sees a new animal he says "cat" or "dog" 
depending on his experience
Learning to Rank 
The different indicators of how good a webpage for 
a query is (cosine similarity, PageRank, etc.) are like 
the nose, the tail and the ears of the animal 
Instead of cat and dog, the algorithm classifies 
relevant or non-relevant! 
When you issue a query and click on a result it is 
marked as relevant, otherwise it is considered non-relevant
Issues 
Among the different indicators there are a lot of 
contextual ones 
Where you issue the query from, who you are, what 
you visited before, what result you clicked on 
before, ... 
This makes the result for each person, in each place, 
in different moments in time, different
The filter bubble 
A lot of diverse content is filtered from you 
The search engine shows you what it thinks you will 
be interested on, based on all this contextual factors 
Lack of transparency of the search process 
A way out of the bubble? Maybe favoring serendipity
Thank you 
for your attention

More Related Content

Similar to Indexing vector spaces graph search engines

Improving VIVO search through semantic ranking.
Improving VIVO search through semantic ranking.Improving VIVO search through semantic ranking.
Improving VIVO search through semantic ranking.Deepak K
 
The well tempered search application
The well tempered search applicationThe well tempered search application
The well tempered search applicationTed Sullivan
 
Researching on the Internet: Helpful hints/safety tips
Researching on the Internet: Helpful hints/safety tipsResearching on the Internet: Helpful hints/safety tips
Researching on the Internet: Helpful hints/safety tipsJennEpps
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrLucidworks
 
Aggregation for searching complex information spaces
Aggregation for searching complex information spacesAggregation for searching complex information spaces
Aggregation for searching complex information spacesMounia Lalmas-Roelleke
 
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Vivian S. Zhang
 
learningIntro.doc
learningIntro.doclearningIntro.doc
learningIntro.docbutest
 
learningIntro.doc
learningIntro.doclearningIntro.doc
learningIntro.docbutest
 
How the Web can change social science research (including yours)
How the Web can change social science research (including yours)How the Web can change social science research (including yours)
How the Web can change social science research (including yours)Frank van Harmelen
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsDerek Kane
 
A recommendation engine for your applications - M.Orselli - Codemotion Rome 17
A recommendation engine for your applications - M.Orselli - Codemotion Rome 17A recommendation engine for your applications - M.Orselli - Codemotion Rome 17
A recommendation engine for your applications - M.Orselli - Codemotion Rome 17Codemotion
 
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrTrey Grainger
 
From Natural Language Processing to Artificial Intelligence
From Natural Language Processing to Artificial IntelligenceFrom Natural Language Processing to Artificial Intelligence
From Natural Language Processing to Artificial IntelligenceJonathan Mugan
 
Does Data Quality lays in facts, or in acts?
Does Data Quality lays in facts, or in acts?Does Data Quality lays in facts, or in acts?
Does Data Quality lays in facts, or in acts?jeansoulin
 
Data Day Seattle, From NLP to AI
Data Day Seattle, From NLP to AIData Day Seattle, From NLP to AI
Data Day Seattle, From NLP to AIJonathan Mugan
 

Similar to Indexing vector spaces graph search engines (20)

Eacl 2006 Pedersen
Eacl 2006 PedersenEacl 2006 Pedersen
Eacl 2006 Pedersen
 
Improving VIVO search through semantic ranking.
Improving VIVO search through semantic ranking.Improving VIVO search through semantic ranking.
Improving VIVO search through semantic ranking.
 
Eurolan 2005 Pedersen
Eurolan 2005 PedersenEurolan 2005 Pedersen
Eurolan 2005 Pedersen
 
Vivo Search
Vivo SearchVivo Search
Vivo Search
 
The well tempered search application
The well tempered search applicationThe well tempered search application
The well tempered search application
 
Researching on the Internet: Helpful hints/safety tips
Researching on the Internet: Helpful hints/safety tipsResearching on the Internet: Helpful hints/safety tips
Researching on the Internet: Helpful hints/safety tips
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with Solr
 
Aggregation for searching complex information spaces
Aggregation for searching complex information spacesAggregation for searching complex information spaces
Aggregation for searching complex information spaces
 
Phpconf2008 Sphinx En
Phpconf2008 Sphinx EnPhpconf2008 Sphinx En
Phpconf2008 Sphinx En
 
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
 
learningIntro.doc
learningIntro.doclearningIntro.doc
learningIntro.doc
 
learningIntro.doc
learningIntro.doclearningIntro.doc
learningIntro.doc
 
How the Web can change social science research (including yours)
How the Web can change social science research (including yours)How the Web can change social science research (including yours)
How the Web can change social science research (including yours)
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text Analytics
 
A recommendation engine for your applications - M.Orselli - Codemotion Rome 17
A recommendation engine for your applications - M.Orselli - Codemotion Rome 17A recommendation engine for your applications - M.Orselli - Codemotion Rome 17
A recommendation engine for your applications - M.Orselli - Codemotion Rome 17
 
Week12
Week12Week12
Week12
 
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solr
 
From Natural Language Processing to Artificial Intelligence
From Natural Language Processing to Artificial IntelligenceFrom Natural Language Processing to Artificial Intelligence
From Natural Language Processing to Artificial Intelligence
 
Does Data Quality lays in facts, or in acts?
Does Data Quality lays in facts, or in acts?Does Data Quality lays in facts, or in acts?
Does Data Quality lays in facts, or in acts?
 
Data Day Seattle, From NLP to AI
Data Day Seattle, From NLP to AIData Day Seattle, From NLP to AI
Data Day Seattle, From NLP to AI
 

Recently uploaded

Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Visualising and forecasting stocks using Dash
Visualising and forecasting stocks using DashVisualising and forecasting stocks using Dash
Visualising and forecasting stocks using Dashnarutouzumaki53779
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Visualising and forecasting stocks using Dash
Visualising and forecasting stocks using DashVisualising and forecasting stocks using Dash
Visualising and forecasting stocks using Dash
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 

Indexing vector spaces graph search engines

  • 1. Indexing Vector Spaces Graphs Search Engines Piero Molino XY Lab
  • 2. Relevance for New Publishing Search Engines can be considered publishers They select, filter, choose and sometimes create the content they give as result to queries From the user point of view, the search process is a blackbox, like magic Understanding it increases the awareness of the risks it has and could help bypassing or hacking it
  • 3. Information Retrieval Let a collection of unstructured information (set of texts) Let an information need (query) Objective: Find items of the collection that answers the information need
  • 4. Representation Simple/naive assumption A text can be represented by the words it contains Bag-of-words model
  • 5. Bag of Words "Me and John and Mary attend this lab" ! Me:1
  • 6. Bag of Words "Me and John and Mary attend this lab" ! Me:1 and:1
  • 7. Bag of Words "Me and John and Mary attend this lab" ! Me:1 and:1 John:1
  • 8. Bag of Words "Me and John and Mary attend this lab" ! Me:1 and:2 John:1
  • 9. Bag of Words "Me and John and Mary attend this lab" ! Me:1 and:2 John:1 Mary:1
  • 10. Bag of Words "Me and John and Mary attend this lab" ! Me:1 and:2 John:1 Mary:1 attend:1
  • 11. Bag of Words "Me and John and Mary attend this lab" ! Me:1 and:2 John:1 Mary:1 attend:1 this:1
  • 12. Bag of Words "Me and John and Mary attend this lab" ! Me:1 and:2 John:1 Mary:1 attend:1 this:1 lab:1
  • 13. Inverted index To access the documents, we build an index, just like humans do Inverted Index: word-to-document "Indice Analitico"
  • 14. Example doc1:"I like football" doc2:"John likes football" doc3:"John likes basketball" I → doc1 like → doc1 football → doc1 doc2 John → doc2 doc3 likes → doc2 doc3 basketball → doc3
  • 15. Query Information need represented with the same bag-of-words model Who likes basketball? → Who:1 likes:1 basketball:1
  • 16. Retrieval Who:1 likes:1 basketball:1 I → doc1 like → doc1 football → doc1 doc2 John → doc2 doc3 likes → doc2 doc3 basketball → doc3 Result: [doc2, doc3]! (without any order... or no?)
  • 17. Boolean Model likes → doc2 doc3 basketball → doc3 doc3 contains both "likes" and "basketball", 2/3 of the input query terms, doc2 contains only 1/3 Boolean Model: represents only presence or absence of query words and ranks document according to how many query words they contain Result: [doc3, doc2]! (with this precise order)
  • 18. Vector Space Model Represents documents and words as vectors or points in a (multidimensional) geometrical space We build a term x documents matrix doc1 doc2 doc3 I 1 0 0 like 1 0 0 football 1 1 0 John 0 1 1 likes 0 1 1 basketball 0 0 1
  • 19. Graphical Representation likes 1 football 1 0 doc1 doc2 doc3
  • 20. Tf-idf The cells of the matrix contain the frequency of a word in a document (term frequency tf) This value can be counterbalanced by the number of documents that contain the word (inverse document frequency idf) The final weight is called tf-idf
  • 21. Tf-idf Example 1 "Romeo" is found 700 times in the document "Romeo and Juliet" and also in 7 other plays The tf-idf of "Romeo" in the document "Romeo and Juliet" is 7000x(1/7)=100
  • 22. Tf-idf Example 2 "and" is found 1000 times in the document "Romeo and Juliet" and also in 10000 other plays The tf-idf of "and" in the document "Romeo and Juliet" is 1000x(1/10000)=0.1
  • 23. Similarity Queries are like an additional column in matrix We can compute a similarity between document and queries
  • 24. Similarity 0 doc1 doc2 doc3 query Euclidean Distance
  • 25. Similarity 0 doc1 doc2 doc3 query Vector length normalization Projection of one normalized vector over another = cos(θ) θ
  • 26. Cosine Similarity The cosine of the angle between two vectors is a measure of how similar two vectors are As the vectors represents documents and queries, the cosine is a measure of similarity of how similar is a document with respect to the query
  • 27. Ranking with Cosine Similarity Shows how query and documents are correlated 1 = maximum correlation, 0 = no correlation, -1 = negative correlation The documents obtained from the inverted index are ranked according to their cosine similarity wrt. the query
  • 28. Ranking on the web In the web documents (webpages) are connected to each other by hyperlinks Is there a way to exploit this topology? PageRank algorithm (and several others)
  • 29. The web as a graph 2 6 3 7 5 4 1 Webpage Hyperlink
  • 30. Matrix Representation 1 2 3 4 5 6 7 1 1/2 1/2 2 1/2 1/2 3 1 4 1/3 1/3 1/3 5 1/2 1/2 6 1/2 1/2 7 1
  • 31. Simulating Random Walks e = random vector, can be interpreted as how important is a node in the network Es. e = 1:0.5, 2:0.5, 3:0.5, 4:0.5, ... A = the matrix e = e · A, repeatedly is like simulating walking randomly on the graph The process converges to a vector e where each value represents the importance in the network
  • 32. Random Walks 2 6 3 7 5 4 1 Active Webpage Hyperlink
  • 33. Random Walks 2 6 3 7 5 4 1 Active Webpage Hyperlink
  • 34. Random Walks 2 6 3 7 5 4 1 Active Webpage Hyperlink
  • 35. Random Walks 2 6 3 7 5 4 1 Active Webpage Hyperlink
  • 36. Random Walks 2 6 3 7 5 4 1 Active Webpage Hyperlink
  • 37. Importance - Authority Webpage Hyperlink 2 6 3 7 5 4 1 Importance (value in e)
  • 38. Ranking with PageRank I → doc1 like → doc1 football → doc1 doc2 John → doc2 doc3 likes → doc2 doc3 basketball → doc3 e doc1 → 0.3 doc2 → 0.1 doc3 → 0.5 The inverted index gives doc2 and doc3 Using the importance in e for ranking Result: [doc3, doc2] (ordered list)
  • 39. Real World Ranking Real word search engines exploit Vector-Space- Model-like approaches, PageRank-like approaches and several others They balance all the different factors observing what webpage you click on after issuing a query and using them as examples for a machine learning algorithm
  • 40. Machine Learning A machine learning algorithm works like a baby You give him examples of what is a dog and what is a cat He learns to look at the pointy ears, the nose and other aspects of the animals When he sees a new animal he says "cat" or "dog" depending on his experience
  • 41. Learning to Rank The different indicators of how good a webpage for a query is (cosine similarity, PageRank, etc.) are like the nose, the tail and the ears of the animal Instead of cat and dog, the algorithm classifies relevant or non-relevant! When you issue a query and click on a result it is marked as relevant, otherwise it is considered non-relevant
  • 42. Issues Among the different indicators there are a lot of contextual ones Where you issue the query from, who you are, what you visited before, what result you clicked on before, ... This makes the result for each person, in each place, in different moments in time, different
  • 43. The filter bubble A lot of diverse content is filtered from you The search engine shows you what it thinks you will be interested on, based on all this contextual factors Lack of transparency of the search process A way out of the bubble? Maybe favoring serendipity
  • 44. Thank you for your attention