SlideShare ist ein Scribd-Unternehmen logo
1 von 18
Anatomy of a search engine
• Not much known about AV, Lycos, Yahoo,
  etc.
• But Google and Clever (to some extent) are
  published
• Design criteria
• Differences
• Architecture
• Data structures
Requirements
• Basic IR concepts:
  – Recall: what % of relevant docs are retrieved
  – Precision: what % of docs retrieved are relevant
• Quantity:
  – handle hundreds of thousands of queries/sec
• Quality
  – High precision (not with pres. engines)
Page rank
• Idea: a page is important when it is referred
  to a lot, or referred to from an important
  page
• PR is used to prioritize; works well even
  with search is just on page titles
PR details
• Pages T1,…,Tn point to page A, C(A) is a link
  fan-out of A
PR(A)=(1-d) + d(PR(T1)/C(T1)+…+PR(Tn)/C(Tn))
d=dumping factor=.85
Model of random walk on the Web
PR(p) = prob. That a “random” user will visit p
Other features and terms
• Anchor text is associated with the page it
  links to
• Some markup aspects are used
Google architecture
                •   URL server sends list
                    of URLs to be fetched
                    to crawlers
                •   StoreServer
                    compresses and stores
                    pages
                •   Indexer extracts
                    words, their pos., size,
                    capital.
                •   Anchors cont.links and
                    their text
                •   Sorter generates
                    inverted index
                •   Searcher uses Lexicon,
                    II, and PR
Some details
• Barrels store words (wordIDs); if a doc
  contains a word, doc`s ID and its wordID
  are stored with hitlist of this word in the doc
• Lexicon points to Inverted Barrels; ea word
  points to docid and hits
Operation
• Crawling
• Searching
• Ranking
Crawling and indexing
• Parsing into anchors and words – error
  robustness (flex+stack)
• Indexing in parallel – hashing into barrels
  using the lexicon – the problem of new
  words shared
Searching
1 parse query
2 convert words into wordIDs
3 Identif. A barrel for ea. Word
4 scan doclists until a doc that matches all the
  search words is found
Ranking
• For a single word, identify the hit list and its
  type, count the # of hits of ea type, vector-
  multiply
• Combine with PR
• For multiple words, take proximity into
  account
Going further
• Google will not return any IBM pages for
  the query `mainframes`
• Many pages that point to IBM page use the
  term ‘mainframe’, so this page should be
  returned
• Clever ranks authoritities pages and hub pages.
  Authorities are pages with high PR. Hubs are
  pages that point to authorities. E.g. my friend’s
  page with a list of links to on-line CD stores. Hubs
  may not be chosen by PR alone
• Clever/HITS (Hyperlink Induced Topic Search)
  starts with an initial set of pages and hubs
Mathematically speaking…
• Let xp be authority weight, yq be hub weight,
  q->p denotes q links to p
     x p = ∑ yq         y p = ∑ xq
           q→ p               p →q


• Let A be adjacency matrix: Ai,j =1 if there is a
  link between i and j, 0 otherwise
x ←ATy and y ← Ax
x ←ATAx, and we can iterate that further,
  working with powers of ATA
This sequence of powers converges to the
  eigenvector of ATA
This means that the result does not depend on
  the initial weights
• Remove ‘local’ links (“back to the main
  page”)
• Drift: transfer of main authority to, e.g.,
  topics of hobbies
• Highjacking: if several pages from the same
  site occur in the base set, they may take
  over a topic
• Remedied by partial content indexing –
  anchors, and by
• dividing a page into pagelets – contiguous
  sequences of links
• Hubs are good when learning about a topic,
  less so when seekeing specific info.
Autres engins
• Altavista et Lycos ont probablement des
  méthodes simples de sélection
• Excite semble utiliser beaucoup de
  propriétés des pages
• Voir « What is a tall poppy among Web pages? »7th Int’l
  WWW Conf.

Weitere ähnliche Inhalte

Was ist angesagt?

An Intro to NoSQL Databases
An Intro to NoSQL DatabasesAn Intro to NoSQL Databases
An Intro to NoSQL DatabasesRajith Pemabandu
 
20120606 Lazy Programmers Write Self-Modifying Code /or/ Dealing with XML Ord...
20120606 Lazy Programmers Write Self-Modifying Code /or/ Dealing with XML Ord...20120606 Lazy Programmers Write Self-Modifying Code /or/ Dealing with XML Ord...
20120606 Lazy Programmers Write Self-Modifying Code /or/ Dealing with XML Ord...David Horvath
 
Resource Oriented Architectures: The Future of Data API?
Resource Oriented Architectures: The Future of Data API?Resource Oriented Architectures: The Future of Data API?
Resource Oriented Architectures: The Future of Data API?Victor Olex
 
Big Data and Hadoop Components
Big Data and Hadoop ComponentsBig Data and Hadoop Components
Big Data and Hadoop ComponentsDezyreAcademy
 
11 wordprocessing ml subject - glossary document
11   wordprocessing ml subject - glossary document11   wordprocessing ml subject - glossary document
11 wordprocessing ml subject - glossary documentShawn Villaron
 
Shooting rabbits with sling
Shooting rabbits with slingShooting rabbits with sling
Shooting rabbits with slingTomasz Rękawek
 
How to use a database
How to use a databaseHow to use a database
How to use a databaseAmyshipp
 
Houston tech fest dev intro to sharepoint search
Houston tech fest   dev intro to sharepoint searchHouston tech fest   dev intro to sharepoint search
Houston tech fest dev intro to sharepoint searchMichael Oryszak
 
Introduction to ArangoDB (nosql matters Barcelona 2012)
Introduction to ArangoDB (nosql matters Barcelona 2012)Introduction to ArangoDB (nosql matters Barcelona 2012)
Introduction to ArangoDB (nosql matters Barcelona 2012)ArangoDB Database
 

Was ist angesagt? (13)

An Intro to NoSQL Databases
An Intro to NoSQL DatabasesAn Intro to NoSQL Databases
An Intro to NoSQL Databases
 
Apache Hive
Apache HiveApache Hive
Apache Hive
 
20120606 Lazy Programmers Write Self-Modifying Code /or/ Dealing with XML Ord...
20120606 Lazy Programmers Write Self-Modifying Code /or/ Dealing with XML Ord...20120606 Lazy Programmers Write Self-Modifying Code /or/ Dealing with XML Ord...
20120606 Lazy Programmers Write Self-Modifying Code /or/ Dealing with XML Ord...
 
Hive
HiveHive
Hive
 
Searching techniques
Searching techniquesSearching techniques
Searching techniques
 
Resource Oriented Architectures: The Future of Data API?
Resource Oriented Architectures: The Future of Data API?Resource Oriented Architectures: The Future of Data API?
Resource Oriented Architectures: The Future of Data API?
 
Big Data and Hadoop Components
Big Data and Hadoop ComponentsBig Data and Hadoop Components
Big Data and Hadoop Components
 
11 wordprocessing ml subject - glossary document
11   wordprocessing ml subject - glossary document11   wordprocessing ml subject - glossary document
11 wordprocessing ml subject - glossary document
 
Shooting rabbits with sling
Shooting rabbits with slingShooting rabbits with sling
Shooting rabbits with sling
 
How to use a database
How to use a databaseHow to use a database
How to use a database
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Houston tech fest dev intro to sharepoint search
Houston tech fest   dev intro to sharepoint searchHouston tech fest   dev intro to sharepoint search
Houston tech fest dev intro to sharepoint search
 
Introduction to ArangoDB (nosql matters Barcelona 2012)
Introduction to ArangoDB (nosql matters Barcelona 2012)Introduction to ArangoDB (nosql matters Barcelona 2012)
Introduction to ArangoDB (nosql matters Barcelona 2012)
 

Andere mochten auch

How Internet Serch Engins Work
How Internet Serch Engins WorkHow Internet Serch Engins Work
How Internet Serch Engins Workmanami motegi
 
Problem-based Learning at 2014 CSE IPSG sharing
Problem-based Learning at 2014 CSE IPSG sharingProblem-based Learning at 2014 CSE IPSG sharing
Problem-based Learning at 2014 CSE IPSG sharingLester Lim
 
Java Server Faces (JSF) - Basics
Java Server Faces (JSF) - BasicsJava Server Faces (JSF) - Basics
Java Server Faces (JSF) - BasicsBG Java EE Course
 
Java Persistence API (JPA) Step By Step
Java Persistence API (JPA) Step By StepJava Persistence API (JPA) Step By Step
Java Persistence API (JPA) Step By StepGuo Albert
 
Skf half year-2010_sv
Skf half year-2010_svSkf half year-2010_sv
Skf half year-2010_svSKF
 
Plan Estratégico Comité Tecnología
Plan Estratégico Comité TecnologíaPlan Estratégico Comité Tecnología
Plan Estratégico Comité TecnologíaAmchamEC
 
Datta Nadkarni portfolio 2014- Marketing Strategist for- Farmers, LensCrafter...
Datta Nadkarni portfolio 2014- Marketing Strategist for- Farmers, LensCrafter...Datta Nadkarni portfolio 2014- Marketing Strategist for- Farmers, LensCrafter...
Datta Nadkarni portfolio 2014- Marketing Strategist for- Farmers, LensCrafter...www.DATTANADKARNI.COM
 
Noun;there is...there are -at airport and on airplane
Noun;there is...there are -at airport and on airplaneNoun;there is...there are -at airport and on airplane
Noun;there is...there are -at airport and on airplaneAldyansyah -
 
The role of research libraries in a European e-science environment
The role of research libraries in a European e-science environmentThe role of research libraries in a European e-science environment
The role of research libraries in a European e-science environmentWouter Schallier
 
Jeff Savitz - A System for Testing Advertising
Jeff Savitz - A System for Testing Advertising Jeff Savitz - A System for Testing Advertising
Jeff Savitz - A System for Testing Advertising Derick Schaefer
 
Value of DoIT GIS
Value of DoIT GISValue of DoIT GIS
Value of DoIT GISksendhil
 
Lua 30+ Programming Skills and 20+ Optimization Tips
Lua 30+ Programming Skills and 20+ Optimization TipsLua 30+ Programming Skills and 20+ Optimization Tips
Lua 30+ Programming Skills and 20+ Optimization TipsHo Kim
 

Andere mochten auch (20)

Beginning In J2EE
Beginning In J2EEBeginning In J2EE
Beginning In J2EE
 
How Internet Serch Engins Work
How Internet Serch Engins WorkHow Internet Serch Engins Work
How Internet Serch Engins Work
 
Androidwear
AndroidwearAndroidwear
Androidwear
 
Problem-based Learning at 2014 CSE IPSG sharing
Problem-based Learning at 2014 CSE IPSG sharingProblem-based Learning at 2014 CSE IPSG sharing
Problem-based Learning at 2014 CSE IPSG sharing
 
Android Seminar
Android SeminarAndroid Seminar
Android Seminar
 
CND magnétoscopie
CND magnétoscopieCND magnétoscopie
CND magnétoscopie
 
Java Server Faces (JSF) - Basics
Java Server Faces (JSF) - BasicsJava Server Faces (JSF) - Basics
Java Server Faces (JSF) - Basics
 
Java Persistence API (JPA) Step By Step
Java Persistence API (JPA) Step By StepJava Persistence API (JPA) Step By Step
Java Persistence API (JPA) Step By Step
 
P073 osm
P073 osmP073 osm
P073 osm
 
Skf half year-2010_sv
Skf half year-2010_svSkf half year-2010_sv
Skf half year-2010_sv
 
Plan Estratégico Comité Tecnología
Plan Estratégico Comité TecnologíaPlan Estratégico Comité Tecnología
Plan Estratégico Comité Tecnología
 
Datta Nadkarni portfolio 2014- Marketing Strategist for- Farmers, LensCrafter...
Datta Nadkarni portfolio 2014- Marketing Strategist for- Farmers, LensCrafter...Datta Nadkarni portfolio 2014- Marketing Strategist for- Farmers, LensCrafter...
Datta Nadkarni portfolio 2014- Marketing Strategist for- Farmers, LensCrafter...
 
Noun;there is...there are -at airport and on airplane
Noun;there is...there are -at airport and on airplaneNoun;there is...there are -at airport and on airplane
Noun;there is...there are -at airport and on airplane
 
The role of research libraries in a European e-science environment
The role of research libraries in a European e-science environmentThe role of research libraries in a European e-science environment
The role of research libraries in a European e-science environment
 
Brochure Graphic Production
Brochure Graphic Production Brochure Graphic Production
Brochure Graphic Production
 
Jeff Savitz - A System for Testing Advertising
Jeff Savitz - A System for Testing Advertising Jeff Savitz - A System for Testing Advertising
Jeff Savitz - A System for Testing Advertising
 
Value of DoIT GIS
Value of DoIT GISValue of DoIT GIS
Value of DoIT GIS
 
nancy
nancynancy
nancy
 
Lua 30+ Programming Skills and 20+ Optimization Tips
Lua 30+ Programming Skills and 20+ Optimization TipsLua 30+ Programming Skills and 20+ Optimization Tips
Lua 30+ Programming Skills and 20+ Optimization Tips
 
Teacher Training
Teacher TrainingTeacher Training
Teacher Training
 

Ähnlich wie Websrc~1

Google Paper
Google Paper Google Paper
Google Paper girish1m
 
The Anatomy of a Large-Scale Hypertextual Web Search Engine
The Anatomy of a Large-Scale Hypertextual Web Search EngineThe Anatomy of a Large-Scale Hypertextual Web Search Engine
The Anatomy of a Large-Scale Hypertextual Web Search EngineMehul Boricha
 
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr..."PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...Stefan Adam
 
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16MLconf
 
OSCON 2012 MongoDB Tutorial
OSCON 2012 MongoDB TutorialOSCON 2012 MongoDB Tutorial
OSCON 2012 MongoDB TutorialSteven Francia
 
Practical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkPractical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkJake Mannix
 
Practical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrPractical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrJake Mannix
 
Evolution of the Graph Schema
Evolution of the Graph SchemaEvolution of the Graph Schema
Evolution of the Graph SchemaJoshua Shinavier
 
Modeling Data in MongoDB
Modeling Data in MongoDBModeling Data in MongoDB
Modeling Data in MongoDBlehresman
 
Haystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon HughesHaystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon HughesOpenSource Connections
 
Searching with vectors
Searching with vectorsSearching with vectors
Searching with vectorsSimon Hughes
 
NoSql - mayank singh
NoSql - mayank singhNoSql - mayank singh
NoSql - mayank singhMayank Singh
 
Analyzing Web Archives
Analyzing Web ArchivesAnalyzing Web Archives
Analyzing Web Archivesvinaygo
 
Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?Oscar Corcho
 
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com Lucidworks
 
Vectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic MatchingVectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic MatchingSimon Hughes
 
Smx Ad Tech Seo Tactics
Smx Ad Tech Seo TacticsSmx Ad Tech Seo Tactics
Smx Ad Tech Seo Tacticsjeetututeja
 

Ähnlich wie Websrc~1 (20)

Google Paper
Google Paper Google Paper
Google Paper
 
The Anatomy of a Large-Scale Hypertextual Web Search Engine
The Anatomy of a Large-Scale Hypertextual Web Search EngineThe Anatomy of a Large-Scale Hypertextual Web Search Engine
The Anatomy of a Large-Scale Hypertextual Web Search Engine
 
DC presentation 1
DC presentation 1DC presentation 1
DC presentation 1
 
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr..."PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
 
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
 
Search engines
Search enginesSearch engines
Search engines
 
OSCON 2012 MongoDB Tutorial
OSCON 2012 MongoDB TutorialOSCON 2012 MongoDB Tutorial
OSCON 2012 MongoDB Tutorial
 
Practical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkPractical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and Spark
 
Practical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrPractical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+Solr
 
MongoDB for Genealogy
MongoDB for GenealogyMongoDB for Genealogy
MongoDB for Genealogy
 
Evolution of the Graph Schema
Evolution of the Graph SchemaEvolution of the Graph Schema
Evolution of the Graph Schema
 
Modeling Data in MongoDB
Modeling Data in MongoDBModeling Data in MongoDB
Modeling Data in MongoDB
 
Haystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon HughesHaystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon Hughes
 
Searching with vectors
Searching with vectorsSearching with vectors
Searching with vectors
 
NoSql - mayank singh
NoSql - mayank singhNoSql - mayank singh
NoSql - mayank singh
 
Analyzing Web Archives
Analyzing Web ArchivesAnalyzing Web Archives
Analyzing Web Archives
 
Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?
 
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
 
Vectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic MatchingVectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic Matching
 
Smx Ad Tech Seo Tactics
Smx Ad Tech Seo TacticsSmx Ad Tech Seo Tactics
Smx Ad Tech Seo Tactics
 

Mehr von Ram Dutt Shukla (20)

Ip Sec Rev1
Ip Sec Rev1Ip Sec Rev1
Ip Sec Rev1
 
Message Authentication
Message AuthenticationMessage Authentication
Message Authentication
 
Shttp
ShttpShttp
Shttp
 
Web Security
Web SecurityWeb Security
Web Security
 
I Pv6 Addressing
I Pv6 AddressingI Pv6 Addressing
I Pv6 Addressing
 
Anycast & Multicast
Anycast & MulticastAnycast & Multicast
Anycast & Multicast
 
Congestion Control
Congestion ControlCongestion Control
Congestion Control
 
Congestion Control
Congestion ControlCongestion Control
Congestion Control
 
Retransmission Tcp
Retransmission TcpRetransmission Tcp
Retransmission Tcp
 
Tcp Congestion Avoidance
Tcp Congestion AvoidanceTcp Congestion Avoidance
Tcp Congestion Avoidance
 
Tcp Immediate Data Transfer
Tcp Immediate Data TransferTcp Immediate Data Transfer
Tcp Immediate Data Transfer
 
Tcp Reliability Flow Control
Tcp Reliability Flow ControlTcp Reliability Flow Control
Tcp Reliability Flow Control
 
Tcp Udp Notes
Tcp Udp NotesTcp Udp Notes
Tcp Udp Notes
 
Transport Layer [Autosaved]
Transport Layer [Autosaved]Transport Layer [Autosaved]
Transport Layer [Autosaved]
 
Transport Layer
Transport LayerTransport Layer
Transport Layer
 
T Tcp
T TcpT Tcp
T Tcp
 
Anycast & Multicast
Anycast & MulticastAnycast & Multicast
Anycast & Multicast
 
Igmp
IgmpIgmp
Igmp
 
Mobile I Pv6
Mobile I Pv6Mobile I Pv6
Mobile I Pv6
 
Mld
MldMld
Mld
 

Kürzlich hochgeladen

A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 

Kürzlich hochgeladen (20)

A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 

Websrc~1

  • 1. Anatomy of a search engine • Not much known about AV, Lycos, Yahoo, etc. • But Google and Clever (to some extent) are published • Design criteria • Differences • Architecture • Data structures
  • 2. Requirements • Basic IR concepts: – Recall: what % of relevant docs are retrieved – Precision: what % of docs retrieved are relevant • Quantity: – handle hundreds of thousands of queries/sec • Quality – High precision (not with pres. engines)
  • 3. Page rank • Idea: a page is important when it is referred to a lot, or referred to from an important page • PR is used to prioritize; works well even with search is just on page titles
  • 4. PR details • Pages T1,…,Tn point to page A, C(A) is a link fan-out of A PR(A)=(1-d) + d(PR(T1)/C(T1)+…+PR(Tn)/C(Tn)) d=dumping factor=.85 Model of random walk on the Web PR(p) = prob. That a “random” user will visit p
  • 5. Other features and terms • Anchor text is associated with the page it links to • Some markup aspects are used
  • 6. Google architecture • URL server sends list of URLs to be fetched to crawlers • StoreServer compresses and stores pages • Indexer extracts words, their pos., size, capital. • Anchors cont.links and their text • Sorter generates inverted index • Searcher uses Lexicon, II, and PR
  • 7. Some details • Barrels store words (wordIDs); if a doc contains a word, doc`s ID and its wordID are stored with hitlist of this word in the doc • Lexicon points to Inverted Barrels; ea word points to docid and hits
  • 9. Crawling and indexing • Parsing into anchors and words – error robustness (flex+stack) • Indexing in parallel – hashing into barrels using the lexicon – the problem of new words shared
  • 10. Searching 1 parse query 2 convert words into wordIDs 3 Identif. A barrel for ea. Word 4 scan doclists until a doc that matches all the search words is found
  • 11. Ranking • For a single word, identify the hit list and its type, count the # of hits of ea type, vector- multiply • Combine with PR • For multiple words, take proximity into account
  • 12. Going further • Google will not return any IBM pages for the query `mainframes` • Many pages that point to IBM page use the term ‘mainframe’, so this page should be returned
  • 13. • Clever ranks authoritities pages and hub pages. Authorities are pages with high PR. Hubs are pages that point to authorities. E.g. my friend’s page with a list of links to on-line CD stores. Hubs may not be chosen by PR alone • Clever/HITS (Hyperlink Induced Topic Search) starts with an initial set of pages and hubs
  • 14. Mathematically speaking… • Let xp be authority weight, yq be hub weight, q->p denotes q links to p x p = ∑ yq y p = ∑ xq q→ p p →q • Let A be adjacency matrix: Ai,j =1 if there is a link between i and j, 0 otherwise
  • 15. x ←ATy and y ← Ax x ←ATAx, and we can iterate that further, working with powers of ATA This sequence of powers converges to the eigenvector of ATA This means that the result does not depend on the initial weights
  • 16. • Remove ‘local’ links (“back to the main page”) • Drift: transfer of main authority to, e.g., topics of hobbies • Highjacking: if several pages from the same site occur in the base set, they may take over a topic
  • 17. • Remedied by partial content indexing – anchors, and by • dividing a page into pagelets – contiguous sequences of links • Hubs are good when learning about a topic, less so when seekeing specific info.
  • 18. Autres engins • Altavista et Lycos ont probablement des méthodes simples de sélection • Excite semble utiliser beaucoup de propriétés des pages • Voir « What is a tall poppy among Web pages? »7th Int’l WWW Conf.