SlideShare ist ein Scribd-Unternehmen logo
1 von 28
Downloaden Sie, um offline zu lesen
The Anatomy of a Large-Scale
Hypertextual Web Search Engine
           Lawrence Page & Sergey Brin




          Presented By : Girish Malkarnenkar
             Email: girish@cs.utexas.edu
INF384H / CS395T Concepts of Information Retrieval and
    Web Search (Fall 2011) - (12th September 2011)
Motivation behind Google
• Rapid growth in                Amount of
                                 information
                                 on the web




       Number of new
       inexperienced web users
Motivation behind Google
• Usage of human maintained indices like
  Yahoo! which were subjective, expensive to
  build & maintain, slow to improve and did not
  cover all topics.
• Automated search engines relying on simple
  keyword matching returned low quality
  results.
• Attempts by advertisers to mislead automated
  search engines
How bad were things in 1997?
• “Junk results” washed out any relevant search
  results.
• Only one of the top 4 commercial search
  engines at the time could find itself (in the
  top 10 results)!
• There was a desperate need for a search
  engine that could cope up with the ever-
  increasing information flow and still return
  relevant information.
Challenges in scaling with the web!
• In 1994, the 1st web search engine, the
  WWWW indexed around 105 pages.
• By November 1997, the top engines
  indexed 108 web documents!
• In 1994, the WWWW handled 1500
  queries per day.
• By November 1997, Altavista handled
  around 20 million queries per day!
Challenges in scalability


•   Fast crawling technology
•   Storage Space
•   Efficient indexing system
•   Fast handling of queries
Google’s design goals
• Aiming for very high precision in results since
  most users look only at the first few 10s of
  results.

• Precision is important even at the expense of
  recall (i.e. the total number of relevant
  documents returned)
The irony of it all…
• In this paper, the authors had criticized the
  commercialization of academic search engine
  as it caused search engine technology to
  remain a black art.
• They had also stated their aims of making
  Google an open academic environment for
  researchers working on large scale web data.
• In the appendix, they had also blasted
  advertising funded search engines for being
  “inherently biased”
System features of Google
• PageRank
•   A Top 10 IEEE ICDM data mining algorithm
•   Tries to incorporate ideas from
    academic community (publishing and citations)

• Anchor Text
•   <a href=http://www.com> ANCHOR TEXT </a>
PageRank!




It isn't the only factor that Google uses to rank pages, but it is an
                           important one.
Why does PageRank use links?
• Links represent citations
• Quantity of links to a website makes the
  website more popular
• Quality of links to a website also helps in
  computing rank
• Link structure largely unused before Larry
  Page proposed it to thesis advisor
• Idea based on academic citation literature
  which counted citations or backlinks to a given
  page.
How does PageRank work?


Counts links from all pages but not
 equally
Normalizes by the number of links on a
 page.
Simplified PageRank algorithm
• Assume four web pages: A, B,C and D. Let each page
  would begin with an estimated PageRank of 0.25.


      A       C
                  D
          B

              C
      A
                      D
          B


• L(A) is defined as the number of links going out of page
  A. The PageRank of a page A is given as follows:
PageRank algorithm including damping factor
 Assume page A has pages B, C, D ..., which point
 to it. The parameter d is a damping factor which
 can be set between 0 and 1. Usually set d to
 0.85. The PageRank of a page A is given as
 follows:
Intuitive Justification

• A "random surfer" who is given a web page at random and keeps
  clicking on links, never hitting "back“, but eventually gets bored
  and starts on another random page.
   – The probability that the random surfer visits a page is its
      PageRank.
   – The d damping factor is the probability at each page the
      "random surfer" will get bored and request another random
      page.

• A page can have a high PageRank
   – If there are many pages that point to it
   – Or if there are some pages that point to it, and have a high
     PageRank.
Anchor Text
•   <A href="http://www.yahoo.com/">Yahoo!</A>
The text of a hyperlink (anchor text) is
associated with the page that the link is on,
and it is also associated with the page the link
points to.

Why?
   anchors often provide more accurate descriptions of
     web pages than the pages themselves.

      anchors may exist for documents which cannot be
       indexed by a text-based search engine, such as images,
       programs, and databases.
Other Features

• It has location information for all hits (uses
  proximity in search)
• Google keeps track of some visual
  presentation details such as font size of words.
• Words in a larger or bolder font are weighted
  higher than other words.
• Full raw HTML of pages is available in a
  repository
Google Architecture
Implemented in C and C++ on Solaris and Linux
Google Architecture
                          Multiple crawlers run in parallel.
Keeps track of URLs       Each crawler keeps its own DNS          Compresses and
that have and need         lookup cache and ~300 open            stores web pages
   to be crawled             connections open at once.




 Stores each link and
text surrounding link.




Converts relative URLs
 into absolute URLs.


              Uncompresses and parses               Contains full html of every web
               documents. Stores link              page. Each document is prefixed
             information in anchors file.            by docID, length, and URL.
Google Architecture
Maps absolute URLs into docIDs stored in Doc          Parses & distributes hit lists into
   Index. Stores anchor text in “barrels”.                       “barrels.”
Generates database of links (pairs of docIds).
                                                            Partially sorted forward
                                                        indexes sorted by docID. Each
                                                        barrel stores hitlists for a given
                                                               range of wordIDs.

                                                          In-memory hash table that
                                                           maps words to wordIds.
                                                         Contains pointer to doclist in
                                                        barrel which wordId falls into.

                                                           Creates inverted index
                                                           whereby document list
                                                        containing docID and hitlists
                                                       can be retrieved given wordID.
      DocID keyed index where each entry includes info such as pointer to doc in
       repository, checksum, statistics, status, etc. Also contains URL info if doc
                      has been crawled. If not just contains URL.
Single Word Query Ranking
• Hitlist is retrieved for single word
• Each hit can be one of several types: title, anchor,
  URL, large font, small font, etc.
• Each hit type is assigned its own weight
• Type-weights make up vector of weights
• Number of hits of each type is counted to form
  count-weight vector
• Dot product of type-weight and count-weight vectors
  is used to compute IR score
• IR score is combined with PageRank to compute final
  rank
Multi-word Query Ranking
• Similar to single-word ranking except now must
  analyze proximity of words in a document
• Hits occurring closer together are weighted higher
  than those farther apart
• Each proximity relation is classified into 1 of 10 bins
  ranging from a “phrase match” to “not even close”
• Each type and proximity pair has a type-prox weight
• Counts converted into count-weights
• Take dot product of count-weights and type-prox
  weights to computer for IR score
The Past: Original Page # 1




When Larry Page and Sergey Brin begun work on their search engine, it
wasn’t originally called Google. They called it Backrub (as a reference to the
algorithm which used backlinks to rank pages), only changing it a year into
development and yes, the hand in the logo was Larry Page’s, scanned.
The Past: Original Page # 2




The original Google webpage (in 1997)
The Present
The Future?


“The ultimate search engine would
understand exactly what you mean and give
back exactly what you want.”

- Larry Page
References…
• Brin, Page. The Anatomy of a Large-Scale
  Hypertextual Web Search Engine.
• www.cs.uvm.edu/~xwu/kdd
• http://www.ics.uci.edu/~scott/google.htm
Thank you!

Weitere ähnliche Inhalte

Was ist angesagt?

Understanding Seo At A Glance
Understanding Seo At A GlanceUnderstanding Seo At A Glance
Understanding Seo At A Glancepoojagupta267
 
Search engines powerpoint
Search engines powerpointSearch engines powerpoint
Search engines powerpointvbaker2210
 
google search engine
google search enginegoogle search engine
google search engineway2go
 
Searching the Web
Searching the WebSearching the Web
Searching the Webcshieh
 
How a search engine works slide
How a search engine works slideHow a search engine works slide
How a search engine works slideSovan Misra
 
Architecture of a search engine
Architecture of a search engineArchitecture of a search engine
Architecture of a search engineSylvain Utard
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawlervinay arora
 
Introduction to Search Engines
Introduction to Search EnginesIntroduction to Search Engines
Introduction to Search EnginesNitin Pande
 
Search Engines and its working
Search Engines and its workingSearch Engines and its working
Search Engines and its workingMukesh Kumar
 
Working of search engine
Working of search engineWorking of search engine
Working of search engineNikhil Deswal
 
Internet Tutorial 03
Internet  Tutorial 03Internet  Tutorial 03
Internet Tutorial 03dpd
 
Googling of GooGle
Googling of GooGleGoogling of GooGle
Googling of GooGlebinit singh
 
RDFa: an introduction
RDFa: an introductionRDFa: an introduction
RDFa: an introductionKai Li
 

Was ist angesagt? (20)

Understanding Seo At A Glance
Understanding Seo At A GlanceUnderstanding Seo At A Glance
Understanding Seo At A Glance
 
Search engines powerpoint
Search engines powerpointSearch engines powerpoint
Search engines powerpoint
 
Search engine
Search engineSearch engine
Search engine
 
Search Engines
Search EnginesSearch Engines
Search Engines
 
google search engine
google search enginegoogle search engine
google search engine
 
Searching the Web
Searching the WebSearching the Web
Searching the Web
 
Search Engine ppt
Search Engine pptSearch Engine ppt
Search Engine ppt
 
How a search engine works slide
How a search engine works slideHow a search engine works slide
How a search engine works slide
 
Architecture of a search engine
Architecture of a search engineArchitecture of a search engine
Architecture of a search engine
 
Search Engine
Search EngineSearch Engine
Search Engine
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawler
 
Search engine ppt
Search engine pptSearch engine ppt
Search engine ppt
 
Introduction to Search Engines
Introduction to Search EnginesIntroduction to Search Engines
Introduction to Search Engines
 
Search Engines and its working
Search Engines and its workingSearch Engines and its working
Search Engines and its working
 
search engines
search enginessearch engines
search engines
 
Working of search engine
Working of search engineWorking of search engine
Working of search engine
 
Google Search Presentation
Google Search PresentationGoogle Search Presentation
Google Search Presentation
 
Internet Tutorial 03
Internet  Tutorial 03Internet  Tutorial 03
Internet Tutorial 03
 
Googling of GooGle
Googling of GooGleGoogling of GooGle
Googling of GooGle
 
RDFa: an introduction
RDFa: an introductionRDFa: an introduction
RDFa: an introduction
 

Andere mochten auch

Query optimization
Query optimizationQuery optimization
Query optimizationdixitdavey
 
14. Query Optimization in DBMS
14. Query Optimization in DBMS14. Query Optimization in DBMS
14. Query Optimization in DBMSkoolkampus
 
Query Optimisation
Query OptimisationQuery Optimisation
Query Optimisationdchq
 
CS 542 -- Query Optimization
CS 542 -- Query OptimizationCS 542 -- Query Optimization
CS 542 -- Query OptimizationJ Singh
 
Query Processing and Optimisation - Lecture 10 - Introduction to Databases (1...
Query Processing and Optimisation - Lecture 10 - Introduction to Databases (1...Query Processing and Optimisation - Lecture 10 - Introduction to Databases (1...
Query Processing and Optimisation - Lecture 10 - Introduction to Databases (1...Beat Signer
 
OLAP & DATA WAREHOUSE
OLAP & DATA WAREHOUSEOLAP & DATA WAREHOUSE
OLAP & DATA WAREHOUSEZalpa Rathod
 

Andere mochten auch (7)

Query optimization
Query optimizationQuery optimization
Query optimization
 
Query optimisation
Query optimisationQuery optimisation
Query optimisation
 
14. Query Optimization in DBMS
14. Query Optimization in DBMS14. Query Optimization in DBMS
14. Query Optimization in DBMS
 
Query Optimisation
Query OptimisationQuery Optimisation
Query Optimisation
 
CS 542 -- Query Optimization
CS 542 -- Query OptimizationCS 542 -- Query Optimization
CS 542 -- Query Optimization
 
Query Processing and Optimisation - Lecture 10 - Introduction to Databases (1...
Query Processing and Optimisation - Lecture 10 - Introduction to Databases (1...Query Processing and Optimisation - Lecture 10 - Introduction to Databases (1...
Query Processing and Optimisation - Lecture 10 - Introduction to Databases (1...
 
OLAP & DATA WAREHOUSE
OLAP & DATA WAREHOUSEOLAP & DATA WAREHOUSE
OLAP & DATA WAREHOUSE
 

Ähnlich wie Google Paper

The Anatomy of a Large-Scale Hypertextual Web Search Engine
The Anatomy of a Large-Scale Hypertextual Web Search EngineThe Anatomy of a Large-Scale Hypertextual Web Search Engine
The Anatomy of a Large-Scale Hypertextual Web Search EngineMehul Boricha
 
Google history nd architecture
Google history nd architectureGoogle history nd architecture
Google history nd architectureDivyangee Jain
 
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr..."PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...Stefan Adam
 
Business Intelligence Solution Using Search Engine
Business Intelligence Solution Using Search EngineBusiness Intelligence Solution Using Search Engine
Business Intelligence Solution Using Search Engineankur881120
 
Organising and Managing Research
Organising and Managing ResearchOrganising and Managing Research
Organising and Managing ResearchDr. Vinayak Bharadi
 
Internet Research Presentation
Internet Research PresentationInternet Research Presentation
Internet Research Presentationadeason
 
Content Registration at Crossref - LIVE Bangkok
Content Registration at Crossref - LIVE BangkokContent Registration at Crossref - LIVE Bangkok
Content Registration at Crossref - LIVE BangkokCrossref
 
Optimizing DITA Content for Search Engine Optimization tekom tcworld 2016
Optimizing DITA Content for Search Engine Optimization tekom tcworld 2016Optimizing DITA Content for Search Engine Optimization tekom tcworld 2016
Optimizing DITA Content for Search Engine Optimization tekom tcworld 2016IXIASOFT
 
DITA and SEO
DITA and SEODITA and SEO
DITA and SEOIXIASOFT
 
Web Mining.pptx
Web Mining.pptxWeb Mining.pptx
Web Mining.pptxScrbifPt
 
How to SEO a Terrific - and Profitable - User Experience
How to SEO a Terrific - and Profitable - User ExperienceHow to SEO a Terrific - and Profitable - User Experience
How to SEO a Terrific - and Profitable - User ExperienceBrightEdge
 
SEO Tutorial - SEO Company in India
SEO Tutorial - SEO Company in IndiaSEO Tutorial - SEO Company in India
SEO Tutorial - SEO Company in Indiaannakoch32
 
Resource Oriented Architectures: The Future of Data API?
Resource Oriented Architectures: The Future of Data API?Resource Oriented Architectures: The Future of Data API?
Resource Oriented Architectures: The Future of Data API?Victor Olex
 

Ähnlich wie Google Paper (20)

Anatomy of google
Anatomy of googleAnatomy of google
Anatomy of google
 
The Anatomy of a Large-Scale Hypertextual Web Search Engine
The Anatomy of a Large-Scale Hypertextual Web Search EngineThe Anatomy of a Large-Scale Hypertextual Web Search Engine
The Anatomy of a Large-Scale Hypertextual Web Search Engine
 
Search engines
Search enginesSearch engines
Search engines
 
Google history nd architecture
Google history nd architectureGoogle history nd architecture
Google history nd architecture
 
DC presentation 1
DC presentation 1DC presentation 1
DC presentation 1
 
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr..."PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
 
How Google Works
How Google WorksHow Google Works
How Google Works
 
Brief
BriefBrief
Brief
 
Web Search Engine
Web Search EngineWeb Search Engine
Web Search Engine
 
Business Intelligence Solution Using Search Engine
Business Intelligence Solution Using Search EngineBusiness Intelligence Solution Using Search Engine
Business Intelligence Solution Using Search Engine
 
Organising and Managing Research
Organising and Managing ResearchOrganising and Managing Research
Organising and Managing Research
 
Internet Research Presentation
Internet Research PresentationInternet Research Presentation
Internet Research Presentation
 
Content Registration at Crossref - LIVE Bangkok
Content Registration at Crossref - LIVE BangkokContent Registration at Crossref - LIVE Bangkok
Content Registration at Crossref - LIVE Bangkok
 
Optimizing DITA Content for Search Engine Optimization tekom tcworld 2016
Optimizing DITA Content for Search Engine Optimization tekom tcworld 2016Optimizing DITA Content for Search Engine Optimization tekom tcworld 2016
Optimizing DITA Content for Search Engine Optimization tekom tcworld 2016
 
DITA and SEO
DITA and SEODITA and SEO
DITA and SEO
 
CAB 2.pptx
CAB 2.pptxCAB 2.pptx
CAB 2.pptx
 
Web Mining.pptx
Web Mining.pptxWeb Mining.pptx
Web Mining.pptx
 
How to SEO a Terrific - and Profitable - User Experience
How to SEO a Terrific - and Profitable - User ExperienceHow to SEO a Terrific - and Profitable - User Experience
How to SEO a Terrific - and Profitable - User Experience
 
SEO Tutorial - SEO Company in India
SEO Tutorial - SEO Company in IndiaSEO Tutorial - SEO Company in India
SEO Tutorial - SEO Company in India
 
Resource Oriented Architectures: The Future of Data API?
Resource Oriented Architectures: The Future of Data API?Resource Oriented Architectures: The Future of Data API?
Resource Oriented Architectures: The Future of Data API?
 

Kürzlich hochgeladen

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 

Kürzlich hochgeladen (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 

Google Paper

  • 1. The Anatomy of a Large-Scale Hypertextual Web Search Engine Lawrence Page & Sergey Brin Presented By : Girish Malkarnenkar Email: girish@cs.utexas.edu INF384H / CS395T Concepts of Information Retrieval and Web Search (Fall 2011) - (12th September 2011)
  • 2. Motivation behind Google • Rapid growth in Amount of information on the web Number of new inexperienced web users
  • 3. Motivation behind Google • Usage of human maintained indices like Yahoo! which were subjective, expensive to build & maintain, slow to improve and did not cover all topics. • Automated search engines relying on simple keyword matching returned low quality results. • Attempts by advertisers to mislead automated search engines
  • 4. How bad were things in 1997? • “Junk results” washed out any relevant search results. • Only one of the top 4 commercial search engines at the time could find itself (in the top 10 results)! • There was a desperate need for a search engine that could cope up with the ever- increasing information flow and still return relevant information.
  • 5. Challenges in scaling with the web! • In 1994, the 1st web search engine, the WWWW indexed around 105 pages. • By November 1997, the top engines indexed 108 web documents! • In 1994, the WWWW handled 1500 queries per day. • By November 1997, Altavista handled around 20 million queries per day!
  • 6. Challenges in scalability • Fast crawling technology • Storage Space • Efficient indexing system • Fast handling of queries
  • 7. Google’s design goals • Aiming for very high precision in results since most users look only at the first few 10s of results. • Precision is important even at the expense of recall (i.e. the total number of relevant documents returned)
  • 8. The irony of it all… • In this paper, the authors had criticized the commercialization of academic search engine as it caused search engine technology to remain a black art. • They had also stated their aims of making Google an open academic environment for researchers working on large scale web data. • In the appendix, they had also blasted advertising funded search engines for being “inherently biased”
  • 9. System features of Google • PageRank • A Top 10 IEEE ICDM data mining algorithm • Tries to incorporate ideas from academic community (publishing and citations) • Anchor Text • <a href=http://www.com> ANCHOR TEXT </a>
  • 10. PageRank! It isn't the only factor that Google uses to rank pages, but it is an important one.
  • 11. Why does PageRank use links? • Links represent citations • Quantity of links to a website makes the website more popular • Quality of links to a website also helps in computing rank • Link structure largely unused before Larry Page proposed it to thesis advisor • Idea based on academic citation literature which counted citations or backlinks to a given page.
  • 12. How does PageRank work? Counts links from all pages but not equally Normalizes by the number of links on a page.
  • 13. Simplified PageRank algorithm • Assume four web pages: A, B,C and D. Let each page would begin with an estimated PageRank of 0.25. A C D B C A D B • L(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows:
  • 14. PageRank algorithm including damping factor Assume page A has pages B, C, D ..., which point to it. The parameter d is a damping factor which can be set between 0 and 1. Usually set d to 0.85. The PageRank of a page A is given as follows:
  • 15. Intuitive Justification • A "random surfer" who is given a web page at random and keeps clicking on links, never hitting "back“, but eventually gets bored and starts on another random page. – The probability that the random surfer visits a page is its PageRank. – The d damping factor is the probability at each page the "random surfer" will get bored and request another random page. • A page can have a high PageRank – If there are many pages that point to it – Or if there are some pages that point to it, and have a high PageRank.
  • 16. Anchor Text • <A href="http://www.yahoo.com/">Yahoo!</A> The text of a hyperlink (anchor text) is associated with the page that the link is on, and it is also associated with the page the link points to. Why?  anchors often provide more accurate descriptions of web pages than the pages themselves.  anchors may exist for documents which cannot be indexed by a text-based search engine, such as images, programs, and databases.
  • 17. Other Features • It has location information for all hits (uses proximity in search) • Google keeps track of some visual presentation details such as font size of words. • Words in a larger or bolder font are weighted higher than other words. • Full raw HTML of pages is available in a repository
  • 18. Google Architecture Implemented in C and C++ on Solaris and Linux
  • 19. Google Architecture Multiple crawlers run in parallel. Keeps track of URLs Each crawler keeps its own DNS Compresses and that have and need lookup cache and ~300 open stores web pages to be crawled connections open at once. Stores each link and text surrounding link. Converts relative URLs into absolute URLs. Uncompresses and parses Contains full html of every web documents. Stores link page. Each document is prefixed information in anchors file. by docID, length, and URL.
  • 20. Google Architecture Maps absolute URLs into docIDs stored in Doc Parses & distributes hit lists into Index. Stores anchor text in “barrels”. “barrels.” Generates database of links (pairs of docIds). Partially sorted forward indexes sorted by docID. Each barrel stores hitlists for a given range of wordIDs. In-memory hash table that maps words to wordIds. Contains pointer to doclist in barrel which wordId falls into. Creates inverted index whereby document list containing docID and hitlists can be retrieved given wordID. DocID keyed index where each entry includes info such as pointer to doc in repository, checksum, statistics, status, etc. Also contains URL info if doc has been crawled. If not just contains URL.
  • 21. Single Word Query Ranking • Hitlist is retrieved for single word • Each hit can be one of several types: title, anchor, URL, large font, small font, etc. • Each hit type is assigned its own weight • Type-weights make up vector of weights • Number of hits of each type is counted to form count-weight vector • Dot product of type-weight and count-weight vectors is used to compute IR score • IR score is combined with PageRank to compute final rank
  • 22. Multi-word Query Ranking • Similar to single-word ranking except now must analyze proximity of words in a document • Hits occurring closer together are weighted higher than those farther apart • Each proximity relation is classified into 1 of 10 bins ranging from a “phrase match” to “not even close” • Each type and proximity pair has a type-prox weight • Counts converted into count-weights • Take dot product of count-weights and type-prox weights to computer for IR score
  • 23. The Past: Original Page # 1 When Larry Page and Sergey Brin begun work on their search engine, it wasn’t originally called Google. They called it Backrub (as a reference to the algorithm which used backlinks to rank pages), only changing it a year into development and yes, the hand in the logo was Larry Page’s, scanned.
  • 24. The Past: Original Page # 2 The original Google webpage (in 1997)
  • 26. The Future? “The ultimate search engine would understand exactly what you mean and give back exactly what you want.” - Larry Page
  • 27. References… • Brin, Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine. • www.cs.uvm.edu/~xwu/kdd • http://www.ics.uci.edu/~scott/google.htm