SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Sitemaps: Above and Beyond
the Crawl of Duty
Sitemaps! Sitemaps!
Uri Schonfeld (Google and UCLA)
Narayanan Shivakumar (Google)
Copyright Uri Schonfeld, shuri.org April
2009
What are we going to talk about?
• The sitemaps protocol:
– Not introduced in this paper
– Friendly web servers publishing URL lists
• Popular and growing in popularity
• First large scale study over real data:
• How it is used by users
• Its Impact
– First look at how it can be used by search engines
– Lots of future work to get excited over
• Let’s start with:
– Underlying problem that sitemaps addresses
Copyright Uri Schonfeld, shuri.org April
2009
Dream of the Perfect Crawl
1.Users Have High Expectations:
• Coverage: Every page should be findable
• Freshness: Latest event, viral video,...
• Deep Web: ajax, flash, silverlight,....
1.Search Engines Dream of the perfect crawl:
• Everything the users want
• …but efficient:
– No 404s
– No duplicates
1.Sitemaps to the rescue...
Copyright Uri Schonfeld, shuri.org April
2009
Sitemaps
1. Basic idea: The web server
1.Puts a URL list, a sitemaps file, on its site
2.Includes new and changed content
3.Lets the search engines know
2. The URL list may also include:
 URLs
 Last Modification Time
 Expected Change Frequency
 Priority
1. Let the search engine know:
1."Ping" search engines that their sitemaps file has changed
2.Alternatively include sitemaps in robots.txt file (April 2007)
Copyright Uri Schonfeld, shuri.org April
2009
Sitemaps: This is how it looks
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns=
"http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://www.example.com/</loc>
<lastmod>2005-01-01</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>
<url>
...
<url>
</urlset>
Copyright Uri Schonfeld, shuri.org April
2009
Related Work
1. 1999: "Santa Fe Convention"
1.Lead to OAI-PMH
2."...e-print servers to expose metadata for the papers it
held"
3.Coalition for Networked Information, Digital Library
Federation, Open Archives Initiative (OAI), Herbert Van
de Sompel, Carl Lagoze
2. 2000: Crawler Friendly Web Servers: Brandman, Cho, Garcia-
Molina and Shivakumar
1.Export list of URLs and changed content
3. 2005/6: Sitemaps:
1.Introduced in 2005 by Google
2.2006 Microsoft, Yahoo and Google announced joint
support
Copyright Uri Schonfeld, shuri.org April
2009
Our Main Contributions
1. First Study of Sitemaps over real world
data:
a) How it is used
b) It’s impact
2. Define metrics to evaluate Sitemaps feeds.
3. Explore:
a) The challenges of using Sitemaps together
with Discovery Crawl
b) Define a preliminary algorithm combining
the two crawls.Copyright Uri Schonfeld, shuri.org April
2009
Inside Google
1. Sitemaps & Discovery
2. Sitemaps:
a) Sitemaps are fetched:
• After they are pinged.
• Several frequencies.
a) Sitemaps discovered URLs are fed to the crawling pipeline.
b) Some sources are fed directly for instant crawling.
3. Discovery:
a) New URLs and URLs of changed content are fed back to the
pipeline
4. Pipeline
Copyright Uri Schonfeld, shuri.org April
2009
How Sitemaps Is Used?
1. Approximately 35M websites publish Sitemaps, and
give us metadata for several billions of URLs.
2. Metadata:
1. 61% include a priority field.
2. 58% of URLs include a lastmodification date
3. 7% include a change frequency field
3. Formats Breakdown:
a) XML Sitemap 76.76
b) Url List 3.42
c) Atom 1.61
d) RSS 0.11
e) Unknown 17.51
4. Robots.txt announced April 2007
Copyright Uri Schonfeld, shuri.org April
2009
Sitemaps
Case Studies
Copyright Uri Schonfeld, shuri.org April
2009
Sitemaps Use Case Studies
1. Looked at three different sites:
a) Amazon: Large.
b) CNN: Dynamic.
c) Pubmedcentral.nih.gov: Archival.
2. Amazon:
a) Huge.
b) Service Oriented Architecture:
• Hard to list valid URLs, when content changes
• Research Opportunity: Auto Generation of Sitemaps
a) 20M URLs published in:
• 10,000 sitemaps files.
• Each file: 20,000-50,000 URLs.
• Log based.
a) Efficiency: URLs crawled vs unique pages
• Discovery 63%, Sitemaps 86%.Copyright Uri Schonfeld, shuri.org April
2009
Case Study: CNN
1. Very Dynamic:
a) Many new URLs added daily
2. Sitemaps:
a) News: 200-400 URLs
b) Weekly:2500-3000 URLs
c) Monthly:5000-10000 URLs
d) The lists don't overlap but complete
e) Additional SitemapsIndex of hub pages
Copyright Uri Schonfeld, shuri.org April
2009
Case Study
Pubmedcentral.nih.gov
1. Archival domain:
a) Add and hardly change.
b) Oldest journal published in1809.
2. Thus, can be exhaustive.
3. Sitemap files:
a) 50+ sitemaps files.
b) 30,000 URLs in each.
c) Last modification inaccurate (unlike
CNN and Amazon).Copyright Uri Schonfeld, shuri.org April
2009
Pubmedcentral.nih.gov (cont’)
1. URL break down
a) Discovery and Sitemaps 3 million
b) Sitemaps only 1.7 million
c) 1 million due to duplicates
2. Manually examined 3000 sample URLs from the
missing ~300,000
a) 8% errors
b) 10% redirects
c) 11% other duplicate content
d) 51% judgment call needed (should crawl or
not)
Copyright Uri Schonfeld, shuri.org April
2009
Pubmedcentral
Copyright Uri Schonfeld, shuri.org April
2009
CNN: New URLs Seen Over Time
Copyright Uri Schonfeld, shuri.org April
2009
Evaluating Sitemaps
Copyright Uri Schonfeld, shuri.org April
2009
Evaluating Sitemaps
1. Coverage and Freshness
2. How should we judge usefulness?
3. How far does a URL get in our pipeline:
1. Seen
2. Crawled
3. Unique
4. Indexed
5. Results
6. Clicked
4. UniqueCoverage = UniqueSitemaps(D) / Unique(D)
5. IndexCoverage = IndexedSitemaps(D) / Indexed(D)
6. PageRankCoverage = RankMassSitemaps(D) / RankMass(D)
Copyright Uri Schonfeld, shuri.org April
2009
Coverage
Copyright Uri Schonfeld, shuri.org April
2009
Coverage vs UniqueCoverage
Copyright Uri Schonfeld, shuri.org April
2009
UniqueCoverage vs Domain Size
• 46% domains
have above 50%
UniqueCoverage
• 12% domains
have 90%
UniqueCoverage.
Copyright Uri Schonfeld, shuri.org April
2009
While PageRank Coverage…
Copyright Uri Schonfeld, shuri.org April
2009
Bang for the Buck…
Copyright Uri Schonfeld, shuri.org April
2009
Pings and Freshness
First Seen by Sitemaps
• Ping: 12.7%
• Non-Ping: 80.3%
First Seen by Discovery
• Ping: 1.5%
• Non-Ping: 5.5%
• 14.2% Discovered through pings.
• But who saw first is independent.
• Doesn't reflect the potential.
Research Opportunity: Detect and ping policy
• Of URLs seen by both Sitemaps and Discovery.
o 78% Seen first by Sitemaps
o 22% Seen first by Discovery
Copyright Uri Schonfeld, shuri.org April
2009
Doing Both :
Sitemaps and Discovery
1. New URLs and Refresh: we’ll talk new URLs.
2. You can't fetch it all ⇒ per site quota.
3.What to fetch?
4. Crawl uses some ranking.
5. What should ranking for Sitemaps URLs?
6. How to balance between them?
Copyright Uri Schonfeld, shuri.org April
2009
Ranking URLs in Sitemaps
1. Priority:
1.Full autority to the webmaster.
2.Is not available all the time.
2. PageRank:
1. Provenly effective.
2.Not available for the truly new pages.
3.Webmasters don't have a Say at all.
3. PriorityRank:
1.Modify graph to take both into account
2.Add sitemaps as a page implicitly linked to from the root.
3.Links from Sitemaps are weighted by priority if
available
4.Calculate PageRank over this modified graph.
5.Hybrid of the two previous methods .
Copyright Uri Schonfeld, shuri.org April
2009
Balancing the Crawl:
Algorithm Simplified
1. for epoch in 0..infinity do
2. kD = kS = 1/2
1.Fetch:
1.Top kD * Quota from Discovery
2.Top kS * Quota from Sitemaps
2.Measure derivative of the utility (IndexCoverage)
3.Adjust kC and KS
Copyright Uri Schonfeld, shuri.org April
2009
Conclusion and Future Work
1. Large scale study, real data
2. You cannot stop Discovery… yet.
3. Presented metrics for freshness and coverage.
4. Sitemaps evaluated for coverage and freshness.
5. Presented Algorithm to combine Sitemaps & Discovery
6. To Be Done
1. Good news: tons of future work
2. Duplicates not solved on web-server side either.
3. Better Pings.
4. Ranking Sitemaps URLs can be a challenge.
Copyright Uri Schonfeld, shuri.org April
2009
Acks
We wish to thank many Googlers!
thank...
Dennis Geels, Ori Gershony, Laramie, Madhu, Thomal, Alkis,
Peter Dickman, Arup, Charlie, Nish, Rosemary, Ralph, Nikhil.
Copyright Uri Schonfeld, shuri.org April
2009
The End
Thank You!
Copyright Uri Schonfeld, shuri.org April
2009

Weitere ähnliche Inhalte

Ähnlich wie Sitemaps Above and Beyond Crawl Duty

Search Engine working, Crawlers working, Search Engine mechanism
Search Engine working, Crawlers working, Search Engine mechanismSearch Engine working, Crawlers working, Search Engine mechanism
Search Engine working, Crawlers working, Search Engine mechanismUmang MIshra
 
Module 2 search engines .pptx
Module 2 search engines .pptxModule 2 search engines .pptx
Module 2 search engines .pptxReynaldLegardaJr
 
How search engine works and history of search engine
How search engine works and history of search engineHow search engine works and history of search engine
How search engine works and history of search engineAK DigiHub
 
searchengineppt-171025105119 (1).docx
searchengineppt-171025105119 (1).docxsearchengineppt-171025105119 (1).docx
searchengineppt-171025105119 (1).docxNiteshRaj48
 
Deployment of rd_fa_microdata_microformats_on_the_web
Deployment of rd_fa_microdata_microformats_on_the_webDeployment of rd_fa_microdata_microformats_on_the_web
Deployment of rd_fa_microdata_microformats_on_the_webSTIinnsbruck
 
Search Engine Optimization Primer
Search Engine Optimization PrimerSearch Engine Optimization Primer
Search Engine Optimization PrimerSimobo
 
Search engine world - Free Seminar
Search engine world - Free SeminarSearch engine world - Free Seminar
Search engine world - Free SeminarRana Gomaa
 
Demand quest seo training 1 16x9 10.2018
Demand quest seo training 1 16x9 10.2018Demand quest seo training 1 16x9 10.2018
Demand quest seo training 1 16x9 10.2018Nate Plaunt
 
Web2.0.2012 - lesson 8 - Google world
Web2.0.2012 - lesson 8 - Google worldWeb2.0.2012 - lesson 8 - Google world
Web2.0.2012 - lesson 8 - Google worldCarlo Vaccari
 
Effective Searching Policies for Web Crawler
Effective Searching Policies for Web CrawlerEffective Searching Policies for Web Crawler
Effective Searching Policies for Web CrawlerIJMER
 
Datasets, APIs, and Web Scraping
Datasets, APIs, and Web ScrapingDatasets, APIs, and Web Scraping
Datasets, APIs, and Web ScrapingDamian T. Gordon
 

Ähnlich wie Sitemaps Above and Beyond Crawl Duty (20)

E017624043
E017624043E017624043
E017624043
 
Search Engine working, Crawlers working, Search Engine mechanism
Search Engine working, Crawlers working, Search Engine mechanismSearch Engine working, Crawlers working, Search Engine mechanism
Search Engine working, Crawlers working, Search Engine mechanism
 
Seo Presentation
Seo PresentationSeo Presentation
Seo Presentation
 
How seo works
How seo worksHow seo works
How seo works
 
Search engine ppt
Search engine pptSearch engine ppt
Search engine ppt
 
Webcrawler
WebcrawlerWebcrawler
Webcrawler
 
Seoppt
SeopptSeoppt
Seoppt
 
Module 2 search engines .pptx
Module 2 search engines .pptxModule 2 search engines .pptx
Module 2 search engines .pptx
 
How search engine works and history of search engine
How search engine works and history of search engineHow search engine works and history of search engine
How search engine works and history of search engine
 
searchengineppt-171025105119 (1).docx
searchengineppt-171025105119 (1).docxsearchengineppt-171025105119 (1).docx
searchengineppt-171025105119 (1).docx
 
Seo Analysis Report
Seo Analysis ReportSeo Analysis Report
Seo Analysis Report
 
Deployment of rd_fa_microdata_microformats_on_the_web
Deployment of rd_fa_microdata_microformats_on_the_webDeployment of rd_fa_microdata_microformats_on_the_web
Deployment of rd_fa_microdata_microformats_on_the_web
 
Search Engine Optimization Primer
Search Engine Optimization PrimerSearch Engine Optimization Primer
Search Engine Optimization Primer
 
Search engine world - Free Seminar
Search engine world - Free SeminarSearch engine world - Free Seminar
Search engine world - Free Seminar
 
webcrawler.pptx
webcrawler.pptxwebcrawler.pptx
webcrawler.pptx
 
Demand quest seo training 1 16x9 10.2018
Demand quest seo training 1 16x9 10.2018Demand quest seo training 1 16x9 10.2018
Demand quest seo training 1 16x9 10.2018
 
Web2.0.2012 - lesson 8 - Google world
Web2.0.2012 - lesson 8 - Google worldWeb2.0.2012 - lesson 8 - Google world
Web2.0.2012 - lesson 8 - Google world
 
Search Engine
Search Engine Search Engine
Search Engine
 
Effective Searching Policies for Web Crawler
Effective Searching Policies for Web CrawlerEffective Searching Policies for Web Crawler
Effective Searching Policies for Web Crawler
 
Datasets, APIs, and Web Scraping
Datasets, APIs, and Web ScrapingDatasets, APIs, and Web Scraping
Datasets, APIs, and Web Scraping
 

Mehr von Mark J. Feldman

Venture Capital Deal Terms
Venture Capital Deal TermsVenture Capital Deal Terms
Venture Capital Deal TermsMark J. Feldman
 
How Venture Capitalist (VC) Firms Screen Deals
How Venture Capitalist (VC) Firms Screen DealsHow Venture Capitalist (VC) Firms Screen Deals
How Venture Capitalist (VC) Firms Screen DealsMark J. Feldman
 
Massachusetts - Israel Cleantech Opportunities
Massachusetts - Israel Cleantech OpportunitiesMassachusetts - Israel Cleantech Opportunities
Massachusetts - Israel Cleantech OpportunitiesMark J. Feldman
 
The CleanTech Market Opportunity
The CleanTech Market OpportunityThe CleanTech Market Opportunity
The CleanTech Market OpportunityMark J. Feldman
 
Small Cap Value Equity Pitchbook
Small Cap Value Equity PitchbookSmall Cap Value Equity Pitchbook
Small Cap Value Equity PitchbookMark J. Feldman
 
COMMUNITY DEVELOPMENT FINANCIAL INSTITUTIONS: Considerations for Securitizati...
COMMUNITY DEVELOPMENT FINANCIAL INSTITUTIONS: Considerations for Securitizati...COMMUNITY DEVELOPMENT FINANCIAL INSTITUTIONS: Considerations for Securitizati...
COMMUNITY DEVELOPMENT FINANCIAL INSTITUTIONS: Considerations for Securitizati...Mark J. Feldman
 
Cruz: Application-Transparent Distributed Checkpoint-Restart on Standard Oper...
Cruz:Application-Transparent Distributed Checkpoint-Restart on Standard Oper...Cruz:Application-Transparent Distributed Checkpoint-Restart on Standard Oper...
Cruz: Application-Transparent Distributed Checkpoint-Restart on Standard Oper...Mark J. Feldman
 
Oracle 10g Application Server
Oracle 10g Application ServerOracle 10g Application Server
Oracle 10g Application ServerMark J. Feldman
 
Choosing The Right Enterprise Antispyware Solution
Choosing The Right Enterprise Antispyware SolutionChoosing The Right Enterprise Antispyware Solution
Choosing The Right Enterprise Antispyware SolutionMark J. Feldman
 
Googlebase Information Pack for MLSs and MLS vendors
Googlebase Information Pack for MLSs and MLS vendorsGooglebase Information Pack for MLSs and MLS vendors
Googlebase Information Pack for MLSs and MLS vendorsMark J. Feldman
 
Beginners Guide To Venture Capital
Beginners Guide To Venture CapitalBeginners Guide To Venture Capital
Beginners Guide To Venture CapitalMark J. Feldman
 
II Security At Microsoft
II Security At MicrosoftII Security At Microsoft
II Security At MicrosoftMark J. Feldman
 
McDonald's Worldwide Corporate Responsibility Report
McDonald's Worldwide Corporate Responsibility ReportMcDonald's Worldwide Corporate Responsibility Report
McDonald's Worldwide Corporate Responsibility ReportMark J. Feldman
 
Email Marketing Tips and Tricks
Email Marketing Tips and TricksEmail Marketing Tips and Tricks
Email Marketing Tips and TricksMark J. Feldman
 
Email Marketing: Expand Your Reach, Grow Your Business
Email Marketing: Expand Your Reach, Grow Your BusinessEmail Marketing: Expand Your Reach, Grow Your Business
Email Marketing: Expand Your Reach, Grow Your BusinessMark J. Feldman
 

Mehr von Mark J. Feldman (16)

Venture Capital Deal Terms
Venture Capital Deal TermsVenture Capital Deal Terms
Venture Capital Deal Terms
 
How Venture Capitalist (VC) Firms Screen Deals
How Venture Capitalist (VC) Firms Screen DealsHow Venture Capitalist (VC) Firms Screen Deals
How Venture Capitalist (VC) Firms Screen Deals
 
Massachusetts - Israel Cleantech Opportunities
Massachusetts - Israel Cleantech OpportunitiesMassachusetts - Israel Cleantech Opportunities
Massachusetts - Israel Cleantech Opportunities
 
The CleanTech Market Opportunity
The CleanTech Market OpportunityThe CleanTech Market Opportunity
The CleanTech Market Opportunity
 
Small Cap Value Equity Pitchbook
Small Cap Value Equity PitchbookSmall Cap Value Equity Pitchbook
Small Cap Value Equity Pitchbook
 
COMMUNITY DEVELOPMENT FINANCIAL INSTITUTIONS: Considerations for Securitizati...
COMMUNITY DEVELOPMENT FINANCIAL INSTITUTIONS: Considerations for Securitizati...COMMUNITY DEVELOPMENT FINANCIAL INSTITUTIONS: Considerations for Securitizati...
COMMUNITY DEVELOPMENT FINANCIAL INSTITUTIONS: Considerations for Securitizati...
 
Cruz: Application-Transparent Distributed Checkpoint-Restart on Standard Oper...
Cruz:Application-Transparent Distributed Checkpoint-Restart on Standard Oper...Cruz:Application-Transparent Distributed Checkpoint-Restart on Standard Oper...
Cruz: Application-Transparent Distributed Checkpoint-Restart on Standard Oper...
 
Oracle 10g Application Server
Oracle 10g Application ServerOracle 10g Application Server
Oracle 10g Application Server
 
Choosing The Right Enterprise Antispyware Solution
Choosing The Right Enterprise Antispyware SolutionChoosing The Right Enterprise Antispyware Solution
Choosing The Right Enterprise Antispyware Solution
 
Googlebase Information Pack for MLSs and MLS vendors
Googlebase Information Pack for MLSs and MLS vendorsGooglebase Information Pack for MLSs and MLS vendors
Googlebase Information Pack for MLSs and MLS vendors
 
Beginners Guide To Venture Capital
Beginners Guide To Venture CapitalBeginners Guide To Venture Capital
Beginners Guide To Venture Capital
 
II Security At Microsoft
II Security At MicrosoftII Security At Microsoft
II Security At Microsoft
 
McDonald's Worldwide Corporate Responsibility Report
McDonald's Worldwide Corporate Responsibility ReportMcDonald's Worldwide Corporate Responsibility Report
McDonald's Worldwide Corporate Responsibility Report
 
Sub Prime Explanation
Sub Prime ExplanationSub Prime Explanation
Sub Prime Explanation
 
Email Marketing Tips and Tricks
Email Marketing Tips and TricksEmail Marketing Tips and Tricks
Email Marketing Tips and Tricks
 
Email Marketing: Expand Your Reach, Grow Your Business
Email Marketing: Expand Your Reach, Grow Your BusinessEmail Marketing: Expand Your Reach, Grow Your Business
Email Marketing: Expand Your Reach, Grow Your Business
 

Kürzlich hochgeladen

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 

Kürzlich hochgeladen (20)

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 

Sitemaps Above and Beyond Crawl Duty

  • 1. Sitemaps: Above and Beyond the Crawl of Duty Sitemaps! Sitemaps! Uri Schonfeld (Google and UCLA) Narayanan Shivakumar (Google) Copyright Uri Schonfeld, shuri.org April 2009
  • 2. What are we going to talk about? • The sitemaps protocol: – Not introduced in this paper – Friendly web servers publishing URL lists • Popular and growing in popularity • First large scale study over real data: • How it is used by users • Its Impact – First look at how it can be used by search engines – Lots of future work to get excited over • Let’s start with: – Underlying problem that sitemaps addresses Copyright Uri Schonfeld, shuri.org April 2009
  • 3. Dream of the Perfect Crawl 1.Users Have High Expectations: • Coverage: Every page should be findable • Freshness: Latest event, viral video,... • Deep Web: ajax, flash, silverlight,.... 1.Search Engines Dream of the perfect crawl: • Everything the users want • …but efficient: – No 404s – No duplicates 1.Sitemaps to the rescue... Copyright Uri Schonfeld, shuri.org April 2009
  • 4. Sitemaps 1. Basic idea: The web server 1.Puts a URL list, a sitemaps file, on its site 2.Includes new and changed content 3.Lets the search engines know 2. The URL list may also include:  URLs  Last Modification Time  Expected Change Frequency  Priority 1. Let the search engine know: 1."Ping" search engines that their sitemaps file has changed 2.Alternatively include sitemaps in robots.txt file (April 2007) Copyright Uri Schonfeld, shuri.org April 2009
  • 5. Sitemaps: This is how it looks <?xml version="1.0" encoding="UTF-8"?> <urlset xmlns= "http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>http://www.example.com/</loc> <lastmod>2005-01-01</lastmod> <changefreq>monthly</changefreq> <priority>0.8</priority> </url> <url> ... <url> </urlset> Copyright Uri Schonfeld, shuri.org April 2009
  • 6. Related Work 1. 1999: "Santa Fe Convention" 1.Lead to OAI-PMH 2."...e-print servers to expose metadata for the papers it held" 3.Coalition for Networked Information, Digital Library Federation, Open Archives Initiative (OAI), Herbert Van de Sompel, Carl Lagoze 2. 2000: Crawler Friendly Web Servers: Brandman, Cho, Garcia- Molina and Shivakumar 1.Export list of URLs and changed content 3. 2005/6: Sitemaps: 1.Introduced in 2005 by Google 2.2006 Microsoft, Yahoo and Google announced joint support Copyright Uri Schonfeld, shuri.org April 2009
  • 7. Our Main Contributions 1. First Study of Sitemaps over real world data: a) How it is used b) It’s impact 2. Define metrics to evaluate Sitemaps feeds. 3. Explore: a) The challenges of using Sitemaps together with Discovery Crawl b) Define a preliminary algorithm combining the two crawls.Copyright Uri Schonfeld, shuri.org April 2009
  • 8. Inside Google 1. Sitemaps & Discovery 2. Sitemaps: a) Sitemaps are fetched: • After they are pinged. • Several frequencies. a) Sitemaps discovered URLs are fed to the crawling pipeline. b) Some sources are fed directly for instant crawling. 3. Discovery: a) New URLs and URLs of changed content are fed back to the pipeline 4. Pipeline Copyright Uri Schonfeld, shuri.org April 2009
  • 9. How Sitemaps Is Used? 1. Approximately 35M websites publish Sitemaps, and give us metadata for several billions of URLs. 2. Metadata: 1. 61% include a priority field. 2. 58% of URLs include a lastmodification date 3. 7% include a change frequency field 3. Formats Breakdown: a) XML Sitemap 76.76 b) Url List 3.42 c) Atom 1.61 d) RSS 0.11 e) Unknown 17.51 4. Robots.txt announced April 2007 Copyright Uri Schonfeld, shuri.org April 2009
  • 10. Sitemaps Case Studies Copyright Uri Schonfeld, shuri.org April 2009
  • 11. Sitemaps Use Case Studies 1. Looked at three different sites: a) Amazon: Large. b) CNN: Dynamic. c) Pubmedcentral.nih.gov: Archival. 2. Amazon: a) Huge. b) Service Oriented Architecture: • Hard to list valid URLs, when content changes • Research Opportunity: Auto Generation of Sitemaps a) 20M URLs published in: • 10,000 sitemaps files. • Each file: 20,000-50,000 URLs. • Log based. a) Efficiency: URLs crawled vs unique pages • Discovery 63%, Sitemaps 86%.Copyright Uri Schonfeld, shuri.org April 2009
  • 12. Case Study: CNN 1. Very Dynamic: a) Many new URLs added daily 2. Sitemaps: a) News: 200-400 URLs b) Weekly:2500-3000 URLs c) Monthly:5000-10000 URLs d) The lists don't overlap but complete e) Additional SitemapsIndex of hub pages Copyright Uri Schonfeld, shuri.org April 2009
  • 13. Case Study Pubmedcentral.nih.gov 1. Archival domain: a) Add and hardly change. b) Oldest journal published in1809. 2. Thus, can be exhaustive. 3. Sitemap files: a) 50+ sitemaps files. b) 30,000 URLs in each. c) Last modification inaccurate (unlike CNN and Amazon).Copyright Uri Schonfeld, shuri.org April 2009
  • 14. Pubmedcentral.nih.gov (cont’) 1. URL break down a) Discovery and Sitemaps 3 million b) Sitemaps only 1.7 million c) 1 million due to duplicates 2. Manually examined 3000 sample URLs from the missing ~300,000 a) 8% errors b) 10% redirects c) 11% other duplicate content d) 51% judgment call needed (should crawl or not) Copyright Uri Schonfeld, shuri.org April 2009
  • 16. CNN: New URLs Seen Over Time Copyright Uri Schonfeld, shuri.org April 2009
  • 17. Evaluating Sitemaps Copyright Uri Schonfeld, shuri.org April 2009
  • 18. Evaluating Sitemaps 1. Coverage and Freshness 2. How should we judge usefulness? 3. How far does a URL get in our pipeline: 1. Seen 2. Crawled 3. Unique 4. Indexed 5. Results 6. Clicked 4. UniqueCoverage = UniqueSitemaps(D) / Unique(D) 5. IndexCoverage = IndexedSitemaps(D) / Indexed(D) 6. PageRankCoverage = RankMassSitemaps(D) / RankMass(D) Copyright Uri Schonfeld, shuri.org April 2009
  • 19. Coverage Copyright Uri Schonfeld, shuri.org April 2009
  • 20. Coverage vs UniqueCoverage Copyright Uri Schonfeld, shuri.org April 2009
  • 21. UniqueCoverage vs Domain Size • 46% domains have above 50% UniqueCoverage • 12% domains have 90% UniqueCoverage. Copyright Uri Schonfeld, shuri.org April 2009
  • 22. While PageRank Coverage… Copyright Uri Schonfeld, shuri.org April 2009
  • 23. Bang for the Buck… Copyright Uri Schonfeld, shuri.org April 2009
  • 24. Pings and Freshness First Seen by Sitemaps • Ping: 12.7% • Non-Ping: 80.3% First Seen by Discovery • Ping: 1.5% • Non-Ping: 5.5% • 14.2% Discovered through pings. • But who saw first is independent. • Doesn't reflect the potential. Research Opportunity: Detect and ping policy • Of URLs seen by both Sitemaps and Discovery. o 78% Seen first by Sitemaps o 22% Seen first by Discovery Copyright Uri Schonfeld, shuri.org April 2009
  • 25. Doing Both : Sitemaps and Discovery 1. New URLs and Refresh: we’ll talk new URLs. 2. You can't fetch it all ⇒ per site quota. 3.What to fetch? 4. Crawl uses some ranking. 5. What should ranking for Sitemaps URLs? 6. How to balance between them? Copyright Uri Schonfeld, shuri.org April 2009
  • 26. Ranking URLs in Sitemaps 1. Priority: 1.Full autority to the webmaster. 2.Is not available all the time. 2. PageRank: 1. Provenly effective. 2.Not available for the truly new pages. 3.Webmasters don't have a Say at all. 3. PriorityRank: 1.Modify graph to take both into account 2.Add sitemaps as a page implicitly linked to from the root. 3.Links from Sitemaps are weighted by priority if available 4.Calculate PageRank over this modified graph. 5.Hybrid of the two previous methods . Copyright Uri Schonfeld, shuri.org April 2009
  • 27. Balancing the Crawl: Algorithm Simplified 1. for epoch in 0..infinity do 2. kD = kS = 1/2 1.Fetch: 1.Top kD * Quota from Discovery 2.Top kS * Quota from Sitemaps 2.Measure derivative of the utility (IndexCoverage) 3.Adjust kC and KS Copyright Uri Schonfeld, shuri.org April 2009
  • 28. Conclusion and Future Work 1. Large scale study, real data 2. You cannot stop Discovery… yet. 3. Presented metrics for freshness and coverage. 4. Sitemaps evaluated for coverage and freshness. 5. Presented Algorithm to combine Sitemaps & Discovery 6. To Be Done 1. Good news: tons of future work 2. Duplicates not solved on web-server side either. 3. Better Pings. 4. Ranking Sitemaps URLs can be a challenge. Copyright Uri Schonfeld, shuri.org April 2009
  • 29. Acks We wish to thank many Googlers! thank... Dennis Geels, Ori Gershony, Laramie, Madhu, Thomal, Alkis, Peter Dickman, Arup, Charlie, Nish, Rosemary, Ralph, Nikhil. Copyright Uri Schonfeld, shuri.org April 2009
  • 30. The End Thank You! Copyright Uri Schonfeld, shuri.org April 2009

Hinweis der Redaktion

  1. More Bursty than CNN Seems
  2. Very dynamic Search engine adjusts Discovery rate
  3. crawled in 2008  am*.com 500 Million URLs
  4. crawled in 2008  am*.com 500 Million URLs Duplicates in Sitemaps and Discovery mostly similar
  5. 46% &amp;gt;50% UniqueCoverage  12% &amp;gt;90% UniqueCoverage.
  6. most domains are above the diagonal  achieves a higher percent of URLs in the index with less unique pages.  Sitemaps crawl attains a higher utility.