SlideShare ist ein Scribd-Unternehmen logo
1 von 24
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 1
Graph Structure in the Web
Revisited
Robert Meusel, Sebastiano Vigna,
Oliver Lehmberg, Christian Bizer
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 2
Textbook Knowledge about the Web Graph
Broder et al.: Graph structure in the Web. WWW2000.
used two AltaVista crawls (200 million pages, 1.5 billion links)
Results
Power Laws Bow-Tie
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 3
This talk will:
1. Show that the textbook knowledge might
be wrong or dependent on crawling process.
2. Provide you with a large recent Web graph
to do further research.
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 4
Outline
1. Public Web Crawls
2. The Web Data Commons Hyperlink Graph
3. Analysis of the Graph
1. In-degree & Out-degree Distributions
2. Node Centrality
3. Strong Components
4. Bow Tie
5. Reachability and Average Shortest Path
4. Conclusion
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 5
Public Web Crawls
1. AltaVista Crawl distributed by Yahoo! WebScope 2002
• Size: 1.4 billion pages
• Problem: Largest strongly connected component 4%
2. ClueWeb 2009
• Size: 1 billion pages
• Problem: Largest strongly connected component 3%
3. ClueWeb 2012
• Size: 733 million pages
• Largest strongly connected component 76%
• Problem: Only English pages
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 6
The Common Crawl
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 7
The Common Crawl Foundation
Regularly publishes Web crawls on Amazon S3.
Five crawls available so far:
Crawling Strategy (Spring 2012)
• breadth-first visiting strategy
• at least 71 million seeds from previous crawls and from Wikipedia
Date # Pages
2010 2.5 billion
Spring 2012 3.5 billion
Spring 2013 2.0 billion
Winter 2013 2.0 billion
Spring 2014 2.5 billion
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 8
Web Data Commons – Hyperlink Graph
extracted from the Spring 2012 version of the Common Crawl
size
3.5 billion nodes
128 billion arcs
pages originate from 43 million pay-level domains (PLDs)
• 240 million PLDs were registered in 2012 * (18%)
world-wide coverage
* http://www.verisigninc.com/assets/domain-name-brief-oct2012.pdf
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 9
Downloading the WDC Hyperlink Graph
http://webdatacommons.org/hyperlinkgraph/
4 aggregation levels:
Extraction code is published under Apache License
• Extraction costs per run: ~ 200 US$ in Amazon EC2 fees
Graph #Nodes #Arcs Size (zipped)
Page graph 3.56 billion 128.73 billion 376 GB
Subdomain graph 101 million 2,043 million 10 GB
1st level subdomain graph 95 million 1,937 million 9.5 GB
PLD graph 43 million 623 million 3.1 GB
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 10
Analysis of the Graph
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 11
In-Degree Distribution
Broder et al. (2000)
Power law with exponent 2.1
WDC Hyperlink Graph (2012)
Best power law exponent 2.24
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 12
In-Degree Distribution
Power law
fitted using
plfit-tool.
Maximum
likelihood
fitting.
Starting
degree:
1129
Best power
law exponent:
2.24
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 13
Goodness of Fit Test
Method
• Clauset et al.:
Power-Law Distributions in Empirical Data. SIAM Review 2009.
• p-value < 0.1  power law not a plausible hypothesis
Goodness of fit result
• p-value = 0
Conclusions:
• in-degree does not follow power law
• in-degree has non-fat heavy-tailed distribution
• maybe log-normal?
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 14
Out-Degree Distribution
Broder et al.:
Power law
exponent 2.78
WDC:
Best power law
exponent 2.77
p-value = 0
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 15
Node Centrality
http://wwwranking.webdatacommons.org
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 16
Average Degree
Broder et al. 2000: 7.5
WDC 2012: 36.8
 Factor 4.9 larger
Possible explanation: HTML templates of CMS
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 17
Strongly Connected Components
Calculated using WebGraph framework on a machine with 1 TB RAM.
Largest SCC
Broder: 27.7%
WDC: 51.3 %
 Factor 1.8 larger
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 18
The Bow-Tie Structure of Broder et al. 2000
Balanced size of IN and OUT: 21%
Size of LSCC: 27%
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 19
The Bow-Tie Structure of WDC Hyperlinkgraph 2012
IN much larger than OUT: 31% vs. 6%
LSCC much larger: 51%
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 20
Zhu et al. WWW2008
The Chinese web looks like a tea-pot.
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 21
Reachability and Average Shortest Path
Broder et al. 2000
Pairs of pages connected by
path: 25%
Average shortest path: 16.12
WDC Webgraph 2012
Pairs of pages connected by
path: 48%
Average shortest path: 12.84
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 22
Conclusions
1. Web has become more dense and more connected
• Average degree has grown significantly in last 13 years (factor 5)
• Connectivity between pairs of pages has doubled
2. Macroscopic structure
• There is large SCC of growing size.
• The shape of the bow-tie seems to depend on the crawl
3. In- and out-degree distributions do not follow power laws.
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 23
Questions?
Advertisement
WebDataCommons.org also offers:
1. Corpus of 17 billion RDFa, Microdata, Microformats statements
2. Corpus of 147 million relational HTML tables
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 24

Weitere ähnliche Inhalte

Was ist angesagt?

The Basics of MongoDB
The Basics of MongoDBThe Basics of MongoDB
The Basics of MongoDBvaluebound
 
Introduction to MongoDB.pptx
Introduction to MongoDB.pptxIntroduction to MongoDB.pptx
Introduction to MongoDB.pptxSurya937648
 
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBRavi Teja
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBMike Dirolf
 
Basics of MongoDB
Basics of MongoDB Basics of MongoDB
Basics of MongoDB Habilelabs
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisArnab Mitra
 
Elastic Search
Elastic SearchElastic Search
Elastic SearchNavule Rao
 
The Web of data and web data commons
The Web of data and web data commonsThe Web of data and web data commons
The Web of data and web data commonsJesse Wang
 
An Introduction to Semantic Web Technology
An Introduction to Semantic Web TechnologyAn Introduction to Semantic Web Technology
An Introduction to Semantic Web TechnologyAnkur Biswas
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialDaniel Abadi
 
MongoDB presentation
MongoDB presentationMongoDB presentation
MongoDB presentationHyphen Call
 
MongoDB Atlas
MongoDB AtlasMongoDB Atlas
MongoDB AtlasMongoDB
 
Intro To MongoDB
Intro To MongoDBIntro To MongoDB
Intro To MongoDBAlex Sharp
 
Data Lineage with Apache Airflow using Marquez
Data Lineage with Apache Airflow using Marquez Data Lineage with Apache Airflow using Marquez
Data Lineage with Apache Airflow using Marquez Willy Lulciuc
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with HadoopNalini Mehta
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBNodeXperts
 
Mongodb basics and architecture
Mongodb basics and architectureMongodb basics and architecture
Mongodb basics and architectureBishal Khanal
 
Building Your First App with MongoDB
Building Your First App with MongoDBBuilding Your First App with MongoDB
Building Your First App with MongoDBMongoDB
 

Was ist angesagt? (20)

The Basics of MongoDB
The Basics of MongoDBThe Basics of MongoDB
The Basics of MongoDB
 
Introduction to MongoDB.pptx
Introduction to MongoDB.pptxIntroduction to MongoDB.pptx
Introduction to MongoDB.pptx
 
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Mongodb
MongodbMongodb
Mongodb
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Basics of MongoDB
Basics of MongoDB Basics of MongoDB
Basics of MongoDB
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Elastic Search
Elastic SearchElastic Search
Elastic Search
 
The Web of data and web data commons
The Web of data and web data commonsThe Web of data and web data commons
The Web of data and web data commons
 
An Introduction to Semantic Web Technology
An Introduction to Semantic Web TechnologyAn Introduction to Semantic Web Technology
An Introduction to Semantic Web Technology
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop Tutorial
 
MongoDB presentation
MongoDB presentationMongoDB presentation
MongoDB presentation
 
MongoDB Atlas
MongoDB AtlasMongoDB Atlas
MongoDB Atlas
 
Intro To MongoDB
Intro To MongoDBIntro To MongoDB
Intro To MongoDB
 
Data Lineage with Apache Airflow using Marquez
Data Lineage with Apache Airflow using Marquez Data Lineage with Apache Airflow using Marquez
Data Lineage with Apache Airflow using Marquez
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Mongodb basics and architecture
Mongodb basics and architectureMongodb basics and architecture
Mongodb basics and architecture
 
Building Your First App with MongoDB
Building Your First App with MongoDBBuilding Your First App with MongoDB
Building Your First App with MongoDB
 

Andere mochten auch

Graph Structure In The Web
Graph Structure In The WebGraph Structure In The Web
Graph Structure In The Webdailyye
 
Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)Kira
 
Wwsss intro2016-final
Wwsss intro2016-finalWwsss intro2016-final
Wwsss intro2016-finalSteffen Staab
 
Web Science Framework and InterDataNet
Web Science Framework and InterDataNetWeb Science Framework and InterDataNet
Web Science Framework and InterDataNetmaria chiara pettenati
 
Graphs, Edges & Nodes - Untangling the Social Web
Graphs, Edges & Nodes - Untangling the Social WebGraphs, Edges & Nodes - Untangling the Social Web
Graphs, Edges & Nodes - Untangling the Social WebJoël Perras
 
Titan: The Rise of Big Graph Data
Titan: The Rise of Big Graph DataTitan: The Rise of Big Graph Data
Titan: The Rise of Big Graph DataMarko Rodriguez
 

Andere mochten auch (7)

Graph Structure In The Web
Graph Structure In The WebGraph Structure In The Web
Graph Structure In The Web
 
Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)
 
Wwsss intro2016-final
Wwsss intro2016-finalWwsss intro2016-final
Wwsss intro2016-final
 
Web Science Framework and InterDataNet
Web Science Framework and InterDataNetWeb Science Framework and InterDataNet
Web Science Framework and InterDataNet
 
Graphs, Edges & Nodes - Untangling the Social Web
Graphs, Edges & Nodes - Untangling the Social WebGraphs, Edges & Nodes - Untangling the Social Web
Graphs, Edges & Nodes - Untangling the Social Web
 
Intro to Web Science (Fall 2013)
Intro to Web Science (Fall 2013)Intro to Web Science (Fall 2013)
Intro to Web Science (Fall 2013)
 
Titan: The Rise of Big Graph Data
Titan: The Rise of Big Graph DataTitan: The Rise of Big Graph Data
Titan: The Rise of Big Graph Data
 

Ähnlich wie Graph Structure in the Web - Revisited. WWW2014 Web Science Track

The Graph Structure of the Web - Aggregated by Pay-Level Domain
The Graph Structure of the Web - Aggregated by Pay-Level DomainThe Graph Structure of the Web - Aggregated by Pay-Level Domain
The Graph Structure of the Web - Aggregated by Pay-Level Domainoli-unima
 
Creating knowledge out of interlinked data
Creating knowledge out of interlinked dataCreating knowledge out of interlinked data
Creating knowledge out of interlinked dataSören Auer
 
Web Page Recommendation Using Web Mining
Web Page Recommendation Using Web MiningWeb Page Recommendation Using Web Mining
Web Page Recommendation Using Web MiningIJERA Editor
 
Preparation of Web Mapping Application of Balephi-B Hydropower Project
Preparation of Web Mapping Application of Balephi-B Hydropower ProjectPreparation of Web Mapping Application of Balephi-B Hydropower Project
Preparation of Web Mapping Application of Balephi-B Hydropower ProjectBiplov Bhandari
 
TR14-05_Martindell.pdf
TR14-05_Martindell.pdfTR14-05_Martindell.pdf
TR14-05_Martindell.pdfTomTom149267
 
Restructuring a Web Application, Using Spring and Hibernate
Restructuring a Web Application, Using Spring and HibernateRestructuring a Web Application, Using Spring and Hibernate
Restructuring a Web Application, Using Spring and Hibernategustavoeliano
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Deepak semantic web_iitd
Deepak semantic web_iitdDeepak semantic web_iitd
Deepak semantic web_iitdDeepak Shevani
 
An End User Perspective on Implementing Oracle in the Engineering Environment
An End User Perspective on Implementing Oracle in the Engineering EnvironmentAn End User Perspective on Implementing Oracle in the Engineering Environment
An End User Perspective on Implementing Oracle in the Engineering Environmentjeffhobbs
 
CTS Conference Web 2.0 Tutorial Part 1
CTS Conference Web 2.0 Tutorial Part 1CTS Conference Web 2.0 Tutorial Part 1
CTS Conference Web 2.0 Tutorial Part 1Geoffrey Fox
 
WEB 2.0: BUILDING RICH INTERNET APPLICATIONS WITH PHP
WEB 2.0: BUILDING RICH INTERNET APPLICATIONS WITH PHPWEB 2.0: BUILDING RICH INTERNET APPLICATIONS WITH PHP
WEB 2.0: BUILDING RICH INTERNET APPLICATIONS WITH PHPzend
 
GeoMapFish User-Group - March 2021
GeoMapFish User-Group - March 2021GeoMapFish User-Group - March 2021
GeoMapFish User-Group - March 2021remyguillaume
 
11.concept for a web map implementation with faster query response
11.concept for a web map implementation with faster query response11.concept for a web map implementation with faster query response
11.concept for a web map implementation with faster query responseAlexander Decker
 

Ähnlich wie Graph Structure in the Web - Revisited. WWW2014 Web Science Track (20)

The Graph Structure of the Web - Aggregated by Pay-Level Domain
The Graph Structure of the Web - Aggregated by Pay-Level DomainThe Graph Structure of the Web - Aggregated by Pay-Level Domain
The Graph Structure of the Web - Aggregated by Pay-Level Domain
 
Creating knowledge out of interlinked data
Creating knowledge out of interlinked dataCreating knowledge out of interlinked data
Creating knowledge out of interlinked data
 
Web Page Recommendation Using Web Mining
Web Page Recommendation Using Web MiningWeb Page Recommendation Using Web Mining
Web Page Recommendation Using Web Mining
 
Preparation of Web Mapping Application of Balephi-B Hydropower Project
Preparation of Web Mapping Application of Balephi-B Hydropower ProjectPreparation of Web Mapping Application of Balephi-B Hydropower Project
Preparation of Web Mapping Application of Balephi-B Hydropower Project
 
TR14-05_Martindell.pdf
TR14-05_Martindell.pdfTR14-05_Martindell.pdf
TR14-05_Martindell.pdf
 
Df25632640
Df25632640Df25632640
Df25632640
 
WEB2.0 And CLOUD
WEB2.0 And CLOUDWEB2.0 And CLOUD
WEB2.0 And CLOUD
 
Restructuring a Web Application, Using Spring and Hibernate
Restructuring a Web Application, Using Spring and HibernateRestructuring a Web Application, Using Spring and Hibernate
Restructuring a Web Application, Using Spring and Hibernate
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
Deepak semantic web_iitd
Deepak semantic web_iitdDeepak semantic web_iitd
Deepak semantic web_iitd
 
HoLIS GIS Update
HoLIS GIS UpdateHoLIS GIS Update
HoLIS GIS Update
 
An End User Perspective on Implementing Oracle in the Engineering Environment
An End User Perspective on Implementing Oracle in the Engineering EnvironmentAn End User Perspective on Implementing Oracle in the Engineering Environment
An End User Perspective on Implementing Oracle in the Engineering Environment
 
LOD2 webinar series: Virtuoso by OpenLink Software
LOD2 webinar series: Virtuoso by OpenLink SoftwareLOD2 webinar series: Virtuoso by OpenLink Software
LOD2 webinar series: Virtuoso by OpenLink Software
 
The Semantic Data Web, Sören Auer, University of Leipzig
The Semantic Data Web, Sören Auer, University of LeipzigThe Semantic Data Web, Sören Auer, University of Leipzig
The Semantic Data Web, Sören Auer, University of Leipzig
 
CTS Conference Web 2.0 Tutorial Part 1
CTS Conference Web 2.0 Tutorial Part 1CTS Conference Web 2.0 Tutorial Part 1
CTS Conference Web 2.0 Tutorial Part 1
 
WEB 2.0: BUILDING RICH INTERNET APPLICATIONS WITH PHP
WEB 2.0: BUILDING RICH INTERNET APPLICATIONS WITH PHPWEB 2.0: BUILDING RICH INTERNET APPLICATIONS WITH PHP
WEB 2.0: BUILDING RICH INTERNET APPLICATIONS WITH PHP
 
Hackference
HackferenceHackference
Hackference
 
Webware Webinar
Webware WebinarWebware Webinar
Webware Webinar
 
GeoMapFish User-Group - March 2021
GeoMapFish User-Group - March 2021GeoMapFish User-Group - March 2021
GeoMapFish User-Group - March 2021
 
11.concept for a web map implementation with faster query response
11.concept for a web map implementation with faster query response11.concept for a web map implementation with faster query response
11.concept for a web map implementation with faster query response
 

Mehr von Chris Bizer

GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?
GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?
GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?Chris Bizer
 
Integrating Product Data from the Semantic Web using Deep Learning Techniques
Integrating Product Data from the Semantic Web using Deep Learning TechniquesIntegrating Product Data from the Semantic Web using Deep Learning Techniques
Integrating Product Data from the Semantic Web using Deep Learning TechniquesChris Bizer
 
Using the Semantic Web as Training Data for Product Matching
Using the Semantic Web as Training Data for Product MatchingUsing the Semantic Web as Training Data for Product Matching
Using the Semantic Web as Training Data for Product MatchingChris Bizer
 
JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open Web
JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open WebJIST2019 Keynote: Completing Knowledge Graphs using Data from the Open Web
JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open WebChris Bizer
 
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...Chris Bizer
 
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...Chris Bizer
 
Data Search and Search Joins (Universität Heidelberg 2015)
Data Search and Search Joins (Universität Heidelberg 2015)Data Search and Search Joins (Universität Heidelberg 2015)
Data Search and Search Joins (Universität Heidelberg 2015)Chris Bizer
 
Exploring the Application Potential of Relational Web Tables
Exploring the Application Potential of Relational Web TablesExploring the Application Potential of Relational Web Tables
Exploring the Application Potential of Relational Web TablesChris Bizer
 
Evolving the Web into a Global Dataspace – Advances and Applications
Evolving the Web into a Global Dataspace – Advances and ApplicationsEvolving the Web into a Global Dataspace – Advances and Applications
Evolving the Web into a Global Dataspace – Advances and ApplicationsChris Bizer
 
Extending Tables with Data from over a Million Websites
 Extending Tables with Data from over a Million Websites Extending Tables with Data from over a Million Websites
Extending Tables with Data from over a Million WebsitesChris Bizer
 
Adoption of the Linked Data Best Practices in Different Topical Domains
Adoption of the Linked Data Best Practices in Different Topical DomainsAdoption of the Linked Data Best Practices in Different Topical Domains
Adoption of the Linked Data Best Practices in Different Topical DomainsChris Bizer
 
Evolving the Web into a Global Database - Advances and Applications.
Evolving the Web into a Global Database - Advances and Applications. Evolving the Web into a Global Database - Advances and Applications.
Evolving the Web into a Global Database - Advances and Applications. Chris Bizer
 
Search Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited LectureSearch Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited LectureChris Bizer
 
DBpedia - An Interlinking Hub in the Web of Data
DBpedia - An Interlinking Hub in the Web of DataDBpedia - An Interlinking Hub in the Web of Data
DBpedia - An Interlinking Hub in the Web of DataChris Bizer
 

Mehr von Chris Bizer (14)

GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?
GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?
GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?
 
Integrating Product Data from the Semantic Web using Deep Learning Techniques
Integrating Product Data from the Semantic Web using Deep Learning TechniquesIntegrating Product Data from the Semantic Web using Deep Learning Techniques
Integrating Product Data from the Semantic Web using Deep Learning Techniques
 
Using the Semantic Web as Training Data for Product Matching
Using the Semantic Web as Training Data for Product MatchingUsing the Semantic Web as Training Data for Product Matching
Using the Semantic Web as Training Data for Product Matching
 
JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open Web
JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open WebJIST2019 Keynote: Completing Knowledge Graphs using Data from the Open Web
JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open Web
 
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...
 
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...
 
Data Search and Search Joins (Universität Heidelberg 2015)
Data Search and Search Joins (Universität Heidelberg 2015)Data Search and Search Joins (Universität Heidelberg 2015)
Data Search and Search Joins (Universität Heidelberg 2015)
 
Exploring the Application Potential of Relational Web Tables
Exploring the Application Potential of Relational Web TablesExploring the Application Potential of Relational Web Tables
Exploring the Application Potential of Relational Web Tables
 
Evolving the Web into a Global Dataspace – Advances and Applications
Evolving the Web into a Global Dataspace – Advances and ApplicationsEvolving the Web into a Global Dataspace – Advances and Applications
Evolving the Web into a Global Dataspace – Advances and Applications
 
Extending Tables with Data from over a Million Websites
 Extending Tables with Data from over a Million Websites Extending Tables with Data from over a Million Websites
Extending Tables with Data from over a Million Websites
 
Adoption of the Linked Data Best Practices in Different Topical Domains
Adoption of the Linked Data Best Practices in Different Topical DomainsAdoption of the Linked Data Best Practices in Different Topical Domains
Adoption of the Linked Data Best Practices in Different Topical Domains
 
Evolving the Web into a Global Database - Advances and Applications.
Evolving the Web into a Global Database - Advances and Applications. Evolving the Web into a Global Database - Advances and Applications.
Evolving the Web into a Global Database - Advances and Applications.
 
Search Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited LectureSearch Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited Lecture
 
DBpedia - An Interlinking Hub in the Web of Data
DBpedia - An Interlinking Hub in the Web of DataDBpedia - An Interlinking Hub in the Web of Data
DBpedia - An Interlinking Hub in the Web of Data
 

Kürzlich hochgeladen

Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...ssuser79fe74
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Unit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oUnit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oManavSingh202607
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryAlex Henderson
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and ClassificationsAreesha Ahmad
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformationAreesha Ahmad
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Joonhun Lee
 
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET
 
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...Mohammad Khajehpour
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxFarihaAbdulRasheed
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learninglevieagacer
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfSumit Kumar yadav
 
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATIONSTS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATIONrouseeyyy
 
Introduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptxIntroduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptxBhagirath Gogikar
 

Kürzlich hochgeladen (20)

Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
 
Unit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oUnit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 o
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
 
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
 
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATIONSTS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
 
Introduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptxIntroduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptx
 

Graph Structure in the Web - Revisited. WWW2014 Web Science Track

  • 1. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 1 Graph Structure in the Web Revisited Robert Meusel, Sebastiano Vigna, Oliver Lehmberg, Christian Bizer
  • 2. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 2 Textbook Knowledge about the Web Graph Broder et al.: Graph structure in the Web. WWW2000. used two AltaVista crawls (200 million pages, 1.5 billion links) Results Power Laws Bow-Tie
  • 3. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 3 This talk will: 1. Show that the textbook knowledge might be wrong or dependent on crawling process. 2. Provide you with a large recent Web graph to do further research.
  • 4. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 4 Outline 1. Public Web Crawls 2. The Web Data Commons Hyperlink Graph 3. Analysis of the Graph 1. In-degree & Out-degree Distributions 2. Node Centrality 3. Strong Components 4. Bow Tie 5. Reachability and Average Shortest Path 4. Conclusion
  • 5. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 5 Public Web Crawls 1. AltaVista Crawl distributed by Yahoo! WebScope 2002 • Size: 1.4 billion pages • Problem: Largest strongly connected component 4% 2. ClueWeb 2009 • Size: 1 billion pages • Problem: Largest strongly connected component 3% 3. ClueWeb 2012 • Size: 733 million pages • Largest strongly connected component 76% • Problem: Only English pages
  • 6. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 6 The Common Crawl
  • 7. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 7 The Common Crawl Foundation Regularly publishes Web crawls on Amazon S3. Five crawls available so far: Crawling Strategy (Spring 2012) • breadth-first visiting strategy • at least 71 million seeds from previous crawls and from Wikipedia Date # Pages 2010 2.5 billion Spring 2012 3.5 billion Spring 2013 2.0 billion Winter 2013 2.0 billion Spring 2014 2.5 billion
  • 8. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 8 Web Data Commons – Hyperlink Graph extracted from the Spring 2012 version of the Common Crawl size 3.5 billion nodes 128 billion arcs pages originate from 43 million pay-level domains (PLDs) • 240 million PLDs were registered in 2012 * (18%) world-wide coverage * http://www.verisigninc.com/assets/domain-name-brief-oct2012.pdf
  • 9. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 9 Downloading the WDC Hyperlink Graph http://webdatacommons.org/hyperlinkgraph/ 4 aggregation levels: Extraction code is published under Apache License • Extraction costs per run: ~ 200 US$ in Amazon EC2 fees Graph #Nodes #Arcs Size (zipped) Page graph 3.56 billion 128.73 billion 376 GB Subdomain graph 101 million 2,043 million 10 GB 1st level subdomain graph 95 million 1,937 million 9.5 GB PLD graph 43 million 623 million 3.1 GB
  • 10. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 10 Analysis of the Graph
  • 11. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 11 In-Degree Distribution Broder et al. (2000) Power law with exponent 2.1 WDC Hyperlink Graph (2012) Best power law exponent 2.24
  • 12. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 12 In-Degree Distribution Power law fitted using plfit-tool. Maximum likelihood fitting. Starting degree: 1129 Best power law exponent: 2.24
  • 13. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 13 Goodness of Fit Test Method • Clauset et al.: Power-Law Distributions in Empirical Data. SIAM Review 2009. • p-value < 0.1  power law not a plausible hypothesis Goodness of fit result • p-value = 0 Conclusions: • in-degree does not follow power law • in-degree has non-fat heavy-tailed distribution • maybe log-normal?
  • 14. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 14 Out-Degree Distribution Broder et al.: Power law exponent 2.78 WDC: Best power law exponent 2.77 p-value = 0
  • 15. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 15 Node Centrality http://wwwranking.webdatacommons.org
  • 16. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 16 Average Degree Broder et al. 2000: 7.5 WDC 2012: 36.8  Factor 4.9 larger Possible explanation: HTML templates of CMS
  • 17. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 17 Strongly Connected Components Calculated using WebGraph framework on a machine with 1 TB RAM. Largest SCC Broder: 27.7% WDC: 51.3 %  Factor 1.8 larger
  • 18. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 18 The Bow-Tie Structure of Broder et al. 2000 Balanced size of IN and OUT: 21% Size of LSCC: 27%
  • 19. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 19 The Bow-Tie Structure of WDC Hyperlinkgraph 2012 IN much larger than OUT: 31% vs. 6% LSCC much larger: 51%
  • 20. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 20 Zhu et al. WWW2008 The Chinese web looks like a tea-pot.
  • 21. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 21 Reachability and Average Shortest Path Broder et al. 2000 Pairs of pages connected by path: 25% Average shortest path: 16.12 WDC Webgraph 2012 Pairs of pages connected by path: 48% Average shortest path: 12.84
  • 22. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 22 Conclusions 1. Web has become more dense and more connected • Average degree has grown significantly in last 13 years (factor 5) • Connectivity between pairs of pages has doubled 2. Macroscopic structure • There is large SCC of growing size. • The shape of the bow-tie seems to depend on the crawl 3. In- and out-degree distributions do not follow power laws.
  • 23. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 23 Questions? Advertisement WebDataCommons.org also offers: 1. Corpus of 17 billion RDFa, Microdata, Microformats statements 2. Corpus of 147 million relational HTML tables
  • 24. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 24