Graph Structure in the Web - Revisited. WWW2014 Web Science Track

Chris Bizer
Chris BizerProfessor um University of Mannheim
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 1
Graph Structure in the Web
Revisited
Robert Meusel, Sebastiano Vigna,
Oliver Lehmberg, Christian Bizer
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 2
Textbook Knowledge about the Web Graph
Broder et al.: Graph structure in the Web. WWW2000.
used two AltaVista crawls (200 million pages, 1.5 billion links)
Results
Power Laws Bow-Tie
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 3
This talk will:
1. Show that the textbook knowledge might
be wrong or dependent on crawling process.
2. Provide you with a large recent Web graph
to do further research.
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 4
Outline
1. Public Web Crawls
2. The Web Data Commons Hyperlink Graph
3. Analysis of the Graph
1. In-degree & Out-degree Distributions
2. Node Centrality
3. Strong Components
4. Bow Tie
5. Reachability and Average Shortest Path
4. Conclusion
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 5
Public Web Crawls
1. AltaVista Crawl distributed by Yahoo! WebScope 2002
• Size: 1.4 billion pages
• Problem: Largest strongly connected component 4%
2. ClueWeb 2009
• Size: 1 billion pages
• Problem: Largest strongly connected component 3%
3. ClueWeb 2012
• Size: 733 million pages
• Largest strongly connected component 76%
• Problem: Only English pages
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 6
The Common Crawl
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 7
The Common Crawl Foundation
Regularly publishes Web crawls on Amazon S3.
Five crawls available so far:
Crawling Strategy (Spring 2012)
• breadth-first visiting strategy
• at least 71 million seeds from previous crawls and from Wikipedia
Date # Pages
2010 2.5 billion
Spring 2012 3.5 billion
Spring 2013 2.0 billion
Winter 2013 2.0 billion
Spring 2014 2.5 billion
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 8
Web Data Commons – Hyperlink Graph
extracted from the Spring 2012 version of the Common Crawl
size
3.5 billion nodes
128 billion arcs
pages originate from 43 million pay-level domains (PLDs)
• 240 million PLDs were registered in 2012 * (18%)
world-wide coverage
* http://www.verisigninc.com/assets/domain-name-brief-oct2012.pdf
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 9
Downloading the WDC Hyperlink Graph
http://webdatacommons.org/hyperlinkgraph/
4 aggregation levels:
Extraction code is published under Apache License
• Extraction costs per run: ~ 200 US$ in Amazon EC2 fees
Graph #Nodes #Arcs Size (zipped)
Page graph 3.56 billion 128.73 billion 376 GB
Subdomain graph 101 million 2,043 million 10 GB
1st level subdomain graph 95 million 1,937 million 9.5 GB
PLD graph 43 million 623 million 3.1 GB
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 10
Analysis of the Graph
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 11
In-Degree Distribution
Broder et al. (2000)
Power law with exponent 2.1
WDC Hyperlink Graph (2012)
Best power law exponent 2.24
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 12
In-Degree Distribution
Power law
fitted using
plfit-tool.
Maximum
likelihood
fitting.
Starting
degree:
1129
Best power
law exponent:
2.24
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 13
Goodness of Fit Test
Method
• Clauset et al.:
Power-Law Distributions in Empirical Data. SIAM Review 2009.
• p-value < 0.1  power law not a plausible hypothesis
Goodness of fit result
• p-value = 0
Conclusions:
• in-degree does not follow power law
• in-degree has non-fat heavy-tailed distribution
• maybe log-normal?
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 14
Out-Degree Distribution
Broder et al.:
Power law
exponent 2.78
WDC:
Best power law
exponent 2.77
p-value = 0
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 15
Node Centrality
http://wwwranking.webdatacommons.org
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 16
Average Degree
Broder et al. 2000: 7.5
WDC 2012: 36.8
 Factor 4.9 larger
Possible explanation: HTML templates of CMS
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 17
Strongly Connected Components
Calculated using WebGraph framework on a machine with 1 TB RAM.
Largest SCC
Broder: 27.7%
WDC: 51.3 %
 Factor 1.8 larger
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 18
The Bow-Tie Structure of Broder et al. 2000
Balanced size of IN and OUT: 21%
Size of LSCC: 27%
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 19
The Bow-Tie Structure of WDC Hyperlinkgraph 2012
IN much larger than OUT: 31% vs. 6%
LSCC much larger: 51%
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 20
Zhu et al. WWW2008
The Chinese web looks like a tea-pot.
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 21
Reachability and Average Shortest Path
Broder et al. 2000
Pairs of pages connected by
path: 25%
Average shortest path: 16.12
WDC Webgraph 2012
Pairs of pages connected by
path: 48%
Average shortest path: 12.84
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 22
Conclusions
1. Web has become more dense and more connected
• Average degree has grown significantly in last 13 years (factor 5)
• Connectivity between pairs of pages has doubled
2. Macroscopic structure
• There is large SCC of growing size.
• The shape of the bow-tie seems to depend on the crawl
3. In- and out-degree distributions do not follow power laws.
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 23
Questions?
Advertisement
WebDataCommons.org also offers:
1. Corpus of 17 billion RDFa, Microdata, Microformats statements
2. Corpus of 147 million relational HTML tables
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 24
1 von 24

Recomendados

Building a Scalable Web Crawler with Hadoop von
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopHadoop User Group
34.9K views17 Folien
Mapping french open data actors on the web with common crawl von
Mapping french open data actors on the web with common crawlMapping french open data actors on the web with common crawl
Mapping french open data actors on the web with common crawldata publica
6.3K views19 Folien
AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012 von
AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012
AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012Amazon Web Services
3.5K views45 Folien
Mining a Large Web Corpus von
Mining a Large Web CorpusMining a Large Web Corpus
Mining a Large Web CorpusRobert Meusel
12.5K views21 Folien
The Web of data and web data commons von
The Web of data and web data commonsThe Web of data and web data commons
The Web of data and web data commonsJesse Wang
4.3K views58 Folien
Intro to Web Design von
Intro to Web DesignIntro to Web Design
Intro to Web DesignKathy Gill
10K views32 Folien

Más contenido relacionado

Was ist angesagt?

Measuring the impact of Google Analytics von
Measuring the impact of Google AnalyticsMeasuring the impact of Google Analytics
Measuring the impact of Google AnalyticsDomino Data Lab
4.3K views32 Folien
Web 3.0 The Semantic Web von
Web 3.0 The Semantic WebWeb 3.0 The Semantic Web
Web 3.0 The Semantic WebHatem Mahmoud
139.8K views144 Folien
Semantic web von
Semantic webSemantic web
Semantic webMyungjin Lee
11.7K views79 Folien
Large scale crawling with Apache Nutch von
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache NutchJulien Nioche
42.7K views35 Folien
NoSQL Graph Databases - Why, When and Where von
NoSQL Graph Databases - Why, When and WhereNoSQL Graph Databases - Why, When and Where
NoSQL Graph Databases - Why, When and WhereEugene Hanikblum
4K views100 Folien
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies von
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies DataWorks Summit/Hadoop Summit
5.4K views33 Folien

Was ist angesagt?(20)

Measuring the impact of Google Analytics von Domino Data Lab
Measuring the impact of Google AnalyticsMeasuring the impact of Google Analytics
Measuring the impact of Google Analytics
Domino Data Lab4.3K views
Web 3.0 The Semantic Web von Hatem Mahmoud
Web 3.0 The Semantic WebWeb 3.0 The Semantic Web
Web 3.0 The Semantic Web
Hatem Mahmoud139.8K views
Large scale crawling with Apache Nutch von Julien Nioche
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache Nutch
Julien Nioche42.7K views
NoSQL Graph Databases - Why, When and Where von Eugene Hanikblum
NoSQL Graph Databases - Why, When and WhereNoSQL Graph Databases - Why, When and Where
NoSQL Graph Databases - Why, When and Where
Eugene Hanikblum4K views
Intro to Cypher von Neo4j
Intro to CypherIntro to Cypher
Intro to Cypher
Neo4j1.7K views
Introduction to Graph Databases von DataStax
Introduction to Graph DatabasesIntroduction to Graph Databases
Introduction to Graph Databases
DataStax2.4K views
Introduction au web sémantique : quand le lient fait sens von FICEL Hemza
Introduction au web sémantique : quand le lient fait sensIntroduction au web sémantique : quand le lient fait sens
Introduction au web sémantique : quand le lient fait sens
FICEL Hemza3.7K views
Introduction To WordPress von Craig Bailey
Introduction To WordPressIntroduction To WordPress
Introduction To WordPress
Craig Bailey9K views
Strata sf - Amundsen presentation von Tao Feng
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
Tao Feng3.8K views
An Introduction to Big Data, NoSQL and MongoDB von William LaForest
An Introduction to Big Data, NoSQL and MongoDBAn Introduction to Big Data, NoSQL and MongoDB
An Introduction to Big Data, NoSQL and MongoDB
William LaForest15.1K views
Common Crawl: An Open Repository of Web Data von huguk
Common Crawl: An Open Repository of Web DataCommon Crawl: An Open Repository of Web Data
Common Crawl: An Open Repository of Web Data
huguk4.7K views
WEB I - 01 - Introduction to Web Development von Randy Connolly
WEB I - 01 - Introduction to Web DevelopmentWEB I - 01 - Introduction to Web Development
WEB I - 01 - Introduction to Web Development
Randy Connolly22.9K views
Web Development Ppt von Bruce Tucker
Web Development PptWeb Development Ppt
Web Development Ppt
Bruce Tucker14.3K views
Discovering SEO Opportunities through Log Analysis #DTDConf von Aleyda Solís
 Discovering SEO Opportunities through Log Analysis #DTDConf Discovering SEO Opportunities through Log Analysis #DTDConf
Discovering SEO Opportunities through Log Analysis #DTDConf
Aleyda Solís6K views
stackconf 2022: Introduction to Vector Search with Weaviate von NETWAYS
stackconf 2022: Introduction to Vector Search with Weaviatestackconf 2022: Introduction to Vector Search with Weaviate
stackconf 2022: Introduction to Vector Search with Weaviate
NETWAYS208 views

Destacado

Graph Structure In The Web von
Graph Structure In The WebGraph Structure In The Web
Graph Structure In The Webdailyye
1.1K views13 Folien
Tutorial 6 (web graph attributes) von
Tutorial 6 (web graph attributes)Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)Kira
1.2K views11 Folien
Wwsss intro2016-final von
Wwsss intro2016-finalWwsss intro2016-final
Wwsss intro2016-finalSteffen Staab
1.8K views108 Folien
Web Science Framework and InterDataNet von
Web Science Framework and InterDataNetWeb Science Framework and InterDataNet
Web Science Framework and InterDataNetmaria chiara pettenati
1.2K views61 Folien
Graphs, Edges & Nodes - Untangling the Social Web von
Graphs, Edges & Nodes - Untangling the Social WebGraphs, Edges & Nodes - Untangling the Social Web
Graphs, Edges & Nodes - Untangling the Social WebJoël Perras
35.9K views75 Folien
Intro to Web Science (Fall 2013) von
Intro to Web Science (Fall 2013)Intro to Web Science (Fall 2013)
Intro to Web Science (Fall 2013)Rensselaer Polytechnic Institute
3.8K views51 Folien

Destacado(7)

Graph Structure In The Web von dailyye
Graph Structure In The WebGraph Structure In The Web
Graph Structure In The Web
dailyye1.1K views
Tutorial 6 (web graph attributes) von Kira
Tutorial 6 (web graph attributes)Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)
Kira1.2K views
Graphs, Edges & Nodes - Untangling the Social Web von Joël Perras
Graphs, Edges & Nodes - Untangling the Social WebGraphs, Edges & Nodes - Untangling the Social Web
Graphs, Edges & Nodes - Untangling the Social Web
Joël Perras35.9K views
Titan: The Rise of Big Graph Data von Marko Rodriguez
Titan: The Rise of Big Graph DataTitan: The Rise of Big Graph Data
Titan: The Rise of Big Graph Data
Marko Rodriguez170.5K views

Similar a Graph Structure in the Web - Revisited. WWW2014 Web Science Track

The Graph Structure of the Web - Aggregated by Pay-Level Domain von
The Graph Structure of the Web - Aggregated by Pay-Level DomainThe Graph Structure of the Web - Aggregated by Pay-Level Domain
The Graph Structure of the Web - Aggregated by Pay-Level Domainoli-unima
3.4K views20 Folien
Creating knowledge out of interlinked data von
Creating knowledge out of interlinked dataCreating knowledge out of interlinked data
Creating knowledge out of interlinked dataSören Auer
1.4K views63 Folien
Web Page Recommendation Using Web Mining von
Web Page Recommendation Using Web MiningWeb Page Recommendation Using Web Mining
Web Page Recommendation Using Web MiningIJERA Editor
795 views6 Folien
Preparation of Web Mapping Application of Balephi-B Hydropower Project von
Preparation of Web Mapping Application of Balephi-B Hydropower ProjectPreparation of Web Mapping Application of Balephi-B Hydropower Project
Preparation of Web Mapping Application of Balephi-B Hydropower ProjectBiplov Bhandari
1K views16 Folien
TR14-05_Martindell.pdf von
TR14-05_Martindell.pdfTR14-05_Martindell.pdf
TR14-05_Martindell.pdfTomTom149267
4 views24 Folien
Df25632640 von
Df25632640Df25632640
Df25632640IJERA Editor
189 views9 Folien

Similar a Graph Structure in the Web - Revisited. WWW2014 Web Science Track(20)

The Graph Structure of the Web - Aggregated by Pay-Level Domain von oli-unima
The Graph Structure of the Web - Aggregated by Pay-Level DomainThe Graph Structure of the Web - Aggregated by Pay-Level Domain
The Graph Structure of the Web - Aggregated by Pay-Level Domain
oli-unima3.4K views
Creating knowledge out of interlinked data von Sören Auer
Creating knowledge out of interlinked dataCreating knowledge out of interlinked data
Creating knowledge out of interlinked data
Sören Auer1.4K views
Web Page Recommendation Using Web Mining von IJERA Editor
Web Page Recommendation Using Web MiningWeb Page Recommendation Using Web Mining
Web Page Recommendation Using Web Mining
IJERA Editor795 views
Preparation of Web Mapping Application of Balephi-B Hydropower Project von Biplov Bhandari
Preparation of Web Mapping Application of Balephi-B Hydropower ProjectPreparation of Web Mapping Application of Balephi-B Hydropower Project
Preparation of Web Mapping Application of Balephi-B Hydropower Project
Biplov Bhandari1K views
Restructuring a Web Application, Using Spring and Hibernate von gustavoeliano
Restructuring a Web Application, Using Spring and HibernateRestructuring a Web Application, Using Spring and Hibernate
Restructuring a Web Application, Using Spring and Hibernate
gustavoeliano659 views
Welcome to International Journal of Engineering Research and Development (IJERD) von IJERD Editor
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
IJERD Editor257 views
An End User Perspective on Implementing Oracle in the Engineering Environment von jeffhobbs
An End User Perspective on Implementing Oracle in the Engineering EnvironmentAn End User Perspective on Implementing Oracle in the Engineering Environment
An End User Perspective on Implementing Oracle in the Engineering Environment
jeffhobbs642 views
CTS Conference Web 2.0 Tutorial Part 1 von Geoffrey Fox
CTS Conference Web 2.0 Tutorial Part 1CTS Conference Web 2.0 Tutorial Part 1
CTS Conference Web 2.0 Tutorial Part 1
Geoffrey Fox1.1K views
WEB 2.0: BUILDING RICH INTERNET APPLICATIONS WITH PHP von zend
WEB 2.0: BUILDING RICH INTERNET APPLICATIONS WITH PHPWEB 2.0: BUILDING RICH INTERNET APPLICATIONS WITH PHP
WEB 2.0: BUILDING RICH INTERNET APPLICATIONS WITH PHP
zend2.5K views
GeoMapFish User-Group - March 2021 von remyguillaume
GeoMapFish User-Group - March 2021GeoMapFish User-Group - March 2021
GeoMapFish User-Group - March 2021
remyguillaume15 views
Concept for a web map implementation with faster query response von Alexander Decker
Concept for a web map implementation with faster query responseConcept for a web map implementation with faster query response
Concept for a web map implementation with faster query response
Alexander Decker233 views

Más de Chris Bizer

GPT4 versus BERT: Which Foundation Model is better for Web Data Integration? von
GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?
GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?Chris Bizer
38 views56 Folien
Integrating Product Data from the Semantic Web using Deep Learning Techniques von
Integrating Product Data from the Semantic Web using Deep Learning TechniquesIntegrating Product Data from the Semantic Web using Deep Learning Techniques
Integrating Product Data from the Semantic Web using Deep Learning TechniquesChris Bizer
35 views47 Folien
Using the Semantic Web as Training Data for Product Matching von
Using the Semantic Web as Training Data for Product MatchingUsing the Semantic Web as Training Data for Product Matching
Using the Semantic Web as Training Data for Product MatchingChris Bizer
552 views26 Folien
JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open Web von
JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open WebJIST2019 Keynote: Completing Knowledge Graphs using Data from the Open Web
JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open WebChris Bizer
873 views53 Folien
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the... von
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...Chris Bizer
933 views58 Folien
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch... von
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...Chris Bizer
2.6K views53 Folien

Más de Chris Bizer(14)

GPT4 versus BERT: Which Foundation Model is better for Web Data Integration? von Chris Bizer
GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?
GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?
Chris Bizer38 views
Integrating Product Data from the Semantic Web using Deep Learning Techniques von Chris Bizer
Integrating Product Data from the Semantic Web using Deep Learning TechniquesIntegrating Product Data from the Semantic Web using Deep Learning Techniques
Integrating Product Data from the Semantic Web using Deep Learning Techniques
Chris Bizer35 views
Using the Semantic Web as Training Data for Product Matching von Chris Bizer
Using the Semantic Web as Training Data for Product MatchingUsing the Semantic Web as Training Data for Product Matching
Using the Semantic Web as Training Data for Product Matching
Chris Bizer552 views
JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open Web von Chris Bizer
JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open WebJIST2019 Keynote: Completing Knowledge Graphs using Data from the Open Web
JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open Web
Chris Bizer873 views
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the... von Chris Bizer
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...
Chris Bizer933 views
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch... von Chris Bizer
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...
Chris Bizer2.6K views
Data Search and Search Joins (Universität Heidelberg 2015) von Chris Bizer
Data Search and Search Joins (Universität Heidelberg 2015)Data Search and Search Joins (Universität Heidelberg 2015)
Data Search and Search Joins (Universität Heidelberg 2015)
Chris Bizer423 views
Exploring the Application Potential of Relational Web Tables von Chris Bizer
Exploring the Application Potential of Relational Web TablesExploring the Application Potential of Relational Web Tables
Exploring the Application Potential of Relational Web Tables
Chris Bizer679 views
Evolving the Web into a Global Dataspace – Advances and Applications von Chris Bizer
Evolving the Web into a Global Dataspace – Advances and ApplicationsEvolving the Web into a Global Dataspace – Advances and Applications
Evolving the Web into a Global Dataspace – Advances and Applications
Chris Bizer5.5K views
Extending Tables with Data from over a Million Websites von Chris Bizer
 Extending Tables with Data from over a Million Websites Extending Tables with Data from over a Million Websites
Extending Tables with Data from over a Million Websites
Chris Bizer3.6K views
Adoption of the Linked Data Best Practices in Different Topical Domains von Chris Bizer
Adoption of the Linked Data Best Practices in Different Topical DomainsAdoption of the Linked Data Best Practices in Different Topical Domains
Adoption of the Linked Data Best Practices in Different Topical Domains
Chris Bizer3.1K views
Evolving the Web into a Global Database - Advances and Applications. von Chris Bizer
Evolving the Web into a Global Database - Advances and Applications. Evolving the Web into a Global Database - Advances and Applications.
Evolving the Web into a Global Database - Advances and Applications.
Chris Bizer1.6K views
Search Joins with the Web - ICDT2014 Invited Lecture von Chris Bizer
Search Joins with the Web - ICDT2014 Invited LectureSearch Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited Lecture
Chris Bizer3.5K views
DBpedia - An Interlinking Hub in the Web of Data von Chris Bizer
DBpedia - An Interlinking Hub in the Web of DataDBpedia - An Interlinking Hub in the Web of Data
DBpedia - An Interlinking Hub in the Web of Data
Chris Bizer2.5K views

Último

ENTOMOLOGY PPT ON BOMBYCIDAE AND SATURNIIDAE.pptx von
ENTOMOLOGY PPT ON BOMBYCIDAE AND SATURNIIDAE.pptxENTOMOLOGY PPT ON BOMBYCIDAE AND SATURNIIDAE.pptx
ENTOMOLOGY PPT ON BOMBYCIDAE AND SATURNIIDAE.pptxMN
6 views13 Folien
Ecology von
Ecology Ecology
Ecology Abhijith Raj.R
6 views10 Folien
A training, certification and marketing scheme for informal dairy vendors in ... von
A training, certification and marketing scheme for informal dairy vendors in ...A training, certification and marketing scheme for informal dairy vendors in ...
A training, certification and marketing scheme for informal dairy vendors in ...ILRI
10 views13 Folien
1978 NASA News Release Log von
1978 NASA News Release Log1978 NASA News Release Log
1978 NASA News Release Logpurrterminator
7 views146 Folien
MILK LIPIDS 2.pptx von
MILK LIPIDS 2.pptxMILK LIPIDS 2.pptx
MILK LIPIDS 2.pptxabhinambroze18
7 views15 Folien
Conventional and non-conventional methods for improvement of cucurbits.pptx von
Conventional and non-conventional methods for improvement of cucurbits.pptxConventional and non-conventional methods for improvement of cucurbits.pptx
Conventional and non-conventional methods for improvement of cucurbits.pptxgandhi976
16 views35 Folien

Último(20)

ENTOMOLOGY PPT ON BOMBYCIDAE AND SATURNIIDAE.pptx von MN
ENTOMOLOGY PPT ON BOMBYCIDAE AND SATURNIIDAE.pptxENTOMOLOGY PPT ON BOMBYCIDAE AND SATURNIIDAE.pptx
ENTOMOLOGY PPT ON BOMBYCIDAE AND SATURNIIDAE.pptx
MN6 views
A training, certification and marketing scheme for informal dairy vendors in ... von ILRI
A training, certification and marketing scheme for informal dairy vendors in ...A training, certification and marketing scheme for informal dairy vendors in ...
A training, certification and marketing scheme for informal dairy vendors in ...
ILRI10 views
Conventional and non-conventional methods for improvement of cucurbits.pptx von gandhi976
Conventional and non-conventional methods for improvement of cucurbits.pptxConventional and non-conventional methods for improvement of cucurbits.pptx
Conventional and non-conventional methods for improvement of cucurbits.pptx
gandhi97616 views
별헤는 사람들 2023년 12월호 전명원 교수 자료 von sciencepeople
별헤는 사람들 2023년 12월호 전명원 교수 자료별헤는 사람들 2023년 12월호 전명원 교수 자료
별헤는 사람들 2023년 12월호 전명원 교수 자료
sciencepeople7 views
How to be(come) a successful PhD student von Tom Mens
How to be(come) a successful PhD studentHow to be(come) a successful PhD student
How to be(come) a successful PhD student
Tom Mens422 views
Experimental animal Guinea pigs.pptx von Mansee Arya
Experimental animal Guinea pigs.pptxExperimental animal Guinea pigs.pptx
Experimental animal Guinea pigs.pptx
Mansee Arya10 views
A Ready-to-Analyze High-Plex Spatial Signature Development Workflow for Cance... von InsideScientific
A Ready-to-Analyze High-Plex Spatial Signature Development Workflow for Cance...A Ready-to-Analyze High-Plex Spatial Signature Development Workflow for Cance...
A Ready-to-Analyze High-Plex Spatial Signature Development Workflow for Cance...
CSF -SHEEBA.D presentation.pptx von SheebaD7
CSF -SHEEBA.D presentation.pptxCSF -SHEEBA.D presentation.pptx
CSF -SHEEBA.D presentation.pptx
SheebaD710 views
"How can I develop my learning path in bioinformatics? von Bioinformy
"How can I develop my learning path in bioinformatics?"How can I develop my learning path in bioinformatics?
"How can I develop my learning path in bioinformatics?
Bioinformy18 views
Open Access Publishing in Astrophysics von Peter Coles
Open Access Publishing in AstrophysicsOpen Access Publishing in Astrophysics
Open Access Publishing in Astrophysics
Peter Coles543 views
Pollination By Nagapradheesh.M.pptx von MNAGAPRADHEESH
Pollination By Nagapradheesh.M.pptxPollination By Nagapradheesh.M.pptx
Pollination By Nagapradheesh.M.pptx
MNAGAPRADHEESH15 views

Graph Structure in the Web - Revisited. WWW2014 Web Science Track

  • 1. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 1 Graph Structure in the Web Revisited Robert Meusel, Sebastiano Vigna, Oliver Lehmberg, Christian Bizer
  • 2. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 2 Textbook Knowledge about the Web Graph Broder et al.: Graph structure in the Web. WWW2000. used two AltaVista crawls (200 million pages, 1.5 billion links) Results Power Laws Bow-Tie
  • 3. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 3 This talk will: 1. Show that the textbook knowledge might be wrong or dependent on crawling process. 2. Provide you with a large recent Web graph to do further research.
  • 4. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 4 Outline 1. Public Web Crawls 2. The Web Data Commons Hyperlink Graph 3. Analysis of the Graph 1. In-degree & Out-degree Distributions 2. Node Centrality 3. Strong Components 4. Bow Tie 5. Reachability and Average Shortest Path 4. Conclusion
  • 5. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 5 Public Web Crawls 1. AltaVista Crawl distributed by Yahoo! WebScope 2002 • Size: 1.4 billion pages • Problem: Largest strongly connected component 4% 2. ClueWeb 2009 • Size: 1 billion pages • Problem: Largest strongly connected component 3% 3. ClueWeb 2012 • Size: 733 million pages • Largest strongly connected component 76% • Problem: Only English pages
  • 6. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 6 The Common Crawl
  • 7. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 7 The Common Crawl Foundation Regularly publishes Web crawls on Amazon S3. Five crawls available so far: Crawling Strategy (Spring 2012) • breadth-first visiting strategy • at least 71 million seeds from previous crawls and from Wikipedia Date # Pages 2010 2.5 billion Spring 2012 3.5 billion Spring 2013 2.0 billion Winter 2013 2.0 billion Spring 2014 2.5 billion
  • 8. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 8 Web Data Commons – Hyperlink Graph extracted from the Spring 2012 version of the Common Crawl size 3.5 billion nodes 128 billion arcs pages originate from 43 million pay-level domains (PLDs) • 240 million PLDs were registered in 2012 * (18%) world-wide coverage * http://www.verisigninc.com/assets/domain-name-brief-oct2012.pdf
  • 9. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 9 Downloading the WDC Hyperlink Graph http://webdatacommons.org/hyperlinkgraph/ 4 aggregation levels: Extraction code is published under Apache License • Extraction costs per run: ~ 200 US$ in Amazon EC2 fees Graph #Nodes #Arcs Size (zipped) Page graph 3.56 billion 128.73 billion 376 GB Subdomain graph 101 million 2,043 million 10 GB 1st level subdomain graph 95 million 1,937 million 9.5 GB PLD graph 43 million 623 million 3.1 GB
  • 10. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 10 Analysis of the Graph
  • 11. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 11 In-Degree Distribution Broder et al. (2000) Power law with exponent 2.1 WDC Hyperlink Graph (2012) Best power law exponent 2.24
  • 12. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 12 In-Degree Distribution Power law fitted using plfit-tool. Maximum likelihood fitting. Starting degree: 1129 Best power law exponent: 2.24
  • 13. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 13 Goodness of Fit Test Method • Clauset et al.: Power-Law Distributions in Empirical Data. SIAM Review 2009. • p-value < 0.1  power law not a plausible hypothesis Goodness of fit result • p-value = 0 Conclusions: • in-degree does not follow power law • in-degree has non-fat heavy-tailed distribution • maybe log-normal?
  • 14. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 14 Out-Degree Distribution Broder et al.: Power law exponent 2.78 WDC: Best power law exponent 2.77 p-value = 0
  • 15. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 15 Node Centrality http://wwwranking.webdatacommons.org
  • 16. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 16 Average Degree Broder et al. 2000: 7.5 WDC 2012: 36.8  Factor 4.9 larger Possible explanation: HTML templates of CMS
  • 17. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 17 Strongly Connected Components Calculated using WebGraph framework on a machine with 1 TB RAM. Largest SCC Broder: 27.7% WDC: 51.3 %  Factor 1.8 larger
  • 18. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 18 The Bow-Tie Structure of Broder et al. 2000 Balanced size of IN and OUT: 21% Size of LSCC: 27%
  • 19. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 19 The Bow-Tie Structure of WDC Hyperlinkgraph 2012 IN much larger than OUT: 31% vs. 6% LSCC much larger: 51%
  • 20. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 20 Zhu et al. WWW2008 The Chinese web looks like a tea-pot.
  • 21. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 21 Reachability and Average Shortest Path Broder et al. 2000 Pairs of pages connected by path: 25% Average shortest path: 16.12 WDC Webgraph 2012 Pairs of pages connected by path: 48% Average shortest path: 12.84
  • 22. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 22 Conclusions 1. Web has become more dense and more connected • Average degree has grown significantly in last 13 years (factor 5) • Connectivity between pairs of pages has doubled 2. Macroscopic structure • There is large SCC of growing size. • The shape of the bow-tie seems to depend on the crawl 3. In- and out-degree distributions do not follow power laws.
  • 23. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 23 Questions? Advertisement WebDataCommons.org also offers: 1. Corpus of 17 billion RDFa, Microdata, Microformats statements 2. Corpus of 147 million relational HTML tables
  • 24. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 24