SlideShare ist ein Scribd-Unternehmen logo
1 von 21
Downloaden Sie, um offline zu lesen
Slide 1
International Internet Preservation Consortium
General Assembly 2014, Paris
Mining a Large Web Corpus
Robert Meusel
Christian Bizer
Slide 2
The Common Crawl
Slide 3
Hyperlink Graphs
Knowledge about the structure of the Web can be used to
improve crawling strategies, to help SEO experts or to
understand social phenomena.
Slide 4
HTML-embedded Data on the Web
Several million websites semantically markup the content of
their HTML pages.
Markup Syntaxes
 Microformats
 RDFa
 Microdata
Data snippets
within info boxes
Slide 5
Relational HTML Tables
HTML Tables over semi-structured data which can be used to
build up or extend knowledge bases as DBPedia.
• Cafarella, et al.: WebTables: Exploring the Power of Tables on the Web. VLDB 2008.
 In a corpus of 14B raw
tables, 154M are „good“
relations (1.1%)
Slide 6
The Web Data Commons Project
 Has developed an Amazon-based framework for extracting data
from large web crawls
 Capable to run on any cloud infrastructure
 Has applied this framework to the Common Crawl data
 Adaptable to other crawls
 Results and framework are publicly available
 http://webdatacommons.org
Goal: Offer an easy-to-use, cost efficient, distributed
extraction framework for large web crawls, as well as
datasets extracted out of the crawls.
Slide 7
Extraction Framework
AWS EC2
Instance
AWS EC2
Instance
Master
AWS SQS
AWS EC2
Instance
AWS S3
1: Fill queue
2: Launch instances
3: Request
file-reference
4: Download file
5: Extract &
Upload
automated
manual
6: Collect results
Slide 8
Extraction Worker
AWS S3
AWS S3
WDC Extractor
.(w)arc
Worker
Filter
output
Worker:
• Written in Java
• Process one page at
once
• Independent from
other files and
workers
Download file
Upload output file
Filter:
• Reduce Runtime
• Mime-Type filter
• Regex detection of
content or meta-
information
Worker
Slide 9
Web Data Commons – Extraction Framework
 Written in Java
 Mainly tailored for Amazon Web Services
 Fault tolerant and cheap
 300 USD to extract 17 billion RDF statements from 44 TB
 Easy customizable
 Only worker has to be adapted
 Worker is a single process method processing one file each time
 Scaling is automated by the framework
 Access Open Source Code:
 https://www.assembla.com/code/commondata/
Alternative: Hadoop Version, which can run on any Hadoop
cluster without Amazon Web Services.
Slide 10
Extracted Datasets
 Hyperlink Graph
 HTML-embedded Data
 Relational HTML Tables
Hyperlink Graph
HTML-embedded Data
Relational HTML Tables
Slide 11
Hyperlink Graph
 Extracted from the Common Crawl 2012 Dataset
 Over 3.5 billion pages connected by over 128 billion links
 Graph files: 386 GB
http://webdatacommons.org/hyperlinkgraph/
http://wwwranking.webdatacommons.org/
Slide 12
Hyperlink Graph
 Degrees do not follow a power-law
 Detection of Spam pages
 Further insights:
 WWW‘14: Graph Structure in the Web – Revisited (Meusel et al.)
 WebSci‘14: The Graph Structure of the Web aggregated by Pay-Level Domain (Lehmberg et al.)
Discovery of evolutions in the global structure of the World
Wide Web.
Slide 13
Hyperlink Graph
Discovery of important and interesting sites using different
popularity rankings or website categorization libraries
Websites connected by at least ½ Million Links
Slide 14
HTML-embedded Data
More and more Websites semantically
markup the content of their HTML pages.
Markup Syntaxes
RDFa
Microformats
Microdata
Slide 15
Websites containing Structured Data (2013)
1.8 million websites (PLDs) out of 12.8 million
provide Microformat, Microdata or RDFa data (13.9%)
585 million of the 2.2 billion pages contain
Microformat, Microdata or RDFa data (26.3%).
Web Data Commons - Microformat, Microdata, RDFa Corpus
 17 billion RDF triples from Common Crawl 2013
 Next release will be in winter 2014
http://webdatacommons.org/structureddata/
Slide 16
Top Classes Microdata (2013)
• schema = Schema.org
• dv = Google‘s
Rich Snippet Vocabulary
Slide 17
HTML Tables
• Cafarella, et al.: WebTables: Exploring the Power of Tables on the Web. VLDB 2008.
• Crestan, Pantel: Web-Scale Table Census and Classification. WSDM 2011.
In corpus of 14B raw tables, 154M are “good” relations (1.1%).
Cafarella (2008)
Classification Precision: 70-80%
Slide 18
WDC - Web Tables Corpus
 Large corpus of relational Web tables for public download
 Extracted from Common Crawl 2012 (3.3 billion pages)
 147 million relational tables
 selected out of 11.2 B raw tables (1.3%)
 download includes the HTML pages of the tables (1TB zipped)
 Table Statistics
 Heterogeneity: Very high.
http://webdatacommons.org/webtables/
Min Max Average Median
Attributes 2 2,368 3.49 3
Data Rows 1 70,068 12.41 6
Slide 19
 Attribute Statistics
28,000,000 different attribute labels
WDC - Web Tables Corpus
Attribute #Tables
name 4,600,000
price 3,700,000
date 2,700,000
artist 2,100,000
location 1,200,000
year 1,000,000
manufacturer 375,000
counrty 340,000
isbn 99,000
area 95,000
population 86,000
 Subject Attribute Values
1.74 billion rows
253,000,000 different subject labels
Value #Rows
usa 135,000
germany 91,000
greece 42,000
new york 59,000
london 37,000
athens 11,000
david beckham 3,000
ronaldinho 1,200
oliver kahn 710
twist shout 2,000
yellow submarine 1,400
Slide 20
Conclusion
Three factors are necessary to work with web-scale data:
 Thanks to Common Crawl, this data is available
 Like Amazon or other on-demand cloud-services
 The Web Data Commons Framework, or standard tools like Pig
 Cost evaluation on task-base, but the WDC framework has turned
out to be cheaper
Availability of Crawls
Availability of cheap, easy-to-use infrastructures
Easy to adopt scalable extraction frameworks
Slide 21
Questions
 Please visit our website: www.webdatacommons.org
 Data and Framework are available as free download
 Web Data Commons is supported by:

Weitere ähnliche Inhalte

Was ist angesagt?

GraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQLGraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQLSpark Summit
 
An Introduction to Semantic Web Technology
An Introduction to Semantic Web TechnologyAn Introduction to Semantic Web Technology
An Introduction to Semantic Web TechnologyAnkur Biswas
 
Link analysis : Comparative study of HITS and Page Rank Algorithm
Link analysis : Comparative study of HITS and Page Rank AlgorithmLink analysis : Comparative study of HITS and Page Rank Algorithm
Link analysis : Comparative study of HITS and Page Rank AlgorithmKavita Kushwah
 
Solr: Search at the Speed of Light
Solr: Search at the Speed of LightSolr: Search at the Speed of Light
Solr: Search at the Speed of LightErik Hatcher
 
Lecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebLecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebMarina Santini
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDatabricks
 
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...Andrew Lamb
 
Data Day Texas 2017: Scaling Data Science at Stitch Fix
Data Day Texas 2017: Scaling Data Science at Stitch FixData Day Texas 2017: Scaling Data Science at Stitch Fix
Data Day Texas 2017: Scaling Data Science at Stitch FixStefan Krawczyk
 
Graph Neural Networks for Recommendations
Graph Neural Networks for RecommendationsGraph Neural Networks for Recommendations
Graph Neural Networks for RecommendationsWQ Fan
 
[FFE19] Build a Flink AI Ecosystem
[FFE19] Build a Flink AI Ecosystem[FFE19] Build a Flink AI Ecosystem
[FFE19] Build a Flink AI EcosystemJiangjie Qin
 
LinkML Intro July 2022.pptx PLEASE VIEW THIS ON ZENODO
LinkML Intro July 2022.pptx PLEASE VIEW THIS ON ZENODOLinkML Intro July 2022.pptx PLEASE VIEW THIS ON ZENODO
LinkML Intro July 2022.pptx PLEASE VIEW THIS ON ZENODOChris Mungall
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
 
Chapter 1 semantic web
Chapter 1 semantic webChapter 1 semantic web
Chapter 1 semantic webR A Akerkar
 
Web scraping in python
Web scraping in pythonWeb scraping in python
Web scraping in pythonSaurav Tomar
 

Was ist angesagt? (20)

GraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQLGraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQL
 
An Introduction to Semantic Web Technology
An Introduction to Semantic Web TechnologyAn Introduction to Semantic Web Technology
An Introduction to Semantic Web Technology
 
Link analysis : Comparative study of HITS and Page Rank Algorithm
Link analysis : Comparative study of HITS and Page Rank AlgorithmLink analysis : Comparative study of HITS and Page Rank Algorithm
Link analysis : Comparative study of HITS and Page Rank Algorithm
 
Seo and page rank algorithm
Seo and page rank algorithmSeo and page rank algorithm
Seo and page rank algorithm
 
Solr: Search at the Speed of Light
Solr: Search at the Speed of LightSolr: Search at the Speed of Light
Solr: Search at the Speed of Light
 
Lecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebLecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic Web
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
Web Crawlers
Web CrawlersWeb Crawlers
Web Crawlers
 
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
 
Data Day Texas 2017: Scaling Data Science at Stitch Fix
Data Day Texas 2017: Scaling Data Science at Stitch FixData Day Texas 2017: Scaling Data Science at Stitch Fix
Data Day Texas 2017: Scaling Data Science at Stitch Fix
 
Semantic search
Semantic searchSemantic search
Semantic search
 
Graph Neural Networks for Recommendations
Graph Neural Networks for RecommendationsGraph Neural Networks for Recommendations
Graph Neural Networks for Recommendations
 
[FFE19] Build a Flink AI Ecosystem
[FFE19] Build a Flink AI Ecosystem[FFE19] Build a Flink AI Ecosystem
[FFE19] Build a Flink AI Ecosystem
 
Web Scraping
Web ScrapingWeb Scraping
Web Scraping
 
LinkML Intro July 2022.pptx PLEASE VIEW THIS ON ZENODO
LinkML Intro July 2022.pptx PLEASE VIEW THIS ON ZENODOLinkML Intro July 2022.pptx PLEASE VIEW THIS ON ZENODO
LinkML Intro July 2022.pptx PLEASE VIEW THIS ON ZENODO
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
Semantic web
Semantic webSemantic web
Semantic web
 
Introduction to mongodb
Introduction to mongodbIntroduction to mongodb
Introduction to mongodb
 
Chapter 1 semantic web
Chapter 1 semantic webChapter 1 semantic web
Chapter 1 semantic web
 
Web scraping in python
Web scraping in pythonWeb scraping in python
Web scraping in python
 

Andere mochten auch

BDT204 Awesome Applications of Open Data - AWS re: Invent 2012
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012BDT204 Awesome Applications of Open Data - AWS re: Invent 2012
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012Amazon Web Services
 
Common Crawl: An Open Repository of Web Data
Common Crawl: An Open Repository of Web DataCommon Crawl: An Open Repository of Web Data
Common Crawl: An Open Repository of Web Datahuguk
 
Using the whole web as your dataset
Using the whole web as your datasetUsing the whole web as your dataset
Using the whole web as your datasetTuri, Inc.
 
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary ...
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary ...A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary ...
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary ...Robert Meusel
 
2013 open analytics-meetup-mortar
2013 open analytics-meetup-mortar2013 open analytics-meetup-mortar
2013 open analytics-meetup-mortarOpen Analytics
 
Scaling Credible Content
Scaling Credible ContentScaling Credible Content
Scaling Credible ContentJoe Griffin
 
Marketing in an Altered Reality World with Brent Csutoras
Marketing in an Altered Reality World with Brent CsutorasMarketing in an Altered Reality World with Brent Csutoras
Marketing in an Altered Reality World with Brent CsutorasSearch Engine Journal
 
The Switchabalizer - our journey from spell checker to homophone corrrecter
The Switchabalizer - our journey from spell checker to homophone corrrecterThe Switchabalizer - our journey from spell checker to homophone corrrecter
The Switchabalizer - our journey from spell checker to homophone corrrecterCommonCrawl
 
Is Crawling Legal? Web Crawling legal Policies
Is Crawling Legal? Web Crawling legal PoliciesIs Crawling Legal? Web Crawling legal Policies
Is Crawling Legal? Web Crawling legal PoliciesPromptCloud
 
Insight Data Engineering project
Insight Data Engineering projectInsight Data Engineering project
Insight Data Engineering projectHoa Nguyen
 
Clickstream Data Warehouse - Turning clicks into customers
Clickstream Data Warehouse - Turning clicks into customersClickstream Data Warehouse - Turning clicks into customers
Clickstream Data Warehouse - Turning clicks into customersAlbert Hui
 
Log Mining: Beyond Log Analysis
Log Mining: Beyond Log AnalysisLog Mining: Beyond Log Analysis
Log Mining: Beyond Log AnalysisAnton Chuvakin
 
Quick Tour of Text Mining
Quick Tour of Text MiningQuick Tour of Text Mining
Quick Tour of Text MiningYi-Shin Chen
 
Big Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS CloudBig Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS CloudAmazon Web Services
 
Migrating Large Scale Data Sets to the Cloud
Migrating Large Scale Data Sets to the CloudMigrating Large Scale Data Sets to the Cloud
Migrating Large Scale Data Sets to the CloudAmazon Web Services
 

Andere mochten auch (17)

BDT204 Awesome Applications of Open Data - AWS re: Invent 2012
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012BDT204 Awesome Applications of Open Data - AWS re: Invent 2012
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012
 
Common Crawl: An Open Repository of Web Data
Common Crawl: An Open Repository of Web DataCommon Crawl: An Open Repository of Web Data
Common Crawl: An Open Repository of Web Data
 
Using the whole web as your dataset
Using the whole web as your datasetUsing the whole web as your dataset
Using the whole web as your dataset
 
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary ...
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary ...A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary ...
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary ...
 
2013 open analytics-meetup-mortar
2013 open analytics-meetup-mortar2013 open analytics-meetup-mortar
2013 open analytics-meetup-mortar
 
Scaling Credible Content
Scaling Credible ContentScaling Credible Content
Scaling Credible Content
 
Marketing in an Altered Reality World with Brent Csutoras
Marketing in an Altered Reality World with Brent CsutorasMarketing in an Altered Reality World with Brent Csutoras
Marketing in an Altered Reality World with Brent Csutoras
 
The Switchabalizer - our journey from spell checker to homophone corrrecter
The Switchabalizer - our journey from spell checker to homophone corrrecterThe Switchabalizer - our journey from spell checker to homophone corrrecter
The Switchabalizer - our journey from spell checker to homophone corrrecter
 
Is Crawling Legal? Web Crawling legal Policies
Is Crawling Legal? Web Crawling legal PoliciesIs Crawling Legal? Web Crawling legal Policies
Is Crawling Legal? Web Crawling legal Policies
 
Insight Data Engineering project
Insight Data Engineering projectInsight Data Engineering project
Insight Data Engineering project
 
Clickstream Data Warehouse - Turning clicks into customers
Clickstream Data Warehouse - Turning clicks into customersClickstream Data Warehouse - Turning clicks into customers
Clickstream Data Warehouse - Turning clicks into customers
 
Log Mining: Beyond Log Analysis
Log Mining: Beyond Log AnalysisLog Mining: Beyond Log Analysis
Log Mining: Beyond Log Analysis
 
IBM Open Data
IBM Open DataIBM Open Data
IBM Open Data
 
Quick Tour of Text Mining
Quick Tour of Text MiningQuick Tour of Text Mining
Quick Tour of Text Mining
 
Big Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS CloudBig Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS Cloud
 
Migrating Large Scale Data Sets to the Cloud
Migrating Large Scale Data Sets to the CloudMigrating Large Scale Data Sets to the Cloud
Migrating Large Scale Data Sets to the Cloud
 
Build Features, Not Apps
Build Features, Not AppsBuild Features, Not Apps
Build Features, Not Apps
 

Ähnlich wie Mining a Large Web Corpus

Search Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited LectureSearch Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited LectureChris Bizer
 
鏈結資料在圖書館的應用20131107
鏈結資料在圖書館的應用20131107鏈結資料在圖書館的應用20131107
鏈結資料在圖書館的應用20131107皓仁 柯
 
Big Tools for Big Data
Big Tools for Big DataBig Tools for Big Data
Big Tools for Big DataLewis Crawford
 
Applications of xml, semantic web or linked data in Library/Information Servi...
Applications of xml, semantic web or linked data in Library/Information Servi...Applications of xml, semantic web or linked data in Library/Information Servi...
Applications of xml, semantic web or linked data in Library/Information Servi...Nurhazman Abdul Aziz
 
The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014
The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014
The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014Robert Meusel
 
Future of Web 2.0 & The Semantic Web
Future of Web 2.0 & The Semantic WebFuture of Web 2.0 & The Semantic Web
Future of Web 2.0 & The Semantic Webis20090
 
DWH & big data architecture approaches
DWH & big data architecture approachesDWH & big data architecture approaches
DWH & big data architecture approachesLuxoft
 
Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Anna Shymchenko
 
IRJET- Data Retrieval using Master Resource Description Framework
IRJET- Data Retrieval using Master Resource Description FrameworkIRJET- Data Retrieval using Master Resource Description Framework
IRJET- Data Retrieval using Master Resource Description FrameworkIRJET Journal
 
(More) Transparency Transformation
(More) Transparency Transformation(More) Transparency Transformation
(More) Transparency TransformationGeorge Thomas
 
Web Technology Trends (early 2009)
Web Technology Trends (early 2009)Web Technology Trends (early 2009)
Web Technology Trends (early 2009)Prodosh Banerjee
 
Strategies for integrating semantic and blockchain technologies
Strategies for integrating semantic and blockchain technologiesStrategies for integrating semantic and blockchain technologies
Strategies for integrating semantic and blockchain technologiesHéctor Ugarte
 

Ähnlich wie Mining a Large Web Corpus (20)

Search Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited LectureSearch Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited Lecture
 
鏈結資料在圖書館的應用20131107
鏈結資料在圖書館的應用20131107鏈結資料在圖書館的應用20131107
鏈結資料在圖書館的應用20131107
 
Big Tools for Big Data
Big Tools for Big DataBig Tools for Big Data
Big Tools for Big Data
 
Applications of xml, semantic web or linked data in Library/Information Servi...
Applications of xml, semantic web or linked data in Library/Information Servi...Applications of xml, semantic web or linked data in Library/Information Servi...
Applications of xml, semantic web or linked data in Library/Information Servi...
 
disertation
disertationdisertation
disertation
 
Gt ea2009
Gt ea2009Gt ea2009
Gt ea2009
 
The Social Data Web
The Social Data WebThe Social Data Web
The Social Data Web
 
The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014
The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014
The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014
 
Future of Web 2.0 & The Semantic Web
Future of Web 2.0 & The Semantic WebFuture of Web 2.0 & The Semantic Web
Future of Web 2.0 & The Semantic Web
 
DWH & big data architecture approaches
DWH & big data architecture approachesDWH & big data architecture approaches
DWH & big data architecture approaches
 
Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»
 
IRJET- Data Retrieval using Master Resource Description Framework
IRJET- Data Retrieval using Master Resource Description FrameworkIRJET- Data Retrieval using Master Resource Description Framework
IRJET- Data Retrieval using Master Resource Description Framework
 
Linked Data and Semantic Web Application Development by Peter Haase
Linked Data and Semantic Web Application Development by Peter HaaseLinked Data and Semantic Web Application Development by Peter Haase
Linked Data and Semantic Web Application Development by Peter Haase
 
Linked Data
Linked DataLinked Data
Linked Data
 
Big Data
Big DataBig Data
Big Data
 
Semantic web Santhosh N Basavarajappa
Semantic web   Santhosh N BasavarajappaSemantic web   Santhosh N Basavarajappa
Semantic web Santhosh N Basavarajappa
 
(More) Transparency Transformation
(More) Transparency Transformation(More) Transparency Transformation
(More) Transparency Transformation
 
mx & dbs
mx & dbsmx & dbs
mx & dbs
 
Web Technology Trends (early 2009)
Web Technology Trends (early 2009)Web Technology Trends (early 2009)
Web Technology Trends (early 2009)
 
Strategies for integrating semantic and blockchain technologies
Strategies for integrating semantic and blockchain technologiesStrategies for integrating semantic and blockchain technologies
Strategies for integrating semantic and blockchain technologies
 

Kürzlich hochgeladen

notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptMsecMca
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptNANDHAKUMARA10
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueBhangaleSonal
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapRishantSharmaFr
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdfSuman Jyoti
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756dollysharma2066
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...tanu pandey
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdfKamal Acharya
 
Intro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdfIntro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdfrs7054576148
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startQuintin Balsdon
 

Kürzlich hochgeladen (20)

notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
Intro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdfIntro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdf
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 

Mining a Large Web Corpus

  • 1. Slide 1 International Internet Preservation Consortium General Assembly 2014, Paris Mining a Large Web Corpus Robert Meusel Christian Bizer
  • 3. Slide 3 Hyperlink Graphs Knowledge about the structure of the Web can be used to improve crawling strategies, to help SEO experts or to understand social phenomena.
  • 4. Slide 4 HTML-embedded Data on the Web Several million websites semantically markup the content of their HTML pages. Markup Syntaxes  Microformats  RDFa  Microdata Data snippets within info boxes
  • 5. Slide 5 Relational HTML Tables HTML Tables over semi-structured data which can be used to build up or extend knowledge bases as DBPedia. • Cafarella, et al.: WebTables: Exploring the Power of Tables on the Web. VLDB 2008.  In a corpus of 14B raw tables, 154M are „good“ relations (1.1%)
  • 6. Slide 6 The Web Data Commons Project  Has developed an Amazon-based framework for extracting data from large web crawls  Capable to run on any cloud infrastructure  Has applied this framework to the Common Crawl data  Adaptable to other crawls  Results and framework are publicly available  http://webdatacommons.org Goal: Offer an easy-to-use, cost efficient, distributed extraction framework for large web crawls, as well as datasets extracted out of the crawls.
  • 7. Slide 7 Extraction Framework AWS EC2 Instance AWS EC2 Instance Master AWS SQS AWS EC2 Instance AWS S3 1: Fill queue 2: Launch instances 3: Request file-reference 4: Download file 5: Extract & Upload automated manual 6: Collect results
  • 8. Slide 8 Extraction Worker AWS S3 AWS S3 WDC Extractor .(w)arc Worker Filter output Worker: • Written in Java • Process one page at once • Independent from other files and workers Download file Upload output file Filter: • Reduce Runtime • Mime-Type filter • Regex detection of content or meta- information Worker
  • 9. Slide 9 Web Data Commons – Extraction Framework  Written in Java  Mainly tailored for Amazon Web Services  Fault tolerant and cheap  300 USD to extract 17 billion RDF statements from 44 TB  Easy customizable  Only worker has to be adapted  Worker is a single process method processing one file each time  Scaling is automated by the framework  Access Open Source Code:  https://www.assembla.com/code/commondata/ Alternative: Hadoop Version, which can run on any Hadoop cluster without Amazon Web Services.
  • 10. Slide 10 Extracted Datasets  Hyperlink Graph  HTML-embedded Data  Relational HTML Tables Hyperlink Graph HTML-embedded Data Relational HTML Tables
  • 11. Slide 11 Hyperlink Graph  Extracted from the Common Crawl 2012 Dataset  Over 3.5 billion pages connected by over 128 billion links  Graph files: 386 GB http://webdatacommons.org/hyperlinkgraph/ http://wwwranking.webdatacommons.org/
  • 12. Slide 12 Hyperlink Graph  Degrees do not follow a power-law  Detection of Spam pages  Further insights:  WWW‘14: Graph Structure in the Web – Revisited (Meusel et al.)  WebSci‘14: The Graph Structure of the Web aggregated by Pay-Level Domain (Lehmberg et al.) Discovery of evolutions in the global structure of the World Wide Web.
  • 13. Slide 13 Hyperlink Graph Discovery of important and interesting sites using different popularity rankings or website categorization libraries Websites connected by at least ½ Million Links
  • 14. Slide 14 HTML-embedded Data More and more Websites semantically markup the content of their HTML pages. Markup Syntaxes RDFa Microformats Microdata
  • 15. Slide 15 Websites containing Structured Data (2013) 1.8 million websites (PLDs) out of 12.8 million provide Microformat, Microdata or RDFa data (13.9%) 585 million of the 2.2 billion pages contain Microformat, Microdata or RDFa data (26.3%). Web Data Commons - Microformat, Microdata, RDFa Corpus  17 billion RDF triples from Common Crawl 2013  Next release will be in winter 2014 http://webdatacommons.org/structureddata/
  • 16. Slide 16 Top Classes Microdata (2013) • schema = Schema.org • dv = Google‘s Rich Snippet Vocabulary
  • 17. Slide 17 HTML Tables • Cafarella, et al.: WebTables: Exploring the Power of Tables on the Web. VLDB 2008. • Crestan, Pantel: Web-Scale Table Census and Classification. WSDM 2011. In corpus of 14B raw tables, 154M are “good” relations (1.1%). Cafarella (2008) Classification Precision: 70-80%
  • 18. Slide 18 WDC - Web Tables Corpus  Large corpus of relational Web tables for public download  Extracted from Common Crawl 2012 (3.3 billion pages)  147 million relational tables  selected out of 11.2 B raw tables (1.3%)  download includes the HTML pages of the tables (1TB zipped)  Table Statistics  Heterogeneity: Very high. http://webdatacommons.org/webtables/ Min Max Average Median Attributes 2 2,368 3.49 3 Data Rows 1 70,068 12.41 6
  • 19. Slide 19  Attribute Statistics 28,000,000 different attribute labels WDC - Web Tables Corpus Attribute #Tables name 4,600,000 price 3,700,000 date 2,700,000 artist 2,100,000 location 1,200,000 year 1,000,000 manufacturer 375,000 counrty 340,000 isbn 99,000 area 95,000 population 86,000  Subject Attribute Values 1.74 billion rows 253,000,000 different subject labels Value #Rows usa 135,000 germany 91,000 greece 42,000 new york 59,000 london 37,000 athens 11,000 david beckham 3,000 ronaldinho 1,200 oliver kahn 710 twist shout 2,000 yellow submarine 1,400
  • 20. Slide 20 Conclusion Three factors are necessary to work with web-scale data:  Thanks to Common Crawl, this data is available  Like Amazon or other on-demand cloud-services  The Web Data Commons Framework, or standard tools like Pig  Cost evaluation on task-base, but the WDC framework has turned out to be cheaper Availability of Crawls Availability of cheap, easy-to-use infrastructures Easy to adopt scalable extraction frameworks
  • 21. Slide 21 Questions  Please visit our website: www.webdatacommons.org  Data and Framework are available as free download  Web Data Commons is supported by: