The document describes a series of datasets created by parsing HTML pages to extract structured data in the form of Microdata, RDFa, and Microformats. It provides an overview of the datasets created in 2010, 2012, and 2013, which contain over 30 billion RDF quads extracted from over 1.7 million domains. The datasets are hosted online and provide insights into the usage of different vocabularies and markup languages as well as opportunities for applying and analyzing the large-scale structured web data.
The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014
1. The WebDataCommons
Microdata, RDFa, and Microformat
Dataset Series
Robert Meusel, Petar Petrovski, and
Christian Bizer
2. 2
HTML-embedded Structured Data on the Web
More and more websites semantically markup the content of
their HTML pages.
RDFa
Microdata
Microformats
The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
3. 1. _:node1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#
3. _:node1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#
4. _:node1 <http://schema.org/Offer/price> "u20AC
5. _:node1 <http://schema.org/Offer/priceCurrency>
3
Dataset Creation
Common Crawl Foundation Corpora of 2010, 2012 and 2013
• Snapshot of popular pages of the Web
• Continuously new crawls available
Parsing the HTML pages using Apache Any23
• Using a distributed framework on 100 parallel EC2 instances
type> <http://schema.org/Product> .
2. _:node1 <http://schema.org/Product/name>
"Predator Instinct FG Fuu00DFballschuh"@de .
type> <http://schema.org/Offer> .
219,95"@de .
"EUR"@de .
6. …
Any23
The framework is easy to adapt and is publicly available at:
http://webdatacommons.org/framework/
The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
4. 4
Dataset Series Overview
Series contains three datasets from 2010, 2012 and 2013
All together over 30 billion RDF quads
Each dataset is again split into subsets including quads
extracted for a particular markup language
The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
5. 5
Overview of 2013 dataset
Over 1.7 million domains using at least one markup language
Over 17 billion quads with over 4 billion records (typed entities)
hCard still most dominant among domains
Microdata contains the largest number of quads
The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
6. 6
Divergence in Class and Property Usage in 2013
Small number of classes and properties is
used by a large number of domains
RDFa: 646k classes and 27k properties,
but <1k classes and ~2k properties are
used by at least two different domains
MD: 15k classes and 170k properties, but
~1.2k classes and <13k properties are
used by at least two different domains.
Classes and Properties used by solely one
domain are mostly typos
The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
7. 7
RDFa Insights 2013
Usage of various vocabularies to describe information:
• Strong presents of Open Graph Protocol (e.g. Facebook)
• FOAF and SIOC (Blog-Software as Drupal)
Largest topics covered are:
• Articles and Documents (Blogs and News portals)
• Products, Reviews and Ratings
• Organizations
The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
8. 8
Microdata Insights 2013 and 2012
Clear increase of development in comparison to 2012
Still two vocabularies deployed: data-vocabulary and schema.org
Largest topical areas:
• Postal Addresses and Locations
• Products, Offers and Ratings
• Organizations and Persons
• Articles and Blogs
• Breadcrumb
The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
9. 9
Focus on Schema.org/Product
One of the largest public available
product collections
Almost 100 million records
described with name, offer and
image
34 million records contain a
further description
11% of all product records include
a brand
The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
10. 10
Microformats Insights 2013
Most dominant vocabulary is hCard
Still a very solid deployment
Topics are:
• Persons & Organizations
• Events
• Products and reviews
• Recipes
The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
11. 11
Opportunities & Challenges
Opportunities
Vast amounts of free data,
created from people all over
the world
Large topical coverage from
broad areas (as products) to
niche (as recipes)
High up-to-dateness of
information, as popular
pages potentially update
their content frequently
Challenges
Data quality assessment, as
the data is created by
experts and rookies
Further information
extraction, as a flat schema
and rather low number of
properties are used
Identity resolution, as the
data does hardly contain
identifiers
The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
12. 12
Possible Application Domains
Enriching existing knowledge bases
• E.g. mapping DBPedia Classes and Properties to the corresponding classes and
properties within the available vocabularies to add missing information and
extend entity knowledge
• As shown by Lehmberg et al. winner of the Semantic Web Challenge (Big Data
Track) 2014, this data can be used as additional source (besides others) to gather
and return wider search results
Design and adaption of algorithms and methods to face the
characteristics of such web data
• Training of data extraction methods to gather not marked data within the HTML
pages
• Further extraction of additional information from the raw data, e.g. extraction of
skills, requirements etc. from job posting descriptions
Starting point for further data discovery
• The dataset can be used as starting points for further data crawling, as not all
pages from a domain are included (in most of the cases)
The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
13. 13
Thank you! Questions? Feedback?
Data and more statistics can be found at:
http://webdatacommons.org/structureddata/index.html
More interesting datasets and analysis can be found at the
website of WebDataCommons:
http://webdatacommons.org/index.html
Acknowledgement
The extraction and analysis of the datasets was supported by AWS in Education Grant
and the EU FP7 project LOD2. Special thanks to SWSA for supporting the travel to ISWC
2014.
The WebDataCommons Microdata, RDFa, and Microformats Dataset Series