The WebDataCommons 
Microdata, RDFa, and Microformat 
Dataset Series 
Robert Meusel, Petar Petrovski, and 
Christian Bizer
2 
HTML-embedded Structured Data on the Web 
More and more websites semantically markup the content of 
their HTML pages. ...
1. _:node1 <http://www.w3.org/1999/02/22-rdf-syntax-ns# 
3. _:node1 <http://www.w3.org/1999/02/22-rdf-syntax-ns# 
4. _:nod...
4 
Dataset Series Overview 
 Series contains three datasets from 2010, 2012 and 2013 
 All together over 30 billion RDF ...
5 
Overview of 2013 dataset 
 Over 1.7 million domains using at least one markup language 
 Over 17 billion quads with o...
6 
Divergence in Class and Property Usage in 2013 
 Small number of classes and properties is 
used by a large number of ...
7 
RDFa Insights 2013 
 Usage of various vocabularies to describe information: 
• Strong presents of Open Graph Protocol ...
8 
Microdata Insights 2013 and 2012 
 Clear increase of development in comparison to 2012 
 Still two vocabularies deplo...
9 
Focus on Schema.org/Product 
 One of the largest public available 
product collections 
 Almost 100 million records 
...
10 
Microformats Insights 2013 
 Most dominant vocabulary is hCard 
 Still a very solid deployment 
 Topics are: 
• Per...
11 
Opportunities & Challenges 
Opportunities 
 Vast amounts of free data, 
created from people all over 
the world 
 La...
12 
Possible Application Domains 
 Enriching existing knowledge bases 
• E.g. mapping DBPedia Classes and Properties to t...
13 
Thank you! Questions? Feedback? 
Data and more statistics can be found at: 
http://webdatacommons.org/structureddata/i...
Nächste SlideShare
Wird geladen in …5
×

The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014

1.492 Aufrufe

Veröffentlicht am

Veröffentlicht in: Bildung
0 Kommentare
4 Gefällt mir
Statistik
Notizen
  • Als Erste(r) kommentieren

Keine Downloads
Aufrufe
Aufrufe insgesamt
1.492
Auf SlideShare
0
Aus Einbettungen
0
Anzahl an Einbettungen
33
Aktionen
Geteilt
0
Downloads
20
Kommentare
0
Gefällt mir
4
Einbettungen 0
Keine Einbettungen

Keine Notizen für die Folie

The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014

  1. 1. The WebDataCommons Microdata, RDFa, and Microformat Dataset Series Robert Meusel, Petar Petrovski, and Christian Bizer
  2. 2. 2 HTML-embedded Structured Data on the Web More and more websites semantically markup the content of their HTML pages. RDFa Microdata Microformats The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
  3. 3. 1. _:node1 <http://www.w3.org/1999/02/22-rdf-syntax-ns# 3. _:node1 <http://www.w3.org/1999/02/22-rdf-syntax-ns# 4. _:node1 <http://schema.org/Offer/price> "u20AC 5. _:node1 <http://schema.org/Offer/priceCurrency> 3 Dataset Creation  Common Crawl Foundation Corpora of 2010, 2012 and 2013 • Snapshot of popular pages of the Web • Continuously new crawls available  Parsing the HTML pages using Apache Any23 • Using a distributed framework on 100 parallel EC2 instances type> <http://schema.org/Product> . 2. _:node1 <http://schema.org/Product/name> "Predator Instinct FG Fuu00DFballschuh"@de . type> <http://schema.org/Offer> . 219,95"@de . "EUR"@de . 6. … Any23 The framework is easy to adapt and is publicly available at: http://webdatacommons.org/framework/ The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
  4. 4. 4 Dataset Series Overview  Series contains three datasets from 2010, 2012 and 2013  All together over 30 billion RDF quads  Each dataset is again split into subsets including quads extracted for a particular markup language The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
  5. 5. 5 Overview of 2013 dataset  Over 1.7 million domains using at least one markup language  Over 17 billion quads with over 4 billion records (typed entities)  hCard still most dominant among domains  Microdata contains the largest number of quads The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
  6. 6. 6 Divergence in Class and Property Usage in 2013  Small number of classes and properties is used by a large number of domains  RDFa: 646k classes and 27k properties, but <1k classes and ~2k properties are used by at least two different domains  MD: 15k classes and 170k properties, but ~1.2k classes and <13k properties are used by at least two different domains. Classes and Properties used by solely one domain are mostly typos The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
  7. 7. 7 RDFa Insights 2013  Usage of various vocabularies to describe information: • Strong presents of Open Graph Protocol (e.g. Facebook) • FOAF and SIOC (Blog-Software as Drupal)  Largest topics covered are: • Articles and Documents (Blogs and News portals) • Products, Reviews and Ratings • Organizations The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
  8. 8. 8 Microdata Insights 2013 and 2012  Clear increase of development in comparison to 2012  Still two vocabularies deployed: data-vocabulary and schema.org  Largest topical areas: • Postal Addresses and Locations • Products, Offers and Ratings • Organizations and Persons • Articles and Blogs • Breadcrumb The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
  9. 9. 9 Focus on Schema.org/Product  One of the largest public available product collections  Almost 100 million records described with name, offer and image  34 million records contain a further description  11% of all product records include a brand The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
  10. 10. 10 Microformats Insights 2013  Most dominant vocabulary is hCard  Still a very solid deployment  Topics are: • Persons & Organizations • Events • Products and reviews • Recipes The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
  11. 11. 11 Opportunities & Challenges Opportunities  Vast amounts of free data, created from people all over the world  Large topical coverage from broad areas (as products) to niche (as recipes)  High up-to-dateness of information, as popular pages potentially update their content frequently Challenges  Data quality assessment, as the data is created by experts and rookies  Further information extraction, as a flat schema and rather low number of properties are used  Identity resolution, as the data does hardly contain identifiers The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
  12. 12. 12 Possible Application Domains  Enriching existing knowledge bases • E.g. mapping DBPedia Classes and Properties to the corresponding classes and properties within the available vocabularies to add missing information and extend entity knowledge • As shown by Lehmberg et al. winner of the Semantic Web Challenge (Big Data Track) 2014, this data can be used as additional source (besides others) to gather and return wider search results  Design and adaption of algorithms and methods to face the characteristics of such web data • Training of data extraction methods to gather not marked data within the HTML pages • Further extraction of additional information from the raw data, e.g. extraction of skills, requirements etc. from job posting descriptions  Starting point for further data discovery • The dataset can be used as starting points for further data crawling, as not all pages from a domain are included (in most of the cases) The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
  13. 13. 13 Thank you! Questions? Feedback? Data and more statistics can be found at: http://webdatacommons.org/structureddata/index.html More interesting datasets and analysis can be found at the website of WebDataCommons: http://webdatacommons.org/index.html Acknowledgement The extraction and analysis of the datasets was supported by AWS in Education Grant and the EU FP7 project LOD2. Special thanks to SWSA for supporting the travel to ISWC 2014. The WebDataCommons Microdata, RDFa, and Microformats Dataset Series

×