Anzeige
Anzeige

Más contenido relacionado

Similar a A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time(20)

Anzeige

A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time

  1. A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time Robert Meusel, Christian Bizer and Heiko Paulheim
  2. 2 Motivation - LOD Cloud with 1.000 data providers A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
  3. 3 Motivation - schema.org MD with 700k data providers A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
  4. 4 Microdata in a Nutshell  Adding structured information to web pages • By marking up contents and entities  Arbitrary vocabularies are possible • Practically, only schema.org is deployed on a large scale • Plus its historical predecessor: data-vocabulary.org  Similar to RDFa A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 <div itemscope itemtype="http://schema.org/PostalAddress"> <span itemprop="name">Data and Web Science Group</span> <span itemprop="addressLocality">Mannheim</span>, <span itemprop="postalCode">68131</span> <span itemprop="addressCountry">Germany</span> </div>
  5. 5 Schema.org in a Nutshell  Vocabulary for marking up entities on web pages • 675 classes and 965 properties (as of May 2015, release 2.0)  Promoted and consumes by major search engine companies • Google, Bing, Yahoo!, and Yandex • Google Rich Snippets  Community-driven evolution and development A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
  6. 6 Schema.org in a Nutshell – Coverage  Schema.org has incorporated some popular vocabularies, like: • Good Relations (2012) • W3C BibExtend (2014) • MusicBrainz vocabulary (2015) • Automotive Ontology (2015) A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
  7. 7 Microdata with Schema.org in HTML Pages <html> … <body> … <div id="main-section" class="performance left" data- sku="M17242_580“> <h1> Predator Instinct FG Fußballschuh </h1> <div> <meta content="EUR"> <span data-sale-price="219.95">219,95</span> … </body> </html> HTML pages embed directly markup languages to annotate items using different vocabularies <html> … <body> … <div id="main-section" class="performance left" data- sku="M17242_580" itemscope itemtype="http://schema.org/Product"> <h1 itemprop="name"> Predator Instinct FG Fußballschuh </h1> <div itemscope itemtype="http://schema.org/Offer" itemprop="offers"> <meta itemprop="priceCurrency" content="EUR"> <span itemprop="price" data-sale- price="219.95">219,95</span> … </body> </html> 1._:node1 <http://www.w3.org/1999/02/22-rdf-syntax- ns#type> <http://schema.org/Product> . 2._:node1 <http://schema.org/Product/name> "Predator Instinct FG Fußballschuh"@de . 3._:node1 <http://www.w3.org/1999/02/22-rdf-syntax- ns#type> <http://schema.org/Offer> . 4._:node1 <http://schema.org/Offer/price> "219,95"@de . 5._:node1 <http://schema.org/Offer/priceCurrency> "EUR" . 6.… A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
  8. 8 Wrap-Up A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015  Semantic annotations are used by more and more websites  Entities on websites become machine-readable and machine- understandable  schema.org together with Microdata is a success story • Promoted by search engine companies • Deployed by over 17% of all websites [1] (over 700k data providers)  Usage is more compliant to the schema than e.g. LOD [2] [1] http://webdatacommons.org/structureddata/2014-12/stats/stats.html [2] Meusel and Paulheim, ESWC 2015
  9. 9 Digging for Reasons  So, Microdata is more often deployed and is often more schema compliant, although there are millions of uncontrolled providers with different skill sets  But why? Some hypotheses… • Availability of documentation • Tool support • Business incentive • Schema flexibility  Can we confirm/reject those from looking at the data? A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
  10. 10 A Diachronic Perspective  Versions of schema.org are archived over time • Plus: there are several crawl releases per year • i.e., we can look at change over time  If we look at both schema and deployed data, we may observe • Adoption rates of schema changes • Data-first changes to the schema • Convergence or divergence of deployed data A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
  11. 11 A Diachronic Perspective  Three releases of WDC Microdata corpus [1] • 2012, 2013, and 2014  Versions of schema.org that were valid • At the beginning of the crawl • At the end of the crawl A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 [1] http://webdatacommons.org/structureddata
  12. 12 Top-down Adoption  How fast are changes in the schema adopted? • New classes/properties • Deprecations • Domain/range changes  Measuring adoption: challenges • Different crawls • Overall growth of deployed schema.org  Measure: normalized usage increase (nui) from i to j: • nui(s)>1.05: usage of schema element s has increased significantly • nui(s)<0.95: usage of schema element s has decreased significantly A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
  13. 13 Top-down Adoption  Adoption of new classes and properties • Almost half of all introduced classes are never used! • Similar for new properties  Reasons • Bulk-addition of vocabularies • not every term is equally needed • e.g., medical vocabulary • Blind spot of our approach • some terms are mainly for e-mail markup • e.g., Actions A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 SURPRISE!
  14. 14 Top-down Adoption  Main domains of positive adoption • Meta data for web content (schema.org/Website has the highest nui) • Broadcasting (e.g., TV Episodes) • Questions & Answers • Postal addresses  Classes featured in Google Rich Snippets • Still growth on high level (tens of thousands of data providers) • But nui(s)<0.95 A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 Yellow Pages Search Engine Listings Collaboration with BBC and EBU Influence of CMS adoption Q&A Pages, such as Stackoverflow
  15. 15 Top-down Adoption  Adoption of domain/range changes • Again: rather low overall adoption  Adopted well for • Products (height, width, itemCondition, …) • Broadcasting domain (episode, actor, ...) A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 Search Engine Listings Collaboration with BBC and EBU
  16. 16 Top-down Adoption  Adoption of deprecations • Works well (29 out of 32 have a significantly low nui)  Exceptions • s:map (← s:hasMap) • s:maps (← s:hasMap) A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 For Google Maps (lots of outdated tutorials)
  17. 17 Bottom-up Evolution  Martin Luther • Started the protestant church • A success story, too (like schema.org) • (i.e., 800 million adopters worldwide)  Famous quote: • “Man muss […] dem gemeinen Mann aufs Maul schauen” • (roughly: “You have to listen to the way the common man really speaks.”) A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 Martin Luther, 1483-1546 Disclaimer: I do not speak for the protestant church.
  18. 18 Bottom-up Evolution  Are new features in the schema first used “inofficially”? • New classes/properties • Domain/range changes  Instrument for measurement: ROC curves • True positives mapped against false positives • tp: elements used before • fp: elements not used before • Ranking by #PLDs A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
  19. 19 Bottom-up Evolution  There are some mild influences observable • Stronger for domain/range changes • especially range changes • Weaker for new classes/properties A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 2012→ 2013 2013→ 2014 2012→ 2014 classes properties domains ranges
  20. 20 Bottom-up Evolution  Extension mechanism • Allows for user-defined classes/properties • Those become subclasses implicitly  Analysis over time • No measurable impact on standard evolution • “Inofficial” use is likelier than use of extension mechanism A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 s:Product/ElectronicProduct s:price/reducedPrice
  21. 21 Overall Convergence  Measuring convergence • i.e., homogeneity of descriptions of classes • Example: two instances of s:LocalBusiness A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 _:1 _:2 “Birmingham” “Main Street 24” s:LocalBusiness s:PostalAddress _:1 _:2 “Liverpool” “Church Street 1” s:LocalBusiness s:PostalAddress
  22. 22 Overall Convergence  Recap • RDF from Microdata is a set of trees • i.e., we can enumerate all paths to leaf nodes (omitting literals)  Example: A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 _:1 _:2 “Liverpool” “Church Street 1” s:LocalBusiness s:PostalAddress rdf:type-s:LocalBusiness, s:address-rdf:type-s:PostalAddress, s:address-s:addressLocality, s:address-s:streetAddress
  23. 23 Overall Convergence  Using all paths, we can compute the entropy for each class as  A low entropy refers to a high homogeneity  We normalize both by maximum entropy and the total number of paths • i.e., we use normalized entropy rate as a measure for homogeneity A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
  24. 24 Overall Convergence  Observations • Overall entropy decreases over time  Classes with high convergence rates • WebSite, Blog, … • Hotel, Restaurant, … • Product, Offer, … • Rating, Review A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 Influence of CMS adoption Yellow pages Google Rich Snippets ...all of the above
  25. 25 Key Adoption Drivers  Search Engine Optimization • Web site providers want to be high in Google rankings • Direct business incentive!  Tool adoption • Major CMSs use schema.org  Standard Agility • schema.org: 25 revisions in last three years • cf. FOAF: six revisions in last eight years A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
  26. 26 Summary  Both ways, top-down and bottom-up adoptions can be observed  Homogeneity of deployed schema increase over time  Described empirical data-driven study reveals valuable insights to understand how and why schema.org is a success story  Observed key drivers and obstacles can also help to understand and analysis adoption of other standards, e.g. LOD  More fine-grained insights might be revealed when extending the analysis corpus to the mailing list archive and issue tracker A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
  27. 27 Thank you! Questions? Feedback? Raw data can be found on the website of WebDataCommons: http://webdatacommons.org/structureddata/ More interesting datasets and analysis: http://webdatacommons.org/index.html A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 Acknowledgement The extraction and analysis of the datasets was supported by AWS in Education Grant.
Anzeige