A Web-scale Study of the Adoption and
Evolution of the schema.org Vocabulary
over Time
Robert Meusel, Christian Bizer and
...
2
Motivation - LOD Cloud with 1.000 data providers
A Web-scale Study of the Adoption and Evolution of the schema.org Vocab...
3
Motivation - schema.org MD with 700k data providers
A Web-scale Study of the Adoption and Evolution of the schema.org Vo...
4
Microdata in a Nutshell
 Adding structured information to web pages
• By marking up contents and entities
 Arbitrary v...
5
Schema.org in a Nutshell
 Vocabulary for marking up entities on web pages
• 675 classes and 965 properties (as of May 2...
6
Schema.org in a Nutshell – Coverage
 Schema.org has incorporated some popular vocabularies, like:
• Good Relations (201...
7
Microdata with Schema.org in HTML Pages
<html>
…
<body>
…
<div id="main-section" class="performance left" data-
sku="M17...
8
Wrap-Up
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
 Semantic an...
9
Digging for Reasons
 So, Microdata is more often deployed and is often more
schema compliant, although there are millio...
10
A Diachronic Perspective
 Versions of schema.org are archived over time
• Plus: there are several crawl releases per y...
11
A Diachronic Perspective
 Three releases of WDC Microdata corpus [1]
• 2012, 2013, and 2014
 Versions of schema.org t...
12
Top-down Adoption
 How fast are changes in the schema adopted?
• New classes/properties
• Deprecations
• Domain/range ...
13
Top-down Adoption
 Adoption of new classes and properties
• Almost half of all introduced classes are never used!
• Si...
14
Top-down Adoption
 Main domains of positive adoption
• Meta data for web content
(schema.org/Website has the highest n...
15
Top-down Adoption
 Adoption of domain/range changes
• Again: rather low overall adoption
 Adopted well for
• Products...
16
Top-down Adoption
 Adoption of deprecations
• Works well (29 out of 32 have a significantly low nui)
 Exceptions
• s:...
17
Bottom-up Evolution
 Martin Luther
• Started the protestant church
• A success story, too (like schema.org)
• (i.e., 8...
18
Bottom-up Evolution
 Are new features in the schema first used “inofficially”?
• New classes/properties
• Domain/range...
19
Bottom-up Evolution
 There are some mild influences observable
• Stronger for domain/range changes
• especially range ...
20
Bottom-up Evolution
 Extension mechanism
• Allows for user-defined classes/properties
• Those become subclasses implic...
21
Overall Convergence
 Measuring convergence
• i.e., homogeneity of descriptions of classes
• Example: two instances of ...
22
Overall Convergence
 Recap
• RDF from Microdata is a set of trees
• i.e., we can enumerate all paths to leaf nodes
(om...
23
Overall Convergence
 Using all paths, we can compute the entropy for each class as
 A low entropy refers to a high ho...
24
Overall Convergence
 Observations
• Overall entropy decreases over time
 Classes with high convergence rates
• WebSit...
25
Key Adoption Drivers
 Search Engine Optimization
• Web site providers want to be high in Google rankings
• Direct busi...
26
Summary
 Both ways, top-down and bottom-up adoptions can be
observed
 Homogeneity of deployed schema increase over ti...
27
Thank you! Questions? Feedback?
Raw data can be found on the website of WebDataCommons:
http://webdatacommons.org/struc...
Nächste SlideShare
Wird geladen in …5
×

A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time

722 Aufrufe

Veröffentlicht am

Promoted by major search engines, schema.org has become a widely adopted standard for marking up structured data in HTML web pages. In this paper, we use a series of largescale Web crawls to analyze the evolution and adoption of schema.org over time. The availability of data from di erent points in time for both the schema and the websites deploying data allows for a new kind of empirical analysis of standards adoption, which has not been possible before. To conduct our analysis, we compare di erent versions of the schema.org vocabulary to the data that was deployed on hundreds of thousands of Web pages at di erent points in time. We measure both top-down adoption (i.e., the extent to which changes in the schema are adopted by data providers) as well as bottom-up evolution (i.e., the extent to which the actually deployed data drives changes in the schema). Our empirical analysis shows that both processes can be observed.

Veröffentlicht in: Wissenschaft
0 Kommentare
0 Gefällt mir
Statistik
Notizen
  • Als Erste(r) kommentieren

  • Gehören Sie zu den Ersten, denen das gefällt!

Keine Downloads
Aufrufe
Aufrufe insgesamt
722
Auf SlideShare
0
Aus Einbettungen
0
Anzahl an Einbettungen
20
Aktionen
Geteilt
0
Downloads
4
Kommentare
0
Gefällt mir
0
Einbettungen 0
Keine Einbettungen

Keine Notizen für die Folie

A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time

  1. 1. A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time Robert Meusel, Christian Bizer and Heiko Paulheim
  2. 2. 2 Motivation - LOD Cloud with 1.000 data providers A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
  3. 3. 3 Motivation - schema.org MD with 700k data providers A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
  4. 4. 4 Microdata in a Nutshell  Adding structured information to web pages • By marking up contents and entities  Arbitrary vocabularies are possible • Practically, only schema.org is deployed on a large scale • Plus its historical predecessor: data-vocabulary.org  Similar to RDFa A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 <div itemscope itemtype="http://schema.org/PostalAddress"> <span itemprop="name">Data and Web Science Group</span> <span itemprop="addressLocality">Mannheim</span>, <span itemprop="postalCode">68131</span> <span itemprop="addressCountry">Germany</span> </div>
  5. 5. 5 Schema.org in a Nutshell  Vocabulary for marking up entities on web pages • 675 classes and 965 properties (as of May 2015, release 2.0)  Promoted and consumes by major search engine companies • Google, Bing, Yahoo!, and Yandex • Google Rich Snippets  Community-driven evolution and development A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
  6. 6. 6 Schema.org in a Nutshell – Coverage  Schema.org has incorporated some popular vocabularies, like: • Good Relations (2012) • W3C BibExtend (2014) • MusicBrainz vocabulary (2015) • Automotive Ontology (2015) A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
  7. 7. 7 Microdata with Schema.org in HTML Pages <html> … <body> … <div id="main-section" class="performance left" data- sku="M17242_580“> <h1> Predator Instinct FG Fußballschuh </h1> <div> <meta content="EUR"> <span data-sale-price="219.95">219,95</span> … </body> </html> HTML pages embed directly markup languages to annotate items using different vocabularies <html> … <body> … <div id="main-section" class="performance left" data- sku="M17242_580" itemscope itemtype="http://schema.org/Product"> <h1 itemprop="name"> Predator Instinct FG Fußballschuh </h1> <div itemscope itemtype="http://schema.org/Offer" itemprop="offers"> <meta itemprop="priceCurrency" content="EUR"> <span itemprop="price" data-sale- price="219.95">219,95</span> … </body> </html> 1._:node1 <http://www.w3.org/1999/02/22-rdf-syntax- ns#type> <http://schema.org/Product> . 2._:node1 <http://schema.org/Product/name> "Predator Instinct FG Fußballschuh"@de . 3._:node1 <http://www.w3.org/1999/02/22-rdf-syntax- ns#type> <http://schema.org/Offer> . 4._:node1 <http://schema.org/Offer/price> "219,95"@de . 5._:node1 <http://schema.org/Offer/priceCurrency> "EUR" . 6.… A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
  8. 8. 8 Wrap-Up A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015  Semantic annotations are used by more and more websites  Entities on websites become machine-readable and machine- understandable  schema.org together with Microdata is a success story • Promoted by search engine companies • Deployed by over 17% of all websites [1] (over 700k data providers)  Usage is more compliant to the schema than e.g. LOD [2] [1] http://webdatacommons.org/structureddata/2014-12/stats/stats.html [2] Meusel and Paulheim, ESWC 2015
  9. 9. 9 Digging for Reasons  So, Microdata is more often deployed and is often more schema compliant, although there are millions of uncontrolled providers with different skill sets  But why? Some hypotheses… • Availability of documentation • Tool support • Business incentive • Schema flexibility  Can we confirm/reject those from looking at the data? A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
  10. 10. 10 A Diachronic Perspective  Versions of schema.org are archived over time • Plus: there are several crawl releases per year • i.e., we can look at change over time  If we look at both schema and deployed data, we may observe • Adoption rates of schema changes • Data-first changes to the schema • Convergence or divergence of deployed data A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
  11. 11. 11 A Diachronic Perspective  Three releases of WDC Microdata corpus [1] • 2012, 2013, and 2014  Versions of schema.org that were valid • At the beginning of the crawl • At the end of the crawl A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 [1] http://webdatacommons.org/structureddata
  12. 12. 12 Top-down Adoption  How fast are changes in the schema adopted? • New classes/properties • Deprecations • Domain/range changes  Measuring adoption: challenges • Different crawls • Overall growth of deployed schema.org  Measure: normalized usage increase (nui) from i to j: • nui(s)>1.05: usage of schema element s has increased significantly • nui(s)<0.95: usage of schema element s has decreased significantly A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
  13. 13. 13 Top-down Adoption  Adoption of new classes and properties • Almost half of all introduced classes are never used! • Similar for new properties  Reasons • Bulk-addition of vocabularies • not every term is equally needed • e.g., medical vocabulary • Blind spot of our approach • some terms are mainly for e-mail markup • e.g., Actions A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 SURPRISE!
  14. 14. 14 Top-down Adoption  Main domains of positive adoption • Meta data for web content (schema.org/Website has the highest nui) • Broadcasting (e.g., TV Episodes) • Questions & Answers • Postal addresses  Classes featured in Google Rich Snippets • Still growth on high level (tens of thousands of data providers) • But nui(s)<0.95 A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 Yellow Pages Search Engine Listings Collaboration with BBC and EBU Influence of CMS adoption Q&A Pages, such as Stackoverflow
  15. 15. 15 Top-down Adoption  Adoption of domain/range changes • Again: rather low overall adoption  Adopted well for • Products (height, width, itemCondition, …) • Broadcasting domain (episode, actor, ...) A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 Search Engine Listings Collaboration with BBC and EBU
  16. 16. 16 Top-down Adoption  Adoption of deprecations • Works well (29 out of 32 have a significantly low nui)  Exceptions • s:map (← s:hasMap) • s:maps (← s:hasMap) A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 For Google Maps (lots of outdated tutorials)
  17. 17. 17 Bottom-up Evolution  Martin Luther • Started the protestant church • A success story, too (like schema.org) • (i.e., 800 million adopters worldwide)  Famous quote: • “Man muss […] dem gemeinen Mann aufs Maul schauen” • (roughly: “You have to listen to the way the common man really speaks.”) A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 Martin Luther, 1483-1546 Disclaimer: I do not speak for the protestant church.
  18. 18. 18 Bottom-up Evolution  Are new features in the schema first used “inofficially”? • New classes/properties • Domain/range changes  Instrument for measurement: ROC curves • True positives mapped against false positives • tp: elements used before • fp: elements not used before • Ranking by #PLDs A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
  19. 19. 19 Bottom-up Evolution  There are some mild influences observable • Stronger for domain/range changes • especially range changes • Weaker for new classes/properties A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 2012→ 2013 2013→ 2014 2012→ 2014 classes properties domains ranges
  20. 20. 20 Bottom-up Evolution  Extension mechanism • Allows for user-defined classes/properties • Those become subclasses implicitly  Analysis over time • No measurable impact on standard evolution • “Inofficial” use is likelier than use of extension mechanism A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 s:Product/ElectronicProduct s:price/reducedPrice
  21. 21. 21 Overall Convergence  Measuring convergence • i.e., homogeneity of descriptions of classes • Example: two instances of s:LocalBusiness A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 _:1 _:2 “Birmingham” “Main Street 24” s:LocalBusiness s:PostalAddress _:1 _:2 “Liverpool” “Church Street 1” s:LocalBusiness s:PostalAddress
  22. 22. 22 Overall Convergence  Recap • RDF from Microdata is a set of trees • i.e., we can enumerate all paths to leaf nodes (omitting literals)  Example: A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 _:1 _:2 “Liverpool” “Church Street 1” s:LocalBusiness s:PostalAddress rdf:type-s:LocalBusiness, s:address-rdf:type-s:PostalAddress, s:address-s:addressLocality, s:address-s:streetAddress
  23. 23. 23 Overall Convergence  Using all paths, we can compute the entropy for each class as  A low entropy refers to a high homogeneity  We normalize both by maximum entropy and the total number of paths • i.e., we use normalized entropy rate as a measure for homogeneity A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
  24. 24. 24 Overall Convergence  Observations • Overall entropy decreases over time  Classes with high convergence rates • WebSite, Blog, … • Hotel, Restaurant, … • Product, Offer, … • Rating, Review A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 Influence of CMS adoption Yellow pages Google Rich Snippets ...all of the above
  25. 25. 25 Key Adoption Drivers  Search Engine Optimization • Web site providers want to be high in Google rankings • Direct business incentive!  Tool adoption • Major CMSs use schema.org  Standard Agility • schema.org: 25 revisions in last three years • cf. FOAF: six revisions in last eight years A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
  26. 26. 26 Summary  Both ways, top-down and bottom-up adoptions can be observed  Homogeneity of deployed schema increase over time  Described empirical data-driven study reveals valuable insights to understand how and why schema.org is a success story  Observed key drivers and obstacles can also help to understand and analysis adoption of other standards, e.g. LOD  More fine-grained insights might be revealed when extending the analysis corpus to the mailing list archive and issue tracker A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
  27. 27. 27 Thank you! Questions? Feedback? Raw data can be found on the website of WebDataCommons: http://webdatacommons.org/structureddata/ More interesting datasets and analysis: http://webdatacommons.org/index.html A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 Acknowledgement The extraction and analysis of the datasets was supported by AWS in Education Grant.

×