Max Schmachtenberg 
Christian Bizer 
Heiko Paulheim 
Adoption of the Linked Data Best Practices 
in Different Topical Doma...
The Linked Data Best Practices 
Central idea of Linked Data: Ease data discovery and 
integration by complying to a set of...
State of the LOD Cloud Report - 2011 
 http://lod-cloud.net/state/ 
 Based on information 
by provided dataset 
publishe...
LOD Cloud - 2011 
Consists of 
295 datasets. 
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices,...
Outline 
Goal: Update the State of the LOD Cloud report 
and LOD Cloud itself to 2014. 
1. Methodology 
2. Adoption of the...
1. Methodology: Crawl of the Linked Data Web 
 Crawler: LDSpider, Crawl Date: April 2014 
 Seeds: 560,000 seed URIs from...
Categorization by Topical Domain 
 Used categorization from datahub.io for existing datasets. 
 Manually categorized rem...
2. Adoption of the Linking Best Practices 
Data publishers should set RDF links as: 
1. Discoverability depends on being l...
Degrees 
 56% of all datasets set RDF links pointing to other datasets. 
• The remaining 44% are either only the target o...
“Crawlable” LOD Cloud 2014 
 ss 
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 ...
Degree Distributions 
 Dotted line: Social Networking (status.net, etc.) 
 Solid line: Cross-Domain datasets (DBpedia, e...
Conclusion concerning Linking Best Practices 
 Some datasets put a lot of effort into linking. 
 Many datasets only link...
3. Adoption of the Vocabulary Best Practices 
Goal: Help applications understand the data by 
1. Reusing terms from widely...
Widely-Used and Proprietary Vocabularies 
 Strong agreement on some vocabularies. 
 Proprietary vocabularies are used in...
Dereferencability of Term URIs and Vocabulary Linking 
 28% of the proprietary vocabularies provide dereferencable URIs. ...
Adoption of the Metadata Best Practices 
1. Publish machine-readable provenance information. 
2. Publish machine-readable ...
Provenance and Licensing Metadata 
 37% of the datasets provide provenance information 
• Dublin Core is used more than W...
Dataset Level Metadata (VoID) 
 15% of the datasets publish VoID descriptions. 
 Via these descriptions, it is possible ...
Conclusion concerning Metadata Best Practices 
 Applications can not rely on availability of metadata, 
as only a small f...
“Full” LOD Cloud Diagram 
570 datasets 
 374 datahub.io 
 196 our crawl 
http://lod-cloud.net/ 
Schmachtenberg, Bizer, P...
Growth of the “Full” LOD Cloud Diagram 
 2011: 295 datasets 
 2014: 570 datasets (+ 93 %) 
http://lod-cloud.net/ 
Schmac...
Comparison of Linked Data and Schema.org 
Schema.org 
1. does not expect data publishers to set data links. 
2. relies on ...
Adoption 
WebDataCommons, 2013*: 
463,000 websites (PLDs) provide Microdata annotations. 
Google, 2014**: 
5 million websi...
Schema.org Topical Focus 
Different topics 
compared to 
Linked Data. 
Schmachtenberg, Bizer, Paulheim: Adoption of the Li...
Class / Property Distribution 
Microdata 2012 
 Only a small set of classes / properties is actually used. 
 Less variet...
Shallowness of the Schema.org Data 
schema:Product schema:JobPosting 
Product Names 
• AppleMacBook Air MC968/A 11.6-Inch ...
Conclusion 
Linked Data Schema.org 
~ 1,000 sources > 460,000 sources 
covers wider range of specific topics 
(government,...
Thank you. 
References 
 Report 
http://linkeddatacatalog.dws.informatik.uni-mannheim.de/state/ 
 Catalog 
http://linked...
Nächste SlideShare
Wird geladen in …5
×

Adoption of the Linked Data Best Practices in Different Topical Domains

1.621 Aufrufe

Veröffentlicht am

Slides from the presentation of the following paper:

Max Schmachtenberg, Christian Bizer, Heiko Paulheim: Adoption of the Linked Data Best Practices in Different Topical Domains. 13th International Semantic Web Conference (ISWC2014) - RDB Track, pp. 245-260, Riva del Garda, Italy, October 2014.

Paper URL:

http://dws.informatik.uni-mannheim.de/fileadmin/lehrstuehle/ki/pub/SchmachtenbergBizerPaulheim-AdoptionOfLinkedDataBestPractices.pdf

Abstract:

The central idea of Linked Data is that data publishers support applications in discovering and integrating data by complying to a set of best practices in the areas of linking, vocabulary usage, and metadata provision. In 2011, the State of the LOD Cloud report analyzed the adoption of these best practices by linked datasets within different topical domains. The report was based on information that was provided by the dataset publishers themselves via the datahub.io Linked Data catalog. In this paper, we revisit and update the findings of the 2011 State of the LOD Cloud report based on a crawl of the Web of Linked Data conducted in April 2014. We analyze how the adoption of the different best practices has changed and present an overview of the linkage relationships between datasets in the form of an updated LOD cloud diagram, this time not based on information from dataset providers, but on data that can actually be retrieved by a Linked Data crawler. Among others, we find that the number of linked datasets has approximately doubled between 2011 and 2014, that there is increased agreement on common vocabularies for describing certain types of entities, and that provenance and license metadata is still rarely provided by the data sources.

Veröffentlicht in: Internet
0 Kommentare
0 Gefällt mir
Statistik
Notizen
  • Als Erste(r) kommentieren

  • Gehören Sie zu den Ersten, denen das gefällt!

Keine Downloads
Aufrufe
Aufrufe insgesamt
1.621
Auf SlideShare
0
Aus Einbettungen
0
Anzahl an Einbettungen
12
Aktionen
Geteilt
0
Downloads
18
Kommentare
0
Gefällt mir
0
Einbettungen 0
Keine Einbettungen

Keine Notizen für die Folie

Adoption of the Linked Data Best Practices in Different Topical Domains

  1. 1. Max Schmachtenberg Christian Bizer Heiko Paulheim Adoption of the Linked Data Best Practices in Different Topical Domains Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 1
  2. 2. The Linked Data Best Practices Central idea of Linked Data: Ease data discovery and integration by complying to a set of best practices. 1. Linking Best Practices • Set RDF links pointing at instances in other data sources. 2. Vocabulary Best Practices • Reuse terms from widely-used vocabularies. • Make definitions of proprietary terms dereferencable. • Link vocabulary terms to terms in other vocabularies. 3. Metadata Best Practices • Publish machine-readable provenance and licensing metadata. • Publish metadata about alternative access methods (SPARQL, dumps) Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 2
  3. 3. State of the LOD Cloud Report - 2011  http://lod-cloud.net/state/  Based on information by provided dataset publishers via the datahub.io catalog Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 3
  4. 4. LOD Cloud - 2011 Consists of 295 datasets. Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 4
  5. 5. Outline Goal: Update the State of the LOD Cloud report and LOD Cloud itself to 2014. 1. Methodology 2. Adoption of the Linking Best Practices 3. Adoption of the Vocabulary Best Practices 4. Adoption of the Metadata Best Practices 5. Conclusions (in Relation to Schema.org) Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 5
  6. 6. 1. Methodology: Crawl of the Linked Data Web  Crawler: LDSpider, Crawl Date: April 2014  Seeds: 560,000 seed URIs from 1. Example URIs in datahub.io catalog 2. URIs from BTC2012 dataset 3. URIs from datasets advertised on public-lod@w3.org mailing list  Crawled Data Corpus • 900,000 documents containing • 8,038,000 resources • 1014 datasets • 77 datasets prevent crawling via robots.txt • Distribution by dataset • Red line: documents • Blue line: resources Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 6
  7. 7. Categorization by Topical Domain  Used categorization from datahub.io for existing datasets.  Manually categorized remaining datasets.  Added new category Social Networking  Growth without new category Social Networking: 94 %  LODstats (http://stats.lod2.eu/) discovered similar number of datasets: 1048 Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 7
  8. 8. 2. Adoption of the Linking Best Practices Data publishers should set RDF links as: 1. Discoverability depends on being linked. 2. RDF links ease data integration. Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 8
  9. 9. Degrees  56% of all datasets set RDF links pointing to other datasets. • The remaining 44% are either only the target of RDF links from other datasets or are isolated.  Datasets with Top In- and Outdegrees:  Most widely used linking predicates: owl:sameAs, rdfs:seeAlso, foaf:knows Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 9
  10. 10. “Crawlable” LOD Cloud 2014  ss Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 10
  11. 11. Degree Distributions  Dotted line: Social Networking (status.net, etc.)  Solid line: Cross-Domain datasets (DBpedia, etc.)  Largest Strongly Connected Component: 36% (377 datasets) Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 11
  12. 12. Conclusion concerning Linking Best Practices  Some datasets put a lot of effort into linking.  Many datasets only link to a small number of other datasets or do not set RDF links at all.  Similar situation as in 2011. Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 12
  13. 13. 3. Adoption of the Vocabulary Best Practices Goal: Help applications understand the data by 1. Reusing terms from widely-used vocabularies. 2. Making definitions of proprietary terms dereferencable. 3. Linking vocabulary terms to terms in other vocabularies. Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 13
  14. 14. Widely-Used and Proprietary Vocabularies  Strong agreement on some vocabularies.  Proprietary vocabularies are used in addition to common ones, as data is often very specific Widely-Used Vocabularies Proprietary Vocabularies Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 14
  15. 15. Dereferencability of Term URIs and Vocabulary Linking  28% of the proprietary vocabularies provide dereferencable URIs.  21% set RDF links to other vocabularies (8% in 2011) • Popular linking predicates: rdfs:range, rdfs:subClassOf, rdfs:subClassOf Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 15
  16. 16. Adoption of the Metadata Best Practices 1. Publish machine-readable provenance information. 2. Publish machine-readable licensing information. 3. Publish metadata about alternative access methods (SPARQL endpoints, RDF dumps) Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 16
  17. 17. Provenance and Licensing Metadata  37% of the datasets provide provenance information • Dublin Core is used more than W3C Prov  10% provide machine-readable licensing information • Most used predicates dc:license, cc:license Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 17
  18. 18. Dataset Level Metadata (VoID)  15% of the datasets publish VoID descriptions.  Via these descriptions, it is possible to discover SPARQL endpoints and dumps for about 10% of the data sources. Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 18
  19. 19. Conclusion concerning Metadata Best Practices  Applications can not rely on availability of metadata, as only a small fraction of all data sources publishes such data.  The Government and Library domains are positive exceptions.  Similarly low numbers as in 2011. Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 19
  20. 20. “Full” LOD Cloud Diagram 570 datasets  374 datahub.io  196 our crawl http://lod-cloud.net/ Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 20
  21. 21. Growth of the “Full” LOD Cloud Diagram  2011: 295 datasets  2014: 570 datasets (+ 93 %) http://lod-cloud.net/ Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 21
  22. 22. Comparison of Linked Data and Schema.org Schema.org 1. does not expect data publishers to set data links. 2. relies on marking up data in HTML pages. 3. Strong application pull by Google, Microsoft, Yahoo! Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 22
  23. 23. Adoption WebDataCommons, 2013*: 463,000 websites (PLDs) provide Microdata annotations. Google, 2014**: 5 million websites provide Schema.org data.  Orders of magnitude more Schema.org data sources. * WebDataCommons extracts Microdata, RDFa, Microformat data from the CommonCrawl (2.2 billion HTML pages from 12.8 million PLDs). ** Guha in LDOW2014 Keynote Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 23
  24. 24. Schema.org Topical Focus Different topics compared to Linked Data. Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 24
  25. 25. Class / Property Distribution Microdata 2012  Only a small set of classes / properties is actually used.  Less variety compared to Linked Data. Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 25
  26. 26. Shallowness of the Schema.org Data schema:Product schema:JobPosting Product Names • AppleMacBook Air MC968/A 11.6-Inch Laptop • Apple MacBook Air 11-in, Intel Core i5 1.60GHz, 64 GB, Lion 10.7 JobPostings • More specific properties like skills are hardly used. • 57% of all hiringOrganizations are strings not instances. Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 26
  27. 27. Conclusion Linked Data Schema.org ~ 1,000 sources > 460,000 sources covers wider range of specific topics (government, libraries, science) topics focused on search engines (products, organizations) contains more complex data structures very simple and shallow data structures partial ontology agreement strong ontology agreement identity resolution eased by RDF links identity resolution often requires value parsing Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 27
  28. 28. Thank you. References  Report http://linkeddatacatalog.dws.informatik.uni-mannheim.de/state/  Catalog http://linkeddatacatalog.dws.informatik.uni-mannheim.de/ Acknowledgement  This work was supported by Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 28

×