Slides from the presentation of the following paper:
Max Schmachtenberg, Christian Bizer, Heiko Paulheim: Adoption of the Linked Data Best Practices in Different Topical Domains. 13th International Semantic Web Conference (ISWC2014) - RDB Track, pp. 245-260, Riva del Garda, Italy, October 2014.
Paper URL:
http://dws.informatik.uni-mannheim.de/fileadmin/lehrstuehle/ki/pub/SchmachtenbergBizerPaulheim-AdoptionOfLinkedDataBestPractices.pdf
Abstract:
The central idea of Linked Data is that data publishers support applications in discovering and integrating data by complying to a set of best practices in the areas of linking, vocabulary usage, and metadata provision. In 2011, the State of the LOD Cloud report analyzed the adoption of these best practices by linked datasets within different topical domains. The report was based on information that was provided by the dataset publishers themselves via the datahub.io Linked Data catalog. In this paper, we revisit and update the findings of the 2011 State of the LOD Cloud report based on a crawl of the Web of Linked Data conducted in April 2014. We analyze how the adoption of the different best practices has changed and present an overview of the linkage relationships between datasets in the form of an updated LOD cloud diagram, this time not based on information from dataset providers, but on data that can actually be retrieved by a Linked Data crawler. Among others, we find that the number of linked datasets has approximately doubled between 2011 and 2014, that there is increased agreement on common vocabularies for describing certain types of entities, and that provenance and license metadata is still rarely provided by the data sources.
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Adoption of the Linked Data Best Practices in Different Topical Domains
1. Max Schmachtenberg
Christian Bizer
Heiko Paulheim
Adoption of the Linked Data Best Practices
in Different Topical Domains
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 1
2. The Linked Data Best Practices
Central idea of Linked Data: Ease data discovery and
integration by complying to a set of best practices.
1. Linking Best Practices
• Set RDF links pointing at instances in other data sources.
2. Vocabulary Best Practices
• Reuse terms from widely-used vocabularies.
• Make definitions of proprietary terms dereferencable.
• Link vocabulary terms to terms in other vocabularies.
3. Metadata Best Practices
• Publish machine-readable provenance and licensing metadata.
• Publish metadata about alternative access methods (SPARQL, dumps)
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 2
3. State of the LOD Cloud Report - 2011
http://lod-cloud.net/state/
Based on information
by provided dataset
publishers via the
datahub.io catalog
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 3
4. LOD Cloud - 2011
Consists of
295 datasets.
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 4
5. Outline
Goal: Update the State of the LOD Cloud report
and LOD Cloud itself to 2014.
1. Methodology
2. Adoption of the Linking Best Practices
3. Adoption of the Vocabulary Best Practices
4. Adoption of the Metadata Best Practices
5. Conclusions (in Relation to Schema.org)
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 5
6. 1. Methodology: Crawl of the Linked Data Web
Crawler: LDSpider, Crawl Date: April 2014
Seeds: 560,000 seed URIs from
1. Example URIs in datahub.io catalog
2. URIs from BTC2012 dataset
3. URIs from datasets advertised on public-lod@w3.org mailing list
Crawled Data Corpus
• 900,000 documents containing
• 8,038,000 resources
• 1014 datasets
• 77 datasets prevent
crawling via robots.txt
• Distribution by dataset
• Red line: documents
• Blue line: resources
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 6
7. Categorization by Topical Domain
Used categorization from datahub.io for existing datasets.
Manually categorized remaining datasets.
Added new category Social Networking
Growth without new category Social Networking: 94 %
LODstats (http://stats.lod2.eu/) discovered similar number of datasets: 1048
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 7
8. 2. Adoption of the Linking Best Practices
Data publishers should set RDF links as:
1. Discoverability depends on being linked.
2. RDF links ease data integration.
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 8
9. Degrees
56% of all datasets set RDF links pointing to other datasets.
• The remaining 44% are either only the target of RDF links from other
datasets or are isolated.
Datasets with Top In- and Outdegrees:
Most widely used linking predicates: owl:sameAs, rdfs:seeAlso, foaf:knows
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 9
10. “Crawlable” LOD Cloud 2014
ss
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 10
11. Degree Distributions
Dotted line: Social Networking (status.net, etc.)
Solid line: Cross-Domain datasets (DBpedia, etc.)
Largest Strongly Connected Component: 36% (377 datasets)
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 11
12. Conclusion concerning Linking Best Practices
Some datasets put a lot of effort into linking.
Many datasets only link to a small number of other datasets
or do not set RDF links at all.
Similar situation as in 2011.
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 12
13. 3. Adoption of the Vocabulary Best Practices
Goal: Help applications understand the data by
1. Reusing terms from widely-used vocabularies.
2. Making definitions of proprietary terms
dereferencable.
3. Linking vocabulary terms to terms in other
vocabularies.
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 13
14. Widely-Used and Proprietary Vocabularies
Strong agreement on some vocabularies.
Proprietary vocabularies are used in
addition to common ones,
as data is often very specific
Widely-Used Vocabularies
Proprietary Vocabularies
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 14
15. Dereferencability of Term URIs and Vocabulary Linking
28% of the proprietary vocabularies provide dereferencable URIs.
21% set RDF links to other vocabularies (8% in 2011)
• Popular linking predicates: rdfs:range, rdfs:subClassOf, rdfs:subClassOf
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 15
16. Adoption of the Metadata Best Practices
1. Publish machine-readable provenance information.
2. Publish machine-readable licensing information.
3. Publish metadata about alternative access methods
(SPARQL endpoints, RDF dumps)
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 16
17. Provenance and Licensing Metadata
37% of the datasets provide provenance information
• Dublin Core is used more than W3C Prov
10% provide machine-readable licensing information
• Most used predicates dc:license, cc:license
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 17
18. Dataset Level Metadata (VoID)
15% of the datasets publish VoID descriptions.
Via these descriptions, it is possible to discover SPARQL
endpoints and dumps for about 10% of the data sources.
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 18
19. Conclusion concerning Metadata Best Practices
Applications can not rely on availability of metadata,
as only a small fraction of all data sources publishes such data.
The Government and Library domains are positive exceptions.
Similarly low numbers as in 2011.
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 19
20. “Full” LOD Cloud Diagram
570 datasets
374 datahub.io
196 our crawl
http://lod-cloud.net/
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 20
21. Growth of the “Full” LOD Cloud Diagram
2011: 295 datasets
2014: 570 datasets (+ 93 %)
http://lod-cloud.net/
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 21
22. Comparison of Linked Data and Schema.org
Schema.org
1. does not expect data publishers to set data links.
2. relies on marking up data in HTML pages.
3. Strong application pull by Google, Microsoft, Yahoo!
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 22
23. Adoption
WebDataCommons, 2013*:
463,000 websites (PLDs) provide Microdata annotations.
Google, 2014**:
5 million websites provide Schema.org data.
Orders of magnitude more Schema.org data sources.
* WebDataCommons extracts Microdata, RDFa, Microformat data
from the CommonCrawl (2.2 billion HTML pages from 12.8 million PLDs).
** Guha in LDOW2014 Keynote
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 23
24. Schema.org Topical Focus
Different topics
compared to
Linked Data.
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 24
25. Class / Property Distribution
Microdata 2012
Only a small set of classes / properties is actually used.
Less variety compared to Linked Data.
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 25
26. Shallowness of the Schema.org Data
schema:Product schema:JobPosting
Product Names
• AppleMacBook Air MC968/A 11.6-Inch Laptop
• Apple MacBook Air 11-in, Intel Core i5 1.60GHz, 64 GB, Lion 10.7
JobPostings
• More specific properties like skills are hardly used.
• 57% of all hiringOrganizations are strings not instances.
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 26
27. Conclusion
Linked Data Schema.org
~ 1,000 sources > 460,000 sources
covers wider range of specific topics
(government, libraries, science)
topics focused on search engines
(products, organizations)
contains more complex
data structures
very simple and shallow
data structures
partial ontology agreement strong ontology agreement
identity resolution eased by RDF links identity resolution often requires
value parsing
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 27
28. Thank you.
References
Report
http://linkeddatacatalog.dws.informatik.uni-mannheim.de/state/
Catalog
http://linkeddatacatalog.dws.informatik.uni-mannheim.de/
Acknowledgement
This work was supported by
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 28