Adoption of the Linked Data Best Practices in Different Topical Domains

Max Schmachtenberg
Christian Bizer
Heiko Paulheim
Adoption of the Linked Data Best Practices
in Different Topical Domains
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 1

The Linked Data Best Practices
Central idea of Linked Data: Ease data discovery and
integration by complying to a set of best practices.
1. Linking Best Practices
• Set RDF links pointing at instances in other data sources.
2. Vocabulary Best Practices
• Reuse terms from widely-used vocabularies.
• Make definitions of proprietary terms dereferencable.
• Link vocabulary terms to terms in other vocabularies.
3. Metadata Best Practices
• Publish machine-readable provenance and licensing metadata.
• Publish metadata about alternative access methods (SPARQL, dumps)

State of the LOD Cloud Report - 2011
 http://lod-cloud.net/state/
 Based on information
by provided dataset
publishers via the
datahub.io catalog

LOD Cloud - 2011
Consists of
295 datasets.

Outline
Goal: Update the State of the LOD Cloud report
and LOD Cloud itself to 2014.
1. Methodology
2. Adoption of the Linking Best Practices
3. Adoption of the Vocabulary Best Practices
4. Adoption of the Metadata Best Practices
5. Conclusions (in Relation to Schema.org)

1. Methodology: Crawl of the Linked Data Web
 Crawler: LDSpider, Crawl Date: April 2014
 Seeds: 560,000 seed URIs from
1. Example URIs in datahub.io catalog
2. URIs from BTC2012 dataset
3. URIs from datasets advertised on public-lod@w3.org mailing list
 Crawled Data Corpus
• 900,000 documents containing
• 8,038,000 resources
• 1014 datasets
• 77 datasets prevent
crawling via robots.txt
• Distribution by dataset
• Red line: documents
• Blue line: resources

Categorization by Topical Domain
 Used categorization from datahub.io for existing datasets.
 Manually categorized remaining datasets.
 Added new category Social Networking
 Growth without new category Social Networking: 94 %
 LODstats (http://stats.lod2.eu/) discovered similar number of datasets: 1048

2. Adoption of the Linking Best Practices
Data publishers should set RDF links as:
1. Discoverability depends on being linked.
2. RDF links ease data integration.

Degrees
 56% of all datasets set RDF links pointing to other datasets.
• The remaining 44% are either only the target of RDF links from other
datasets or are isolated.
 Datasets with Top In- and Outdegrees:
 Most widely used linking predicates: owl:sameAs, rdfs:seeAlso, foaf:knows

“Crawlable” LOD Cloud 2014
 ss

Degree Distributions
 Dotted line: Social Networking (status.net, etc.)
 Solid line: Cross-Domain datasets (DBpedia, etc.)
 Largest Strongly Connected Component: 36% (377 datasets)

Conclusion concerning Linking Best Practices
 Some datasets put a lot of effort into linking.
 Many datasets only link to a small number of other datasets
or do not set RDF links at all.
 Similar situation as in 2011.

3. Adoption of the Vocabulary Best Practices
Goal: Help applications understand the data by
1. Reusing terms from widely-used vocabularies.
2. Making definitions of proprietary terms
dereferencable.
3. Linking vocabulary terms to terms in other
vocabularies.

Widely-Used and Proprietary Vocabularies
 Strong agreement on some vocabularies.
 Proprietary vocabularies are used in
addition to common ones,
as data is often very specific
Widely-Used Vocabularies
Proprietary Vocabularies

Dereferencability of Term URIs and Vocabulary Linking
 28% of the proprietary vocabularies provide dereferencable URIs.
 21% set RDF links to other vocabularies (8% in 2011)
• Popular linking predicates: rdfs:range, rdfs:subClassOf, rdfs:subClassOf

Adoption of the Metadata Best Practices
1. Publish machine-readable provenance information.
2. Publish machine-readable licensing information.
3. Publish metadata about alternative access methods
(SPARQL endpoints, RDF dumps)

Provenance and Licensing Metadata
 37% of the datasets provide provenance information
• Dublin Core is used more than W3C Prov
 10% provide machine-readable licensing information
• Most used predicates dc:license, cc:license

Dataset Level Metadata (VoID)
 15% of the datasets publish VoID descriptions.
 Via these descriptions, it is possible to discover SPARQL
endpoints and dumps for about 10% of the data sources.

Conclusion concerning Metadata Best Practices
 Applications can not rely on availability of metadata,
as only a small fraction of all data sources publishes such data.
 The Government and Library domains are positive exceptions.
 Similarly low numbers as in 2011.

“Full” LOD Cloud Diagram
570 datasets
 374 datahub.io
 196 our crawl
http://lod-cloud.net/

Growth of the “Full” LOD Cloud Diagram
 2011: 295 datasets
 2014: 570 datasets (+ 93 %)
http://lod-cloud.net/

Comparison of Linked Data and Schema.org
Schema.org
1. does not expect data publishers to set data links.
2. relies on marking up data in HTML pages.
3. Strong application pull by Google, Microsoft, Yahoo!

Adoption
WebDataCommons, 2013*:
463,000 websites (PLDs) provide Microdata annotations.
Google, 2014**:
5 million websites provide Schema.org data.
 Orders of magnitude more Schema.org data sources.
* WebDataCommons extracts Microdata, RDFa, Microformat data
from the CommonCrawl (2.2 billion HTML pages from 12.8 million PLDs).
** Guha in LDOW2014 Keynote

Schema.org Topical Focus
Different topics
compared to
Linked Data.

Class / Property Distribution
Microdata 2012
 Only a small set of classes / properties is actually used.
 Less variety compared to Linked Data.

Shallowness of the Schema.org Data
schema:Product schema:JobPosting
Product Names
• AppleMacBook Air MC968/A 11.6-Inch Laptop
• Apple MacBook Air 11-in, Intel Core i5 1.60GHz, 64 GB, Lion 10.7
JobPostings
• More specific properties like skills are hardly used.
• 57% of all hiringOrganizations are strings not instances.

Conclusion
Linked Data Schema.org
~ 1,000 sources > 460,000 sources
covers wider range of specific topics
(government, libraries, science)
topics focused on search engines
(products, organizations)
contains more complex
data structures
very simple and shallow
data structures
partial ontology agreement strong ontology agreement
identity resolution eased by RDF links identity resolution often requires
value parsing

Thank you.
References
 Report
http://linkeddatacatalog.dws.informatik.uni-mannheim.de/state/
 Catalog
http://linkeddatacatalog.dws.informatik.uni-mannheim.de/
Acknowledgement
 This work was supported by

Adoption of the Linked Data Best Practices in Different Topical Domains

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Adoption of the Linked Data Best Practices in Different Topical Domains

Similar to Adoption of the Linked Data Best Practices in Different Topical Domains (20)

More from Chris Bizer

More from Chris Bizer (9)

Recently uploaded

Recently uploaded (20)

Adoption of the Linked Data Best Practices in Different Topical Domains