Linked Open Data Fundamentals for Libraries, Archives and Museums
1. Linked Open Data Fundamentals
For Libraries, Archives & Museums
Trevor Thornton
Senior Applications Developer, NYPL Labs
New York Public Library
2. Workshop Topics
• What Linked Open Data is
• Potential benefits of Linked Open Data for
libraries, archives and museums
• Overview of technical concepts
• Licenses for open data (legal issues)
• Tour of relevant Linked Open Data sources
(element sets, controlled vocabularies, published
data sets)
• General considerations for implementation
3. Linked Open Data (LOD)
Data
For libraries, archives and museums, this is includes any type of digital
information that describes resources or aids in their discovery (metadata).
It also includes data produced through original research (scientific/statistical
data, geospatial data, etc.)
Linked Data
Data published on the Web in accordance with principles designed to
facilitate linkages between resources
Linked Open Data
Linked data that is freely usable, reusable, and redistributable — subject, at
most, to attribution and ‘share alike’ requirements
4. The value of our data
• Our data is a crucial tool in serving our
missions to collect, preserve and provide
access to resources
• We are dedicated to standards of quality and
accuracy in the data we create
• The creation and management of data
represents a significant investment on the part
of cultural heritage institutions
5. Benefits of Linked Open Data
• Puts information on the web, where people are
looking for it
• People can use your data in new ways, opening
opportunities for scholarship and innovation
• Expands discoverability of your collections
• Allows for continuous improvement of
your data by linking it to a growing pool
of other data
6. The emerging data commons
Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
7. A very brief
history of
linked data
Starring
Tim Berners-Lee
Photo: Paul Clarke
8. 1990 (more or less)
Tim Berners-Lee invents the World Wide Web to
publish hypertext documents on the Internet.
It includes 3 essential technologies:
URI (Uniform Resource Identifier)
HTTP (Hypertext Transfer protocol)
HTML (Hypertext Markup Language)
9. 2001
Tim Berners-Lee proposes ‘The Semantic Web’
in an article in Scientific American
“The Semantic Web is not a separate Web but an extension of the
current one, in which information is given well-defined
meaning, better enabling computers and people to work in
cooperation…
In the near future, these developments will usher in significant new
functionality as machines become much better able to process and
‘understand’ the data that they merely display at present.”
10. 2006
In a document discussing design issues for the
Semantic Web, Berners-Lee introduces linked
data as a crucial component:
“The Semantic Web isn't just about putting data on the web. It is
about making links, so that a person or machine can explore the
web of data. With linked data, when you have some of it, you
can find other, related, data.”
He outlines 4 basic principles…
11. The Linked Data Principles
1. Use URIs as names for things.
2. Use HTTP URIs so that people can look up
those names.
3. When someone looks up a URI, provide
useful information, using the standards
(RDF, SPARQL).
4. Include links to other URIs so that they can
discover more things.
13. URI
(Uniform Resource Identifier)
Globally unique identifier for a resource on a
computer or a network.
HTTP URIs identify resources on the Web.
http://www.yourdomain.org/something
14. URI vs. URL
URLs (Uniform Resource Locators) are a subset
of URIs that, in addition to identifying a
resource, provide a means of locating it.
A URI does not necessarily point to a document.
A URL does .
A URI can identify a real-world object.
15. HTTP
(Hypertext Transfer Protocol)
The foundation of data communication for the Web
HTTP request
Client/User agent Web
(e.g. web browser) Server
HTTP response
16. RDF
Resource Description Framework
A framework for describing Web resources.
A Web resource is anything that can be retrieved
or identified on the WWW via a URI.
RDF descriptions are based on simple
subject-predicate-object
expressions called “triples”.
17. The RDF Triple
predicate
subject object
Subject - the resource being described
Predicate - a property of that resource
Object - the value of the property
Subject and predicate are defined using URIs.
Object can either be a URI or a ‘literal’ (text, number, date, etc.)
19. A basic triple
creator
James Joyce
http://www.worldcat.org/oclc/746309573
http://purl.org/dc/terms/creator
http://viaf.org/viaf/44300643
20. Another basic triple
subject
Dublin, Ireland
http://www.worldcat.org/oclc/746309573
http://purl.org/dc/terms/subject
http://dbpedia.org/resource/Dublin
21. One more basic triple
date created 1918/1922
http://www.worldcat.org/oclc/746309573
http://purl.org/dc/terms/created
22. RDF data as a graph
http://www.worldcat.org/oclc/746309
573
date created subject
http://purl.org/dc/terms/created http://purl.org/dc/terms/subject
creator
http://purl.org/dc/terms/creator
Dublin, Ireland
1918/1920 http://dbpedia.org/resource/Du
blin
James Joyce
http://viaf.org/viaf/44300643
23. RDF serialization formats
‘Serialization’ = to record one or more
RDF graphs in a machine-readable file.
There are 2 basic options:
RDF in a standalone text file:
• RDF XML
• N3 (Notation 3)
• Turtle (Terse RDF Triple Language)
• N-Triples
RDF embedded in HTML
• RDFa (RDF in attributes)
24. Basic triples in N-Triples
<http://www.worldcat.org/oclc/746309573> <http://purl.org/dc/terms/creator>
<http://viaf.org/viaf/44300643> .
<http://www.worldcat.org/oclc/746309573> <http://purl.org/dc/terms/subject>
<http://dbpedia.org/resource/Dublin> .
<http://www.worldcat.org/oclc/746309573> <http://purl.org/dc/terms/created>
1918/1922 .
N-Triples is the most basic expression of RDF.
25. Basic triples in N3/Turtle
@prefix dcterms: <http://purl.org/dc/terms/>.
<http://www.worldcat.org/oclc/746309573>
dcterms:creator http://viaf.org/viaf/44300643;
dcterms:subject http://dbpedia.org/resource/Dublin;
dcterms:created 1918/1922.
Statements about the same resource are
grouped together.
Property URIs are shortened using prefixes.
27. RDFa (RDF in Attributes)
RDFa allows RDF data to be embedded
within HTML content.
Rendered HTML:
Ulysses is a novel by the Irish author James Joyce.
HTML code:
<div about=“http://www.worldcat.org/oclc/746309573”
prefix=“dcterms: http://purl.org/dc/terms/>
Ulysses is a novel by the Irish author
<span property=“dcterms:creator”
resource=“http://viaf.org/viaf/44300643”>James Joyce</span>
</div>
28. RDF Ontologies
Ontologies/vocabularies define categories of things
and the relationships that they can have
to each other.
Ontologies provide the semantics that allow data
to be interpreted by machines.
Rules of inference – what can be assumed to be
true based on what is asserted by a triple.
29. RDFS (RDF Schema)
A basic vocabulary for ontology development.
RDFS defines RDF classes and properties.
Class – a category of resources; a resource in
such a category is said to be an instance of the
class
Property – a relation between a subject
resource and an object resource in a triple.
30. OWL
(Web Ontology Language)
Provides an extended set of properties used in
ontology/vocabulary definitions
(used in conjunction with RDFS)
• Equivalence/disjunction
• Advanced property definitions
• Restrictions and Cardinality
31. SKOS
(Simple Knowledge Organization System)
Set of vocabularies created to support the use of
thesauri, classification schemes, subject heading
systems and taxonomies in RDF
• Concept schemes
(names, topics, geographic terms, etc.)
• Preferred/alternate labels
• Broader/narrower concepts
32. Triplestore
A database for storing RDF data.
Often a triplestore is part of a suite of
applications that might include:
• Triplestore
• Inference engine – provides the ‘intelligence’
required to interpret data based on RDFS/OWL
ontologies
• Query engine – supports access to data based on
user-supplied queries
33. SPARQL
(SPARQL Protocol and RDF Query Language)
• The primary query language for RDF data
(analogous to SQL for relational databases)
• SPARQL endpoint – Web service that provides
direct access to RDF datastores via SPARQL
queries
34. Publishing Linked Data
Establish URIs for your resources
• Within a domain that you control (yourlibrary.org)
• Consult with your IT staff on strategies for
formulating URIs, for example:
Subdomain (data.yourlibrary.org/something)
Reserve a path within your domain,
(yourdomain.org/data/something)
35. Publishing Linked Data
Decide what happens when users (human or
machine) try to access your URIs via the Web
1. Nothing (Not recommended)
2. Something – User is provided with information about the
resource
URI directs to RDF file
Good for machines, not for humans
URI directs to an HTML representation of the resource
Good for humans, useless for machines – Not recommended
URI directs to an HTML representation of the resource with RDFa embedded
Good for humans, OK for machines
URI directs to either RDF file or HTML representation based on what
the user prefers (content negotiation)
36. HTTP Content Negotiation
HTTP request
Client/User agent Web
(e.g. web browser) Server
HTTP response
HTTP Request HTTP Response
• Resource URI (+ method) • Status code
• Headers (Information about • Headers (Information
the requestor) about the response)
• Message body (optional) • Message body (optional)
37. HTTP ‘Accept’ Header
Part of the HTTP request that specifies what
types of data the client can accept
• Web browsers
HTML, JPEG, GIF, text, or other formats that browser can
display – unsupported formats are either displayed as text or
prompt user to download file
• Semantic web applications
RDF XML, N3, Turtle, or other RDF serialization
38. HTTP Status Codes
Part of the HTTP response that classifies the
nature of the response
1xx : Informational
2xx : Success
Example: 200 OK
3xx : Redirection
Examples: 301 Moved Permanently, 303 See Other
Response will include ‘Location’ header with URI for new resource
4xx : Error
Example: 404 Not Found
39. HTTP Content Negotiation
via 303 Redirect
HTTP request
URI: http://example.org/something
Accepts: HTML, JPEG, GIF, etc.
Web server
Web browser (running some kind of
content negotiation service)
40. HTTP Content Negotiation
via 303 Redirect
HTTP request
URI: http://example.org/something
Accepts: HTML, JPEG, GIF, etc.
HTTP response
Status: 303 See Other
Location: Web server
Web browser http://example.org/something.html
(running some kind of
content negotiation service)
41. HTTP Content Negotiation
via 303 Redirect
HTTP request
URI: http://example.org/something
Accepts: HTML, JPEG, GIF, etc.
HTTP response
Status: 303 See Other
Location: Web server
Web browser http://example.org/something.html
(running some kind of
content negotiation service)
HTTP request
URI: http://example.org/something.html
Accepts: HTML, JPEG, GIF, etc.
42. HTTP Content Negotiation
via 303 Redirect
HTTP request
URI: http://example.org/something
Accepts: HTML, JPEG, GIF, etc.
HTTP response
Status: 303 See Other
Location: Web server
Web browser http://example.org/something.html
(running some kind of
content negotiation service)
HTTP request
URI: http://example.org/something.html
Accepts: HTML, JPEG, GIF, etc.
HTTP response
Status: 200 OK
43. Trust
The rapid growth of the Web is attributable in
large part to the fact that it allows anyone to say
anything about anything (provable
facts, subjective opinions, blatant lies and
everything in between)
This is also true of the linked data web.
Libraries, archives and museums are expected
to provide ‘factual’, objective data and depend
on trusted sources.
44. Linked data attribution
A growing concern in the linked data community is
the need to include attribution with data in order to
determine whether or not it can/should be trusted.
• RDF reification – allows source attribution to be associated
with an RDF triple
• Named graphs – Extension of RDF that allows attribution and
other metadata to be associated with RDF descriptions
• Quad stores – Similar to triplestores but with an additional
element that connects the triple with its source
46. Linked Open Data
Data
For libraries, archives and museums, this is includes any type of digital
information that describes resources or aids in their discovery (metadata).
Also includes data produced through original research (scientific/statistical
data, geospatial data, etc.)
Linked Data
Data published on the Web in accordance with principles designed to
facilitate linkages between resources
Linked Open Data
Linked data that is freely usable, reusable, and redistributable — subject, at
most, to attribution and ‘share alike’ requirements
47. Open data licensing
Licensing your data is not the same as licensing
your assets. Typically permitted uses of data are
much less restrictive.
You can often provide free, open use of your
data even if use of your assets are
completely restricted.
TALK TO YOUR LEGAL DEPARTMENT FIRST.
48. Open data licensing
A nonprofit organization that enables the
sharing and use of creativity and knowledge
through free legal tools.
CC provides an alternative to standard
“all rights reserved” copyright.
49. Creative Commons Licenses
Three-Layer Design:
LEAGAL CODE
The actual license as a legal
document (accessible on the Web)
COMMONS DEED
The human-readable version
of the license
MACHINE-READABLE CODE
Allows license info to be
expressed in RDF
50. Creative Commons Licenses
CC licenses allow creators to specify a
combination of 4 restrictions on use
Attribution Non-Commercial
Any use must give Only non-commercial
credit to the creator uses are permitted
Share Alike No Derivative Works
Any use must be made The original may only be used
available under the same in whole and unchanged
terms as the original
Licenses specify that any restrictions may be waived with
permission of the rights holder.
51. OPEN DATA (: Creative Commons Licenses
Attribution (CC BY)
Allows distribution and reuse in any way as long as you get credit
Attribution-ShareAlike (CC BY-SA)
Allows distribution and reuse in any way as long as you get credit and
derivative works are released under the same license
Attribution-NoDerivs (CC BY-ND)
Requires that the original is used unchanged and in whole, with credit to you
NOT OPEN DATA ):
Attribution-NonCommercial (CC BY-ND)
Allows distribution and reuse in any way, for non-commercial purposes only, as long as
you get credit
Attribution-NonCommercial-ShareAlike (CC BY-NC-SA)
Requires that the original is used unchanged and in whole, with credit to you, provided
that derivative works are released under the same license
Attribution-NonCommercial-NoDerivs (CC BY-NC-ND)
Only permits use as-is, for non commercial purposes, and with credit to you – the most
restrictive CC license available
52. CC0 (‘CC Zero’)
Allows creators to waive all rights to work
and to place it as completely as possible
into the public domain.
• Laws vary from jurisdiction to jurisdiction as to what rights are
automatically granted and how and when they expire or may
be voluntarily relinquished
• Ambiguity with regard to rights can limit creative re-use
• CC0 is designed to make it as clear as is legally possible that
any use of your content is allowed
• Quickly becoming the preferred license for open data
AGAIN, TALK TO YOUR LEGAL DEPARTMENT FIRST!
55. Bibliographic Ontology
bibliontology.com
An extensive vocabulary of terms for describing
bibliographic resources
56. FOAF (Friend of a Friend)
foaf-project.org
Provides a vocabulary for describing people and their
relationships to each other and the things they create
57. LC Linked Data Service
id.loc.gov
Library of Congress authorities as linked data (Name Authority
File, Subject Headings, Thesaurus of Graphic Materials, etc.)
58. Virtual International Authority File
viaf.org
Links names from multiple authority files to create cluster
records representing the entities identified
59. GeoNames
geonames.org
Aggregates geographic data from a wide variety of sources
and makes it available as LOD
60. New York Times
data.nytimes.com
150 years of subjects from New York Times articles –
data source for Times Topics pages
62. DBpedia
dbpedia.org
Crowd-sourced community effort to extract structured
information from Wikipedia and to make it available on the Web
63. Freebase
freebase.com
A large collaborative knowledge base consisting of metadata
composed mainly by its community members (owned by Google)
64. Google Knowledge Graph
Google uses data from Freebase and other sources
to provide related information based on search queries
65. Schema.org
A set of vocabularies developed by Google, Bing (Microsoft)
and Yahoo! for adding semantic data to web pages
66. OCLC WorldCat
oclc.org/worldcat
Earlier this year, OCLC added linked data to records in
WorldCat, using Schema.org vocabularies and proposed
extensions
for library data
68. Start small
Linked Open Data is not an
‘all or nothing’ proposition
Start by publishing data about
specific collections or items of
special interest
Consider incorporating Linked Open Data
into online exhibitions or special projects
69. Engage the linked data
community
Let people know what you’re up to, and
ask for feedback – you will get it.
70. Be creative
In addition to publishing data about
your own collections, think about how you
can incorporate data from other sources
into your projects
Consider collaborations with
other institutions
71. Utilize your internal resources
Cataloging/Metadata
Curators/Subject Matter Experts
IT Staff
Legal Department