As part of MuseWeb 2023 in Washington, DC, this presentation walks through the basics of Linked Data, and then discusses the six levels of implementation of Linked Data, using the Getty's work as examples.
1. So…how does anyone do this stuff, for real?
MW 2023, Washington DC
Linked Data on a Budget
David Newbury
Assistant Director, Software and User Experience,
Getty Digital
2. Hi! I’m David.
I lead the software and user experience teams at Getty.
Getty is a big museum/research hub in Los Angeles. We do lots of things with
data.
All of the actual work here was done by my fabulously talented team. I just talk.
2
Introduction
4. ● Linked Data is another name for the
Semantic Web, a good idea by Tim
Berners-Lee, whose previous good idea
turned out to be very good.
4
What is Linked Data: The Standard Story
5. There are three main concepts in Linked Data:
1. Data is represented as a graph.
2. Meaning is determined by ontologies.
3. IDs are dereferencable URLs.
5
What is Linked Data: The Standard Story
6. A Graph is a way to represent data.
Think of a fact.
6
What is Linked Data: Data as a Graph
Favorite Drink Coffee
David
7. A Graph is a way to represent data.
Think of a fact.
Think of another fact.
7
What is Linked Data: Data as a Graph
Favorite Drink Coffee
Favorite Drink Beer
David
John
8. A Graph is a way to represent data.
Think of a fact.
Think of another fact.
And another.
8
What is Linked Data: Data as a Graph
Favorite Drink Coffee
Favorite Drink Beer
Favorite Drink Chai
David
John
Betsy
9. You could imagine these as a table of data:
9
What is Linked Data: Data as a Graph
David
John
Betsy
Fav Drink
Coffee
Beer
Chai
10. You could imagine these as a table of data:
…and add other information about
the people involved.
10
What is Linked Data: Data as a Graph
David
John
Betsy
Fav Drink
Coffee
Beer
Chai
Hometown
Pittsburgh
Boston
Pittsburgh
11. This does get duplicative, though,
if you want to add additional
information about a different
column.
11
What is Linked Data: Data as a Graph
David
Fav Drink
Coffee
John Beer
Betsy Chai
Hometown
Pittsburgh
Boston
Pittsburgh
State
PA
MA
PA
12. You can solve this with a relational
database…
12
What is Linked Data: Data as a Graph
David
Fav Drink
Coffee
John Beer
Betsy Chai
Place ID
1
2
1
Hometown
Pittsburgh
Boston
State
PA
MA
Place ID
1
2
13. You can solve this with a relational
database…
…or with with a graph.
13
What is Linked Data: Data as a Graph
David
Fav Drink Coffee
State
Hometown
Pittsburgh PA
Betsy
Fav Drink Chai
Hometown
John
Fav Drink Beer
State
Hometown Boston MA
14. Tables are great for lots of data
about “a thing”, with a limited
number of kinds of things with
consistent links between things.
14
What is Linked Data: Data as a Graph
15. Graphs are great when the number
of kinds of things and number of
links between them is high and
inconsistent.
15
What is Linked Data: Data as a Graph
16. Another problem is meaning:
Words are great, but they require a
shared understanding of what’s
being described.
16
What is Linked Data: Data as a Ontology
David State PA
David State Solid
David State Confused
17. Linked Data uses ontologies to
include, as data, context and
definition around the terms used to
define how things are connected.
17
What is Linked Data: Data as a Ontology
David State PA
David State Solid
David State Confused
defined as
Geographical region
within a country
defined as
Distinct form of
matter
defined as
Emotional or mental
condition
18. It also assumes that each of
these concepts is represented
by a unique identifier, which lets
people—and computers—be
unambiguous.
18
What is Linked Data: Data as a Ontology
geo_state
State
matter_state
State
mental_state
State
defined as
Geographical region
within a country
defined as
Distinct form of
matter
defined as
Emotional or mental
condition
label
label
label
19. By making these identifiers into
URLs, they can be made globally
unique—and can also carry with
them the identity of the
concept’s creator.
19
What is Linked Data: Data as a Ontology
getty.edu/geo_state
getty.edu/matter_state
getty.edu/mental_state
defined as
Geographical region
within a country
defined as
Distinct form of
matter
defined as
Emotional or mental
condition
20. And it also means that if you
dereference that URL, you can
provide access to the data!
20
What is Linked Data: Dereferencable Data
getty.edu/matter_state
defined as
Distinct form of
matter
21. It also means that the
information can come from
outside of our own ecosystem.
21
What is Linked Data: Dereferencable Data
getty.edu/matter_state
defined as
Distinct form of
matter
spanish
label
materia
same as
wikidata.org/Q35758
25. Relational databases are optimized
for performance and data locality.
If you keep all the information
about a person in one place—it’s
very fast to pull it back.
25
Annoyances: Performance
David
Fav Drink
Coffee
John Beer
Betsy Chai
Place ID
1
2
1
Hometown
Pittsburgh
Boston
State
PA
MA
Place ID
1
2
26. It’s also easy to understand
“What is a person” from the
perspective of the application:
It’s the information in the
“Person” table.
26
Annoyances: Concept Boundaries
David
Fav Drink
Coffee
John Beer
Betsy Chai
Place ID
1
2
1
27. It also makes it easy to include metadata
about the record.
27
Annoyances: Metadata
David
Fav Drink
Coffee
John Beer
Betsy Chai
Place ID
1
2
1
Updated
2022-01-05
1970-01-01
2023-04-01
28. This idea of a “record” is a construct—
remember, these are just facts,
organized into a table.
But we’re trained to think about data as
collections of grouped facts, relevant
within a specific context.
28
Annoyances: Record Boundaries
David
Fav Drink
Coffee
John Beer
Betsy Chai
Place ID
1
2
1
Updated
2022-01-05
1970-01-01
2023-04-01
29. Graphs don’t provide clear
boundaries the same way—they
don’t have the concept of a record.
Each triple is a stand-alone
record—and often collecting all
the information you want requires
many hops across the graph.
29
Annoyances: Graph Structure
David
Fav Drink Coffee
State
Hometown
Pittsburgh PA
Betsy
Fav Drink Chai
Hometown
30. Graphs are optimized for querying:
Defining a query-specific context
that includes a set of facts based on
novel criteria of interest, and
returning that subset of
information.
30
Annoyances: Queries
David
Fav Drink Coffee
State
Hometown
Pittsburgh PA
Betsy
Fav Drink Chai
Hometown
31. “What objects does Getty have that have images larger than
1200px on the longest side that have been exhibited in both New
York and Paris and were created by artists who lived before 1850?”
is just as easy to ask as
“What is the tombstone data about Irises?”
31
Annoyances: Queries
32. “What objects does Getty have that have images larger than
1200px on the longest side that have been exhibited in both New
York and Paris and were created by artists who lived before 1850?”
is just as easy absurdly difficult to ask as
“What is the tombstone data about Irises?”
32
Annoyances: Queries
33. Doing so moves the burden of defining the
relevant context to the user of the data, not
the creator of the data.
This is great for research, but not so great
for ease of use.
33
Annoyances: Queries
34. We have never asked:
“What objects does Getty have that have images larger than
1200px on the longest side that have been exhibited in both New
York and Paris and were created by artists who lived before 1850?
…but we ask
What is the tombstone data about Irises?
Several thousand times a day.
34
Annoyances: Queries
37. Dereferencability could solve
this…but it requires network
requests.
So many requests.
37
Annoyances: Queries
David
Fav Drink Coffee
geo_state
Hometown
Pittsburgh PA
State
defined as
Geographical region
within a country
label
same as
wikidata.org/Q35758
38. Dereferencability could solve
this…but it requires network
requests.
So many requests.
…when do you stop?
38
Annoyances: Queries
David
Fav Drink Coffee
geo_state
Hometown
Pittsburgh PA
State
defined as
Geographical region
within a country
label
same as
wikidata.org/Q106458883
spanish
label
división administrativa de
primer nivel en varios países
39. Dereferencability could solve
this…but it requires network
requests.
So many requests.
…when do you stop?
…and can you rely on other
systems?
39
Annoyances: Queries
David
Fav Drink Coffee
geo_state
Hometown
Pittsburgh PA
State
defined as
Geographical region
within a country
label
same as
wikidata.org/Q106458883
spanish
label
división administrativa de
primer nivel en varios países
40. Linked Data is annoying.
None of these are theoretical concerns about Linked Data.
They’re just practical concerns when you try and build something on top of it.
40
Part 2: Summary
42. Getty has been doing Linked Data since 2014,
starting with the Getty Vocabularies.
It’s a collection of concepts, people, and places
deeply relevant to the study of art and
architecture.
42
Getty’s Linked Data: Getty Vocabularies
43. Since then, we’ve moved most of our major
systems to use Linked Data—including our
archives…
43
Getty’s Linked Data: Archival Records
44. Since then, we’ve moved most of our major
systems to use Linked Data—including our
archives…
… and our museum collection.
44
Getty’s Linked Data: Archival Records
45. We’ve also built a complex, powerful
infrastructure to support doing this across
our application landscape.
It’s been fun. We’ve learned a lot.
45
Getty’s Linked Data: APIs
46. A Hard-won lesson:
No application that we’ve built required Linked Data.
46
Getty’s Linked Data: What we learned
47. A Hard-won lesson:
No application that we’ve built required Linked Data.
Which, if you think about it, makes sense. Each application has
a specific, known context with clear record boundaries.
47
Getty’s Linked Data: What we learned
48. Why keep doing it?
The value is in the ecosystem—when we present information in multiple contexts.
It’s also in the community—allowing our data to be used beyond our
organization’s boundaries.
48
Getty’s Linked Data: What we learned
49. Why should YOU do it?
Because what makes cultural data interesting is not contained within the walls of
any one institution.
It’s shared across our entire, world-wide community. We should work together.
That’s the reason—not any particular data structure or ontology.
49
Getty’s Linked Data: What we learned
51. You don’t need to do what we’ve done.
Enabling connections across silos and organizations doesn’t mean that you need a
triplestore with Linked.Art data provided via JSON-LD documents reconciled to
ULAN and Wikidata, queryable via SPARQL and ElasticSearch, with
cross-references via Web Annotations, associated with IIIF Manifests.
51
Linking Data: The Six Levels
52. You don’t need to do what we’ve done.
Enabling connections across silos and organizations doesn’t mean that you need a
triplestore with Linked.Art data provided via JSON-LD documents reconciled to
ULAN and Wikidata, queryable via SPARQL and ElasticSearch, with
cross-references via Web Annotations, associated with IIIF Manifests.
That would just be showing off.
52
Linking Data: The Six Levels
53. You just need to make it easy for people to
understand what you have done.
There are, in our experience, six levels of Linked Data that build on one another,
but all provide value—both within an organization and across the community.
53
Linking Data: The Six Levels
54. #1: Authority
Provide a consistent way to identify both entities and the institution providing
information in your data.
54
Level 1: Authority
55. Give everything an identifier.
Other people can’t talk about your data without a way to unambiguously refer to
the record that you’re talking about.
URLs as IDs are great for this—they’re both unique—and they let others know
who produced the data.
55
Level 1: Authority
56. Give everything an human-friendly identifier.
https://data.getty.edu/research/collections/object/97e8fd22-92a4-4831-aa63-33255c1aaefe
This is not friendly.
56
Level 1: Authority
57. Give everything an human-friendly identifier.
This is friendly:
https://data.getty.edu/archives/AK3098
57
Level 1: Authority
58. Identifiers are for other PEOPLE to use.
Identifiers are most commonly used by machines—but most of the effort around
identifiers is done by humans typing them.
Optimize for people, not for machines.
58
Level 1: Authority
59. Identifiers Identify Documents.
You have the best sense of what “relevant context” might be. It’s wonderful to
provide query capabilities—but you should determine what information is usually
relevant for a given identifier.
Make easy things easy, and hard things possible.
59
Level 1: Authority
61. Reference, even if you can’t link.
Give people a sense of how your data might be connected to others by adding in
pointer to a shared, common point of reference.
61
Level 2: Reconciliation
62. Reference, even if you can’t link.
The Getty Vocabularies are great for this. So is Wikidata. So is VIAF. Doesn’t
matter—just give us a way to confirm that what we’re thinking is what you’re
thinking.
62
Level 2: Reconciliation
64. If this is all you can do, you’ve done enough.
Almost all the value of linked data is present at this point. If you publish data,
provide identifiers, and you include links to others—you’ve done linked data.
Everything after this is extra credit.
64
Level 2: Reconciliation
66. Links go both ways.
It’s valuable to know that a given artwork is mentioned in a book—but it’s just as
valuable to know that a book mentions an artwork!
66
Level 3: Bidirectional Linking
67. Sync is hard.
We’ve learned that trying to keep this in sync within systems is hard. Most of our
applications are not designed to deal with information outside of their own sphere
of control.
Instead, we maintain these references outside systems of record, and look them
up when needed for presentation.
67
Level 3: Bidirectional Linking
68. Links are often surprising!
Publishing bidirectional crosswalks between linked things creates
networks of information—and helps people discover unexpected relationships.
68
Level 3: Bidirectional Linking
70. This is where you start doing things for other people.
The previous levels are about what you do in your data, often for yourself—
but now, you’re doing things explicitly to help other people do things with your
data.
70
Level 4: Aggregation
71. The best place for data is where people are looking for it.
Often, that’s not with you.
Share your data with other people, and let them point back to you.
71
Level 4: Aggregation
72. But: Change Discovery.
If other people are using your data, they’re going to cache it.
They don’t trust you.
72
Level 4: Aggregation
73. Change Discovery.
If other people are using your data, they’re going to cache it.
I don’t trust you.
73
Level 4: Aggregation
74. Change Discovery.
If other people are using your data, they’re going to cache it.
I don’t trust you.
Please don’t trust my systems.
74
Level 4: Aggregation
76. Change Discovery.
It doesn’t matter what the change is—just letting someone know to look for
changes provides most of the value.
Recaching everything is hard, but pulling just the changed records is easy.
76
Level 4: Aggregation
78. Data Standards matter now.
Up to this point in the process, I haven’t mentioned anything about linked.art, or
CIDOC-CRM, or Schema.org, or SKOS-XL.
They don’t matter until you want to create an automatically interoperable
application.
78
Level 5: Interoperability
79. Data Standards have other value, of course.
Standards are great for consistency and ensuring quality—
and for letting other people write the documentation.
79
Level 5: Interoperability
80. Externally, they’re for robots.
The external value of standards means that I can write code that consumes your
data without needing to talk to you—or even know you exist.
This is why Schema.org is so widely used—Google doesn’t know I exist, but they
can still extract my event data and share it.
80
Level 5: Interoperability
81. IIIF is our community’s shining example of this.
A standard widely-enough used that there are multiple applications that can be
used across the field to show other people’s data in yet other people’s applications.
81
Level 5: Interoperability
82. Linked.art is just beginning to demonstrate this.
We’re on the precipice of having enough data at this level for it to be worth
building applications for artwork. Stay tuned!
82
Level 5: Interoperability
83. This is Level 5.
A reminder here. This is my penultimate level of linked data.
You don’t need to start here, and you don’t need to get to here to provide value.
83
Level 5: Interoperability
84. #6: Reuse
Allowing one institution to import information from another while maintaining the
provenance of the data.
84
Level 6: Reuse
85. We haven’t gotten here.
The final goal here would be if I could use your data in my application—and have it
still be your data.
This is the dream.
I’m still dreaming about this.
85
Level 6: Reuse
86. We will get here.
An ecosystem of shared, reusable, linked data will open potential beyond what we
can do at any organization—even Getty.
But it can’t be done without others. Without you.
86
Level 6: Reuse
87. Start Small.
Each of these levels provides value.
Decide what you can do—and do that—it’s enough, and it helps us build the
community.
87
Level 6: Reuse
88. Work Together—and complain!
The only way we’ll know what works—and, more importantly, what doesn’t, is if we
hear from others that things don’t work!
Linked Data is not valuable outside of a community—and if it’s not working for the
community, it’s not working.
We’re making mistakes—let us know when, so we can learn—and we can share.
88
Level 6: Reuse