Linked Data on a Budget

So…how does anyone do this stuff, for real?
MW 2023, Washington DC
Linked Data on a Budget
David Newbury
Assistant Director, Software and User Experience,
Getty Digital

Hi! I’m David.
I lead the software and user experience teams at Getty.
Getty is a big museum/research hub in Los Angeles. We do lots of things with
data.
All of the actual work here was done by my fabulously talented team. I just talk.
2
Introduction

Part 1:
Linked Data is amazing!
3

● Linked Data is another name for the
Semantic Web, a good idea by Tim
Berners-Lee, whose previous good idea
turned out to be very good.
4
What is Linked Data: The Standard Story

There are three main concepts in Linked Data:
1. Data is represented as a graph.
2. Meaning is determined by ontologies.
3. IDs are dereferencable URLs.
5
What is Linked Data: The Standard Story

A Graph is a way to represent data.
Think of a fact.
6
What is Linked Data: Data as a Graph
Favorite Drink Coffee
David

Think of a fact.
Think of another fact.
7
Favorite Drink Beer
David
John

Think of a fact.
Think of another fact.
And another.
8
Favorite Drink Beer
Favorite Drink Chai
David
John
Betsy

You could imagine these as a table of data:
9
David
John
Betsy
Fav Drink
Coffee
Beer
Chai

You could imagine these as a table of data:
…and add other information about
the people involved.
10
David
John
Betsy
Fav Drink
Coffee
Beer
Chai
Hometown
Pittsburgh
Boston
Pittsburgh

This does get duplicative, though,
if you want to add additional
information about a different
column.
11
David
Fav Drink
Coffee
John Beer
Betsy Chai
Hometown
Pittsburgh
Boston
Pittsburgh
State
PA
MA
PA

You can solve this with a relational
database…
12
David
Fav Drink
Coffee
John Beer
Betsy Chai
Place ID
1
2
1
Hometown
Pittsburgh
Boston
State
PA
MA
Place ID
1
2

You can solve this with a relational
database…
…or with with a graph.
13
David
Fav Drink Coffee
State
Hometown
Pittsburgh PA
Betsy
Fav Drink Chai
Hometown
John
Fav Drink Beer
State
Hometown Boston MA

Tables are great for lots of data
about “a thing”, with a limited
number of kinds of things with
consistent links between things.
14

Graphs are great when the number
of kinds of things and number of
links between them is high and
inconsistent.
15

Another problem is meaning:
Words are great, but they require a
shared understanding of what’s
being described.
16
What is Linked Data: Data as a Ontology
David State PA
David State Solid
David State Confused

Linked Data uses ontologies to
include, as data, context and
deﬁnition around the terms used to
deﬁne how things are connected.
17
David State PA
David State Solid
David State Confused
defined as
Geographical region
within a country
defined as
Distinct form of
matter
defined as
Emotional or mental
condition

It also assumes that each of
these concepts is represented
by a unique identiﬁer, which lets
people—and computers—be
unambiguous.
18
geo_state
State
matter_state
State
mental_state
State
defined as
Geographical region
within a country
defined as
Distinct form of
matter
defined as
Emotional or mental
condition
label
label
label

By making these identiﬁers into
URLs, they can be made globally
unique—and can also carry with
them the identity of the
concept’s creator.
19
getty.edu/geo_state
getty.edu/matter_state
getty.edu/mental_state
defined as
Geographical region
within a country
defined as
Distinct form of
matter
defined as
Emotional or mental
condition

And it also means that if you
dereference that URL, you can
provide access to the data!
20
What is Linked Data: Dereferencable Data
defined as
Distinct form of
matter

It also means that the
information can come from
outside of our own ecosystem.
21
What is Linked Data: Dereferencable Data
defined as
Distinct form of
matter
spanish
label
materia
same as
wikidata.org/Q35758

Linked Data is Amazing!
22
Part 1: Summary

Linked Data is Amazing!
But…
23
Part 1: Summary

Part 2:
Linked Data is annoying.
24

Relational databases are optimized
for performance and data locality.
If you keep all the information
about a person in one place—it’s
very fast to pull it back.
25
Annoyances: Performance
David
Fav Drink
Coffee
John Beer
Betsy Chai
Place ID
1
2
1
Hometown
Pittsburgh
Boston
State
PA
MA
Place ID
1
2

It’s also easy to understand
“What is a person” from the
perspective of the application:
It’s the information in the
“Person” table.
26
Annoyances: Concept Boundaries
David
Fav Drink
Coffee
John Beer
Betsy Chai
Place ID
1
2
1

It also makes it easy to include metadata
about the record.
27
Annoyances: Metadata
David
Fav Drink
Coffee
John Beer
Betsy Chai
Place ID
1
2
1
Updated
2022-01-05
1970-01-01
2023-04-01

This idea of a “record” is a construct—
remember, these are just facts,
organized into a table.
But we’re trained to think about data as
collections of grouped facts, relevant
within a speciﬁc context.
28
Annoyances: Record Boundaries
David
Fav Drink
Coffee
John Beer
Betsy Chai
Place ID
1
2
1
Updated
2022-01-05
1970-01-01
2023-04-01

Graphs don’t provide clear
boundaries the same way—they
don’t have the concept of a record.
Each triple is a stand-alone
record—and often collecting all
the information you want requires
many hops across the graph.
29
Annoyances: Graph Structure
David
Fav Drink Coffee
State
Hometown
Pittsburgh PA
Betsy
Fav Drink Chai
Hometown

Graphs are optimized for querying:
Deﬁning a query-speciﬁc context
that includes a set of facts based on
novel criteria of interest, and
returning that subset of
information.
30
Annoyances: Queries
David
Fav Drink Coffee
State
Hometown
Pittsburgh PA
Betsy
Fav Drink Chai
Hometown

“What objects does Getty have that have images larger than
1200px on the longest side that have been exhibited in both New
York and Paris and were created by artists who lived before 1850?”
is just as easy to ask as
“What is the tombstone data about Irises?”
31
Annoyances: Queries

York and Paris and were created by artists who lived before 1850?”
is just as easy absurdly difﬁcult to ask as
“What is the tombstone data about Irises?”
32
Annoyances: Queries

Doing so moves the burden of deﬁning the
relevant context to the user of the data, not
the creator of the data.
This is great for research, but not so great
for ease of use.
33
Annoyances: Queries

We have never asked:
York and Paris and were created by artists who lived before 1850?
…but we ask
What is the tombstone data about Irises?
Several thousand times a day.
34
Annoyances: Queries

Dereferencability could solve
this…but it requires network
requests.
35
Annoyances: Queries
David
Fav Drink Coffee
Hometown
Pittsburgh

requests.
Annoyances: Queries
David
Fav Drink Coffee
geo_state
Hometown
Pittsburgh PA

requests.
So many requests.
37
Annoyances: Queries
David
Fav Drink Coffee
geo_state
Hometown
Pittsburgh PA
State
defined as
Geographical region
within a country
label
same as
wikidata.org/Q35758

requests.
So many requests.
…when do you stop?
38
Annoyances: Queries
David
Fav Drink Coffee
geo_state
Hometown
Pittsburgh PA
State
defined as
Geographical region
within a country
label
same as
wikidata.org/Q106458883
spanish
label
división administrativa de
primer nivel en varios países

requests.
So many requests.
…when do you stop?
…and can you rely on other
systems?
39
Annoyances: Queries
David
Fav Drink Coffee
geo_state
Hometown
Pittsburgh PA
State
defined as
Geographical region
within a country
label
same as
wikidata.org/Q106458883
spanish
label
división administrativa de
primer nivel en varios países

Linked Data is annoying.
None of these are theoretical concerns about Linked Data.
They’re just practical concerns when you try and build something on top of it.
40
Part 2: Summary

Part 3:
Getty builds stuff on linked data.
41

Getty has been doing Linked Data since 2014,
starting with the Getty Vocabularies.
It’s a collection of concepts, people, and places
deeply relevant to the study of art and
architecture.
42
Getty’s Linked Data: Getty Vocabularies

Since then, we’ve moved most of our major
systems to use Linked Data—including our
archives…
43
Getty’s Linked Data: Archival Records

Since then, we’ve moved most of our major
systems to use Linked Data—including our
archives…
… and our museum collection.
44
Getty’s Linked Data: Archival Records

We’ve also built a complex, powerful
infrastructure to support doing this across
our application landscape.
It’s been fun. We’ve learned a lot.
45
Getty’s Linked Data: APIs

A Hard-won lesson:
No application that we’ve built required Linked Data.
46
Getty’s Linked Data: What we learned

A Hard-won lesson:
No application that we’ve built required Linked Data.
Which, if you think about it, makes sense. Each application has
a speciﬁc, known context with clear record boundaries.
47

Why keep doing it?
The value is in the ecosystem—when we present information in multiple contexts.
It’s also in the community—allowing our data to be used beyond our
organization’s boundaries.
48

Why should YOU do it?
Because what makes cultural data interesting is not contained within the walls of
any one institution.
It’s shared across our entire, world-wide community. We should work together.
That’s the reason—not any particular data structure or ontology.
49

Part 4:
So…what can YOU do?
50

You don’t need to do what we’ve done.
Enabling connections across silos and organizations doesn’t mean that you need a
triplestore with Linked.Art data provided via JSON-LD documents reconciled to
ULAN and Wikidata, queryable via SPARQL and ElasticSearch, with
cross-references via Web Annotations, associated with IIIF Manifests.
51
Linking Data: The Six Levels

You don’t need to do what we’ve done.
Enabling connections across silos and organizations doesn’t mean that you need a
triplestore with Linked.Art data provided via JSON-LD documents reconciled to
ULAN and Wikidata, queryable via SPARQL and ElasticSearch, with
cross-references via Web Annotations, associated with IIIF Manifests.
That would just be showing off.
52

You just need to make it easy for people to
understand what you have done.
There are, in our experience, six levels of Linked Data that build on one another,
but all provide value—both within an organization and across the community.
53

#1: Authority
Provide a consistent way to identify both entities and the institution providing
information in your data.
54
Level 1: Authority

Give everything an identiﬁer.
Other people can’t talk about your data without a way to unambiguously refer to
the record that you’re talking about.
URLs as IDs are great for this—they’re both unique—and they let others know
who produced the data.
55
Level 1: Authority

Give everything an human-friendly identiﬁer.
https://data.getty.edu/research/collections/object/97e8fd22-92a4-4831-aa63-33255c1aaefe
This is not friendly.
56
Level 1: Authority

Give everything an human-friendly identiﬁer.
This is friendly:
https://data.getty.edu/archives/AK3098
57
Level 1: Authority

Identifiers are for other PEOPLE to use.
Identifiers are most commonly used by machines—but most of the effort around
identifiers is done by humans typing them.
Optimize for people, not for machines.
58
Level 1: Authority

Identiﬁers Identify Documents.
You have the best sense of what “relevant context” might be. It’s wonderful to
provide query capabilities—but you should determine what information is usually
relevant for a given identiﬁer.
Make easy things easy, and hard things possible.
59
Level 1: Authority

#2: Reconciliation
Use authorities and thesauri to disambiguate between similar real world entities.
60
Level 2: Reconciliation

Reference, even if you can’t link.
Give people a sense of how your data might be connected to others by adding in
pointer to a shared, common point of reference.
61

Reference, even if you can’t link.
The Getty Vocabularies are great for this. So is Wikidata. So is VIAF. Doesn’t
matter—just give us a way to conﬁrm that what we’re thinking is what you’re
thinking.
62

Publish that reference.
It only works, though, if you let people KNOW.
63

If this is all you can do, you’ve done enough.
Almost all the value of linked data is present at this point. If you publish data,
provide identiﬁers, and you include links to others—you’ve done linked data.
Everything after this is extra credit.
64

#3: Bidirectional Linking
Establish and publish connections between systems or institutions.
65
Level 3: Bidirectional Linking

Links go both ways.
It’s valuable to know that a given artwork is mentioned in a book—but it’s just as
valuable to know that a book mentions an artwork!
66

Sync is hard.
We’ve learned that trying to keep this in sync within systems is hard. Most of our
applications are not designed to deal with information outside of their own sphere
of control.
Instead, we maintain these references outside systems of record, and look them
up when needed for presentation.
67

Links are often surprising!
Publishing bidirectional crosswalks between linked things creates
networks of information—and helps people discover unexpected relationships.
68

#4: Aggregation
Enhance discovery by providing search and access to information across
collections.
69
Level 4: Aggregation

This is where you start doing things for other people.
The previous levels are about what you do in your data, often for yourself—
but now, you’re doing things explicitly to help other people do things with your
data.
70

The best place for data is where people are looking for it.
Often, that’s not with you.
Share your data with other people, and let them point back to you.
71

But: Change Discovery.
If other people are using your data, they’re going to cache it.
They don’t trust you.
72

Change Discovery.
I don’t trust you.
73

Change Discovery.
I don’t trust you.
Please don’t trust my systems.
74

Change Discovery.
Cache our data: We’ll let you know if the data changes.
75

Change Discovery.
It doesn’t matter what the change is—just letting someone know to look for
changes provides most of the value.
Recaching everything is hard, but pulling just the changed records is easy.
76

#5: Interoperability
Develop interfaces that present information from many sources in a single way.
77
Level 5: Interoperability

Data Standards matter now.
Up to this point in the process, I haven’t mentioned anything about linked.art, or
CIDOC-CRM, or Schema.org, or SKOS-XL.
They don’t matter until you want to create an automatically interoperable
application.
78

Data Standards have other value, of course.
Standards are great for consistency and ensuring quality—
and for letting other people write the documentation.
79

Externally, they’re for robots.
The external value of standards means that I can write code that consumes your
data without needing to talk to you—or even know you exist.
This is why Schema.org is so widely used—Google doesn’t know I exist, but they
can still extract my event data and share it.
80

IIIF is our community’s shining example of this.
A standard widely-enough used that there are multiple applications that can be
used across the ﬁeld to show other people’s data in yet other people’s applications.
81

Linked.art is just beginning to demonstrate this.
We’re on the precipice of having enough data at this level for it to be worth
building applications for artwork. Stay tuned!
82

This is Level 5.
A reminder here. This is my penultimate level of linked data.
You don’t need to start here, and you don’t need to get to here to provide value.
83

#6: Reuse
Allowing one institution to import information from another while maintaining the
provenance of the data.
84
Level 6: Reuse

We haven’t gotten here.
The ﬁnal goal here would be if I could use your data in my application—and have it
still be your data.
This is the dream.
I’m still dreaming about this.
85
Level 6: Reuse

We will get here.
An ecosystem of shared, reusable, linked data will open potential beyond what we
can do at any organization—even Getty.
But it can’t be done without others. Without you.
86
Level 6: Reuse

Start Small.
Each of these levels provides value.
Decide what you can do—and do that—it’s enough, and it helps us build the
community.
87
Level 6: Reuse

Work Together—and complain!
The only way we’ll know what works—and, more importantly, what doesn’t, is if we
hear from others that things don’t work!
Linked Data is not valuable outside of a community—and if it’s not working for the
community, it’s not working.
We’re making mistakes—let us know when, so we can learn—and we can share.
88
Level 6: Reuse

Thank you! Complaints go here:
dnewbury@getty.edu
89

Linked Data on a Budget

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Linked Data on a Budget

Ähnlich wie Linked Data on a Budget (20)

Mehr von David Newbury

Mehr von David Newbury (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Linked Data on a Budget