1. (1)
Standardizing for Open Data
Ivan
Herman,
W3C
Open
Data
Week
Marseille,
France,
June
26
2013
Slides at: http://www.w3.org/2013/Talks/0626-Marseille-IH/
2. (2)
Data
is
everywhere
on
the
Web!
l Public,
private,
behind
enterprise
firewalls
l Ranges
from
informal
to
highly
curated
l Ranges
from
machine
readable
to
human
readable
l HTML
tables,
twitter
feeds,
local
vocabularies,
spreadsheets,
…
l Expressed
in
diverse
models
l tree,
graph,
table,
…
l Serialized
in
many
ways
l XML,
CSV,
RDF,
PDF,
HTML
Tables,
microdata,…
8. (8)
W3C’s
standardization
focus
was,
traditionally,
on
Web
scale
integration
of
data
l Some
basic
principles:
l use
of
URIs
everywhere
(to
uniquely
identify
things)
l relate
resources
among
one
another
(to
connect
things
on
the
Web)
l discover
new
relationships
through
inferences
l This
is
what
the
Semantic
Web
technologies
are
all
about
9. (9)
We
have
a
number
of
standards
RDF
1.1
SPARQL
1.1
URI
JSON-‐LD
Turtle
RDFa
RDF/XML
RDF:
data
model,
links,
basic
assertions;
different
serializations
SPARQL:
querying
data
A
fairly
stable
set
of
technologies
by
now!
10. (10)
We
have
a
number
of
standards
RDB2RDF
RDF
1.1
RDFS
1.1
SPARQL
1.1
OWL
2
URI
JSON-‐LD
Turtle
RDFa
RDF/XML
RDF:
data
model,
links,
basic
assertions;
different
serializations
SPARQL:
querying
data
RDFS:
simple
vocabularies
OWL:
complex
vocabularies,
ontologies
RDB2RDF:
databases
to
RDF
A
fairly
stable
set
of
technologies
by
now!
12. (12)
Integration
is
done
in
different
ways
l Very
roughly:
l data
is
accessed
directly
as
RDF
and
turned
into
something
useful
l relies
on
data
being
“preprocessed”
and
published
as
RDF
l data
is
collected
from
different
sources,
integrated
internally
l using,
say,
a
triple
store
15. (15)
However…
l There
is
a
price
to
pay:
a
relatively
heavy
ecosystem
l many
developers
shy
away
from
using
RDF
and
related
tools
l Not
all
applications
need
this!
l data
may
be
used
directly,
no
need
for
integration
concerns
l the
emphasis
may
be
on
easy
production
and
manipulation
of
data
with
simple
tools
16. (16)
Typical
situation
on
the
Web
l Data
published
in
CSV,
JSON,
XML
l An
application
uses
only
1-‐2
datasets,
integration
done
by
direct
programming
is
straightforward
l e.g.,
in
a
Web
Application
l Data
is
often
very
large,
direct
manipulation
is
more
efficient
17. (17)
Non-‐RDF
Data
l In
some
setting
that
data
can
be
converted
into
RDF
l But,
in
many
cases,
it
is
not
done
l e.g.,
CSV
data
is
way
too
big
l RDF
tooling
may
not
be
adequate
for
the
task
at
hand
l integration
is
not
a
major
issue
19. (19)
What
that
application
does…
l Gets
the
data
published
by
NHS
l Processes
the
data
(e.g.,
through
Hadoop)
l Integrates
the
result
of
the
analysis
with
geographical
data
Ie:
the
raw
data
is
used
without
integration
20. (20)
The
reality
of
data
on
the
Web…
l It
is
still
a
fairly
messy
space
out
there
L
l many
different
formats
are
used
l data
is
difficult
to
find
l published
data
are
messy,
erroneous,
l tools
are
complex,
unfinished…
21. (21)
How
do
developers
perceive
this?
‘When
transportation
agencies
consider
data
integration,
one
pervasive
notion
is
that
the
analysis
of
existing
information
needs
and
infrastructure,
much
less
the
organization
of
data
into
viable
channels
for
integration,
requires
a
monumental
initial
commitment
of
resources
and
staff.
Resource-‐scarce
agencies
identify
this
perceived
major
upfront
overhaul
as
"unachievable"
and
"disruptive.”’
-‐-‐
Data
Integration
Primer:
Challenges
to
Data
Integration,
US
Dept.
of
Transportation
22. (22)
One
may
look
at
the
problem
through
different
goggles
l Two
alternatives
come
to
the
fore:
1. provide
tools,
environments,
etc.,
to
help
outsiders
to
publish
Linked
Data
(in
RDF)
easily
l a
typical
example
is
the
Datalift
project
2. forget
about
RDF,
Linked
Data,
etc,
and
concentrate
on
the
raw
data
instead
25. (25)
Open
Data
on
the
Web
Workshop
l Had
a
successful
workshop
in
London,
in
April:
l around
100
participants
l coming
from
different
horizons:
publishers
and
users
of
Linked
Data,
CSV,
PDF,
…
26. (26)
We
also
talked
to
our
“stakeholders”
l Member
organizations
and
companies
l Open
Data
Institute,
Open
Knowledge
Foundation,
Schema.org
l …
27. (27)
Some
takeaway
l The
Semantic
Web
community
needs
stability
of
the
technology
l do
not
add
yet
another
technology
block
J
l existing
technologies
should
be
maintained
28. (28)
Some
takeaway
l Look
at
the
more
general
space,
too
l importance
of
metadata
l deal
with
non-‐RDF
data
formats
l best
practices
are
necessary
to
raise
the
quality
of
published
data
30. (30)
Metadata
is
of
a
major
importance
l Metadata
describes
the
characteristics
of
the
dataset
l structure,
datatypes
used
l access
rights,
licenses
l provenance,
authorship
l etc.
l Vocabularies
are
also
key
for
Linked
Data
31. (31)
Vocabulary
Management
Action
l Standard
vocabularies
are
necessary
to
describe
data
l there
are
already
some
initiatives:
W3C’s
data
cube,
data
catalog,
PROV,
schema.org,
DCMI,
…
l At
the
moment,
it
is
a
fairly
chaotic
world…
l many,
possibly
overlapping
vocabularies
l difficult
to
locate
the
one
that
is
needed
l vocabularies
may
not
be
properly
managed,
maintained,
versioned,
provided
persistence…
32. (32)
W3C’s
plan:
l Provide
a
space
whereby
l communities
can
develop
l host
vocabularies
at
W3C
if
requested
l annotate
vocabularies
with
a
proper
set
of
metadata
terms
l establish
a
vocabulary
directory
l The
exact
structure
is
still
being
discussed:
http://www.w3.org/2013/04/vocabs/
33.
34. (34)
CSV
on
the
Web
l Planned
work
areas:
l metadata
vocabulary
to
describe
CSV
data
l structure,
reference
to
access
rights,
annotations,
etc.
l methods
to
find
the
metadata
l part
of
an
HTTP
header,
special
rows
and
columns,
packaging
formats…
l mapping
content
to
RDF,
JSON,
XML
l Possibly
at
a
later
phase:
l API
standards
to
access
CSV
data
35.
36. (36)
Open
Data
Best
Practices
l Document
best
practices
for
data
publishers
l management
of
persistence,
versioning,
URI
design
l use
of
core
vocabularies
(provenance,
access
control,
ownership,
annotations,…)
l business
models
l Specialized
Metadata
vocabularies
l quality
description
(quality
of
the
data,
update
frequencies,
correction
policies,
etc.)
l description
of
data
access
API-‐s
l …
37. (37)
Summary
l Data
on
the
Web
has
many
different
facets
l We
have
concentrated
on
the
integration
aspects
in
the
past
years
l We
have
to
take
a
more
general
view,
look
at
other
types
of
data
published
on
the
Web
38. (38)
In
future…
l We
should
look
at
other
formats,
not
only
CSV
l MARC,
GIS,
ABIF,…
l Better
outreach
to
data
publishing
communities
and
organizations
l WF,
RDA,
ODI,
OKFN,
…