Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Small Data: Bridging the Gap Between Generic and Specific Repositories
1. Small
Data,
or:
Bridging
the
Gap
Between
Specific
and
Generic
Research
Repositories
April
11,
2013
Anita
de
Waard
VP
Research
Data
CollaboraDons
a.dewaard@elsevier.com
hHp://researchdata.elsevier.com/
2. There
are
many
efforts
to
enhance
data
storing
and
sharing...
• Many
different
research
databases–
both
generic
(Dryad,
Dataverse,
…)
and
specific
(NIF,
IEDA,
PDB,
…)
• Many
systems
for
creaDng/sharing
workflows
(Taverna,
MyExperiment,
Vistrails,
Workflow4Ever
etc)
• Many
e-‐lab
notebooks
(LabGuru,
LabArchives,
LaBlog,
etc)
• Scores
of
projects,
commiHees,
standards,
bodies,
grants,
iniDaDves,
conferences
for
discussing
and
connecDng
all
of
this
(KEfED,
Pegasus,
PROV,
RDA,
Science
Gateways,
Codata,
BRDI,
Earthcube,
etc.
etc)
• You
can
make
a
living
out
of
this
;-‐)!
(and
many
of
us
do…)
3. …but
this
is
what
scienDsts
do:
Using
anDbodies
and
squishy
bits
Grad
Students
experiment
and
enter
details
into
their
lab
notebook.
The
PI
then
tries
to
make
sense
of
this,
and
writes
a
paper.
End
of
story.
4. Why
save
research
data?
A. Data
PreservaDon:
– Preserve
record
of
scienDfic
process,
provenance
– Enable
reproducible
research
B. Data
Use:
– Use
results
obtained
by
others
– Do
beHer
science!
– Improve
interdisciplinary
work
5. Where
the
data
goes
now:
PDB:
A
small
porDon
of
data
88,3
k
(1-‐2%?)
stored
in
small,
PetDB:
>
50
My
Papers
1,5
k
SedDB:
topic-‐focused
2
M
scienDsts
data
repositories
0.6
k
MiRB:
2
M
papers/year
25k
TAIR:
72,1
k
Some
data
(8%?)
stored
in
large,
generic
data
Majority
of
data
repositories
(90%?)
is
stored
on
local
hard
drives
Dryad:
Dataverse:
7,631
files
0.6
M
Datacite:
1.5
M
6. So
this
needs
to
happen:
PDB:
A
small
porDon
of
data
88,3
k
(1-‐2%?)
stored
in
small,
PetDB:
>
50
My
Papers
1,5
k
SedDB:
topic-‐focused
2
M
scienDsts
data
repositories
0.6
k
MiRB:
2
M
papers/year
25k
TAIR:
72,1
k
Some
data
(8%?)
stored
in
large,
generic
data
Majority
of
data
repositories
(90%?)
is
stored
on
local
hard
drives
Dryad:
Dataverse:
7,631
files
0.6
M
INCREASE
DATA
PRESERVATION
Datacite:
1.5
M
7. Data
PreservaDon
Issues:
ObjecDon:
“Our
lab
notebooks
are
all
on
paper
–
it’s
how
we
do
things”
Response:
Grao
tools
closely
on
scienDsts’
daily
pracDce
Example:
create
tailored
metadata
collecDon
tools
on
mini-‐tablets
in
labs
to
replace
paper
notebooks
8. Data
PreservaDon
Issues:
ObjecDon:
“I
need
to
see
a
direct
benefit
of
any
effort
I
put
in.”
Response:
Create
tools
to
allow
beHer
insight
in
own
and
other’s
results.
Example:
‘PI-‐Dashboard’:
allow
immediate
access/
analysis
of
shared
data:
new
science!
9. Data
Use
Issues:
ObjecDon:
“I
don’t
really
trust
anyone
else’s
data
–
and
don’t
think
they’ll
trust
mine”
Response:
Create
social
networking
context;
allow
data
owner
to
provide
granular
access
control.
Example:
• In
Urban
Lab
app,
data
stored
by
researcher
name.
• PI
decides
who
gets
to
see
which
data
• Match
up
with
NIF
and
Eagle-‐I
ontologies
on
back
end
so
export
of
(part
of)
data
is
possible
at
any
Dme.
c
o
n
s
o
r
t
i
u
m
10. Data
Use
Issues:
• ObjecDon:
“I
am
afraid
other
people
might
scoop
my
discoveries”
• Response:
Reward
system
needs
to
move
from
direct
compeDDon
to
a
‘shared
mission’
approach
(cf.
Mars)
• Example:
Data
Rescue
Challenge
in
the
geosciences:
collect
and
reward
stories/pracDces
of
data
preservaDon,
enable
cross-‐disciplinary
access
and
use
of
all
data.
The
2013
Interna.onal
Data
Rescue
Award
in
the
Geosciences
Organised
by
IEDA
and
Elsevier
Research
Data
Services
hHp://researchdata.elsevier.com/datachallenge
11. Data
PreservaDon
and
AnnotaDon:
:
Fine,
I’ll
do
it–
but
where
the
hell
do
I
put
it?
WANT
AND
Domain-‐Specific
Domain
of
study:
Collaborators:
Local
Data
Repository
Data
Repository
DIFFERENT
ALL
THEY
Generic
METADATA!!!!
InsDtuDonal
Data
Repository
Funding
Agency:
University:
Data
Repository
12. Comparing
Repository
Types:
Repository
Advantages
Disadvantages
Effort,
Reuse,
Credit,
Compliance
Local
data
Easy!
No
one
steals
No
one
sees
it.
Habit,
Ease,
Privacy,
Control
repository
your
data.
Not
compliant
with
MORE
ANNOTATION
requirements
InsDtuDonal
Not
very
difficult.
Data
can’t
easily
be
Repository
Administrators
are
reused.
Credit?
happy.
Generic
data
Not
very
hard
to
do.
Data
can’t
be
easily
repository
Have
complied!
reused.
Credit…
Domain-‐specific
Data
can
be
reused.
Lot
of
work
–
for
data
repository
Credit!
curators
13. Conclusions
for
data
annotaDon:
“Instead
of
building
newer
and
larger
weapons
of
mass
destrucHon,
I
think
mankind
should
try
to
get
more
use
out
of
the
ones
we
have”
Deep
Thoughts
by
Jack
Handy
• Let’s
use
the
data
standards
we
already
have
–
and
agree
on
using
the
same
ones
• Work
with
exisDng
data
repositories
in
a
field
to
come
to
a
lowest
common
denominator
of
metadata
• Tailor
the
systems
to
be
opDmally
easy
to
use
for
scienDsts
in
terms
of
metadata:
add
as
liHle
as
you
have
to,
as
few
Dmes
as
you
can.
14. Summary:
• Data
PreservaDon:
– Tailor
tools
to
fit
scienDsts’
workflow
–
follow
the
experiment!
– We
are
creaDng
repositories
of
shared
experiments:
Enable
demonstrably
beFer
science!
• Data
Use:
– Allow
owner
full
control
over
who
sees
which
data
-‐
create
social
networking
context
– CollecDvely
pioneer
long-‐term
funding
opDons;
support/
develop
‘shared
mission’
funding
challenges
• How
annotaDon
can
help
reuse:
– Collaborate
between
(generic/specific,
insDtuDonal,
cross-‐
naDonal)
data
faciliDes
to
integrate
repositories,
enable
cross-‐
repository
usage
and
reuse
exisIng
metadata.
15. QuesDons?
Anita
de
Waard
VP
Research
Data
CollaboraDons
a.dewaard@elsevier.com
hHp://researchdata.elsevier.com/
16. Elsevier
Research
Data
Services
Goals:
1. Increase
Data
PreservaDon:
Help
increase
the
amount
and
quality
of
data
preserved
and
shared
2. Improve
Data
Use:
Help
increase
the
value
and
usability
of
the
data
shared
by
increasing
annotaDon,
normalizaDon,
provenance
enabling
enhanced
interoperability
3. Develop
Sustainable
Models:
Help
measure
and
deliver
credit
for
shared
data,
the
researchers,
the
insDtute,
and
the
funding
body,
enabling
more
sustainable
plaworms.
17. Guiding
Principles
of
RDS:
• In
principle,
all
open
data
stays
open
and
URLs,
front
end
etc.
stay
where
they
are
(i.e.
with
repository)
• CollaboraDon
is
tailored
to
data
repositories’
unique
needs/interests-‐
‘service-‐model’
type:
– Aspects
where
collaboraDon
is
needed
are
discussed
– A
collaboraDon
plan
is
drawn
up
using
a
Service-‐Level
Agreement:
agree
on
Dme,
condiDons,
etc.
• Transparent
business
model
• Very
small
(2/3
people)
department;
immediate
communicaDon;
instant
deployment
of
ideas.
18. “But
aren’t
you
guys
in
it
for
the
money?”
• Yes,
we
are-‐
like
most
businesses…
• Is
your
real
quesDon
perhaps:
‘Does
no
one
want
to
work
with
you
anymore
because
of
the
Open
Access
debate?’
• The
OA
debate
focuses
on
three
issues:
– IPR
and
Access
issues
E.g.
BY-‐NC-‐SA?
Github?
..?
– Opaque
business
models
E.g.
Gold
Open
Access?
Shared
funding
model?
Commercial
analyDcs
with
shared
royalDes?
– Lack
of
perceived
added
We
offer
a
service:
only
use
value
it
if
it’s
any
good!