Driving Behavioral Change for Information Management through Data-Driven Gree...
DataONE_cobb_hubbub2012_20120924_v05
1. DataONE:
An
interoperable
data
repositories
case
study
John
W.
Cobb
R&D
Staff
and
DataONE
Leadership
Team
Member
Oak
Ridge
Na;onal
Laboratory
HUBbub
2012
,
the
HUBzero
conference
Indianapolis,
IN
24
September
2012
2. Acknowledgment:
• Authorship:
This
talk
represents
work
of
the
en;re
DataONE
extended
team.
• It
especially
draws
upon
slide
material
from
• Bill
Michener,
UNM
(esp.
recent
DataONE
AHM
Sept.
18,
2012)
• Amber
Budden
–
DataONE
Ass.
Dir.
For
CE
• DataONE
is
an
NSF
supported
project
(OCI-‐0830944)
2
3. Hubs
and
data
repositories
• A
personal
view
(apologies
for
a
possibly
mis-‐informed
speaker)
• HUB-‐roots
(history
and
pre-‐history)
• PUNCH:
web
portal
for
running
tools
(DOI:
10.1109/40.846308)
• -‐>
NanoHUB:
Applica;on
orchestra;on
environment
• +
RAPPTURE:
Rapid
Applica;on
por;ng
and
development
• +
Framelesss
VNC
windows
–
seamless
hosted
environment
on
clients!
• +
Rich
collabora;ve
environment
and
rich
user
experience
!!
(“wishlist”)
• Repurpose:
Hubzero
-‐>
hubs
explode
(ex.
NEESHub
a
cri;cal
advantage
for
largest
research
award
in
Purdue
history)
• Now
(and
recent
past)
turn
to
Hub+Data
Integra;on.
Some
successes
already
• Opportunity:
Richer
interac;ons
between
HUB’s
and
mul;ple
data
repositories
• Perhaps
for
example:
Enable
mul;-‐project
collabora;on
within
PURR?
• Or:
Integrate
NEES
DB’s
with
SCEC
simula;ons
and
IRIS
waveforms?
3
4. Mul;ple
data
repository
access?
• HUB
+
Database
exists
• HUB
+
external
data
repository
access
use
case.
• But
…..
What
if?
• Access
mul;ple
(possibly
external)
repositories
from
within
a
HUB
environment?
• Access
mul;ple
external
repositories
with
similar
data?
Say
aggregate
all
data
from
state
hydrologists?
C.f.
driNET
hip://drinet.hubzero.org
• Integrate
disparate
data
sets
for
new
and
novel
analysis.
Recall
Noshir
Contractor’s
comments
this
morning:
teaming
and
interdisciplinary
work
has
increased
impact
(Wuchty,
Jones,
Uzzi)
• Enable
reproducible
analysis
and
synthesis
via
a
automated
workflow
to
create
synthe;c
data
products
• Programma;c
access
• More
integra;on
(more
than
just
raw
search
terms
a
la
Google)
• …
• What
do
you
want
to
discover
today?
(to
paraphrase
Microsol)
4
5. DataONE
mo;va;on
• DataONE
is
a
project
to
address
Plan
these
issues
• Build
(assemble/aggregate)
Analyze
Collect
data
repository
interoperability
• Advance
state
of
the
prac;ce
data
lifecycle
management
• Planning
Integrate
Assure
• Deposi;on
• Metadata
genera;on
• Seman;c
integra;on
Discover
Describe
• Workflow
and
provenance
• Analysis
Preserve
• Synthesis
• Focus
on
a
broad
science
area
• Deploy
a
working
CI
and
grow
it
• DataONE
–
Data
Observa;on
Network
Earth
5
7. Pressing
issues
for
the
digital
data
lifecycle
Plan
Analyze
Collect
Integrate
Assure
Discover
Describe
Preserve
7
8. Mul;ple
data
sources
–
mutually
reinforcing
Increasing
Process
Knowledge
Decreasing
Spa;al
Coverage
Intensive
science
sites
and
experiments
Extensive
science
sites
Volunteer
&
educa;on
networks
Remote
sensing
Adapted
from
CENR-‐OSTP
8
9. Scaiered
data
sources
“finding
the
needle
in
the
haystack”
Data
are
massively
dispersed
• Ecological
field
sta;ons
and
research
centers
(100s)
• Natural
history
museums
and
biocollec;on
facili;es
(100s)
• Agency
data
collec;ons
(100s
to
1000s)
• Individual
scien;sts
(1000s
to
10,000s
to
100,000s)
9
11. Preserva;on:
Poor
data
prac;ce
“data
entropy”
Time
of
publica?on
Specific
details
General
details
Re?rement
or
Informa?on
Content
career
change
Accident
Death
Time
(Michener
et
al.
1997)
11
12. Preserva;on:
Data
longevity
Resource
Study Resource Type
Half-life
Rumsey (2002) Legal Citations 1.4 years
Harter and Kim (1996) Scholarly Article Citations 1.5 years
Koehler (1999 and 2002) Random Web Pages 2.0 years
Spinellis (2003) Computer Science 4.0 years
Citations
Markwell and Brooks Biological Science 4.6 years
(2002) Education Resources
Nelson and Allen (2002) Digital Library Object 24.5 years
Koehler,
W.
(2004)
Informa(on
Research
9(2):
174.
12
13. The Long Tail of Orphan Data
The
Ultra-‐violet
Most of the bytes
divergence
are at the high end,
Specialized repositories
but most of the
(e.g. GenBank, PDB)
datasets are at the
Volume
low end – Jim Gray
The
Infrared
Orphan data
Catastrophe
(B. Heidorn)
Rank frequency of datatype
13
14. Data
deluge
and
interoperability
“the
flood
of
increasingly
heterogeneous
data”
Data
are
heterogeneous
• Syntax
• (format)
• Schema
• (model)
• Seman;cs
• (meaning)
Jones
et
al.
2007
14
15. Metadata
universe
(mul;-‐verse)
• There
are
a
mul;tude
of
metadata
standards
• Discipline
and
sub-‐discipline
specific
• Each
with
different
terms
and
context
Source:
Jenn
Riley,
Indiana
U.
Digital
Librarian
hip://www.dlib.indiana.edu/~jenlrile/metadatamap/
Via
John
Kunze,
Cal.
Dig.
Lib
15
16. Each
dot
is
its
own
standard
!
“…billions
and
billions
of
worlds
…”
–
Carl
Sagan
16
17. DataONE
CI
architectural
Elements
• Hard-‐core
cyberinfrastructure
(CI)
• CI
Member
Node
(MN)
data
repositories
• Coordina;ng
Node
(CN)
global
metadata
repo’s
Investigator Toolkit
Simple,
but
powerful
REST
API/SPI
for
Web Interface Analysis, Visualization Data Management
•
universal
access
Client Libraries
• Inves;gator
toolkit
(ITK)
solware
tools
Java Python Command Line
to
allow
access
to
the
data
repository
collec;ve
via
familiar
access
idioms
Member Nodes Coordinating Nodes
• Cultural
and
wetware
issues
Service Interfaces
Resolution Discovery
Educa;onal
Materials
Service Interfaces
• Replication Registration
• Best
prac;ces
Bridge to non-DataONE Coordination Layer
• Workshops
and
tutorials
Member Node services Identifiers Catalog
• Surveys
and
assessments
Preservation Monitor
Data Repository
• Scien;st,
policymaker,
ci;zen
Object Store Index
engagement
• Collabora;on,
governance,
and
sustainability
hip://mule1.dataone.org/ArchitectureDocs-‐current/
17
19. Key
Cyberinfrastructure
Elements
• Unique
iden;fiers
• Search
and
deliver
• Replica;on
• Federated
iden;ty
Usable
by
People
and
their
Agents
19
20. Suppor;ng
the
data
lifecycle
ORC
Node
UCSB
Node
UNM
Node
1. Deposi;on/acquisi;on/ingest
}
2. Cura;on
and
metadata
management
The
data
3. Protec;on,
including
privacy
lifecycle
4. Discovery,
access,
use,
and
dissemina;on
5. Interoperability,
standards,
and
integra;on
6. Evalua;on,
analysis,
and
visualiza;on" 20
21. DataONE
Supports
Data
Preserva;on
Three
major
components
for
a
Member
Nodes
flexible,
scalable,
sustainable
• diverse
ins;tu;ons
Coordina?ng
Nodes
network
• serve
local
community
• retain
complete
metadata
Inves?gator
Toolkit
• provide
resources
for
catalog
managing
their
data
• indexing
for
search
• retain
copies
of
data
• network-‐wide
services
• ensure
content
availability
(preserva;on)
• replica;on
services
21
22. DataONE
sa;sfies
arch
requirements
• Enables
integra;on
of
mul;ple
geographically
diverse
and
metadata
diverse
repositories
• Presents
collec;ve
search
results
across
mul;ple
repository
• Provides
a
unified
API/SPI
for
search
and
programma;c
interface
hip://mule1.dataone.org/ArchitectureDocs-‐current/
• DataONE
content
has
unique
iden;fiers
(DOI’s)
for
referencable/citable
data
objects
• Supports
both
large
datasets
and
the
long-‐tails
22
23. DataONE
spurs
innova;on
• Enables
new
analysis
and
synthesis
efforts
by
integra;ng
tasks
across
repositories
• Provides
means
for
data
replica;on
and
basis
for
repositories
to
build
“data
wills”
or
“data
trust”
plans
• Provides
a
plavorm
to
develop
advanced
interoperable
workflow
tools
and
seman;c
integra;on
tools
23
25. DataONE:
Suppor;ng
Scien;fic
Data
Preserva;on,
Discovery,
and
Innova;on
Current
Member
Nodes:
Coming
Soon:
Current
Tools:
Tools
Coming
Soon:
Queensland
University
of
Technology
25
29. Plans
per
template
(as
of
June
2012)
400
Approximate
number
of
plans
per
template
350
339
300
287
Templates
of
greatest
interest
to
the
DataONE
community
in
red;
250
2,302
unique
users
to
date
197
200
159
150
133
133
124
101
100
71
65
60
46
37
36
50
34
17
15
6
0
hip://dmptool.org
29
30. ✔ Check
for
best
prac;ces
✔ Create
metadata
✔ Connect
to
ONEShare
Data
&
Metadata
(EML)
30
37. Inves;gator
Toolkit
Support
Plan
DMP-Tool
Analyze
Collect
Kepler
Integrate
Assure
Discover
Describe
Preserve
37
38. Explora;on,
Visualiza;on,
and
Analysis
Diverse
bird
observa;ons
and
Model
results
environmental
data
from
300,00
loca;ons
in
the
US
Occurrence
of
Indigo
Bun?ng
(2008)
integrated
and
analyzed
using
High
Performance
Compu;ng
Resources
Land
Cover
Jan
Apr
Jun
Sep
Dec
Meteorology
• Examine
paierns
of
migra;on
MODIS
–
Spa;o-‐Temporal
Exploratory
• Infer
how
climate
Remote
Model
iden;fies
factors
change
may
affect
affec;ng
paierns
of
migra;on
sensing
data
bird
migra;on
38
39. Public
Par?cipa?on
in
Scien?fic
Research
Conference:
4-‐5
August
2012
in
Portland,
Oregon
USA
prior
to
Ecological
Society
of
America
mee;ng
(6-‐10
Aug.):
hip://www.birds.cornell.edu/citscitoolkit/conference/2012
39
40. User
Assessments
Scien;sts:
BL
Scien;sts:
FU
Library
Policies:
BL
Library
Policies:
FU
Librarians:
BL
Librarians:
FU
Policy
Makers:
BL
Policy
Makers:
FU
Educators:
BL
Educators:
FU
Year
1
Year
2
Year
3
Year
4
Year
5
40
41. What
standard
do
you
currently
use?
676
266
95 95 96 97
12 21 26
DIF DwC DC EML FGDC Open ISO My Lab none
GIS
Metadata
language
41
42. Many
are
interested
in
sharing
data
Willing
to
share
data
across
a
broad
81%
group
of
researchers
Willing
to
place
at
least
some
of
my
78%
data
into
a
central
data
repository
with
no
restric;ons
Appropriate
to
create
new
datasets
76%
from
shared
data
Willing
to
place
all
of
my
data
into
a
41%
central
data
repository
with
no
restric;ons
0%
20%
40%
60%
80%
100%
Percent
agree
42
43. Modeler
Scientist
Manager
Resource
Ecological
Data Librarians
Data
Service
User
Matrix
Investigator
ToolKit
Data
Management
Planning
Best
Practices
Tools
Database
Training
Curricula
43
47. DataONE:
Next
steps
• Member
node
growth
• Number
of
member
nodes
• Increase
the
number
and
size
of
data
sets
• Sustainably
• In
terms
of
resource
needs
form
MN’s
• In
terms
of
resource
demands
on
DataONE
• New
Inves;gator
toolkit
tools
(strategically)
• An
increasing
number
of
science
use
cases
with
more
breakthrough
science
• Also,
re-‐purposing
DataONE
CI
outside
of
Bio/Eco/
Env
areas
in
strategic
collabora;ve
partnerships
47
48. Ack:
DataONE
Team
and
Sponsors
• Amber
Budden,
Roger
Dahl,
Rebecca
Koskela,
Bill
• Ewa
Deelman
Michener,
Robert
Nahf,
Skye
Roseboom,
Mark
Servilla
• Deborah
McGuinness
•
Dave
Vieglais
• Suzie
Allard,
Nick
Dexter,
Kimberly
Douglass,
• Jeff
Horsburgh
Carol
Tenopir,
Robert
Waltz,
Bruce
Wilson
• John
Cobb,
Bob
Cook,
Ranjeet
Devarakonda,
• Robert
Sandusky
Giri
Palanismy,
Line
Pouchard
• Patricia
Cruse,
John
Kunze
• Bertram
Ludaescher
• Sky
Bristol,
Mike
Frame,
Richard
Huffine,
Viv
• Peter
Honeyman
Hutchison,
Jeff
Moriseie,
Jake
Weltzin,
Lisa
Zolly
• Stephanie
Hampton,
Chris
Jones,
Mai
• Cliff
Duke
Jones,
Ben
Leinfelder,
Andrew
Pippin
• Paul
Allen,
Rick
Bonney,
Steve
Kelling
• Carole
Goble
• Ryan
Scherle,
Todd
Vision
• Donald
Hobern
• Randy
Butler
• David
DeRoure
LEON LEVY
FOUNDATION
48
49. Ques;ons?
Contact
Points
John
W.
Cobb,
Ph.D.
Oak
Ridge
John
W.
Cobb,
Ph.D.
Oak
Ridge
Na;onal
Lab
cobbjw@ornl.gov
865.576.5439
hip://www.dataone.org/
hip://docs.dataone.org
49