NISO Forum, Denver, Sept. 24, 2012: Scientific discovery and innovation in an era of data-intensive science

Data Observation Network for Earth
(DataONE): Supporting Scientific Data
Preservation, Discovery, and Innovation

Bill Michener

Professor and DataONE Project Director
University of New Mexico

24 September 2012

National Information Standards Organization

Research and Data Life Cycle Integration

?
Plan
Proposal
writing Analyze Collect

Ideas Research Integrate Assure

Discover Describe

Publication Preserve

?
3

Three Key Challenges
Plan

Analyze Collect
I v o
n a n
n t
o i
Integrate Assure

Discover Describe

Preserve
4

1. Data Preservation and Planning

✔ ? 5

The Long Tail of Orphan Data

“Most of the bytes
are at the high end,
Specialized repositories but most of the
(e.g. GenBank, PDB) datasets are at the
Volume

low end” – Jim Gray

Orphan data

(B. Heidorn)
Rank frequency of datatype

6

Planning ?

Metadata standard?
Data repository?

7

DataONE and the DMPTool
Support Data Preservation
Three major components for a Member Nodes
flexible, scalable, sustainable • diverse institutions
Coordinating Nodes
network • serve local community
• retain complete metadata
Investigator Toolkit
• provide resources for
catalog
managing their data
• indexing for search
• retain copies of data
• network-wide services
• ensure content
availability (preservation)
• replication services

8

Dryad (>3,000 data products)
Coordinated
submission of articles
and underlying data

Handshaking with
specialized
repositories

Promotion of reuse
and incentives for
deposit

9

Knowledge Network for Biocomplexity
(20,000+ data packages)
Data Types
• Ecological
• Environmental
• Demographic
• Social/Legal/Economic

Contributors 60
• Individual investigators 45 Data
• Field stations and networks 30 Sizes
• Government agencies %
15
• Non-profit partnerships 0

10-200

>200
<1

1-10
• Synthesis centers
MB
10

✔Check for best practices
✔Create metadata
✔Connect to ONEShare

Data &
Metadata (EML)

11

Data Management Planning Tool

12

2. Data Discovery

15

The DataONE Federation

17

Member Node Functional Tiers

Tier 1: Read only, public content
ping(), getLogRecords(), getCapabilities(),get(), getSys
temMetadata(), getChecksum(),listObjects(), synchronizat
ionFailed()

Tier 2: Read only, with access control
isAuthorized(), setAccessPolicy()

Tier 3: Read/Write using client tools
create(), update(), delete()

Tier 4: Able to operate as a replication target
replicate(),getReplica()

http://mule1.dataone.org/ArchitectureDocs-current/apis/MN_APIs.html

18

ORNL DAAC
as a DataONE
Member Node NASA collectors DAAC Users (UWG)

Investigator Toolkit

DataONE Users
19

1. Ontology-based discovery search results

Concepts acquire
context: biomass
as Material or
biomass as Energy Additional
search terms

Super-classes
may have
different
1. NCBO ontology repository instance
properties 2. Populated with ontologies (e.g., the NASA-JPL Semantic Web
for Earth and Environmental Terminology)
3. Queried ontologies and returned results using REST services 25

Approach 2: Enrich MN Metadata
DAAC DRYAD KNB 3 KNB
Number of Documents 978 1,729 24,249 2 DRYAD
Total Number of Keywords 7,294 8,266 254,525 1 DAAC
Average Keywords/Document 7.46 4.78 10.49 0 2 4 6 8 10 12

Actual Keywords Suggested Keywords
[1]field investigation
1. canopy characteristics [2]analysis
2. field investigation [3]land cover
[4]computational model
3. vegetation index [5]reflectance
4. leaf characteristics [6]vegetative cover
[7]biomass
5. Satellite [8]primary production
[9]steel measuring tape
6. land cover [10]weigh balance
7. leaf area meter [11]precipitation amount
[12]canopy characteristics
8. Reflectance [13]leaf characteristics
9. steel measuring tape [14]water vapor
[15]quadrat sample frame
10. vegetative cover [16]rain gauge
[17]surface air temperature
11. plant characteristics [18]air temperature
12. albedo [19]meteorological station
[20]human observer
[21]vegetation index
[22]soil core device
[23]plant characteristics
[24]surface wind 26
[25]albedo

3. Innovation

The Fourth Paradigm:
1. Observational and
experimental
2. Theoretical research
3. Computer simulations of
natural phenomena
4. Data-intensive research
• new
tools, techniques, and
ways of working

27
27

“Data Intensive Science” and the “80:20 Rule”
Increasing Process Knowledge
Decreasing Spatial Coverage

Intensive science sites
and experiments

Extensive science sites

Volunteer &
education networks

Remote
sensing
Adapted from CENR-OSTP

28

Public Participation in Scientific Research Conference: 4-5 August 2012 in
Portland, Oregon USA prior to Ecological Society of America meeting (6-10 Aug.):
http://www.birds.cornell.edu/citscitoolkit/conference/2012

29

Investigator Toolkit Support

Plan
DMP-Tool
Analyze Collect
Kepler

Integrate Assure

Discover Describe

Preserve
30

Exploration, Visualization, and Analysis

Diverse bird observations and Model results
environmental data from
300,00 locations in the US Occurrence of Indigo Bunting (2008)
integrated and analyzed using
High Performance Computing
Resources

Land Cover

Jan Ap Jun Sep Dec
r
Meteorology
• Examine patterns of
migration
MODIS – Spatio-Temporal Exploratory • Infer how climate
Remote Model identifies factors change may affect
sensing data affecting patterns of bird migration
migration

31

Taverna, MyExperiment

32

Provenance Browser

33
33

DataONE: Supporting Scientific Data
Current Member Nodes:

Coming Soon:
Current Tools:

Tools Coming Soon: Queensland University of Technology

34

Deployment Targets – Y5

2009 2010 2011 2012 2013 2014
Y1 Y2 Y3 Y4 Y5

Metadata Objects 100k (130k) 400k 1M
Datasets 90k (120k) 180k 360k
Uptime 99.0 (100) 99.9 99.9
Metadata Schemas 8 (4) 8 8
Member Nodes 10 (8) 20 40
MN Countries 3 (2) 5 10
Coordinating Nodes 3 (3) 4 5
CN Countries 1 (1) 1 2
ITK Tools 8 (4) 10 12

35

Community Engagement

36

User Assessments

Scientists: BL Scientists: FU

Library Policies: BL Library Policies: FU

Librarians: BL Librarians: FU

Policy Makers: BL Policy Makers: FU

Educators: BL Educators: FU

Year 1 Year 2 Year 3 Year 4 Year 5

37

Community Engagement

38

Best Practices and Software Tools

39

June 3-21, 2013
University of New Mexico 40

Internships
2009 – 4 interns, 2010 – 4 interns
2011 – 8 interns, 2012 – 6 interns

https://notebooks.dataone.org/summer2012/

41

DataONE: Supporting Scientific Data

42

DataONE Team and Sponsors
• Amber Budden, Roger Dahl, Rebecca Koskela, Bill • Ewa Deelman
Michener, Robert Nahf, Skye Roseboom, Mark
Servilla
• Deborah McGuinness
• Dave Vieglais
• Suzie Allard, Nick Dexter, Kimberly • Jeff Horsburgh
Douglass, Carol Tenopir, Robert Waltz, Bruce
• Wilson
John Cobb, Bob Cook, Ranjeet • Robert Sandusky
Devarakonda, Giri Palanismy, Line Pouchard
• Patricia Cruse, John Kunze • Bertram Ludaescher

• Sky Bristol, Mike Frame, Richard Huffine, Viv • Peter Honeyman
Hutchison, Jeff Morisette, Jake Weltzin, Lisa Zolly
• Stephanie Hampton, Chris Jones, Matt • Cliff Duke
Jones, Ben Leinfelder, Andrew Pippin

• Paul Allen, Rick Bonney, Steve Kelling • Carole Goble

• Ryan Scherle, Todd Vision • Donald Hobern

• Randy Butler • David DeRoure

LEON LEVY
FOUNDATION 44

NISO Forum, Denver, Sept. 24, 2012: Scientific discovery and innovation in an era of data-intensive science

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to NISO Forum, Denver, Sept. 24, 2012: Scientific discovery and innovation in an era of data-intensive science

Similar to NISO Forum, Denver, Sept. 24, 2012: Scientific discovery and innovation in an era of data-intensive science (20)

More from National Information Standards Organization (NISO)

More from National Information Standards Organization (NISO) (20)

Recently uploaded

Recently uploaded (20)

NISO Forum, Denver, Sept. 24, 2012: Scientific discovery and innovation in an era of data-intensive science

Editor's Notes