Scientific discovery and innovation in an era of data-intensive science
William (Bill) Michener, Professor and Director of e-Science Initiatives for University Libraries, University of New Mexico; DataONE Principal Investigator
The scope and nature of biological, environmental and earth sciences research are evolving rapidly in response to environmental challenges such as global climate change, invasive species and emergent diseases. Scientific studies are increasingly focusing on long-term, broad-scale, and complex questions that require massive amounts of diverse data collected by remote sensing platforms and embedded environmental sensor networks; collaborative, interdisciplinary science teams; and new tools that promote scientific data preservation, discovery, and innovation. This talk describes the challenges facing scientists as they transition into this new era of data intensive science, presents current solutions, and lays out a roadmap to the future where new information technologies significantly increase the pace of scientific discovery and innovation.
Z Score,T Score, Percential Rank and Box Plot Graph
NISO Forum, Denver, Sept. 24, 2012: Scientific discovery and innovation in an era of data-intensive science
1. Data Observation Network for Earth
(DataONE): Supporting Scientific Data
Preservation, Discovery, and Innovation
Bill Michener
Professor and DataONE Project Director
University of New Mexico
24 September 2012
National Information Standards Organization
3. Research and Data Life Cycle Integration
?
Plan
Proposal
writing Analyze Collect
Ideas Research Integrate Assure
Discover Describe
Publication Preserve
?
3
4. Three Key Challenges
Plan
Analyze Collect
I v o
n a n
n t
o i
Integrate Assure
Discover Describe
Preserve
4
6. The Long Tail of Orphan Data
“Most of the bytes
are at the high end,
Specialized repositories but most of the
(e.g. GenBank, PDB) datasets are at the
Volume
low end” – Jim Gray
Orphan data
(B. Heidorn)
Rank frequency of datatype
6
7. Planning ?
Metadata standard?
Data repository?
7
8. DataONE and the DMPTool
Support Data Preservation
Three major components for a Member Nodes
flexible, scalable, sustainable • diverse institutions
Coordinating Nodes
network • serve local community
• retain complete metadata
Investigator Toolkit
• provide resources for
catalog
managing their data
• indexing for search
• retain copies of data
• network-wide services
• ensure content
availability (preservation)
• replication services
8
9. Dryad (>3,000 data products)
Coordinated
submission of articles
and underlying data
Handshaking with
specialized
repositories
Promotion of reuse
and incentives for
deposit
9
10. Knowledge Network for Biocomplexity
(20,000+ data packages)
Data Types
• Ecological
• Environmental
• Demographic
• Social/Legal/Economic
Contributors 60
• Individual investigators 45 Data
• Field stations and networks 30 Sizes
• Government agencies %
15
• Non-profit partnerships 0
10-200
>200
<1
1-10
• Synthesis centers
MB
10
11. ✔Check for best practices
✔Create metadata
✔Connect to ONEShare
Data &
Metadata (EML)
11
25. 1. Ontology-based discovery search results
Concepts acquire
context: biomass
as Material or
biomass as Energy Additional
search terms
Super-classes
may have
different
1. NCBO ontology repository instance
properties 2. Populated with ontologies (e.g., the NASA-JPL Semantic Web
for Earth and Environmental Terminology)
3. Queried ontologies and returned results using REST services 25
26. Approach 2: Enrich MN Metadata
DAAC DRYAD KNB 3 KNB
Number of Documents 978 1,729 24,249 2 DRYAD
Total Number of Keywords 7,294 8,266 254,525 1 DAAC
Average Keywords/Document 7.46 4.78 10.49 0 2 4 6 8 10 12
Actual Keywords Suggested Keywords
[1]field investigation
1. canopy characteristics [2]analysis
2. field investigation [3]land cover
[4]computational model
3. vegetation index [5]reflectance
4. leaf characteristics [6]vegetative cover
[7]biomass
5. Satellite [8]primary production
[9]steel measuring tape
6. land cover [10]weigh balance
7. leaf area meter [11]precipitation amount
[12]canopy characteristics
8. Reflectance [13]leaf characteristics
9. steel measuring tape [14]water vapor
[15]quadrat sample frame
10. vegetative cover [16]rain gauge
[17]surface air temperature
11. plant characteristics [18]air temperature
12. albedo [19]meteorological station
[20]human observer
[21]vegetation index
[22]soil core device
[23]plant characteristics
[24]surface wind 26
[25]albedo
27. 3. Innovation
The Fourth Paradigm:
1. Observational and
experimental
2. Theoretical research
3. Computer simulations of
natural phenomena
4. Data-intensive research
• new
tools, techniques, and
ways of working
27
27
28. “Data Intensive Science” and the “80:20 Rule”
Increasing Process Knowledge
Decreasing Spatial Coverage
Intensive science sites
and experiments
Extensive science sites
Volunteer &
education networks
Remote
sensing
Adapted from CENR-OSTP
28
29. Public Participation in Scientific Research Conference: 4-5 August 2012 in
Portland, Oregon USA prior to Ecological Society of America meeting (6-10 Aug.):
http://www.birds.cornell.edu/citscitoolkit/conference/2012
29
30. Investigator Toolkit Support
Plan
DMP-Tool
Analyze Collect
Kepler
Integrate Assure
Discover Describe
Preserve
30
31. Exploration, Visualization, and Analysis
Diverse bird observations and Model results
environmental data from
300,00 locations in the US Occurrence of Indigo Bunting (2008)
integrated and analyzed using
High Performance Computing
Resources
Land Cover
Jan Ap Jun Sep Dec
r
Meteorology
• Examine patterns of
migration
MODIS – Spatio-Temporal Exploratory • Infer how climate
Remote Model identifies factors change may affect
sensing data affecting patterns of bird migration
migration
31
34. DataONE: Supporting Scientific Data
Preservation, Discovery, and Innovation
Current Member Nodes:
Coming Soon:
Current Tools:
Tools Coming Soon: Queensland University of Technology
34
37. User Assessments
Scientists: BL Scientists: FU
Library Policies: BL Library Policies: FU
Librarians: BL Librarians: FU
Policy Makers: BL Policy Makers: FU
Educators: BL Educators: FU
Year 1 Year 2 Year 3 Year 4 Year 5
37
44. DataONE Team and Sponsors
• Amber Budden, Roger Dahl, Rebecca Koskela, Bill • Ewa Deelman
Michener, Robert Nahf, Skye Roseboom, Mark
Servilla
• Deborah McGuinness
• Dave Vieglais
• Suzie Allard, Nick Dexter, Kimberly • Jeff Horsburgh
Douglass, Carol Tenopir, Robert Waltz, Bruce
• Wilson
John Cobb, Bob Cook, Ranjeet • Robert Sandusky
Devarakonda, Giri Palanismy, Line Pouchard
• Patricia Cruse, John Kunze • Bertram Ludaescher
• Sky Bristol, Mike Frame, Richard Huffine, Viv • Peter Honeyman
Hutchison, Jeff Morisette, Jake Weltzin, Lisa Zolly
• Stephanie Hampton, Chris Jones, Matt • Cliff Duke
Jones, Ben Leinfelder, Andrew Pippin
• Paul Allen, Rick Bonney, Steve Kelling • Carole Goble
• Ryan Scherle, Todd Vision • Donald Hobern
• Randy Butler • David DeRoure
LEON LEVY
FOUNDATION 44
Editor's Notes
Networking, interconnectedness of information. Defining the relationships between components increases the value and utility of those items.The internet provides connectivity between systems, and a good deal of infrastructure has been built on this rapidly evolving, now pervasive fabric.The design of most internet based infrastructure though is very ephemeral, and thus is not suitable for preservation of information, or more importantly, the relationships between elements.URLs are often used as identifiers, except these have a significant problem in that their resolution, that is finding the location where the content identified by the URL may be retrieved is entirely dependent on the persistent availability of the service endpoint referenced by the URL. Change in any component in the resolution chain results in failure, and thus negates the utility of the URL.[Diagram of URL resolution process]The semantic web, the goal of interconnectedness between information is entirely dependent on effective identifier resolution.Preservation of content.Access to content. Creating communities of agents able to access and manipulate, information. Generating new content, relationships between content, discovering new associations. Being completely open about activity – the generation of new content, mining existing information, access to processing resources may however be best done with some privacy. There are always some activities best not to perform in full public view.The DataONE project is building infrastructure that addresses these concerns.
In fact, many researchers find the new requirement to be quite confusing. Here are just a few examples of the questions that they are asking.
There is widely used infrastructure for certain well-defined “easy” biological datatypes like DNA sequences and protein structures. But these repositories are not adequate to capture all those many datasets that requires more context to be reusable. Our civilization is not wealthy to ever support the variety specialized repositories that would be needed, and the curation that would be needed to standardize these data.
DataONE is a federated data network built to improve access to Earth science data, and to support science by: engaging the relevant science, data, and policy communities; facilitating easy, secure, and persistent storage of data; and disseminating integrated and user-friendly tools for data discovery, analysis, visualization, and decision-making. There are three principal components:Member Nodes which include a diverse array of data centers and repositories that are associated with national and international agencies and research networks, universities, libraries, etc.Coordinating Nodes which support data replication across Member Nodes (i.e., data centers) as well as network wide services like 24/7 access to metadata at the CNs, indexing and rapid search and discovery, etc. Am Investigator Toolkit that includes tools that are widely used by scientists, The tools are coupled with the DataONE resources so that it is, for example, possible to seamlessly and transparently access data at Member Nodes through the tool of your choice.
ContentData supporting peer-reviewed articles in basic and applied bioscienceCurrently, 2.4 Gb data from ~400 articles and 50 journalsPlatformCustomized Dspace repositoryMetadata and data standardsDublin Core Application ProfileData file format determined by depositor and journal policySome curation and migration of file formatsAvailabilityOpen Data (Creative Commons Zero), with time-limited embargoesIdentifier schemeDataCite DOIUsage~3000 annual downloadsGovernance and sustainabilityJointly managed by a consortium of partner journalsProject funding from NSF (since 2008) and JISC (starting 2010)Institutional homeNational Evolutionary Synthesis Center, British Library (pending)
As one example, DataONE is part of a consortium that is developing a Data Management Planning Online Tool. The tool “walks” scientists through the process of developing a concise, but comprehensive data management plan that could enable good stewardship of data and meet requirements of sponsors and home institutions.
First, one logs in, selects the Research sponsor and solicitation number.
The five steps are located on the left side bar and include information about the data, metadata (or documentation about the data, policies for access and re-use, and plans for archiving and preserving the data. In this example, the Univ. of Virginia offers suggested text for archiving and preserving the data that can be pasted into the plan.
There are many opportunities for collaboration with DataONE and there are many benefits to doing so; the next few slides highlight the benefit and opps for research scientists, Member Nodes, and funding agencies. This map highlights many of the international partners that have expressed interest in establishing Member Nodes, many of which are active members of the DataONE Users Group.
NASA Collectors: Field investigators who collect data from NASA-funded projects and deposit those data in the ORNL DAAC. DAAC Users: Those who search and download data from the ORNL DAACMember Node Crescent: the software stack that enables the MN functionality for the ORNL DAAC. This crescent software is developed and installed by D1 staff, making use of the characteristics of the DAAC system and metadata DAAC users can obtain data directly from the ORNL DAAC, as they did before. D1 users will access metadata from the CN and will acquire ORNL DAAC data from the DAAC indirectly via the Member Node. The data and documentation downloads are recorded by the DAAC; the D1 users sees the DAAC’s citation to the downloaded data set
I
Other development activities during years 2-5 will focus on expanding the suite of tools that are available through the Investigator Toolkit. New tool additions will be identified and prioritized by the DataONE Users Group.
How else do we know what the community needs?The Scientific Exploration, Visualization and Analysis working group is another example that you heard about earlier. In summary, by running through a comprehensive case study, this working group was able to provide specific guidance on the challenges faced when conducting data intensive science. Challenges that were communicated to, and met by, the DataONE core CI team and developers.Another mechanism to understand community needs is to conduct extensive surveys of stakeholders….