This series of presentations was given at the EarthCube Data Facilities End-User Workshop held January 15-17, 2014 in Washington, DC. This workshop provided a forum to discuss the unique requirements and challenges associated with developing the communication, collaboration, interoperability, and governance structures that will be required to build EarthCube in conjunction with existing and emerging NSF/GEO facilities.
This panel and discussion, specifically, outlined and explained several current concepts in data sharing and interoperability, featuring presentations by:
Paul Morin (UMN): Polar Cyberinfrastructure
Don Middleton (UCAR): Atmospheric/Climate
Kerstin Lehnert (LDEO): Domain Repositories & Physical Samples
David Schindel (CBOL, GRBio): Biological Perspective & Collections
Hank Leoscher (NEON): Observation Networks
Daniel Fuka (Virginia Tech) and Ruth Duerr (NSIDC): Brokering
Ilya Zaslavsky (UCSD): Cross-Domain Interoperability
12. Data System Interoperability and
Standards for UCAR/NCAR and
Collaborative Activities
August 13, 2013
Data Facility Workshop; Arlington, VA.
Don Middleton (on behalf of many others)
University Corporation for Atmospheric Research
U.S. National Center for Atmospheric Research
Computational and Information Systems Laboratory
Boulder, Colorado, USA
13. Data Cyberinfrastructure for “Big
Head” and “Long Tail” Scientific
Research
Research Data Archive
Mauna Loa Solar
Observatory
High Altitude
Observatory
(HAO)
Field Project Archive
Earth Observing
Lab (EOL)
Earth System Grid
Community Data Portal ACADIS Arctic Gateway
Computational and
Information
Systems
Laboratory (CISL)
and Earth System
Laboratory (NESL)
NCAR Wyoming
Supercomputing
Center, Cheyenne.
Disk, archive, and
computational resources.
ACADIS is joint venture of
NCAR EOL & CISL, the
National Snow and Ice Data
Center, and UCAR Unidata
UCAR Unidata
netCDF, THREDDS,
TDS, LDM, IDV, Ros
etta
These systems federate in various ways
among themselves, across organizations such
as as ACADIS, and with external programs
such as GCMD, the UN/WMO
WIS, ESGF, TIGGE, and others.
14. Automated
Modeling and
Observation
Systems
Federation with Other Systems
(GCMD, WMO, ADE)
Data Users and
Publishers (SelfPub)
RESTful Pub
Services
ACADIS Gateway
Identity
Management
(OpenID, SAML)
Discovery Services
(Apache SOLR)
Publishing Services
Metadata and Database Services
Catalog Harvester
(OpenSearch, DIF, THRE
DDS)
Metrics
Data
Services, Access
Control
OAI-PMH
Repository
(DC, DIF, ISO)
Core Technology
•Spring Framework
•Hibernate
•Liquibase
•Apache SOLR
•OpenID4Java/OpenSAML
•OAI-PMH, OpenSearch
•ActiveMQ
•FreeMarker
•Java NetCDF Library
•DOI’s via EZID/DataCite
Bagit (from
the LoC)
EOL ACADIS
Collections
(via
THREDDS)
NSIDC Arctic
Collections
(via
Brokering)
Future
Federated
Collections
NWSC GLADE
ACADIS Arctic
Collections
RDB
HPSS
ACADIS is sponsored by NSF/GEO/PLR
15. The Chronopolis
Data Preservation
• A Consortium ofNetwork
UCSD
Libraries, SDSC, Univ. of
Maryland, and NCAR
•
•
•
Using LoC Bagit for deposits
Based on iRods and ACE
(Audit Control Environment)
TRAC-certified (i.e. ISO
16363)
18. courtesy of:
Lesley Wyborn, Geoscience Australia
(talk at the IGSN workshop at IGC 2012)
EarthCube Data Facilities Workshop
18
19.
Access to the physical samples is needed to verify
& reproduce published observations.
Access to sample metadata is needed for proper
interpretation and re-use of sample-based data.
Access to both is needed to facilitate sharing of
samples for use & re-use.
▪ Samples are often expensive to collect (drilling, remote
locations).
▪ Many samples are unique and irreplaceable.
▪ Re-analysis augments utility of existing data.
EarthCube Data Facilities Workshop
19
20. Geochemistry
Structural Geology and Tectonics
Experimental Stratigraphy
Critical Zone Community
Envisioning a Digital Crust
Cyberinfrastructure for Paleogeoscience
Petrology and Geochemistry
Inland Waters
Deep Seafloor Processes and Dynamics
Coral Reef Systems Science
Geochronology
Rock Deformation and Mineral Physics Research
EarthCube Data Facilities Workshop
20
21.
“Global Access to Global Collections: establish repositories for all
physical samples and the biological, geochemical and physical
measurements made from those samples.” (Paleogeoscience)
“Poor and uneven access and management of sample
collections, incomplete sample tracking and linking of samples to
analyses in the literature and databases, discoverability of existing
samples” (Petrology & Geochem)
“Most geological terrains of interest do not have sufficient or even
sample density through space and time.” (Petrology & Geochem)
“Central archive of experimental samples with integrated
workflows, database templates, and community-wide DOI system
for samples” (Mineral Physics & Rock Deformation)
EarthCube Data Facilities Workshop
21
23. Infrastructure and resources for preservation and
access of physical samples
Tools for repositories to efficiently manage and
improve online access to their collections.
Online registry for discovery, access, and
preservation of sample data & metadata
Best practices & standards
for sample curation and sample sharing
for sample data & data exchange
Funding strategies, business models
EarthCube Data Facilities Workshop
23
24.
A multi-institutional initiative to build a
“Digital Environment for Sample Curation”
to advance access and re-use of physical samples
to support and simplify the work of curators
to advance best practices, standards, & policies for
sample curation, distribution, attribution, and
citation
EarthCube Data Facilities Workshop
24
25.
Physical collection facilities
NSF-funded repositories:
LDEO, OSU, SIO, LacCore, WHOI, USPRR, UT
Austin, ARF, and growing
State Surveys (AASG), USGS
Industry
Data facilities & systems:
IGSN/SESAR, IMLGS, USGIN
Computer & Information Science: RENCI, UT
Austin
Biocollection informatics: iPlant, iDigBio
EarthCube Data Facilities Workshop
25
26. Curators (Admin GUI)
Samplers (User GUI)
Public (Admin GUI)
DESC (data, tools, services)
Data Systems
IGSN Registry
EarthCube Data Facilities Workshop
Publications
26
27.
28. US Interagency Working Group
on Scientific Collections
(IWGSC)
• Covers all scientific disciplines
• Created under White House S&T
Council, reports to Life Sciences
Subcommittee
• ~10 participating Departments/Agencies
• USDA and Smithsonian Co-chairs
2009 recommendations
included:
• Increase impact and
improve management of
collections
• Clarify and standardize
management and
budgeting for collections
• Create an online
clearinghouse of
information on Federal
scientific collections
• Covers all scientific disciplines
• Created under OECD Global
Science Forum
• Independent project, no legal
status
• National and Institutional
memberships
• Governance by Executive Board
• Secretariat Office at Smithsonian
SciColl Priorities:
• Develop first crossdisciplinary registry of objectbased scientific collections
(GRSciColl)
• Promote interdisciplinary
research utilizing scientific
collections
29. Global Registry of Scientific
Collections (GRSciColl)
GRSciColl
Disease banks
Veterinary samples
Human medical
samples
Human artefacts
And
more,
what
else?
Standards repositories
Air, water, soil samples
Rocks, sediment and
ice cores
Extraterrestrial
samples
SciColl and IWGSC ask:
How can we connect collections
across disciplines?
Fossils and microfossils
Microbes in BRCs
Living material in
genebanks, culture
collections
Plants and animals in
zoos, botanical
gardens, aquariums
Plants and animals in
museums, herbaria
30. Structure of GRSciColl
Institutional Collection Table
• Institution ID
• Collection ID
• Collection Name
• Collection Discipline
• Content Type(s)
• Primary Contact
Personal Collection Table
• Institution ID = “Personal”
• Collection ID
• Collection Name
• Collection Discipline
• Content Type(s)
• Primary Contact
Institution Table
• Institution ID
• Institution Name
• Institution Discipline(s)
• Primary Contact
Contacts Table
• Contact Name
• Primary Institution
• Primary Collection
• Additional Inst/Coll
SciColl and IWGSC ask:
What terms constitute the common
vocabularies of discipline and content type?
33. Get Specific Data
Many respondents appeared to desire more specific details and expressed
an interest in data communicated that can be readily used in their work.
35. Data as a National Resource
NSF Director Suresh‟s emphasis on:
• “Era of Observations”
• “Era of Data and Information”
March 2012: White House $200M “Big
Data” initiative:
• NSF
• NIH
• DOE
• DOD
• DARPA
• USGS
36. The President’s Council of Advisors on Science
and Technology (PCAST)
The PCAST report (2011) urge that even
as the government deals with our
nation‟s economic challenges, it must:
“…address the threats to both the
environmental and the economic aspects
of well-being that derive from the
accelerating degradation of the
environmental capital – the Nation‟s
ecosystems and the biodiversity they
contain”.
PCAST New Directions…..
37. Global Themes – Global Observations
Increasing importance on designing new x-discipline data
structures to support policy/decision-making
Societal Benefit Areas (SBAs)
Agriculture Biodiversity Climate
Disasters
Ecosystems
Energy
Health
Water
Weather
Essential Climate Variables (ECVs)
Essential Biodiversity Variables (EBVs)
Essential Carbon Variables (ECVs)
Aligned with OSTP (NEO, US-GEO) NSF/EU Strategic Planning
Aligned with GEO, GEO-BON, GCOS, Diversitas, WMO, WCRP, etc…
Aligned with Suresh, S., 2012. Research funding: Global challenges need global solutions,
Nature, 490, 337-338, doi:10.1038/490337a
38. Why Interoperability?
• The rapid pace of large-scale environmental global changes
underscores the value of accessible long-term data sets.
• Natural, managed, and socioeconomic systems are subject to
complex interacting stresses that play out over extended periods of
time and space.
• An era of large-scale, interdisciplinary science fueled by large
data sets.
• Data Interoperability enhances the value of current scientific
efforts and investment.
• Interoperability is needed to forecast future conditions for basic
understanding, and for future planning, policy, and societal benefit.
• Currently, there is no accepted approach to make large datasets
interoperable
• Provides new leadership opportunities for Scientists globally
39. Interoperability Philosophy - scientific utility
1. Linking Science Questions and
Hypotheses and Requirements
•
•
•
•
Mapping Questions to „what must be done‟
„how‟ data can/will be used jointly
Defining Joint Science Scope
Defines interfaces and Functionality
2. Traceability of Measurements
• Use of Recognized Standards
• Traceability to Recognized Standards, or First
Principles
• Known and managed signal:noise
• Managing QA/QC
• Uncertainty budgets (ISO traceable)
3. Algorithms/Procedures
• What is the algorithm or procedural process
to create a data product?
• Provides “consistent and compatible” data
• Managed through intercomparisons
• What are their relative uncertainties?
4. Informatics
•
•
•
•
•
Standards - Data Formats
Standards - Metadata formats
Persistent Identifiers / Open-source /Policies
Discovery tools / Dissemination / Discovery
Ontologies, semantics and controlled
vocabularies
• Archival and Curation Activities
• Providence
40. Interoperability Philosophy - scientific utility
The degree to which Observatories are truly interoperable is the
degree to which these four elements are adopted by collaborative
facilities
Signal:noise and uncertainty estimates must also be known in order for
data to have broader, global utility and prognostic capability
(ecological forecasting)
Provides the frame for individual approaches and creativity, spans
organizational and programmatic maturity
Is a framework by which all parties can engage (policy and social dimension, incl)
Facilitates establishing a Baseline/infrastructure with scientific creativity
Real work, real tasks can be defined
This Interoperability Framework is currently being implemented as part of
a joint EU FP7 and US NSF Project called CoopEUS (www.coopeus.eu)
42. The National Ecological Observatory Network is a project sponsored by the National
Science Foundation and managed under cooperative agreement by NEON Inc.
46. NEON Interactions – Other Organizations
The Type of Interaction and Efficacy is Dependant on the
Organizational Development of the other Institution
• Balancing Scientific Creativity vs. Baseline Infrastructure
• Level of System Engineering Maturity
• Base Capacity - Critical Mass
• Cultural Sensitivity
47.
48. BCube: A Broker Framework
for Next Generation Geoscience
Siri Jodha - PI
49. Brokering Framework Principles
• A broker connects information resources by
mediating interactions between those
resources without requiring the maintainers of
those resources to adapt their existing systems
EPOS Workshop, Erice 2013
51. What if....
A scientist could find data and services that matched
their interests as easy as subscribing to the news?
Greenland 1 km DEM has been published
A Digital Elevation Model (DEM)
of Greenland acquired by A.
Researcher is available in binary
format at a 1 KM grid spacing in a
polar stereographic projection ...
more
myData News.org
Greenland Ice Sheet Melt Characteristics Data updated
Greenland Ice Sheet Melt Characteristics now available via OpenSearch API
Preparing Data for Ingest, presented 10/27/09 by R. Duerr
LID590DCL Foundations of Data Curation
52. What if....
Scientists could advertise AND INDEX their data
so other scientists could find it AND REFERENCE
IT, as simply as...
1 - Filling out a web form
2 - Saving it to your website
3 - Adding it's link to your site
53. BCube Broker
•
•
•
•
•
Service Bus Mediator
Scientific Field to Field Translator
Crawling, Advertising, (and Indexing)
http://nsidc.org/bcube
http://rd-alliance.org
54.
55. Domain data repositories and
cross-disciplinary data
integration
governance issues
technical issues
ILYA ZASLAVSKY
AND THE EARTHCUBE CINERGI PROJECT (NSF ICER-1343816)
59. Short questionnaire
Potential added value by a cross-domain system
Integration with cross-domain search
Key characteristics for CINERGI
Function
Importance
Making metadata from your facility available for search using
standard metadata, via standard APIs
1 2 3 4 5 6 7
Unimportant
Essential
NA DK
Tracking demand for and cross-domain usage of your resources
1 2 3 4
Unimportant
NA
1 2 3 4
Unimportant
NA
Identifying issues related to data and metadata quality and
completeness
Tracking search hits that become searches for resources
managed by your data facility
Connecting owners of relevant datasets to your facility for
potential longer-term data management
Connecting data from your facility with people, publications,
models, and projects
Identifying communities using data, tools, and models from your
facility
Validating published metadata and service signatures from your
facility
Finding and reporting to you resources that appear as duplicates
across multiple registries
5 6 7
Essential
DK
5 6 7
Essential
DK
1 2 3 4 5 6 7
Unimportant
Essential
NA DK
1 2 3 4 5 6 7
Unimportant
Essential
NA DK
1 2 3 4 5 6 7
Unimportant
Essential
NA DK
1 2 3 4 5 6 7
Unimportant
Essential
NA DK
1 2 3 4 5 6 7
Unimportant
Essential
NA DK
1 2 3 4 5 6 7
Unimportant
Essential
NA DK
Comments
Hinweis der Redaktion
For example…. Being able to position ourselves globally…NEO Task Force Assessment Working Group (AWG) ,first National assessment by July 1, 2012,“Societal Benefit Areas” (SBAs) are the organizing construct12 SBA Teams + 1 Reference Measurements Team
For example…. Being able to position ourselves globally…NEO Task Force Assessment Working Group (AWG) ,first National assessment by July 1, 2012,“Societal Benefit Areas” (SBAs) are the organizing construct12 SBA Teams + 1 Reference Measurements Team