2. DataONE Preservation in a Nutshell*
1. Keep the bits safe
• Replicate the data and metadata
• Do local security and media refresh
2. Protect their form and meaning
• Know what you have, and know your rights
• Know when to migrate and emulate
3. Safeguard the guardians
• Organizational and network sustainability
* DataONE Preservation Strategy, PWG
workshop, Chicago, December 5-6, 2010
3. DataONE Metadata WG Goals
1. Build an e-dictionary to look up metadata terms
and to publish your own terms
2. Develop community focusing on data curation,
citation, and discovery for DataONE
3. Develop a community to sustain it
4. Agreeing on terms: a totally different take
• Traditional metadata standards are controlled
• Change by committee is ugly, costly, and slow
• Example: Dublin Core, 15 cross-domain terms
• 5 years to agree, highly divergent local
use, change relegated to external ontologies
4
10. Metadata Vision
Instead, create one dictionary
• Crowd sourced plus lightly supervised canon
• Anyone can look up terms
• Any part of “metadata speech”
• Anyone can propose and refine their terms
• Strong terms rise, weak terms decline
Greenberg, J., Murillo, A. and Kunze, J (in press). Ontological
Empowerment: Sustainability via Ownership. In K. LeBarre and J. Tennis Advances
in Classification Research, 23nd Annual ASIS SIG/CR Workshop, 26 October 2012,
Baltimore, MD.
10
12. Metadata Vision
One dictionary
• Crowd sourced plus lightly supervised canon
• Anyone can look up terms
• Any part of “metadata speech”
• Anyone can propose and refine their terms
• Strong terms rise, weak terms decline
Greenberg, J., Murillo, A. and Kunze, J (in press). Ontological
Empowerment: Sustainability via Ownership. In K. LeBarre and J. Tennis Advances
in Classification Research, 23nd Annual ASIS SIG/CR Workshop, 26 October 2012,
Baltimore, MD.
12
13. What we did
• Met
• Laughed, Talked, Cried, Hugged
• Conquered
13
14. Use cases
Six solid cases, eg,
• Sally Scientist is about to enter column headers
for observational data on Pikas in the alpine for
data to go into Dryad
• Doug Data wants to use Sally’s observations and
needs to lookup the definition of one of her
column headers
14
16. Work packages in the next 2 years
Move from pre-proof-of-concept to Beta
• Software development
• Assessment (eg, students)
• Moderation protocols – community elders
• Establish community identity and rhythm
• Not completely flat, not completely crowd-sourced
Hinweis der Redaktion
We’re a sort of cluster group, which really consists of two parts: a preservation subgroup and a metadata subgroup.They are different, and I’ll spend one slide on Preservation and the rest on the exciting work in Metadata that’s just starting up.
If we had just on slide on Preservation, this pretty much summarizes the whole story. To meet the objective of “easy, secure, and persistent storage of data”, DataONE adopts a simple 3-tiered approach.Retaining the actual bits that comprise the data is paramount, as all other preservation and access questions are moot if the bits are lost. A cornerstone of this tier is replication. We attempt to make our replicas “de-correlated”, in the sense that we hold the copies in places where they are unlikely to be subject to the same power failure, same earthquake, same funding loss, etc. CNs hold a copy of all science metadata, so that we always know what DataONE has. An extra copy of MN data is held by each of two other MNs. Damage or corruption in those copies is detected by periodically re-computing checksums (eg, SHA-256 digests) for randomly selected datasets and comparing them with checksums securely stored at the CNs – any bit-level change can be corrected by copying from an unchanged copy. This kind of “pop quiz” cannot be cheated by simply reporting back a previously computed checksum as it’s the actual MN replica data that’s requested. Although it entails sampling only a subset of the data, it is not feasible to exhaustively check the amount of content that DataONE anticipates holding, because that will effectively keep the MNs and CNs busy all the time. Local Information Technology (IT) standards at the MNs are important, and there will be more about this in a later slide. MN guidelines also call for the common-sense and usual practice of periodic “media refresh”, which is the copying of data from old physical recording devices to new physical recording devices to avoid errors due to media degradation and vendor de-support.Assuming the bits are kept safe, one also has to be able to make sense of them into the future, so protecting their form, meaning, and behavior is critical. This we accomplish first by fully knowing the form and structure of the data, in other words, by collecting accurate characterization metadata. Sources of this metadata include scientists, MN curators, and the output from automated characterization tools such as JHOVE. We also encourage use of widely supported formats. Finally, we will use standardized format names from the Unified Digital Format Registry (UDFR), which enables automated notification of obsolescence through services such as AONS (Automated Obsolescence Notification System) and Plato (PlanetsPreservation Planning Tool). I’ll note that both JHOVE and UDFR are maintained by the California Digital Library, which is a DataONE partner. Migration and emulation are sub-strategies that DataONE will use in the event that formats become obsolete. At some time in the future, one may expect that available contemporary hardware and software will be unable to render or otherwise use bits saved in some formats. Migration is used to convert from older to newer formats; all converted content is subject to “before” and “after” characterization to ensure semantic invariance. Emulation effectively preserves older computing environments in order to retain the experience of rendering older formats; once considered a specialized intervention, emulation has become a more viable technique with recent developments in consumer and enterprise server virtualization solutions. Ultimately, having the bits and their meaning is useless if we don’t also have the legal right (a) to hold the data, (b) to make copies and derivatives in performance of preservation management (such as replication and migration), and (c) to transfer those same rights to a successor archive. Just as important is to know specifically who owns the original data and whether those rights have been granted. As a start we strongly encourage providers to assign “Creative Commons Zero” (CC0) licenses to all contributed data, which facilitates preservation while still permitting an attribution requirement.Of course the DataONE organization and network itself needs to be preserved. No network, no MNs, no data. This topic has considerable cross-over with what the Governance and Sustainability working group is doing, and I’ll say more about it in a subsequent slide.
Goals:Develop and implement a sustainable, effective metadata registry framework.Identify a core, foundational, yet flexible set of metadata properties (elements, attributes, and other sub-vocabularies) supporting basic curation and interoperability. This work will explore bridges with the Dublin Core Metadata Initiative (DCMI) and the DataCite consortium.Survey and assess metadata generation approaches (automatic, semi-automatic, derived, manual) and models to support the above stated goals.Purpose: to assist DataONE in recording and maintaining via metadata (as structured, named information elements) sufficient, sustainable functional information about data sets to support discovery, life-cycle management, citation, and general interoperation. Interoperation is a core value for any federation of autonomous nodes such as DataONE, and has separate consequences for every working group; for the MWG, general interoperation is meant to address data discovery across nodes and disciplines, as well as data re-use within the earth sciences (to the extent that this can be generalized).Scope: While metadata is a vast subject comprising, in principle, every piece of structured data bearing any relationship to any other piece of data, the MWG focuses on expressing technical and scientific metadata (DataONE’s “system” and “science” metadata). This emphasis combines the main metadata requirements from the core cyberinfrastructure team (CCIT) with relevant sources of minimal metadata requirements. Because the CCIT is best qualified to focus on technical metadata, the MWG will give priority to metadata that supports data preservation, curation, citation, and discovery in general. Of special interest will be the publication of spreadsheet data and data papers.
Traditional metadata standards are controlled by panels of experts, eg, FGDC, EML, Darwin Core Change by committee is ugly, costly, and slowExample: perhaps most widely use cross domain vocabulary is Dublin Core, 15 cross-domain termsAgreed on in 5 years, lots of local divergence“I love the 15, but my domain needs these 2 terms. How do we add them?” A: Make your own ontology!Multiply by 200 domains and the result is 200 ontologies, 200 panels, 200 islands of non-interoperation
Something between crowd-sourcing and an exclusive clubLearn from wikipedia, internet RFCs, and American Heritage DictionaryGreenberg, J., Murillo, A. and Kunze, J (in press). Ontological Empowerment: Sustainability via Ownership. In K. LeBarre and J. Tennis Advances in Classification Research, 23nd Annual ASIS SIG/CR Workshop, 26 October 2012, Baltimore, MD.
We’re a sort of cluster group, which really consists of two parts: a preservation subgroup and a metadata subgroup.They are different, and I’ll spend one slide on Preservation and the rest on the exciting work in Metadata that’s just starting up.
Something between crowd-sourcing and an exclusive clubLearn from wikipedia, internet RFCs, and American Heritage DictionaryGreenberg, J., Murillo, A. and Kunze, J (in press). Ontological Empowerment: Sustainability via Ownership. In K. LeBarre and J. Tennis Advances in Classification Research, 23nd Annual ASIS SIG/CR Workshop, 26 October 2012, Baltimore, MD.
First meeting. Re-affirmed our vision.Met with Semantics WG and Provenance WGGot scared, got over it, because this is hard hard hard.Pre-proof of conceptAggressive plan in next 2 months to develop a 0.1 prototype using - either Drupal or StackOverflow - Sally Scientist