Pamwg 2012ahm

•Als PPTX, PDF herunterladen•

1 gefällt mir•381 views

John Kunze

Combined slides for PAMWG at DataONE All Hands meeting.

DataONE Preservation
and Metadata Working Group

September 2012
DataONE All Hands Meeting

DataONE Preservation in a Nutshell*
1. Keep the bits safe
• Replicate the data and metadata
• Do local security and media refresh
2. Protect their form and meaning
• Know what you have, and know your rights
• Know when to migrate and emulate
3. Safeguard the guardians
• Organizational and network sustainability

* DataONE Preservation Strategy, PWG
workshop, Chicago, December 5-6, 2010

DataONE Metadata WG Goals
1. Build an e-dictionary to look up metadata terms
and to publish your own terms
2. Develop community focusing on data curation,
citation, and discovery for DataONE
3. Develop a community to sustain it

Agreeing on terms: a totally different take

• Traditional metadata standards are controlled
• Change by committee is ugly, costly, and slow
• Example: Dublin Core, 15 cross-domain terms
• 5 years to agree, highly divergent local
use, change relegated to external ontologies

4

Metadata Vision
Instead, create one dictionary
• Crowd sourced plus lightly supervised canon
• Anyone can look up terms
• Any part of “metadata speech”
• Anyone can propose and refine their terms
• Strong terms rise, weak terms decline
Greenberg, J., Murillo, A. and Kunze, J (in press). Ontological
Empowerment: Sustainability via Ownership. In K. LeBarre and J. Tennis Advances
in Classification Research, 23nd Annual ASIS SIG/CR Workshop, 26 October 2012,
Baltimore, MD.

10

Metadata Vision
One dictionary
• Crowd sourced plus lightly supervised canon
• Anyone can look up terms
• Any part of “metadata speech”
• Anyone can propose and refine their terms
• Strong terms rise, weak terms decline
Greenberg, J., Murillo, A. and Kunze, J (in press). Ontological
Empowerment: Sustainability via Ownership. In K. LeBarre and J. Tennis Advances
in Classification Research, 23nd Annual ASIS SIG/CR Workshop, 26 October 2012,
Baltimore, MD.

12

What we did
• Met
• Laughed, Talked, Cried, Hugged
• Conquered

13

Use cases
Six solid cases, eg,
• Sally Scientist is about to enter column headers
for observational data on Pikas in the alpine for
data to go into Dryad
• Doug Data wants to use Sally’s observations and
needs to lookup the definition of one of her
column headers

14

Work packages in the next 2 years
Move from pre-proof-of-concept to Beta
• Software development
• Assessment (eg, students)
• Moderation protocols – community elders
• Establish community identity and rhythm
• Not completely flat, not completely crowd-sourced

Empfohlen

Assessment criteriaMark Conrad

Lead gen top5David Cantrell

New Metaphors: Data Papers and Data CitationsJohn Kunze

Identifiers obey Resolvers not SchemesJohn Kunze

Library Tools Supporting Data-Rich ResearchJohn Kunze

The Data Management EcosystemJohn Kunze

The ARK Identifier Scheme at Ten Years OldJohn Kunze

How to market using facebook placesSFU Pub355

Empfohlen

Assessment criteriaMark Conrad

Lead gen top5David Cantrell

New Metaphors: Data Papers and Data CitationsJohn Kunze

Identifiers obey Resolvers not SchemesJohn Kunze

Library Tools Supporting Data-Rich ResearchJohn Kunze

The Data Management EcosystemJohn Kunze

The ARK Identifier Scheme at Ten Years OldJohn Kunze

How to market using facebook placesSFU Pub355

A Vocabulary for PersistenceJohn Kunze

Marketing for Bands on the Web SFU Pub355

ARK identifiers: lessons learnt at BnF: paths forwardJohn Kunze

Big Data's Long TailJohn Kunze

YAMZ: a cross-domain crowd-sourced metadata vocabularyJohn Kunze

Scalable Identifiers for Natural History CollectionsJohn Kunze

Annotating Research DatasetsJohn Kunze

How the Long Tail is Occurring in the Movie IndustrySFU Pub355

RSS FeedsSFU Pub355

Information literacy in a media-saturated worldPam Wilson

How words and images signifyPam Wilson

YAMZ.net: better, faster, cheaper taxonomy buildingJohn Kunze

DataONE Preservation and Metadata Working Group Report 2014John Kunze

NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...National Information Standards Organization (NISO)

The Research Data Alliance: Creating the culture and technology for an intern...Research Data Alliance

Citizen Science PhenotypesAndrea Wiggins

The Research Data Alliance--Creating the culture and technology for an intern...Research Data Alliance

Linked Open Data Alignment and Enrichment Using Bootstrapping Based TechniquesPrateek Jain

ACRL STS Liaisons Forum - AIBSVirginia Pannabecker

PhD Proposal Defense - Prateek JainArtificial Intelligence Institute at UofSC

2016 Ocean Sciences Meeting tutorialJosh Young

DataONE Education Module 02: Data SharingDataONE

Weitere ähnliche Inhalte

Andere mochten auch

A Vocabulary for PersistenceJohn Kunze

Marketing for Bands on the Web SFU Pub355

ARK identifiers: lessons learnt at BnF: paths forwardJohn Kunze

Big Data's Long TailJohn Kunze

YAMZ: a cross-domain crowd-sourced metadata vocabularyJohn Kunze

Scalable Identifiers for Natural History CollectionsJohn Kunze

Annotating Research DatasetsJohn Kunze

How the Long Tail is Occurring in the Movie IndustrySFU Pub355

RSS FeedsSFU Pub355

Information literacy in a media-saturated worldPam Wilson

How words and images signifyPam Wilson

YAMZ.net: better, faster, cheaper taxonomy buildingJohn Kunze

Andere mochten auch (12)

A Vocabulary for Persistence

Marketing for Bands on the Web

ARK identifiers: lessons learnt at BnF: paths forward

Big Data's Long Tail

YAMZ: a cross-domain crowd-sourced metadata vocabulary

Scalable Identifiers for Natural History Collections

Annotating Research Datasets

How the Long Tail is Occurring in the Movie Industry

RSS Feeds

Information literacy in a media-saturated world

How words and images signify

YAMZ.net: better, faster, cheaper taxonomy building

Ähnlich wie Pamwg 2012ahm

DataONE Preservation and Metadata Working Group Report 2014John Kunze

NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...National Information Standards Organization (NISO)

The Research Data Alliance: Creating the culture and technology for an intern...Research Data Alliance

Citizen Science PhenotypesAndrea Wiggins

The Research Data Alliance--Creating the culture and technology for an intern...Research Data Alliance

Linked Open Data Alignment and Enrichment Using Bootstrapping Based TechniquesPrateek Jain

ACRL STS Liaisons Forum - AIBSVirginia Pannabecker

PhD Proposal Defense - Prateek JainArtificial Intelligence Institute at UofSC

2016 Ocean Sciences Meeting tutorialJosh Young

DataONE Education Module 02: Data SharingDataONE

Data Management for Collaboration, Access, and InteroperabilityPlato L. Smith II

IMT530 Tagging PresentationMichael Braly

Data Citation Rewards and IncentivesMicah Altman

DataONE Education Module 08: Data CitationDataONE

Data Exchange, Data Citation: An overview of some community workNational Information Standards Organization (NISO)

Data Policy for Open ScienceResearch Data Alliance

Data Policy for Open ScienceMark Parsons

Data Exchange, Data Citation: An overview of some community workNational Information Standards Organization (NISO)

How and Why to Share Your Datakfear

Bridging the missing middle for al_tversionfinal_14_08_2014debbieholley1

Ähnlich wie Pamwg 2012ahm (20)

DataONE Preservation and Metadata Working Group Report 2014

NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...

The Research Data Alliance: Creating the culture and technology for an intern...

Citizen Science Phenotypes

The Research Data Alliance--Creating the culture and technology for an intern...

Linked Open Data Alignment and Enrichment Using Bootstrapping Based Techniques

ACRL STS Liaisons Forum - AIBS

PhD Proposal Defense - Prateek Jain

2016 Ocean Sciences Meeting tutorial

DataONE Education Module 02: Data Sharing

Data Management for Collaboration, Access, and Interoperability

IMT530 Tagging Presentation

Data Citation Rewards and Incentives

DataONE Education Module 08: Data Citation

Data Exchange, Data Citation: An overview of some community work

Data Policy for Open Science

Data Exchange, Data Citation: An overview of some community work

How and Why to Share Your Data

Bridging the missing middle for al_tversionfinal_14_08_2014

Mehr von John Kunze

The YAMZ MetadictionaryJohn Kunze

YAMZ Metadata Vocabulary BuilderJohn Kunze

The ARK Alliance: 20 years, 850 institutions, 8.2 billion persistent identifi...John Kunze

EZID and N2T at CDLJohn Kunze

Names, Things, and Open Identifier Infrastructure: N2T and ARKsJohn Kunze

Selected Bash shell tricks from Camp CDL breakout groupJohn Kunze

Future-Proofing the Web: What We Can Do TodayJohn Kunze

Supporting Data-Rich Research on Many FrontsJohn Kunze

Pairtrees for object storageJohn Kunze

The BagIt file package formatJohn Kunze

Mehr von John Kunze (10)

The YAMZ Metadictionary

YAMZ Metadata Vocabulary Builder

The ARK Alliance: 20 years, 850 institutions, 8.2 billion persistent identifi...

EZID and N2T at CDL

Names, Things, and Open Identifier Infrastructure: N2T and ARKs

Selected Bash shell tricks from Camp CDL breakout group

Future-Proofing the Web: What We Can Do Today

Supporting Data-Rich Research on Many Fronts

Pairtrees for object storage

The BagIt file package format

Pamwg 2012ahm

1. DataONE Preservation and Metadata Working Group September 2012 DataONE All Hands Meeting

2. DataONE Preservation in a Nutshell* 1. Keep the bits safe • Replicate the data and metadata • Do local security and media refresh 2. Protect their form and meaning • Know what you have, and know your rights • Know when to migrate and emulate 3. Safeguard the guardians • Organizational and network sustainability * DataONE Preservation Strategy, PWG workshop, Chicago, December 5-6, 2010

3. DataONE Metadata WG Goals 1. Build an e-dictionary to look up metadata terms and to publish your own terms 2. Develop community focusing on data curation, citation, and discovery for DataONE 3. Develop a community to sustain it

4. Agreeing on terms: a totally different take • Traditional metadata standards are controlled • Change by committee is ugly, costly, and slow • Example: Dublin Core, 15 cross-domain terms • 5 years to agree, highly divergent local use, change relegated to external ontologies 4

5. The Metadata Universe Jenn Riley, IU

6. The Metadata Universe Jenn Riley, IU

7. The Metadata Universe Jenn Riley, IU

8. The Metadata Universe Jenn Riley, IU

9. The Metadata Universe Jenn Riley, IU

10. Metadata Vision Instead, create one dictionary • Crowd sourced plus lightly supervised canon • Anyone can look up terms • Any part of “metadata speech” • Anyone can propose and refine their terms • Strong terms rise, weak terms decline Greenberg, J., Murillo, A. and Kunze, J (in press). Ontological Empowerment: Sustainability via Ownership. In K. LeBarre and J. Tennis Advances in Classification Research, 23nd Annual ASIS SIG/CR Workshop, 26 October 2012, Baltimore, MD. 10

11. DataONE Preservation and Metadata Working Group September 2012 DataONE All Hands Meeting

12. Metadata Vision One dictionary • Crowd sourced plus lightly supervised canon • Anyone can look up terms • Any part of “metadata speech” • Anyone can propose and refine their terms • Strong terms rise, weak terms decline Greenberg, J., Murillo, A. and Kunze, J (in press). Ontological Empowerment: Sustainability via Ownership. In K. LeBarre and J. Tennis Advances in Classification Research, 23nd Annual ASIS SIG/CR Workshop, 26 October 2012, Baltimore, MD. 12

13. What we did • Met • Laughed, Talked, Cried, Hugged • Conquered 13

14. Use cases Six solid cases, eg, • Sally Scientist is about to enter column headers for observational data on Pikas in the alpine for data to go into Dryad • Doug Data wants to use Sally’s observations and needs to lookup the definition of one of her column headers 14

15. Mockup

16. Work packages in the next 2 years Move from pre-proof-of-concept to Beta • Software development • Assessment (eg, students) • Moderation protocols – community elders • Establish community identity and rhythm • Not completely flat, not completely crowd-sourced

Hinweis der Redaktion

We’re a sort of cluster group, which really consists of two parts: a preservation subgroup and a metadata subgroup.They are different, and I’ll spend one slide on Preservation and the rest on the exciting work in Metadata that’s just starting up.
If we had just on slide on Preservation, this pretty much summarizes the whole story. To meet the objective of “easy, secure, and persistent storage of data”, DataONE adopts a simple 3-tiered approach.Retaining the actual bits that comprise the data is paramount, as all other preservation and access questions are moot if the bits are lost. A cornerstone of this tier is replication. We attempt to make our replicas “de-correlated”, in the sense that we hold the copies in places where they are unlikely to be subject to the same power failure, same earthquake, same funding loss, etc. CNs hold a copy of all science metadata, so that we always know what DataONE has. An extra copy of MN data is held by each of two other MNs. Damage or corruption in those copies is detected by periodically re-computing checksums (eg, SHA-256 digests) for randomly selected datasets and comparing them with checksums securely stored at the CNs – any bit-level change can be corrected by copying from an unchanged copy. This kind of “pop quiz” cannot be cheated by simply reporting back a previously computed checksum as it’s the actual MN replica data that’s requested. Although it entails sampling only a subset of the data, it is not feasible to exhaustively check the amount of content that DataONE anticipates holding, because that will effectively keep the MNs and CNs busy all the time. Local Information Technology (IT) standards at the MNs are important, and there will be more about this in a later slide. MN guidelines also call for the common-sense and usual practice of periodic “media refresh”, which is the copying of data from old physical recording devices to new physical recording devices to avoid errors due to media degradation and vendor de-support.Assuming the bits are kept safe, one also has to be able to make sense of them into the future, so protecting their form, meaning, and behavior is critical. This we accomplish first by fully knowing the form and structure of the data, in other words, by collecting accurate characterization metadata. Sources of this metadata include scientists, MN curators, and the output from automated characterization tools such as JHOVE. We also encourage use of widely supported formats. Finally, we will use standardized format names from the Unified Digital Format Registry (UDFR), which enables automated notification of obsolescence through services such as AONS (Automated Obsolescence Notification System) and Plato (PlanetsPreservation Planning Tool). I’ll note that both JHOVE and UDFR are maintained by the California Digital Library, which is a DataONE partner. Migration and emulation are sub-strategies that DataONE will use in the event that formats become obsolete. At some time in the future, one may expect that available contemporary hardware and software will be unable to render or otherwise use bits saved in some formats. Migration is used to convert from older to newer formats; all converted content is subject to “before” and “after” characterization to ensure semantic invariance. Emulation effectively preserves older computing environments in order to retain the experience of rendering older formats; once considered a specialized intervention, emulation has become a more viable technique with recent developments in consumer and enterprise server virtualization solutions. Ultimately, having the bits and their meaning is useless if we don’t also have the legal right (a) to hold the data, (b) to make copies and derivatives in performance of preservation management (such as replication and migration), and (c) to transfer those same rights to a successor archive. Just as important is to know specifically who owns the original data and whether those rights have been granted. As a start we strongly encourage providers to assign “Creative Commons Zero” (CC0) licenses to all contributed data, which facilitates preservation while still permitting an attribution requirement.Of course the DataONE organization and network itself needs to be preserved. No network, no MNs, no data. This topic has considerable cross-over with what the Governance and Sustainability working group is doing, and I’ll say more about it in a subsequent slide.
Goals:Develop and implement a sustainable, effective metadata registry framework.Identify a core, foundational, yet flexible set of metadata properties (elements, attributes, and other sub-vocabularies) supporting basic curation and interoperability. This work will explore bridges with the Dublin Core Metadata Initiative (DCMI) and the DataCite consortium.Survey and assess metadata generation approaches (automatic, semi-automatic, derived, manual) and models to support the above stated goals.Purpose: to assist DataONE in recording and maintaining via metadata (as structured, named information elements) sufficient, sustainable functional information about data sets to support discovery, life-cycle management, citation, and general interoperation. Interoperation is a core value for any federation of autonomous nodes such as DataONE, and has separate consequences for every working group; for the MWG, general interoperation is meant to address data discovery across nodes and disciplines, as well as data re-use within the earth sciences (to the extent that this can be generalized).Scope: While metadata is a vast subject comprising, in principle, every piece of structured data bearing any relationship to any other piece of data, the MWG focuses on expressing technical and scientific metadata (DataONE’s “system” and “science” metadata). This emphasis combines the main metadata requirements from the core cyberinfrastructure team (CCIT) with relevant sources of minimal metadata requirements. Because the CCIT is best qualified to focus on technical metadata, the MWG will give priority to metadata that supports data preservation, curation, citation, and discovery in general. Of special interest will be the publication of spreadsheet data and data papers.
Traditional metadata standards are controlled by panels of experts, eg, FGDC, EML, Darwin Core Change by committee is ugly, costly, and slowExample: perhaps most widely use cross domain vocabulary is Dublin Core, 15 cross-domain termsAgreed on in 5 years, lots of local divergence“I love the 15, but my domain needs these 2 terms. How do we add them?” A: Make your own ontology!Multiply by 200 domains and the result is 200 ontologies, 200 panels, 200 islands of non-interoperation
Something between crowd-sourcing and an exclusive clubLearn from wikipedia, internet RFCs, and American Heritage DictionaryGreenberg, J., Murillo, A. and Kunze, J (in press). Ontological Empowerment: Sustainability via Ownership. In K. LeBarre and J. Tennis Advances in Classification Research, 23nd Annual ASIS SIG/CR Workshop, 26 October 2012, Baltimore, MD.
We’re a sort of cluster group, which really consists of two parts: a preservation subgroup and a metadata subgroup.They are different, and I’ll spend one slide on Preservation and the rest on the exciting work in Metadata that’s just starting up.
Something between crowd-sourcing and an exclusive clubLearn from wikipedia, internet RFCs, and American Heritage DictionaryGreenberg, J., Murillo, A. and Kunze, J (in press). Ontological Empowerment: Sustainability via Ownership. In K. LeBarre and J. Tennis Advances in Classification Research, 23nd Annual ASIS SIG/CR Workshop, 26 October 2012, Baltimore, MD.
First meeting. Re-affirmed our vision.Met with Semantics WG and Provenance WGGot scared, got over it, because this is hard hard hard.Pre-proof of conceptAggressive plan in next 2 months to develop a 0.1 prototype using - either Drupal or StackOverflow - Sally Scientist