What You Cite is What You Get: Verifiable Addressing of Immutable, Self-Describing Research Data

•

1 gefällt mir•107 views

The requirements of FAIR and Reproducible Science are driving new PID requirements around the citation of exact, verifiable versions of research outputs. For example, versioned DOIs have been implemented on various popular platforms (Figshare, Zenodo, F1000). There is also renewed interest in the idea of data packaging as a practical means of bundling data and metadata in a way that can be easily cited and transmitted as a single payload. Similarly, several groups have identified the need to directly and unambiguously reference the exact content of a data payload in PID metadata: — for example, the Freya project is looking at how best to allow “direct access to content associated with a DOI” (see https://github.com/datacite/freya/issues/2), — RDA is tackling similar issues in the PID Kernel Information Working Group (https://www.rd-alliance.org/groups/pid-kernel-information-wg), — and the Software Heritage made ‘verifiability’ a key property of their PID design (https://hal.archives-ouvertes.fr/hal-01865790). This all suggests an appetite within the research community for greater consensus around linking PIDs to versioned (and therefore immutable), self-describing content. Beyond the scholarly communication ecosystem, a number of projects, notably the Interplanetary FileSystem (IPFS) and DAT, use content-addressed, peer-to-peer networks to bake these properties right into the fabric of the network, in a way that has potentially interesting implications for how we share reproducible research outputs. This talk will provide an overview of these issues, explore some of the emerging solutions in the area, as well as the challenges inherent in new approaches.

Wissenschaft

What you cite is what you get?
Verifiable addressing of immutable, self-describing
research data
Eoghan Ó Carragáin
Persistent Identifiers in Research
ETH Zurich
2019-09-13

Can I Use It? “FAIR Digital Objects”
Turning FAIR into Reality 10.2777/1524

image: @OA_RHUL
RO-Crate
BDBags
Can I Use It? Portable, self-describing “Data Packages”

“The demand for reproducibility of research results is
growing. [There is need] to reference the exact version
of the data that was used to underpin the research
findings, and/or was used to generate higher level
products. ”
Can I Trust it? Citation and versioning of Research Data

“Link rot” and “content drift”
https://doi.org/10.1371/journal.pone.0167475
https://doi.org/10.1371/journal.pone.0115253

http://example.com/file.dat
file://tmp/file.dat

http://example.com/file.dat
file://tmp/file.dat file://tmp/file.dat

Web references are mutable & “content
negotiable” by design
http://example.com/file.dat
file://tmp/file.dat file://tmp/file.dat

http://example.com/file.dat
http://repo.org/file.dat
http://archive.net/file.dat
DOI redirection/multi-resolution
http://doi.org/10.1234/abcde

Can I find it? Persistent Identifiers and link rot

Arnap & Hutchinson 2006 10.1145/1179509.1179514
Can I trust it? Persistent Identifiers and content drift
doi: 10.1000/182
2004 (Version 4) 2019 (Version > 6)

“persistence is purely a matter of service and is neither
inherent in an object nor conferred on it by a particular
naming syntax. The best that an identifier can do is to
lead users to the services that support robust reference.”
- J. Kunze et al. “The ARK Identifier Scheme”
https://tools.ietf.org/html/draft-kunze-ark-18

Can I trust it? DOI Versioning
Any C.U.D. operation on files triggers a new
version.

Can I trust it? Allowing content verification with checksums

https://hal.archives-ouvertes.fr/hal-01865790
Can I trust it? Content-Addressing for research data?
Content-addressed PID scheme:
swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2
“Identifiers for Digital Objects” rather than “Digital Identifiers of Objects”

“it is really about the ability to trust your data. I guarantee you, if you put your data in Git, you can
trust the fact that five years later after it was converted from your hard disk to your DVD to whatever
new technology and you copied it along, you can verify that the data you get back out is the exact
same as the data you put in.”
- Linus Torvalds
“if you cannot guarantee that what I put in […] comes out exactly the same, your system is not worth using”
Can I trust it? Content-addressed and distributed systems

http://site.com/data/pids.pdf
ipfs://zdj7WjqNrjReTcEveRh/pids.pdf
Image: @protocollabs
LOCATION-ADDRESSING
CONTENT-ADDRESSING

slide:@flyingzumwalt
http://example.com/file.dat
http://repo.org/file.dat
http://archive.org/file.dat
Replicas compete to be THE LOCATION

image:@flyingzumwalt
ipfs://zdj7WjqNrjReTcEveRh
gcsXsJvSGwLxJ7js1R7ZCz
NaQSKuTh

Immutability != permanent/persistent availability
● Who coordinates ‘nodes of last resort’ (c.f. LOCKSS, Keepers Registry)?
● Persistent availability of large amounts of research data = Collective
Action Problem (see: 10.5334/kula.7)
How long is long enough?
● Inevitability of hash collisions (at some stage)?
● For the scholarly record, you still need an indirection layer, to be able to
update citations to point at new hashes (sound familiar?)
● Indeed, we need an indirection layer which allows upgrading between
technology stacks and protocols (sound familiar?)
Great, so let’s just use IPFS…?
“persistence is purely a matter of service”

title
…
…
contentURL
checksum
“Kernel record/metadata"
Can I trust it? Content-address + Indirection
vDOI

Can I trust it? What’s the right recipe?
title
…
…
contentURL
checksum
“Kernel record/metadata"

Empfohlen

17:15 Kolloquium – Donnerstag, 27. Februar 2020 – Das Büro darf nicht nur Mit...ETH-Bibliothek

ETH Zurich's DOI DeskETH-Bibliothek

10 YearsDOI Desk at ETH ZurichETH-Bibliothek

OriginStamp: Trusted Time Stamping via the Bitcoin BlockchainETH-Bibliothek

Tracking Citations to Research Software via PIDsETH-Bibliothek

Persistent Identifiers for Scientific Data at CSCSETH-Bibliothek

Building Open Research Infrastructure with PIDsETH-Bibliothek

DataCite and its Members: Connecting Research and Identifying KnowledgeETH-Bibliothek

Empfohlen

17:15 Kolloquium – Donnerstag, 27. Februar 2020 – Das Büro darf nicht nur Mit...ETH-Bibliothek

ETH Zurich's DOI DeskETH-Bibliothek

10 YearsDOI Desk at ETH ZurichETH-Bibliothek

OriginStamp: Trusted Time Stamping via the Bitcoin BlockchainETH-Bibliothek

Tracking Citations to Research Software via PIDsETH-Bibliothek

Persistent Identifiers for Scientific Data at CSCSETH-Bibliothek

Building Open Research Infrastructure with PIDsETH-Bibliothek

DataCite and its Members: Connecting Research and Identifying KnowledgeETH-Bibliothek

Bilder online recherchieren – Tipps und TricksETH-Bibliothek

Transkribus. Eine Forschungsplattform für die automatisierte Digitalisierung,...ETH-Bibliothek

Herausforderungen im Datenmanagement von MetadatenETH-Bibliothek

Gamification und Game Design: Theorie und Praxis jenseits der Heilsversprechu...ETH-Bibliothek

Data Management in Research –WhyandHow?ETH-Bibliothek

Openness, exchange, FAIR DATA – oh brave new world that has such vision! (Dr....ETH-Bibliothek

CitizenScience - Freiwillige lokalisieren Bilder im virtuellen GlobusETH-Bibliothek

FORUM - Das Bottom-up Gremium der ETH-BibliothekETH-Bibliothek

Digitaler Zugang zu Lesespuren - Das Projekt „Thomas Mann Nachlassbibliothek“...ETH-Bibliothek

„Ex meis libris“ - Die Provenienzdatenbank der ETH-Bibliothek ETH-Bibliothek

Wenn Algorithmen Zeitschriften lesen - Vom Mehrwert automatisierter Textanrei...ETH-Bibliothek

Die Research Collection der ETH Zürich - Ein Repositorium für Publikationen u...ETH-Bibliothek

The ETH Zurich DOI Desk ETH-Bibliothek

Texte visualisieren und lesenETH-Bibliothek

Netzwerk MetadatenmanagementETH-Bibliothek

Intellectual Property is Common PropertyETH-Bibliothek

Open access requirements in SNSF and EU projectsETH-Bibliothek

Open Access PublishingETH-Bibliothek

Research partnerships, user participation, extended outreach – some of ETH L...ETH-Bibliothek

Data Management Plans for SNSF GrantsETH-Bibliothek

TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsSérgio Sacani

Genome sequencing,shotgun sequencing.pptxSilpa

Weitere ähnliche Inhalte

Mehr von ETH-Bibliothek

Bilder online recherchieren – Tipps und TricksETH-Bibliothek

Transkribus. Eine Forschungsplattform für die automatisierte Digitalisierung,...ETH-Bibliothek

Herausforderungen im Datenmanagement von MetadatenETH-Bibliothek

Gamification und Game Design: Theorie und Praxis jenseits der Heilsversprechu...ETH-Bibliothek

Data Management in Research –WhyandHow?ETH-Bibliothek

Openness, exchange, FAIR DATA – oh brave new world that has such vision! (Dr....ETH-Bibliothek

CitizenScience - Freiwillige lokalisieren Bilder im virtuellen GlobusETH-Bibliothek

FORUM - Das Bottom-up Gremium der ETH-BibliothekETH-Bibliothek

Digitaler Zugang zu Lesespuren - Das Projekt „Thomas Mann Nachlassbibliothek“...ETH-Bibliothek

„Ex meis libris“ - Die Provenienzdatenbank der ETH-Bibliothek ETH-Bibliothek

Wenn Algorithmen Zeitschriften lesen - Vom Mehrwert automatisierter Textanrei...ETH-Bibliothek

Die Research Collection der ETH Zürich - Ein Repositorium für Publikationen u...ETH-Bibliothek

The ETH Zurich DOI Desk ETH-Bibliothek

Texte visualisieren und lesenETH-Bibliothek

Netzwerk MetadatenmanagementETH-Bibliothek

Intellectual Property is Common PropertyETH-Bibliothek

Open access requirements in SNSF and EU projectsETH-Bibliothek

Open Access PublishingETH-Bibliothek

Research partnerships, user participation, extended outreach – some of ETH L...ETH-Bibliothek

Data Management Plans for SNSF GrantsETH-Bibliothek

Mehr von ETH-Bibliothek (20)

Bilder online recherchieren – Tipps und Tricks

Transkribus. Eine Forschungsplattform für die automatisierte Digitalisierung,...

Herausforderungen im Datenmanagement von Metadaten

Gamification und Game Design: Theorie und Praxis jenseits der Heilsversprechu...

Data Management in Research –WhyandHow?

Openness, exchange, FAIR DATA – oh brave new world that has such vision! (Dr....

CitizenScience - Freiwillige lokalisieren Bilder im virtuellen Globus

FORUM - Das Bottom-up Gremium der ETH-Bibliothek

Digitaler Zugang zu Lesespuren - Das Projekt „Thomas Mann Nachlassbibliothek“...

„Ex meis libris“ - Die Provenienzdatenbank der ETH-Bibliothek

Wenn Algorithmen Zeitschriften lesen - Vom Mehrwert automatisierter Textanrei...

Die Research Collection der ETH Zürich - Ein Repositorium für Publikationen u...

The ETH Zurich DOI Desk

Texte visualisieren und lesen

Netzwerk Metadatenmanagement

Intellectual Property is Common Property

Open access requirements in SNSF and EU projects

Open Access Publishing

Research partnerships, user participation, extended outreach – some of ETH L...

Data Management Plans for SNSF Grants

Kürzlich hochgeladen

TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsSérgio Sacani

Genome sequencing,shotgun sequencing.pptxSilpa

Role of AI in seed science Predictive modelling and Beyond.pptxArvind Kumar

Module for Grade 9 for Asynchronous/Distance learninglevieagacer

Cyanide resistant respiration pathway.pptxSilpa

CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIADr. TATHAGAT KHOBRAGADE

Human genetics..........................pptxSilpa

Site Acceptance Test .Poonam Aher Patil

Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxDiariAli

Chemistry 5th semester paper 1st Notes.pdfSumit Kumar yadav

Phenolics: types, biosynthesis and functions.Silpa

Reboulia: features, anatomy, morphology etc.Silpa

FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryAlex Henderson

Clean In Place(CIP).pptx .Poonam Aher Patil

PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICEGoa Call Girls High Profile Escorts

GBSN - Microbiology (Unit 3)Defense Mechanism of the body Areesha Ahmad

Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani

CYTOGENETIC MAP................ ppt.pptxSilpa

Porella : features, morphology, anatomy, reproduction etc.Silpa

Grade 7 - Lesson 1 - Microscope and Its FunctionsOrtegaSyrineMay

Kürzlich hochgeladen (20)

TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings

Genome sequencing,shotgun sequencing.pptx

Role of AI in seed science Predictive modelling and Beyond.pptx

Module for Grade 9 for Asynchronous/Distance learning

Cyanide resistant respiration pathway.pptx

CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA

Human genetics..........................pptx

Site Acceptance Test .

Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx

Chemistry 5th semester paper 1st Notes.pdf

Phenolics: types, biosynthesis and functions.

Reboulia: features, anatomy, morphology etc.

FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry

Clean In Place(CIP).pptx .

PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE

GBSN - Microbiology (Unit 3)Defense Mechanism of the body

Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds

CYTOGENETIC MAP................ ppt.pptx

Porella : features, morphology, anatomy, reproduction etc.

Grade 7 - Lesson 1 - Microscope and Its Functions

What You Cite is What You Get: Verifiable Addressing of Immutable, Self-Describing Research Data

1. What you cite is what you get? Verifiable addressing of immutable, self-describing research data Eoghan Ó Carragáin Persistent Identifiers in Research ETH Zurich 2019-09-13

3. Can I Use It? “FAIR Digital Objects” Turning FAIR into Reality 10.2777/1524

4. image: @OA_RHUL RO-Crate BDBags Can I Use It? Portable, self-describing “Data Packages”

5. “The demand for reproducibility of research results is growing. [There is need] to reference the exact version of the data that was used to underpin the research findings, and/or was used to generate higher level products. ” Can I Trust it? Citation and versioning of Research Data

6. “Link rot” and “content drift” https://doi.org/10.1371/journal.pone.0167475 https://doi.org/10.1371/journal.pone.0115253

7. http://example.com/file.dat file://tmp/file.dat

8. http://example.com/file.dat file://tmp/file.dat file://tmp/file.dat

9. Web references are mutable & “content negotiable” by design http://example.com/file.dat file://tmp/file.dat file://tmp/file.dat

10. http://example.com/file.dat http://repo.org/file.dat http://archive.net/file.dat DOI redirection/multi-resolution http://doi.org/10.1234/abcde

11. Can I find it? Persistent Identifiers and link rot

12. Arnap & Hutchinson 2006 10.1145/1179509.1179514 Can I trust it? Persistent Identifiers and content drift doi: 10.1000/182 2004 (Version 4) 2019 (Version > 6)

13. “persistence is purely a matter of service and is neither inherent in an object nor conferred on it by a particular naming syntax. The best that an identifier can do is to lead users to the services that support robust reference.” - J. Kunze et al. “The ARK Identifier Scheme” https://tools.ietf.org/html/draft-kunze-ark-18

14. “The demand for reproducibility of research results is growing. [There is need] to reference the exact version of the data that was used to underpin the research findings, and/or was used to generate higher level products. ” Can I Trust it? Citation and versioning of Research Data

15. Can I trust it? DOI Versioning Any C.U.D. operation on files triggers a new version.

16. Can I trust it? Allowing content verification with checksums

17. https://hal.archives-ouvertes.fr/hal-01865790 Can I trust it? Content-Addressing for research data? Content-addressed PID scheme: swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2 “Identifiers for Digital Objects” rather than “Digital Identifiers of Objects”

18. “it is really about the ability to trust your data. I guarantee you, if you put your data in Git, you can trust the fact that five years later after it was converted from your hard disk to your DVD to whatever new technology and you copied it along, you can verify that the data you get back out is the exact same as the data you put in.” - Linus Torvalds “if you cannot guarantee that what I put in […] comes out exactly the same, your system is not worth using” Can I trust it? Content-addressed and distributed systems

19. http://site.com/data/pids.pdf ipfs://zdj7WjqNrjReTcEveRh/pids.pdf Image: @protocollabs LOCATION-ADDRESSING CONTENT-ADDRESSING

20. slide:@flyingzumwalt http://example.com/file.dat http://repo.org/file.dat http://archive.org/file.dat Replicas compete to be THE LOCATION

21. image:@flyingzumwalt ipfs://zdj7WjqNrjReTcEveRh gcsXsJvSGwLxJ7js1R7ZCz NaQSKuTh

22. Immutability != permanent/persistent availability ● Who coordinates ‘nodes of last resort’ (c.f. LOCKSS, Keepers Registry)? ● Persistent availability of large amounts of research data = Collective Action Problem (see: 10.5334/kula.7) How long is long enough? ● Inevitability of hash collisions (at some stage)? ● For the scholarly record, you still need an indirection layer, to be able to update citations to point at new hashes (sound familiar?) ● Indeed, we need an indirection layer which allows upgrading between technology stacks and protocols (sound familiar?) Great, so let’s just use IPFS…? “persistence is purely a matter of service”

23. title … … contentURL checksum “Kernel record/metadata" Can I trust it? Content-address + Indirection vDOI

24. Can I trust it? What’s the right recipe? title … … contentURL checksum “Kernel record/metadata"

25. Answers?