A Decentralized Approach to Dissemination, Retrieval, and Archiving of Data
1. A Decentralized Approach to Dissemination,
Retrieval, and Archiving of Data
Tobias Kuhn
http://www.tkuhn.org
@txkuhn
Department of Computer Science, VU University Amsterdam
Open Science for an Open Society Workshop
2016 Conference on Complex Systems
Amsterdam, Netherlands
20 September 2016
2. Increasing Importance of Scientific Data
https://www.google.com/trends/explore#q=%22data%20science%22
Tobias Kuhn, Department of Computer Science, VU University Amsterdam Decentralized Data Publishing 2 / 15
3. Scientific Data as Supplemental Material
...
http://www.nature.com/ni/journal/v16/n10/full/ni.3267.html#supplementary-information
Tobias Kuhn, Department of Computer Science, VU University Amsterdam Decentralized Data Publishing 3 / 15
4. Scientific Data in Open Repositories
Tobias Kuhn, Department of Computer Science, VU University Amsterdam Decentralized Data Publishing 4 / 15
5. We Need Better Data Publishing!
Published data should be:
• Verifiable (Is this really the data I am looking for?)
• Immutable (Can I be sure that it hasn’t been modified?)
• Permanent (Will it be available in 1, 5, 20 years from now?)
• Reliable (Can it be efficiently retrieved whenever needed?)
• Granular (Can I refer to individual data entries?)
• Semantic (Can it be automatically interpreted?)
• Linked (Does it use established identifiers and ontologies?)
• Trustworthy (Can I trust the source?)
Tobias Kuhn, Department of Computer Science, VU University Amsterdam Decentralized Data Publishing 5 / 15
6. Requirement: Automated Low-Level Operations
We need automated low-level operations to publish and retrieve data
entries and datasets:
publish <dataset-identifier>
get <dataset-identifier>
(like HTTP POST/GET but verifiable, immutable, permanent, reliable, ...)
Approach: Linked Data + Cryptography + Decentralization
Tobias Kuhn, Department of Computer Science, VU University Amsterdam Decentralized Data Publishing 6 / 15
7. Nanopublications: Linked Data Containers for
Provenance-Aware Semantic Publishing
assertion
provenance
publication info
nanopublication
http://nanopub.org / @nanopub org
Tobias Kuhn, Department of Computer Science, VU University Amsterdam Decentralized Data Publishing 7 / 15
8. Trusty URIs: Cryptographic Hash Values for
Verifiable and Immutable Web Identifiers
Nanopublications with Trusty URIs are ...
Verifiable
+
Immutable
+
Permanent
.trighttp://example.org/r1. RA 5AbXdpz5DcaYXCh9l3eI9ruBosiL5XDU3rxBbBaUO70
http://trustyuri.net/
Kuhn, Dumontier. Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linked Data. ESWC 2014.
Tobias Kuhn, Department of Computer Science, VU University Amsterdam Decentralized Data Publishing 8 / 15
9. Decentralized and Reliable Publishing with a
Nanopublication Server Network
Nanopublications
with Trusty URIs
Publication
Retrieval
Propagation /
Archiving
http://npmonitor.inn.ac
Tobias Kuhn, Department of Computer Science, VU University Amsterdam Decentralized Data Publishing 9 / 15
10. Defining Datasets with Nanopublication Indexes
(which are themselves Nanopublications)
appends
has sub-index
has
element
(a) (b)
(c) (f)
(d) (e)
Tobias Kuhn, Department of Computer Science, VU University Amsterdam Decentralized Data Publishing 10 / 15
11. Nanopublication Server Network is
Efficient and Scalable
Our servers can deliver nanopublications about 100 times faster than
when a triple store is used (and need much less resources):
time from start of test in seconds
responsetimeinseconds
0 50 100 150 200 250 3000 50 100 150 200 250 300
0.1
1
10
100
0 20 40 60 80 100
number of clients accessing the service in parallel
Virtuoso triple store with SPARQL endpoint
nanopublication server
Tobias Kuhn, Department of Computer Science, VU University Amsterdam Decentralized Data Publishing 11 / 15
13. Reliable Low-Level Publish/Retrieve Operations!
Operation to publish data:
$ np publish nanopubs.trig
156026 nanopubs published at http://np.inn.ac/
which can also be used to publish dataset definitions (indexes):
$ np publish index.trig
157 nanopubs published at http://np.inn.ac/
Operation to retrieve data entries:
$ np get http://np.inn.ac/RA7Kmmugi8OuCirfe5WKchnJhC3FuhQD
and to retrieve entire datasets:
$ np get -c http://np.inn.ac/RAY lQruuagCYtAcKAPptkY7EpITw
https://github.com/Nanopublication/nanopub-java
Tobias Kuhn, Department of Computer Science, VU University Amsterdam Decentralized Data Publishing 13 / 15
14. Future Work
• Improve server protocol
• Develop services on top of the server network
• Establish best practices for versioning, retractions, reviews, etc.
• Connect it all to the scientific publishing workflow
Tobias Kuhn, Department of Computer Science, VU University Amsterdam Decentralized Data Publishing 14 / 15
15. Thank you for your attention!
Questions?
Further information:
• Paper on the approach:
https://peerj.com/articles/cs-78/
• Nanopublications: http://nanopub.org
• Trusty URIs: http://trustyuri.net
• Nanopublication Server Network: http://npmonitor.inn.ac
Tobias Kuhn, Department of Computer Science, VU University Amsterdam Decentralized Data Publishing 15 / 15
16. Multi-Layer Architecture
applications (analyze/use data)
advanced services (query/analyze data)
core services (find data)
decentralized server network (provide data)
1
Tobias Kuhn, Department of Computer Science, VU University Amsterdam Decentralized Data Publishing 16 / 15