SlideShare ist ein Scribd-Unternehmen logo
1 von 44
Provenance as a Building Block for an Open
Science Infrastructure
Andreas Schreiber
German Aerospace Center (DLR)
Cologne/Berlin, Germany
ISGC 2018, Taipei, Taiwan
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 1
Topics
• Reproducibility
• Provenance and PROV
• Storing provenance
• Gathering provenance
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 2
Reproducibility
Reproducibility in (data) science is based on
• Open Source Software
• Code Reviews
• Code Repositories
• Publications with code
• Container (Docker etc.)
• Workflows
• (Electronic) laboratory notebooks
• Open data formats
• Data management
• Metadata and Provenance
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 3
Provenance
Basics
• Provenance refers to the source of
information and the process that led to its
existence
• Where did I get this file?
• How did it come to exist?
• Provenance information is critical to users
trying to understand where a particular
data file came from
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 4
Other and related terms
• Traceability
• Lineage
• Logging
• Monitoring
Provenance Information
Capture, archive, and distribute provenance information, for example
• The source of all externally supplied data files
• The source of the algorithms used to transform the data within the system
• The Algorithm design documents
• A complete description of the processing environment
• A complete description of the processing framework
• A record of each job’s execution
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 5
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 6
Data Science Workflows
More Formal Definition of Provenance
Provenance is
information about entities, activities, and people
involved in
producing a piece of data or thing,
which can be used to form
assessments about its quality, reliability or trustworthiness.
PROV W3C Working Group
https://www.w3.org/TR/prov-overview
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 7
W3C Specification „PROV“
• PROV-O, the PROV ontology, an OWL2 ontology allowing the mapping of the PROV data
model to RDF
• PROV-DM, the PROV data model for provenance
• PROV-N, a notation for provenance aimed at human consumption
• PROV-CONSTRAINTS, a set of constraints applying to the PROV data model
• PROV-XML, an XML schema for the PROV data model
• PROV-AQ, mechanisms for accessing and querying provenance
• PROV-DICTIONARY introduces a specific type of collection, consisting of key-entity pairs
• PROV-DC provides a mapping between PROV-O and Dublin Core Terms
• PROV-SEM, a declarative specification in terms of first-order logic of the PROV data model
• PROV-LINKS introduces a mechanism to link across bundles
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 8
PROV Elements
Entities
• Physical, digital, conceptual, or other kinds of things
• For example, documents, web sites, graphics, or data sets
Activities
• Activities generate new entities or
make use of existing entities
• Activities could be actions or processes
Agents
• Agents takes a role in an activity and have
the responsibility for the activity
• For example, persons, pieces of software,
or organizations
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 9
Activity
Entity
Agent
PROV Relations
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 10
Activity
Entity
Agent
wasGeneratedBy
used
wasDerivedFrom
wasAttributedTo
wasAssociatedWith
Baking a Cake
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 11
100 g
butter
bake
2
eggs
100 g
sugar
100 g
flour
cake
used
used
used
used
wasGeneratedBy
wasDerivedFrom
Textual Representations Visualizations
PROV Notations and Representations
• Formats: PROV-N, JSON, Turtle, XML, …
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 12
document
prefix userdata http://software.dlr.de/qs/userdata/
. . .
wasDerivedFrom(userdata:weights, userdata:WeightReport.csv,
wasDerivedFrom(qs:graphic/weights, userdata:weights,
wasAssociatedWith(qs:graphic/weights, qs:user/onyame@gmail.com, -)
used(python_method:read_csv, library:pandas, -)
used(python_method:matplotlib_plot, userdata:weights, -)
used(python_method:matplotlib_plot, library:matplotlib, -)
used(python_method:read_csv, userdata:WeightReport.csv, -)
wasAttributedTo(userdata:WeightReport.csv, qs:user/onyame@gmail.com)
agent(qs:user/onyame@gmail.com, [prov:type="prov:Person"])
entity(library:pandas, [library:version="0.17.1"])
entity(userdata:WeightReport.csv)
entity(userdata:weights)
. . .
endDocument
Storing and Retrieving Provenance
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 13
Provenance Architecture
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 14
Recording of Data Processing
Information
Application
Data (Results)
Provenance
Store
Storing and Retrieving Provenance
Some Storage Technologies
• Relational databases and SQL
• XML and Xpath
• RDF and SPARQL
• Graph databases and Gremlin/Cypher
Services
• REST APIs
• PROVSTORE
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 15
ProvStore
University of Southampton
• RESTful web service
• storage and access of
provenance documents
• Public and private
documents
• Conversion to various
text formats
• Simple visualizations
• APIs
• Python
• jQuery
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 16
https://provenance.ecs.soton.ac.uk/store/
Graphs
Provenance is a Directed Acyclic Graph (DAG)
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 17
A
B
E
F
G
D
C
Graph Databases
Naturally, graph databases are a good
technology for storing (Provenance) graphs
Many graph databases are available
• Neo4j
• Titan
• ArangoDB
• ...
Query languages
• Cypher
• Gremlin (TinkerPop)
• GraphQL
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 18
Neo4j
• Open-Source
• Implemented in Java
• Stores property graphs
(key-value-based, directed)
http://neo4j.com
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 19
Storing Provenance in Graph Database
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 20
Graph database Neo4j
MATCH (e:Entity)-[*]-(u:Agent) RETURN u
Trusted Provenance: Storing Provenance in a Blockchain
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 21
PROV2BIGCHAINDB
https://github.com/DLR-SC/prov2bigchaindb
Blockchain
Combination of multiple techniques
• Peer-to-peer network
• Public/Private key signing
• Time-stamping
• Proof-of-Work
• Merkle-Trees
Proposed solutions to
• The double-spending problem
• The byzantine generals problem
• Tamper-resistant distributed database
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 22
Blockchain Transactions
• Linked by hash of current and preceding
transactions
• Bitcoin: Transfers amount of BTC
• Public/private key signing
• All transactions are broadcasted across
the network
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 23
Document-based Storage of PROV Documents
• Only one user/address on the blockchain
• Provenance is stored as one valid document
Pros
• Less complex
• Ownership restricted to one participant
• Easy to query
• Less costly, if less data is added
Cons
• Single point of failure
• Less tamper-resistant
• No chaining of transactions
• Huge amount of data in transactions
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 24
Role-based Storage of PROV Documents
• Every agents is a blockchain user/address
• Generates transactions for its entities and
activities
• Relations modeled with references to other
transactions
Pros
• Close to typical process structures
• Implicit ownership and responsibility
Cons
• Agent needs to know relevant transactions
for references
• Difficult to query, if no ownership transfer is
used
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 25
Graph-based Storage of PROV Documents
• All PROV relations are modeled as ownership transfer
• All agents, activity and entities are actual blockchain
user/addresses
Pros
• Mapping close to PROV model
• Small amount of data per transaction
• Strong tamper-resistant due to:
• Multiple owner
• Large amount of transactions
Cons
• Complex implementation
• High costs due to many transactions
• Very slow in querying, if traversal of
transactions is needed
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 26
Test Setup
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 27
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 28
Performance Comparison
Gathering Provenance
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 29
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 30
Data &
Metadata
Workflows
Algorithms / Scripts Machine Learning
Data
Management
PROV
Provenance
Store
</>
Software
Development
Gather or Generate Provenance
Depends on your application (tools, languages, etc.)
• Generation at run-time, compile-time, or retrospectively
Runtime
• Instrumentation of the application
• Cumbersome from software engineering perspective
• Combined with logging or with aspect-oriented approaches
Compile time
• Based on static code analysis (dependency analysis, program slicing, etc.)
Retrospectively
• Reconstructed from files or filesystem metadata
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 31
Tools and Libraries for Generating Provenance
Libraries for Python
• PROVPY
• PROVNEO4J
Other Tools
• NOWORKFLOW
• GIT2PROV
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 32
Python Library ProvPy (PROV)
https://github.com/trungdong/prov
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 33
from prov.model import ProvDocument
# Create a new provenance document
d1 = ProvDocument()
# Entity: now:employment-article-v1.html
e1 = d1.entity('now:employment-article-v1.html')
# Agent: nowpeople:Bob
d1.agent('nowpeople:Bob')
# Attributing the article to the agent
d1.wasAttributedTo(e1, 'nowpeople:Bob')
d1.entity('govftp:oesm11st.zip',
{'prov:label': 'employment-stats-2011',
'prov:type': 'void:Dataset'})
d1.wasDerivedFrom('now:employment-article-v1.html',
'govftp:oesm11st.zip')
# Adding an activity
d1.activity('is:writeArticle')
d1.used('is:writeArticle', 'govftp:oesm11st.zip')
d1.wasGeneratedBy('now:employment-article-v1.html', 'is:writeArticle')
Python Library ProvPy (PROV)
https://github.com/trungdong/prov
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 34
PROVNEO4J – Storing PROV Documents in Neo4j
https://github.com/DLR-SC/provneo4j
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 35
import provneo4j.api
provneo4j_api = provneo4j.api.Api(
base_url="http://localhost:7474/db/data",
username="neo4j", password="python")
provneo4j_api.document.create(prov_doc, name=”MyProv”)
PROVNEO4J – Storing PROV Documents in Neo4j
https://github.com/DLR-SC/provneo4j
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 36
Provenance Instrumentation of TENSORFLOW
Provenance of TENSORFLOW workflows
• Tensor  PROV Entity
• Operations  PROV Activity
Example: MNIST with 400 training iterations
• 64581 database nodes
• 33549 Entities
• 31032 Activities
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 37
Provenance Instrumentation
of TENSORFLOW
Example Query
• Shortest paths from all tensors
in 400. iteration to init operation
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 38
MATCH path=allShortestPaths((root)<-[*]-(n))
WHERE root.`tf:type`="tf:Session_init" and n.`tf:name` =~ ".*_400"
RETURN path
NOWORKFLOW – Provenance of Scripts
https://github.com/gems-uff/noworkflow
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 39
Project
experiment.py
p12.dat
p13.dat
precipitation.py
p14.dat
out.png
$ now run -e Tracker experiment.py
GIT2PROV
http://git2prov.org
• Generate PROV documents
from git repositories
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 40
GIT2PROV Example Output
https://provenance.ecs.soton.ac.uk/store/documents/116377/
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 41
Provenance Visualization
Visualization of Provenance is an ongoing research topic
• Especially, for non-experts (“Provenance for people”)
• Example: PROV COMICS
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 42
Key Messages and Summary
Recording the Provenance of science workflows is important
• to understand where data came from
• to reproduce data processing steps or whole workflows
Use a standard for Provenance
• W3C standard PROV
• Mapping to (graph) databases, allows easy querying
• A standard allow interoperability and comparison
• Storing in blockchains for increasing trust
Recording Provenance is not hard
• APIs and tools available
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 43
Activity
Entity
Agent
wasGeneratedBy
used
wasDerivedFrom
wasAttributedTo
wasAssociatedWith
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 44
Thank You!
Questions?
Andreas.Schreiber@dlr.de
www.DLR.de/sc | @onyame

Weitere ähnliche Inhalte

Was ist angesagt?

Massively Scalable Computational Finance with SciDB
 Massively Scalable Computational Finance with SciDB Massively Scalable Computational Finance with SciDB
Massively Scalable Computational Finance with SciDBParadigm4Inc
 
CCCB Germline Variant Analysis on Cloud Platform
CCCB Germline Variant Analysis on Cloud PlatformCCCB Germline Variant Analysis on Cloud Platform
CCCB Germline Variant Analysis on Cloud PlatformYaoyu Wang
 
Real-time Data Analytics mit Elasticsearch
Real-time Data Analytics mit ElasticsearchReal-time Data Analytics mit Elasticsearch
Real-time Data Analytics mit Elasticsearchinovex GmbH
 
Dataset Descriptions in Open PHACTS and HCLS
Dataset Descriptions in Open PHACTS and HCLSDataset Descriptions in Open PHACTS and HCLS
Dataset Descriptions in Open PHACTS and HCLSAlasdair Gray
 
Don't build a data science team
Don't build a data science teamDon't build a data science team
Don't build a data science teamLars Albertsson
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solidLars Albertsson
 
TPC-H analytics' scenarios and performances on Hadoop data clouds
TPC-H analytics' scenarios and performances on Hadoop data cloudsTPC-H analytics' scenarios and performances on Hadoop data clouds
TPC-H analytics' scenarios and performances on Hadoop data cloudsRim Moussa
 
معرفی کاربردهای یادگیری عمیق و چالش های آن در کلان داده
معرفی کاربردهای یادگیری عمیق و چالش های آن در کلان دادهمعرفی کاربردهای یادگیری عمیق و چالش های آن در کلان داده
معرفی کاربردهای یادگیری عمیق و چالش های آن در کلان دادهWeb Standards School
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageJulien Le Dem
 
iRODS UGM 2018 Fair data management and DISQOVERability
iRODS UGM 2018 Fair data management and DISQOVERabilityiRODS UGM 2018 Fair data management and DISQOVERability
iRODS UGM 2018 Fair data management and DISQOVERabilityMaarten Coonen
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Julien Le Dem
 
Time Series Analytics for Big Fast Data
Time Series Analytics for Big Fast DataTime Series Analytics for Big Fast Data
Time Series Analytics for Big Fast DataSouth West Data Meetup
 
Eventually, time will kill your data processing
Eventually, time will kill your data processingEventually, time will kill your data processing
Eventually, time will kill your data processingLars Albertsson
 
The Elastic Stack as a SIEM
The Elastic Stack as a SIEMThe Elastic Stack as a SIEM
The Elastic Stack as a SIEMJohn Hubbard
 
Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applicationsLars Albertsson
 
Processing genetic data at scale
Processing genetic data at scaleProcessing genetic data at scale
Processing genetic data at scaleMark Schroering
 
Engineering data quality
Engineering data qualityEngineering data quality
Engineering data qualityLars Albertsson
 
Mortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data qualityMortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data qualityLars Albertsson
 

Was ist angesagt? (20)

Massively Scalable Computational Finance with SciDB
 Massively Scalable Computational Finance with SciDB Massively Scalable Computational Finance with SciDB
Massively Scalable Computational Finance with SciDB
 
CCCB Germline Variant Analysis on Cloud Platform
CCCB Germline Variant Analysis on Cloud PlatformCCCB Germline Variant Analysis on Cloud Platform
CCCB Germline Variant Analysis on Cloud Platform
 
Real-time Data Analytics mit Elasticsearch
Real-time Data Analytics mit ElasticsearchReal-time Data Analytics mit Elasticsearch
Real-time Data Analytics mit Elasticsearch
 
Dataset Descriptions in Open PHACTS and HCLS
Dataset Descriptions in Open PHACTS and HCLSDataset Descriptions in Open PHACTS and HCLS
Dataset Descriptions in Open PHACTS and HCLS
 
Don't build a data science team
Don't build a data science teamDon't build a data science team
Don't build a data science team
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
 
Data democratised
Data democratisedData democratised
Data democratised
 
TPC-H analytics' scenarios and performances on Hadoop data clouds
TPC-H analytics' scenarios and performances on Hadoop data cloudsTPC-H analytics' scenarios and performances on Hadoop data clouds
TPC-H analytics' scenarios and performances on Hadoop data clouds
 
معرفی کاربردهای یادگیری عمیق و چالش های آن در کلان داده
معرفی کاربردهای یادگیری عمیق و چالش های آن در کلان دادهمعرفی کاربردهای یادگیری عمیق و چالش های آن در کلان داده
معرفی کاربردهای یادگیری عمیق و چالش های آن در کلان داده
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineage
 
iRODS UGM 2018 Fair data management and DISQOVERability
iRODS UGM 2018 Fair data management and DISQOVERabilityiRODS UGM 2018 Fair data management and DISQOVERability
iRODS UGM 2018 Fair data management and DISQOVERability
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020
 
Bicod2017
Bicod2017Bicod2017
Bicod2017
 
Time Series Analytics for Big Fast Data
Time Series Analytics for Big Fast DataTime Series Analytics for Big Fast Data
Time Series Analytics for Big Fast Data
 
Eventually, time will kill your data processing
Eventually, time will kill your data processingEventually, time will kill your data processing
Eventually, time will kill your data processing
 
The Elastic Stack as a SIEM
The Elastic Stack as a SIEMThe Elastic Stack as a SIEM
The Elastic Stack as a SIEM
 
Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applications
 
Processing genetic data at scale
Processing genetic data at scaleProcessing genetic data at scale
Processing genetic data at scale
 
Engineering data quality
Engineering data qualityEngineering data quality
Engineering data quality
 
Mortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data qualityMortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data quality
 

Ähnlich wie Provenance as a building block for an open science infrastructure

Provenance for Reproducible Data Science
Provenance for Reproducible Data ScienceProvenance for Reproducible Data Science
Provenance for Reproducible Data ScienceAndreas Schreiber
 
Oracle Stream Analytics - Simplifying Stream Processing
Oracle Stream Analytics - Simplifying Stream ProcessingOracle Stream Analytics - Simplifying Stream Processing
Oracle Stream Analytics - Simplifying Stream ProcessingGuido Schmutz
 
Reproducible Science with Python
Reproducible Science with PythonReproducible Science with Python
Reproducible Science with PythonAndreas Schreiber
 
Neo4j GraphTalk Oslo - Introduction to Graphs
Neo4j GraphTalk Oslo - Introduction to GraphsNeo4j GraphTalk Oslo - Introduction to Graphs
Neo4j GraphTalk Oslo - Introduction to GraphsNeo4j
 
Project update: A collaborative approach to "filling the digital preservation...
Project update: A collaborative approach to "filling the digital preservation...Project update: A collaborative approach to "filling the digital preservation...
Project update: A collaborative approach to "filling the digital preservation...Jenny Mitcham
 
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)Blue BRIDGE
 
Anaconda and PyData Solutions
Anaconda and PyData SolutionsAnaconda and PyData Solutions
Anaconda and PyData SolutionsTravis Oliphant
 
Extending the Stream/Table Duality into a Trinity, with Graphs (David Allen &...
Extending the Stream/Table Duality into a Trinity, with Graphs (David Allen &...Extending the Stream/Table Duality into a Trinity, with Graphs (David Allen &...
Extending the Stream/Table Duality into a Trinity, with Graphs (David Allen &...confluent
 
Role of PIDs in connecting scholarly works
Role of PIDs in connecting scholarly worksRole of PIDs in connecting scholarly works
Role of PIDs in connecting scholarly worksOpenAIRE
 
A Provenance Model for Quantified Self Data
A Provenance Model for Quantified Self DataA Provenance Model for Quantified Self Data
A Provenance Model for Quantified Self DataAndreas Schreiber
 
Nyc web perf-final-july-23
Nyc web perf-final-july-23Nyc web perf-final-july-23
Nyc web perf-final-july-23Dan Boutin
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMark Kromer
 
Modelling research output expressions : metadata schema modelling of publicat...
Modelling research output expressions : metadata schema modelling of publicat...Modelling research output expressions : metadata schema modelling of publicat...
Modelling research output expressions : metadata schema modelling of publicat...CILIP MDG
 
CNI 2018: A Research Object Authoring Tool for the Data Commons
CNI 2018: A Research Object Authoring Tool for the Data CommonsCNI 2018: A Research Object Authoring Tool for the Data Commons
CNI 2018: A Research Object Authoring Tool for the Data CommonsAnita de Waard
 
Showcasing research data tools - Jisc Digifest 2016
Showcasing research data tools - Jisc Digifest 2016Showcasing research data tools - Jisc Digifest 2016
Showcasing research data tools - Jisc Digifest 2016Jisc
 
Scalable Preservation Workflows
Scalable Preservation WorkflowsScalable Preservation Workflows
Scalable Preservation WorkflowsSCAPE Project
 
STI 2022 - Generating large-scale network analyses of scientific landscapes i...
STI 2022 - Generating large-scale network analyses of scientific landscapes i...STI 2022 - Generating large-scale network analyses of scientific landscapes i...
STI 2022 - Generating large-scale network analyses of scientific landscapes i...Michele Pasin
 
Analytical Innovation: How to Build the Next Generation Data Platform
Analytical Innovation: How to Build the Next Generation Data PlatformAnalytical Innovation: How to Build the Next Generation Data Platform
Analytical Innovation: How to Build the Next Generation Data PlatformVMware Tanzu
 
Enterprise guide to building a Data Mesh
Enterprise guide to building a Data MeshEnterprise guide to building a Data Mesh
Enterprise guide to building a Data MeshSion Smith
 

Ähnlich wie Provenance as a building block for an open science infrastructure (20)

Provenance for Reproducible Data Science
Provenance for Reproducible Data ScienceProvenance for Reproducible Data Science
Provenance for Reproducible Data Science
 
Oracle Stream Analytics - Simplifying Stream Processing
Oracle Stream Analytics - Simplifying Stream ProcessingOracle Stream Analytics - Simplifying Stream Processing
Oracle Stream Analytics - Simplifying Stream Processing
 
Reproducible Science with Python
Reproducible Science with PythonReproducible Science with Python
Reproducible Science with Python
 
Neo4j GraphTalk Oslo - Introduction to Graphs
Neo4j GraphTalk Oslo - Introduction to GraphsNeo4j GraphTalk Oslo - Introduction to Graphs
Neo4j GraphTalk Oslo - Introduction to Graphs
 
Project update: A collaborative approach to "filling the digital preservation...
Project update: A collaborative approach to "filling the digital preservation...Project update: A collaborative approach to "filling the digital preservation...
Project update: A collaborative approach to "filling the digital preservation...
 
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
 
Anaconda and PyData Solutions
Anaconda and PyData SolutionsAnaconda and PyData Solutions
Anaconda and PyData Solutions
 
Extending the Stream/Table Duality into a Trinity, with Graphs (David Allen &...
Extending the Stream/Table Duality into a Trinity, with Graphs (David Allen &...Extending the Stream/Table Duality into a Trinity, with Graphs (David Allen &...
Extending the Stream/Table Duality into a Trinity, with Graphs (David Allen &...
 
Role of PIDs in connecting scholarly works
Role of PIDs in connecting scholarly worksRole of PIDs in connecting scholarly works
Role of PIDs in connecting scholarly works
 
A Provenance Model for Quantified Self Data
A Provenance Model for Quantified Self DataA Provenance Model for Quantified Self Data
A Provenance Model for Quantified Self Data
 
Nyc web perf-final-july-23
Nyc web perf-final-july-23Nyc web perf-final-july-23
Nyc web perf-final-july-23
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data Analytics
 
Modelling research output expressions : metadata schema modelling of publicat...
Modelling research output expressions : metadata schema modelling of publicat...Modelling research output expressions : metadata schema modelling of publicat...
Modelling research output expressions : metadata schema modelling of publicat...
 
CNI 2018: A Research Object Authoring Tool for the Data Commons
CNI 2018: A Research Object Authoring Tool for the Data CommonsCNI 2018: A Research Object Authoring Tool for the Data Commons
CNI 2018: A Research Object Authoring Tool for the Data Commons
 
Blockchains and databases a new era in distributed computing
Blockchains and databases a new era in distributed computingBlockchains and databases a new era in distributed computing
Blockchains and databases a new era in distributed computing
 
Showcasing research data tools - Jisc Digifest 2016
Showcasing research data tools - Jisc Digifest 2016Showcasing research data tools - Jisc Digifest 2016
Showcasing research data tools - Jisc Digifest 2016
 
Scalable Preservation Workflows
Scalable Preservation WorkflowsScalable Preservation Workflows
Scalable Preservation Workflows
 
STI 2022 - Generating large-scale network analyses of scientific landscapes i...
STI 2022 - Generating large-scale network analyses of scientific landscapes i...STI 2022 - Generating large-scale network analyses of scientific landscapes i...
STI 2022 - Generating large-scale network analyses of scientific landscapes i...
 
Analytical Innovation: How to Build the Next Generation Data Platform
Analytical Innovation: How to Build the Next Generation Data PlatformAnalytical Innovation: How to Build the Next Generation Data Platform
Analytical Innovation: How to Build the Next Generation Data Platform
 
Enterprise guide to building a Data Mesh
Enterprise guide to building a Data MeshEnterprise guide to building a Data Mesh
Enterprise guide to building a Data Mesh
 

Mehr von Andreas Schreiber

Provenance-based Security Audits and its Application to COVID-19 Contact Trac...
Provenance-based Security Audits and its Application to COVID-19 Contact Trac...Provenance-based Security Audits and its Application to COVID-19 Contact Trac...
Provenance-based Security Audits and its Application to COVID-19 Contact Trac...Andreas Schreiber
 
Visualization of Software Architectures in Virtual Reality and Augmented Reality
Visualization of Software Architectures in Virtual Reality and Augmented RealityVisualization of Software Architectures in Virtual Reality and Augmented Reality
Visualization of Software Architectures in Virtual Reality and Augmented RealityAndreas Schreiber
 
Raising Awareness about Open Source Licensing at the German Aerospace Center
Raising Awareness about Open Source Licensing at the German Aerospace CenterRaising Awareness about Open Source Licensing at the German Aerospace Center
Raising Awareness about Open Source Licensing at the German Aerospace CenterAndreas Schreiber
 
Open Source Licensing for Rocket Scientists
Open Source Licensing for Rocket ScientistsOpen Source Licensing for Rocket Scientists
Open Source Licensing for Rocket ScientistsAndreas Schreiber
 
Interactive Visualization of Software Components with Virtual Reality Headsets
Interactive Visualization of Software Components with Virtual Reality HeadsetsInteractive Visualization of Software Components with Virtual Reality Headsets
Interactive Visualization of Software Components with Virtual Reality HeadsetsAndreas Schreiber
 
Visualizing Provenance using Comics
Visualizing Provenance using ComicsVisualizing Provenance using Comics
Visualizing Provenance using ComicsAndreas Schreiber
 
Nachvollziehbarkeit mit Hinblick auf Privacy-Verletzungen
Nachvollziehbarkeit mit Hinblick auf Privacy-VerletzungenNachvollziehbarkeit mit Hinblick auf Privacy-Verletzungen
Nachvollziehbarkeit mit Hinblick auf Privacy-VerletzungenAndreas Schreiber
 
Tracking after Stroke: Doctors, Dogs and All The Rest
Tracking after Stroke: Doctors, Dogs and All The RestTracking after Stroke: Doctors, Dogs and All The Rest
Tracking after Stroke: Doctors, Dogs and All The RestAndreas Schreiber
 
High Throughput Processing of Space Debris Data
High Throughput Processing of Space Debris DataHigh Throughput Processing of Space Debris Data
High Throughput Processing of Space Debris DataAndreas Schreiber
 
Bericht von der QS15 Conference & Exposition
Bericht von der QS15 Conference & ExpositionBericht von der QS15 Conference & Exposition
Bericht von der QS15 Conference & ExpositionAndreas Schreiber
 
Telemedizin: Gesundheit, messbar für jedermann
Telemedizin: Gesundheit, messbar für jedermannTelemedizin: Gesundheit, messbar für jedermann
Telemedizin: Gesundheit, messbar für jedermannAndreas Schreiber
 
Quantified Self mit Wearable Devices und Smartphone-Sensoren
Quantified Self mit Wearable Devices und Smartphone-SensorenQuantified Self mit Wearable Devices und Smartphone-Sensoren
Quantified Self mit Wearable Devices und Smartphone-SensorenAndreas Schreiber
 
Example Blood Pressure Report of BloodPressureCompanion
Example Blood Pressure Report of BloodPressureCompanionExample Blood Pressure Report of BloodPressureCompanion
Example Blood Pressure Report of BloodPressureCompanionAndreas Schreiber
 
Beispiel-Blutdruckbericht des BlutdruckBegleiter
Beispiel-Blutdruckbericht des BlutdruckBegleiterBeispiel-Blutdruckbericht des BlutdruckBegleiter
Beispiel-Blutdruckbericht des BlutdruckBegleiterAndreas Schreiber
 
Informatik für die Welt von Morgen
Informatik für die Welt von MorgenInformatik für die Welt von Morgen
Informatik für die Welt von MorgenAndreas Schreiber
 
Python for High Performance and Scientific Computing
Python for High Performance and Scientific ComputingPython for High Performance and Scientific Computing
Python for High Performance and Scientific ComputingAndreas Schreiber
 

Mehr von Andreas Schreiber (20)

Provenance-based Security Audits and its Application to COVID-19 Contact Trac...
Provenance-based Security Audits and its Application to COVID-19 Contact Trac...Provenance-based Security Audits and its Application to COVID-19 Contact Trac...
Provenance-based Security Audits and its Application to COVID-19 Contact Trac...
 
Visualization of Software Architectures in Virtual Reality and Augmented Reality
Visualization of Software Architectures in Virtual Reality and Augmented RealityVisualization of Software Architectures in Virtual Reality and Augmented Reality
Visualization of Software Architectures in Virtual Reality and Augmented Reality
 
Raising Awareness about Open Source Licensing at the German Aerospace Center
Raising Awareness about Open Source Licensing at the German Aerospace CenterRaising Awareness about Open Source Licensing at the German Aerospace Center
Raising Awareness about Open Source Licensing at the German Aerospace Center
 
Open Source Licensing for Rocket Scientists
Open Source Licensing for Rocket ScientistsOpen Source Licensing for Rocket Scientists
Open Source Licensing for Rocket Scientists
 
Interactive Visualization of Software Components with Virtual Reality Headsets
Interactive Visualization of Software Components with Virtual Reality HeadsetsInteractive Visualization of Software Components with Virtual Reality Headsets
Interactive Visualization of Software Components with Virtual Reality Headsets
 
Visualizing Provenance using Comics
Visualizing Provenance using ComicsVisualizing Provenance using Comics
Visualizing Provenance using Comics
 
Quantified Self Comics
Quantified Self ComicsQuantified Self Comics
Quantified Self Comics
 
Nachvollziehbarkeit mit Hinblick auf Privacy-Verletzungen
Nachvollziehbarkeit mit Hinblick auf Privacy-VerletzungenNachvollziehbarkeit mit Hinblick auf Privacy-Verletzungen
Nachvollziehbarkeit mit Hinblick auf Privacy-Verletzungen
 
Python at Warp Speed
Python at Warp SpeedPython at Warp Speed
Python at Warp Speed
 
Open Source im DLR
Open Source im DLROpen Source im DLR
Open Source im DLR
 
Tracking after Stroke: Doctors, Dogs and All The Rest
Tracking after Stroke: Doctors, Dogs and All The RestTracking after Stroke: Doctors, Dogs and All The Rest
Tracking after Stroke: Doctors, Dogs and All The Rest
 
High Throughput Processing of Space Debris Data
High Throughput Processing of Space Debris DataHigh Throughput Processing of Space Debris Data
High Throughput Processing of Space Debris Data
 
Bericht von der QS15 Conference & Exposition
Bericht von der QS15 Conference & ExpositionBericht von der QS15 Conference & Exposition
Bericht von der QS15 Conference & Exposition
 
Telemedizin: Gesundheit, messbar für jedermann
Telemedizin: Gesundheit, messbar für jedermannTelemedizin: Gesundheit, messbar für jedermann
Telemedizin: Gesundheit, messbar für jedermann
 
Big Python
Big PythonBig Python
Big Python
 
Quantified Self mit Wearable Devices und Smartphone-Sensoren
Quantified Self mit Wearable Devices und Smartphone-SensorenQuantified Self mit Wearable Devices und Smartphone-Sensoren
Quantified Self mit Wearable Devices und Smartphone-Sensoren
 
Example Blood Pressure Report of BloodPressureCompanion
Example Blood Pressure Report of BloodPressureCompanionExample Blood Pressure Report of BloodPressureCompanion
Example Blood Pressure Report of BloodPressureCompanion
 
Beispiel-Blutdruckbericht des BlutdruckBegleiter
Beispiel-Blutdruckbericht des BlutdruckBegleiterBeispiel-Blutdruckbericht des BlutdruckBegleiter
Beispiel-Blutdruckbericht des BlutdruckBegleiter
 
Informatik für die Welt von Morgen
Informatik für die Welt von MorgenInformatik für die Welt von Morgen
Informatik für die Welt von Morgen
 
Python for High Performance and Scientific Computing
Python for High Performance and Scientific ComputingPython for High Performance and Scientific Computing
Python for High Performance and Scientific Computing
 

Kürzlich hochgeladen

Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningVitsRangannavar
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfPower Karaoke
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsMehedi Hasan Shohan
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 

Kürzlich hochgeladen (20)

Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learning
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdf
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software Solutions
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 

Provenance as a building block for an open science infrastructure

  • 1. Provenance as a Building Block for an Open Science Infrastructure Andreas Schreiber German Aerospace Center (DLR) Cologne/Berlin, Germany ISGC 2018, Taipei, Taiwan > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 1
  • 2. Topics • Reproducibility • Provenance and PROV • Storing provenance • Gathering provenance > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 2
  • 3. Reproducibility Reproducibility in (data) science is based on • Open Source Software • Code Reviews • Code Repositories • Publications with code • Container (Docker etc.) • Workflows • (Electronic) laboratory notebooks • Open data formats • Data management • Metadata and Provenance > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 3
  • 4. Provenance Basics • Provenance refers to the source of information and the process that led to its existence • Where did I get this file? • How did it come to exist? • Provenance information is critical to users trying to understand where a particular data file came from > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 4 Other and related terms • Traceability • Lineage • Logging • Monitoring
  • 5. Provenance Information Capture, archive, and distribute provenance information, for example • The source of all externally supplied data files • The source of the algorithms used to transform the data within the system • The Algorithm design documents • A complete description of the processing environment • A complete description of the processing framework • A record of each job’s execution > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 5
  • 6. > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 6 Data Science Workflows
  • 7. More Formal Definition of Provenance Provenance is information about entities, activities, and people involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability or trustworthiness. PROV W3C Working Group https://www.w3.org/TR/prov-overview > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 7
  • 8. W3C Specification „PROV“ • PROV-O, the PROV ontology, an OWL2 ontology allowing the mapping of the PROV data model to RDF • PROV-DM, the PROV data model for provenance • PROV-N, a notation for provenance aimed at human consumption • PROV-CONSTRAINTS, a set of constraints applying to the PROV data model • PROV-XML, an XML schema for the PROV data model • PROV-AQ, mechanisms for accessing and querying provenance • PROV-DICTIONARY introduces a specific type of collection, consisting of key-entity pairs • PROV-DC provides a mapping between PROV-O and Dublin Core Terms • PROV-SEM, a declarative specification in terms of first-order logic of the PROV data model • PROV-LINKS introduces a mechanism to link across bundles > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 8
  • 9. PROV Elements Entities • Physical, digital, conceptual, or other kinds of things • For example, documents, web sites, graphics, or data sets Activities • Activities generate new entities or make use of existing entities • Activities could be actions or processes Agents • Agents takes a role in an activity and have the responsibility for the activity • For example, persons, pieces of software, or organizations > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 9 Activity Entity Agent
  • 10. PROV Relations > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 10 Activity Entity Agent wasGeneratedBy used wasDerivedFrom wasAttributedTo wasAssociatedWith
  • 11. Baking a Cake > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 11 100 g butter bake 2 eggs 100 g sugar 100 g flour cake used used used used wasGeneratedBy wasDerivedFrom
  • 12. Textual Representations Visualizations PROV Notations and Representations • Formats: PROV-N, JSON, Turtle, XML, … > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 12 document prefix userdata http://software.dlr.de/qs/userdata/ . . . wasDerivedFrom(userdata:weights, userdata:WeightReport.csv, wasDerivedFrom(qs:graphic/weights, userdata:weights, wasAssociatedWith(qs:graphic/weights, qs:user/onyame@gmail.com, -) used(python_method:read_csv, library:pandas, -) used(python_method:matplotlib_plot, userdata:weights, -) used(python_method:matplotlib_plot, library:matplotlib, -) used(python_method:read_csv, userdata:WeightReport.csv, -) wasAttributedTo(userdata:WeightReport.csv, qs:user/onyame@gmail.com) agent(qs:user/onyame@gmail.com, [prov:type="prov:Person"]) entity(library:pandas, [library:version="0.17.1"]) entity(userdata:WeightReport.csv) entity(userdata:weights) . . . endDocument
  • 13. Storing and Retrieving Provenance > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 13
  • 14. Provenance Architecture > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 14 Recording of Data Processing Information Application Data (Results) Provenance Store
  • 15. Storing and Retrieving Provenance Some Storage Technologies • Relational databases and SQL • XML and Xpath • RDF and SPARQL • Graph databases and Gremlin/Cypher Services • REST APIs • PROVSTORE > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 15
  • 16. ProvStore University of Southampton • RESTful web service • storage and access of provenance documents • Public and private documents • Conversion to various text formats • Simple visualizations • APIs • Python • jQuery > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 16 https://provenance.ecs.soton.ac.uk/store/
  • 17. Graphs Provenance is a Directed Acyclic Graph (DAG) > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 17 A B E F G D C
  • 18. Graph Databases Naturally, graph databases are a good technology for storing (Provenance) graphs Many graph databases are available • Neo4j • Titan • ArangoDB • ... Query languages • Cypher • Gremlin (TinkerPop) • GraphQL > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 18
  • 19. Neo4j • Open-Source • Implemented in Java • Stores property graphs (key-value-based, directed) http://neo4j.com > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 19
  • 20. Storing Provenance in Graph Database > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 20 Graph database Neo4j MATCH (e:Entity)-[*]-(u:Agent) RETURN u
  • 21. Trusted Provenance: Storing Provenance in a Blockchain > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 21 PROV2BIGCHAINDB https://github.com/DLR-SC/prov2bigchaindb
  • 22. Blockchain Combination of multiple techniques • Peer-to-peer network • Public/Private key signing • Time-stamping • Proof-of-Work • Merkle-Trees Proposed solutions to • The double-spending problem • The byzantine generals problem • Tamper-resistant distributed database > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 22
  • 23. Blockchain Transactions • Linked by hash of current and preceding transactions • Bitcoin: Transfers amount of BTC • Public/private key signing • All transactions are broadcasted across the network > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 23
  • 24. Document-based Storage of PROV Documents • Only one user/address on the blockchain • Provenance is stored as one valid document Pros • Less complex • Ownership restricted to one participant • Easy to query • Less costly, if less data is added Cons • Single point of failure • Less tamper-resistant • No chaining of transactions • Huge amount of data in transactions > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 24
  • 25. Role-based Storage of PROV Documents • Every agents is a blockchain user/address • Generates transactions for its entities and activities • Relations modeled with references to other transactions Pros • Close to typical process structures • Implicit ownership and responsibility Cons • Agent needs to know relevant transactions for references • Difficult to query, if no ownership transfer is used > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 25
  • 26. Graph-based Storage of PROV Documents • All PROV relations are modeled as ownership transfer • All agents, activity and entities are actual blockchain user/addresses Pros • Mapping close to PROV model • Small amount of data per transaction • Strong tamper-resistant due to: • Multiple owner • Large amount of transactions Cons • Complex implementation • High costs due to many transactions • Very slow in querying, if traversal of transactions is needed > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 26
  • 27. Test Setup > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 27
  • 28. > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 28 Performance Comparison
  • 29. Gathering Provenance > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 29
  • 30. > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 30 Data & Metadata Workflows Algorithms / Scripts Machine Learning Data Management PROV Provenance Store </> Software Development
  • 31. Gather or Generate Provenance Depends on your application (tools, languages, etc.) • Generation at run-time, compile-time, or retrospectively Runtime • Instrumentation of the application • Cumbersome from software engineering perspective • Combined with logging or with aspect-oriented approaches Compile time • Based on static code analysis (dependency analysis, program slicing, etc.) Retrospectively • Reconstructed from files or filesystem metadata > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 31
  • 32. Tools and Libraries for Generating Provenance Libraries for Python • PROVPY • PROVNEO4J Other Tools • NOWORKFLOW • GIT2PROV > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 32
  • 33. Python Library ProvPy (PROV) https://github.com/trungdong/prov > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 33 from prov.model import ProvDocument # Create a new provenance document d1 = ProvDocument() # Entity: now:employment-article-v1.html e1 = d1.entity('now:employment-article-v1.html') # Agent: nowpeople:Bob d1.agent('nowpeople:Bob') # Attributing the article to the agent d1.wasAttributedTo(e1, 'nowpeople:Bob') d1.entity('govftp:oesm11st.zip', {'prov:label': 'employment-stats-2011', 'prov:type': 'void:Dataset'}) d1.wasDerivedFrom('now:employment-article-v1.html', 'govftp:oesm11st.zip') # Adding an activity d1.activity('is:writeArticle') d1.used('is:writeArticle', 'govftp:oesm11st.zip') d1.wasGeneratedBy('now:employment-article-v1.html', 'is:writeArticle')
  • 34. Python Library ProvPy (PROV) https://github.com/trungdong/prov > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 34
  • 35. PROVNEO4J – Storing PROV Documents in Neo4j https://github.com/DLR-SC/provneo4j > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 35 import provneo4j.api provneo4j_api = provneo4j.api.Api( base_url="http://localhost:7474/db/data", username="neo4j", password="python") provneo4j_api.document.create(prov_doc, name=”MyProv”)
  • 36. PROVNEO4J – Storing PROV Documents in Neo4j https://github.com/DLR-SC/provneo4j > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 36
  • 37. Provenance Instrumentation of TENSORFLOW Provenance of TENSORFLOW workflows • Tensor  PROV Entity • Operations  PROV Activity Example: MNIST with 400 training iterations • 64581 database nodes • 33549 Entities • 31032 Activities > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 37
  • 38. Provenance Instrumentation of TENSORFLOW Example Query • Shortest paths from all tensors in 400. iteration to init operation > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 38 MATCH path=allShortestPaths((root)<-[*]-(n)) WHERE root.`tf:type`="tf:Session_init" and n.`tf:name` =~ ".*_400" RETURN path
  • 39. NOWORKFLOW – Provenance of Scripts https://github.com/gems-uff/noworkflow > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 39 Project experiment.py p12.dat p13.dat precipitation.py p14.dat out.png $ now run -e Tracker experiment.py
  • 40. GIT2PROV http://git2prov.org • Generate PROV documents from git repositories > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 40
  • 41. GIT2PROV Example Output https://provenance.ecs.soton.ac.uk/store/documents/116377/ > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 41
  • 42. Provenance Visualization Visualization of Provenance is an ongoing research topic • Especially, for non-experts (“Provenance for people”) • Example: PROV COMICS > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 42
  • 43. Key Messages and Summary Recording the Provenance of science workflows is important • to understand where data came from • to reproduce data processing steps or whole workflows Use a standard for Provenance • W3C standard PROV • Mapping to (graph) databases, allows easy querying • A standard allow interoperability and comparison • Storing in blockchains for increasing trust Recording Provenance is not hard • APIs and tools available > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 43 Activity Entity Agent wasGeneratedBy used wasDerivedFrom wasAttributedTo wasAssociatedWith
  • 44. > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 44 Thank You! Questions? Andreas.Schreiber@dlr.de www.DLR.de/sc | @onyame