SlideShare ist ein Scribd-Unternehmen logo
1 von 13
Provenance at the Dagstuhl seminar on
                                        Semantic Data Management, April 2012




                                        Paolo Missier, Jose Manuel Gómez-Perez,
Dagstuhl repost @ SWPM 12 - P.Missier




                                                       Satya Sahoo

                                                  SWPM’12, June. 2012



  1
previously at Dagstuhl...




                                        Much provenance, not much semantics

                                        - final report to be published soon

                                        Interim Seminar wiki
Dagstuhl repost @ SWPM 12 - P.Missier




  2
The provenance day @Dagstuhl
                                        Tuesday (main topic: provenance, person in charge: Grigoris Antoniou)
                                        Session 1. Provenance in semantic data management
                                         ■ Tutorial: Provenance some useful concepts (Paul, 20 minutes)
                                         ■ An introduction to the W3C PROV family of specs (Paul Groth / Luc Moreau / Paolo Missier / Olaf, 30 minutes)
                                         ■ Presentations from other attendees.
                                            ■ Manuel Salvadores: "Access Control in SPARQL: The BioPortal Use Case (15-20 min)"
                                            ■ Bryan Thompson: Simple and effective provenance mechanism for triples or quads based on composition

                                        Session 2. Presentations
                                         ■ Kerry Taylor: Reaping the rewards: what is the provenance saying? (20 min)
                                         ■ Martin Theobald: Reasoning in Uncertain RDF Knowledge Bases with Lineage (20 min)
                                         ■ James Cheney: Database Wiki and provenance for SPARQL updates (10-15 min).

                                        Session 3. Working groups and wrap-up
                                         ■ Objective: obtain roadmaps about typical problems on provenance
                                         ■ Working groups
                                            ■ Frank Van Harmelen: Provenance and scalability
Dagstuhl repost @ SWPM 12 - P.Missier




                                            ■ Paolo Missier: Provenance-specific benchmarks and corpora
                                            ■ José Manuel Gómez-Pérez: Novel usages of provenance information
                                            ■ Norbert Fuhr: Provenance and uncertainty




  3
WG: Novel usages of provenance information (José Manuel Gómez-Pérez)
                                        • Data integration
                                          – assisted analysis, exploration along different dimensions of quality
                                          – SmartCities, OpenStreetMap
                                        • Analytics in social networks
                                          – detect cool members in social networks
                                        • Provenance diff (hard in general)
                                        • Billing / Privacy
                                          – emerging pay-per-query models
                                        • Credit, attribution, citation and licensing
                                        • Result reproducibility (e.g., Executable Paper Challenge)
                                        • Determining quality in the report that has been generated by 3rd
                                          parties for an organisation (e.g., Government report)
Dagstuhl repost @ SWPM 12 - P.Missier




  4
WG: creating provenance-specific benchmarks
                                        • Another one of the spontaneous Working Group activities at
                                          Dagstuhl
                                        • Not strictly “semantic”
                                          – but PROV-RDF one of the expected encodings
                                        • Led by Satya Sahoo, PM
                                        • A community initiative


                                          Goal:

                                          To collect a corpus of reference provenance traces
                                          from multiple contributors
                                          from multiple domains
Dagstuhl repost @ SWPM 12 - P.Missier




                                          and make it available as a community resource




  5
Collecting reference provenance datasets
                                          Why:
                                        • to better understand actual usages of provenance
                                        • for analysing properties of provenance graphs
                                          – patterns in graphs
                                        • to create a level field for performance comparison
                                          – storage, compression methods
                                          – query models, query processing
                                             • SPARQL
                                             • Datalog
                                             • Graph query languages
                                        • to test algorithms that prove interesting hypotheses
                                          – “prov(D) contains valid indicators for quality(D)”
Dagstuhl repost @ SWPM 12 - P.Missier




                                          How:
                                        • By collecting submissions from the community
                                        • By generating synthetic provenance

  6
What: submissions
                                            Submission:
                                              - a collection of traces
                                              - a collection of queries
                                            hopefully from a variety of different domains


                                        •   Interesting properties of each trace:
                                        •   Graph structure -- regularity, recognizable patterns
                                        •   Graph size
                                        •   Scaling factors
                                        •   what is it to be used for

                                          Submission:
Dagstuhl repost @ SWPM 12 - P.Missier




                                        • Diversity of structure and size within the family
                                        • Numerosity of traces



  7
What: Traces format
                                        • The PROV assumptions:
                                          – uptake: PROV will be successful (!)
                                          – interoperability: PROV will be sufficiently expressive to provide interoperability


                                        • Thus, expecting PROV encoding for submissions seems
                                          reasonable

                                        • Advantages:
                                          – tools are being built to parse, visualize, validate, analyse PROV-compliant traces
                                          – multiple encodings available
                                              • especially good if RDF is your thing
                                        • Issues:
Dagstuhl repost @ SWPM 12 - P.Missier




                                          – Conversion: existing traces are not natively PROV
                                          – is there a need to dereference data at the end of URIs?
                                          – licensing: multiple tiers? specific to each dataset?




  8
What: Queries
                                        • Hypothesis: Some queries are generic, in the sense that they apply across
                                          multiple collections of traces
                                          Single trace queries:
                                        • Reachability queries over data and activity dependencies
                                           – backwards (diagnosis)
                                           – forwards (impact analysis)
                                        • “chains of responsibility” (delegation)
                                          Aggregation queries:
                                        • production/usages of data, activities across traces
                                           – assumes uniformity within a collection

                                        • Do graph mining problems apply? do they have interesting interpretations?
                                           – eg. subgraph discovery
Dagstuhl repost @ SWPM 12 - P.Missier




                                        • Feature extraction for learning, mining

                                        • Pairwise trace comparison:
                                           – “earliest divergence” queries between pairs of "nearly isomorphic" traces
                                           – differencing (complex)
  9
A provenance repository
                                        • If traces are submitted in one of the PROV standard encodings,
                                          then the P-rep can provide validation services upon admission

                                        • PROV is expected to support the following encodings:
                                          –   PROV-N -- the technology-neutral notation
                                          –   RDF -- the main official encoding
                                          –   XML -- unofficial XSD available
                                          –   JSON -- unofficial
                                          –   (Datalog? -- even more unofficial but syntactically very close to PROV-N)


                                          Available validations:                                   PROV-N
                                        • Syntax:
Dagstuhl repost @ SWPM 12 - P.Missier




                                          – PROV-N syntax
                                                                                     N 2 JSON      N 2 RDF       N 2 XML
                                          – XML schema validation
                                        • Consistency:
                                          – validation wrt PROV-constraints           PROV-         PROV-         PROV-
                                                                                      JSON           RDF           XML



10
Low-hanging fruits
                                        • Wikipedia history pages
                                          – dumps freely available
                                          – or, through the Wikipedia REST API
                                        • OpenStreetMap history pages
                                          – very similar structure


                                        • ...any other?
Dagstuhl repost @ SWPM 12 - P.Missier




11
Can we learn from similar initiatives?
                                        • Well-established repositories for testing Machine Learning methods
                                          – the UCI Machine Learning repositories
                                          – the KDD Cup datasets
                                          – ... and more


                                        • “Building better RDF benchmarks”: Kavitha Srinivas @Dagstuhl
                                          –   DBpedia, UniProt -- large but no representative query workload
                                          –   YAGO: Wikipedia <-> Wordnet, 8 queries
                                          –   Barton Library, 7 queries
                                          –   Linked Sensor Dataset, no queries
                                          –   TPC-H as RDF
                                          –   Berlin SPARQL Benchmark (BSBM), 12 queries + mixes
                                          –   Lehigh University Benchmark (LUBM), 14 queries
Dagstuhl repost @ SWPM 12 - P.Missier




                                          –   SP2Bench (DBLP) 12 queries

                                          – Original approach:
                                             • Turn every dataset into a benchmark
                                             • by editing the dataset to enforce measures of
12                                               – Coverage and Coherence
WG: Provenance and uncertainty (Norbert Fuhr)
                                        • Uncertainty in the data
                                            – Sensor data, Customer reviews
                                        • Issues
                                            – Reliability (“is this the original painting?”)
                                            – Authenticity
                                        • Sources of uncertain provenance
                                            –   Information extraction / NLP methods
                                            –   Human errors
                                            –   Inferences
                                            –   Instruments
                                        • Challenges
                                            – We need a data model for uncertainty in provenance
                                               • probabilistic dependency relations
Dagstuhl repost @ SWPM 12 - P.Missier




                                            – Explanation of the derivation of uncertain results
                                        • Limitations
                                            – Hard rules vs soft rules
                                            – Knowledge acquisition process of those rules
                                            – provenance incompleteness vs uncertainty
13
                                        •

Weitere ähnliche Inhalte

Andere mochten auch

Invited talk @ DCC09 workshop
Invited talk @ DCC09 workshopInvited talk @ DCC09 workshop
Invited talk @ DCC09 workshopPaolo Missier
 
Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collecti...
Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collecti...Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collecti...
Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collecti...Paolo Missier
 
Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010
Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010
Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010Paolo Missier
 
Ipaw12 datalog paper talk
Ipaw12 datalog paper talkIpaw12 datalog paper talk
Ipaw12 datalog paper talkPaolo Missier
 
ProvAbs: model, policy, and tooling for abstracting PROV graphs
ProvAbs: model, policy, and tooling for abstracting PROV graphsProvAbs: model, policy, and tooling for abstracting PROV graphs
ProvAbs: model, policy, and tooling for abstracting PROV graphsPaolo Missier
 
Big Data Quality Panel : Diachron Workshop @EDBT
Big Data Quality Panel: Diachron Workshop @EDBTBig Data Quality Panel: Diachron Workshop @EDBT
Big Data Quality Panel : Diachron Workshop @EDBTPaolo Missier
 
Your data won’t stay smart forever: exploring the temporal dimension of (big ...
Your data won’t stay smart forever:exploring the temporal dimension of (big ...Your data won’t stay smart forever:exploring the temporal dimension of (big ...
Your data won’t stay smart forever: exploring the temporal dimension of (big ...Paolo Missier
 
The lifecycle of reproducible science data and what provenance has got to do ...
The lifecycle of reproducible science data and what provenance has got to do ...The lifecycle of reproducible science data and what provenance has got to do ...
The lifecycle of reproducible science data and what provenance has got to do ...Paolo Missier
 
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science CentralCloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science CentralPaolo Missier
 
Paper presentation @DILS'07
Paper presentation @DILS'07Paper presentation @DILS'07
Paper presentation @DILS'07Paolo Missier
 

Andere mochten auch (10)

Invited talk @ DCC09 workshop
Invited talk @ DCC09 workshopInvited talk @ DCC09 workshop
Invited talk @ DCC09 workshop
 
Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collecti...
Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collecti...Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collecti...
Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collecti...
 
Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010
Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010
Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010
 
Ipaw12 datalog paper talk
Ipaw12 datalog paper talkIpaw12 datalog paper talk
Ipaw12 datalog paper talk
 
ProvAbs: model, policy, and tooling for abstracting PROV graphs
ProvAbs: model, policy, and tooling for abstracting PROV graphsProvAbs: model, policy, and tooling for abstracting PROV graphs
ProvAbs: model, policy, and tooling for abstracting PROV graphs
 
Big Data Quality Panel : Diachron Workshop @EDBT
Big Data Quality Panel: Diachron Workshop @EDBTBig Data Quality Panel: Diachron Workshop @EDBT
Big Data Quality Panel : Diachron Workshop @EDBT
 
Your data won’t stay smart forever: exploring the temporal dimension of (big ...
Your data won’t stay smart forever:exploring the temporal dimension of (big ...Your data won’t stay smart forever:exploring the temporal dimension of (big ...
Your data won’t stay smart forever: exploring the temporal dimension of (big ...
 
The lifecycle of reproducible science data and what provenance has got to do ...
The lifecycle of reproducible science data and what provenance has got to do ...The lifecycle of reproducible science data and what provenance has got to do ...
The lifecycle of reproducible science data and what provenance has got to do ...
 
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science CentralCloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
 
Paper presentation @DILS'07
Paper presentation @DILS'07Paper presentation @DILS'07
Paper presentation @DILS'07
 

Ähnlich wie SWPM12 report on the dagstuhl seminar on Semantic Data Management

Ml pluss ejan2013
Ml pluss ejan2013Ml pluss ejan2013
Ml pluss ejan2013CS, NcState
 
Discovering and Navigating Memes in Social Media
Discovering and Navigating Memes in Social MediaDiscovering and Navigating Memes in Social Media
Discovering and Navigating Memes in Social MediaMatthew Lease
 
Carpenter - Wolfram Data Summit ResourceSync
Carpenter - Wolfram Data Summit ResourceSyncCarpenter - Wolfram Data Summit ResourceSync
Carpenter - Wolfram Data Summit ResourceSyncnisohq
 
OAC Presentation at CNI 09 Fall Forum
OAC Presentation at CNI 09 Fall ForumOAC Presentation at CNI 09 Fall Forum
OAC Presentation at CNI 09 Fall ForumRobert Sanderson
 
Machine Learning of Natural Language
Machine Learning of Natural LanguageMachine Learning of Natural Language
Machine Learning of Natural Languagebutest
 
OAI7 Research Objects
OAI7 Research ObjectsOAI7 Research Objects
OAI7 Research Objectsseanb
 
Large Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine LearningLarge Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine Learningjaumebp
 
Reproducibility in human cognitive neuroimaging: a community-­driven data sha...
Reproducibility in human cognitive neuroimaging: a community-­driven data sha...Reproducibility in human cognitive neuroimaging: a community-­driven data sha...
Reproducibility in human cognitive neuroimaging: a community-­driven data sha...Nolan Nichols
 
myExperiment and the Rise of Social Machines
myExperiment and the Rise of Social MachinesmyExperiment and the Rise of Social Machines
myExperiment and the Rise of Social MachinesDavid De Roure
 
Large Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using   Genetics-Based Machine LearningLarge Scale Data Mining using   Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine LearningXavier Llorà
 
Benchmark Tutorial -- IV - Participation
Benchmark Tutorial -- IV - ParticipationBenchmark Tutorial -- IV - Participation
Benchmark Tutorial -- IV - Participationjdbess
 
The Economics of Data Sharing
The Economics of Data SharingThe Economics of Data Sharing
The Economics of Data SharingAnita de Waard
 
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
 Unexpected Challenges in Large Scale Machine Learning by Charles Parker Unexpected Challenges in Large Scale Machine Learning by Charles Parker
Unexpected Challenges in Large Scale Machine Learning by Charles ParkerBigMine
 
Stream Reasoning: State of the Art and Beyond
Stream Reasoning: State of the Art and BeyondStream Reasoning: State of the Art and Beyond
Stream Reasoning: State of the Art and BeyondEmanuele Della Valle
 
Provenance Management to Enable Data Sharing
Provenance Management to Enable Data SharingProvenance Management to Enable Data Sharing
Provenance Management to Enable Data SharingUniversity of Arizona
 
Hide the Stack: Toward Usable Linked Data
Hide the Stack:Toward Usable Linked DataHide the Stack:Toward Usable Linked Data
Hide the Stack: Toward Usable Linked Dataaba-sah
 
Big Data Real Time Training in Chennai
Big Data Real Time Training in ChennaiBig Data Real Time Training in Chennai
Big Data Real Time Training in ChennaiVijay Susheedran C G
 

Ähnlich wie SWPM12 report on the dagstuhl seminar on Semantic Data Management (20)

Ml pluss ejan2013
Ml pluss ejan2013Ml pluss ejan2013
Ml pluss ejan2013
 
Discovering and Navigating Memes in Social Media
Discovering and Navigating Memes in Social MediaDiscovering and Navigating Memes in Social Media
Discovering and Navigating Memes in Social Media
 
Carpenter - Wolfram Data Summit ResourceSync
Carpenter - Wolfram Data Summit ResourceSyncCarpenter - Wolfram Data Summit ResourceSync
Carpenter - Wolfram Data Summit ResourceSync
 
Resource Sync - Introduction
Resource Sync - IntroductionResource Sync - Introduction
Resource Sync - Introduction
 
OAC Presentation at CNI 09 Fall Forum
OAC Presentation at CNI 09 Fall ForumOAC Presentation at CNI 09 Fall Forum
OAC Presentation at CNI 09 Fall Forum
 
Machine Learning of Natural Language
Machine Learning of Natural LanguageMachine Learning of Natural Language
Machine Learning of Natural Language
 
OAI7 Research Objects
OAI7 Research ObjectsOAI7 Research Objects
OAI7 Research Objects
 
Large Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine LearningLarge Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine Learning
 
Reproducibility in human cognitive neuroimaging: a community-­driven data sha...
Reproducibility in human cognitive neuroimaging: a community-­driven data sha...Reproducibility in human cognitive neuroimaging: a community-­driven data sha...
Reproducibility in human cognitive neuroimaging: a community-­driven data sha...
 
myExperiment and the Rise of Social Machines
myExperiment and the Rise of Social MachinesmyExperiment and the Rise of Social Machines
myExperiment and the Rise of Social Machines
 
Large Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using   Genetics-Based Machine LearningLarge Scale Data Mining using   Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine Learning
 
Benchmark Tutorial -- IV - Participation
Benchmark Tutorial -- IV - ParticipationBenchmark Tutorial -- IV - Participation
Benchmark Tutorial -- IV - Participation
 
The Economics of Data Sharing
The Economics of Data SharingThe Economics of Data Sharing
The Economics of Data Sharing
 
West coastrollout
West coastrolloutWest coastrollout
West coastrollout
 
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
 Unexpected Challenges in Large Scale Machine Learning by Charles Parker Unexpected Challenges in Large Scale Machine Learning by Charles Parker
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
 
Stream Reasoning: State of the Art and Beyond
Stream Reasoning: State of the Art and BeyondStream Reasoning: State of the Art and Beyond
Stream Reasoning: State of the Art and Beyond
 
Michener Plenary PPSR2012
Michener Plenary PPSR2012Michener Plenary PPSR2012
Michener Plenary PPSR2012
 
Provenance Management to Enable Data Sharing
Provenance Management to Enable Data SharingProvenance Management to Enable Data Sharing
Provenance Management to Enable Data Sharing
 
Hide the Stack: Toward Usable Linked Data
Hide the Stack:Toward Usable Linked DataHide the Stack:Toward Usable Linked Data
Hide the Stack: Toward Usable Linked Data
 
Big Data Real Time Training in Chennai
Big Data Real Time Training in ChennaiBig Data Real Time Training in Chennai
Big Data Real Time Training in Chennai
 

Mehr von Paolo Missier

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsPaolo Missier
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Paolo Missier
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...Paolo Missier
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...Paolo Missier
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Paolo Missier
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewPaolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...Paolo Missier
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Paolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data SciencePaolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...Paolo Missier
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...Paolo Missier
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...Paolo Missier
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...Paolo Missier
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff UniversityPaolo Missier
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Paolo Missier
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...Paolo Missier
 

Mehr von Paolo Missier (20)

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance records
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overview
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data Science
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff University
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
 

Kürzlich hochgeladen

Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxMaryGraceBautista27
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYKayeClaireEstoconing
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxDr.Ibrahim Hassaan
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
Q4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxQ4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxnelietumpap1
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersSabitha Banu
 

Kürzlich hochgeladen (20)

Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptx
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptx
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
Q4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxQ4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptx
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginners
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 

SWPM12 report on the dagstuhl seminar on Semantic Data Management

  • 1. Provenance at the Dagstuhl seminar on Semantic Data Management, April 2012 Paolo Missier, Jose Manuel Gómez-Perez, Dagstuhl repost @ SWPM 12 - P.Missier Satya Sahoo SWPM’12, June. 2012 1
  • 2. previously at Dagstuhl... Much provenance, not much semantics - final report to be published soon Interim Seminar wiki Dagstuhl repost @ SWPM 12 - P.Missier 2
  • 3. The provenance day @Dagstuhl Tuesday (main topic: provenance, person in charge: Grigoris Antoniou) Session 1. Provenance in semantic data management ■ Tutorial: Provenance some useful concepts (Paul, 20 minutes) ■ An introduction to the W3C PROV family of specs (Paul Groth / Luc Moreau / Paolo Missier / Olaf, 30 minutes) ■ Presentations from other attendees. ■ Manuel Salvadores: "Access Control in SPARQL: The BioPortal Use Case (15-20 min)" ■ Bryan Thompson: Simple and effective provenance mechanism for triples or quads based on composition Session 2. Presentations ■ Kerry Taylor: Reaping the rewards: what is the provenance saying? (20 min) ■ Martin Theobald: Reasoning in Uncertain RDF Knowledge Bases with Lineage (20 min) ■ James Cheney: Database Wiki and provenance for SPARQL updates (10-15 min). Session 3. Working groups and wrap-up ■ Objective: obtain roadmaps about typical problems on provenance ■ Working groups ■ Frank Van Harmelen: Provenance and scalability Dagstuhl repost @ SWPM 12 - P.Missier ■ Paolo Missier: Provenance-specific benchmarks and corpora ■ José Manuel Gómez-Pérez: Novel usages of provenance information ■ Norbert Fuhr: Provenance and uncertainty 3
  • 4. WG: Novel usages of provenance information (José Manuel Gómez-Pérez) • Data integration – assisted analysis, exploration along different dimensions of quality – SmartCities, OpenStreetMap • Analytics in social networks – detect cool members in social networks • Provenance diff (hard in general) • Billing / Privacy – emerging pay-per-query models • Credit, attribution, citation and licensing • Result reproducibility (e.g., Executable Paper Challenge) • Determining quality in the report that has been generated by 3rd parties for an organisation (e.g., Government report) Dagstuhl repost @ SWPM 12 - P.Missier 4
  • 5. WG: creating provenance-specific benchmarks • Another one of the spontaneous Working Group activities at Dagstuhl • Not strictly “semantic” – but PROV-RDF one of the expected encodings • Led by Satya Sahoo, PM • A community initiative Goal: To collect a corpus of reference provenance traces from multiple contributors from multiple domains Dagstuhl repost @ SWPM 12 - P.Missier and make it available as a community resource 5
  • 6. Collecting reference provenance datasets Why: • to better understand actual usages of provenance • for analysing properties of provenance graphs – patterns in graphs • to create a level field for performance comparison – storage, compression methods – query models, query processing • SPARQL • Datalog • Graph query languages • to test algorithms that prove interesting hypotheses – “prov(D) contains valid indicators for quality(D)” Dagstuhl repost @ SWPM 12 - P.Missier How: • By collecting submissions from the community • By generating synthetic provenance 6
  • 7. What: submissions Submission: - a collection of traces - a collection of queries hopefully from a variety of different domains • Interesting properties of each trace: • Graph structure -- regularity, recognizable patterns • Graph size • Scaling factors • what is it to be used for Submission: Dagstuhl repost @ SWPM 12 - P.Missier • Diversity of structure and size within the family • Numerosity of traces 7
  • 8. What: Traces format • The PROV assumptions: – uptake: PROV will be successful (!) – interoperability: PROV will be sufficiently expressive to provide interoperability • Thus, expecting PROV encoding for submissions seems reasonable • Advantages: – tools are being built to parse, visualize, validate, analyse PROV-compliant traces – multiple encodings available • especially good if RDF is your thing • Issues: Dagstuhl repost @ SWPM 12 - P.Missier – Conversion: existing traces are not natively PROV – is there a need to dereference data at the end of URIs? – licensing: multiple tiers? specific to each dataset? 8
  • 9. What: Queries • Hypothesis: Some queries are generic, in the sense that they apply across multiple collections of traces Single trace queries: • Reachability queries over data and activity dependencies – backwards (diagnosis) – forwards (impact analysis) • “chains of responsibility” (delegation) Aggregation queries: • production/usages of data, activities across traces – assumes uniformity within a collection • Do graph mining problems apply? do they have interesting interpretations? – eg. subgraph discovery Dagstuhl repost @ SWPM 12 - P.Missier • Feature extraction for learning, mining • Pairwise trace comparison: – “earliest divergence” queries between pairs of "nearly isomorphic" traces – differencing (complex) 9
  • 10. A provenance repository • If traces are submitted in one of the PROV standard encodings, then the P-rep can provide validation services upon admission • PROV is expected to support the following encodings: – PROV-N -- the technology-neutral notation – RDF -- the main official encoding – XML -- unofficial XSD available – JSON -- unofficial – (Datalog? -- even more unofficial but syntactically very close to PROV-N) Available validations: PROV-N • Syntax: Dagstuhl repost @ SWPM 12 - P.Missier – PROV-N syntax N 2 JSON N 2 RDF N 2 XML – XML schema validation • Consistency: – validation wrt PROV-constraints PROV- PROV- PROV- JSON RDF XML 10
  • 11. Low-hanging fruits • Wikipedia history pages – dumps freely available – or, through the Wikipedia REST API • OpenStreetMap history pages – very similar structure • ...any other? Dagstuhl repost @ SWPM 12 - P.Missier 11
  • 12. Can we learn from similar initiatives? • Well-established repositories for testing Machine Learning methods – the UCI Machine Learning repositories – the KDD Cup datasets – ... and more • “Building better RDF benchmarks”: Kavitha Srinivas @Dagstuhl – DBpedia, UniProt -- large but no representative query workload – YAGO: Wikipedia <-> Wordnet, 8 queries – Barton Library, 7 queries – Linked Sensor Dataset, no queries – TPC-H as RDF – Berlin SPARQL Benchmark (BSBM), 12 queries + mixes – Lehigh University Benchmark (LUBM), 14 queries Dagstuhl repost @ SWPM 12 - P.Missier – SP2Bench (DBLP) 12 queries – Original approach: • Turn every dataset into a benchmark • by editing the dataset to enforce measures of 12 – Coverage and Coherence
  • 13. WG: Provenance and uncertainty (Norbert Fuhr) • Uncertainty in the data – Sensor data, Customer reviews • Issues – Reliability (“is this the original painting?”) – Authenticity • Sources of uncertain provenance – Information extraction / NLP methods – Human errors – Inferences – Instruments • Challenges – We need a data model for uncertainty in provenance • probabilistic dependency relations Dagstuhl repost @ SWPM 12 - P.Missier – Explanation of the derivation of uncertain results • Limitations – Hard rules vs soft rules – Knowledge acquisition process of those rules – provenance incompleteness vs uncertainty 13 •

Hinweis der Redaktion

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n