SlideShare ist ein Scribd-Unternehmen logo
1 von 28
Downloaden Sie, um offline zu lesen
24th International World Wide Web Conference, 21st May 2015, Florence, Italy
Executing Provenance-Enabled
Queries over Web Data
Marcin Wylot1
,
Philippe Cudré-Mauroux1
, and Paul Groth2
1)
eXascale Infolab, University of Fribourg, Switzerland
2)
Elsevier Labs, Amsterdam, Netherlands
Research Question
How RDF databases can
efficiently support
provenance-enabled
queries?
2
Outline
➢ Motivation
➢ Provenance-Enabled Queries
➢ Query Execution Strategies
➢ Results
3
Data Provenance
“Provenance is information about
entities, activities, and people involved
in producing a piece of data or thing, which can be used to form
assessments about its quality, reliability or trustworthiness.”
Which pieces of data were combined to produce
the result?
4
Data Integration
➢ Integrated and summarized data
➢ Trust, transparency, and cost
➢ Capability to store and track
provenance data (WWW 2014)
➢ Capability to tailor queries with
provenance information (WWW
2015)
5
Querying Linked Data
use data following a provenance specification
6
Provenance-Enabled Query
A Workload Query is a query producing results a user is
interested in. These results are referred to as workload
query results.
A Provenance Query is a query that selects a set of data
from which the workload query results should originate.
A Provenance-Enabled Query is a pair consisting of a
Workload Query and a Provenance Query, producing results
a user is interested in (as specified by the Workload Query)
and originating only from data pre-selected by the
Provenance Query.
7
Provenance-Enabled Query: Example
SELECT ?t WHERE {
?a <type> <article> .
?a <tag> <Obama> .
?a <title> ?t . }
➢ ensure that the articles come from sources attributed to the government
SELECT ?ctx WHERE {
?ctx prov:wasAttributedTo <government> . }
➢ ensure that the data used to produce the answer was associated a
“SeniorEditor” and a “Manager”
SELECT ?ctx WHERE {
?ctx prov:wasGeneratedBy <articleProd>.
<articleProd> prov:wasAssociatedWith ?ed .
?ed rdf:type <SeniorEdior> .
<articleProd> prov:wasAssociatedWith ?m .
?m rdf:type <Manager> . }
8
Executing Provenance-Enabled Queries
A workload and a provenance query are given as input to
a triplestore, which produces results for both queries and
then combine them to obtain the final results.
9
TripleProv: Query Execution Pipeline
input: provenance-enable query
➢ execute the provenance query
➢ optionally pre-materializing or co-locating data
➢ optionally rewrite the workload queries
➢ execute the workload queries
➢
output: the workload query results, restricted to those which were derived
from data specified by the provenance query 10
Physical Storage Models
A molecule collocates objects related
to a given subject; it is composed of a
subject, and a series of predicate and
object related to that subject.
Extended for provenance data a
molecule collocates the context values
with the predicate-object pairs.
This avoids the duplication of the
same context value, while at the same
time collocating all data about a given
subject in one structure.
11
RDF molecule
basic data unit
Query Execution Strategies
1. Post-Filtering
2. Query Rewriting
3. Full Materialization
4. Pre-Filtering
5. Partial Materialization
12
Post-Filtering
➢ the baseline strategy
➢ executes both the workload and the provenance query independently.
➢ the provenance and workload queries can be executed in any order
➢ the results from the provenance query are used to filter a posteriori
the results of the workload query based on their provenance
13
Query Rewriting
14
execute the provenance
query
rewrite the query plan;
add provenance
constraints
return restricted results
➢ efficient from the provenance query
execution side
➢ can be suboptimal from the
workload query execution side
It can be implemented in two ways by
the triplestores, either by modifying
the query execution process, or by
rewriting the workload queries in
order to include constraints on the
named graphs.
Full Materialization
15
We implemented a basic view mechanisms in TripleProv. These
mechanisms allow us to project, materialize and utilize as a
secondary structure the portions of the molecules that are following
the provenance specification.
Full Materialization
16
execute the
provenance query
materialize data for
the provenance query
execute workload
queries on the
materialized view
➢ This strategy will outperform all other
strategies when executing the workload
queries, since they are executed as is on
the relevant subset of the data.
➢ Materializing all potential tuples based on
the provenance query can be expensive,
both in terms of storage space and latency.
➢ Implementing this strategy requires either to
manually materialize the relevant tuples and
modify the workload queries accordingly, or
to use a triplestore supporting materialized
views.
Pre-Filtering
➢ Dedicated provenance index
collocates, for each context
values, the ids (or hashes) of all
tuples belonging to this context.
➢ The index is created upfront
when the data is loaded.
17
Pre-Filtering
18
execute the
provenance query
execute workload
queries, including
early filtering with the
provenance index
➢ The provenance index is
looked up during the query
execution to filter molecules
that are compatible with the
provenance specification.
➢ This strategy requires to create
a new index structure in the
system, and to modify both the
loading and the query execution
processes.
Partial Materialization
➢ This strategy introduces a trade-off between the performance of
the provenance query and that of the workload queries.
➢ While executing the provenance query, the system builds a
temporary structure maintaining the ids of all molecules
belonging to the context values returned by the provenance
query.
19
Partial Materialization
20
execute the
provenance query and
partially materialize
molecules
execute workload
queries, including
early filtering based
on pre-materialized
set of molecules
➢ The system dynamically (and efficiently)
looks-up all molecules and can filter
them out early in case they do not
appear in the temporary structure.
➢ Query processing operations can be
executed faster on a reduced number of
elements.
➢ The implementation of this strategy
requires the introduction of an additional
data structure at the core of the system,
and the adjustment of the query
execution process in order to use it.
Experiments
What is the most efficient query
execution strategy for provenance-
enabled queries?
21
Datasets
➢ Two collections of RDF data gathered from the Web
○ Billion Triple Challenge (BTC): Crawled from the linked open data
cloud
○ Web Data Commons (WDC): RDFa, Microdata extracted from
common crawl
➢ Typical collections gathered from multiple sources
➢ Sampled subsets of ~40 million triples each; ~10GB each
➢ Added provenance specific triples (184 for WDC and 360 for BTC); that
the provenance queries do not modify the result sets of the workload
queries
22
Workloads
➢ Queries defined for BTC
○ T. Neumann and G. Weikum. Scalable join processing on very large rdf
graphs. In Proceedings of the 2009 ACM SIGMOD International
Conference on Management of data, pages 627–640. ACM, 2009.
➢ Two additional queries with UNION and OPTIONAL
clauses
➢ 7 various new queries for WDC
http://exascale.info/provqueries
23
Results for BTC
➢ Full Materialization: 44x faster
than the vanilla version of the
system
➢ Partial Materialization: 35x faster
➢ Pre-Filtering: 23x faster
➢ Adaptive Partial Materialization
executes a provenance query and
materialize data 475 times faster
than Full Materialization
➢ Query Rewriting and Post-
Filtering strategies perform
significantly slower
24
Results for Representative Scenario
➢ original BTC dataset
➢ no added triples
➢ output changes due to
provenance specification
➢ higher performance gains
for all provenance aware
strategies are in the more
realistic scenario
25
smaller number of context values from the provenance query
smaller number of relevant molecules to inspect
Data Analysis
➢ How many context values refer
to how many triples? How
selective it is?
➢ 6'819'826 unique context values
in the BTC dataset.
➢ The majority of the context
values are highly selective.
26
➢ average selectivity
○ 5.8 triples per context value
○ 2.3 molecules per context value
Conclusions
➢ Querying provenance data does not necessarily
introduce a performance overhead.
➢ Queries tailored with provenance data can be executed
faster.
➢ Provenance information is highly selective.
➢ Partial Materialization represents the best trade-off for
provenance-enabled queries, but it introduces a
materialization cost and is not trivial to implement.
27
Summary
➢ provenance-enabled queries: to tailor queries with
provenance information
➢ five provenance aware query execution strategies
➢ TripleProv: an efficient triplestore allowing to store, track,
and query provenance
➢ experimental evaluation and data analysis
★ http://exascale.info/provqueries
★ http://exascale.info/tripleprov
28
❖ email: marcin@exascale.info
❖ twitter: @mwylot

Weitere ähnliche Inhalte

Was ist angesagt?

Integrative information management for systems biology
Integrative information management for systems biologyIntegrative information management for systems biology
Integrative information management for systems biology
Neil Swainston
 

Was ist angesagt? (16)

Journals analysis ppt
Journals analysis pptJournals analysis ppt
Journals analysis ppt
 
A cyber physical stream algorithm for intelligent software defined storage
A cyber physical stream algorithm for intelligent software defined storageA cyber physical stream algorithm for intelligent software defined storage
A cyber physical stream algorithm for intelligent software defined storage
 
Transparency in the Data Supply Chain
Transparency in the Data Supply ChainTransparency in the Data Supply Chain
Transparency in the Data Supply Chain
 
Integrative information management for systems biology
Integrative information management for systems biologyIntegrative information management for systems biology
Integrative information management for systems biology
 
Slice for Distributed Persistence (JavaOne 2010)
Slice for Distributed Persistence (JavaOne 2010)Slice for Distributed Persistence (JavaOne 2010)
Slice for Distributed Persistence (JavaOne 2010)
 
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESFINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
 
Drug Repurposing using Deep Learning on Knowledge Graphs
Drug Repurposing using Deep Learning on Knowledge GraphsDrug Repurposing using Deep Learning on Knowledge Graphs
Drug Repurposing using Deep Learning on Knowledge Graphs
 
Ontology based clustering algorithms
Ontology based clustering algorithmsOntology based clustering algorithms
Ontology based clustering algorithms
 
Integrating vague association mining with markov model
Integrating vague association mining with markov modelIntegrating vague association mining with markov model
Integrating vague association mining with markov model
 
Applying soft computing techniques to corporate mobile security systems
Applying soft computing techniques to corporate mobile security systemsApplying soft computing techniques to corporate mobile security systems
Applying soft computing techniques to corporate mobile security systems
 
Iaetsd a survey on one class clustering
Iaetsd a survey on one class clusteringIaetsd a survey on one class clustering
Iaetsd a survey on one class clustering
 
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSA HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
 
Similarity distance measures
Similarity  distance measuresSimilarity  distance measures
Similarity distance measures
 
“Open Data Web” – A Linked Open Data Repository Built with CKAN
“Open Data Web” – A Linked Open Data Repository Built with CKAN“Open Data Web” – A Linked Open Data Repository Built with CKAN
“Open Data Web” – A Linked Open Data Repository Built with CKAN
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
IRJET- Optimal Number of Cluster Identification using Robust K-Means for ...
IRJET-  	  Optimal Number of Cluster Identification using Robust K-Means for ...IRJET-  	  Optimal Number of Cluster Identification using Robust K-Means for ...
IRJET- Optimal Number of Cluster Identification using Robust K-Means for ...
 

Ähnlich wie Executing Provenance-Enabled Queries over Web Data

Recording and Reasoning Over Data Provenance in Web and Grid Services
Recording and Reasoning Over Data Provenance in Web and Grid ServicesRecording and Reasoning Over Data Provenance in Web and Grid Services
Recording and Reasoning Over Data Provenance in Web and Grid Services
Martin Szomszor
 

Ähnlich wie Executing Provenance-Enabled Queries over Web Data (20)

Efficient, Scalable, and Provenance-Aware Management of Linked Data
Efficient, Scalable, and Provenance-Aware Management of Linked DataEfficient, Scalable, and Provenance-Aware Management of Linked Data
Efficient, Scalable, and Provenance-Aware Management of Linked Data
 
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
 
Query aware determinization of uncertain objects
Query aware determinization of uncertain objectsQuery aware determinization of uncertain objects
Query aware determinization of uncertain objects
 
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
 
C4Bio paper talk
C4Bio paper talkC4Bio paper talk
C4Bio paper talk
 
Recording and Reasoning Over Data Provenance in Web and Grid Services
Recording and Reasoning Over Data Provenance in Web and Grid ServicesRecording and Reasoning Over Data Provenance in Web and Grid Services
Recording and Reasoning Over Data Provenance in Web and Grid Services
 
Data Cleaning Techniques
Data Cleaning TechniquesData Cleaning Techniques
Data Cleaning Techniques
 
Testing data and metadata backends with ClawIO
Testing data and metadata backends with ClawIOTesting data and metadata backends with ClawIO
Testing data and metadata backends with ClawIO
 
CREST Overview
CREST OverviewCREST Overview
CREST Overview
 
IEEE 2015 Java Projects
IEEE 2015 Java ProjectsIEEE 2015 Java Projects
IEEE 2015 Java Projects
 
Data mining
Data miningData mining
Data mining
 
G017334248
G017334248G017334248
G017334248
 
A Web Extraction Using Soft Algorithm for Trinity Structure
A Web Extraction Using Soft Algorithm for Trinity StructureA Web Extraction Using Soft Algorithm for Trinity Structure
A Web Extraction Using Soft Algorithm for Trinity Structure
 
Final report group2
Final report group2Final report group2
Final report group2
 
Offsite presentation original
Offsite presentation originalOffsite presentation original
Offsite presentation original
 
Data mining weka
Data mining wekaData mining weka
Data mining weka
 
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science CentralCloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
 
Svm Classifier Algorithm for Data Stream Mining Using Hive and R
Svm Classifier Algorithm for Data Stream Mining Using Hive and RSvm Classifier Algorithm for Data Stream Mining Using Hive and R
Svm Classifier Algorithm for Data Stream Mining Using Hive and R
 
EXECUTION OF ASSOCIATION RULE MINING WITH DATA GRIDS IN WEKA 3.8
EXECUTION OF ASSOCIATION RULE MINING WITH DATA GRIDS IN WEKA 3.8EXECUTION OF ASSOCIATION RULE MINING WITH DATA GRIDS IN WEKA 3.8
EXECUTION OF ASSOCIATION RULE MINING WITH DATA GRIDS IN WEKA 3.8
 

Mehr von eXascale Infolab

HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
eXascale Infolab
 
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
eXascale Infolab
 
CIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition rankingCIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition ranking
eXascale Infolab
 

Mehr von eXascale Infolab (20)

Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link PredictionBeyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
 
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
 
Representation Learning on Complex Graphs
Representation Learning on Complex GraphsRepresentation Learning on Complex Graphs
Representation Learning on Complex Graphs
 
A force directed approach for offline gps trajectory map
A force directed approach for offline gps trajectory mapA force directed approach for offline gps trajectory map
A force directed approach for offline gps trajectory map
 
Cikm 2018
Cikm 2018Cikm 2018
Cikm 2018
 
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
 
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
 
Dependency-Driven Analytics: A Compass for Uncharted Data Oceans
Dependency-Driven Analytics: A Compass for Uncharted Data OceansDependency-Driven Analytics: A Compass for Uncharted Data Oceans
Dependency-Driven Analytics: A Compass for Uncharted Data Oceans
 
Crowd scheduling www2016
Crowd scheduling www2016Crowd scheduling www2016
Crowd scheduling www2016
 
SANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference ResolutionSANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference Resolution
 
Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data Management
 
SSSW 2015 Sense Making
SSSW 2015 Sense MakingSSSW 2015 Sense Making
SSSW 2015 Sense Making
 
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked DataLDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
 
The Dynamics of Micro-Task Crowdsourcing
The Dynamics of Micro-Task CrowdsourcingThe Dynamics of Micro-Task Crowdsourcing
The Dynamics of Micro-Task Crowdsourcing
 
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...
 
CIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition rankingCIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition ranking
 
OLTP-Bench
OLTP-BenchOLTP-Bench
OLTP-Bench
 
An Introduction to Big Data
An Introduction to Big DataAn Introduction to Big Data
An Introduction to Big Data
 
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)
 
Hasler2014
Hasler2014Hasler2014
Hasler2014
 

Kürzlich hochgeladen

Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
Areesha Ahmad
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
RizalinePalanog2
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
Sérgio Sacani
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
Areesha Ahmad
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 

Kürzlich hochgeladen (20)

GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Unit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oUnit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 o
 

Executing Provenance-Enabled Queries over Web Data

  • 1. 24th International World Wide Web Conference, 21st May 2015, Florence, Italy Executing Provenance-Enabled Queries over Web Data Marcin Wylot1 , Philippe Cudré-Mauroux1 , and Paul Groth2 1) eXascale Infolab, University of Fribourg, Switzerland 2) Elsevier Labs, Amsterdam, Netherlands
  • 2. Research Question How RDF databases can efficiently support provenance-enabled queries? 2
  • 3. Outline ➢ Motivation ➢ Provenance-Enabled Queries ➢ Query Execution Strategies ➢ Results 3
  • 4. Data Provenance “Provenance is information about entities, activities, and people involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability or trustworthiness.” Which pieces of data were combined to produce the result? 4
  • 5. Data Integration ➢ Integrated and summarized data ➢ Trust, transparency, and cost ➢ Capability to store and track provenance data (WWW 2014) ➢ Capability to tailor queries with provenance information (WWW 2015) 5
  • 6. Querying Linked Data use data following a provenance specification 6
  • 7. Provenance-Enabled Query A Workload Query is a query producing results a user is interested in. These results are referred to as workload query results. A Provenance Query is a query that selects a set of data from which the workload query results should originate. A Provenance-Enabled Query is a pair consisting of a Workload Query and a Provenance Query, producing results a user is interested in (as specified by the Workload Query) and originating only from data pre-selected by the Provenance Query. 7
  • 8. Provenance-Enabled Query: Example SELECT ?t WHERE { ?a <type> <article> . ?a <tag> <Obama> . ?a <title> ?t . } ➢ ensure that the articles come from sources attributed to the government SELECT ?ctx WHERE { ?ctx prov:wasAttributedTo <government> . } ➢ ensure that the data used to produce the answer was associated a “SeniorEditor” and a “Manager” SELECT ?ctx WHERE { ?ctx prov:wasGeneratedBy <articleProd>. <articleProd> prov:wasAssociatedWith ?ed . ?ed rdf:type <SeniorEdior> . <articleProd> prov:wasAssociatedWith ?m . ?m rdf:type <Manager> . } 8
  • 9. Executing Provenance-Enabled Queries A workload and a provenance query are given as input to a triplestore, which produces results for both queries and then combine them to obtain the final results. 9
  • 10. TripleProv: Query Execution Pipeline input: provenance-enable query ➢ execute the provenance query ➢ optionally pre-materializing or co-locating data ➢ optionally rewrite the workload queries ➢ execute the workload queries ➢ output: the workload query results, restricted to those which were derived from data specified by the provenance query 10
  • 11. Physical Storage Models A molecule collocates objects related to a given subject; it is composed of a subject, and a series of predicate and object related to that subject. Extended for provenance data a molecule collocates the context values with the predicate-object pairs. This avoids the duplication of the same context value, while at the same time collocating all data about a given subject in one structure. 11 RDF molecule basic data unit
  • 12. Query Execution Strategies 1. Post-Filtering 2. Query Rewriting 3. Full Materialization 4. Pre-Filtering 5. Partial Materialization 12
  • 13. Post-Filtering ➢ the baseline strategy ➢ executes both the workload and the provenance query independently. ➢ the provenance and workload queries can be executed in any order ➢ the results from the provenance query are used to filter a posteriori the results of the workload query based on their provenance 13
  • 14. Query Rewriting 14 execute the provenance query rewrite the query plan; add provenance constraints return restricted results ➢ efficient from the provenance query execution side ➢ can be suboptimal from the workload query execution side It can be implemented in two ways by the triplestores, either by modifying the query execution process, or by rewriting the workload queries in order to include constraints on the named graphs.
  • 15. Full Materialization 15 We implemented a basic view mechanisms in TripleProv. These mechanisms allow us to project, materialize and utilize as a secondary structure the portions of the molecules that are following the provenance specification.
  • 16. Full Materialization 16 execute the provenance query materialize data for the provenance query execute workload queries on the materialized view ➢ This strategy will outperform all other strategies when executing the workload queries, since they are executed as is on the relevant subset of the data. ➢ Materializing all potential tuples based on the provenance query can be expensive, both in terms of storage space and latency. ➢ Implementing this strategy requires either to manually materialize the relevant tuples and modify the workload queries accordingly, or to use a triplestore supporting materialized views.
  • 17. Pre-Filtering ➢ Dedicated provenance index collocates, for each context values, the ids (or hashes) of all tuples belonging to this context. ➢ The index is created upfront when the data is loaded. 17
  • 18. Pre-Filtering 18 execute the provenance query execute workload queries, including early filtering with the provenance index ➢ The provenance index is looked up during the query execution to filter molecules that are compatible with the provenance specification. ➢ This strategy requires to create a new index structure in the system, and to modify both the loading and the query execution processes.
  • 19. Partial Materialization ➢ This strategy introduces a trade-off between the performance of the provenance query and that of the workload queries. ➢ While executing the provenance query, the system builds a temporary structure maintaining the ids of all molecules belonging to the context values returned by the provenance query. 19
  • 20. Partial Materialization 20 execute the provenance query and partially materialize molecules execute workload queries, including early filtering based on pre-materialized set of molecules ➢ The system dynamically (and efficiently) looks-up all molecules and can filter them out early in case they do not appear in the temporary structure. ➢ Query processing operations can be executed faster on a reduced number of elements. ➢ The implementation of this strategy requires the introduction of an additional data structure at the core of the system, and the adjustment of the query execution process in order to use it.
  • 21. Experiments What is the most efficient query execution strategy for provenance- enabled queries? 21
  • 22. Datasets ➢ Two collections of RDF data gathered from the Web ○ Billion Triple Challenge (BTC): Crawled from the linked open data cloud ○ Web Data Commons (WDC): RDFa, Microdata extracted from common crawl ➢ Typical collections gathered from multiple sources ➢ Sampled subsets of ~40 million triples each; ~10GB each ➢ Added provenance specific triples (184 for WDC and 360 for BTC); that the provenance queries do not modify the result sets of the workload queries 22
  • 23. Workloads ➢ Queries defined for BTC ○ T. Neumann and G. Weikum. Scalable join processing on very large rdf graphs. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pages 627–640. ACM, 2009. ➢ Two additional queries with UNION and OPTIONAL clauses ➢ 7 various new queries for WDC http://exascale.info/provqueries 23
  • 24. Results for BTC ➢ Full Materialization: 44x faster than the vanilla version of the system ➢ Partial Materialization: 35x faster ➢ Pre-Filtering: 23x faster ➢ Adaptive Partial Materialization executes a provenance query and materialize data 475 times faster than Full Materialization ➢ Query Rewriting and Post- Filtering strategies perform significantly slower 24
  • 25. Results for Representative Scenario ➢ original BTC dataset ➢ no added triples ➢ output changes due to provenance specification ➢ higher performance gains for all provenance aware strategies are in the more realistic scenario 25 smaller number of context values from the provenance query smaller number of relevant molecules to inspect
  • 26. Data Analysis ➢ How many context values refer to how many triples? How selective it is? ➢ 6'819'826 unique context values in the BTC dataset. ➢ The majority of the context values are highly selective. 26 ➢ average selectivity ○ 5.8 triples per context value ○ 2.3 molecules per context value
  • 27. Conclusions ➢ Querying provenance data does not necessarily introduce a performance overhead. ➢ Queries tailored with provenance data can be executed faster. ➢ Provenance information is highly selective. ➢ Partial Materialization represents the best trade-off for provenance-enabled queries, but it introduces a materialization cost and is not trivial to implement. 27
  • 28. Summary ➢ provenance-enabled queries: to tailor queries with provenance information ➢ five provenance aware query execution strategies ➢ TripleProv: an efficient triplestore allowing to store, track, and query provenance ➢ experimental evaluation and data analysis ★ http://exascale.info/provqueries ★ http://exascale.info/tripleprov 28 ❖ email: marcin@exascale.info ❖ twitter: @mwylot