SlideShare ist ein Scribd-Unternehmen logo
1 von 52
Strategies for Processing and
Explaining Distributed Queries on
Linked Data
Rakebul Hasan
Wimmics
Inria Sophia Antipolis
Research Theme
• Distribute Query Processing
– Optimization techniques for querying Linked Data

• Distributed Query Explanation
– Query plan explanation
• How query solving works

– Query result explanation
• Why
• Where

1
DISTRIBUTE QUERY PROCESSING

2
Querying Linked Data
• Bottom-up strategies
– Discover sources during query processing by
following links between sources

• Top-down strategies
– Sources are known

3
Querying Linked Data
FedBench Query CD5: Find the director and the
genre of movies directed by Italians
SELECT ?film ?director ?genre WHERE {
?film dbpedia-owl:director ?director.
?director dbpedia-owl:nationalitydbpedia:Italy .
?xowl:sameAs ?film .
?xlinkedMDB:genre ?genre
}

4
Querying Linked Data
FedBench Query CD5: Find the director and the genre of movies
directed by Italians

SELECT ?film ?director ?genre WHERE {
SERVICE <http://dbpedia.org/sparql>{
?film dbpedia-owl:director ?director.
?director dbpedia-owl:nationalitydbpedia:Italy .
?xowl:sameAs ?film .
}
SERVICE <http://data.linkedmdb.org/sparql> {
?xlinkedMDB:genre ?genre
}
}

SPARQL 1.1 SERVICE clause

Need knowledge of which part of
the query should be solved by which endpoint

5
• Top-down approaches
– Data warehousing approach
– Virtual integration approach

6
Querying Linked Data
• Data warehousing approach
– Collect the data in a central triple store
– Process queries on that triple store

• Disadvantages
• Expensive preprocessing (data collection + integration)
and maintenance
• Data not up to date

7
Querying Linked Data
• Virtual integration approach
– A query federation middleware
• Parse and split into subqueries
• Source selection for subqueries
• Evaluate the subqueries against corresponding sources
directly
• Merge the results

– Advantages
• no preprocessing and maintenance
• up to date data
8
Running Example
Query LS6: Find KEGG drug names of all drugs in
Drugbank belonging to category Micronutrient

drugbank

http://www4.wiwiss.fu-berlin.de/drugbank/sparql

SELECT ?drug ?title WHERE {
?drug drugbank:drugCategorydrugbank-category:micronutrient .
?drug drugbank:casRegistryNumber ?id .
?keggDrugrdf:typekegg:Drug .
?keggDrug bio2rdf:xRef ?id .
?keggDrugpurl:title ?title .
}

KEGG

http://cu.bioportal.bio2rdf.org/sparql

DBpedia

http://data.linkedmdb.org/sparql

9
Parsing and Source Selecting
drugbank

Query LS6: Find KEGG drug names of all drugs in
Drugbank belonging to category Micronutrient
http://www4.wiwiss.fu-berlin.de/drugbank/sparql

?drug drugbank:drugCategorydrugbank-category:micronutrient
?drug drugbank:casRegistryNumber ?id

SELECT ?drug ?title WHERE {
?drug drugbank:drugCategorydrugbank-category:micronutrient .
?drug drugbank:casRegistryNumber ?id .
?keggDrugrdf:typekegg:Drug .
?keggDrug bio2rdf:xRef ?id .
?keggDrugpurl:title ?title .
}

KEGG

http://cu.bioportal.bio2rdf.org/sparql

?keggDrugrdf:typekegg:Drug
?keggDrug bio2rdf:xRef ?id
Send ask queries to all the sources
ASK {?drug drugbank:drugCategorydrugbank-category:micronutrient }
ASK {?drug drugbank:casRegistryNumber ?id }
ASK {?keggDrugrdf:typekegg:Drug }
ASK {?keggDrug bio2rdf:xRef ?id }
ASK {?keggDrugpurl:title ?title }

DBpedia

http://data.linkedmdb.org/sparql

?keggDrugpurl:title ?title

10
Evaluating subqueries
• Two options
– All triple patterns are individually evaluated
– Nested loop join (NLJ): evaluate iteratively pattern
by pattern

11
Evaluating subqueries
• Example of NLJ
SELECT ?drug ?title WHERE {
?drug drugbank:drugCategorydrugbank-category:micronutrient .
?drug drugbank:casRegistryNumber ?id .
?keggDrugrdf:typekegg:Drug .
?keggDrug bio2rdf:xRef ?id .
?keggDrugpurl:title ?title .
}

12
Optimization techniques
• Source Selection
– Indexing: characterization of RDF graphs
– Statistics based catalogue
– Caching

13
Optimization techniques
• Exclusive grouping
SELECT ?drug ?title WHERE {
?drug drugbank:drugCategorydrugbank-category:micronutrient .
?drug drugbank:casRegistryNumber ?id .
?keggDrugrdf:typekegg:Drug .
?keggDrug bio2rdf:xRef ?id .
?keggDrugpurl:title ?title .
}

14
Optimization techniques
• Bound join
– Effective for NLJ
– Mappings of variable values from the
intermediate result to the next subquery
– SPARQL 1.1 VALUES clause

15
Optimization techniques
• Hash join
– Hash table for intermediate mappings

16
Optimization techniques
• FILTER optimization
– Evaluating the subqueries with corresponding
FILTERS
– Reduces the number of intermediate results

17
Optimization techniques
• Parallelization
– Effective for individual subquery evaluation
approach

18
Optimization techniques
• Join order
– Selectivity estimation: join order heuristic based
on the number of bound and free variables [1]
– Statistics: based on cardinalities of the triple
patterns [2]

[1] M. Stocker, A. Seaborne, A. Bernstein, C. Kiefer, and D. Reynolds. SPARQL basic graph pattern optimization using selectivity estimation.
In Proceedings of the 17th international conference on World Wide Web (WWW '08)
[2] A. Harth, K. Hose, M. Karnstedt, A. Polleres, K. Sattler, and J. Umbrich. Data summaries for on-demand queries over linked data.
In Proceedings of the 19th international conference on World wide web (WWW '10)

19
Selectivity Estimation
• Ideal for the Linked Data scenario
– No need for statistics about the underlying data
– Estimations can be wrong however
BGP

Number of results

Selectivity
estimation

?v1 p1 ?v2
?v1 p2 ?v3

5

2

?v4 p3 c1
?v4 p4 ?v1

10

1

20
Query Performance Prediction
• Learn query performance from already
executed queries
• Statistics about the underlying data not
required
• Applications
– Query optimization: join order
– Workload management/scheduling

21
Regression
Learn a mapping function
f(X) = y
X = feature vector, vector representation of SPARQL query
y = performance metric (e.g. latency, number of results)

Support vector machine with nu-SVR
k-nearest neighbors regression

22
Feature Extraction
• How can we represent SPARQL queries as
vectors?

23
SPARQL
• SPARQL algebra features
• Graph pattern features

24
SPARQL Algebra

http://www.w3.org/TR/sparql11-query/#sparqlQuery

25
SPARQL Algebra Features

26
Graph Pattern Features

27
Graph Pattern Features
• Clustering Training Queries
– K-mediods clustering algorithm with
approximated edit distance as distance function
• Selects data points as cluster centers
• Arbitrary distance function

28
Graph Pattern Features
• Graph Edit Distance
– Minimum amount of distortion needed to
transform one graph to another

– Compute similarity by inversing distance

29
Graph Pattern Features
• Graph Edit Distance
– Usually computed using A* search
• Exponential running time

– Bipartite matching based approximated graph edit
distance with
• Previous research shows accurate results with
classification problems

30
Experiment Setup
• Triple store and hardware
– Jena TDB 1.0.0
– 16 GB memory
– 2.53 GHz CPU
– 48 GB system RAM
– Linux 2.6.32 operating system

31
Experiment Setup
• Datasets
– Training, validation, test datasets generated from
25 DBPSB query templates
• 1260 training queries
• 420 validation queries
• 420 test queries

– RDF data: DBpedia 3.5.1 with 100% scaling factor
from DBPSB framework

32
Prediction Models
• Support Vector Machine (SVM) with the nuSVR kernel for regression
• k-nearest neighbors (k-NN) regression
– Euclidean distance as the distance function
– k-dimensional tree (k-d tree) data structure to
compute the nearest neighbors

33
Evaluation Measures
• Coefficient of determination

• Root mean squared error (RMSE)

34
Predicting Query Execution Time
• SPARQL algebra features

35
Predicting Query Execution Time
• SPARQL algebra and graph pattern features

36
What’s Next
• Apply QPP in join order optimization

• Benchmarking
– FedBench: a benchmarking framework for
federated query processing

• Systematic generation of training queries
– Bootstrapping
– Refining training queries from query logs
37
Statistical Analysis of Query Logs
• Approach to Systematic generation of training queries

Mario Arias, Javier D. Fernández, Miguel A. Martínez-Prieto, Pablo de la Fuente: An Empirical Study of Real-World SPARQL Queries,
1st International Workshop on Usage Analysis and the Web of Data,
co-located with the 20th International World Wide Web Conference (WWW2011)

38
Summary
• Distributed query processing
• Optimization techniques
• Query performance prediction
– Join order optimization

39
DISTRIBUTED QUERY EXPLANATION

40
Query Explanation
• Query plan explanation
• Query result explanation
• Motivation
– Understanding
– Transparency
– Trust

41
Query Explanation in the Semantic
Web
Query plan explanation

Query result explanation

Jena
Sesame
Virtuoso

*

BigOWLIM

+

* Explanation of SPARQL to SQL translation
+ a debugging feature on the query engine side
42
Distributed Query Explanation in the
Semantic Web
Query plan explanation

Query result explanation

FedX
DARQ
SemWIQ
ADERIS
Anapsid

43
ADERIS Query Plan Explanation

44
Related Work
• Database: Why, how, where provenance
• Semantic Web:
– Inference explanation
• Provenance
• Generating justifications

45
Distributed Query Explanation
What We Provide
• Query plan explanation – Work in progress
– Prior to query execution
• Includes predicted performance metrics

– After query execution with performance metrics

• Query result explanation
– Why provenance
– Where provenance

46
Query Result Explanation

(RDF model, Query, Result)
Result Explainer
Plug-in

Can be any RDF model (RDF graph, SPARQL endpoint)

We generate result explanations by querying this model for why provenance

47
Why Provenance
• Triples in virtual model from which the result
is derived
<http://example.org/book/book8>
<http://purl.org/dc/elements/1.1/creator>
<http://www-sop.inria.fr/members/Alice> ;
<http://purl.org/dc/elements/1.1/title>
"Distributed Query Processing for Linked Data" .
<http://www-sop.inria.fr/members/Charlie>
<http://xmlns.com/foaf/0.1/name>
"Charlie" .
<http://www-sop.inria.fr/members/Alice>
<http://xmlns.com/foaf/0.1/knows>
<http://www-sop.inria.fr/members/Charlie> ;
<http://xmlns.com/foaf/0.1/name>
"Alice" .

48
Where Provenance
• Keep the provenance of sources in the virtual model
fedqld:source1 {
<http://example.org/book/book8>
<http://purl.org/dc/elements/1.1/creator>
<http://www-sop.inria.fr/members/Alice> ;
<http://purl.org/dc/elements/1.1/title>
"Distributed Query Processing for Linked Data" .
}
fedqld:source2 {
<http://www-sop.inria.fr/members/Charlie>
<http://xmlns.com/foaf/0.1/name>
"Charlie" .

Where the triples in this
graph come from

<http://www-sop.inria.fr/members/Alice>
<http://xmlns.com/foaf/0.1/knows>
<http://www-sop.inria.fr/members/Charlie> ;
<http://xmlns.com/foaf/0.1/name>
"Alice" .
}
fedqld:prov {
fedqld:source1 void:sparqlEndpoint<http://localhost:3030/books/query> .
fedqld:source2 void:sparqlEndpoint<http://localhost:3031/person/query> .
}

49
What’s next
• Explanation user interfaces
• Evaluating the impacts of our explanations

50
References
•
•
•

Andreas Schwarte, Peter Haase, KatjaHoose, Ralf Schenkel, and Michael Schmidt. Fedx: A federation layer for
distributed query processing on linked open data. In ESWC, 2011
Shinji Kikuchi, Shelly Sachdeva, SubhashBhalla, Steven Lynden, Isao Kojima, AkiyoshiMatono, and Yusuke Tanimura.
Adaptive integration of distributed semantic web data. Databases in Networked Information Systems
Michael Schmidt, Olaf Görlitz, Peter Haase, Günter Ladwig, Andreas Schwarte, and Thanh Tran. 2011. FedBench: a
benchmark suite for federated semantic data query processing. In Proceedings of the 10th international
conference on The semantic web - Volume Part I (ISWC'11)

51

Weitere ähnliche Inhalte

Ähnlich wie Strategies for Processing and Explaining Distributed Queries on Linked Data

(ATS6-PLAT03) What's behind Discngine collections
(ATS6-PLAT03) What's behind Discngine collections(ATS6-PLAT03) What's behind Discngine collections
(ATS6-PLAT03) What's behind Discngine collections
BIOVIA
 

Ähnlich wie Strategies for Processing and Explaining Distributed Queries on Linked Data (20)

Product Recommendation System​ By Using Collaborative Filtering and Network B...
Product Recommendation System​ By Using Collaborative Filtering and Network B...Product Recommendation System​ By Using Collaborative Filtering and Network B...
Product Recommendation System​ By Using Collaborative Filtering and Network B...
 
Publishing Linked Data from RDB
Publishing Linked Data from RDBPublishing Linked Data from RDB
Publishing Linked Data from RDB
 
Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...
 
balloon Fusion: SPARQL Rewriting Based on Unified Co-Reference Information
balloon Fusion: SPARQL Rewriting Based on  Unified Co-Reference Informationballoon Fusion: SPARQL Rewriting Based on  Unified Co-Reference Information
balloon Fusion: SPARQL Rewriting Based on Unified Co-Reference Information
 
Extending LargeRDFBench for Multi-Source Data at Scale for SPARQL Endpoint F...
 Extending LargeRDFBench for Multi-Source Data at Scale for SPARQL Endpoint F... Extending LargeRDFBench for Multi-Source Data at Scale for SPARQL Endpoint F...
Extending LargeRDFBench for Multi-Source Data at Scale for SPARQL Endpoint F...
 
Extended LargeRDFBench
Extended LargeRDFBenchExtended LargeRDFBench
Extended LargeRDFBench
 
SPARQL and Linked Data Benchmarking
SPARQL and Linked Data BenchmarkingSPARQL and Linked Data Benchmarking
SPARQL and Linked Data Benchmarking
 
The Search and Hyperlinking Task at MediaEval 2014
The Search and Hyperlinking Task at MediaEval 2014The Search and Hyperlinking Task at MediaEval 2014
The Search and Hyperlinking Task at MediaEval 2014
 
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...
 
A Machine Learning Approach to SPARQL Query Performance Prediction
A Machine Learning Approach to SPARQL Query Performance PredictionA Machine Learning Approach to SPARQL Query Performance Prediction
A Machine Learning Approach to SPARQL Query Performance Prediction
 
The Progress on Sagace and Data Integration
The Progress on Sagace and Data IntegrationThe Progress on Sagace and Data Integration
The Progress on Sagace and Data Integration
 
Edgard Marx, Amrapali Zaveri, Diego Moussallem and Sandro Rautenberg | DBtren...
Edgard Marx, Amrapali Zaveri, Diego Moussallem and Sandro Rautenberg | DBtren...Edgard Marx, Amrapali Zaveri, Diego Moussallem and Sandro Rautenberg | DBtren...
Edgard Marx, Amrapali Zaveri, Diego Moussallem and Sandro Rautenberg | DBtren...
 
Getting the best of Linked Data and Property Graphs: rdf2neo and the KnetMine...
Getting the best of Linked Data and Property Graphs: rdf2neo and the KnetMine...Getting the best of Linked Data and Property Graphs: rdf2neo and the KnetMine...
Getting the best of Linked Data and Property Graphs: rdf2neo and the KnetMine...
 
DBtrends Semantics 2016
DBtrends Semantics 2016DBtrends Semantics 2016
DBtrends Semantics 2016
 
Relational Database Design Bootcamp
Relational Database Design BootcampRelational Database Design Bootcamp
Relational Database Design Bootcamp
 
Best Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflowBest Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflow
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
 
(ATS6-PLAT03) What's behind Discngine collections
(ATS6-PLAT03) What's behind Discngine collections(ATS6-PLAT03) What's behind Discngine collections
(ATS6-PLAT03) What's behind Discngine collections
 
Processing Large Graphs
Processing Large GraphsProcessing Large Graphs
Processing Large Graphs
 
moraes-a2017ictir
moraes-a2017ictirmoraes-a2017ictir
moraes-a2017ictir
 

Strategies for Processing and Explaining Distributed Queries on Linked Data

  • 1. Strategies for Processing and Explaining Distributed Queries on Linked Data Rakebul Hasan Wimmics Inria Sophia Antipolis
  • 2. Research Theme • Distribute Query Processing – Optimization techniques for querying Linked Data • Distributed Query Explanation – Query plan explanation • How query solving works – Query result explanation • Why • Where 1
  • 4. Querying Linked Data • Bottom-up strategies – Discover sources during query processing by following links between sources • Top-down strategies – Sources are known 3
  • 5. Querying Linked Data FedBench Query CD5: Find the director and the genre of movies directed by Italians SELECT ?film ?director ?genre WHERE { ?film dbpedia-owl:director ?director. ?director dbpedia-owl:nationalitydbpedia:Italy . ?xowl:sameAs ?film . ?xlinkedMDB:genre ?genre } 4
  • 6. Querying Linked Data FedBench Query CD5: Find the director and the genre of movies directed by Italians SELECT ?film ?director ?genre WHERE { SERVICE <http://dbpedia.org/sparql>{ ?film dbpedia-owl:director ?director. ?director dbpedia-owl:nationalitydbpedia:Italy . ?xowl:sameAs ?film . } SERVICE <http://data.linkedmdb.org/sparql> { ?xlinkedMDB:genre ?genre } } SPARQL 1.1 SERVICE clause Need knowledge of which part of the query should be solved by which endpoint 5
  • 7. • Top-down approaches – Data warehousing approach – Virtual integration approach 6
  • 8. Querying Linked Data • Data warehousing approach – Collect the data in a central triple store – Process queries on that triple store • Disadvantages • Expensive preprocessing (data collection + integration) and maintenance • Data not up to date 7
  • 9. Querying Linked Data • Virtual integration approach – A query federation middleware • Parse and split into subqueries • Source selection for subqueries • Evaluate the subqueries against corresponding sources directly • Merge the results – Advantages • no preprocessing and maintenance • up to date data 8
  • 10. Running Example Query LS6: Find KEGG drug names of all drugs in Drugbank belonging to category Micronutrient drugbank http://www4.wiwiss.fu-berlin.de/drugbank/sparql SELECT ?drug ?title WHERE { ?drug drugbank:drugCategorydrugbank-category:micronutrient . ?drug drugbank:casRegistryNumber ?id . ?keggDrugrdf:typekegg:Drug . ?keggDrug bio2rdf:xRef ?id . ?keggDrugpurl:title ?title . } KEGG http://cu.bioportal.bio2rdf.org/sparql DBpedia http://data.linkedmdb.org/sparql 9
  • 11. Parsing and Source Selecting drugbank Query LS6: Find KEGG drug names of all drugs in Drugbank belonging to category Micronutrient http://www4.wiwiss.fu-berlin.de/drugbank/sparql ?drug drugbank:drugCategorydrugbank-category:micronutrient ?drug drugbank:casRegistryNumber ?id SELECT ?drug ?title WHERE { ?drug drugbank:drugCategorydrugbank-category:micronutrient . ?drug drugbank:casRegistryNumber ?id . ?keggDrugrdf:typekegg:Drug . ?keggDrug bio2rdf:xRef ?id . ?keggDrugpurl:title ?title . } KEGG http://cu.bioportal.bio2rdf.org/sparql ?keggDrugrdf:typekegg:Drug ?keggDrug bio2rdf:xRef ?id Send ask queries to all the sources ASK {?drug drugbank:drugCategorydrugbank-category:micronutrient } ASK {?drug drugbank:casRegistryNumber ?id } ASK {?keggDrugrdf:typekegg:Drug } ASK {?keggDrug bio2rdf:xRef ?id } ASK {?keggDrugpurl:title ?title } DBpedia http://data.linkedmdb.org/sparql ?keggDrugpurl:title ?title 10
  • 12. Evaluating subqueries • Two options – All triple patterns are individually evaluated – Nested loop join (NLJ): evaluate iteratively pattern by pattern 11
  • 13. Evaluating subqueries • Example of NLJ SELECT ?drug ?title WHERE { ?drug drugbank:drugCategorydrugbank-category:micronutrient . ?drug drugbank:casRegistryNumber ?id . ?keggDrugrdf:typekegg:Drug . ?keggDrug bio2rdf:xRef ?id . ?keggDrugpurl:title ?title . } 12
  • 14. Optimization techniques • Source Selection – Indexing: characterization of RDF graphs – Statistics based catalogue – Caching 13
  • 15. Optimization techniques • Exclusive grouping SELECT ?drug ?title WHERE { ?drug drugbank:drugCategorydrugbank-category:micronutrient . ?drug drugbank:casRegistryNumber ?id . ?keggDrugrdf:typekegg:Drug . ?keggDrug bio2rdf:xRef ?id . ?keggDrugpurl:title ?title . } 14
  • 16. Optimization techniques • Bound join – Effective for NLJ – Mappings of variable values from the intermediate result to the next subquery – SPARQL 1.1 VALUES clause 15
  • 17. Optimization techniques • Hash join – Hash table for intermediate mappings 16
  • 18. Optimization techniques • FILTER optimization – Evaluating the subqueries with corresponding FILTERS – Reduces the number of intermediate results 17
  • 19. Optimization techniques • Parallelization – Effective for individual subquery evaluation approach 18
  • 20. Optimization techniques • Join order – Selectivity estimation: join order heuristic based on the number of bound and free variables [1] – Statistics: based on cardinalities of the triple patterns [2] [1] M. Stocker, A. Seaborne, A. Bernstein, C. Kiefer, and D. Reynolds. SPARQL basic graph pattern optimization using selectivity estimation. In Proceedings of the 17th international conference on World Wide Web (WWW '08) [2] A. Harth, K. Hose, M. Karnstedt, A. Polleres, K. Sattler, and J. Umbrich. Data summaries for on-demand queries over linked data. In Proceedings of the 19th international conference on World wide web (WWW '10) 19
  • 21. Selectivity Estimation • Ideal for the Linked Data scenario – No need for statistics about the underlying data – Estimations can be wrong however BGP Number of results Selectivity estimation ?v1 p1 ?v2 ?v1 p2 ?v3 5 2 ?v4 p3 c1 ?v4 p4 ?v1 10 1 20
  • 22. Query Performance Prediction • Learn query performance from already executed queries • Statistics about the underlying data not required • Applications – Query optimization: join order – Workload management/scheduling 21
  • 23. Regression Learn a mapping function f(X) = y X = feature vector, vector representation of SPARQL query y = performance metric (e.g. latency, number of results) Support vector machine with nu-SVR k-nearest neighbors regression 22
  • 24. Feature Extraction • How can we represent SPARQL queries as vectors? 23
  • 25. SPARQL • SPARQL algebra features • Graph pattern features 24
  • 29. Graph Pattern Features • Clustering Training Queries – K-mediods clustering algorithm with approximated edit distance as distance function • Selects data points as cluster centers • Arbitrary distance function 28
  • 30. Graph Pattern Features • Graph Edit Distance – Minimum amount of distortion needed to transform one graph to another – Compute similarity by inversing distance 29
  • 31. Graph Pattern Features • Graph Edit Distance – Usually computed using A* search • Exponential running time – Bipartite matching based approximated graph edit distance with • Previous research shows accurate results with classification problems 30
  • 32. Experiment Setup • Triple store and hardware – Jena TDB 1.0.0 – 16 GB memory – 2.53 GHz CPU – 48 GB system RAM – Linux 2.6.32 operating system 31
  • 33. Experiment Setup • Datasets – Training, validation, test datasets generated from 25 DBPSB query templates • 1260 training queries • 420 validation queries • 420 test queries – RDF data: DBpedia 3.5.1 with 100% scaling factor from DBPSB framework 32
  • 34. Prediction Models • Support Vector Machine (SVM) with the nuSVR kernel for regression • k-nearest neighbors (k-NN) regression – Euclidean distance as the distance function – k-dimensional tree (k-d tree) data structure to compute the nearest neighbors 33
  • 35. Evaluation Measures • Coefficient of determination • Root mean squared error (RMSE) 34
  • 36. Predicting Query Execution Time • SPARQL algebra features 35
  • 37. Predicting Query Execution Time • SPARQL algebra and graph pattern features 36
  • 38. What’s Next • Apply QPP in join order optimization • Benchmarking – FedBench: a benchmarking framework for federated query processing • Systematic generation of training queries – Bootstrapping – Refining training queries from query logs 37
  • 39. Statistical Analysis of Query Logs • Approach to Systematic generation of training queries Mario Arias, Javier D. Fernández, Miguel A. Martínez-Prieto, Pablo de la Fuente: An Empirical Study of Real-World SPARQL Queries, 1st International Workshop on Usage Analysis and the Web of Data, co-located with the 20th International World Wide Web Conference (WWW2011) 38
  • 40. Summary • Distributed query processing • Optimization techniques • Query performance prediction – Join order optimization 39
  • 42. Query Explanation • Query plan explanation • Query result explanation • Motivation – Understanding – Transparency – Trust 41
  • 43. Query Explanation in the Semantic Web Query plan explanation Query result explanation Jena Sesame Virtuoso * BigOWLIM + * Explanation of SPARQL to SQL translation + a debugging feature on the query engine side 42
  • 44. Distributed Query Explanation in the Semantic Web Query plan explanation Query result explanation FedX DARQ SemWIQ ADERIS Anapsid 43
  • 45. ADERIS Query Plan Explanation 44
  • 46. Related Work • Database: Why, how, where provenance • Semantic Web: – Inference explanation • Provenance • Generating justifications 45
  • 47. Distributed Query Explanation What We Provide • Query plan explanation – Work in progress – Prior to query execution • Includes predicted performance metrics – After query execution with performance metrics • Query result explanation – Why provenance – Where provenance 46
  • 48. Query Result Explanation (RDF model, Query, Result) Result Explainer Plug-in Can be any RDF model (RDF graph, SPARQL endpoint) We generate result explanations by querying this model for why provenance 47
  • 49. Why Provenance • Triples in virtual model from which the result is derived <http://example.org/book/book8> <http://purl.org/dc/elements/1.1/creator> <http://www-sop.inria.fr/members/Alice> ; <http://purl.org/dc/elements/1.1/title> "Distributed Query Processing for Linked Data" . <http://www-sop.inria.fr/members/Charlie> <http://xmlns.com/foaf/0.1/name> "Charlie" . <http://www-sop.inria.fr/members/Alice> <http://xmlns.com/foaf/0.1/knows> <http://www-sop.inria.fr/members/Charlie> ; <http://xmlns.com/foaf/0.1/name> "Alice" . 48
  • 50. Where Provenance • Keep the provenance of sources in the virtual model fedqld:source1 { <http://example.org/book/book8> <http://purl.org/dc/elements/1.1/creator> <http://www-sop.inria.fr/members/Alice> ; <http://purl.org/dc/elements/1.1/title> "Distributed Query Processing for Linked Data" . } fedqld:source2 { <http://www-sop.inria.fr/members/Charlie> <http://xmlns.com/foaf/0.1/name> "Charlie" . Where the triples in this graph come from <http://www-sop.inria.fr/members/Alice> <http://xmlns.com/foaf/0.1/knows> <http://www-sop.inria.fr/members/Charlie> ; <http://xmlns.com/foaf/0.1/name> "Alice" . } fedqld:prov { fedqld:source1 void:sparqlEndpoint<http://localhost:3030/books/query> . fedqld:source2 void:sparqlEndpoint<http://localhost:3031/person/query> . } 49
  • 51. What’s next • Explanation user interfaces • Evaluating the impacts of our explanations 50
  • 52. References • • • Andreas Schwarte, Peter Haase, KatjaHoose, Ralf Schenkel, and Michael Schmidt. Fedx: A federation layer for distributed query processing on linked open data. In ESWC, 2011 Shinji Kikuchi, Shelly Sachdeva, SubhashBhalla, Steven Lynden, Isao Kojima, AkiyoshiMatono, and Yusuke Tanimura. Adaptive integration of distributed semantic web data. Databases in Networked Information Systems Michael Schmidt, Olaf Görlitz, Peter Haase, Günter Ladwig, Andreas Schwarte, and Thanh Tran. 2011. FedBench: a benchmark suite for federated semantic data query processing. In Proceedings of the 10th international conference on The semantic web - Volume Part I (ISWC'11) 51

Hinweis der Redaktion

  1. Overview of distributed query processing, extracting interesting features from sparql queries for machine learning algorithms, explaining sparql queries
  2. The graph edit distance between two graphs is the minimum amount of distortion needed to transform one graph to another.The minimum amount of distortion is the sequence of edit operations with minimum cost. The edit operations are deletions, insertions, and substitutions of nodes and edges.
  3. RMSD represents the sample standard deviation of the differences between predicted values and observed valuesprediction errors when computed out-of-sample
  4. As the figures show, predictions for queries from templates 15, 22, 24, and 25 have large errors. Execution times of the queries from template 15 range from 1 ms to 21 ms, template 22 range from 1 ms to 22 ms, and template 25 range from 1 ms to 24 ms. For this reason, many data points in range of 1 ms to 24 ms are distant from the perfect prediction line in Figure~\ref{fig:all_predictions}(a). Execution times of the queries from template 24 range from 86 ms to 110 ms. Many data points at this interval are at even more distance from the perfect prediction line than the interval 1 ms to 24 ms. This explains the highest error for queries belonging to template 25.
  5. Refining training queries from query logs by considering the statistically significant characteristics Bootstrapping: Starting with a initial set of properties, resources and literals and than generate training queries by permutations and combinations of the statistically significant features Simplifying the pattern features: join features, triple pattern features, pattern graph features to represent query patterns
  6. our goal is not to find the minimal parts of axioms in a justification which are required to hold an entailment.
  7. triples in the virtual model for the query result are causes