Strategies for Processing and Explaining Distributed Queries on Linked Data
1. Strategies for Processing and
Explaining Distributed Queries on
Linked Data
Rakebul Hasan
Wimmics
Inria Sophia Antipolis
2. Research Theme
• Distribute Query Processing
– Optimization techniques for querying Linked Data
• Distributed Query Explanation
– Query plan explanation
• How query solving works
– Query result explanation
• Why
• Where
1
4. Querying Linked Data
• Bottom-up strategies
– Discover sources during query processing by
following links between sources
• Top-down strategies
– Sources are known
3
5. Querying Linked Data
FedBench Query CD5: Find the director and the
genre of movies directed by Italians
SELECT ?film ?director ?genre WHERE {
?film dbpedia-owl:director ?director.
?director dbpedia-owl:nationalitydbpedia:Italy .
?xowl:sameAs ?film .
?xlinkedMDB:genre ?genre
}
4
6. Querying Linked Data
FedBench Query CD5: Find the director and the genre of movies
directed by Italians
SELECT ?film ?director ?genre WHERE {
SERVICE <http://dbpedia.org/sparql>{
?film dbpedia-owl:director ?director.
?director dbpedia-owl:nationalitydbpedia:Italy .
?xowl:sameAs ?film .
}
SERVICE <http://data.linkedmdb.org/sparql> {
?xlinkedMDB:genre ?genre
}
}
SPARQL 1.1 SERVICE clause
Need knowledge of which part of
the query should be solved by which endpoint
5
8. Querying Linked Data
• Data warehousing approach
– Collect the data in a central triple store
– Process queries on that triple store
• Disadvantages
• Expensive preprocessing (data collection + integration)
and maintenance
• Data not up to date
7
9. Querying Linked Data
• Virtual integration approach
– A query federation middleware
• Parse and split into subqueries
• Source selection for subqueries
• Evaluate the subqueries against corresponding sources
directly
• Merge the results
– Advantages
• no preprocessing and maintenance
• up to date data
8
10. Running Example
Query LS6: Find KEGG drug names of all drugs in
Drugbank belonging to category Micronutrient
drugbank
http://www4.wiwiss.fu-berlin.de/drugbank/sparql
SELECT ?drug ?title WHERE {
?drug drugbank:drugCategorydrugbank-category:micronutrient .
?drug drugbank:casRegistryNumber ?id .
?keggDrugrdf:typekegg:Drug .
?keggDrug bio2rdf:xRef ?id .
?keggDrugpurl:title ?title .
}
KEGG
http://cu.bioportal.bio2rdf.org/sparql
DBpedia
http://data.linkedmdb.org/sparql
9
11. Parsing and Source Selecting
drugbank
Query LS6: Find KEGG drug names of all drugs in
Drugbank belonging to category Micronutrient
http://www4.wiwiss.fu-berlin.de/drugbank/sparql
?drug drugbank:drugCategorydrugbank-category:micronutrient
?drug drugbank:casRegistryNumber ?id
SELECT ?drug ?title WHERE {
?drug drugbank:drugCategorydrugbank-category:micronutrient .
?drug drugbank:casRegistryNumber ?id .
?keggDrugrdf:typekegg:Drug .
?keggDrug bio2rdf:xRef ?id .
?keggDrugpurl:title ?title .
}
KEGG
http://cu.bioportal.bio2rdf.org/sparql
?keggDrugrdf:typekegg:Drug
?keggDrug bio2rdf:xRef ?id
Send ask queries to all the sources
ASK {?drug drugbank:drugCategorydrugbank-category:micronutrient }
ASK {?drug drugbank:casRegistryNumber ?id }
ASK {?keggDrugrdf:typekegg:Drug }
ASK {?keggDrug bio2rdf:xRef ?id }
ASK {?keggDrugpurl:title ?title }
DBpedia
http://data.linkedmdb.org/sparql
?keggDrugpurl:title ?title
10
12. Evaluating subqueries
• Two options
– All triple patterns are individually evaluated
– Nested loop join (NLJ): evaluate iteratively pattern
by pattern
11
13. Evaluating subqueries
• Example of NLJ
SELECT ?drug ?title WHERE {
?drug drugbank:drugCategorydrugbank-category:micronutrient .
?drug drugbank:casRegistryNumber ?id .
?keggDrugrdf:typekegg:Drug .
?keggDrug bio2rdf:xRef ?id .
?keggDrugpurl:title ?title .
}
12
16. Optimization techniques
• Bound join
– Effective for NLJ
– Mappings of variable values from the
intermediate result to the next subquery
– SPARQL 1.1 VALUES clause
15
18. Optimization techniques
• FILTER optimization
– Evaluating the subqueries with corresponding
FILTERS
– Reduces the number of intermediate results
17
20. Optimization techniques
• Join order
– Selectivity estimation: join order heuristic based
on the number of bound and free variables [1]
– Statistics: based on cardinalities of the triple
patterns [2]
[1] M. Stocker, A. Seaborne, A. Bernstein, C. Kiefer, and D. Reynolds. SPARQL basic graph pattern optimization using selectivity estimation.
In Proceedings of the 17th international conference on World Wide Web (WWW '08)
[2] A. Harth, K. Hose, M. Karnstedt, A. Polleres, K. Sattler, and J. Umbrich. Data summaries for on-demand queries over linked data.
In Proceedings of the 19th international conference on World wide web (WWW '10)
19
21. Selectivity Estimation
• Ideal for the Linked Data scenario
– No need for statistics about the underlying data
– Estimations can be wrong however
BGP
Number of results
Selectivity
estimation
?v1 p1 ?v2
?v1 p2 ?v3
5
2
?v4 p3 c1
?v4 p4 ?v1
10
1
20
22. Query Performance Prediction
• Learn query performance from already
executed queries
• Statistics about the underlying data not
required
• Applications
– Query optimization: join order
– Workload management/scheduling
21
23. Regression
Learn a mapping function
f(X) = y
X = feature vector, vector representation of SPARQL query
y = performance metric (e.g. latency, number of results)
Support vector machine with nu-SVR
k-nearest neighbors regression
22
29. Graph Pattern Features
• Clustering Training Queries
– K-mediods clustering algorithm with
approximated edit distance as distance function
• Selects data points as cluster centers
• Arbitrary distance function
28
30. Graph Pattern Features
• Graph Edit Distance
– Minimum amount of distortion needed to
transform one graph to another
– Compute similarity by inversing distance
29
31. Graph Pattern Features
• Graph Edit Distance
– Usually computed using A* search
• Exponential running time
– Bipartite matching based approximated graph edit
distance with
• Previous research shows accurate results with
classification problems
30
32. Experiment Setup
• Triple store and hardware
– Jena TDB 1.0.0
– 16 GB memory
– 2.53 GHz CPU
– 48 GB system RAM
– Linux 2.6.32 operating system
31
33. Experiment Setup
• Datasets
– Training, validation, test datasets generated from
25 DBPSB query templates
• 1260 training queries
• 420 validation queries
• 420 test queries
– RDF data: DBpedia 3.5.1 with 100% scaling factor
from DBPSB framework
32
34. Prediction Models
• Support Vector Machine (SVM) with the nuSVR kernel for regression
• k-nearest neighbors (k-NN) regression
– Euclidean distance as the distance function
– k-dimensional tree (k-d tree) data structure to
compute the nearest neighbors
33
38. What’s Next
• Apply QPP in join order optimization
• Benchmarking
– FedBench: a benchmarking framework for
federated query processing
• Systematic generation of training queries
– Bootstrapping
– Refining training queries from query logs
37
39. Statistical Analysis of Query Logs
• Approach to Systematic generation of training queries
Mario Arias, Javier D. Fernández, Miguel A. Martínez-Prieto, Pablo de la Fuente: An Empirical Study of Real-World SPARQL Queries,
1st International Workshop on Usage Analysis and the Web of Data,
co-located with the 20th International World Wide Web Conference (WWW2011)
38
42. Query Explanation
• Query plan explanation
• Query result explanation
• Motivation
– Understanding
– Transparency
– Trust
41
43. Query Explanation in the Semantic
Web
Query plan explanation
Query result explanation
Jena
Sesame
Virtuoso
*
BigOWLIM
+
* Explanation of SPARQL to SQL translation
+ a debugging feature on the query engine side
42
44. Distributed Query Explanation in the
Semantic Web
Query plan explanation
Query result explanation
FedX
DARQ
SemWIQ
ADERIS
Anapsid
43
46. Related Work
• Database: Why, how, where provenance
• Semantic Web:
– Inference explanation
• Provenance
• Generating justifications
45
47. Distributed Query Explanation
What We Provide
• Query plan explanation – Work in progress
– Prior to query execution
• Includes predicted performance metrics
– After query execution with performance metrics
• Query result explanation
– Why provenance
– Where provenance
46
48. Query Result Explanation
(RDF model, Query, Result)
Result Explainer
Plug-in
Can be any RDF model (RDF graph, SPARQL endpoint)
We generate result explanations by querying this model for why provenance
47
49. Why Provenance
• Triples in virtual model from which the result
is derived
<http://example.org/book/book8>
<http://purl.org/dc/elements/1.1/creator>
<http://www-sop.inria.fr/members/Alice> ;
<http://purl.org/dc/elements/1.1/title>
"Distributed Query Processing for Linked Data" .
<http://www-sop.inria.fr/members/Charlie>
<http://xmlns.com/foaf/0.1/name>
"Charlie" .
<http://www-sop.inria.fr/members/Alice>
<http://xmlns.com/foaf/0.1/knows>
<http://www-sop.inria.fr/members/Charlie> ;
<http://xmlns.com/foaf/0.1/name>
"Alice" .
48
50. Where Provenance
• Keep the provenance of sources in the virtual model
fedqld:source1 {
<http://example.org/book/book8>
<http://purl.org/dc/elements/1.1/creator>
<http://www-sop.inria.fr/members/Alice> ;
<http://purl.org/dc/elements/1.1/title>
"Distributed Query Processing for Linked Data" .
}
fedqld:source2 {
<http://www-sop.inria.fr/members/Charlie>
<http://xmlns.com/foaf/0.1/name>
"Charlie" .
Where the triples in this
graph come from
<http://www-sop.inria.fr/members/Alice>
<http://xmlns.com/foaf/0.1/knows>
<http://www-sop.inria.fr/members/Charlie> ;
<http://xmlns.com/foaf/0.1/name>
"Alice" .
}
fedqld:prov {
fedqld:source1 void:sparqlEndpoint<http://localhost:3030/books/query> .
fedqld:source2 void:sparqlEndpoint<http://localhost:3031/person/query> .
}
49
52. References
•
•
•
Andreas Schwarte, Peter Haase, KatjaHoose, Ralf Schenkel, and Michael Schmidt. Fedx: A federation layer for
distributed query processing on linked open data. In ESWC, 2011
Shinji Kikuchi, Shelly Sachdeva, SubhashBhalla, Steven Lynden, Isao Kojima, AkiyoshiMatono, and Yusuke Tanimura.
Adaptive integration of distributed semantic web data. Databases in Networked Information Systems
Michael Schmidt, Olaf Görlitz, Peter Haase, Günter Ladwig, Andreas Schwarte, and Thanh Tran. 2011. FedBench: a
benchmark suite for federated semantic data query processing. In Proceedings of the 10th international
conference on The semantic web - Volume Part I (ISWC'11)
51
Hinweis der Redaktion
Overview of distributed query processing, extracting interesting features from sparql queries for machine learning algorithms, explaining sparql queries
The graph edit distance between two graphs is the minimum amount of distortion needed to transform one graph to another.The minimum amount of distortion is the sequence of edit operations with minimum cost. The edit operations are deletions, insertions, and substitutions of nodes and edges.
RMSD represents the sample standard deviation of the differences between predicted values and observed valuesprediction errors when computed out-of-sample
As the figures show, predictions for queries from templates 15, 22, 24, and 25 have large errors. Execution times of the queries from template 15 range from 1 ms to 21 ms, template 22 range from 1 ms to 22 ms, and template 25 range from 1 ms to 24 ms. For this reason, many data points in range of 1 ms to 24 ms are distant from the perfect prediction line in Figure~\ref{fig:all_predictions}(a). Execution times of the queries from template 24 range from 86 ms to 110 ms. Many data points at this interval are at even more distance from the perfect prediction line than the interval 1 ms to 24 ms. This explains the highest error for queries belonging to template 25.
Refining training queries from query logs by considering the statistically significant characteristics Bootstrapping: Starting with a initial set of properties, resources and literals and than generate training queries by permutations and combinations of the statistically significant features Simplifying the pattern features: join features, triple pattern features, pattern graph features to represent query patterns
our goal is not to find the minimal parts of axioms in a justification which are required to hold an entailment.
triples in the virtual model for the query result are causes