Strategies for Processing and Explaining Distributed Queries on Linked Data

Strategies for Processing and
Explaining Distributed Queries on
Linked Data
Rakebul Hasan
Wimmics
Inria Sophia Antipolis

Research Theme
• Distribute Query Processing
– Optimization techniques for querying Linked Data

• Distributed Query Explanation
– Query plan explanation
• How query solving works

– Query result explanation
• Why
• Where

1

DISTRIBUTE QUERY PROCESSING

2

Querying Linked Data
• Bottom-up strategies
– Discover sources during query processing by
following links between sources

• Top-down strategies
– Sources are known

3

FedBench Query CD5: Find the director and the
genre of movies directed by Italians
SELECT ?film ?director ?genre WHERE {
?film dbpedia-owl:director ?director.
?director dbpedia-owl:nationalitydbpedia:Italy .
?xowl:sameAs ?film .
?xlinkedMDB:genre ?genre
}

4

FedBench Query CD5: Find the director and the genre of movies
directed by Italians

SELECT ?film ?director ?genre WHERE {
SERVICE <http://dbpedia.org/sparql>{
?film dbpedia-owl:director ?director.
?director dbpedia-owl:nationalitydbpedia:Italy .
?xowl:sameAs ?film .
}
SERVICE <http://data.linkedmdb.org/sparql> {
?xlinkedMDB:genre ?genre
}
}

SPARQL 1.1 SERVICE clause

Need knowledge of which part of
the query should be solved by which endpoint

5

• Top-down approaches
– Data warehousing approach
– Virtual integration approach

6

• Data warehousing approach
– Collect the data in a central triple store
– Process queries on that triple store

• Disadvantages
• Expensive preprocessing (data collection + integration)
and maintenance
• Data not up to date

7

• Virtual integration approach
– A query federation middleware
• Parse and split into subqueries
• Source selection for subqueries
• Evaluate the subqueries against corresponding sources
directly
• Merge the results

– Advantages
• no preprocessing and maintenance
• up to date data
8

Running Example
Query LS6: Find KEGG drug names of all drugs in
Drugbank belonging to category Micronutrient

drugbank

http://www4.wiwiss.fu-berlin.de/drugbank/sparql

SELECT ?drug ?title WHERE {
?drug drugbank:drugCategorydrugbank-category:micronutrient .
?drug drugbank:casRegistryNumber ?id .
?keggDrugrdf:typekegg:Drug .
?keggDrug bio2rdf:xRef ?id .
?keggDrugpurl:title ?title .
}

KEGG

http://cu.bioportal.bio2rdf.org/sparql

DBpedia

http://data.linkedmdb.org/sparql

9

Parsing and Source Selecting
drugbank

Query LS6: Find KEGG drug names of all drugs in
Drugbank belonging to category Micronutrient
http://www4.wiwiss.fu-berlin.de/drugbank/sparql

?drug drugbank:drugCategorydrugbank-category:micronutrient
?drug drugbank:casRegistryNumber ?id

}

KEGG

http://cu.bioportal.bio2rdf.org/sparql

?keggDrugrdf:typekegg:Drug
?keggDrug bio2rdf:xRef ?id
Send ask queries to all the sources
ASK {?drug drugbank:drugCategorydrugbank-category:micronutrient }
ASK {?drug drugbank:casRegistryNumber ?id }
ASK {?keggDrugrdf:typekegg:Drug }
ASK {?keggDrug bio2rdf:xRef ?id }
ASK {?keggDrugpurl:title ?title }

DBpedia

http://data.linkedmdb.org/sparql

?keggDrugpurl:title ?title

10

Evaluating subqueries
• Two options
– All triple patterns are individually evaluated
– Nested loop join (NLJ): evaluate iteratively pattern
by pattern

11

Evaluating subqueries
• Example of NLJ
}

12

Optimization techniques
• Source Selection
– Indexing: characterization of RDF graphs
– Statistics based catalogue
– Caching

13

• Exclusive grouping
}

14

• Bound join
– Effective for NLJ
– Mappings of variable values from the
intermediate result to the next subquery
– SPARQL 1.1 VALUES clause

15

• Hash join
– Hash table for intermediate mappings

16

• FILTER optimization
– Evaluating the subqueries with corresponding
FILTERS
– Reduces the number of intermediate results

17

• Parallelization
– Effective for individual subquery evaluation
approach

18

• Join order
– Selectivity estimation: join order heuristic based
on the number of bound and free variables [1]
– Statistics: based on cardinalities of the triple
patterns [2]

[1] M. Stocker, A. Seaborne, A. Bernstein, C. Kiefer, and D. Reynolds. SPARQL basic graph pattern optimization using selectivity estimation.
In Proceedings of the 17th international conference on World Wide Web (WWW '08)
[2] A. Harth, K. Hose, M. Karnstedt, A. Polleres, K. Sattler, and J. Umbrich. Data summaries for on-demand queries over linked data.
In Proceedings of the 19th international conference on World wide web (WWW '10)

19

Selectivity Estimation
• Ideal for the Linked Data scenario
– No need for statistics about the underlying data
– Estimations can be wrong however
BGP

Number of results

Selectivity
estimation

?v1 p1 ?v2
?v1 p2 ?v3

5

2

?v4 p3 c1
?v4 p4 ?v1

10

1

20

Query Performance Prediction
• Learn query performance from already
executed queries
• Statistics about the underlying data not
required
• Applications
– Query optimization: join order
– Workload management/scheduling

21

Regression
Learn a mapping function
f(X) = y
X = feature vector, vector representation of SPARQL query
y = performance metric (e.g. latency, number of results)

Support vector machine with nu-SVR
k-nearest neighbors regression

22

Feature Extraction
• How can we represent SPARQL queries as
vectors?

23

SPARQL
• SPARQL algebra features
• Graph pattern features

24

SPARQL Algebra

http://www.w3.org/TR/sparql11-query/#sparqlQuery

25

Graph Pattern Features
• Clustering Training Queries
– K-mediods clustering algorithm with
approximated edit distance as distance function
• Selects data points as cluster centers
• Arbitrary distance function

28

• Graph Edit Distance
– Minimum amount of distortion needed to
transform one graph to another

– Compute similarity by inversing distance

29

• Graph Edit Distance
– Usually computed using A* search
• Exponential running time

– Bipartite matching based approximated graph edit
distance with
• Previous research shows accurate results with
classification problems

30

Experiment Setup
• Triple store and hardware
– Jena TDB 1.0.0
– 16 GB memory
– 2.53 GHz CPU
– 48 GB system RAM
– Linux 2.6.32 operating system

31

Experiment Setup
• Datasets
– Training, validation, test datasets generated from
25 DBPSB query templates
• 1260 training queries
• 420 validation queries
• 420 test queries

– RDF data: DBpedia 3.5.1 with 100% scaling factor
from DBPSB framework

32

Prediction Models
• Support Vector Machine (SVM) with the nuSVR kernel for regression
• k-nearest neighbors (k-NN) regression
– Euclidean distance as the distance function
– k-dimensional tree (k-d tree) data structure to
compute the nearest neighbors

33

Evaluation Measures
• Coefficient of determination

• Root mean squared error (RMSE)

34

Predicting Query Execution Time
• SPARQL algebra features

35

Predicting Query Execution Time
• SPARQL algebra and graph pattern features

36

What’s Next
• Apply QPP in join order optimization

• Benchmarking
– FedBench: a benchmarking framework for
federated query processing

• Systematic generation of training queries
– Bootstrapping
– Refining training queries from query logs
37

Statistical Analysis of Query Logs
• Approach to Systematic generation of training queries

Mario Arias, Javier D. Fernández, Miguel A. Martínez-Prieto, Pablo de la Fuente: An Empirical Study of Real-World SPARQL Queries,
1st International Workshop on Usage Analysis and the Web of Data,
co-located with the 20th International World Wide Web Conference (WWW2011)

38

Summary
• Distributed query processing
• Optimization techniques
• Query performance prediction
– Join order optimization

39

DISTRIBUTED QUERY EXPLANATION

40

Query Explanation
• Query plan explanation
• Query result explanation
• Motivation
– Understanding
– Transparency
– Trust

41

Query Explanation in the Semantic
Web
Query plan explanation

Query result explanation

Jena
Sesame
Virtuoso

*

BigOWLIM

+

* Explanation of SPARQL to SQL translation
+ a debugging feature on the query engine side
42

Distributed Query Explanation in the
Semantic Web
Query plan explanation

Query result explanation

FedX
DARQ
SemWIQ
ADERIS
Anapsid

43

ADERIS Query Plan Explanation

44

Related Work
• Database: Why, how, where provenance
• Semantic Web:
– Inference explanation
• Provenance
• Generating justifications

45

Distributed Query Explanation
What We Provide
• Query plan explanation – Work in progress
– Prior to query execution
• Includes predicted performance metrics

– After query execution with performance metrics

• Query result explanation
– Why provenance
– Where provenance

46

Query Result Explanation

(RDF model, Query, Result)
Result Explainer
Plug-in

Can be any RDF model (RDF graph, SPARQL endpoint)

We generate result explanations by querying this model for why provenance

47

Why Provenance
• Triples in virtual model from which the result
is derived
<http://example.org/book/book8>
<http://purl.org/dc/elements/1.1/creator>
<http://www-sop.inria.fr/members/Alice> ;
<http://purl.org/dc/elements/1.1/title>
"Distributed Query Processing for Linked Data" .
<http://www-sop.inria.fr/members/Charlie>
<http://xmlns.com/foaf/0.1/name>
"Charlie" .
<http://www-sop.inria.fr/members/Alice>
<http://xmlns.com/foaf/0.1/knows>
<http://www-sop.inria.fr/members/Charlie> ;
"Alice" .

48

Where Provenance
• Keep the provenance of sources in the virtual model
fedqld:source1 {
<http://example.org/book/book8>
<http://purl.org/dc/elements/1.1/creator>
<http://www-sop.inria.fr/members/Alice> ;
<http://purl.org/dc/elements/1.1/title>
"Distributed Query Processing for Linked Data" .
}
fedqld:source2 {
<http://www-sop.inria.fr/members/Charlie>
"Charlie" .

Where the triples in this
graph come from

<http://www-sop.inria.fr/members/Alice>
<http://xmlns.com/foaf/0.1/knows>
<http://www-sop.inria.fr/members/Charlie> ;
"Alice" .
}
fedqld:prov {
fedqld:source1 void:sparqlEndpoint<http://localhost:3030/books/query> .
fedqld:source2 void:sparqlEndpoint<http://localhost:3031/person/query> .
}

49

What’s next
• Explanation user interfaces
• Evaluating the impacts of our explanations

50

References
•
•
•

Andreas Schwarte, Peter Haase, KatjaHoose, Ralf Schenkel, and Michael Schmidt. Fedx: A federation layer for
distributed query processing on linked open data. In ESWC, 2011
Shinji Kikuchi, Shelly Sachdeva, SubhashBhalla, Steven Lynden, Isao Kojima, AkiyoshiMatono, and Yusuke Tanimura.
Adaptive integration of distributed semantic web data. Databases in Networked Information Systems
Michael Schmidt, Olaf Görlitz, Peter Haase, Günter Ladwig, Andreas Schwarte, and Thanh Tran. 2011. FedBench: a
benchmark suite for federated semantic data query processing. In Proceedings of the 10th international
conference on The semantic web - Volume Part I (ISWC'11)

51

Strategies for Processing and Explaining Distributed Queries on Linked Data

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Strategies for Processing and Explaining Distributed Queries on Linked Data

Ähnlich wie Strategies for Processing and Explaining Distributed Queries on Linked Data (20)

Strategies for Processing and Explaining Distributed Queries on Linked Data

Hinweis der Redaktion