The talk was given at the 15th International Conference on Extending Database Technology (EDBT 2012) on March 29, 2012 in Berlin, Germany.
Abstract:
Query optimization in RDF Stores is a challenging problem as SPARQL queries typically contain many more joins than equivalent relational plans, and hence lead to a large join order search space. In such cases, cost-based query optimization often is not possible. One practical reason for this is that statistics typically are missing in web scale setting such as the Linked Open Datasets (LOD). The more profound reason is that due to the absence of schematic structure in RDF, join-hit ratio estimation requires complicated forms of correlated join statistics; and currently there are no methods to identify the relevant correlations beforehand. For this reason, the use of good heuristics is essential in SPARQL query optimization, even in the case that are partially used with cost-based statistics (i.e., hybrid query optimization). In this paper we describe a set of useful heuristics for SPARQL query optimizers. We present these in the context of a new Heuristic SPARQL Planner (HSP) that is capable of exploiting the syntactic and the structural variations of the triple patterns in a SPARQL query in order to choose an execution plan without the need of any cost model. For this, we define the variable graph and we show a reduction of the SPARQL query optimization problem to the maximum weight independent set problem. We implemented our planner on top of the MonetDB open source column-store and evaluated its effectiveness against the state-of-the-art RDF-3X engine as well as comparing the plan quality with a relational (SQL) equivalent of the benchmarks.
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Ā
Heuristic based Query Optimisation for SPARQL
1. Heuristic-
Based
Optimization
for SPARQL
Queries
Lefteris
Sidirourgos
Heuristic-Based Optimization for SPARQL
Queries
P. Tsialiamanis1,2 , Irini Fundulaki1 , V. Christophides1,2
L. Sidirourgos3 , P. Boncz3
Institute of Computer Science - FORTH
University of Crete, Greece
CWI, The Netherlands
EDBT 2012
Berlin, Germany
2. Heuristic-
Based
Optimization
Web of Data
for SPARQL
Queries
Lefteris
Sidirourgos
Knowledge Bases (Wikipedia, US Sensus Bureau, CIA
WorldFactBook) and Social Networks
Government Sites (data.gov.uk, www.data.gov)
Entertainment, Music, News Sources (BBC), ...
Scientiļ¬c Data Sources (Biology, Physics, Astronomy, Geography, ... )
Main Challenge: Efficient Management and Usage
of large (huge) volumes of Semantic Web Data
3. Heuristic-
Based
Optimization
Querying Linked Data: Where does existing
for SPARQL
Queries
Database Technology ļ¬t?
Lefteris
Sidirourgos
Existing Relational Stores are problematic for querying the
Web of Data
absence of schema and constraints for RDF data leads to query
plans that are hard to optimize ā a query plan contains large
number of self joins over a large triple table
relational optimizers compute the cost of scans but not the join
hit ratio(s): the subject-subject and subject-object join hit ratio
will have the same estimate since correlations between triple
components are not considered
correlated cost estimation over many selections and self joins:
the rule and not the exception for RDF query processing
4. Heuristic-
Based
Optimization
Contributions
for SPARQL
Queries
Lefteris
Sidirourgos
Heuristic-based SPARQL Planner (HSP)
heuristic-based optimization of SPARQL
join queries over flawed cost-based opti-
mization
Set of heuristics based on the structural and syntactic query
characteristics to reduce the size of intermediate join results
Bushy query plans that maximize the number of merge joins
Reduce the problem of query planing to the problem of ļ¬nding
the maximum weight independent set for a SPARQL query
novel representation of a SPARQL query as a variable
graph
Translation of the HSP bushy plans directly into MonetDBās
Physical Algebra (MAL)
5. Heuristic-
Based
Optimization
RDF in a nutshell
for SPARQL
Queries
Lefteris
RDF is the W3C standard for representing information in the
Sidirourgos
Web
RDF Data Model is based on node and edge labeled directed
graphs
U
Predicate U = set of URIs
Subject Object
L = set of Literals
U U L
6. Heuristic-
Based
Optimization
RDF in a nutshell
for SPARQL
Queries
Lefteris
RDF is the W3C standard for representing information in the
Sidirourgos
Web
RDF Data Model is based on node and edge labeled directed
graphs
U
Predicate U = set of URIs
Subject Object
L = set of Literals
U U L
(s, p, o) ā U Ć U Ć (U āŖ L) is called an RDF triple
7. Heuristic-
Based
Optimization
RDF in a nutshell
for SPARQL
Queries
Lefteris
RDF is the W3C standard for representing information in the
Sidirourgos
Web
RDF Data Model is based on node and edge labeled directed
graphs
U
Predicate U = set of URIs
Subject Object
L = set of Literals
U U L
(s, p, o) ā U Ć U Ć (U āŖ L) is called an RDF triple
A set of RDF triples is an RDF graph
8. Heuristic-
Based
Optimization
SPARQL Queries in a Nutshell
for SPARQL
Queries
Lefteris
Sidirourgos
U= set of URIs, L= set of Literals, V= set of Variables
SPARQL triple pattern is an element of the set
( V āŖ U) Ć ( V āŖ U) Ć ( V āŖ U āŖ L)
Intuitively a triple pattern denotes the triples in an RDF graph
that are of a speciļ¬c form.
(?j, rdf:type, ?y ): matches all triples with predicate rdf:type
SPARQL graph patterns are formulated using the join, union
and optional operators between triple patterns
9. Heuristic-
Based
Optimization
SPARQL Queries in a Nutshell
for SPARQL
Queries
Lefteris
Sidirourgos
SPARQL query is of the form
select ?u1 , ?u2 , . . . where Q
with ?u1 , ?u2 , . . . variables and Q a SPARQL graph pattern
10. Heuristic-
Based
Optimization
SPARQL Queries in a Nutshell
for SPARQL
Queries
Lefteris
Sidirourgos
SPARQL query is of the form
select ?u1 , ?u2 , . . . where Q
with ?u1 , ?u2 , . . . variables and Q a SPARQL graph pattern
We focus on SPARQL join queries Q = {tp0 , . . . , tpk } where
tp1 , tp2 , . . . are triple patterns. A join is deļ¬ned by multiple
occurences of the same variable.
( ?j , rdf:type, ?y ) . (sp2b:Journal1/1940, ?p, ?j )
11. Heuristic-
Based
Optimization
Storing RDF Triples
for SPARQL
Queries
Lefteris
Sidirourgos
The RDF graph is stored in a triple table (ternary relation) that
contains triples of the form (subject, property , object)
subject(s) predicate(p) object(o)
t1 : sp2b:Journal1/1940 rdf:type sp2b:Journal
t2 : sp2b:Inproceeding17 rdf:type sp2b:Inproceedings
t3 : sp2b:Proceeding1/1954 dcterms:issued āā1954āā
t4 : sp2b:Journal1/1952 dc:title āāJournal 1 (1952)āā
t5 : sp2b:Journal1/1941 rdf:type sp2b:Journal
t6 : sp2b:Article9 rdf:type sp2b:Article
t7 : sp2b:Inproceeding40 dc:terms āā1950āā
t8 : sp2b:Inproceeding40 rdf:type sp2b:Inproceedings
t9 : sp2b:Journal1/1941 dc:title āāJournal 1 (1941)āā
t10 : sp2b:Journal1/1942 rdf:type sp2b:Journal
t11 : sp2b:Journal1/1940 dc:title āāJournal 1 (1940)āā
t12 : sp2b:Inproceeding40 foaf:homepage http://www.dielectrics.tld
t13 : sp2b:Journal1/1940 dcterms:issued āā1940āā
12. Heuristic-
Based
Optimization
Dictionary & Ordered Relations
for SPARQL
Queries To avoid processing long strings we map URIs and Literals to
Lefteris unique identiļ¬ers: dictionary (binary relation) stores the
Sidirourgos
mapping
Dictionary
Oid Value Oid Value Oid Value
001 sp2b:Journal1/1940 009 sp2b:Journal 017 www.dielectrics.tld
002 sp2b:Inproceeding17 010 sp2b:Inproceedings 018 āā1940āā
003 sp2b:Proceeding1/1954 011 āā1954āā 019 rdf:type
004 sp2b:Journal1/1952 012 āāJournal 1 (1952)āā 020 dcterms:issued
To support merge joins for all possible join patterns we propose
six ordered relations that store all ordering combinations on
subject (s), property (p) and object (o) components of a triple:
spo, sop ops, osp, pos, pso.
Triples are lexicographically sorted by the appropriate collation
order
Triples table (spo) Triples table (pso)
s p o p s o
t1 001 019 009 t1 019 001 009
t13 001 020 018 t2 019 002 010
t2 002 019 020 t5 019 005 009
t3 003 020 011 t6 019 006 013
t5 005 019 009 t8 019 007 010
t6 006 019 013 t10 019 008 009
t7 007 019 010 t13 020 001 018
13. Heuristic-
Based
Optimization
Query Planning: Motivating Example
for SPARQL
Queries
Lefteris
Sidirourgos
Problems to solve:
1. Multiple ordered relations to evaluate one triple pattern
2. Multiple join orderings and join algorithms per join variable
14. Heuristic-
Based
Optimization
Query Planning: Motivating Example
for SPARQL
Queries
Lefteris 1. Multiple ordered relations to evaluate one triple pattern
Sidirourgos
select ?y
s0 p0 ?x ?y p1 ?x ?y p2 ?z s1 p3 ?z s2 p4 ?z
where { (tp0) . (tp1) . (tp2) . (tp3) . (tp4) }
15. Heuristic-
Based
Optimization
Query Planning: Motivating Example
for SPARQL
Queries
Lefteris 1. Multiple ordered relations to evaluate one triple pattern
Sidirourgos
select ?y
s0 p0 ?x ?y p1 ?x ?y p2 ?z s1 p3 ?z s2 p4 ?z
where { (tp0) . (tp1) . (tp2) . (tp3) . (tp4) }
tp0: (s0 p0 ?x)
1. pso: Ļproperty =p0,subject=s0 (pso)
the selection on subject is done on a very large triple table; the
results are ordered on the object
16. Heuristic-
Based
Optimization
Query Planning: Motivating Example
for SPARQL
Queries
Lefteris 1. Multiple ordered relations to evaluate one triple pattern
Sidirourgos
select ?y
s0 p0 ?x ?y p1 ?x ?y p2 ?z s1 p3 ?z s2 p4 ?z
where { (tp0) . (tp1) . (tp2) . (tp3) . (tp4) }
tp0: (s0 p0 ?x)
1. pso: Ļproperty =p0,subject=s0 (pso)
the selection on subject is done on a very large triple table; the
results are ordered on the object
2. spo: Ļproperty =p0,subject=s0 (spo)
the results are ordered on the object
17. Heuristic-
Based
Optimization
Query Planning: Motivating Example
for SPARQL
Queries
Lefteris 1. Multiple ordered relations to evaluate one triple pattern
Sidirourgos
select ?y
s0 p0 ?x ?y p1 ?x ?y p2 ?z s1 p3 ?z s2 p4 ?z
where { (tp0) . (tp1) . (tp2) . (tp3) . (tp4) }
tp0: (s0 p0 ?x)
1. pso: Ļproperty =p0,subject=s0 (pso)
the selection on subject is done on a very large triple table; the
results are ordered on the object
2. spo: Ļproperty =p0,subject=s0 (spo)
the results are ordered on the object
3. pos: Ļproperty =p0,subject=s0 (pos)
the results are not returned ordered on object
4. . . .
18. Heuristic-
Based
Optimization
Query Planning: Motivating Example
for SPARQL
Queries
Lefteris
2. Multiple join orderings and join algorithms per join variable
Sidirourgos
select ?y
s0 p0 ?x ?y p1 ?x ?y p2 ?z s1 p3 ?z s2 p4 ?z
where { (tp0) . (tp1) . (tp2) . (tp3) . (tp4) }
the merge joins are performed on variables ?y and ?z
Ļ ?y
hj
?z
hj
?x
mj mj
?y ?z
Ļ (spo) Ļ (pso) Ļ (pso) Ļ (spo) Ļ (spo)
subject = s0 property=p1 property = p2 subject=s1 subject=s2
property = p0 property=p3 property = p4
[tp0] [tp1] [tp2] [tp3] [tp4]
20. Heuristic-
Based
Optimization
Heuristics
for SPARQL
Queries
Lefteris
Sidirourgos
We use a set of heuristics to determine
the ordered relation on which to evaluate a
triple pattern
determine join ordering
21. Heuristic-
Based
Optimization
Heuristics
for SPARQL
Queries
Lefteris
Sidirourgos
H1: Triple Pattern Order
Triple patterns are ordered from the most to the least selective i.e.,
the one that is likely to produce less to the one that is more likely to
produce more intermediate results
(s, p, o) (s, ?, o) (?, p, o) (s, p, ?)
(?, ?, o) (s, ?, ?) (?, p, ?) (?, ?, ?)ā
ā
except predicate rdf:type
22. Heuristic-
Based
Optimization
Heuristics
for SPARQL
Queries
Lefteris
Sidirourgos
H1: Triple Pattern Order
Triple patterns are ordered from the most to the least selective i.e.,
the one that is likely to produce less to the one that is more likely to
produce more intermediate results
(s, p, o) (s, ?, o) (?, p, o) (s, p, ?)
(?, ?, o) (s, ?, ?) (?, p, ?) (?, ?, ?)ā
ā
except predicate rdf:type
H2: Join Patterns
Join Patterns are ordered from the most to the least selective one
p o s p s o o o s s p p
23. Heuristic-
Based
Optimization
Heuristics
for SPARQL
Queries
Lefteris
Sidirourgos
H3: Number of Literals/URIs
A set of triple patterns with more literals is more selective than a set
of triple patterns with more URIs
H4: Triple Patterns with Literals at the object position
Triple pattern with literal is more selective than one with URI in the
object position
Heuristics H3 and H4 are a special case of H1 but can be used
separately
H5: Triple Patterns with least number of Projections
Related to Tuple Reconstruction in Column-Store DBMS
24. Heuristic-
Based
Optimization
Query Planning
for SPARQL
Queries Objective: produce query plans with the maximum
Lefteris number of merge joins
Sidirourgos
Solution: Reduce Query Planning to the Maximum
Weight Independent Set Problem
25. Heuristic-
Based
Optimization
Query Planning
for SPARQL
Queries Objective: produce query plans with the maximum
Lefteris number of merge joins
Sidirourgos
Solution: Reduce Query Planning to the Maximum
Weight Independent Set Problem
An independent set is a set of nodes no two of which share an
edge
Finding independent sets is a problem complementary to ļ¬nding
ļ¬nding Cliques in a graph
n1 n4
n1
n5
n2 n3 n2 n3
independent set = { {n1}, {n2}, {n3}} independent set = { {n1, n3}, {n1, n5},
{n4, n3}, {n4, n5},
{n2, n5}}
26. Heuristic-
Based
Optimization
Query Planning
for SPARQL
Queries Objective: produce query plans with the maximum
Lefteris number of merge joins
Sidirourgos
Solution: Reduce Query Planning to the Maximum
Weight Independent Set Problem
An independent set is a set of nodes no two of which share an
edge
Finding independent sets is a problem complementary to ļ¬nding
ļ¬nding Cliques in a graph
n1 n4
n1
n5
n2 n3 n2 n3
independent set = { {n1}, {n2}, {n3}} independent set = { {n1, n3}, {n1, n5},
{n4, n3}, {n4, n5},
{n2, n5}}
Equivalent to finding the largest groups of vari-
ables that can be merge-joined
27. Heuristic-
Based
Optimization
Variable Graph
for SPARQL
Queries
Model a SPARQL join query as a Variable Graph where
Lefteris
Sidirourgos nodes in the graph are the query variables
an edge exist between two nodes, if they belong to the same
triple pattern
the weight of each node is the number of joins it participates in
28. Heuristic-
Based
Optimization
Variable Graph
for SPARQL
Queries
Model a SPARQL join query as a Variable Graph where
Lefteris
Sidirourgos nodes in the graph are the query variables
an edge exist between two nodes, if they belong to the same
triple pattern
the weight of each node is the number of joins it participates in
select ?y
?p ?ss ?c1 ?p ?dd ?c2 ?c1 p1 o1
where { (tp0) . (tp1) . (tp2) .
?c1 p2 ?x ?c2 p3 o2 ?c2 p4 ?y
(tp3) . (tp4) . (tp5) }
29. Heuristic-
Based
Optimization
Variable Graph
for SPARQL
Queries
Model a SPARQL join query as a Variable Graph where
Lefteris
Sidirourgos nodes in the graph are the query variables
an edge exist between two nodes, if they belong to the same
triple pattern
the weight of each node is the number of joins it participates in
select ?y
?p ?ss ?c1 ?p ?dd ?c2 ?c1 p1 o1
where { (tp0) . (tp1) . (tp2) .
?c1 p2 ?x ?c2 p3 o2 ?c2 p4 ?y
(tp3) . (tp4) . (tp5) }
(2) (1) (2)
?c2 ?p ?c1
30. Heuristic-
Based
Optimization
Variable Graph
for SPARQL
Queries
Model a SPARQL join query as a Variable Graph where
Lefteris
Sidirourgos nodes in the graph are the query variables
an edge exist between two nodes, if they belong to the same
triple pattern
the weight of each node is the number of joins it participates in
select ?y
?p ?ss ?c1 ?p ?dd ?c2 ?c1 p1 o1
where { (tp0) . (tp1) . (tp2) .
?c1 p2 ?x ?c2 p3 o2 ?c2 p4 ?y
(tp3) . (tp4) . (tp5) }
(2) (1) (2)
?c2 ?p ?c1
MWIS: {{?c2, ?c1}}
31. Heuristic-
Based
Optimization
Query Planning: Assigning Access Paths to
for SPARQL
Queries
Triple Patterns
Lefteris
Sidirourgos
Require: SPARQL join query Q
Ensure: Mapping M that assigns triple patterns to ordered relations
while (there exist a non empty set T of triple patterns that do not have a
merge joined variable) do
Construct the variable graph from set T
I ā the set of maximum weight independent sets
repeat
apply Heuristic 3, 4, 2, 5 in that order to I
to eliminate independent sets
until either |I| = 1 or all heuristics have been applied
Choose one set randomly, and remove the triple patterns
that do not have a variable in the independent set
end while
for all merge joined variables do
assign ordered relations to those triple patterns such that merge joined
variables are returned sorted
assign the ordered relation for remaining triple patterns (hash joins)
34. Heuristic-
Based
Optimization
Evaluation
for SPARQL
Queries
Lefteris
Sidirourgos Experiments
Compared the quality of the plans produced by our
heuristic-based SPARQL Planner (HSP) with those produced
by the cost-based dynamic programming planner (CDP) of
RDF-3X (Neumann et al.)
Compared the execution time of HSP plans translated into
MonetDBās physical algebra (MAL) and
SQL queries
evaluated over MonetDB with the execution time of CDP
plans evaluated over RDF-3X
Measured the planning time for CDP and HSP
Datasets
Synthetic (SP2 Bench) and Real (YAGO) Datasets
35. Heuristic-
Based
Optimization
Characteristics of Query Workload
for SPARQL
Queries
Lefteris
Sidirourgos
Queries with diļ¬erent number of triple patterns
Queries with diļ¬erent structural characteristics (kinds of joins)
star and chain-shaped queries with join variables in diļ¬erent
positions (Heuristics 1, 2)
Triple Patterns with diļ¬erent syntactic characteristics
number of variables and constants found in diļ¬erent positions
(Heuristics 1, 3, 4)
36. Heuristic-
Based
Optimization
Query Plans
for SPARQL
Queries
Lefteris
Sidirourgos
Comparing Plan Quality
1. HSP produces plans with the same number of merge
and hash joins as CDP
2. Triple patterns are evaluated on the same ordered
relation (HSP), index (CDP)
3. Plans differ on join ordering and the types of
joins applied to the join variables
37. Heuristic-
Based
Optimization
Query Plans
for SPARQL
Queries
Lefteris
Sidirourgos
Comparing Plan Quality
1. HSP produces plans with the same number of merge
and hash joins as CDP
2. Triple patterns are evaluated on the same ordered
relation (HSP), index (CDP)
3. Plans differ on join ordering and the types of
joins applied to the join variables
Comparing Plan Cost
4. The cost of plans using the cost function of
RDF-3X/CDP does not differ
5. The planning of time for CDP and HSP does not
differ and is in the order of 0.1ms.
39. Heuristic-
Based
Optimization SP1 SP2a SP2b SP3a SP3b SP3c SP4a SP4b
for SPARQL
Queries HSP 32 873 830 487 100 105 354+953,381 264+953,381
CDP 32 31 54 487 100 105 354+953,381 299+858,461
Lefteris
Sidirourgos Y1 Y2 Y3 Y4
HSP 12+300,054 1+303,579 329+302,577 327+763,749
CDP 7+300,023 1.5+301,614 328+302,577 326+763,603
1. SP4a contains small chain joins, Y3 contains small star joins
with syntactically dissimilar triple patterns
2. SP2a, SP2b: same merge joi variables, large star queries with
syntactically similar triple patterns, intermediate results of
diļ¬erent sizes
40. Heuristic-
Based
Optimization SP1 SP2a SP2b SP3a SP3b SP3c SP4a SP4b
for SPARQL
Queries HSP 32 873 830 487 100 105 354+953,381 264+953,381
CDP 32 31 54 487 100 105 354+953,381 299+858,461
Lefteris
Sidirourgos Y1 Y2 Y3 Y4
HSP 12+300,054 1+303,579 329+302,577 327+763,749
CDP 7+300,023 1.5+301,614 328+302,577 326+763,603
1. SP4a contains small chain joins, Y3 contains small star joins
with syntactically dissimilar triple patterns
2. SP2a, SP2b: same merge joi variables, large star queries with
syntactically similar triple patterns, intermediate results of
diļ¬erent sizes
3. SP4b, Y1, Y2, Y4: diļ¬erent merge join variables, intermediate
results of similar size
41. Heuristic-
Based
Optimization SP1 SP2a SP2b SP3a SP3b SP3c SP4a SP4b
for SPARQL
Queries HSP 32 873 830 487 100 105 354+953,381 264+953,381
CDP 32 31 54 487 100 105 354+953,381 299+858,461
Lefteris
Sidirourgos Y1 Y2 Y3 Y4
HSP 12+300,054 1+303,579 329+302,577 327+763,749
CDP 7+300,023 1.5+301,614 328+302,577 326+763,603
1. SP4a contains small chain joins, Y3 contains small star joins
with syntactically dissimilar triple patterns
2. SP2a, SP2b: same merge joi variables, large star queries with
syntactically similar triple patterns, intermediate results of
diļ¬erent sizes
3. SP4b, Y1, Y2, Y4: diļ¬erent merge join variables, intermediate
results of similar size
HSP heuristics proved to be effective in choosing a
near to optimal plan for queries whose triple
patterns exhibit syntactical dissimilarities causing
the application of all HSP heuristics
42. Heuristic-
Based
Optimization
Query Plan Execution Times
for SPARQL
Queries Queries with diļ¬erent plans
Lefteris RDF-3X/CDP outperforms MonetDB/HSP when HSP chooses
Sidirourgos
randomly the join ordering (SP2a, Sp2b)
MonetDB/HSP performs better than RDF-3X/CDP when
intermediate results are of the same order of magnitude (SP4b,
Y1, Y2, Y4)
Queries with the same plan
MonetDB/HSP performs better than RDF-3X/CDP (SP3a,
SP3b, SP3c, SP4a, SP5, SP6, Y3)
MonetDB/HSP performs always better than MonetDB/SQL
SP1 SP2a SP2b SP3a SP3b SP3c
MonetDB/HSP 19.52 3,267.01 1,035.12 80.92 8.74 12.55
RDF-3X/CDP 0.25 355.50 1,000.75 85.14 11.95 13.97
MonetDB/SQL 11.92 3,561 1,103 82.91 9.61 14.81
SP4a SP4b SP5 SP6
MonetDB/HSP 3,602.09 1,766.29 0.06 0.43
RDF-3X/CDP 3,634.60 2,781.75 0.10 22.85
MonetDB/SQL XXX 1,909.13 0.09 0.48
Y1 Y2 Y3 Y4
MonetDB/HSP 6.04 8.65 25.69 2.32
RDF-3X/CDP 15.75 9.95 81.20 90.45
MonetDB/SQL 7.69 9.07 538.65 1,113
43. Heuristic-
Based
Optimization
Conclusions
for SPARQL
Queries
Lefteris
Sidirourgos
RDF speciļ¬c heuristics for determining triple pattern selectivities
Reduced Query Planning to the Maximum Weight Independent
Set problem
Heuristics-based SPARQL planner (HSP) capable of choosing
near to optimal execution plans without the use of statistics
Implemented HSP plans on top of MonetDB
Experimentally evaluated HSP using synthetically generated and
real RDF datasets
44. Heuristic-
Based
Optimization
Future Work
for SPARQL
Queries
Lefteris
Sidirourgos
Extend HSP to support full SPARQL
Integrate proposed heuristics with a traditional cost-based
optimizer towards a hybrid solution
Apply the approach to a distributed environment
Cope with diļ¬erent RDF storage schemas (vertical partitioning,
hybrid)
Experiment with a large set of synthetically generated queries
45. Heuristic-
Based
Optimization
for SPARQL
Queries
Lefteris
Sidirourgos
46. Heuristic-
Based
Optimization
for SPARQL
Queries
Lefteris subject(s) predicate(p) object(o)
Sidirourgos
t1 : sp2b:Journal1/1940 rdf:type sp2b:Journal
t2 : sp2b:Inproceeding17 rdf:type sp2b:Inproceedings
t3 : sp2b:Proceeding1/1954 dcterms:issued āā1954āā
t4 : sp2b:Journal1/1952 dc:title āāJournal 1 (1952)āā
t5 : sp2b:Journal1/1941 rdf:type sp2b:Journal
t6 : sp2b:Article9 rdf:type sp2b:Article
t7 : sp2b:Inproceeding40 dc:terms āā1950āā
t8 : sp2b:Inproceeding40 rdf:type sp2b:Inproceedings
t9 : sp2b:Journal1/1941 dc:title āāJournal 1 (1941)āā
t10 : sp2b:Journal1/1942 rdf:type sp2b:Journal
t11 : sp2b:Journal1/1940 dc:title āāJournal 1 (1940)āā
t12 : sp2b:Inproceeding40 foaf:homepage http://www.dielectrics.tld
t13 : sp2b:Journal1/1940 dcterms:issued āā1940āā
(s, p, o) ā U Ć U Ć (U āŖ L) is called an RDF triple
A set of RDF triples is called an RDF graph
47. Heuristic-
Based
Optimization
for SPARQL
Queries
Lefteris
Sidirourgos
Comparing Plan Cost
CDP (RDF-3X) cost function considers the estimation of
intermediate results
lc+rc
cost mergejoin(lc, lr ) = 100,000
lc rc
cost hashjoin(lc, rc) = 300, 000 + 100 + 10
where lc and rc are the cardinalities of two join input relations, with
the lc being the smallest input
48. Heuristic-
Based
Optimization
Constructing MAL Plans
for SPARQL
Queries
Lefteris translate each operator (selection, projection, join) to the
Sidirourgos
approperiate MAL statements
bind(ops, object) Ļ (ops)
property=p1
object = o1
[tp2]
uselect(o1) bind(ops, property)
semijoin
uselect(p1) bind(ops, subject)
leftjoin
?c1