SlideShare a Scribd company logo
1 of 48
Heuristic-
    Based
Optimization
for SPARQL
   Queries

  Lefteris
Sidirourgos

               Heuristic-Based Optimization for SPARQL
                                Queries

                 P. Tsialiamanis1,2 , Irini Fundulaki1 , V. Christophides1,2
                               L. Sidirourgos3 , P. Boncz3

                              Institute of Computer Science - FORTH
                                     University of Crete, Greece
                                       CWI, The Netherlands


                                        EDBT 2012
                                      Berlin, Germany
Heuristic-
    Based
Optimization
               Web of Data
for SPARQL
   Queries

  Lefteris
Sidirourgos




                  Knowledge Bases (Wikipedia, US Sensus Bureau, CIA
                  WorldFactBook) and Social Networks
                  Government Sites (data.gov.uk, www.data.gov)
                  Entertainment, Music, News Sources (BBC), ...

                  Scientiļ¬c Data Sources (Biology, Physics, Astronomy, Geography, ... )




                Main Challenge: Efficient Management and Usage
                of large (huge) volumes of Semantic Web Data
Heuristic-
    Based
Optimization
               Querying Linked Data: Where does existing
for SPARQL
   Queries
               Database Technology ļ¬t?
  Lefteris
Sidirourgos



               Existing Relational Stores are problematic for querying the
               Web of Data

                   absence of schema and constraints for RDF data leads to query
                   plans that are hard to optimize ā†’ a query plan contains large
                   number of self joins over a large triple table
                   relational optimizers compute the cost of scans but not the join
                   hit ratio(s): the subject-subject and subject-object join hit ratio
                   will have the same estimate since correlations between triple
                   components are not considered
                   correlated cost estimation over many selections and self joins:
                   the rule and not the exception for RDF query processing
Heuristic-
    Based
Optimization
               Contributions
for SPARQL
   Queries

  Lefteris
Sidirourgos
                      Heuristic-based SPARQL Planner (HSP)
                  heuristic-based optimization of SPARQL
                  join queries over flawed cost-based opti-
                  mization

                  Set of heuristics based on the structural and syntactic query
                  characteristics to reduce the size of intermediate join results
                  Bushy query plans that maximize the number of merge joins
                  Reduce the problem of query planing to the problem of ļ¬nding
                  the maximum weight independent set for a SPARQL query

                      novel representation of a SPARQL query as a variable
                      graph
                  Translation of the HSP bushy plans directly into MonetDBā€™s
                  Physical Algebra (MAL)
Heuristic-
    Based
Optimization
               RDF in a nutshell
for SPARQL
   Queries

  Lefteris
                      RDF is the W3C standard for representing information in the
Sidirourgos
                      Web
                      RDF Data Model is based on node and edge labeled directed
                      graphs

                                U




                            Predicate                         U   =   set of URIs
                Subject                 Object
                                                              L   =   set of Literals


                  U                 U            L
Heuristic-
    Based
Optimization
               RDF in a nutshell
for SPARQL
   Queries

  Lefteris
                      RDF is the W3C standard for representing information in the
Sidirourgos
                      Web
                      RDF Data Model is based on node and edge labeled directed
                      graphs

                                U




                            Predicate                          U   =   set of URIs
                Subject                 Object
                                                               L   =   set of Literals


                  U                 U            L



                          (s, p, o) āˆˆ U Ɨ U Ɨ (U āˆŖ L) is called an RDF triple
Heuristic-
    Based
Optimization
               RDF in a nutshell
for SPARQL
   Queries

  Lefteris
                      RDF is the W3C standard for representing information in the
Sidirourgos
                      Web
                      RDF Data Model is based on node and edge labeled directed
                      graphs

                                U




                            Predicate                          U   =   set of URIs
                Subject                 Object
                                                               L   =   set of Literals


                  U                 U            L



                          (s, p, o) āˆˆ U Ɨ U Ɨ (U āˆŖ L) is called an RDF triple
                                A set of RDF triples is an RDF graph
Heuristic-
    Based
Optimization
               SPARQL Queries in a Nutshell
for SPARQL
   Queries

  Lefteris
Sidirourgos

                    U= set of URIs, L= set of Literals, V= set of Variables

                  SPARQL triple pattern is an element of the set

                               ( V āˆŖ U) Ɨ ( V āˆŖ U) Ɨ ( V āˆŖ U āˆŖ L)

                  Intuitively a triple pattern denotes the triples in an RDF graph
                  that are of a speciļ¬c form.
                       (?j, rdf:type, ?y ): matches all triples with predicate rdf:type

                  SPARQL graph patterns are formulated using the join, union
                  and optional operators between triple patterns
Heuristic-
    Based
Optimization
               SPARQL Queries in a Nutshell
for SPARQL
   Queries

  Lefteris
Sidirourgos




                  SPARQL query is of the form
                                   select ?u1 , ?u2 , . . . where Q
                  with ?u1 , ?u2 , . . . variables and Q a SPARQL graph pattern
Heuristic-
    Based
Optimization
               SPARQL Queries in a Nutshell
for SPARQL
   Queries

  Lefteris
Sidirourgos




                  SPARQL query is of the form
                                    select ?u1 , ?u2 , . . . where Q
                  with ?u1 , ?u2 , . . . variables and Q a SPARQL graph pattern

                  We focus on SPARQL join queries Q = {tp0 , . . . , tpk } where
                  tp1 , tp2 , . . . are triple patterns. A join is deļ¬ned by multiple
                  occurences of the same variable.


                       ( ?j , rdf:type, ?y ) . (sp2b:Journal1/1940, ?p, ?j )
Heuristic-
    Based
Optimization
               Storing RDF Triples
for SPARQL
   Queries

  Lefteris
Sidirourgos


                    The RDF graph is stored in a triple table (ternary relation) that
                    contains triples of the form (subject, property , object)

                         subject(s)              predicate(p)     object(o)
                 t1 :    sp2b:Journal1/1940      rdf:type         sp2b:Journal
                 t2 :    sp2b:Inproceeding17     rdf:type         sp2b:Inproceedings
                 t3 :    sp2b:Proceeding1/1954   dcterms:issued   ā€˜ā€˜1954ā€™ā€™
                 t4 :    sp2b:Journal1/1952      dc:title         ā€˜ā€˜Journal 1 (1952)ā€™ā€™
                 t5 :    sp2b:Journal1/1941      rdf:type         sp2b:Journal
                 t6 :    sp2b:Article9           rdf:type         sp2b:Article
                 t7 :    sp2b:Inproceeding40     dc:terms         ā€˜ā€˜1950ā€™ā€™
                 t8 :    sp2b:Inproceeding40     rdf:type         sp2b:Inproceedings
                 t9 :    sp2b:Journal1/1941      dc:title         ā€˜ā€˜Journal 1 (1941)ā€™ā€™
                 t10 :   sp2b:Journal1/1942      rdf:type         sp2b:Journal
                 t11 :   sp2b:Journal1/1940      dc:title         ā€˜ā€˜Journal 1 (1940)ā€™ā€™
                 t12 :   sp2b:Inproceeding40     foaf:homepage    http://www.dielectrics.tld
                 t13 :   sp2b:Journal1/1940      dcterms:issued   ā€˜ā€˜1940ā€™ā€™
Heuristic-
    Based
Optimization
               Dictionary & Ordered Relations
for SPARQL
   Queries        To avoid processing long strings we map URIs and Literals to
  Lefteris        unique identiļ¬ers: dictionary (binary relation) stores the
Sidirourgos
                  mapping
                                                                    Dictionary
                  Oid   Value                         Oid         Value                     Oid         Value
                  001   sp2b:Journal1/1940            009         sp2b:Journal              017         www.dielectrics.tld
                  002   sp2b:Inproceeding17           010         sp2b:Inproceedings        018         ā€˜ā€˜1940ā€™ā€™
                  003   sp2b:Proceeding1/1954         011         ā€˜ā€˜1954ā€™ā€™                  019         rdf:type
                  004   sp2b:Journal1/1952            012         ā€˜ā€˜Journal 1 (1952)ā€™ā€™      020         dcterms:issued


                  To support merge joins for all possible join patterns we propose
                  six ordered relations that store all ordering combinations on
                  subject (s), property (p) and object (o) components of a triple:
                  spo, sop ops, osp, pos, pso.
                  Triples are lexicographically sorted by the appropriate collation
                  order
                                 Triples table (spo)                               Triples table (pso)
                                    s           p           o                         p           s         o
                           t1       001         019         009              t1       019         001       009
                           t13      001         020         018              t2       019         002       010
                           t2       002         019         020              t5       019         005       009
                           t3       003         020         011              t6       019         006       013
                           t5       005         019         009              t8       019         007       010
                           t6       006         019         013              t10      019         008       009
                           t7       007         019         010              t13      020         001       018
Heuristic-
    Based
Optimization
               Query Planning: Motivating Example
for SPARQL
   Queries

  Lefteris
Sidirourgos
               Problems to solve:

                1. Multiple ordered relations to evaluate one triple pattern
                2. Multiple join orderings and join algorithms per join variable
Heuristic-
    Based
Optimization
               Query Planning: Motivating Example
for SPARQL
   Queries

  Lefteris       1. Multiple ordered relations to evaluate one triple pattern
Sidirourgos




               select ?y
                       s0 p0 ?x ?y p1 ?x         ?y p2 ?z       s1 p3 ?z       s2 p4 ?z
               where { (tp0) . (tp1)         .     (tp2)    .    (tp3)     .    (tp4)     }
Heuristic-
    Based
Optimization
               Query Planning: Motivating Example
for SPARQL
   Queries

  Lefteris        1. Multiple ordered relations to evaluate one triple pattern
Sidirourgos




               select ?y
                       s0 p0 ?x ?y p1 ?x             ?y p2 ?z       s1 p3 ?z       s2 p4 ?z
               where { (tp0) . (tp1)             .     (tp2)    .    (tp3)     .    (tp4)     }

                tp0: (s0 p0 ?x)
                1. pso: Ļƒproperty =p0,subject=s0 (pso)
                        the selection on subject is done on a very large triple table; the
                        results are ordered on the object
Heuristic-
    Based
Optimization
               Query Planning: Motivating Example
for SPARQL
   Queries

  Lefteris        1. Multiple ordered relations to evaluate one triple pattern
Sidirourgos




               select ?y
                       s0 p0 ?x ?y p1 ?x             ?y p2 ?z       s1 p3 ?z       s2 p4 ?z
               where { (tp0) . (tp1)             .     (tp2)    .    (tp3)     .    (tp4)     }

                tp0: (s0 p0 ?x)
                1. pso: Ļƒproperty =p0,subject=s0 (pso)
                        the selection on subject is done on a very large triple table; the
                        results are ordered on the object
                2. spo: Ļƒproperty =p0,subject=s0 (spo)
                        the results are ordered on the object
Heuristic-
    Based
Optimization
               Query Planning: Motivating Example
for SPARQL
   Queries

  Lefteris        1. Multiple ordered relations to evaluate one triple pattern
Sidirourgos




               select ?y
                       s0 p0 ?x ?y p1 ?x               ?y p2 ?z        s1 p3 ?z       s2 p4 ?z
               where { (tp0) . (tp1)               .     (tp2)     .    (tp3)     .    (tp4)     }

                tp0: (s0 p0 ?x)
                1. pso: Ļƒproperty =p0,subject=s0 (pso)
                           the selection on subject is done on a very large triple table; the
                           results are ordered on the object
                2. spo: Ļƒproperty =p0,subject=s0 (spo)
                           the results are ordered on the object
                3. pos: Ļƒproperty =p0,subject=s0 (pos)
                           the results are not returned ordered on object
                4. . . .
Heuristic-
    Based
Optimization
               Query Planning: Motivating Example
for SPARQL
   Queries

  Lefteris
                 2. Multiple join orderings and join algorithms per join variable
Sidirourgos

               select ?y
                       s0 p0 ?x ?y p1 ?x                                 ?y p2 ?z            s1 p3 ?z                s2 p4 ?z
               where { (tp0) . (tp1)                              .        (tp2)       .      (tp3)              .    (tp4)     }
                  the merge joins are performed on variables ?y and ?z


                                                                  Ļ€ ?y
                                                                      hj
                                                                      ?z




                                                     hj
                                                     ?x

                                                          mj                                      mj
                                                          ?y                                      ?z


                       Ļƒ (spo)          Ļƒ    (pso)             Ļƒ (pso)          Ļƒ    (spo)             Ļƒ (spo)
                        subject = s0        property=p1         property = p2       subject=s1          subject=s2
                        property = p0                                               property=p3         property = p4
                           [tp0]               [tp1]               [tp2]                [tp3]              [tp4]
Heuristic-
    Based
Optimization
               Query Planning: Motivating Example
for SPARQL
   Queries

  Lefteris
                 2. Multiple join orderings and join algorithms per join variable
Sidirourgos

                  merge joins are performed on variables ?z and ?x
               select ?y
                       s0 p0 ?x ?y p1 ?x                               ?y p2 ?z           s1 p3 ?z                s2 p4 ?z
               where { (tp0) . (tp1)                          .          (tp2)      .      (tp3)              .    (tp4)     }

                                                            Ļ€ ?y
                                                                  hj
                                                                  ?y
                                                                                    mj
                                                                                    ?z



                                  mj                                                           mj
                                  ?x                                                           ?z


                       Ļƒ (spo)          Ļƒ    (pos)        Ļƒ (pos)            Ļƒ    (spo)             Ļƒ (spo)
                        subject = s0        property=p1    property = p2         subject=s1          subject=s2
                        property = p0                                            property=p3         property = p4
                           [tp0]               [tp1]          [tp2]                  [tp3]              [tp4]
Heuristic-
    Based
Optimization
               Heuristics
for SPARQL
   Queries

  Lefteris
Sidirourgos
                 We use a set of heuristics to determine
                    the ordered relation on which to evaluate a
                    triple pattern
                    determine join ordering
Heuristic-
    Based
Optimization
               Heuristics
for SPARQL
   Queries

  Lefteris
Sidirourgos
                H1: Triple Pattern Order
               Triple patterns are ordered from the most to the least selective i.e.,
               the one that is likely to produce less to the one that is more likely to
               produce more intermediate results

                            (s, p, o)      (s, ?, o)     (?, p, o)     (s, p, ?)
                               (?, ?, o)     (s, ?, ?)     (?, p, ?)     (?, ?, ?)āˆ—
                                     āˆ—
                                         except predicate rdf:type
Heuristic-
    Based
Optimization
               Heuristics
for SPARQL
   Queries

  Lefteris
Sidirourgos
                H1: Triple Pattern Order
               Triple patterns are ordered from the most to the least selective i.e.,
               the one that is likely to produce less to the one that is more likely to
               produce more intermediate results

                            (s, p, o)           (s, ?, o)     (?, p, o)     (s, p, ?)
                                 (?, ?, o)        (s, ?, ?)     (?, p, ?)       (?, ?, ?)āˆ—
                                        āˆ—
                                            except predicate rdf:type


                H2: Join Patterns
               Join Patterns are ordered from the most to the least selective one

                        p    o      s       p      s    o      o    o       s     s     p    p
Heuristic-
    Based
Optimization
               Heuristics
for SPARQL
   Queries

  Lefteris
Sidirourgos

                H3: Number of Literals/URIs

               A set of triple patterns with more literals is more selective than a set
               of triple patterns with more URIs
                H4: Triple Patterns with Literals at the object position
               Triple pattern with literal is more selective than one with URI in the
               object position

                 Heuristics H3 and H4 are a special case of H1 but can be used
                 separately

                H5: Triple Patterns with least number of Projections
               Related to Tuple Reconstruction in Column-Store DBMS
Heuristic-
    Based
Optimization
               Query Planning
for SPARQL
   Queries     Objective: produce query plans with the maximum
  Lefteris     number of merge joins
Sidirourgos
               Solution: Reduce Query Planning to the Maximum
               Weight Independent Set Problem
Heuristic-
    Based
Optimization
               Query Planning
for SPARQL
   Queries     Objective: produce query plans with the maximum
  Lefteris     number of merge joins
Sidirourgos
               Solution: Reduce Query Planning to the Maximum
               Weight Independent Set Problem
                  An independent set is a set of nodes no two of which share an
                  edge
                  Finding independent sets is a problem complementary to ļ¬nding
                  ļ¬nding Cliques in a graph
                                                                          n1     n4
                                       n1

                                                                                        n5


                                  n2        n3                           n2     n3

                      independent set = { {n1}, {n2}, {n3}}   independent set = { {n1, n3}, {n1, n5},
                                                                                  {n4, n3}, {n4, n5},
                                                                                  {n2, n5}}
Heuristic-
    Based
Optimization
               Query Planning
for SPARQL
   Queries     Objective: produce query plans with the maximum
  Lefteris     number of merge joins
Sidirourgos
               Solution: Reduce Query Planning to the Maximum
               Weight Independent Set Problem
                  An independent set is a set of nodes no two of which share an
                  edge
                  Finding independent sets is a problem complementary to ļ¬nding
                  ļ¬nding Cliques in a graph
                                                                          n1     n4
                                       n1

                                                                                        n5


                                  n2        n3                           n2     n3

                      independent set = { {n1}, {n2}, {n3}}   independent set = { {n1, n3}, {n1, n5},
                                                                                  {n4, n3}, {n4, n5},
                                                                                  {n2, n5}}



                  Equivalent to finding the largest groups of vari-
                  ables that can be merge-joined
Heuristic-
    Based
Optimization
               Variable Graph
for SPARQL
   Queries
                  Model a SPARQL join query as a Variable Graph where
  Lefteris
Sidirourgos           nodes in the graph are the query variables
                      an edge exist between two nodes, if they belong to the same
                      triple pattern
                      the weight of each node is the number of joins it participates in
Heuristic-
    Based
Optimization
               Variable Graph
for SPARQL
   Queries
                  Model a SPARQL join query as a Variable Graph where
  Lefteris
Sidirourgos           nodes in the graph are the query variables
                      an edge exist between two nodes, if they belong to the same
                      triple pattern
                      the weight of each node is the number of joins it participates in



                      select ?y
                              ?p ?ss ?c1 ?p ?dd ?c2                  ?c1 p1 o1
                      where {   (tp0) .     (tp1)               .      (tp2)     .
                                ?c1 p2 ?x       ?c2 p3 o2           ?c2 p4 ?y
                                  (tp3)     .     (tp4)     .         (tp5)      }
Heuristic-
    Based
Optimization
               Variable Graph
for SPARQL
   Queries
                  Model a SPARQL join query as a Variable Graph where
  Lefteris
Sidirourgos           nodes in the graph are the query variables
                      an edge exist between two nodes, if they belong to the same
                      triple pattern
                      the weight of each node is the number of joins it participates in



                      select ?y
                              ?p ?ss ?c1 ?p ?dd ?c2                   ?c1 p1 o1
                      where {   (tp0) .     (tp1)               .       (tp2)     .
                                ?c1 p2 ?x       ?c2 p3 o2           ?c2 p4 ?y
                                  (tp3)     .     (tp4)     .         (tp5)       }

                                    (2)         (1)             (2)
                                  ?c2           ?p          ?c1
Heuristic-
    Based
Optimization
               Variable Graph
for SPARQL
   Queries
                  Model a SPARQL join query as a Variable Graph where
  Lefteris
Sidirourgos           nodes in the graph are the query variables
                      an edge exist between two nodes, if they belong to the same
                      triple pattern
                      the weight of each node is the number of joins it participates in



                      select ?y
                              ?p ?ss ?c1 ?p ?dd ?c2                   ?c1 p1 o1
                      where {   (tp0) .     (tp1)               .       (tp2)     .
                                ?c1 p2 ?x       ?c2 p3 o2           ?c2 p4 ?y
                                  (tp3)     .     (tp4)     .         (tp5)       }

                                    (2)         (1)             (2)
                                  ?c2           ?p          ?c1

                                    MWIS: {{?c2, ?c1}}
Heuristic-
    Based
Optimization
               Query Planning: Assigning Access Paths to
for SPARQL
   Queries
               Triple Patterns
  Lefteris
Sidirourgos
               Require: SPARQL join query Q
               Ensure: Mapping M that assigns triple patterns to ordered relations
                  while (there exist a non empty set T of triple patterns that do not have a
                  merge joined variable) do
                      Construct the variable graph from set T
                      I ā† the set of maximum weight independent sets
                     repeat
                        apply Heuristic 3, 4, 2, 5 in that order to I
                         to eliminate independent sets
                     until either |I| = 1 or all heuristics have been applied
                      Choose one set randomly, and remove the triple patterns
                       that do not have a variable in the independent set
                    end while
                   for all merge joined variables do
                       assign ordered relations to those triple patterns such that merge joined
                       variables are returned sorted
                   assign the ordered relation for remaining triple patterns (hash joins)
Heuristic-
    Based
Optimization
               Assigning Access Paths to Triple Patterns:
for SPARQL
   Queries
               Example
  Lefteris
Sidirourgos
                    Query Q
                ?p ?ss ?c1 ?p ?dd ?c2       ?c1 p1 o1       ?c1 p2 ?x         ?c2 p3 o2       ?c2 p4 ?y
                  (tp0)
                          . (tp1)       .     (tp2)
                                                        .     (tp3)
                                                                         .      (tp4)
                                                                                          .     (tp5)


                    Variable Graph

                                         (2)        (1)                 (2)
                                        ?c2         ?p              ?c1


                    Single maximum weight independent Set: MWIS = {{?c2, ?c1}}
                    Mapping M

                     M(tp1, (osp, ?c2)) M(tp4, (ops, ?c2))                    M(tp5, (pso, ?c2))
                     M(tp0, (osp, ?c1))        M(tp2, (ops, ?c1))             M(tp3, (pso, ?c1))
Heuristic-
    Based
Optimization
               Constructing (Bushy) Logical Plans
for SPARQL
   Queries
               select ?y
  Lefteris
                         ?p ?ss ?c1 ?p ?dd ?c2                ?c1 p1 o1         ?c1 p2 ?x       ?c2 p3 o2        ?c2 p4 ?y
Sidirourgos
               where {     (tp0) .     (tp1)              .     (tp2)      .      (tp3)     .     (tp4)      .     (tp5)     }

                   M(tp0, (osp, ?c1))               M(tp2, (ops, ?c1))               M(tp3, (pso, ?c1))
                   M(tp1, (osp, ?c2))               M(tp4, (ops, ?c2))               M(tp5, (pso, ?c2))


                                                                     Ļ€ ?p
                                                                               hj
                                                                               ?p


                                                                                                 mj
                                                      mj
                                                                                                 ?c2
                                                      ?c1

                                           mj                                        mj
                                           ?c1        scan(osp)                      ?c2         scan(osp)
                                                          [tp0]                                      [tp1]

                         Ļƒ (pso)             Ļƒ    (ops)            Ļƒ (pso)             Ļƒ     (ops)
                           property = p2         property=p1         property = p4          property=p3
                                                 object = o1                                object = o2
                             [tp3]                   [tp2]                [tp5]                 [tp4]
Heuristic-
    Based
Optimization
               Evaluation
for SPARQL
   Queries

  Lefteris
Sidirourgos    Experiments

                  Compared the quality of the plans produced by our
                  heuristic-based SPARQL Planner (HSP) with those produced
                  by the cost-based dynamic programming planner (CDP) of
                  RDF-3X (Neumann et al.)
                  Compared the execution time of HSP plans translated into
                      MonetDBā€™s physical algebra (MAL) and
                      SQL queries
                  evaluated over MonetDB with the execution time of CDP
                  plans evaluated over RDF-3X
                  Measured the planning time for CDP and HSP

               Datasets
                  Synthetic (SP2 Bench) and Real (YAGO) Datasets
Heuristic-
    Based
Optimization
               Characteristics of Query Workload
for SPARQL
   Queries

  Lefteris
Sidirourgos




                  Queries with diļ¬€erent number of triple patterns
                  Queries with diļ¬€erent structural characteristics (kinds of joins)
                    star and chain-shaped queries with join variables in diļ¬€erent
                                   positions (Heuristics 1, 2)

                  Triple Patterns with diļ¬€erent syntactic characteristics
                   number of variables and constants found in diļ¬€erent positions
                                      (Heuristics 1, 3, 4)
Heuristic-
    Based
Optimization
               Query Plans
for SPARQL
   Queries

  Lefteris
Sidirourgos
               Comparing Plan Quality
                1. HSP produces plans with the same number of merge
                   and hash joins as CDP
                2. Triple patterns are evaluated on the same ordered
                   relation (HSP), index (CDP)
                3. Plans differ on join ordering and the types of
                   joins applied to the join variables
Heuristic-
    Based
Optimization
               Query Plans
for SPARQL
   Queries

  Lefteris
Sidirourgos
               Comparing Plan Quality
                1. HSP produces plans with the same number of merge
                   and hash joins as CDP
                2. Triple patterns are evaluated on the same ordered
                   relation (HSP), index (CDP)
                3. Plans differ on join ordering and the types of
                   joins applied to the join variables

               Comparing Plan Cost
                4. The cost of plans using the cost function of
                   RDF-3X/CDP does not differ
                5. The planning of time for CDP and HSP does not
                   differ and is in the order of 0.1ms.
Heuristic-
    Based
Optimization         SP1   SP2a        SP2b    SP3a    SP3b     SP3c        SP4a          SP4b
for SPARQL
   Queries     HSP    32    873        830      487     100      105    354+953,381   264+953,381
               CDP    32    31         54       487     100      105    354+953,381   299+858,461
  Lefteris
Sidirourgos                       Y1              Y2              Y3            Y4
                     HSP   12+300,054          1+303,579      329+302,577   327+763,749
                     CDP   7+300,023          1.5+301,614     328+302,577   326+763,603


               1. SP4a contains small chain joins, Y3 contains small star joins
                  with syntactically dissimilar triple patterns
Heuristic-
    Based
Optimization         SP1   SP2a        SP2b    SP3a    SP3b     SP3c        SP4a          SP4b
for SPARQL
   Queries     HSP    32    873        830      487     100      105    354+953,381   264+953,381
               CDP    32    31         54       487     100      105    354+953,381   299+858,461
  Lefteris
Sidirourgos                       Y1              Y2              Y3            Y4
                     HSP   12+300,054          1+303,579      329+302,577   327+763,749
                     CDP   7+300,023          1.5+301,614     328+302,577   326+763,603


               1. SP4a contains small chain joins, Y3 contains small star joins
                  with syntactically dissimilar triple patterns
               2. SP2a, SP2b: same merge joi variables, large star queries with
                  syntactically similar triple patterns, intermediate results of
                  diļ¬€erent sizes
Heuristic-
    Based
Optimization         SP1   SP2a       SP2b    SP3a    SP3b     SP3c        SP4a          SP4b
for SPARQL
   Queries     HSP   32    873        830      487     100      105    354+953,381   264+953,381
               CDP   32    31         54       487     100      105    354+953,381   299+858,461
  Lefteris
Sidirourgos                      Y1              Y2              Y3            Y4
                     HSP   12+300,054         1+303,579      329+302,577   327+763,749
                     CDP   7+300,023         1.5+301,614     328+302,577   326+763,603


               1. SP4a contains small chain joins, Y3 contains small star joins
                  with syntactically dissimilar triple patterns
               2. SP2a, SP2b: same merge joi variables, large star queries with
                  syntactically similar triple patterns, intermediate results of
                  diļ¬€erent sizes
               3. SP4b, Y1, Y2, Y4: diļ¬€erent merge join variables, intermediate
                  results of similar size
Heuristic-
    Based
Optimization         SP1   SP2a       SP2b    SP3a    SP3b     SP3c        SP4a          SP4b
for SPARQL
   Queries     HSP   32    873        830      487     100      105    354+953,381   264+953,381
               CDP   32    31         54       487     100      105    354+953,381   299+858,461
  Lefteris
Sidirourgos                      Y1              Y2              Y3            Y4
                     HSP   12+300,054         1+303,579      329+302,577   327+763,749
                     CDP   7+300,023         1.5+301,614     328+302,577   326+763,603


               1. SP4a contains small chain joins, Y3 contains small star joins
                  with syntactically dissimilar triple patterns
               2. SP2a, SP2b: same merge joi variables, large star queries with
                  syntactically similar triple patterns, intermediate results of
                  diļ¬€erent sizes
               3. SP4b, Y1, Y2, Y4: diļ¬€erent merge join variables, intermediate
                  results of similar size



                HSP heuristics proved to be effective in choosing a
                  near to optimal plan for queries whose triple
                patterns exhibit syntactical dissimilarities causing
                      the application of all HSP heuristics
Heuristic-
    Based
Optimization
               Query Plan Execution Times
for SPARQL
   Queries     Queries with diļ¬€erent plans
  Lefteris        RDF-3X/CDP outperforms MonetDB/HSP when HSP chooses
Sidirourgos
                  randomly the join ordering (SP2a, Sp2b)
                  MonetDB/HSP performs better than RDF-3X/CDP when
                  intermediate results are of the same order of magnitude (SP4b,
                  Y1, Y2, Y4)
               Queries with the same plan
                  MonetDB/HSP performs better than RDF-3X/CDP (SP3a,
                  SP3b, SP3c, SP4a, SP5, SP6, Y3)
                  MonetDB/HSP performs always better than MonetDB/SQL
                                   SP1       SP2a       SP2b      SP3a      SP3b    SP3c
                   MonetDB/HSP    19.52    3,267.01   1,035.12    80.92      8.74   12.55
                   RDF-3X/CDP      0.25      355.50   1,000.75    85.14     11.95   13.97
                   MonetDB/SQL    11.92       3,561      1,103    82.91      9.61   14.81
                                               SP4a      SP4b      SP5       SP6
                           MonetDB/HSP     3,602.09   1,766.29    0.06      0.43
                           RDF-3X/CDP      3,634.60   2,781.75    0.10     22.85
                           MonetDB/SQL         XXX    1,909.13    0.09      0.48
                                                Y1     Y2         Y3       Y4
                             MonetDB/HSP       6.04   8.65      25.69      2.32
                             RDF-3X/CDP       15.75   9.95      81.20     90.45
                             MonetDB/SQL       7.69   9.07     538.65     1,113
Heuristic-
    Based
Optimization
               Conclusions
for SPARQL
   Queries

  Lefteris
Sidirourgos




                  RDF speciļ¬c heuristics for determining triple pattern selectivities
                  Reduced Query Planning to the Maximum Weight Independent
                  Set problem
                  Heuristics-based SPARQL planner (HSP) capable of choosing
                  near to optimal execution plans without the use of statistics
                  Implemented HSP plans on top of MonetDB
                  Experimentally evaluated HSP using synthetically generated and
                  real RDF datasets
Heuristic-
    Based
Optimization
               Future Work
for SPARQL
   Queries

  Lefteris
Sidirourgos




                  Extend HSP to support full SPARQL
                  Integrate proposed heuristics with a traditional cost-based
                  optimizer towards a hybrid solution
                  Apply the approach to a distributed environment
                  Cope with diļ¬€erent RDF storage schemas (vertical partitioning,
                  hybrid)
                  Experiment with a large set of synthetically generated queries
Heuristic-
    Based
Optimization
for SPARQL
   Queries

  Lefteris
Sidirourgos
Heuristic-
    Based
Optimization
for SPARQL
   Queries

  Lefteris             subject(s)              predicate(p)     object(o)
Sidirourgos
               t1 :    sp2b:Journal1/1940      rdf:type         sp2b:Journal
               t2 :    sp2b:Inproceeding17     rdf:type         sp2b:Inproceedings
               t3 :    sp2b:Proceeding1/1954   dcterms:issued   ā€˜ā€˜1954ā€™ā€™
               t4 :    sp2b:Journal1/1952      dc:title         ā€˜ā€˜Journal 1 (1952)ā€™ā€™
               t5 :    sp2b:Journal1/1941      rdf:type         sp2b:Journal
               t6 :    sp2b:Article9           rdf:type         sp2b:Article
               t7 :    sp2b:Inproceeding40     dc:terms         ā€˜ā€˜1950ā€™ā€™
               t8 :    sp2b:Inproceeding40     rdf:type         sp2b:Inproceedings
               t9 :    sp2b:Journal1/1941      dc:title         ā€˜ā€˜Journal 1 (1941)ā€™ā€™
               t10 :   sp2b:Journal1/1942      rdf:type         sp2b:Journal
               t11 :   sp2b:Journal1/1940      dc:title         ā€˜ā€˜Journal 1 (1940)ā€™ā€™
               t12 :   sp2b:Inproceeding40     foaf:homepage    http://www.dielectrics.tld
               t13 :   sp2b:Journal1/1940      dcterms:issued   ā€˜ā€˜1940ā€™ā€™



                       (s, p, o) āˆˆ U Ɨ U Ɨ (U āˆŖ L) is called an RDF triple

                            A set of RDF triples is called an RDF graph
Heuristic-
    Based
Optimization
for SPARQL
   Queries

  Lefteris
Sidirourgos



               Comparing Plan Cost
                   CDP (RDF-3X) cost function considers the estimation of
                   intermediate results

                                                     lc+rc
                          cost mergejoin(lc, lr ) = 100,000
                                                                    lc       rc
                          cost hashjoin(lc, rc) = 300, 000 +       100   +   10

               where lc and rc are the cardinalities of two join input relations, with
               the lc being the smallest input
Heuristic-
    Based
Optimization
               Constructing MAL Plans
for SPARQL
   Queries

  Lefteris        translate each operator (selection, projection, join) to the
Sidirourgos
                  approperiate MAL statements


                   bind(ops, object)                                 Ļƒ    (ops)
                                                                         property=p1
                                                                         object = o1
                                                                             [tp2]
                      uselect(o1)          bind(ops, property)




                                     semijoin




                                    uselect(p1)          bind(ops, subject)



                                                  leftjoin


                                                   ?c1

More Related Content

What's hot

Efficient source selection for sparql endpoint federation
Efficient source selection for sparql endpoint federationEfficient source selection for sparql endpoint federation
Efficient source selection for sparql endpoint federation
Muhammad Saleem
Ā 

What's hot (20)

SWT Lecture Session 2 - RDF
SWT Lecture Session 2 - RDFSWT Lecture Session 2 - RDF
SWT Lecture Session 2 - RDF
Ā 
Learning Commonalities in RDF
Learning Commonalities in RDFLearning Commonalities in RDF
Learning Commonalities in RDF
Ā 
Triple Stores
Triple StoresTriple Stores
Triple Stores
Ā 
Parallel Materialisation of Datalog Programs in Centralised, Main-Memory RDF ...
Parallel Materialisation of Datalog Programs in Centralised, Main-Memory RDF ...Parallel Materialisation of Datalog Programs in Centralised, Main-Memory RDF ...
Parallel Materialisation of Datalog Programs in Centralised, Main-Memory RDF ...
Ā 
Federated SPARQL query processing over the Web of Data
Federated SPARQL query processing over the Web of DataFederated SPARQL query processing over the Web of Data
Federated SPARQL query processing over the Web of Data
Ā 
HiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint Federation
HiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint FederationHiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint Federation
HiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint Federation
Ā 
RDF data model
RDF data modelRDF data model
RDF data model
Ā 
SAC 2019 ester giallonardo
SAC 2019 ester giallonardoSAC 2019 ester giallonardo
SAC 2019 ester giallonardo
Ā 
Democratizing Big Semantic Data management
Democratizing Big Semantic Data managementDemocratizing Big Semantic Data management
Democratizing Big Semantic Data management
Ā 
Procrastinators CS340
Procrastinators CS340Procrastinators CS340
Procrastinators CS340
Ā 
Efficient source selection for sparql endpoint federation
Efficient source selection for sparql endpoint federationEfficient source selection for sparql endpoint federation
Efficient source selection for sparql endpoint federation
Ā 
SWT Lecture Session 8 - Rules
SWT Lecture Session 8 - RulesSWT Lecture Session 8 - Rules
SWT Lecture Session 8 - Rules
Ā 
Federated SPARQL Query Processing ISWC2015 Tutorial
Federated SPARQL Query Processing ISWC2015 TutorialFederated SPARQL Query Processing ISWC2015 Tutorial
Federated SPARQL Query Processing ISWC2015 Tutorial
Ā 
SPARQL Query Verbalization for Explaining Semantic Search Engine Queries
SPARQL Query Verbalization for Explaining Semantic Search Engine QueriesSPARQL Query Verbalization for Explaining Semantic Search Engine Queries
SPARQL Query Verbalization for Explaining Semantic Search Engine Queries
Ā 
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
Ā 
Introduction to RDF Data Model
Introduction to RDF Data ModelIntroduction to RDF Data Model
Introduction to RDF Data Model
Ā 
Linked (Open) Data
Linked (Open) DataLinked (Open) Data
Linked (Open) Data
Ā 
Efficient RDF Interchange (ERI) Format for RDF Data Streams
Efficient RDF Interchange (ERI) Format for RDF Data StreamsEfficient RDF Interchange (ERI) Format for RDF Data Streams
Efficient RDF Interchange (ERI) Format for RDF Data Streams
Ā 
Federated Query Formulation and Processing Through BioFed
Federated Query Formulation and Processing Through BioFedFederated Query Formulation and Processing Through BioFed
Federated Query Formulation and Processing Through BioFed
Ā 
Verifying Integrity Constraints of a RDF-based WordNet
Verifying Integrity Constraints of a RDF-based WordNetVerifying Integrity Constraints of a RDF-based WordNet
Verifying Integrity Constraints of a RDF-based WordNet
Ā 

Similar to Heuristic based Query Optimisation for SPARQL

Efficient Query Answering against Dynamic RDF Databases
Efficient Query Answering against Dynamic RDF DatabasesEfficient Query Answering against Dynamic RDF Databases
Efficient Query Answering against Dynamic RDF Databases
Alexandra Roatiș
Ā 
NISO/DCMI Webinar: International Bibliographic Standards, Linked Data, and th...
NISO/DCMI Webinar: International Bibliographic Standards, Linked Data, and th...NISO/DCMI Webinar: International Bibliographic Standards, Linked Data, and th...
NISO/DCMI Webinar: International Bibliographic Standards, Linked Data, and th...
National Information Standards Organization (NISO)
Ā 
KIT Graduiertenkolloquium 11.05.2016
KIT Graduiertenkolloquium 11.05.2016KIT Graduiertenkolloquium 11.05.2016
KIT Graduiertenkolloquium 11.05.2016
Dr.-Ing. Thomas Hartmann
Ā 
Data translation with SPARQL 1.1
Data translation with SPARQL 1.1Data translation with SPARQL 1.1
Data translation with SPARQL 1.1
andreas_schultz
Ā 
SPARQL in the Semantic Web
SPARQL in the Semantic WebSPARQL in the Semantic Web
SPARQL in the Semantic Web
Jan Beeck
Ā 
Federation and Navigation in SPARQL 1.1
Federation and Navigation in SPARQL 1.1Federation and Navigation in SPARQL 1.1
Federation and Navigation in SPARQL 1.1
net2-project
Ā 
Short Report Bridges performance gap between Relational and RDF
Short Report Bridges performance gap between Relational and RDFShort Report Bridges performance gap between Relational and RDF
Short Report Bridges performance gap between Relational and RDF
Akram Abbasi
Ā 

Similar to Heuristic based Query Optimisation for SPARQL (20)

Achieving time effective federated information from scalable rdf data using s...
Achieving time effective federated information from scalable rdf data using s...Achieving time effective federated information from scalable rdf data using s...
Achieving time effective federated information from scalable rdf data using s...
Ā 
Efficient Query Answering against Dynamic RDF Databases
Efficient Query Answering against Dynamic RDF DatabasesEfficient Query Answering against Dynamic RDF Databases
Efficient Query Answering against Dynamic RDF Databases
Ā 
Ivan Herman - Semantic Web Activities @ W3C
Ivan Herman - Semantic Web Activities @ W3CIvan Herman - Semantic Web Activities @ W3C
Ivan Herman - Semantic Web Activities @ W3C
Ā 
Semantic web for ontology chapter4 bynk
Semantic web for ontology chapter4 bynkSemantic web for ontology chapter4 bynk
Semantic web for ontology chapter4 bynk
Ā 
NISO/DCMI Webinar: International Bibliographic Standards, Linked Data, and th...
NISO/DCMI Webinar: International Bibliographic Standards, Linked Data, and th...NISO/DCMI Webinar: International Bibliographic Standards, Linked Data, and th...
NISO/DCMI Webinar: International Bibliographic Standards, Linked Data, and th...
Ā 
RDF and Java
RDF and JavaRDF and Java
RDF and Java
Ā 
DLF 2015 Presentation, "RDF in the Real World."
DLF 2015 Presentation, "RDF in the Real World." DLF 2015 Presentation, "RDF in the Real World."
DLF 2015 Presentation, "RDF in the Real World."
Ā 
KIT Graduiertenkolloquium 11.05.2016
KIT Graduiertenkolloquium 11.05.2016KIT Graduiertenkolloquium 11.05.2016
KIT Graduiertenkolloquium 11.05.2016
Ā 
RDF, SPARQL and Semantic Repositories
RDF, SPARQL and Semantic RepositoriesRDF, SPARQL and Semantic Repositories
RDF, SPARQL and Semantic Repositories
Ā 
Using Linked Data to Mine RDF from Wikipedia's Tables
Using Linked Data to Mine RDF from Wikipedia's TablesUsing Linked Data to Mine RDF from Wikipedia's Tables
Using Linked Data to Mine RDF from Wikipedia's Tables
Ā 
Data translation with SPARQL 1.1
Data translation with SPARQL 1.1Data translation with SPARQL 1.1
Data translation with SPARQL 1.1
Ā 
Rdf data-model-and-storage
Rdf data-model-and-storageRdf data-model-and-storage
Rdf data-model-and-storage
Ā 
Linked data 101: Getting Caught in the Semantic Web
Linked data 101: Getting Caught in the Semantic Web Linked data 101: Getting Caught in the Semantic Web
Linked data 101: Getting Caught in the Semantic Web
Ā 
RSP-QL*: Querying Data-Level Annotations in RDF Streams
RSP-QL*: Querying Data-Level Annotations in RDF StreamsRSP-QL*: Querying Data-Level Annotations in RDF Streams
RSP-QL*: Querying Data-Level Annotations in RDF Streams
Ā 
SPARQL in the Semantic Web
SPARQL in the Semantic WebSPARQL in the Semantic Web
SPARQL in the Semantic Web
Ā 
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 1 (...
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 1 (...Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 1 (...
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 1 (...
Ā 
Linking Open, Big Data Using Semantic Web Technologies - An Introduction
Linking Open, Big Data Using Semantic Web Technologies - An IntroductionLinking Open, Big Data Using Semantic Web Technologies - An Introduction
Linking Open, Big Data Using Semantic Web Technologies - An Introduction
Ā 
Federation and Navigation in SPARQL 1.1
Federation and Navigation in SPARQL 1.1Federation and Navigation in SPARQL 1.1
Federation and Navigation in SPARQL 1.1
Ā 
A Hands On Overview Of The Semantic Web
A Hands On Overview Of The Semantic WebA Hands On Overview Of The Semantic Web
A Hands On Overview Of The Semantic Web
Ā 
Short Report Bridges performance gap between Relational and RDF
Short Report Bridges performance gap between Relational and RDFShort Report Bridges performance gap between Relational and RDF
Short Report Bridges performance gap between Relational and RDF
Ā 

More from PlanetData Network of Excellence

A Contextualized Knowledge Repository for Open Data about Trentino
A Contextualized Knowledge Repository for Open Data about TrentinoA Contextualized Knowledge Repository for Open Data about Trentino
A Contextualized Knowledge Repository for Open Data about Trentino
PlanetData Network of Excellence
Ā 
Access Control for RDF graphs using Abstract Models
Access Control for RDF graphs using Abstract ModelsAccess Control for RDF graphs using Abstract Models
Access Control for RDF graphs using Abstract Models
PlanetData Network of Excellence
Ā 
Abstract Access Control Model for Dynamic RDF Datasets
Abstract Access Control Model for Dynamic RDF DatasetsAbstract Access Control Model for Dynamic RDF Datasets
Abstract Access Control Model for Dynamic RDF Datasets
PlanetData Network of Excellence
Ā 
Automation in Cytomics: A Modern RDBMS Based Platform for Image Analysis and ...
Automation in Cytomics: A Modern RDBMS Based Platform for Image Analysis and ...Automation in Cytomics: A Modern RDBMS Based Platform for Image Analysis and ...
Automation in Cytomics: A Modern RDBMS Based Platform for Image Analysis and ...
PlanetData Network of Excellence
Ā 

More from PlanetData Network of Excellence (20)

Dl2014 slides
Dl2014 slidesDl2014 slides
Dl2014 slides
Ā 
A Contextualized Knowledge Repository for Open Data about Trentino
A Contextualized Knowledge Repository for Open Data about TrentinoA Contextualized Knowledge Repository for Open Data about Trentino
A Contextualized Knowledge Repository for Open Data about Trentino
Ā 
On Leveraging Crowdsourcing Techniques for Schema Matching Networks
On Leveraging Crowdsourcing Techniques for Schema Matching NetworksOn Leveraging Crowdsourcing Techniques for Schema Matching Networks
On Leveraging Crowdsourcing Techniques for Schema Matching Networks
Ā 
Towards Enabling Probabilistic Databases for Participatory Sensing
Towards Enabling Probabilistic Databases for Participatory SensingTowards Enabling Probabilistic Databases for Participatory Sensing
Towards Enabling Probabilistic Databases for Participatory Sensing
Ā 
Privacy-Preserving Schema Reuse
Privacy-Preserving Schema ReusePrivacy-Preserving Schema Reuse
Privacy-Preserving Schema Reuse
Ā 
Pay-as-you-go Reconciliation in Schema Matching Networks
Pay-as-you-go Reconciliation in Schema Matching NetworksPay-as-you-go Reconciliation in Schema Matching Networks
Pay-as-you-go Reconciliation in Schema Matching Networks
Ā 
Demo: tablet-based visualisation of transport data in Madrid using SPARQLstream
Demo: tablet-based visualisation of transport data in Madrid using SPARQLstreamDemo: tablet-based visualisation of transport data in Madrid using SPARQLstream
Demo: tablet-based visualisation of transport data in Madrid using SPARQLstream
Ā 
On the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingOn the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream Processing
Ā 
Urbanopoly: Collection and Quality Assessment of Geo-spatial Linked Data via ...
Urbanopoly: Collection and Quality Assessment of Geo-spatial Linked Data via ...Urbanopoly: Collection and Quality Assessment of Geo-spatial Linked Data via ...
Urbanopoly: Collection and Quality Assessment of Geo-spatial Linked Data via ...
Ā 
Linking Smart Cities Datasets with Human Computation: the case of UrbanMatch
Linking Smart Cities Datasets with Human Computation: the case of UrbanMatchLinking Smart Cities Datasets with Human Computation: the case of UrbanMatch
Linking Smart Cities Datasets with Human Computation: the case of UrbanMatch
Ā 
SciQL, Bridging the Gap between Science and Relational DBMS
SciQL, Bridging the Gap between Science and Relational DBMSSciQL, Bridging the Gap between Science and Relational DBMS
SciQL, Bridging the Gap between Science and Relational DBMS
Ā 
CLODA: A Crowdsourced Linked Open Data Architecture
CLODA: A Crowdsourced Linked Open Data ArchitectureCLODA: A Crowdsourced Linked Open Data Architecture
CLODA: A Crowdsourced Linked Open Data Architecture
Ā 
Scalable Nonmonotonic Reasoning over RDF Data Using MapReduce
Scalable Nonmonotonic Reasoning over RDF Data Using MapReduceScalable Nonmonotonic Reasoning over RDF Data Using MapReduce
Scalable Nonmonotonic Reasoning over RDF Data Using MapReduce
Ā 
Data and Knowledge Evolution
Data and Knowledge Evolution  Data and Knowledge Evolution
Data and Knowledge Evolution
Ā 
Evolution of Workflow Provenance Information in the Presence of Custom Infere...
Evolution of Workflow Provenance Information in the Presence of Custom Infere...Evolution of Workflow Provenance Information in the Presence of Custom Infere...
Evolution of Workflow Provenance Information in the Presence of Custom Infere...
Ā 
Access Control for RDF graphs using Abstract Models
Access Control for RDF graphs using Abstract ModelsAccess Control for RDF graphs using Abstract Models
Access Control for RDF graphs using Abstract Models
Ā 
Arrays in Databases, the next frontier?
Arrays in Databases, the next frontier?Arrays in Databases, the next frontier?
Arrays in Databases, the next frontier?
Ā 
Abstract Access Control Model for Dynamic RDF Datasets
Abstract Access Control Model for Dynamic RDF DatasetsAbstract Access Control Model for Dynamic RDF Datasets
Abstract Access Control Model for Dynamic RDF Datasets
Ā 
Towards Parallel Nonmonotonic Reasoning with Billions of Facts
Towards Parallel Nonmonotonic Reasoning with Billions of FactsTowards Parallel Nonmonotonic Reasoning with Billions of Facts
Towards Parallel Nonmonotonic Reasoning with Billions of Facts
Ā 
Automation in Cytomics: A Modern RDBMS Based Platform for Image Analysis and ...
Automation in Cytomics: A Modern RDBMS Based Platform for Image Analysis and ...Automation in Cytomics: A Modern RDBMS Based Platform for Image Analysis and ...
Automation in Cytomics: A Modern RDBMS Based Platform for Image Analysis and ...
Ā 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
Ā 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
Ā 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(ā˜Žļø+971_581248768%)**%*]'#abortion pills for sale in dubai@
Ā 

Recently uploaded (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
Ā 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Ā 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Ā 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
Ā 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
Ā 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Ā 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
Ā 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Ā 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
Ā 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
Ā 
šŸ¬ The future of MySQL is Postgres šŸ˜
šŸ¬  The future of MySQL is Postgres   šŸ˜šŸ¬  The future of MySQL is Postgres   šŸ˜
šŸ¬ The future of MySQL is Postgres šŸ˜
Ā 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Ā 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Ā 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
Ā 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Ā 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
Ā 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
Ā 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Ā 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
Ā 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Ā 

Heuristic based Query Optimisation for SPARQL

  • 1. Heuristic- Based Optimization for SPARQL Queries Lefteris Sidirourgos Heuristic-Based Optimization for SPARQL Queries P. Tsialiamanis1,2 , Irini Fundulaki1 , V. Christophides1,2 L. Sidirourgos3 , P. Boncz3 Institute of Computer Science - FORTH University of Crete, Greece CWI, The Netherlands EDBT 2012 Berlin, Germany
  • 2. Heuristic- Based Optimization Web of Data for SPARQL Queries Lefteris Sidirourgos Knowledge Bases (Wikipedia, US Sensus Bureau, CIA WorldFactBook) and Social Networks Government Sites (data.gov.uk, www.data.gov) Entertainment, Music, News Sources (BBC), ... Scientiļ¬c Data Sources (Biology, Physics, Astronomy, Geography, ... ) Main Challenge: Efficient Management and Usage of large (huge) volumes of Semantic Web Data
  • 3. Heuristic- Based Optimization Querying Linked Data: Where does existing for SPARQL Queries Database Technology ļ¬t? Lefteris Sidirourgos Existing Relational Stores are problematic for querying the Web of Data absence of schema and constraints for RDF data leads to query plans that are hard to optimize ā†’ a query plan contains large number of self joins over a large triple table relational optimizers compute the cost of scans but not the join hit ratio(s): the subject-subject and subject-object join hit ratio will have the same estimate since correlations between triple components are not considered correlated cost estimation over many selections and self joins: the rule and not the exception for RDF query processing
  • 4. Heuristic- Based Optimization Contributions for SPARQL Queries Lefteris Sidirourgos Heuristic-based SPARQL Planner (HSP) heuristic-based optimization of SPARQL join queries over flawed cost-based opti- mization Set of heuristics based on the structural and syntactic query characteristics to reduce the size of intermediate join results Bushy query plans that maximize the number of merge joins Reduce the problem of query planing to the problem of ļ¬nding the maximum weight independent set for a SPARQL query novel representation of a SPARQL query as a variable graph Translation of the HSP bushy plans directly into MonetDBā€™s Physical Algebra (MAL)
  • 5. Heuristic- Based Optimization RDF in a nutshell for SPARQL Queries Lefteris RDF is the W3C standard for representing information in the Sidirourgos Web RDF Data Model is based on node and edge labeled directed graphs U Predicate U = set of URIs Subject Object L = set of Literals U U L
  • 6. Heuristic- Based Optimization RDF in a nutshell for SPARQL Queries Lefteris RDF is the W3C standard for representing information in the Sidirourgos Web RDF Data Model is based on node and edge labeled directed graphs U Predicate U = set of URIs Subject Object L = set of Literals U U L (s, p, o) āˆˆ U Ɨ U Ɨ (U āˆŖ L) is called an RDF triple
  • 7. Heuristic- Based Optimization RDF in a nutshell for SPARQL Queries Lefteris RDF is the W3C standard for representing information in the Sidirourgos Web RDF Data Model is based on node and edge labeled directed graphs U Predicate U = set of URIs Subject Object L = set of Literals U U L (s, p, o) āˆˆ U Ɨ U Ɨ (U āˆŖ L) is called an RDF triple A set of RDF triples is an RDF graph
  • 8. Heuristic- Based Optimization SPARQL Queries in a Nutshell for SPARQL Queries Lefteris Sidirourgos U= set of URIs, L= set of Literals, V= set of Variables SPARQL triple pattern is an element of the set ( V āˆŖ U) Ɨ ( V āˆŖ U) Ɨ ( V āˆŖ U āˆŖ L) Intuitively a triple pattern denotes the triples in an RDF graph that are of a speciļ¬c form. (?j, rdf:type, ?y ): matches all triples with predicate rdf:type SPARQL graph patterns are formulated using the join, union and optional operators between triple patterns
  • 9. Heuristic- Based Optimization SPARQL Queries in a Nutshell for SPARQL Queries Lefteris Sidirourgos SPARQL query is of the form select ?u1 , ?u2 , . . . where Q with ?u1 , ?u2 , . . . variables and Q a SPARQL graph pattern
  • 10. Heuristic- Based Optimization SPARQL Queries in a Nutshell for SPARQL Queries Lefteris Sidirourgos SPARQL query is of the form select ?u1 , ?u2 , . . . where Q with ?u1 , ?u2 , . . . variables and Q a SPARQL graph pattern We focus on SPARQL join queries Q = {tp0 , . . . , tpk } where tp1 , tp2 , . . . are triple patterns. A join is deļ¬ned by multiple occurences of the same variable. ( ?j , rdf:type, ?y ) . (sp2b:Journal1/1940, ?p, ?j )
  • 11. Heuristic- Based Optimization Storing RDF Triples for SPARQL Queries Lefteris Sidirourgos The RDF graph is stored in a triple table (ternary relation) that contains triples of the form (subject, property , object) subject(s) predicate(p) object(o) t1 : sp2b:Journal1/1940 rdf:type sp2b:Journal t2 : sp2b:Inproceeding17 rdf:type sp2b:Inproceedings t3 : sp2b:Proceeding1/1954 dcterms:issued ā€˜ā€˜1954ā€™ā€™ t4 : sp2b:Journal1/1952 dc:title ā€˜ā€˜Journal 1 (1952)ā€™ā€™ t5 : sp2b:Journal1/1941 rdf:type sp2b:Journal t6 : sp2b:Article9 rdf:type sp2b:Article t7 : sp2b:Inproceeding40 dc:terms ā€˜ā€˜1950ā€™ā€™ t8 : sp2b:Inproceeding40 rdf:type sp2b:Inproceedings t9 : sp2b:Journal1/1941 dc:title ā€˜ā€˜Journal 1 (1941)ā€™ā€™ t10 : sp2b:Journal1/1942 rdf:type sp2b:Journal t11 : sp2b:Journal1/1940 dc:title ā€˜ā€˜Journal 1 (1940)ā€™ā€™ t12 : sp2b:Inproceeding40 foaf:homepage http://www.dielectrics.tld t13 : sp2b:Journal1/1940 dcterms:issued ā€˜ā€˜1940ā€™ā€™
  • 12. Heuristic- Based Optimization Dictionary & Ordered Relations for SPARQL Queries To avoid processing long strings we map URIs and Literals to Lefteris unique identiļ¬ers: dictionary (binary relation) stores the Sidirourgos mapping Dictionary Oid Value Oid Value Oid Value 001 sp2b:Journal1/1940 009 sp2b:Journal 017 www.dielectrics.tld 002 sp2b:Inproceeding17 010 sp2b:Inproceedings 018 ā€˜ā€˜1940ā€™ā€™ 003 sp2b:Proceeding1/1954 011 ā€˜ā€˜1954ā€™ā€™ 019 rdf:type 004 sp2b:Journal1/1952 012 ā€˜ā€˜Journal 1 (1952)ā€™ā€™ 020 dcterms:issued To support merge joins for all possible join patterns we propose six ordered relations that store all ordering combinations on subject (s), property (p) and object (o) components of a triple: spo, sop ops, osp, pos, pso. Triples are lexicographically sorted by the appropriate collation order Triples table (spo) Triples table (pso) s p o p s o t1 001 019 009 t1 019 001 009 t13 001 020 018 t2 019 002 010 t2 002 019 020 t5 019 005 009 t3 003 020 011 t6 019 006 013 t5 005 019 009 t8 019 007 010 t6 006 019 013 t10 019 008 009 t7 007 019 010 t13 020 001 018
  • 13. Heuristic- Based Optimization Query Planning: Motivating Example for SPARQL Queries Lefteris Sidirourgos Problems to solve: 1. Multiple ordered relations to evaluate one triple pattern 2. Multiple join orderings and join algorithms per join variable
  • 14. Heuristic- Based Optimization Query Planning: Motivating Example for SPARQL Queries Lefteris 1. Multiple ordered relations to evaluate one triple pattern Sidirourgos select ?y s0 p0 ?x ?y p1 ?x ?y p2 ?z s1 p3 ?z s2 p4 ?z where { (tp0) . (tp1) . (tp2) . (tp3) . (tp4) }
  • 15. Heuristic- Based Optimization Query Planning: Motivating Example for SPARQL Queries Lefteris 1. Multiple ordered relations to evaluate one triple pattern Sidirourgos select ?y s0 p0 ?x ?y p1 ?x ?y p2 ?z s1 p3 ?z s2 p4 ?z where { (tp0) . (tp1) . (tp2) . (tp3) . (tp4) } tp0: (s0 p0 ?x) 1. pso: Ļƒproperty =p0,subject=s0 (pso) the selection on subject is done on a very large triple table; the results are ordered on the object
  • 16. Heuristic- Based Optimization Query Planning: Motivating Example for SPARQL Queries Lefteris 1. Multiple ordered relations to evaluate one triple pattern Sidirourgos select ?y s0 p0 ?x ?y p1 ?x ?y p2 ?z s1 p3 ?z s2 p4 ?z where { (tp0) . (tp1) . (tp2) . (tp3) . (tp4) } tp0: (s0 p0 ?x) 1. pso: Ļƒproperty =p0,subject=s0 (pso) the selection on subject is done on a very large triple table; the results are ordered on the object 2. spo: Ļƒproperty =p0,subject=s0 (spo) the results are ordered on the object
  • 17. Heuristic- Based Optimization Query Planning: Motivating Example for SPARQL Queries Lefteris 1. Multiple ordered relations to evaluate one triple pattern Sidirourgos select ?y s0 p0 ?x ?y p1 ?x ?y p2 ?z s1 p3 ?z s2 p4 ?z where { (tp0) . (tp1) . (tp2) . (tp3) . (tp4) } tp0: (s0 p0 ?x) 1. pso: Ļƒproperty =p0,subject=s0 (pso) the selection on subject is done on a very large triple table; the results are ordered on the object 2. spo: Ļƒproperty =p0,subject=s0 (spo) the results are ordered on the object 3. pos: Ļƒproperty =p0,subject=s0 (pos) the results are not returned ordered on object 4. . . .
  • 18. Heuristic- Based Optimization Query Planning: Motivating Example for SPARQL Queries Lefteris 2. Multiple join orderings and join algorithms per join variable Sidirourgos select ?y s0 p0 ?x ?y p1 ?x ?y p2 ?z s1 p3 ?z s2 p4 ?z where { (tp0) . (tp1) . (tp2) . (tp3) . (tp4) } the merge joins are performed on variables ?y and ?z Ļ€ ?y hj ?z hj ?x mj mj ?y ?z Ļƒ (spo) Ļƒ (pso) Ļƒ (pso) Ļƒ (spo) Ļƒ (spo) subject = s0 property=p1 property = p2 subject=s1 subject=s2 property = p0 property=p3 property = p4 [tp0] [tp1] [tp2] [tp3] [tp4]
  • 19. Heuristic- Based Optimization Query Planning: Motivating Example for SPARQL Queries Lefteris 2. Multiple join orderings and join algorithms per join variable Sidirourgos merge joins are performed on variables ?z and ?x select ?y s0 p0 ?x ?y p1 ?x ?y p2 ?z s1 p3 ?z s2 p4 ?z where { (tp0) . (tp1) . (tp2) . (tp3) . (tp4) } Ļ€ ?y hj ?y mj ?z mj mj ?x ?z Ļƒ (spo) Ļƒ (pos) Ļƒ (pos) Ļƒ (spo) Ļƒ (spo) subject = s0 property=p1 property = p2 subject=s1 subject=s2 property = p0 property=p3 property = p4 [tp0] [tp1] [tp2] [tp3] [tp4]
  • 20. Heuristic- Based Optimization Heuristics for SPARQL Queries Lefteris Sidirourgos We use a set of heuristics to determine the ordered relation on which to evaluate a triple pattern determine join ordering
  • 21. Heuristic- Based Optimization Heuristics for SPARQL Queries Lefteris Sidirourgos H1: Triple Pattern Order Triple patterns are ordered from the most to the least selective i.e., the one that is likely to produce less to the one that is more likely to produce more intermediate results (s, p, o) (s, ?, o) (?, p, o) (s, p, ?) (?, ?, o) (s, ?, ?) (?, p, ?) (?, ?, ?)āˆ— āˆ— except predicate rdf:type
  • 22. Heuristic- Based Optimization Heuristics for SPARQL Queries Lefteris Sidirourgos H1: Triple Pattern Order Triple patterns are ordered from the most to the least selective i.e., the one that is likely to produce less to the one that is more likely to produce more intermediate results (s, p, o) (s, ?, o) (?, p, o) (s, p, ?) (?, ?, o) (s, ?, ?) (?, p, ?) (?, ?, ?)āˆ— āˆ— except predicate rdf:type H2: Join Patterns Join Patterns are ordered from the most to the least selective one p o s p s o o o s s p p
  • 23. Heuristic- Based Optimization Heuristics for SPARQL Queries Lefteris Sidirourgos H3: Number of Literals/URIs A set of triple patterns with more literals is more selective than a set of triple patterns with more URIs H4: Triple Patterns with Literals at the object position Triple pattern with literal is more selective than one with URI in the object position Heuristics H3 and H4 are a special case of H1 but can be used separately H5: Triple Patterns with least number of Projections Related to Tuple Reconstruction in Column-Store DBMS
  • 24. Heuristic- Based Optimization Query Planning for SPARQL Queries Objective: produce query plans with the maximum Lefteris number of merge joins Sidirourgos Solution: Reduce Query Planning to the Maximum Weight Independent Set Problem
  • 25. Heuristic- Based Optimization Query Planning for SPARQL Queries Objective: produce query plans with the maximum Lefteris number of merge joins Sidirourgos Solution: Reduce Query Planning to the Maximum Weight Independent Set Problem An independent set is a set of nodes no two of which share an edge Finding independent sets is a problem complementary to ļ¬nding ļ¬nding Cliques in a graph n1 n4 n1 n5 n2 n3 n2 n3 independent set = { {n1}, {n2}, {n3}} independent set = { {n1, n3}, {n1, n5}, {n4, n3}, {n4, n5}, {n2, n5}}
  • 26. Heuristic- Based Optimization Query Planning for SPARQL Queries Objective: produce query plans with the maximum Lefteris number of merge joins Sidirourgos Solution: Reduce Query Planning to the Maximum Weight Independent Set Problem An independent set is a set of nodes no two of which share an edge Finding independent sets is a problem complementary to ļ¬nding ļ¬nding Cliques in a graph n1 n4 n1 n5 n2 n3 n2 n3 independent set = { {n1}, {n2}, {n3}} independent set = { {n1, n3}, {n1, n5}, {n4, n3}, {n4, n5}, {n2, n5}} Equivalent to finding the largest groups of vari- ables that can be merge-joined
  • 27. Heuristic- Based Optimization Variable Graph for SPARQL Queries Model a SPARQL join query as a Variable Graph where Lefteris Sidirourgos nodes in the graph are the query variables an edge exist between two nodes, if they belong to the same triple pattern the weight of each node is the number of joins it participates in
  • 28. Heuristic- Based Optimization Variable Graph for SPARQL Queries Model a SPARQL join query as a Variable Graph where Lefteris Sidirourgos nodes in the graph are the query variables an edge exist between two nodes, if they belong to the same triple pattern the weight of each node is the number of joins it participates in select ?y ?p ?ss ?c1 ?p ?dd ?c2 ?c1 p1 o1 where { (tp0) . (tp1) . (tp2) . ?c1 p2 ?x ?c2 p3 o2 ?c2 p4 ?y (tp3) . (tp4) . (tp5) }
  • 29. Heuristic- Based Optimization Variable Graph for SPARQL Queries Model a SPARQL join query as a Variable Graph where Lefteris Sidirourgos nodes in the graph are the query variables an edge exist between two nodes, if they belong to the same triple pattern the weight of each node is the number of joins it participates in select ?y ?p ?ss ?c1 ?p ?dd ?c2 ?c1 p1 o1 where { (tp0) . (tp1) . (tp2) . ?c1 p2 ?x ?c2 p3 o2 ?c2 p4 ?y (tp3) . (tp4) . (tp5) } (2) (1) (2) ?c2 ?p ?c1
  • 30. Heuristic- Based Optimization Variable Graph for SPARQL Queries Model a SPARQL join query as a Variable Graph where Lefteris Sidirourgos nodes in the graph are the query variables an edge exist between two nodes, if they belong to the same triple pattern the weight of each node is the number of joins it participates in select ?y ?p ?ss ?c1 ?p ?dd ?c2 ?c1 p1 o1 where { (tp0) . (tp1) . (tp2) . ?c1 p2 ?x ?c2 p3 o2 ?c2 p4 ?y (tp3) . (tp4) . (tp5) } (2) (1) (2) ?c2 ?p ?c1 MWIS: {{?c2, ?c1}}
  • 31. Heuristic- Based Optimization Query Planning: Assigning Access Paths to for SPARQL Queries Triple Patterns Lefteris Sidirourgos Require: SPARQL join query Q Ensure: Mapping M that assigns triple patterns to ordered relations while (there exist a non empty set T of triple patterns that do not have a merge joined variable) do Construct the variable graph from set T I ā† the set of maximum weight independent sets repeat apply Heuristic 3, 4, 2, 5 in that order to I to eliminate independent sets until either |I| = 1 or all heuristics have been applied Choose one set randomly, and remove the triple patterns that do not have a variable in the independent set end while for all merge joined variables do assign ordered relations to those triple patterns such that merge joined variables are returned sorted assign the ordered relation for remaining triple patterns (hash joins)
  • 32. Heuristic- Based Optimization Assigning Access Paths to Triple Patterns: for SPARQL Queries Example Lefteris Sidirourgos Query Q ?p ?ss ?c1 ?p ?dd ?c2 ?c1 p1 o1 ?c1 p2 ?x ?c2 p3 o2 ?c2 p4 ?y (tp0) . (tp1) . (tp2) . (tp3) . (tp4) . (tp5) Variable Graph (2) (1) (2) ?c2 ?p ?c1 Single maximum weight independent Set: MWIS = {{?c2, ?c1}} Mapping M M(tp1, (osp, ?c2)) M(tp4, (ops, ?c2)) M(tp5, (pso, ?c2)) M(tp0, (osp, ?c1)) M(tp2, (ops, ?c1)) M(tp3, (pso, ?c1))
  • 33. Heuristic- Based Optimization Constructing (Bushy) Logical Plans for SPARQL Queries select ?y Lefteris ?p ?ss ?c1 ?p ?dd ?c2 ?c1 p1 o1 ?c1 p2 ?x ?c2 p3 o2 ?c2 p4 ?y Sidirourgos where { (tp0) . (tp1) . (tp2) . (tp3) . (tp4) . (tp5) } M(tp0, (osp, ?c1)) M(tp2, (ops, ?c1)) M(tp3, (pso, ?c1)) M(tp1, (osp, ?c2)) M(tp4, (ops, ?c2)) M(tp5, (pso, ?c2)) Ļ€ ?p hj ?p mj mj ?c2 ?c1 mj mj ?c1 scan(osp) ?c2 scan(osp) [tp0] [tp1] Ļƒ (pso) Ļƒ (ops) Ļƒ (pso) Ļƒ (ops) property = p2 property=p1 property = p4 property=p3 object = o1 object = o2 [tp3] [tp2] [tp5] [tp4]
  • 34. Heuristic- Based Optimization Evaluation for SPARQL Queries Lefteris Sidirourgos Experiments Compared the quality of the plans produced by our heuristic-based SPARQL Planner (HSP) with those produced by the cost-based dynamic programming planner (CDP) of RDF-3X (Neumann et al.) Compared the execution time of HSP plans translated into MonetDBā€™s physical algebra (MAL) and SQL queries evaluated over MonetDB with the execution time of CDP plans evaluated over RDF-3X Measured the planning time for CDP and HSP Datasets Synthetic (SP2 Bench) and Real (YAGO) Datasets
  • 35. Heuristic- Based Optimization Characteristics of Query Workload for SPARQL Queries Lefteris Sidirourgos Queries with diļ¬€erent number of triple patterns Queries with diļ¬€erent structural characteristics (kinds of joins) star and chain-shaped queries with join variables in diļ¬€erent positions (Heuristics 1, 2) Triple Patterns with diļ¬€erent syntactic characteristics number of variables and constants found in diļ¬€erent positions (Heuristics 1, 3, 4)
  • 36. Heuristic- Based Optimization Query Plans for SPARQL Queries Lefteris Sidirourgos Comparing Plan Quality 1. HSP produces plans with the same number of merge and hash joins as CDP 2. Triple patterns are evaluated on the same ordered relation (HSP), index (CDP) 3. Plans differ on join ordering and the types of joins applied to the join variables
  • 37. Heuristic- Based Optimization Query Plans for SPARQL Queries Lefteris Sidirourgos Comparing Plan Quality 1. HSP produces plans with the same number of merge and hash joins as CDP 2. Triple patterns are evaluated on the same ordered relation (HSP), index (CDP) 3. Plans differ on join ordering and the types of joins applied to the join variables Comparing Plan Cost 4. The cost of plans using the cost function of RDF-3X/CDP does not differ 5. The planning of time for CDP and HSP does not differ and is in the order of 0.1ms.
  • 38. Heuristic- Based Optimization SP1 SP2a SP2b SP3a SP3b SP3c SP4a SP4b for SPARQL Queries HSP 32 873 830 487 100 105 354+953,381 264+953,381 CDP 32 31 54 487 100 105 354+953,381 299+858,461 Lefteris Sidirourgos Y1 Y2 Y3 Y4 HSP 12+300,054 1+303,579 329+302,577 327+763,749 CDP 7+300,023 1.5+301,614 328+302,577 326+763,603 1. SP4a contains small chain joins, Y3 contains small star joins with syntactically dissimilar triple patterns
  • 39. Heuristic- Based Optimization SP1 SP2a SP2b SP3a SP3b SP3c SP4a SP4b for SPARQL Queries HSP 32 873 830 487 100 105 354+953,381 264+953,381 CDP 32 31 54 487 100 105 354+953,381 299+858,461 Lefteris Sidirourgos Y1 Y2 Y3 Y4 HSP 12+300,054 1+303,579 329+302,577 327+763,749 CDP 7+300,023 1.5+301,614 328+302,577 326+763,603 1. SP4a contains small chain joins, Y3 contains small star joins with syntactically dissimilar triple patterns 2. SP2a, SP2b: same merge joi variables, large star queries with syntactically similar triple patterns, intermediate results of diļ¬€erent sizes
  • 40. Heuristic- Based Optimization SP1 SP2a SP2b SP3a SP3b SP3c SP4a SP4b for SPARQL Queries HSP 32 873 830 487 100 105 354+953,381 264+953,381 CDP 32 31 54 487 100 105 354+953,381 299+858,461 Lefteris Sidirourgos Y1 Y2 Y3 Y4 HSP 12+300,054 1+303,579 329+302,577 327+763,749 CDP 7+300,023 1.5+301,614 328+302,577 326+763,603 1. SP4a contains small chain joins, Y3 contains small star joins with syntactically dissimilar triple patterns 2. SP2a, SP2b: same merge joi variables, large star queries with syntactically similar triple patterns, intermediate results of diļ¬€erent sizes 3. SP4b, Y1, Y2, Y4: diļ¬€erent merge join variables, intermediate results of similar size
  • 41. Heuristic- Based Optimization SP1 SP2a SP2b SP3a SP3b SP3c SP4a SP4b for SPARQL Queries HSP 32 873 830 487 100 105 354+953,381 264+953,381 CDP 32 31 54 487 100 105 354+953,381 299+858,461 Lefteris Sidirourgos Y1 Y2 Y3 Y4 HSP 12+300,054 1+303,579 329+302,577 327+763,749 CDP 7+300,023 1.5+301,614 328+302,577 326+763,603 1. SP4a contains small chain joins, Y3 contains small star joins with syntactically dissimilar triple patterns 2. SP2a, SP2b: same merge joi variables, large star queries with syntactically similar triple patterns, intermediate results of diļ¬€erent sizes 3. SP4b, Y1, Y2, Y4: diļ¬€erent merge join variables, intermediate results of similar size HSP heuristics proved to be effective in choosing a near to optimal plan for queries whose triple patterns exhibit syntactical dissimilarities causing the application of all HSP heuristics
  • 42. Heuristic- Based Optimization Query Plan Execution Times for SPARQL Queries Queries with diļ¬€erent plans Lefteris RDF-3X/CDP outperforms MonetDB/HSP when HSP chooses Sidirourgos randomly the join ordering (SP2a, Sp2b) MonetDB/HSP performs better than RDF-3X/CDP when intermediate results are of the same order of magnitude (SP4b, Y1, Y2, Y4) Queries with the same plan MonetDB/HSP performs better than RDF-3X/CDP (SP3a, SP3b, SP3c, SP4a, SP5, SP6, Y3) MonetDB/HSP performs always better than MonetDB/SQL SP1 SP2a SP2b SP3a SP3b SP3c MonetDB/HSP 19.52 3,267.01 1,035.12 80.92 8.74 12.55 RDF-3X/CDP 0.25 355.50 1,000.75 85.14 11.95 13.97 MonetDB/SQL 11.92 3,561 1,103 82.91 9.61 14.81 SP4a SP4b SP5 SP6 MonetDB/HSP 3,602.09 1,766.29 0.06 0.43 RDF-3X/CDP 3,634.60 2,781.75 0.10 22.85 MonetDB/SQL XXX 1,909.13 0.09 0.48 Y1 Y2 Y3 Y4 MonetDB/HSP 6.04 8.65 25.69 2.32 RDF-3X/CDP 15.75 9.95 81.20 90.45 MonetDB/SQL 7.69 9.07 538.65 1,113
  • 43. Heuristic- Based Optimization Conclusions for SPARQL Queries Lefteris Sidirourgos RDF speciļ¬c heuristics for determining triple pattern selectivities Reduced Query Planning to the Maximum Weight Independent Set problem Heuristics-based SPARQL planner (HSP) capable of choosing near to optimal execution plans without the use of statistics Implemented HSP plans on top of MonetDB Experimentally evaluated HSP using synthetically generated and real RDF datasets
  • 44. Heuristic- Based Optimization Future Work for SPARQL Queries Lefteris Sidirourgos Extend HSP to support full SPARQL Integrate proposed heuristics with a traditional cost-based optimizer towards a hybrid solution Apply the approach to a distributed environment Cope with diļ¬€erent RDF storage schemas (vertical partitioning, hybrid) Experiment with a large set of synthetically generated queries
  • 45. Heuristic- Based Optimization for SPARQL Queries Lefteris Sidirourgos
  • 46. Heuristic- Based Optimization for SPARQL Queries Lefteris subject(s) predicate(p) object(o) Sidirourgos t1 : sp2b:Journal1/1940 rdf:type sp2b:Journal t2 : sp2b:Inproceeding17 rdf:type sp2b:Inproceedings t3 : sp2b:Proceeding1/1954 dcterms:issued ā€˜ā€˜1954ā€™ā€™ t4 : sp2b:Journal1/1952 dc:title ā€˜ā€˜Journal 1 (1952)ā€™ā€™ t5 : sp2b:Journal1/1941 rdf:type sp2b:Journal t6 : sp2b:Article9 rdf:type sp2b:Article t7 : sp2b:Inproceeding40 dc:terms ā€˜ā€˜1950ā€™ā€™ t8 : sp2b:Inproceeding40 rdf:type sp2b:Inproceedings t9 : sp2b:Journal1/1941 dc:title ā€˜ā€˜Journal 1 (1941)ā€™ā€™ t10 : sp2b:Journal1/1942 rdf:type sp2b:Journal t11 : sp2b:Journal1/1940 dc:title ā€˜ā€˜Journal 1 (1940)ā€™ā€™ t12 : sp2b:Inproceeding40 foaf:homepage http://www.dielectrics.tld t13 : sp2b:Journal1/1940 dcterms:issued ā€˜ā€˜1940ā€™ā€™ (s, p, o) āˆˆ U Ɨ U Ɨ (U āˆŖ L) is called an RDF triple A set of RDF triples is called an RDF graph
  • 47. Heuristic- Based Optimization for SPARQL Queries Lefteris Sidirourgos Comparing Plan Cost CDP (RDF-3X) cost function considers the estimation of intermediate results lc+rc cost mergejoin(lc, lr ) = 100,000 lc rc cost hashjoin(lc, rc) = 300, 000 + 100 + 10 where lc and rc are the cardinalities of two join input relations, with the lc being the smallest input
  • 48. Heuristic- Based Optimization Constructing MAL Plans for SPARQL Queries Lefteris translate each operator (selection, projection, join) to the Sidirourgos approperiate MAL statements bind(ops, object) Ļƒ (ops) property=p1 object = o1 [tp2] uselect(o1) bind(ops, property) semijoin uselect(p1) bind(ops, subject) leftjoin ?c1