Query-Load aware partitioning of RDF data

Query-Load aware partitioning
of RDF data
Luis Galárraga
Saarbrücken, July 4th 2011

July 4th, 2011 Query load aware partitioning of RDF data 1/37

Outline

● Motivation & background
● Fragmentation in databases
● Observations & goals
● Proposed methodology
● Preliminary results


Outline



Motivation
● Increasing interest in semantic
representations for knowledge.
– Increasing number of data providers (e.g
Linked Data initiative)
– Semantic Web: “Web of knowledge”
– Growing data sources (e.g Wikipedia)
● Need for efficient query processing
– Centralized solutions might become infeasible
as data steadily grows.
– Taking advantage of parallelism can help
improve performance.

Data keeps growing

Dbpedia datasets size growth
40 Dbpedia 3.6
35 3.500.000 resources
30
0.5 billion facts
25
http://dbpedia.org
Size in GB

20
Size
15

10 Semantic Web Challenge 2011
5
2 billion triples
0
10/10/06 02/22/08 07/06/09 11/18/10 04/01/12
20 GB dataset
Date http://challenge.semanticweb.org


Data keeps growing


RDF and triple stores
● Resource Description Framework is a
language to represent knowledge about
resources (things).
– Resources are identified by URIs
<http://www.mpii.de/yago/resource/John_Doe>

● It uses statements or triples
PREFIX yago: <http://www.mpii.de/yago/resource/John_Doe>
PREFIX foaf: <http://xmlns.com/foaf/0.1/name>

yago:John_Doe foaf:name “John Doe”

Subject Predicate Object

RDF and triple stores

● Data in a triple store can be seen as data
graph or a huge 3-columns relation.
yago:John_Doe

foaf:name
foaf:knows yago:John_Doe foaf:name “John Doe”

“John Doe” yago:John_Doe foaf:knows yago:Max_Mustermann
yago:Max_Mustermann foaf:name “Max Mustermann”
yago:Max_Mustermann

foaf:name
“Max Mustermann”

● Existing solutions like Jena or Sesame use
some variation of the 3-columns relation.

How to query RDF?

● Use of data graph abstraction.
– SQL designed for relational databases
● SPARQL defines queries as subgraphs
patterns to be matched within the data graph.
a
yago:John_Doe foaf:Person
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX yago: <http://www.mpii.de/yago/resource> foaf:name
SELECT ?name foaf:knows
WHERE {
a
?person a foaf:Person . “John Doe”
?person foaf:knows yago:Max_Mustermann .
?person foaf:name ?name . yago:Max_Mustermann
}

foaf:name
“Max Mustermann”


Outline



Fragmentation in databases
● Why? To exploit processing power of
multiples nodes by decomposing operations
into parallel sub-operations.
● In relational databases:
– Horizontal fragmentation [Dimovski, 2010]
– Vertical fragmentation [Hoffer 1975]
– Workload driven [Curino, 2010]
● It has to be combined with an allocation
strategy (assignment of fragments to hosts)

Horizontal & vertical fragmentation
yago:John_Doe foaf:name “John Doe”
yago:Max_Mustermann foaf:name “Max Mustermann”
yago:John_Doe foaf:knows yago:Max_Mustermann
yago:John_Doe foaf:mbox “jdoe@wherever.com”
yago:Juan_Perez foaf:mbox “jprz@wherever.com”

Horizontal or tuple based fragmentation Vertical or column based fragmentation

Subject Predicate Object Subject Object
yago:John_Doe foaf:name “John Doe” yago:John_Doe “John Doe”
yago:Max_Mustermann foaf:name “Max Mustermann” yago:Max_Mustermann “Max Mustermann”
Subject Predicate
yago:John_Doe yago:Max_Mustermann
yago:John_Doe foaf:name
yago:John_Doe “jdoe@whatever.com”
yago:Max_Mustermann foaf:name
yago:John_Doe foaf:knows yago:Max_Mustermann
yago:Juan_Perez “jprz@wherever.com”
yago:John_Doe foaf:knows
yago:John_Doe foaf:mbox
yago:John_Doe foaf:mbox “jdoe@wherever.com”
yago:Juan_Perez foaf:mbox
yago:Juan_Perez foaf:mbox “jprz@wherever.com”


Workload-driven fragmentation
● Relationships between tuples as a graph.
– A node per tuple. They share an edge if they
are required by the same transaction.
● Partition the graph

● Try to keep
transactions as
local as possible


Outline



Observations
● RDF query load
– Updates and insertions are rare
– Join oriented
● Data graph
– Subjects more selective than objects which are
more selective than predicates.
– Constants unstable for fragmentation.
● Distributed Query Processing
– Communication costs dominate distributed
transactions

Goals
● Fragment RDF dataset based on a workload
to guarantee:
● Small latency
– Limit communication costs by maximizing
local transactions but keeping parallelism
● High throughput
● Scalability
● Load balancing
– Allocate fragments such that hosts get
approximately the same load.

Outline



Proposed methodology
Partitioning phase
Determine a complete and non-redundant
fragmentation of the triple store using a minimal
set of predicates extracted from the query load.

Allocation phase
Assign fragments to hosts to guarantee load
balancing


Normalizing the query load

● Extract independent sub-queries.
– We still want independent subqueries to run in
parallel.
● Normalize triple patterns:
– Turn infrequent URIs or literals into variables.
– Capture patterns of access
– Not applicable to data types with a reduced
value space (e.g xsd:boolean = {true, false})


Normalizing the query load
SELECT ?name
WHERE{ Infrequent literal
?x foaf:name ?name .
?x foaf:mbox "alice@wherever.com"
}
SELECT ?name
WHERE{ Infrequent literal
?x foaf:mbox "bob@wherever.com"
}
SELECT ?name
WHERE{
?x foaf:mbox ?mbox
}


Extracting predicates
PREFIX yago: <http://www.mpii.de/yago/..>
SELECT ?name P1: Predicate = foaf:name from A
WHERE{ P2: Predicate = foaf:mbox from B
A ?x foaf:name ?name . P3: Predicate = foaf:knows from C
B ?x foaf:mbox ?mbox P4: Object = yago:John_Doe from C
}
SELECT ?name
WHERE{
A ?z foaf:name ?name .
B ?z foaf:mbox ?mbox .
● Remember where the
}
C ?z foaf:knows yago:John_Doe .
predicates come from.
A: ?x foaf:name ?name
Freq: 2 ● Store join relationships
1
C: ?x foaf:knows f:John_Doe
2
between patterns
Freq: 1 among the queries:
1 B: ?x foaf:mbox ?mbox
Freq: 2
Global Query Graph

Minterms & Fragments
● Conjunctive expressions over the a set of
predicates. e.g : Minterm 00 = ~P ^ ~P 1 2

P1: Predicate = foaf:name Minterm 01 = ~P1 ^ P2
P2: Predicate = foaf:mbox Minterm 10 = P1 ^ ~P2
Minterm 11 = P1 ^ P2
● A minterm defines a fragment.
– Set of triples satisfying the logical function
● The set of all possible minterms determines a
non-redundant and complete fragmentation.
– But we want a minimal set of predicates.

Optimal Horizontal Fragmentation
● A predicate is redundant if the
fragmentation is insensitive to its presence
or absence.
● Start with an empty set
● For every extracted predicate:
– Add it to the set and fragment the
database building the minterms
– If the fragment is redundant, ignore it.
– If not redundant, check if it did not make
previously added predicates redundant.


Optimal Horizontal Partitioning
P1: Predicate = foaf:name
P2: Predicate = foaf:mbox

Minterm 00: Predicate != foaf:mbox AND Predicate != foaf:name
Minterm 01: Predicate != foaf:mbox AND Predicate = foaf:name
Minterm 10: Predicate = foaf:mbox AND Predicate != foaf:name
Minterm 11: Predicate = foaf:mbox AND Predicate = foaf:name

●
The algorithm is O(n2) in the number of
predicates.
● Even though there is an exponential
number of minterms, many will be not
satisfiable.


Minterm 01: Predicate != foaf:mbox AND Predicate = foaf:name
Minterm 10: Predicate = foaf:mbox AND Predicate != foaf:name
Minterm 11: Predicate = foaf:mbox AND Predicate = foaf:name

●
predicates.
satisfiable.


Minterm 01: Predicate = foaf:name
Minterm 10: Predicate = foaf:mbox

●
predicates.
satisfiable.

Allocating the fragments
● Fragments have access frequencies derived
from their provenance and might join in the
query load. A: ?x foaf:name ?name
Freq: 2

Minterm 01: Predicate = foaf:name from A 1 2
Minterm 10: Predicate = foaf:mbox from B
C: ?x foaf:knows f:John_Doe
Freq: 1

1 B: ?x foaf:mbox ?mbox
Freq: 2
● Allocate fragments to hosts so that:
– They are in the same host if they can join in the
query load.
– Hosts receive approximately the same load.


Allocating the fragments
● Sort fragments descendent by load
● For every fragment, calculate the benefit of
assigning it to every host.
TL
T L=∑ F g×S g U H =
g n
UH
benefit  f , H = × ∑ [ E f , g1]
U H CL H g∈H
F g=Size of fragment g ; F g =Frequency of access of fragment g
n=number of hosts
benefit  f , H =Benefit of assigning fragment f to host H
CL H =Current load for host H (from fragments assigned so far)
E g , j =Weight between fragments f and g in the global query load graph

● Assign it to the most beneficial host

Outline



Evaluating query complexity

● Local query graph
PREFIX yago: <http://www.mpii.de/yago/..>
SELECT ?name
WHERE{
?z foaf:name ?name . ?z foaf:name ?name
?z foaf:mbox ?mbox .
?z foaf:knows yago:John_Doe .
}

?z foaf:knows f:John_Doe

?z foaf:mbox ?mbox


Evaluating the fragmentation
● Distributed query graph for a query Q
obtained from global query graph +
fragments definition + Q query graph
–
Fragment 10 Fragment 01
Predicate = foaf:mbox Predicate = foaf:name
Relevant to B Relevant to A
Host 1

PREFIX foaf: <http://xmlns.com/foaf/0.1/> Fragment 00
PREFIX yago: <http://www.mpii.de/yago/..> Predicate != foaf:mbox AND
SELECT ?name Predicate != foaf:name
WHERE{ Relevant to C
A ?z foaf:name ?name . Host 2
B ?z foaf:mbox ?mbox . Remote edge
C ?z foaf:knows yago:John_Doe .
} Local edge

Preliminary results

● Metrics to evaluate query complexity:
– Number of edges in local query graph
– Number of remote edges in the distributed
query graph
● Metrics to evaluate fragmentation quality for a
query
– Number of local edges in the distributed query
graph
– Number of hosts required to answer the query


Preliminary results
Number of hosts: 5
Run # Dataset Dataset description File size # triples #sub #
(MB) queries predicates
1 Subset Dbpedia Dbpedia foaf information 136 1745624 10 9
(names and dates)
2 Subset Dbpedia Dbpedia foaf information 136 1745624 10 10
(names and dates)
3 YAGO Core YAGO Core databaset 2662.4 26227687 9 21
4 YAGO sample RDF-3x YAGO dump 3276.8 35238246 19 35
sample

Edge count in local query graphs vs Number of contacted hosts Local edges vs remote edges in Distributed Query Graph
Per run of the algorithm per run of the algorithm
2.5 2
1.8
2 Average local
1.6
edges in
Hosts contacted 1.4 Distributed Query
1.5
Average

Edges in local query 1.2 Graph
graph Average remote
1 1 edges in
Distributed Query
0.8 Graph
0.5 0.6
0.4
0
1 2 3 4 0.2

Runs 0
1 2 3 4


Conclusions

● Use of standard techniques from relational
databases
● Method independent from actual storage
implementation.
– Huge 3-columns table abstraction
● It can be easily extended to support
redundancy.
● Applicable to evolving query loads
– By changing the level of constants
normalization

Future work
● Evaluate quality of partitioning
– Using real execution costs: need of a
distributed index + query planner +
distributed cost model
– Against other approaches (e.g fragmentation
by predicate)
● Evaluate greedy allocation algorithm
– Against optimal solution, round robin, etc..
● Use of estimates for fragment sizes
– So far extracted via queries.

Thanks for your attention


Query-Load aware partitioning of RDF data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Query-Load aware partitioning of RDF data

Similar to Query-Load aware partitioning of RDF data (20)

More from Luis Galárraga

More from Luis Galárraga (15)

Recently uploaded

Recently uploaded (20)

Query-Load aware partitioning of RDF data