Query-Load aware partitioning of RDF datasets using standard fragmentation techniques for relational databases aimed to provide an insight of the advantages of a proper fragmentation scheme in big semantic databases for efficient query processing.
Scaling API-first – The story of a global engineering organization
Query-Load aware partitioning of RDF data
1. Query-Load aware partitioning
of RDF data
Luis Galárraga
Saarbrücken, July 4th 2011
July 4th, 2011 Query load aware partitioning of RDF data 1/37
2. Outline
● Motivation & background
● Fragmentation in databases
● Observations & goals
● Proposed methodology
● Preliminary results
July 4th, 2011 Query load aware partitioning of RDF data 2/37
3. Outline
● Motivation & background
● Fragmentation in databases
● Observations & goals
● Proposed methodology
● Preliminary results
July 4th, 2011 Query load aware partitioning of RDF data 3/37
4. Outline
● Motivation & background
● Fragmentation in databases
● Observations & goals
● Proposed methodology
● Preliminary results
July 4th, 2011 Query load aware partitioning of RDF data 4/37
5. Motivation
● Increasing interest in semantic
representations for knowledge.
– Increasing number of data providers (e.g
Linked Data initiative)
– Semantic Web: “Web of knowledge”
– Growing data sources (e.g Wikipedia)
● Need for efficient query processing
– Centralized solutions might become infeasible
as data steadily grows.
– Taking advantage of parallelism can help
improve performance.
July 4th, 2011 Query load aware partitioning of RDF data 5/37
6. Data keeps growing
Dbpedia datasets size growth
40 Dbpedia 3.6
35 3.500.000 resources
30
0.5 billion facts
25
http://dbpedia.org
Size in GB
20
Size
15
10 Semantic Web Challenge 2011
5
2 billion triples
0
10/10/06 02/22/08 07/06/09 11/18/10 04/01/12
20 GB dataset
Date http://challenge.semanticweb.org
July 4th, 2011 Query load aware partitioning of RDF data 6/37
8. RDF and triple stores
● Resource Description Framework is a
language to represent knowledge about
resources (things).
– Resources are identified by URIs
<http://www.mpii.de/yago/resource/John_Doe>
● It uses statements or triples
PREFIX yago: <http://www.mpii.de/yago/resource/John_Doe>
PREFIX foaf: <http://xmlns.com/foaf/0.1/name>
yago:John_Doe foaf:name “John Doe”
Subject Predicate Object
July 4th, 2011 Query load aware partitioning of RDF data 8/37
9. RDF and triple stores
● Data in a triple store can be seen as data
graph or a huge 3-columns relation.
yago:John_Doe
Subject Predicate Object
foaf:name
foaf:knows yago:John_Doe foaf:name “John Doe”
“John Doe” yago:John_Doe foaf:knows yago:Max_Mustermann
yago:Max_Mustermann foaf:name “Max Mustermann”
yago:Max_Mustermann
foaf:name
“Max Mustermann”
● Existing solutions like Jena or Sesame use
some variation of the 3-columns relation.
July 4th, 2011 Query load aware partitioning of RDF data 9/37
10. How to query RDF?
● Use of data graph abstraction.
– SQL designed for relational databases
● SPARQL defines queries as subgraphs
patterns to be matched within the data graph.
a
yago:John_Doe foaf:Person
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX yago: <http://www.mpii.de/yago/resource> foaf:name
SELECT ?name foaf:knows
WHERE {
a
?person a foaf:Person . “John Doe”
?person foaf:knows yago:Max_Mustermann .
?person foaf:name ?name . yago:Max_Mustermann
}
foaf:name
“Max Mustermann”
July 4th, 2011 Query load aware partitioning of RDF data 10/37
11. Outline
● Motivation & background
● Fragmentation in databases
● Observations & goals
● Proposed methodology
● Preliminary results
July 4th, 2011 Query load aware partitioning of RDF data 11/37
12. Fragmentation in databases
● Why? To exploit processing power of
multiples nodes by decomposing operations
into parallel sub-operations.
● In relational databases:
– Horizontal fragmentation [Dimovski, 2010]
– Vertical fragmentation [Hoffer 1975]
– Workload driven [Curino, 2010]
● It has to be combined with an allocation
strategy (assignment of fragments to hosts)
July 4th, 2011 Query load aware partitioning of RDF data 12/37
14. Workload-driven fragmentation
● Relationships between tuples as a graph.
– A node per tuple. They share an edge if they
are required by the same transaction.
● Partition the graph
● Try to keep
transactions as
local as possible
July 4th, 2011 Query load aware partitioning of RDF data 14/37
15. Outline
● Motivation & background
● Fragmentation in databases
● Observations & goals
● Proposed methodology
● Preliminary results
July 4th, 2011 Query load aware partitioning of RDF data 15/37
16. Observations
● RDF query load
– Updates and insertions are rare
– Join oriented
● Data graph
– Subjects more selective than objects which are
more selective than predicates.
– Constants unstable for fragmentation.
● Distributed Query Processing
– Communication costs dominate distributed
transactions
July 4th, 2011 Query load aware partitioning of RDF data 16/37
17. Goals
● Fragment RDF dataset based on a workload
to guarantee:
● Small latency
– Limit communication costs by maximizing
local transactions but keeping parallelism
● High throughput
● Scalability
● Load balancing
– Allocate fragments such that hosts get
approximately the same load.
July 4th, 2011 Query load aware partitioning of RDF data 17/37
18. Outline
● Motivation & background
● Fragmentation in databases
● Observations & goals
● Proposed methodology
● Preliminary results
July 4th, 2011 Query load aware partitioning of RDF data 18/37
19. Proposed methodology
Partitioning phase
Determine a complete and non-redundant
fragmentation of the triple store using a minimal
set of predicates extracted from the query load.
Allocation phase
Assign fragments to hosts to guarantee load
balancing
July 4th, 2011 Query load aware partitioning of RDF data 19/37
20. Normalizing the query load
● Extract independent sub-queries.
– We still want independent subqueries to run in
parallel.
● Normalize triple patterns:
– Turn infrequent URIs or literals into variables.
– Capture patterns of access
– Not applicable to data types with a reduced
value space (e.g xsd:boolean = {true, false})
July 4th, 2011 Query load aware partitioning of RDF data 20/37
22. Extracting predicates
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX yago: <http://www.mpii.de/yago/..>
SELECT ?name P1: Predicate = foaf:name from A
WHERE{ P2: Predicate = foaf:mbox from B
A ?x foaf:name ?name . P3: Predicate = foaf:knows from C
B ?x foaf:mbox ?mbox P4: Object = yago:John_Doe from C
}
SELECT ?name
WHERE{
A ?z foaf:name ?name .
B ?z foaf:mbox ?mbox .
● Remember where the
}
C ?z foaf:knows yago:John_Doe .
predicates come from.
A: ?x foaf:name ?name
Freq: 2 ● Store join relationships
1
C: ?x foaf:knows f:John_Doe
2
between patterns
Freq: 1 among the queries:
1 B: ?x foaf:mbox ?mbox
Freq: 2
Global Query Graph
July 4th, 2011 Query load aware partitioning of RDF data 22/37
23. Minterms & Fragments
● Conjunctive expressions over the a set of
predicates. e.g : Minterm 00 = ~P ^ ~P 1 2
P1: Predicate = foaf:name Minterm 01 = ~P1 ^ P2
P2: Predicate = foaf:mbox Minterm 10 = P1 ^ ~P2
Minterm 11 = P1 ^ P2
● A minterm defines a fragment.
– Set of triples satisfying the logical function
● The set of all possible minterms determines a
non-redundant and complete fragmentation.
– But we want a minimal set of predicates.
July 4th, 2011 Query load aware partitioning of RDF data 23/37
24. Optimal Horizontal Fragmentation
● A predicate is redundant if the
fragmentation is insensitive to its presence
or absence.
● Start with an empty set
● For every extracted predicate:
– Add it to the set and fragment the
database building the minterms
– If the fragment is redundant, ignore it.
– If not redundant, check if it did not make
previously added predicates redundant.
July 4th, 2011 Query load aware partitioning of RDF data 24/37
25. Optimal Horizontal Partitioning
P1: Predicate = foaf:name
P2: Predicate = foaf:mbox
Minterm 00: Predicate != foaf:mbox AND Predicate != foaf:name
Minterm 01: Predicate != foaf:mbox AND Predicate = foaf:name
Minterm 10: Predicate = foaf:mbox AND Predicate != foaf:name
Minterm 11: Predicate = foaf:mbox AND Predicate = foaf:name
●
The algorithm is O(n2) in the number of
predicates.
● Even though there is an exponential
number of minterms, many will be not
satisfiable.
July 4th, 2011 Query load aware partitioning of RDF data 25/37
26. Optimal Horizontal Partitioning
P1: Predicate = foaf:name
P2: Predicate = foaf:mbox
Minterm 00: Predicate != foaf:mbox AND Predicate != foaf:name
Minterm 01: Predicate != foaf:mbox AND Predicate = foaf:name
Minterm 10: Predicate = foaf:mbox AND Predicate != foaf:name
Minterm 11: Predicate = foaf:mbox AND Predicate = foaf:name
●
The algorithm is O(n2) in the number of
predicates.
● Even though there is an exponential
number of minterms, many will be not
satisfiable.
July 4th, 2011 Query load aware partitioning of RDF data 26/37
27. Optimal Horizontal Partitioning
P1: Predicate = foaf:name
P2: Predicate = foaf:mbox
Minterm 00: Predicate != foaf:mbox AND Predicate != foaf:name
Minterm 01: Predicate = foaf:name
Minterm 10: Predicate = foaf:mbox
●
The algorithm is O(n2) in the number of
predicates.
● Even though there is an exponential
number of minterms, many will be not
satisfiable.
July 4th, 2011 Query load aware partitioning of RDF data 27/37
28. Allocating the fragments
● Fragments have access frequencies derived
from their provenance and might join in the
query load. A: ?x foaf:name ?name
Freq: 2
Minterm 01: Predicate = foaf:name from A 1 2
Minterm 10: Predicate = foaf:mbox from B
C: ?x foaf:knows f:John_Doe
Freq: 1
1 B: ?x foaf:mbox ?mbox
Freq: 2
● Allocate fragments to hosts so that:
– They are in the same host if they can join in the
query load.
– Hosts receive approximately the same load.
July 4th, 2011 Query load aware partitioning of RDF data 28/37
29. Allocating the fragments
● Sort fragments descendent by load
● For every fragment, calculate the benefit of
assigning it to every host.
TL
T L=∑ F g×S g U H =
g n
UH
benefit f , H = × ∑ [ E f , g1]
U H CL H g∈H
F g=Size of fragment g ; F g =Frequency of access of fragment g
n=number of hosts
benefit f , H =Benefit of assigning fragment f to host H
CL H =Current load for host H (from fragments assigned so far)
E g , j =Weight between fragments f and g in the global query load graph
● Assign it to the most beneficial host
July 4th, 2011 Query load aware partitioning of RDF data 29/37
30. Outline
● Motivation & background
● Fragmentation in databases
● Observations & goals
● Proposed methodology
● Preliminary results
July 4th, 2011 Query load aware partitioning of RDF data 30/37
32. Evaluating the fragmentation
● Distributed query graph for a query Q
obtained from global query graph +
fragments definition + Q query graph
–
Fragment 10 Fragment 01
Predicate = foaf:mbox Predicate = foaf:name
Relevant to B Relevant to A
Host 1
PREFIX foaf: <http://xmlns.com/foaf/0.1/> Fragment 00
PREFIX yago: <http://www.mpii.de/yago/..> Predicate != foaf:mbox AND
SELECT ?name Predicate != foaf:name
WHERE{ Relevant to C
A ?z foaf:name ?name . Host 2
B ?z foaf:mbox ?mbox . Remote edge
C ?z foaf:knows yago:John_Doe .
} Local edge
July 4th, 2011 Query load aware partitioning of RDF data 32/37
33. Preliminary results
● Metrics to evaluate query complexity:
– Number of edges in local query graph
– Number of remote edges in the distributed
query graph
● Metrics to evaluate fragmentation quality for a
query
– Number of local edges in the distributed query
graph
– Number of hosts required to answer the query
July 4th, 2011 Query load aware partitioning of RDF data 33/37
34. Preliminary results
Number of hosts: 5
Run # Dataset Dataset description File size # triples #sub #
(MB) queries predicates
1 Subset Dbpedia Dbpedia foaf information 136 1745624 10 9
(names and dates)
2 Subset Dbpedia Dbpedia foaf information 136 1745624 10 10
(names and dates)
3 YAGO Core YAGO Core databaset 2662.4 26227687 9 21
4 YAGO sample RDF-3x YAGO dump 3276.8 35238246 19 35
sample
Edge count in local query graphs vs Number of contacted hosts Local edges vs remote edges in Distributed Query Graph
Per run of the algorithm per run of the algorithm
2.5 2
1.8
2 Average local
1.6
edges in
Hosts contacted 1.4 Distributed Query
1.5
Average
Edges in local query 1.2 Graph
graph Average remote
1 1 edges in
Distributed Query
0.8 Graph
0.5 0.6
0.4
0
1 2 3 4 0.2
Runs 0
1 2 3 4
July 4th, 2011 Query load aware partitioning of RDF data 34/37
35. Conclusions
● Use of standard techniques from relational
databases
● Method independent from actual storage
implementation.
– Huge 3-columns table abstraction
● It can be easily extended to support
redundancy.
● Applicable to evolving query loads
– By changing the level of constants
normalization
July 4th, 2011 Query load aware partitioning of RDF data 35/37
36. Future work
● Evaluate quality of partitioning
– Using real execution costs: need of a
distributed index + query planner +
distributed cost model
– Against other approaches (e.g fragmentation
by predicate)
● Evaluate greedy allocation algorithm
– Against optimal solution, round robin, etc..
● Use of estimates for fragment sizes
– So far extracted via queries.
July 4th, 2011 Query load aware partitioning of RDF data 36/37
37. Thanks for your attention
July 4th, 2011 Query load aware partitioning of RDF data 37/37