1. Federated SPARQL Query Processing
Over the Web of Data
Muhammad Saleem
Tutorial at ISWC 2015, Bethlehem, USA
Agile Knowledge Engineering and Semantic Web (AKSW), University of Leipzig,
Germany, 11/10/2015
2. Agenda
• SPARQL Query Federation Approaches
• SPARQL Query Federation Optimization
– Source Selection
– Data Integration Options
– Join Order Selection
– Join Order Optimization
– Join Implementations
• Performance Metrics and Discussion
4. SPARQL Endpoint Federation Approaches
• Most commonly used approaches
• Make use of SPARQL endpoints URLs
• Fast query execution
• RDF data needs to be exposed via SPARQL
endpoints
• E.g., HiBISCus, FedX, SPLENDID, ANAPSID, LHD,
TopFed, QUETSAL etc.
5. Linked Data Federation Approaches
• Data needs not be exposed via SPARQL endpoints
• Uses URI lookups at runtime
• Data should follow Linked Data principles
• Slower as compared to previous approaches
• E.g., LDQPS, SIHJoin, WoDQA etc.
6. Linked Data Fragments Federation
• Federation over Linked Data Fragments
• Will be explained in upcoming session in detail
7. Query federation on top of Distributed Hash Tables
• Uses DHT indexing to federate SPARQL queries
• Space efficient
• Cannot deal with whole LOD
• E.g., ATLAS
8. Hybrid
• Federation over SPARQL endpoints and Linked
Data
• Can potentially deal with whole LOD
• E.g., ADERIS-Hybrid (of SEF+LDF)
16. Types of Source Selection
• Index-free
– Using SPARQL ASK queries
– No index maintenance required
– Potentially ensures result set completeness
– SPARQL ASK queries can be expensive
– Can make use of the cache to store recent SPARQL ASK queries results
– E.g., FedX
• Index-only
– Only make use of Index/data summaries
– Less efficient but fast source selection
– Result set completeness is not ensured
– E.g., DARQ, LHD
• Hybrid
– Make use of index+SPARQL ASK
– Most efficient
– Result set completeness is not ensured
– Can make use of the cache to store recent SPARQL ASK queries results
– E.g., HiBISCuS, ANAPSID, SPLENDID
17. Index-free Source Selection
Input: SPARQL query Q , set of all data sources D
Output: Triple pattern to relevant data sources map M
for each triple pattern ti in SPARQL query Q
Ri = {}; // set of relevant data sources for triple pattern ti
for each data source di in D
if SPARQL ASK(di , ti) = true
Ri = Ri U {di};
end if
end for
M = M U {Ri};
end for
return M What is the total number of SPARQL ASK requests used?
total number of triple patterns * total number of data sources
18. Index-free
Source Selection
FedBench (LD3): Return for all US presidents their party
membership and news pages about them.
SELECT ?president ?party ?page
WHERE {
?president rdf:type dbpedia:President .
?president dbpedia:nationality dbpedia:United_States .
?president dbpedia:party ?party .
?x nyt:topicPage ?page .
?x owl:sameAs ?president .
}
dbpedia
RDF
Source Selection Algorithm
Triple pattern-wise source selection
S1TP1 =
KEGG
RDF
ChEBI
RDF
NYT
RDF
SWDF
RDF
LMDB
RDF
Jamendo
RDF
Geo
Names
RDF
DrugBank
RDF
S1 S2 S3 S4 S5 S6 S7 S8 S9
//TP1
//TP3
//TP4
//TP5
//TP2
22. FedBench (LD3): Return for all US presidents their party
membership and news pages about them.
SELECT ?president ?party ?page
WHERE {
?president rdf:type dbpedia:President .
?president dbpedia:nationality dbpedia:United_States .
?president dbpedia:party ?party .
?x nyt:topicPage ?page .
?x owl:sameAs ?president .
}
dbpedia
RDF
Source Selection Algorithm
Triple pattern-wise source selection
S1TP1 =
KEGG
RDF
ChEBI
RDF
NYT
RDF
SWDF
RDF
LMDB
RDF
Jamendo
RDF
Geo
Names
RDF
DrugBank
RDF
S1 S2 S3 S4 S5 S6 S7 S8 S9
//TP1
//TP3
//TP4
//TP5
//TP2
TP2 = S1
TP3 = S1 TP4 = S4
TP5 = S1 S2
Index-free
Source Selection
Total number of SPARQL ASK requests used = 45
Total triple pattern-wise sources selected = 12
S4-S9
23. Index-only Source Selection (LHD)
Input: SPARQL query Q , set of all data sources D, data sources index I storing all distinct predicates for
all data sources in D
Output: Triple pattern to relevant data sources map M
for each triple pattern ti in SPARQL query Q
Ri = {}; // set of relevant data sources for triple pattern ti
p = Pred(ti) // predicate of ti
if (bound (p))
Ri = Lookup (I, p) // index lookup for predicate of ti
else
Ri = D ; // all data sources are relevant
end if
M = M U {Ri} ;
end for
return M Why it is the less efficient approach (i.e., greatly overestimate relevant data sources)?
• Source selection is only based on predicate of triple patterns
• Simply select all data sources for triple patterns having unbound predicates
24. Index-only
Source Selection
FedBench (LD3): Return for all US presidents their party
membership and news pages about them.
SELECT ?president ?party ?page
WHERE {
?president rdf:type dbpedia:President .
?president dbpedia:nationality dbpedia:United_States .
?president dbpedia:party ?party .
?x nyt:topicPage ?page .
?x owl:sameAs ?president .
}
dbpedia
RDF
Source Selection Algorithm
Triple pattern-wise source selection
S1-S9TP1 =
KEGG
RDF
ChEBI
RDF
NYT
RDF
SWDF
RDF
LMDB
RDF
Jamendo
RDF
Geo
Names
RDF
DrugBank
RDF
S1 S2 S3 S4 S5 S6 S7 S8 S9
//TP1
//TP3
//TP4
//TP5
//TP2
28. FedBench (LD3): Return for all US presidents their party
membership and news pages about them.
SELECT ?president ?party ?page
WHERE {
?president rdf:type dbpedia:President .
?president dbpedia:nationality dbpedia:United_States .
?president dbpedia:party ?party .
?x nyt:topicPage ?page .
?x owl:sameAs ?president .
}
dbpedia
RDF
Source Selection Algorithm
Triple pattern-wise source selection
TP1 =
KEGG
RDF
ChEBI
RDF
NYT
RDF
SWDF
RDF
LMDB
RDF
Jamendo
RDF
Geo
Names
RDF
DrugBank
RDF
S1 S2 S3 S4 S5 S6 S7 S8 S9
//TP1
//TP3
//TP4
//TP5
//TP2
TP2 = S1
TP3 = S1 TP4 = S4
TP5 = S1 S2 S4-S9
Index-only
Source Selection
Total number of SPARQL ASK requests used = 0
Total triple pattern-wise sources selected = 20
S1-S9
29. Hybrid Source Selection
Input: SPARQL query Q , set of all data sources D, data sources index I storing all distinct predicates for all data
sources in D
Output: Triple pattern to relevant data sources map M
for each triple pattern ti in SPARQL query Q
Ri = {}; // set of relevant data sources for triple pattern ti
s = Subj(ti) , p = Pred(ti) , o = Obj(ti) ; // subject, predicate, and object of ti
if (!bound (p) || bound (s) || bound (o) )
for each data source di in D
if SPARQL ASK(di , ti) = true
Ri = Ri U {di};
end if
end for
else
Ri = Lookup (I, p) // index lookup for predicate of ti
end if
M = M U {Ri}
end for
return M
What is the total number of SPARQL ASK requests used?
total number of triple patterns with bound subject or bound object
or unbound predicate * total number of data sources
30. Hybrid Source
Selection
FedBench (LD3): Return for all US presidents their party
membership and news pages about them.
SELECT ?president ?party ?page
WHERE {
?president rdf:type dbpedia:President .
?president dbpedia:nationality dbpedia:United_States .
?president dbpedia:party ?party .
?x nyt:topicPage ?page .
?x owl:sameAs ?president .
}
dbpedia
RDF
Source Selection Algorithm
Triple pattern-wise source selection
S1TP1 =
KEGG
RDF
ChEBI
RDF
NYT
RDF
SWDF
RDF
LMDB
RDF
Jamendo
RDF
Geo
Names
RDF
DrugBank
RDF
S1 S2 S3 S4 S5 S6 S7 S8 S9
//TP1
//TP3
//TP4
//TP5
//TP2
34. FedBench (LD3): Return for all US presidents their party
membership and news pages about them.
SELECT ?president ?party ?page
WHERE {
?president rdf:type dbpedia:President .
?president dbpedia:nationality dbpedia:United_States .
?president dbpedia:party ?party .
?x nyt:topicPage ?page .
?x owl:sameAs ?president .
}
dbpedia
RDF
Source Selection Algorithm
Triple pattern-wise source selection
S1TP1 =
KEGG
RDF
ChEBI
RDF
NYT
RDF
SWDF
RDF
LMDB
RDF
Jamendo
RDF
Geo
Names
RDF
DrugBank
RDF
S1 S2 S3 S4 S5 S6 S7 S8 S9
//TP1
//TP3
//TP4
//TP5
//TP2
TP2 = S1
TP3 = S1 TP4 = S4
TP5 = S1 S2
Total number of SPARQL ASK requests used = 18
Total triple pattern-wise sources selected = 12
S4-S9
Anything still needs
to be improved?
Hybrid Source
Selection
35. Source Selection
• Triple pattern-wise source selection
– Ensures 100% recall
– Can over-estimate capable sources
– Can be expensive, e.g., total number of SPARQL ASK
requests used
– Performed by FedX, SPLENDID, LHD, DARQ, ADERIS etc.
• Join-aware triple-pattern wise source selection
– Ensures 100% recall
– May selects optimal/close to optimal capable sources
– Can be expensive, e.g., total number of SPARQL ASK
requests used
– Can significantly reduce the query execution time
– Performed by ANAPSID, HiBISCuS
36. HiBISCuS: Hypergraph-Based Source Selection for
SPARQL Endpoint Federation
• Hybrid source selection
• Join-aware triple-pattern wise source selection
• Makes use of the hypergraph representation of
SPARQL queries
• Makes use of the URI authorities
• Makes use of the cache to store recent SPARQL
ASK queries results
37. Motivation
FedBench (LD3): Return for all US presidents their party
membership and news pages about them.
SELECT ?president ?party ?page
WHERE {
?president rdf:type dbpedia:President .
?president dbpedia:nationality dbpedia:United_States .
?president dbpedia:party ?party .
?x nyt:topicPage ?page .
?x owl:sameAs ?president .
}
dbpedia
RDF
Source Selection Algorithm
Triple pattern-wise source selection
S1TP1 =
KEGG
RDF
ChEBI
RDF
NYT
RDF
SWDF
RDF
LMDB
RDF
Jamendo
RDF
Geo
Names
RDF
DrugBank
RDF
S1 S2 S3 S4 S5 S6 S7 S8 S9
//TP1
//TP3
//TP4
//TP5
//TP2
38. Motivation
FedBench (LD3): Return for all US presidents their party
membership and news pages about them.
SELECT ?president ?party ?page
WHERE {
?president rdf:type dbpedia:President .
?president dbpedia:nationality dbpedia:United_States .
?president dbpedia:party ?party .
?x nyt:topicPage ?page .
?x owl:sameAs ?president .
}
dbpedia
RDF
Source Selection Algorithm
Triple pattern-wise source selection
S1TP1 =
KEGG
RDF
ChEBI
RDF
NYT
RDF
SWDF
RDF
LMDB
RDF
Jamendo
RDF
Geo
Names
RDF
DrugBank
RDF
S1 S2 S3 S4 S5 S6 S7 S8 S9
//TP1
//TP3
//TP4
//TP5
//TP2
TP2 = S1
45. Problem Statement
• An overestimation of triple pattern-wise source selection can
be expensive
– Resources are wasted
– Query runtime is increased
– Extra traffic is generated
• How do we perform join-aware triple pattern wise source
selection in time efficient way?
46. HiBISCuS: Key Concept
• Makes use of the URI’s authorities
http://dbpedia.org/ontology/party
Scheme Authority Path
For URI details: http://tools.ietf.org/html/rfc3986
62. Complete Local Integration
• Triple patterns are individually and completely
evaluated against every endpoint
• Triple pattern results are locally integrated using
different join techniques, e.g., NLJ, Hash Join etc.
• Less efficient if query contains common
predicates such rdf:type and owl:sameAs
• Large amount of potentially irrelevant
intermediate results retrieval
63. Iterative Integration
• Evaluate query iteratively pattern by pattern
• Start with a single triple pattern
• Substitute mappings from previous triple pattern
in the subsequent evaluation
• Evaluate query in a NLJ fashion
• NLJ can cause many remote requests
• Block NLJ fashion minimize the remote requests
65. Join Order Selection
• Left-deep trees
– Joins take place in a left-to-right sequential order
– Result of the join is used as an outer input for the next join
– Used in FedX, DARQ
• Right-deep trees
– Joins take place in a right-to-left sequential order
– Result of the join is used as an inner input for the next join
• Bushy trees
– Joins take place in sub-tress both on left and right sides
– Used in ANAPSID
• Dynamic programming
– Used in SPLENDID
66. Join Order Selection Example
Compute Micronutrients using Drugbank and KEGG
SELECT ?drug ?title WHERE {
?drug drugbank:drugCategory drugbank-cat:micronutrient. // TP1
?drug drugbank:casRegistryNumber ?id . // TP2
?keggDrug rdf:type kegg:Drug . // TP3
?keggDrug bio2rdf:xRef ?id . // TP4
?keggDrug dc:title ?title . // TP5
}
66
𝜋 ? 𝑑𝑟𝑢𝑔, ? 𝑡𝑖𝑡𝑙𝑒
TP1 TP2
TP3
TP4
TP5
Left-deep tree
𝜋 ? 𝑑𝑟𝑢𝑔, ? 𝑡𝑖𝑡𝑙𝑒
TP1 TP2
TP3
TP4
TP5
Right-deep tree
𝜋 ? 𝑑𝑟𝑢𝑔, ? 𝑡𝑖𝑡𝑙𝑒
TP1 TP2
Bushy tree
TP3 TP5
TP4
Goal: Execute smallest cardinality joins first
68. Join Order Optimization
• Exclusive Groups
– Group triple patterns with the same relevant data source
– Evaluation in a single (remote) sub-query
– Push join to the data source, i.e., endpoint
• Variable count-heuristic
– Iteratively determine the join order based on free variables
count of triple patterns and groups
– Consider “resolved ” variable mappings from earlier iteration
• Using Selectivities
– Store distinct predicates, avg. subject selectivities , and avg.
object selectivities for each predicate in index
– Use the predicate count, avg. subject selectivities , and avg.
object selectivities to estimate the join cardinality
69. Exclusive Groups
SELECT ?President ?Party ?TopicPage WHERE {
?President rdf:type dbpedia-yago:PresidentsOfTheUnitedStates .
?President dbpedia:party ?Party .
?nytPresident owl:sameAs ?President .
?nytPresident nytimes:topicPage ?TopicPage .
}
Source Selection
@ DBpedia
@ DBpedia
@ DBpedia, NYTimes
@ NYTimes
Exclusive Group
Advantage:
Delegate joins to the endpoint by forming exclusive groups (i.e. executing the
respective patterns in a single subquery)
69
Source: http://www.slideshare.net/aschwarte/fedx-for-federated-query-processing-on-linked-data
70. Exclusive Groups Join Order Optimization
2 Unoptimized Internal Representation
Compute Micronutrients using Drugbank and KEGG
SELECT ?drug ?title WHERE {
?drug drugbank:drugCategory drugbank-cat:micronutrient .
?drug drugbank:casRegistryNumber ?id .
?keggDrug rdf:type kegg:Drug .
?keggDrug bio2rdf:xRef ?id .
?keggDrug dc:title ?title .
}
1 SPARQL Query
3 Optimized Internal Representation
4x Local Join
=
4x NLJ
Exlusive Group
Remote Join
70
Source: http://www.slideshare.net/aschwarte/fedx-for-federated-query-processing-on-linked-data
71. [] a sd:Service ;
sd:endpointUrl <http://localhost:8890/sparql> ;
sd:capability [
sd:predicate diseasome:name ;
sd:totalTriples 147 ; // Total number of triple patterns with predicate value sd:predicate
sd:avgSbjSel ``0.0068'' ; // 1/ distinct subjects with predicate value sd:predicate
sd:avgObjSel ``0.0069'' ; // 1/ distinct Objects with predicate value sd:predicate
] ;
sd:capability [
sd:predicate diseasome:chromosomalLocation ;
sd:totalTtriples 160 ;
sd:avgSbjSel ``0.0062'' ;
sd:avgObjSel ``0.0072'' ;
] ;
S1 P O1 .
S1 P O2 .
S2 P O1 .
S3 P O2 .
totalTriples = 4
avgSbjSel(p) = 1/3
avgObjSel(p) =1/2
Selectivity Based Join Order Optimization
72. Selectivity Based Join Order Optimization
• Triple pattern cardinality
• Join Cardinality
𝑝 = pred(tp) , 𝑇 = Total triple having predicate 𝑝
𝐶(𝑡𝑝) =
𝑇 𝑖𝑓 𝑛𝑒𝑖𝑡ℎ𝑒𝑟 𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑛𝑜𝑟 𝑜𝑏𝑗𝑒𝑐𝑡 𝑖𝑠 𝑏𝑜𝑢𝑛𝑑
𝑇 × 𝑎𝑣𝑔𝑆𝑏𝑗𝑆𝑒𝑙 𝑝 𝑖𝑓 𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑖𝑠 𝑏𝑜𝑢𝑛𝑑
𝑇 × 𝑎𝑣𝑔𝑂𝑏𝑗𝑆𝑒𝑙 𝑝 𝑖𝑓𝑜𝑏𝑗𝑒𝑐𝑡 𝑖𝑠 𝑏𝑜𝑢𝑛𝑑
𝐶(𝐽 𝑡𝑝1, 𝑡𝑝2 ) =
𝐶 𝑡𝑝1 × 𝐶 𝑡𝑝2 × 𝑎𝑣𝑔𝑃𝑟𝑒𝑑𝐽𝑜𝑖𝑛𝑆𝑒𝑙 𝑡𝑝1 × 𝑎𝑣𝑔𝑃𝑟𝑒𝑑𝐽𝑜𝑖𝑛𝑆𝑒𝑙 𝑡𝑝2 𝑖𝑓 𝑝 − 𝑝 𝑗𝑜𝑖𝑛
𝐶 𝑡𝑝1 × 𝐶 𝑡𝑝2 × 𝑎𝑣𝑔𝑆𝑏𝑗𝐽𝑜𝑖𝑛𝑆𝑒𝑙 𝑡𝑝1 × 𝑎𝑣𝑔𝑆𝑏𝑗𝐽𝑜𝑖𝑛𝑆𝑒𝑙 𝑡𝑝2 𝑖𝑓 𝑠 − 𝑠 𝑗𝑜𝑖𝑛
𝐶 𝑡𝑝1 × 𝐶 𝑡𝑝2 × 𝑎𝑣𝑔𝑆𝑏𝑗𝐽𝑜𝑖𝑛𝑆𝑒𝑙 𝑡𝑝1 × 𝑎𝑣𝑔𝑂𝑏𝑗𝐽𝑜𝑖𝑛𝑆𝑒𝑙 𝑡𝑝2 𝑖𝑓 𝑠 − 𝑜 𝑗𝑜𝑖𝑛
How to calculate avgPredJoinSel, avgSbjJoinSel, and avgObjJoinSel?
DARQ selected 0.5 as the avgJoinSel value for all joins
74. Join Implementations
• Bound Joins
– Start with a single triple pattern (lowest cardinality)
– Substitute mappings from previous triple pattern in the
subsequent evaluation
– Bound Joins in NLJ fashion
• Execute bound joins in nested loop join fashion
• Too many remote requests
– Bound Joins in Block NLJ fashion
• Execute bound joins in block nested loop join fashion
• Make use of SPARQL UNION construct
• Remote requests are reduced by the block size
• Other Join techniques
– E.g, Hash Joins
75. Bound Joins in Block NLJ
SELECT ?President ?Party ?TopicPage WHERE {
?President rdf:type dbpedia:PresidentsOfTheUnitedStates .
?President dbpedia:party ?Party .
?nytPresident owl:sameAs ?President .
?nytPresident nytimes:topicPage ?TopicPage .
}
Assume that the following intermediate results have been computed as input for the last triple pattern
Block Input
“Barack Obama”
“George W. Bush”
…
Before (NLJ)
SELECT ?TopicPage WHERE { “Barack Obama” nytimes:topicPage ?TopicPage }
SELECT ?TopicPage WHERE { “George W. Bush” nytimes:topicPage ?TopicPage }
…
Now: Evaluation in a single remote request using a SPARQL UNION
construct + local post processing (SPARQL 1.0)
75
Source: http://www.slideshare.net/aschwarte/fedx-for-federated-query-processing-on-linked-data
76. Parallelization and Pipelining
• Execute sub-queries concurrently on different data
sources
• Multithreaded worker pool to execute the joins
and UNION operators in parallel
• Pipelining approach for intermediate results
• See FedX and LHD implementations
78. Performance Metrics
• Efficient source selection in terms of
– Total triple pattern-wise sources selected
– Total number of SPARQL ASK requests used during source
selection
– Source selection time
• Query execution time
• Results completeness and correctness
• Number of remote requests during query execution
• Index compression ratio (1- index size/datadump size)
• See https://code.google.com/p/bigrdfbench/
79. Evaluation Setup
• Local dedicated network
• Local SPARQL endpoints (One per machine)
• Run each query 10 times and present the average results
• Statistically analyzed the results, e.g., Wilcoxon signed rank
test, student T-test