Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Federated SPARQL query processing over the Web of Data

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Wird geladen in …3
×

Hier ansehen

1 von 83 Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Ähnlich wie Federated SPARQL query processing over the Web of Data (20)

Anzeige

Weitere von Muhammad Saleem (13)

Aktuellste (20)

Anzeige

Federated SPARQL query processing over the Web of Data

  1. 1. Federated SPARQL Query Processing Over the Web of Data Muhammad Saleem, Axel-Cyrille Ngonga Ngomo Agile Knowledge Engineering and Semantic Web (AKSW), University of Leipzig, Germany, 25/11/2014
  2. 2. Agenda • SPARQL Query Federation Approaches • SPARQL Query Federation Optimization – Query Rewriting – Source Selection – Data Integration Options – Join Order Selection – Join Order Optimization – Join Implementations • Performance Metrics and Discussion
  3. 3. SPARQL Query Federation Approaches • SPARQL Endpoint Federation (SEF) • Linked Data Federation (LDF) • Distributed Hash Tables (DHTs) • Hybrid of SEF+LDF
  4. 4. SPARQL Endpoint Federation Approaches • Most commonly used approaches • Make use of SPARQL endpoints URLs • Fast query execution • RDF data needs to be exposed via SPARQL endpoints • E.g., HiBISCus, FedX, SPLENDID, ANAPSID, LHD etc.
  5. 5. Linked Data Federation Approaches • Data needs not be exposed via SPARQL endpoints • Uses URI lookups at runtime • Data should follow Linked Data principles • Slower as compared to previous approaches • E.g., LDQPS, SIHJoin, WoDQA etc.
  6. 6. Query federation on top of Distributed Hash Tables • Uses DHT indexing to federate SPARQL queries • Space efficient • Cannot deal with whole LOD • E.g., ATLAS
  7. 7. Hybrid of SEF+LDF • Federation over SPARQL endpoints and Linked Data • Can potentially deal with whole LOD • E.g., ADERIS-Hybrid
  8. 8. SPARQL Endpoint Federation Parsing/Rewriting Source Selection Federator Optimzer Integrator S1 S2 S3 S4 RDF RDF RDF RDF Rewrite query and get Individual Triple Patterns Identify capable source against Individual Triple Patterns Generate optimized sub-query Exe. Plan Execute sub-queries Integrate sub-queries results
  9. 9. SPARQL Query Rewriting
  10. 10. SPARQL Query Rewriting FedBench (LD3): Return for all US presidents their party membership and news pages about them. SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality ?nationality. ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . Filter (?nationality = dbpedia:United_States ) } FedBench (LD3): Return for all US presidents their party membership and news pages about them. SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } Try to simplify/avoid SPARQL FILTER and REGEX expressions
  11. 11. Source Selection
  12. 12. Source Selection FedBench (LD3): Return for all US presidents their party membership and news pages about them. SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } dbpedia RDF //TP3 //TP4 //TP5 Source Selection Algorithm Triple pattern-wise source selection TP1 = S1 KEGG RDF ChEBI RDF NYT RDF //TP1 SWDF RDF //TP2 LMDB RDF Jamendo RDF Geo Names RDF DrugBank RDF S1 S2 S3 S4 S5 S6 S7 S8 S9
  13. 13. FedBench (LD3): Return for all US presidents their party membership and news pages about them. SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } dbpedia RDF //TP3 //TP4 //TP5 Source Selection Algorithm Source Selection Triple pattern-wise source selection TP1 = S1 KEGG RDF ChEBI RDF NYT RDF //TP1 SWDF RDF //TP2 LMDB RDF Jamendo RDF TP2 = S1 Geo Names RDF DrugBank RDF S1 S2 S3 S4 S5 S6 S7 S8 S9
  14. 14. FedBench (LD3): Return for all US presidents their party membership and news pages about them. SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } dbpedia RDF //TP3 //TP4 //TP5 Source Selection Algorithm Source Selection Triple pattern-wise source selection TP1 = S1 KEGG RDF ChEBI RDF NYT RDF //TP1 SWDF RDF //TP2 LMDB RDF TP3 = S1 Jamendo RDF TP2 = S1 Geo Names RDF DrugBank RDF S1 S2 S3 S4 S5 S6 S7 S8 S9
  15. 15. FedBench (LD3): Return for all US presidents their party membership and news pages about them. SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } dbpedia RDF //TP3 //TP4 //TP5 Source Selection Algorithm Source Selection Triple pattern-wise source selection TP1 = S1 KEGG RDF ChEBI RDF NYT RDF //TP1 SWDF RDF //TP2 LMDB RDF TP3 = S1 TP4 = S4 Jamendo RDF TP2 = S1 Geo Names RDF DrugBank RDF S1 S2 S3 S4 S5 S6 S7 S8 S9
  16. 16. FedBench (LD3): Return for all US presidents their party membership and news pages about them. SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } dbpedia RDF //TP3 //TP4 //TP5 Source Selection Algorithm Source Selection Triple pattern-wise source selection TP1 = S1 KEGG RDF ChEBI RDF NYT RDF //TP1 SWDF RDF //TP2 LMDB RDF TP3 = S1 TP4 = S4 TP5 = S1 S2 S4-S9 Total triple pattern-wise sources selected = Jamendo RDF TP2 = S1 1+1+1+1+8 => 12 Geo Names RDF DrugBank RDF S1 S2 S3 S4 S5 S6 S7 S8 S9
  17. 17. Types of Source Selection • Index-free – Using SPARQL ASK queries – No index maintenance required – Potentially ensures result set completeness – SPARQL ASK queries can be expensive – Can make use of the cache to store recent SPARQL ASK queries results – E.g., FedX • Index-only – Only make use of Index/data summaries – Less efficient but fast source selection – Result set completeness is not ensured – E.g., DARQ, LHD • Hybrid – Make use of index+SPARQL ASK – Most efficient – Result set completeness is not ensured – Can make use of the cache to store recent SPARQL ASK queries results – E.g., HiBISCuS, ANAPSID, SPLENDID
  18. 18. Index-free Source Selection Input: SPARQL query Q , set of all data sources D Output: Triple pattern to relevant data sources map M for each triple pattern ti in SPARQL query Q Ri = {}; // set of relevant data sources for triple pattern ti for each data source di in D if SPARQL ASK(di , ti) = true Ri = Ri U {di}; end if end for M = M U {Ri}; end for return M What is the total number of SPARQL ASK requests used? total number of triple patterns * total number of data sources
  19. 19. Index-free Source Selection FedBench (LD3): Return for all US presidents their party membership and news pages about them. SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } dbpedia RDF //TP3 //TP4 //TP5 Source Selection Algorithm Triple pattern-wise source selection TP1 = S1 KEGG RDF ChEBI RDF NYT RDF //TP1 SWDF RDF //TP2 LMDB RDF Jamendo RDF Geo Names RDF DrugBank RDF S1 S2 S3 S4 S5 S6 S7 S8 S9
  20. 20. FedBench (LD3): Return for all US presidents their party membership and news pages about them. SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } dbpedia RDF //TP3 //TP4 //TP5 Source Selection Algorithm Index-free Source Selection Triple pattern-wise source selection TP1 = S1 KEGG RDF ChEBI RDF NYT RDF //TP1 SWDF RDF //TP2 LMDB RDF Jamendo RDF TP2 = S1 Geo Names RDF DrugBank RDF S1 S2 S3 S4 S5 S6 S7 S8 S9
  21. 21. FedBench (LD3): Return for all US presidents their party membership and news pages about them. SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } dbpedia RDF //TP3 //TP4 //TP5 Source Selection Algorithm Index-free Source Selection Triple pattern-wise source selection TP1 = S1 KEGG RDF ChEBI RDF NYT RDF //TP1 SWDF RDF //TP2 LMDB RDF TP3 = S1 Jamendo RDF TP2 = S1 Geo Names RDF DrugBank RDF S1 S2 S3 S4 S5 S6 S7 S8 S9
  22. 22. FedBench (LD3): Return for all US presidents their party membership and news pages about them. SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } dbpedia RDF //TP3 //TP4 //TP5 Source Selection Algorithm Index-free Source Selection Triple pattern-wise source selection TP1 = S1 KEGG RDF ChEBI RDF NYT RDF //TP1 SWDF RDF //TP2 LMDB RDF TP3 = S1 TP4 = S4 Jamendo RDF TP2 = S1 Geo Names RDF DrugBank RDF S1 S2 S3 S4 S5 S6 S7 S8 S9
  23. 23. FedBench (LD3): Return for all US presidents their party membership and news pages about them. SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } dbpedia RDF //TP3 //TP4 //TP5 Source Selection Algorithm Index-free Source Selection Triple pattern-wise source selection TP1 = S1 KEGG RDF ChEBI RDF NYT RDF //TP1 SWDF RDF //TP2 LMDB RDF TP3 = S1 TP4 = S4 TP5 = S1 S2 S4-S9 Total number of SPARQL ASK requests used = 45 Total triple pattern-wise sources selected = 12 Jamendo RDF TP2 = S1 Geo Names RDF DrugBank RDF S1 S2 S3 S4 S5 S6 S7 S8 S9
  24. 24. Index-only Source Selection (LHD) Input: SPARQL query Q , set of all data sources D, data sources index I storing all distinct predicates for all data sources in D Output: Triple pattern to relevant data sources map M for each triple pattern ti in SPARQL query Q Ri = {}; // set of relevant data sources for triple pattern ti p = Pred(ti) // predicate of ti if (bound (p)) Ri = Lookup (I, p) // index lookup for predicate of ti else Ri = D ; // all data sources are relevant end if M = M U {Ri} ; end for return M Why it is the less efficient approach (i.e., greatly overestimate relevant data sources)? • Source selection is only based on predicate of triple patterns • Simply select all data sources for triple patterns having unbound predicates
  25. 25. Index-only Source Selection FedBench (LD3): Return for all US presidents their party membership and news pages about them. SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } dbpedia RDF //TP3 //TP4 //TP5 Source Selection Algorithm Triple pattern-wise source selection TP1 = S1-S9 KEGG RDF ChEBI RDF NYT RDF //TP1 SWDF RDF //TP2 LMDB RDF Jamendo RDF Geo Names RDF DrugBank RDF S1 S2 S3 S4 S5 S6 S7 S8 S9
  26. 26. FedBench (LD3): Return for all US presidents their party membership and news pages about them. SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } dbpedia RDF //TP3 //TP4 //TP5 Source Selection Algorithm Index-only Source Selection Triple pattern-wise source selection TP1 = KEGG RDF ChEBI RDF NYT RDF //TP1 SWDF RDF //TP2 LMDB RDF S1-S9 TP2 = S1 Jamendo RDF Geo Names RDF DrugBank RDF S1 S2 S3 S4 S5 S6 S7 S8 S9
  27. 27. FedBench (LD3): Return for all US presidents their party membership and news pages about them. SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } dbpedia RDF //TP3 //TP4 //TP5 Source Selection Algorithm Index-only Source Selection Triple pattern-wise source selection TP1 = KEGG RDF ChEBI RDF NYT RDF //TP1 SWDF RDF //TP2 LMDB RDF S1-S9 TP3 = S1 Jamendo RDF TP2 = S1 Geo Names RDF DrugBank RDF S1 S2 S3 S4 S5 S6 S7 S8 S9
  28. 28. FedBench (LD3): Return for all US presidents their party membership and news pages about them. SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } dbpedia RDF //TP3 //TP4 //TP5 Source Selection Algorithm Index-only Source Selection Triple pattern-wise source selection TP1 = KEGG RDF ChEBI RDF NYT RDF //TP1 SWDF RDF //TP2 LMDB RDF S1-S9 TP3 = S1 TP4 = S4 Jamendo RDF TP2 = S1 Geo Names RDF DrugBank RDF S1 S2 S3 S4 S5 S6 S7 S8 S9
  29. 29. FedBench (LD3): Return for all US presidents their party membership and news pages about them. SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } dbpedia RDF //TP3 //TP4 //TP5 Source Selection Algorithm Index-only Source Selection Triple pattern-wise source selection TP1 = KEGG RDF ChEBI RDF NYT RDF //TP1 SWDF RDF //TP2 LMDB RDF S1-S9 TP3 = S1 TP4 = S4 TP5 = S1 S2 S4-S9 Total number of SPARQL ASK requests used = 0 Total triple pattern-wise sources selected = 20 Jamendo RDF TP2 = S1 Geo Names RDF DrugBank RDF S1 S2 S3 S4 S5 S6 S7 S8 S9
  30. 30. Hybrid Source Selection Input: SPARQL query Q , set of all data sources D, data sources index I storing all distinct predicates for all data sources in D Output: Triple pattern to relevant data sources map M for each triple pattern ti in SPARQL query Q Ri = {}; // set of relevant data sources for triple pattern ti s = Subj(ti) , p = Pred(ti) , o = Obj(ti) ; // subject, predicate, and object of ti if (!bound (p) || bound (s) || bound (o) ) for each data source di in D if SPARQL ASK(di , ti) = true Ri = RiU {di}; end if end for else Ri = Lookup (I, p) // index lookup for predicate of ti end if M = M U {Ri} end for return M What is the total number of SPARQL ASK requests used? total number of triple patterns with bound subject or bound object or unbound predicate * total number of data sources
  31. 31. Hybrid Source Selection FedBench (LD3): Return for all US presidents their party membership and news pages about them. SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } dbpedia RDF //TP3 //TP4 //TP5 Source Selection Algorithm Triple pattern-wise source selection TP1 = S1 KEGG RDF ChEBI RDF NYT RDF //TP1 SWDF RDF //TP2 LMDB RDF Jamendo RDF Geo Names RDF DrugBank RDF S1 S2 S3 S4 S5 S6 S7 S8 S9
  32. 32. FedBench (LD3): Return for all US presidents their party membership and news pages about them. SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } dbpedia RDF //TP3 //TP4 //TP5 Source Selection Algorithm Hybrid Source Selection Triple pattern-wise source selection TP1 = S1 KEGG RDF ChEBI RDF NYT RDF //TP1 SWDF RDF //TP2 LMDB RDF Jamendo RDF TP2 = S1 Geo Names RDF DrugBank RDF S1 S2 S3 S4 S5 S6 S7 S8 S9
  33. 33. FedBench (LD3): Return for all US presidents their party membership and news pages about them. SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } dbpedia RDF //TP3 //TP4 //TP5 Source Selection Algorithm Hybrid Source Selection Triple pattern-wise source selection TP1 = S1 KEGG RDF ChEBI RDF NYT RDF //TP1 SWDF RDF //TP2 LMDB RDF TP3 = S1 Jamendo RDF TP2 = S1 Geo Names RDF DrugBank RDF S1 S2 S3 S4 S5 S6 S7 S8 S9
  34. 34. FedBench (LD3): Return for all US presidents their party membership and news pages about them. SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } dbpedia RDF //TP3 //TP4 //TP5 Source Selection Algorithm Hybrid Source Selection Triple pattern-wise source selection TP1 = S1 KEGG RDF ChEBI RDF NYT RDF //TP1 SWDF RDF //TP2 LMDB RDF TP3 = S1 TP4 = S4 Jamendo RDF TP2 = S1 Geo Names RDF DrugBank RDF S1 S2 S3 S4 S5 S6 S7 S8 S9
  35. 35. FedBench (LD3): Return for all US presidents their party membership and news pages about them. SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } Anything still needs to be improved? dbpedia RDF //TP3 //TP4 //TP5 Source Selection Algorithm Hybrid Source Selection Triple pattern-wise source selection TP1 = S1 KEGG RDF ChEBI RDF NYT RDF //TP1 SWDF RDF //TP2 LMDB RDF TP3 = S1 TP4 = S4 TP5 = S1 S2 S4-S9 Total number of SPARQL ASK requests used = 18 Total triple pattern-wise sources selected = 12 Jamendo RDF TP2 = S1 Geo Names RDF DrugBank RDF S1 S2 S3 S4 S5 S6 S7 S8 S9
  36. 36. Source Selection • Triple pattern-wise source selection – Ensures 100% recall – Can over-estimate capable sources – Can be expensive, e.g., total number of SPARQL ASK requests used – Performed by FedX, SPLENDID, LHD, DARQ, ADERIS etc. • Join-aware triple-pattern wise source selection – Ensures 100% recall – May selects optimal/close to optimal capable sources – Can be expensive, e.g., total number of SPARQL ASK requests used – Can significantly reduce the query execution time – Performed by ANAPSID, HiBISCuS
  37. 37. HiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint Federation • Hybrid source selection • Join-aware triple-pattern wise source selection • Makes use of the hypergraph representation of SPARQL queries • Makes use of the URI authorities • Makes use of the cache to store recent SPARQL ASK queries results
  38. 38. Motivation FedBench (LD3): Return for all US presidents their party membership and news pages about them. SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } dbpedia RDF //TP3 //TP4 //TP5 Source Selection Algorithm Triple pattern-wise source selection TP1 = S1 KEGG RDF ChEBI RDF NYT RDF //TP1 SWDF RDF //TP2 LMDB RDF Jamendo RDF Geo Names RDF DrugBank RDF S1 S2 S3 S4 S5 S6 S7 S8 S9
  39. 39. Motivation FedBench (LD3): Return for all US presidents their party membership and news pages about them. SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } dbpedia RDF //TP3 //TP4 //TP5 Source Selection Algorithm Triple pattern-wise source selection TP1 = S1 KEGG RDF ChEBI RDF NYT RDF //TP1 SWDF RDF //TP2 LMDB RDF Jamendo RDF TP2 = S1 Geo Names RDF DrugBank RDF S1 S2 S3 S4 S5 S6 S7 S8 S9
  40. 40. Motivation FedBench (LD3): Return for all US presidents their party membership and news pages about them. SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } dbpedia RDF //TP3 //TP4 //TP5 Source Selection Algorithm Triple pattern-wise source selection TP1 = S1 KEGG RDF ChEBI RDF NYT RDF //TP1 SWDF RDF //TP2 LMDB RDF TP3 = S1 Jamendo RDF TP2 = S1 Geo Names RDF DrugBank RDF S1 S2 S3 S4 S5 S6 S7 S8 S9
  41. 41. Motivation FedBench (LD3): Return for all US presidents their party membership and news pages about them. SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } dbpedia RDF //TP3 //TP4 //TP5 Source Selection Algorithm Triple pattern-wise source selection TP1 = S1 KEGG RDF ChEBI RDF NYT RDF //TP1 SWDF RDF //TP2 LMDB RDF TP3 = S1 TP4 = S4 Jamendo RDF TP2 = S1 Geo Names RDF DrugBank RDF S1 S2 S3 S4 S5 S6 S7 S8 S9
  42. 42. Motivation FedBench (LD3): Return for all US presidents their party membership and news pages about them. SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } dbpedia RDF //TP3 //TP4 //TP5 Source Selection Algorithm Triple pattern-wise source selection TP1 = S1 KEGG RDF ChEBI RDF NYT RDF //TP1 SWDF RDF //TP2 LMDB RDF TP3 = S1 TP4 = S4 TP5 = S1 S2 S4 S5 Jamendo RDF TP2 = S1 S6 S7 S8 S9 Geo Names RDF DrugBank RDF S1 S2 S3 S4 S5 S6 S7 S8 S9
  43. 43. Motivation FedBench (LD3): Return for all US presidents their party membership and news pages about them. SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } dbpedia RDF //TP3 //TP4 //TP5 Source Selection Algorithm Triple pattern-wise source selection TP1 = S1 KEGG RDF ChEBI RDF NYT RDF //TP1 SWDF RDF //TP2 LMDB RDF TP3 = S1 TP4 = S4 TP5 = S1 S2 S4 S5 Total triple pattern-wise selected sources = 12 Total SPARQL ASK queries : 9*5 = 45 Jamendo RDF TP2 = S1 S6 S7 S8 S9 Geo Names RDF DrugBank RDF S1 S2 S3 S4 S5 S6 S7 S8 S9
  44. 44. Motivation FedBench (LD3): Return for all US presidents their party membership and news pages about them. SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } dbpedia RDF //TP3 //TP4 //TP5 Source Selection Algorithm Triple pattern-wise source selection TP1 = S1 KEGG RDF ChEBI RDF NYT RDF //TP1 SWDF RDF //TP2 LMDB RDF TP3 = S1 TP4 = S4 TP5 = S1 S2 S4 S5 Total triple pattern-wise selected sources = 12 Total SPARQL ASK queries : 9*5 = 45 Jamendo RDF TP2 = S1 S6 S7 S8 S9 Geo Names RDF DrugBank RDF S1 S2 S3 S4 S5 S6 S7 S8 S9
  45. 45. Motivation FedBench (LD3): Return for all US presidents their party membership and news pages about them. SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } dbpedia RDF //TP3 //TP4 //TP5 Source Selection Algorithm Triple pattern-wise source selection TP1 = S1 TP3 = S1 TP2 = S1 TP4 = S4 TP5 = S1 S2 S4 S5 S6 S7 S8 S9 Optimal triple pattern-wise selected sources 5 KEGG RDF ChEBI RDF NYT RDF //TP1 SWDF RDF //TP2 LMDB RDF Jamendo RDF Geo Names RDF DrugBank RDF S1 S2 S3 S4 S5 S6 S7 S8 S9
  46. 46. Problem Statement • An overestimation of triple pattern-wise source selection can be expensive – Resources are wasted – Query runtime is increased – Extra traffic is generated • How do we perform join-aware triple pattern wise source selection in time efficient way?
  47. 47. HiBISCuS: Key Concept • Makes use of the URI’s authorities http://dbpedia.org/ontology/party Scheme Authority Path For URI details: http://tools.ietf.org/html/rfc3986
  48. 48. HiBISCuS: SPARQL Query as Hypergraph SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } ?president rdf:type dbpedia: President
  49. 49. HiBISCuS: SPARQL Query as Hypergraph SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } ?president rdf:type dbpedia: President dbpedia: United_S tates dbpedia: nationality
  50. 50. HiBISCuS: SPARQL Query as Hypergraph SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } ?president rdf:type dbpedia: President dbpedia: United_S tates dbpedia: party dbpedia: nationality ?party
  51. 51. HiBISCuS: SPARQL Query as Hypergraph SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } ?president rdf:type dbpedia: President dbpedia: United_S tates dbpedia: party dbpedia: nationality ?party ?x nyt:topi cPage ?page
  52. 52. HiBISCuS: SPARQL Query as Hypergraph SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } ?president rdf:type dbpedia: President dbpedia: United_S tates dbpedia: party dbpedia: nationality ?party ?x nyt:topi cPage ?page owl: SameAs
  53. 53. HiBISCuS: SPARQL Query as Hypergraph SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } ?president rdf:type dbpedia: President dbpedia: United_S tates dbpedia: nationality ?x owl: SameAs dbpedia: party ?party nyt:topi cPage ?page Star simple hybrid Tail of hyperedge
  54. 54. HiBISCuS: Data Summaries [] a ds:Service ; ds:endpointUrl <http://dbpedia.org/sparql> ; ds:capability [ ds:predicate dbpedia:party ; ds:sbjAuthority <http://dbpedia.org/> ; ds:objAuthority <http://dbpedia.org/> ; ] ; ds:capability [ ds:predicate rdf:type ; ds:sbjAuthority <http://dbpedia.org/> ; ds:objAuthority owl:Thing, dbpedia:President; #we store all distinct classes ] ; ds:capability [ ds:predicate dbpedia:postalCode ; ds:sbjAuthority <http://dbpedia.org/> ; #No objAuthority as the object value for dbpedia:postalCode is string ] ;
  55. 55. HiBISCuS: Triple Pattern-wise Source Selection SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } ?president rdf:type dbpedia: President dbpedia: United_ States dbpedia: nationality ?x owl: SameAs dbpedia: party ?party nyt:topi cPage ?page dbpedia KEGG NYT SWDF LMDB Geo DrgBnk Jamendo
  56. 56. HiBISCuS: Triple Pattern-wise Source Pruning SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } ?president rdf:type dbpedia: President dbpedia: United_ States dbpedia: nationality ?x owl: SameAs dbpedia: party ?party nyt:topi cPage ?page dbpedia KEGG NYT SWDF DrgBnk LMDB Geo Jamendo Obj. auth. dbpedia Sbj. auth. Sbj. auth. KEGG Sbj. auth. NYT Sbj. auth. SWDF Sbj. auth. LMDB Sbj. auth. Geo Sbj. auth. DrgBnk Sbj. auth. Jamendo
  57. 57. HiBISCuS: Triple Pattern-wise Source Pruning SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } ?president rdf:type dbpedia: President dbpedia: United_ States dbpedia: nationality ?x owl: SameAs dbpedia: party ?party nyt:topi cPage ?page dbpedia Sbj. auth. Sbj. auth. KEGG Sbj. auth. NYT Sbj. auth. SWDF Sbj. auth. LMDB Sbj. auth. Geo Sbj. auth. DrgBnk Sbj. auth. Jamendo dbpedia KEGG NYT SWDF DrgBnk LMDB Geo Jamendo Obj. auth.
  58. 58. HiBISCuS: Triple Pattern-wise Source Pruning SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } ?president rdf:type dbpedia: President dbpedia: United_ States dbpedia: nationality ?x owl: SameAs dbpedia: party ?party nyt:topi cPage ?page dbpedia KEGG NYT SWDF DrgBnk LMDB Geo Jamendo Obj. auth.
  59. 59. HiBISCuS: Triple Pattern-wise Source Pruning SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } ?president rdf:type dbpedia: President dbpedia: United_ States dbpedia: nationality ?x owl: SameAs dbpedia: party ?party nyt:topi cPage ?page NYT Obj. auth.
  60. 60. HiBISCuS: Triple Pattern-wise Source Pruning SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } ?president rdf:type dbpedia: President dbpedia: United_ States dbpedia: nationality ?x owl: SameAs dbpedia: party ?party nyt:topi cPage ?page NYT Obj. auth.
  61. 61. HiBISCuS: Triple Pattern-wise Source Pruning SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } ?president rdf:type dbpedia: President dbpedia: United_ States dbpedia: nationality ?x owl: SameAs dbpedia: party ?party nyt:topi cPage ?page Total triple pattern-wise selected sources = 5 Total SPARQL ASK queries : 0
  62. 62. Data Integration Options
  63. 63. Complete Local Integration • Triple patterns are individually and completely evaluated against every endpoint • Triple pattern results are locally integrated using different join techniques, e.g., NLJ, Hash Join etc. • Less efficient if query contains common predicates such rdf:type and owl:sameAs • Large amount of potentially irrelevant intermediate results retrieval
  64. 64. Iterative Integration • Evaluate query iteratively pattern by pattern • Start with a single triple pattern • Substitute mappings from previous triple pattern in the subsequent evaluation • Evaluate query in a NLJ fashion • NLJ can cause many remote requests • Block NLJ fashion minimize the remote requests
  65. 65. Join Order Selection
  66. 66. Join Order Selection • Left-deep trees – Joins take place in a left-to-right sequential order – Result of the join is used as an outer input for the next join – Used in FedX, DARQ • Right-deep trees – Joins take place in a right-to-left sequential order – Result of the join is used as an inner input for the next join • Bushy trees – Joins take place in sub-tress both on left and right sides – Used in ANAPSID • Dynamic programming – Used in SPLENDID
  67. 67. Join Order Selection Example Compute Micronutrients using Drugbank and KEGG SELECT ?drug ?title WHERE { ?drug drugbank:drugCategory drugbank-cat:micronutrient. // TP1 ?drug drugbank:casRegistryNumber ?id . // TP2 ?keggDrug rdf:type kegg:Drug . // TP3 ?keggDrug bio2rdf:xRef ?id . // TP4 ?keggDrug dc:title ?title . // TP5 } 67 휋 ? 푑푟푢푔, ? 푡푖푡푙푒 TP1 TP2 TP3 TP4 TP5 Left-deep tree 휋 ? 푑푟푢푔, ? 푡푖푡푙푒 TP1 TP2 TP3 TP4 TP5 Right-deep tree Bushy tree 휋 ? 푑푟푢푔, ? 푡푖푡푙푒 TP1 TP2 TP3 TP5 TP4 Goal: Execute smallest cardinality joins first
  68. 68. Join Order Optimization
  69. 69. Join Order Optimization • Exclusive Groups – Group triple patterns with the same relevant data source – Evaluation in a single (remote) sub-query – Push join to the data source, i.e., endpoint • Variable count-heuristic – Iteratively determine the join order based on free variables count of triple patterns and groups – Consider “resolved ” variable mappings from earlier iteration • Using Selectivities – Store distinct predicates, avg. subject selectivities , and avg. object selectivities for each predicate in index – Use the predicate count, avg. subject selectivities , and avg. object selectivities to estimate the join cardinality
  70. 70. Exclusive Groups SELECT ?President ?Party ?TopicPage WHERE { ?President rdf:type dbpedia-yago:PresidentsOfTheUnitedStates . ?President dbpedia:party ?Party . ?nytPresident owl:sameAs ?President . ?nytPresident nytimes:topicPage ?TopicPage . } Source Selection @ DBpedia @ DBpedia @ DBpedia, NYTimes @ NYTimes Exclusive Group Advantage: Delegate joins to the endpoint by forming exclusive groups (i.e. executing the respective patterns in a single subquery) 70
  71. 71. Exclusive Groups Join Order Optimization 2 Unoptimized Internal Representation 1 SPARQL Query Compute Micronutrients using Drugbank and KEGG SELECT ?drug ?title WHERE { ?drug drugbank:drugCategory drugbank-cat:micronutrient . ?drug drugbank:casRegistryNumber ?id . ?keggDrug rdf:type kegg:Drug . ?keggDrug bio2rdf:xRef ?id . ?keggDrug dc:title ?title . } 3 Optimized Internal Representation 4x Local Join = 4x NLJ Exlusive Group  Remote Join 71
  72. 72. Selectivity Based Join Order Optimization [] a sd:Service ; sd:endpointUrl <http://localhost:8890/sparql> ; sd:capability [ sd:predicate diseasome:name ; sd:totalTriples 147 ; // Total number of triple patterns with predicate value sd:predicate sd:avgSbjSel ``0.0068'' ; // 1/ distinct subjects with predicate value sd:predicate sd:avgObjSel ``0.0069'' ; // 1/ distinct Objects with predicate value sd:predicate ] ; sd:capability [ sd:predicate diseasome:chromosomalLocation ; sd:totalTtriples 160 ; sd:avgSbjSel ``0.0062'' ; sd:avgObjSel ``0.0072'' ; ] ; S1 P O1 . S1 P O2 . S2 P O1 . S3 P O2 . totalTriples = 4 avgSbjSel(p) = 1/3 avgObjSel(p) =1/2
  73. 73. Selectivity Based Join Order Optimization • Triple pattern cardinality • Join Cardinality 푝 = pred(tp) , 푇 = Total triple having predicate 푝 퐶(푡푝) = 푇 푖푓 푛푒푖푡ℎ푒푟 푠푢푏푗푒푐푡 푛표푟 표푏푗푒푐푡 푖푠 푏표푢푛푑 푇 × 푎푣푔푆푏푗푆푒푙 푝 푖푓 푠푢푏푗푒푐푡 푖푠 푏표푢푛푑 푇 × 푎푣푔푂푏푗푆푒푙 푝 푖푓표푏푗푒푐푡 푖푠 푏표푢푛푑 퐶(퐽 푡푝1, 푡푝2 ) = 퐶 푡푝1 × 퐶 푡푝2 × 푎푣푔푃푟푒푑퐽표푖푛푆푒푙 푡푝1 × 푎푣푔푃푟푒푑퐽표푖푛푆푒푙 푡푝2 푖푓 푝 − 푝 푗표푖푛 퐶 푡푝1 × 퐶 푡푝2 × 푎푣푔푆푏푗퐽표푖푛푆푒푙 푡푝1 × 푎푣푔푆푏푗퐽표푖푛푆푒푙 푡푝2 푖푓 푠 − 푠 푗표푖푛 퐶 푡푝1 × 퐶 푡푝2 × 푎푣푔푆푏푗퐽표푖푛푆푒푙 푡푝1 × 푎푣푔푂푏푗퐽표푖푛푆푒푙 푡푝2 푖푓 푠 − 표 푗표푖푛 How to calculate avgPredJoinSel, avgSbjJoinSel, and avgObjJoinSel? DARQ selected 0.5 as the avgJoinSel value for all joins
  74. 74. Join Implementations
  75. 75. Join Implementations • Bound Joins – Start with a single triple pattern (lowest cardinality) – Substitute mappings from previous triple pattern in the subsequent evaluation – Bound Joins in NLJ fashion • Execute bound joins in nested loop join fashion • Too many remote requests – Bound Joins in Block NLJ fashion • Execute bound joins in block nested loop join fashion • Make use of SPARQL UNION construct • Remote requests are reduced by the block size • Other Join techniques – E.g, Hash Joins
  76. 76. Bound Joins in Block NLJ SELECT ?President ?Party ?TopicPage WHERE { ?President rdf:type dbpedia:PresidentsOfTheUnitedStates . ?President dbpedia:party ?Party . ?nytPresident owl:sameAs ?President . ?nytPresident nytimes:topicPage ?TopicPage . } Assume that the following intermediate results have been computed as input for the last triple pattern Block Input “Barack Obama” “George W. Bush” … Before (NLJ) SELECT ?TopicPage WHERE { “Barack Obama” nytimes:topicPage ?TopicPage } SELECT ?TopicPage WHERE { “George W. Bush” nytimes:topicPage ?TopicPage } … Now: Evaluation in a single remote request using a SPARQL UNION construct + local post processing (SPARQL 1.0) 76
  77. 77. Parallelization and Pipelining • Execute sub-queries concurrently on different data sources • Multithreaded worker pool to execute the joins and UNION operators in parallel • Pipelining approach for intermediate results • See FedX and LHD implementations
  78. 78. Performance Metrics and Discussion
  79. 79. Performance Metrics • Efficient source selection in terms of – Total triple pattern-wise sources selected – Total number of SPARQL ASK requests used during source selection – Source selection time • Query execution time • Results completeness and correctness • Number of remote requests during query execution • Index compression ratio (1- index size/datadump size) • See https://code.google.com/p/bigrdfbench/
  80. 80. Evaluation Setup • Local dedicated network • Local SPARQL endpoints (One per machine) • Run each query 10 times and present the average results • Statistically analyzed the results, e.g., Wilcoxon signed rank test, student T-test
  81. 81. SPARQL Query Federation Engines • FedX • SPLENDID • HiBISCuS+FedX • HiBISCuS+SPLENDID • ANAPSID • LHD • DARQ 81
  82. 82. AKSW SPARQL Federation Publications • HiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint Federation by Muhammad Saleem and Axel-Cyrille Ngonga Ngomo, in (ESWC, 2014) • DAW: Duplicate-AWare Federated Query Processing over the Web of Data by Muhammad Saleem Axel-Cyrille Ngonga Ngomo, Josiane Xavier Parreira , Helena Deus , and Manfred Hauswirth , in (ISWC 2013). • TopFed: TCGA Tailored Federated Query Processing and Linking to LOD by Muhammad Saleem, Shanmukha Sampath , Axel-Cyrille Ngonga Ngomo , Aftab Iqbal, Jonas Almeida , and Helena F. Deus , in (Journal of Biomedical Semantics, 2014). • A Fine-Grained Evaluation of SPARQL Endpoint Federation Systems by Muhammad Saleem, Yasar Khan, Ali Hasnain, Ivan Ermilov, and Axel-Cyrille Ngonga Ngomo , in (Semantic Web Journal, 2014) • BigRDFBench: A Billion Triples Benchmark for SPARQL Query Federation by Muhammad Saleem, Ali Hasnain, Axel-Cyrille Ngonga Ngomo , in (submitted WWW, 2015). • SAFE: Policy-Aware SPARQL Query Federation Over RDF Data Cubes By Yasar Khan, Muhammed Saleem , Aftab Iqbal, Muntazir Mehdi, Aidan Hogan, Panagiotis Hasapis, Axel-Cyrille Ngonga Ngomo, Stefan Decker, and Ratnesh Sahay, in (SWAT4LS, 2014) • QFed: Query Set For Federated SPARQL Query Benchmark by Nur Aini Rakhmawati, Sarasi lithsena , Muhammad Saleem , Stefan Decker, in (iiWAS, 2014) 82
  83. 83. Thanks {saleem,ngonga}@informatik.uni-leipzig.de AKSW, University of Leipzig, Germany

×