Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Joins in a distributed world - Lucian Precup

591 Aufrufe

Veröffentlicht am

A lot of database related algorithms are more difficult to implement in a distributed environment. Quite often, the "distributed" version is far from the "classical" version : constraints are dropped (see the CAP theorem), only specific cases are supported (for example : the involved data needs to be co-located within the distributed system), etc. This talk focuses on joins. We start by presenting join implementations in "classical" relational databases than we lead the audience through the challenges and solutions to make these functions available in a distributed environment. While we start with a theoretical point of view, we finish by giving real life examples from implementations in ETL systems (known for joining heterogeneous databases and therefore quite advanced in this area, but often not real-time) and some modern NoSQL databases (where most systems choose to offer less features with respect to joins).

Veröffentlicht in: Daten & Analysen
  • Als Erste(r) kommentieren

  • Gehören Sie zu den Ersten, denen das gefällt!

Joins in a distributed world - Lucian Precup

  1. 1. Joins in a distributed world @LucianPrecup @dmconf Barcelona 2015 #dmconf15 2015-11-21
  2. 2. whoami • CTO of Adelean (http://adelean.com/) • Integrate – search, nosql and big data technologies– search, nosql and big data technologies • to support – ETL, BI, data mining, data processing and data visualization • use cases 2015-11-21 2@LucianPrecup #dmconf15 Barcelona
  3. 3. Poll - How many of you … • Use a NoSQL database within your current application / information system? • Which NoSQL database? • Use a relational database within your current application / information system? • Use a relational database within your current application / information system? • Which RDBMS? • Store the data into more than one system? • Distribute data on several servers? • Use an ORM framework (hibernate, doctrine, …) ? 2015-11-21 3@LucianPrecup #dmconf15 Barcelona
  4. 4. Distributed databases • Issues – CAP theorem – Two phase commit – Distributed Transactions 2015-11-21 4@LucianPrecup #dmconf15 Barcelona
  5. 5. Objective • Show what distributed databases are able or not able to do about joins and why • Give YOU the ability to implement joins in your applications (your service layers)your applications (your service layers) 2015-11-21 5@LucianPrecup #dmconf15 Barcelona
  6. 6. Joins • A join combines the output from two sources • A join condition – the relationship between the sources • Join trees• Join trees 2015-11-21 6@LucianPrecup #dmconf15 Barcelona
  7. 7. The "Query Optimizer" SELECT DISTINCT offer_status FROM offer; SELECT offer_status FROM offer GROUP by offer_status; ≡ SELECT offer_status FROM offer GROUP by offer_status; SELECT A.ID as ID1, A.X, B.ID as ID2, B.Y FROM A LEFT OUTER JOIN B ON A.ID = B.ID WHERE ID2 IS NOT NULL SELECT A.ID as ID1, A.X, B.ID as ID2, B.Y FROM A INNER JOIN B ON A.ID = B.ID ≡ 2015-11-21 7@LucianPrecup #dmconf15 Barcelona
  8. 8. The query optimizer - decisions • Access paths – The optimizer must choose an access path to retrieve data from each table in the join statement. For example, choose between a full table scan or an index scan. • Join methods – To join each pair of row sources, the database must decide how to do it: nested loop, sort merge, hash joins. Each join method has specific situations in which it is more suitable than the others.which it is more suitable than the others. • Join types – The join condition determines the join type: an inner join retrieves only rows that match the join condition, an outer join retrieves rows that do not match the join condition. • Join order – To execute a statement that joins more than two tables, the database joins two tables and then joins the resulting row source to the next table and so on. • Calculating the cost of a query plan – Based on metrics : I/Os (single block I/O, multiblock I/Os), estimated CPU, functions and expressions 2015-11-21 8@LucianPrecup #dmconf15 Barcelona
  9. 9. Join methods • Nested Loops Joins 2015-11-21 9@LucianPrecup #dmconf15 Barcelona
  10. 10. Join methods • Hash Joins 2015-11-21 10@LucianPrecup #dmconf15 Barcelona
  11. 11. Join methods • Sort Merge Joins 1 3 1 2 2015-11-21 11@LucianPrecup #dmconf15 Barcelona 7 9 2 7 8 9
  12. 12. Join types • Inner Joins – Equijoins – nonequijoins • Outer Joins – Nested Loop Outer Joins – Hash Join Outer Joins– Hash Join Outer Joins – Sort Merge Outer Joins – Full Outer Joins – Multiple Tables on the Left of an Outer Join • Semijoins – ~ IN or EXISTS clause • Antijoins – ~NOT IN or NOT EXISTS 2015-11-21 12@LucianPrecup #dmconf15 Barcelona
  13. 13. Optimizing tricks • Push filters as low as possible in the execution tree • Cache small tables into memory • Calculate statistics on each datasources and use them • Think use cases • “Fortunately, many of the lessons learned over the years for optimizing and executing join queries can be directly translated to the distributed world.” (see Nicolas Bruno et all. in the references below) 2015-11-21 13@LucianPrecup #dmconf15 Barcelona
  14. 14. Push filters Filter 1 Join‘ Filter 1’ Filter 1’’ Table 1 Table 2 Table 1 Table 2 Join Filter 1’ Filter 1’’ 2015-11-21 14@LucianPrecup #dmconf15 Barcelona
  15. 15. Take advantage of small tables Join Filter with data from Table 2 Table 1 Table 2 Table 1 Table 2 2015-11-21 15@LucianPrecup #dmconf15 Barcelona
  16. 16. Use statistics Join 2 Join 3 Join 4 Table 1 Table 2 Join 1 Table 3 Table 3 Join 3 Table 2 Table 1 2015-11-21 16@LucianPrecup #dmconf15 Barcelona
  17. 17. Other join optimisations (technical) • Join algorithms – Different ways to implement a logical join operator : nested-loop joins, sort-based joins, hash-based joins. – Auxiliary data structures: secondary indexes, join indexes, bitmap indexes, bloom filters (e.g., indexed loop join, indexed sort-merge join, and distributed semi-join). • Bloom Filters• Bloom Filters – especially useful when the amount of memory needed to store the filter is small relative to the amount of data in the data set, and when most data is expected to fail the membership test. • Partition-Wise Joins – A partition-wise join is a join optimization that divides a large join of two tables, one of which must be partitioned on the join key, into several smaller joins. • Full partition-wise join : Both tables must be equipartitioned on their join keys. We can then divide a large join into smaller joins between two partitions. • Partial partition-wise joins : Only one table is partitioned on the join key. The other table may or may not be partitioned. 2015-11-21 17@LucianPrecup #dmconf15 Barcelona
  18. 18. The "Query Optimizer" NoSQLSQL/RDBMS Power to the DBA 2015-11-21 18@LucianPrecup #dmconf15 Barcelona
  19. 19. The "Query Optimizer" SQL/RDBMS Power to the DBA NoSQL Power to the developer 192015-11-21 @LucianPrecup #dmconf15 Barcelona
  20. 20. The "Query Optimizer" SQL/RDBMS NoSQL / Distributed / Multi-Model 2015-11-21 20@LucianPrecup #dmconf15 Barcelona
  21. 21. Distributing data • Clusters of nodes • Partitioning the data (sharding) • Partition keys (random or functional) • Joining heterogeneous systems Persons (Oracle) Contracts (SQL Server) Logs (Elasticsearch) 2015-11-21 21@LucianPrecup #dmconf15 Barcelona
  22. 22. Partitions (shards) • "Random" partitioning • "Functional" routing key colocated joins Data A Data B Data C • "Functional" routing key colocated joins Customers by customer_id Customers by customer_id Contracts by customer_id Contracts by customer_id Contracts by customer_id 2015-11-21 22@LucianPrecup #dmconf15 Barcelona
  23. 23. The join graph topology • Partitioning schemes: a partition function (e.g., hash partitioning, range partitioning, random partitioning, and custom partitioning), a partition key, and a partition count. • Data exchange operators: initial partitioning, repartitioning, full merge, partial repartitioning and partial merge • Merging schemes: random merge, sort merge, concat- merge and sort concat-merge. • Distribution policies: include distribution with duplication and distribution without duplication 2015-11-21 23@LucianPrecup #dmconf15 Barcelona
  24. 24. Joins from a functional perspective • Think use cases : Model and distribute the data according to the queries you are expecting – Enrich a table with data from lookup tables, I need objects of type A filtered by criteria on properties of type B • Loosen the generality – I only want star schemas, I only want to bulk load data at night– I only want star schemas, I only want to bulk load data at night and query it all day, I only want to run a few really expensive queries not millions of tiny ones • Polyglot persistence : store your data in multiple ways – Graph databases for relations, document store for business objects, key value for lookup properties • Services and micro-services : you access the data through the services layer – Then you can implement the joins in many ways 2015-11-21 24@LucianPrecup #dmconf15 Barcelona
  25. 25. The service layer • Seamless integration for applications • Caches the data model and exposes a business model (services fit for business needs) Service Layer Customers Contracts … Application 1 Application 2 … 2015-11-21 25@LucianPrecup #dmconf15 Barcelona
  26. 26. Implementing your own joins • According to use cases and capacities of each module, joins can be implemented at different levels NoSQL Front End Service Layer Source 1 The datasource executes the join Join executed by the NoSQL database (~E.g. Parent Child in Elasticsearch) Parallel reading of the two datasources and join done by the Batch (~Sort Merge Join) Parallel reading of the two datasources and join done by the Batch (~Sort Merge Join) 26 NoSQL Front EndSource 1 Source 2 Batch (init or delta) Real time data injection Join implemented by the Service layer at query time (~"My Own Custom Join") Join implemented by the Service layer at query time (~"My Own Custom Join") JMS Queue "GET" calls to the second data source (~Nested Loops Join) "GET" calls to the second data source (~Nested Loops Join) Read one source, hash it then stream the other one (~HashJoin) Read one source, hash it then stream the other one (~HashJoin) 2015-11-21
  27. 27. Joins with NoSQL databases Normalized database Document {"film" : { "id" : "183070", "title" : "The Artist", "published" : "2011-10-12", "genre" : ["Romance", "Drama", "Comedy"], "language" : ["English", "French"], "persons" : ["persons" : [ {"person" : { "id" : "5079", "name" : "Michel Hazanavicius", "role" : "director" }}, {"person" : { "id" : "84145", "name" : "Jean Dujardin", "role" : "actor" }}, {"person" : { "id" : "24485", "name" : "Bérénice Bejo", "role" : "actor" }}, {"person" : { "id" : "4204", "name" : "John Goodman", "role" : "actor" }} ] }} 2015-11-21 27@LucianPrecup #dmconf15 Barcelona
  28. 28. SQL vs. NoSQL : the issue with joins :-) • Let’s say you have two relational entities: Persons and Contracts – A Person has zero, one or more Contracts – A Contract is attached to one or more Persons (eg. the Subscriber, the Grantee, …) • Need a search services :• Need a search services : – S1: getPersonsDetailsByContractProperties – S2: getContractsDetailsByPersonProperties • Simple solution with SQL: SELECT P.* FROM P, C WHERE P.id = C.pid AND C.a = 'A‘ SELECT C.* FROM P, C WHERE P.id = C.pid AND P.a = 'A' 2015-11-21 28@LucianPrecup #dmconf15 Barcelona
  29. 29. The issue with joins - solutions • Solution 1 – Store Persons with Contracts together for S1 {"person" : { "details" : …, … , "contracts" : ["contract" :{"id" : 1, …}, …] }} – Store Contracts with Persons together for S2 {"contract" : { "details" : …, …, "persons" : ["person" :{"id" : 1, "role" : "S", …}, …]}} • Issues with solution 1: – A lot of data duplication – Have to get Contracts when indexing Persons and vice-versa • Solution 2• Solution 2 – Use the joins provided by the NoSQL system (Eg. Elasticsearch’s Parent/Child) • Issues with solution 2: – Works in one way but not the other (only one parent for n children, a 1 to n relationship) • Solution 3 – Store Persons and Contracts separately – Launch two NoSQL queries and join the results into your application to get the final response – For S1 : First get all Contract ids by Contract properties, then get Persons by Contract ids (terms query or mget) – For S2 : First get all Persons ids by Person properties, then get Contracts by Person ids (terms query or mget) – The response to the second query can be returned “as is” to the client (pagination, etc.) 2015-11-21 29@LucianPrecup #dmconf15 Barcelona
  30. 30. Optimizing tricks. Distributed. • Model the data according to use cases • Duplicate the data if different modeling schemas are needed • Loosen the generality• Loosen the generality • Choose a good partitioning key • Colocate joins as much as possible. The less redistribution the better. • Implement your own joins :-) 2015-11-21 30@LucianPrecup #dmconf15 Barcelona
  31. 31. References • Back to the future : SQL 92 for Elasticsearch? - Lucian Precup - NoSQL Matters Dublin 2014 (https://2014.nosql-matters.org/dub/wp- content/uploads/2014/09/lucian_precup_back_to_the_future_sql_92_for _elasticsearch.pdf) • Oracle Database Online Documentation - Database SQL Tuning Guide (https://docs.oracle.com/database/121/TGSQL/tgsql_join.htm#TGSQL242) • Advanced Join Strategies for Large-Scale Distributed Computation - Nicolas Bruno, YongChul Kwon, Ming-Chuan Wu - VLDB • Advanced Join Strategies for Large-Scale Distributed Computation - Nicolas Bruno, YongChul Kwon, Ming-Chuan Wu - VLDB (http://www.vldb.org/pvldb/vol7/p1484-bruno.pdf) • “Distributed joins are hard to scale”- Interview with Dwight Merriman by Roberto V. Zicari (http://www.odbms.org/blog/2011/02/distributed-joins- are-hard-to-scale-interview-with-dwight-merriman/) • Joins and aggregations in a distributed NoSQL DB - Max Neunhöffer – NoSQL Matters Dublin 2014 (https://2014.nosql-matters.org/dub/wp- content/uploads/2014/09/NeunhoefferDublin.pdf) 2015-11-21 31@LucianPrecup #dmconf15 Barcelona