DIY: A distributed database cluster, or: MySQL Cluster


Published on

Live from the International PHP Conference 2013: MySQL Cluster is a distributed, auto-sharding database offering 99,999% high availability. It runs on Rasperry PI as well as on a cluster of multi-core machines. A 30 node cluster was able to deliver 4.3 billion (not million) read transactions per second in 2012. Take a deeper look into the theory behind all the MySQL replication/clustering solutions (including 3rd party) and learn how they differ.

Published in: Technologie
0 Kommentare
5 Gefällt mir
  • Hinterlassen Sie den ersten Kommentar

Keine Downloads
On SlideShare
Aus Einbettungen
Anzahl an Einbettungen
Gefällt mir
Einbettungen 0
No embeds

No notes for slide

DIY: A distributed database cluster, or: MySQL Cluster

  1. 1. MySQL Cluster talkDIYNo Best PracticesNo Product Presentation… you have been warned.N marketing fluff
  2. 2. Foreword and disclaimerDo it yourself, become a maker, get famous!In this course you will learn how to create an eager updateanywhere cluster. You need:●A soldering iron, solder●Wires (multiple colors recommended)●A collection of computersBy the end of the talk you can either challenge MySQL, orget MySQL Cluster for free – its Open Source, as ever since.Get armed with the distributed system theory you, as adeveloper, need to master any distributed database.
  3. 3. DIY – Distributed DatabaseCluster, or: MySQL ClusterUlf Wendel, MySQL/OracleN marketing fluff
  4. 4. Live on stage:Making a Cluster
  5. 5. The speaker says...Beautiful work, but unfortunately the DIY troubles beginbefore the first message has been delivered in our cluster.Long before we can speak about the latest hats fashion, wehave to fix wiring and communication! Communicationshould be:• Fast• Reliable (loss, retransmission, checksum, ordering)• SecureNetwork performance is a limiting factor fordistributed systems. Hmm, we better go back to thedrawing board before we mess up more computers...
  6. 6. Availability• Cluster as a whole unaffected by loss of nodesScalability• Geographic distribution• Scale size in terms of users and data• Database specific: read and/or write loadDistribution Transparency• Access, Location, Migration, Relocation (while in use)• Replication• Concurrency, FailureBack to the beginning: goals
  7. 7. The speaker says...A distributed database cluster strives for maximumavailability and scalability while maintaining distributiontransparency.MySQL Cluster has a shared-nothing design good enoughfor 99,999% (five minutes downtime per year). It scalesfrom Rasperry Pi run in a briefcase to 1.2 billion writetransactions per second on a 30 data nodes cluster (if usingpossibly unsupported bleeding edge APIs.) It offers fulldistribution transparency with the exception of partitionrelocation to be triggered manually but performedtransparently by the cluster. Thats to beat. Lets learn whatkind of clusters exist, how they tick and what the bestalgorithms are.
  8. 8. Where are transactions run?Primary Copy Update AnywhereWhen doessynchronizationhappen?EagerNot available forMySQLMySQL Cluster3rdpartyLazyMySQL Replication3rdpartyMySQL ClusterReplicationWhat kind of cluster?
  9. 9. The speaker says...A wide range of clusters can be categorized by askingwhere transactions are run and when replicassynchronize their data. Any eager solution ensures that allreplicas are synchronized at any time: it offers strongconsistency. A transaction cannot commit beforesynchronization is done. Please note, what it means totransaction rates:• Single computer tx rate ~ disk/fsync rate• Lazy cluster tx rate ~ disk/fsync rate• Eager cluster tx rate ~ network round-trip time (RTT)Test: Would you deploy MySQL Cluster on Amazon EC2 :-) ?
  10. 10. Lazy Primary Copy we have...010101001011010101010110100101101010010101010101010110101011101010110111101Master (Primary)WriteSlave (Copy) Slave (Copy) Slave (Copy)ReadReadLazy synchronization: eventual consistencyPrimary Copy: where any transaction may run
  11. 11. The speaker says...MySQL Replication falls into the category of lazy PrimaryCopy clusters. It is a rather unflexible solution as allupdates must be sent to the primary. However, thissimplifies concurrency control of conflicting, concurrentupdate transactions. Concurrency control is no differentfrom a single database.Lazy replication can be fast. Transactions dont have towait for synchronization of replicas. The price of the fastexecution is the risk of stale reads and eventualconsistency. Transactions can be lost when the primarycrashes after commit and before any copy has beenupdated. (This is something you can avoid by using MySQLsemi-sync replication, which delays the commit until deliveryto copy.)
  12. 12. BTW, confusing: Multi-MasterMaster (Primary)Slave (Copy)Master (Primary)Slave (Copy)SET A = 1 SET B = 1A, B A, B
  13. 13. The speaker says...Be aware of the term Multi-Master. MySQL Communitysometimes uses it to describe a set of Primary Copyclusters where primaries (master) replicate from eachother. This is one of the many possible topologies that youcan build with MySQL Replication. In the example, the PCcluster on the left manages table A and the PC cluster onthe right manages table B. The Primaries copy table Arespectively table B from each other. There is noconcurrency control and conflicts can arise. There is nodistribution transparency. This is not an own kind of clusterwith regards to our where and when criteria. And, it israrely what you want...Not a good goal for DIY – lets move on.
  14. 14. Lets do Eager Update Anywhere010101001011010101010110100101101010010101010101010110101011101010110111101ReplicaWriteReplica Replica ReplicaReadEager synchronization: strong consistencyUpdate Anywhere: any transaction can run on any replica
  15. 15. The speaker says...An eager update anywhere cluster improvesdistribution transparency and removes the risk ofreading stale data. Transparency and flexibility is improvedbecause any transaction can be directed to anyreplica. Synchronization happens as part of the commit,thus strong consistency is achieved. Remember:transaction rate ~ network RTT. Failure tolerance isbetter than with Primary Copy. There is no single point offailure – the primary - that can cause a total outage of thecluster. Nodes may fail without bringing the cluster downimmediately. Concurrency control (synchronization) iscomplex as concurrent transactions from different replicasmay conflict.
  16. 16. Concurrency Control: 1SR010101001011010101010110100101101010010101010101010110101011101010110111101Replicat0: SET a = 1 Replica t0: SET a = 2One-Copy-Serializability (1SR) for correctness• All replicas must decide on the same transaction ordera = 1a = 2a = 2a = 1a = 1010101001011010101010110100101101010010101010101010110101011101010110111101
  17. 17. The speaker says...Concurrent ACID transactions must be isolated from eachother to ensure correctness. The database system needs amechanism to detect conflicts. If any, transactions need tobe serialized. The challenge is to have all replicas committransactions in the same serial order. One-Copy-Serializability (1SR) demands the concurrentexecution of transactions in an replicated databaseto be equivalent to a serial execution of thesetransactions over a single logical copy of thedatabase. 1SR is the highest level of consistency, lowerexist, for example, snapshot isolation. Given that, thequestions are:• How to detect conflicting transactions?• How to enforce a global total order?
  18. 18. Certification: detect conflictReplicaUpdate transactionReplicaRead queryReplicaRead set: a = 1Write set: b = 12Transactions get executed and certified before commit• Conflict detection is based on read and write sets• Multi-Primary deferred updateCertification Certification
  19. 19. The speaker says...(For brevity we discuss multi-primary deferred update only.)In a multi-primary deferred update system a readquery can be served by a replica without consultingany of the other replicas. A write transaction must becertified by all other replicas before it can commit.During the execution of the transaction, the replica recordsall data items read and written. The read/write sets are thenforwarded by the replica to all other replicas to certify theremote transaction. The other replicas check whether theremote transaction includes data items modified by anactive local transaction. The outcome of the certificationdecides on commit or abort. Either symetric (statementbased) or asymetric (row based) replication can be used.
  20. 20. Concurrency Control010101001011010101010110100101101010010101010101010110101011101010110111101Replicat0: SET a = 1 Replica t0: SET a = 2Various synchronization mechanisms• Atomic commit• Atomic broadcast• Strict two-phase locking (2PL)• Optimistic, Physical clock, Lamports clock, vector clock...a = 1a = 2a = 1a = 1a = 2
  21. 21. The speaker says...One challenge remains: replicas must agree on a globaltotal order for comitting transactions no matter inwhich order they receive messages.We will discuss atomic commit (two-phase-locking) andatomic broadcast. The other approaches are out of scope.
  22. 22. Atomic commit for CCExecute Committing PreCommitAbortedComittedFormula (background): serial execution, unnecessaryaborts
  23. 23. The speaker says...Atomic commit can be expressed as a state machine withthe final states abort and commit. Once a transaction hasbeen executed, it enters the committing state in whichcertification/voting takes place. Given the absence ofconflicting concurrent transactions, a replica sets thetransactions status to precommit. If all replicas precommit,the transaction is comitted, otherwise it is aborted.Dont worry about the formula. It checks for concurrenttransactions – as we did before – and ensures, in case ofconflicts, that only one transaction can commit at a time.Problem: it may also do unnecessary abortsdepending on message delivery order as it requires allservers to precommit->commit in the same order.
  24. 24. Atomic broadcast for CCAtomic broadcast guarantees• Agreement: if one server delivers a message, all will• Total order: all servers deliver messages in the same orderGreatly simplified concurrency check• Deterministic: no extra communication after local decision
  25. 25. The speaker says...Atomic broadcast ensures that transaction are delivered inthe same order to all replicas. Thus, certification oftransactions is deterministic: all replicas will make the samedecision about commit or abort because they all base theirdecision on the same facts. This in turn means that there isno need to coordinate the decisions of all replicas – allreplicas will make the same decision.A transaction does not conflict and thus will commit, if itsexecuted after the commit of any other transaction, or itsread set does not overlap with the write set of any othertransaction. The formula is greatly simplified! Great for DIY!
  26. 26. Voting quorum: ROWA, or...?Read-One Write-All is a special quorum• Quorum constraints: NR+ NW> N, NW> N/2ReplicaReplicaReplicaReplicaReplicaReplicaReplicaReplicaReplicaReplicaReplicaReplicaExample: N= 12, read quorum NR= 3, write quorum NW= 10Replica Replica ReplicaExample: N= 3, read quorum NR= 2, write quorum NW= 2
  27. 27. The speaker says...So far we have silently assumed a Read-One Write-All(ROWA) quorum for voting. Reads could be served locallybecause updates have been applied to all replicas.Alternatively, we could make a rule that an update has to beagreed by and applied to half of the replicas plus one. Thismay be faster than achieving agreement among all replicas.However, for a correct read we now have to contact half ofthe replicas plus one and check whether they all give thesame reply. If so, we must have read the latest version asthe remaining, unchecked replicas form a minority thatcannot be updated. The read quorum overlaps the writequorum by at least one element.
  28. 28. Voting quorum: ROWA!ROWA almost always performs better• Are Quorums an Alternative for Data Replication?(Jimenez-Peris• „The obvious conclusion from these results is that ROWAA isthe best choice for a wide range of application scenarios. It offersgood scalability (within the limitations of replication protocols),very good availability, and an acceptable communicationoverhead. It also has the significant advantage of being verysimple to implement and very easy to adapt to configurationchanges. For every peculiar loads and configurations, it is possiblethat some variation of quorum does better than ROWAA.“• Background: scale out results from study
  29. 29. The speaker says...Judging from the paper ROWA respectively Read-One Write-All-Available (ROWAA) is a promisingapproach. For example, it offers linear scalability for readonly workloads but still remains competitive for mixedupdate and read loads. It requires a high write-to-read ratiobefore the various Quorum algorithms outperform ROWA onscalability. In sum: ROWA beats Quorums by a magnitudefor read but does not drop by a magniture for write, and theweb is read dominated. Scalability is one aspect.Quorums also help with availability – the studiesfinding is similar: ROWA is fine.DIY decision on currency control: ROWA, atomic broadcast.Quiz: name a system using Quorums? Riak! Next:Availability and Fault Tolerance.
  30. 30. Complex failure handling required• Later evolution: Three-Phase Commit (3PC)Fault Tolerance: 2PCCoordinator Participant ParticipantVote RequestPreCommitPreCommitVote RequestGlobal CommitCommit
  31. 31. The speaker says...When discussing atomic commit we have effectively shownthe Two-Phase Commit (2PC) protocol. 2PC starts with avote request multicasted from a coordinator to allparticipants. The participants either vote to commit(precommit) or abort. Then, the coordinator checks thevoting result. If all voted to commit, it sends a globalcommit messages and the participants commit. Otherwisethe coordinator sends a global abort command. Variousissues may arise in case of network or processfailures. Some cannot be cured using timeouts. Forexample, consider the situation when a participantprecommits but gets no global commit or global abort. Theparticipant cannot uniliterally leave the state. At best, it canask another participant what to do.
  32. 32. Two-Phase Commit is a blocking protocolFault Tolerance: 2PCCoordinator Participant ParticipantVote RequestPreCommitPreCommitVote Request
  33. 33. The speaker says...The worst case scenario is a crash of the coordinator afterall participants have voted to precommit. The participantscannot leave the precommit state before the coordinator hasrecovered. They do not know whether all of them havevoted to commit or not. Thus, they do not know whether aglobal commit or global abort has to be performed.As none of them has received a message about the outcomeof the voting, the participants cannot contact one anotherand ask for the outcome.Two-Phase Commit is also known as a blockingprotocol.
  34. 34. Reliable multicast/broadcast• Build on the idea of group views and view changesVirtual SynchronyP1P2P3P4M1M2VCM3M4G1 = {P1, P2, P3} G2 = {P1, P2, P3, P4}
  35. 35. The speaker says...Virtual Synchrony is a mechanism that does not block. It isbuild around the idea of associating multicast messages withthe notion of a group. A message is delivered to allmembers of a group but no other processes. Either themessage is delivered to all members of a group or to noneof them. All members of the group agree that they are partof the group before the message is multicasted (groupview). In the example, M1...3 are associated with the groupG1 = {P1, P2, P3}. If a process wants to join or leave agroup a view change message is multicated. In theexample, P4 wants to join the group and a VC message issend while M3 is still being delivered. Virtual Synchronyrequires that either M3 is delivered to all of G1 before theview change takes place or to none.
  36. 36. View changes act as a message barrier• Remember the issues with 2PC …?Virtual SynchronyP1P2P3P4M5VCM6G2 = {P1, P2, P3, P4} G3 = {P1, P2, P3}M7M8
  37. 37. The speaker says...There is only one condition under which a multicastmessage is allowed not to be delivered: if the sendercrashed. Assume the processes continue working andmulticast messages M5, M6, M7 to group G2 = {P1, P2, P3,P4}. While P4 sends M7 it crashes. P4 has managed todeliver its message to {P3}. The crash of P4 is noticed and aview change is triggered. Because Virtual Synchronyrequires a message to be delivered to all members of thegroup associated with it but the sender crashed, P3 is freeto drop M7 and the view change can take place.A new group view G3 is established and messages can beexchanged again.
  38. 38. Wire: message ordering and fault tolerance• Common choices: UDP or TCP over IPReliable, delivered vs. received010101001011010101010110100101101010010101010101010110101011101010110111101ReplicaReplicaUpdate 1 Update 2t1: Update 1t2: Update 2t1: Update 2t2: Update 1 (lost)
  39. 39. The speaker says...Virtual Synchrony offers reliable multicast. Reliability can bebest achieved using a protocol higher up on the OSI model.Isis, an early framework implementing Virtual Synchrony,has used TCP point to point connections if reliable servicewas requested. TCP is a connection oriented protocol(endpoint failures can be deteted easily) with error handlingand message delivery in the order sent. However, usingTCP only there are no ordering constraints betweenmessages from any two senders. Those orderingconstraints have to be implemented at the application layer.We say a message can be recieved on the network layerin a different order than its delivered to the applicationby the model discussed. Vector clocks can be used forglobal total ordering.
  40. 40. AB = Virtual Synchrony offering total-order delivery• „Synchrony“ does not refer to temporal aspectsAtomic broadcast definitionP1P2P3P4M1M2Unordered delivery Ordered deliveryP1P2P3P4M1M2
  41. 41. The speaker says...Atomic broadcast means Virtual Synchrony used with total-order message ordering. When Virtual Synchrony wasintroduced back in the mid 80s, it was explicitly designed toallow other message orderings. For example, it should beable to support distributed applications that have a notion offinding messages that commute, and thus may be applied inan order different from the order sent to improveperformace. If events are applied in different order ondifferent processes, the system cannot be calledsynchronous any more – the inventors called it virtuallysynchronous.However, recall we are only after total-ordering for 1SR.
  42. 42. Wash the brain without marketing fluff, split brain, done!• System dependent... E.g. Isis failure detector was very basicHow to cook brainsP1P2P3P4M1M2n1({P1, P2, P3, P4]) = 4VCSplit brain – Connection lostn2({P1, P2}) = 2 < (n1/2)
  43. 43. The speaker says...The failure of individual processes – or database replicas –has been discussed. The model has measures to handlethem following using a fail stop approach.To conclude the discussion of fault tolerance we look at asituation called split brain: one half of the cluster lostconnection to another half. Which shall survive? Theanswer is often implementation dependent. Forexample, the early Virtual Synchrony framework Isis has arule that a new group view can only be installed if itcontains n / 2 + 1 members with n being the number ofmembers in the current group. In the example both halveswould shut down. Brain splitting question: how manyreplicas would you project for a cluster if you dont knowsplit brain implementation details?
  44. 44. In-core architectureDIY: Hack MySQL (oh, oh), or...?MySQL DBMS MySQL DBMSLoad BalancerPECL/mysqlnd_ms MySQL ProxyPHP PHP PHPReflector ReflectorReplicator ReplicatorGCS
  45. 45. The speaker says...Heres a generic architecture made of five components:• Clients (PHP, Java, …) using well known interfaces• Load Balancer (for example PECL/mysqlnd_ms)• The actual database system• The reflector allows inspection and modification of on-going transactions• The (distributed) replicator handling concurrencycontrol• The Group Communication System (GCS) providescommunication primitives such as multicast (GCSexamples: Appia, JGroups – Java, Spread – C/C++)
  46. 46. Middleware architectureDIY: Hack MySQL (oh, oh), or...?Virtual DBMS Virtual DBMSLoad BalancerClientsReflector ReflectorReplicator ReplicatorGCSDBMS DBMS
  47. 47. The speaker says...An in-core design requires support for a reflector by thedatabase. Strictly speaking there is no API inside MySQL onecan use. The APIs used for MySQL Replication are notsufficient. Nonetheless, MySQL Replication can beclassified as in-core in our model. Due to the lack of anreflector API, the only third party product following an in-core design (Galera by Codership) has to patch theMySQL core.Tungsten Replicator by Continuent is a Middlewaredesign. Clients contact a virtual database. Requests areintercepted, parsed and replicated. The challenge is in theinterception: statements using non-deterministic calls suchas NOW() and TIME() must be taken care of.
  48. 48. Hybrid architectureDIY: Hack MySQL (oh, oh), or...?DBMS DBMSLoad BalancerClientsReflector Plugin Reflector PluginReplicator ReplicatorGCS
  49. 49. The speaker says...In a hybrid architecture the reflector runs within thedatabase process but the replicator layer is using extraprocesses.It is not a perfect comparison as we will see later but forthe sake of our model, we can classify MySQL Cluster as ahybrid architecture. The reflector is implemented as astorage engine. The replicator layer is using extra processes.This design has some neat MySQL NDB Cluster specificbenefits. If any MySQL product has NoSQL genes, it isMySQL Cluster.
  50. 50. Primary Copy Update AnywhereEagerNot available forMySQLMySQL Cluster (Hybrid)Galera (In-core)LazyMySQL Replication(In-core)Tungsten(Middleware)MySQL ClusterReplication(Hybrid)DIY: Summary
  51. 51. The speaker says...Time for a summary before coding ants and compilers starttheir work. From a DIY perspective we can skip LazyPrimary Copy: it has simple concurrency control, itdoes not depend on network speed, it is great for flackyand slow WAN connections but it offers eventualconsistency only (hint: enjoy PECL/mysqlnd_ms!), it hasno means to scale writes. And, it exists – no karma...An eager update anywhere solution offering the highestlevel of correctness (1SR) gives you strong consistency. Itscales writes to some degree because they can beexecuted on any replica, which parallizes execution load.Commit performance is network bound.
  52. 52. Full Replication Partitial ReplicationReadScale OutWriteScale OutCapabilityMySQL Replication(Lazy Primary Copy,In-core)MySQL Cluster(Eager UpdateAnywhere,Hybrid)Tungsten(Primary Copy,Middleware)Galera(Eager Update Anywhere,In-core)If 1SR - hard limitDIY: The Master Class
  53. 53. The speaker says...The DIY Master Class for maximum karma is a partialreplication solution offering strong consistency. Partialreplication is the only way to ultimately scale writerequests. The explanation is simple: every write adds loadto the entire cluster. Remember that writes need to becoordinated, remember that concurrency control involves allreplicas (ROWA) or a good number of them (Quorum).Thus, every additional replica adds load to all others. Thesolution is to partition the data set and keep each partitionon a subset of all replicas only. NoSQL calls it sharding,MySQL Cluster calls it partitioning. Partial replication –thats the DIY master piece, that will give you KARMA.
  54. 54. Availability• Shared-nothing, High Availability (99,999%)• WAN Replication to secondary data centersScalability• Read and write through partial replication (partitioning)• Distributed queries (parallize work), real-time guarantees• Focus In-Memory with disk storage extension• Sophisticated thread model for multi-core CPU• Optimized for short transaction (hundrets of operations)Distribution Transparency• SQL level: 100%, low-level interfaces availableMySQL (NDB) Cluster goals
  55. 55. The speaker says...I am not aware of text books discussing partialreplication theory in-depth. Thus, we have to reverseengineer an existing system. As this is a talk aboutMySQL Cluster, how about talking about MySQL Clusterfinally?MySQL Cluster has originally been developed to servetelecommunication systems. It aims to parallize work asmuch as possible, hence it is a distributed database. Itstarted as an in-memory solution but can store data on diskmeanwhile. It runs best in environments offering lownetwork latency, high network throughput and issuing shorttransactions. Applications should not demand complex joins.There is no chance you throw Drupal at it and Drupal runssuper-fast out of the box! Lets see why...
  56. 56. SQL view: Cluster is yet another table storage engineMySQL Cluster is a hybridMySQL MySQLLoad BalancerClientsReflector Plugin = NDB Storage EngineReplicator = NDB Data NodeGCS
  57. 57. The speaker says...MySQL Cluster has a hybrid architecture. It consists of thegreen elements on the slide. The Reflector isimplemented as a MySQL storage engine. From a SQLusers perspective, it is just another storage engine, similarto MyISAM, InnoDB or others (Distribution Transparency).From a SQL client perspective there is no change: all MySQLAPIs can be used. The Reflector (NDB Storage Engine) runsas part of the MySQL process. The Replicator is aseperate process called NDB data node. Please note,node means process not machine. MySQL Cluster does notfit perfectly in the model: an NDB data node combinesReplicator and storage tasks.BTW, what happens to Cluster if a MySQL Server fails?
  58. 58. Fast low-level access: bypassing the SQL layerMySQL Cluster is a beastMySQL MySQLLoad BalancerClientsReflector Plugin = NDB Storage EngineReplicator = NDB Data NodeGCSClients4.3b read tx/s1.2b write tx/s(in 2012)
  59. 59. The speaker says...From the perspective of MySQL Cluster, a MySQL Server isyet another application client. MySQL Server happens to bean application that implements a SQL view on the relationaldata stored inside the cluster.MySQL Cluster users often bypass the SQL layer byimplementing application clients on their own. SQL is a richquery language but parsing a SQL query can take 30...50%of the total runtime of a query. Thus, bypassing is a goodidea. The top benchmark results we show for Cluster areachieved using C/C++ clients directly accessing MySQLCluster. There are many extra APIs for this special case:NDB API (C/C++, low level), ClusterJ (ORM style),ClusterJPA (low level), … - even for node.js (ORM style)
  60. 60. Partitioning (auto-sharding)NDB Data Node 1 NDB Data Node 2NDB Data Node 3 NDB Data Node 4Partition 0, PrimaryPartition 2, CopyPartition 0, CopyPartition 2, PrimaryPartition 1, Primary Partition 1, CopyPartition 3, Copy Partition 3, PrimaryNode Group 1Node Group 0
  61. 61. The speaker says...There is a lot to say about how MySQL Cluster partitions atable and spreads it over nodes. The manual has all details,just all...The key idea is to use an eager primary copy approach forpartitions combined with a mindful distribution of eachpartitions primary and its copies. NDB supports zero or onecopies (replication factor). The failure of a partitions primarydoes not cause a failure of the Cluster. In the example, thefailure of any one node has no impact. Also, node 1 and 4may fail without a stop of the Cluster (fail stop model). Butthe cluster shuts down if all nodes of a node group fail.
  62. 62. Concurrency Control: 2PL,“2PC“NDB Data Node 1 NDB Data Node 2NDB Data Node 3 NDB Data Node 4Partition 0, PrimaryPartition 2, CopyPartition 0, CopyPartition 2, PrimaryPartition 1, Primary Partition 1, CopyPartition 3, Copy Partition 3, PrimaryWRR
  63. 63. The speaker says...Buuuuh? Two-Phase-Locking (2PL) and Two-Phase-Commit(2PC) are used for concurrency control. Cluster is usingtraditional row locking to isolate transactions. Read andwrite locks can be distributed throughout the cluster. Thelocks are set on the primary partitions. Transactions areserialized during execution. When a transaction commits, anoptimized Two-Phase-Commit is used to synchronize thepartition copies.The SQL layer recognizes the commit as soon as the copiesare updated (and before logs have been written to disk).The low-level NDB C/C++ application API is asynchronous.Fire and forget is possible: your application can continuebefore transaction processing as even begun!
  64. 64. Brain MasalaNDB Data Node 1 NDB Data Node 2NDB Data Node 3 NDB Data Node 4Partition 0, PrimaryPartition 2, CopyPartition 0, CopyPartition 2, PrimaryPartition 1, Primary Partition 1, CopyPartition 3, Copy Partition 3, PrimaryArbitrator
  65. 65. The speaker says...The failure of a single node is detected using a hearthbeatprotocol: details are documented, future improvements arepossible. Both MySQL Cluster and Virtual Synchronyseperate message delivery from node failure detection.The worst case scenario of a brain split is cured by theintroduction of arbitrators. If the nodes split and each halfis able to keep the Cluster up, the nodes try to contact thearbitrator. It is then up to the arbitrator to decide who staysup and who shuts down. Arbitrators are extra processes,ideally run on extra machines. Management nodes can actas arbitrators too. You need at least one management nodefor administration, thus you always have an arbitratorreadily available.
  66. 66. Drupal? Sysbench? Oh, oh...NDB Data Node 1 NDB Data Node 2NDB Data Node 3 NDB Data Node 4Partition 0, PrimaryPartition 2, CopyPartition 0, CopyPartition 2, PrimaryPartition 1, Primary Partition 1, CopyPartition 3, Copy Partition 3, PrimaryMySQL
  67. 67. The speaker says...Partial replication (here: partitioning, sharding) is the onlyknown solution to the write scale out problem. But, it comesat the high price of distributed queries.A SQL query may require reading data from many partitions.One the one hand work is nicely parallized over many nodeson the other hand, records found have to be transferredwithin the cluster from one node to another. AlthoughCluster tries to batch requests efficiently together tominimize communication delays, transferring data from nodeto node to answer questions remains an expensiveoperation.
  68. 68. Oh, oh... tune your partitions!NDB Data Node 1 NDB Data Node 2NDB Data Node 3 NDB Data Node 4Partition 0, PrimaryPartition 2, CopyPartition 0, CopyPartition 2, PrimaryPartition 1, Primary Partition 1, CopyPartition 3, Copy Partition 3, PrimaryMySQLCREATE TABLE cities {id INT NOT NULL,Population INT UNSIGNED,city_name VARCHAR(100),PRIMARY KEY(city_name, id)}SELECT id FROM citiesWHEREcity_name = Kiel
  69. 69. The speaker says...How much traffic and latency occurs depends on the actualSQL query and the partitioning scheme. By default a tableis partitioned into 3840 virtual fragments (thinkvBuckets) using its primary key. The partitioning canand should be tuned.Try to find partitioning keys that make your common,expensive or time-criticial queries run on a single node.Assume you have a list of cities. City names are not unique,thus you have introduced a numeric primary key. It is likelythat your most common query checks for the city name notfor the numeric primary key only. Therefore, yourpartitioning should be based on city name as well.
  70. 70. The ultimate Key-Value-Store?NDB Data Node 1 NDB Data Node 2NDB Data Node 3 NDB Data Node 4Partition 0, PrimaryPartition 2, CopyPartition 0, CopyPartition 2, PrimaryPartition 1, Primary Partition 1, CopyPartition 3, Copy Partition 3, PrimaryMySQLCREATE TABLE cities {id INT NOT NULL,city_name VARCHAR(100),PRIMARY KEY(id)}SELECT FROM citiesWHERE id = 1SELECT FROM cititesWHERE id = 100
  71. 71. The speaker says...I may have stated it before: if there is any product atMySQL that can compete with NoSQL (as in Key-Value-Store) on the issue of distributed data stores, it is MySQLCluster.An optimal query load for MySQL Cluster is one thatprimarily performs lookups on partition keys. Each query willexecute on one node only. There is little traffic within thecluster – little network overhead. Work load is perfectlyparallized.Will your unmodified PHP application perform on Cluster?
  72. 72. Joins: 24...70x fasterThenNowNDB_API> read a from table t1 where pk = 1[round trip](a = 15)NDB_API> read b from table t2 where pk = 15[round trip](b = 30)[return a = 15, b = 30]SELECT t1.a, t2.b FROM t1, t2WHERE = 1 AND t1.a = t2.pkNDB_API> read @a=a from table t1 where pk = 1;read b from table t2 where pk = @a[round trip]
  73. 73. The speaker says...In 7.2 we claim certain joins to execute 24...70x faster bythe help of AQL (condition push-down)! How come?Partial replication does not go together well with joins. Takethis simple nested join as an example. There are two tablesto join. The join condition of the second table depends onthe values of the first table. Thus, t1 has to be searchedbefore t2 can be searched and the result can be returned tothe user. That makes two operations and two round trips.As of 7.2, there is a new batched way of doing it. It savesround trips. Some round trips avoided means – at theextreme - 24...70x faster: the network is your enemy #1.
  74. 74. Benchmark pitfall: connectionsNDB Data Node 1 NDB Data Node 2NDB Data Node 3 NDB Data Node 4MySQLLoad BalancerMany, many clientsMySQLNDB Storage Engine NDB Storage Engine
  75. 75. The speaker says...If you ever come to the point of evaluating MySQL Cluster,make sure you configure MySQL to Cluster connectionsappropriately (ndb_cluster_connection_pool).A MySQL Server with only one connection (default setting)from itself to the cluster may not be able to serve manyconcurrent clients at the rate the Cluster part itself might beable to handle them. The connection may an impose anartifical limitation on the cluster throughput.
  76. 76. Adding nodes, rebalancingNDB Data Node 1 NDB Data Node 2NDB Data Node 3 NDB Data Node 4Partitions PartitionsPartitions PartitionsNDB Data Node 5 NDB Data Node 6
  77. 77. The speaker says...Adding nodes, growing the capacity of your cluster in termsof size and computing power, is an online operation. At anytime you can add nodes to your cluster.New nodes do not immediately participate inoperations. You have to tell the cluster what to do withthem: use for new tables, or use for growing the capacityavailable to existing tables. When growing existing tables,data needs to be redistributed to the new nodes.Rebalancing is an online operation: it does not blockclients. The partitioning algorithm used by Cluster ensuresthat data is copied to new nodes only, there is notraffic between nodes currently holding fragments ofthe table to be rebalanced.
  78. 78. We shall...• Code an Eager Update-Anywhere Cluster• Prefer an hybrid design to get not too deep into MySQL• Do not fear the lack of text books on partital replication• Read CPU vendor tuning guides like comics• Like Sweden or FinlandSend your application to the MySQL Cluster team.Cluster is different. MySQL Cluster is perfect for websession storage. Whether your Drupal, WordPress, …runs faster is hard to tell – possibly not faster.PS (marketing fluff): ask Sales for a show!DIY - Summary
  79. 79. The speaker says...By the end of this talk you should remember at least this:●There are four kinds of replication solutions based on amatrix asking „where can all transactions run“ and „whenare replicas synchronized“●Clusters dont make everything faster – the network isyour enemy. For read scale out there are provensolutions.●Write scale out is only possible through partial replication(Small write Quorum would impact read performance)
  80. 80. THE ENDContact:
  81. 81. The speaker says...Thank you for your attendance!Upcoming shows:Talk&Show! (ask... :-))YourPlace, any timePHP SummitMunich, December 2013