Everything you always wanted to know about Distributed databases, at devoxx london, by javier ramirez, teowaki.
Basic concepts of distributed systems, such as consensus, gossip and infection protocols, vector clocks, sharding storage, so you can create highly available distributed systems
2. @supercoco9@supercoco9#distributed-devoxx#distributed-devoxx
Everything you always wanted to know about
highly available distributed databases
• Javier Ramirez:
–20 years in web development (C/Java/Ruby/Python)
–6 years in NoSQL (Redis, Mongo, Neo4j)
–4 years in Cloud (AWS, GCP)
–3 years in Big Data (BigQuery, Spark, Apache Beam/Dataflow)
–Google Developer Expert and Authorised trainer on the Google Cloud
Platform
My projects:
• https://teowaki.com
• https://aprendoaprogramar.com
10. Some data center outages reported in 2015:
* Amazon Web Services
* Apple iCloud
* Microsoft Azure
* IBM Softlayer
* Google Cloud Platform
* And of course every hosting with scheduled
maintenance operations (rackspace, digital
ocean, ovh...)
16. A main server sends a binary log of changes to one
or more replicas
* Also known as Write Ahead Log or WAL
17. Master-slave is good but
* All the operations are replicated on all slaves
* Good scalability on reads, but not on writes
* Cannot function during a network partition
* Single point of failure (SPOF)
19. Every server can accept reads or writes, and send
its binary log to all the other servers
* also referred as update-anywhere
20. Multi-master is great, but:
* All the operations are replicated on all masters.
* When synchronous, high latency (Consistency achieved via locks,
coordination and serializable transactions)
* When asynchronous, typically poor conflict resolution
*Hard to scale up or down automatically
21. The system I want:
* Always ON, even with network partitions
* Scales out both reads and writes. Doesn't need to keep all the data in all the
servers
* Runs on cheap commodity diverse hardware
* Runs locally to my users (low latency)
* Grows/shrinks elastically and survives server failures
22. Then you need to let go of
many convenient things you take for granted in
databases
26. Distributed DB design decisions
* data (keys) distribution
* data replication/durability
* conflict resolution
* membership
* status of the other peers
* operation under partitions and
during unavailability of peers
* incremental scalability
27. Data distribution
Consistent hashing based on the key
Usually implies operations work on single keys. Some
solutions, like Redis, allow the clients to group related
keys consistently. Some solutions, like BigTable, allow to
collocate data by group or family.
Queries are frequently limited to query by key or by
secondary indexes (say bye to the power of SQL)
29. Data Replication
How many replicas of each? Typically at least 3, so in case of
conflicts there can be a quorum
Often, the distribution of keys is done taking into account the
physical location of nodes, so replicas live in different racks or
different datacentres
30. Replication: durability
If we want to have a durable system, we need at least to make sure the data is replicated in
at least 2 nodes before confirming the transaction to the client.
This is called the write quorum, and in many cases it can be configured individually.
Not all data are equally important, and not all systems have the same R/W ratio.
Systems can be configured to be “always writable” or “always readable”.
31. Conflicts
I see a record that I thought was deleted
I created a record but cannot see it
I have different values in two nodes
Something should be unique, but it's not
32. No-Conflict strategies
Quorum-based systems: Paxos, RAFT.
Require coordination of processes with continuous elections
of leaders and consensus.
Worse latency
Last Write Wins (LWW):
Doesn't require coordination.
Good latency
34. Vector clocks
* Don't need to sync time
* There are several
versions of a same item
* Need consolidation
to prune size
* Usually client needs to
fix the conflict and update
36. Gossip
A centralised server is a SPOF
Communicating state with each node is very time consuming
and doesn't support partitions
Gossip protocols communicate pairs of random nodes at
regular frequent intervals and exchange information.
Based on that information exchange, a new status is agreed
38. Incremental scalability
When a new node enters the system, the rest of nodes notice
via gossip.
The node claims a partition of the ring and asks
the replicas of the same partition to send data to it.
When the rest of nodes decide (after gossiping) that a node
has left the system and it's not a temporary failure, the data
assigned to the partitions of that node is copied to more
replicas to reach the N copies.
All the process is automatic and transparent.
39. Operation under partition:
Hinted Handoff
On a network partition, it can happen that we have less than
W nodes of the same segment in the current partition.
In this case, the data is replicated to W nodes, even if that
node wasn't responsible for the segment. The data is kept
with a “hint”, and stored in a special area.
Periodically, the server will try to contact the original
destination and will “hand off” the data to it.
40. Anti Entropy
A system with handoffs can be chaotic and not very
effective
Anti Entropy is implemented to make sure hints are
handed off or synchronized to other nodes
Anti entropy is usually achieved by using Merkle Trees, a
hash of hashes structure very efficient to compare
differences between nodes
41. All this features mean your clients need to
be aware of some internals of the system
42. Clients must
* Know which close nodes are responsible for each
segment of the ring, and hash locally**
* Be aware of when nodes become available or
unavailable**
* Decide on durability
* Handle conflict resolution, unless under LWW
** some solutions offer a load balancer proxy to abstract the client
from that complexity, but trading off latency
43. now you know how it works
* A system that always can work, even with network partitions
* That scales out both reads and writes
* On cheap commodity diverse hardware
* Running locally to your users (low latency)
* Can grow/shrink elastically and survive server failures
44. Extra level: Build your
own distributed database
Netflix dynomite, built in Java
Uber ringpop, built in JavaScript
A squirrel did take out half of our
Santa Clara data centre two years back
Mike Christian, Yahoo Director of Engineering
2012, at a conference
that's the reason why google wraps submarine fibre cables in kevlar, so shark bites won't damage them
rackspace was taken down when a truck driver had an accident during a delivery to the data centre
hurricanes, truck drivers, sharks eating trans
oceanic cable, and of course electronic and
mechanical failures, human errors, and malicious attacks
Starbucks customers couldn't buy any coffees a whole morning
Tinder users lost temporarily their matches for a few hours
Twilio did good
Netflix had a few problems in the past, but now they are awesome
of course this doesn't give you high availability, but at least prevent from data lost to an extent (depending on your backup practices)
Frequently used not only on relational databases, but on every kind of distributed system. Redis when configured as master-slave works in a very similar way too
So the more writes you have, the busiest all of your servers will be
When I say “write” I mean updates and deletions too
Recovery is not fully automatic and, at best, requires some extra coordination
OrientDB is quite good, so I put it into distributed databases
The more writes you have, the more load in the whole system
Also, the usual case is all the data lives on all the servers, and that simply doesn't scale
netflix several thousands cassandra nodes
facebook: several tenths of thousands nodes for analytics
Cheap hardware: important to be heterogeneus!
or else it's very difficult to support
netflix several thousands cassandra nodes
facebook: several tenths of thousands nodes for analytics
Forget about:
flexible queries, table design where everything can be queried no matter what (even if slow)
transactions
strong consistency
delegating all the complexity to the servers
Eventually consistent
Eric Brewer
you know some of the names on relational, traditional, non distributed databases
mysql
mariadb
oracle
postgresql
sql server
ibm db2
sqlite
SAP HANA
The Amazon Dynamo paper and the Google BigTable paper are behind many of the concepts of modern distributed databases, together with the work of Leslie Lamport, the creator of Latex and a member of Microsoft Research
There is a new generation of systems based on the Google Spanner paper
some systems allow to define virtual nodes, so a physical node contains in reality several nodes
that's one way of allow heterogeneity of the system
Parameters W and R can also be configured to LOCAL_QUORUM, so they need agreement only from local nodes and not across datacenters
by combining global quorum for reads and local quorum for reads, netflix gets 500 ms from the time it writes on one region until it can be read from another, while keeping very fast reads
usually due to load balancing, concurrency, or network partitions
riak: crdt
systems based in gossip for membership and liveliness can be extended adding extra monitoring information. This solution, for example, is used at CERN to monitor grids of thousands of nodes and monitor memory/cpu usage
Amazon dynamo uses gossip to send ring distribution information, apart from using it to check disconnected/failed/new nodes
Adding more than one node at a time is tricky
Cheap hardware: important to be heterogeneus!
netflix several thousands cassandra nodes
facebook: several tenths of thousands nodes for analytics
Netflix performance:
Chaos Monkeys and 500 ms between recovery across regions
Of course you can always read the source of any open source solution, but it's easier to plug a generic ring/membership and extend it