48. -F1: A Distributed SQL Database That Scales, Google
“Because the data is synchronously replicated
across multiple datacenters, and because
we’ve chosen widely distributed datacenters,
the commit latencies are relatively high (50-150
ms).”
49. -Kohavi and Longbotham 2007
“Every 100 ms increase in load time of
Amazon.com decreased sales by 1%.”
(~$120M of losses per 100 ms)
50. “Average partition duration ranged from 6 minutes for
software-related failures to more than 8.2 hours for
hardware-related failures (median 2.7 and 32 minutes;
95th percentile of 19.9 minutes and 3.7 days,
respectively).”
-The Network is Reliable
WANs Fail
53. -F1: A Distributed SQL Database That Scales, Google
“We also have a lot of experience with eventual
consistency systems at Google. In all such
systems, we find developers spend a
significant fraction of their time building
extremely complex and error-prone
mechanisms to cope with eventual consistency
and handle data that may be out of date. We
think this is an unacceptable burden to place
on developers and that consistency problems
should be solved at the database level. ”
55. “A shared-data system can have at most
two of the three following properties:
Consistency, Availability, and tolerance to
network Partitions.”
-Dr. Eric Brewer
56. On Consistency
• ACID Consistency: Any transaction, or operation
will bring the database from one valid state to
another
• CAP Consistency: All nodes see the same data at
the same time (synchrony)
57. On Partition Tolerance
• The network will be allowed to lose arbitrarily many
messages sent from one node to another.
• Databases systems, in order to be useful must
have communication over the network
• Clients count
58. There is no such thing as
a 100% reliable network:
Can’t choose CA
http://codahale.com/you-cant-sacrifice-partition-tolerance
59. We Can Have Both*
(*Just not at the same time)
60. PNUTS
• Paper released by Yahoo! research in 2008
• Operations:
• Read-Any
• Read-Critical(Required-Version)*
• Read-Latest
• Write
• Test-and-set-write(Required-Version)
* Will fall back to CP operation
63. “This is a specific form of weak
consistency; the storage system
guarantees that if no new
updates are made to the object,
eventually all accesses will
return the last updated value.”
Definition of “Eventual Consistency” from “Eventually
Consistency Revisited” - Werner Vogels
109. Vector Clocks
• Extension of Lamport Clocks
• Used to detect cause and effect in distributed
systems
• Can determine concurrency of events, and
causality violations
• Preserves h.b. relationships
111. • CRDTs:
• Convergent Replicated Data Types
• Commutative Replication Data Types
• Enables data structures to be always writeable on both sides of a partition,
and replay after healing a partition
• Enable distributed computation across monotonic functions
• Two Types:
• CvRDTs
• CmRDTs
CRDTs
112. CvRDTs
• State / value based CRDTs
• Minimal state
• Don’t require active garbage collection
116. CRDTs in the Wild
• Sets
• Observe-remove set
• Grow-only sets
• Counters
• Grow-only counters
• PN-Counters
• Flags
• Maps
117. Data structures that are
CRDTs
• Probabilistic, convergent data structures
• Hyper log log
• Bloom filter
• Co-recursive folding functions
• Maximum-counter
• Running Average
• Operational Transform
118. CRDTs
• Incredibly powerful primitive
• Not only useful for in-database manipulation but
client-database interaction
• You can compose them, and build your own
• Garbage collection is tricky
137. Invariant Operation AP / CP
Specify unique ID Any CP
Generate unique ID Any AP
> INCREMENT AP
> DECREMENT CP
< INCREMENT CP
< DECREMENT AP
Secondary Index Any AP
Materialized View Any AP
AUTO_INCREMEN
T
INSERT CP
Linearizability CAS CP
Operations Requiring
Weak Consistency
vs.
Strong Consistency
138. BASE not ACID
•Basically Available: There will be a response
per request (failure, or success)
•Soft State: Any two reads against the system
may yield different data (when measured
against time)
•Eventually Consistent: The system will
eventually become consistent when all
failures have healed, and time goes to infinity
140. Technology Timeline
• 1996 - Log structured merge tree
• 2000 - CAP Theorem
• 2007 - Amazon Dynamo Paper
• 2011 - INRIA CRDT Technical Report
• 2014 - Riak DT map: a composable, convergent
replicated dictionary
141. Further Reading
• Don’t Settle for Eventual: Scalable Causal Consistency for Wide-Area
Storage with COPS
• PNUTS: Yahoo!’s Hosted Data Serving Platform
• F1: A Distributed SQL Database That Scales
• Spanner: Google's Globally-Distributed Database
• The Network is Reliable: An informal survey of real-world communications
failures
• A comprehensive study of Convergent and CommutativeReplicated Data
Types
• Riak DT Map: A Composable, Convergent Replicated Dictionary
142. Get in Touch
• If you’re interested in cheating the speed of light
• Come use our software
• If you’re interested in solving today’s computer science
problems
• Come work for us
• If you’d like to learn more about distributed systems at
scale
• Maybe you have a better idea