NoSQL databases were created to solve scalability problems with SQL databases. It turns out these problems are profoundly connected with Einstein's theory of relativity (no, honestly), and understanding this illuminates the SQL/NoSQL divide in surprising ways.
1. NoSQL and Relativity
Lars Marius Garshol, lars.marius.garshol@schibsted.com
2015-09-10, JavaZone 2015
http://twitter.com/larsga
2. Summary
• CAP theorem says we must choose between
availability and consistency
• To scale NoSQL databases choose availability
• Einstein's theory of relativity shows clearly why this
is a trade-off
• Finally, we learn that the trade-off looks different
from what one might think
2
4. Consistency
4
DB
node DB
nodeClient
write(x = 5)
read(x)
Consistency = these two reads are
guaranteed to return the same value
(which need not be 5)
5. Availability
5
DB
node DB
nodeClient
write(x = 5)
If the database accepts
the write, you have availability
write(x = 5)
(the other node may be accepting
other writes, even to x)
?
write(x = 217)
6. The CAP theorem
• Consistency
• all nodes always give the same answer
• Availability
• all nodes always answer queries and accept updates
• Partition-tolerance
• database continues working, even if some nodes
disappear
6
The theorem: choose any two!
And you can't drop partition-tolerance
7. CAP history
• First formulated by Eric Brewer in 2000
• described the SQL/NoSQL divide very well
• Formalized and proven in 2002
• by Seth Gilbert and Nancy Lynch
• Today CAP is better understood
• widely considered a key tradeoff in distributed systems
• theoretical justification for NoSQL databases
7
8. What defines NoSQL?
• SQL is not the query language
• usually something more primitive
• sometimes just key/value lookup
• Sacrifices consistency for availability
• which is what this talk is about
• Schemaless
• that is, data need not conform to a predefined schema
8
9. Why NoSQL?
• Big internet sites couldn't scale relational databases
• consistency requires communication between all nodes,
doesn't scale to high numbers of nodes
• long downtimes during schema changes
• Therefore, switched to NoSQL databases
• schemalessness means no schema changes
• sacrificing consistency means higher performance
• Downside is complexity moves out of database and
into the application
9
10. Achieving availability
10
DB
node DB
nodeClient
write(x = 5)
It's not acceptable to have the two nodes
forever disagreeing on the value of x.
One solution is eventual consistency.
write(x = 5)
OK
write(x = 217)
11. Eventual consistency
• Promises that
• if no further writes are made,
• eventually all nodes will be consistent
• A very weak guarantee
• when is "eventually"?
• what happens before "eventually"?
• Stronger guarantees are sometimes made
• for example, by quantifying actual behaviour in practice
11
12. Implementing eventual
consistency
• Nodes must inform all other nodes of writes
• when receiving and applying a write, it must be passed on
to all other nodes,
• signal OK to writer before complete agreement is reached
• Requires a conflict resolution mechanism
• all nodes must agree on the resolution
• common solution: clock value in write, let write with highest
value win
• clocks need not be in sync
12
13. When eventually is too late
13
DB
node
1 DB
node
2Client
1 Client
2
set
account
X
balance
=
0 set
account
X
balance
=
0
read
account
X
balance
-‐>
100
set
account
X
balance
=
0
Happy
customer
walks
away,
richer
by
200.
Nodes
eventually
agree
balance
is
0.
read
account
X
balance
-‐>
100
14. The key problem
• The ordering of events affects the outcome
• that is, what application logic chooses to do is affected by
what the database says
• what the database says depends on the ordering of events
• The different nodes do not observe the same order of
events
• This can be solved
• at the cost of a communication delay
• which is key to the consistency/availability tradeoff
14
16. Time, Clocks and the Ordering of
Events in a Distributed System
The
origin
of
this
paper
was
a
note
titled
The
Maintenance
of
Duplicate
Databases
by
Paul
Johnson
and
Bob
Thomas.
I
believe
their
note
introduced
the
idea
of
using
message
time-‐
stamps
in
a
distributed
algorithm.
I
happen
to
have
a
solid,
visceral
understanding
of
special
relativity
(see
[5]).
This
enabled
me
to
grasp
immediately
the
essence
of
what
they
were
trying
to
do.
...
I
realized
that
the
essence
of
Johnson
and
Thomas's
algorithm
was
the
use
of
timestamps
to
provide
a
total
ordering
of
events
that
was
consistent
with
the
causal
order.
...
!
It
didn't
take
me
long
to
realize
that
an
algorithm
for
totally
ordering
events
could
be
used
to
implement
any
distributed
system.
16http://research.microsoft.com/en-‐us/um/people/lamport/pubs/pubs.html#time-‐clocks
17. Order theory
• An ordering relation is any relation ≤ such that
• a ≤ a (reflexivity)
• if a ≤ b and b ≤ a then a = b (antisymmetry)
• if a ≤ b and b ≤ c then a ≤ c (transitivity)
• A total order is an order such that
• a ≤ b or b ≤ a (totality)
• A partial order is any order which is not total
• that is, for some pairs a and b, neither a ≤ b nor b ≤ a
17
18. Examples of partial order
• Ordering sets by the subset relation
• for some sets neither a ≤ b nor b ≤ a
• Ordering values of type 'duration' from XML
Schema
• “In general, the ·order-relation· on duration is a partial order
since there is no determinate relationship between certain
durations such as one month (P1M) and 30 days (P30D)”
• 1 day ≤ 2 days
• but 1 month and 30 days?
18
20. History of physics
• 1687
• Isaac Newton publishes Philosophiæ Naturalis Principia
Mathematica
• physics begins
• no changes over next two centuries
• 1905
• Albert Einstein publishes special relativity
• abandons notions of fixed time and space
• 1916
• Einstein’s general relativity
• takes into account gravity
• no changes since
20
22. The barn is too small
• Three people (M, F, and B) own a board (5m wide) and a barn
(4m wide)
• The board doesn’t fit inside the barn!
• What to do?
22
4
23.
24. We have a solution!
24
As
seen
by
F
&
B:
board
is
4m
long
(relativistic
shortening),
barn
continues
to
be
4m
wide.
!
When
the
board
is
exactly
inside
the
barn,
F
and
B
will
close
their
doors
simultaneously,
and
the
problem
will
be
solved.
!
(As
seen
by
M:
board
is
5m
long
(at
rest
relative
to
him),
barn
shortened
to
3.2m
wide.
Pay
no
attention
to
this.)
25. • When the board is just
inside, both close their doors
simultaneously!
• Right after, the board
crashes through back door
What they observe
F and B M
• B shuts his door just as the
front of the board reaches
him!
• 0.6 seconds later, F closes
his door!
• Board crashes through back
door
25
26. The key point
• They don’t agree on the order of events!
• Change the story slightly and the three people could
have three different orders of events
• This is not a paradox
• it is in fact how the universe works
• the ordering of events in the universe is a partial order
• What then of causality?
• if A causes B, but some people think B happened
before A, then what?
26
27. Resolution
• A, B, and C are events
• The cone is the “light cone” from A
• that is, the spread of light from A
• C is outside the cone
• therefore A cannot influence C
• observers may disagree on order of A&C
• B is inside the cone
• therefore A can influence B
• observers may not disagree on order
27
30. How to order events?
• A total ordering of events is impossible
• unless a communications delay is introduced
• this is just part of how the universe works
• If all nodes are inside the light cone of an event they
can agree on the order
• this is where the delay comes from
• so time taken by light to traverse physical distance is the
ultimate limit
• in practice, the effective limit is higher, due to hardware and
design constraints
30
31. One solution: Paxos
• Created by Leslie Lamport from that original insight
• Can be used to introduce a logical clock
• basically a counter all nodes agree on
• This, in turn, can be used to create a total order for events
• As you can see, the cost is the communications delay
31http://research.microsoft.com/en-‐us/um/people/lamport/pubs/pubs.html#paxos-‐simple
32. Or eventual consistency
• In this scenario we accept that what the database tells you might
be wrong
• can be handled by application logic
• or there may be separate business processes to handle errors
• For example, Amazon might have to compensate disappointed
customers with a gift card once per 100,000 transactions
• benefit of staying in business outweighs cost of error
• This happens even in banking
• ATM that loses network access may still allow withdrawals up to a limit,
accepting the risk of overcharging
• customers overcharging will pay fees and interest, anyway
32
33. How eventual is eventual
consistency?
• Two papers by Peter Bailis (2012 and 2014) give
formulas for computing the odds of a stale read
• Shows that usually you can get 99%+ odds of
consistency after short time window
• But ... 1M transactions/day, 99.99% odds, still
means 100 stale reads/day
33
34. Or CALM
• CALM = Consistency As Logical Monotonicity
• that is, facts used by clients to make decisions never
change
• this preserves causality
• A database that never deletes or overwrites is
CALM
• an event log, such as a record of stock exchange trades,
timeseries data, or ...
• not suitable for all systems, though
34
35. A CALM example
• Client 1 reads A = 10
• Client 1 uses this to decide to write B = 5
• If Client 2 now reads B = 5, then they must also
read either A = 10 or a later value of A
• This preserves causal consistency
35
36. Or ACID 2.0
• Not really ACID at all, so rather misleading
• Requires update operations to have these properties:
• associativity a + (b + c) = (a + b) + c
• commutativity a + b = b + a
• idempotence f(x) = f(f(x))
• distributed
• Usual approach is to use datatypes which guarantee
this
36
37. CRDTs give ACID 2.0
• CRDT = Commutative, Replicated Data Types
• also, "Conflict-free Replicated Data Types"
• Datatypes designed so that order of operations
don't affect the outcome
• stronger than eventual consistency because writes don't
conflict
• requires "odd" datatypes, however
37
38. The ATM example
• The problem in the ATM example is the writes
• read(X), client does X = X - 100, write(X)
• the time window in between allows for conflict
• What if the operation were "increment(X, -100)" instead?
• this is associative and commutative (but not necessarily
idempotent)
• In this case the logic "if X >= 100" test could still be fooled
• however, the customer's balance would be "-100"
• so information would not be lost
38
39. An example of a real CRDT
39
https://github.com/aphyr/meangirls
40. but "solutions" are half-solutions, and pretty awkward
Thus far, everyone agrees
41. “To go wildly faster, one must
remove all four sources of the
overhead discussed above.
This is possible in either a SQL
context or some other context.”
44. The AdWords experience
This backend was originally based on a MySQL database that was manually
sharded many ways. The uncompressed dataset is tens of terabytes, which is
small compared to many NoSQL instances, but was large enough to cause
difficulties with sharded MySQL. The MySQL sharding scheme assigned each
customer and all related data to a fixed shard. This layout enabled the use of
indexes and complex query processing on a per-customer basis, but required
some knowledge of the sharding in application business logic.
Resharding this revenue-critical database as it grew in the number of
customers and their data was extremely costly. The last resharding took over
two years of intense effort, and involved coordination and testing across
dozens of teams to minimize risk.
44
45. AdWords requirements
We store financial data and have hard
requirements on data integrity and consistency.
We also have a lot of experience with eventual
consistency systems at Google. In all such systems,
we find developers spend a significant fraction of
their time building extremely complex and error-
prone mechanisms to cope with eventual
consistency and handle data that may be out of
date. We think this is an unacceptable burden to
place on developers and that consistency problems
should be solved at the database level.
45
46. More experience
At least 300 applications within Google use Megastore (despite its
relatively low performance) because its data model is simpler to
manage than Bigtable’s, and because of its support for
synchronous replication across datacenters. (Bigtable only
supports eventually-consistent replication across data-centers.)
Examples of well-known Google applications that use Megastore are
Gmail, Picasa, Calendar, Android Market, and AppEngine. !
46
47. Requirements
• Scalability
• scale simply by adding hardware
• no manual sharding
• Availability
• no downtime, for any reason
• Consistency
• strong database consistency
• Usability
• full SQL with indexes
47
Uh, didn’t we just learn
that this is impossible?
48. Spanner
• Globally distributed semi-relational database
• SQL as query language
• versioned data with non-locking read-only transactions
• Externally consistent reads/writes
• Atomic schema updates
• even while transactions are running
• Basic availability
• experiment showed killing 25 out of 125 servers reduced
throughput, but had no other effect
48
52. TrueTime
• Time API with uncertainty (ε)
• use atomic clock and GPS masters to reduce ε
• ε usually around 4 milliseconds
• TT.now() = [earliest, latest] = [now() - ε, now() + ε]
• TT.after(t) = t < TT.now().latest
• TT.before(t) = t > TT.now().earliest
52
n
now() latestearliest
ε ε
54. Reads
• Non-locking
• System assigns a read timestamp t
• t = TT.now().latest
• (in reality somewhat smarter)
• Replicas maintain a value tsafe
• the timestamp by which the replica is 100% up to date
• Replica can reply to read as long as t < tsafe
• may require waiting for tsafe to progress
54
55. Linearizability
• If commit(T1) < start(T2), then ts(T1) < ts(T2)
• In addition, transactions use pessimistic locking
• This guarantees
• causal consistency
• external consistency
• linearizability
55
T1
T2
commit(T1)
start(T2)
56. Writes
• Commit timestamp is set to t ≥ TT.now().latest
• Data remains invisible until TT.after(t)
• that means commit wait ≥ 2ε
• After commit wait, apply change and release locks
• Paxos is used to handle locking and ordering
• this causes a write quorum of at least half the nodes
• as a result, Spanner is CP, not AP
56
57. F1 - layer above Spanner
• Builds on Spanner, adds
• distributed SQL queries
• including joins from external sources
• transactionally consistent indexes
• asynchronous schema changes
• optimistic transactions
• automatic change history
57
58. Why versioned data?
“Many database users build mechanisms to log changes, either from application code or using
database features like triggers. In the MySQL system that AdWords used before F1, our Java
application libraries added change history records into all transactions. This was nice, but it was
inefficient and never 100% reliable. Some classes of changes would not get history records,
including changes written from Python scripts and manual SQL data changes.”
58
Application code is not enough to enforce business rules,
because many important changes are made behind the
application code. For example, data conversion.
!
Look at any database that’s a few years old, and you’ll
find data disallowed by the application code, but allowed
by the schema.
60. Two interfaces
• NoSQL interface
• basically a simple key->row lookup
• simpler in code for object lookup
• faster because no SQL parsing
• Full SQL interface
• good for analytics and more complex interactions
60
61. Status
• >100 terabyte of uncompressed data
• distributed across 5 data centers
• Five nines (99.999%) uptime
• Serves up to hundreds of thousands of requests/second
• SQL queries scan trillions of rows/day
• No observable increase of latency compared to MySQL-
based backend
• but change tracking and sharding now invisible to application
61
63. Conclusion
• NoSQL is mostly about high availability & eventual
consistency
• to some degree also schemalessness
• NoSQL is eventually consistent because of CAP
• The CAP Theorem is a consequence of the theory of relativity
• New systems seem to indicate that consistency may scale,
after all
• basically, the speed of light is greater than we thought
• basic availability is enough if you have enough nodes
63
64. Further reading
• NoSQL eMag, InfoQ, pilot issue May 2013
• http://www.infoq.com/minibooks/emag-NoSQL
• Brewer’s original presentation
• http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf
• Proof by Lynch & Gilbert
• http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf
• Why E=mc2, Cox & Forshaw
• Eventual Consistency Today: Limitations, Extensions, and Beyond,
ACM Queue
• http://queue.acm.org/detail.cfm?id=2462076
64
65. Even further reading
• Bailis papers
• http://www.bailis.org/papers/pbs-vldb2012.pdf
• http://www.bailis.org/papers/pbs-vldbj2014.pdf
• Spanner paper
• http://research.google.com/archive/spanner.html
• F1 papers
• http://research.google.com/pubs/pub38125.html
• http://research.google.com/pubs/pub41376.html
65
slideshare.net/larsga