SlideShare ist ein Scribd-Unternehmen logo
1 von 65
Downloaden Sie, um offline zu lesen
NoSQL and Relativity
Lars Marius Garshol, lars.marius.garshol@schibsted.com
2015-09-10, JavaZone 2015
http://twitter.com/larsga
Summary
• CAP theorem says we must choose between
availability and consistency
• To scale NoSQL databases choose availability
• Einstein's theory of relativity shows clearly why this
is a trade-off
• Finally, we learn that the trade-off looks different
from what one might think
2
NoSQL and CAP
Consistency
4
DB	
  node DB	
  nodeClient
write(x = 5)
read(x)
Consistency = these two reads are
guaranteed to return the same value
(which need not be 5)
Availability
5
DB	
  node DB	
  nodeClient
write(x = 5)
If the database accepts
the write, you have availability
write(x = 5)
(the other node may be accepting
other writes, even to x)
?
write(x = 217)
The CAP theorem
• Consistency
• all nodes always give the same answer
• Availability
• all nodes always answer queries and accept updates
• Partition-tolerance
• database continues working, even if some nodes
disappear
6
The theorem: choose any two!
And you can't drop partition-tolerance
CAP history
• First formulated by Eric Brewer in 2000
• described the SQL/NoSQL divide very well
• Formalized and proven in 2002
• by Seth Gilbert and Nancy Lynch
• Today CAP is better understood
• widely considered a key tradeoff in distributed systems
• theoretical justification for NoSQL databases
7
What defines NoSQL?
• SQL is not the query language
• usually something more primitive
• sometimes just key/value lookup
• Sacrifices consistency for availability
• which is what this talk is about
• Schemaless
• that is, data need not conform to a predefined schema
8
Why NoSQL?
• Big internet sites couldn't scale relational databases
• consistency requires communication between all nodes,
doesn't scale to high numbers of nodes
• long downtimes during schema changes
• Therefore, switched to NoSQL databases
• schemalessness means no schema changes
• sacrificing consistency means higher performance
• Downside is complexity moves out of database and
into the application
9
Achieving availability
10
DB	
  node DB	
  nodeClient
write(x = 5)
It's not acceptable to have the two nodes
forever disagreeing on the value of x.
One solution is eventual consistency.
write(x = 5)
OK
write(x = 217)
Eventual consistency
• Promises that
• if no further writes are made,
• eventually all nodes will be consistent
• A very weak guarantee
• when is "eventually"?
• what happens before "eventually"?
• Stronger guarantees are sometimes made
• for example, by quantifying actual behaviour in practice
11
Implementing eventual
consistency
• Nodes must inform all other nodes of writes
• when receiving and applying a write, it must be passed on
to all other nodes,
• signal OK to writer before complete agreement is reached
• Requires a conflict resolution mechanism
• all nodes must agree on the resolution
• common solution: clock value in write, let write with highest
value win
• clocks need not be in sync
12
When eventually is too late
13
DB	
  node	
  1 DB	
  node	
  2Client	
  1 Client	
  2
set	
  account	
  X	
  	
  
balance	
  =	
  0 set	
  account	
  X	
  	
  
balance	
  =	
  0
read	
  account	
  X	
  
balance	
  -­‐>	
  100
set	
  account	
  X	
  	
  
balance	
  =	
  0
Happy	
  customer	
  walks	
  
away,	
  richer	
  by	
  200.
Nodes	
  eventually	
  agree	
  
balance	
  is	
  0.
read	
  account	
  X	
  
balance	
  -­‐>	
  100
The key problem
• The ordering of events affects the outcome
• that is, what application logic chooses to do is affected by
what the database says
• what the database says depends on the ordering of events
• The different nodes do not observe the same order of
events
• This can be solved
• at the cost of a communication delay
• which is key to the consistency/availability tradeoff
14
Brief math digression
Time, Clocks and the Ordering of
Events in a Distributed System
The	
  origin	
  of	
  this	
  paper	
  was	
  a	
  note	
  titled	
  The	
  
Maintenance	
  of	
  Duplicate	
  Databases	
  by	
  Paul	
  
Johnson	
  and	
  Bob	
  Thomas.	
  	
  I	
  believe	
  their	
  note	
  
introduced	
  the	
  idea	
  of	
  using	
  message	
  time-­‐
stamps	
  in	
  a	
  distributed	
  algorithm.	
  	
  I	
  happen	
  
to	
  have	
  a	
  solid,	
  visceral	
  understanding	
  of	
  
special	
  relativity	
  (see	
  [5]).	
  	
  This	
  enabled	
  me	
  
to	
  grasp	
  immediately	
  the	
  essence	
  of	
  what	
  
they	
  were	
  trying	
  to	
  do.	
  	
  ...	
  	
  I	
  realized	
  that	
  the	
  
essence	
  of	
  Johnson	
  and	
  Thomas's	
  algorithm	
  
was	
  the	
  use	
  of	
  timestamps	
  to	
  provide	
  a	
  total	
  
ordering	
  of	
  events	
  that	
  was	
  consistent	
  with	
  
the	
  causal	
  order.	
  	
  ...	
  
!
It	
  didn't	
  take	
  me	
  long	
  to	
  realize	
  that	
  an	
  
algorithm	
  for	
  totally	
  ordering	
  events	
  could	
  
be	
  used	
  to	
  implement	
  any	
  distributed	
  system.
16http://research.microsoft.com/en-­‐us/um/people/lamport/pubs/pubs.html#time-­‐clocks
Order theory
• An ordering relation is any relation ≤ such that
• a ≤ a (reflexivity)
• if a ≤ b and b ≤ a then a = b (antisymmetry)
• if a ≤ b and b ≤ c then a ≤ c (transitivity)
• A total order is an order such that
• a ≤ b or b ≤ a (totality)
• A partial order is any order which is not total
• that is, for some pairs a and b, neither a ≤ b nor b ≤ a
17
Examples of partial order
• Ordering sets by the subset relation
• for some sets neither a ≤ b nor b ≤ a
• Ordering values of type 'duration' from XML
Schema
• “In general, the ·order-relation· on duration is a partial order
since there is no determinate relationship between certain
durations such as one month (P1M) and 30 days (P30D)”
• 1 day ≤ 2 days
• but 1 month and 30 days?
18
Relativity
History of physics
• 1687
• Isaac Newton publishes Philosophiæ Naturalis Principia
Mathematica
• physics begins
• no changes over next two centuries
• 1905
• Albert Einstein publishes special relativity
• abandons notions of fixed time and space
• 1916
• Einstein’s general relativity
• takes into account gravity
• no changes since
20
...
(we skip 270 slides)
The barn is too small
• Three people (M, F, and B) own a board (5m wide) and a barn
(4m wide)
• The board doesn’t fit inside the barn!
• What to do?
22
4
We have a solution!
24
As	
  seen	
  by	
  F	
  &	
  B:	
  board	
  is	
  4m	
  long	
  (relativistic	
  shortening),	
  barn	
  continues	
  to	
  be	
  4m	
  wide.	
  
!
When	
  the	
  board	
  is	
  exactly	
  inside	
  the	
  barn,	
  F	
  and	
  B	
  will	
  close	
  their	
  doors	
  simultaneously,	
  
and	
  the	
  problem	
  will	
  be	
  solved.	
  
!
(As	
  seen	
  by	
  M:	
  board	
  is	
  5m	
  long	
  (at	
  rest	
  relative	
  to	
  him),	
  barn	
  shortened	
  to	
  3.2m	
  wide.	
  Pay	
  no	
  attention	
  to	
  this.)
• When the board is just
inside, both close their doors
simultaneously!
• Right after, the board
crashes through back door
What they observe
F and B M
• B shuts his door just as the
front of the board reaches
him!
• 0.6 seconds later, F closes
his door!
• Board crashes through back
door
25
The key point
• They don’t agree on the order of events!
• Change the story slightly and the three people could
have three different orders of events
• This is not a paradox
• it is in fact how the universe works
• the ordering of events in the universe is a partial order
• What then of causality?
• if A causes B, but some people think B happened
before A, then what?
26
Resolution
• A, B, and C are events
• The cone is the “light cone” from A
• that is, the spread of light from A
• C is outside the cone
• therefore A cannot influence C
• observers may disagree on order of A&C
• B is inside the cone
• therefore A can influence B
• observers may not disagree on order
27
...
(we skip 532 slides)
Back to CAP
How to order events?
• A total ordering of events is impossible
• unless a communications delay is introduced
• this is just part of how the universe works
• If all nodes are inside the light cone of an event they
can agree on the order
• this is where the delay comes from
• so time taken by light to traverse physical distance is the
ultimate limit
• in practice, the effective limit is higher, due to hardware and
design constraints
30
One solution: Paxos
• Created by Leslie Lamport from that original insight
• Can be used to introduce a logical clock
• basically a counter all nodes agree on
• This, in turn, can be used to create a total order for events
• As you can see, the cost is the communications delay
31http://research.microsoft.com/en-­‐us/um/people/lamport/pubs/pubs.html#paxos-­‐simple
Or eventual consistency
• In this scenario we accept that what the database tells you might
be wrong
• can be handled by application logic
• or there may be separate business processes to handle errors
• For example, Amazon might have to compensate disappointed
customers with a gift card once per 100,000 transactions
• benefit of staying in business outweighs cost of error
• This happens even in banking
• ATM that loses network access may still allow withdrawals up to a limit,
accepting the risk of overcharging
• customers overcharging will pay fees and interest, anyway
32
How eventual is eventual
consistency?
• Two papers by Peter Bailis (2012 and 2014) give
formulas for computing the odds of a stale read
• Shows that usually you can get 99%+ odds of
consistency after short time window
• But ... 1M transactions/day, 99.99% odds, still
means 100 stale reads/day
33
Or CALM
• CALM = Consistency As Logical Monotonicity
• that is, facts used by clients to make decisions never
change
• this preserves causality
• A database that never deletes or overwrites is
CALM
• an event log, such as a record of stock exchange trades,
timeseries data, or ...
• not suitable for all systems, though
34
A CALM example
• Client 1 reads A = 10
• Client 1 uses this to decide to write B = 5
• If Client 2 now reads B = 5, then they must also
read either A = 10 or a later value of A
• This preserves causal consistency
35
Or ACID 2.0
• Not really ACID at all, so rather misleading
• Requires update operations to have these properties:
• associativity a + (b + c) = (a + b) + c
• commutativity a + b = b + a
• idempotence f(x) = f(f(x))
• distributed
• Usual approach is to use datatypes which guarantee
this
36
CRDTs give ACID 2.0
• CRDT = Commutative, Replicated Data Types
• also, "Conflict-free Replicated Data Types"
• Datatypes designed so that order of operations
don't affect the outcome
• stronger than eventual consistency because writes don't
conflict
• requires "odd" datatypes, however
37
The ATM example
• The problem in the ATM example is the writes
• read(X), client does X = X - 100, write(X)
• the time window in between allows for conflict
• What if the operation were "increment(X, -100)" instead?
• this is associative and commutative (but not necessarily
idempotent)
• In this case the logic "if X >= 100" test could still be fooled
• however, the customer's balance would be "-100"
• so information would not be lost
38
An example of a real CRDT
39
https://github.com/aphyr/meangirls
but "solutions" are half-solutions, and pretty awkward
Thus far, everyone agrees
“To go wildly faster, one must
remove all four sources of the
overhead discussed above.
This is possible in either a SQL
context or some other context.”
Didn't we just learn this isn't right?
Meanwhile, at
Google...
The AdWords experience
This backend was originally based on a MySQL database that was manually
sharded many ways. The uncompressed dataset is tens of terabytes, which is
small compared to many NoSQL instances, but was large enough to cause
difficulties with sharded MySQL. The MySQL sharding scheme assigned each
customer and all related data to a fixed shard. This layout enabled the use of
indexes and complex query processing on a per-customer basis, but required
some knowledge of the sharding in application business logic.
Resharding this revenue-critical database as it grew in the number of
customers and their data was extremely costly. The last resharding took over
two years of intense effort, and involved coordination and testing across
dozens of teams to minimize risk.
44
AdWords requirements
We store financial data and have hard
requirements on data integrity and consistency.
We also have a lot of experience with eventual
consistency systems at Google. In all such systems,
we find developers spend a significant fraction of
their time building extremely complex and error-
prone mechanisms to cope with eventual
consistency and handle data that may be out of
date. We think this is an unacceptable burden to
place on developers and that consistency problems
should be solved at the database level.
45
More experience
At least 300 applications within Google use Megastore (despite its
relatively low performance) because its data model is simpler to
manage than Bigtable’s, and because of its support for
synchronous replication across datacenters. (Bigtable only
supports eventually-consistent replication across data-centers.)
Examples of well-known Google applications that use Megastore are
Gmail, Picasa, Calendar, Android Market, and AppEngine. !
46
Requirements
• Scalability
• scale simply by adding hardware
• no manual sharding
• Availability
• no downtime, for any reason
• Consistency
• strong database consistency
• Usability
• full SQL with indexes
47
Uh, didn’t we just learn
that this is impossible?
Spanner
• Globally distributed semi-relational database
• SQL as query language
• versioned data with non-locking read-only transactions
• Externally consistent reads/writes
• Atomic schema updates
• even while transactions are running
• Basic availability
• experiment showed killing 25 out of 125 servers reduced
throughput, but had no other effect
48
Spanner architecture
49
Spanner architecture #2
50
Spanner data model
51
TrueTime
• Time API with uncertainty (ε)
• use atomic clock and GPS masters to reduce ε
• ε usually around 4 milliseconds
• TT.now() = [earliest, latest] = [now() - ε, now() + ε]
• TT.after(t) = t < TT.now().latest
• TT.before(t) = t > TT.now().earliest
52
n
now() latestearliest
ε ε
Versioned rows
53
Key Data Data Timestamp
id1 ... ... t1
id2 ... ... t2
id1 ... ... t3
id3 ... ... t4
Reads
• Non-locking
• System assigns a read timestamp t
• t = TT.now().latest
• (in reality somewhat smarter)
• Replicas maintain a value tsafe
• the timestamp by which the replica is 100% up to date
• Replica can reply to read as long as t < tsafe
• may require waiting for tsafe to progress
54
Linearizability
• If commit(T1) < start(T2), then ts(T1) < ts(T2)
• In addition, transactions use pessimistic locking
• This guarantees
• causal consistency
• external consistency
• linearizability
55
T1
T2
commit(T1)
start(T2)
Writes
• Commit timestamp is set to t ≥ TT.now().latest
• Data remains invisible until TT.after(t)
• that means commit wait ≥ 2ε
• After commit wait, apply change and release locks
• Paxos is used to handle locking and ordering
• this causes a write quorum of at least half the nodes
• as a result, Spanner is CP, not AP
56
F1 - layer above Spanner
• Builds on Spanner, adds
• distributed SQL queries
• including joins from external sources
• transactionally consistent indexes
• asynchronous schema changes
• optimistic transactions
• automatic change history
57
Why versioned data?
“Many database users build mechanisms to log changes, either from application code or using
database features like triggers. In the MySQL system that AdWords used before F1, our Java
application libraries added change history records into all transactions. This was nice, but it was
inefficient and never 100% reliable. Some classes of changes would not get history records,
including changes written from Python scripts and manual SQL data changes.”
58
Application code is not enough to enforce business rules,
because many important changes are made behind the
application code. For example, data conversion.
!
Look at any database that’s a few years old, and you’ll
find data disallowed by the application code, but allowed
by the schema.
Distributed queries
59
Two interfaces
• NoSQL interface
• basically a simple key->row lookup
• simpler in code for object lookup
• faster because no SQL parsing
• Full SQL interface
• good for analytics and more complex interactions
60
Status
• >100 terabyte of uncompressed data
• distributed across 5 data centers
• Five nines (99.999%) uptime
• Serves up to hundreds of thousands of requests/second
• SQL queries scan trillions of rows/day
• No observable increase of latency compared to MySQL-
based backend
• but change tracking and sharding now invisible to application
61
Conclusion
Conclusion
• NoSQL is mostly about high availability & eventual
consistency
• to some degree also schemalessness
• NoSQL is eventually consistent because of CAP
• The CAP Theorem is a consequence of the theory of relativity
• New systems seem to indicate that consistency may scale,
after all
• basically, the speed of light is greater than we thought
• basic availability is enough if you have enough nodes
63
Further reading
• NoSQL eMag, InfoQ, pilot issue May 2013
• http://www.infoq.com/minibooks/emag-NoSQL
• Brewer’s original presentation
• http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf
• Proof by Lynch & Gilbert
• http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf
• Why E=mc2, Cox & Forshaw
• Eventual Consistency Today: Limitations, Extensions, and Beyond,
ACM Queue
• http://queue.acm.org/detail.cfm?id=2462076
64
Even further reading
• Bailis papers
• http://www.bailis.org/papers/pbs-vldb2012.pdf
• http://www.bailis.org/papers/pbs-vldbj2014.pdf
• Spanner paper
• http://research.google.com/archive/spanner.html
• F1 papers
• http://research.google.com/pubs/pub38125.html
• http://research.google.com/pubs/pub41376.html
65
slideshare.net/larsga

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
Chandler Huang
 

Was ist angesagt? (17)

designing distributed scalable and reliable systems
designing distributed scalable and reliable systemsdesigning distributed scalable and reliable systems
designing distributed scalable and reliable systems
 
Quantum Computing Quantum Internet 2020_unit 1 By: Prof. Lili Saghafi
Quantum Computing Quantum Internet 2020_unit 1 By: Prof. Lili SaghafiQuantum Computing Quantum Internet 2020_unit 1 By: Prof. Lili Saghafi
Quantum Computing Quantum Internet 2020_unit 1 By: Prof. Lili Saghafi
 
MRT 2018: reflecting on the past and the present with temporal graph models
MRT 2018: reflecting on the past and the present with temporal graph modelsMRT 2018: reflecting on the past and the present with temporal graph models
MRT 2018: reflecting on the past and the present with temporal graph models
 
How Events Are Reshaping Modern Systems
How Events Are Reshaping Modern SystemsHow Events Are Reshaping Modern Systems
How Events Are Reshaping Modern Systems
 
Time in distributed systmes
Time in distributed systmesTime in distributed systmes
Time in distributed systmes
 
Анализ телеметрии при масштабировании, Theo Schlossnagle (Circonus)
Анализ телеметрии при масштабировании, Theo Schlossnagle (Circonus)Анализ телеметрии при масштабировании, Theo Schlossnagle (Circonus)
Анализ телеметрии при масштабировании, Theo Schlossnagle (Circonus)
 
Tutorial Kafka-Storm
Tutorial Kafka-StormTutorial Kafka-Storm
Tutorial Kafka-Storm
 
Determinism in finance
Determinism in financeDeterminism in finance
Determinism in finance
 
PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.
 
Apache Storm based Real Time Analytics for Recommending Trending Topics and S...
Apache Storm based Real Time Analytics for Recommending Trending Topics and S...Apache Storm based Real Time Analytics for Recommending Trending Topics and S...
Apache Storm based Real Time Analytics for Recommending Trending Topics and S...
 
Using Time Window Compaction Strategy For Time Series Workloads
Using Time Window Compaction Strategy For Time Series WorkloadsUsing Time Window Compaction Strategy For Time Series Workloads
Using Time Window Compaction Strategy For Time Series Workloads
 
The Quantum Internet: Hype or the Next Step
The Quantum Internet:  Hype or the Next StepThe Quantum Internet:  Hype or the Next Step
The Quantum Internet: Hype or the Next Step
 
Using Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems EasyUsing Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems Easy
 
Cassandra compaction
Cassandra compactionCassandra compaction
Cassandra compaction
 
Apache Storm Internals
Apache Storm InternalsApache Storm Internals
Apache Storm Internals
 
Realtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQRealtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQ
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 

Ähnlich wie NoSQL and Einstein's theory of relativity

Verification with LoLA: 4 Using LoLA
Verification with LoLA: 4 Using LoLAVerification with LoLA: 4 Using LoLA
Verification with LoLA: 4 Using LoLA
Universität Rostock
 

Ähnlich wie NoSQL and Einstein's theory of relativity (20)

NoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativityNoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativity
 
CAP Theorem and Split Brain Syndrome
CAP Theorem and Split Brain SyndromeCAP Theorem and Split Brain Syndrome
CAP Theorem and Split Brain Syndrome
 
CAP Theorem - Theory, Implications and Practices
CAP Theorem - Theory, Implications and PracticesCAP Theorem - Theory, Implications and Practices
CAP Theorem - Theory, Implications and Practices
 
Deductive databases
Deductive databasesDeductive databases
Deductive databases
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures
 
ds7_con.ppt
ds7_con.pptds7_con.ppt
ds7_con.ppt
 
Modern Java Concurrency
Modern Java ConcurrencyModern Java Concurrency
Modern Java Concurrency
 
Interactions complicate debugging
Interactions complicate debuggingInteractions complicate debugging
Interactions complicate debugging
 
Verification with LoLA: 4 Using LoLA
Verification with LoLA: 4 Using LoLAVerification with LoLA: 4 Using LoLA
Verification with LoLA: 4 Using LoLA
 
cse40822-CAP.pptx
cse40822-CAP.pptxcse40822-CAP.pptx
cse40822-CAP.pptx
 
Collision Detection an Overview
Collision Detection an OverviewCollision Detection an Overview
Collision Detection an Overview
 
L5. Data Transformation and Feature Engineering
L5. Data Transformation and Feature EngineeringL5. Data Transformation and Feature Engineering
L5. Data Transformation and Feature Engineering
 
Quantum Computers PART 1 & 2 by Prof Lili Saghafi
Quantum Computers  PART 1 & 2 by Prof Lili SaghafiQuantum Computers  PART 1 & 2 by Prof Lili Saghafi
Quantum Computers PART 1 & 2 by Prof Lili Saghafi
 
Building on quicksand microservices indicthreads
Building on quicksand microservices  indicthreadsBuilding on quicksand microservices  indicthreads
Building on quicksand microservices indicthreads
 
Software + Babies
Software + BabiesSoftware + Babies
Software + Babies
 
Quantum Computers New Generation of Computers PART1 by Prof Lili Saghafi
Quantum Computers New Generation of Computers PART1 by Prof Lili SaghafiQuantum Computers New Generation of Computers PART1 by Prof Lili Saghafi
Quantum Computers New Generation of Computers PART1 by Prof Lili Saghafi
 
Kafka Summit SF 2017 - One Data Center is Not Enough: Scaling Apache Kafka Ac...
Kafka Summit SF 2017 - One Data Center is Not Enough: Scaling Apache Kafka Ac...Kafka Summit SF 2017 - One Data Center is Not Enough: Scaling Apache Kafka Ac...
Kafka Summit SF 2017 - One Data Center is Not Enough: Scaling Apache Kafka Ac...
 
Disaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache KafkaDisaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache Kafka
 
quantum computing basics roll no 15.pptx
quantum computing basics roll no 15.pptxquantum computing basics roll no 15.pptx
quantum computing basics roll no 15.pptx
 
6269441.ppt
6269441.ppt6269441.ppt
6269441.ppt
 

Mehr von Lars Marius Garshol

Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practiceHafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practice
Lars Marius Garshol
 

Mehr von Lars Marius Garshol (20)

JSLT: JSON querying and transformation
JSLT: JSON querying and transformationJSLT: JSON querying and transformation
JSLT: JSON querying and transformation
 
Data collection in AWS at Schibsted
Data collection in AWS at SchibstedData collection in AWS at Schibsted
Data collection in AWS at Schibsted
 
Kveik - what is it?
Kveik - what is it?Kveik - what is it?
Kveik - what is it?
 
Nature-inspired algorithms
Nature-inspired algorithmsNature-inspired algorithms
Nature-inspired algorithms
 
History of writing
History of writingHistory of writing
History of writing
 
Norwegian farmhouse ale
Norwegian farmhouse aleNorwegian farmhouse ale
Norwegian farmhouse ale
 
Archive integration with RDF
Archive integration with RDFArchive integration with RDF
Archive integration with RDF
 
The Euro crisis in 10 minutes
The Euro crisis in 10 minutesThe Euro crisis in 10 minutes
The Euro crisis in 10 minutes
 
Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engine
 
Linked Open Data for the Cultural Sector
Linked Open Data for the Cultural SectorLinked Open Data for the Cultural Sector
Linked Open Data for the Cultural Sector
 
Bitcoin - digital gold
Bitcoin - digital goldBitcoin - digital gold
Bitcoin - digital gold
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 
Hops - the green gold
Hops - the green goldHops - the green gold
Hops - the green gold
 
Big data 101
Big data 101Big data 101
Big data 101
 
Linked Open Data
Linked Open DataLinked Open Data
Linked Open Data
 
Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practiceHafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practice
 
Approximate string comparators
Approximate string comparatorsApproximate string comparators
Approximate string comparators
 
Experiments in genetic programming
Experiments in genetic programmingExperiments in genetic programming
Experiments in genetic programming
 
Semantisk integrasjon
Semantisk integrasjonSemantisk integrasjon
Semantisk integrasjon
 
Linking data without common identifiers
Linking data without common identifiersLinking data without common identifiers
Linking data without common identifiers
 

Kürzlich hochgeladen

Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
gajnagarg
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 

Kürzlich hochgeladen (20)

Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 

NoSQL and Einstein's theory of relativity

  • 1. NoSQL and Relativity Lars Marius Garshol, lars.marius.garshol@schibsted.com 2015-09-10, JavaZone 2015 http://twitter.com/larsga
  • 2. Summary • CAP theorem says we must choose between availability and consistency • To scale NoSQL databases choose availability • Einstein's theory of relativity shows clearly why this is a trade-off • Finally, we learn that the trade-off looks different from what one might think 2
  • 4. Consistency 4 DB  node DB  nodeClient write(x = 5) read(x) Consistency = these two reads are guaranteed to return the same value (which need not be 5)
  • 5. Availability 5 DB  node DB  nodeClient write(x = 5) If the database accepts the write, you have availability write(x = 5) (the other node may be accepting other writes, even to x) ? write(x = 217)
  • 6. The CAP theorem • Consistency • all nodes always give the same answer • Availability • all nodes always answer queries and accept updates • Partition-tolerance • database continues working, even if some nodes disappear 6 The theorem: choose any two! And you can't drop partition-tolerance
  • 7. CAP history • First formulated by Eric Brewer in 2000 • described the SQL/NoSQL divide very well • Formalized and proven in 2002 • by Seth Gilbert and Nancy Lynch • Today CAP is better understood • widely considered a key tradeoff in distributed systems • theoretical justification for NoSQL databases 7
  • 8. What defines NoSQL? • SQL is not the query language • usually something more primitive • sometimes just key/value lookup • Sacrifices consistency for availability • which is what this talk is about • Schemaless • that is, data need not conform to a predefined schema 8
  • 9. Why NoSQL? • Big internet sites couldn't scale relational databases • consistency requires communication between all nodes, doesn't scale to high numbers of nodes • long downtimes during schema changes • Therefore, switched to NoSQL databases • schemalessness means no schema changes • sacrificing consistency means higher performance • Downside is complexity moves out of database and into the application 9
  • 10. Achieving availability 10 DB  node DB  nodeClient write(x = 5) It's not acceptable to have the two nodes forever disagreeing on the value of x. One solution is eventual consistency. write(x = 5) OK write(x = 217)
  • 11. Eventual consistency • Promises that • if no further writes are made, • eventually all nodes will be consistent • A very weak guarantee • when is "eventually"? • what happens before "eventually"? • Stronger guarantees are sometimes made • for example, by quantifying actual behaviour in practice 11
  • 12. Implementing eventual consistency • Nodes must inform all other nodes of writes • when receiving and applying a write, it must be passed on to all other nodes, • signal OK to writer before complete agreement is reached • Requires a conflict resolution mechanism • all nodes must agree on the resolution • common solution: clock value in write, let write with highest value win • clocks need not be in sync 12
  • 13. When eventually is too late 13 DB  node  1 DB  node  2Client  1 Client  2 set  account  X     balance  =  0 set  account  X     balance  =  0 read  account  X   balance  -­‐>  100 set  account  X     balance  =  0 Happy  customer  walks   away,  richer  by  200. Nodes  eventually  agree   balance  is  0. read  account  X   balance  -­‐>  100
  • 14. The key problem • The ordering of events affects the outcome • that is, what application logic chooses to do is affected by what the database says • what the database says depends on the ordering of events • The different nodes do not observe the same order of events • This can be solved • at the cost of a communication delay • which is key to the consistency/availability tradeoff 14
  • 16. Time, Clocks and the Ordering of Events in a Distributed System The  origin  of  this  paper  was  a  note  titled  The   Maintenance  of  Duplicate  Databases  by  Paul   Johnson  and  Bob  Thomas.    I  believe  their  note   introduced  the  idea  of  using  message  time-­‐ stamps  in  a  distributed  algorithm.    I  happen   to  have  a  solid,  visceral  understanding  of   special  relativity  (see  [5]).    This  enabled  me   to  grasp  immediately  the  essence  of  what   they  were  trying  to  do.    ...    I  realized  that  the   essence  of  Johnson  and  Thomas's  algorithm   was  the  use  of  timestamps  to  provide  a  total   ordering  of  events  that  was  consistent  with   the  causal  order.    ...   ! It  didn't  take  me  long  to  realize  that  an   algorithm  for  totally  ordering  events  could   be  used  to  implement  any  distributed  system. 16http://research.microsoft.com/en-­‐us/um/people/lamport/pubs/pubs.html#time-­‐clocks
  • 17. Order theory • An ordering relation is any relation ≤ such that • a ≤ a (reflexivity) • if a ≤ b and b ≤ a then a = b (antisymmetry) • if a ≤ b and b ≤ c then a ≤ c (transitivity) • A total order is an order such that • a ≤ b or b ≤ a (totality) • A partial order is any order which is not total • that is, for some pairs a and b, neither a ≤ b nor b ≤ a 17
  • 18. Examples of partial order • Ordering sets by the subset relation • for some sets neither a ≤ b nor b ≤ a • Ordering values of type 'duration' from XML Schema • “In general, the ·order-relation· on duration is a partial order since there is no determinate relationship between certain durations such as one month (P1M) and 30 days (P30D)” • 1 day ≤ 2 days • but 1 month and 30 days? 18
  • 20. History of physics • 1687 • Isaac Newton publishes Philosophiæ Naturalis Principia Mathematica • physics begins • no changes over next two centuries • 1905 • Albert Einstein publishes special relativity • abandons notions of fixed time and space • 1916 • Einstein’s general relativity • takes into account gravity • no changes since 20
  • 21. ... (we skip 270 slides)
  • 22. The barn is too small • Three people (M, F, and B) own a board (5m wide) and a barn (4m wide) • The board doesn’t fit inside the barn! • What to do? 22 4
  • 23.
  • 24. We have a solution! 24 As  seen  by  F  &  B:  board  is  4m  long  (relativistic  shortening),  barn  continues  to  be  4m  wide.   ! When  the  board  is  exactly  inside  the  barn,  F  and  B  will  close  their  doors  simultaneously,   and  the  problem  will  be  solved.   ! (As  seen  by  M:  board  is  5m  long  (at  rest  relative  to  him),  barn  shortened  to  3.2m  wide.  Pay  no  attention  to  this.)
  • 25. • When the board is just inside, both close their doors simultaneously! • Right after, the board crashes through back door What they observe F and B M • B shuts his door just as the front of the board reaches him! • 0.6 seconds later, F closes his door! • Board crashes through back door 25
  • 26. The key point • They don’t agree on the order of events! • Change the story slightly and the three people could have three different orders of events • This is not a paradox • it is in fact how the universe works • the ordering of events in the universe is a partial order • What then of causality? • if A causes B, but some people think B happened before A, then what? 26
  • 27. Resolution • A, B, and C are events • The cone is the “light cone” from A • that is, the spread of light from A • C is outside the cone • therefore A cannot influence C • observers may disagree on order of A&C • B is inside the cone • therefore A can influence B • observers may not disagree on order 27
  • 28. ... (we skip 532 slides)
  • 30. How to order events? • A total ordering of events is impossible • unless a communications delay is introduced • this is just part of how the universe works • If all nodes are inside the light cone of an event they can agree on the order • this is where the delay comes from • so time taken by light to traverse physical distance is the ultimate limit • in practice, the effective limit is higher, due to hardware and design constraints 30
  • 31. One solution: Paxos • Created by Leslie Lamport from that original insight • Can be used to introduce a logical clock • basically a counter all nodes agree on • This, in turn, can be used to create a total order for events • As you can see, the cost is the communications delay 31http://research.microsoft.com/en-­‐us/um/people/lamport/pubs/pubs.html#paxos-­‐simple
  • 32. Or eventual consistency • In this scenario we accept that what the database tells you might be wrong • can be handled by application logic • or there may be separate business processes to handle errors • For example, Amazon might have to compensate disappointed customers with a gift card once per 100,000 transactions • benefit of staying in business outweighs cost of error • This happens even in banking • ATM that loses network access may still allow withdrawals up to a limit, accepting the risk of overcharging • customers overcharging will pay fees and interest, anyway 32
  • 33. How eventual is eventual consistency? • Two papers by Peter Bailis (2012 and 2014) give formulas for computing the odds of a stale read • Shows that usually you can get 99%+ odds of consistency after short time window • But ... 1M transactions/day, 99.99% odds, still means 100 stale reads/day 33
  • 34. Or CALM • CALM = Consistency As Logical Monotonicity • that is, facts used by clients to make decisions never change • this preserves causality • A database that never deletes or overwrites is CALM • an event log, such as a record of stock exchange trades, timeseries data, or ... • not suitable for all systems, though 34
  • 35. A CALM example • Client 1 reads A = 10 • Client 1 uses this to decide to write B = 5 • If Client 2 now reads B = 5, then they must also read either A = 10 or a later value of A • This preserves causal consistency 35
  • 36. Or ACID 2.0 • Not really ACID at all, so rather misleading • Requires update operations to have these properties: • associativity a + (b + c) = (a + b) + c • commutativity a + b = b + a • idempotence f(x) = f(f(x)) • distributed • Usual approach is to use datatypes which guarantee this 36
  • 37. CRDTs give ACID 2.0 • CRDT = Commutative, Replicated Data Types • also, "Conflict-free Replicated Data Types" • Datatypes designed so that order of operations don't affect the outcome • stronger than eventual consistency because writes don't conflict • requires "odd" datatypes, however 37
  • 38. The ATM example • The problem in the ATM example is the writes • read(X), client does X = X - 100, write(X) • the time window in between allows for conflict • What if the operation were "increment(X, -100)" instead? • this is associative and commutative (but not necessarily idempotent) • In this case the logic "if X >= 100" test could still be fooled • however, the customer's balance would be "-100" • so information would not be lost 38
  • 39. An example of a real CRDT 39 https://github.com/aphyr/meangirls
  • 40. but "solutions" are half-solutions, and pretty awkward Thus far, everyone agrees
  • 41. “To go wildly faster, one must remove all four sources of the overhead discussed above. This is possible in either a SQL context or some other context.”
  • 42. Didn't we just learn this isn't right?
  • 44. The AdWords experience This backend was originally based on a MySQL database that was manually sharded many ways. The uncompressed dataset is tens of terabytes, which is small compared to many NoSQL instances, but was large enough to cause difficulties with sharded MySQL. The MySQL sharding scheme assigned each customer and all related data to a fixed shard. This layout enabled the use of indexes and complex query processing on a per-customer basis, but required some knowledge of the sharding in application business logic. Resharding this revenue-critical database as it grew in the number of customers and their data was extremely costly. The last resharding took over two years of intense effort, and involved coordination and testing across dozens of teams to minimize risk. 44
  • 45. AdWords requirements We store financial data and have hard requirements on data integrity and consistency. We also have a lot of experience with eventual consistency systems at Google. In all such systems, we find developers spend a significant fraction of their time building extremely complex and error- prone mechanisms to cope with eventual consistency and handle data that may be out of date. We think this is an unacceptable burden to place on developers and that consistency problems should be solved at the database level. 45
  • 46. More experience At least 300 applications within Google use Megastore (despite its relatively low performance) because its data model is simpler to manage than Bigtable’s, and because of its support for synchronous replication across datacenters. (Bigtable only supports eventually-consistent replication across data-centers.) Examples of well-known Google applications that use Megastore are Gmail, Picasa, Calendar, Android Market, and AppEngine. ! 46
  • 47. Requirements • Scalability • scale simply by adding hardware • no manual sharding • Availability • no downtime, for any reason • Consistency • strong database consistency • Usability • full SQL with indexes 47 Uh, didn’t we just learn that this is impossible?
  • 48. Spanner • Globally distributed semi-relational database • SQL as query language • versioned data with non-locking read-only transactions • Externally consistent reads/writes • Atomic schema updates • even while transactions are running • Basic availability • experiment showed killing 25 out of 125 servers reduced throughput, but had no other effect 48
  • 52. TrueTime • Time API with uncertainty (ε) • use atomic clock and GPS masters to reduce ε • ε usually around 4 milliseconds • TT.now() = [earliest, latest] = [now() - ε, now() + ε] • TT.after(t) = t < TT.now().latest • TT.before(t) = t > TT.now().earliest 52 n now() latestearliest ε ε
  • 53. Versioned rows 53 Key Data Data Timestamp id1 ... ... t1 id2 ... ... t2 id1 ... ... t3 id3 ... ... t4
  • 54. Reads • Non-locking • System assigns a read timestamp t • t = TT.now().latest • (in reality somewhat smarter) • Replicas maintain a value tsafe • the timestamp by which the replica is 100% up to date • Replica can reply to read as long as t < tsafe • may require waiting for tsafe to progress 54
  • 55. Linearizability • If commit(T1) < start(T2), then ts(T1) < ts(T2) • In addition, transactions use pessimistic locking • This guarantees • causal consistency • external consistency • linearizability 55 T1 T2 commit(T1) start(T2)
  • 56. Writes • Commit timestamp is set to t ≥ TT.now().latest • Data remains invisible until TT.after(t) • that means commit wait ≥ 2ε • After commit wait, apply change and release locks • Paxos is used to handle locking and ordering • this causes a write quorum of at least half the nodes • as a result, Spanner is CP, not AP 56
  • 57. F1 - layer above Spanner • Builds on Spanner, adds • distributed SQL queries • including joins from external sources • transactionally consistent indexes • asynchronous schema changes • optimistic transactions • automatic change history 57
  • 58. Why versioned data? “Many database users build mechanisms to log changes, either from application code or using database features like triggers. In the MySQL system that AdWords used before F1, our Java application libraries added change history records into all transactions. This was nice, but it was inefficient and never 100% reliable. Some classes of changes would not get history records, including changes written from Python scripts and manual SQL data changes.” 58 Application code is not enough to enforce business rules, because many important changes are made behind the application code. For example, data conversion. ! Look at any database that’s a few years old, and you’ll find data disallowed by the application code, but allowed by the schema.
  • 60. Two interfaces • NoSQL interface • basically a simple key->row lookup • simpler in code for object lookup • faster because no SQL parsing • Full SQL interface • good for analytics and more complex interactions 60
  • 61. Status • >100 terabyte of uncompressed data • distributed across 5 data centers • Five nines (99.999%) uptime • Serves up to hundreds of thousands of requests/second • SQL queries scan trillions of rows/day • No observable increase of latency compared to MySQL- based backend • but change tracking and sharding now invisible to application 61
  • 63. Conclusion • NoSQL is mostly about high availability & eventual consistency • to some degree also schemalessness • NoSQL is eventually consistent because of CAP • The CAP Theorem is a consequence of the theory of relativity • New systems seem to indicate that consistency may scale, after all • basically, the speed of light is greater than we thought • basic availability is enough if you have enough nodes 63
  • 64. Further reading • NoSQL eMag, InfoQ, pilot issue May 2013 • http://www.infoq.com/minibooks/emag-NoSQL • Brewer’s original presentation • http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf • Proof by Lynch & Gilbert • http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf • Why E=mc2, Cox & Forshaw • Eventual Consistency Today: Limitations, Extensions, and Beyond, ACM Queue • http://queue.acm.org/detail.cfm?id=2462076 64
  • 65. Even further reading • Bailis papers • http://www.bailis.org/papers/pbs-vldb2012.pdf • http://www.bailis.org/papers/pbs-vldbj2014.pdf • Spanner paper • http://research.google.com/archive/spanner.html • F1 papers • http://research.google.com/pubs/pub38125.html • http://research.google.com/pubs/pub41376.html 65 slideshare.net/larsga