Intro to Databases

Databases
Sargun Dhillon
@Sargun

What is a database?
A database is an organized collection of data

What are databases
for?
Applications

Internet Applications
Experiencing exploding growth

Internet Trafﬁc vs. Penetration
0
25
50
75
100
0
10000
20000
30000
40000
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
IP Trafﬁc (PB/mo) Global Penetration (%)

Number of Internet Users in 2012

Average Distance to Every Human

Extrapolating
We have not yet reached Peak “Web” and we won’t see
it for some time

Applications
How are they built?

Useful Application
Add Persistence

What is a Transaction?
A Unit of Work

Transaction Scheduling
Concurrent Operations

Non-Conﬂicting Concurrency
Parallel Execution

ACID = Atomicity
A transaction executes or it does not

ACID = Consistency
Correctness; Require the database to follow set of
invariants

ACID = Isolation
Prevent inter-actor visibility during concurrent operations

ACID = Durability
Once you write, it will survive

Vertically Scalability
Moore’s Law can take us places

Biggest AWS Database
• vCPUs: 32
• Memory: 244
• Storage: 3TB
• IOPs: 30,000 IOPs
• Networking: 10 Gigabit
• Resiliency: Multi-AZ
• SLA: 99.95%
• Backend: Postgresql

Do we have a natural
sharding key?

Two-phase commit?
Three-phase commit?
Paxos?
Enhanced Three-phase commit?
Wat?
Egalitarian Paxos?

Do we really want to
run NxM databases?

Hotspots?
(The “Bieber” problem)

Scaling SSI databases
is a hard problem

No latency win for
mutable data

Must sacriﬁce recency
for latency win

Multi-master requires
at least 1 RTT

-F1: A Distributed SQL Database That Scales, Google
“Because the data is synchronously replicated
across multiple datacenters, and because
we’ve chosen widely distributed datacenters,
the commit latencies are relatively high (50-150
ms).”

-Kohavi and Longbotham 2007
“Every 100 ms increase in load time of
Amazon.com decreased sales by 1%.”
(~$120M of losses per 100 ms)

“Average partition duration ranged from 6 minutes for
software-related failures to more than 8.2 hours for
hardware-related failures (median 2.7 and 32 minutes;
95th percentile of 19.9 minutes and 3.7 days,
respectively).”
-The Network is Reliable
WANs Fail

-F1: A Distributed SQL Database That Scales, Google
“We also have a lot of experience with eventual
consistency systems at Google. In all such
systems, we ﬁnd developers spend a
signiﬁcant fraction of their time building
extremely complex and error-prone
mechanisms to cope with eventual consistency
and handle data that may be out of date. We
think this is an unacceptable burden to place
on developers and that consistency problems
should be solved at the database level. ”

“A shared-data system can have at most
two of the three following properties:
Consistency, Availability, and tolerance to
network Partitions.”
-Dr. Eric Brewer

On Consistency
• ACID Consistency: Any transaction, or operation
will bring the database from one valid state to
another
• CAP Consistency: All nodes see the same data at
the same time (synchrony)

On Partition Tolerance
• The network will be allowed to lose arbitrarily many
messages sent from one node to another.
• Databases systems, in order to be useful must
have communication over the network
• Clients count

There is no such thing as
a 100% reliable network:
Can’t choose CA
http://codahale.com/you-cant-sacriﬁce-partition-tolerance

We Can Have Both*
(*Just not at the same time)

PNUTS
• Paper released by Yahoo! research in 2008
• Operations:
• Read-Any
• Read-Critical(Required-Version)*
• Read-Latest
• Write
• Test-and-set-write(Required-Version)
* Will fall back to CP operation

“This is a speciﬁc form of weak
consistency; the storage system
guarantees that if no new
updates are made to the object,
eventually all accesses will
return the last updated value.”
Deﬁnition of “Eventual Consistency” from “Eventually
Consistency Revisited” - Werner Vogels

Eventual Consistency
in the LAN

Good at Building
LANs at Scale

in the WAN

Write Anywhere
Beat the speed of the light

Typical Pattern
with
COTS EC Store

Schema
CREATE TABLE test.users (
user_name text PRIMARY KEY,
friends set<text>,
posts set<text>
)

State
*****:test> SELECT * FROM users;
user_name | friends | posts
-----------+----------+-------
sargun | {'BOSS'} | null

Remove Boss
*****:test> UPDATE users SET
friends = friends - {'BOSS'}
WHERE user_name = 'sargun' ;

State at DC2 & DC3
-----------+----------+-------
sargun | {'BOSS'} | null

Post Message
*****:test> UPDATE users SET
posts = posts + {'PARTY'} WHERE
user_name = 'sargun' ;

State at DC2 & DC3
-----------+----------+-----------
sargun | {'BOSS'} | {'PARTY'}

Worse Than Banking
Unbounded Financial Loss

No
Happens-Before (h.b.)
Relationship

Very Little Beneﬁt
Over
CP system
Quorum Systems

Why not just do
Paxos*?
Single-Decree Paxos Variant such as EPaxos, Cheap Paxos, or
Multi-Paxos

Participating Quorums
Must Overlap

Just Perform
Paxos Reconﬁguration
to
Recover from Failure

Strong Eventual Consistency
“Any set of nodes that have received
the same (unordered) set of updates
will be in the same state.”

Vector Clocks
• Extension of Lamport Clocks
• Used to detect cause and effect in distributed
systems
• Can determine concurrency of events, and
causality violations
• Preserves h.b. relationships

• CRDTs:
• Convergent Replicated Data Types
• Commutative Replication Data Types
• Enables data structures to be always writeable on both sides of a partition,
and replay after healing a partition
• Enable distributed computation across monotonic functions
• Two Types:
• CvRDTs
• CmRDTs
CRDTs

CvRDTs
• State / value based CRDTs
• Minimal state
• Don’t require active garbage collection

CmRDTs
• Op / method based CRDTs
• Size grows monotonically
• Uses version vectors to determine order of
operations

CRDTs in the Wild
• Sets
• Observe-remove set
• Grow-only sets
• Counters
• Grow-only counters
• PN-Counters
• Flags
• Maps

Data structures that are
CRDTs
• Probabilistic, convergent data structures
• Hyper log log
• Bloom ﬁlter
• Co-recursive folding functions
• Maximum-counter
• Running Average
• Operational Transform

CRDTs
• Incredibly powerful primitive
• Not only useful for in-database manipulation but
client-database interaction
• You can compose them, and build your own
• Garbage collection is tricky

Model
curl -s http://localhost:8098/types/test/buckets/test/
datatypes/sargun |python -mjson.tool
{
"context": "g2wAAAABaAJtAAAACBjtDYuvG6A4YQpq",
"type": "map",
"value": {
"friends_set": [
"Boss"
],
"posts_set": []
}
}

“Primary Key”
{
"type": "map",
"value": {
"friends_set": [
"Boss"
],
"posts_set": []
}
}

Causal Context
{
"type": "map",
"value": {
"friends_set": [
"Boss"
],
"posts_set": []
}
}

Update
curl -XPOST http://localhost:8098/types/test/buckets/
test/datatypes/sargun
-H "Content-Type: application/json"
-H "X-Riak-Vclock: g2wAAAABaAJtAAAACBjtDYuvG6A4YQpq"
-d '
{
"update": {
"friends_set": {
"remove": "Boss"
}
}
}'

Updated Entries
(during partition)
{
"type": "map",
"value": {
"friends_set": [
"Boss"
],
"posts_set": []
}
}
{
"context": "g2wAAAABaAJtAAAACBjtDYuvG6A4YQtq",
"type": "map",
"value": {
"friends_set": [],
"posts_set": []
}
}

Updatecurl -XPOST http://localhost:8098/types/test/buckets/
test/datatypes/sargun
-H "Content-Type: application/json"
-H "X-Riak-Vclock: g2wAAAABaAJtAAAACBjtDYuvG6A4YQtq"
-d '
{
"update": {
"posts_set": {
"add": "Party"
}
}
}'

Updated Entries
(After Healing)
{
"context": "g2wAAAABaAJtAAAACBjtDYuvG6A4YQ5q",
"type": "map",
"value": {
"friends_set": [],
"posts_set": [
"Party"
]
}
}
{
"context": "g2wAAAABaAJtAAAACBjtDYuvG6A4YQ5q",
"type": "map",
"value": {
"friends_set": [],
"posts_set": [
"Party"
]
}
}

Currently:
Replicates entire value

In Summary

Scalable
Scalability
Processors

Eventual Consistency (CAP)
Without Consistency (ACID)
Gives EC a Bad Name

Invariant Operation AP / CP
Specify unique ID Any CP
Generate unique ID Any AP
> INCREMENT AP
> DECREMENT CP
< INCREMENT CP
< DECREMENT AP
Secondary Index Any AP
Materialized View Any AP
AUTO_INCREMEN
T
INSERT CP
Linearizability CAS CP
Operations Requiring
Weak Consistency
vs.
Strong Consistency

BASE not ACID
•Basically Available: There will be a response
per request (failure, or success)
•Soft State: Any two reads against the system
may yield different data (when measured
against time)
•Eventually Consistent: The system will
eventually become consistent when all
failures have healed, and time goes to inﬁnity

Brand New Technology
Still being invented

Technology Timeline
• 1996 - Log structured merge tree
• 2000 - CAP Theorem
• 2007 - Amazon Dynamo Paper
• 2011 - INRIA CRDT Technical Report
• 2014 - Riak DT map: a composable, convergent
replicated dictionary

Further Reading
• Don’t Settle for Eventual: Scalable Causal Consistency for Wide-Area
Storage with COPS
• PNUTS: Yahoo!’s Hosted Data Serving Platform
• F1: A Distributed SQL Database That Scales
• Spanner: Google's Globally-Distributed Database
• The Network is Reliable: An informal survey of real-world communications
failures
• A comprehensive study of Convergent and CommutativeReplicated Data
Types
• Riak DT Map: A Composable, Convergent Replicated Dictionary

Get in Touch
• If you’re interested in cheating the speed of light
• Come use our software
• If you’re interested in solving today’s computer science
problems
• Come work for us
• If you’d like to learn more about distributed systems at
scale
• Maybe you have a better idea

Sargun Dhillon
@Sargun
sdhillon@basho.com
The Case
for

Intro to Databases

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Intro to Databases

Ähnlich wie Intro to Databases (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Intro to Databases