Cassandra: Open Source Bigtable + Dynamo

Motivation
● Scaling reads to a relational database is
hard
● Scaling writes to a relational database is
virtually impossible
● … and when you do, it usually isn't relational
anymore

The new face of data
● Scale out, not up
● Online load balancing, cluster growth
● Flexible schema
● Key-oriented queries
● CAP-aware

CAP theorem
● Pick two of Consistency, Availability,
Partition tolerance

Two famous papers
● Bigtable: A distributed storage system for
structured data, 2006
● Dynamo: amazon's highly available key-
value store, 2007

Two approaches
● Bigtable: “How can we build a distributed
db on top of GFS?”
● Dynamo: “How can we build a distributed
hash table appropriate for the data
center?”

10,000 ft summary
● Dynamo partitioning and replication
● Log-structured ColumnFamily data model
similar to Bigtable's

Cassandra highlights
● High availability
● Incremental scalability
● Eventually consistent
● Tunable tradeoffs between consistency
and latency
● Minimal administration
● No SPF

Architecture details
● O(1) node lookup
● Explicit replication
● Eventually consistent

Architecture layers
Messaging service Commit log Tombstones
Gossip Memtable Hinted handoff
Failure detection SSTable Read repair
Cluster state Indexes Bootstrap
Partitioner Compaction Monitoring
Replication Admin tools

Writes
● Any node
● Partitioner
● Commitlog, memtable
● SSTable
● Compaction
● Wait for W responses

Memtable / SSTable

Disk

Commit log

SSTable format
● Key / data

SSTable Indexes
● Bloom filter
● Key
● Column

(Similar to Hadoop MapFile / Tfile)

Compaction
● Merge keys
● Combine columns
● Discard tombstones

Remove
● Deletion marker (tombstone) necessary
to suppress data in older SSTables, until
compaction
● Read repair complicates things a little
● Eventually consistent complicates things
more
● Solution: configurable delay before
tombstone GC, after which tombstones
are not repaired

Cassandra write properties
● No reads
● No seeks
● Fast
● Atomic within ColumnFamily
● Always writable

Read path
● Any node
● Partitioner
● Wait for R responses
● Wait for N – R responses in the
background and perform read repair

Cassandra read properties
● Read multiple SSTables
● Slower than writes (but still fast)
● Seeks can be mitigated with more RAM
● Scales to billions of rows

Consistency in a BASE world
● If W + R > N, you will have consistency
● W=1, R=N
● W=N, R=1
● W=Q, R=Q where Q = N / 2 + 1

vs MySQL with 50GB of data
● MySQL
● ~300ms write
● ~350ms read
● Cassandra
● ~0.12ms write
● ~15ms read

● Achtung!

Data model
● Rows, ColumnFamilies, Columns

ColumnFamilies

keyA column1 column2 column3
keyC column1 column7 column11

Column
Byte[] Name
Byte[] Value
I64 timestamp

Super ColumnFamilies

keyF Super1 Super2

column column column column column column

keyJ Super1 Super5

column column column column column column

Types of queries
● Single column
● Slice
● Set of names / range of names
● Simple slice -> columns
● Super slice -> supercolumns
● Key range

Range queries
● Add “master” server
● Implement on top of K/V
● Order-preserving partitioning

Modification
● Insert / update
● Remove
● Single column or batch
● Specify W, number of nodes to wait for

Thrift
struct Column {
   1: binary                        name,
   2: binary                        value,
   3: i64                           timestamp,
}

struct SuperColumn {
   1: binary                        name,
   2: list<Column>                  columns,
}

Column get_column(table, key, column_path, block_for=1)

list<string> get_key_range(table, column_family, start_with="",
stop_at="", max_results=100)

void insert(table, key, column_path, value, timestamp,
block_for=0)

void remove(tablename, key, column_path_or_parent, timestamp)

Example: a multiuser blog
Two queries
- the most recent posts belonging to a
given blog, in reverse chronological order
- a single post and its comments, in
chronological order

First try

JBE Cassandra is teh awesome BASE FTW
blog
post comment comment post comment comment

Evan I like kittens And Ruby
blog
post comment comment post comment comment

<ColumnFamily
Type="Super"
CompareWith="TimeString"
CompareSubcolumnsWith="UUID"
Name="Blog"/>

Second try
JBE blog Cassandra BASE FTW Cassandr comment comment
is teh a is teh
awesome awesome
Evan blog I like kittens And Ruby Base FTW comment comment
I like comment comment
kittens
And Ruby comment comment

<ColumnFamily <ColumnFamily
CompareWith="UUIDType" CompareWith="UUIDType"
Name="Blog"/> Name="Comment"/>

Cassandra 0.3
● Remove support
● OPP / Range queries
● Test suite
● Workarounds for JDK bugs
● Rudimentary multi-datacenter support

Cassandra 0.4
● Branched May 18
● Data file format change to support billions
of rows per node instead of millions
● API changes (no more colon delimiters)
● Multi-table (keyspace) support
● LRU key cache
● fsync support
● Bootstrap
● Web interface

Cassandra 0.5
● Bootstrap
● Load balancing
● Closely related to “bootstrap done right”
● Merkle tree repair
● Millions of columns per row
● This will require another data format change
● Multiget
● Callout support

Users
Production: facebook, RocketFuel
Production RSN: Digg, Rackspace
No date yet: IBM Research, Twitter
Evaluating: 50+ in #cassandra on freenode

More
● Eventual consistency:
http://www.allthingsdistributed.com/2008/12/
● Introduction to distributed databases by
Todd Lipcon at NoSQL 09:
http://www.vimeo.com/5145059
● Other articles/videos about Cassandra:
http://wiki.apache.org/cassandra/ArticlesAndP
● #cassandra on irc.freenode.net

Cassandra: Open Source Bigtable + Dynamo

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (9)

Ähnlich wie Cassandra: Open Source Bigtable + Dynamo

Ähnlich wie Cassandra: Open Source Bigtable + Dynamo (20)

Mehr von jbellis

Mehr von jbellis (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Cassandra: Open Source Bigtable + Dynamo