An introduction to Cassandra, including replication + partitioning options, data center awareness, local storage model, data modeling example. Presented by Andrew Byde on 25th August 2011 at NoSQLNow! in San Jose , California
4. History
• 2007: Started at Facebook for inbox search
• July 2008: Open sourced by Facebook
• March 2009: Apache Incubator
• February 2010: Apache top-level project
• May 2011:Version 0.8
5. What it’s good for
• Horizontal scalability
• No single-point of failure -- symmetric
• Multi-data centre support
• Very high write workloads
• Tuneable consistency -- per operation
6. What it’s not so good for
• Transactions
• Read heavy workloads
• Low latency applications
• compared to in-memory dbs
11. Rows and columns
col1 col2 col3 col4 col5 col6 col7
row1 x x x
row2 x x x x x
row3 x x x x x
row4 x x x x
row5 x x x x
row6 x
row7 x x x
12. Reads
• get
• get_slice One row, some cols
• name predicate
• slice range
• multiget_slice Multiple rows
• get_range_slices
13. get
col1 col2 col3 col4 col5 col6 col7
row1 x x x
row2 x x x x x
row3 x x x x x
row4 x x x x
row5 x x x x
row6 x
row7 x x x
14. get_slice: name predicate
col1 col2 col3 col4 col5 col6 col7
row1 x x x
row2 x x x x x
row3 x x x x x
row4 x x x x
row5 x x x x
row6 x
row7 x x x
15. get_slice: slice range
col1 col2 col3 col4 col5 col6 col7
row1 x x x
row2 x x x x x
row3 x x x x x x
row4 x x x x
row5 x x x x
row6 x
row7 x x x
16. multiget_slice: name
predicate
col1 col2 col3 col4 col5 col6 col7
row1 x x x
row2 x x x x x
row3 x x x x x
row4 x x x x
row5 x x x x
row6 x
row7 x x x
17. get_range_slices: slice range
col1 col2 col3 col4 col5 col6 col7
row1 x x x
row2 x x x x x
row3 x x x x x
row4 x x x x
row5 x x x x
row6 x
row7 x x x
25. Partitioning + Replication
• Partitioning data on to nodes
• load balancing
• row-based
• Replication
• to protect against failure
• better availability
26. Partitioning
• Random: take hash of row key
• good for load balancing
• bad for range queries
• Ordered: subdivide key space
• bad for load balancing
• good for range queries
• Or build your own...
36. Consistency Level
• How many replicas must respond in order to
declare success
• W/N must succeed for write to succeed
• write with client-generated timestamp
• R/N must succeed for read to succeed
• return most recent, by timestamp
• Tuneable per request
37. Consistency Level
• 1, 2, 3 responses
• Quorum (more than half)
• Quorum in local data center
• Quorum in each data center
39. Read repair
• If the replicas disagree on read, send most
recent data back
n1
read k? n2
n3
40. Read repair
• If the replicas disagree on read, send most
recent data back
n1 v, t1
read k? n2 not found!
n3 v’, t2
41. Read repair
• If the replicas disagree on read, send most
recent data back
n1 v, t1
n2 not found!
user n3 v’, t2
42. Read repair
• If the replicas disagree on read, send most
recent data back
n1
n2
n3 write (k, v’, t2)
43. Hinted handoff
• When a node is unavailable
• Writes can be written to any node as a hint
• Delivered when the node comes back
online
44. Anti-entropy
• Equivalent to ‘read repair all’
• Requires reading all data (woah)
• (Although only hashes are sent to calculate diffs)
• Manual process
48. Data-centric model
m1: {
sender: user1
content: “Mary had a little lamb”
recipients: user2, user3
}
• but how to do ‘recipients’ for Inbox?
• one-to-many modelled by a join table
49. To join
m1: { user2: {
sender: user1 m1: true
subject: “A rhyme”
content: “Mary had a little lamb” }
} user3: {
m2: {
sender: user1 m1: true
subject: “colours” m2: true
content: “Its fleece was white as snow”
} }
m3: { user4: {
sender: user1
subject: “loyalty” m2: true
content: “And everywhere that Mary went” m3: true
}
}
50. .. or not to join
• Joins are expensive, so de-normalise to trade
off space for time
• We can have lots of columns, so think BIG:
• Make message id a time-typed super-column.
• This makes get_slice an efficient way of
searching for messages in a time window
52. De-normalisation +
Cassandra
• have to write a copy of the record for each
recipient ... but writes are very cheap
• get_slice fetches columns for a particular
row, so gets received messages for a user
• on-disk column order is optimal for this
query
54. What it’s good for
• Horizontal scalability
• No single-point of failure -- symmetric
• Multi-data centre support
• Very high write workloads
• Tuneable consistency -- per operation
We provide Cassandra training and support and the Acunu Data Platform, high performance storage software that incorporates Cassandra.  Come and talk to us if you want to know more.  We have an ebook to give away to those that want to dive into Cassandra details.\nYou've probably heard about 'eventual consistency / scale out / de-norm … I'm going to explain what they mean.\n\n
\n
\n
\n
\n
\n
\n
\n
\n
but... Tables fixed structure, described in a schema. \nColumns much more flexible; no fixed schema in the RDBMS sense; little structure. \nAdd a column whenever you want. \nDon't need the same columns in each row, etc etc.\n
* two-level map\n* everything in Cassandra has a timestamp which is used to help with consistency. \n* You might use your own timestamp as a key but you don't normally do anything with the internal timestamps.\n* (Of course this means your clocks need to be reasonably accurate, so you can tell people they need to use NTP).\n\n
* three-level map\n
* three level map\n
\n
* sparse\n* up to 2 billion rows\n* ... but big rows are a problem (repair etc done based on row)\n* on a single node, data sorted by row key\n
* Queries are all key based. I.e. the ‘WHERE’ is all on key, the above differ in the SELECT * \n
\n
\n
\n
* note that the predicate is on NAME -- can’t do ‘WHERE col3=x’ with this\n
\n
\n
* memtable default is skip list\n* background compaction of SSTables\n* BENEFIT IS SEQUENTIAL WRITES\n
* data is sorted, key then value\n* compactions are streaming, hence efficient\n\n
* reads go everywhere in parallel\n* Bloom filters are per-row, so help with get_slice but not multi-row range queries\n
\n
Amazon Dynamo\nconnect to any node in the cluster\nnodes talk to one another using a p2p protocol called ‘gossip’ -- entirely symmetric.\n\n
\n
\n
Hash ring based: keys are hashed; regions of hash output space are claimed by nodes\n
Hash ring based: keys are hashed; regions of hash output space are claimed by nodes\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
PER REQUEST\n
\n
\n
\n
\n
\n
\n
\n
* merkel trees\n
* at scale you have to optimise for queries\n* de-normalisation not specific to cassandra\n\n
* de-normalisation not specific to cassandra\n* but it’s well suited because writes are relatively cheap, and little infrastructure for queries\n
get inbox for user 3\n
\n
* extra table holding recipient -> msg\n* have to a point query per message to show the inbox for a user\n\n
\n
* note, content not duplicated, only subject -- row would become too large\n* columns need to be ordered by time decreasing -- custom comparator\n