1. Apache Cassandra
Fundamentals
or:
How I stopped worrying and learned to love the CAP theorem
Russell Spitzer
@RussSpitzer
Software Engineer in Test at DataStax
2. Who am I?
• Former Bioinformatics Student
at UCSF
• Work on the integration of
Cassandra (C*) with Hadoop,
Solr, and Redacted!
• I Spend a lot of time spinning up
clusters on EC2, GCE, Azure, …
http://www.datastax.com/dev/
blog/testing-cassandra-1000-
nodes-at-a-time
• Developing new ways to make
sure that C* Scales
3. Apache Cassandra is a Linearly Scaling
and Fault Tolerant noSQL Database
Linearly Scaling:
The power of the database
increases linearly with the
number of machines
2x machines = 2x throughput
http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html
Fault Tolerant:
Nodes down != Database Down
Datacenter down != Database Down
4. CAP Theorem Limits What
Distributed Systems can do
Consistency
When I ask the same question to any part of the system I should get the same answer
How many planes do we have?
5. CAP Theorem Limits What
Distributed Systems can do
Consistency
When I ask the same question to any part of the system I should get the same answer
How many planes do we have?
Consistent
1 1 1 1 1 1 1
6. CAP Theorem Limits What
Distributed Systems can do
Consistency
When I ask the same question to any part of the system I should get the same answer
How many planes do we have?
Not Consistent
1 4 1 2 1 8 1
7. CAP Theorem Limits What
Distributed Systems can do
When I ask a question I will get an answer
Availability
How many planes do we have?
Available
1 zzzzz *snort* zzz
8. CAP Theorem Limits What
Distributed Systems can do
Availability
When I ask a question I will get an answer
How many planes do we have?
I have to wait for major snooze to wake up
zzzzz *snort* zzz
Not Available
9. CAP Theorem Limits What
Distributed Systems can do
Partition Tolerance
I can ask questions even when the system is having intra-system communication
problems
How many planes do we have?
Team Edward Team Jacob
1
Tolerant
10. CAP Theorem Limits What
Distributed Systems can do
Partition Tolerance
I can ask questions even when the system is having intra-system communication
problems
How many planes do we have?
Not Tolerant
Team Edward Team Jacob
I’m not sure without asking those
vampire lovers and we aren’t speaking
11. Cassandra is an AP System
which is Eventually Consistent
Eventually consistent:
New information will make it to everyone eventually
How many planes do we have? How many planes do we have?
I don’t know without asking those
vampire lovers and we aren’t speaking
1 1 1 1 1 1
I just heard !
we actually !
have 2
2 2 2 2 2 2 2
12. Two knobs control fault tolerance in
C*: Replication and Consistency Level
Server Side - Replication:
How many copies of a data should exist in the cluster?
Coordinator
for this operation
ABD ABC
ACD
BCD
RF=3
Client
SimpleStrategy: Replicas
NetworkTopologyStrategy: Replicas per Datacenter
13. Two knobs control fault tolerance in
C*: Replication and Consistency Level
Client Side - Consistency Level:
How many replicas should we check before
acknowledgment?
ABD ABC
ACD
BCD
Client
Coordinator
for this operation
CL = One
14. Two knobs control fault tolerance in
C*: Replication and Consistency Level
Client Side - Consistency Level:
How many replicas should we check before
acknowledgment?
ABD ABC
ACD
BCD
CL = Quorum
Client
Coordinator
for this operation
15. Nodes own data whose primary key
hashes to their their token ranges
ABD ABC
ACD
BCD
Every piece of data belongs on
the node who owns the
Murmur3(2.0) Hash of its
partition key + (RF-1) other
nodes
Partition Key Clustering Key
Rest of Data
ID: ICBM_432 Time: 30
Loc: SF , Status: Idle
ID: ICBM_432
Murmur3Hash
Murmur3: A
16. Cassandra writes are FAST
due to log-append storage
Par Clu Re Memory
Memtable
Memtable Memtable
Commit Log
Par Clu Re
Par Clu Re
Par Clu Re
Disk Flushed
SSTable SSTable
17. Deletes in a distributed
System are Challenging
We need to keep records of
deletions in case of network
partitions
Node1
Node2 Power Outage
Time
Tombstone Tombstone
Tombstone
18. Compactions merge and
unify data in our stables
SSTable
1
+ SSTable
SSTable
2 3
Since SSTables are immutable
this is our chance to
consolidate rows and remove
tombstones (After GC Grace)
19. Layout of Data Allows for Rapid
Queries Along Clustering Columns
ID: ICBM_432
ID: ICBM_900
ID: ICBM_9210
Time: 30
Loc:
SF
Status:
Idle
Time: 45
Loc:
SF
Status:
Idle
Time: 60
Loc:
SF
Status:
Idle
Time: 30
Loc:
Boston
Status:
Idle
Time: 45
Loc:
Boston
Status:
Idle
Time: 60
Loc:
Boston
Status:
Idle
Time: 30
Loc:
Tulsa
Status:
Idle
Time: 45
Loc:
Tulsa
Status:
Idle
Time: 60
Loc:
Tulsa
Status:
Idle
Disclaimer: Not exactly like this (Use sstable2json to see real layout)
20. CQL allows easy definition
of Table Structures
ID: ICBM_432
Time: 30
Loc:
SF
Status:
Idle
Time: 45
Loc:
SF
Status:
Idle
Time: 60
Loc:
SF
Status:
Idle
CREATE TABLE icbmlog (
name text,
time timestamp,
location text,
status text,
PRIMARY KEY (name,time)
);
21. Reading data is FAST but
limited by disk IO
Memory
Memtable
Memtable Memtable
Commit Log
Par Clu Re
Par Clu Re
Par Clu Re
Disk
SSTable SSTable
Client
Par Clu Re
LWW
Replica
Par Clu Re
22. Reading data is FAST but
limited by disk IO
Memory
Memtable
Memtable Memtable
Commit Log
Par Clu Re
Par Clu Re
Par Clu Re
Disk
SSTable SSTable
Client
Par Clu Re
LWW
Replica
Par Clu Re
Read
Repair
23. New Clients provide a
holistic view of the C* cluster
Client
ABD ABC
ACD
BCD
Initial Contact
Cluster.builder().addContactPoint("127.0.0.1").build()
24. Session Objects Are used
for Executing Requests
session = cluster.connect()
session.execute("DROP KEYSPACE IF EXISTS icbmkey")
session.execute("CREATE KEYSPACE icbmkey with
replication =
{'class':'SimpleStrategy','replication_factor':'1'}")
For highest throughput use asynchronous methods
ResultSetFuture executeAsync(Query query)
Then add a callback or Queue the ResultSetFutures
ResultSetFuture
ResultSetFuture
ResultSetFuture
25. Token Aware Policies allow the reduction
in the number of intra-network requests
made
Client
ABD ABC
ACD
BCD
A
26. Prepared statements allow for
sending less data over the wire
Query is prepared on all nodes by driver
Prepared batch statements
can further improve throughput
PreparedStatement ps = session.prepare("INSERT INTO messages (user_id, msg_id, title, body) VALUES (?, ?, ?, ?)");
BatchStatement batch = new BatchStatement();
batch.add(ps.bind(uid, mid1, title1, body1));
batch.add(ps.bind(uid, mid2, title2, body2));
batch.add(ps.bind(uid, mid3, title3, body3));
session.execute(batch);
27. Avoid
• Preparing statements more than once
• Creating batches which are too large
• Running statements in serial
• Using consistency-levels above your need
• Secondary Indexes in your main queries
• or really at all unless you are doing analytics