IAC 2024 - IA Fast Track to Search Focused AI Solutions
Cassandra for Sysadmins
1. 101
for System Administrators
nathan@milford.io
twitter.com/NathanMilford
http://blog.milford.io
2. What is Cassandra?
• It is a distributed, columnar database.
• Originally created at Facebook in 2008, is now a top level
Apache Project.
• Combines the best features of Amazon's Dynamo
(replication, mostly) and Google's Big Table (data model).
6. @
Rocking ~30 Billion impressions a month, like a bawse.
Used for semi-persistent storage of recommendations.
• 14 nodes in two data centers.
• Dell R610, 8 cores, 32G of RAM, 6 x 10K SAS drives.
• Using 0.8 currently, just upgraded. In production since 0.4.
• We use Hector.
• ~70-80G per node, ~550G dataset unreplicated.
• RP + OldNTS @ RF2. PropFileSnitch, RW @ CL.ONE.
Excited for NTS
Excited for TTLs!
7. How We Use Cassandra
+--------------+
| Tomcat | Tomcat serves Recs from Memcached.
+------------+-+
^ |
Flume ships logs to Hadoop DWH.
| | L Bunch of algos run against log data.
+-----+-----+ | O
| Memcached | | G Results are crammed into Cassandra.
+-----------+ | S
^ |
Keyspace per algo.
| | V CacheWarmer...
+-------+-----+ | I
| CacheWarmer | | A
Sources recs from Cassandra (and other
+-------------+ |
^ | F
sources)
| | L
Dumps them in Memcached.
+-----+-----+ | U
| Cassandra | | M
+-----------+ | E
^ |
| v
+----+--------+
| Hadoop/Hive |
+-------------+
* Simplified Workflow
8. Before I get too technical, relax.
The following slides may sound complex at first, but at the
end of the day, to get your feet wet all you need to do is:
yum install / apt-get install
Define a Seed.
Define Strategies.
service cassandra start
Go get a beer.
In my experience, once the cluster has been setup there is
not much else to do other than occasional tuning as you
learn how your data behaves.
9. Why Cassandra?
• Minimal Administration.
• No Single Point of Failure.
• Scales Horizontally.
• Writes are durable.
• Consistency is tuneable as needed on reads and writes.
• Schema is flexible, can be updated live.
• Handles failure gracefully, Cassandra is crash-only.
• Replication is easy, Rack and Datacenter aware.
10. Data Model
Keyspace = {
Column Family: {
Row Key: {
Column Name: "Column Value"
Column Name: "Column Value"
}
}
}
A Keyspace is a container for Column Families. Analogous to a
database in MySQL.
A Column Family is a container for a group of columns. Analagous to
a table in MySQL.
A Column is the basic unit. Key, Value and Timestamp.
A Row is a smattering of column data.
11. Gossip
In config, define seed(s).
Used for intra-cluster
communication.
Cluster self-assembles.
Works with failure detection.
Routes client requests.
12. Pluggable Partitioning
RandomPartitioner (RP)
Orders by MD5 of the key.
Most common.
Distributes relatively evenly.
There are others, but you probably will not use them.
13. Distributed Hash Table: The Ring
For Random Partitioner:
• Ring is made up of a range from
0 to 2**127.
• Token is MD5(Key).
• Each node is given a slice of the
ring
o Initial token is defined,
node owns that token up to
the next node's initial token.
Rock your tokens here:
http://blog.milford.io/cassandra-token-calculator/
14. Pluggable Topology Discovery
Cassandra needs to know about your network to direct
replica placement. Snitches inform Cassandra about it.
SimpleSnitch
Default, good for 1 data center.
RackInferringSnitch
Infers location from the IP's octets.
10.D.R.N (Data center, Rack, Node)
PropertyFileSnitch
cassandra-topology.properties
IP=DC:RACK (arbitrary values)
10.10.10.1=NY1:R1
EC2Snitch
Discovers AWS AZ and Regions.
15. Pluggable Replica Placement
SimpleStratgy
Places replicas in the adjacent nodes on the ring.
NetworkTopologyStrategy
Used with property file snitch.
Explicitly pick how replicas are placed.
strategy_options = [{NY1:2, LA1:2}];
16. Reading & Writing
Old method uses Thrift which is
usually abstracted using APIs
(ex. Hector, PyCassa, PHPCass)
Now we have CQL and JDBC!
SELECT * FROM ColumnFamily WHERE rowKey='Name';
Since all nodes are equal, you can read and write to any node.
The node you connect to becomes a Coordinator for that
request and routes your data to the proper nodes.
Connection pooling to nodes is sometimes handled by the API
Framework, otherwise use RRDNS or HAProxy.
17. Tunable Consistency
It is difficult to keep replicas of data consistant across nodes, let alone
across continents.
In any distributed system you have to make tradeoffs between how
consistent your dataset is versus how avaliable it is and how tolerant
the system is of partitions. (a.k.a CAP theorm.)
Cassandra chooses to focus on making the data avaliable and partition
tolerant and empower you to chose how consistant you need it to be.
Cassandra is awesomesauce because you choose what is more
important to your query, consistency or latency.
18. Per-Query Consistency Levels
Latency increases the more nodes you have to involve.
ANY: For writes only. Writes to any avaliable node and expects
Cassandra to sort it out. Fire and forget.
ONE: Reads or writes to the closest replica.
QUORUM: Writes to half+1 of the appropriate replicas before the
operation is successful. A read is sucessful when half+1 replicas agree
on a value to return.
LOCAL_QUORUM: Same as above, but only to the local datacenter in
a multi-datacenter topology.
ALL: For writes, all replicas need to ack the write. For reads, returns
the record with the newest timestamp once all replicas reply. In both
cases, if we're missing even one replica, the operation fails.
19. Cassandra Write Path
Cassandra identifies which node owns the token you're trying to
write based on your partitioning, replication and placement
strategies.
Data Written to CommitLog
Sequential writes to disk, kinda like a MySQL binlog.
Mostly written to, is only read from upon a restart.
Data Written to Memtable.
Acts as a Write-back cache of data.
Memtable hits a threshold (configurable) it is flushed to disk as an
SSTable. An SSTable (Sorted String Table) is an immutable file on
disk. More on compaction later.
20. Cassandra Read Path
Cassandra identifies which node owns the token you're trying to
read based on your partitioning, replication and placement
strategies.
First checks the Bloom filter, which can save us some time.
A space-efficient structure that tests if a key is on the node.
False positives are possible.
False negatives are impossible.
Then checks the index.
Tells us which SStable file the data is in.
And how far into the SStable file to look so we don't need to
scan the whole thing.
21. Distributed Deletes
Hard to delete stuff in a distributed system.
Difficult to keep track of replicas.
SSTables are immutable.
Deleted items are tombstoned (marked for deletion).
Data still exists, just can't be read by API.
Cleaned out during major compaction, when SSTables are
merged/remade.
22. Compaction
• When you have enough disparate SSTable files taking
up space, they are merge sorted into single SSTable
files.
• An expensive process (lots of GC, can eat up half of your
disk space)
• Tombstones discarded.
• Manual or automatic.
• Pluggable in 1.0.
• Leveled Compaction in 1.0
23. Repair
Anti-Entropy and Read Repair
During node repair and QUORUM & ALL reads,
ColumnFamilies are compared with replicas and
discrepancies resolved.
Put manual repair in cron to run at an interval =< the
value of GCGraceSeconds to catch old tombstones or
risk forgotten deletes.
Hinted Handoff
If a node is down, writes spool on other nodes and are
handed off then it comes back.
Sometimes left off, since a returning node can get
flooded.
24. Caching
Key Cache
Puts the location of keys in memory.
Improves seek times for keys on disk.
Enabled per ColumnFamily.
On by default at 200,000 keys.
Row Cache
Keeps full rows of hot data in memory.
Enable per ColumnFamily.
Skinny rows are more efficient.
Row Cache is consulted first,
then the Key Cache
Will require a bit of tuning.
25. Hardware
RAM: Depends on use. Stores some
objects off Heap.
CPU: More cores the better.
Cassandra is built with concurrency in mind.
Disk: Cassandra tries to minimize random IO. Minimum of 2
disks. Keep CommitLog and Data on separate spindles.
RAID10 or RAID0 as you see fit. I set mine up thus:
1 Disk = OS + Commitlog & RAID10 = DataSSTables
Network: 1 x 1gigE is fine, more the better and Gossip and
Data can be defined on separate interfaces.
26. What about 'Cloud' environments?
EC2Snitch
• Maps EC2 Regions to Racks
• Maps EC2 Availability Zones to DCs
• Use Network Topology Strategy
Avoid EBS. Use RAID0/RAID10 across ephemeral drives.
Replicate across Availability Zones.
Netflix is moving to 100% Cassandra on EC2:
http://www.slideshare.net/adrianco/migrating-netflix-from-
oracle-to-global-cassandra
27. Installing
RedHat
rpm -i http://rpm.datastax.com/EL/6/x86_64/riptano-release-5-1.el6.noarch.rpm
yum -y install apache-cassandra
Debian
Add to /etc/apt/sources.list
deb http://www.apache.org/dist/cassandra/debian unstable main
deb-src http://www.apache.org/dist/cassandra/debian unstable main
wget http://www.apache.org/dist/cassandra/KEYS -O- | sudo apt-key add -
sudo apt-get update
sudo apt-get install cassandra
29. Hot Tips
• Use Sun/Oracle JVM (1.6 u22+)
• Use JNA Library.
o Keep disk_access_mode as auto.
o BTW, it is not using all your RAM, is like FS Cache.
• Don't use autobootstrap, specify initial token.
• Super columns impose a performance penalty.
• Enable GC logging in cassandra-env.sh
• Don't use a large heap. (Yay off-heap caching!)
• Don't use swap.
30. Monitoring
Install MX4J jar into class path or ping JMX directly.
curl | grep | awk it into Nagios, Ganglia, Cacti or what have you.
31. What to Monitor
Heap Size and Usage
CompactionStage
Garbage Collections
Compaction Count
IO Wait
Cache Hit Rate
RowMutationStage (Writes)
ReadStage
Active and Pending
Active and Pending
32. Adding/Removing/Replacing Nodes
Adding a Node
Calculate new tokens.
Set correct initial token on the new node
Once it bootstrapped, nodetool move on other nodes.
Removing a Node
nodetool decommission drains data to other nodes
nodetool removetoken tells the cluster to get the
data from other replicas (faster, more expensive on live
nodes).
Replacing a Node
Bring up replacement node with same IP and token.
Run nodetool repair.
33. Useful nodetool commands.
nodetool info - Displays node-level info.
nodetool ring - Displays info on nodes on the ring.
nodetool cfstats - Displays ColumnFamily statistics.
nodetool tpstats - Displays what operations Cassandra
is doing right now.
nodetool netstats - Displays streaming information.
nodetool drain - Flushes Memtables to SSTables on disk
and stops accepting writes. Useful before a restart to make
startup quicker (no CommitLog to replay)
39. Backups
Single Node Snapshot
nodetool snapshot
nodetool clearsnapshot
Makes a hardlink of SSTables that you can tarball.
Cluster-wide Snapshot.
clustertool global_snapshot
clustertool clear_global_snapshot
Just does local snapshots on all nodes.
To restore:
Stop the node.
Clear CommitLogs.
Zap *.db files in the Keyspace directory.
Copy the snapshot over from the snapshots subdirectory.
Start the node and wait for load to decrease.
40. Shutdown Best Practice
While Cassandra is crash-safe, you can make a cleaner
shutdown and save some time during startup thus:
Make other nodes think this one is down.
nodetool -h $(hostname) -p 8080 disablegossip
Wait a few secs, cut off anyone from writing to this node.
nodetool -h $(hostname) -p 8080 dissablethrift
Flush all memtables to disk.
nodetool -h $(hostname) -p 8080 drain
Shut it down.
/etc/init.d/cassandra stop
41. Rolling Upgrades
From 0.7 you can do rolling upgrades. Check for cassandra.yaml changes!
On each node, one by one:
Shutdown as in previous slide, but do a snapshot after draining.
Remove old jars, rpms, debs. Your data will not be touched.
Add new jars, rpms, debs.
/etc/init.d/cassandra start
Wait for the node to come back up and for the other nodes to see it.
When done, before you run repair, on each node run:
nodetool -h $(hostname) -p 8080 scrub
This is rebuilding the sstables to make them up to date.
It is essentially a major compaction, without compacting, so it is a bit
expensive.
Run repair on your nodes to clean up the data.
nodetool -h $(hostname) -p 8080 repair
42. Join Us!
http://www.meetup.com/NYC-Cassandra-User-Group/
We'll be ing You!
These slides can be found here:
http://www.slideshare.net/nmilford/cassandra-for-sysadmins