Ooyala has been using Apache Cassandra since version 0.4. Our data ingest volume has exploded since 0.4 and Cassandra has scaled along with us. Al will cover many topics from an operational perspective on how to manage, tune, and scale Cassandra in a production environment.
ICT role in 21st century education and its challenges
Â
C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey
1. PRACTICE MAKES PERFECT:
EXTREME CASSANDRA OPTIMIZATION
@AlTobey
Tech Lead, Compute and Data Services
#CASSANDRA13
1Saturday, June 15, 13
I didnât name this talk. The conference people did, but I like it a lot.
2. 2
â About me / Ooyala
â How not to manage your Cassandra clusters
â Make it suck less
â How to be a heuristician
â Tools of the trade
â More Settings
â Show & Tell
#CASSANDRA13
Outline
2Saturday, June 15, 13
3. 3
â Tech Lead, Compute and Data Services at Ooyala, Inc.
â C&D team is #devops: 3 ops, 3 eng, me
â C&D team is #bdaas: Big Data as a Service
â ~100 Cassandra nodes, expanding quickly
â Obligatory: weâre hiring
#CASSANDRA13
@AlTobey
3Saturday, June 15, 13
â I wonât go into devops today, but Iâm happy to talk about it later.
â 2 years at Ooyala, SRE -> TL Tools Team -> C&D
â C&D builds BDaaS for Ooyala: fully managed Cassandra / Spark / Hadoop / Zookeeper / Kafka
â 11 clusters, 5-36 nodes, working on something big
â BEFORE: Engineers deployed systems: expensive, error-prone, AFTER: Engineers use APIâs & consult
4. 4
â Founded in 2007
â 230+ employees globally
â 200M unique users,110+ countries
â Over 1 billion videos played per month
â Over 2 billion analytic events per day
#CASSANDRA13
Ooyala
4Saturday, June 15, 13
5. 5
Ooyala has been using Cassandra since v0.4
Use cases:
â Analytics data (real-time and batch)
â Highly available K/V store
â Time series data
â Play head tracking (cross-device resume)
â Machine Learning Data
#CASSANDRA13
Ooyala & Cassandra
5Saturday, June 15, 13
7. memTable
Avoiding read-modify-write
7#CASSANDRA13
Albert 6 Wednesday 0
Evan Tuesday 0 Wednesday 0
Frank Tuesday 3 Wednesday 3
Kelvin Tuesday 0 Wednesday 0
cassandra13_drinks column family
Krzysztof Tuesday 0 Wednesday 0
Phillip Tuesday 12 Wednesday 0
Tuesday
7Saturday, June 15, 13
â CF to track how much I expect my team at Ooyala to drink
â Row keys are names
â Column keys are days
â Values are a count of drinks
8. memTable
Avoiding read-modify-write
8#CASSANDRA13
Al Tuesday 2 Wednesday 0
Phillip Tuesday 0 Wednesday 1
cassandra13_drinks column family
ssTable
Albert 6 Wednesday 0
Evan Tuesday 0 Wednesday 0
Frank Tuesday 3 Wednesday 3
Kelvin Tuesday 0 Wednesday 0
Krzysztof Tuesday 0 Wednesday 0
Phillip Tuesday 12 Wednesday 0
Tuesday
8Saturday, June 15, 13
â Next day, after after a ïŹush
â Iâm speaking so I decided to drink less
â Phillip informs me that he has quit drinking
9. memTable
Avoiding read-modify-write
9#CASSANDRA13
Albert Tuesday 22 Wednesday 0
cassandra13_drinks column family
ssTable
Albert Tuesday 2 Wednesday 0
Phillip Tuesday 0 Wednesday 1
ssTable
Albert 6 Wednesday 0
Evan Tuesday 0 Wednesday 0
Frank Tuesday 3 Wednesday 3
Kelvin Tuesday 0 Wednesday 0
Krzysztof Tuesday 0 Wednesday 0
Phillip Tuesday 12 Wednesday 0
Tuesday
9Saturday, June 15, 13
â Iâm drinking with all you people so I decide to add 20
â read 2, add 20, write 22
10. Avoiding read-modify-write
10#CASSANDRA13
cassandra13_drinks column family
ssTable
Albert Tuesday 22 Wednesday 0
Evan Tuesday 0 Wednesday 0
Frank Tuesday 3 Wednesday 3
Kelvin Tuesday 0 Wednesday 0
Krzysztof Tuesday 0 Wednesday 0
Phillip Tuesday 0 Wednesday 1
10Saturday, June 15, 13
â After compaction & conïŹict resolution
â Overwriting the same value is just ïŹne! Works really well for some patterns such as time-series data
â Separate read/write streams handy for debugging, but not a big deal
11. 2011: 0.6 â 0.8
11
â Migration is still a largely unsolved problem
â Wrote a tool in Scala to scrub data and write via Thrift
â Rebuilt indexes - faster than copying
hadoop
cassandra
GlusterFS P2P
cassandra
Thrift
#CASSANDRA13
Scala Map/Reduce
11Saturday, June 15, 13
â Because of some legacy choices, we know we had a bunch of expired tombstones
â GlusterFS: userspace, ionice(1), fast & easy
â Scala MR: sstabledump, etc. TOO SLOW, Scala MR only took a week (with production running too!)
12. Changes: 0.6 â 0.8
12
â Cassandra 0.8
â 24GiB heap
â Sun Java 1.6 update
â Linux 2.6.36
â XFS on MD RAID5
â Disabled swap or at least vm.swappiness=1
#CASSANDRA13
12Saturday, June 15, 13
â More on XFS settings & bugs later
â Got signiïŹcant improvements from RAID & readahead tuning (more later)
â Alâs ïŹrst rule of tuning databases: disable swap or GTFO
â ïŹxed lots of applications by simply disabling swap
13. 13
â 18 nodes â 36 nodes
â DSE 3.0
â Stale tombstones again!
â No downtime!
cassandra
GlusterFS P2P
DSE 3.0
Thrift
#CASSANDRA13
Scala Map/Reduce
2012: Capacity Increase
13Saturday, June 15, 13
â I switched teams, working on Hastur, didnât document enough, repairs were forgotten again
â 60 day GC Grace Period expired ... 3 months ago
â rsync is not enough for hardware moves: do rebuilds!
â Use DSE Map/Reduce to isolate most of the load from production
14. System Changes: Apache 1.0 â DSE 3.0
14
â DSE 3.0 installed via apt packages
â Unchanged: heap, distro
â Ran much faster this time!
â Mistake: Moved to MD RAID 0
Fix: RAID10 or RAID5, MD, ZFS, or btrfs
â Mistake: Running on Ubuntu Lucid
Fix: Ubuntu Precise
#CASSANDRA13
14Saturday, June 15, 13
â Previously deployed with Capistrano
â DSE 3âs Hadoop is compiled on Debian 6 so native components will not load on 10.04âs libc
â still gradually rebuilding nodes from RAID0 â RAID5 and Lucid -> Precise
15. Config Changes: Apache 1.0 â DSE 3.0
15
â Schema: compaction_strategy = LCS
â Schema: bloom_filter_fp_chance = 0.1
â Schema: sstable_size_in_mb = 256
â Schema: compression_options = Snappy
â YAML: compaction_throughput_mb_per_sec: 0
#CASSANDRA13
15Saturday, June 15, 13
â LCS is a huge improvement in operations life (no more major compactions)
â Bloom ïŹlters were tipping over a 24GiB heap
â With lots of data per node, sstable sizes in LCS must be MUCH bigger
â > 100,000 open ïŹles slows everything down, especially startup
â 256mb v.s. 5mb is 50x reduction in ïŹle count
â Compaction canât keep up: even huge rates donât work, must be disabled
â try to adjust heap, etc. so youâre ïŹushing at nearly full memtables to reduce compaction needs
â backreference RMW?
â might be ïŹxed in >= 1.2
16. 16
â 36 nodes â lots more nodes
â As usual, no downtime!
#CASSANDRA13
DSE 3.1DSE 3.1
replication
2013: Datacenter Move
16Saturday, June 15, 13
â Size omitted in published slides. I was asked not to publish yet, I will tweet, etc. in a couple weeks.
â Wasnât the original plan, but we save a lot of $$ by leaving old cage
â Prep for next-generation architecture!
17. 17
Upcoming use cases:
â Store every event from our players at full resolution
â Cache code for our Spark job server
â AMPLab Tachyon backend?
#CASSANDRA13
Coming Soon for Cassandra at Ooyala
17Saturday, June 15, 13
â This is the intro for the next slide / diagram.
â Considering Astyanax or CQL3 backend for Tachyon so we can contribute it back
18. 18
spark
APIloggersplayers kafka
ingest
job server
#CASSANDRA13
DSE 3.1
Next Generation Architecture: Ooyala Event Store
Tachyon?
18Saturday, June 15, 13
â Look mom! No Hadoop! Remember what I said about latency?
â But weâre not just running DSE on these machines. Theyâre running: DSE, Spark, KVM, and CDH3u4 (legacy)
â Secret is cgroups!
â Also, ZFS (later)
19. 19
â Security
â Cost of Goods Sold
â Operations / support
â Developer happiness
â Physical capacity (cpu/memory/network/disk)
â Reliability / Resilience
â Compromise
#CASSANDRA13
Thereâs more to tuning than performance:
19Saturday, June 15, 13
Shifting themes: philosophy of tuning
â Security is always #1: The decision to disable security features is an important decision!
â Example: EC2 instances sizes vary wildly in consistency and raw performance
â Leveled v.s. Size Tiered compaction, ZFS/LVM/MDRAID, bare metal v.s. EC2
â how much of this stuff do my devs need to know? How much work is it to get a new KS/CF?
â speed of node rebuilds, risk incurred by extended rebuilds, speed of repair
a.) e.g. it takes a full 24 hours to repair each node in our 36-node cluster, so > 1 month to repair the cluster
â repeatable conïŹgurations, do future admins have to remember to do stuff or is it automated?
â Look up âJohn Allspaw Resilienceâ
â you only have access to EC2 or old hardware, your company has an OS/ïŹlesystem/settings policy (e.g. my $PREVIOUS_JOB CentOS 5.3 Linux
2.18.x hardened distro)
There are others of course.
20. 20
â Iâd love to be more scientific, but production comes first
â Sometimes you have to make educated guesses
â Itâs not as difficult as itâs made out to be
â Your brain is great at heuristics. Trust it.
â Concentrate on bottlenecks
â Make incremental changes
â Read Malcom Gladwellâs âBlinkâ
#CASSANDRA13
I am not a scientist ... heuristician?
20Saturday, June 15, 13
â A truly scientiïŹc approach would take a lot of time and resources.
â When under time pressure and things are slow, you have to move fast and measure âby the seat of your pantsâ
â Be educated, do research, and make sensible decisions without months of testing, be prepared to do better next time
â Itâs actually pretty fast and easy this way!
â More on what tools I use later on.
21. 21
Observe, Orient, Decide, Act:
â Observe the system in production under load
â Make small, safe changes
â Observe
â Commit or Revert
#CASSANDRA13
The OODA Loop
21Saturday, June 15, 13
â Understand YOUR production workload ïŹrst!
â Look at Opscenter latency numbers
â cl-netstat.pl (later)
âExamples:
â Changing /proc/sys/vm/dirty_background_ratio is fairly safe and shows results quickly.
â Some network settings can take your node ofïŹine, temporarily or require manual intervention.
â Changing the compaction scheme requires a lot of time and has other implications.
22. Testing Shiny Things
22
â Like kernels
â And Linux distributions
â And ZFS
â And btrfs
â And JVMâs & parameters
â Test them in production!
#CASSANDRA13
22Saturday, June 15, 13
â Testing stuff in a lab is ïŹne, if you have one and you have the time.
â Take (responsible) advantage of Cassandraâs resilience:
â test things you think should work well in production on ONE node or a couple nodes well spaced out.
23. ext4
ext4
ext4
ZFS
ext4
kernel
upgrade
ext4
btrfs
Testing Shiny Things: In Production
23#CASSANDRA13
23Saturday, June 15, 13
â Use your staging / non-prod environments ïŹrst if you have them (some people donât and thatâs unfortunate but it happens)
â test things you think should work well in production on ONE node or a couple nodes well spaced out.
24. 24#CASSANDRA13
Brendan Greggâs Tool Chart
http://joyent.com/blog/linux-performance-analysis-and-tools-brendan-gregg-s-talk-at-scale-11x
24Saturday, June 15, 13
â Brendan Greggâs chart is so good, I just copied it for now.
â Original: http://joyent.com/blog/linux-performance-analysis-and-tools-brendan-gregg-s-talk-at-scale-11x
â Iâll brieïŹy talk about a few
25. 25#CASSANDRA13
dstat -lrvn 10
25Saturday, June 15, 13
â Just like vmstat but prettier and does way more
â 35 lines of output = about 5 minutes of 10s snapshots
â Whatâs interesting?
â IO wait starting at line 5, but all numbers are going up, so this is probably during a map/reduce job
â IO wait is high, but disk throughput isnât impressive at all
â ~2 blocked âprocsâ (which includes threads)
Not bothering to tune this right now because production latency is ïŹne.
27. 27#CASSANDRA13
iostat -x 1
27Saturday, June 15, 13
â Mostly I just look at the *wait numbers here.
â Great for ïŹnding a bad disk with high latency.
28. 28#CASSANDRA13
htop
28Saturday, June 15, 13
â Per-CPU utilization bars are nice
â Displays threads by default (hit âHâ in plain top)
â Very conïŹgurable!
â For example: 1 thread at 100% CPU is usually the GC
29. 29#CASSANDRA13
jconsole
29Saturday, June 15, 13
â Looks like I can reduce the heap size on this cluster, but should probably increase -Xmn to 100mb * (physical cores) (not counting hypercores)
31. 31#CASSANDRA13
nodetool ring
10.10.10.10 Analytics rack1 Up Normal 47.73 MB 1.72% 1012046694721756637024691720378965
10.10.10.10 Analytics rack1 Up Normal 63.94 MB 0.86% 1026714038123521225967078556906197
10.10.10.10 Analytics rack1 Up Normal 85.73 MB 0.86% 1041381381525285814909465393433428
10.10.10.10 Analytics rack1 Up Normal 47.87 MB 0.86% 1056048724927050403851852229960659
10.10.10.10 Analytics rack1 Up Normal 39.73 MB 0.86% 1070716068328814992794239066487891
10.10.10.10 Analytics rack1 Up Normal 40.74 MB 1.75% 1100423945662575060114582859200003
10.10.10.10 Analytics rack1 Up Normal 40.08 MB 2.20% 1137814208669076757916163680305794
10.10.10.10 Analytics rack1 Up Normal 56.19 MB 3.45% 1196501513956187970179620530735245
10.10.10.10 Analytics rack1 Up Normal 214.88 MB 11.62% 1394248867770897155613247921498720
10.10.10.10 Analytics rack1 Up Normal 214.29 MB 2.45% 1435882108713996181107000284314407
10.10.10.10 Analytics rack1 Up Normal 158.49 MB 1.76% 1465773686249280216901752503449044
10.10.10.10 Analytics rack1 Up Normal 40.3 MB 0.92% 1481401683578223483181070489250370
31Saturday, June 15, 13
â hotspots
32. 32#CASSANDRA13
nodetool cfstats
Keyspace: gostress
Read Count: 0
Read Latency: NaN ms.
Write Count: 0
Write Latency: NaN ms.
Pending Tasks: 0
Column Family: stressful
SSTable count: 1
Space used (live): 32981239
Space used (total): 32981239
Number of Keys (estimate): 128
Memtable Columns Count: 0
Memtable Data Size: 0
Memtable Switch Count: 0
Read Count: 0
Read Latency: NaN ms.
Write Count: 0
Write Latency: NaN ms.
Pending Tasks: 0
Bloom Filter False Positives: 0
Bloom Filter False Ratio: 0.00000
Bloom Filter Space Used: 336
Compacted row minimum size: 7007507
Compacted row maximum size: 8409007
Compacted row mean size: 8409007
Could be using a lot of heap
Controllable by sstable_size_in_mb
32Saturday, June 15, 13
â bloom ïŹlters
â sstable_size_in_mb
33. 33#CASSANDRA13
nodetool proxyhistograms
Offset Read Latency Write Latency Range Latency
35 0 20 0
42 0 61 0
50 0 82 0
60 0 440 0
72 0 3416 0
86 0 17910 0
103 0 48675 0
124 1 97423 0
149 0 153109 0
179 2 186205 0
215 5 139022 0
258 134 44058 0
310 2656 60660 0
372 34698 742684 0
446 469515 7359351 0
535 3920391 31030588 0
642 9852708 33070248 0
770 4487796 9719615 0
924 651959 984889 0
33Saturday, June 15, 13
â units are microseconds
â can give you a good idea of how much latency coordinator hops are costing you
34. 34#CASSANDRA13
nodetool compactionstats
al@node ~ $ nodetool compactionstats
pending tasks: 3
compaction type keyspace column family bytes compacted bytes total progress
Compaction hastur gauge_archive 9819749801 16922291634 58.03%
Compaction hastur counter_archive 12141850720 16147440484 75.19%
Compaction hastur mark_archive 647389841 1475432590 43.88%
Active compaction remaining time : n/a
al@node ~ $ nodetool compactionstats
pending tasks: 3
compaction type keyspace column family bytes compacted bytes total progress
Compaction hastur gauge_archive 10239806890 16922291634 60.51%
Compaction hastur counter_archive 12544404397 16147440484 77.69%
Compaction hastur mark_archive 1107897093 1475432590 75.09%
Active compaction remaining time : n/a
34Saturday, June 15, 13
â
35. 35#CASSANDRA13
â cassandra-stress
â YCSB
â Production
â Terasort (DSE)
â Homegrown
Stress Testing Tools
35Saturday, June 15, 13
â we mostly focus on cassandra-stress for burn-in of new clusters
â can quickly ïŹgure out the right setting for -Xmn
â Terasort is interesting for comparing DSE to Cloudera/Hortonworks/etc. (itâs fast!)
â Consider writing custom benchmarks for your application patterns
â sometimes itâs faster to write one than ïŹgure out how to make a generic tool do what you want
36. 36#CASSANDRA13
kernel.pid_max = 999999
fs.file-max = 1048576
vm.max_map_count = 1048576
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 65536 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
vm.dirty_ratio = 10
vm.dirty_background_ratio = 2
vm.swappiness = 1
/etc/sysctl.conf
36Saturday, June 15, 13
â pid_max doesnât ïŹx anything, I just like it and have never had a problem with it
â These are my starting point settings for nearly every system/application.
â Generally safe for production.
â vm.dirty*ratio can go big for fake fast writes, generally safe for Cassandra, but beware youâre more likely to see FS/ïŹle corruption on power loss
â but you will get latency spikes if you hit dirty_ratio (percentage of RAM), so donât tune too low
37. 37#CASSANDRA13
ra=$((2**14))# 16k
ss=$(blockdev --getss /dev/sda)
blockdev --setra $(($ra / $ss)) /dev/sda
echo 256 > /sys/block/sda/queue/nr_requests
echo cfq > /sys/block/sda/queue/scheduler
echo 16384 > /sys/block/md7/md/stripe_cache_size
/etc/rc.local
37Saturday, June 15, 13
â Lower readahead is better for latency on seeky workloads
â More readahead will artiïŹcially increase your IOPS by reading a bunch of stuff you might not need!
â nr_requests = number of IO structs the kernel will keep in ïŹight, donât go crazy
â Deadline is best for raw throughput
â CFQ supports cgroup priorities and is occasionally better for latency on SATA drives
â Default stripe cache is 128. The increase seems to help MD RAID5 a lot.
â Donât forget to set readahead separately for MD RAID devices
38. 38#CASSANDRA13
-Xmx8G leave it alone
-Xms8G leave it alone
-Xmn1200M 100MiB * nCPU
-Xss180k should be fine
-XX:+UseNUMA
numactl --interleave
JVM Args
38Saturday, June 15, 13
â In general, most people should leave the defaults alone. Especially the heap, which can cause no end of trouble if you do it wrong and cause GC
pauses.
â Donât count hypercores.
â Our biggest bang for the buck so far has been tuning newsize.
â Have you ever seen âout of memoryâ when thereâs plenty of memory available? You probably have a full NUMA node.
â NUMA is how modern machines are built. Older Apache Cassandra distros had numactl --interleave, but this doesnât seem to be in the DSE
scripts. Iâve been running +UseNUMA for about a year and a half now and it seems to work ïŹne.
39. cgroups
39#CASSANDRA13
Provides fine-grained control over Linux resources
â Makes the Linux scheduler better
â Lets you manage systems under extreme load
â Useful on all Linux machines
â Can choose between determinism and flexibility
39Saturday, June 15, 13
â static resource assignment has better determinism / constentcy
â weighted resources provide most of the advantage with a lot more ïŹexibility
40. cgroups
40#CASSANDRA13
cat >> /etc/default/cassandra <<EOF
cpucg=/sys/fs/cgroup/cpu/cassandra
mkdir $cpucg
cat $cpucg/../cpuset.mems >$cpucg/cpuset.mems
cat $cpucg/../cpuset.cpus >$cpucg/cpuset.cpus
echo 100 > $cpucg/shares
echo $$ > $cpucg/tasks
EOF
40Saturday, June 15, 13
â automatically adds cassandra to a CG called âcassandraâ
â cpuset.mems can be used to limit NUMA nodes if you have huge machines
â cpuset.cpus can restrict tasks to speciïŹc cores (like taskset, stricter)
â shares is just a number, set your own scale, 1-1000 works for me
â adding a task to a CG is as simple as adding its PID
â children are not necessarily added, you must add threads too if joining after startup (ps -efL)
41. Successful Experiment: btrfs
41#CASSANDRA13
mkfs.btrfs -m raid10 -d raid0 /dev/sd[c-h]1
mkfs.btrfs -m raid10 -d raid0 /dev/sd[c-h]1
mount -o compress=lzo /dev/sdc1 /data
41Saturday, June 15, 13
â Like ZFS, btrfs can manage multiple disks without mdraid or LVM.
â We have one production system in EC2 running btrfs ïŹawlessly.
â Iâm told there are problems when the disk ïŹlls up so donât do that.
â noatime isnât necessary on modern Linux, relatime is the default for xfs / ext4 and is good enough
42. Successful Experiment: ZFS on Linux
42#CASSANDRA13
zpool create data raidz /dev/sd[c-h]
zfs create data/cassandra
zfs set compression=lzjb data/cassandra
zfs set atime=off data/cassandra
zfs set logbias=throughput data/cassandra
42Saturday, June 15, 13
â ZFS really is the ultimate ïŹlesystem.
â RAIDZ is like RAID5 but totally different:
â variable-width stripes
â no write hole
â VERY fast, plays well with C*
â Stable! (so far)
43. Conclusions
43#CASSANDRA13
â Tuning is multi-dimensional
â Production load is your most important benchmark
â Lean on Cassandra, experiment!
â No one metric tells the whole story
43Saturday, June 15, 13