3. What does Ooyala use it for?
Fast access to data generated by Map/Reduce
High availability key/value out of Storm
Cross-device resume (playhead tracking)
ML predictions
Time-series data, raw events & application metrics
4. The Beginning
Our data is doubling every year
Cluster size: 18 nodes
Biggest CF: 2TB
Repairs becoming a problem
Expired tombstones
5. First Migration
Upgrade to C* 0.6 to 0.8
Remove expired tombstones
Scrub data and rebuild indexes
Lots of Linux performance tuning
Map/Reduce
6. Second Migration
Upgrade to Cassandra 1.0
Remove expired tombstones
Update schema
More Linux performance tuning
Map/Reduce - this time using DSE Hadoop
7. Tuning Highlights
Bloom filter false-positive chance (schema)
Index density (schema)
LeveledCompaction ssTable size (schema)
XFS filesystem bugs (Linux)
Stick with ext4 if you like to sleep.
NO SWAP!
8. More Information
cassandra-users mailing list
irc.freenode.net #cassandra / #cassandra-ops
http://www.datastax.com/docs/1.1/index
@AlTobey / al@ooyala.com
Contact me about open positions at Ooyala.
9. Rejected Slides follow:
An old version of this deck was a lot more
technical. I've added them back for online
posting since people have asked about the
specifics.
10. Linux: General Observations
● use a modern kernel, 2.6.32 is ancient
○ Running 3.4.11 on new production hardware
○ default Ubuntu Lucid / Oneiric kernels in EC2
● I have yet to use XFS bug-free
○ 2.6.38 has an especially fun bug
○ allocsize=64m allocates 64m always & forever
○ echo 1 > /proc/sys/vm/drop_caches
● Put commit log on a different filesystem
● btrfs works fine in production
● Block alignment is hard
○ use GPT disk labels and it's generally not an issue
○ or just skip disk labels and RAID whole disks
11. Linux: almost a server OS
/etc/security/limits.conf
* - memlock unlimited
* - nofile 1048576
* - fsize unlimited
* - nproc 999999
13. Ubuntu: FFFFFFFUUUUUUUUUUU
/etc/fstab
/dev/md4 /commit ext4 nobootwait,barrier=1,journal_ioprio=0,rw 0 0
/dev/md7 /srv xfs nobootwait,rw 0 0
● force barriers for journal
● noatime & relatime aren't necessary anymore
○ since ~ 2.6.31
● nobootwait is an upstart option
○ set this or upstart will troll you at 4am
○ mountall hangs on boot for any error without this
○ use on both hardware and EC2 unless you love
using OOB consoles
● As noted, XFS is buggy, so consider ext4.
14. Linux: Final Adjustments
/etc/rc.local (or whatever you prefer)
● CFQ disk scheduler
○ deadline is still faster, but no cgroup support
○ noop is a popular choice in EC2, SSD, and HW
RAID
● Tune readahead
○ don't go crazy, 64k is a decent choice
○ big RA will inflate your bandwidth numbers, but
really large values will waste IO on unused data
● If running MD RAID5/6
○ echo 16384 >
/sys/block/$md/md/stripe_cache_size
15. JVM: ALL THE MEMORY
● Use Oracle JVM 1.6 for Cassandra
○ OpenJDK works, still not recommended
○ Use fpm to create packages if you don't have them
● Default Cassandra GC settings are OK
○ -XX:+UseNUMA
■ works fine in production
■ Apache scripts will use numactl if installed
● DSE does not! (yet)
○ Bigger data will need bigger heaps.
■ 12G seems to work OK
■ 24G works, but approaching limits of JVM
■ too little free memory causes excessive
memtable flushing (more on this later)
16. Cassandra.(?:ya|f)ml
● index_interval: 512
○ save some memory on indexes
● compaction_throughput_mb_per_sec: 0
○ this can hurt your read latency, but in my experience
leveled compaction falls behind under very high
insert loads without this, use a bigger heap to
compensate?
● rpc_server_type: hsha
○ if you have lots & lots of connections, e.g. from
Hadoop, saves memory
17. Cassandra: Schema Tuning
● Enable compression
○ compression_options = {'sstable_compression': 'org.
apache.cassandra.io.compress.SnappyCompressor'};
● Examine bloom filter false-positives
○ nodetool -h localhost cfstats |grep Bloom
○ bloom_filter_fp_chance = 0.1 # diminishing returns
● Reduce ssTable count
○ memory pressure caused frequent memtable flushes
○ compaction throttling made it worse
○ compaction_strategy_options = {'sstable_size_in_mb':
256}
● Give yourself time to repair
○ gc_grace = 5184000 # 60 days
○ shoot for (node_count * 86400 * 3) to be safe
18. Future
● Upgrade all clusters to DSE 2.2
● Chef cookbook (likely open)
● Mixing CQL3 and Thrift API access
○ all lower case CF names
○ WITH COMPACT STORAGE
● Cassandra 1.2
○ native protocol
○ JBOD support
○ vnodes
○ compound row key support in CQL3
19. MOAR
● Freenode IRC is a great resource
○ #cassandra, #cassandra-ops
● cassandra-users mailing list
● DataStax Enterprise
○ The Hadoop integration works and is useful
○ Still playing with Solr
○ OpsCenter is really nice
● Me:
○ @AlTobey on Twitter
○ tobert on irc.freenode.net
○ https://gist.github.com/tobert
●
20. More Information (again)
cassandra-users mailing list
irc.freenode.net #cassandra / #cassandra-ops
http://www.datastax.com/docs/1.1/index
@AlTobey / al@ooyala.com
Contact me about open positions at Ooyala.