2. Build a compaction strategy that
compacts the most overlapping
sstables together
3. 0. Setting up your IDE
https://wiki.apache.org/cassandra/RunningCassandraInIDEA
http://wiki.apache.org/cassandra/RunningCassandraInEclipse
4. 1. Implement a no-op compaction
strategy
● class Xyz extends AbstractCompactionStrategy {..}
● Implement the abstract methods
○ getNextBackgroundTask
■ Return a CompactionTask containing the sstables you want to
compact, null if none
○ getMaximalTask
■ ‘Major compaction’ - should compact all sstables
○ ...
● ALTER TABLE foo WITH compaction = {
class: ‘Xyz’ }
5. 2. Make it compact the most
overlapping sstables
● We should reduce disk usage the most if we compact
the overlapping sstables together
● CompactionMetadata has ICardinality
○ HyperLogLog - count unique items in a stream
○ Currently used to estimate how big bloom filters we need to allocate
during compaction
○ https://github.com/addthis/stream-lib
○ SSTableReader#getApproximateKeyCount
○ ICardinality#merge - merge several of these components to find count
of keys in the union of the sstables.
6. 3. Add support for
worthDroppingTombstones
● Single-sstable compaction to drop tombstones
● Tries to figure how much sstables overlap and then
estimate how many tombstones we have outside that
overlap
● Currently we check for range overlap
● Could probably be improved if we used ICardinality
7. 4. Add heuristics to avoid n²
CompactionMetadata comparisons
Algorithms!
8. Summary
1. Implement a no-op compaction strategy
2. Make it compact the most overlapping
sstables
3. Add support for worth dropping tombstones
4. Add heuristics to avoid n² comparisons
Slides: bit.ly/1pd9Bws