At Spotify, we see failure as an opportunity to learn. During the two years we've used Cassandra in our production environment, we have learned a lot. This session touches on some of the exciting design anti-patterns, performance killers and other opportunities to lose a finger that are at your disposal with Cassandra.
3. #Cassandra13
The Spotify backend
• Around 3000 servers in 3 datacenters
• Volumes
o We have ~ 12 soccer fields of music
o Streaming ~ 4 Wikipedias/second
o ~ 24 000 000 active users
4. #Cassandra13
The Spotify backend
• Specialized software powering Spotify
o ~ 70 services
o Mostly Python, some Java
o Small, simple services responsible for single task
5. #Cassandra13
Storage needs
• Used to be a pure PostgreSQL shop
• Postgres is awesome, but...
o Poor cross-site replication support
o Write master failure requires manual intervention
o Sharding throws most relational advantages out the
window
6. #Cassandra13
Cassandra @ Spotify
• We started using Cassandra ~2 years ago
• About a dozen services use it by now
• Back then, there was little information about how to
design efficient, scalable storage schemas for
Cassandra
7. #Cassandra13
Cassandra @ Spotify
• We started using Cassandra ~2 years ago
• About a dozen services use it by now
• Back then, there was little information about how to
design efficient, scalable storage schemas for
Cassandra
• So we screwed up
• A lot
9. #Cassandra13
Read repair
• Repair from outages during regular read operation
• With RR, all reads request hash digests from all nodes
• Result is still returned as soon as enough nodes have
replied
• If there is a mismatch, perform a repair
10. #Cassandra13
Read repair
• Useful factoid: Read repair is performed across all data
centers
• So in a multi-DC setup, all reads will result in requests being
sent to every data center
• We've made this mistake a bunch of times
• New in 1.1: dclocal_read_repair
11. #Cassandra13
Row cache
• Cassandra can be configured to cache entire data rows in
RAM
• Intended as a memcache alternative
• Lets enable it. What's the worst that could happen, right?
12. #Cassandra13
Row cache
NO!
• Only stores full rows
• All cache misses are silently promoted to full row slices
• All writes invalidate entire row
• Don't use unless you understand all use cases
13. #Cassandra13
Compression
• Cassandra supports transparent compression of all data
• Compression algorithm (snappy) is super fast
• So you can just enable it and everything will be better, right?
14. #Cassandra13
Compression
• Cassandra supports transparent compression of all data
• Compression algorithm (snappy) is super fast
• So you can just enable it and everything will be better, right?
• NO!
• Compression disables a bunch of fast paths, slowing down
fast reads
16. #Cassandra13
Performance worse over time
• A freshly loaded Cassandra cluster is usually snappy
• But when you keep writing to the same columns over for a
long time, performance goes down
• We've seen clusters where reads touch a dozen SSTables
on average
• nodetool cfhistograms is your friend
17. #Cassandra13
Performance worse over time
• CASSANDRA-5514
• Every SSTable stores first/last column of SSTable
• Time series-like data is effectively partitioned
18. #Cassandra13
Few cross continent clusters
• Few cross continent Cassandra users
• We are kind of on our own when it comes to some problems
• CASSANDRA-5148
• Disable TCP nodelay
• Reduced packet count by 20 %
20. #Cassandra13
How not to upgrade Cassandra
• Very few total cluster outages
o Clusters have been up and running since the early
0.7 days, been rolling upgraded, expanded, full
hardware replacements etc.
• Never lost any data!
o No matter how spectacularly Cassandra fails, it has
never written bad data
o Immutable SSTables FTW
21. #Cassandra13
Upgrade from 0.7 to 0.8
• This was the first big upgrade we did, 0.7.4 ⇾ 0.8.6
• Everyone claimed rolling upgrade would work
o It did not
• One would expect 0.8.6 to have this fixed
• Patched Cassandra and rolled it a day later
• Takeaways:
o ALWAYS try rolling upgrades in a testing environment
o Don't believe what people on the Internet tell you
22. #Cassandra13
Upgrade 0.8 to 1.0
• We tried upgrading in test env, worked fine
• Worked fine in production...
• Except the last cluster
• All data gone
23. #Cassandra13
Upgrade 0.8 to 1.0
• We tried upgrading in test env, worked fine
• Worked fine in production...
• Except the last cluster
• All data gone
• Many keys per SSTable ⇾ corrupt bloom filters
• Made Cassandra think it didn't have any keys
• Scrub data ⇾ fixed
• Takeaway: ALWAYS test upgrades using production data
24. #Cassandra13
Upgrading 1.0 to 1.1
• After the previous upgrades, we did all the tests with
production data and everything worked fine...
• Until we redid it in production, and we had reports of
missing rows
• Scrub ⇾ restart made them reappear
• This was in December, have not been able to reproduce
• PEBKAC?
• Takeaway: ?
28. #Cassandra13
What happens if one node is slow?
Many reasons for temporary slowness:
• Bad raid battery
• Sudden bursts of compaction/repair
• Bursty load
• Net hiccup
• Major GC
• Reality
29. #Cassandra13
What happens if one node is slow?
• Coordinator has a request queue
• If a node goes down completely, gossip will notice
quickly and drop the node
• But what happens if a node is just super slow?
30. #Cassandra13
What happens if one node is slow?
• Gossip doesn't react quickly to slow nodes
• The request queue for the coordinator on every node in
the cluster fills up
• And the entire cluster stops accepting requests
31. #Cassandra13
What happens if one node is slow?
• Gossip doesn't react quickly to slow nodes
• The request queue for the coordinator on every node in
the cluster fills up
• And the entire cluster stops accepting requests
• No single point of failure?
32. #Cassandra13
What happens if one node is slow?
• Solution: Partitioner awareness in client
• Max 3 nodes go down
• Available in Astyanax
34. #Cassandra13
Deleting data
How is data deleted?
• SSTables are immutable, we can't remove the data
• Cassandra creates tombstones for deleted data
• Tombstones are versioned the same way as any other
write
35. #Cassandra13
How not to delete data
Do tombstones ever go away?
• During compactions, tombstones can get merged into
SStables that hold the original data, making the
tombstones redundant
• Once a tombstone is the only value for a specific
column, the tombstone can go away
• Still need grace time to handle node downtime
36. #Cassandra13
How not to delete data
• Tombstones can only be deleted once all non-
tombstone values have been deleted
• If you're using SizeTiered compaction, 'old' rows will
rarely get deleted
37. #Cassandra13
How not to delete data
• Tombstones are a problem even when using levelled
compaction
• In theory, 90 % of all rows should live in a single
SSTable
• In production, we've found that 20 - 50 % of all reads hit
more than one SSTable
• Frequently updated columns will exist in many levels,
causing tombstones to stick around
38. #Cassandra13
How not to delete data
• Deletions are messy
• Unless you perform major compactions, tombstones will
rarely get deleted from «popular» rows
• Avoid schemas that delete data!
39. #Cassandra13
TTL:ed data
• Cassandra supports TTL:ed data
• Once TTL:ed data expires, it should just be compacted
away, right?
• We know we don't need the data anymore, no need for
a tombstone, so it should be fast, right?
40. #Cassandra13
TTL:ed data
• Cassandra supports TTL:ed data
• Once TTL:ed data expires, it should just be compacted
away, right?
• We know we don't need the data anymore, no need for
a tombstone, so it should be fast, right?
• Noooooo...
• (Overwritten data could theoretically bounce back)
43. #Cassandra13
The Playlist service
Our old playlist system had many problems:
• Stored data across hundreds of millions of files, making
backup process really slow.
• Home brewed replication model that didn't work very
well
• Frequent downtimes, huge scalability problems
44. #Cassandra13
The Playlist service
Our old playlist system had many problems:
• Stored data across hundreds of millions of files, making
backup process really slow.
• Home brewed replication model that didn't work very
well
• Frequent downtimes, huge scalability problems
• Perfect test case for
Cassandra!
45. #Cassandra13
Playlist data model
• Every playlist is a revisioned object
• Think of it like a distributed versioning system
• Allows concurrent modification on multiple offlined clients
• We even have an automatic merge conflict resolver that
works really well!
• That's actually a really useful feature
46. #Cassandra13
Playlist data model
• Every playlist is a revisioned object
• Think of it like a distributed versioning system
• Allows concurrent modification on multiple offlined clients
• We even have an automatic merge conflict resolver that
works really well!
• That's actually a really useful feature said no one ever
47. #Cassandra13
Playlist data model
• Sequence of changes
• The changes are the authoritative data
• Everything else is optimization
• Cassandra pretty neat for storing this kind of stuff
• Can use consistency level ONE safely
49. #Cassandra13
Tombstone hell
Noticed that HEAD requests took several seconds for some
lists
Easy to reproduce in cassandra-cli
• get playlist_head[utf8('spotify:user...')];
• 1-15 seconds latency - should be < 0.1 s
Copy head SSTables to development machine for
investigation
Cassandra tool sstabletojson showed that the row contained
600 000 tombstones!
50. #Cassandra13
Tombstone hell
We expected tombstones would be deleted after 30 days
• Nope, all tombstones since 1.5 years ago were there
Revelation: Rows existing in 4+ SSTables never have
tombstones deleted during minor compactions
• Frequently updated lists exists in nearly all SSTables
Solution:
Major compaction (CF size cut in half)
51. #Cassandra13
Zombie tombstones
• Ran major compaction manually on all nodes during a
few days.
• All seemed well...
• But a week later, the same lists took several seconds
again‽‽‽
52. #Cassandra13
Repair vs major compactions
A repair between the major compactions "resurrected" the
tombstones :(
New solution:
• Repairs during Monday-Friday
• Major compaction Saturday-Sunday
A (by now) well-known Cassandra anti-pattern:
Don't use Cassandra to store queues
53. #Cassandra13
Cassandra counters
• There are lots of places in the Spotify UI where we
count things
• # of followers of a playlist
• # of followers of an artist
• # of times a song has been played
• Cassandra has a feature called distributed counters that
sounds suitable
• Is this awesome?
55. #Cassandra13
Lessons
• There are still various esoteric problems with large scale
Cassandra installations
• Debugging them is interesting
• If you agree with the above statements, you should
totally come work with us
56. #Cassandra13
Lessons
• Cassandra read performance is heavily dependent on
the temporal patterns of your writes
• Cassandra is initially snappy, but various write patterns
make read performance slowly decrease
• Super hard to perform realistic benchmarks
57. #Cassandra13
Lessons
• Avoid repeatedly writing data to the same row over very
long spans of time
• If you're working at scale, you'll need to know how
Cassandra works under the hood
• nodetool cfhistograms is your friend