How Discord Migrated Trillions of Messages from Cassandra to ScyllaDB

How Discord Stores
Trillions of Messages
Bo Ingram, Senior Software Engineer

Progress Over Perfection
Day 1: MongoDB/TokuMX (a temporary choice)
We wanted a database ultimately that was:
● Scalable
● Fault Tolerant
● Low Maintenance
● Not a blob store (serialization is expensive!)
2

Let’s meet a troublemaker!
cassandra-messages

What Even Is A Message?
CREATE TABLE messages (
channel_id bigint,
bucket int,
message_id bigint,
author_id bigint,
content text,
PRIMARY KEY ((channel_id, bucket), message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);
5

Cassandra Problems
Size of dataset and access patterns cause trouble
● Hot partitions tank latency
● Compactions run behind
● 2 expensive 2 repair
● Garbage collection tax
If messages are slow, Discord is slow!
6

2020 Architecture
● Python API monolith
● Talked directly to databases
● No more Cassandra… except for messages
● Figuring out ScyllaDB - how can we get the most performance?
7

Built in Rust!
● Experience
● It’s fast!
● It’s fun!
● “Fearless concurrency”
● Strong Cassandra/ScyllaDB support
● Leverage Tokio ecosystem for
async I/O
● Lets us say we “rewrote it in Rust”
9

Better, but it’s still Cassandra
● It helps!
● Still having issues (just not quite as frequently)
● Concurrently, we’re ironing out ScyllaDB performance
We want the best ScyllaDB we can have so that Discord thrives
13

Giving Discord the best of both worlds
The Superdisk

ScyllaDB Disk Options in GCP
Local NVMe SSDs
● ScyllaDB’s preference
● Fast
● Low reliability = increased
risk of quorum loss and
downtime
● Potential data loss on
underlying host error
Persistent Disks
● Discord’s preference
● Point-in-time snapshots
● Slow
● Cascading latency if used
in production
15

Linux says we can have both!
16

We turned it on…
● Reads still occurring at the speed of PD writes
● ScyllaDB uses one I/O bandwidth channel by default - shared by reads and
writes
● Worked with ScyllaDB to get duplex I/O - separate channels for reads and
writes
● Write intent bitmap - speeds up recovery, but was causing latency issues so
we turned that off
22

Reducing Complexity
● Superdisk makes for a system that’s fast, scalable, durable… and complex
● Automation + logical simplicity - can we reduce the number of states the
system can be in?
● Bias towards removing/replacing nodes in an unknown scenario
● Write bitmap removal - having it gives faster resyncs, but if your errors are
either miniscule resyncs or a full rebuild than what advantage does it have?
23

Preparing the RAID
● Preﬂight script runs before
ScyllaDB startup - in charge of
giving the green light
● Brand new node - provision RAID,
join the cluster
● Data on PD but not NVMe -
unclean shutdown, wait for the
RAID to recover
● Data on both PD and NVMe -
health check the RAID
24

Click here to read more
How Discord Supercharges
Network Disks for Extreme Low
Latency

The Groundwork
● Data service library - optimized database accesses
● Superdisk storage topology - fast + reliable
Meanwhile, Cassandra is a high-toil system…
26

Migrating Trillions of
Messages to ScyllaDB

Plan v1
Goal: Get value quickly because Cassandra is giving us literal headaches
● Try to use ScyllaDB for recent data and migrate historical data
behind it
● Set up ScyllaDB’s Spark migrator, requires a lot of tuning
ETA: 3+ months
28

Plan v2
Goal: Get value even faster than Plan v1
29

Plan v2
● Rewrite it in Rust!
● Reimplement the data migrator in an afternoon via our data service
code
30

Plan v2
● Rewrite it in Rust!
● Reimplement the data migrator in an afternoon via our data service
code
ETA: 9 days
up to 3.2 million messages / second!
31

It’s Great!
● Quiet (knock on wood)!
● Better storage density - less than half the nodes of Cassandra
● 53% reduction in disk utilization
● Much more consistent latencies
33

Rust Data Service Library
+ Superdisk
+ ScyllaDB
= How Discord Stores Trillions of Messages

Thank You
Stay in Touch
Bo Ingram
github.com/boingram
linkedin.com/in/bo-ingram-1a069275

How Discord Migrated Trillions of Messages from Cassandra to ScyllaDB

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie How Discord Migrated Trillions of Messages from Cassandra to ScyllaDB

Ähnlich wie How Discord Migrated Trillions of Messages from Cassandra to ScyllaDB (20)

Mehr von ScyllaDB

Mehr von ScyllaDB (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

How Discord Migrated Trillions of Messages from Cassandra to ScyllaDB