Distributed systems are usually optimized with particular workloads in mind. At the same time, the system should still behave in a sane way when the assumptions about workload do not hold - notably, one user shouldn't be able to ruin the whole system's performance. Buggy parts of the system can be a source of the overload as well, so it is worth considering overload protection on a per-component basis. For example, ScyllaDB's shared-nothing architecture gives it great scalability, but at the same time makes it prone to a "hot partition" problem: a single partition accessed with disproportionate frequency can ruin performance for other requests handled by the same shards. This talk will describe how we implemented rate limiting on a per-partition basis which reduces the performance impact in such a case, and how we reduced the CPU cost of handling failed requests such as timeouts (spoiler: it's about C++ exceptions).
2. Piotr Dulikowski
■ Holds BA and MSc in Computer Science from the
University of Warsaw
■ Involved in development of several features in ScyllaDB
including CDC and per-partition rate limiting
■ Maintainer of the ScyllaDB Rust Driver
4. ■ A ScyllaDB cluster consists of multiple nodes
■ Each node is divided into shards (CPU core + part of RAM)
■ Shards within a node handle separate data (shared-nothing architecture)
■ Data is split into partitions
■ Consists of rows with the same partition key
■ Each partition has a subset of nodes called replicas, responsible for storing the partition
■ Requests can be handled from any node/shard, but the coordinator has to contact
replicas
Data Distribution in ScyllaDB
5. Each partition has limited computing resources assigned to it, and it’s easy to
exhaust them if the workload becomes too unbalanced.
Partitions whose replicas intersect with hot partition’s replicas will be affected,
too.
The “Hot Partition” Problem
6. ■ Keep in mind how your expected workload looks like
■ Hot partitions may appear due to badly chosen schema
■ ScyllaDB won’t fix those issues for you - schema is your responsibility
Choose Appropriate Schema
7. It makes sense to optimize your schema for the common case. What about the
“uncommon case”?
You can always encounter:
■ Malicious/misbehaving users
■ Parts of your system going awry due to bugs
The system does not have to satisfy these requests, but they should not affect
the whole system too much.
It’s Not Always About Bad Schema
8. ■ Requests will start piling up on overloaded shards
■ When latency exceeds the request timeout, most of the work will be
wasted
■ We can reject some requests early
■ Accept only as much as we can comfortably handle
■ Rejecting some requests early leaves more resources for handling the
remaining ones
How to Retain Goodput?
9. A maximum read/write rate can be
set for a table.
ScyllaDB will reject some operations
in an effort to keep the rate of
successful requests under the limit.
Per-Partition Rate Limiting
ALTER TABLE ks.tbl
WITH per_partition_rate_limit = {
'max_writes_per_second': 100,
'max_reads_per_second': 200
};
11. ■ A shard tracks “hit count” for tuples of
(token, table name, operation type)
■ Every second all counters are halved
■ Assuming steady rate of X ops/s,
counter will eventually oscillate between
X and 2X
https://github.com/scylladb/scylladb/blob/m
aster/db/rate_limiter.hh
Measurements on Replica Side
token,
table, operation type
counter
2c042489794ad03b,
‘table1’, ‘write’
100
6fc6353cbcd7808,
‘table1’, ‘read’
2
3ea0c947c5fcd34e,
‘table2’, ‘read’
1
12. Coordinator increments the counter relevant
to the operation and chooses to reject with
some probability.
■ If the operation is accepted, the
operation proceeds as usual and
replicas increment their counters
■ If the operation is rejected,
communication with replicas is skipped
Case: Coordinator is a Replica
13. Coordinator lets replicas decide whether to
accept or reject.
■ Coordinator chooses a random value
and sends it to replicas
■ Replicas compute probability of
rejection based on their counter and
choose to reject based on the random
value
■ Replicas should have counter values
that are close to each other
Case: Coordinator is not a Replica
14. On write, all live replicas participate. All
replica counters are updated and all replicas
have a good estimate of the request rate for
the partition.
On read, not all replicas participate. The exact
number of replicas depends on replication
factor and consistency level. This can lead to
read operations to be under-counted - but it’s
still fine for rate limiting.
Reads vs. Writes Accuracy
Writes Reads
16. ■ People have mixed feelings about exceptions
■ They are a part of the language, and they are used in the standard library
■ …but they have some undesirable properties, e.g. hard-to-predict performance
■ We are using exceptions in ScyllaDB
■ Leads to more idiomatic code, and our framework supports them well
■ They aren’t a big problem, as long as you aren’t throwing them in large volumes
■ Throwing exceptions can be slow
■ It involves acquiring a global mutex which is not scalable
■ We worked around it, but had to disable caching which made throwing scalable, but slow
■ https://github.com/scylladb/seastar/blob/master/src/core/exception_hacks.cc
What’s Wrong with C++ Exceptions?
17. Seastar gives us flow control
constructs that do not use throwing
underneath.
Exceptions can be stored in
std::exception_ptr and passed
around without throwing.
Problem is, the exception inside the
std::exception_ptr must be rethrown
in order to access it.
Exceptions in Seastar
future<> do_thing() {
return really_do_thing().finally([] {
std::cout << "Did the thingn"
});
}
future<> really_do_thing() {
if (fail_flag) {
return make_exception_future<>(
std::runtime_error("oh no!"));
} else {
return make_ready_future<>();
}
}
18. Use boost::result to return the result
(contains success or exception).
Use a custom container that allows
inspecting the exception.
Results in portable code, but very
tedious to convert existing code.
Approach 1: Avoid Them
future<result<>> do_thing() {
return really_do_thing().then(
[] (result<> res) -> result<> {
if (res) {
// handle success
} else {
// handle failure
}
}
);
}
19. Introduce an “exception_ptr
inspector” function and replace
existing try..catch blocks in a
straightforward way.
Make sure that for other things we
use the existing tools.
Non-portable code, but much less
work!
Approach 2: Implement Missing Parts
std::exception_ptr ep = get_exception();
if (auto* ex
= try_catch<std::logic_error>(ep)) {
// ...
} else if (auto* ex
= try_catch<std::runtime_error>(ep)) {
// ...
} else {
// ...
}
Based on the C++ proposal:
https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p1066r1.html