Designing how to implement aggregates in a distributed database is a non-trivial task. When dealing with aggregates that will be polling the entire cluster, it is important to consider performance impacts. If done poorly, full table scans can bring production systems to their knees. So how can you implement aggregate functions without hammering real-time availability and performance for other read/write operations? Learn how distributed aggregates were implemented in ScyllaDB to balance performance across large NoSQL distributed database clusters.
3. Why distribute aggregates?
How are aggregate queries executed currently?
■ Coordinator downloads all required data from other nodes
■ Then locally aggregates the whole table
Reasons to do the distribution
■ This is slow and very inefficient
■ The workload is accumulated on a single machine
13. Distributing the aggregates
The parallelization of aggregates works in a pretty straightforward way
■ Coordinator which receives the query turns into super-coordinator
■ Super-coordinator splits the query into subqueries for other nodes
● Each node receive query only for data it holds
● Each partition range exists in only one subquery to avoid data duplication
■ Each participating node executes its subquery and returns a partial result
● Partial result is a state of an aggregate without finalizing it
■ Super-coordinator receives the results, reduces them and finalizes
15. Distributing aggregates
Scylla not only distributed the work at the cluster level. The nodes also distribute the query to
their shards. Each shard again gets a query to only its data, so it doesn’t have to
communicate with other ones.
All native aggregates have their distributed implementation.
Scylla uses them by default without user need to think about it.
Native aggregates:
■ COUNT
■ SUM
■ AVG
■ MAX
■ MIN
16. User-Defined Aggregates
UDA consists of:
■ Required fields
● State function
● State type
■ Optional fields
● Final function
● Initial condition
CREATE AGGREGATE my_aggregate(int)
SFUNC state_f
STYPE tuple<int, bigint>
FINALFUNC final_f
INITCOND (0, 0);
17. Distributed User-Defined Aggregates
UDA consists of:
■ Required fields
● State function
● State type
■ Optional fields
● Reduce function
● Final function
● Initial condition
CREATE AGGREGATE my_aggregate(int)
SFUNC state_f
STYPE tuple<int, bigint>
REDUCEFUNC reduce_f
FINALFUNC final_f
INITCOND (0, 0);
18. Example of UDA
How many people follows Elon
Musk on Twitter?
Operations:
■ full scan table
■ iterate over each list
CREATE TABLE twitter_users (
id bigint,
name text,
follows list<bigint>,
PRIMARY KEY(id)
)
19. Example of UDA
CREATE FUNCTION count_if_follows_elon(acc bigint, follows list<bigint>)
RETURNS NULL ON NULL INPUT
RETURNS bigint
LANGUAGE lua
AS $$
for _, f in ipairs(follows) do
if f == 100 then
return acc + 1
end
end
return acc
$$
CREATE FUNCTION reduce_followers_count(acc1 bigint, acc2 bigint)
RETURNS NULL ON NULL INPUT
RETURNS bigint
LANGUAGE lua
AS $$
return acc1 + acc2
$$
20. Example of UDA
CREATE AGGREGATE count_elon_followers(list<bigint>)
SFUNC count_if_follows_elon
STYPE bigint
REDUCEFUNC reduce_followers_count
INITCOND 0
21. Reducing the results
■ Each coordinator reduces results from its shards and sends the result to
super-coordinator
■ Super-coordinator is responsible for reducing partial results from other
coordinators
■ Reduction is done be exciting proper function
■ All of the partial results are stored in memory (for now!)
22. What if the subquery fails?
Parallelization service provides retrying system in case of subquery failure.
■ When the subquery fails
● Super-coordinator acts as a its coordinator …
● and executes it
This way, the query is more likely to succeed, for instance by selecting another,
available replica.
23. Limitations
Main limitations are only queries
without WHERE and GROUP BY
clauses can be distributed.
Because select statement can’t be
serialized, we are representing it as
forward_request.
struct forward_request {
struct aggregation_info {
db::functions::function_name name;
std::vector<sstring> column_names;
};
query::read_command cmd;
dht::partition_range_vector pr;
db::consistency_level cl;
lowres_clock::time_point timeout;
std::vector<aggregation_info> aggregation_infos;
};
25. Benchmarks
Specification:
■ 3 nodes cluster on AWS
■ Instances: i3.4xlarge (16 vCPUs, memory 112 GiB)
■ Each result is average of 10 queries
■ Queries executed one by one with “BYPASS CACHE” option
Data:
■ 50 millions rows
■ “Twitter accounts” table