Ever wondered just how many CPU cores of KSQL Server you need to provision to handle your planned stream processing workload ? Or how many GBits of aggregate network bandwidth, spread across some number of processing threads, you'll need to deal with combined peak throughput of multiple queries ? In this talk we'll first explore the basic drivers of KSQL throughput and hardware requirements, building up to more advanced query plan analysis and capacity-planning techniques, and review some real-world testing results along the way. Finally we will recap how and what to monitor to know you got it right!
7. 77
Anatomy of a KSQL Query
Tuning goals
Performance Factors
What to Monitor
Rules of Thumb
8. 8
● Every KSQL continuous query results in a
Kafka Streams Application
● An Application has a Topology…
● ..which may have sub-topologies…
● ..which are executed on StreamThreads
Apps,
CPUs,
Topologies
and Threads,
Oh My!
10. 10
Topologies, Tasks, & Partitions
• Topologies are divided into sub-topologies at read-write boundaries
- Read-process-write loop
• Within a sub-topology, tasks created for the max input partition count
- If multiple input topics, they are being co-processed, e.g. joins
- Internal topics, such as *-rekey ones, are counted too
• Each task is assigned to at most one StreamThread
- A StreamThread results in at least 3 JVM threads being created
- A StreamThread has its own Consumer and Producer instance
22. 22
State Stores (RocksDB)
Tables - consider key-space cardinality and message-size
Joins - join type, join windows
Aggregates - window sizes, group cardinality
23. 23
Fault-Tolerance, powered by Kafka
Server A:
“I do stateful stream
processing, like tables,
joins, aggregations.”
“streaming
restore” of
A’s local state to BChangelog Topic
“streaming
backup” of
A’s local state
KSQL / Kafka
Streams App
Kafka
A key challenge of distributed stream processing is fault-tolerant state.
State is automatically migrated
in case of server failure
Server B:
“I restore the state and
continue processing
where
server A stopped.”
24. 2424
Some Measurements
● KSQL Servers – i3.xlarge
○ 4 vCPUs
○ 30.5 GB memory
○ “up to 10Gbit network” (experimentally measured at ~ 1.2Gb/s
full-duplex baseline)
○ 200GB EBS SSD
● JVM Settings
○ Heap size 16GB (~50% of RAM, to leave space for state-stores)
25. 25
Test Highlights
• Simple project query
(“speed-of-light”)
• CREATE STREAM foo AS
SELECT * FROM bar;
#
Queries msg/s MB/s
msg
size
CPU
%
MB Mem
Max
2 193k 59.14 320 99.19 18,949
10 189k 57.67 320 99.74 20,101
20 175k 53.43 320 99.68 23,377
50 168k 51.37 320 96.61 28,291
• 4 cores can’t saturate a 1Gb network link in
this test (but larger messages get close)
26. 26
Test Highlights
• Simple project query
(“speed-of-light”)
• CREATE STREAM foo AS
SELECT * FROM bar;
#
Queries
#
Servers msg/s
msg/s/
host MB/s
CPU
%
2 1 193k 193k 59 99
2 3 585k 195k 179 96
2 10 1,855k 185k 567 96
Message throughput scales with server count
(same query, same data, msg-size=300bytes)
27. 27
CREATE STREAM vip_actions AS
SELECT userid, page, action, zipcode
FROM clickstream c
LEFT JOIN users u ON c.userid = u.user_id
WHERE u.level = 'Platinum';
28. 28
Test Highlights
• Stream-Table join
Stream-table join runs at ~50% throughput of
project query
#
Queries msg/s MB/s
msg
size
CPU
%
MB Mem
Max
2 88k 26 314 99.8 18,022
10 80k 24 314 99.8 19,931
29. 29
Further Results
• A non-windowed aggregate on the same data ran at ~47k msgs/sec
• A windowed aggregate ran at ~24k msgs/sec (varies with window params)
• Re-partitioning can cut these results further
31. 31
Take-Aways (1)
• Establish c
• Project and filter queries are cheap and fast
• Joins are slower, aggregates more so
• If select throughput (c) is 100%, then
• Joins run at about 50% of c
• Aggregates run at about 25%
• Windowed aggregates run ~10-15%
32. 32
Take-Aways (2)
• (de)serialization is the most expensive part of any query
• Use Avro message format
• Start with 4 CPU cores for “serious” message volumes
• Use SSD for any state stores (speed > size)