Confluent Cloud runs a modified version of Apache Kafka - redesigned to be cloud-native and deliver a serverless user experience. In this talk, we will discuss key improvements we've made to Kafka and how they contribute to Confluent Cloud availability, elasticity, and multi-tenancy. You'll learn about innovations that you can use on-prem, and everything you need to make the most of Confluent Cloud.
3. Cloud Native Data Systems
Aurora DynamoDB Kinesis
S3 Spanner Snowflake
4. What is a Cloud-Native Data System?
ELASTIC
USAGE-BASED
COST MODEL
INFINITE
API-DRIVEN
OPERATIONS
SECURE AND
RELIABLE
SERVERLESS
GLOBAL
MULTITENANT
5. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Confluent Cloud
5
6. “I could run Apache Kafka in the
cloud myself. How is Confluent
Cloud any different?”
Some Dude, Highly Inflated Engineering Title
8. “Downtime” is the total accumulated minutes during a
calendar month for a given Confluent Cloud Service cluster
during which the entire cluster is unavailable.
A minute is considered unavailable for a given cluster if all
continuous attempts by Confluent’s monitoring system to
write to the cluster within the minute fail. Confluent’s
monitoring system connects to the same endpoints that
Customer uses.
SLA is what the lawyers wrote
9. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Kafka Healthcheck - How it works
1. Topic with N partitions
2. Leader on each broker
3. Each producer attempts to produce 100
events per minute to its partition
4. Consumer consumes all of them
5. SLI:
Success rate and latencies are reported
10
Broker 0
hc_topic, 0
Broker 1
hc_topic, 1
Broker 2
hc_topic, 2
Broker 3
hc_topic, 3
Kafka Healthcheck
Producer 0
Producer 1
Producer 2
Producer 3
Consumer
Topic Check
TLS Check
Metric Reporter
10. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Kafka Healthcheck - Broker crashed
11
Broker 0
Broker 1
hc_topic, 1
Broker 2
hc_topic, 2
Broker 3
hc_topic, 3
Kafka Healthcheck
Producer 0
Producer 1
Producer 2
Producer 3
Consumer
Topic Check
TLS Check
Metric Reporter
X
hc_topic, 0
We report on customer experience.
If leader election was successful
- there is no downtime.
11. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Kafka Healthcheck - How it works
If the leader is unreachable - we alert, even if
“Kafka is fine”
12
Broker 0
hc_topic, 0
Broker 1
hc_topic, 1
Broker 2
hc_topic, 2
Broker 3
hc_topic, 3
Kafka Healthcheck
Producer 0
Producer 1
Producer 2
Producer 3
Consumer
Topic Check
TLS Check
Metric Reporter
X
12. 14
1. Produce success < 100 for over 5 min
2. Consume success < 100 for over 5 min
3. Produce latency > 500ms for over
30min
4. Failure to list topics
5. TLS certificate with less than 30 days till
expiry
SLO is the thing that
wakes you up at night
13. 15
1. Tiny messages, low throughput
2. No consumer group rebalances
3. Blind to issues on a subset of customer
partitions
We have multiple other monitoring
system that covers the healthcheck
“blindspots”.
Is this enough?
15. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Mothership
Confluent Cloud Control Plane
19
Gateway
Scheduler
Mothership
RDS
kafka spec
service
Lets Encrypt
Route 53, GCP
DNS, Azure
DNS
Network region
Customer Data Plane
NLB
Envoy
Envoy K8s satellite
sync
Operator
customer control plane
request
https://www.youtube.com/watch?v=ss5OEBejFCs
16. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Mothership
Example: Provision 2 CKU
20
Gateway
Scheduler
Mothership
RDS
spec service
Lets Encrypt
Route 53, GCP
DNS, Azure
DNS
Network region
NLB
Envoy
Envoy
K8s satellite
sync
Operator
I want 2 CKU with
high availability
Auth,
ratelimit
Pod count and
placement,
network config
Image and version
resource limits
kafka configs
Cluster
configuration
Cluster
configuration
physical config
Dynamic config
18. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Elasticity Flavor #1 - Basic/Standard
24
19. 25
1. Use ~100 partitions
2. Use ~4 clients at ~ 30MB/s each
3.--producer-props acks=all
linger.ms=10
4. Use defaults for everything else.
Especially request.timeout.ms
Getting 100MB/s
20. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Tail at Scale
26
Throughput (MB/s)
avg
latency
max
latency
30 43 670
60 89 858
90 739 9836
100 (attempted 120) 1362 11367
21. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Elasticity flavor #2 - Dedicated
27
22. 28
Same benchmark as before, but...
Producer:
24.79 MB/sec,
16.88 ms avg latency,
1851.00 ms max latency,
7 ms 50th,
57 ms 95th,
211 ms 99th,
566 ms 99.9th
If you need even lower latencies –
Add CKU
Clusters are elastic
23. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Confluent Cloud is a not a race car
29
25. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Logical Clusters
Authentication
API keys used to authenticate
belong to one logical cluster.
32
Private Namespace
This includes:
- Topics
- Consumer groups
- ACLs
- Metrics
- “Broker configuration”
Limits
● Throughput
● Partitions
● Connections
● Connection rate
● Max message size
● Request size
● Backpressure
● Create/delete topics
26. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Interceptor - list topics
33
Broker
Network layer
Network threads
Processor
threads
Request Queue
Response Queue
tenant
identifier
tenant
identifier
Modified
responses
+ Tenant metrics
27. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Interceptor - add topic
34
Broker
Network layer
Network threads
Processor
threads
Request Queue
Response Queue
tenant
identifier
tenant
identifier
Validate: Are you
below partition
limit?
28. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Interceptor - alter config
35
Broker
Network layer
Network threads
Processor
threads
Request Queue
Response Queue
tenant
identifier
tenant
identifier
Validate:
Are tenants
allowed?
Is the new
value allowed
29. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Interceptor - controlled shutdown
36
Broker
Network layer
Network threads
Processor
threads
Request Queue
Response Queue
tenant
identifier
tenant
identifier
LOL NOPE
30. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Limits - we use Kafka Quotas
37
Broker
Network layer
Network threads
Processor
threads
Request Queue
Response Queue
tenant
identifier
tenant
identifier
Record request
and check quota.
Add delay to
response.
If throttled, mute
channel
31. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Quotas! Same as in Apache Kafka but…
38
Quota type In Apache Kafka In Confluent Cloud
Produce / consume throughput KIP-13, KIP-219, KIP-257
Per-tenant, cluster-wide,
control-plane integration
Thread utilization KIP-124
Auto-tuned, based on workload and
broker utilization
Connections KIP-402 Pre-tuned based on benchmarks
Connection rate KIP-612 Pre-tuned, soon to be auto-tuned
Replication quota KIP-73, rarely used
Always-on for smooth recovery and
auto-balancing
Create / delete topic KIP-599 Pre-tuned based on benchmarks
There’s a talk for that:
https://videos.confluent.io/watch/1Vt6hGj7TLa2uSCKwdidzA
32. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Goals of Multi Tenant Quota:
- Encourage efficiency
- Fair resource allocation
- Availability
39
34. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Key Take Aways
Confluent Cloud is more
than someone else’s
Apache Kafka.
Provisioning automation,
configuration management,
specialized monitoring,
elasticity and multi-tenancy
are all unique capabilities.
41
User experience is more
than GUI
Quotas, benchmark results,
latencies, SLA, bug-fixes
applied in hours
- these are all part of our user
experience.
Getting the Most of
Confluent Cloud
Our pricing rewards
efficiency.
- 100s of partitions
- 10s of clients
- Long lived clients
- linger.ms=0
- acks=all
- Default client configuration
(request.timeout.ms, retries,
delivery.timeout.ms)