Kafka is one of the most important foundation services at Zendesk. It became even more crucial with the introduction of Global Event Bus which my team built to propagate events between Kafka clusters hosted at different parts of the world and between different products. As part of its rollout, we had to add mTLS support in all of our Kafka Clusters (we have quite a few of them), this was to make propagation of events between clusters hosted at different parts of the world secure. It was quite a journey, but we eventually built a solution that is working well for us.
Things I will be sharing as part of the talk:
1. Establishing the use case/problem we were trying to solve (why we needed mTLS)
2. Building a Certificate Authority with open source tools (with self-signed Root CA)
3. Building helper components to generate certificates automatically and regenerate them before they expire (helps using a shorter TTL (Time To Live) which is good security practice) for both Kafka Clients and Brokers
4. Hot reloading regenerated certificates on Kafka brokers without downtime
5. What we built to rotate the self-signed root CA without downtime as well across the board
6. Monitoring and alerts on TTL of certificates
7. Performance impact of using TLS (along with why TLS affects kafka’s performance)
8. What we are doing to drive adoption of mTLS for existing Kafka clients using PLAINTEXT protocol by making onboarding easier
9. How this will become a base for other features we want, eg ACL, Rate Limiting (by using the principal from the TLS certificate as Identity of clients)
21. SOLUTION ARCHITECTURE - AUTH MANAGER
PKI auth manager is a wrapper around consul-template
https://github.com/hashicorp/consul-template
● one-off script or a daemon
● render templates using consul and vault apis
● regenerate secrets based on TTL (Time To Live)
● script execution on updates
24. SOLUTION ARCHITECTURE - CA ROTATION
𝤿 global record of current root
𝤿 certificate regeneration on
root changes
𝤿 broadcasting root changes
𝤿 reasonably fast rotation
𝤿 reload regenerated
certificates
27. SOLUTION ARCHITECTURE - CA ROTATION - BLOCKING QUERIES
GET /v1/kv/path-to-consul-key?index=10&wait=1m0s
28. SOLUTION ARCHITECTURE - CA ROTATION
To do CA rotation, we need:
✅ global record of current root
✅ certificate regeneration on root changes
✅ broadcasting root changes
✅ reasonably fast rotation
𝤿 reload regenerated certificates
30. SOLUTION ARCHITECTURE - CA ROTATION
To do CA rotation, we need:
✅ global record of current root
✅ certificate regeneration on root changes
✅ broadcasting root changes
✅ reasonably fast rotation
✅ reload regenerated certificates
38. Some reasons behind the performance impact?
PERFORMANCE IMPACT - REASONS
● mTLS network traffic overhead
● encryption and decryption are CPU intensive
● no ZERO COPY optimisation
● performance of JVM SSL engine
44. ✅ onboarding guides
✅ local Kafka cluster with TLS
✅ example clients configured with TLS
✅ pairing with early adopters
𝤿 tighter K8S integration
CLIENT ONBOARDING
How are we driving adoption?
47. LESSONS LEARNT
Revocation check of public SSL certificates is widely broken
● Neglected tooling
● Kafka doesn’t support custom revocation check
Two ways of checking Revocation:
● Certificate Revocation List (CRL)
● Online Certificate Status Protocol (OCSP)
48. LESSONS LEARNT - REVOCATION
https://somedomain.com/kafka-pki/crl
-Dcom.sun.security.enableCRLDP=true -Dcom.sun.net.ssl.checkRevocation=true
Certificate chain of *.somedomain.com
0 .. Sectigo Limited….
1 .. The USERTRUST Network/CN=USERTrust RSA
2 .. CN=AddTrust External CA Root
3 .. CN=AddTrust External CA Root