Cloud is changing the world; Kubernetes is changing the world; real-time event streaming is changing the world. In this talk we explore some of best practices to synergistically combine the power of these paradigm shifts to achieve a much greater return on your Kafka investments. From declarative deployments, zero-downtime upgrades, elastic scaling to self-healing and automated governance, learn how you can bring the next level of speed, agility, resilience, and security to your Kafka implementations.
3. ARCHITECTURE EVOLUTION
Event LogMessaging
1990s 2000s 2010s
Monolith Service Oriented Architecture Microservices, Events, Containers, Serverless
The speed of doing business is increasing…
• Application delivery acceleration – CI/CD pipelines, but
exponential increase in quantity (but not complexity) of
operations work.
• Kubernetes – public release in July 2015; Site Reliability
Engineering (SRE) best practices published; massive
automation in systems administration tasks (self-healing).
• An Operator is an automated Site Reliability Engineer.
4. KUBERNETES OPERATOR PATTERN
1. Operators are custom controllers watching customer
resources
2. Allow Infrastructure Engineers and Developers to provide
application specific features to manage their site and software.
3. The logic needed to maintain, scale, and heal a specific piece of
software is encoded into an operator application that runs as a
container in the cluster
4. The code in the operator is responsible for more targeted and
advanced health detection and healing that can be achieved via
Kubernetes’ generic self-healing
5. Vendors are writing custom operators to make cloud-native
management of their software easy https://operatorhub.io/
6. Confluent has created an Operator for Kafka
7. Other Kafka Operators
1. https://operatorhub.io/operator/banzaicloud-kafka-operator
2. https://operatorhub.io/operator/strimzi-kafka-operator
5. CONFLUENT OPERATOR
› CLOUD NATIVE DEPLOYMENT ON KUBERNETES
› DECLARATIVE VS. IMPERATIVE SEMANTICS
› IMMUTABLE (CONFLUENT CERTIFIED DOCKER IMAGES)
› SELF-HEALING (CONTINUOUS)
› INFRASTRUCTURE AS CODE BEST PRACTICES (HELM, YAML)
› AUTOMATED DEPLOYMENT
› CERTIFIED IMAGES PULLED FROM CONTAINER REGISTRIES
› IMAGE SCANNING FOR VULNERABILITY
› CI/CD BEST PRACTICES (HELM CHARTS, JENKINS)
› AUTOMATED ROLLING UPGRADES (DOWNGRADES)
› STOP BROKER
› UPGRADE BINARIES
› PARTITION LEADER REASSIGNMENT
› START BROKER
› VERIFY ZERO UNDER-REPLICATED PARTITIONS
› ELASTIC SCALING
› KUBERNETES METRICS SERVER
› SPIN UP NEW BROKERS
› SPIN UP NEW CONNECT WORKERS
› SECURITY
› AUTOMATED CONFIGURATION OF TRUSTSTORES & KEYSTORES
› SECRETS MANAGEMENT
6. GOAL AND CHALLENGES
Operations
ü Installation
ü Upgrades
ü Patches
ü Rollbacks
ü Elastic Scaling (up/down)
ü Fault Tolerance
• Disaster Recovery
ü Security (inflight, at rest)
• Logging, Monitoring, Alerting
ü Secrets Management
• …
Application Development
• Application Onboarding
• Creating Topics
• Increasing Partitions
• Deleting Topics
• Security
• Monitoring
• Best Practices – Producers,
Consumers, KSQL, KStreams
• …
• Pager Duty Self Healing
ChallengesGoal
7. C O N F I D E N T I A L
SELF-SERVICE, AUTOMATION
› APPLICATION ONBOARDING
› TOPIC MANAGEMENT
› PARTITIONS
› WORKFLOWS
› HOUSEKEEPING
› HEALTHCHECK
› LIVELINESS, READINESS
› NO OFFLINE PARTITIONS
› ABILITY TO PRODUCE AND CONSUME
› AUTOMATED DR
› ACTIVE-PASSIVE, ACTIVE-ACTIVE OR STRETCH
› OFFSET SYNCHRONIZATION
› PROXY SERVICE
› CI/CD PIPELINES
› CERTIFIED IMAGES IN CONTAINER REGISTRY
› HELM CHARTS
› ZERO DOWNTIME OPERATIONS
› UPGRADES/PATCHES
› DOWNGRADES
› RESTARTS
› ELASTIC SCALING
7
REST API
Web Page
Jenkins, Ansible
Jira Tickets, Manual
http://kafka/API
GOVERNANCE
› BEST PRACTICES
› TOPICS (REPLICATION FACTOR = 3)
› PARTITION SIZING
› PRODUCERS
› ACKS (1, ALL)
› ERROR HANDLING (RETRIABLE/NON-RETRIABLE)
› CONSUMERS (OFFSET MANAGEMENT)
› BROKERS
› KSQL, KSTREAMS
› NAMING CONVENTIONS
› METADATA MANAGEMENT
› OWNERSHIP, ATTRIBUTION
› ENTITLEMENT MANAGEMENT
› RBAC
› CAPACITY RESERVATION
› QUOTA MANAGEMENT
› LOGGING, MONITORING, ALERTING
› 2 AM PRODUCTION ISSUE RESOLUTION
› LONG TERM DATA PIPELINE OPTIMIZATION
› SLACK, EMAIL, PAGERDUTY INTEGRATION
8. GOAL AND CHALLENGES
Operations
ü Installation
ü Upgrades
ü Patches
ü Rollbacks
ü Elastic Scaling (up/down)
ü Fault Tolerance
ü Disaster Recovery
ü Security (inflight, at rest)
ü Logging, Monitoring, Alerting
ü Secrets Management
ü …
Application Development
ü Application Onboarding
ü Creating Topics
ü Increasing Partitions
ü Deleting Topics
ü Security
ü Monitoring
ü Best Practices – Producers,
Consumers, KSQL, KStreams
ü …
ü Pager Duty Self Healing
ChallengesGoal
9. Snapshot
USAGE, ROI BY TENANT
Usage Cost = function(Compute, Storage, Network, Human Effort)
Trend over time
10. AUGUST 2020
WE ARE HUMAN
KAFKA EXCELLENCE AT SCALE
CLOUD, KUBERNETES, INFRASTRUCTURE-AS-CODE