Kafka Excellence at Scale – Cloud, Kubernetes, Infrastructure as Code (Vik Walia, Slower)

AUGUST 2020
WE ARE HUMAN
KAFKA EXCELLENCE AT SCALE
CLOUD, KUBERNETES, INFRASTRUCTURE-AS-CODE

GOAL AND CHALLENGES
Operations
• Installation
• Upgrades
• Patches
• Rollbacks
• Elastic Scaling (up/down)
• Fault Tolerance
• Disaster Recovery
• Security (inflight, at rest)
• Logging, Monitoring, Alerting
• Secrets Management
• …
Application Development
• Application Onboarding
• Creating Topics
• Increasing Partitions
• Deleting Topics
• Security
• Monitoring
• Best Practices – Producers,
Consumers, KSQL, KStreams
• …
• Pager Duty Self Healing
ChallengesGoal

ARCHITECTURE EVOLUTION
Event LogMessaging
1990s 2000s 2010s
Monolith Service Oriented Architecture Microservices, Events, Containers, Serverless
The speed of doing business is increasing…
• Application delivery acceleration – CI/CD pipelines, but
exponential increase in quantity (but not complexity) of
operations work.
• Kubernetes – public release in July 2015; Site Reliability
Engineering (SRE) best practices published; massive
automation in systems administration tasks (self-healing).
• An Operator is an automated Site Reliability Engineer.

KUBERNETES OPERATOR PATTERN
1. Operators are custom controllers watching customer
resources
2. Allow Infrastructure Engineers and Developers to provide
application specific features to manage their site and software.
3. The logic needed to maintain, scale, and heal a specific piece of
software is encoded into an operator application that runs as a
container in the cluster
4. The code in the operator is responsible for more targeted and
advanced health detection and healing that can be achieved via
Kubernetes’ generic self-healing
5. Vendors are writing custom operators to make cloud-native
management of their software easy https://operatorhub.io/
6. Confluent has created an Operator for Kafka
7. Other Kafka Operators
1. https://operatorhub.io/operator/banzaicloud-kafka-operator
2. https://operatorhub.io/operator/strimzi-kafka-operator

CONFLUENT OPERATOR
› CLOUD NATIVE DEPLOYMENT ON KUBERNETES
› DECLARATIVE VS. IMPERATIVE SEMANTICS
› IMMUTABLE (CONFLUENT CERTIFIED DOCKER IMAGES)
› SELF-HEALING (CONTINUOUS)
› INFRASTRUCTURE AS CODE BEST PRACTICES (HELM, YAML)
› AUTOMATED DEPLOYMENT
› CERTIFIED IMAGES PULLED FROM CONTAINER REGISTRIES
› IMAGE SCANNING FOR VULNERABILITY
› CI/CD BEST PRACTICES (HELM CHARTS, JENKINS)
› AUTOMATED ROLLING UPGRADES (DOWNGRADES)
› STOP BROKER
› UPGRADE BINARIES
› PARTITION LEADER REASSIGNMENT
› START BROKER
› VERIFY ZERO UNDER-REPLICATED PARTITIONS
› ELASTIC SCALING
› KUBERNETES METRICS SERVER
› SPIN UP NEW BROKERS
› SPIN UP NEW CONNECT WORKERS
› SECURITY
› AUTOMATED CONFIGURATION OF TRUSTSTORES & KEYSTORES
› SECRETS MANAGEMENT

GOAL AND CHALLENGES
Operations
ü Installation
ü Upgrades
ü Patches
ü Rollbacks
ü Elastic Scaling (up/down)
ü Fault Tolerance
• Disaster Recovery
ü Security (inflight, at rest)
• Logging, Monitoring, Alerting
ü Secrets Management
• …
• Application Onboarding
• Creating Topics
• Increasing Partitions
• Deleting Topics
• Security
• Monitoring
• Best Practices – Producers,
• …
• Pager Duty Self Healing
ChallengesGoal

C O N F I D E N T I A L
SELF-SERVICE, AUTOMATION
› APPLICATION ONBOARDING
› TOPIC MANAGEMENT
› PARTITIONS
› WORKFLOWS
› HOUSEKEEPING
› HEALTHCHECK
› LIVELINESS, READINESS
› NO OFFLINE PARTITIONS
› ABILITY TO PRODUCE AND CONSUME
› AUTOMATED DR
› ACTIVE-PASSIVE, ACTIVE-ACTIVE OR STRETCH
› OFFSET SYNCHRONIZATION
› PROXY SERVICE
› CI/CD PIPELINES
› CERTIFIED IMAGES IN CONTAINER REGISTRY
› HELM CHARTS
› ZERO DOWNTIME OPERATIONS
› UPGRADES/PATCHES
› DOWNGRADES
› RESTARTS
› ELASTIC SCALING
7
REST API
Web Page
Jenkins, Ansible
Jira Tickets, Manual
http://kafka/API
GOVERNANCE
› BEST PRACTICES
› TOPICS (REPLICATION FACTOR = 3)
› PARTITION SIZING
› PRODUCERS
› ACKS (1, ALL)
› ERROR HANDLING (RETRIABLE/NON-RETRIABLE)
› CONSUMERS (OFFSET MANAGEMENT)
› BROKERS
› KSQL, KSTREAMS
› NAMING CONVENTIONS
› METADATA MANAGEMENT
› OWNERSHIP, ATTRIBUTION
› ENTITLEMENT MANAGEMENT
› RBAC
› CAPACITY RESERVATION
› QUOTA MANAGEMENT
› LOGGING, MONITORING, ALERTING
› 2 AM PRODUCTION ISSUE RESOLUTION
› LONG TERM DATA PIPELINE OPTIMIZATION
› SLACK, EMAIL, PAGERDUTY INTEGRATION

GOAL AND CHALLENGES
Operations
ü Installation
ü Upgrades
ü Patches
ü Rollbacks
ü Elastic Scaling (up/down)
ü Fault Tolerance
ü Disaster Recovery
ü Security (inflight, at rest)
ü Logging, Monitoring, Alerting
ü Secrets Management
ü …
ü Application Onboarding
ü Creating Topics
ü Increasing Partitions
ü Deleting Topics
ü Security
ü Monitoring
ü Best Practices – Producers,
ü …
ü Pager Duty Self Healing
ChallengesGoal

Snapshot
USAGE, ROI BY TENANT
Usage Cost = function(Compute, Storage, Network, Human Effort)
Trend over time

Kafka Excellence at Scale – Cloud, Kubernetes, Infrastructure as Code (Vik Walia, Slower)

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Kafka Excellence at Scale – Cloud, Kubernetes, Infrastructure as Code (Vik Walia, Slower)

Ähnlich wie Kafka Excellence at Scale – Cloud, Kubernetes, Infrastructure as Code (Vik Walia, Slower) (20)

Mehr von HostedbyConfluent

Mehr von HostedbyConfluent (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Kafka Excellence at Scale – Cloud, Kubernetes, Infrastructure as Code (Vik Walia, Slower)