Application teams in JPMC have started shifting towards building event driven architectures and real time steaming pipelines and Kafka has been at core in this journey. As application teams have started adopting Kafka rapidly, need for a centrally managed Kafka as a service has emerged. We have started delivering Kafka as a service in early 2018 and running in production for more than an year now operating 80+ clusters (and growing) in all environments together. One of the key requirements is to provide truly segregated, secured multi-tenant environment with RBAC model while satisfying financial regulations and controls at the same time. Operating clusters at large scale requires scalable self-service capabilities and cluster management orchestration. In this talk we will present - Our experiences in delivering and operating secured, multi-tenant and resilient Kafka clusters at scale. - Internals of our service framework/control plane which enables self-service capabilities for application teams, cluster build/patch orchestration and capacity management capabilities for TSE/admin teams. - Our approach in enabling automated Cross Datacenter failover for application teams using service framework and confluent replicator.
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Secure Kafka at scale in true multi-tenant environment ( Vishnu Balusu & Ashok Kadambala, JP Morgan Chase) Kafka Summit SF 2019
1. Kafka as a Managed Service
Secure Kafka at scale in true Multi-Tenant Environment
Kafka Summit, SFO 2019
Presenters: Vishnu Balusu & Ashok Kadambala
2. 2
Agenda
Part 1
• Motivation & Design Principles
• Kafka-scape
• Cluster Design
• Data-Driven Control Plane
• App Resiliency
Part 2
• Self-Service API
• Schema Management
• Kafka Streams
• Orchestrator (Cluster Patching)
• Ubiquitous Access (Multi-Cloud)
Final Remarks
• Lessons Learned
• Future Ahead
3. 3
PROBLEM
STATEMENT
Why a Managed Service?
Many bespoke implementations across the firm
• Varied design and patterns
• Different standards of security and resiliency
• Lack of firm-wide governance in risk management
• Lack of real end-to-end self-service
• No metadata driven APIs
• No centralized view of Data Lineage
A Fully managed Service with Design Principles
ü Centralized Service
ü Secure from Start
ü Consumable from Hybrid Cloud and Platforms
ü Data Driven End-to-End Self-Service APIs
ü Scalable on demand
ü Built per Customer Requirements
Solution
Next Exit
8. 8
Control Plane : Multi-Tenancy & Capacity Management
X Topic with Size X GB
• Logical abstraction at metadata
level for every Kafka cluster
• Allows applications to reserve
storage on the cluster
• All the Kafka artefacts created by
application are maintained within
the application namespace
• Topic Sizes and Quotas are enforced
Tenant 1 Tenant 2
10
10
15
5
5
2
Physical Kafka
Cluster
5
10
10
15
5
5
2
namespaces
Automated admin
workflow
5
Metadata
Entitlements
Governance
Quotas
Tenant NKafka cluster logical
abstraction Metadata
Entitlements
Governance
Quotas
Metadata
Entitlements
Governance
Quotas
9. 9
App Resiliency : Connection Profile
• Unique Cluster Names – RREnnnn (Region, Env, Numeric)
• Connection profile is queried via API using cluster name
• Applications are immune from Infra changes
{
"clusterName": "NAD1700",
"topicSuffix": "na1700",
"kafkaBrokerConnectionProtocols": [
{
"protocol": "SASL_SSL",
"bootstrapServersStr": "",
"serviceName": " jpmckafka",
}
],
"schemaRegistryURLs": [
],
"restProxyURLs": [],
"clusterReplicationPattern": "ACTIVE_ACTIVE",
"replicatedClusterProfile": {
"clusterName": "NAD1701",
"topicSuffix": "na1701",
"kafkaBrokerConnectionProtocols": [
{
"protocol": "SASL_SSL",
"bootstrapServersStr": "",
"serviceName": “jpmckafka",
}
],
"schemaRegistryURLs": [
],
"restProxyURLs": []
}
}
/applications/{appid}/cluster/{ClusterName}/connectionProfile Connection Profile for a given clusterGET
10. 10
App Resiliency : Cluster Health Index
• Health Index is determined from
ü Ability to produce/consume externally as a client
ü Number of Kafka/zookeeper processes up and running
ü Offline partitions within the cluster
• Cluster Index is persisted as a metric in Prometheus and
exposed via an API to application teams
• Recommended to integrate into Automated Application
Resiliency
Control PlaneHealth Check
API
PeriodicHealth
checkonclusters
QueryCluster
Metrics
ScrapeCluster
HealthIndex
Determine Cluster
Health Index
11. 11
App Resiliency : Active-Active Clusters
• Better utilization of infrastructure
• Do not require much manual intervention recovering from datacenter failure
• Eventual Consistency | Highly Available | Partition Tolerance
Multi-DC Resiliency
16. 16
Schema Management
• GET request should be open to everyone
• POST/PUT/DELETE requests should be authorized
• Schema registry ownership and lineage should be maintained
Securing Schema Registry
resource.extension.class
Fully qualified class name of a valid implementation of the SchemaRegistryResourceExtension interface. This can be used to inject
user defined resources like filters. Typically used to add custom capability like logging, security, etc.
17. 17
Schema Registry: AuthX Extension
@Priority(Priorities.AUTHENTICATION)
public class AuthenticationFilter implements ContainerRequestFilter {
public AuthenticationFilter() {
}
@Override
public void filter(ContainerRequestContext containerRequestContext) {
}
}
resource.extension.class=com.jpmorgan.kafka.schemaregistry.security.SchemaRegistryAuthXExtension
package com.jpmorgan.kafka.schemaregistry.security;
public class SchemaRegistryAuthXExtension implements SchemaRegistryResourceExtension
{
@Override
public void register(Configurable<?> configurable,
SchemaRegistryConfig schemaRegistryConfig,
SchemaRegistry schemaRegistry) throws SchemaRegistryException {
configurable.register(new AuthenticationFilter());
}
@Override
public void close() {
}
}
21. 21
• Find Active Controller broker and patch it at
the end
• For each kafka broker
1. Stop Kafka Broker
2. Deploy config/binaries
3. Start Kafka broker
4. Invoke Health check
• Wait for URPs to be zero
• Produce/Consume on test topic
5. Abort patching if health check fails
Orchestrator: Cluster Patching
1
2
n
Metadata
Control Plane
Telemetry
Orchestrator
22. 22
Ubiquitous Access (Multi-Cloud)
• Common Control Plane
•
• OnPrem Private Cloud : Market Place Tile
• OnPrem Kube Platform : Service Catalog
• Public Cloud : TLS/Oauth
• OAuth via Federated ADFS (KIP-255: OAuth Authentication via SASL/OAUTHBEARER)
23. 23
Lessons Learned
Data
api
Tollgates
Automate Everything {large scale infra}
Centralized Schema Registry {multiple clusters}
New
Features New Features ≠ Stability
0 1 2 3 4 5 6 7 8 9
Offset Management {replicated clusters}
0 1 2 3 4 5 6 7 8 9
≠
Scaling & Monitoring is not an easy job !!
24. 24
Future ahead…
Fleet Management
(State Machines)
Self-Healing Kafka Auto Throttling &
Kill Switch
Centralized
Schema Management
2.5 DC
Stretch Clusters
Chaos Engineering
Failure is a norm!!!
Action