2. Agenda
하이퍼 커넥트 아셈
타워
Confluent Tech Talk
Confluent Tech Talk
하이퍼 커넥트 아셈타워
시간 세션 발표자
15:00 Confluent 소개
원설아
15:10-15:40
2022’ 컨플루언트 로드맵
카프카와 커넥터를 지속적으로 관리하도
록 지원하는 모니터링 방안
신준희
KRaft vs Zookeeper의 차이란, 황주필
15:40-16:00 How can Confluent help you? 원설아
16:00-16:30 Q&A Confluent
3. 3
Confluent Tech Talks
Who
상용 고객 대상
비공개 세미나
What
기술세션 및 로드맵
실제 구축에 구축 경험
및 다양한 QnA가능
When
매분기별로 진행
Sponsors
상용 고객사의 장소를
빌려서 오픈하우스 개
념으로 진행
하이퍼커넥트
April 7
아셈타워
TBD
July TBD
TBD
TBD
October TBD
TBD
TBD
December TBD
TBD
4. 테마 및 목적
상용 고객 우대 프로그램으로, 상용 고객만을 위한 익스클루시브 웨비나를 제공하며 홍
보가 아닌 기술을 이야기하는 시간
진솔한 이야기를 위한 시간으로, 경험을 공유하는 쌍방향 커뮤니케이션 채널을 구축하
고 향후 지식공유를 위한 커뮤니티 형성을 목표로 시작됨
Confluent Tech Talk
Let's Explore: A Day in the Life of a
Data Professional
5. 오늘의 고객사! 하이퍼 커넥트
하이퍼 커넥트는 유명할대로 유명해진 한국의 토종스타트업으로써,
영상 메신저 서비스 아자르와 라이브 동영상 서비스 하쿠나를 대표적으로 서비스하는
기업으로써,
설립 7년 만에 미국(틴더)의 매치스그룹에 약 2조원에 인수된 잭팟기업!
비고: 아자르는 약 5억명 다운로드
웹 RTC를 모바일에서 처음으로 구현한 앱으로,
모바일에서 중앙서버를 거치지않고 빠른 통신을 가능하게 하여
네트워크가 좋지 않은 상황에도 큰 배터리 소모 없이
끊김없는 영상통화가 가능
7. The Confluent Q1 ‘22 Launch
Announcing the latest updates to our cloud-native data streaming platform, Confluent Cloud
50+ Fully Managed Connectors
Quickly and reliably connect your entire
business in real-time with no
operational or management burden
Dedicated Cluster
Expand & Shrink
Run at massive scale, cost effectively
with self-service, programmatic
expansion and shrink of GBps+ clusters
Global Schema Linking
Maintain trusted, compatible data
streams across cloud and hybrid
environments with shared schemas that
sync in real time
Complete Cloud Native Everywhere
Single CSU pricing for ksqlDB
Easily leverage the power of stream
processing with a pricing option fit for
any budget
One Confluent CLI
Manage all of Confluent, across clouds
and on-premises, from a single interface
Integrated Monitoring with
Datadog & Prometheus
Gain deep visibility into Confluent Cloud
from within the tools you already use
Learn more! Read the announcement blog or attend the instructional demo webinar.
8. Connect your entire business with just a few clicks
50+
fully
managed
connectors
Amazon S3
Amazon Redshift
Amazon DynamoDB
Google Cloud
Spanner
AWS Lambda
Amazon SQS
Amazon Kinesis
Azure Service Bus
Azure Event Hubs
Azure Synapse
Analytics
Azure Blob
Storage
Azure Functions Azure Data Lake
Google
BigTable
9. Easily leverage the power of stream processing
with a pricing option fit for any budget
Light
1, 2 CSUs
Standard
4 CSUs
Recommended
Heavy
8, 12 CSUs
A CSU (Confluent Streaming Unit) is an abstract unit that represents the linearity of performance. For example, if a
workload achieves a certain level of throughput with 4 CSUs, you can expect ~3X the throughput with 12 CSUs.
Great option for
testing,
development, and
low throughput
workloads
Sufficient for most
workloads – this is
what ~95% of
customers are
currently leveraging
A minimum of 8
CSUs in a multi-AZ
Kafka cluster are
required for high
availability and SLA
coverage
~$140-$320 per month ~$650 per month ~$1300-$2000 per month
INTERNAL USE ONLY
10. Monitor data streams directly alongside the rest of
your technology stack within Datadog
Integrate Confluent Cloud with Prometheus for unified monitoring of data streams within their chosen
open source platform by using the new /export endpoint in the Confluent Cloud Metrics API.
Setup with just a few clicks, the Datadog
integration provides businesses with a single source
of truth for monitoring their business:
● Unified monitoring
● Serverless operation
● Out-of-the-box dashboards
● Proactive alerting
Create custom data streaming use cases and dashboards
with our fully managed Datadog Metrics sink connector
Easy setup docs: Datadog, Prometheus
11. Run at massive scale, cost effectively with self-
service expansion and shrink of GBps+ clusters
Capacity
Demand
Capacity
Demand
Dedicated clusters
● Supporting GBps+ use cases
● Add or remove capacity
through the Confluent Cloud
UI, CLI, or API
● Make informed decisions with
the Load Metrics API
● Adjust capacity confidently
with built-in safeguards and
automatic load re-balancing
Basic & Standard clusters
● Instant and elastic scalability
up to 100 MBps
● Scale-to-$0 pricing (Basic)
with no sizing or capacity
management required
Enhanced elasticity for fully managed,
cloud-native Apache Kafka® clusters
12. Shared schemas for every Cluster Linking use case
Global Data
Sharing
Regional Disaster
Recovery
Cluster
Migrations
Customers running across environments have an easy option to bring their schemas along
Cost effective, secure, and performant
data transport layer across public
clouds and regions.
Near real-time DR with failover to
perfectly mirrored Kafka topics on a
separate region or different cloud
provider altogether.
Prescriptive, low-downtime
migrations between cloud providers,
across regions, or from another Kafka
service to Confluent Cloud.
Bring data in motion to every cloud application—across public and
private clouds—with linked clusters & schemas that sync in real-time.
13. Maintain trusted data streams across
environments with shared schemas that sync in
real time
14. One Confluent CLI
Manage all of Confluent, across
clouds and on-premises, from a
single interface
Everywhere
Whether managing an on-premises
deployment, provisioning a new cluster in
the cloud, or even linking the two
together, all of this and more can now be
handled through a single tool.
Migration Guide & New Client Download
Confluent v2 CLI
15. Security Product Roadmap Timeline
Now 2H’21 1H’22
Confluent
Cloud
● SAML/SSO for user
authentication
● At-rest BYOK on AWS/GCP
● Cluster RBAC ( Control plane
)with limited roles
● Audit logs: Kafka
authentication and mgmnt
logs for dedicated/standard
clusters, RBAC events
● API keys API
● RBAC: +Data plane,
+Connectors, KSQL, and SR
● Audit logs: ORG level
actions
● BYOK on GCP + Infinite
Storage
● OAuth/OIDC
● Audit logs:
Produce/Consume
logs
● BYOK on Azure
2H’22+
● mTLS
● Native IAM
Integration
● Audit logs:
Produce/Consume
logs
● Client side encryption
16. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
CFK Roadmap
Declarative API
● Cluster Links
● Schema, Connectors
● Standalone REST Proxy
Cloud-Native Automation
● Cluster Shrink
Recent (Q4 21) Now (1H 22) Future
Declarative API
● Schema Links (Preview now,
GA)
● Multi-Region Clusters
Kubernetes Ecosystem
● Custom Volume Mounts
Self-Service Control Plane
across multiple Infrastructures
18. Agenda
1. Confluent Cloud UI
What to expect when using a local Control Center
2. Monitoring Kafka Connect and Connectors
Using REST API and JMX
3. Client Metrics
Monitor your clients
4. Confluent Cloud Metrics API
Queryable HTTP API for server-side metrics
05. Monitor Consumer Lag
The different ways to monitor consumer lag
18
20. 20
Cluster Metrics
● Cluster Status
● Production: 최근 1시간 동안의 평균 값
● Consumption: 최근 1시간 동안의 평균 값
● Storage: 사용된 스토리지 용량(복제 제외)
● Resources: 현재 토픽 수, ksqlDB 앱,
커넥터, 클라이언트
● Cluster Metadata:
○ 클러스터 ID
○ 타입
○ 위치
21. 21
Cluster Dashboard
● Cluster Load
● Throughput (최근 시간):
○ Production: 평균값
○ Consumption: 평균값
● Storage: 사용된 용량(복제 제외)
● Topics: 현재 토픽 수
● Partitions: 현재 파티션 수
● Connectors
● ksqlDB Applications
23. 23
Clients
● 사용자는 Cloud UI의 “Data Integration ->
Clients tab “을 통해 producer/consumer
정보에 액세스 가능
● Producer / Consumer 행동에 대한 뷰 제공
● Consumer Lag 뷰 제공, 클릭하면
consumer group 관점에서 consumer lag
확인 가능
24. 24
Cluster Capacity
● 사용량 vs 현재 한도 뷰 제공
● 클러스터가 용량 이상/이하로 작동인지 확인하여
자원 조정 판단에 유용한 정보 제공
Load Metric
Load metric은 클러스터의 워크로드에 대한 가시성 제공
로드 메트릭은 시스템 사용률의 많은 기본 요소를 통합하여
Kafka에 특정한 단일 메트릭을 생성한다는 점에서 Linux
Load Average와 유사. 상세정보: Cluster Load Metric
for Dedicated Clusters
Load Metric은 메트릭 API를 통해
”io.confluent.kafka.server/cluster_load_percent”와
같이 전용 클러스터에 사용 가능
25. 25
Stream Lineage
● 클러스터 내의 producers, topics 및
consumers 생산자, 주제 및 소비자 간의 데이터
흐름 경로를 시각화
● Producers, topics 및 consumers에 대해 상세
확인 가능
● Producers, topics, consumer groups 및
consumers 간의 흐름 처리량 메트릭 검사
● ℹ VPC 피어링된 사용자는 이 기능을 사용하려면
Data Flow 공용 인터넷 엔드포인트 필요
27. Monitoring Kafka Connect and Connectors
27
• JMX 및 REST API를 사용하여 Connect, Connector 및 Client 관리 및 모니터링 가능
Source: https://docs.confluent.io/platform/current/connect/monitoring.html#connector-and-task-status
REST interface 활용
Kafka Connect의 REST API로 클러스터
관리
Connector의 설정 및 작업 상태를 표시하고
설정 변경 및 작업 재시작 등의 API 포함
JMX를 사용하여 Connect 모니터링
Connect는 JMX(Java Management
Extensions)를 통해 다양한 메트릭 보고
Connect의 metric.reporters 설정으로
추가 플러그형 통계 리포터 사용 가능
28. Using the REST interface
• Kafka Connect의 REST API를 통해 클러스터 관리 가능
• Connector의 구성과 작업 상태를 확인할 수 있는 API가 포함됨
• 현재 동작(예: 구성 변경 및 작업 재시작)의 변경 가능
• Kafka Connect 클러스터가 실행되면 이를 모니터링하고 수정 가능
• Kafka Connect는 서비스로 실행되도록 되어 있기 때문에 Connector 관리를 위한 REST API도 지원
(Default 8083 포트 사용)
• 분산 모드에서 실행되면 REST API가 클러스터의 기본 인터페이스가 되며, 클러스터 멤버에게 요청 가능
• REST API는 필요한 경우 요청을 자동으로 전달
Source: https://docs.confluent.io/platform/current/connect/monitoring.html#connector-and-task-status
29. Connector and task status
• REST API를 사용하여 각 Connector가 할당된 worker의 ID를 포함하여 커넥터의 현재 상태와 해당 작업 확인 가능
• Connector 및 해당 tasks는 클러스터의 모든 worker가 모니터링하는 공유 토픽(status.storage.topic으로 구성됨)에 대한 상태
업데이트를 게시
• Worker는 이 토픽을 비동기식으로 사용하기 때문에 일반적으로 status API를 통해 상태 변경이 표시되기 전에 짧은 지연 존재
• Connector/Task의 상태
• UNASSIGNED: connector/task이 아직 worker에게 할당되지 않음
• RUNNING: connector/task 이 실행 중
• PAUSED: connector/task이 관리상 일시 중지됨
• FAILED: connector/task이 실패함 (일반적으로 예외를 발생시켜 상태 출력에 보고됨)
• 대부분의 경우 커넥터와 작업 상태가 일치하지만 변경이 발생하거나 작업이 실패하는 짧은 시간 동안 다를 수 있음, 예로 connector를 처음
시작할 때 connector와 해당 task가 모두 RUNNING 상태로 전환되기 전에 상당한 지연이 발생 될 수 있음
• Connect는 실패한 작업을 자동으로 다시 시작하지 않으므로 작업이 실패해도 상태가 다르게 될 수 있음
• Connector의 메시지 처리를 일시적으로 중지하는 것이 유용한 경우가 있음. 예로 원격 시스템이 유지 관리 중인 경우 예외 스팸으로 로그를
채우는 대신 source connector가 새 데이터에 대한 폴링을 중지하는 것이 좋음(pause/resume API 활용)
• 일시 중지 상태(pause status)는 영구적이므로 클러스터를 다시 시작해도 작업이 재개될 때까지 커넥터는 메시지 처리를 다시 시작하지 않음
• Connector의 모든 task가 PAUSED 상태로 전환되기 전에 지연이 발생 될 수 있음
• 실패한 task는 다시 시작될 때까지 PAUSED 상태로 전환되지 않음
Source: https://docs.confluent.io/platform/current/connect/monitoring.html#connector-and-task-status
30. Common REST examples
Source: https://docs.confluent.io/platform/current/connect/monitoring.html#connector-and-task-status
Cluster Info & connector plugins
Connector tasks & operation
Connector config/tasks/status
31. Using JMX to monitor Connect
• Connect는 JMX(Java Management Extensions)를 통해 다양한 메트릭 보고
• Connect의 metric.reporters 설정으로 추가 플러그형 통계 리포터 사용 가능
• Mbean과 Metics(상세 정보는 Appendix 참조)
• Connector metrics
• MBean: kafka.connect:type=connector-metrics,connector=”{([-.w]+)
• Common task metrics
• MBean: kafka.connect:type=connector-task-metrics,connector=”{connector}”,task=”{task}”
• Worker metrics
• MBean: kafka.connect:type=connect-worker-metrics
• Worker rebalance metrics
• MBean: kafka.connect:type=connect-worker-rebalance-metrics
• Source task metrics
• MBean: kafka.connect:type=source-task-metrics,connector=([-.w]+),task=([d]+)
• Sink task metrics
• MBean: kafka.connect:type=sink-task-metrics,connector=([-.w]+),task=([d]+)
• Client metrics
• MBean: kafka.connect:type=connect-metrics,client-id=
Source: https://docs.confluent.io/platform/current/connect/monitoring.html#connector-and-task-status
33. Java: Client JMX metrics
33
• Java Kafka application에서 일부 Internal JMX (Java Management Extensions) 메트릭을 노출
• 많은 사용자가 JMX Exporter를 실행하여 모니터링 시스템(Grafana, Datadog, InfluxDB 등)에 메트릭을 제공
• Confluent Cloud clusters에서는 JMX metrics 사용 불가 (👉 Metrics API 활용 필요)
• 모니터링할 중요한 JMX metrics:
• 일반 producer metricsd와 producer throttling-time
• Consumer metrics
• ksqlDB & Kafka Streams metrics
• Kafka Connect metrics
• JMX-Exporter를 사용하여 메트릭을 추출하여 Prometheus와 연계
• Exporter는 원하는 메트릭만 추출하여 전달하도록 구성 가능
Clients emitting
JMX
JMX Client
e.g. JMX-Exporter
Prometheus
Observability
App
Data pipeline pattern for Client Metrics
34. Client Throttling
34
• Confluent Cloud 서비스 계획에 따라 produce (write) 및 consume (read) 에 대한 특정 처리량 비율로 제한됨
• 클라이언트 애플리케이션이 해당 비율을 초과하면 브로커의 할당량(Quota)이 감지되고, 클라이언트 애플리케이션 요청은
브로커에 의해 요청이 제한됨
• 클라이언트가 제한되는 경우 다음 두 가지 옵션을 고려 필요:
• 가능한 경우 응용 프로그램을 수정하여 처리량을 최적화(Optimizing for Throughput 섹션 참조)
• 제한이 높은 클러스터 구성으로 업그레이드
• ℹ Metrics API는 서버 측의 처리량 메트릭을 제공할 수 있지만 클라이언트 측의 처리량 메트릭은 제공하지 않음
Metric Description
kafka.producer:type=producer-metrics,client-id=([-
.w]+),name=produce-throttle-time-avg
The average time in ms that a request was throttled by a broker
kafka.producer:type=producer-metrics,client-id=([-
.w]+),name=produce-throttle-time-max
The maximum time in ms that a request was throttled by a broker
kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-
.w]+),name=fetch-throttle-time-avg
The average time in ms that a broker spent throttling a fetch request
kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-
.w]+),name=fetch-throttle-time-max
The maximum time in ms that a broker spent throttling a fetch request
모니터링 할 client JMX metrics
39. 39
Metrics API
● HTTPS를 사용하여 통해 서버 측 메트릭을 쉽게 pull 가능
● Topic, partition, cluster 레벨에서 집계된 메트릭
● 사용 가능한 메트릭:
○ received_bytes (cluster, topic, partition)
○ sent_bytes (cluster, topic, partition)
○ received_records (cluster, topic and partition)
○ sent_records (cluster, topic and partition)
○ retained_bytes (topic and partition)
○ active_connection_count (cluster)
○ request_count (cluster, type)
○ partition_count (cluster)
○ successful_authentication_count
○ cluster_link (various)
○ dead_letter_queue_records
○ ksql/streaming_unit_count
○ schema_registry/schema_count
전체 API 사양 확인
40. 40
Metrics API
● UI 또는 CLI를 사용하여 생성된 사용자 클라우드
API 키 사용:
confluent api-key create –resource cloud
● 모범 사례: 서비스 계정에 대한 Cloud API 키 생성
후 MetricsViewer 역할 제공(UI 및 CLI 예제)
● 조직별로 API key 제한 존재(유의 사항)
● Consumer Lag은 API로 사용 할 수 있음:
○ 토픽별로 그룹화된 분당 생성 또는 소비된
바이트 수
○ 지정된 토픽 또는 클러스터에 대해 2시간
동안 유지된 최대 시간당 바이트
41. 41
Native DataDog Integration
단순 설정:
1. 통합 타일에서 구성 탭으로 이동
2. API 키 추가를 클릭하여 Confluent Cloud API 키
및 API 암호를 입력
노트: MetricsViewer 역할의 서비스 계정 사용
3. 저장 클릭. Datadog는 credential과 연결된 계정
검색
4. Confluent Cloud 클러스터 ID 또는 커넥터 ID를
추가. Datadog는 Confluent Cloud 메트릭을
탐색하고 몇 분 안에 메트릭을 로드
참고 자료:
● Confluent Docs
● KB Article
44. #1: Using Confluent Cloud UI
Consumer lag은 네비게이션 바의 Data Integration -> Clients/Consumers 섹션에서 사용 가능
Clients Section
Consumers Section
45. #2: Using JMX (Java client only)
• Java consumers 를 사용하는 경우 JMX 메트릭을 캡처하고 records-lag-max 모니터링 가능
• 노트: Consumer의 records-lag-max JMX 메트릭은 consumer가 가장 최근에 본 오프셋과 로그의 가장 최근
오프셋을 비교하여 지연을 계산합니다. 이는 보다 실시간 측정입니다.
Metric Description
kafka.consumer:type=consumer-
fetch-manager-metrics,client-
id=([-.w]+),records-lag-max
파티션에 대한 레코드 수의 최대 지연.
시간이 지남에 따라 값이 증가하는 것은
consumer를 따라가지 못하고 있다는 것을
보여주는 가장 좋은 지표
47. 47
#4: Using Consumer lag REST
Consumer lag API(Confluent Cloud API Reference
Documentation)
Currently
• 초당 25개 요청 제한
• API GA됨
• 퍼블릭 및 VPC 피어링 전용 클러스터 지원
Sample Command:
51. Using JMX to monitor Connect
• Connect는 JMX(Java Management Extensions)를 통해 다양한 메트릭 보고
• Connect의 metric.reporters 설정으로 추가 플러그형 통계 리포터 사용 가능
• Mbean과 Metics
• Connector metrics
• MBean: kafka.connect:type=connector-metrics,connector=”{([-.w]+)
• Common task metrics
• MBean: kafka.connect:type=connector-task-metrics,connector=”{connector}”,task=”{task}”
• Worker metrics
• MBean: kafka.connect:type=connect-worker-metrics
• Worker rebalance metrics
• MBean: kafka.connect:type=connect-worker-rebalance-metrics
• Source task metrics
• MBean: kafka.connect:type=source-task-metrics,connector=([-.w]+),task=([d]+)
• Sink task metrics
• MBean: kafka.connect:type=sink-task-metrics,connector=([-.w]+),task=([d]+)
• Client metrics
• MBean: kafka.connect:type=connect-metrics,client-id=
Source: https://docs.confluent.io/platform/current/connect/monitoring.html#connector-and-task-status
52. JMX to monitor Connect - Connector metrics
• MBean: kafka.connect:type=connector-metrics,connector=”{([-.w]+)
Source: https://docs.confluent.io/platform/current/connect/monitoring.html#connector-and-task-status
Metric Explanation
connector-type Connector type: source or sink
connector-class Connector class name
connector-version Connector class version (as reported by the connector)
status Connector status: running, paused, or stopped
53. JMX to monitor Connect - Common task metrics
• MBean: kafka.connect:type=connector-task-metrics,connector=”{connector}”,task=”{task}”
Source: https://docs.confluent.io/platform/current/connect/monitoring.html#connector-and-task-status
Metric Explanation
status Current task status: unassigned, running, paused, failed, or destroyed
pause-ratio Fraction of time the task has spent in a paused state
running-ratio Fraction of time the task has spent in the running state
offset-commit-success-percentage Average percentage of the task’s offset commit attempts that succeeded
offset-commit-failure-percentage
Average percentage of the task’s offset commit attempts that failed or had
an error
offset-commit-max-time-ms Maximum time in milliseconds taken by the task to commit offsets
offset-commit-avg-time-ms Average time in milliseconds taken by the task to commit offsets
batch-size-max Maximum size of the batches processed by the connector
batch-size-avg Average size of the batches processed by the connector
54. JMX to monitor Connect - Worker metrics
• MBean: kafka.connect:type=connect-worker-metrics
Source: https://docs.confluent.io/platform/current/connect/monitoring.html#connector-and-task-status
Metric Explanation
task-count Number of tasks that have run in this worker
connector-count The number of connectors that have run in this worker
connector-startup-attempts-total Total number of connector startups that this worker has attempted
connector-startup-success-total Total number of connector starts that succeeded
connector-startup-success-percentage Average percentage of the worker’s connector starts that succeeded
connector-startup-failure-total Total number of connector starts that failed
connector-startup-failure-percentage Average percentage of the worker’s connectors starts that failed
task-startup-attempts-total Total number of task startups that the worker has attempted
task-startup-success-total Total number of task starts that succeeded
task-startup-success-percentage Average percentage of the worker’s task starts that succeeded
task-startup-failure-total Total number of task starts that failed
task-startup-failure-percentage Average percentage of the worker’s task starts that failed
55. JMX to monitor Connect - Worker rebalance metrics
• MBean: kafka.connect:type=connect-worker-rebalance-metrics
Source: https://docs.confluent.io/platform/current/connect/monitoring.html#connector-and-task-status
Metric Explanation
leader-name Name of the group leader
epoch Epoch or generation number of the worker
completed-rebalances-total Total number of rebalances completed by the worker
rebalancing Whether the worker is currently rebalancing
rebalance-max-time-ms Maximum time the worker spent rebalancing (in milliseconds)
rebalance-avg-time-ms Average time the worker spent rebalancing (in milliseconds)
time-since-last-rebalance-ms Time since the most recent worker rebalance (in milliseconds)
56. JMX to monitor Connect - Source task metrics
• MBean: kafka.connect:type=source-task-metrics,connector=([-.w]+),task=([d]+)
Source: https://docs.confluent.io/platform/current/connect/monitoring.html#connector-and-task-status
Metric Explanation
source-record-write-total
Number of records output from the transformations and written to Kafka for the task belonging to
the named source connector in the worker (since the task was last restarted)
source-record-write-rate
After transformations are applied, this is the average per-second number of records output from
the transformations and written to Kafka for the task belonging to the named source connector in
the worker (excludes any records filtered out by the transformations)
source-record-poll-total
Before transformations are applied, this is the number of records produced or polled by the task
belonging to the named source connector in the worker (since the task was last restarted)
source-record-poll-rate
Before transformations are applied, this is the average per-second number of records produced or
polled by the task belonging to the named source connector in the worker
source-record-active-count-max Maximum number of records polled by the task but not yet completely written to Kafka
source-record-active-count-avg Average number of records polled by the task but not yet completely written to Kafka
source-record-active-count Most recent number of records polled by the task but not yet completely written to Kafka
poll-batch-max-time-ms Maximum time in milliseconds taken by this task to poll for a batch of source records
poll-batch-avg-time-ms Average time in milliseconds taken by this task to poll for a batch of source records
57. JMX to monitor Connect - Sink task metrics
• MBean: kafka.connect:type=sink-task-metrics,connector=([-.w]+),task=([d]+)
Source: https://docs.confluent.io/platform/current/connect/monitoring.html#connector-and-task-status
Metric Explanation
sink-record-read-rate
Before transformations are applied, this is the average per-second number of records read from Kafka for the task belonging to the
named sink connector in the worker
sink-record-read-total
Before transformations are applied, this is the total number of records produced or polled by the task belonging to the named sink
connector in the worker (since the task was last restarted)
sink-record-send-rate
After transformations are applied, this is the average per-second number of records output from the transformations and sent to the task
belonging to the named sink connector in the worker (excludes any records filtered out by the transformations)
sink-record-send-total
Total number of records output from the transformations and sent to the task belonging to the named sink connector in the worker
(since the task was last restarted)
sink-record-active-count Most recent number of records read from Kafka but not yet completely committed, flushed, or acknowledged by the sink task
sink-record-active-count-max
Maximum number of records read from Kafka, but that have not yet completely been committed, flushed, or acknowledged by the sink
task
sink-record-active-count-avg
Average number of records read from Kafka, but that have not yet completely been committed, flushed, or acknowledged by the sink
task
partition-count Number of topic partitions assigned to the task and which belong to the named sink connector in the worker
offset-commit-seq-no Current sequence number for offset commits
offset-commit-completion-rate Average per-second number of offset commit completions that have completed successfully
offset-commit-completion-total Total number of offset commit completions that were completed successfully
offset-commit-skip-rate Average per-second number of offset commit completions that were received too late and skipped, or ignored
offset-commit-skip-total Total number of offset commit completions that were received too late and skipped, or ignored
put-batch-max-time-ms Maximum time in milliseconds taken by this task to put a batch of sink records
58. JMX to monitor Connect - Client metrics
• MBean: kafka.connect:type=connect-metrics,client-id=
Source: https://docs.confluent.io/platform/current/connect/monitoring.html#connector-and-task-status
Metric Explanation
connection-close-rate Connections closed per second in the window
connection-count Current number of active connections
connection-creation-rate New connections established per second in the window
failed-authentication-rate Connections that failed authentication
incoming-byte-rate Bytes per second read off all sockets
io-ratio Fraction of time the I/O thread spent doing I/O
io-time-ns-avg Average length of time for I/O per select call in nanoseconds
io-wait-ratio Fraction of time the I/O thread spent waiting
io-wait-time-ns-avg Average length of time the I/O thread spent waiting for a socket ready for reads or writes in nanoseconds
network-io-rate Average number of network operations (reads or writes) on all connections per second
outgoing-byte-rate Average number of outgoing bytes sent per second to all servers
request-rate Average number of requests sent per second
request-size-avg Average size of all requests in the window
request-size-max Maximum size of any request sent in the window
response-rate Responses received and sent per second
select-rate Number of times the I/O layer checked for new I/O to perform per second
successful-authentication-rate Connections that were successfully authenticated using SASL or SSL
62. KIP-500 - Replace Zookeeper
with a Self-Managed Metadata Quorum
62
Description
현재 Kafka는 Broker나 Partition에 관한 메타데이터
를 Zookeeper에 저장하고 Kafka Controller의 선출
에도 이용하고 있다. 이러한 메타 데이터를 관리하
는 방법을 보다 견고하고 확장 가능한 방법으로 수
정하여 더 많은 파티션을 지원할 수 있다.
KIP-500: Replace Zookeeper with a Self-Managed Metadata Quorum
Changes
● Metadata 관리방법 및 Controller 기능
● Broker간의 합의 형성 방법
● Client 변경
ZK
ZK ZK
Quorum (We agree !!)
Controller (What’s the truth ?!)
Other Broker (What did Zookeeper say?!)
Quorum Controller
63. Timeline (subject to change…)
Right now you can try the feature and spin up a cluster without ZK:
https://github.com/apache/kafka/blob/2.8/config/kraft/README.md
63
64. KIP-631 - The quorum-based Kafka Controller
64
• 개별 Broker에서도
엑세스 가능
• 모든 변경 사항을 알
릴 수 없음
Controller로서 메타데
이터 캐시를 가지고 있지
만 소스는 ZK
• 새 Controller 를 선출할 때
Zookeeper 에서 메타데이터
를 로드하는 데 시간이 걸림
• Zookeeper와 Controller 간
의 데이터 차이가 발생
KIP-631: The Quorum-based Kafka Controller
❏ Broker Info
❏ Partition Info
❏ Broker Info
❏ Partition Info
❏ Consumer Info
ZK
ZK ZK
● Controller가 복수개 (3+). Controller들 내에서
합의 형성(Raft)
● Broker는 Active Controller와 통신 (현재
Controllor와 동일)
● Broker가 메타데이터를 Active Controller에서 가
져옴
● registered, unregistered, and fenced (new).
❏ Broker Info
❏ Partition Info
❏ Consumer Info
65. KIP-595 - A Raft Protocol for Metadata Quorum
65
● 메타데이터 합의 (Leader/Follower)
● Leader Election
KIP-595: A Raft Protocol for Metadata Quorum
메타데이터 저장소(ZK)에서 로그기반(Kafka)으로
원래 Kafka는 사용자 데이터/메타데이터를 로그 기반으
로 하고 있으며, Raft 모델과의 친화성이 높다. 메타데
이터는 내부 Topic 인 __cluster_metadata 에서 관리되
고, 스냅샷과 토픽을 이용해 신속하게 메타데이터를 복
구한다
KIP-380 Controller Epoch
KIP-380: Detect outdated control requests and
bounced brokers using broker generation 에 의해 오프
라인이 된 Controller가 복귀 시에 Active Controller 상
태를 포기하는 메타데이터(Epoch) 가 이미 들어가 있다
. (Epoch는 Raft의 Term과 동의어)
66. KIP-630 - Kafka Raft Snapshot
66
KIP-630: Kafka Raft Snapshot
0에서 State 복구
Controller는 메타데이터 로그로부터 상태(State) 를 갱
신해 메모리에 보존, 동기화한다. 이 Metadata State 를
재현하기 위해서는 로그가 필요하지만, 로그는 끊임없
이 증가하고 State 복구에 걸리는 시간도 비례적으로
증대한다.
Active Controller는 어느 커밋된 상태에서 정기적으로
Metadata Store의 스냅샷을 취득
새롭게 참가(신규/복구)하는 Controller는 스냅샷을 시
작점으로 하고, 필요한 오프셋으로부터 메타데이터 로
그를 읽어 State를 복구한다.
67. __cluster_metadata_0
<baseoffset>.log
Broker
in-memory metadata
structure
Metadata Cache
Controller
(active)
in-memory metadata
structure
__cluster_metadata_0
<baseoffset>.log
Controller
in-memory metadata
structure
__cluster_metadata_0
<baseoffset>.log
Controller
in-memory metadata
structure
__cluster_metadata_0
<baseoffset>.log
KRaft Cluster Metadata
2
2
2
3
3
3
클러스터 metadata 가 변경되면, 활성 컨트롤러는 변경 사항을
__cluster_metadata 토픽에 이벤트를 작성한다
1
컨트롤러(follower) 와 브로커 (observer) 는
__cluster_metadata 토픽 이벤트를 복제한다. 활성 컨트롤러는
컨트롤러의 쿼럼(다수)에 의해 복제될 때 이벤트를 커밋한다.
2 컨트롤러와 브로커가 커밋 되면 로그 이벤트
에서 메모리의 메타데이터 구조를 업데이트한다.
3
3 1
Broker Broker Broker
68. Internal vs External Modes
68
Kraft External Mode Kraft Internal Mode
Overview
Quorum Controllers run on separate nodes,
replacing current ZK nodes
Quorum Controllers run within the Broker /
Confluent Server
Impact on Node Counts
No node count impact : 1-to-1 replacement of ZK
with Controllers
ZK nodes are removed with no replacement
nodes from Controllers
Confluent
Recommendation
Default Confluent recommendation for CP
customers
Not recommended for production by Confluent
Brokers Quorum Controllers Brokers
75. Sample Customer Journey
75
Review current ecosystem,
plan onboarding strategy
and design Confluent
implementation
Review, Plan
& Design
Build a highly automated
Confluent Service and
prepare for migration and
new use cases
Build the
Service
Continuous onboarding for
use cases and applications
Consume the
Service
Migrate existing
applications in line with
the migration plan
Migration
Ensure that all teams are
enabled on operating and
consuming the service,
including documentation
and Education
Enablement
76. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
What does the Confluent project team look like?
Engagement
Manager
Event Streaming
Strategist (if needed)
Resident Solution
Architect
Resident Field
Engineer
Business Value
Consultant (if
needed)
Product &
Engineering (if
needed)
DevX (if needed)
Project Team Confluent Investment
Confluent Account
Resources
Confluent Executive
Confluent Account
Team
Confluent Customer
Success Mgmt
Confluent Project Team Confluent Account Team
Confluent PS&E
Director
77. Sample Engagement Timeline
31 7 14 21
Environment
Assessment
Februar
y
28 7 14 21
Platform
Blueprint
March
28 4 11 18
April
25 2 9 16
Developer
Blueprint
May
Training Week:
Administration
Proposed Training
Week: Developer
SA SA SA SA
SA
SA SA SA
SA
23
Operator
Blueprint
Consultant engaged
Customer Training
Enablement period/Non
engagement
Enablement - Scheduling to be confirmed
Scheduling to be confirmed
Streaming
Blueprint
24
FE FE FE FE FE FE Scheduling to be confirmed
78. C O N F I D E N T I A L
Confluent Pathfinder Adoption Plan
Run by: Confluent
Engagement Manager
& Solution Architect
Session : 90 mins
Purpose: PS to understand project
goals and use cases, including
proposed architecture.
Providing a clear path to success
along the Maturity Curve, with
goals and activities defined for
engagement with Professional
Services.
Run by: Confluent
Engagement Manager
Session : 60 mins
Purpose: Confluent PS to present
back proposal for PS&E delivery of
topics identified in Pathfinder.
Adoption plan is used during
delivery to map activities to
customer project milestones and
jointly define workshops.
Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
79. CSTA: Customer Success Technical Architect
한국의 CSTA, 박보순
한국을 대표하는 CSTA, 플래티넘(설치형)/프리미어(클라우드) 고객 담당
• 플래티넘과 프리미어 고객을 위한 기술서비스 지원
• 정기적으로 로드맵 리뷰를 하고 고객의 피드백을 제품이나 엔지니어팀에 전달
• 주요 릴리즈에 대하서 함께 리뷰하고 업그레이드 플랜을 협의
• 심각한 이슈가 있었던 서포트 케이스에 대한 사후 관리
• 새롭게 릴리즈 된 내역에 대해서 제품 데모를 지원
• 프라이빗 베타를 기능에 대해서 미리 테스트 해보실수 있도록 지원
80. Copyright 2020, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Apache Kafka® Administration
by Confluent
운영자 과정 개발자 과정
https://www.confluent.io/ko-kr/training/
Confluent 교육 프로그램
Confluent Developer Skills for
Building Apache Kafka®
Confluent Stream Processing Using
ksqlDB & Apache Kafka® Streams
Confluent Advanced Skills for
Optimizing Apache Kafka®
온사이트 교육 퍼블릭 교육 온라인 교육
3가지로 제공되는 교육 방식
Confluent Fundamentals
for Apache Kafka®
기본 과정
5개 교육 과정
81. 개발자 사이트 리뉴얼 : https://developer.confluent.io/
개발자를 위한…(For, By, Of)
Jun Rao, Kafka 공동 개발자
https://developer.confluent.io
/learn-kafka/architecture/get-
started/