Kafka的设计与实现

Kafka 的设计与实现
wang xing

Kafka
“Kafka is a high-throughput distributed pub/sub messaging
system, a distributed, partitioned, replicated commit log service,
with a unique design.”
Producer Consumer
Producer Consumer
Topic 1
Topic 2
Topic 3
subscribe
produce(topic, msg)
pub/sub system
msg
msg

Concept
Broker: Kafka 的 Server, 一个 Kafka 集群由多个 Broker 组成;
Producer: 消息的生产者, 向 Broker 主动 push 消息;
Consumer: 消息的消费者, 向 Broker 主动 pull 消息;
Topic: 每条消息都被指定了一个Topic, 每个 Topic 的消息分多个 Partition 分散存放在 Broker 上 ;
Partition: 一个 Topic 可以分为多个 Partition, 每个 Partition 的消息是有序的 , 并有若干 Replication。
Partition 中的每个消息都有一个连续的序列号叫 Offset, 用于唯一标识 Partition 内的一条消
息。一个 Partition 中的消息同时只会分配给 Consumer Group 中的一个 Consumer 消费;
Consumer Group:
若干个 Consumer 可以组成一个 Consumer Group, 每个 Consumer 消费若干 Partition 内的
消息。一条消息在某一个 Consumer Group内只会被消费一次。

Releases
0.9:
• 安全特性
• 客户端连接Broker，Broker连接ZK的授权验证，数据传输进行加密等
• Kafka Connect
• 与外部系统、数据集建立数据流的连接
• 新的 Consumer API
• 去除了Scala和ZK的依赖，纯Java
0.10:
• Kafka Streams
• 实现了一系列流处理动作, 例如join、filter和aggregate等, 构建低延迟的流处理系统
• 机架感知
• 让 Partition 的 Replica 分布在不同的机架上
• Message中加入Timestamp
• 消息被发送或是被Commit的时间

Producer
KafkaProducer:
1. 等待 topic metdData 数据的更新，序列化 key，value;
2. 根据 topic 的 patition 个数和 key 的值，计算该条消息所属的 partition，将消息 append 给
RecordAccumulator;
RecordAccumulator:
1. 使用 batches ( ConcurrentMap<TopicPartition, Deque<RecordBatch>> ) 暂存将要发向各个 topic 的消息;
2. append 方法内将发送的这条消息 tryAppend 进对应 Deque 最后一个 RecordBatch 中。如果空间不够，就会
让该 RecordBatch 进入只读状态。然后在 Deque 末端新起一个 RecordBatch;
3. drain 方法为每个 connected 状态的 Broker 节点，如果 Broker 节点是该 Partition 的 Leader 节点，则对应
从 batches 中的 Deque 中取出第一个 RecordBatch，拼装成 Map<Integer, List<RecordBatch>> 的结构，key 是
Broker 节点的 id， value 是发给该节点的 RecordBatch 列表;
Sender:
1. KafkaProducer 初始化的时候会启一个 KafkaThread 线程，运行 Runnable 的 Sender 对象，不停发
送 RecordAccumulator 内累积的消息;
2. 调用 RecordAccumulator 的 drain 方法，获得发送给每个 Broker 节点的 RecordBatch 列表，封装成一
个 ClientRequest，通过 NetworkClient 的 send 方法发送出去;

Producer
Case1: 将消息一条一条发送到 Broker，发送的延迟是 2ms，那么 1s 可以发送 500 条消息;
Case2: 延迟 8ms 发送消息，假设 8ms 内收集到 20 条消息，那么 1s 可以发送 2000 条消息;
Latency or Throughput?
batch.size: This is an upper limit of how many messages Kafka Producer will attempt to batch before
sending, specified in bytes (default is 16K bytes). Kafka may send batches before this limit is, but will
always send when this limit is reached. Therefore setting this limit too low will hurt throughput
without improving latency. The main reason to set this low is lack of memory – Kafka will always
allocate enough memory for the entire batch size, even if latency requirements cause it to send half-
empty batches.
linger.ms: How long will the producer wait before sending in order to allow more messages to get
accumulated in the same batch. (default is 0). Sometimes we are willing to wait a bit longer in order
to improve the overall throughput at the expense of a little higher latency.

Broker
• 一个 Topic 分成多个 Partition, 每个 Partition内的消息时间顺序上是有序的;
• 发送上来的消息陆续 append 到 Partition文件的末尾;
• 对于 Partition – Consumer Group 都有一个 offset, 表示这个 Consumer Group 在这个 Partition 内下一个
会消费的 Position;

Broker
• 同一个 Topic 下的每个 Partition 有一个存储目录, 目录名为 Topic名称 + Partiton序号;
• 多个 Segment 组成一个Partition, 每个 Segment 物理上由.index索引文件和.log数据文件组成
数据文件中的第3个message(整个Partiton的第368772个消息)在文件中的物理偏移量为497

Broker
Replica 分配方法：
1. 假设有 n 个 Broker (假设有 n 个)
2. 将第( i mod n ) 个 Broker 作为第 i 个
Partition 的 Leader
3. 将第 i 个 Partition 的第 j 个 Replica 分配
到第 ((i + j) mode n ) 个 Broker 上

Broker
• HighWatermark：ISR 中的 Follower 都已经完成同步的消息标记位
• LogEndOffset：Leader 已经完成追加的消息标记位

Consumer
KafkaConsumer:
1. poll 方法内调用 Fetcher 的 fetchedRecords 方法，返回暂存着的、可直接拿来消费的消息。没有的话调用
Fetcher 的 sendFetches 方法，发送 FetchRequest 给每个 Partition 对应的 Leader Broker 节点，获得新的供消
费的消息;
Fetcher:
1. 使用 records (List<PartitionRecords<K, V>>) 暂存着从 Broker Leader 拉取的，可以直接消费的消息
2. fetchedRecords 方法中从 records 中取出暂存的消息，转化成 Map<TopicPartition, List<ConsumerRecord<K,
V>>> 的结构，并调用 SubscriptionState 的 position 方法，维护各个 Partition 目前消费到的 posision;
3. sendFetches 方法中对于每个 fetchable 的 Partition，带上这个 Partition 目前消费到的 position，构造
成 FetchRequest，发送到对应的 Leader Broker。对 FetchResponse 的处理中，将每个 Partition 待消费的消息
放到 records 中;
SubscriptionState:
1. 使用 assignment (Map<TopicPartition, TopicPartitionState>) 维护目前被指派消费的 Partition，以及各个
Partition 消费到的 position ;

Consumer
针对每个 Consumer Group，会从 Broker 中挑选出一个 Coordinator。当新的Consumer进入、老的Consumer离开
和元数据改变时协调进行 Partition 的重新分配，进行 rebalancing;
ConsumerCoordinator:
• ensurePartitionAssignment 方法中，如果刚加入当前 Group 或者 HeartBeat 收到 rebalance 的 response ，调
用 ConsumerNetworkClient 的 send 方法，向 coordinator 发送 JoinGroupRequest。从 coordinator 那收到分配
结果后，更新 SubscriptionState 内的 assignment;

Consumer
KafkaConsumer 并不是线程安全的，不能多个线程操作同一个 KafkaConsumer，Kafka 官方对于 Consumer 端的多线程
处理的建议:
1. 每个线程都持有一个 KafkaConsumer 对象
• 实现简单, 不需要线程间的协作, 能让每个 Partition 内消息的顺序处理;
• 每个 KafkaConsumer 都要与 Kafka 集群保持一个TCP, 连接线程数不能超过Partition数;
2. 单个 KafkaConsumer 线程负责拉取消息，多个Worker线程负责消费消息
• 可自由控制Worker线程的数量, 不受 Partition 数量限制;
• 消息消费的顺序无法保证, 难以设置提交offset的时机。

Refer
• http://www.jasongj.com/categories/Kafka/
• https://cwiki.apache.org/confluence/display/KAFKA/Kafka+0.9+Consumer+Rewrite+Design
• http://www.confluent.io/blog/tutorial-getting-started-with-the-new-apache-kafka-0.9-consumer-client
• https://engineering.linkedin.com/kafka/intra-cluster-replication-apache-kafka
• http://ingest.tips/2015/07/19/tips-for-improving-performance-of-kafka-producer/
• http://www.confluent.io/blog/how-to-choose-the-number-of-topicspartitions-in-a-kafka-cluster/
• https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol

Kafka的设计与实现

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Kafka的设计与实现

Ähnlich wie Kafka的设计与实现 (20)

Kafka的设计与实现

Hinweis der Redaktion