In this session we share our experience of building a real-time data pipelines at Tencent PCG - one that handles 20 trillion daily messages with 700 clusters and 100Gb/s bursting traffic from a single app. We discuss our roadmap of enhancing Kafka to break its limits in terms of scalability, robustness and cost of operation.
We first built a proxy layer that aggregates physical clusters in a way agnostic to the clients. While this architecture solves many operational problems, it requires significant development to stay future-proof. With retrospection with our customer and careful study of the ongoing work from the community, we then designed a region federation solution in the broker layer, which allows us to deploy clusters at a much larger scale than previously possible, while at the same time providing better failure recovery and operability. We discuss how we make this development compatible with KIP-500 and KIP-405, and the two KIP (693, 694) that we submitted for discussion.
Enhancing Apache Kafka for Large Scale Real-Time Data Pipeline at Tencent | Kahn Chen and Thirteen Wang, Tencent
1. Kahn Chen 、Liang Wang 13/06/2021
Enhancing Apache Kafka for Large Scale Real-Time Data Pipeline
at Tencent
2. Self-Introduction
Agenda
Main application scenarios at Tencent
Business requirements in Using Apache Kafka
Architecture evolution process in using Apache kafka
Operational data at Tencent
Incorporate new features of the Apache Kafka community
4. Main application scenarios
As one of the world’s biggest internet-based platform companies, Tencent uses technology to enrich the lives of users and assist the digital upgrade of
enterprises. An example product is the popular WeChat application, which has over one billion active users worldwide. The Platform and Content Group (PCG)
is responsible for integrating Tencent’s internet, social, and content platforms. PCG promotes the cross-platform and multi-modal development of IP, with
the overall goal of creating more diversified premium digital content experiences. Since its inception, many major products—from the well-known QQ,
QZone, video, App Store, news, and browser, to relatively new members of the ecosystem such as live broadcast, anime, and movies—have been evolving on
top of consolidated infrastructure and a set of foundational technology components.
At our center stands the real-time messaging system that connects data and computing. We have proudly built many essential data pipelines and messaging
queues around Apache Kafka. Our application of Kafka is similar to other organizations: We build pipelines for cross-region log ingestion, machine learning
platforms, and asynchronous communication among microservices. The unique challenges come from stretching our architecture along multiple dimensions,
namely scalability, customization, and SLA.
5. Business requirements in Using Apache Kafka
l Low latency and high SLA
In the actual operation, the business often requires very low latency. And we found that the annual disk failure rate is as high as 5%, there may also be cases
such as computer room network failures. However, the current disaster tolerance of Kafka in the community can only support the Broker level, and it is difficult
to support the low delay required by our business.
l Topic partition needs to support lossless scaling down
In some real-world operation scenarios, such as after the peak traffic during the holidays, we need to recycle the machines that have been temporarily added to
the cluster, and scale down the temporarily expanded partitions. However, currently Kafka does not support this capability. It requires the developer change
the business logic of the code to achieve this.
l A large number of topics need to be supported on a single cluster
In Tencent real-time data warehouse scenarios, table1->flink_sql-> table2 ...
table1 and table2 are usually single topics in real-time stream. Thus, a lot of topics are needed. However, because of the performance bottleneck of Zookeeper,
apache Kafka cannot support over one million of the partitions well before version 2.8.0.
6. In order to meet the above requirements, what have we done
7. first federal kafka design
Example: A logical topic with eight partitions (P0–P7) is distributed to two physical clusters each with four partitions (P0–P3)
8. On the one hand We add Two proxy services, one for the producer (pproxy) and another for the consumer (cproxy), implement the core protocols of the Kafka broker.
They are also responsible for mapping logical topics to their physical incarnation. The application uses the same Kafka SDK to connect directly to the proxy, which
acts as a broker.
On the other hand We extract the logic of the Kafka controller node (such as topic metadata) into a separate service, which is also called “the controller,” but
it is different from Kafka’s own controller functionality. This service is responsible for collecting the Metadata of physical clusters, composing the partition
information logical topics, and then publishing it to the proxy brokers.
Key components and workflows
9. 1. Controller reads the mapping between logical topic and physical topic from the config database.
2. Controller scans metadata of all physical topics from each Kafka cluster.
3. Controller finds the list of available pproxy from heartbeat message.
4. Controller composes the metadata for the logical topics.
5. Controller pushes both the topic mapping and topic metadata to proxy brokers.
6. The metadata is sent to the Kafka SDK.
Cluster provisioning workflow
10. Cluster decommission workflow
1. Controller marks the cluster (cluster 1) to be decommissioned in read-only mode in the config database.
2. Controller picks up this state change when it fetches the metadata from the database.
3. Controller reconstructs metadata for producers, removing cluster 1 from the physical clusters.
4. Controller pushes a new version of producer metadata to the proxy. At this moment, data stops being written to cluster 1 , while the consuming
side is not impacted.
5. While all topics in the cluster are expired, the controller reconstructs the consumer metadata again and removes the physical topics no longer
valid.
6. Consumer proxy loads the new metadata and now when the new consumer arrives, it will no longer pick up cluster 1. This completes the
decommission operation.
11. Advantages and disadvantages of Cluster Federation
Advantages
ü The topic partition can achieve lossless expansion and contraction
ü Disaster tolerance can be achieved at the sub-cluster level
Disadvantages
l Proxy service needs to implement all kafka protocols, and it is difficult to
incorporate new features of apache kafka community
l Add an additional 30% of the cost to deploy the Proxy service
12. 1. How to reduce the additional 30% overhead caused by proxy services
2. How to make it easy to incorporate new features from the apache kafka
community
Now we have two new problems
13. First, the role of the proxy service is to proxy the partition request of the logical topic to the correct physical topic. If we want to remove the proxy service, the
metadata obtained by the sdk must be directly requested on the real broker. We can merge the metadata of the topics distributed on multiple sub-clusters to make
the SDK look like a complete topic.
Example: A logical topic with eight partitions (P0–P7) is distributed to two physical clusters,the first one with four partitions (P0–P3) and the other with four
partitions (P4–P7,partition id was started from 4), and the topic name must be same in the two clusters
P4
P5
P6
P7
new design of federal Kafka (Aims to remove proxy service)
P4 P6 P5 P7
P5 P7 P4 P6
partition id does not start from 0
14. The improved version of the Cluster Federation(without proxy service layer)
In order to make the sdk get a merged metadata, we added a new module "Master cluster", which is a complete Kafka cluster with zookeeper, but there is no broker role on it. On
this "Master cluster" cluster, we have mounted multiple kafka clusters with Brokers, and divided these kafka clusters into groups. In operation, we ensure that a topic partition will
be allocated to all sub-clusters in a group, but cross-groups are not allowed. Therefore, this improved federated cluster can achieve the purpose of expanding the partitions by
expanding the sub-clusters under the group where the topic is located when the number of topic partitions is insufficient. When the number of topics in the federated cluster is
insufficient, the purpose of expanding the topic can be achieved by expanding the group.
"Master cluster" will obtain the metadata of all sub-clusters, and sort and summarize the metadata of the same topic to obtain a complete metadata and send it to the SDK. so that
the message production without transaction is already available Up. Since there is no correlation between the sub-clusters and a consumer group tends to subscribe to multiple
topics, the management of consumer groups and transactions must be implemented on the "Master Cluster".
15. How does the new Cluster Federation work ?
After the "Master Cluster" is started, the metadata of all sub-clusters will be summarized and sorted by Topic
1、The SDK obtains metadata from the "Master Cluster" cluster
2、"Master Cluster" will return the aggregated topic metadata
3、The SDK normally reads and writes messages based on metadata, but Metadata Update and FindCoordiantor will only request from the
"Master Cluster".
Produce/Fetch request for partition 0 and partition 3 of U_TopicA
16. How to realize the expansion and disaster tolerance of “Master Cluster”?
1、We have realized the role of brokers in KIP 500, so it becomes very easy for Coordinator to expand according to actual needs.
2、When a new sub-cluster is registered or a sub-cluster is unregistered, this change will be written to Redis.
3、The data requested by OffsetCommit will also be written to Redis asynchronously.
4、When“Master Cluster” fails,we will use the backup cluster to recover the data from Redis immediately.
17. l Limit the number of partitions according to the cluster specifications to prevent performance
problems caused by the out-of-control file number.
How to achieve cluster high performance and high availability
l Mute topics and partitions of failure cluster, and we have submitted KIP-694 about it.
l Mute reads and writes for lossless cluster upgrades.
l The SDK has the ability to mute abnormal partitions, and we have submitted KIP-693 about it
also.
18. Operational data at Tencent
Apach kafka Tencent's customized kafka
Read and write recovery time after
cluster failure
It usually takes 30-60 minutes to
restart a broker with 20TB data
The client will stop writing to the
abnormal sub-cluster, and the
recovery time is <1min
Number of Brokers in a single
cluster
Before version 2.8.0, the number of
brokers in a single cluster can
reach about 200.
The number of brokers in a cluster
can support up to 1000
Machine recycling and partition
shrinking after traffic peaks
Machine recycling can be achieved
through rebalance, but partition
scaling down cannot be supported
You can achieve business-
insensitive partition scaling down
and recycling machines without
balancing
19. Incorporate new features of the Apache Kafka community
When we designed the federation mode, kafka 2.8.0 was not released, so our overall
design idea is to be better compatible with the features of KIP 500 in the future, such as
role splitting, and Controller unified management metadata.
In the future we will incorporate these cool features of the community