1) Apache InLong is an open source data integration framework that provides automatic, secure, and reliable data transmission. It supports both batch and stream processing using different message queues like Apache Pulsar.
2) Apache Pulsar is used with Apache InLong because it offers very low latency, high throughput, reliable data transmission, and multi-tenancy. KoP allows migrating Kafka workloads to Pulsar.
3) Apache InLong contributes to Apache Pulsar through over 60 contributors and 50 pull requests to the KoP project. It uses Pulsar for auto disaster tolerance, multi-tenancy of data streams, and auditing data streams.
5. About Apache InLong
The History of Apache InLong
2013-06
200
2014-06
6201
2015-06
25052
2016-06
54905
2017-06
91096
2018-06
138718
2019-06
227360
TubeMQ
open source
2020-06
333029
2020-12
457552
2021-07
663285
0
100000
200000
300000
400000
500000
600000
700000
2013-06 2013-12 2014-06 2014-12 2015-06 2015-12 2016-06 2016-12 2017-06 2017-12 2018-06 2018-12 2019-06 2019-12 2020-06 2020-12 2021-07
Average daily data reporting volume(100 million pieces/day)
Rename
InLong
6. About Apache InLong
What is Apache InLong
Apache InLong(incubating) is a one-stop data integration framework that provides automatic, secure and reliable data
transmission capabilities. InLong supports both batch and stream data processing at the same time, which offers great
power to build data analysis, modeling and other real-time applications based on streaming data.
• Ease of Use
• can easily and quickly report, transfer, and distribute data
• Stability & Reliability
• delivers high-performance processing capabilities for 10 trillion-level data streams
• Comprehensive Feature
• integrated with different types of Message Queue (MQ) services, provides real-time data extract, transform, and
load (ETL) and sorting capabilities
• Scalability
• adopts a pluggable architecture that allows you to plug modules into the system based on specific protocols.
7. About Apache InLong
The Architecture of InLong
SDK
File
HTTP
DB
DataProxy
DataProxy
TubeMQ
Pulsar
Kafka
Sort
Real-time
Offline
SDK
Hive
Iceberg
HBase
ClickHouse
Inges
t
Converge Cache Sort Storage
OpenAPI
Manager
Metadata Authority Scheduler
Naming
Service Audit Monitor
Cluster
DataProxy
8. • About Apache InLong
• Apache InLong + Pulsar
• The User Case of Apache
InLong
Contents
9. Apache InLong + Pulsar
The Pulsar Data Stream
Apple |
175.64
AT&T | 24.78
Tesla | 908.87
……
Smith | 24
Jones | 33
Kevin | 19
……
people
stream
stocks
stream
InLong Group
1
Data
Prox
y
tenant/group1/people
tenant/group1/stocks
Pulsar
Cluster
Sort
(Smith, 24)
(Jones, 33)
(Kevin, 19)
(Apple,
175.64)
(AT&T, 24.78)
(Tesla, 908.87)
People table
Stocks table
• InLong Stream: Data Stream, a stream has a specific flow direction.
• InLong Group: Data Stream Group, it contains multiple data streams.
10. Apache InLong + Pulsar
Why Choose Pulsar ?
Comparison TubeMQ Kafka Pulsar
Latency Very low,10ms Low,250ms Very low,10ms
TPS High,14W+/s Normal,10W+/s High,14W+/s
Filter consume Supports client filter or server filter Supports client filter Supports client filter
Data No copies Multiple copies Multiple copies
Reliability Relies on RAID 10 Low High, autorecovery
Stability High, running in Tencent for almost 7 years
with 33 trillions of message per day
Unstable when topics grows HIgh
Client language supports Java or C++ 1 client (Official support) 7 kinds of client
CAP Model AP AP or CP CP or AP
11. Apache InLong + Pulsar
KoP(Kafka on Pulsar) Replace Kafka
• Migrate the Kafka business
• The first team to put KoP in the production environment
• 2 KoP maintainers
Pulsar Cluster
bookie
bookie
bookie
bookie
broker
broker
broker
KoP
Kafka consumer
message
Kafka producer
InLong
Sort
message
message
message
InLong
DataProxy
12. Apache InLong + Pulsar
Pulsar Auto Disaster Tolerance For InLong
Pulsar
Cluster1
Monitor
primary
producer
failover
producer
Pulsar
Cluster2
check
check
consumer
Procedures:
1. Initialize two produce and produce to two clusters accordingly
2. Only one producer is active
3. Change producerMonitor checks the errors inside a time window
13. Apache InLong + Pulsar
Pulsar Multi Tenancy for InLong Data Stream
persistent:// tenant namespace topic
business InLong group InLong
stream
• InLong Stream: Data Stream, a stream has a specific flow direction.
• InLong Group: Data Stream Group, it contains multiple data streams.
school
students teachers
teacher
s table
students
table
14. Apache InLong + Pulsar
InLong Data Audit Using
Pulsar
• Separate audit data stream
• No data loss
Audit
Proxy
InLong Agent AuditSDK
InLong DataProxy AuditSDK
InLong Sort AuditSDK
Pulsar
AuditD
ds
MySQL
ES
HDFS
Minute
Hour
Day
Audit
Repor
t
15. Apache InLong + Pulsar
InLong Contribute to Pulsar
6 60+ 50+
Contributor Pulsar
PR
KoP
PR
16. • About Apache InLong
• Apache InLong + Pulsar
• The User Case of Apache
InLong
Contents
17. The User Case of Apache InLong
Tencent Ads
• Background
• account statement of advertises can be used as data input for analysis or reconciliation. The inputs
are mainly binlogs from mysql
• Used features in InLong:
• Low latency: no more than 10ms
• No data loss
• Massive consumers: thousands of consumers for one topic
• Massive data: over 100 billion/day
InLong
DB Agent
Pulsar Flink
Pulsar
Client
InLong
Sort
Druid
InLong
DataProxy
Hive
Binlog
18. The User Case of Apache InLong
Tencent Security Platform
• Background
• As business goes to the cloud, there are more and more security agents. If a particular module is
abnormal, it will cause the entire background data to skyrocket and cause an avalanche. A set of
transmission schemes are required to act as a "barrier" to slow the impact on this system.
• Used features in InLong:
• No data loss
• Massive agent: Over 1 million agents
Pulsar
Flink
InLong
Sort
InLong
DataProxy
Hive
Security
Agent
Security
Agent
Security
Agent
Security
Agent
Security
Agent
Security
Agent
Security
Agent
Security
Agent
Security
Agent
Security
Agent