Building data pipelines is pretty hard! Building a multi-datacenter active-active real time data pipeline for multiple classes of data with different durability, latency and availability guarantees is much harder. Real time infrastructure powers critical pieces of Uber (think Surge) and in this talk we will discuss our architecture, technical challenges, learnings and how a blend of open source infrastructure (Apache Kafka and Flink) and in-house technologies have helped Uber scale.
18. Producer Libraries
● High Throughput (average case)
○ Non-blocking, async, batched
● At-least-once (critical use case)
○ Blocking, sync
● Topic Discovery
○ Discovers the kafka cluster a topic belongs
○ Able to multiplex to different kafka clusters
20. Kafka Local Agent
● Producer side persistence
○ Local storage
● Isolates clients from downstream outages, backpressure
● Controlled backfill upon recovery
○ Prevents from overwhelming a recovering cluster
24. Kafka Rest Proxy: Internals
● Based on Confluent’s open sourced Rest Proxy
● Performance enhancements
○ Simple HTTP servlets on jetty instead of Jersey
○ Optimized for binary payloads.
○ Performance increase from 7K* to 45K QPS/box
● Caching of topic metadata
● Reliability improvements*
○ Support for Fallback cluster
○ Support for multiple producers (SLA-based segregation)
● Plan to contribute back to community
*Based on benchmarking & analysis done in Jun ’2015
26. Kafka Secondary Cluster
● High availability on regional cluster failure
● Rest proxy produces Secondary Cluster on Regional Cluster
failure
● uReplicator/Mirrormaker backfill data back to regional cluster
on recovery
31. At-Least-Once
Application Process
ProxyClient
Kafka Proxy Server uReplicator
1
2
3 5 7
64 8
Regional Kafka Aggregate Kafka
● Most of infrastructure tuned for high throughput
○ Batching at each stage
○ Ack before being persisted (ack’ed != committed)
● Single node failure in any stage leads to data loss
● Need a reliable pipeline for High Value Data e.g. Payments
32. At-least-once Kafka: Data Flow
Application Process
ProxyClient
Kafka Proxy Server uReplicator
1
6
2 3 7
45 8
Regional Kafka Aggregate Kafka
35. Offset Sync Service
● Used for syncing offset between aggregate clusters on
failover
● Mirrormaker periodically snapshot regional offset to
aggregate offset map to external datastore
● Use offset map to recover safe consumer offset to resume
from in passive DC
39. Chaperone - End to End Auditing
● In-house Auditing Solution for Kafka
● Running in Production for ~2 Years
○ Audit 20k+ topics for 99.99% completeness
● Open Sourced: https://github.com/uber/chaperone
● Uber Engineering Blog: https://eng.uber.com/chaperone/
[George]
Uber as a product is the realtime movement of people and things.
As a result, Kafka (Stream processing) is a critical component of many real time systems at uber.
[George]
Rider app sends information to our servers, which is fed to Kafka.
Driver app sends information to serves, which is fed to Kafka.
This info is passed to stream processing framework, which does useful calculations.
Then info is passed back to the user in the form of:
Match
Routing info
ETA
Promote Uber eats....
ETAs change based on timings. Need historical input on all trips i.e. submission time, preparation time, pickup time etc... More complex than rider app because there is an offline component.
[George]
Of course, this is just the tip of a very large iceberg
[George]
General pub sub between services
Kafka is the basis of all Stream Processing systems at Uber. AthenaX (our self-serve platform) is built on top of Kafka. AthenaX uses Samza / Flink
All data that needs to be ingested is written to Kafka.
Changelog transport. Slightly different from the above use-cases because of ordering & durability guarantees
Logging is used to feed ELK
[George]
We are one of the largest users of Kafka.
[George]
Excluding replication
[George]
[George]
[George]
Kafka is the hub in Uber’s data infrastructure.
On the left side, we can find many kinds of applications and services. They generate data or logs and send them to Kafka.
At the other side, we have stream processing engine, batch processing engines & various services to process the data.
Now, let’s look a bit deeper in the Kafka box
Highlight surge as an important use case to maintain marketplace health?
For example,
Surg
Surge adjusts the prices based on demand/supply statistics, which is derived from data generated by rider and driver apps.
ELK index log msgs for troubleshooting.
Samza, Flink are general stream processing engines, used to find insight from the dataset in real time.
While Hadoop represents the set of tools to process the data in batches.
Meanwhile, data in Kafka are copied to HDFS and S3 for long term backup.
[George]
[George]
[George]
We are not using a single giant Kafka cluster in datacenter,
since Kafka itself does not have good support for multi-tenancy and resource isolation.
Instead, we have setup multiple clusters to support specific use cases. For example,
We have dedicated cluster for Surge, which is super critical for Uber business.
And we have a cluster for logging topics, which needs very high throughput.
Besides, we have a secondary cluster in each data center,
which accepts data from REST proxy if primary kafka goes down.
[George]
This is a high level overview of the Kafka architecture at Uber.
Multiple DC
Producer -> Rest Proxy -> DC Local Regional Cluster -> Mirrormaker/Ureplicator -> Agg Cluster (Global view of data)
[George]
Next half of presentation will cover some of the components we’ve added to scale Kafka at Uber:
Producer Library/Local Agent [Mingmin]
Rest Proxy [Mingmin]
Secondary [Mingmin]
Ureplicator [Mingmin]
OffsetSyncService [George]
Transition: Mingmin will discuss the producer side components.
[Mingmin]
Essentially, client libraries are HTTP clients.
But we use many techniques inside to achieve high throughput and low produce latency
Ilke, non-blocking/async and batching.
Produce latency is how long it takes to call produce() and returns back from the method call.
End2end latency is how long it takes for consumers to see the data.
As mentioned, we have multiple Kafka clusters.
Client library needs to discover which cluster the topic belongs to and sends msg there.
What’s more, client library integrates with LocalAgent to ensure data reliability.
We’re going to talk about this in following section.
[Mingmin]
[Mingmin]
LocalAgent is deployed on every host. Has come in handy in production on several occasions. It’s been designed to use minimal resource, so that it won’t affect services on that host.
When REST proxy fails, the data from client fail over to LocalAgent, which keeps data until RP goes back.
And when RP is back, the backfilling rate is controlled to avoid overloading RP.
Data stored on disk uses the Kafka ‘Log’
[Mingmin]
[Mingmin]
And here we build this pipeline to address those requirements.
Basically, in each data center, there is a regional Kafka cluster.
In front of it, we setup Kafka REST proxy, which is web service essentially.
Applications use proxy client to publish data to Kafka.
At the other end, we have aggregate Kafka cluster.
uReplicator copies data from multiple regional clusters into the aggregate cluster.
Besides, LocalAgent and SecondaryKafka are used for fault tolerance purpose.
[Mingmin]
So why build it? Why not publish to Kafka directly?
First of all, it simplifies the implementation of client library,
Therefore, makes it feasible to support multiple language.
Kafka protocol is not well documented and hard to implement.
But with Rest Proxy, the client library is http client essentially.
Secondly, it decouples client and kafka cluster. This makes
Kafka maintenance easier to conduct and transparent to end users.
What’s more, the connection to Kafka brokers are reduced a lot.
Besides, we have built quota management in RestProxy to ensure
abnormal producer won’t affect the normal ones.
[Mingmin]
[Mingmin]
The regional clusters are just regular Kafka clusters, but we have a secondary cluster in DC, which guarantees HA when regional cluster is unavailable.
[Mingmin]
[Mingmin]
uReplicator copies data from multiple regional clusters into the aggregate cluster.
Replacement for the open source mirrormaker
[Mingmin]
Copies thousands of topics between clusters.
Why did we build it?
Long rebalance times. Upto 20 mins:
Apache Helix lets us embed customized balancing logic in case certain works are heavily loaded
[Mingmin]
[Mingmin]
[Mingmin]
Most of our Kafka clusters are tuned for high throughput by batching and async techniques.
By tuning the configuration and patching few parts of the pipeline,
the data can be shipped over without any loss.
[Mingmin]
[George]
Consumers may consume from two different places:
Regional Kafka clusters
Global Aggregate Cluster to see a global view of data
[George]
[George]
[George]
Chaperone is embedded in or deployed for all the components along the pipeline to count every message flow through it.
The audit results are stored in Cassandra so that users can query them to check if there is msg loss or delay.
In Chaperone, the different kind of components are called tiers, like Rest_proxy_tier or regional_tier, aggregate tier.
The rest proxy and client libraries publish counts to the Chaperone Web Service
Chaperone then consumes from the Kafka tiers and finally generates a report per-topic on the amount of data in each tier during a given 10 minute window
If counts during a window differ by more 0.01% (i.e. 99.99% completeness), an alert is triggered
[George]
If there is no loss, msg count is supposed to be same at each tier.
If there is loss, the gap in the figure highlights when the loss happened and by how much.
((For example, 10 msg are generated between 11:00am and 11:10am.
When those 10 msg arrive at regional broker, an audit msg saying that
10 msg generated between this 10min has arrived at regional broker
can be generated and stored in database.
So, we can check if those 10 msg generated between this 10min has reached all components.))
[George]
Besides, Chaperone tracks msg latency and msg rate.