SlideShare ist ein Scribd-Unternehmen logo
1 von 42
Real-Time Data Pipeline @ Uber
Mingmin Chen
George Teo
Seattle Apache Kafka Meetup
Jan 18, 2018
Agenda
● Use Cases & Current Scale
● Data Infrastructure @ Uber
● Kafka @ Uber
○ Rest Proxy & Clients
○ Local Agent
○ uReplicator (Mirrormaker)
○ Offset Sync Service
○ Chaperone (Auditing)
○ Cluster Balancing
● Future Work
Use Cases
Real-time Driver-Rider Matching
Stream
Processing
- Driver-Rider Match
- ETA
App Views
Vehicle information
KAFKA
UberEATS - Real-Time ETAs
A bunch more...
● Fraud Detection
● Share My ETA
● Driver & Rider Signups
● Etc.
Kafka - Use Cases
● General Pub-Sub
● Stream Processing
○ AthenaX - Self-Serve Platform (Samza, Flink)
● Database Changelog Transport
○ Schemaless, Cassandra, MySQL
● Ingestion
○ HDFS, S3
● Logging
Scale
* obligatory show-off slide
Trillion+ ~PBs
Messages/Day Data Volume
Scale
excluding replication
Tens of Thousands
Topics
Data Infrastructure @ Uber
Apache Kafka is Uber’s Data Hub
PRODUCERS
CONSUMERS
Real-time
Analytics, Alerts,
Dashboards
Samza / Flink
Applications
Data Science
Analytics
Reporting
Kafka
Vertica / Hive
Rider App
Driver App
API / Services
Etc.
Ad-hoc Exploration
ELK
Data Infrastructure @ Uber
Debugging
Hadoop
Surge Mobile App
Cassandra
Schemaless
MySQL
DATABASES
AWS S3
(Internal) Services
Kafka @ Uber
Requirements
● Scale Horizontally
● API Latency (<5ms typically)
● Availability -> 99.99%
● Durability -> 99.99%; 100% -> Critical Customers
● Multi-DC Replication
● Multi-Language Support
○ Java, Go, Python, Node.js, C++
● Auditing
Kafka Clusters
● Running Kafka 0.10.2
● Use Case-based
○ Logging
○ Database Changelogs
○ Highly Isolated & Reliable e.g. Surge
○ High Value Data (e.g. Signups)
● Fallback Secondary Clusters
● Global Aggregates
○ Offset Sync Service
DC2
DC1
Kafka Ecosystem @ Uber
Applications
[ProxyClient]
Kafka REST
Proxy
Regional
Kafka
Applications
[ProxyClient]
Kafka REST
Proxy
Regional
Kafka
Local
Agent
Secondary
Kafka
Aggregate
Kafka
uReplicator
16
Offset Sync Service
Aggregate
Kafka
uReplicator
DC1
DC2
Kafka Ecosystem @ Uber
Applications
[ProxyClient]
Kafka REST
Proxy
Regional
Kafka
Applications
[ProxyClient]
Kafka REST
Proxy
Regional
Kafka
Local
Agent
Secondary
Kafka
Aggregate
Kafka
uReplicator
17
Offset Sync Service
Aggregate
Kafka
uReplicator
Producer Libraries
● High Throughput (average case)
○ Non-blocking, async, batched
● At-least-once (critical use case)
○ Blocking, sync
● Topic Discovery
○ Discovers the kafka cluster a topic belongs
○ Able to multiplex to different kafka clusters
Kafka Local Agent
DC2
DC1
Applications
[ProxyClient]
Kafka REST
Proxy
Regional
Kafka
Applications
[ProxyClient]
Kafka REST
Proxy
Regional
Kafka
Local
Agent
Secondary
Kafka
Aggregate
Kafka
uReplicator
Offset Sync Service
Aggregate
Kafka
uReplicator
Kafka Local Agent
● Producer side persistence
○ Local storage
● Isolates clients from downstream outages, backpressure
● Controlled backfill upon recovery
○ Prevents from overwhelming a recovering cluster
Local Agent in Action
Add
Figure
Kafka Rest Proxy
DC1
Applications
[ProxyClient]
Kafka REST
Proxy
Regional
Kafka
Applications
[ProxyClient]
Kafka REST
Proxy
Regional
Kafka
Local
Agent
Secondary
Kafka
DC2
Aggregate
Kafka
uReplicator
Aggregate
Kafka
uReplicator
22
Offset Sync Service
Why Kafka Rest Proxy ?
● Simplified Client API
○ Multi-lang Support
● Decouple Client With Kafka broker
○ Thin Clients = Operational Ease
○ Easier Kafka Upgrades
● Enhanced Reliability
○ Quota Management
○ Primary & Secondary Clusters
Kafka Rest Proxy: Internals
● Based on Confluent’s open sourced Rest Proxy
● Performance enhancements
○ Simple HTTP servlets on jetty instead of Jersey
○ Optimized for binary payloads.
○ Performance increase from 7K* to 45K QPS/box
● Caching of topic metadata
● Reliability improvements*
○ Support for Fallback cluster
○ Support for multiple producers (SLA-based segregation)
● Plan to contribute back to community
*Based on benchmarking & analysis done in Jun ’2015
Kafka Secondary Cluster
DC1
Applications
[ProxyClient]
Kafka REST
Proxy
Regional
Kafka
Applications
[ProxyClient]
Kafka REST
Proxy
Regional
Kafka
Local
Agent
Secondary
Kafka
DC2
Aggregate
Kafka
uReplicator
Aggregate
Kafka
uReplicator
25
Offset Sync Service
Kafka Secondary Cluster
● High availability on regional cluster failure
● Rest proxy produces Secondary Cluster on Regional Cluster
failure
● uReplicator/Mirrormaker backfill data back to regional cluster
on recovery
uReplicator
DC1
Applications
[ProxyClient]
Kafka REST
Proxy
Regional
Kafka
Applications
[ProxyClient]
Kafka REST
Proxy
Regional
Kafka
Local
Agent
Secondary
Kafka
DC2
Aggregate
Kafka
uReplicator
Aggregate
Kafka
uReplicator
Offset Sync Service
uReplicator
● In-house Intercluster Replication Solution
○ Apache Helix-based
○ Mirror all traffic between & within DCs
○ Lower rebalance latencies
● Running in Production ~2 Years
● Open Sourced: https://github.com/uber/uReplicator
● Uber Engineering Blog: https://eng.uber.com/ureplicator/
Cluster Balancing
● No Auto Rebalancing
● Manual Placement is Hard
● Auto Plan Generation
○ And execution!
Cluster Balancing
At-Least-Once
Application Process
ProxyClient
Kafka Proxy Server uReplicator
1
2
3 5 7
64 8
Regional Kafka Aggregate Kafka
● Most of infrastructure tuned for high throughput
○ Batching at each stage
○ Ack before being persisted (ack’ed != committed)
● Single node failure in any stage leads to data loss
● Need a reliable pipeline for High Value Data e.g. Payments
At-least-once Kafka: Data Flow
Application Process
ProxyClient
Kafka Proxy Server uReplicator
1
6
2 3 7
45 8
Regional Kafka Aggregate Kafka
Consumer
DC1
Applications
[ProxyClient]
Kafka REST
Proxy
Regional
Kafka
Aggregate
Kafka
uReplicator
Consumer
Application
Consumer
Application
(Global View)
Offset Sync Service
DC1
Applications
[ProxyClient]
Kafka REST
Proxy
Regional
Kafka
Applications
[ProxyClient]
Kafka REST
Proxy
Regional
Kafka
Local
Agent
Secondary
Kafka
DC2
Aggregate
Kafka
uReplicator
Aggregate
Kafka
uReplicator
Offset Sync Service
Offset Sync Service
● Used for syncing offset between aggregate clusters on
failover
● Mirrormaker periodically snapshot regional offset to
aggregate offset map to external datastore
● Use offset map to recover safe consumer offset to resume
from in passive DC
Auditing - Chaperone
CONFIDENTIAL
>> INSERT SCREENSHOT HERE <<
Chaperone - Track Counts
CONFIDENTIAL
>> INSERT SCREENSHOT HERE <<
Chaperone - Track Latency
Chaperone - End to End Auditing
● In-house Auditing Solution for Kafka
● Running in Production for ~2 Years
○ Audit 20k+ topics for 99.99% completeness
● Open Sourced: https://github.com/uber/chaperone
● Uber Engineering Blog: https://eng.uber.com/chaperone/
Future Work
Future Work
● Richer consumer semantics for service owners
○ DLQ
○ Per partition competing consumer
● Multi-zone Clusters
○ Durability during DC wide outages
● Chargebacks
● Efficiency Enhancements
○ Intelligent aggregates, automated topic GC etc..
● uReplicator 2.0
● Open Source
Thank you
Proprietary and confidential © 2016 Uber Technologies, Inc. All rights reserved. No part of this document may be reproduced or
utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage
or retrieval systems, without permission in writing from Uber. This document is intended only for the use of the individual or entity
to whom it is addressed and contains information that is privileged, confidential or otherwise exempt from disclosure under
applicable law. All recipients of this document are notified that the information contained herein includes proprietary and
confidential information of Uber, and recipient may not make use of, disseminate, or in any way disclose this document or any of
the enclosed information to any person other than employees of addressee to the extent necessary for consultations with
authorized personnel of Uber.
More open-source projects at eng.uber.com

Weitere ähnliche Inhalte

Was ist angesagt?

Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan EwenAdvanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
confluent
 

Was ist angesagt? (20)

Apache kafka performance(latency)_benchmark_v0.3
Apache kafka performance(latency)_benchmark_v0.3Apache kafka performance(latency)_benchmark_v0.3
Apache kafka performance(latency)_benchmark_v0.3
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Airflow - a data flow engine
Airflow - a data flow engineAirflow - a data flow engine
Airflow - a data flow engine
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Apache kafka performance(throughput) - without data loss and guaranteeing dat...
Apache kafka performance(throughput) - without data loss and guaranteeing dat...Apache kafka performance(throughput) - without data loss and guaranteeing dat...
Apache kafka performance(throughput) - without data loss and guaranteeing dat...
 
Apache Kafka - Martin Podval
Apache Kafka - Martin PodvalApache Kafka - Martin Podval
Apache Kafka - Martin Podval
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
 
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark OperatorApache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Centralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stackCentralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stack
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka Introduction
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Kafka presentation
Kafka presentationKafka presentation
Kafka presentation
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan EwenAdvanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
 
Kafka pub sub demo
Kafka pub sub demoKafka pub sub demo
Kafka pub sub demo
 
ELK Stack
ELK StackELK Stack
ELK Stack
 

Ähnlich wie Kafka Practices @ Uber - Seattle Apache Kafka meetup

Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
confluent
 

Ähnlich wie Kafka Practices @ Uber - Seattle Apache Kafka meetup (20)

How Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayHow Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per day
 
Hadoop summit - Scaling Uber’s Real-Time Infra for Trillion Events per Day
Hadoop summit - Scaling Uber’s Real-Time Infra for  Trillion Events per DayHadoop summit - Scaling Uber’s Real-Time Infra for  Trillion Events per Day
Hadoop summit - Scaling Uber’s Real-Time Infra for Trillion Events per Day
 
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
 
Uber Real Time Data Analytics
Uber Real Time Data AnalyticsUber Real Time Data Analytics
Uber Real Time Data Analytics
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
 
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
 
Twitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
Twitter’s Apache Kafka Adoption Journey | Ming Liu, TwitterTwitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
Twitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
 
Data Con LA 2019 - Unifying streaming and message queue with Apache Kafka by ...
Data Con LA 2019 - Unifying streaming and message queue with Apache Kafka by ...Data Con LA 2019 - Unifying streaming and message queue with Apache Kafka by ...
Data Con LA 2019 - Unifying streaming and message queue with Apache Kafka by ...
 
Confluent Tech Talk Korea
Confluent Tech Talk KoreaConfluent Tech Talk Korea
Confluent Tech Talk Korea
 
Unbounded bounded-data-strangeloop-2016-monal-daxini
Unbounded bounded-data-strangeloop-2016-monal-daxiniUnbounded bounded-data-strangeloop-2016-monal-daxini
Unbounded bounded-data-strangeloop-2016-monal-daxini
 
PortoTechHub - Hail Hydrate! From Stream to Lake with Apache Pulsar and Friends
PortoTechHub  - Hail Hydrate! From Stream to Lake with Apache Pulsar and FriendsPortoTechHub  - Hail Hydrate! From Stream to Lake with Apache Pulsar and Friends
PortoTechHub - Hail Hydrate! From Stream to Lake with Apache Pulsar and Friends
 
Key considerations in productionizing streaming applications
Key considerations in productionizing streaming applicationsKey considerations in productionizing streaming applications
Key considerations in productionizing streaming applications
 
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at UberDisaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber
 
Scaling up uber's real time data analytics
Scaling up uber's real time data analyticsScaling up uber's real time data analytics
Scaling up uber's real time data analytics
 
Confluent Operator as Cloud-Native Kafka Operator for Kubernetes
Confluent Operator as Cloud-Native Kafka Operator for KubernetesConfluent Operator as Cloud-Native Kafka Operator for Kubernetes
Confluent Operator as Cloud-Native Kafka Operator for Kubernetes
 
Openshift serverless Solution
Openshift serverless SolutionOpenshift serverless Solution
Openshift serverless Solution
 
BDX 2016- Monal daxini @ Netflix
BDX 2016-  Monal daxini  @ NetflixBDX 2016-  Monal daxini  @ Netflix
BDX 2016- Monal daxini @ Netflix
 
What's new in confluent platform 5.4 online talk
What's new in confluent platform 5.4 online talkWhat's new in confluent platform 5.4 online talk
What's new in confluent platform 5.4 online talk
 
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
 
Build real time stream processing applications using Apache Kafka
Build real time stream processing applications using Apache KafkaBuild real time stream processing applications using Apache Kafka
Build real time stream processing applications using Apache Kafka
 

Kürzlich hochgeladen

Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Christo Ananth
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
ankushspencer015
 

Kürzlich hochgeladen (20)

Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsRussian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 

Kafka Practices @ Uber - Seattle Apache Kafka meetup

  • 1. Real-Time Data Pipeline @ Uber Mingmin Chen George Teo Seattle Apache Kafka Meetup Jan 18, 2018
  • 2. Agenda ● Use Cases & Current Scale ● Data Infrastructure @ Uber ● Kafka @ Uber ○ Rest Proxy & Clients ○ Local Agent ○ uReplicator (Mirrormaker) ○ Offset Sync Service ○ Chaperone (Auditing) ○ Cluster Balancing ● Future Work
  • 4. Real-time Driver-Rider Matching Stream Processing - Driver-Rider Match - ETA App Views Vehicle information KAFKA
  • 6. A bunch more... ● Fraud Detection ● Share My ETA ● Driver & Rider Signups ● Etc.
  • 7. Kafka - Use Cases ● General Pub-Sub ● Stream Processing ○ AthenaX - Self-Serve Platform (Samza, Flink) ● Database Changelog Transport ○ Schemaless, Cassandra, MySQL ● Ingestion ○ HDFS, S3 ● Logging
  • 9. Trillion+ ~PBs Messages/Day Data Volume Scale excluding replication Tens of Thousands Topics
  • 11. Apache Kafka is Uber’s Data Hub
  • 12. PRODUCERS CONSUMERS Real-time Analytics, Alerts, Dashboards Samza / Flink Applications Data Science Analytics Reporting Kafka Vertica / Hive Rider App Driver App API / Services Etc. Ad-hoc Exploration ELK Data Infrastructure @ Uber Debugging Hadoop Surge Mobile App Cassandra Schemaless MySQL DATABASES AWS S3 (Internal) Services
  • 14. Requirements ● Scale Horizontally ● API Latency (<5ms typically) ● Availability -> 99.99% ● Durability -> 99.99%; 100% -> Critical Customers ● Multi-DC Replication ● Multi-Language Support ○ Java, Go, Python, Node.js, C++ ● Auditing
  • 15. Kafka Clusters ● Running Kafka 0.10.2 ● Use Case-based ○ Logging ○ Database Changelogs ○ Highly Isolated & Reliable e.g. Surge ○ High Value Data (e.g. Signups) ● Fallback Secondary Clusters ● Global Aggregates ○ Offset Sync Service
  • 16. DC2 DC1 Kafka Ecosystem @ Uber Applications [ProxyClient] Kafka REST Proxy Regional Kafka Applications [ProxyClient] Kafka REST Proxy Regional Kafka Local Agent Secondary Kafka Aggregate Kafka uReplicator 16 Offset Sync Service Aggregate Kafka uReplicator
  • 17. DC1 DC2 Kafka Ecosystem @ Uber Applications [ProxyClient] Kafka REST Proxy Regional Kafka Applications [ProxyClient] Kafka REST Proxy Regional Kafka Local Agent Secondary Kafka Aggregate Kafka uReplicator 17 Offset Sync Service Aggregate Kafka uReplicator
  • 18. Producer Libraries ● High Throughput (average case) ○ Non-blocking, async, batched ● At-least-once (critical use case) ○ Blocking, sync ● Topic Discovery ○ Discovers the kafka cluster a topic belongs ○ Able to multiplex to different kafka clusters
  • 19. Kafka Local Agent DC2 DC1 Applications [ProxyClient] Kafka REST Proxy Regional Kafka Applications [ProxyClient] Kafka REST Proxy Regional Kafka Local Agent Secondary Kafka Aggregate Kafka uReplicator Offset Sync Service Aggregate Kafka uReplicator
  • 20. Kafka Local Agent ● Producer side persistence ○ Local storage ● Isolates clients from downstream outages, backpressure ● Controlled backfill upon recovery ○ Prevents from overwhelming a recovering cluster
  • 21. Local Agent in Action Add Figure
  • 22. Kafka Rest Proxy DC1 Applications [ProxyClient] Kafka REST Proxy Regional Kafka Applications [ProxyClient] Kafka REST Proxy Regional Kafka Local Agent Secondary Kafka DC2 Aggregate Kafka uReplicator Aggregate Kafka uReplicator 22 Offset Sync Service
  • 23. Why Kafka Rest Proxy ? ● Simplified Client API ○ Multi-lang Support ● Decouple Client With Kafka broker ○ Thin Clients = Operational Ease ○ Easier Kafka Upgrades ● Enhanced Reliability ○ Quota Management ○ Primary & Secondary Clusters
  • 24. Kafka Rest Proxy: Internals ● Based on Confluent’s open sourced Rest Proxy ● Performance enhancements ○ Simple HTTP servlets on jetty instead of Jersey ○ Optimized for binary payloads. ○ Performance increase from 7K* to 45K QPS/box ● Caching of topic metadata ● Reliability improvements* ○ Support for Fallback cluster ○ Support for multiple producers (SLA-based segregation) ● Plan to contribute back to community *Based on benchmarking & analysis done in Jun ’2015
  • 25. Kafka Secondary Cluster DC1 Applications [ProxyClient] Kafka REST Proxy Regional Kafka Applications [ProxyClient] Kafka REST Proxy Regional Kafka Local Agent Secondary Kafka DC2 Aggregate Kafka uReplicator Aggregate Kafka uReplicator 25 Offset Sync Service
  • 26. Kafka Secondary Cluster ● High availability on regional cluster failure ● Rest proxy produces Secondary Cluster on Regional Cluster failure ● uReplicator/Mirrormaker backfill data back to regional cluster on recovery
  • 28. uReplicator ● In-house Intercluster Replication Solution ○ Apache Helix-based ○ Mirror all traffic between & within DCs ○ Lower rebalance latencies ● Running in Production ~2 Years ● Open Sourced: https://github.com/uber/uReplicator ● Uber Engineering Blog: https://eng.uber.com/ureplicator/
  • 29. Cluster Balancing ● No Auto Rebalancing ● Manual Placement is Hard ● Auto Plan Generation ○ And execution!
  • 31. At-Least-Once Application Process ProxyClient Kafka Proxy Server uReplicator 1 2 3 5 7 64 8 Regional Kafka Aggregate Kafka ● Most of infrastructure tuned for high throughput ○ Batching at each stage ○ Ack before being persisted (ack’ed != committed) ● Single node failure in any stage leads to data loss ● Need a reliable pipeline for High Value Data e.g. Payments
  • 32. At-least-once Kafka: Data Flow Application Process ProxyClient Kafka Proxy Server uReplicator 1 6 2 3 7 45 8 Regional Kafka Aggregate Kafka
  • 34. Offset Sync Service DC1 Applications [ProxyClient] Kafka REST Proxy Regional Kafka Applications [ProxyClient] Kafka REST Proxy Regional Kafka Local Agent Secondary Kafka DC2 Aggregate Kafka uReplicator Aggregate Kafka uReplicator Offset Sync Service
  • 35. Offset Sync Service ● Used for syncing offset between aggregate clusters on failover ● Mirrormaker periodically snapshot regional offset to aggregate offset map to external datastore ● Use offset map to recover safe consumer offset to resume from in passive DC
  • 37. CONFIDENTIAL >> INSERT SCREENSHOT HERE << Chaperone - Track Counts
  • 38. CONFIDENTIAL >> INSERT SCREENSHOT HERE << Chaperone - Track Latency
  • 39. Chaperone - End to End Auditing ● In-house Auditing Solution for Kafka ● Running in Production for ~2 Years ○ Audit 20k+ topics for 99.99% completeness ● Open Sourced: https://github.com/uber/chaperone ● Uber Engineering Blog: https://eng.uber.com/chaperone/
  • 41. Future Work ● Richer consumer semantics for service owners ○ DLQ ○ Per partition competing consumer ● Multi-zone Clusters ○ Durability during DC wide outages ● Chargebacks ● Efficiency Enhancements ○ Intelligent aggregates, automated topic GC etc.. ● uReplicator 2.0 ● Open Source
  • 42. Thank you Proprietary and confidential © 2016 Uber Technologies, Inc. All rights reserved. No part of this document may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval systems, without permission in writing from Uber. This document is intended only for the use of the individual or entity to whom it is addressed and contains information that is privileged, confidential or otherwise exempt from disclosure under applicable law. All recipients of this document are notified that the information contained herein includes proprietary and confidential information of Uber, and recipient may not make use of, disseminate, or in any way disclose this document or any of the enclosed information to any person other than employees of addressee to the extent necessary for consultations with authorized personnel of Uber. More open-source projects at eng.uber.com

Hinweis der Redaktion

  1. Introductions
  2. [George]
  3. [George] Uber as a product is the realtime movement of people and things. As a result, Kafka (Stream processing) is a critical component of many real time systems at uber.
  4. [George] Rider app sends information to our servers, which is fed to Kafka. Driver app sends information to serves, which is fed to Kafka. This info is passed to stream processing framework, which does useful calculations. Then info is passed back to the user in the form of: Match Routing info ETA
  5. Promote Uber eats.... ETAs change based on timings. Need historical input on all trips i.e. submission time, preparation time, pickup time etc... More complex than rider app because there is an offline component.
  6. [George] Of course, this is just the tip of a very large iceberg
  7. [George] General pub sub between services Kafka is the basis of all Stream Processing systems at Uber. AthenaX (our self-serve platform) is built on top of Kafka. AthenaX uses Samza / Flink All data that needs to be ingested is written to Kafka. Changelog transport. Slightly different from the above use-cases because of ordering & durability guarantees Logging is used to feed ELK
  8. [George] We are one of the largest users of Kafka.
  9. [George] Excluding replication
  10. [George]
  11. [George]
  12. [George] Kafka is the hub in Uber’s data infrastructure. On the left side, we can find many kinds of applications and services. They generate data or logs and send them to Kafka. At the other side, we have stream processing engine, batch processing engines & various services to process the data. Now, let’s look a bit deeper in the Kafka box Highlight surge as an important use case to maintain marketplace health? For example, Surg Surge adjusts the prices based on demand/supply statistics, which is derived from data generated by rider and driver apps. ELK index log msgs for troubleshooting. Samza, Flink are general stream processing engines, used to find insight from the dataset in real time. While Hadoop represents the set of tools to process the data in batches. Meanwhile, data in Kafka are copied to HDFS and S3 for long term backup.
  13. [George]
  14. [George]
  15. [George] We are not using a single giant Kafka cluster in datacenter, since Kafka itself does not have good support for multi-tenancy and resource isolation. Instead, we have setup multiple clusters to support specific use cases. For example, We have dedicated cluster for Surge, which is super critical for Uber business. And we have a cluster for logging topics, which needs very high throughput. Besides, we have a secondary cluster in each data center, which accepts data from REST proxy if primary kafka goes down.
  16. [George] This is a high level overview of the Kafka architecture at Uber. Multiple DC Producer -> Rest Proxy -> DC Local Regional Cluster -> Mirrormaker/Ureplicator -> Agg Cluster (Global view of data)
  17. [George] Next half of presentation will cover some of the components we’ve added to scale Kafka at Uber: Producer Library/Local Agent [Mingmin] Rest Proxy [Mingmin] Secondary [Mingmin] Ureplicator [Mingmin] OffsetSyncService [George] Transition: Mingmin will discuss the producer side components.
  18. [Mingmin] Essentially, client libraries are HTTP clients. But we use many techniques inside to achieve high throughput and low produce latency Ilke, non-blocking/async and batching. Produce latency is how long it takes to call produce() and returns back from the method call. End2end latency is how long it takes for consumers to see the data. As mentioned, we have multiple Kafka clusters. Client library needs to discover which cluster the topic belongs to and sends msg there. What’s more, client library integrates with LocalAgent to ensure data reliability. We’re going to talk about this in following section.
  19. [Mingmin]
  20. [Mingmin] LocalAgent is deployed on every host. Has come in handy in production on several occasions. It’s been designed to use minimal resource, so that it won’t affect services on that host. When REST proxy fails, the data from client fail over to LocalAgent, which keeps data until RP goes back. And when RP is back, the backfilling rate is controlled to avoid overloading RP. Data stored on disk uses the Kafka ‘Log’
  21. [Mingmin]
  22. [Mingmin] And here we build this pipeline to address those requirements. Basically, in each data center, there is a regional Kafka cluster. In front of it, we setup Kafka REST proxy, which is web service essentially. Applications use proxy client to publish data to Kafka. At the other end, we have aggregate Kafka cluster. uReplicator copies data from multiple regional clusters into the aggregate cluster. Besides, LocalAgent and SecondaryKafka are used for fault tolerance purpose.
  23. [Mingmin] So why build it? Why not publish to Kafka directly? First of all, it simplifies the implementation of client library, Therefore, makes it feasible to support multiple language. Kafka protocol is not well documented and hard to implement. But with Rest Proxy, the client library is http client essentially. Secondly, it decouples client and kafka cluster. This makes Kafka maintenance easier to conduct and transparent to end users. What’s more, the connection to Kafka brokers are reduced a lot. Besides, we have built quota management in RestProxy to ensure abnormal producer won’t affect the normal ones.
  24. [Mingmin]
  25. [Mingmin] The regional clusters are just regular Kafka clusters, but we have a secondary cluster in DC, which guarantees HA when regional cluster is unavailable.
  26. [Mingmin]
  27. [Mingmin] uReplicator copies data from multiple regional clusters into the aggregate cluster. Replacement for the open source mirrormaker
  28. [Mingmin] Copies thousands of topics between clusters. Why did we build it? Long rebalance times. Upto 20 mins: Apache Helix lets us embed customized balancing logic in case certain works are heavily loaded
  29. [Mingmin]
  30. [Mingmin]
  31. [Mingmin] Most of our Kafka clusters are tuned for high throughput by batching and async techniques. By tuning the configuration and patching few parts of the pipeline, the data can be shipped over without any loss.
  32. [Mingmin]
  33. [George] Consumers may consume from two different places: Regional Kafka clusters Global Aggregate Cluster to see a global view of data
  34. [George]
  35. [George]
  36. [George] Chaperone is embedded in or deployed for all the components along the pipeline to count every message flow through it. The audit results are stored in Cassandra so that users can query them to check if there is msg loss or delay. In Chaperone, the different kind of components are called tiers, like Rest_proxy_tier or regional_tier, aggregate tier. The rest proxy and client libraries publish counts to the Chaperone Web Service Chaperone then consumes from the Kafka tiers and finally generates a report per-topic on the amount of data in each tier during a given 10 minute window If counts during a window differ by more 0.01% (i.e. 99.99% completeness), an alert is triggered
  37. [George] If there is no loss, msg count is supposed to be same at each tier. If there is loss, the gap in the figure highlights when the loss happened and by how much. ((For example, 10 msg are generated between 11:00am and 11:10am. When those 10 msg arrive at regional broker, an audit msg saying that 10 msg generated between this 10min has arrived at regional broker can be generated and stored in database. So, we can check if those 10 msg generated between this 10min has reached all components.))
  38. [George] Besides, Chaperone tracks msg latency and msg rate.
  39. [George]
  40. [George]
  41. [George]