Uber’s mission is to provide transportation as reliable as running water and for fulfilling that mission data plays a critical role. In Uber, Hadoop plays a critical role in Data Infrastructure. We want to talk about the journey of Hadoop @Uber and our future plans in terms of scaling for billions of trips. We will talk about most unique use case Uber have and how Hadoop and eco system which we built, helped us in this journey. We want to talk about how we scaled from 10 -> 2000 and In future to scale up to 10’s X1000 of Nodes. We will talk about our mistakes, learning and wins and how we process billions of events per day. We will talk about the unique challenges and real world use-cases and how we will co-locate the Uber’s service architecture with batch (e.g data pipelines, machine learning and analytical workloads). Uber have done lot of improvements to current Hadoop eco system and uniquely solved some of the problems in a way which is never been solved in the past. This presentation will help audience to use this as an example and even encourage them to enhance the eco system. This will help to increase the community of these project and overall help the whole big data space. Audience is anybody who is working on Big Data and want to understand how to scale Hadoop and eco system for 10s of thousands of node. This talk will help them understand the Hadoop ecosystem and how to efficiently use that. It will also introduce them to some of the awesome technologies which Uber team is building in big data space.
2024: Domino Containers - The Next Step. News from the Domino Container commu...
Hadoop Infrastructure @Uber Past, Present and Future
1. U B E R | Data
Hadoop Infrastructure
@Uber Past , Present and
Future
Mayank Bansal
2. U B E R | Data
Senior Software Engineer, Uber
Hadoop Committer, Oozie PMC
Past: Sr. Staff @ ebay, Worked on Hadoop
Sr. Eng @ Yahoo, Worked on Oozie
Who Am I
3. U B E R | Data
• Past
• Challenges
• Present
• Done Along the way
• Future
• Work Ahead
Agenda
4. U B E R | Data
“ Transportation as reliable as running water ,
everywhere, for everyone ”
Uber’s Mission
75+ Countries 500+ Cities
And growing…
9. U B E R | Data
Uber’s Data Audience
● 1000s of City Operators (Uber Ops!)
○ On the ground team who run and scale uber’s
transportation network
● 100s of Data Scientists and Analysts
○ Spread across various functional groups including
Engineering, Marketing, BizDev etc
● 10s of Engineering Teams
○ Focussed on building automated Data Applications
10. U B E R | Data
Data Infra Once Upon a time.. (2014)
Kafka Logs
Key-Val DB
RDBMS DBs
S3
Applications
…
ETL
Business Ops
A/B Experiments
Adhoc Analytics
City Ops
Vertica
Data Warehouse
Data
Science
EMR
11. U B E R | Data
Pain Points
● Scalability
○ Data Grew faster then we expected
● Reliability
○ There were no checks in place to validate
12. U B E R | Data
Hadoop Scale (2015)
~Few Servers Some Data
No Hives No Presto
~100
Jobs/day
Some
Spark
Apps
13. U B E R | Data
Data Infrastructure Today
Kafka8 Logs
Schemaless DB
SOA DBs
Service Accounts
…
ETL
Machine Learning
Experimentation
Data Science
Adhoc Analytics
Ops/Data Science
HDFS
City Ops
Data
Science
Spark| Presto
Hive
Vertica
14. U B E R | Data
Hadoop Scale Today
~Few Thousand
Servers
Many Many
PBs
~20k
Hive
queries/day
~100k
Presto
queries/day
100k
Jobs/day
Few Thousand
Spark Apps /
day
15. U B E R | Data
A Few things we solved along the way..
● Strict Schema Management
○ Because our largest data audience are SQL
Savvy! (1000s of Uber Ops!)
○ SQL = Strict Schema
● Big Data Processing Tools Unlocked -
Hive, Presto and Spark
○ Migrate SQL savvy users from Vertica to Hive
& Presto (1000s of Ops & 100s of data
scientists & analysts)
○ Spark for more advanced users - 100s of data
scientists
16. U B E R | Data
A Few things we solved along the way..
● Scalable Ingestion Model
○ Data Grows exponentially
○ Need to think about this from the beginning
● Data Tools
○ Automated Hive registration Hdrone
○ Janus, http end point for running hive, presto
queries
○ Used by query builder
19. U B E R | Data
Hadoop Evolution @ ebay
2014
Few Nodes
Some Data
2015
~100’s Nodes
Few PB Data
3000+ node
30,000+ cores
50+ PB
2016
~1000 Nodes
~10’s PB Data
Hadoop Evolution @ Uber
2017
~ 5000 Nodes
~ 100’s PB Data
20. U B E R | Data
Hadoop Cluster Utilization
• Over
provisioning
for the peak
loads.
• Over capacity
for anticipation
of future
growth
21. U B E R | Data
Hadoop Evolution @ ebay
2014
0 Nodes
2015
Few Nodes
3000+ node
30,000+ cores
50+ PB
2016
~1000’s Nodes
Mesos Evolution @ Uber
2017
~ 10’s
Thousands Nodes
22. U B E R | Data
Mesos Cluster Utilization
• Over
provisioning for
the peak loads
• Over capacity
for anticipation
of future growth
26. U B E R | Data
Mesos vs YARN
YARN MESOS
Single Level Scheduler Two Level Scheduler
Use C groups for isolation Use C groups for Isolation
CPU, Memory as a resource CPU, Memory and Disk as a resource
Works well with Hadoop work loads Works well with longer running
services
YARN support time based
reservations
Mesos does not have support of
reservations
Dominant resource scheduling Scheduling is done by frameworks
and depends on case to case basis
Scales Better
Similar Isolation
Disk is
better
This is Important
Imp for batch SLA’s
Better for batch
27. U B E R | Data
Let’s tied them together
YARN is good for Hadoop / Batch
Mesos is good for Longer Running Services
In a Nutshell
29. U B E R | Data
• Myriad is Mesos Framework for Apache
YARN
• Mesos manages Data Center resources
• YARN manages Hadoop workloads
• Myriad
• Gets resources from Mesos
• Launches Node Managers
30. U B E R | Data
• YARN will handle
resources handed
over to it.
• Mesos will work on
rest of the resources
Myriad’s Limitations
Static Resource Partitioning
31. U B E R | Data
• YARN will never be able to do over subscription.
• Node Manager will go away
• Fragmentation of resources
• Mesos over subscription can kill YARN too
Myriad’s Limitations
Resource Over Subscription
32. U B E R | Data
• No Global Quota
Enforcement
• No Global
Priorities
Myriad’s Limitations
33. U B E R | Data
• Elastic Resource Management
• Utilization
• Stability
• Long List …
Myriad’s Limitations
35. U B E R | Data
Few Takeaways …
• We need one scheduling layer across all
workloads
• Partitioning resources are not good
• At least can save 20-30% resources
• Stability and simplicity wins in Production
• Multi Level of resource Management and
scheduling will not be scalable
36. U B E R | Data
High Level Characteristics
• Global Quota Management
• Central Scheduling policies
• Over subscription for both Online and Batch
• Isolation and bin packing
• SLA guarantees at Global Level
42. U B E R | Data
Peloton – Done so far
• Batch Workloads Support
• Spark Support
• GPU support
• Distributed Tensorflow support
• Gang Scheduling
43. U B E R | Data
Peloton – WIP
• YARN Api’s
• State full and stateless services
• Separate placement engines
• State full
• Stateless
• Control panel
• Peloton deploy Peloton
44. U B E R | Data
Peloton – Timelines
• Beta Released
• Production Early Q3
• Open Source – Q3-Q4 time frame
45. U B E R | Data
Peloton – Team
Min Shi Jimmy Eskil
Zhitao Tengfei Anant Mayank