Lyft is on the mission to improve people's lives with the world's best transportation. As part of this mission Lyft invests heavily in open source infrastructure and tooling. At Lyft Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark at Lyft has evolved to solve both Machine Learning and large scale ETL workloads. By combining the flexibility of Kubernetes with the data processing power of Apache Spark, Lyft is able to drive ETL data processing to a different level. In this talk, Li Gao and Rohit Menon will talk about challenges the Lyft team faced and solutions they developed to support Apache Spark on Kubernetes in production and at scale. Topics Include: - Key traits of Apache Spark on Kubernetes. - Deep dive into Lyft's multi-cluster setup and operationality to handle petabytes of production data. - How Lyft extends and enhances Apache Spark to support capabilities such as Spark pod life cycle metrics and state management, resource prioritization, and queuing and throttling. - Dynamic job scale estimation and runtime dynamic job configuration. - How Lyft powers internal Data Scientists, Business Analysts, and Data Engineers via a multi-cluster setup.
Speakers: Li Gao, Rohit Menon
2. Li Gao, Lyft
Rohit Menon, Lyft
Scaling Spark on
Kubernetes
#UnifiedAnalytics #SparkAISummit
3. Introduction
3#UnifiedAnalytics #SparkAISummit
Li Gao
Works in the Data Platform team at Lyft, currently leading the Compute Infra
initiatives including Spark on Kubernetes.
Previously at Salesforce, Fitbit, Groupon, and other startups.
Rohit Menon
Rohit Menon is a Software Engineer on the Data Platform team at Lyft. Rohit's
primary area of focus is building and scaling out the Spark and Hive Infrastructure
for ETL and Machine learning use cases.
Previously at EA, VMWare
5. Data Landscape
5#UnifiedAnalytics #SparkAISummit
● Batch data Ingestion and ETL
● Data Streaming
● ML platforms
● Notebooks and BI tools
● Query and Visualization
● Operational Analytics
● Data Discovery & Lineage
● Workflow orchestration
● Cloud Platforms
7. Batch
Compute
Clusters
What batch compute is used for
7
Events
Ext Data
RDB/KV
Sys Events
IngestPipelines
AWSS3
AWSS3
HMS
Presto,HiveClient,andBITools
Analysts
Engineers
Scientists
Services
9. Batch Compute Challenges
9
● 3rd Party vendor dependency limitations
● Data ETL expressed solely in SQL
● Complex logic expressed in Python that hard to adopt
in SQL
● Different dependencies and versions
● Resource load balancing for heterogeneous workloads
12. What about Python functions?
12
“I want to express my processing logic in python functions
with external geo libraries (i.e. Geomesa) and interact with
Hive tables” --- Lyft data engineer
13. How Spark can help?
13
RDB/KV
Applications
APIs
Environments
Data Sources
and Data
Sinks
14. What challenges remain?
14
● Per job custom dependencies
● Handling version requirements (Py3 v.s. Py2)
● Still need to run on shared clusters for cost efficiency
19. What challenges still remain?
● Spark on k8s is still in its early days
● Single cluster scaling limit
● CRD and control plane update
● Pod churn and IP allocations throttling
● ECR container registry reliability
19
20. Current scale
20
● 10s PB data lake
● (O) 100k batch jobs running daily
● ~ 1000s of EC2 nodes spanning multiple
clusters and AZs
● ~ 1000s of workflows running daily
21. How Lyft scales Spark on K8s
21
# of Clusters # of Namespaces
# of Pods
Pod Churn Rate
# of Nodes
Pod Size
Job:Pod ratio IP Alloc Rate Limit
ECR Rate Limit
Affinity & Isolation
QoS & Quota
24. HA in Cluster Pool
24
Cluster 1
Cluster 2
Cluster 3
Cluster Pool A
Cluster 4
● Cluster rotation within a cluster pool
● Automated provisioning of a new cluster and (manually) add into rotation
● Throttle at lower bound when rotation in progress
25. Multiple Namespaces (Groups)
25
Pod Pod Pod
Namespace 1
Pod Pod Pod
Namespace 2
Pod Pod Pod
Namespace 3
Node A Node B Node C Node D
Role1 Role1 Role2
Max Pod Size 1 Max Pod Size 2
● Practical ~3K active pods per namespace observed
● Less preemption required when namespace isolated by quota
● Different namespaces can map different IAM roles and sidecar
configurations
26. Pod Sharing
26
Job
Controller Spark Driver
Pod
Spark Exec
Pods
Job 2 Driver
Pod
Job 2 Exec
Pods
Job 3 Driver
Pod
Job 3 Exec
Pods
Shared Pods
Job 1
Job 4
Job 3
Job 2
AWS
S3
Dep
Dep
Dedicate & Isolated Pods
Dep
29. Pod Priority and Preemption (WIP)
29
● Priority base
preemption
● Driver pod has higher
priority than executor
pod
● Experimental
D1 D2 E1 E2 E3 E4
K8s Scheduler
D1
E5
New Pod Req
Before
D2 E5 E2 E3 E4
After
E1
Evictedhttps://github.com/kubernetes/kubernetes/issues/71486
https://github.com/kubernetes/enhancements/issues/564
30. Taints and Tolerations (WIP)
30
Node A Node B Node C Node D Node E Node F
P1 P2 P3 P4 P5 P6 P7 P7 P8 P9 P10
Controllers and Watchers Job 1 Job 2
Core Nodes (Taint) Worker Nodes (Taint)
● Other considerations: Node Labels, Node Selectors to separate GPU and CPU based
workloads
31. What about ECR reliability?
31
Node 1 Node 2 Node 3
Pods Pods Pods
DaemonSet + Docker In Docker
ECR Container Images
32. Spark Job Config Overlays (DML)
32
Cluster Pool Defaults
Cluster Defaults
Spark Job User Specified Config
Cluster and Namespace Overrides
Final Spark Job Config
Config
Composer
&
Event
Watcher
Spark
Operator
39. Remaining work
● More intelligent & resilient job routing/scheduler and
parameter setting
● Serverless and self-serviceable user experiences for
any-to-any batch data compute
● Finer grained cost attribution
● Improved docker image distribution
● Spark 3.0 & Kubernetes v1.14+
39
40. Key Takeaways
● Apache Spark can help unify different batch data compute
use cases
● Kubernetes can help solve the dependency and multi-version
requirements using its containerized approach
● Spark on Kubernetes can scale significantly by using a
multi-cluster compute mesh approach with proper resource
isolation and scheduling techniques
● Challenges remain when running Spark on Kubernetes at
scale
40