SlideShare ist ein Scribd-Unternehmen logo
1 von 36
Downloaden Sie, um offline zu lesen
Running Spark on Kubernetes:
Best Practices and Pitfalls
Jean-Yves Stephan, Co-Founder & CEO @ Data Mechanics
Julien Dumazert, Co-Founder & CTO @ Data Mechanics
Who We Are
Jean-Yves “JY” Stephan
Co-Founder & CEO @ Data Mechanics
jy@datamechanics.co
Previously:
Software Engineer and
Spark Infrastructure Lead @ Databricks
Julien Dumazert
Co-Founder & CTO @ Data Mechanics
julien@datamechanics.co
Previously:
Lead Data Scientist @ ContentSquare
Data Scientist @ BlaBlaCar
Who Are You?
Poll: What is your experience with running Spark on Kubernetes?
● 61% - I’ve never used it, but I’m curious about it.
● 24% - I’ve prototyped using it, but I’m not using it in production.
● 15% - I’m using it in production.
This slide was edited after the conference to show the results for the poll.
You can see and take the poll at https://www.datamechanics.co/spark-summit-poll
Agenda
A quick primer on Data Mechanics
Spark on Kubernetes
Core Concepts & Setup
Configuration & Performance Tips
Monitoring & Security
Future Works
Conclusion: Should you get started?
Data Mechanics - A serverless Spark platform
● Applications start and autoscale in
seconds.
● Seamless transition from local
development to running at scale.
● Tunes the infra parameters and Spark
configurations automatically for each
pipeline to make them fast and stable.
https://www.datamechanics.co
Customer story: Impact of automated tuning on Tradelab
For details, watch our SSAI 2019 Europe talk
How to automate performance tuning for Apache Spark
● Stability: Automatic remediation of
OutOfMemory errors and timeouts
● 2x performance boost on average
(speed and cost savings)
GatewayData engineers
Data scientists
We’re deployed on k8s in our customers cloud account
Spark on Kubernetes:
Core Concepts & Setup
Where does Kubernetes fit within Spark?
Kubernetes is a new cluster-manager/scheduler for Spark.
● Standalone
● Apache Mesos
● Yarn
● Kubernetes (since version 2.3)
Spark on Kubernetes - Architecture
Source
Two ways to submit Spark applications on k8s
● “Vanilla” way from Spark main open
source repo
● Configs spread between Spark config
(mostly) and k8s manifests
● Little pod customization support
before Spark 3.0
● App management is more manual
● Open-sourced by Google (but works on
any platform)
● Configs in k8s-style YAML with sugar
on top (configmaps, volumes, affinities)
● Tooling to read logs, kill, restart,
schedule apps
● Requires a long-running system pod
spark-on-k8s operatorSpark-submit
App management in practice
spark-on-k8s operatorSpark-submit
# Run an app
$ spark-submit --master k8s://https://<api-server> …
# List apps
k get pods -label "spark-role=driver"
NAME READY STATUS RESTARTS AGE
my-app-driver 0/1 Completed 0 25h
# Read logs
k logs my-app-driver
# Describe app
# No way to actually describe an app and its parameters…
# Run an app
$ kubectl apply -f <app-manifest>.yaml
# List apps
$ k get sparkapplications
NAME AGE
my-app 2d22h
# Read logs
sparkctl log my-app
# Describe app
$ k get sparkapplications my-app -o yaml
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
arguments:
- gs://path/to/data.parquet
mainApplicationFile: local:///opt/my-app/main.jar
...
status:
applicationState:
state: COMPLETED
...
Dependency Management Comparison
● Lack of isolation
○ Global Spark version
○ Global Python version
○ Global dependencies
● Lack of reproducibility
○ Flaky Init scripts
○ Subtle differences in AMIs or system
● Full isolation
○ Each Spark app runs in its own docker
container
● Control your environment
○ Package each app in a docker image
○ Or build a small set of docker images
for major changes and specify your app
code using URIs
KubernetesYARN
Spark on Kubernetes:
Configuration & Performance Tips
A surprise when sizing executors on k8s
Assume you have a k8s cluster with 16GB-RAM 4-core instances.
Do one of these and you’ll never get an executor!
● Set spark.executor.cores=4
● Set spark.executor.memory=11g
k8s-aware executor sizing
What happened?
→ Only a fraction of capacity is available to Spark pods,
and spark.executor.cores=4 requests 4 cores!
Compute available resources
● Estimate node allocatable: usually 95%
● Measure what’s taken by your daemonsets (say 10%)
→ 85% of cores are available
Configure Spark
spark.executor.cores=4
spark.kubernetes.executor.request.cores=3400m
Node capacity
Resources reserved for k8s and system
daemons
Node allocatable
Resources requested by daemonsets
Remaining space for Spark pods!
More configuration tips here
Dynamic allocation on Kubernetes
● Full dynamic allocation is not available. When killing an exec pod, you
may lose shuffle files that are expensive to recompute.
There is ongoing work to enable it (JIRA: SPARK-24432).
● In the meantime, a soft dynamic allocation is available from Spark 3.0
Only executors which do not hold active shuffle files can be scaled down.
spark.dynamicAllocation.enabled=true
spark.dynamicAllocation.shuffleTracking.enabled=true
Cluster autoscaling & dynamic allocation
k8s can be configured to autoscale if pending pods
cannot be allocated.
Autoscaling plays well with dynamic allocation:
● <10s to get a new exec if there is room in the cluster
● 1-2 min if the cluster needs to autoscale
Requires to install the cluster autoscaler on AKS (Azure)
and EKS (AWS). It is natively installed on GKE (GCP).
K8s cluster
Spark
application
Spark
application
Cluster autoscaling
Dynamic allocation
Dynamic allocation
Overprovisioning to speed up dynamic allocation
To further improve the speed of dynamic allocation,
overprovision the cluster with low-prio pause pods:
● The pause pods force k8s to scale up
● Spark pods preempt pause pods’ resources when
needed
Cluster autoscaler doc about overprovisioning.
K8s cluster
Spark
application
Cluster autoscaling
Dynamic allocation
Spark
application
Dynamic allocation
Pause
pod
Further cost reduction with spot instances
Spot (or preemptible) instances can reduce costs up to 75%.
● If an executor is killed, Spark can recover
● If the driver is killed, game over!
Node selectors and affinities can be used to
constrain drivers on non-preemptible nodes.
Non-preemptible node
Driver Driver
Preemptible node
Exec
Preemptible node
Exec
Preemptible node
Exec
I/O with an object storage
Usually in Spark on Kubernetes, data is read and written to an object storage.
Cloud providers write optimized committers for their object storages, like the S3A
Committers.
If it’s not the case, use the version 2 of the Hadoop committer bundled with Spark:
The performance boost may be up to 2x! (if you write many files)
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2
Improve shuffle performance with volumes
I/O speed is critical in shuffle-bound workloads, because Spark uses local files as
scratch space.
Docker filesystem is slow → Use volumes to improve performance!
● emptyDir: use a temporary directory on the host (by default in Spark 3.0)
● hostPath: Leverage a fast disk mounted in the host (NVMe-based SSD)
● tmpfs: Use your RAM as local storage (⚠ dangerous)
Performance
We ran performance benchmarks to compare Kubernetes and YARN.
Results will be published on our blog early July 2020.
(Sneak peek: There is no performance penalty for running on k8s if you follow our recommendations)
Spark on Kubernetes:
Monitoring & Security
Monitor pod resource usage with k8s tools
Workload-agnostic tools to monitor pod usages:
● Kubernetes dashboard (installation on EKS)
● The GKE console
Issues:
● Hard to reconcile with Spark jobs/stages/tasks
● Executors metadata are lost when the Spark app
is completed
GKE console
Kubernetes dashboard
Spark history server
Setting up a History server is relatively easy:
● Direct your Spark event log file to S3/GCS/Azure Storage Account with
the spark.eventLog.dir config
● Install the Spark history server Helm chart on your cluster
What’s missing: resource usage metrics!
“Spark Delight” - A Spark UI replacement
Note: This slide was added after the conference.
Sorry for the self-promotion. We look forward to the feedback from the community!
● We’re building a better Spark UI
○ better UX
○ new system metrics
○ automated performance
recommendations
○ free of charge
○ cross-platform
● Not released yet, but we’re working
on it! Learn more and leave us
feedback.
Export Spark metrics to a time-series database
Spark leverages the DropWizard library to produce
detailed metrics.
The metrics can be exported to a time-series
database:
● InfluxDB (see spark-dashboard by Luca Canali)
● Prometheus
○ Spark has a built-in Prometheus servlet since version 3.0
○ The spark-operator proposes a Docker image with a
Prometheus java agent for older versions
Use sparkmeasure to pipe task metrics and stage
boundaries to the database Luca Canali, spark-dashboard
Security
Kubernetes security best practices apply to Spark on Kubernetes for free!
Access control
Strong built-in RBAC system in Kubernetes
Spark apps and pods benefit from it as native k8s resources
Secrets management
Kubernetes secrets as a first step
Integrations with solutions like HashiCorp Vault
Networking
Mutual TLS, Network policies (since v1.18)
Service mesh like Istio
Spark on Kubernetes:
Future Works
Features being worked on
● Shuffle improvements: Disaggregating storage and compute
○ Use remote storage for persisting shuffle data: SPARK-25299
○ Goal: Enable full dynamic allocation, and make Spark resilient to node loss (e.g. spot/pvm)
▪ Better Handling for node shutdown
▪ Copy shuffle and cache data during graceful decomissioning of a node: SPARK-20624
▪ Support local python dependency upload (SPARK-27936)
▪ Job Queues and Resource Management
Spark on Kubernetes:
Should You Get Started?
● Native Containerization
● A single cloud-agnostic infrastructure for your
entire tech stack with a rich ecosystem
● Efficient resource sharing guaranteeing both
resource isolation and cost efficiency
● Learning curve if you’re new to Kubernetes
● A lot to setup yourself since most managed
platforms do not support Kubernetes
● Marked as experimental (until 2.4) with missing
features like the External Shuffle service.
ConsPros
We chose Kubernetes for our platform - should you?
For more details, read our blog post
The Pros and Cons of Running Apache Spark on Kubernetes
Checklist to get started with Spark-on-Kubernetes
● Setup the infrastructure
○ Create the Kubernetes cluster
○ Optional: Setup the spark operator
○ Create a Docker Registry
○ Host the Spark History Server
○ Setup monitoring for Spark application logs and metrics
● Configure your apps for success
○ Configure node pools and your pod sizes for optimal binpacking
○ Optimize I/O with proper libraries and volume mounts
○ Optional: Enable k8s autoscaling and Spark app dynamic allocation
○ Optional: Use spot/preemptible VMs for cost reduction
● Enjoy the Ride !
Our platform helps
with this, and we’re
happy to help too!
The Simplest Way To Run Spark
https://www.datamechanics.co
Thank you!
Appendix
Cost reduction with cluster autoscaling
Configure two node pools for your k8s cluster
● Node pool of small instances for system pods (e.g. ingress controller,
autoscaler, spark-operator )
● Node pool of larger instances for Spark applications
Since node pools can scale down to zero on all cloud providers,
● you have large instances at your disposal for Spark apps
● you only pay for a small instance when the cluster is idle!

Weitere ähnliche Inhalte

Was ist angesagt?

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetesdatamantra
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache SparkDatabricks
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka StreamsGuozhang Wang
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLDatabricks
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!Databricks
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationDatabricks
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Databricks
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Edureka!
 

Was ist angesagt? (20)

Apache spark
Apache sparkApache spark
Apache spark
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetes
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Node Labels in YARN
Node Labels in YARNNode Labels in YARN
Node Labels in YARN
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
 

Ähnlich wie Running Apache Spark on Kubernetes: Best Practices and Pitfalls

Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Reliable Performance at Scale with Apache Spark on Kubernetes
Reliable Performance at Scale with Apache Spark on KubernetesReliable Performance at Scale with Apache Spark on Kubernetes
Reliable Performance at Scale with Apache Spark on KubernetesDatabricks
 
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsMonitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsDatabricks
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKzmhassan
 
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark OperatorApache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark OperatorDatabricks
 
Running Apache Spark Jobs Using Kubernetes
Running Apache Spark Jobs Using KubernetesRunning Apache Spark Jobs Using Kubernetes
Running Apache Spark Jobs Using KubernetesDatabricks
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsDatabricks
 
PySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March MeetupPySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March MeetupHolden Karau
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Landon Robinson
 
What is New with Apache Spark Performance Monitoring in Spark 3.0
What is New with Apache Spark Performance Monitoring in Spark 3.0What is New with Apache Spark Performance Monitoring in Spark 3.0
What is New with Apache Spark Performance Monitoring in Spark 3.0Databricks
 
18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes
18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes
18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on KubernetesAthens Big Data
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftChester Chen
 
Simplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Simplify and Boost Spark 3 Deployments with Hypervisor-Native KubernetesSimplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Simplify and Boost Spark 3 Deployments with Hypervisor-Native KubernetesDatabricks
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...Omid Vahdaty
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesKubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesSeungYong Oh
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with SparkRoger Rafanell Mas
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao
 
03 2014 Apache Spark Serving: Unifying Batch, Streaming, and RESTful Serving
03 2014 Apache Spark Serving: Unifying Batch, Streaming, and RESTful Serving03 2014 Apache Spark Serving: Unifying Batch, Streaming, and RESTful Serving
03 2014 Apache Spark Serving: Unifying Batch, Streaming, and RESTful ServingDatabricks
 

Ähnlich wie Running Apache Spark on Kubernetes: Best Practices and Pitfalls (20)

Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Reliable Performance at Scale with Apache Spark on Kubernetes
Reliable Performance at Scale with Apache Spark on KubernetesReliable Performance at Scale with Apache Spark on Kubernetes
Reliable Performance at Scale with Apache Spark on Kubernetes
 
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsMonitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark OperatorApache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
 
Running Apache Spark Jobs Using Kubernetes
Running Apache Spark Jobs Using KubernetesRunning Apache Spark Jobs Using Kubernetes
Running Apache Spark Jobs Using Kubernetes
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
PySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March MeetupPySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March Meetup
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
 
What is New with Apache Spark Performance Monitoring in Spark 3.0
What is New with Apache Spark Performance Monitoring in Spark 3.0What is New with Apache Spark Performance Monitoring in Spark 3.0
What is New with Apache Spark Performance Monitoring in Spark 3.0
 
18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes
18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes
18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
 
Simplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Simplify and Boost Spark 3 Deployments with Hypervisor-Native KubernetesSimplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Simplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesKubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
 
03 2014 Apache Spark Serving: Unifying Batch, Streaming, and RESTful Serving
03 2014 Apache Spark Serving: Unifying Batch, Streaming, and RESTful Serving03 2014 Apache Spark Serving: Unifying Batch, Streaming, and RESTful Serving
03 2014 Apache Spark Serving: Unifying Batch, Streaming, and RESTful Serving
 

Mehr von Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionDatabricks
 

Mehr von Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Kürzlich hochgeladen

Rithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfRithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfrahulyadav957181
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfWorld Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfsimulationsindia
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaManalVerma4
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 

Kürzlich hochgeladen (20)

Rithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfRithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdf
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfWorld Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in India
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 

Running Apache Spark on Kubernetes: Best Practices and Pitfalls

  • 1. Running Spark on Kubernetes: Best Practices and Pitfalls Jean-Yves Stephan, Co-Founder & CEO @ Data Mechanics Julien Dumazert, Co-Founder & CTO @ Data Mechanics
  • 2. Who We Are Jean-Yves “JY” Stephan Co-Founder & CEO @ Data Mechanics jy@datamechanics.co Previously: Software Engineer and Spark Infrastructure Lead @ Databricks Julien Dumazert Co-Founder & CTO @ Data Mechanics julien@datamechanics.co Previously: Lead Data Scientist @ ContentSquare Data Scientist @ BlaBlaCar
  • 3. Who Are You? Poll: What is your experience with running Spark on Kubernetes? ● 61% - I’ve never used it, but I’m curious about it. ● 24% - I’ve prototyped using it, but I’m not using it in production. ● 15% - I’m using it in production. This slide was edited after the conference to show the results for the poll. You can see and take the poll at https://www.datamechanics.co/spark-summit-poll
  • 4. Agenda A quick primer on Data Mechanics Spark on Kubernetes Core Concepts & Setup Configuration & Performance Tips Monitoring & Security Future Works Conclusion: Should you get started?
  • 5. Data Mechanics - A serverless Spark platform ● Applications start and autoscale in seconds. ● Seamless transition from local development to running at scale. ● Tunes the infra parameters and Spark configurations automatically for each pipeline to make them fast and stable. https://www.datamechanics.co
  • 6. Customer story: Impact of automated tuning on Tradelab For details, watch our SSAI 2019 Europe talk How to automate performance tuning for Apache Spark ● Stability: Automatic remediation of OutOfMemory errors and timeouts ● 2x performance boost on average (speed and cost savings)
  • 7. GatewayData engineers Data scientists We’re deployed on k8s in our customers cloud account
  • 8. Spark on Kubernetes: Core Concepts & Setup
  • 9. Where does Kubernetes fit within Spark? Kubernetes is a new cluster-manager/scheduler for Spark. ● Standalone ● Apache Mesos ● Yarn ● Kubernetes (since version 2.3)
  • 10. Spark on Kubernetes - Architecture Source
  • 11. Two ways to submit Spark applications on k8s ● “Vanilla” way from Spark main open source repo ● Configs spread between Spark config (mostly) and k8s manifests ● Little pod customization support before Spark 3.0 ● App management is more manual ● Open-sourced by Google (but works on any platform) ● Configs in k8s-style YAML with sugar on top (configmaps, volumes, affinities) ● Tooling to read logs, kill, restart, schedule apps ● Requires a long-running system pod spark-on-k8s operatorSpark-submit
  • 12. App management in practice spark-on-k8s operatorSpark-submit # Run an app $ spark-submit --master k8s://https://<api-server> … # List apps k get pods -label "spark-role=driver" NAME READY STATUS RESTARTS AGE my-app-driver 0/1 Completed 0 25h # Read logs k logs my-app-driver # Describe app # No way to actually describe an app and its parameters… # Run an app $ kubectl apply -f <app-manifest>.yaml # List apps $ k get sparkapplications NAME AGE my-app 2d22h # Read logs sparkctl log my-app # Describe app $ k get sparkapplications my-app -o yaml apiVersion: sparkoperator.k8s.io/v1beta2 kind: SparkApplication arguments: - gs://path/to/data.parquet mainApplicationFile: local:///opt/my-app/main.jar ... status: applicationState: state: COMPLETED ...
  • 13. Dependency Management Comparison ● Lack of isolation ○ Global Spark version ○ Global Python version ○ Global dependencies ● Lack of reproducibility ○ Flaky Init scripts ○ Subtle differences in AMIs or system ● Full isolation ○ Each Spark app runs in its own docker container ● Control your environment ○ Package each app in a docker image ○ Or build a small set of docker images for major changes and specify your app code using URIs KubernetesYARN
  • 15. A surprise when sizing executors on k8s Assume you have a k8s cluster with 16GB-RAM 4-core instances. Do one of these and you’ll never get an executor! ● Set spark.executor.cores=4 ● Set spark.executor.memory=11g
  • 16. k8s-aware executor sizing What happened? → Only a fraction of capacity is available to Spark pods, and spark.executor.cores=4 requests 4 cores! Compute available resources ● Estimate node allocatable: usually 95% ● Measure what’s taken by your daemonsets (say 10%) → 85% of cores are available Configure Spark spark.executor.cores=4 spark.kubernetes.executor.request.cores=3400m Node capacity Resources reserved for k8s and system daemons Node allocatable Resources requested by daemonsets Remaining space for Spark pods! More configuration tips here
  • 17. Dynamic allocation on Kubernetes ● Full dynamic allocation is not available. When killing an exec pod, you may lose shuffle files that are expensive to recompute. There is ongoing work to enable it (JIRA: SPARK-24432). ● In the meantime, a soft dynamic allocation is available from Spark 3.0 Only executors which do not hold active shuffle files can be scaled down. spark.dynamicAllocation.enabled=true spark.dynamicAllocation.shuffleTracking.enabled=true
  • 18. Cluster autoscaling & dynamic allocation k8s can be configured to autoscale if pending pods cannot be allocated. Autoscaling plays well with dynamic allocation: ● <10s to get a new exec if there is room in the cluster ● 1-2 min if the cluster needs to autoscale Requires to install the cluster autoscaler on AKS (Azure) and EKS (AWS). It is natively installed on GKE (GCP). K8s cluster Spark application Spark application Cluster autoscaling Dynamic allocation Dynamic allocation
  • 19. Overprovisioning to speed up dynamic allocation To further improve the speed of dynamic allocation, overprovision the cluster with low-prio pause pods: ● The pause pods force k8s to scale up ● Spark pods preempt pause pods’ resources when needed Cluster autoscaler doc about overprovisioning. K8s cluster Spark application Cluster autoscaling Dynamic allocation Spark application Dynamic allocation Pause pod
  • 20. Further cost reduction with spot instances Spot (or preemptible) instances can reduce costs up to 75%. ● If an executor is killed, Spark can recover ● If the driver is killed, game over! Node selectors and affinities can be used to constrain drivers on non-preemptible nodes. Non-preemptible node Driver Driver Preemptible node Exec Preemptible node Exec Preemptible node Exec
  • 21. I/O with an object storage Usually in Spark on Kubernetes, data is read and written to an object storage. Cloud providers write optimized committers for their object storages, like the S3A Committers. If it’s not the case, use the version 2 of the Hadoop committer bundled with Spark: The performance boost may be up to 2x! (if you write many files) spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2
  • 22. Improve shuffle performance with volumes I/O speed is critical in shuffle-bound workloads, because Spark uses local files as scratch space. Docker filesystem is slow → Use volumes to improve performance! ● emptyDir: use a temporary directory on the host (by default in Spark 3.0) ● hostPath: Leverage a fast disk mounted in the host (NVMe-based SSD) ● tmpfs: Use your RAM as local storage (⚠ dangerous) Performance We ran performance benchmarks to compare Kubernetes and YARN. Results will be published on our blog early July 2020. (Sneak peek: There is no performance penalty for running on k8s if you follow our recommendations)
  • 24. Monitor pod resource usage with k8s tools Workload-agnostic tools to monitor pod usages: ● Kubernetes dashboard (installation on EKS) ● The GKE console Issues: ● Hard to reconcile with Spark jobs/stages/tasks ● Executors metadata are lost when the Spark app is completed GKE console Kubernetes dashboard
  • 25. Spark history server Setting up a History server is relatively easy: ● Direct your Spark event log file to S3/GCS/Azure Storage Account with the spark.eventLog.dir config ● Install the Spark history server Helm chart on your cluster What’s missing: resource usage metrics!
  • 26. “Spark Delight” - A Spark UI replacement Note: This slide was added after the conference. Sorry for the self-promotion. We look forward to the feedback from the community! ● We’re building a better Spark UI ○ better UX ○ new system metrics ○ automated performance recommendations ○ free of charge ○ cross-platform ● Not released yet, but we’re working on it! Learn more and leave us feedback.
  • 27. Export Spark metrics to a time-series database Spark leverages the DropWizard library to produce detailed metrics. The metrics can be exported to a time-series database: ● InfluxDB (see spark-dashboard by Luca Canali) ● Prometheus ○ Spark has a built-in Prometheus servlet since version 3.0 ○ The spark-operator proposes a Docker image with a Prometheus java agent for older versions Use sparkmeasure to pipe task metrics and stage boundaries to the database Luca Canali, spark-dashboard
  • 28. Security Kubernetes security best practices apply to Spark on Kubernetes for free! Access control Strong built-in RBAC system in Kubernetes Spark apps and pods benefit from it as native k8s resources Secrets management Kubernetes secrets as a first step Integrations with solutions like HashiCorp Vault Networking Mutual TLS, Network policies (since v1.18) Service mesh like Istio
  • 30. Features being worked on ● Shuffle improvements: Disaggregating storage and compute ○ Use remote storage for persisting shuffle data: SPARK-25299 ○ Goal: Enable full dynamic allocation, and make Spark resilient to node loss (e.g. spot/pvm) ▪ Better Handling for node shutdown ▪ Copy shuffle and cache data during graceful decomissioning of a node: SPARK-20624 ▪ Support local python dependency upload (SPARK-27936) ▪ Job Queues and Resource Management
  • 31. Spark on Kubernetes: Should You Get Started?
  • 32. ● Native Containerization ● A single cloud-agnostic infrastructure for your entire tech stack with a rich ecosystem ● Efficient resource sharing guaranteeing both resource isolation and cost efficiency ● Learning curve if you’re new to Kubernetes ● A lot to setup yourself since most managed platforms do not support Kubernetes ● Marked as experimental (until 2.4) with missing features like the External Shuffle service. ConsPros We chose Kubernetes for our platform - should you? For more details, read our blog post The Pros and Cons of Running Apache Spark on Kubernetes
  • 33. Checklist to get started with Spark-on-Kubernetes ● Setup the infrastructure ○ Create the Kubernetes cluster ○ Optional: Setup the spark operator ○ Create a Docker Registry ○ Host the Spark History Server ○ Setup monitoring for Spark application logs and metrics ● Configure your apps for success ○ Configure node pools and your pod sizes for optimal binpacking ○ Optimize I/O with proper libraries and volume mounts ○ Optional: Enable k8s autoscaling and Spark app dynamic allocation ○ Optional: Use spot/preemptible VMs for cost reduction ● Enjoy the Ride ! Our platform helps with this, and we’re happy to help too!
  • 34. The Simplest Way To Run Spark https://www.datamechanics.co Thank you!
  • 36. Cost reduction with cluster autoscaling Configure two node pools for your k8s cluster ● Node pool of small instances for system pods (e.g. ingress controller, autoscaler, spark-operator ) ● Node pool of larger instances for Spark applications Since node pools can scale down to zero on all cloud providers, ● you have large instances at your disposal for Spark apps ● you only pay for a small instance when the cluster is idle!