This document discusses running Spark on the cloud, including the advantages, challenges, and how Qubole addresses them. Some key advantages include using S3 for storage which allows independent scaling of storage and compute, ability to create ephemeral clusters on demand, and autoscaling capabilities. Challenges involve cluster lifecycle management, different interfaces needed, Spark autoscaling, debuggability across clusters, and handling spot instances. Qubole provides tools that automate cluster management, enable autoscaling of Spark, and make experiences seamless across clusters and interfaces.
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Running Spark on Cloud
1. Running Spark on Cloud
Advantages and Challenges - Praveen Seluka
2. Introduction
• About Qubole :
• Big Data as a Service on Cloud - Started by Ashish
Thusoo and Joydeep Sen Sarma who created Apache
Hive at Facebook
• Hadoop, Hive, Spark, Presto and other technologies
• easy-to-use and highly performant in cloud
• About me :
• lead Spark as a Service effort at Qubole
3. ~ 170+ PB of data processed
per month
10 – 3000 node clusters on a
daily basis
300,000 machines per month
20,000 jobs on a daily basis
highlights
4. Agenda
1. Getting Started with Spark on Cloud
2. Advantages of running in cloud
3. Challenges and how Qubole solves it - and
tools required for complete Spark experience
5. 1) Getting Started : Spark on
Cloud
• Install Spark on EC2 (HDFS if required)
• Ability to spin up cluster of instances
• Choosing Spark backend cluster mode and
configuring it
• Standalone
• YARN
• Mesos
6. 1) spark-ec2 scripts can help
• http://spark.apache.org/docs/latest/ec2-
scripts.html
• helps you spin up named clusters
• creates security group, comes pre-baked with
spark installed - ready to work
• Ability to choose instance type, region, zone,
spark version…
7. 2a) Advantages : S3 as
Datalake
• Separating compute and storage - they can scale
independently
• S3 is highly available, reliable and scalable. We have
not heard object loss - ever.
• Cost effective
• HDFS vs s3 - very little difference in perf
• Same access as HDFS : HadoopFileSystem API for
accessing S3. NativeS3FileSystem or S3aFileSystem
8. 2b)Advantages : Ephemeral Clusters
• The biggest advantage of using s3 as storage is, we
can spin up clusters only when needed
• Ability to have multiple clusters - one for a team/
individual
• NFLX - users spin up a cluster for each job - Simplicity
(not efficient though)
9. 2c) Advantages : Flexibility
• Ability to choose instance types
• High memory instances - r3.* for high memory workloads
where you cache RDD and access it
• c3.* for CPU intensive workloads
• spot instances
• Add EBS disks - if the instance is low on ephemeral
storage
10. 2d) Advantages : Autoscaling
• Big-data workloads are bursty in nature
• Scale the cluster on-demand, and shrink when its idle
• Highly cost effective
• multi-tenancy
• Qubole provides efficient autoscaling spark clusters -
more on that later
11. 3a) Challenges - cluster lifecycle
• Automate cluster lifecycle. terminate when idle - it’s
easy to forget
• Periodically check for bad instances and remove them
• Cluster health check and terminate/restart
• a simple interface required to create/delete and
configure multiple clusters
• We forked MIT starcluster years back and have added
significant stuff like lifecycle to it
12. 3b) Challenges : Interfaces
• Data Engineer needs to submit ML/graph
algorithms through an API/SDK
• Data Analyst needs to use Tableau/Mode/Tool of
choice
• Data Scientist needs notebook for interactive
exploration and analysis
13. 3c) Challenges - Spark
Autoscaling
• Spark Job has static resource allocation
• —num-executors 20 —executor-memory 5G —
executor-cores 4
• Each executor is a long running JVM - held by the
Spark Job for the lifetime of the application
• Each executor can run multiple tasks. for the
above configuration, with spark.cpu.tasks=1,
there can be 20*4=80 tasks running in parallel
14. 3c) Challenges - Spark
Autoscaling
• Problem : Its hard to predict the amount of resources
required for the job
• We added API’s to add/remove executors at runtime
when the job is running (Contributed to Open Source)
• sc.requestExecutors(x)
• sc.removeExecutors(List())
• Spark driver program can now add or remove
executors at runtime
15. 3c) Challenges - Spark
Autoscaling
• We built autoscaling algorithm which can request and
release executors based on load dynamically
• Stage = x number of tasks
• Once a task completes within a stage, we know the task
run time t. So, we can estimate the stageRuntime = x * t
• We try to complete the stage within a threshold
(configurable). if the stage is expected to take more time
than threshold, we upscale. We determine the number of
executors required to complete the stage within expected
time, and add requests.
16. 3c) Challenges - Spark
Autoscaling
• Downscaling is tricky. Executors have cached
RDD’s and shuffle data
• Use external shuffle service (YARN aux service)
• We downscale when there are no running
stages. But if the executor has cached RDD’s
then we dont remove the executor. (Removing it
will require recomputation of RDD from source
and can be expensive)
17. 3d) Challenges - Debuggability
• Spark history server runs inside cluster to serve Spark UI
even after job completes
• Copy app logs (container logs) and event logs(spark UI) to
s3
• Run history server anywhere outside which can render Spark
UI by accessing event logs from s3
• Requires a control tier, which has list of all applications that
have run so far
• Qubole makes the whole experience seamless - here is why ?
18. 3e) Challenges - Move is slow in s3
• In s3, move = copy + delete
• Use DFOC/ParquetDFOC which can copy the
results to destination directly
19. 3f) Challenges : Spot instances
• Really useful for short running workloads. if the job fails due to
spot loss, retry
• In general, Using spot instances for spark is a hard problem.
Can a long running Spark job recover and keep running inspite
of big spot node loss ?
• yes, but very inefficient
• mechanisms to improve - RDD caching with replica
placement in on-demand nodes
• shuffle service and storing shuffle data in HDFS/replicated
storage
20. 3g) Spark JobServer
• Enables in-memory RDD to be accessible from
multiple spark applications
• Long running Spark Contexts