While the performance delivered by Spark has enabled data scientists to undertake sophisticated analyses on big and complex data in actionable timeframes, too often, the process of manually configuring the underlying Spark jobs (including the number and size of the executors) can be a significant and time consuming undertaking. Not only it does this configuration process typically rely heavily on repeated trial-and-error, it necessitates that data scientists have a low-level understanding of Spark and detailed cluster sizing information. At Alpine Data we have been working to eliminate this requirement, and develop algorithms that can be used to automatically tune Spark jobs with minimal user involvement,
In this presentation, we discuss the algorithms we have developed and illustrate how they leverage information about the size of the data being analyzed, the analytical operations being used in the flow, the cluster size, configuration and real-time utilization, to automatically determine the optimal Spark job configuration for peak performance.
4. We use Spark
• End-2-end support
– Problem inception through model operationalization
• Extensive use of Spark at all stages
– ETL
– Feature engineering
– Model training
– Model scoring
7. Inefficient
• Manual Spark tuning is a time consuming and inefficient
process
– Frequently results in “trial-and-error” approach
– Can be hours (or more) before OOM fails occur
• Less problematic for unchanging operationalized jobs
– Run same analysis every hour/day/week on new data
– Worth spending time & resources to fine tune
8. AutoML
Data Set
Alg #1
Alg #2
Alg #3
Alg #N
Alg #1
Alg #N
1)Investigate N
ML algorithms
2) Tune top
performing
algorithms
Feature
engineering
Alg #2
Alg #1
Alg #N
2) Feature
elimination
12. It depends….
• Size & complexity of data
• Complexity of algorithm(s)
• Nuances of the implementation
• Cluster size
• YARN configuration
• Cluster utilization
13. Default settings?
• Spark defaults are targeted at toy data sets & small
clusters
– Not applicable to real-world data science
• Resource computations often assume a dedicated
cluster
– Looking at total vCPU and memory resources
• Enterprise clusters are typically shared
– Applying dedicated computations is wasteful
15. Challenges
• Algorithm needs to
– Determine compute requirements for current analysis
– Reconcile requirements with available resources
– Allow users to set specific variables
• Changing one parameter will generally impact others
1. Increase memory per executor
èDecrease number of executors
2. Increase cores per executor
èDecrease memory per core
[N.B. Not just a single “correct” configuration!!!]
16. Understanding the inputs
• Sample from input data set(s)
• Estimate
– Schema inference
– Dimensionality
– Cardinality
– Row count
• Remember about compressed formats
• Offline background scans feasible
– Retain snapshots in metastore
• Take into account impact of downstream operations
18. Algorithmic complexity
• Each operation has unique requirements
– Wide range in memory and compute requirements across
algorithms
• Provide levers for algorithms to provide input on
resource asks
– Resource heavy or resource light (Memory & CPU)
• Per algorithm modifiers computed for all algorithms in
Application toolbox
19. Wrapping OSS
• Autotuning APIs are made available to user & OSS
algorithms wrapped using our extensions SDK
– H20
– MLlib
– Etc.
• Provide input on relative memory requirements and
computational complexity
20. YARN Queues
• Queues frequently used in enterprise settings
– Control resource utilization
• Variety of queuing strategies
– Users bound to queues
– Groups bound to queues
– Applications bound to queues
• Queues can have own maximum container sizes
• Queues can have different scheduling polices
21. Dynamic allocation
• Spark provides support for auto-scaling
– Dynamically add/remove executors according to demand
• Is preemption also enabled?
– Minimize wait time for other users
• Still need to determine optimal executor and driver
sizing
22. Query cluster
• Query YARN Resource Manager to get cluster state
– Kerberised REST API
• Determine key resource considerations
– Preemption enabled?
– Queue mapping
– Queue type
– Queue resources
– Real-time node utilization
23. Executor Sizing
• Try to make efficient use of the available resources
– Start with a balanced strategy
• Consume resources based on their relative abundance
MemPerCore= QueueMem
QueueCores *MemHungry
CoresPerExecutor = min(MaxUsableMemoryPerContainer
MemoryPerCore ,5)
MemPerExecutor = max(1GB,MemPerCore*CoresPerExecutor)
24. Number of executors (1/2)
• Determine how many executors can be accommodated
per node
• Consider queue constraints
• Consider testing per node with different core values
AvailableExecutors = Min( AvailableNodeMemory
MemoryPerExecutor+Overheads
n=0
ActiveNodes
∑ , AvailableNodeCores
CoresPerExecutor )
UseableExecutors = Min(AvailableExecutors, QueueMemory
MemoryPerExecutor , QueueCores
CoresPerExecutor )
25. Number of executors (2/2)
• Compute executor count required for memory
requirements
• Determine final executor count
NeededExecutors = CacheSize
SparkMemPerExecutor*SparkStorageFraction
FinalExecutors = min(UseableExecutors, NeededExecutors*ComputeHungry)
26. Data Partitioning
• Minimally want a partition per executor core
• Increase number of partitions if memory constrained
MinPartitions = NumExecutors*CoresPerExecutor
MemPerTask = SparkMemPerExecutor*(1−SparkStorageFraction)
CoresPerExecutor
MemPartitions = CacheSize
MemPerTask
FinalPartitions = max(MinPartitions,MemPartitions)
27. Driver sizing
• Leverage executor size as baseline
• Increase driver memory if analysis requires
– Some algorithms very driver memory heavy
• Reduce executor count if needed to reflect
presence of cluster-side driver
28. Sanity checks
• Make sure none of the chosen parameters look
crazy!
– Massive partition counts
– Crazy resourcing for small files
– Complex jobs squeezed for overwhelmed clusters
• Think about edge cases
• Iterate if necessary!
30. Future opportunities
Cloud Scaling
• Autoscale cloud EMR clusters
– Leverage understanding of required resources to dynamically
manage the number of compute nodes
– Significant performance benefits and cost savings
Iterative improvements
• Retain decisions from prior runs in metastore
• Harvest additional sizing and runtime information
– Spark REST APIs
• Leverage data to fine tune future runs
31. Conclusions
• Enterprise clusters are noisy, shared environments
• Configuring Spark can be labor intensive & error prone
– Significant trial and error process
• Important to automate the Spark configuration process
– Remove requirement for data scientists to understand Spark and
cluster configuration details
• Software automatically reconciles analysis requirements
with available resources
• Even simple estimators can do a good job
32. More details?
• Too little time…
– Will blog more details, example resourcing and scala
code fragments
alpinedata.com/blog