This document discusses the challenges of operationalizing big data applications and how full stack performance intelligence can help DataOps teams address issues. It describes how intelligence can provide automated diagnosis and remediation to solve problems, automated detection and prevention to be proactive, and automated what-if analysis and planning to prepare for future use. Real-life examples show how intelligence can help with proactively detecting SLA violations, diagnosing Hive/Spark application failures, and planning a migration of applications to the cloud.
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Problems while Operationalizing Big Data Apps
1. Slow, Stuck, or Runaway Apps?
Learn How to Quickly Fix Problems
While Operationalizing Big Data Apps
Shivnath Babu
CTO @ Unravel Data
shivnath@unraveldata.com
2. About me
Shivnath Babu
Co-founder/CTO,
Unravel Data Systems
Adjunct Professor,
Duke University
Menlo Park, CA 94025
• R&D on Hadoop, Spark, NoSQL, streaming,
& MPP to simplify ongoing app/system
management
• Led work at Duke on first self-tuning Hadoop
platform: Starfish
• Awards from NSF, IBM, HP
• PhD, Stanford University
3. Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Content
• Challenges in operationalizing big data apps
• How we can improve state-of-the-art
• Real-life examples
7. What can go wrong?
• Failures
• My query failed after 6 hours!
• What does this Exception mean?
8. What can go wrong?
• Failures
• My query failed after 6 hours!
• What does this Exception mean?
• Bad performance
• My app is very slow
• Pipeline is not meeting 4hr SLA
• Unreliable performance
• My app is stuck
• Latency is 3x worse today
• Poor scalability
• Oh, but it worked on the dev cluster!
• Bad App(le)s
• Tom’s query brought the cluster down
• Application Problems
• Poor joins/transformations
• Ineffective caching
• Bloated data structures
• Data/Storage Problems
• Skewed data, load imbalance
• Small files, poor data partitioning
• Configuration Problems
• Suboptimal container sizes
• Scheduler weight/capacity settings
• Resource Problems
• Resource contention
• Service degradation (ex: NameNode)
And why?
10. Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
How do DataOps address this
problem today?
11. Look at Logs?
Logs in distributed systems are spread out, incomplete,
& usually very difficult to understand
12.
13.
14. Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
There has to be a better way
Full Stack Performance Intelligence
15. HW HW HW
Hadoop Spark Kafka Cassandra Elasticsearch
MPP
Applications: ETL, BI/SQL, Data Pipelines, Streaming, ML
Cloud
Big Data Stack
Logs,Profiles,Metrics,Events
Cloud
Full Stack Performance Intelligence from 30k ft
ApplyPredictive
Analytics
Intelligence
needed by
DataOps
16. Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Why Full Stack?
• Because problems can happen all over the stack
o Otherwise, we will be blindsided and give wrong insights
• Because it is now possible to:
o Get full-stack telemetry data (high volume, velocity, & variety)
o Reuse distributed systems to process and store this data
17. Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
What is “Intelligence”?
• Not just graphs and time-series charts
• And not simply throwing some AI/ML and seeing what comes out
Intelligence = Automation to Augment DataOps
18. Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Let us Dig Deeper
• We surveyed 250+ DataOps professionals across many verticals to
understand where and how intelligence can benefit them
• Use cases from this survey fall into three categories (aka the Three P’s)
1. I have a Problem that I need to fix
2. I want to be Proactive in detecting and fixing problems
3. I need to Plan for future use
19. Intelligence = Automation to Augment DataOps
1. Automated Diagnosis
2. Automated Remediation
DataOps Need Intelligence Needed
I have a problem
1. Automated Prediction
2. Automated What-if Analysis
I need to plan
Underutilized clusters
Low throughput
Unused datasets
Poor data layout1. Automated Detection
2. Automated Diagnosis
3. Automated Prevention/Remediation
I want to be proactive
27. Underutilized clusters
Low throughput
But, This is Just One Type of Contention
• At Resource Manager Level
• App admission time
• Container allocation for Application Master
• Container allocation for tasks
• Container allocation for Executor
• At Application Level
• Workflow Scheduler, e.g., Oozie
• Query Engine, e.g., HiveServer2
• At Master Daemon Level
• NameNode
• Hive MetaStore
28. Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Key Takeaways
Resource contention at different levels affects app performance
• Different apps (Oozie workflows, MapReduce, Spark, Tez) are affected differently
• Manual diagnosis can be hard and time-consuming
It is possible to diagnose and remedy such problems automatically
• By analyzing full-stack telemetry data
• By combining: Automated Baselining, Anomaly Detection, & Correlation Analysis
29. Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Real-life Problem: Hive/Spark App Failure
• My SQL query failed. Why?
• A MapReduce job failed. Why?
• A Task failed. Why?
• JVM went Out-of-Memory. Why?
• Data skew. Where?
• Reduce-side. Got it!
• How to Fix it?
1. At Resource layer, e.g., larger containers
2. At Configuration layer, e.g., turn on dynamic adaptation to skew
3. At Data layer, e.g., separate skewed keys from others
4. At App layer, e.g., filter skew keys or change algorithm
5. Some combination of the above
30. Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Real-life Planning: Migrate Apps to Cloud
• How to create perf baselines for on-
prem Vs. cloud comparison?
• What type of instances to get for
same performance on cloud?
• How many permanent Vs. spun-on-
demand instances are needed?
• Which configuration settings will need
tuning for on-prem Vs. cloud?
32. Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
To Summarize
• Operationalizing big data apps is very challenging for DataOps
• Full Stack Performance Intelligence will augment DataOps to:
1. Deliver quick and high ROI on the Big Data Stack
2. Do more in less time
3. Help them sleep better