UiPath Platform: The Backend Engine Powering Your Automation - Session 1
Running Spark and MapReduce together in Production
1. RUNNING SPARK AND MAPREDUCE
TOGETHER IN PRODUCTION
David Chaiken, CTO of Altiscale
chaiken@altiscale.com
#HadoopSherpa
2. 2
AGENDA
• Why run MapReduce and Spark together in production?
• What about H2O, Impala, and other memory-intensive
frameworks?
• Batch + Interactive = Challenges
• Specific issues and solutions
• Ongoing Challenges: Keeping Things Running
• Perspective: Hadoop as a Service versus DIY*
* do it yourself
3. ALTISCALE PERSPECTIVE:
INFRASTRUCTURE NERDS
• Experienced Technical Yahoos
• Raymie Stata, CEO. Former Yahoo! CTO,
advocate of Apache Software Foundation
• David Chaiken, CTO.
Former Yahoo! Chief Architect
• Charles Wimmer, Head of Operations.
Former Yahoo! SRE
• Hadoop as a Service, built and managed by Big Data,
SaaS, and enterprise software veterans
• Yahoo!, Google, LinkedIn, VMWare, Oracle, ...
3
4. 4
SOLVED:
COST-EFFECTIVE DATA SCIENCE AT SCALE
But how do you make it
easier for data scientists?
Two bad options:
1. Use Hadoop directly
using unfamiliar and
unproductive command-
line tools and APIs
2. Use Hadoop indirectly
via a back-and-forth with
data engineers who
translate needs into
Hadoop programs
11. INTERACTIVE:
INCREASE CONTAINER SIZE
Challenge: Memory intensive systems take as much
local DRAM as available.
Solutions:
• Spark and H20: Increase YARN container memory size
• Impala: Box using operating system containers
11
12. • Caution: Larger YARN container settings for interactive
jobs may not be right for batch systems like Hive
• Container size: needs to combine vcores and memory:
yarn.scheduler.maximum-allocation-vcores
yarn.nodemanager.resource.cpu-vcores ...
HIVE+INTERACTIVE:
WATCH OUT FOR LARGE CONTAINER SIZE
12
13. HIVE + INTERACTIVE:
WATCH OUT FOR FRAGMENTATION
• Caution: Attempting to schedule interactive systems and
batch systems like Hive may result in fragmentation
• Interactive systems may require all-or-nothing scheduling
• Batch jobs with little tasks may starve interactive jobs
13
14. HIVE + INTERACTIVE:
WATCH OUT FOR FRAGMENTATION
Solutions:
• Reserve interactive nodes before starting batch jobs
• Reduce interactive container size (if the algorithm permits)
• Node labels (YARN-2492) and gang scheduling (YARN-624)
14
16. 16
CHALLENGE: SECURITY
• Challenge: User Management not uniform
• MapReduce: collaboration requires getting groups right
• Hive: proxyuser settings have to be right for hiveserver2
• Spark application owner versus connected users
• Impala: “I just gotta be me!”
• As usual, watch out for cluster administrator accounts!
• Challenge: Port and Protocol Management
• Best security practice: open specific ports for specific protocols
• Spark: “I just gotta be free!”
• Spark improved between version 1.0.2 -> 1.1.0,
but still confusing
17. 17
CHALLENGE: WEB SERVING
• How to provide interactive services to business user?
• Concerns: security, variable resources, latency, availability
• Keep serving infrastructure separate from Hadoop
18. 18
CHALLENGE:
RESOURCE ATTRIBUTION (BILLING)
• Accounting for long-running Spark, H2O, Impala clusters?
• Is reserving resources the same as using the resources?
• Trade-off: availability/response time vs. oversubscription.
19. 19
CHALLENGE:
STABILITY VERSUS AGILITY
• Never-ending story: latest hotness versus SLAs*
• New system stability curve. Example…
• SPARK-1476: 2GB limit in Spark for blocks
• Interoperation issues. Example…
• IMPALA-1416: Queries fail with metastore exception after
upgrade and compute stat
• HIVE-8627: Compute stats on a table from Impala caused the
table to be corrupted
• Many issues come down to YARN container size and
JVM heap size configuration
* service level agreements
20. 20
PERSPECTIVE: HADOOP AS A SERVICE
VERSUS DIY (DO IT YOURSELF)
• Data Scientists and Data Engineers:
use the right tools for the right job
• Data Scientists and Data Engineers:
don’t spend your time on cluster maintenance
• Hadoop As A Service: have your cake and eat it, too
• Benefit from the experiences of other customers
• One size does not fit all, but one configuration schema does
• Leave the maintenance to us infrastructure nerds
http://2015.hadoopsummit.org/san-jose/agenda/
Abstract: Clusters must be tuned properly to run memory-intensive systems like Spark, H2O, and Impala alongside traditional MapReduce jobs. This talk describes Altiscale's experience running the new memory-intensive systems in production for our customers. We focus on the cluster tuning that we needed to do to create environments that run a mix of processing frameworks reliably and efficiently. Our results show that there's no need to rip and replace MapReduce clusters in favor of Spark, or any other memory-intensive system.
back-and-forth between data engineers: latency, misunderstanding
good news: newer tools (developed over the last 5 years) eliminate the need for data scientists to use the raw interfaces
think about nested loops of activity.
inner loop: modeling
pop out to outer loop (flattening): e.g. to use different classifier to get better signal
outer loop: exploration, looks at source form data directly.
note that data is often stored twice: source form data (can always go back to it if you need it) and structured/cleaned data typically in hive: more convenient to use this data set in general.
from time to time, need to go back to the source form data to look for signal that may not be in the structured source. kind of like flattening, but data is dirtier, harder to understand.
common but not universal workflow. mostly an example for the tips in the rest of the presentation
back-and-forth between data engineers: latency, misunderstanding
good news: newer tools (developed over the last 5 years) eliminate the need for data scientists to use the raw interfaces
meme: Google stopped using Map/Reduce years ago
reality: there are still lots of M/R jobs running in Google’s infrastructure
best of breed suite also applies to modeling, directly on big data, directly on Hadoop
over the last few years, huge evolution of tools: ability to do scale-out modeling directly on top of Hadoop. in the past, used to be a relatively rare thing, e.g. Mahout was fairly difficult, only worthwhile using when there’s a lot of benefit. more recently: move modeling off of workstations and directly onto Hadoop cluster
tools on top: Hadoop-native, built for scale-out, big data manner, where Hadoop is strong
lower: legacy tools that are embracing scale-out computation directly on Hadoop, directly on big data
You want them on one cluster because big data is big. When you have the data in multiple environments, like EMR, you pay a penalty. Your jobs run two times slower because you have to keep moving the data around.
Data scientists need a mixed environment. It’s not effective for them to have Spark off on its own cluster. It’s just not how they work.
However, the community has not come to grips with mixed workloads yet. It’s a bit unstable and you can see this when you start asking yourself questions like “Why is my Spark job not starting?” or “Why is my Spark job consuming so many resources?”
Analogy: OLTP + OLAP = Challenges
Map/Reduce (especially hidden under SQL) is still awesome for data cleaning and other tasks that are a high bandwidth game.
Spark, H2O, Impala are great for interactive, iterative,“inner-loop” data analysis that is a low latency game.
Map/Reduce tends to generate lots of little tasks; the newer frameworks self-schedule and need lots of DRAM (soon: lots of DRAM and/or Flash)
Running both types together causes resource conflicts!
Challenge: Memory intensive systems take as much local DRAM as available.
Solutions: Increase YARN container memory sizefor DRAM-intensive systems like Spark and H2O.Use operating system containers to box Impala on datanodes.
Note: alignment of diagrams on the next few slides is critical
The diagram could benefit from a legend!
- circles instead of squares to avoid the Hermann grid illusion
Stinger = hive 0.13 + Tez is intended to be more balanced
At Altiscale, we think that AWS is awesome for web serving – even though we know that AWS is not great for Hadoop.
Operating system containers (namespaces + cgroups) can help with container/heap size issues.
Wouldn’t it be great if JVM could ask for more resources instead of putting itself into a GC loop?
Interoperation issues aren’t just technical (or even mostly technical). IMPALA and HIVE have to interoperate, but are championed by competitors (Cloudera and Hortonworks)
SPARK-1476 details are in https://altiscale.zendesk.com/agent/tickets/1589
IMPALA-1416/HIVE-8627 are in https://altiscale.zendesk.com/agent/tickets/1510
Data scientists should never be spending time getting all of these frameworks to work well together. That’s the job that infrastructure nerds should be doing.
Hadoop As A Service: modeled after the internal services of Internet companies