`At the StampedeCon 2013 Big Data conference in St. Louis, Shrikanth Shankar, Head of Engineering at Qubole, presented Cloud-Friendly Hadoop and Hive. The cloud reduces the barrier to entry for many small and medium size enterprises into analytics. Hadoop and related frameworks like Hive, Oozie, Sqoop are becoming tools of choice for deriving insights from data. However, these frameworks were designed for in-house datacenters, which have different tradeoffs from a cloud environment, and making them run well in the cloud presents some challenges. In this talk, Shrikanth Shankar, Head of Engineering at Qubole, describes how these experiences taught us to extend Hadoop and Hive to exploit these new tradeoffs. Use cases will be presented that show how the challenges at large scale at Facebook are now making it extremely easy for a significantly smaller end user to leverage these technologies in the cloud.
2. INTRODUCTION
• Hadoop has revolutionized big data processing
• Becoming the de-facto platform for new data projects
• Started as file system (HDFS) + Programming framework (Map-Reduce).An ecosystem of
projects has sprung up on top of Hadoop
• Hive, Pig, Cascading etc. - Simple ways of processing data
• Sqoop, Flume etc. - Data movement into and out of HDFS
• Oozie,Azkaban etc. - Workflow scheduling
• However, these systems were all designed with an on-premise architecture in mind.
• The cloud is different enough - Some things can/should change.
Thursday, July 25, 13
3. DN/TT DN/TT
ON-PREMISE HADOOP
ARCHITECTURE
Hadoop Cluster
Namenode
JobTracker
DN/TTDN/TTDN/TT ......
IT control
Relational
systems
(Hive metastore etc.)
End User End User ...... End User
Thursday, July 25, 13
4. HADOOP ON-PREMISE
• Usually deployed on bare-metal nodes*
• HDFS is store of choice (3-way replication for safety). Locality of data
access is a big design point
• Clusters are mostly static - new machines are added on IT schedule*
• Static clusters means users can focus on their tasks (MR jobs, Hive
queries) and not on cluster management
• IT bears the burden of managing clusters
Thursday, July 25, 13
5. HADOOP ON-PREMISE
• Partitioning of resources
• Static partitioning with different clusters for Batch and
Interactive workloads
• Within a cluster load balancing is done by the JT scheduler
• Capex costs are significant
• IT controlled - requires an Ops team (Hadoop ops, Sysadmin
etc.)
Thursday, July 25, 13
8. INFRASTRUCTURE
CHARACTERISTICS
• Running in aVM
• Not that big a deal usually - except plan for performance variability
• No locality information
• Nodes are ephemeral - if you lose a node you will lose data on the node
• AZ-wide correlated failures are to be expected. Region wide are possible (but rare)
• High capacity Object stores with high cross sectional bandwidth
• High latency, Variability in perf, REMOTE*. Not POSIX compliant
• Persistent block stores
• REMOTE,Variable perf,
Thursday, July 25, 13
9. INFRASTRUCTURE
CHARACTERISTICS
• ELASTIC
• Add a 100 nodes on demand in a few minutes
• Costs are Op-ex (largely).
• Nodes are per hour (CPU + Disk), Storage is per GB
• Cost management is a key challenge
• Some interesting payment choices (On-demand, Spot, Reserved)
Thursday, July 25, 13
11. STORAGE
• From a cost perspective using HDFS for long term storage
means you pay for both CPU and disk.
• Its also more expensive to make HDFS reliable (cross AZ,
maybe even cross Region?)
• Using an object store allows you to pay only for storage
• With object stores you see latency issues since data is remote
Thursday, July 25, 13
12. STORAGE
• But node storage is still needed when jobs and queries are
active
• For intermediate job results (not all results should go back
to S3 - e.g. stage outputs in Hive)
• For intermediate data (mapper output)
• Makes scaling nodes challenging
• Also since performance is better - may want to move remote
data to HDFS before accessing
Thursday, July 25, 13
13. COMPUTE AND CLUSTERS
• If you dont need Hadoop for persistent storage - when do
you need a cluster?
• Bring them up on demand - maybe for every job?
• But that can be expensive - no multiplexing
• Ideally you want to share Hadoop clusters as much as
possible. Shut down cluster when not being used
Thursday, July 25, 13
14. COMPUTE AND CLUSTERS
• If cluster is dynamic and you need sharing - how do you do
‘discover’ it?
• How about cluster sizing?
• Static is a left over from on-premise
• Be dynamic on the cloud. Hard for end users to do manually
Thursday, July 25, 13
15. COMPUTE AND CLUSTER
• Adding nodes needs to be done based on load
• E.g. Most of the time jobs need < 5 nodes. A batch job
comes in needs 100 nodes. We should expand the cluster
(for as long as needed)
• Removing nodes is trickier
• If we lose intermediate results lots of work will be lost.
• Job1 uses 100 nodes, produces data spread over all of them.
Job 2 consumes results but only needs 10 nodes. How do you
give up 90 nodes?
Thursday, July 25, 13
16. COMPUTE AND CLUSTER
• Pricing choices are interesting
• For e.g. spot nodes average half the price of an on-demand
node
• But if price spikes you lose all the spot nodes at once
• Hadoop fault tolerance can retry failed jobs (but expensive) -
what about data loss when you lose all the spot nodes?
Thursday, July 25, 13
17. END USER EXPERIENCE
• The cloud isnt just about cost - its also about agility.To allow
this we need to focus on the end user experience
• End users would prefer to focus on higher level API’s
• e.g. Run a Hadoop job or a Hive query - specifics of
clusters should be hidden from them
• Some things should be persistent (log files, results, ...)
• They get this for free on premise
Thursday, July 25, 13
18. BETTER END STATE
• IT/dev ops/users should set high level controls
• Usage governance (max cluster size, max bill, cpu hours used
per month etc.)
• End users should focus at the level they understand
• Smart software should bridge the gap
Thursday, July 25, 13