Hadoop framework is often built on native environment with commodity hardware as its original design. However, with growing tendency of cloud computing, there is stronger requirement to build hadoop cluster on a public/private cloud in order for customers to benefit from virtualization and multi-tenancy. My speech want to introduce some challenges to provide hadoop service on virtualization platform like: performance, rack awareness, job scheduling, memory over commitment, etc and propose some solutions.
2. Cloud: Big Shifts in Simplification and Optimization
1. Reduce the Complexity 2. Dramatically Lower 3. Enable Flexible, Agile
Costs IT Service Delivery
to simplify operations to redirect investment into to meet and anticipate the
and maintenance value-add opportunities needs of the business
2
3. Infrastructure, Apps and now Data…
Build Run
Private
Public
Manage
Simplify Infrastructure Simplify App Platform
Next Trend:
With Cloud Through PaaS
Simplify Data
3
4. Trend 1/3: New Data Growing at 60% Y/Y
Exabytes of information stored 20 Zetta by 2015
1 Yotta by 2030
Yes, you are part
of the yotta
audio generation…
digital tv
digital photos
camera phones, rfid
medical imaging, sensors
satellite images, games, scanners, twitter
cad/cam, appliances, videoconfercing, digital movies
Source: The Information Explosion , 2009
4
6. Trend 3/3: Value from Data Exceeds Hardware Cost
Value from the intelligence of data analytics now outstrips the cost
of hardware
• Hadoop enables the use of lower cost hardware
• Hardware cost halving every 18mo
Value
Big Iron:
$40k/CPU
Commodity
Cluster:
$1k/CPU
Cost
6
7. Three Big Reasons to Virtualize Hadoop: 1. Simplify Hardware
Trend is ―not just hadoop‖ for big data
• Hadoop is often combined with other
technologies: Big SQL, NoSQL etc,…
SQLCluster
• Unify the infrastructure platform for all
Big SQL NoSQL Hadoop
NoSQL Cluster
Unified Big Data Infrastructure
Private
Public
Hadoop Cluster
Common Hardware Base
• Eliminate the hardware/driver/testing phase
• Use existing team for
DSS Cluster ordering, diagnosis, capacity management of
7
hardware farm
8. Three Big Reasons to Virtualize Hadoop: 2. Rapid Provisioning
I WANT MY HADOOP CLUSTER NOW!
Instant Cluster Provisioning
• Provision Hadoop Clusters instantly
• Automatable using provisioning
engines/scripts: e.g. whir
8
9. Three Big Reasons to Virtualize Hadoop: 3. Leverage Capabilities
Increase Utilization
• Hadoop cluster only uses resources it needs
• Extra resources can be used by other applications when not in use
Eliminate single points of failure
• Use vSphere HA for Namenode and Jobtracker
Use VM Isolation
• Create separate clusters with defensible security
• Enables multiple-versions of Hadoop on the same infrastructure
• Extends to Hadoop and Linux Environments
Leverage Resource Management
• Control/assign resources through resource pools
• E.g. Use spare cycles for Hadoop Processing through priority control
9
10. What? Hadoop in a VM? Really?
Actually, Hadoop performs well in a virtual machine
10
13. Hadoop Configuration
Distribution
• Cloudera CDH3u0
• Based on Apache open-source 0.20.2
Parameters
• dfs.datanode.max.xcievers=4096
• dfs.replication=2
• dfs.block.size=134217728
• io.file.buffer.size=131072
• mapred.child.java.opts=”-Xmx2048m -Xmn512m” (native)
• mapred.child.java.opts=”-Xmx1900m -Xmn512m” (virtual)
Network topology
• Hadoop uses info for reliability and performance
• Multiple VMs/host: Each host is a “rack”
13
14. Benchmarks
Derived from test apps included in distro
Pi
• Direct-exec Monte-Carlo estimation of pi
• # map tasks = # logical processors
• 1.68 T samples
TestDFSIO
• Streaming write and read
~ 4*R/(R+G) = 22/7
• 1 TB
• More tasks than processors
Terasort
• 3 phases: teragen, terasort, teravalidate
• 10B or 35B records, each 100 Bytes (1 TB, 3.5 TB)
• More tasks than processors
• CPU, networking, and storage I/O
14
15. Performance of Hadoop for Several Workloads
Ratio of time taken – Lower is Better
1.2
1
0.8
Ratio to Native
0.6
1 VM
0.4
2 VMs
0.2
0
15
16. Architecting Hadoop as a Service using Virtualization
Goals
• Make it fast and easy to provision new Hadoop Clusters on Demand
• Leverage virtual machines to provide isolation (esp. for Multi-tenant)
• Optimize Hadoop’s performance based on virtual topologies
• Make the system reliable based on virtual topologies
Leveraging Virtualization
• Elastic scale in/out
• Use high-availability to protect namenode/job tracker
• Resource controls and sharing: re-use underutilized memory, cpu
• Prioritize Workloads: limit or guarantee resource usage in a mixed
environment
16
17. Provisioning
Leverage the vSphere APIs to auto-deploy a cluster
• Whirr, HOD, or custom using ruby, chef, etc,…
Use linked-clones to rapidly fork many nodes
17
18. Fast Provisioning
From a ―seed‖ node to a cluster
Thin Provisioning Linked Clone
60GB => 3.5GB ~6 second
18
19. SAN, NAS or Local Disk?
Shared Storage: SAN or NAS Hybrid Storage
• Easy to provision • SAN for boot images, VMs, other
• Automated cluster rebalancing workloads
• Local disk for HDFS
• Scalable Bandwidth, Lower Cost/GB
Other VM
Other VM
Other VM
Other VM
Other VM
Other VM
Other VM
Other VM
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Host Host Host Host Host Host
19
20. Enable Automatic Rack awareness through vSphere
Important to robust hadoop
cluster
Automatic network topology
detect — an important
vSphere feature
Rack script is generated
automatically
20
21. Multi-tenant: share cluster or not
Shared big cluster VS. Isolated small clusters
High performance Secure
Large scale Flexible
Pre-job provisioning Post-job provisioning
Combination – as customers’ requirement are different
21
22. Elastic Hadoop Cluster
Traditional hadoop cluster
• Easy to scale out
• Fast-provision new hadoop nodes and join into existing cluster
• Hard to scale in
While (ClusterIsTooLarge) {
choose node k;
kill (node k);
wait (k’s data block is recovered);
if necessary, hadoop.rebalance();
}
Elastic hadoop cluster
…
Normal node
NN JT Elastic node
TaskTracker
…
DataNode
22
23. Replica Placement
Second Replica
• Different rack
• Rack-awareness required
Third Replica
• Same rack, different physical host
• Nodes share host (in virtualized
environment)
23
25. Performance
Create more smaller VMs
• Makes Hadoop scale better
• Allows for easier/faster adjustment of packing of VMs across hosts by vSphere
(including through DRS)
Sizing/Configuration of storage is critical
• Plan on ~50Mbytes/sec of bandwidth per core
• SANs are typically configured by default for IOPS, not Bandwidth
• Ensure SAN ports/switch topology allows required aggregate bandwidth
• Performance of the backend storage should be tested/sized
• Local disks will give ~100-140MBytes/sec per disk: pick correct controller
25
26. Summary
Hadoop does work well in a virtual environment
Plan a virtual cluster, enable other big-data solutions on the same
infrastructure
Leverage the recipes to automate your configuration and
deployment
26