2. 22
Small vs Large Clusters
Small Production Clusters and
Proof of Concept
– Build and run by a few skilful
people
– Can be a natural extension
to conventional IT
– You know the servers by
name
Large Production Clusters
– Build and run by pioneers
– Large development staff
– Major Hadoop contributors
– Understand the problems of
scale
Images: Creative Commons 2.0 – Attribution Andrew Morrell (Flickr )
3. 33
– Have, or want to start with, a small PoC (10’s of nodes)
– Want to quickly scale to large cluster (100’s of nodes)
– Want the scale of large clusters, but with the build and operational
model of a small one
– Want to run the cluster rather than build and develop it
– Need to integrate it with existing systems
Large Scale Early Adopters
Unfortunately not all things in life scale as well as Hadoop
Design – The Technology Challenge
Build – The Engineering Challenge
Transfer to Operations - The Service Management Challenge
4. 44
Design – The Technology Challenge
Selecting all the right bits
Server Selection
– Core Nodes: Resilient, Big Memory, RAID
– Data Nodes: Not resilient, no RAID or hot swap, basic iLO
– Trade off Disks vs Cores vs Memory to match target load
– Need to consider disc allocation policy
– Network redundancy is useful to avoid rack switch failures
– Edge Nodes (Data ingress/egress & Mgmt)
– Higher spec data nodes
– Help provide the “appliance” view of the cluster
– Have Hadoop installed but don’t run as part of the cluster.
– Network Selection
– Dual 1Gb from data nodes to rack switches
– 10Gb from rack switches to core, and from Edge nodes
5. 55
Build – The Engineering Challenge
Do you realise how many cardboard boxes that is ?
Building at the scale of 500+ servers has its own set of problems
• Space and Environment
• Consistency of Build
• Failures during the Build
• Deployment time and the cost of rework
Two things we found very helpful:
Factory Integration Services
Cluster Management Utility
6. 66
Build – HP Factory Integration Services
Reducing risk and time
• Many years experience of building large clusters
• Site inspection
• Build, Configure, Soak Test
• Diagnose and fix DoAs
• Rack and Label
• Asset tagging
• Custom build and set-up
• Pack and Ship
• On-Site build and integration
www.hp.com/go/factoryexpress
Complex solutions ...
... Made simple
7. 77
Build – HP Cluster Management Utility
Rack aware deployment and monitoring
• Proven cluster deployment and management tool
• 11 Years of experience
• Proven with clusters of 3500+ nodes
• Deployment
• Network and power load aware deployment
• Easily extensible
• Kickstart integration
• Monitoring
• Scalable non intrusive monitoring
• Collectl integration
• Administration
• Command Line or GUI
• Cluster wide configuration
www.hp.com/go/cmu
10. 1010
Operate – the organisational challenge
How do we know when its working ?
Clusters are not just large numbers of servers
• At scale it may never be 100% up (like a network)
.... but it can be 100% down (like a server)
• Need to think more in terms of “How healthy is it ?”
• Core nodes are important
• Data nodes much less so – unless they fail in patterns
• Edge nodes – somewhere in between
• Look at HDFS health for replication counts
• Nagios & ganglia
• Collectl / CMU to visualise the cluster
11. 1111
Summary
Key considerations when building a large cluster
• Use a pilot system to establish your server configuration
• Stand on the shoulders of the Pioneers
• Build and test in the factory if you can
• Consistency in the build and configuration is vital
• Cherish the NameNode, protect the Edge Nodes, and develop the
right level of indifference to the Data Nodes
• Practice the key recovery cases
• Match training and support to the service expectations
And remember not all things in life scale as well as Hadoop