Lessons from building large clusters

©2010 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice1
Phil Day, HP Consulting
8th November 2010

22
Small vs Large Clusters
Small Production Clusters and
Proof of Concept
– Build and run by a few skilful
people
– Can be a natural extension
to conventional IT
– You know the servers by
name
Large Production Clusters
– Build and run by pioneers
– Large development staff
– Major Hadoop contributors
– Understand the problems of
scale
Images: Creative Commons 2.0 – Attribution Andrew Morrell (Flickr )

33
– Have, or want to start with, a small PoC (10’s of nodes)
– Want to quickly scale to large cluster (100’s of nodes)
– Want the scale of large clusters, but with the build and operational
model of a small one
– Want to run the cluster rather than build and develop it
– Need to integrate it with existing systems
Large Scale Early Adopters
Unfortunately not all things in life scale as well as Hadoop
Design – The Technology Challenge
Build – The Engineering Challenge
Transfer to Operations - The Service Management Challenge

44
Design – The Technology Challenge
Selecting all the right bits
Server Selection
– Core Nodes: Resilient, Big Memory, RAID
– Data Nodes: Not resilient, no RAID or hot swap, basic iLO
– Trade off Disks vs Cores vs Memory to match target load
– Need to consider disc allocation policy
– Network redundancy is useful to avoid rack switch failures
– Edge Nodes (Data ingress/egress & Mgmt)
– Higher spec data nodes
– Help provide the “appliance” view of the cluster
– Have Hadoop installed but don’t run as part of the cluster.
– Network Selection
– Dual 1Gb from data nodes to rack switches
– 10Gb from rack switches to core, and from Edge nodes

55
Build – The Engineering Challenge
Do you realise how many cardboard boxes that is ?
Building at the scale of 500+ servers has its own set of problems
• Space and Environment
• Consistency of Build
• Failures during the Build
• Deployment time and the cost of rework
Two things we found very helpful:
Factory Integration Services
Cluster Management Utility

66
Build – HP Factory Integration Services
Reducing risk and time
• Many years experience of building large clusters
• Site inspection
• Build, Configure, Soak Test
• Diagnose and fix DoAs
• Rack and Label
• Asset tagging
• Custom build and set-up
• Pack and Ship
• On-Site build and integration
www.hp.com/go/factoryexpress
Complex solutions ...
... Made simple

77
Build – HP Cluster Management Utility
Rack aware deployment and monitoring
• Proven cluster deployment and management tool
• 11 Years of experience
• Proven with clusters of 3500+ nodes
• Deployment
• Network and power load aware deployment
• Easily extensible
• Kickstart integration
• Monitoring
• Scalable non intrusive monitoring
• Collectl integration
• Administration
• Command Line or GUI
• Cluster wide configuration
www.hp.com/go/cmu

99
Cluster Performance over time
Disk (read)
CPU
Disk (write)
Network
Map
Red
05:00
10:00
15:00

1010
Operate – the organisational challenge
How do we know when its working ?
Clusters are not just large numbers of servers
• At scale it may never be 100% up (like a network)
.... but it can be 100% down (like a server)
• Need to think more in terms of “How healthy is it ?”
• Core nodes are important
• Data nodes much less so – unless they fail in patterns
• Edge nodes – somewhere in between
• Look at HDFS health for replication counts
• Nagios & ganglia
• Collectl / CMU to visualise the cluster

1111
Summary
Key considerations when building a large cluster
• Use a pilot system to establish your server configuration
• Stand on the shoulders of the Pioneers
• Build and test in the factory if you can
• Consistency in the build and configuration is vital
• Cherish the NameNode, protect the Edge Nodes, and develop the
right level of indifference to the Data Nodes
• Practice the key recovery cases
• Match training and support to the service expectations
And remember not all things in life scale as well as Hadoop

Lessons from building large clusters

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (19)

Ähnlich wie Lessons from building large clusters

Ähnlich wie Lessons from building large clusters (20)

Mehr von Steve Loughran

Mehr von Steve Loughran (20)

Lessons from building large clusters