1. How to Build a Cluster
A High-Level Overview of the Key Issues
Ramsay Key – May 2017
2. Why should you care?
• Although a lot of computing has been commoditized and delivered via IT
and “the cloud”, it’s still useful to understand infrastructure:
• Building your own “bare-metal” servers for yourself or a customer
• Deploying your hardware into your customer’s data-center
• Deploying your software into your customer’s data-center
• Understanding considerations for non-traditional systems (i.e. vehicles, devices)
• Some components need to built regardless of “public cloud” or private infrastructure
6. Gathering Requirements
• Operations
• What is the intended operation of the system? Prototype? Production? 24x7? Call-ins?
• Type of system? Query? Analytics? Batch? Streaming?
• Could the system grow?
• Dataflow
• How much data? Ingest volume (bytes/records)? Query load? Timeframe?
• Compute
• Type of computing: CPU heavy? RAM heavy? File I/O heavy? Network I/O heavy?
• If these answers are not clear – try to derive upper and lower bounds on the needs
7. RAM goes a long way these days!
• All 4B IPv4 addresses
fit in 18GB of RAM
• 256GB of RAM can
hold 16B MD5 hashes
8. Design Considerations
• Reliability
• How reliable does your system need to be? How many “nines” of availability do you need?
• Failure
• How much redundancy is required? How much do you have? What about the facilities plant?
• Scalability
• Horizontal vs. Vertical. How much load can the system process? Scalability also includes people processes
• Backup
• Do you have a backup plan?
• Application Deployments
• How to deploy, manage, troubleshoot applications?
9. Helpful Philosophies
• Keep It Simple Stupid
• Very easy to create complex server infrastructure. Strive for simplicity
• Be Homogenous
• Heterogeneity complicates scaling, debugging, and logistics
• Expect Failure
• Components will fail. Disks fail all the time. More computers ➔ More failure probability
• Automate Everything
• If you can reproduce your infrastructure quickly and easily, it is a good sign it is healthy
11. Financial Considerations
• Electricity availability/cost fundamentally dictates scale
• Appropriate accounting/purchasing allows hardware to be depreciated
• Can you “buy” your way out of scalability problems? (i.e. horizontal scaling)
• Capital Expenditure (CapEx) vs. Operational Expenditures (OpEx)
• CapEx = hardware, facilities
• OpEx = labor, support, maintenance
• Trade-off between CapEx and OpEx
• Clusters generally try to minimize OpEx via automation, homogeneity
• However, clusters don’t run and fix themselves – still need labor to support them
13. Rackspace & Power, Space, Cooling
• Datacenters have real physical constraints generally characterized as Power,
Space, and Cooling (PSC)
• Datacenters are laid out in “racks” (a.k.a. cabinets)
• Datacenters have different “Tiers” (1-4) for handling different failure levels
• Racks are standardized around 42 “rack units” height (a.k.a. “U”).
• “Rack servers” commonly come in 1U and 2U dimensions. Width standardized.
• 3U+ generally implied to be more unique hardware
• Prefer rack-servers over “blade centers”
14. Rack Considerations
• 1U servers are considered “dense”
• Need to pay attention to cooling and cabling
• Can be hard to fit more elaborate components in 1U (GPUs or large hard-drives)
• 2U servers are good all-around chassis
• May lose some density per U
• A good reference for an “all-up” rack is 40 servers and 2 “top-of-rack”
(TOR) switches
15. Power Distribution Units (PDU) – can run 2 for redundancy
1U servers
2 Top-of-rack (TOR) switches. Red and blue cables are
“bonded” and provide redundancy and performance
HID – Badge swipe access / alarm
Fiber uplinks to datacenter spine
From: https://techbloc.net/archives/970
1 “management” switch for administration. Separate
network for when main network is down. Yellow.
16. Server Selection
• For purchasing, best to work through a value-added reseller (VAR)
• Can assist with questions, delivery, coordination
• Consider redundant power supplies depending on production level
• Consider redundant network ports depending on production level
• Hot-swappable hard-drives make life easy (almost standard now)
• Make sure NICs will “PXEBoot” (i.e. network boot)
• Consider RAID (Redundant Array of Inexpensive Disks) level. Popular options:
• RAID10 a good mix between redundancy and performance
• JBOD = Just a bunch of disks – let applications manage redundancy
17. 10 2TB hot-swappable hard-drives
Dual hot-swappable power supplies
Dual CPUs, 10 cores
256GB RAM
Fan bank for cooling
(Not viewable) – 4 10G NIC ports, 2 1G NIC port, 1 IPMI port
Typical
commodity server
– circa 2017
Generally don’t
need tools to
replace parts!
18. Operating Systems
• Linux is the OS of choice when building clusters
• Lots of tools for managing and tuning Linux clusters at scale
• Licensed software complicates scaling a cluster
• These days many excellent open-source alternatives exists
• CentOS (derivative of Redhat) and Ubuntu are both popular options
• CentOS generally about stability and security
• Popular with enterprises, IT, and operations people
• Ubuntu generally newer and modern (closer to latest constituent software releases)
• Popular with innovators, researchers, etc.
19. Provisioning
• Provisioning = building or rebuilding a node
• Typical flow is for the node to “PXEBoot” into a kickstart (Redhat/Centos)
or preseed (Ubuntu) an installer
• Node first boots via DHCP then downloads a kickstart/preseed file
• The kickstart/preseed file points to an installer and associated packages
• Foreman is a system that facilitates the “PXEBoot”, kickstart/preseed process
• Typical model is to put the bare minimum into kickstart/preseed and then let
a configuration management system take over
22. Network Considerations
• Typical server network configuration would have:
• 1Gb IPMI port
• Allows interaction with basic server functions (power-on/power-off, etc.)
• 1Gb management port
• For server administration tasks, separate from data network
• 2 10Gb data ports (in a bonded configuration)
• For passing data between nodes
• Keep in mind that disk I/O speed may be slower than the network
23. Network Fabric
• Typical rack configuration has two “top-of-rack” (TOR) switches that connect all
the internal rack servers together, plus an “uplink” to the datacenter spine of “core”
switches so it can talk to the other racks
• Use two switches per rack for redundancy and throughput
• Spine typically runs at 25Gb, 40Gb, or 100Gb
http://www.cisco.com/c/dam/en/us/products/collateral/switches/nexus-7000-
series-switches/white-paper-c11-737022.docx/_jcr_content/renditions/white-
paper-c11-737022_3.jpg
Racks
25. Configuration Management
• Configuration Management = tools that guarantee server configurations
• Popular cluster configuration management tools:
• Puppet: Most common, well-known, agent architecture
• Chef: An early alternative to Puppet, also agent architecture
• Ansible: Popular, agent-less architecture, integrates with networking gear well
• SaltStack: More recent alternative, both agent and agent-less architecture
• All tools have pros & cons - doesn’t matter so much what you use, just use one!
• Possible to entirely define your infrastructure within the tools
• Version control (git) the tool configurations and you get “infrastructure as code”
27. Monitoring
• Bad things will happen in your cluster!
• The larger, the more complex the cluster, the more fantastic ways it can fail
• Absolutely need monitoring of your cluster
• Nagios is the most common tool for monitoring
• Many alternatives: ElastAlert, Zenoss, Prometheus, xymon, Zabbix, Sensu, …
• Tools send an alert (email, syslog, IM) when bad things happen
• Tools usually come with some defaults – and have pluggable architectures
29. Health & Status (Metrics)
• “Measure Anything. Measure Everything” – Etsy
• Instrument everything you can:
• Useful for performance tuning
• When bad things happen, these will be handy for identifying the root-cause
• Applications, operating systems, processes, disks, network, memory, etc.
• Ganglia is the most common tool for metrics
• Many excellent alternatives: Grafana, Prometheus, Logstash, Graphite, statsd, collectd, OpenTSDB, Timely
• Pick one and use it!
31. Coordination Services
• Often useful to run a “coordination service” within a cluster
• Coordination service provides distributed reliable services for applications:
• Configuration (identifying masters)
• Naming (finding other services)
• Synchronization (tracking state)
• Typically present a “key-value” interface to clients
• Popular implementations: Zookeeper, etcd, Consul, etc.
32. Other Usefuls Tools
• Public-Key Infrastructure (PKI, ssh) should be used for authentication
• Avoid passwords – only root should have a password, if at all
• Use LDAP to manage user accounts
• Have some “database” that records which servers provide which functions
• genders is a simple, popular way to do this
• pdsh is a parallel shell command useful for running commands across a cluster
• dshbak cleans up the output
34. Software Deployment Considerations
• Always a good idea to partition your cluster into:
• Production servers - stuff you really care about…doesn’t need to be 24x7 to be production
• Integration servers - last stop before being added to production
• Test servers - general developer playground similar to production systems
• Package, version, and install your software like a product
• Helps for automation and traceability
• Scripting languages (python, perl, etc.) can be risky to deploy because they can easily be
changed once installed