Early detection and correction of cluster health issues is a vital part of daily cluster management, no matter the size. Building and managing a healthy cluster is the best cure for meeting service level agreements and preventing or avoiding elongated troubleshooting. A cluster is effective and efficient when problems are detected and eliminated early. Fortunately, deploying simple tools and processes prevents minor problems from becoming major headaches. This talk covers how we developed, tested, and deployed a comprehensive health process based on real life events and experiences. The table driven health check runs a full scan in ~2 seconds and includes: a checklist, ‘positive’ error pattern matching, enabling and disabling node blacklisting, logging, validating file systems, processing very large log files, trapping in-rack network faults (adds 5 seconds to accurately detect packet loss), and recommissioning nodes into production.
Breaking the Kubernetes Kill Chain: Host Path Mount
A Cluster Is Only As Strong As its Weakest Link
1. A cluster is only as strong
as its weakest link.
@DanRomike
Hadoop Tooling Engineer / Configuration
Manager
@Twitter
1#HadoopSummit
2. Introduction
• Hadoop health at Twitter:
– Scope of our operation
– What are some of our weak links?
– What is in our checkup?
– Where does our health check run?
– Which faults are meaningful to us?
– What is our future health strategy?
– Summary of our achievements
2#HadoopSummit
3. Cluster Health Pyramid
Us
Tools and
Jenkins
A Cluster
Management Shell
Health Scans
Management of 1000s/Nodes,
10s/Clusters
3#HadoopSummit
5. The Health Pyramid
Us
Tools and
Jenkins
A Cluster
Management Shell
Health Scans
Management of 1000s/Nodes,
10s/Clusters
5#HadoopSummit
6. Clusters
Data
Warehouse
/ HBase
Large number of
computing jobs:
10’sk/ day
High storage
consumption
Tripled in Size
Processing
Large number of
computing jobs:
10’sk/ day
Doubled in Size
Backups
HDFS Storage
Doubled in Size
Test
Test releases
Evaluate jobs
6#HadoopSummit
7. Site Operations
Central Site
Operations
Team
• Ticket based
• Short repair times
• Infrastructure
Generally, what
breaks?
• PSU, LOM, BIOS, Wiring
• Network Bonding
• Disks, Controllers
• TOR Switches
• Rack Power
7#HadoopSummit
13. The Health Pyramid
Us
Tools and
Jenkins
A Cluster
Management Shell
Health Scans
Management of
1000s/Nodes, 10s/Clusters
13#HadoopSummit
14. Health Check Mission
Create and deploy a
comprehensive
health check that
reports failing
nodes, reduces
impact to
performance, and
uses common
standard tools.
Fast: logs may grow quickly,
avoid timeouts
Adjustable: setting the right
thresholds
Reliable: must not cause issues
or ‘brownouts’
Reusable: new tools will use
status and results
14#HadoopSummit
15. Health Goals
Reduce on-call incidents
Reduce
troubleshooting
Prevent cascading
failures
Verify after
maintenance
Facilitate change
and growth
15#HadoopSummit
20. Faults to Detect
• Network
– Speed decrease
– Partial rack power outages, loss of services
– Rack switch packet loss
– Errors/drops/retries bursts
• Reported memory vs. installed memory
• Induced fault: for node maintenance
20#HadoopSummit
21. More Faults
• Storage
– Full
– Incorrect disk installed
– Correct inodes per file system
– File system type: ext4
– HW disk controller issues
• Kernel is too old
• High CPU spikes with high loads
• Datanode failure
21#HadoopSummit
22. Log Checking
• Which logs to check
– System logs
– Datanode logs
– Tasktracker logs
• How to check
– Relevant records
– Bottom up scan
– Positive Pattern Matching
– Use of fault counters and scan thresholds
22#HadoopSummit
24. The Health Pyramid
Us
Tools and
Jenkins
A Cluster
Management Shell
Health Scans
Management of
1000s/Nodes, 10s/Clusters
24#HadoopSummit
25. Management Shell
• Health Shell (CLI) maintains a working list
– Refines the list as node state changes
– Interactive BASH Shell is the CLI
– Concurrent execution functions
– Interfaces to all Hadoop admin functions
– Familiar interface
25#HadoopSummit
26. Today’s Health Pyramid
Us
Tools and
Jenkins
A Cluster
Management Shell
Health Scans
Management of
1000s/Nodes, 10s/Clusters
26#HadoopSummit
Dan Romike, Hadoop Tooling Engineer / Configuration Manager, Twitter, Inc.Dan Romike started with Hadoop in the summer of 2008 at Yahoo!, Inc., in their Hadoop data warehouse and site operations teams and received a ‘You Rock’ award for a very large data management project. He has since worked with Hadoop operations at eBay, Inc. and now at Twitter as a Hadoop Reliability Engineer. He recently gave a presentation at the 2011 Summit discussing Hadoop automation and has an extensive background building and managing Unix based production environments.
Early detection and correction of cluster health issues isa vital part of daily cluster management, no matter thesize. Building and managing a healthy cluster is the bestcure to meeting service level agreements and preventing or avoidingelongated troubleshooting. A cluster is effective and efficientwhen problems are detected and eliminated early.
Deploying simple tools and processes prevents minor problemsfrom becoming major headaches. This talk covers how Twitter'sHadoop Reliability team developed, tested, and deployed a broadspectrum cluster health check that detects problems quickly andearly.
Clusters run at full efficiency when all LIVE nodes are working at their peek. During node failures, partial or full, the cluster may behave in unexpected ways and thus causing a weak link. Finding a small problem on thousands of nodes is time consuming. What’s we deployed is an internal check that is able to affect a change in the cluster’s behavior and blacklist failing nodes thus preventing new tasks from starting in a failed condition.
We start with a high-level review of the Hadoop environment at Twitter.We are a very small operational team and we need the ability to manage a large Hadoop environment from installation to production and we try to avoid losing time working on troubleshooting issues that are affecting the cluster.Our team effort is to build these missing layers in the Health and management pyramid that will provide us meaningful and simple interfaces for the Hadoop admins.
Each clusters have a primary use. We run close to maximum for storage and processing on most clusters so it is important to test and evaluate all releases and production changes to prevent failures on the large clusters.These clusters are thousands of nodes and 10s of petabytesin multiple datacenters with a large number of jobs / day
The Site Operations team manages our infrastructure and corrects node failures (after being withdrawn from the clusters). We ticket each failure, one per ticket, and they quickly and accurately correct the issues and return the node ready for commissioning. The support we receive is immeasurable because we would not be able to grow as quickly as we have in the last year.Some of the issues that are discovered and resolved by Site operations are discussed.
Our nodes belong to roles managed by an internal Configuration Manager. All nodes must belong to a role, each node has inherited attributes, and we may affect a role-wide operation by executing commands through the manager.
To ensure that our code and configurations are accurate, we have a rigorous process that includes: Peer reviews, review boards, staging, validations, canary, restaging, production. The code and configurations are checked in and distributed to the nodes via Puppet, without exception.
Reliability covers many aspects of cluster management and is part of the daily maintenance, outages, preventative care, and health evaluation that every cluster, irrespective of size, requires.Our focus is the HEALTH aspect of Hadoop and to be able to manage failures without intervention. We do so with a complex Health process that has simple roots, it isolates node issues, and reported failures are rolled up into the monitoring system, which is an independent function.
Hadoop is highly dependent on a healthy cluster, be it 10 or 1000 nodes. A cluster may exhibit failed behaviors from minor issues on a single node, and discovering the issue and immediately blacklisting it is important.Listed here are most of the weak links that will cause data and job issues.
This section covers what we wanted to achieve to obtain full cluster health. We realized early that the health process plays an important role in node health as well is validation and ensuring that returning nodes enter the cluster fully functional. The same script is able to perform multiple tasks with no code changes.
What are some of the best methods to building and deploying a check? There is a limited amount of time to run checks, seconds, and to scan for other issues, and a full log body scan was not reasonable nor may be accurate. Here are some of the aspects we sought.
After the script is deployed, we needed to verify these goals. Though difficult to track, we used our work load and number of people required to manage our clusters as an primary indicator. We are pleased with the results.
Each Hadoop cluster has three primary columns of health, we created two and one is provided:The health check finds issues collecting in logs and process states based on thresholds and timeOur monitoring system will notify us of issues over time using aggregation.And finally, Hadoop manages heartbeats for both datanodes and tasks, these provide critical information on the node’s status. Should the heartbeat be delayed too long, the cluster will automatically take corrective actions.The administrator takes manual actions to exclude or include nodes into the cluster, however, in some cases, nodes have to be excluded to kill an issue.
To install a health check, update these properties, as described.
The actual process of the health check is to return a result message to the job manager. An ‘ERROR’ indicates the node is to be taken out of circulation, but the attempts are allowed to finish. Any other terms may be used to indicate to automation that the node PASSed or that other issues exist and actions are required.Because tasks finish, instead of being terminated, the blacklist gives us time to evaluate the problem and take corrective actions such as fail-tasking. StoriesFull file systems from errant jobs filled node storage; health caused a brownout and shown in blacklistedErrant jars cause Full GCs in TT: updated health to count Full GCs over 999 records to restart TTRacks lost packets: added rack packet loss detection in same rack to blacklist the rack and wrote a crawler for inter-rackPredictive disk failures in the controller: detected and blacklistedKickstart install root on the wrong disk, detectedHigh load averages slow down jobs: blacklist immediatelyMemory shortfall: detect and blacklist nodesbinfsusedfsused $E $FAULTS_DF root $ERRS_DFsbinmkfswrongfsused $W $FAULTS_DF root $WARN_DFsbindiskwrongfsused $W $FAULTS_DW root $WARN_DW file mounts $proc/mounts $E 1 root ^\/dev\/ file fstab $etc/fstab $E 1 root ^LABEL= file loadavg $proc/loadavg $W 70.0 root [0-9.]+procdatanode $dnpid $E 1 hadoop $PROC_DNproctasktracker $ttpid $W 1 hadoop $PROC_TTprocregionserver $rspid $W 1 hadoop $PROC_RSprocmonit $log/monit.log $W 1 root $ubin/monitprocsyslogd $run/syslog-ng.pid $W 1 root syslog-ngproc scribed $run/scribe.pid $W 1 root $usbin/scribed log syslog-dev $log/syslog $E $FAULTS_DV root $ERRS_DV log syslog-hw $log/syslog $E $FAULTS_HW root $ERRS_HW log mcelog-hw $log/mcelog $E $FAULTS_MC root $ERRS_MC log ttlog $ttlog $W $FAULTS_TT hadoop $ERRS_TT log dnlog $dnlog $W $FAULTS_DN hadoop $ERRS_DN log rslog $rslog $W $FAULTS_RS hadoop $ERRS_RS log scribe $sclog $W $FAULTS_SC hadoop $ERRS_SC log shortmem $proc/meminfo $W $FAULTS_SM root $ERRS_SM log bonding $bonding $F $FAULTS_EB root $ERRS_EB toggle blacklisted $bllog $E $FAULTS_BL hadoop $ERRS_BL
Detecting faults is based on real-life experiences and is usually taught by errors and failures. This section describes some of the faults we scan for and which provide the basis of a health check.We also induce faults into the health system to perform maintenance operations. It is easier to do maintain by blacklisting than to exclude.
Managing a node’s expected performance is a major concern and ‘weakness’ in working with large clusters. A single node issue may cause cascading problems which extends job run times.Some of the possible issues are network losses, speed reductions, and issues caused by manual interventions as part of general maintenance. The health process needs to trap and blacklist nodes that are not meeting specifications.
The Hadoop storage system may be difficult to maintain as the cluster grows. With 10s of petabytes spinning on 1000s of nodes, storage issues have caused major issues in the past.However, with improvements in disks and controllers, storage has been far less of an issue. We are now focused on performance gains and storage efficiency:Running the latest file systemsReducing inodes to recover 3% in storageImprove build time by reduced inodesImprove fsck time by reduced inodesHadoopmay only use 1-2% of inodesFSCK time dramatically improvesKickstarat improvementsOld kernels have security issuesDataode goes down, TT needs to blacklistTask tracker failures on disk full
Bottom up log scans is an effective method of limiting the amount of data to process and for locating just recent issues. Some logs are large and may be stale, so keeping the information fresh and current prevents brownouts and less blacklisting.We also use ‘positive exception’ matching logic via egrep based on receiving many ‘false positives’. We choose to match the majority of the pattern directly and then with a positive column match [123], on the ‘not’ side, we negated a column match [^123]. We want to match what we were looking for, not what we weren’t looking for.
We have a two layers remaining in our health strategy or pyramid.Underway is a management shell that will ease the process of managing lists, faults, and reducing recovery time.
We are currently wrapping up our management CLI that assists the Hadoop Administrators to start, stop clusters, manage lists, and perform beak/fix actions, to name a few.Our goal is to reduce the time to manage and recover a cluster:Improve recovery time from a crashReduce node time to repairReduce recovery from brownoutsImprove the ability to manage nodes based on state without a SQL database
The management shell is a BASH CLI that eases the administrative functions for large clusters.
The last part of the pyramid is the future of integrating automation and tools where the health process provides an essential role.
Hadoop clusters have increased efficiency by having fewer task failures due to node hardware faults. Long tails rarely occur due to node issues. On-call issues have also declined because we have less troubleshooting issues due to stuck jobs. We are looking forward to hearing your comments.