A Cluster Is Only As Strong As its Weakest Link

A cluster is only as strong
as its weakest link.
@DanRomike
Hadoop Tooling Engineer / Configuration
Manager
@Twitter
1#HadoopSummit

Introduction
• Hadoop health at Twitter:
– Scope of our operation
– What are some of our weak links?
– What is in our checkup?
– Where does our health check run?
– Which faults are meaningful to us?
– What is our future health strategy?
– Summary of our achievements
2#HadoopSummit

Cluster Health Pyramid
Us
Tools and
Jenkins
A Cluster
Management Shell
Health Scans
Management of 1000s/Nodes,
10s/Clusters
3#HadoopSummit

MANAGING HADOOP
What we support
4#HadoopSummit

The Health Pyramid
Us
Tools and
Jenkins
A Cluster
Management Shell
Health Scans
Management of 1000s/Nodes,
10s/Clusters
5#HadoopSummit

Clusters
Data
Warehouse
/ HBase
Large number of
computing jobs:
10’sk/ day
High storage
consumption
Tripled in Size
Processing
Large number of
computing jobs:
10’sk/ day
Doubled in Size
Backups
HDFS Storage
Doubled in Size
Test
Test releases
Evaluate jobs
6#HadoopSummit

Site Operations
Central Site
Operations
Team
• Ticket based
• Short repair times
• Infrastructure
Generally, what
breaks?
• PSU, LOM, BIOS, Wiring
• Network Bonding
• Disks, Controllers
• TOR Switches
• Rack Power
7#HadoopSummit

Our Configuration Manager
Role
Run
Attribute
8#HadoopSummit

Automation
Refined
processes
Source
Control
Repository
Config
Mgmt
Puppet
9#HadoopSummit

Cluster Reliability Team
10
Manage
Build, grow, and
migrate
On-boarding Migrate distcp harness
Configuration
Optimized
properties
heartbeats.in.seconds Set to cluster size
Reliability
Data integrity
Failures, under-
rep, 3-reps
fsck, -report,
metasave
Violated,MISSING
Balance Balancer rack-topology.sh
Nodes LIVE, DEAD, B-LIST Break/fix Recommission
HEALTH Scan Isolate issues Report failures
#HadoopSummit

Weak Links
Node Issues
• Performance loss, slow
• Storage failures
• High CPU usage
• Memory failures
• Onboard network failures
• Power On/Off
Infrastructure Issues
• Changes, adds and moves
• Site power maintenance
• Rack issues
• Unscheduled changes
• Cooling
• Network infrastructure
11#HadoopSummit

CLUSTER HEALTH
Health checks for Hadoop production environments
12#HadoopSummit

The Health Pyramid
Us
Tools and
Jenkins
A Cluster
Management Shell
Health Scans
Management of
1000s/Nodes, 10s/Clusters
13#HadoopSummit

Health Check Mission
Create and deploy a
comprehensive
health check that
reports failing
nodes, reduces
impact to
performance, and
uses common
standard tools.
Fast: logs may grow quickly,
avoid timeouts
Adjustable: setting the right
thresholds
Reliable: must not cause issues
or ‘brownouts’
Reusable: new tools will use
status and results
14#HadoopSummit

Health Goals
Reduce on-call incidents
Reduce
troubleshooting
Prevent cascading
failures
Verify after
maintenance
Facilitate change
and growth
15#HadoopSummit

Early Detection
Health
1-3mins
Thresholds
Preset Level
Blacklist
ERROR,Exclude
Notify
Alert
Monitor
Threshold Alert
Alerts
Email
Page
On-Call
Heartbeats
It’s Alive
Delays
Performance
Datanodes
0-3secs
Tasks
0-5secs
16#HadoopSummit

mapred-site.xml
<name>mapred.healthChecker.script.path</name>
<value>/etc/hadoop/conf/healthcheck2</value>
<name>mapred.healthChecker.interval</name>
<value>180000</value>
<name>mapred.healthChecker.script.timeout</n
ame>
<value>45000</value>
17#HadoopSummit

Healthy to Blacklisted
PASS ERROR
WARN
Con
figu
re Exe
cute
Eval
uate
FAIL
Health
18#HadoopSummit

FAULTS
What to scan for
19#HadoopSummit

Faults to Detect
• Network
– Speed decrease
– Partial rack power outages, loss of services
– Rack switch packet loss
– Errors/drops/retries bursts
• Reported memory vs. installed memory
• Induced fault: for node maintenance
20#HadoopSummit

More Faults
• Storage
– Full
– Incorrect disk installed
– Correct inodes per file system
– File system type: ext4
– HW disk controller issues
• Kernel is too old
• High CPU spikes with high loads
• Datanode failure
21#HadoopSummit

Log Checking
• Which logs to check
– System logs
– Datanode logs
– Tasktracker logs
• How to check
– Relevant records
– Bottom up scan
– Positive Pattern Matching
– Use of fault counters and scan thresholds
22#HadoopSummit

FUTURE STRATEGY
Reduce recovery time by building a management shell
23#HadoopSummit

The Health Pyramid
Us
Tools and
Jenkins
A Cluster
Management Shell
Health Scans
Management of
24#HadoopSummit

Management Shell
• Health Shell (CLI) maintains a working list
– Refines the list as node state changes
– Interactive BASH Shell is the CLI
– Concurrent execution functions
– Interfaces to all Hadoop admin functions
– Familiar interface
25#HadoopSummit

Today’s Health Pyramid
Us
Tools and
Jenkins
A Cluster
Management Shell
Health Scans
Management of
26#HadoopSummit

CONCLUSION
Change weak links into strong links
27#HadoopSummit

Achievements
• Failing nodes are blacklisted
• New cluster validations
• Fewer Job tails
• Less intervention
• Increased job throughput
• Improved health
28#HadoopSummit

#ThankYou
@DanRomike
29#HadoopSummit

A Cluster Is Only As Strong As its Weakest Link

Recommended

Recommended

More Related Content

Similar to A Cluster Is Only As Strong As its Weakest Link

Similar to A Cluster Is Only As Strong As its Weakest Link (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

A Cluster Is Only As Strong As its Weakest Link

Editor's Notes