20080528dublinpt3

Managing a Large Hadoop Cluster

Jeff Hammerbacher
Manager, Data
May 28 - 29, 2008

Anatomy of the Facebook Cluster
Hardware
▪ Individual nodes
▪ CPU: Intel Xeon dual socket quad cores (8 cores per box)
▪ Memory: 16 GB ECC DRAM
▪ Disk: 4 x 1 TB 7200 RPM SATA
▪ Network: 1 gE
▪ Topology
▪ 320 nodes arranged into 8 racks of 40 nodes each
▪ 8 x 1 Gbps links out to the core switch

Functional Separation
▪ Need to have test, staging, and production clusters
▪ Break nodes into groups of 10
▪ First 30 machines on each rack run DFS
▪ Last 10 machines used for DFS and upgrade testing or left idle
▪ Run main MapReduce cluster on 20 machines in each rack
▪ Run test MapReduce cluster on 10 machines in four racks
▪ Do MapReduce testing on 10 machines in four racks
▪ A few other MapReduce clusters for isolated applications

Software for Administration
▪ Most utilities are included in hadoop/bin
▪ Format DFS, start/stop daemons, fsck, rebalance blocks, etc.
▪ Hypershell (internal): provides distributed shell functionality
▪ See also: dsh, GXP, Capistrano, ClusterIt
▪ Cfengine: ensure uniform system images, conﬁguration, and libraries
▪ ODS (internal): monitoring and alerting
▪ See also: Ganglia for monitoring, Nagios for alerting
▪ Cacti: network monitoring

Excerpts from Facebook’s conf/hadoop-site.xml
dfs.block.size 134,217,728 Larger block size for less NN metadata
dfs.datanode.du.reserved 1,024,000,000 Don’t ﬁll up the local disk
dfs.namenode.handler.count 40 More NN server threads for DN RPCs
dfs.network.script /mnt/vol/hive/stable/bin/rackid.pl Print machine network name
fs.trash.interval 1,440
fs.trash.root /Trash
io.ﬁle.buffer.size 32,768 Size of r/w buffer used by SequenceFile
io.sort.factor 100 More streams merged while sorting
io.sort.mb 200 Higher memory limit while sorting data
mapred.child.java.opts -Xmx1024m -Djava.net.preferIPv4Stack=true Large heap size; avoid RPC timeout
mapred.linerecordreader.maxlength 1,000,000 Skip malformed lines
mapred.min.split.size 65,536
mapred.reduce.copy.backoff 5
mapred.reduce.parallel.copies 20 More threads to fetch map output data
mapred.tasktracker.tasks.maximum 5
mapred.speculative.map.enabled TRUE
mapred.speculative.reduce.enabled FALSE
mapred.speculative.map.gap 1
webinterface.private.actions TRUE

HDFS Tips from Dhruba Borthakur
▪ Be careful when using proﬁlers to examine NN state
▪ Never load many small ﬁles
▪ Always use java 1.6, otherwise NN will consume about 50% more CPU
▪ When decommissioning DNs, do a max of 10 machines or so at a time,
otherwise the NN gets overloaded
▪ Run fsck every night and monitor the number of missing/under-
replicated blocks
▪ If a block stays unreplicated, force its replication factor up, then down
▪ When adding new DNs to the cluster, run the rebalancing script

Common Issues
▪ Client libraries out of sync
▪ Non-uniform availability of software or libraries on TT nodes
▪ Bad disk: manifested as ROFS
▪ NIC decides to go into 100 Mbps Ethernet mode
▪ DN reserved amount not honored resulting in disk ﬁlled to capacity
▪ Resource contention

More About Monitoring
▪ Hadoop has an abstract interface for metrics reporting
▪ org.apache.hadoop.metrics.spi
▪ Currently has “ﬁle” and “ganglia” implementations
▪ Every Metric belongs to a Context and a Record
▪ Metrics can also have Tags for disambiguation
▪ See conf/hadoop-metrics.properties for conﬁguration
▪ Web interfaces to NN and JT also have detailed information
▪ A variety of cron’d scripts also take care of system-level monitoring

More About Performance
▪ In addition to the metrics package, logs are rich source of information
▪ Starting to regularly parse logs and store information into MySQL db
▪ Multiple research labs working on this area
▪ Berkeley RAD Lab
▪ Carnegie Mellon PDL
▪ Watch OSDI this year for papers

Recent DFS Performance Numbers
▪ All DNs are on same rack to isolate switch performance from test
▪ 8 DNs, each with 2 map slots: hence performance levels off at 16 ﬁles
▪ Each mapper writes 1 GB/ﬁle. Block size is 128MB. Replication factor is 3.
▪ Uses Java 1.6
Number of Files 0.15.4 (MB/s) 0.17.0 (MB/s)
1 30 60
2 25 53
3 20 43
5 18 33
8 9 27
13 8 18
20 9 17
24 8 18
28 8 16

X-Trace + Hadoop
HDFS Performance analysis

Resource Management and Job Scheduling
▪ By far the most intensive cluster management responsibility
▪ At Facebook: manually set job priorities and kill jobs
▪ HOD
▪ Integrates with Torque resource manager
▪ Torque frequently paired with Maui cluster scheduler
▪ Other options
▪ Sun Grid Engine
▪ Condor
▪ Platform LSF (commercial)

Manual Job Scheduling
Job Priorities and “Kill this Job” from JT Web Interface

Recent Cluster Statistics
▪ From May 2nd to May 21st:
▪ Total jobs: 8,794
▪ Total map tasks: 1,362,429
▪ Total reduce tasks: 86,806
▪ Average duration of a successful job: 296 s
▪ Average duration of a successful map: 81 s
▪ Average duration of a successful reduce: 678 s

20080528dublinpt3

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie 20080528dublinpt3

Ähnlich wie 20080528dublinpt3 (20)

Mehr von Jeff Hammerbacher

Mehr von Jeff Hammerbacher (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

20080528dublinpt3