The document summarizes the hardware, software configuration, and management of a large Hadoop cluster at Facebook. The cluster consists of 320 nodes arranged in 8 racks. The nodes are configured for different purposes like running the distributed file system, MapReduce jobs, and testing. Software like Hypershell and Cfengine are used for administration. Common issues and performance optimization techniques are also discussed.
Appkodes Tinder Clone Script with Customisable Solutions.pptx
20080528dublinpt3
1.
2. Managing a Large Hadoop Cluster
Jeff Hammerbacher
Manager, Data
May 28 - 29, 2008
3. Anatomy of the Facebook Cluster
Hardware
▪ Individual nodes
▪ CPU: Intel Xeon dual socket quad cores (8 cores per box)
▪ Memory: 16 GB ECC DRAM
▪ Disk: 4 x 1 TB 7200 RPM SATA
▪ Network: 1 gE
▪ Topology
▪ 320 nodes arranged into 8 racks of 40 nodes each
▪ 8 x 1 Gbps links out to the core switch
4. Anatomy of the Facebook Cluster
Functional Separation
▪ Need to have test, staging, and production clusters
▪ Break nodes into groups of 10
▪ First 30 machines on each rack run DFS
▪ Last 10 machines used for DFS and upgrade testing or left idle
▪ Run main MapReduce cluster on 20 machines in each rack
▪ Run test MapReduce cluster on 10 machines in four racks
▪ Do MapReduce testing on 10 machines in four racks
▪ A few other MapReduce clusters for isolated applications
5. Anatomy of the Facebook Cluster
Software for Administration
▪ Most utilities are included in hadoop/bin
▪ Format DFS, start/stop daemons, fsck, rebalance blocks, etc.
▪ Hypershell (internal): provides distributed shell functionality
▪ See also: dsh, GXP, Capistrano, ClusterIt
▪ Cfengine: ensure uniform system images, configuration, and libraries
▪ ODS (internal): monitoring and alerting
▪ See also: Ganglia for monitoring, Nagios for alerting
▪ Cacti: network monitoring
6. Anatomy of the Facebook Cluster
Excerpts from Facebook’s conf/hadoop-site.xml
dfs.block.size 134,217,728 Larger block size for less NN metadata
dfs.datanode.du.reserved 1,024,000,000 Don’t fill up the local disk
dfs.namenode.handler.count 40 More NN server threads for DN RPCs
dfs.network.script /mnt/vol/hive/stable/bin/rackid.pl Print machine network name
fs.trash.interval 1,440
fs.trash.root /Trash
io.file.buffer.size 32,768 Size of r/w buffer used by SequenceFile
io.sort.factor 100 More streams merged while sorting
io.sort.mb 200 Higher memory limit while sorting data
mapred.child.java.opts -Xmx1024m -Djava.net.preferIPv4Stack=true Large heap size; avoid RPC timeout
mapred.linerecordreader.maxlength 1,000,000 Skip malformed lines
mapred.min.split.size 65,536
mapred.reduce.copy.backoff 5
mapred.reduce.parallel.copies 20 More threads to fetch map output data
mapred.tasktracker.tasks.maximum 5
mapred.speculative.map.enabled TRUE
mapred.speculative.reduce.enabled FALSE
mapred.speculative.map.gap 1
webinterface.private.actions TRUE
7. Anatomy of the Facebook Cluster
HDFS Tips from Dhruba Borthakur
▪ Be careful when using profilers to examine NN state
▪ Never load many small files
▪ Always use java 1.6, otherwise NN will consume about 50% more CPU
▪ When decommissioning DNs, do a max of 10 machines or so at a time,
otherwise the NN gets overloaded
▪ Run fsck every night and monitor the number of missing/under-
replicated blocks
▪ If a block stays unreplicated, force its replication factor up, then down
▪ When adding new DNs to the cluster, run the rebalancing script
8. Anatomy of the Facebook Cluster
Common Issues
▪ Client libraries out of sync
▪ Non-uniform availability of software or libraries on TT nodes
▪ Bad disk: manifested as ROFS
▪ NIC decides to go into 100 Mbps Ethernet mode
▪ DN reserved amount not honored resulting in disk filled to capacity
▪ Resource contention
9. Anatomy of the Facebook Cluster
More About Monitoring
▪ Hadoop has an abstract interface for metrics reporting
▪ org.apache.hadoop.metrics.spi
▪ Currently has “file” and “ganglia” implementations
▪ Every Metric belongs to a Context and a Record
▪ Metrics can also have Tags for disambiguation
▪ See conf/hadoop-metrics.properties for configuration
▪ Web interfaces to NN and JT also have detailed information
▪ A variety of cron’d scripts also take care of system-level monitoring
10. Anatomy of the Facebook Cluster
More About Performance
▪ In addition to the metrics package, logs are rich source of information
▪ Starting to regularly parse logs and store information into MySQL db
▪ Multiple research labs working on this area
▪ Berkeley RAD Lab
▪ Carnegie Mellon PDL
▪ Watch OSDI this year for papers
11. Anatomy of the Facebook Cluster
Recent DFS Performance Numbers
▪ All DNs are on same rack to isolate switch performance from test
▪ 8 DNs, each with 2 map slots: hence performance levels off at 16 files
▪ Each mapper writes 1 GB/file. Block size is 128MB. Replication factor is 3.
▪ Uses Java 1.6
Number of Files 0.15.4 (MB/s) 0.17.0 (MB/s)
1 30 60
2 25 53
3 20 43
5 18 33
8 9 27
13 8 18
20 9 17
24 8 18
28 8 16
13. Anatomy of the Facebook Cluster
Resource Management and Job Scheduling
▪ By far the most intensive cluster management responsibility
▪ At Facebook: manually set job priorities and kill jobs
▪ HOD
▪ Integrates with Torque resource manager
▪ Torque frequently paired with Maui cluster scheduler
▪ Other options
▪ Sun Grid Engine
▪ Condor
▪ Platform LSF (commercial)
15. Anatomy of the Facebook Cluster
Recent Cluster Statistics
▪ From May 2nd to May 21st:
▪ Total jobs: 8,794
▪ Total map tasks: 1,362,429
▪ Total reduce tasks: 86,806
▪ Average duration of a successful job: 296 s
▪ Average duration of a successful map: 81 s
▪ Average duration of a successful reduce: 678 s
16. (c) 2008 Facebook, Inc. or its licensors. quot;Facebookquot; is a registered trademark of Facebook, Inc.. All rights reserved. 1.0