The document discusses performance issues on Hadoop clusters and proposes three solutions:
1) Data placement - Distributing data across heterogeneous cluster nodes according to their computing capabilities to reduce data transfer time.
2) Prefetching - Improving performance by preloading required data before tasks are assigned to reduce CPU stall time.
3) Preshuffling - Minimizing data shuffling during reduce by pipelining intermediate data between map and reduce tasks.
Improving Hadoop Performance with Data Placement and Prefetching
1. Performance Issues on
Hadoop Clusters
Jiong Xie
Advisor: Dr. Xiao Qin
Committee Members:
Dr. Cheryl Seals
Dr. Dean Hendrix
University Reader:
Dr. Fa Foster Dai
05/08/12 1
2. Overview of My Research
Data locality Data movement Data shuffling
Data Placement Prefetching Data
Reduce network
on Heterogeneous from Disk
congest
Cluster to Memory
[To Be Submitted]
[HCW 10] [Submit to IPDPS]
05/08/12 2
6. Hadoop Overview
--Mapreduce Running System
(J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters.
OSDI ’04, pages 137–150)
05/08/12 6
9. Existing Hadoop Clusters
• Observation 1: Cluster nodes are
dedicated
– Data locality issues
– Data transfer time
• Observation 2: The number of nodes is
increased
d Scalability issues
e Shuffling overhead goes up
05/08/12 9
11. Solutions
P1: Data placement
Offline, distributed data, heterogeneous node
P2: Prefetching P3: Preshuffling
Online, data preloading Intermediate data movement,
reducing traffic
05/08/12 11
12. Improving MapReduce
Performance through Data
Placement in Heterogeneous
Hadoop Clusters
05/08/12 12
13. Motivational Example
Node A 1 task/min
(fast)
Node B 2x slower
(slow)
Node C 3x slower
(slowest)
Time (min)
05/08/12 13
14. The Native Strategy
Node A 6 tasks
Node B 3 tasks
Node C 2 tasks
Time (min)
Loading Transferring Processing
05/08/12 14
15. Our Solution
--Reducing data transfer time
Node A
6 tasks
Node A’
Node B’ 3 tasks
Node C’ 2 tasks
Time (min)
Loading Transferring Processing
05/08/12 15
16. Challenges
• Does distribution strategy depend on
applications?
• Initialization of data distribution
• The data skew problems
– New data arrival
– Data deletion
– Data updating
– New joining nodes
05/08/12 16
17. Measure Computing Ratios
• Computing ratio
• Fast machines process large data sets
1 task/min
Node A
Node B 2x slower
Node C 3x slower
Time
05/08/12 17
18. Measuring Computing Ratios
1. Run an application, collect response time
2. Set ratio of a node offering the shortest response time as 1
3. Normalize ratios of other nodes
4. Calculate the least common multiple of these ratios
5. Determine the amount of data processed by each node
Node Response Ratio # of File Speed
time(s) Fragments
Node A 10 1 6 Fastest
Node B 20 2 3 Average
Node C 30 3 2 Slowest
05/08/12 18
19. Initialize Data Distribution
Portions
3:2:1
• Input files split into 64MB Namenode
File1
blocks 1
2
4
5
3 6
• Round-Robin data
distribution algorithm
A B C
7 a
8 c
b
9
Datanodes
05/08/12 19
20. Data Redistribution 4
3
2
1
1.Get network topology, Namenode
ratio, and utilization L1 A C
2.Build and sort two lists:
under-utilized node list L1 L2 B
over-utilized node list L2
3. Select the source and A B C Portion
destination node from 3:2:1
the lists.
1 4 7 a 6
4.Transfer data 2 5 8 b
9 c
5.Repeat step 3, 4 until the 3
list is empty.
05/08/12 20
21. Experimental Environment
Node CPU Model CPU(Hz) L1 Cache(KB)
Node A Intel core 2 Duo 2*1G=2G 204
Node B Intel Celeron 2.8G 256
Node C Intel Pentium 3 1.2G 256
Node D Intel Pentium 3 1.2G 256
Node E Intel Pentium 3 1.2G 256
Five nodes in a Hadoop heterogeneous cluster
05/08/12 21
22. Benckmarks
• Grep: a tool searching for a regular
expression in a text file
• WordCount: a program used to count
words in a text file
• Sort: a program used to list the inputs in
sorted order.
05/08/12 22
23. Response Time of Grep and
Wordcount in Each Node
Application dependence
Computing ratio is
Data size independence
05/08/12 23
24. Computing Ratio for Two
Applications
Computing Node Ratios for Grep Ratios for Wordcount
Node A 1 1
Node B 2 2
Node C 3.3 5
Node D 3.3 5
Node E 3.3 5
Computing ratio of the five nodes with respective of Grep and
Wordcount applications
05/08/12 24
26. Impact of data placement on
performance of Grep
05/08/12 26
27. Impact of data placement on
performance of WordCount
05/08/12 27
28. Summary of Data Placement
P1: Data Placement Strategy
• Motivation: Fast machines process large data sets
• Problem: Data locality issue in heterogeneous
clusters
• Contributions: Distribute data according to
computing capability
– Measure computing ratio
– Initialize data placement
– Redistribution
05/08/12 28
30. Prefetching
• Goal: Improving performance
• Approach
– Best effort to guarantee data locality.
– Keeping data close to computing nodes
– Reducing the CPU stall time
05/08/12 30
31. Challenges
• What to prefetch?
• How to prefetch?
• What is the size of blocks to be
prefetched?
05/08/12 31
32. Dataflow in Hadoop
1.Submit job
t
ea
k
rtb
tas
2.Schedule
ea
xt
5.h
Ne
6.
3.Read Input map Local
reduce
FS
Block 1
HDFS
Block 2 Local
map FS reduce
7.Read new file
4. Run map
05/08/12 32
33. Dataflow in Hadoop
1.Submit job
t
2.Schedule
ea
k
rtb
+ more task
tas
ea
+ meta
xt
5.h
Ne
data
6.
3.Read Input map Local
reduce
FS
Block 1
HDFS
Block 2 Local
map FS reduce
5.1.Read new file
4. Run map
05/08/12 33
41. Summary
P2: Predictive Scheduler and Prefetching
• Goal: Moving data before task assigns
• Problem: Synchronization task and data
• Contributions: Preloading the required data early
than the task assigned
– Predictive scheduler
– Prefetching mechanism
– Worker thread
05/08/12 41
43. Preshuffling
• Observation 1: Too much data move from
Map worker to Reduce worker
– Solution1: Map nodes apply pre-shuffling
functions to their local output
• Observation 2: No reduce can start until a
map is complete.
– Solution2: Intermediate data is pipelined
between mappers and reducers.
05/08/12 43
44. Preshuffling
• Goal : Minimize data shuffle during
Reduce
• Approach
– Pipeline
– Overlap between map and data movement
– Group map and reduce
• Challenges
– Synchronize map and reduce
– Data locality
05/08/12 44
45. Dataflow in Hadoop
1.Submit job 2.Schedule
2.
at
Ne
e
rtb
k
w
tas
tas
ea
xt
5.h
k
Ne
6.
3.Read Input 5.Write data
map Local
FS
reduce
Block 1
HTTP GET HDFS
HDFS
Block 2 Local
map FS reduce
3. Request data
4. Run map 4. Send data
05/08/12 45
46. PreShuffle
map reduce
map reduce
Data request
05/08/12 46
56. Run Time affected by Network Condition
Experiment result conducted by Yixian Yang
05/08/12 56
57. Traffic Volume affected by Network
Condition
Experiment result conducted by Yixian Yang
05/08/12 57
Hinweis der Redaktion
Comment: three colors.
1. 將要執行的 MapReduce 程式複製到 Master 與每一臺 Worker 機器中。 2.Master 決定 Map 程式與 Reduce 程式,分別由哪些 Worker 機器執行。 3. 將所有的資料區塊,分配到執行 Map 程式的 Worker 機器中進行 Map 。 4. 將 Map 後的結果存入 Worker 機器的本地磁碟。 5. 執行 Reduce 程式的 Worker 機器,遠端讀取每一份 Map 結果,進行彙整與排序,同時執行 Reduce 程式。 6. 將使用者需要的運算結果輸出。 Hadoop MapReduce Single master node, many worker nodes Client submits a job to master node Master splits each job into tasks (map/reduce), and assigns tasks to worker nodes
Hadoop MapReduce Single master node, many worker nodes Client submits a job to master node Master splits each job into tasks (map/reduce), and assigns tasks to worker nodes Hadoop Distributed File System (HDFS) Single name node, many data nodes Files stored as large, fixed-size (e.g. 64MB) blocks HDFS typically holds map input and reduce output
Comment: no “want to”
90
强调
Comments: differences between shuffling overhead and data transfer time.
Comment: what is computing ratio.
Comment 1: make step descriptions short. Comment 2: animations map to steps.
blocks
More io feature
Higher resolution
Comment: Titles of all the slides must use the same format Up to 33.1% abd 10.2% with average of 17.3% and 7.1%
Up to 33.1% abd 10.2% with average of 17.3% and 7.1%
Up to 33.1% abd 10.2% with average of 17.3% and 7.1%
which data block should be loaded and where is it? how to synchronize computation process with prefetching data. When the data is loaded to the cache ahead the requirement for a long time, it wastes the resource. On the contrary, if the data arrives later than the computation, it is useless. how to optimize the prefetching rate. What is the best percentage of data pre-loading of each machine? The MapReduce can maximal benefit from the best prefetching rate while minimizing the prefetching overhead. When to prefetch is to control how early to trigger the prefetching action of which node. In our previous research, we observed that one node process the same block size data in a fix time. Before the last block finishes, the next block starts loading data. In this design, we estimate the execution time of each block in each node. The execution time for the same application in various node may be different. we collect and adapt the average the running time for a block in each node. What to prefech is to determine which block load to the memory. Initially, the predictive scheduler assign two tasks to the each node. when the preloading work was trigger, it can access the data according to the required information from second task. How much to prefech is decide the amount of prepared data preload to the memory. According to the scheduler, when one task in running, one or more tasks are waiting in the node, the meta data is loaded in the memory of this node. When the prefetching action is triggered, the preloading work automatically load data form disk to memory. In view of large size of HDFS blocks, the number should be not so aggressive. Here we set the number of task is two, the prefetching block is one.
Comment: add animations.
Comment: add animations.
Figure 4.3: Comparing the excution time of Grep in native and prefetching Mapreduce Grep 9.5% 1G 8.5% 2G WordCount 8.9% 8.1% 2G
Above single large file 9% Small files 24% Increase the number of computing node we can get the same improvement through psp
第三部分动画比较典型,但 slide44-slide53 的关系还需要提炼,最好能有所加强。
One map task for each block of the input file Applies user-defined map function to each record in the block Record = <key, value> User-defined number of reduce tasks Each reduce task is assigned a set of record groups Record group = all records with same key For each group, apply user-defined reduce function to the record values in that group Reduce tasks read from every map task Each read returns the record groups for that reduce task
To reduce the amount of transfer data, the map worker apply combiner functions to their local outputs before storing or transferring the intermediate data so as to reduce the traffic utilization. The combining strategy can minimize the amount of data that needs to be transferred to the reducers and speed up the execution time of the overall job. To reduce the communication between nodes, intermediate results are pipelined between fixed mappers and reducers. The reducer do not need to ask for each map node in the cluster. Moreover, the reducers begin processing data as soon as it is produced by mappers. 57 To use the overlapping between data movement and map computation. Our experiment shows that shuffle is always much longer than map computation time. Especially, when the network is saturated, overlapping time will disappear consequently. Secondly, reducing latency in the network also improve throughput but also increase the possibility of network conflict. Pipeline idea can be lend to hide the overlapping. To limit the communication between map and reduce. In MapReduce’s parallel execu- tion, a node’s execution includes map computation, waiting for shuffle, and reduce compu- tation, and the critical path is the slowest node. To improve performance, the critical path has to be shortened. However, the natural overlap of one node’s shuffle wait with another node’s execution does not shorten the critical path. Instead, the operations within a single node need to be overlapped. The reduce check all the avail data from all the map node in the cluster. If some reduce node and map node can group with particular key-value pairs. the network communicate costs can be shorten.
The execution time of native Hadoop for 1GB data WordCount Star. Figure 5.4.2 illustrates the average response time for 1GB data WordCount of native Hadoop. The Reduce does not launch the progress until the all the map tasks finish. Figure 5.1(b) illustrates the average response time for 1GB data WordCount of preshuffling Hadoop. The Reduce task receive map outputs a little earlier and can begin sorting earlier so as to reduce the time to required for the final merge can achieve a better cluster utilization by 14%. In the native Hadoop, Hadoop can finish all the map tasks earlier than preshuffling hadoop because preshuffling allow reduce tasks to begin executing more quickly; the reduce tasks prepare for longing required resources, causing the map phase take longer.
Figure 5.2 reports the improvement of Preshuffling compared with native Hadoop for WordCount. With the increase of block size, Preshuffling can achieve better improvement. when the block size is less than 16MB, the Preshuffling algorithm did not gain a significant performance improvement. The same results can be found in Figure 5.3. When the block size increases to 128MB or more, the improvement also arrive to a fix number. Preshuffling Hadoop can gain average 14% for WordCount and 12.5% for Sort. Figure 5.2: The improvement of Preshuffling with different block size for WordCount. Figure 5.2: The improvement of Preshuffling with different block size for Sort.
title
Average 14% improvement
Hadoop Archive Hadoop Archive 或者 HAR ,是一个高效地将小文件放入 HDFS 块中的文件存档工具,它能够将多个小文件打包成一个 HAR 文件,这样在减少 namenode 内存使用的同时,仍然允许对文件进行透明的访问。 对某个目录 /foo/bar 下的所有小文件存档成 /outputdir/ zoo.har : hadoop archive -archiveName zoo.har -p /foo/bar /outputdir 当然,也可以指定 HAR 的大小 ( 使用 -Dhar.block.size) 。 HAR 是在 Hadoop file system 之上的一个文件系统,因此所有 fs shell 命令对 HAR 文件均可用,只不过是文件路径格式不一样, HAR 的访问路径可以是以下两种格式: Sequence file sequence file 由一系列的二进制 key/value 组成,如果为 key 小文件名, value 为文件内容,则可以将大批小文件合并成一个大文件。 Hadoop-0.21.0 中提供了 SequenceFile ,包括 Writer , Reader 和 SequenceFileSorter 类进行写,读和排序操作。如果 hadoop 版本低于 0.21.0 的版本,实现方法可参见 [3] 。 CombineFileInputFormat CombineFileInputFormat 是一种新的 inputformat ,用于将多个文件合并成一个单独的 split ,另外,它会考虑数据的存储位置。