MapReduce Runtime Design and Performance Optimizations

MapReduce Runtime Environments
Design, Performance, Optimizations

Gilles Fedak
Many Figures imported from authors’ works
Gilles.Fedak@inria.fr
INRIA/University of Lyon, France

Escola Regional de Alto Desempenho - Região Nordeste

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

1 / 84

Outline of Topics
1

What is MapReduce ?
The Programming Model
Applications and Execution Runtimes

2

Design and Principles of MapReduce Execution Runtime
File Systems
MapReduce Execution
Key Features of MapReduce Runtime Environments

3

Performance Improvements and Optimizations
Handling Heterogeneity and Stragglers
Extending MapReduce Application Range

4

MapReduce beyond the Data Center
Background on BitDew
MapReduce for Desktop Grid Computing

5

Conclusion

MapReduce


2 / 84

Section 1
What is MapReduce ?


MapReduce


3 / 84

What is MapReduce ?

• Programming Model for data-intense applications
• Proposed by Google in 2004
• Simple, inspired by functionnal programming
• programmer simply deﬁnes Map and Reduce tasks
• Building block for other parallel programming tools
• Strong open-source implementation: Hadoop
• Highly scalable
• Accommodate large scale clusters: faulty and unreliable resources

MapReduce: Simpliﬁed Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat, in OSDI’04:
Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004.


MapReduce


4 / 84

Subsection 1
The Programming Model


MapReduce


5 / 84

MapReduce in Lisp

• (map f (list l1 , . . . , ln ))
• (map square ‘(1 2 3 4))
• (1 4 9 16)

• (+ 1 4 9 16)
• (+1 (+4 (+9 16)))
• 30


MapReduce


6 / 84

MapReduce a la Google

Map(k , v ) → (k , v )

Mapper

Group(k , v ) by k

Reduce(k , {v1 , . . . , vn }) → v

Reducer

Input

Output

• Map(key , value) is run on each item in the input data set
• Emits a new (key , val) pair

• Reduce(key , val) is run for each unique key emitted by the Map
• Emits ﬁnal output


MapReduce


7 / 84

Exemple: Word Count

• Input: Large number of text documents
• Task: Compute word count across all the document
• Solution :
• Mapper : For every word in document, emits (word, 1)
• Reducer : Sum all occurrences of word and emits (word, total_count)


MapReduce


8 / 84

WordCount Example
/ / Pseudo−code f o r t h e Map f u n c t i o n
map( S t r i n g key , S t r i n g v a l u e ) :
/ / key : document name ,
/ / v a l u e : document c o n t e n t s
f o r e a c h word i n s p l i t ( v a l u e ) :
E m i t I n t e r m e d i a t e ( word , ’ ’ 1 ’ ’ ) ;
/ / Peudo−code f o r t h e Reduce f u n c t i o n
reduce ( S t r i n g key , I t e r a t o r v a l u e ) :
/ / key : a word
/ / v a l u e s : a l i s t o f counts
i n t word_count = 0 ;
foreach v i n values :
word_count += P a r s e I n t ( v ) ;
Emit ( key , A s S t r i n g ( word_count ) ) ;


MapReduce


9 / 84

Subsection 2
Applications and Execution Runtimes


MapReduce


10 / 84

MapReduce Applications Skeleton

• Read large data set
• Map: extract relevant information from each data item
• Shufﬂe and Sort
• Reduce : aggregate, summarize, ﬁlter, or transform
• Store the result
• Iterate if necessary


MapReduce


11 / 84

MapReduce Usage at Google

Jerry Zhao (MapReduce Group Tech lead @ Google)


MapReduce


12 / 84

MaReduce + Distributed Storage
! !"#$%&'(()#*$+),--"./,0$1)#/0*),)1/-0%/2'0$1)-0"%,3$)

-4-0$5)

Powerfull when associated with a distributed storage system
! •678)96""3($)78:;)<=78)9<,1"">)7/($)84-0$5:)
• GFS (Google FS), HDFS (Hadoop File System)


MapReduce


13 / 84

The Execution Runtime

• MapReduce runtime handles transparently:
•
•
•
•
•
•

Parallelization
Communications
Load balancing
I/O: network and disk
Machine failures
Laggers


MapReduce


14 / 84

MapReduce Compared

Jerry Zhao


MapReduce


15 / 84

Existing MapReduce Systems
#$%&'()*+,-./0-(12$3-4$((

""#$%&'()*%+,-%&
• Google MapReduce

./")/0%1(/2&(3+&-$"4%+&4",/-%&
• Proprietary and close source
5$"4%$2&$036%+&1"&!78& to GFS
• Closely linked
9:)$%:%31%+&03&5;;&&<01=&.21="3&(3+&>(?(&@03+03#4&
• Implemented in C++ with Python and Java Bindings
8(<A($$&
• Hadoop

(+"")&

• Open source (Apache projects)
• Cascading, Java, Ruby, Perl, Python, PHP, R, C++ bindings (and certainly
C)%3&4",/-%&DE)(-=%&)/"F%-1G&

more)
5(4-(+03#H&>(?(H&*,@2H&.%/$H&.21="3H&.B.H&*H&"/&5;;&
• Many side projects: HDFS, PIG, Hive, Hbase, Zookeeper
'(32&40+%&)/"F%-14&I&BJ78H&.9!H&B@(4%H&B0?%H&K""6%)%/&
• Used by Yahoo!, Facebook, Amazon and the Google IBM research Cloud
L4%+&@2&M(=""NH&7(-%@""6H&E:(A"3&(3+&1=%&!""#$%&9O'&

/%4%(/-=&5$",+&


MapReduce


16 / 84

Other Specialized Runtime Environments

• Hardware Platform
• Frameworks based on

• Multicore (Phoenix)
• FPGA (FPMR)
• GPU (Mars)

MapReduce
• Iterative MapReduce

(Twister)

• Security

• Incremental MapReduce

• Data privacy (Airvat)

(Nephele, Online MR)

• MapReduce/Virtualization

• Languages (PigLatin)
• Log analysis : In situ MR

• EC2 Amazon
• CAM


MapReduce


17 / 84

Hadoop Ecosystem


MapReduce


18 / 84

Research Areas to Improve MapReduce Performance

• File System
• Higher Throughput, reliability, replication efficiency

• Scheduling
• data/task affinity, scheduling multiple MR apps, hybrid distributed computing

infrastructures
• Failure detections
• Failure characterization, failure detector and prediction

• Security
• Data privacy, distributed result/error checking

• Greener MapReduce
• Energy efficiency, renewable energy


MapReduce


19 / 84

Research Areas to Improve MapReduce Applicability

• Scientiﬁc Computing
• Iterative MapReduce, numerical processing, algorithmics

• Parallel Stream Processing
• language extension, runtime optimization for parallel stream processing,

incremental processing
• MapReduce on widely distributed data-sets
• Mapreduce on embedded devices (sensors), data integration


MapReduce


20 / 84

CloudBlast: MapReduce and Virtualization for BLAST
Computing
NCBI BLAST
• The Basic Local Alignment Search Tool
(BLAST)

• ﬁnds regions of local similarity between

Split into chunks
based on new line. CCAAGGTT

>seq1
AACCGGTT
>seq2

>seq3
GGTTAACC
>seq4
TTAACCGG

sequences.

StreamPatternRecordReader splits
chunks into sequences as keys for
the mapper (empty value).

• compares nucleotide or protein sequences
to sequence databases and calculates the
statistical signiﬁcance of matches.
Hadoop + BLAST

• stream of sequences are split by Hadoop.
• Mapper compares each sequence to the
gene database

• provides fault-tolerance, load balancing.


>seq3
GGTTAACC
>seq4
TTAACCGG

>seq1
AACCGGTT
>seq2
CCAAGGTT
BLAST
Mapper

BLAST
Mapper
BLAST output is
written to a local file.

part-00000

part-00001
DFS merger

• HDFS is used to merge the result
• database are not chunked

MapReduce

'I'8%1#</
&';<%&8';
455A#841#<
'/:#&</C
Y)!,55
P3!#C4$
02'/! /'
C4#/14#/'
1'C5A41'!
4%1<C41#8
&'A#':';! %
;#1';?! =&<
7'5A<FC'
1<! 7'4A! 0
.'C5A41'
E#<A<$#;1;
#/1'&5&'1#/
455A#4/8'

StreamInputFormat

!

"#$%&'!()!! *+,-.#/$!0#12!3456'7%8')!9#:'/!4!;'1!<=!#/5%1!;'>%'/8';?!
@47<<5!;5A#1;!12'!#/5%1!=#A'!#/1<!82%/B;!<=!;4C'!;#D'?!5<;;#EAF!;5A#11#/$!#/!
12'!C#77A'!<=!4!;'>%'/8')!-1&'4CG411'&/6'8<&76'47'&!04;!7':'A<5'7!1<!
Cloudblast: Combining mapreduce and virtualization on
#/1'&5&'1!8<&&'81AF!12'!E<%/74&#';!<=!;'>%'/8';?!02#82!4&'!54;;'7!4;!B'F;!<=!
distributed resources for bioinformatics applications. A.
4!B'FH:4A%'!54#&!1<!12'!3455'&!1241!'I'8%1';!*+,-.)!J"-!&'1&#':';!4AA!
Matsunaga, M.A<84A!=#A'!<%15%1;!8<CE#/#/$!#/1<!4!;#/$A'!&';%A1)!
Tsugawa, & J. Fortes. (eScience’08), 2008.

<5'&41#/$!;F;1'C;!4/7!477#1#</4A!;<=104&'!K')$)!1<!;%55<&1!4!
54&1#8%A4&! 1F5'! <=! /'10<&B! 5&<1<8<A! <&! 1'82/<A<$FL)!
-#C#A4&AF?!4!@47<<5ME4;'7!455&<482!&'>%#&';!C482#/';!0#12!
12'! @47<<5! &%/1#C'! 4/7! =#A'! ;F;1'C! ;<=104&'?! 02#82! 84/!
</AF!'I'8%1'!#=!</'! <=!;':'&4A!488'514EA'!:'&;#</!<=!N4:4!#;!

!"! #$%&%
*#<#/=
;'1;!<=!':
5&<1'#/! ;'
YL)! G'&#<7
<=='&! %5M
%5741';!#;
&';<%&8';
J414!
9P"-!TYc
12'! 7#;1&#
@<0':'&?
7';#$/'7!
C4/4$'C
<E14#/'7!
/<7';?! #7
47:4/14$'
21 / 84

,'&L?&-0.>'! 8#LL'&'.>'! 7'/N''.! =1?%8FGHIJ! 0.8!
-,#FGHIJ! #2! -?&'! .?/#>'071'! N@'.! 2'K%'.>'2! 0&'! 2@?&/T!
#.8#>0/#.$! 0.! 'LL#>#'./! #-,1'-'./0/#?.! ?L! #.,%/! 2'K%'.>'!
2,1#/!0.8!V?7!>??&8#.0/#?.!7M!C08??,)!JN?62#/'!&'2%1/2!2@?N!
7'//'&!2,''8%,!/@0.!?.'62#/'!8%'!/?!,&?>'22?&2!/@0/!0&'!L02/!0/!
;=!P0,,&?+#-0/'1M!<)3!/#-'2!L02/'&R)!J@'!,%71#>1M!0B0#1071'!
-,#FGHIJ! >0.! 7'! L%&/@'&! @0.86/%.'8! L?&! @#$@'&!
JHFG*!EE)!!
*5D*SE:*AJHG!I*Y;*A=*I!=CHSH=J*SEIJE=I)!
• Use Vine to deploy Hadoop on several clusters (24 Xen-based ?,/#-#_'8! #-,1'-'./0/#?.2!12 .?/!
,'&L?&-0.>'T! 7%/! 2%>@! VM @ Univ. Florida + 0&'!
!
"#$!%&'(! Calif.)
"#$!)*&+,!
,%71#>1M!0>>'22#71'!a4b)!
Xen based VMs at Univ
WUX!
WUX!
-!&.!)/01/'2/)!
H1/@?%$@! /@'! './#&'! /0&$'/! 80/0702'! L#/2! #./?! -'-?&M! #.!
• Each VM contains Hadoop + BLAST + databse2'/%,! %2'8T! &%.2! N#/@! /@'! 80/0702'! 2'$-'./'8! #./?! [X!
/@'! (3.5GB)
(TZ<[!
Z(X!
3&'(/),!)/01/'2/!
L&0$-'./2!N'&'!,'&L?&-'8!#.!?&8'&!/?!#.>&'02'!/@'!0-?%./!?L!
3<(!
<X<!
4*&+,/),!)/01/'2/!
• Compared with MPIBlast
.'/N?&O! >?--%.#>0/#?.! L?&! -,#FGHIJ! &%.2! P"#$)! ZR)!
(<(!
<X!
4,5'65+6!6/785,8&'!
I,''8%,! $&0,@2! P"#$)! R! 2@?N! /@0/! /@'&'! N02! .?! #-,0>/! ?.!
<TX3T<4[!
43(T[(!
9&,5%!'1:;/+!&.!
/@'!,'&L?&-0.>'!8%'!/?!9A)!
'12%/&,86/)!
!
<<<U)Z3!
44[)4!
<7/+5(/!'12%/&,86/)!
>?-,%/'68?-#.0/'8! 0,,1#>0/#?.2! ?&! 'B'.! #-,&?B#.$
=/+!)/01/'2/!
,'&L?&-0.>'! 8%'! /?! 0&/#L0>/2! ?L! #-,1'-'./0/#?.T! ')$)! L#1'
4[!@?%&2!0.8!
<!@?%&2!0.8!
4/01/',85%!
2M2/'-!8?%71'!>0>@#.$!a4Ub)!!
4U!-#.%/'2!
XU!-#.%/'2!
/>/21,8&'!,8:/!
J?! 'B01%0/'! /@'! 2>0107#1#/M! ?L! -,#FGHIJ! 0.8
=1?%8FGHIJT!'+,'&#-'./2!N#/@!8#LL'&'./!.%-7'&2!?L!.?8'2T
,&?>'22?&2T! 80/0702'! L&0$-'./2! 0.8! #.,%/! 80/0! 2'/2! N'&'
E9)!*9HG;HJE]A!HA^!*5D*SE:*AJHG!S*I;GJI!
'+'>%/'8)! *+0>/1M! 3! ,&?>'22?&2! 0&'! %2'8! #.! '0>@! .?8')
J?! 'B01%0/'! /@'! ?B'&@'08! ?L! -0>@#.'! B#&/%01#_0/#?.T!
Q@'.'B'&! ,?22#71'T! /@'! .%-7'&2! ?L! .?8'2! #.! '0>@! 2#/'! 0&'
'+,'&#-'./2! N#/@! FGHIJ! N'&'! >?.8%>/'8! ?.! ,@M2#>01!
O',/! 7010.>'8! P#)')T! L?&! '+,'&#-'./2! N#/@! Z! ,&?>'22?&2T! 3
-0>@#.'2!P,%&'!G#.%+!2M2/'-2!N#/@?%/!5'.!9::R!0.8!10/'&!
.?8'2! #.! '0>@! 2#/'! 0&'! %2'8c! L?&! '+,'&#-'./2! N#/@! <U
?.! /@'! 5'.! 9:2! @?2/'8! 7M! /@'! 20-'! -0>@#.'2T! %2#.$! /@'!
,&?>'22?&2T! 4! .?8'2! 0&'! %2'8! #.! '0>@! 2#/'T! 0.8! 2?! ?.R)! "?&
,@M2#>01! .?8'2! 0/! ;")! 9#&/%01#_0/#?.! ?B'&@'08! #2! 70&'1M!
'+,'&#-'./2! N#/@! U4! ,&?>'22?&2T! <3! .?8'2! 0/! ;=! 0.8! 3X! 0/
.?/#>'071'!N@'.!'+'>%/#.$!FGHIJ!02!#/!>0.!7'!?72'&B'8!#.! !
;"! N'&'! %2'8! 8%'! /?! 1#-#/'8! .%-7'&! ?L! 0B0#1071'! .?8'2! 0/
"#$%&'!U)!! =?-,0&#2?.!?L!FGHIJ!2,''86%,2!?.!,@M2#>01!0.8!B#&/%01!
"#$)!U)!J@#2!#2!8%'!/?!/@'!-#.#-01!.''8!L?&!E`]!?,'&0/#?.2!02!
"#$%&'!()!! *+,'&#-'./01!2'/%,)!34!5'.6702'8!9:2!0/!;"!0.8!<3!5'.6
;=)!
-0>@#.'2)!I,''8%,!#2!>01>%10/'8!&'10/#B'!/?!2'K%'./#01!FGHIJ!?.!B#&/%01!
702'8!9:2!0/!;=!0&'!>?..'>/'8!/@&?%$@!9#A'!/?!,&?B#8'!0!8#2/&#7%/'8!
0! &'2%1/! ?L! &',1#>0/#.$! /@'! 80/0702'! #.! '0>@! .?8'! 0.8! L%11M!
S'2%1/2!0&'!2%--0&#_'8!#.!"#$)!)!"?&!011!#.,%/!80/0!2'/2T
&'2?%&>')!9:2!8'1#B'&!,'&L?&-0.>'!>?-,0&071'!N#/@!,@M2#>01!-0>@#.'2!
'.B#&?.-'./!/?!'+'>%/'!C08??,!0.8!:DE6702'8!FGHIJ!0,,1#>0/#?.2)!H!
011?>0/#.$! #/! /?! -0#.! -'-?&M)! E/! 012?! >?.L#&-2! ?/@'&!
=1?%8FGHIJ! 2@?N2! 0! 2-011! 08B0./0$'! ?B'&! -,#FGHIJ)
N@'.!'+'>%/#.$!FGHIJ)!
%2'&!>0.!&'K%'2/!0!B#&/%01!>1%2/'&!L?&!FGHIJ!'+'>%/#?.!7M!>?./0>/#.$!/@'!
?72'&B0/#?.2! ?L! B#&/%01#_0/#?.! @0B#.$! .'$1#$#71'! #-,0>/! ?.!
"?&!#.,%/!80/0!2'/!?L!1?.$!2'K%'.>'2! 0!(6L?18!2,''8%,!>0.
B#&/%01!N?&O2,0>'2!2'&B'&!P9QIR)!
7'! ?72'&B'8! N#/@! =1?%8FGHIJ! N@'.! %2#.$! U4! ,&?>'22?&2
MapReduce
22 / 84
?.!3!2#/'2T!N@#1'!-,#FGHIJ!8'1#B'&2!(36L?18!2,''8%,)!J@'
()!H!9S!9:!#2!>1?.'8!0.8!2/0&/'8!#.!'0>@!2#/'T!?LL'&#.$!

#.,%/)! J@'! 7102/+! ,&?$&0-! L&?-! /@'! A=FE! FGHIJ! 2%#/'T!
N@#>@! >?-,0&'2! .%>1'?/#8'2! 2'K%'.>'2! 0$0#.2/! ,&?/'#.!
2'K%'.>'2! 7M! /&0.210/#.$! /@'! .%>1'?/#8'2! #./?! 0-#.?! 0>#82T!
N02! %2'8)! =@0&0>/'&#2/#>2! ?L! /@'! 3! #.,%/! 2'/2! 0&'! 2@?N.! #.!
JHFG*!EE)!

CloudBlast Deployment and Performance

Section 2
Design and Principles of MapReduce Execution
Runtime


MapReduce


23 / 84

Subsection 1
File Systems


MapReduce


24 / 84

File System for Data-intensive Computing
MapReduce is associated with parallel file systems
• GFS (Google File System) and HDFS (Hadoop File System)
• Parallel, Scalable, Fault tolerant file system
• Based on inexpensive commodity hardware (failures)
• Allows to mix storage and computation

GFS/HDFS specificities
• Many files, large files (> x GB)
• Two types of reads
• Large stream reads (MapReduce)
• Small random reads (usually batched together)

• Write Once (rarely modified), Read Many
• High sustained throughput (favored over latency)
• Consistent concurrent execution


MapReduce


25 / 84

Example: the Hadoop Architecture
Distributed Computing
(MapReduce)

Data
Analy)cs

Applica)ons

JobTracker

Distributed Storage
(HDFS)
NameNode

Data
Node

TaskTracker

Data
Node

TaskTracker

Data
Node

TaskTracker

Data
Node

TaskTracker

Data
Node

TaskTracker

Data
Node

TaskTracker

Data
Node

TaskTracker

Data
Node

TaskTracker

Masters

Data
Node

TaskTracker


MapReduce

Slaves


26 / 84

HDFS Main Principles
HDFS Roles

File.txt

• Application

A

• Split a file in chunks
• The more chunks, the more

B

C

Write
chunks
A,

B,
C
of
file

File.txt

parallelism
• Asks HDFS to store the

Ok,
use
Data

nodes
2,
4,
6

NameNode

Applica'on

chunks
• Name Node
• Keeps track of File metadata

(list of chunks)
Data
Node
1

• Stores file chunks
• Replicates file chunks


MapReduce

Data
Node
6
C

Data
Node
4
B

• Data Node

Data
Node
5

Data
Node
3

replication (3) of chunks on
Data Nodes

Data
Node
4

Data
Node
2
A

• Ensures the distribution and

Data
Node
7


27 / 84

Data locality in HDFS
Rack Awareness
• Name node groups Data Node

in racks
rack 1
rack 2

DN1, DN2, DN3, DN4
DN5, DN6, DN7, DN8

• Assumption that network

performance differs if the
communication is in-rack vs
inter-racks

A
Data
Node
1

• Replication

Data
Node
2
B

• Keep 2 replicas in-rack and 1

in an other rack
A
B

A
Data
Node
3

Data
Node
4

DN1, DN3, DN5
DN2, DN4, DN7

Data
Node
4
B

Data
Node
5
A

Data
Node
6

Data
Node
7
B

• Minimizes inter-rack

communications
• Tolerates full rack failure

MapReduce


28 / 84

Writing to HDFS
Client

Data Node 1

Data Node 3

Data Node 5

Name Node

write File.txt in dir d
Writeable

For each Data chunk
AddBlock A
Data nodes 1, 3, 5 (IP, port, rack number)
initiate TCP connection
Data Nodes 3, 5
Data Node 5

Pipeline Ready
Pipeline Ready
Pipeline Ready

• Name Nodes checks permissions and select a list of Data Nodes
• A pipeline is opened from the Client to each Data Nodes


MapReduce


29 / 84

Pipeline Write HDFS
Client

Data Node 1

Data Node 3

Data Node 5

Name Node

For each Data chunk
Send Block A
Block Received A
Send Block A
Block Received A
Send Block A
Block Received A
Success
Close TCP connection
Success
Success
Success

• Data nodes send and receive chunk concurrently
• Blocks are transferred one by one
• Name Node updates the meta data table
A


DN1, DN3, DN5

MapReduce


30 / 84

Fault Resilience and Replication

NameNode
• Ensures fault detection
• Heartbeat is sent by DataNode to the NameNode every 3 seconds
• After 10 successful heartbeats : Block Report
• Meta data is updated

• Fault is detected
• NameNode lists from the meta-data table the missing chunks
• Name nodes instruct the Data Nodes to replicate the chunks that were on

the faulty nodes

Backup NameNode
• updated every hour


MapReduce


31 / 84

Subsection 2
MapReduce Execution


MapReduce


32 / 84

Anatomy of a MapReduce Execution
Data Split

Intermediate
Results

Input Files

Final Output

Output Files

Mapper
Reducer

Output2

Reducer

Mapper

Output1

Reducer

Mapper

Output3

Combine

Mapper

Mapper

Distribution

• Distribution :
• Split the input ﬁle in chunks
• Distribute the chunks to the DFS


MapReduce


33 / 84

Data Split

Intermediate
Results

Input Files

Final Output

Output Files

Mapper
Reducer

Output2

Reducer

Mapper

Output1

Reducer

Mapper

Output3

Combine

Mapper

Mapper

Map

• Map :
• Mappers execute the Map function on each chunks
• Compute the intermediate results (i.e., list(k , v ))
• Optionally, a Combine function is executed to reduce the size of the IR


MapReduce


33 / 84

Data Split

Intermediate
Results

Input Files

Final Output

Output Files

Mapper
Reducer

Output2

Reducer

Mapper

Output1

Reducer

Mapper

Output3

Combine

Mapper

Mapper

Partition

• Partition :
• Intermediate results are sorted
• IR are partitioned, each partition corresponds to a reducer
• Possibly user-provided partition function


MapReduce


33 / 84

Data Split

Intermediate
Results

Input Files

Final Output

Output Files

Mapper
Reducer

Output2

Reducer

Mapper

Output1

Reducer

Mapper

Output3

Combine

Mapper

Mapper

Shufﬂe

• Shufﬂe :
• references to IR partitions are passed to their corresponding reducer
• Reducers access to the IR partitions using remore-read RPC


MapReduce


33 / 84

Data Split

Intermediate
Results

Input Files

Final Output

Output Files

Mapper
Reducer

Output2

Reducer

Mapper

Output1

Reducer

Mapper

Output3

Combine

Mapper

Mapper

Reduce

• Reduce:
• IR are sorted and grouped by their intermediate keys
• Reducers execute the reduce function to the set of IR they have received
• Reducers write the Output results


MapReduce


33 / 84

Data Split

Intermediate
Results

Input Files

Final Output

Output Files

Mapper
Reducer

Output2

Reducer

Mapper

Output1

Reducer

Mapper

Output3

Combine

Mapper

Mapper

Combine

• Combine :
• the output of the MapReduce computation is available as R output ﬁles
• Optionally, output are combined in a single result or passed as a parameter

for another MapReduce computation

MapReduce


33 / 84

Return to WordCount: Application-Level Optimization
The Combiner Optimization
• How to decrease the amount of information exchanged between mappers

and reducers ?
→ Combiner : Apply reduce function to map output before it is sent to
reducer
/ / Pseudo−code f o r t h e Combine f u n c t i o n
combine ( S t r i n g key , I t e r a t o r v a l u e s ) :
/ / key : a word ,
/ / v a l u e s : l i s t o f counts
i n t partial_word_count = 0;
f o r each v i n v a l u e s :
p a r t i a l _ w o r d _ c o u n t += p a r s e I n t ( v ) ;
Emit ( key , A s S t r i n g ( p a r t i a l _ w o r d _ c o u n t ) ;


MapReduce


34 / 84

Let’s go further: Word Frequency
• Input : Large number of text documents
• Task: Compute word frequency across all the documents
→ Frequency is computed using the total word count

• A naive solution relies on chaining two MapReduce programs
• MR1 counts the total number of words (using combiners)
• MR2 counts the number of each word and divide it by the total number of

words provided by MR1
• How to turn this into a single MR computation ?
1
2

Property : reduce keys are ordered
function EmitToAllReducers(k,v)
• Map task send their local total word count to ALL the reducers with the key ""
• key "" is the ﬁrst key processed by the reducer
• sum of its values gives the total number of words


MapReduce


35 / 84

Word Frequency Mapper and Reducer
/ / Pseudo−code f o r t h e Map f u n c t i o n
map( S t r i n g key , S t r i n g v a l u e ) :
/ / key : document name , v a l u e : document c o n t e n t s
f o r e a c h word i n s p l i t ( v a l u e ) :
E m i t I n t e r m e d i a t e ( word , ’ ’ 1 ’ ’ ) ;
word_count ++;
E m i t I n t e r m e d i a t e T o A l l R e d u c e r s ( " " , A s S t r i n g ( word_count ) ) ;
/ / Peudo−code f o r t h e Reduce f u n c t i o n
reduce ( S t r i n g key , I t e r a t o r v a l u e ) :
/ / key : a word , v a l u e s : a l i s t o f counts
i f ( key= " " ) :
i n t total_word_count = 0;
t o t a l _ w o r d _ c o u n t += P a r s e I n t ( v ) ;
else :
word_count += P a r s e I n t ( v ) ;
Emit ( key , A s S t r i n g ( word_count / t o t a l _ w o r d _ c o u n t ) ) ;


MapReduce


36 / 84

Subsection 3
Key Features of MapReduce Runtime Environments


MapReduce


37 / 84

Handling Failures

On very large cluster made of commodity servers, failures are not a rare
event.
Fault-tolerant Execution
• Workers’ failures are detected by heartbeat
• In case of machine crash :
1
2
3

In progress Map and Reduce task are marked as idle and re-scheduled to
another alive machine.
Completed Map task are also rescheduled
Any reducers are warned of Mappers’ failure

• Result replica “disambiguation”: only the ﬁrst result is considered
• Simple FT scheme, yet it handles failure of large number of nodes.


MapReduce


38 / 84

Scheduling
Data Driven Scheduling
• Data-chunks replication
• GFS divides files in 64MB chunks and stores 3 replica of each chunk on

different machines
• map task is scheduled first to a machine that hosts a replica of the input

chunk
• second, to the nearest machine, i.e. on the same network switch

• Over-partitioning of input file
• #input chunks >> #workers machines
• # mappers > # reducers

Speculative Execution
• Backup task to alleviate stragglers slowdown:
• stragglers : machine that takes unusual long time to complete one of the few

last map or reduce task in a computation
• backup task : at the end of the computation, the scheduler replicates the

remaining in-progress tasks.
• the result is provided by the first task that completes


MapReduce


39 / 84

Speculative Execution in Hadoop

• When a node is idle, the scheduler selects a task
1
2
3

failed tasks have the higher priority
non-running tasks with data local to the node
speculative tasks (i.e., backup task)

• Speculative task selection relies on a progress score between 0 and 1
• Map task : progress score is the fraction of input read
• Reduce task :
• considers 3 phases : the copy phase, the sort phase, the reduce phase
• in each phase, the score is the fraction of the data processed. Example : a task
halfway of the reduce phase has score 1/3 + 1/3 + (1/3 ∗ 1/2) = 5/6

• straggler : tasks which are 0.2 less than the average progress score


MapReduce


40 / 84

Section 3
Performance Improvements and Optimizations


MapReduce


41 / 84

Subsection 1
Handling Heterogeneity and Stragglers


MapReduce


42 / 84

Issues with Hadoop Scheduler and heterogeneous
environment
Hadoop makes several assumptions :
• nodes are homogeneous
• tasks are homogeneous

which leads to bad decisions :
• Too many backups, thrashing shared resources like network bandwidth
• Wrong tasks backed up
• Backups may be placed on slow nodes
• Breaks when tasks start at different times


MapReduce


43 / 84

The LATE Scheduler
The LATE Scheduler (Longest Approximate Time to End)
• estimate task completion time and backup LATE task
• choke backup execution :
• Limit the number of simultaneous backup tasks (SpequlativeCap)
• Slowest nodes are not scheduled backup tasks (SlowNodeTreshold)
• Only back up tasks that are sufﬁciently slow (SlowTaskTreshold)

• Estimate task completion time
progress score
execution time
score
left = 1−progress rate
progress

• progress rate =
• estimated time

In practice SpeculativeCap=10% of available task slots, SlowNodeTreshold, SlowTaskTreshold = 25th percentile
of node progress and task progress rate

Improving MapReduce Performance in Heterogeneous Environments. Zaharia, M., Konwinski, A., Joseph, A. D., Katz, R. and Stoica, I. in OSDI’08: Sixth
Symposium on Operating System Design and Implementation, San Diego, CA, December, 2008.


MapReduce


44 / 84

Progress RateProgress
Example
1 min

Rate Example

2 min

Node 1

1 task/min

Node 2

3x slower

Node 3

1.9x slower

Time (min)


MapReduce


45 / 84

LATE Example

LATE Example
2 min
Estimated time left:
(1-0.66) / (1/3) = 1

Node 1
Node 2

Progress = 66%

Estimated time left:
(1-0.05) / (1/1.9) = 1.8

Progress = 5.3%

Node 3

Time (min)

LATE correctly picks Node 3

MapReduce


46 / 84

LATE Performance (EC2 Sort benchmark)
Stragglers
!"#$%&'($)*(+$,-(-'&.-/-*(0 emulated by running
$
4 stragglers

Heterogeneity
243VM/99 Hosts 1-7VM/Host
,36&&5$E382-$

F1G!$%;+-67H-'$

B>?$
B>#$
B$
=>A$
=>@$
=>?$
=>#$
=$
C&'4($

D-4($

!"#$%&'($)*(+$%(',--./

dd

Each host
sorted 128MB C,4&&3$B,72/$
B&$A,:;530$
with a total of
#=>$
30GB data

'()*+,-./0&1/23(42/&5-*/&

&'()*+,-./01.23'42.05,).0

E&$D3;<754$

12-'3.-$

D1E!$%:+/45./'$

#=<$
?=>$

E
2
o

S
w
a
d
in

?=<$
<=>$
<=<$
@&'0($

A/0($

12/',-/$

Results
Results
12/',-/$!"#$03//453$&2/'$6,72/8$$$%#&&2/'
12-'3.-$!"#$45--675$&2-'$/382-9$$%#$&2-'$/&$:3;<754$
Average 27% speedup over
Average 58% speedup over
Hadoop, 31% over no backups
Hadoop, 220% over noofbackups UIUC
Department Computer Science,
Department of Computer Science, UIUC


MapReduce

?A$


47 / 84

Understanding Outliers
• Stragglers : Tasks that take ≥ 1.5 the median task in that phase
• Re-compute : When a task output is lost, dependant tasks that wait until

1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1

1
Cumulative

problem,

Cumulative Fraction of Phases

the output is regenerated

high runtime
recompute

0.8
0.6
high runtime
recompute

0.4
0.2
0
0 1 2

0

0.1 0.2 0.3 0.4
Fraction of Outliers

0.5

4

6

8

10

Ratio of Straggler Duration to the
Duration of the Median Task

tliers and
(a) What fraction of tasks in a
(b) How much longer do outr resource
phase are outliers?
liers take to finish?
mics, conFigure : Prevalence of Outliers.
e to large of outliers
Causes
ic opera- skew: few keys corresponds to a lot of data
• Data
dian task) are well explained by the amount of data they
ur knowl• Network contention : 70% of the crossrack trafﬁc induced by reduce
process
the
tasks
nces from and Busy or move on half network. Duplicating these5% of the
• Bad
machines :
of recomputes happen on
would not make them run faster and will waste resources.
machines
At the other extreme, in  of the phases Salvador da Bahia, Brazil, 22/10/2013
(bottom left),
MapReduce

48 / 84

123

!*8%?*&

:9

The Mantri Scheduler
9
9

234

DF;CG4E
:9

;9

<9

=9

>99

@A*')#,*ABC#D4E#8B#1"./)*%8"B#78.*

0"))/%1

• cause- and resource- aware scheduling

Figure : Percentage tasks : deﬁne
• real-time monitoring of the speed-up of job completion time in the
ideal case when (some combination of) outliers do not occur.
>.*6'?3%).$&"#*
?+$5&%)
!.+*%*

!"#$%#$&"#'
(")')%*"+),%*

%=%,+$%
-.$/*'/.0%'
1&((2',.3.,&$4

that are ev
of a task a

)%.1'&#3+$
A#3+$''B%,"9%*'
+#.0.&5.B5%

Figure :

;#%<+.5'8")6'
1&0&*&"#

tasks bef
ones nea
#%$8")6'.8.)%' • )%35&,.$%'"+$3+$ *$.)$'$.*6*'$/.$'
@"5+$&"#* • 1+35&,.$%
There
• 6&557')%*$.)$ 35.,%9%#$
• 3)%:,"93+$%
1"'9")%'(&)*$
ing the c
Figure : The Outlier Problem: Causes and Solutions
job comp
of every t
By inducing high variability in repeat runs of the same
copies ar
job, outliers make it hard to meet SLAs. At median, the
Reining in the Outliers in Map-Reduce Clusters using Mantri in G. Ananthanarayanan, S. Kandula, A. slow dow
ratio Lu,stdev E. Harris OSDI’10: Sixth Symposium i.e., jobs System
Greenberg, I. Stoica, Y.of mean in job completion time is .,on Operatinghave a Design and up f
B. Saha,
used
Implementation,non-trivial probability of taking twice as long or finishing
Vancouver, Canada, December, 2010.
are waiti
half as quickly.
a phase v
To summarize, we takeMapReduce
the following lessons from our22/10/2013 comp
Salvador da Bahia, Brazil,
up 49 / 84

Resource Aware Restart
• Restart Outliers tasks on better machines
• Consider two options : kill & restart or duplicate

+
0%*+

,-($)".$/%
67*%$-1)8%

()*!(%
&% !%
'%

'!% !%
+!%

'!% !%
!%
!"#$%

234)"5-!$%

&% !% !% '!% !%
!% '!% !%
'%

!"#$%

0"))/%1$(!-1!%

&% !% '!% !% !%
'% !% !% '!%

!"#$%

>99

78.*

mpletion time in the
tliers do not occur.

Figure : A stylized example to illustrate our main ideas. Tasks
that are eventually killed are filled with stripes, repeat instances

• restart or duplicate iffilled new <lighter)mesh.
of a task are P(t with a trem is high
• restart
"9%*' ;#%<+.5'8")6' if the remaining time is large trem > E(tnew ) + ∆
tasks before the others to avoid being stuck with the large
.B5%
1&0&*&"#
• duplicates if it decreases the total amount of resources used
ones near completion (§.).
+$3+$ *$.)$'$.*6*'$/.$'
There > subtle point < 3 is the number reduc$%
1"'9")%'(&)*$
P(trem > tnew (c+1) )is a δ, where c with outlier mitigation:of copies and
c
δ = 0.25 ing the completion time of a task may in fact increase the
s and Solutions
job completion time. For example, replicating the output
G. Fedak (INRIA, France)every task will drastically reduce recomputations–both
MapReduce
of

50 / 84

Network Aware Placement

• Schedule tasks so that it minimizes network congestion
• Local approximation of optimal placement
• do not track bandwidth changes
• do not require global co-ordination

• Placement of a reduce phase with n tasks, running on a cluster with r

racks, In,r the input data matrix.
i
i
i
i
• Let du , dv the data to be uploaded and downloaded, and bu , bd the

corresponding available bandwidth
di

• the placement of is arg min(max( u ,
bi
v


MapReduce

i
dv
i
bd

))


51 / 84

Mantri Performances

%"#&

%&

"&
&

!'#!
!"#$

!"#(

'()*+,-./01(/1(
203()*40,5-*4

$&

(&

!&

%""

-!./0/12$34/
-!./0/12$34/!5"$67'8

)&

"&
%&

!"#$
• Speed up the job median by
9#6

55%

:3;. 9<=853/41 >324

?;3/7++++?;-7

7#8

!&

%#6
&

:05+

<=3>* ?50,@((((?5*@
AB

@A
;0,1. 20/1
• Network aware data

(a) Completion Time

(b)  Resource Usage

placement reduces the
ine implementation on a production cluster of thousands
completion time of typical
rvers for the four representative jobs.
reduce phase of 31%

:9
9
;:9

9

:9

<9

=9

>9 ?99

)*+,-,./"0%,
)*+,-,./"0%,*'1"2345

%""

.
.
tasks speeds up. of the
half
.
.
.
reduce phases by > 60%

e : Comparing Mantri’s network-aware spread of tasks

the baseline implementation on a production cluster of
sands of servers.


!"

(a) Change in Completion Time

 reduction in
• Network mincompletion time
aware placement of
avg
max

e cluster. Jobs that occupy the cluster for half the time
up by at least .. Figure (b) shows that  of
see a reduction in resource consumption while the
rs use up a few extra resources. These gains are due

!"#$%
&$%''(
)*+,
@$(A4%5B4
@$86"7

#"

0/A4%5B67'8/78/-'C(D467'8/+7C4

re : Comparing Mantri’s straggler mitigation with the

Phase
Job

$"

$"

!"#$%
&$%''(
)*+,
!"#$%&'(%
!"5213

#"
86
!"
76
6
986 976 6

76 86 :6 ;6 <66

-,$%&'(2345,35,$%04'1(%,=0">%
(b) Change in Resource Usage

Figure : Comparing straggler mitigation strategies. Mantri
provides a greater speed-up in completion time while using
fewer resources than existing schemes.

MapReduce


52 / 84

Subsection 2
Extending MapReduce Application Range


MapReduce


53 / 84

!"#$%&'()(*%&'+,-&(.+/0&123&(

Twister: Iterative MapReduce
! !"#$"%&$"'%('%(#$)$"&()%*(

+),")-./(*)$)(

! (0'%123,)-./(.'%2(,3%%"%2(

4&)&5/)-./6(7)89,/*3&/($)#:#((

! ;3-9#3-(7/##)2"%2(-)#/*(

&'773%"&)$"'%9*)$)($,)%#</,#((

! =>&"/%$(#388',$(<',(?$/,)$"+/(

@)8A/*3&/(&'783$)$"'%#(
4/B$,/7/.C(<)#$/,($5)%(D)*''8(
',(!,C)*9!,C)*E?FG6((
! 0'7-"%/(85)#/($'(&'../&$()..(
,/*3&/('3$83$#((
! !)$)()&&/##(+")(.'&).(*"#:#(
E"25$H/"25$(4IJKLL(."%/#('<(
M)+)(&'*/6((
! N388',$(<',($C8"&).(@)8A/*3&/(
&'783$)$"'%#(O''.#($'(7)%)2/(
*)$)((

Ekanayake, Jaliya, et al. Twister: a Runtime for Iterative Mapreduce. International Symposium on High Performance Distributed Computing. HPDC 2010.

MapReduce


54 / 84

Moon: MapReduce on Opportunistic Environments
• Consider volatile resources during their idleness time
• Consider the Hadoop runtime
→ Hadoop does not tolerate massive failures

• The Moon approach :
• Consider hybrid resources: very volatile nodes and more stable nodes
• two queues are managed for speculative execution : frozen and slow tasks
• 2-phase task replication : separate the job between normal and homestretch

phase
→ the job enter the homestretch phase when the number of remaining tasks < H%
available execution slots

Lin, Heshan, et al. Moon: Mapreduce on opportunistic environments International Symposium on High Performance Distributed Computing. HPDC 2010.


MapReduce


55 / 84

Towards Greener Data Center
• Sustainable Computing Infrastructures which minimizes the impact on the

environment
• Reduction of energy consumption
• Reduction of Green House Gaz emission
• Improve the use of renewable energy (Smart Grids, Co-location, etc...)

Solar Bay Apple Maiden Data Center (20MW, ) 00 acres : 400 000 m2. 20 MW peak

MapReduce


56 / 84

Section 4
MapReduce beyond the Data Center


MapReduce


57 / 84

Introduction to Computational Desktop Grids
High!"#$$%&'%&()$(*$&+&,)-%&'./"'#0)1%*"-&
Throughput Computing
over2%"-/00%$-Idle
Large Sets of
Desktop Computers
! 34"54%"&$%-&2*#--)0(%-&'%&
()$(*$&'%&67-&2%0')01&&
$%*"&1%82-&'.#0)(1#9#15
• Volunteer Computing Systems
" :/$&'%&(;($%&<&5(/0/8#-%*"&
(SETI@Home, Distributed.net)

'.5(")0

• Open Source Desktop Grid
"

)22$#()1#/0-&2)")$$=$%-&'%&4")0'%&

(BOINC, XtremWeb, Condor,
1)#$$%
OurGrid)
" >"4)0#-)1#/0&'%&?(/0(/*"-@&'%&
2*#--)0(%
• Some numbers with BOINC
" A%-&*1#$#-)1%*"-&'/#9%01&2/*9/#"&
(October 21, 2013)
8%-*"%"&$%*"&(/01"#,*1#/0
• Active: 0.5M users, 1M
5(/0/8#%&'*&'/0
hosts.
• 24-hour average: 12

PetaFLOPS.
• EC2 equivalent : 1b$/year

MapReduce


58 / 84

MapReduce over the Internet
Computing on Volunteers’ PCs (a la SETI@Home)
• Potential of aggregate computing power and storage
• Average PC : 1TFlops, 4BG RAM, 1TByte disc
• 1 billion PCs, 100 millions GPUs + tablets + smartphone + STB

• hardware and electrical power are paid for by consumers
• Self-maintained by the volunteers

But ...
• High number of resources

• Low performance

• Volatility

• Owned by volunteer

• Scalable but mainly for embarrassingly parallel applications with few I/O

requirements
• Challenge is to broaden the application domain

MapReduce


59 / 84

Subsection 1
Background on BitDew


MapReduce


60 / 84

Large Scale Data Management

BitDew : a Programmable Environment for Large Scale Data
Management
Key Idea 1: provides an API and a runtime environment which integrates
several P2P technologies in a consistent way
Key Idea 2: relies on metadata (Data Attributes) to drive transparently data
management operation : replication, fault-tolerance, distribution,
placement, life-cycle.
BitDew: A Programmable Environment for Large-Scale Data Management and Distribution Gilles Fedak, Haiwu He and Franck Cappello International
Conference for High Performnance Computing (SC’08), Austin, 2008

MapReduce


61 / 84

BitDew : the Big Cloudy Picture
• Aggregates storage in a single

Data Space:

Data Space

• Clients put and get data

from the data space
• Clients deﬁnes data

attributes
• Distinguishes service nodes

(stable), client and Worker
nodes (volatile)
• Service : ensure fault

tolerance, indexing and
scheduling of data to Worker
nodes
• Worker : stores data on
Desktop PCs

REPLICAT =3

put

get
client

• push/pull protocol between

client → service ← Worker

MapReduce


62 / 84

BitDew : the Big Cloudy Picture
• Aggregates storage in a single

Data Space:

Data Space

• Clients put and get data

from the data space

Service Nodes

Reservoir Nodes

• Clients deﬁnes data

attributes
• Distinguishes service nodes

pull
pull

(stable), client and Worker
nodes (volatile)
• Service : ensure fault

pull

tolerance, indexing and
scheduling of data to Worker
nodes
• Worker : stores data on
Desktop PCs

put

get
client

• push/pull protocol between

client → service ← Worker

MapReduce


62 / 84

Data Attributes

REPLICA

:

RESILIENCE

:

LIFETIME

:

AFFINITY

:

TRANSFER PROTOCOL

:

DISTRIBUTION

:


indicates how many occurrences of data should
be available at the same time in the system
controls the resilience of data in presence of machine crash
is a duration, absolute or relative to the existence
of other data, which indicates when a datum is obsolete
drives movement of data according to dependency rules
gives the runtime environment hints about the ﬁle
transfer protocol appropriate to distribute the data
which indicates the maximum number of pieces
of Data with the same Attribute should be sent to
particular node.

MapReduce


63 / 84

Architecture Overview
Applications
Service
Container

Active Data
Data
Scheduler
SQL
Server

Transfer
Manager

BitDew
Data
Catalog
File
System

DHT

Master/
Worker

Storage

Data
Repository
Http
Ftp

Data
Transfer
Bittorrent

Ba

ck-

en

ds

Se

rvic

e

AP

I

Command
-line Tool

BitDew Runtime Environnement

• Programming APIs to create Data, Attributes, manage ﬁle transfers and

program applications
• Services (DC, DR, DT, DS) to store, index, distribute, schedule, transfer

and provide resiliency to the Data
• Several information storage backends and ﬁle transfer protocols


MapReduce


64 / 84

Anatomy of a BitDew Application
• Clients
1
2
3

creates Data and
Attributes
put ﬁle into data slot created
schedule the data with their
corresponding attribute

Data Space
Service Nodes

Reservoir Nodes

dr/dt
dc/ds

• Reservoir

pull

• are notiﬁed when data are

locally scheduled
onDataScheduled or
deleted onDataDeleted
• programmers install
callbacks on these events

pull

create
put
schedule

• Clients and Reservoirs can

both create data and install
callbacks

MapReduce


65 / 84

Anatomy of a BitDew Application
• Clients
1
2
3

creates Data and
Attributes
put ﬁle into data slot created
schedule the data with their
corresponding attribute

Data Space
Service Nodes

Reservoir Nodes

dr/dt
dc/ds

• Reservoir

onDataSceduled

• are notiﬁed when data are

locally scheduled
onDataScheduled or
deleted onDataDeleted
• programmers install
callbacks on these events

onDataScheduled

• Clients and Reservoirs can

both create data and install
callbacks

MapReduce


65 / 84

Examples of BitDew Applications

• Data-Intensive Application
• DI-BOT : data-driven master-worker Arabic characters recognition (M.

Labidi, University of Sfax)
• MapReduce, (X. Shi, L. Lu HUST, Wuhan China)
• Framework for distributed results checking (M. Moca, G. S. Silaghi

Babes-Bolyai University of Cluj-Napoca, Romania)
• Data Management Utilities
• File Sharing for Social Network (N. Kourtellis, Univ. Florida)
• Distributed Checkpoint Storage (F. Bouabache, Univ. Paris XI)
• Grid Data Stagging (IEP, Chinese Academy of Science)


MapReduce


66 / 84

Experiment Setup

Hardware Setup
• Experiments are conducted in a controlled environment (cluster for the

microbenchmark and Grid5K for scalability tests) as well as an Internet
platform
• GdX Cluster : 310-nodes cluster of dual opteron CPU (AMD-64 2Ghz,

2GB SDRAM), network 1Gbs Ethernet switches
• Grid5K : an experimental Grid of 4 clusters including GdX
• DSL-lab : Low-power, low-noise cluster hosted by real Internet users on

DSL broadband network. 20 Mini-ITX nodes, Pentium-M 1Ghz, 512MB
SDRam, with 2GB Flash storage
• Each experiments is averaged over 30 measures


MapReduce


67 / 84

Fault-Tolerance Scenario
0

20

40

60

80

100

120
492KB/s

DSL01
DSL02

211KB/s

DSL03

254KB/s

DSL04

247KB/s

DSL05

384KB/s

DSL06

53KB/s

DSL07

412KB/s

DSL08

332KB/s

DSL09

304KB/s

DSL10

259KB/s

Evaluation of Bitdew in Presence of Host Failures
Scenario: a datum is created with the attribute : REPLICA = 5,
FAULT _ TOLERANCE = true and PROTOCOL = "ftp".
Every 20 seconds, we simulate a machine crash by killing the
BitDew process on one machine owning the data.
We run the experiment in the DSL-Lab environment
Figure: The Gantt chart presents the main events : the blue box shows
the download duration, red box is the waiting time, and red star
indicates a node crash.

MapReduce


68 / 84

Collective File Transfer with Replication
1
d2

0.8

ALL_TO_ALL

Fraction

d1

d1 + d2 + d3 + d4

d4

0.6
0.4

d3

0.2
0
150

Scenario: All-to-all file transfer with
replication.

250

300
KBytes/sec

350

400

450

Results: experiments on DSL-Lab

• {d1, d2, d3, d4} is created

• one run every hour during

approximately 12 hours

where each datum has the
REPLICA attribute set to 4

• during the collective, file

• All-to-all collective

implemented with the AFFINITY
attribute. Each data attributes
is modified on the fly so that
d1.affinity=d2, d2.affinity=d3,
d3.affinity=d4, d4.affinity=d1.

200

MapReduce

transfers are performed
concurrently and the data size
is 5MB.
• Cumulative Distribution

Function (CDF) plot for
collective all-to-all.

69 / 84

Multi-sources and Multi-protocols File Distribution
Scenario
• 4 clusters of 50 nodes each,

each cluster hosting its own
data repository (DC/DR).

dr1/dc1

C1

• ﬁrst phase: a client uploads a
dr2/dc2

large datum to the 4 data
repositories

C2

• second phase: the cluster
C3

nodes get the datum from the
data repository of their cluster.

dr3/dc3

We vary:
1

the number of data
repositories from 1 to 4

2

C4
dr4/dc4

the protocol used to distribute
the data (BitTorrent vs FTP)

1st phase

MapReduce

2nd phase


70 / 84

Multi-sources and Multi-protocols File Distribution
BitTorrent (MB/s)
orsay
1st phase
1DR/1DC
2DR/2DC
4DR/4DC
2nd phase
1DR/1DC
2DR/2DC
4DR/4DC

nancy

bordeaux

30.34
29.79
22.07
3.93
4.04
8.46

FTP (MB/s)

rennes

mean

orsay

nancy

bordeaux

rennes

10.36

12.35
11.51

9.72

30.34
21.07
13.41

98.50
73.85
49.41

22.95

49.86
26.19

22.89

3.93
4.88
7.58

3.78
9.50
7.65

3.96
5.04
5.14

3.90
5.86
7.21

0.72
1.43
2.90

0.70
1.38
2.78

0.56
1.14
2.33

0.60
1.29
2.54

• for both protocols, increasing the number of data sources also increases

the available bandwidth
• FTP performs better in the first phase and BitTorrent gives superior

performances in the second phase
• the best combination (4DC/4DR, FTP for the first phase and BitTorrent for

the second phase) takes 26.46 seconds and outperforms by 168.6% the
best single data source, single protocol (BitTorrent) configuration

MapReduce


71 / 84

Subsection 2
MapReduce for Desktop Grid Computing


MapReduce


72 / 84

Challenge of MapReduce over the Internet
Data Split

Intermediate
Results

Input Files

Final Output

Output Files

Mapper
Reducer

Output2

Reducer

Mapper

Output1

Reducer

Mapper

Output3

Combine

Mapper

Mapper

• Needs Data Replica Management
• Result Certiﬁcation of

• no shared ﬁle system nor direct
communication between hosts

Intermediate Data

• Faults and hosts churn

• Collective Operation (scatter +
gather/reduction)

Towards MapReduce for Desktop Grids B. Tang, M. Moca, S. Chevalier, H. He, G. Fedak, in Proceedings of the Fifth International Conference on P2P,
Parallel, Grid, Cloud and Internet Computing 3GPCIC, Fukuoka, Japan, Novembre 2010

MapReduce


73 / 84

Implementing MapReduce over BitDew (1/2)
Initialisation by the Master
1

Data slicing : Master creates
DataCollection
DC = d1 , . . . , dn , list of data
chuncks sharing the same
MapInput attribute.

2

Creates reducer tokens each
one with its own
TokenReducX attribute
(affinity set to the Reducer).

3

Creates and pin collector
tokens with
TokenCollector attribute
(affinity set to the Collector
Token).


Algorithm 2 E XECUTION OF M AP TASKS
Require: Let Map be the Map function to execute
Require: Let M be the number of Mappers and m a single mapper
Require: Let dm = d1,m , . . . dk,m be the set of map input data received by
worker m
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.

{on the Master node}
Create a single data MapToken with affinity set to DC
{on the Worker node}
if MapToken is scheduled then
for all data d j,m ∈ dm do
execute Map(d j,m )
create list j,m (k, v)
end for
end if

Algorithm 3 S HUFFLING INTERMEDIATE RESULTS
Require: Let of the number of Mappers and m a single worker
ExecutionM beMap Tasks

Require: Let R be the number of Reducers
Require: Let listm (k, v) be the set of key, values pairs of intermediate
1 When a Worker receives
results on worker m

1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
MapReduce

{on MapToken it launches the
the Master node}
for corresponding MapTask and
all r ∈ [1, . . . , R] do
Create Attribute ReduceAttrr with distrib = 1
computes the intermediate values
Create data ReduceTokenr
schedule data ReduceTokenr with attribute ReduceTokenAttr
end listj,m (k , v ).
for
split listm (k, v) in i f m,1 , . . . , i f m,r intermediate files
for all file i fm,r do
create reduce input data irm,r and upload i fm,r

74 / 84

2.
3.
4.
5.
6.
7.
8.
9.
10.

Create a single data MapToken with affinity set to DC
if MapToken is scheduled then
for all data d j,m ∈ dm do
execute Map(d j,m )
create list j,m (k, v)
end for
end if

2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.

Create a single data MasterToken
pinAndSchedule(MasterToken)

Implementing MapReduce over BitDew (2/2)
Require: Let
Require: Let
Require: Let
results on
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.

M be the number of Mappers and m a single worker
R be the number of Reducers
listm (k, v) be the set of key, values pairs of intermediate
worker m

for all r ∈ [1, . . . , R] do
Create Attribute ReduceAttrr with distrib = 1
Create data ReduceTokenr
schedule data ReduceTokenr with attribute ReduceTokenAttr
end for
split listm (k, v) in i f m,1 , . . . , i f m,r intermediate files
for all file i fm,r do
create reduce input data irm,r and upload i fm,r
schedule data irm,r with a f f inity = ReduceTokenr
end for

workers, the ReduceTokenAttr has distrib=1 flag value which
ensures a fair
Shufflingdistribution of the workload between reducers. Finally,
Intermediate Results
the Shuffle phase is simply implemented by scheduling the portioned
intermediate data with an attribute whose affinity tag is equal
to the corresponding ReduceToken.
1
Algorithm 4 presents the Reduce phase. When a reducer, that is
a worker which has the ReduceToken, starts to receive intermediate
results, it calls the Reduce function on the (k, list(v)). When all
the intermediate files have been received, all the values have been
processed for a specific key. If the user wishes, he can get all the
results back to the master and eventually combine them. To proceed
to this last step, the worker creates an MasterToken data, which is
pinnedAndScheduled. This operation means that the MasterToken
is known from the BitDew scheduler but will not be sent to any
nodes in the system. Instead, MasterToken is pinned on the Master
node, which allows the result of the Reduce tasks, scheduled with
a tag affinity set to MasterToken, to be sent to the Master.

Worker partitions intermediate
results according to the partition
function and the number of
reducers and creates
ReduceInput with the attribute
ReducTokenR.

D. Desktop Grid Features

if all or have been received then
Combine or into a single result
end if

Execution of Reduce tasks
overlapping communication with computation and prefetch of data.
We have designed a multi-threaded worker (see Figure 3) which can
1
process several concurrent files transfers, even if the protocol used is
m,r
synchronous such as HTTP for instance. As soon as a file transfer is
finished, the corresponding Map or Reduce tasks are enqueued and
can be processed concurrently. The number of maximum concurrent
Map and Reduce tasks can be configured, as well as the minimum
m,r
number of tasks in the queue before computations can start. In
MapReduce, prefetch of data is natural as data are staged on the
2
r
distributed file system before computations are launched.

When a reducer receives
ReduceInput ir
files it lauches
the corresponding launch
ReduceTask Reduce(ir ).

7+889+04

When all the ir have been
processed, the reducer combines
!"#$%&
the result in a single final result
"#$
Or . '6 "#$ *+,,%3

*+,-%./0%
%#$
1234%3

!

:0;%./8%
%#$

Algorithm 3 S HUFFLING INTERMEDIATE RESULTS

if irm,r is scheduled then
execute Reduce(irm,r )
if all irr have been processed then
create or with a f f inity = MasterToken
end if
end if

%#$
')
-%./0%3
3 eventually, the final result can be

sent back to the master using the
%#$
'(
%#$
TokenCollector attribute.
-%./0%35/##%3
Figure 3. Worker handles the following threads: Mapper, Reducer and ReducePutter. Q1, Q2 and Q3 represent queues and are used for communication between
threads.

Collective file operation: MapReduce processing is composed
of several collective file operations which in some aspects, look
similar with collective communications in parallel programming
such as MPI. Namely, the collectives are the Distribution, the
initial distribution of file chunks (similar to MPI_Scatter), the
Shuffle, that is the redistribution of intermediate results between the
Mapper and
MapReduce the Reducer nodes (similar to MPI_AlltoAll) and

75 / 84

Handling Communications
Latency Hiding
• Multi-thredead worker to overlap
communication and computation.

• The number of maximum concurrent Map
and Reduce tasks can be conﬁgured, as
well as the minimum number of tasks in
the queue before computations can start.

Barrier-free computation
• Reducers detect duplication of intermediate results (that happen because of faults and/or
lags).

• Early reduction : process IR as they arrive → allowed us to remove the barrier between
Map and Reduce tasks.

• But ... IR are not sorted.


MapReduce


76 / 84

Scheduling and Fault-tolerance
2-level scheduling
1 Data placement is ensured by the BitDew scheduler, which is mainly guided by the data

attribute.
2 Workers periodically report to the MR-scheduler, running on the master node the state of

their ongoing computation.
3 The MR-scheduler determines if there are more nodes available than tasks to execute

which can avoid the lagger effect.
Fault-tolerance

• In Desktop Grid, computing resources have high failure rates :
→ during the computation (either execution of Map or Reduce tasks)
→ during the communication, that is ﬁle upload and download.
• MapInput data and ReduceToken token have the resilient attribute enabled.
• Because intermediate results irj,r , j ∈ [1, m] have the affinity set to

ReduceTokenr , they will be automatically downloaded by the node on which
the ReduceToken data is rescheduled.


MapReduce


77 / 84

Data Security and Privacy
Distributed Result Checking
• Traditional DG or VC projects implements result checking on the server.
• IR are too large to be sent back to the server
→ distributed result checking
• replicates the MapInput and Reducers
• select correct results by majority voting

%"
*

!"

&%"'"
+"

!#

!)

!$

(%

(a) No replication

(b) Replication of the Map tasks
Figure 2.

(c) Replication of both Map and Reduce tasks

Dataflows in the MapReduce implementation

distinct workers as input for Map task. Consequently, the
received intermediate result is present in its own codes list
K, ignoring failed results.
m versions of intermediate files
m
corresponding to a map input fi . After receiving r2 + 1
IV. E VALUATION O F T HE R ESULTS C HECKING
identical versions, the reducer considers the respective result
M ECHANISM
correct and further, accepts it as input for a Reduce task.
Figure 2b illustrates the dataflow corresponding to a
In this section we describe the model for characterizMapReduce execution where rm = 3.
ing the errors produced by the MapReduce algorithm in
Volunteer Computing. We assume that workers sabotage
In order to activate the checking mechanism for the
independently of one another, thus we do not take into
results received from reducers, the master node schedules the
consideration the colluding nodes behavior.Bahia, Brazil, 22/10/2013
ReduceT okens
G. Fedak (INRIA, France) T , with the attribute replica = rr (rr >
MapReduce
Salvador da We employ the

Ensuring Datareceives a set of r
reducer Privacy

• Use a hybrid infrastructure : composed of private and public resources
• Use IDA (Information Dispersal Algorithms) approach to distribute and store securely the
data

78 / 84

BitDew Collective Communication
%!!!

56,7-87/9:012.:3;0<4
56,7-87/9:012.:35190,66.+94
=8799.6:012.
>79?.6:012.

$"!!

012.3/4

$!!!
#"!!
#!!!
"!!
!
#

&

'

#$

#(

$& %$
*+,-./

&'

(&

'!

)( #$'

Execution time of the Broadcast, Scatter, Gather collective for a 2.7GB

Figure 6.

The next experiment evaluates the performance of the collective
file operation. Considering that we need enough chunks to measure
the Scatter/Gather collective, we selected 16MB as the chunks size

E VALUA

Figure 5.

Figure: Execution time of theof nodes.
to throu
file to a varying number Broadcast, Scatter, Gather collective for a 2.7GB file the a
varying number of nodes


MapReduce


79 / 84

MapReduce Evaluation
#&!!

0?6,@A?B@9:3C5D/4

#$!!
#!!!
'!!
(!!
&!!
$!!
!
&

'

a 2.7GB

Figure 6.

'

#(

%$
(&
*+,-./

#$'

$"(

"#$

Scalability evaluation on the WordCount application: the y axis presents

Figure: Scalability evaluationand the xWordCount application: the y from 1presents the
the throughput in MB/s on the axis the number of nodes varying axis to 512.
throughput in MB/s and the x axis the number of nodes varying from 1 to 512.

lective
measure
ks size

Table II
E VALUATION OF THE PERFORMANCE ACCORDING TO THE NUMBER OF M APPERS
AND R EDUCERS .

#Mappers

4

8MapReduce
16

32

32

32

32


80 / 84

1;38CA<
1;3"CA<
(;6#<

(;08<

(;6A<
78

(;0"<

1;
0
1; 8 <
0" <

(;08<

E

%;68<

8EE

1;:8<

!"#"$%
54,/64

8AE

1;38C$<
5;3"C$<

%;6$<
(;3"C8<
<
"< 3 8CA
;3 8C 5;

()!*+)&,-%&'-.*'/0
1;:"<

%&'
=

(;3"C#<

9),4-2&3+/:4
?6@4,/+4-,&0&
54D6@4,/+4-,&0&

7#

(;68<

1;68-B-A<
AE

5;3"CA<
5;3"C8<

(;3"CA<
1;3"C8<
5
5;38C8<
(;38C"<

>

1;3"C8<
5;38C8<

1'+)&,-23+4
1;38C"<
5;3"C"<

%;6"<
(;6"< (;6$<

!8

%;68<

7"

%;6#<

(;6#<
!"

(;38C$<

"EE

(;38CA<

(;6$<

!#

1;3"C#<
5;38C$<

5;3"C#<

!$

%;6#<
(;38CA<

%;6A<

(;68<
(;38C"<

!A

5;38C"<
5;38CA<
5;38C#<

Fault-Tolerance Scenario

"AE

#EE

#AE

(;:"<
$EE

$AE

(;:8<
AEE =3>4

Figure: Fault-tolerance scenario (5 mappers, 2 reducers): crashes are injected during
the download of a ﬁle (F1), the execution of the map task (F2) and during the execution
of the reduce task on the last intermediate result (F3).


MapReduce


81 / 84

In case of substantial unnecessary in Table
the HDFS needs
execution progress at all.inter-communication between two
On the other hand, BitDew-MapReduce avoids network failures, as shown fault tol
DataNodes. in BitDew-MapReduce, the intermediate Hadoop JobTracker just marks all the tempo
works. Because, On the other hand, BitDew-MR works properly outputs of completed map tasks are safely store
disconnected nodes as dead - although they are
and the performance is almost the same as the baseline in the
the stable central storage conditions. master does not re-execute the successfully completedall the tasks done
running tasks, and blindly removes map tasks of f
normal case of network server, the
because it needs inter-communication between worker nodes, andnodes from the successful task list. Re-executing t
these BitDew mapreduce is almost independent of
workers.
Objective

Hadoop vs MapReduce/BitDew on Extreme Scenarios
E. CPU Heterogeneity and Straggler
such network conditions.

tasks significantly prolongs the job make-span. Meanw

Host •Churn The 5 shows, Hadoop works very well with thousands or evento ten scenarios go tempora
millions of peer to that
Evaluates MapReduce/BitDew on CPU heterogeneity by adjusting the CPU frequency of half
As Figure independent arrival and departure of the BitDew-MR clearly allows workersmachines lea
CPU Heterogeneity and Straggler We emulate Grid5K according
• to kill worker processes worker offline without node another it
host churn. We nodesWe50% kill the MapReduce onhave straggler and oneanythatand launchnode to new no
dynamic taskperiodically with wrekavoc. Internet one process(a machine performance penalty. on a
the worker an execution on the We nodes a
mimic scheduling approach when worker emulate node on launch ontakes an unusually long time
The key Hadoop job completion, we increas
differenthost churn effect. Totasks in the group process probability the target nodes’ tasks re-execution avoida
emulate complete oneof CPU.last fewfrom the fastthe joining and leaving of idea behind mapCPU frequency to 10%.
to the• classes of the Nodes increase computation) by adjusting the computation
emulate volunteers survival
failures,
failures, seconds.
20 tasks on average sabotage, fromsetchurn, massive which makes BitDew-MR outperforms
HDFSBoth Hadoop andfactorthe MapReduce can launch backupheartbeat timeout value to mapreduce Hadoop u
chunk firewall,BitDew onesand the the DataNode tasks on idlenetwork when a 20 laggers, Becaus
replica and to 3, host slow group get
machines
operation
network
scenarios
allowing red
about 11 tasks. • Hadoop etc. . has same scheduling
Although, BitDew-MR
setup:
heterogeneity does not stragglers work machine and failing failure 3, host isMapReduce
BitDew close to completion perform well.the waste to Figure 6. completed re-download map outputs thatchurn already b
by
is MapReduce runtimealleviate according the problem. However, as shown in workers, BitDewhave causes
to
figure
tasks to
heuristic but it does not
HDFS chunk the number
to 3
is slower from job completion time. the replica factor uploaded to the table 2, for host churn intervals
small The nodesthan Hadoop groups get almostbecause only one copy of ashown in available onstorage rather nodesre-gene
effects on the
Some results the two in -this process On same other hand, as chunk iscentral stable the worker than at a
given time. reason the we DataNode one additional to 80% 20the map phase chunk before launching
them.
- could only progress timeout to has to benefit the
seconds
of Host The Hence,is that jobsperforming theMapReduce workerAnotherdownloadof this method is that
tasks. churn : We periodically kill the chunk copy
10 and 25 seconds, Hadoopnodeonly maintain heartbeat up time firstof process on onebefore and launchitadoes
•
node failing.

Evaluation: network fault tolerance

the map task.
introduce any extra overhead since all the intermediate
on the worker nodes for the sake of the assumption that there
arenew one on another node. and data transfer. Thus, environment, the runtime system must becentral sto
no inter-worker communication
Network Fault Tolerance To work in a Desktop Grid should be always transferred through the resilient
regardless of 25
whether or 50 use the fault tolerant strat
not
although fast nodes spend half of time to process (sec.)local
Churn Interval their
10
to temporary network failures. We set the worker heartbeat 5
timeout value to 30 seconds for both Hadoop and
20
However, the overhead of data transfer to the central st
chunks compare to the slow nodes, they still need to take
makespan (sec.)
BitDew mapreduce, heartbeat set to 20 seconds on Hadoop 25 desktop grid more suitable job progress
period to failed 2357 1752
• to download Hadoop job &'(#)* launching failed failed and BitDew-MapReduce
much time Worker and inject a 25 seconds off-line window storage makes worker nodes at differentfor the applicat
new chunks before
points of the tasks.
In
BitDew-MR joa makespan (sec.) 457 398 marks all temporary data, such as nodes
which generate a 361 357
additional map map phase.25 this test, the Hadoop JobTracker simplyto 366 few intermediate disconnectedWord C
• We they are
workers at different
as “dead” when induce stillseconds offline window periods theseGrep. To mitigate the data transfer overh
running tasks. The tasks performed by 25 nodes are blindly removed from the
and Distributed
successful taskERFORMANCE EVALUATION OF STRAGGLERS SCENARIO prolonging churn scenario with large aggregated bandw
TABLE III.
P list and must be re-executed, significantly we can use a storage cluster
Fig. 2: churn interval < 30 seconds
progresswhen Performance evaluation of host the job makespan, as table 1. Meanwhile,
points
→ Hadoop fails
The emerging public any storage services also provid
the BitDew-MapReduce clearly allows workers to go temporarily offline withoutcloud performance penalty.
Straggler Number Very small
1
2
5
•
alternative different progress included in our fu
• Network failure : networkeffect on BitDew 10 25 seconds atsolution, which can betime
is shutdown during
Hadoop job make477
487
490
• Hadoop fails481
when churn interval <work.
30 seconds
span (sec.)
BitDew-MR job

Job progress of the crash points

12.5% 25% 37.5% 50% 62.5% 75% 87.5% 100%

245
267
50
Network Connectivity 211 set custom map tasksand NAT rules on all the300 RELATED WORK turn down
We Re-executed firewall 298 100 150 200 250 VI. 350 nodes to
worker 400
make-span (sec.)
Hadoop
There
been many studies on
network links and observe how Job makespan (sec.) perform. In thishave536 Hadoop cannot improving MapRed
MapReduce jobs 425 468 479 512test, 572 589 601 even launch a
3

Re-executed map tasks 0
0 performance [17, 18, 0 28] and 0
0
0
0
19, 0
exploring ways to sup
BitDew-MR
256
MapReduce 254 we architectures [16, 22, 23, o
GdX has 356 double core nodes, Job to measure the performance on 521 on different 274two workers per node26
so makespan (sec.) 246 249 243 239 nodes 257 run
Anthony SIMONET - MapReduce on Desktop Grids related work isscenario [23] 2012 stands
14
closely
MOON
Table 1: Performance evaluation of the network fault toleranceDecember 4th, which
nodes.
MapReduce On Opportunistic eNvironments. Unlike
lundi 3 décembre 12
• Hadoop marks all temporary disconnected nodes limits the systemtaskswithin a campus
are
work, MOON as failed → scale dead
• The Hadoop job tracker marks all temporary disconnected nodes as
and assumes the underlying resources are hybrid
re-executed
- These tasks must be re-executed
organized by provisioning a fixed fraction of dedicated st
4.2 Evaluation of the incremental BitDew MapReduce
MapReduce
Salvador da other volatile personal82 / 84
computers to supplementBahia, Brazil, 22/10/2013
compu

Section 5
Conclusion


MapReduce


83 / 84

Conclusion
• MapReduce Runtimes
• Attractive programming model for data-intense computing
• Many research issues concerning execution runtimes (heterogeneity,

stragglers, scheduling )
• Many others : security, QoS, iterative MapReduce,

• MapReduce beyond the Data Center :
• enable large scale data-intensive computing : Internet MapReduce

• Want to learn more ?
• Home page for the Mini-Curso : http://graal.ens-lyon.fr/ gfedak/pmwiki-

lri/pmwiki.php/Main/MarpReduceClass
• the MapReduce Workshop@HPDC (http://graal.ens-lyon.fr/mapreduce)
• Desktop Grid Computing ed. C. Cerin and G. Fedak – CRC Press
• our websites : http://www.bitdew.net and http://www.xtremweb-hep.org


MapReduce


84 / 84

MapReduce Runtime Design and Performance Optimizations

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (10)

Ähnlich wie MapReduce Runtime Design and Performance Optimizations

Ähnlich wie MapReduce Runtime Design and Performance Optimizations (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

MapReduce Runtime Design and Performance Optimizations