SlideShare ist ein Scribd-Unternehmen logo
1 von 92
Downloaden Sie, um offline zu lesen
MapReduce Runtime Environments
Design, Performance, Optimizations

Gilles Fedak
Many Figures imported from authors’ works
Gilles.Fedak@inria.fr
INRIA/University of Lyon, France

Escola Regional de Alto Desempenho - Região Nordeste

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

1 / 84
Outline of Topics
1

What is MapReduce ?
The Programming Model
Applications and Execution Runtimes

2

Design and Principles of MapReduce Execution Runtime
File Systems
MapReduce Execution
Key Features of MapReduce Runtime Environments

3

Performance Improvements and Optimizations
Handling Heterogeneity and Stragglers
Extending MapReduce Application Range

4

MapReduce beyond the Data Center
Background on BitDew
MapReduce for Desktop Grid Computing

5

Conclusion
G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

2 / 84
Section 1
What is MapReduce ?

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

3 / 84
What is MapReduce ?

• Programming Model for data-intense applications
• Proposed by Google in 2004
• Simple, inspired by functionnal programming
• programmer simply defines Map and Reduce tasks
• Building block for other parallel programming tools
• Strong open-source implementation: Hadoop
• Highly scalable
• Accommodate large scale clusters: faulty and unreliable resources

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat, in OSDI’04:
Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004.

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

4 / 84
Subsection 1
The Programming Model

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

5 / 84
MapReduce in Lisp

• (map f (list l1 , . . . , ln ))
• (map square ‘(1 2 3 4))
• (1 4 9 16)

• (+ 1 4 9 16)
• (+1 (+4 (+9 16)))
• 30

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

6 / 84
MapReduce a la Google

Map(k , v ) → (k , v )

Mapper

Group(k , v ) by k

Reduce(k , {v1 , . . . , vn }) → v

Reducer

Input

Output

• Map(key , value) is run on each item in the input data set
• Emits a new (key , val) pair

• Reduce(key , val) is run for each unique key emitted by the Map
• Emits final output

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

7 / 84
Exemple: Word Count

• Input: Large number of text documents
• Task: Compute word count across all the document
• Solution :
• Mapper : For every word in document, emits (word, 1)
• Reducer : Sum all occurrences of word and emits (word, total_count)

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

8 / 84
WordCount Example
/ / Pseudo−code f o r t h e Map f u n c t i o n
map( S t r i n g key , S t r i n g v a l u e ) :
/ / key : document name ,
/ / v a l u e : document c o n t e n t s
f o r e a c h word i n s p l i t ( v a l u e ) :
E m i t I n t e r m e d i a t e ( word , ’ ’ 1 ’ ’ ) ;
/ / Peudo−code f o r t h e Reduce f u n c t i o n
reduce ( S t r i n g key , I t e r a t o r v a l u e ) :
/ / key : a word
/ / v a l u e s : a l i s t o f counts
i n t word_count = 0 ;
foreach v i n values :
word_count += P a r s e I n t ( v ) ;
Emit ( key , A s S t r i n g ( word_count ) ) ;

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

9 / 84
Subsection 2
Applications and Execution Runtimes

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

10 / 84
MapReduce Applications Skeleton

• Read large data set
• Map: extract relevant information from each data item
• Shuffle and Sort
• Reduce : aggregate, summarize, filter, or transform
• Store the result
• Iterate if necessary

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

11 / 84
MapReduce Usage at Google

Jerry Zhao (MapReduce Group Tech lead @ Google)

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

12 / 84
MaReduce + Distributed Storage
!  !"#$%&'(()#*$+),--"./,0$1)#/0*),)1/-0%/2'0$1)-0"%,3$)

-4-0$5)

Powerfull when associated with a distributed storage system
! •678)96""3($)78:;)<=78)9<,1"">)7/($)84-0$5:)
• GFS (Google FS), HDFS (Hadoop File System)

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

13 / 84
The Execution Runtime

• MapReduce runtime handles transparently:
•
•
•
•
•
•

Parallelization
Communications
Load balancing
I/O: network and disk
Machine failures
Laggers

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

14 / 84
MapReduce Compared

Jerry Zhao

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

15 / 84
Existing MapReduce Systems
#$%&'()*+,-./0-(12$3-4$((

""#$%&'()*%+,-%&
• Google MapReduce

  ./")/0%1(/2&(3+&-$"4%+&4",/-%&
• Proprietary and close source
  5$"4%$2&$036%+&1"&!78& to GFS
• Closely linked
  9:)$%:%31%+&03&5;;&&<01=&.21="3&(3+&>(?(&@03+03#4&
• Implemented in C++ with Python and Java Bindings
  8(<A($$&
• Hadoop

(+"")&

• Open source (Apache projects)
• Cascading, Java, Ruby, Perl, Python, PHP, R, C++ bindings (and certainly
  C)%3&4",/-%&DE)(-=%&)/"F%-1G&

more)
  5(4-(+03#H&>(?(H&*,@2H&.%/$H&.21="3H&.B.H&*H&"/&5;;&
• Many side projects: HDFS, PIG, Hive, Hbase, Zookeeper
  '(32&40+%&)/"F%-14&I&BJ78H&.9!H&B@(4%H&B0?%H&K""6%)%/&
• Used by Yahoo!, Facebook, Amazon and the Google IBM research Cloud
  L4%+&@2&M(=""NH&7(-%@""6H&E:(A"3&(3+&1=%&!""#$%&9O'&

/%4%(/-=&5$",+&

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

16 / 84
Other Specialized Runtime Environments

• Hardware Platform
• Frameworks based on

• Multicore (Phoenix)
• FPGA (FPMR)
• GPU (Mars)

MapReduce
• Iterative MapReduce

(Twister)

• Security

• Incremental MapReduce

• Data privacy (Airvat)

(Nephele, Online MR)

• MapReduce/Virtualization

• Languages (PigLatin)
• Log analysis : In situ MR

• EC2 Amazon
• CAM

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

17 / 84
Hadoop Ecosystem

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

18 / 84
Research Areas to Improve MapReduce Performance

• File System
• Higher Throughput, reliability, replication efficiency

• Scheduling
• data/task affinity, scheduling multiple MR apps, hybrid distributed computing

infrastructures
• Failure detections
• Failure characterization, failure detector and prediction

• Security
• Data privacy, distributed result/error checking

• Greener MapReduce
• Energy efficiency, renewable energy

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

19 / 84
Research Areas to Improve MapReduce Applicability

• Scientific Computing
• Iterative MapReduce, numerical processing, algorithmics

• Parallel Stream Processing
• language extension, runtime optimization for parallel stream processing,

incremental processing
• MapReduce on widely distributed data-sets
• Mapreduce on embedded devices (sensors), data integration

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

20 / 84
CloudBlast: MapReduce and Virtualization for BLAST
Computing
NCBI BLAST
• The Basic Local Alignment Search Tool
(BLAST)

• finds regions of local similarity between

Split into chunks
based on new line. CCAAGGTT

>seq1
AACCGGTT
>seq2

>seq3
GGTTAACC
>seq4
TTAACCGG

sequences.

StreamPatternRecordReader splits
chunks into sequences as keys for
the mapper (empty value).

• compares nucleotide or protein sequences
to sequence databases and calculates the
statistical significance of matches.
Hadoop + BLAST

• stream of sequences are split by Hadoop.
• Mapper compares each sequence to the
gene database

• provides fault-tolerance, load balancing.

G. Fedak (INRIA, France)

>seq3
GGTTAACC
>seq4
TTAACCGG

>seq1
AACCGGTT
>seq2
CCAAGGTT
BLAST
Mapper

BLAST
Mapper
BLAST output is
written to a local file.

part-00000

part-00001
DFS merger

• HDFS is used to merge the result
• database are not chunked

MapReduce

'I'8%1#</
&';<%&8';
455A#841#<
'/:#&</C
Y)!,55
P3!#C4$
02'/! /'
C4#/14#/'
1'C5A41'!
4%1<C41#8
&'A#':';! %
;#1';?! =&<
7'5A<FC'
1<! 7'4A! 0
.'C5A41'
E#<A<$#;1;
#/1'&5&'1#/
455A#4/8'

StreamInputFormat

!

"#$%&'!()!! *+,-.#/$!0#12!3456'7%8')!9#:'/!4!;'1!<=!#/5%1!;'>%'/8';?!
@47<<5!;5A#1;!12'!#/5%1!=#A'!#/1<!82%/B;!<=!;4C'!;#D'?!5<;;#EAF!;5A#11#/$!#/!
12'!C#77A'!<=!4!;'>%'/8')!-1&'4CG411'&/6'8<&76'47'&!04;!7':'A<5'7!1<!
Cloudblast: Combining mapreduce and virtualization on
#/1'&5&'1!8<&&'81AF!12'!E<%/74&#';!<=!;'>%'/8';?!02#82!4&'!54;;'7!4;!B'F;!<=!
distributed resources for bioinformatics applications. A.
4!B'FH:4A%'!54#&!1<!12'!3455'&!1241!'I'8%1';!*+,-.)!J"-!&'1&#':';!4AA!
Matsunaga, M.A<84A!=#A'!<%15%1;!8<CE#/#/$!#/1<!4!;#/$A'!&';%A1)!
Tsugawa, & J. Fortes. (eScience’08), 2008.

<5'&41#/$!;F;1'C;!4/7!477#1#</4A!;<=104&'!K')$)!1<!;%55<&1!4!
54&1#8%A4&! 1F5'! <=! /'10<&B! 5&<1<8<A! <&! 1'82/<A<$FL)!
-#C#A4&AF?!4!@47<<5ME4;'7!455&<482!&'>%#&';!C482#/';!0#12!
12'! @47<<5! &%/1#C'! 4/7! =#A'! ;F;1'C! ;<=104&'?! 02#82! 84/!
</AF!'I'8%1'!#=!</'! <=!;':'&4A!488'514EA'!:'&;#</!<=!N4:4!#;!
Salvador da Bahia, Brazil, 22/10/2013

!"! #$%&%
*#<#/=
;'1;!<=!':
5&<1'#/! ;'
YL)! G'&#<7
<=='&! %5M
%5741';!#;
&';<%&8';
J414!
9P"-!TYc
12'! 7#;1&#
@<0':'&?
7';#$/'7!
C4/4$'C
<E14#/'7!
/<7';?! #7
47:4/14$'
21 / 84
,'&L?&-0.>'! 8#LL'&'.>'! 7'/N''.! =1?%8FGHIJ! 0.8!
-,#FGHIJ! #2! -?&'! .?/#>'071'! N@'.! 2'K%'.>'2! 0&'! 2@?&/T!
#.8#>0/#.$! 0.! 'LL#>#'./! #-,1'-'./0/#?.! ?L! #.,%/! 2'K%'.>'!
2,1#/!0.8!V?7!>??&8#.0/#?.!7M!C08??,)!JN?62#/'!&'2%1/2!2@?N!
7'//'&!2,''8%,!/@0.!?.'62#/'!8%'!/?!,&?>'22?&2!/@0/!0&'!L02/!0/!
;=!P0,,&?+#-0/'1M!<)3!/#-'2!L02/'&R)!J@'!,%71#>1M!0B0#1071'!
-,#FGHIJ! >0.! 7'! L%&/@'&! @0.86/%.'8! L?&! @#$@'&!
JHFG*!EE)!!
*5D*SE:*AJHG!I*Y;*A=*I!=CHSH=J*SEIJE=I)!
• Use Vine to deploy Hadoop on several clusters (24 Xen-based ?,/#-#_'8! #-,1'-'./0/#?.2!12 .?/!
,'&L?&-0.>'T! 7%/! 2%>@! VM @ Univ. Florida + 0&'!
!
"#$!%&'(! Calif.)
"#$!)*&+,!
,%71#>1M!0>>'22#71'!a4b)!
Xen based VMs at Univ
WUX!
WUX!
-!&.!)/01/'2/)!
H1/@?%$@! /@'! './#&'! /0&$'/! 80/0702'! L#/2! #./?! -'-?&M! #.!
• Each VM contains Hadoop + BLAST + databse2'/%,! %2'8T! &%.2! N#/@! /@'! 80/0702'! 2'$-'./'8! #./?! [X!
/@'! (3.5GB)
(TZ<[!
Z(X!
3&'(/),!)/01/'2/!
L&0$-'./2!N'&'!,'&L?&-'8!#.!?&8'&!/?!#.>&'02'!/@'!0-?%./!?L!
3<(!
<X<!
4*&+,/),!)/01/'2/!
• Compared with MPIBlast
.'/N?&O! >?--%.#>0/#?.! L?&! -,#FGHIJ! &%.2! P"#$)! ZR)!
(<(!
<X!
4,5'65+6!6/785,8&'!
I,''8%,! $&0,@2! P"#$)! R! 2@?N! /@0/! /@'&'! N02! .?! #-,0>/! ?.!
<TX3T<4[!
43(T[(!
9&,5%!'1:;/+!&.!
/@'!,'&L?&-0.>'!8%'!/?!9A)!
'12%/&,86/)!
!
<<<U)Z3!
44[)4!
<7/+5(/!'12%/&,86/)!
>?-,%/'68?-#.0/'8! 0,,1#>0/#?.2! ?&! 'B'.! #-,&?B#.$
=/+!)/01/'2/!
,'&L?&-0.>'! 8%'! /?! 0&/#L0>/2! ?L! #-,1'-'./0/#?.T! ')$)! L#1'
4[!@?%&2!0.8!
<!@?%&2!0.8!
4/01/',85%!
2M2/'-!8?%71'!>0>@#.$!a4Ub)!!
4U!-#.%/'2!
XU!-#.%/'2!
/>/21,8&'!,8:/!
J?! 'B01%0/'! /@'! 2>0107#1#/M! ?L! -,#FGHIJ! 0.8
=1?%8FGHIJT!'+,'&#-'./2!N#/@!8#LL'&'./!.%-7'&2!?L!.?8'2T
,&?>'22?&2T! 80/0702'! L&0$-'./2! 0.8! #.,%/! 80/0! 2'/2! N'&'
E9)!*9HG;HJE]A!HA^!*5D*SE:*AJHG!S*I;GJI!
'+'>%/'8)! *+0>/1M! 3! ,&?>'22?&2! 0&'! %2'8! #.! '0>@! .?8')
J?! 'B01%0/'! /@'! ?B'&@'08! ?L! -0>@#.'! B#&/%01#_0/#?.T!
Q@'.'B'&! ,?22#71'T! /@'! .%-7'&2! ?L! .?8'2! #.! '0>@! 2#/'! 0&'
'+,'&#-'./2! N#/@! FGHIJ! N'&'! >?.8%>/'8! ?.! ,@M2#>01!
O',/! 7010.>'8! P#)')T! L?&! '+,'&#-'./2! N#/@! Z! ,&?>'22?&2T! 3
-0>@#.'2!P,%&'!G#.%+!2M2/'-2!N#/@?%/!5'.!9::R!0.8!10/'&!
.?8'2! #.! '0>@! 2#/'! 0&'! %2'8c! L?&! '+,'&#-'./2! N#/@! <U
?.! /@'! 5'.! 9:2! @?2/'8! 7M! /@'! 20-'! -0>@#.'2T! %2#.$! /@'!
,&?>'22?&2T! 4! .?8'2! 0&'! %2'8! #.! '0>@! 2#/'T! 0.8! 2?! ?.R)! "?&
,@M2#>01! .?8'2! 0/! ;")! 9#&/%01#_0/#?.! ?B'&@'08! #2! 70&'1M!
'+,'&#-'./2! N#/@! U4! ,&?>'22?&2T! <3! .?8'2! 0/! ;=! 0.8! 3X! 0/
.?/#>'071'!N@'.!'+'>%/#.$!FGHIJ!02!#/!>0.!7'!?72'&B'8!#.! !
;"! N'&'! %2'8! 8%'! /?! 1#-#/'8! .%-7'&! ?L! 0B0#1071'! .?8'2! 0/
"#$%&'!U)!! =?-,0&#2?.!?L!FGHIJ!2,''86%,2!?.!,@M2#>01!0.8!B#&/%01!
"#$)!U)!J@#2!#2!8%'!/?!/@'!-#.#-01!.''8!L?&!E`]!?,'&0/#?.2!02!
"#$%&'!()!! *+,'&#-'./01!2'/%,)!34!5'.6702'8!9:2!0/!;"!0.8!<3!5'.6
;=)!
-0>@#.'2)!I,''8%,!#2!>01>%10/'8!&'10/#B'!/?!2'K%'./#01!FGHIJ!?.!B#&/%01!
702'8!9:2!0/!;=!0&'!>?..'>/'8!/@&?%$@!9#A'!/?!,&?B#8'!0!8#2/&#7%/'8!
0! &'2%1/! ?L! &',1#>0/#.$! /@'! 80/0702'! #.! '0>@! .?8'! 0.8! L%11M!
S'2%1/2!0&'!2%--0&#_'8!#.!"#$)!)!"?&!011!#.,%/!80/0!2'/2T
&'2?%&>')!9:2!8'1#B'&!,'&L?&-0.>'!>?-,0&071'!N#/@!,@M2#>01!-0>@#.'2!
'.B#&?.-'./!/?!'+'>%/'!C08??,!0.8!:DE6702'8!FGHIJ!0,,1#>0/#?.2)!H!
011?>0/#.$! #/! /?! -0#.! -'-?&M)! E/! 012?! >?.L#&-2! ?/@'&!
=1?%8FGHIJ! 2@?N2! 0! 2-011! 08B0./0$'! ?B'&! -,#FGHIJ)
N@'.!'+'>%/#.$!FGHIJ)!
%2'&!>0.!&'K%'2/!0!B#&/%01!>1%2/'&!L?&!FGHIJ!'+'>%/#?.!7M!>?./0>/#.$!/@'!
?72'&B0/#?.2! ?L! B#&/%01#_0/#?.! @0B#.$! .'$1#$#71'! #-,0>/! ?.!
"?&!#.,%/!80/0!2'/!?L!1?.$!2'K%'.>'2! 0!(6L?18!2,''8%,!>0.
B#&/%01!N?&O2,0>'2!2'&B'&!P9QIR)!
7'! ?72'&B'8! N#/@! =1?%8FGHIJ! N@'.! %2#.$! U4! ,&?>'22?&2
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
22 / 84
?.!3!2#/'2T!N@#1'!-,#FGHIJ!8'1#B'&2!(36L?18!2,''8%,)!J@'
()!H!9S!9:!#2!>1?.'8!0.8!2/0&/'8!#.!'0>@!2#/'T!?LL'&#.$!

#.,%/)! J@'! 7102/+! ,&?$&0-! L&?-! /@'! A=FE! FGHIJ! 2%#/'T!
N@#>@! >?-,0&'2! .%>1'?/#8'2! 2'K%'.>'2! 0$0#.2/! ,&?/'#.!
2'K%'.>'2! 7M! /&0.210/#.$! /@'! .%>1'?/#8'2! #./?! 0-#.?! 0>#82T!
N02! %2'8)! =@0&0>/'&#2/#>2! ?L! /@'! 3! #.,%/! 2'/2! 0&'! 2@?N.! #.!
JHFG*!EE)!

CloudBlast Deployment and Performance
Section 2
Design and Principles of MapReduce Execution
Runtime

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

23 / 84
Subsection 1
File Systems

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

24 / 84
File System for Data-intensive Computing
MapReduce is associated with parallel file systems
• GFS (Google File System) and HDFS (Hadoop File System)
• Parallel, Scalable, Fault tolerant file system
• Based on inexpensive commodity hardware (failures)
• Allows to mix storage and computation

GFS/HDFS specificities
• Many files, large files (> x GB)
• Two types of reads
• Large stream reads (MapReduce)
• Small random reads (usually batched together)

• Write Once (rarely modified), Read Many
• High sustained throughput (favored over latency)
• Consistent concurrent execution

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

25 / 84
Example: the Hadoop Architecture
Distributed Computing
(MapReduce)

Data	
  Analy)cs	
  
Applica)ons	
  

JobTracker	
  

Distributed Storage
(HDFS)
NameNode	
  

Data	
  Node	
  
TaskTracker	
  

Data	
  Node	
  
TaskTracker	
  

Data	
  Node	
  
TaskTracker	
  

Data	
  Node	
  
TaskTracker	
  

Data	
  Node	
  
TaskTracker	
  

Data	
  Node	
  
TaskTracker	
  

Data	
  Node	
  
TaskTracker	
  

Data	
  Node	
  
TaskTracker	
  

Masters

Data	
  Node	
  
TaskTracker	
  

G. Fedak (INRIA, France)

MapReduce

Slaves

Salvador da Bahia, Brazil, 22/10/2013

26 / 84
HDFS Main Principles
HDFS Roles

File.txt

• Application

A	
  

• Split a file in chunks
• The more chunks, the more

B	
  

C	
  

Write	
  chunks	
  A,	
  
B,	
  C	
  of	
  file	
  
File.txt	
  

parallelism
• Asks HDFS to store the

Ok,	
  use	
  Data	
  
nodes	
  2,	
  4,	
  6	
  

NameNode	
  

Applica'on	
  

chunks
• Name Node
• Keeps track of File metadata

(list of chunks)
Data	
  Node	
  1	
  

• Stores file chunks
• Replicates file chunks

G. Fedak (INRIA, France)

MapReduce

Data	
  Node	
  6	
   C	
  

Data	
  Node	
  4	
   B	
  

• Data Node

Data	
  Node	
  5	
  

Data	
  Node	
  3	
  

replication (3) of chunks on
Data Nodes

Data	
  Node	
  4	
  

Data	
  Node	
  2	
   A	
  

• Ensures the distribution and

Data	
  Node	
  7	
  

Salvador da Bahia, Brazil, 22/10/2013

27 / 84
Data locality in HDFS
Rack Awareness
• Name node groups Data Node

in racks
rack 1
rack 2

DN1, DN2, DN3, DN4
DN5, DN6, DN7, DN8

• Assumption that network

performance differs if the
communication is in-rack vs
inter-racks

A	
   Data	
  Node	
  1	
  

• Replication

Data	
  Node	
  2	
   B	
  

• Keep 2 replicas in-rack and 1

in an other rack
A
B

A	
   Data	
  Node	
  3	
  
Data	
  Node	
  4	
  

DN1, DN3, DN5
DN2, DN4, DN7

Data	
  Node	
  4	
   B	
  
Data	
  Node	
  5	
   A	
  
Data	
  Node	
  6	
  
Data	
  Node	
  7	
   B	
  

• Minimizes inter-rack

communications
• Tolerates full rack failure
G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

28 / 84
Writing to HDFS
Client

Data Node 1

Data Node 3

Data Node 5

Name Node

write File.txt in dir d
Writeable

For each Data chunk
AddBlock A
Data nodes 1, 3, 5 (IP, port, rack number)
initiate TCP connection
Data Nodes 3, 5
initiate TCP connection
Data Node 5
initiate TCP connection

Pipeline Ready
Pipeline Ready
Pipeline Ready

• Name Nodes checks permissions and select a list of Data Nodes
• A pipeline is opened from the Client to each Data Nodes

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

29 / 84
Pipeline Write HDFS
Client

Data Node 1

Data Node 3

Data Node 5

Name Node

For each Data chunk
Send Block A
Block Received A
Send Block A
Block Received A
Send Block A
Block Received A
Success
Close TCP connection
Success
Close TCP connection
Success
Close TCP connection
Success

• Data nodes send and receive chunk concurrently
• Blocks are transferred one by one
• Name Node updates the meta data table
A

G. Fedak (INRIA, France)

DN1, DN3, DN5

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

30 / 84
Fault Resilience and Replication

NameNode
• Ensures fault detection
• Heartbeat is sent by DataNode to the NameNode every 3 seconds
• After 10 successful heartbeats : Block Report
• Meta data is updated

• Fault is detected
• NameNode lists from the meta-data table the missing chunks
• Name nodes instruct the Data Nodes to replicate the chunks that were on

the faulty nodes

Backup NameNode
• updated every hour

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

31 / 84
Subsection 2
MapReduce Execution

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

32 / 84
Anatomy of a MapReduce Execution
Data Split

Intermediate
Results

Input Files

Final Output

Output Files

Mapper
Reducer

Output2

Reducer

Mapper

Output1

Reducer

Mapper

Output3

Combine

Mapper

Mapper

Distribution

• Distribution :
• Split the input file in chunks
• Distribute the chunks to the DFS

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

33 / 84
Anatomy of a MapReduce Execution
Data Split

Intermediate
Results

Input Files

Final Output

Output Files

Mapper
Reducer

Output2

Reducer

Mapper

Output1

Reducer

Mapper

Output3

Combine

Mapper

Mapper

Map

• Map :
• Mappers execute the Map function on each chunks
• Compute the intermediate results (i.e., list(k , v ))
• Optionally, a Combine function is executed to reduce the size of the IR

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

33 / 84
Anatomy of a MapReduce Execution
Data Split

Intermediate
Results

Input Files

Final Output

Output Files

Mapper
Reducer

Output2

Reducer

Mapper

Output1

Reducer

Mapper

Output3

Combine

Mapper

Mapper

Partition

• Partition :
• Intermediate results are sorted
• IR are partitioned, each partition corresponds to a reducer
• Possibly user-provided partition function

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

33 / 84
Anatomy of a MapReduce Execution
Data Split

Intermediate
Results

Input Files

Final Output

Output Files

Mapper
Reducer

Output2

Reducer

Mapper

Output1

Reducer

Mapper

Output3

Combine

Mapper

Mapper

Shuffle

• Shuffle :
• references to IR partitions are passed to their corresponding reducer
• Reducers access to the IR partitions using remore-read RPC

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

33 / 84
Anatomy of a MapReduce Execution
Data Split

Intermediate
Results

Input Files

Final Output

Output Files

Mapper
Reducer

Output2

Reducer

Mapper

Output1

Reducer

Mapper

Output3

Combine

Mapper

Mapper

Reduce

• Reduce:
• IR are sorted and grouped by their intermediate keys
• Reducers execute the reduce function to the set of IR they have received
• Reducers write the Output results

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

33 / 84
Anatomy of a MapReduce Execution
Data Split

Intermediate
Results

Input Files

Final Output

Output Files

Mapper
Reducer

Output2

Reducer

Mapper

Output1

Reducer

Mapper

Output3

Combine

Mapper

Mapper

Combine

• Combine :
• the output of the MapReduce computation is available as R output files
• Optionally, output are combined in a single result or passed as a parameter

for another MapReduce computation
G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

33 / 84
Return to WordCount: Application-Level Optimization
The Combiner Optimization
• How to decrease the amount of information exchanged between mappers

and reducers ?
→ Combiner : Apply reduce function to map output before it is sent to
reducer
/ / Pseudo−code f o r t h e Combine f u n c t i o n
combine ( S t r i n g key , I t e r a t o r v a l u e s ) :
/ / key : a word ,
/ / v a l u e s : l i s t o f counts
i n t partial_word_count = 0;
f o r each v i n v a l u e s :
p a r t i a l _ w o r d _ c o u n t += p a r s e I n t ( v ) ;
Emit ( key , A s S t r i n g ( p a r t i a l _ w o r d _ c o u n t ) ;

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

34 / 84
Let’s go further: Word Frequency
• Input : Large number of text documents
• Task: Compute word frequency across all the documents
→ Frequency is computed using the total word count

• A naive solution relies on chaining two MapReduce programs
• MR1 counts the total number of words (using combiners)
• MR2 counts the number of each word and divide it by the total number of

words provided by MR1
• How to turn this into a single MR computation ?
1
2

Property : reduce keys are ordered
function EmitToAllReducers(k,v)
• Map task send their local total word count to ALL the reducers with the key ""
• key "" is the first key processed by the reducer
• sum of its values gives the total number of words

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

35 / 84
Word Frequency Mapper and Reducer
/ / Pseudo−code f o r t h e Map f u n c t i o n
map( S t r i n g key , S t r i n g v a l u e ) :
/ / key : document name , v a l u e : document c o n t e n t s
i n t word_count = 0 ;
f o r e a c h word i n s p l i t ( v a l u e ) :
E m i t I n t e r m e d i a t e ( word , ’ ’ 1 ’ ’ ) ;
word_count ++;
E m i t I n t e r m e d i a t e T o A l l R e d u c e r s ( " " , A s S t r i n g ( word_count ) ) ;
/ / Peudo−code f o r t h e Reduce f u n c t i o n
reduce ( S t r i n g key , I t e r a t o r v a l u e ) :
/ / key : a word , v a l u e s : a l i s t o f counts
i f ( key= " " ) :
i n t total_word_count = 0;
foreach v i n values :
t o t a l _ w o r d _ c o u n t += P a r s e I n t ( v ) ;
else :
i n t word_count = 0 ;
foreach v i n values :
word_count += P a r s e I n t ( v ) ;
Emit ( key , A s S t r i n g ( word_count / t o t a l _ w o r d _ c o u n t ) ) ;

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

36 / 84
Subsection 3
Key Features of MapReduce Runtime Environments

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

37 / 84
Handling Failures

On very large cluster made of commodity servers, failures are not a rare
event.
Fault-tolerant Execution
• Workers’ failures are detected by heartbeat
• In case of machine crash :
1
2
3

In progress Map and Reduce task are marked as idle and re-scheduled to
another alive machine.
Completed Map task are also rescheduled
Any reducers are warned of Mappers’ failure

• Result replica “disambiguation”: only the first result is considered
• Simple FT scheme, yet it handles failure of large number of nodes.

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

38 / 84
Scheduling
Data Driven Scheduling
• Data-chunks replication
• GFS divides files in 64MB chunks and stores 3 replica of each chunk on

different machines
• map task is scheduled first to a machine that hosts a replica of the input

chunk
• second, to the nearest machine, i.e. on the same network switch

• Over-partitioning of input file
• #input chunks >> #workers machines
• # mappers > # reducers

Speculative Execution
• Backup task to alleviate stragglers slowdown:
• stragglers : machine that takes unusual long time to complete one of the few

last map or reduce task in a computation
• backup task : at the end of the computation, the scheduler replicates the

remaining in-progress tasks.
• the result is provided by the first task that completes

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

39 / 84
Speculative Execution in Hadoop

• When a node is idle, the scheduler selects a task
1
2
3

failed tasks have the higher priority
non-running tasks with data local to the node
speculative tasks (i.e., backup task)

• Speculative task selection relies on a progress score between 0 and 1
• Map task : progress score is the fraction of input read
• Reduce task :
• considers 3 phases : the copy phase, the sort phase, the reduce phase
• in each phase, the score is the fraction of the data processed. Example : a task
halfway of the reduce phase has score 1/3 + 1/3 + (1/3 ∗ 1/2) = 5/6

• straggler : tasks which are 0.2 less than the average progress score

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

40 / 84
Section 3
Performance Improvements and Optimizations

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

41 / 84
Subsection 1
Handling Heterogeneity and Stragglers

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

42 / 84
Issues with Hadoop Scheduler and heterogeneous
environment
Hadoop makes several assumptions :
• nodes are homogeneous
• tasks are homogeneous

which leads to bad decisions :
• Too many backups, thrashing shared resources like network bandwidth
• Wrong tasks backed up
• Backups may be placed on slow nodes
• Breaks when tasks start at different times

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

43 / 84
The LATE Scheduler
The LATE Scheduler (Longest Approximate Time to End)
• estimate task completion time and backup LATE task
• choke backup execution :
• Limit the number of simultaneous backup tasks (SpequlativeCap)
• Slowest nodes are not scheduled backup tasks (SlowNodeTreshold)
• Only back up tasks that are sufficiently slow (SlowTaskTreshold)

• Estimate task completion time
progress score
execution time
score
left = 1−progress rate
progress

• progress rate =
• estimated time

In practice SpeculativeCap=10% of available task slots, SlowNodeTreshold, SlowTaskTreshold = 25th percentile
of node progress and task progress rate

Improving MapReduce Performance in Heterogeneous Environments. Zaharia, M., Konwinski, A., Joseph, A. D., Katz, R. and Stoica, I. in OSDI’08: Sixth
Symposium on Operating System Design and Implementation, San Diego, CA, December, 2008.

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

44 / 84
Progress RateProgress
Example
1 min

Rate Example

2 min

Node 1

1 task/min

Node 2

3x slower

Node 3

1.9x slower

Time (min)

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

45 / 84
LATE Example

LATE Example
2 min
Estimated time left:
(1-0.66) / (1/3) = 1

Node 1
Node 2

Progress = 66%

Estimated time left:
(1-0.05) / (1/1.9) = 1.8

Progress = 5.3%

Node 3

Time (min)

LATE correctly picks Node 3
G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

46 / 84
LATE Performance (EC2 Sort benchmark)
Stragglers
!"#$%&'($)*(+$,-(-'&.-/-*(0 emulated by running
$
4 stragglers

Heterogeneity
243VM/99 Hosts 1-7VM/Host
,36&&5$E382-$

F1G!$%;+-67H-'$

B>?$
B>#$
B$
=>A$
=>@$
=>?$
=>#$
=$
C&'4($

D-4($

!"#$%&'($)*(+$%(',--./

dd

Each host
sorted 128MB C,4&&3$B,72/$
B&$A,:;530$
with a total of
#=>$
30GB data

'()*+,-./0&1/23(42/&5-*/&

&'()*+,-./01.23'42.05,).0

E&$D3;<754$

12-'3.-$

D1E!$%:+/45./'$

#=<$
?=>$

E
2
o

S
w
a
d
in

?=<$
<=>$
<=<$
@&'0($

A/0($

12/',-/$

Results
Results
12/',-/$!"#$03//453$&2/'$6,72/8$$$%#&&2/'
12-'3.-$!"#$45--675$&2-'$/382-9$$%#$&2-'$/&$:3;<754$
Average 27% speedup over
Average 58% speedup over
Hadoop, 31% over no backups
Hadoop, 220% over noofbackups UIUC
Department Computer Science,
Department of Computer Science, UIUC

G. Fedak (INRIA, France)

MapReduce

?A$

Salvador da Bahia, Brazil, 22/10/2013

47 / 84
Understanding Outliers
• Stragglers : Tasks that take ≥ 1.5 the median task in that phase
• Re-compute : When a task output is lost, dependant tasks that wait until

1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1

1
Cumulative

problem,

Cumulative Fraction of Phases

the output is regenerated

high runtime
recompute

0.8
0.6
high runtime
recompute

0.4
0.2
0
0 1 2

0

0.1 0.2 0.3 0.4
Fraction of Outliers

0.5

4

6

8

10

Ratio of Straggler Duration to the
Duration of the Median Task

tliers and
(a) What fraction of tasks in a
(b) How much longer do outr resource
phase are outliers?
liers take to finish?
mics, conFigure : Prevalence of Outliers.
e to large of outliers
Causes
ic opera- skew: few keys corresponds to a lot of data
• Data
dian task) are well explained by the amount of data they
ur knowl• Network contention : 70% of the crossrack traffic induced by reduce
process
the
tasks
nces from and Busy or move on half network. Duplicating these5% of the
• Bad
machines :
of recomputes happen on
would not make them run faster and will waste resources.
machines
At the other extreme, in  of the phases Salvador da Bahia, Brazil, 22/10/2013
(bottom left),
G. Fedak (INRIA, France)
MapReduce

48 / 84
123

!*8%?*&

:9

The Mantri Scheduler
9
9

234

DF;CG4E
:9

;9

<9

=9

>99

@A*')#,*ABC#D4E#8B#1"./)*%8"B#78.*

0"))/%1

• cause- and resource- aware scheduling

Figure : Percentage tasks : define
• real-time monitoring of the speed-up of job completion time in the
ideal case when (some combination of) outliers do not occur.
>.*6'?3%).$&"#*
?+$5&%)
!.+*%*

!"#$%#$&"#'
(")')%*"+),%*

%=%,+$%
-.$/*'/.0%'
1&((2',.3.,&$4

that are ev
of a task a

)%.1'&#3+$
A#3+$''B%,"9%*'
+#.0.&5.B5%

Figure :

;#%<+.5'8")6'
1&0&*&"#

tasks bef
ones nea
#%$8")6'.8.)%' • )%35&,.$%'"+$3+$ *$.)$'$.*6*'$/.$'
@"5+$&"#* • 1+35&,.$%
There
• 6&557')%*$.)$ 35.,%9%#$
• 3)%:,"93+$%
1"'9")%'(&)*$
ing the c
Figure : The Outlier Problem: Causes and Solutions
job comp
of every t
By inducing high variability in repeat runs of the same
copies ar
job, outliers make it hard to meet SLAs. At median, the
Reining in the Outliers in Map-Reduce Clusters using Mantri in G. Ananthanarayanan, S. Kandula, A. slow dow
ratio Lu,stdev E. Harris OSDI’10: Sixth Symposium i.e., jobs System
Greenberg, I. Stoica, Y.of mean in job completion time is .,on Operatinghave a Design and up f
B. Saha,
used
Implementation,non-trivial probability of taking twice as long or finishing
Vancouver, Canada, December, 2010.
are waiti
half as quickly.
a phase v
To summarize, we takeMapReduce
the following lessons from our22/10/2013 comp
G. Fedak (INRIA, France)
Salvador da Bahia, Brazil,
up 49 / 84
Resource Aware Restart
• Restart Outliers tasks on better machines
• Consider two options : kill & restart or duplicate

+
0%*+

,-($)".$/%
67*%$-1)8%

()*!(%
&% !%
'%

'!% !%
+!%

'!% !%
!%
!"#$%

234)"5-!$%

&% !% !% '!% !%
!% '!% !%
'%

!"#$%

0"))/%1$(!-1!%

&% !% '!% !% !%
'% !% !% '!%

!"#$%

>99

78.*

mpletion time in the
tliers do not occur.

Figure : A stylized example to illustrate our main ideas. Tasks
that are eventually killed are filled with stripes, repeat instances

• restart or duplicate iffilled new <lighter)mesh.
of a task are P(t with a trem is high
• restart
"9%*' ;#%<+.5'8")6' if the remaining time is large trem > E(tnew ) + ∆
tasks before the others to avoid being stuck with the large
.B5%
1&0&*&"#
• duplicates if it decreases the total amount of resources used
ones near completion (§.).
+$3+$ *$.)$'$.*6*'$/.$'
There > subtle point < 3 is the number reduc$%
1"'9")%'(&)*$
P(trem > tnew (c+1) )is a δ, where c with outlier mitigation:of copies and
c
δ = 0.25 ing the completion time of a task may in fact increase the
s and Solutions
job completion time. For example, replicating the output
G. Fedak (INRIA, France)every task will drastically reduce recomputations–both
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
of

50 / 84
Network Aware Placement

• Schedule tasks so that it minimizes network congestion
• Local approximation of optimal placement
• do not track bandwidth changes
• do not require global co-ordination

• Placement of a reduce phase with n tasks, running on a cluster with r

racks, In,r the input data matrix.
i
i
i
i
• Let du , dv the data to be uploaded and downloaded, and bu , bd the

corresponding available bandwidth
di

• the placement of is arg min(max( u ,
bi
v

G. Fedak (INRIA, France)

MapReduce

i
dv
i
bd

))

Salvador da Bahia, Brazil, 22/10/2013

51 / 84
Mantri Performances

%"#&

%&

"&
&

!'#!
!"#$

!"#(

'()*+,-./01(/1(
203()*40,5-*4

$&

(&

!&

%""

-!./0/12$34/
-!./0/12$34/!5"$67'8

)&

"&
%&

!"#$
• Speed up the job median by
9#6

55%

:3;. 9<=853/41 >324

?;3/7++++?;-7

7#8

!&

%#6
&

:05+

<=3>* ?50,@((((?5*@
AB

@A
;0,1. 20/1
• Network aware data

(a) Completion Time

(b)  Resource Usage

placement reduces the
ine implementation on a production cluster of thousands
completion time of typical
rvers for the four representative jobs.
reduce phase of 31%

:9
9
;:9

9

:9

<9

=9

>9 ?99

)*+,-,./"0%,
)*+,-,./"0%,*'1"2345

%""

.
.
tasks speeds up. of the
half
.
.
.
reduce phases by > 60%

e : Comparing Mantri’s network-aware spread of tasks

the baseline implementation on a production cluster of
sands of servers.

G. Fedak (INRIA, France)

!"

(a) Change in Completion Time

 reduction in
• Network mincompletion time
aware placement of
avg
max

e cluster. Jobs that occupy the cluster for half the time
up by at least .. Figure (b) shows that  of
see a reduction in resource consumption while the
rs use up a few extra resources. These gains are due

!"#$%
&$%''(
)*+,
@$(A4%5B4
@$86"7

#"

0/A4%5B67'8/78/-'C(D467'8/+7C4

re : Comparing Mantri’s straggler mitigation with the

Phase
Job

$"

$"

!"#$%
&$%''(
)*+,
!"#$%&'(%
!"5213

#"
86
!"
76
6
986 976 6

76 86 :6 ;6 <66

-,$%&'(2345,35,$%04'1(%,=0">%
(b) Change in Resource Usage

Figure : Comparing straggler mitigation strategies. Mantri
provides a greater speed-up in completion time while using
fewer resources than existing schemes.

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

52 / 84
Subsection 2
Extending MapReduce Application Range

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

53 / 84
!"#$%&'()(*%&'+,-&(.+/0&123&(

Twister: Iterative MapReduce
!  !"#$"%&$"'%('%(#$)$"&()%*(

+),")-./(*)$)(

!  (0'%123,)-./(.'%2(,3%%"%2(

4&)&5/)-./6(7)89,/*3&/($)#:#((

!  ;3-9#3-(7/##)2"%2(-)#/*(

&'773%"&)$"'%9*)$)($,)%#</,#((

!  =>&"/%$(#388',$(<',(?$/,)$"+/(

@)8A/*3&/(&'783$)$"'%#(
4/B$,/7/.C(<)#$/,($5)%(D)*''8(
',(!,C)*9!,C)*E?FG6((
!  0'7-"%/(85)#/($'(&'../&$()..(
,/*3&/('3$83$#((
!  !)$)()&&/##(+")(.'&).(*"#:#(
E"25$H/"25$(4IJKLL(."%/#('<(
M)+)(&'*/6((
!  N388',$(<',($C8"&).(@)8A/*3&/(
&'783$)$"'%#(O''.#($'(7)%)2/(
*)$)((

Ekanayake, Jaliya, et al. Twister: a Runtime for Iterative Mapreduce. International Symposium on High Performance Distributed Computing. HPDC 2010.
G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

54 / 84
Moon: MapReduce on Opportunistic Environments
• Consider volatile resources during their idleness time
• Consider the Hadoop runtime
→ Hadoop does not tolerate massive failures

• The Moon approach :
• Consider hybrid resources: very volatile nodes and more stable nodes
• two queues are managed for speculative execution : frozen and slow tasks
• 2-phase task replication : separate the job between normal and homestretch

phase
→ the job enter the homestretch phase when the number of remaining tasks < H%
available execution slots

Lin, Heshan, et al. Moon: Mapreduce on opportunistic environments International Symposium on High Performance Distributed Computing. HPDC 2010.

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

55 / 84
Towards Greener Data Center
• Sustainable Computing Infrastructures which minimizes the impact on the

environment
• Reduction of energy consumption
• Reduction of Green House Gaz emission
• Improve the use of renewable energy (Smart Grids, Co-location, etc...)

Solar Bay Apple Maiden Data Center (20MW, ) 00 acres : 400 000 m2. 20 MW peak
G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

56 / 84
Section 4
MapReduce beyond the Data Center

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

57 / 84
Introduction to Computational Desktop Grids
High!"#$$%&'%&()$(*$&+&,)-%&'./"'#0)1%*"-&
Throughput Computing
over2%"-/00%$-Idle
Large Sets of
Desktop Computers
! 34"54%"&$%-&2*#--)0(%-&'%&
()$(*$&'%&67-&2%0')01&&
$%*"&1%82-&'.#0)(1#9#15
• Volunteer Computing Systems
" :/$&'%&(;($%&<&5(/0/8#-%*"&
(SETI@Home, Distributed.net)

'.5(")0

• Open Source Desktop Grid
"

)22$#()1#/0-&2)")$$=$%-&'%&4")0'%&

(BOINC, XtremWeb, Condor,
1)#$$%
OurGrid)
" >"4)0#-)1#/0&'%&?(/0(/*"-@&'%&
2*#--)0(%
• Some numbers with BOINC
" A%-&*1#$#-)1%*"-&'/#9%01&2/*9/#"&
(October 21, 2013)
8%-*"%"&$%*"&(/01"#,*1#/0
• Active: 0.5M users, 1M
5(/0/8#%&'*&'/0
hosts.
• 24-hour average: 12

PetaFLOPS.
• EC2 equivalent : 1b$/year
G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

58 / 84
MapReduce over the Internet
Computing on Volunteers’ PCs (a la SETI@Home)
• Potential of aggregate computing power and storage
• Average PC : 1TFlops, 4BG RAM, 1TByte disc
• 1 billion PCs, 100 millions GPUs + tablets + smartphone + STB

• hardware and electrical power are paid for by consumers
• Self-maintained by the volunteers

But ...
• High number of resources

• Low performance

• Volatility

• Owned by volunteer

• Scalable but mainly for embarrassingly parallel applications with few I/O

requirements
• Challenge is to broaden the application domain
G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

59 / 84
MapReduce over the Internet
Computing on Volunteers’ PCs (a la SETI@Home)
• Potential of aggregate computing power and storage
• Average PC : 1TFlops, 4BG RAM, 1TByte disc
• 1 billion PCs, 100 millions GPUs + tablets + smartphone + STB

• hardware and electrical power are paid for by consumers
• Self-maintained by the volunteers

But ...
• High number of resources

• Low performance

• Volatility

• Owned by volunteer

• Scalable but mainly for embarrassingly parallel applications with few I/O

requirements
• Challenge is to broaden the application domain
G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

59 / 84
Subsection 1
Background on BitDew

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

60 / 84
Large Scale Data Management

BitDew : a Programmable Environment for Large Scale Data
Management
Key Idea 1: provides an API and a runtime environment which integrates
several P2P technologies in a consistent way
Key Idea 2: relies on metadata (Data Attributes) to drive transparently data
management operation : replication, fault-tolerance, distribution,
placement, life-cycle.
BitDew: A Programmable Environment for Large-Scale Data Management and Distribution Gilles Fedak, Haiwu He and Franck Cappello International
Conference for High Performnance Computing (SC’08), Austin, 2008
G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

61 / 84
BitDew : the Big Cloudy Picture
• Aggregates storage in a single

Data Space:

Data Space

• Clients put and get data

from the data space
• Clients defines data

attributes
• Distinguishes service nodes

(stable), client and Worker
nodes (volatile)
• Service : ensure fault

tolerance, indexing and
scheduling of data to Worker
nodes
• Worker : stores data on
Desktop PCs

REPLICAT =3

put

get
client

• push/pull protocol between

client → service ← Worker
G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

62 / 84
BitDew : the Big Cloudy Picture
• Aggregates storage in a single

Data Space:

Data Space

• Clients put and get data

from the data space

Service Nodes

Reservoir Nodes

• Clients defines data

attributes
• Distinguishes service nodes

pull
pull

(stable), client and Worker
nodes (volatile)
• Service : ensure fault

pull

tolerance, indexing and
scheduling of data to Worker
nodes
• Worker : stores data on
Desktop PCs

put

get
client

• push/pull protocol between

client → service ← Worker
G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

62 / 84
Data Attributes

REPLICA

:

RESILIENCE

:

LIFETIME

:

AFFINITY

:

TRANSFER PROTOCOL

:

DISTRIBUTION

:

G. Fedak (INRIA, France)

indicates how many occurrences of data should
be available at the same time in the system
controls the resilience of data in presence of machine crash
is a duration, absolute or relative to the existence
of other data, which indicates when a datum is obsolete
drives movement of data according to dependency rules
gives the runtime environment hints about the file
transfer protocol appropriate to distribute the data
which indicates the maximum number of pieces
of Data with the same Attribute should be sent to
particular node.

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

63 / 84
Architecture Overview
Applications
Service
Container

Active Data
Data
Scheduler
SQL
Server

Transfer
Manager

BitDew
Data
Catalog
File
System

DHT

Master/
Worker

Storage

Data
Repository
Http
Ftp

Data
Transfer
Bittorrent

Ba

ck-

en

ds

Se

rvic

e

AP

I

Command
-line Tool

BitDew Runtime Environnement

• Programming APIs to create Data, Attributes, manage file transfers and

program applications
• Services (DC, DR, DT, DS) to store, index, distribute, schedule, transfer

and provide resiliency to the Data
• Several information storage backends and file transfer protocols

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

64 / 84
Anatomy of a BitDew Application
• Clients
1
2
3

creates Data and
Attributes
put file into data slot created
schedule the data with their
corresponding attribute

Data Space
Service Nodes

Reservoir Nodes

dr/dt
dc/ds

• Reservoir

pull

• are notified when data are

locally scheduled
onDataScheduled or
deleted onDataDeleted
• programmers install
callbacks on these events

pull

create
put
schedule

• Clients and Reservoirs can

both create data and install
callbacks
G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

65 / 84
Anatomy of a BitDew Application
• Clients
1
2
3

creates Data and
Attributes
put file into data slot created
schedule the data with their
corresponding attribute

Data Space
Service Nodes

Reservoir Nodes

dr/dt
dc/ds

• Reservoir

onDataSceduled

• are notified when data are

locally scheduled
onDataScheduled or
deleted onDataDeleted
• programmers install
callbacks on these events

onDataScheduled

• Clients and Reservoirs can

both create data and install
callbacks
G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

65 / 84
Examples of BitDew Applications

• Data-Intensive Application
• DI-BOT : data-driven master-worker Arabic characters recognition (M.

Labidi, University of Sfax)
• MapReduce, (X. Shi, L. Lu HUST, Wuhan China)
• Framework for distributed results checking (M. Moca, G. S. Silaghi

Babes-Bolyai University of Cluj-Napoca, Romania)
• Data Management Utilities
• File Sharing for Social Network (N. Kourtellis, Univ. Florida)
• Distributed Checkpoint Storage (F. Bouabache, Univ. Paris XI)
• Grid Data Stagging (IEP, Chinese Academy of Science)

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

66 / 84
Experiment Setup

Hardware Setup
• Experiments are conducted in a controlled environment (cluster for the

microbenchmark and Grid5K for scalability tests) as well as an Internet
platform
• GdX Cluster : 310-nodes cluster of dual opteron CPU (AMD-64 2Ghz,

2GB SDRAM), network 1Gbs Ethernet switches
• Grid5K : an experimental Grid of 4 clusters including GdX
• DSL-lab : Low-power, low-noise cluster hosted by real Internet users on

DSL broadband network. 20 Mini-ITX nodes, Pentium-M 1Ghz, 512MB
SDRam, with 2GB Flash storage
• Each experiments is averaged over 30 measures

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

67 / 84
Fault-Tolerance Scenario
0

20

40

60

80

100

120
492KB/s

DSL01
DSL02

211KB/s

DSL03

254KB/s

DSL04

247KB/s

DSL05

384KB/s

DSL06

53KB/s

DSL07

412KB/s

DSL08

332KB/s

DSL09

304KB/s

DSL10

259KB/s

Evaluation of Bitdew in Presence of Host Failures
Scenario: a datum is created with the attribute : REPLICA = 5,
FAULT _ TOLERANCE = true and PROTOCOL = "ftp".
Every 20 seconds, we simulate a machine crash by killing the
BitDew process on one machine owning the data.
We run the experiment in the DSL-Lab environment
Figure: The Gantt chart presents the main events : the blue box shows
the download duration, red box is the waiting time, and red star
indicates a node crash.
G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

68 / 84
Collective File Transfer with Replication
1
d2

0.8

ALL_TO_ALL

Fraction

d1

d1 + d2 + d3 + d4

d4

0.6
0.4

d3

0.2
0
150

Scenario: All-to-all file transfer with
replication.

250

300
KBytes/sec

350

400

450

Results: experiments on DSL-Lab

• {d1, d2, d3, d4} is created

• one run every hour during

approximately 12 hours

where each datum has the
REPLICA attribute set to 4

• during the collective, file

• All-to-all collective

implemented with the AFFINITY
attribute. Each data attributes
is modified on the fly so that
d1.affinity=d2, d2.affinity=d3,
d3.affinity=d4, d4.affinity=d1.
G. Fedak (INRIA, France)

200

MapReduce

transfers are performed
concurrently and the data size
is 5MB.
• Cumulative Distribution

Function (CDF) plot for
collective all-to-all.
Salvador da Bahia, Brazil, 22/10/2013

69 / 84
Multi-sources and Multi-protocols File Distribution
Scenario
• 4 clusters of 50 nodes each,

each cluster hosting its own
data repository (DC/DR).

dr1/dc1

C1

• first phase: a client uploads a
dr2/dc2

large datum to the 4 data
repositories

C2

• second phase: the cluster
C3

nodes get the datum from the
data repository of their cluster.

dr3/dc3

We vary:
1

the number of data
repositories from 1 to 4

2

C4
dr4/dc4

the protocol used to distribute
the data (BitTorrent vs FTP)
G. Fedak (INRIA, France)

1st phase

MapReduce

2nd phase

Salvador da Bahia, Brazil, 22/10/2013

70 / 84
Multi-sources and Multi-protocols File Distribution
BitTorrent (MB/s)
orsay
1st phase
1DR/1DC
2DR/2DC
4DR/4DC
2nd phase
1DR/1DC
2DR/2DC
4DR/4DC

nancy

bordeaux

30.34
29.79
22.07
3.93
4.04
8.46

FTP (MB/s)

rennes

mean

orsay

nancy

bordeaux

rennes

10.36

12.35
11.51

9.72

30.34
21.07
13.41

98.50
73.85
49.41

22.95

49.86
26.19

22.89

3.93
4.88
7.58

3.78
9.50
7.65

3.96
5.04
5.14

3.90
5.86
7.21

0.72
1.43
2.90

0.70
1.38
2.78

0.56
1.14
2.33

0.60
1.29
2.54

• for both protocols, increasing the number of data sources also increases

the available bandwidth
• FTP performs better in the first phase and BitTorrent gives superior

performances in the second phase
• the best combination (4DC/4DR, FTP for the first phase and BitTorrent for

the second phase) takes 26.46 seconds and outperforms by 168.6% the
best single data source, single protocol (BitTorrent) configuration
G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

71 / 84
Subsection 2
MapReduce for Desktop Grid Computing

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

72 / 84
Challenge of MapReduce over the Internet
Data Split

Intermediate
Results

Input Files

Final Output

Output Files

Mapper
Reducer

Output2

Reducer

Mapper

Output1

Reducer

Mapper

Output3

Combine

Mapper

Mapper

• Needs Data Replica Management
• Result Certification of

• no shared file system nor direct
communication between hosts

Intermediate Data

• Faults and hosts churn

• Collective Operation (scatter +
gather/reduction)

Towards MapReduce for Desktop Grids B. Tang, M. Moca, S. Chevalier, H. He, G. Fedak, in Proceedings of the Fifth International Conference on P2P,
Parallel, Grid, Cloud and Internet Computing 3GPCIC, Fukuoka, Japan, Novembre 2010
G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

73 / 84
Implementing MapReduce over BitDew (1/2)
Initialisation by the Master
1

Data slicing : Master creates
DataCollection
DC = d1 , . . . , dn , list of data
chuncks sharing the same
MapInput attribute.

2

Creates reducer tokens each
one with its own
TokenReducX attribute
(affinity set to the Reducer).

3

Creates and pin collector
tokens with
TokenCollector attribute
(affinity set to the Collector
Token).

G. Fedak (INRIA, France)

Algorithm 2 E XECUTION OF M AP TASKS
Require: Let Map be the Map function to execute
Require: Let M be the number of Mappers and m a single mapper
Require: Let dm = d1,m , . . . dk,m be the set of map input data received by
worker m
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.

{on the Master node}
Create a single data MapToken with affinity set to DC
{on the Worker node}
if MapToken is scheduled then
for all data d j,m ∈ dm do
execute Map(d j,m )
create list j,m (k, v)
end for
end if

Algorithm 3 S HUFFLING INTERMEDIATE RESULTS
Require: Let of the number of Mappers and m a single worker
ExecutionM beMap Tasks

Require: Let R be the number of Reducers
Require: Let listm (k, v) be the set of key, values pairs of intermediate
1 When a Worker receives
results on worker m

1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
MapReduce

{on MapToken it launches the
the Master node}
for corresponding MapTask and
all r ∈ [1, . . . , R] do
Create Attribute ReduceAttrr with distrib = 1
computes the intermediate values
Create data ReduceTokenr
schedule data ReduceTokenr with attribute ReduceTokenAttr
end listj,m (k , v ).
for
{on the Worker node}
split listm (k, v) in i f m,1 , . . . , i f m,r intermediate files
for all file i fm,r do
create reduce input data irm,r and upload i fm,r
Salvador da Bahia, Brazil, 22/10/2013

74 / 84
2.
3.
4.
5.
6.
7.
8.
9.
10.

Create a single data MapToken with affinity set to DC
{on the Worker node}
if MapToken is scheduled then
for all data d j,m ∈ dm do
execute Map(d j,m )
create list j,m (k, v)
end for
end if

2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.

Create a single data MasterToken
pinAndSchedule(MasterToken)

Implementing MapReduce over BitDew (2/2)
Require: Let
Require: Let
Require: Let
results on
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.

M be the number of Mappers and m a single worker
R be the number of Reducers
listm (k, v) be the set of key, values pairs of intermediate
worker m

{on the Master node}
for all r ∈ [1, . . . , R] do
Create Attribute ReduceAttrr with distrib = 1
Create data ReduceTokenr
schedule data ReduceTokenr with attribute ReduceTokenAttr
end for
{on the Worker node}
split listm (k, v) in i f m,1 , . . . , i f m,r intermediate files
for all file i fm,r do
create reduce input data irm,r and upload i fm,r
schedule data irm,r with a f f inity = ReduceTokenr
end for

workers, the ReduceTokenAttr has distrib=1 flag value which
ensures a fair
Shufflingdistribution of the workload between reducers. Finally,
Intermediate Results
the Shuffle phase is simply implemented by scheduling the portioned
intermediate data with an attribute whose affinity tag is equal
to the corresponding ReduceToken.
1
Algorithm 4 presents the Reduce phase. When a reducer, that is
a worker which has the ReduceToken, starts to receive intermediate
results, it calls the Reduce function on the (k, list(v)). When all
the intermediate files have been received, all the values have been
processed for a specific key. If the user wishes, he can get all the
results back to the master and eventually combine them. To proceed
to this last step, the worker creates an MasterToken data, which is
pinnedAndScheduled. This operation means that the MasterToken
is known from the BitDew scheduler but will not be sent to any
nodes in the system. Instead, MasterToken is pinned on the Master
node, which allows the result of the Reduce tasks, scheduled with
a tag affinity set to MasterToken, to be sent to the Master.

Worker partitions intermediate
results according to the partition
function and the number of
reducers and creates
ReduceInput with the attribute
ReducTokenR.

D. Desktop Grid Features
G. Fedak (INRIA, France)

{on the Master node}
if all or have been received then
Combine or into a single result
end if

Execution of Reduce tasks
overlapping communication with computation and prefetch of data.
We have designed a multi-threaded worker (see Figure 3) which can
1
process several concurrent files transfers, even if the protocol used is
m,r
synchronous such as HTTP for instance. As soon as a file transfer is
finished, the corresponding Map or Reduce tasks are enqueued and
can be processed concurrently. The number of maximum concurrent
Map and Reduce tasks can be configured, as well as the minimum
m,r
number of tasks in the queue before computations can start. In
MapReduce, prefetch of data is natural as data are staged on the
2
r
distributed file system before computations are launched.

When a reducer receives
ReduceInput ir
files it lauches
the corresponding launch
ReduceTask Reduce(ir ).

7+889+04

When all the ir have been
processed, the reducer combines
!"#$%&
the result in a single final result
"#$
Or . '6 "#$ *+,,%3

*+,-%./0%
%#$
1234%3

!

:0;%./8%
%#$

Algorithm 3 S HUFFLING INTERMEDIATE RESULTS

{on the Worker node}
if irm,r is scheduled then
execute Reduce(irm,r )
if all irr have been processed then
create or with a f f inity = MasterToken
end if
end if

%#$
')
-%./0%3
3 eventually, the final result can be

sent back to the master using the
%#$
'(
%#$
TokenCollector attribute.
-%./0%35/##%3
Figure 3. Worker handles the following threads: Mapper, Reducer and ReducePutter. Q1, Q2 and Q3 represent queues and are used for communication between
threads.

Collective file operation: MapReduce processing is composed
of several collective file operations which in some aspects, look
similar with collective communications in parallel programming
such as MPI. Namely, the collectives are the Distribution, the
initial distribution of file chunks (similar to MPI_Scatter), the
Shuffle, that is the redistribution of intermediate results between the
Mapper and
MapReduce the Reducer nodes (similar to MPI_AlltoAll) and
Salvador da Bahia, Brazil, 22/10/2013

75 / 84
Handling Communications
Latency Hiding
• Multi-thredead worker to overlap
communication and computation.

• The number of maximum concurrent Map
and Reduce tasks can be configured, as
well as the minimum number of tasks in
the queue before computations can start.

Barrier-free computation
• Reducers detect duplication of intermediate results (that happen because of faults and/or
lags).

• Early reduction : process IR as they arrive → allowed us to remove the barrier between
Map and Reduce tasks.

• But ... IR are not sorted.

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

76 / 84
Scheduling and Fault-tolerance
2-level scheduling
1 Data placement is ensured by the BitDew scheduler, which is mainly guided by the data

attribute.
2 Workers periodically report to the MR-scheduler, running on the master node the state of

their ongoing computation.
3 The MR-scheduler determines if there are more nodes available than tasks to execute

which can avoid the lagger effect.
Fault-tolerance

• In Desktop Grid, computing resources have high failure rates :
→ during the computation (either execution of Map or Reduce tasks)
→ during the communication, that is file upload and download.
• MapInput data and ReduceToken token have the resilient attribute enabled.
• Because intermediate results irj,r , j ∈ [1, m] have the affinity set to

ReduceTokenr , they will be automatically downloaded by the node on which
the ReduceToken data is rescheduled.

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

77 / 84
Data Security and Privacy
Distributed Result Checking
• Traditional DG or VC projects implements result checking on the server.
• IR are too large to be sent back to the server
→ distributed result checking
• replicates the MapInput and Reducers
• select correct results by majority voting

%"
*

!"

&%"'"
+"

!#

!)

!$

(%

(a) No replication

(b) Replication of the Map tasks
Figure 2.

(c) Replication of both Map and Reduce tasks

Dataflows in the MapReduce implementation

distinct workers as input for Map task. Consequently, the
received intermediate result is present in its own codes list
K, ignoring failed results.
m versions of intermediate files
m
corresponding to a map input fi . After receiving r2 + 1
IV. E VALUATION O F T HE R ESULTS C HECKING
identical versions, the reducer considers the respective result
M ECHANISM
correct and further, accepts it as input for a Reduce task.
Figure 2b illustrates the dataflow corresponding to a
In this section we describe the model for characterizMapReduce execution where rm = 3.
ing the errors produced by the MapReduce algorithm in
Volunteer Computing. We assume that workers sabotage
In order to activate the checking mechanism for the
independently of one another, thus we do not take into
results received from reducers, the master node schedules the
consideration the colluding nodes behavior.Bahia, Brazil, 22/10/2013
ReduceT okens
G. Fedak (INRIA, France) T , with the attribute replica = rr (rr >
MapReduce
Salvador da We employ the

Ensuring Datareceives a set of r
reducer Privacy

• Use a hybrid infrastructure : composed of private and public resources
• Use IDA (Information Dispersal Algorithms) approach to distribute and store securely the
data

78 / 84
BitDew Collective Communication
%!!!

56,7-87/9:012.:3;0<4
56,7-87/9:012.:35190,66.+94
=8799.6:012.
>79?.6:012.

$"!!

012.3/4

$!!!
#"!!
#!!!
"!!
!
#

&

'

#$

#(

$& %$
*+,-./

&'

(&

'!

)( #$'

Execution time of the Broadcast, Scatter, Gather collective for a 2.7GB

Figure 6.

The next experiment evaluates the performance of the collective
file operation. Considering that we need enough chunks to measure
the Scatter/Gather collective, we selected 16MB as the chunks size

E VALUA

Figure 5.

Figure: Execution time of theof nodes.
to throu
file to a varying number Broadcast, Scatter, Gather collective for a 2.7GB file the a
varying number of nodes

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

79 / 84
MapReduce Evaluation
#&!!

0?6,@A?B@9:3C5D/4

#$!!
#!!!
'!!
(!!
&!!
$!!
!
&

'

a 2.7GB

Figure 6.

'

#(

%$
(&
*+,-./

#$'

$"(

"#$

Scalability evaluation on the WordCount application: the y axis presents

Figure: Scalability evaluationand the xWordCount application: the y from 1presents the
the throughput in MB/s on the axis the number of nodes varying axis to 512.
throughput in MB/s and the x axis the number of nodes varying from 1 to 512.

lective
measure
ks size

Table II
E VALUATION OF THE PERFORMANCE ACCORDING TO THE NUMBER OF M APPERS
AND R EDUCERS .
G. Fedak (INRIA, France)

#Mappers

4

8MapReduce
16

32

32

32

32

Salvador da Bahia, Brazil, 22/10/2013

80 / 84
1;38CA<
1;3"CA<
(;6#<

(;08<

(;6A<
78

(;0"<

1;
0
1; 8 <
0" <

(;08<

E

%;68<

8EE

1;:8<

!"#"$%
54,/64

8AE

1;38C$<
5;3"C$<

%;6$<
(;3"C8<
<
"< 3 8CA
;3 8C 5;

()!*+)&,-%&'-.*'/0
1;:"<

%&'
=

(;3"C#<

9),4-2&3+/:4
?6@4,/+4-,&0&
54D6@4,/+4-,&0&

7#

(;68<

1;68-B-A<
AE

5;3"CA<
5;3"C8<

(;3"CA<
1;3"C8<
5
5;38C8<
(;38C"<

>

1;3"C8<
5;38C8<

1'+)&,-23+4
1;38C"<
5;3"C"<

%;6"<
(;6"< (;6$<

!8

%;68<

7"

%;6#<

(;6#<
!"

(;38C$<

"EE

(;38CA<

(;6$<

!#

1;3"C#<
5;38C$<

5;3"C#<

!$

%;6#<
(;38CA<

%;6A<

(;68<
(;38C"<

!A

5;38C"<
5;38CA<
5;38C#<

Fault-Tolerance Scenario

"AE

#EE

#AE

(;:"<
$EE

$AE

(;:8<
AEE =3>4

Figure: Fault-tolerance scenario (5 mappers, 2 reducers): crashes are injected during
the download of a file (F1), the execution of the map task (F2) and during the execution
of the reduce task on the last intermediate result (F3).

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

81 / 84
In case of substantial unnecessary in Table
the HDFS needs
execution progress at all.inter-communication between two
On the other hand, BitDew-MapReduce avoids network failures, as shown fault tol
DataNodes. in BitDew-MapReduce, the intermediate Hadoop JobTracker just marks all the tempo
works. Because, On the other hand, BitDew-MR works properly outputs of completed map tasks are safely store
disconnected nodes as dead - although they are
and the performance is almost the same as the baseline in the
the stable central storage conditions. master does not re-execute the successfully completedall the tasks done
running tasks, and blindly removes map tasks of f
normal case of network server, the
because it needs inter-communication between worker nodes, andnodes from the successful task list. Re-executing t
these BitDew mapreduce is almost independent of
workers.
Objective

Hadoop vs MapReduce/BitDew on Extreme Scenarios
E. CPU Heterogeneity and Straggler
such network conditions.

tasks significantly prolongs the job make-span. Meanw

Host •Churn The 5 shows, Hadoop works very well with thousands or evento ten scenarios go tempora
millions of peer to that
Evaluates MapReduce/BitDew on CPU heterogeneity by adjusting the CPU frequency of half
As Figure independent arrival and departure of the BitDew-MR clearly allows workersmachines lea
CPU Heterogeneity and Straggler We emulate Grid5K according
• to kill worker processes worker offline without node another it
host churn. We nodesWe50% kill the MapReduce onhave straggler and oneanythatand launchnode to new no
dynamic taskperiodically with wrekavoc. Internet one process(a machine performance penalty. on a
the worker an execution on the We nodes a
mimic scheduling approach when worker emulate node on launch ontakes an unusually long time
The key Hadoop job completion, we increas
differenthost churn effect. Totasks in the group process probability the target nodes’ tasks re-execution avoida
emulate complete oneof CPU.last fewfrom the fastthe joining and leaving of idea behind mapCPU frequency to 10%.
to the• classes of the Nodes increase computation) by adjusting the computation
emulate volunteers survival
failures,
failures, seconds.
20 tasks on average sabotage, fromsetchurn, massive which makes BitDew-MR outperforms
HDFSBoth Hadoop andfactorthe MapReduce can launch backupheartbeat timeout value to mapreduce Hadoop u
chunk firewall,BitDew onesand the the DataNode tasks on idlenetwork when a 20 laggers, Becaus
replica and to 3, host slow group get
machines
operation
network
scenarios
allowing red
about 11 tasks. • Hadoop etc. . has same scheduling
Although, BitDew-MR
setup:
heterogeneity does not stragglers work machine and failing failure 3, host isMapReduce
BitDew close to completion perform well.the waste to Figure 6. completed re-download map outputs thatchurn already b
by
is MapReduce runtimealleviate according the problem. However, as shown in workers, BitDewhave causes
to
figure
tasks to
heuristic but it does not
HDFS chunk the number
to 3
is slower from job completion time. the replica factor uploaded to the table 2, for host churn intervals
small The nodesthan Hadoop groups get almostbecause only one copy of ashown in available onstorage rather nodesre-gene
effects on the
Some results the two in -this process On same other hand, as chunk iscentral stable the worker than at a
given time. reason the we DataNode one additional to 80% 20the map phase chunk before launching
them.
- could only progress timeout to has to benefit the
seconds
of Host The Hence,is that jobsperforming theMapReduce workerAnotherdownloadof this method is that
tasks. churn : We periodically kill the chunk copy
10 and 25 seconds, Hadoopnodeonly maintain heartbeat up time firstof process on onebefore and launchitadoes
•
node failing.

Evaluation: network fault tolerance

the map task.
introduce any extra overhead since all the intermediate
on the worker nodes for the sake of the assumption that there
arenew one on another node. and data transfer. Thus, environment, the runtime system must becentral sto
no inter-worker communication
Network Fault Tolerance To work in a Desktop Grid should be always transferred through the resilient
regardless of 25
whether or 50 use the fault tolerant strat
not
although fast nodes spend half of time to process (sec.)local
Churn Interval their
10
to temporary network failures. We set the worker heartbeat 5
timeout value to 30 seconds for both Hadoop and
20
However, the overhead of data transfer to the central st
chunks compare to the slow nodes, they still need to take
makespan (sec.)
BitDew mapreduce, heartbeat set to 20 seconds on Hadoop 25 desktop grid more suitable job progress
period to failed 2357 1752
• to download Hadoop job &'(#)* launching failed failed and BitDew-MapReduce
much time Worker and inject a 25 seconds off-line window storage makes worker nodes at differentfor the applicat
new chunks before
points of the tasks.
In
BitDew-MR joa makespan (sec.) 457 398 marks all temporary data, such as nodes
which generate a 361 357
additional map map phase.25 this test, the Hadoop JobTracker simplyto 366 few intermediate disconnectedWord C
• We they are
workers at different
as “dead” when induce stillseconds offline window periods theseGrep. To mitigate the data transfer overh
running tasks. The tasks performed by 25 nodes are blindly removed from the
and Distributed
successful taskERFORMANCE EVALUATION OF STRAGGLERS SCENARIO prolonging churn scenario with large aggregated bandw
TABLE III.
P list and must be re-executed, significantly we can use a storage cluster
Fig. 2: churn interval < 30 seconds
progresswhen Performance evaluation of host the job makespan, as table 1. Meanwhile,
points
→ Hadoop fails
The emerging public any storage services also provid
the BitDew-MapReduce clearly allows workers to go temporarily offline withoutcloud performance penalty.
Straggler Number Very small
1
2
5
•
alternative different progress included in our fu
• Network failure : networkeffect on BitDew 10 25 seconds atsolution, which can betime
is shutdown during
Hadoop job make477
487
490
• Hadoop fails481
when churn interval <work.
30 seconds
span (sec.)
BitDew-MR job

Job progress of the crash points

12.5% 25% 37.5% 50% 62.5% 75% 87.5% 100%

245
267
50
Network Connectivity 211 set custom map tasksand NAT rules on all the300 RELATED WORK turn down
We Re-executed firewall 298 100 150 200 250 VI. 350 nodes to
worker 400
make-span (sec.)
Hadoop
There
been many studies on
network links and observe how Job makespan (sec.) perform. In thishave536 Hadoop cannot improving MapRed
MapReduce jobs 425 468 479 512test, 572 589 601 even launch a
3

Re-executed map tasks 0
0 performance [17, 18, 0 28] and 0
0
0
0
19, 0
exploring ways to sup
BitDew-MR
256
MapReduce 254 we architectures [16, 22, 23, o
GdX has 356 double core nodes, Job to measure the performance on 521 on different 274two workers per node26
so makespan (sec.) 246 249 243 239 nodes 257 run
Anthony SIMONET - MapReduce on Desktop Grids related work isscenario [23] 2012 stands
14
closely
MOON
Table 1: Performance evaluation of the network fault toleranceDecember 4th, which
nodes.
MapReduce On Opportunistic eNvironments. Unlike
lundi 3 décembre 12
• Hadoop marks all temporary disconnected nodes limits the systemtaskswithin a campus
are
work, MOON as failed → scale dead
• The Hadoop job tracker marks all temporary disconnected nodes as
and assumes the underlying resources are hybrid
re-executed
- These tasks must be re-executed
organized by provisioning a fixed fraction of dedicated st
4.2 Evaluation of the incremental BitDew MapReduce
G. Fedak (INRIA, France)
MapReduce
Salvador da other volatile personal82 / 84
computers to supplementBahia, Brazil, 22/10/2013
compu
Section 5
Conclusion

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

83 / 84
Conclusion
• MapReduce Runtimes
• Attractive programming model for data-intense computing
• Many research issues concerning execution runtimes (heterogeneity,

stragglers, scheduling )
• Many others : security, QoS, iterative MapReduce,

• MapReduce beyond the Data Center :
• enable large scale data-intensive computing : Internet MapReduce

• Want to learn more ?
• Home page for the Mini-Curso : http://graal.ens-lyon.fr/ gfedak/pmwiki-

lri/pmwiki.php/Main/MarpReduceClass
• the MapReduce Workshop@HPDC (http://graal.ens-lyon.fr/mapreduce)
• Desktop Grid Computing ed. C. Cerin and G. Fedak – CRC Press
• our websites : http://www.bitdew.net and http://www.xtremweb-hep.org

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

84 / 84

Weitere ähnliche Inhalte

Andere mochten auch

Big Data, Beyond the Data Center
Big Data, Beyond the Data CenterBig Data, Beyond the Data Center
Big Data, Beyond the Data CenterGilles Fedak
 
Runtime Environment Of .Net Divya Rathore
Runtime Environment Of .Net Divya RathoreRuntime Environment Of .Net Divya Rathore
Runtime Environment Of .Net Divya RathoreEsha Yadav
 
Active Data PDSW'13
Active Data PDSW'13Active Data PDSW'13
Active Data PDSW'13Gilles Fedak
 
SpeQuloS: A QoS Service for BoT Applications Using Best Effort Distributed Co...
SpeQuloS: A QoS Service for BoT Applications Using Best Effort Distributed Co...SpeQuloS: A QoS Service for BoT Applications Using Best Effort Distributed Co...
SpeQuloS: A QoS Service for BoT Applications Using Best Effort Distributed Co...Gilles Fedak
 
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...Gilles Fedak
 
The iEx.ec Distributed Cloud: Latest Developments and Perspectives
The iEx.ec Distributed Cloud: Latest Developments and PerspectivesThe iEx.ec Distributed Cloud: Latest Developments and Perspectives
The iEx.ec Distributed Cloud: Latest Developments and PerspectivesGilles Fedak
 
Comp architecture : branch prediction
Comp architecture : branch predictionComp architecture : branch prediction
Comp architecture : branch predictionrinnocente
 
iExec: Blockchain-based Fully Distributed Cloud Computing
iExec: Blockchain-based Fully Distributed Cloud ComputingiExec: Blockchain-based Fully Distributed Cloud Computing
iExec: Blockchain-based Fully Distributed Cloud ComputingGilles Fedak
 
Lecture 14 run time environment
Lecture 14 run time environmentLecture 14 run time environment
Lecture 14 run time environmentIffat Anjum
 
How Blockchain and Smart Buildings can Reshape the Internet
How Blockchain and Smart Buildings can Reshape the InternetHow Blockchain and Smart Buildings can Reshape the Internet
How Blockchain and Smart Buildings can Reshape the InternetGilles Fedak
 

Andere mochten auch (10)

Big Data, Beyond the Data Center
Big Data, Beyond the Data CenterBig Data, Beyond the Data Center
Big Data, Beyond the Data Center
 
Runtime Environment Of .Net Divya Rathore
Runtime Environment Of .Net Divya RathoreRuntime Environment Of .Net Divya Rathore
Runtime Environment Of .Net Divya Rathore
 
Active Data PDSW'13
Active Data PDSW'13Active Data PDSW'13
Active Data PDSW'13
 
SpeQuloS: A QoS Service for BoT Applications Using Best Effort Distributed Co...
SpeQuloS: A QoS Service for BoT Applications Using Best Effort Distributed Co...SpeQuloS: A QoS Service for BoT Applications Using Best Effort Distributed Co...
SpeQuloS: A QoS Service for BoT Applications Using Best Effort Distributed Co...
 
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...
 
The iEx.ec Distributed Cloud: Latest Developments and Perspectives
The iEx.ec Distributed Cloud: Latest Developments and PerspectivesThe iEx.ec Distributed Cloud: Latest Developments and Perspectives
The iEx.ec Distributed Cloud: Latest Developments and Perspectives
 
Comp architecture : branch prediction
Comp architecture : branch predictionComp architecture : branch prediction
Comp architecture : branch prediction
 
iExec: Blockchain-based Fully Distributed Cloud Computing
iExec: Blockchain-based Fully Distributed Cloud ComputingiExec: Blockchain-based Fully Distributed Cloud Computing
iExec: Blockchain-based Fully Distributed Cloud Computing
 
Lecture 14 run time environment
Lecture 14 run time environmentLecture 14 run time environment
Lecture 14 run time environment
 
How Blockchain and Smart Buildings can Reshape the Internet
How Blockchain and Smart Buildings can Reshape the InternetHow Blockchain and Smart Buildings can Reshape the Internet
How Blockchain and Smart Buildings can Reshape the Internet
 

Ähnlich wie MapReduce Runtime Design and Performance Optimizations

An Introduction to MapReduce
An Introduction to MapReduce An Introduction to MapReduce
An Introduction to MapReduce Sina Ebrahimi
 
Hadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraintHadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraintijccsa
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin
 
Debugging of hadoop production errors using open source flow analyzer – semic...
Debugging of hadoop production errors using open source flow analyzer – semic...Debugging of hadoop production errors using open source flow analyzer – semic...
Debugging of hadoop production errors using open source flow analyzer – semic...Mahesh Nair
 
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkA Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkIRJET Journal
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map ReduceUrvashi Kataria
 
High Performance Computing on NYC Yellow Taxi Data Set
High Performance Computing on NYC Yellow Taxi Data SetHigh Performance Computing on NYC Yellow Taxi Data Set
High Performance Computing on NYC Yellow Taxi Data SetParag Ahire
 
Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed DatasetsGabriele Modena
 
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...jencyjayastina
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxHARIKRISHNANU13
 
Comparing Distributed Indexing To Mapreduce or Not?
Comparing Distributed Indexing To Mapreduce or Not?Comparing Distributed Indexing To Mapreduce or Not?
Comparing Distributed Indexing To Mapreduce or Not?TerrierTeam
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analyticsAvinash Pandu
 
Reduce Side Joins
Reduce Side Joins Reduce Side Joins
Reduce Side Joins Edureka!
 
Enhancing usability and utility of USGS 3D Elevation Program (3DEP) lidar dat...
Enhancing usability and utility of USGS 3D Elevation Program (3DEP) lidar dat...Enhancing usability and utility of USGS 3D Elevation Program (3DEP) lidar dat...
Enhancing usability and utility of USGS 3D Elevation Program (3DEP) lidar dat...OpenTopography Facility
 

Ähnlich wie MapReduce Runtime Design and Performance Optimizations (20)

Geospatial Data Abstraction Library (GDAL) Enhancement for ESDIS (GEE)
Geospatial Data Abstraction Library (GDAL) Enhancement for ESDIS (GEE)Geospatial Data Abstraction Library (GDAL) Enhancement for ESDIS (GEE)
Geospatial Data Abstraction Library (GDAL) Enhancement for ESDIS (GEE)
 
An Introduction to MapReduce
An Introduction to MapReduce An Introduction to MapReduce
An Introduction to MapReduce
 
1 mapreduce-fest
1 mapreduce-fest1 mapreduce-fest
1 mapreduce-fest
 
Pig latin
Pig latinPig latin
Pig latin
 
Hadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraintHadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraint
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Debugging of hadoop production errors using open source flow analyzer – semic...
Debugging of hadoop production errors using open source flow analyzer – semic...Debugging of hadoop production errors using open source flow analyzer – semic...
Debugging of hadoop production errors using open source flow analyzer – semic...
 
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkA Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
High Performance Computing on NYC Yellow Taxi Data Set
High Performance Computing on NYC Yellow Taxi Data SetHigh Performance Computing on NYC Yellow Taxi Data Set
High Performance Computing on NYC Yellow Taxi Data Set
 
Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed Datasets
 
WVAGP/EPan Annual Meeting Course Descriptions 2011ns
WVAGP/EPan Annual Meeting Course Descriptions 2011nsWVAGP/EPan Annual Meeting Course Descriptions 2011ns
WVAGP/EPan Annual Meeting Course Descriptions 2011ns
 
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Comparing Distributed Indexing To Mapreduce or Not?
Comparing Distributed Indexing To Mapreduce or Not?Comparing Distributed Indexing To Mapreduce or Not?
Comparing Distributed Indexing To Mapreduce or Not?
 
VOLT - ESWC 2016
VOLT - ESWC 2016VOLT - ESWC 2016
VOLT - ESWC 2016
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analytics
 
Reduce Side Joins
Reduce Side Joins Reduce Side Joins
Reduce Side Joins
 
Enhancing usability and utility of USGS 3D Elevation Program (3DEP) lidar dat...
Enhancing usability and utility of USGS 3D Elevation Program (3DEP) lidar dat...Enhancing usability and utility of USGS 3D Elevation Program (3DEP) lidar dat...
Enhancing usability and utility of USGS 3D Elevation Program (3DEP) lidar dat...
 

Kürzlich hochgeladen

Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 

Kürzlich hochgeladen (20)

Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 

MapReduce Runtime Design and Performance Optimizations

  • 1. MapReduce Runtime Environments Design, Performance, Optimizations Gilles Fedak Many Figures imported from authors’ works Gilles.Fedak@inria.fr INRIA/University of Lyon, France Escola Regional de Alto Desempenho - Região Nordeste G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 1 / 84
  • 2. Outline of Topics 1 What is MapReduce ? The Programming Model Applications and Execution Runtimes 2 Design and Principles of MapReduce Execution Runtime File Systems MapReduce Execution Key Features of MapReduce Runtime Environments 3 Performance Improvements and Optimizations Handling Heterogeneity and Stragglers Extending MapReduce Application Range 4 MapReduce beyond the Data Center Background on BitDew MapReduce for Desktop Grid Computing 5 Conclusion G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 2 / 84
  • 3. Section 1 What is MapReduce ? G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 3 / 84
  • 4. What is MapReduce ? • Programming Model for data-intense applications • Proposed by Google in 2004 • Simple, inspired by functionnal programming • programmer simply defines Map and Reduce tasks • Building block for other parallel programming tools • Strong open-source implementation: Hadoop • Highly scalable • Accommodate large scale clusters: faulty and unreliable resources MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat, in OSDI’04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004. G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 4 / 84
  • 5. Subsection 1 The Programming Model G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 5 / 84
  • 6. MapReduce in Lisp • (map f (list l1 , . . . , ln )) • (map square ‘(1 2 3 4)) • (1 4 9 16) • (+ 1 4 9 16) • (+1 (+4 (+9 16))) • 30 G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 6 / 84
  • 7. MapReduce a la Google Map(k , v ) → (k , v ) Mapper Group(k , v ) by k Reduce(k , {v1 , . . . , vn }) → v Reducer Input Output • Map(key , value) is run on each item in the input data set • Emits a new (key , val) pair • Reduce(key , val) is run for each unique key emitted by the Map • Emits final output G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 7 / 84
  • 8. Exemple: Word Count • Input: Large number of text documents • Task: Compute word count across all the document • Solution : • Mapper : For every word in document, emits (word, 1) • Reducer : Sum all occurrences of word and emits (word, total_count) G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 8 / 84
  • 9. WordCount Example / / Pseudo−code f o r t h e Map f u n c t i o n map( S t r i n g key , S t r i n g v a l u e ) : / / key : document name , / / v a l u e : document c o n t e n t s f o r e a c h word i n s p l i t ( v a l u e ) : E m i t I n t e r m e d i a t e ( word , ’ ’ 1 ’ ’ ) ; / / Peudo−code f o r t h e Reduce f u n c t i o n reduce ( S t r i n g key , I t e r a t o r v a l u e ) : / / key : a word / / v a l u e s : a l i s t o f counts i n t word_count = 0 ; foreach v i n values : word_count += P a r s e I n t ( v ) ; Emit ( key , A s S t r i n g ( word_count ) ) ; G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 9 / 84
  • 10. Subsection 2 Applications and Execution Runtimes G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 10 / 84
  • 11. MapReduce Applications Skeleton • Read large data set • Map: extract relevant information from each data item • Shuffle and Sort • Reduce : aggregate, summarize, filter, or transform • Store the result • Iterate if necessary G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 11 / 84
  • 12. MapReduce Usage at Google Jerry Zhao (MapReduce Group Tech lead @ Google) G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 12 / 84
  • 13. MaReduce + Distributed Storage !  !"#$%&'(()#*$+),--"./,0$1)#/0*),)1/-0%/2'0$1)-0"%,3$) -4-0$5) Powerfull when associated with a distributed storage system ! •678)96""3($)78:;)<=78)9<,1"">)7/($)84-0$5:) • GFS (Google FS), HDFS (Hadoop File System) G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 13 / 84
  • 14. The Execution Runtime • MapReduce runtime handles transparently: • • • • • • Parallelization Communications Load balancing I/O: network and disk Machine failures Laggers G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 14 / 84
  • 15. MapReduce Compared Jerry Zhao G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 15 / 84
  • 16. Existing MapReduce Systems #$%&'()*+,-./0-(12$3-4$(( ""#$%&'()*%+,-%& • Google MapReduce   ./")/0%1(/2&(3+&-$"4%+&4",/-%& • Proprietary and close source   5$"4%$2&$036%+&1"&!78& to GFS • Closely linked   9:)$%:%31%+&03&5;;&&<01=&.21="3&(3+&>(?(&@03+03#4& • Implemented in C++ with Python and Java Bindings   8(<A($$& • Hadoop (+"")& • Open source (Apache projects) • Cascading, Java, Ruby, Perl, Python, PHP, R, C++ bindings (and certainly   C)%3&4",/-%&DE)(-=%&)/"F%-1G& more)   5(4-(+03#H&>(?(H&*,@2H&.%/$H&.21="3H&.B.H&*H&"/&5;;& • Many side projects: HDFS, PIG, Hive, Hbase, Zookeeper   '(32&40+%&)/"F%-14&I&BJ78H&.9!H&B@(4%H&B0?%H&K""6%)%/& • Used by Yahoo!, Facebook, Amazon and the Google IBM research Cloud   L4%+&@2&M(=""NH&7(-%@""6H&E:(A"3&(3+&1=%&!""#$%&9O'& /%4%(/-=&5$",+& G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 16 / 84
  • 17. Other Specialized Runtime Environments • Hardware Platform • Frameworks based on • Multicore (Phoenix) • FPGA (FPMR) • GPU (Mars) MapReduce • Iterative MapReduce (Twister) • Security • Incremental MapReduce • Data privacy (Airvat) (Nephele, Online MR) • MapReduce/Virtualization • Languages (PigLatin) • Log analysis : In situ MR • EC2 Amazon • CAM G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 17 / 84
  • 18. Hadoop Ecosystem G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 18 / 84
  • 19. Research Areas to Improve MapReduce Performance • File System • Higher Throughput, reliability, replication efficiency • Scheduling • data/task affinity, scheduling multiple MR apps, hybrid distributed computing infrastructures • Failure detections • Failure characterization, failure detector and prediction • Security • Data privacy, distributed result/error checking • Greener MapReduce • Energy efficiency, renewable energy G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 19 / 84
  • 20. Research Areas to Improve MapReduce Applicability • Scientific Computing • Iterative MapReduce, numerical processing, algorithmics • Parallel Stream Processing • language extension, runtime optimization for parallel stream processing, incremental processing • MapReduce on widely distributed data-sets • Mapreduce on embedded devices (sensors), data integration G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 20 / 84
  • 21. CloudBlast: MapReduce and Virtualization for BLAST Computing NCBI BLAST • The Basic Local Alignment Search Tool (BLAST) • finds regions of local similarity between Split into chunks based on new line. CCAAGGTT >seq1 AACCGGTT >seq2 >seq3 GGTTAACC >seq4 TTAACCGG sequences. StreamPatternRecordReader splits chunks into sequences as keys for the mapper (empty value). • compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. Hadoop + BLAST • stream of sequences are split by Hadoop. • Mapper compares each sequence to the gene database • provides fault-tolerance, load balancing. G. Fedak (INRIA, France) >seq3 GGTTAACC >seq4 TTAACCGG >seq1 AACCGGTT >seq2 CCAAGGTT BLAST Mapper BLAST Mapper BLAST output is written to a local file. part-00000 part-00001 DFS merger • HDFS is used to merge the result • database are not chunked MapReduce 'I'8%1#</ &';<%&8'; 455A#841#< '/:#&</C Y)!,55 P3!#C4$ 02'/! /' C4#/14#/' 1'C5A41'! 4%1<C41#8 &'A#':';! % ;#1';?! =&< 7'5A<FC' 1<! 7'4A! 0 .'C5A41' E#<A<$#;1; #/1'&5&'1#/ 455A#4/8' StreamInputFormat ! "#$%&'!()!! *+,-.#/$!0#12!3456'7%8')!9#:'/!4!;'1!<=!#/5%1!;'>%'/8';?! @47<<5!;5A#1;!12'!#/5%1!=#A'!#/1<!82%/B;!<=!;4C'!;#D'?!5<;;#EAF!;5A#11#/$!#/! 12'!C#77A'!<=!4!;'>%'/8')!-1&'4CG411'&/6'8<&76'47'&!04;!7':'A<5'7!1<! Cloudblast: Combining mapreduce and virtualization on #/1'&5&'1!8<&&'81AF!12'!E<%/74&#';!<=!;'>%'/8';?!02#82!4&'!54;;'7!4;!B'F;!<=! distributed resources for bioinformatics applications. A. 4!B'FH:4A%'!54#&!1<!12'!3455'&!1241!'I'8%1';!*+,-.)!J"-!&'1&#':';!4AA! Matsunaga, M.A<84A!=#A'!<%15%1;!8<CE#/#/$!#/1<!4!;#/$A'!&';%A1)! Tsugawa, & J. Fortes. (eScience’08), 2008. <5'&41#/$!;F;1'C;!4/7!477#1#</4A!;<=104&'!K')$)!1<!;%55<&1!4! 54&1#8%A4&! 1F5'! <=! /'10<&B! 5&<1<8<A! <&! 1'82/<A<$FL)! -#C#A4&AF?!4!@47<<5ME4;'7!455&<482!&'>%#&';!C482#/';!0#12! 12'! @47<<5! &%/1#C'! 4/7! =#A'! ;F;1'C! ;<=104&'?! 02#82! 84/! </AF!'I'8%1'!#=!</'! <=!;':'&4A!488'514EA'!:'&;#</!<=!N4:4!#;! Salvador da Bahia, Brazil, 22/10/2013 !"! #$%&% *#<#/= ;'1;!<=!': 5&<1'#/! ;' YL)! G'&#<7 <=='&! %5M %5741';!#; &';<%&8'; J414! 9P"-!TYc 12'! 7#;1&# @<0':'&? 7';#$/'7! C4/4$'C <E14#/'7! /<7';?! #7 47:4/14$' 21 / 84
  • 22. ,'&L?&-0.>'! 8#LL'&'.>'! 7'/N''.! =1?%8FGHIJ! 0.8! -,#FGHIJ! #2! -?&'! .?/#>'071'! N@'.! 2'K%'.>'2! 0&'! 2@?&/T! #.8#>0/#.$! 0.! 'LL#>#'./! #-,1'-'./0/#?.! ?L! #.,%/! 2'K%'.>'! 2,1#/!0.8!V?7!>??&8#.0/#?.!7M!C08??,)!JN?62#/'!&'2%1/2!2@?N! 7'//'&!2,''8%,!/@0.!?.'62#/'!8%'!/?!,&?>'22?&2!/@0/!0&'!L02/!0/! ;=!P0,,&?+#-0/'1M!<)3!/#-'2!L02/'&R)!J@'!,%71#>1M!0B0#1071'! -,#FGHIJ! >0.! 7'! L%&/@'&! @0.86/%.'8! L?&! @#$@'&! JHFG*!EE)!! *5D*SE:*AJHG!I*Y;*A=*I!=CHSH=J*SEIJE=I)! • Use Vine to deploy Hadoop on several clusters (24 Xen-based ?,/#-#_'8! #-,1'-'./0/#?.2!12 .?/! ,'&L?&-0.>'T! 7%/! 2%>@! VM @ Univ. Florida + 0&'! ! "#$!%&'(! Calif.) "#$!)*&+,! ,%71#>1M!0>>'22#71'!a4b)! Xen based VMs at Univ WUX! WUX! -!&.!)/01/'2/)! H1/@?%$@! /@'! './#&'! /0&$'/! 80/0702'! L#/2! #./?! -'-?&M! #.! • Each VM contains Hadoop + BLAST + databse2'/%,! %2'8T! &%.2! N#/@! /@'! 80/0702'! 2'$-'./'8! #./?! [X! /@'! (3.5GB) (TZ<[! Z(X! 3&'(/),!)/01/'2/! L&0$-'./2!N'&'!,'&L?&-'8!#.!?&8'&!/?!#.>&'02'!/@'!0-?%./!?L! 3<(! <X<! 4*&+,/),!)/01/'2/! • Compared with MPIBlast .'/N?&O! >?--%.#>0/#?.! L?&! -,#FGHIJ! &%.2! P"#$)! ZR)! (<(! <X! 4,5'65+6!6/785,8&'! I,''8%,! $&0,@2! P"#$)! R! 2@?N! /@0/! /@'&'! N02! .?! #-,0>/! ?.! <TX3T<4[! 43(T[(! 9&,5%!'1:;/+!&.! /@'!,'&L?&-0.>'!8%'!/?!9A)! '12%/&,86/)! ! <<<U)Z3! 44[)4! <7/+5(/!'12%/&,86/)! >?-,%/'68?-#.0/'8! 0,,1#>0/#?.2! ?&! 'B'.! #-,&?B#.$ =/+!)/01/'2/! ,'&L?&-0.>'! 8%'! /?! 0&/#L0>/2! ?L! #-,1'-'./0/#?.T! ')$)! L#1' 4[!@?%&2!0.8! <!@?%&2!0.8! 4/01/',85%! 2M2/'-!8?%71'!>0>@#.$!a4Ub)!! 4U!-#.%/'2! XU!-#.%/'2! />/21,8&'!,8:/! J?! 'B01%0/'! /@'! 2>0107#1#/M! ?L! -,#FGHIJ! 0.8 =1?%8FGHIJT!'+,'&#-'./2!N#/@!8#LL'&'./!.%-7'&2!?L!.?8'2T ,&?>'22?&2T! 80/0702'! L&0$-'./2! 0.8! #.,%/! 80/0! 2'/2! N'&' E9)!*9HG;HJE]A!HA^!*5D*SE:*AJHG!S*I;GJI! '+'>%/'8)! *+0>/1M! 3! ,&?>'22?&2! 0&'! %2'8! #.! '0>@! .?8') J?! 'B01%0/'! /@'! ?B'&@'08! ?L! -0>@#.'! B#&/%01#_0/#?.T! Q@'.'B'&! ,?22#71'T! /@'! .%-7'&2! ?L! .?8'2! #.! '0>@! 2#/'! 0&' '+,'&#-'./2! N#/@! FGHIJ! N'&'! >?.8%>/'8! ?.! ,@M2#>01! O',/! 7010.>'8! P#)')T! L?&! '+,'&#-'./2! N#/@! Z! ,&?>'22?&2T! 3 -0>@#.'2!P,%&'!G#.%+!2M2/'-2!N#/@?%/!5'.!9::R!0.8!10/'&! .?8'2! #.! '0>@! 2#/'! 0&'! %2'8c! L?&! '+,'&#-'./2! N#/@! <U ?.! /@'! 5'.! 9:2! @?2/'8! 7M! /@'! 20-'! -0>@#.'2T! %2#.$! /@'! ,&?>'22?&2T! 4! .?8'2! 0&'! %2'8! #.! '0>@! 2#/'T! 0.8! 2?! ?.R)! "?& ,@M2#>01! .?8'2! 0/! ;")! 9#&/%01#_0/#?.! ?B'&@'08! #2! 70&'1M! '+,'&#-'./2! N#/@! U4! ,&?>'22?&2T! <3! .?8'2! 0/! ;=! 0.8! 3X! 0/ .?/#>'071'!N@'.!'+'>%/#.$!FGHIJ!02!#/!>0.!7'!?72'&B'8!#.! ! ;"! N'&'! %2'8! 8%'! /?! 1#-#/'8! .%-7'&! ?L! 0B0#1071'! .?8'2! 0/ "#$%&'!U)!! =?-,0&#2?.!?L!FGHIJ!2,''86%,2!?.!,@M2#>01!0.8!B#&/%01! "#$)!U)!J@#2!#2!8%'!/?!/@'!-#.#-01!.''8!L?&!E`]!?,'&0/#?.2!02! "#$%&'!()!! *+,'&#-'./01!2'/%,)!34!5'.6702'8!9:2!0/!;"!0.8!<3!5'.6 ;=)! -0>@#.'2)!I,''8%,!#2!>01>%10/'8!&'10/#B'!/?!2'K%'./#01!FGHIJ!?.!B#&/%01! 702'8!9:2!0/!;=!0&'!>?..'>/'8!/@&?%$@!9#A'!/?!,&?B#8'!0!8#2/&#7%/'8! 0! &'2%1/! ?L! &',1#>0/#.$! /@'! 80/0702'! #.! '0>@! .?8'! 0.8! L%11M! S'2%1/2!0&'!2%--0&#_'8!#.!"#$)!)!"?&!011!#.,%/!80/0!2'/2T &'2?%&>')!9:2!8'1#B'&!,'&L?&-0.>'!>?-,0&071'!N#/@!,@M2#>01!-0>@#.'2! '.B#&?.-'./!/?!'+'>%/'!C08??,!0.8!:DE6702'8!FGHIJ!0,,1#>0/#?.2)!H! 011?>0/#.$! #/! /?! -0#.! -'-?&M)! E/! 012?! >?.L#&-2! ?/@'&! =1?%8FGHIJ! 2@?N2! 0! 2-011! 08B0./0$'! ?B'&! -,#FGHIJ) N@'.!'+'>%/#.$!FGHIJ)! %2'&!>0.!&'K%'2/!0!B#&/%01!>1%2/'&!L?&!FGHIJ!'+'>%/#?.!7M!>?./0>/#.$!/@'! ?72'&B0/#?.2! ?L! B#&/%01#_0/#?.! @0B#.$! .'$1#$#71'! #-,0>/! ?.! "?&!#.,%/!80/0!2'/!?L!1?.$!2'K%'.>'2! 0!(6L?18!2,''8%,!>0. B#&/%01!N?&O2,0>'2!2'&B'&!P9QIR)! 7'! ?72'&B'8! N#/@! =1?%8FGHIJ! N@'.! %2#.$! U4! ,&?>'22?&2 G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 22 / 84 ?.!3!2#/'2T!N@#1'!-,#FGHIJ!8'1#B'&2!(36L?18!2,''8%,)!J@' ()!H!9S!9:!#2!>1?.'8!0.8!2/0&/'8!#.!'0>@!2#/'T!?LL'&#.$! #.,%/)! J@'! 7102/+! ,&?$&0-! L&?-! /@'! A=FE! FGHIJ! 2%#/'T! N@#>@! >?-,0&'2! .%>1'?/#8'2! 2'K%'.>'2! 0$0#.2/! ,&?/'#.! 2'K%'.>'2! 7M! /&0.210/#.$! /@'! .%>1'?/#8'2! #./?! 0-#.?! 0>#82T! N02! %2'8)! =@0&0>/'&#2/#>2! ?L! /@'! 3! #.,%/! 2'/2! 0&'! 2@?N.! #.! JHFG*!EE)! CloudBlast Deployment and Performance
  • 23. Section 2 Design and Principles of MapReduce Execution Runtime G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 23 / 84
  • 24. Subsection 1 File Systems G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 24 / 84
  • 25. File System for Data-intensive Computing MapReduce is associated with parallel file systems • GFS (Google File System) and HDFS (Hadoop File System) • Parallel, Scalable, Fault tolerant file system • Based on inexpensive commodity hardware (failures) • Allows to mix storage and computation GFS/HDFS specificities • Many files, large files (> x GB) • Two types of reads • Large stream reads (MapReduce) • Small random reads (usually batched together) • Write Once (rarely modified), Read Many • High sustained throughput (favored over latency) • Consistent concurrent execution G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 25 / 84
  • 26. Example: the Hadoop Architecture Distributed Computing (MapReduce) Data  Analy)cs   Applica)ons   JobTracker   Distributed Storage (HDFS) NameNode   Data  Node   TaskTracker   Data  Node   TaskTracker   Data  Node   TaskTracker   Data  Node   TaskTracker   Data  Node   TaskTracker   Data  Node   TaskTracker   Data  Node   TaskTracker   Data  Node   TaskTracker   Masters Data  Node   TaskTracker   G. Fedak (INRIA, France) MapReduce Slaves Salvador da Bahia, Brazil, 22/10/2013 26 / 84
  • 27. HDFS Main Principles HDFS Roles File.txt • Application A   • Split a file in chunks • The more chunks, the more B   C   Write  chunks  A,   B,  C  of  file   File.txt   parallelism • Asks HDFS to store the Ok,  use  Data   nodes  2,  4,  6   NameNode   Applica'on   chunks • Name Node • Keeps track of File metadata (list of chunks) Data  Node  1   • Stores file chunks • Replicates file chunks G. Fedak (INRIA, France) MapReduce Data  Node  6   C   Data  Node  4   B   • Data Node Data  Node  5   Data  Node  3   replication (3) of chunks on Data Nodes Data  Node  4   Data  Node  2   A   • Ensures the distribution and Data  Node  7   Salvador da Bahia, Brazil, 22/10/2013 27 / 84
  • 28. Data locality in HDFS Rack Awareness • Name node groups Data Node in racks rack 1 rack 2 DN1, DN2, DN3, DN4 DN5, DN6, DN7, DN8 • Assumption that network performance differs if the communication is in-rack vs inter-racks A   Data  Node  1   • Replication Data  Node  2   B   • Keep 2 replicas in-rack and 1 in an other rack A B A   Data  Node  3   Data  Node  4   DN1, DN3, DN5 DN2, DN4, DN7 Data  Node  4   B   Data  Node  5   A   Data  Node  6   Data  Node  7   B   • Minimizes inter-rack communications • Tolerates full rack failure G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 28 / 84
  • 29. Writing to HDFS Client Data Node 1 Data Node 3 Data Node 5 Name Node write File.txt in dir d Writeable For each Data chunk AddBlock A Data nodes 1, 3, 5 (IP, port, rack number) initiate TCP connection Data Nodes 3, 5 initiate TCP connection Data Node 5 initiate TCP connection Pipeline Ready Pipeline Ready Pipeline Ready • Name Nodes checks permissions and select a list of Data Nodes • A pipeline is opened from the Client to each Data Nodes G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 29 / 84
  • 30. Pipeline Write HDFS Client Data Node 1 Data Node 3 Data Node 5 Name Node For each Data chunk Send Block A Block Received A Send Block A Block Received A Send Block A Block Received A Success Close TCP connection Success Close TCP connection Success Close TCP connection Success • Data nodes send and receive chunk concurrently • Blocks are transferred one by one • Name Node updates the meta data table A G. Fedak (INRIA, France) DN1, DN3, DN5 MapReduce Salvador da Bahia, Brazil, 22/10/2013 30 / 84
  • 31. Fault Resilience and Replication NameNode • Ensures fault detection • Heartbeat is sent by DataNode to the NameNode every 3 seconds • After 10 successful heartbeats : Block Report • Meta data is updated • Fault is detected • NameNode lists from the meta-data table the missing chunks • Name nodes instruct the Data Nodes to replicate the chunks that were on the faulty nodes Backup NameNode • updated every hour G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 31 / 84
  • 32. Subsection 2 MapReduce Execution G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 32 / 84
  • 33. Anatomy of a MapReduce Execution Data Split Intermediate Results Input Files Final Output Output Files Mapper Reducer Output2 Reducer Mapper Output1 Reducer Mapper Output3 Combine Mapper Mapper Distribution • Distribution : • Split the input file in chunks • Distribute the chunks to the DFS G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 33 / 84
  • 34. Anatomy of a MapReduce Execution Data Split Intermediate Results Input Files Final Output Output Files Mapper Reducer Output2 Reducer Mapper Output1 Reducer Mapper Output3 Combine Mapper Mapper Map • Map : • Mappers execute the Map function on each chunks • Compute the intermediate results (i.e., list(k , v )) • Optionally, a Combine function is executed to reduce the size of the IR G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 33 / 84
  • 35. Anatomy of a MapReduce Execution Data Split Intermediate Results Input Files Final Output Output Files Mapper Reducer Output2 Reducer Mapper Output1 Reducer Mapper Output3 Combine Mapper Mapper Partition • Partition : • Intermediate results are sorted • IR are partitioned, each partition corresponds to a reducer • Possibly user-provided partition function G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 33 / 84
  • 36. Anatomy of a MapReduce Execution Data Split Intermediate Results Input Files Final Output Output Files Mapper Reducer Output2 Reducer Mapper Output1 Reducer Mapper Output3 Combine Mapper Mapper Shuffle • Shuffle : • references to IR partitions are passed to their corresponding reducer • Reducers access to the IR partitions using remore-read RPC G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 33 / 84
  • 37. Anatomy of a MapReduce Execution Data Split Intermediate Results Input Files Final Output Output Files Mapper Reducer Output2 Reducer Mapper Output1 Reducer Mapper Output3 Combine Mapper Mapper Reduce • Reduce: • IR are sorted and grouped by their intermediate keys • Reducers execute the reduce function to the set of IR they have received • Reducers write the Output results G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 33 / 84
  • 38. Anatomy of a MapReduce Execution Data Split Intermediate Results Input Files Final Output Output Files Mapper Reducer Output2 Reducer Mapper Output1 Reducer Mapper Output3 Combine Mapper Mapper Combine • Combine : • the output of the MapReduce computation is available as R output files • Optionally, output are combined in a single result or passed as a parameter for another MapReduce computation G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 33 / 84
  • 39. Return to WordCount: Application-Level Optimization The Combiner Optimization • How to decrease the amount of information exchanged between mappers and reducers ? → Combiner : Apply reduce function to map output before it is sent to reducer / / Pseudo−code f o r t h e Combine f u n c t i o n combine ( S t r i n g key , I t e r a t o r v a l u e s ) : / / key : a word , / / v a l u e s : l i s t o f counts i n t partial_word_count = 0; f o r each v i n v a l u e s : p a r t i a l _ w o r d _ c o u n t += p a r s e I n t ( v ) ; Emit ( key , A s S t r i n g ( p a r t i a l _ w o r d _ c o u n t ) ; G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 34 / 84
  • 40. Let’s go further: Word Frequency • Input : Large number of text documents • Task: Compute word frequency across all the documents → Frequency is computed using the total word count • A naive solution relies on chaining two MapReduce programs • MR1 counts the total number of words (using combiners) • MR2 counts the number of each word and divide it by the total number of words provided by MR1 • How to turn this into a single MR computation ? 1 2 Property : reduce keys are ordered function EmitToAllReducers(k,v) • Map task send their local total word count to ALL the reducers with the key "" • key "" is the first key processed by the reducer • sum of its values gives the total number of words G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 35 / 84
  • 41. Word Frequency Mapper and Reducer / / Pseudo−code f o r t h e Map f u n c t i o n map( S t r i n g key , S t r i n g v a l u e ) : / / key : document name , v a l u e : document c o n t e n t s i n t word_count = 0 ; f o r e a c h word i n s p l i t ( v a l u e ) : E m i t I n t e r m e d i a t e ( word , ’ ’ 1 ’ ’ ) ; word_count ++; E m i t I n t e r m e d i a t e T o A l l R e d u c e r s ( " " , A s S t r i n g ( word_count ) ) ; / / Peudo−code f o r t h e Reduce f u n c t i o n reduce ( S t r i n g key , I t e r a t o r v a l u e ) : / / key : a word , v a l u e s : a l i s t o f counts i f ( key= " " ) : i n t total_word_count = 0; foreach v i n values : t o t a l _ w o r d _ c o u n t += P a r s e I n t ( v ) ; else : i n t word_count = 0 ; foreach v i n values : word_count += P a r s e I n t ( v ) ; Emit ( key , A s S t r i n g ( word_count / t o t a l _ w o r d _ c o u n t ) ) ; G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 36 / 84
  • 42. Subsection 3 Key Features of MapReduce Runtime Environments G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 37 / 84
  • 43. Handling Failures On very large cluster made of commodity servers, failures are not a rare event. Fault-tolerant Execution • Workers’ failures are detected by heartbeat • In case of machine crash : 1 2 3 In progress Map and Reduce task are marked as idle and re-scheduled to another alive machine. Completed Map task are also rescheduled Any reducers are warned of Mappers’ failure • Result replica “disambiguation”: only the first result is considered • Simple FT scheme, yet it handles failure of large number of nodes. G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 38 / 84
  • 44. Scheduling Data Driven Scheduling • Data-chunks replication • GFS divides files in 64MB chunks and stores 3 replica of each chunk on different machines • map task is scheduled first to a machine that hosts a replica of the input chunk • second, to the nearest machine, i.e. on the same network switch • Over-partitioning of input file • #input chunks >> #workers machines • # mappers > # reducers Speculative Execution • Backup task to alleviate stragglers slowdown: • stragglers : machine that takes unusual long time to complete one of the few last map or reduce task in a computation • backup task : at the end of the computation, the scheduler replicates the remaining in-progress tasks. • the result is provided by the first task that completes G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 39 / 84
  • 45. Speculative Execution in Hadoop • When a node is idle, the scheduler selects a task 1 2 3 failed tasks have the higher priority non-running tasks with data local to the node speculative tasks (i.e., backup task) • Speculative task selection relies on a progress score between 0 and 1 • Map task : progress score is the fraction of input read • Reduce task : • considers 3 phases : the copy phase, the sort phase, the reduce phase • in each phase, the score is the fraction of the data processed. Example : a task halfway of the reduce phase has score 1/3 + 1/3 + (1/3 ∗ 1/2) = 5/6 • straggler : tasks which are 0.2 less than the average progress score G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 40 / 84
  • 46. Section 3 Performance Improvements and Optimizations G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 41 / 84
  • 47. Subsection 1 Handling Heterogeneity and Stragglers G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 42 / 84
  • 48. Issues with Hadoop Scheduler and heterogeneous environment Hadoop makes several assumptions : • nodes are homogeneous • tasks are homogeneous which leads to bad decisions : • Too many backups, thrashing shared resources like network bandwidth • Wrong tasks backed up • Backups may be placed on slow nodes • Breaks when tasks start at different times G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 43 / 84
  • 49. The LATE Scheduler The LATE Scheduler (Longest Approximate Time to End) • estimate task completion time and backup LATE task • choke backup execution : • Limit the number of simultaneous backup tasks (SpequlativeCap) • Slowest nodes are not scheduled backup tasks (SlowNodeTreshold) • Only back up tasks that are sufficiently slow (SlowTaskTreshold) • Estimate task completion time progress score execution time score left = 1−progress rate progress • progress rate = • estimated time In practice SpeculativeCap=10% of available task slots, SlowNodeTreshold, SlowTaskTreshold = 25th percentile of node progress and task progress rate Improving MapReduce Performance in Heterogeneous Environments. Zaharia, M., Konwinski, A., Joseph, A. D., Katz, R. and Stoica, I. in OSDI’08: Sixth Symposium on Operating System Design and Implementation, San Diego, CA, December, 2008. G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 44 / 84
  • 50. Progress RateProgress Example 1 min Rate Example 2 min Node 1 1 task/min Node 2 3x slower Node 3 1.9x slower Time (min) G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 45 / 84
  • 51. LATE Example LATE Example 2 min Estimated time left: (1-0.66) / (1/3) = 1 Node 1 Node 2 Progress = 66% Estimated time left: (1-0.05) / (1/1.9) = 1.8 Progress = 5.3% Node 3 Time (min) LATE correctly picks Node 3 G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 46 / 84
  • 52. LATE Performance (EC2 Sort benchmark) Stragglers !"#$%&'($)*(+$,-(-'&.-/-*(0 emulated by running $ 4 stragglers Heterogeneity 243VM/99 Hosts 1-7VM/Host ,36&&5$E382-$ F1G!$%;+-67H-'$ B>?$ B>#$ B$ =>A$ =>@$ =>?$ =>#$ =$ C&'4($ D-4($ !"#$%&'($)*(+$%(',--./ dd Each host sorted 128MB C,4&&3$B,72/$ B&$A,:;530$ with a total of #=>$ 30GB data '()*+,-./0&1/23(42/&5-*/& &'()*+,-./01.23'42.05,).0 E&$D3;<754$ 12-'3.-$ D1E!$%:+/45./'$ #=<$ ?=>$ E 2 o S w a d in ?=<$ <=>$ <=<$ @&'0($ A/0($ 12/',-/$ Results Results 12/',-/$!"#$03//453$&2/'$6,72/8$$$%#&&2/' 12-'3.-$!"#$45--675$&2-'$/382-9$$%#$&2-'$/&$:3;<754$ Average 27% speedup over Average 58% speedup over Hadoop, 31% over no backups Hadoop, 220% over noofbackups UIUC Department Computer Science, Department of Computer Science, UIUC G. Fedak (INRIA, France) MapReduce ?A$ Salvador da Bahia, Brazil, 22/10/2013 47 / 84
  • 53. Understanding Outliers • Stragglers : Tasks that take ≥ 1.5 the median task in that phase • Re-compute : When a task output is lost, dependant tasks that wait until 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 1 Cumulative problem, Cumulative Fraction of Phases the output is regenerated high runtime recompute 0.8 0.6 high runtime recompute 0.4 0.2 0 0 1 2 0 0.1 0.2 0.3 0.4 Fraction of Outliers 0.5 4 6 8 10 Ratio of Straggler Duration to the Duration of the Median Task tliers and (a) What fraction of tasks in a (b) How much longer do outr resource phase are outliers? liers take to finish? mics, conFigure : Prevalence of Outliers. e to large of outliers Causes ic opera- skew: few keys corresponds to a lot of data • Data dian task) are well explained by the amount of data they ur knowl• Network contention : 70% of the crossrack traffic induced by reduce process the tasks nces from and Busy or move on half network. Duplicating these5% of the • Bad machines : of recomputes happen on would not make them run faster and will waste resources. machines At the other extreme, in  of the phases Salvador da Bahia, Brazil, 22/10/2013 (bottom left), G. Fedak (INRIA, France) MapReduce 48 / 84
  • 54. 123 !*8%?*& :9 The Mantri Scheduler 9 9 234 DF;CG4E :9 ;9 <9 =9 >99 @A*')#,*ABC#D4E#8B#1"./)*%8"B#78.* 0"))/%1 • cause- and resource- aware scheduling Figure : Percentage tasks : define • real-time monitoring of the speed-up of job completion time in the ideal case when (some combination of) outliers do not occur. >.*6'?3%).$&"#* ?+$5&%) !.+*%* !"#$%#$&"#' (")')%*"+),%* %=%,+$% -.$/*'/.0%' 1&((2',.3.,&$4 that are ev of a task a )%.1'&#3+$ A#3+$''B%,"9%*' +#.0.&5.B5% Figure : ;#%<+.5'8")6' 1&0&*&"# tasks bef ones nea #%$8")6'.8.)%' • )%35&,.$%'"+$3+$ *$.)$'$.*6*'$/.$' @"5+$&"#* • 1+35&,.$% There • 6&557')%*$.)$ 35.,%9%#$ • 3)%:,"93+$% 1"'9")%'(&)*$ ing the c Figure : The Outlier Problem: Causes and Solutions job comp of every t By inducing high variability in repeat runs of the same copies ar job, outliers make it hard to meet SLAs. At median, the Reining in the Outliers in Map-Reduce Clusters using Mantri in G. Ananthanarayanan, S. Kandula, A. slow dow ratio Lu,stdev E. Harris OSDI’10: Sixth Symposium i.e., jobs System Greenberg, I. Stoica, Y.of mean in job completion time is .,on Operatinghave a Design and up f B. Saha, used Implementation,non-trivial probability of taking twice as long or finishing Vancouver, Canada, December, 2010. are waiti half as quickly. a phase v To summarize, we takeMapReduce the following lessons from our22/10/2013 comp G. Fedak (INRIA, France) Salvador da Bahia, Brazil, up 49 / 84
  • 55. Resource Aware Restart • Restart Outliers tasks on better machines • Consider two options : kill & restart or duplicate + 0%*+ ,-($)".$/% 67*%$-1)8% ()*!(% &% !% '% '!% !% +!% '!% !% !% !"#$% 234)"5-!$% &% !% !% '!% !% !% '!% !% '% !"#$% 0"))/%1$(!-1!% &% !% '!% !% !% '% !% !% '!% !"#$% >99 78.* mpletion time in the tliers do not occur. Figure : A stylized example to illustrate our main ideas. Tasks that are eventually killed are filled with stripes, repeat instances • restart or duplicate iffilled new <lighter)mesh. of a task are P(t with a trem is high • restart "9%*' ;#%<+.5'8")6' if the remaining time is large trem > E(tnew ) + ∆ tasks before the others to avoid being stuck with the large .B5% 1&0&*&"# • duplicates if it decreases the total amount of resources used ones near completion (§.). +$3+$ *$.)$'$.*6*'$/.$' There > subtle point < 3 is the number reduc$% 1"'9")%'(&)*$ P(trem > tnew (c+1) )is a δ, where c with outlier mitigation:of copies and c δ = 0.25 ing the completion time of a task may in fact increase the s and Solutions job completion time. For example, replicating the output G. Fedak (INRIA, France)every task will drastically reduce recomputations–both MapReduce Salvador da Bahia, Brazil, 22/10/2013 of 50 / 84
  • 56. Network Aware Placement • Schedule tasks so that it minimizes network congestion • Local approximation of optimal placement • do not track bandwidth changes • do not require global co-ordination • Placement of a reduce phase with n tasks, running on a cluster with r racks, In,r the input data matrix. i i i i • Let du , dv the data to be uploaded and downloaded, and bu , bd the corresponding available bandwidth di • the placement of is arg min(max( u , bi v G. Fedak (INRIA, France) MapReduce i dv i bd )) Salvador da Bahia, Brazil, 22/10/2013 51 / 84
  • 57. Mantri Performances %"#& %& "& & !'#! !"#$ !"#( '()*+,-./01(/1( 203()*40,5-*4 $& (& !& %"" -!./0/12$34/ -!./0/12$34/!5"$67'8 )& "& %& !"#$ • Speed up the job median by 9#6 55% :3;. 9<=853/41 >324 ?;3/7++++?;-7 7#8 !& %#6 & :05+ <=3>* ?50,@((((?5*@ AB @A ;0,1. 20/1 • Network aware data (a) Completion Time (b)  Resource Usage placement reduces the ine implementation on a production cluster of thousands completion time of typical rvers for the four representative jobs. reduce phase of 31% :9 9 ;:9 9 :9 <9 =9 >9 ?99 )*+,-,./"0%, )*+,-,./"0%,*'1"2345 %"" . . tasks speeds up. of the half . . . reduce phases by > 60% e : Comparing Mantri’s network-aware spread of tasks the baseline implementation on a production cluster of sands of servers. G. Fedak (INRIA, France) !" (a) Change in Completion Time  reduction in • Network mincompletion time aware placement of avg max e cluster. Jobs that occupy the cluster for half the time up by at least .. Figure (b) shows that  of see a reduction in resource consumption while the rs use up a few extra resources. These gains are due !"#$% &$%''( )*+, @$(A4%5B4 @$86"7 #" 0/A4%5B67'8/78/-'C(D467'8/+7C4 re : Comparing Mantri’s straggler mitigation with the Phase Job $" $" !"#$% &$%''( )*+, !"#$%&'(% !"5213 #" 86 !" 76 6 986 976 6 76 86 :6 ;6 <66 -,$%&'(2345,35,$%04'1(%,=0">% (b) Change in Resource Usage Figure : Comparing straggler mitigation strategies. Mantri provides a greater speed-up in completion time while using fewer resources than existing schemes. MapReduce Salvador da Bahia, Brazil, 22/10/2013 52 / 84
  • 58. Subsection 2 Extending MapReduce Application Range G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 53 / 84
  • 59. !"#$%&'()(*%&'+,-&(.+/0&123&( Twister: Iterative MapReduce !  !"#$"%&$"'%('%(#$)$"&()%*( +),")-./(*)$)( !  (0'%123,)-./(.'%2(,3%%"%2( 4&)&5/)-./6(7)89,/*3&/($)#:#(( !  ;3-9#3-(7/##)2"%2(-)#/*( &'773%"&)$"'%9*)$)($,)%#</,#(( !  =>&"/%$(#388',$(<',(?$/,)$"+/( @)8A/*3&/(&'783$)$"'%#( 4/B$,/7/.C(<)#$/,($5)%(D)*''8( ',(!,C)*9!,C)*E?FG6(( !  0'7-"%/(85)#/($'(&'../&$()..( ,/*3&/('3$83$#(( !  !)$)()&&/##(+")(.'&).(*"#:#( E"25$H/"25$(4IJKLL(."%/#('<( M)+)(&'*/6(( !  N388',$(<',($C8"&).(@)8A/*3&/( &'783$)$"'%#(O''.#($'(7)%)2/( *)$)(( Ekanayake, Jaliya, et al. Twister: a Runtime for Iterative Mapreduce. International Symposium on High Performance Distributed Computing. HPDC 2010. G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 54 / 84
  • 60. Moon: MapReduce on Opportunistic Environments • Consider volatile resources during their idleness time • Consider the Hadoop runtime → Hadoop does not tolerate massive failures • The Moon approach : • Consider hybrid resources: very volatile nodes and more stable nodes • two queues are managed for speculative execution : frozen and slow tasks • 2-phase task replication : separate the job between normal and homestretch phase → the job enter the homestretch phase when the number of remaining tasks < H% available execution slots Lin, Heshan, et al. Moon: Mapreduce on opportunistic environments International Symposium on High Performance Distributed Computing. HPDC 2010. G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 55 / 84
  • 61. Towards Greener Data Center • Sustainable Computing Infrastructures which minimizes the impact on the environment • Reduction of energy consumption • Reduction of Green House Gaz emission • Improve the use of renewable energy (Smart Grids, Co-location, etc...) Solar Bay Apple Maiden Data Center (20MW, ) 00 acres : 400 000 m2. 20 MW peak G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 56 / 84
  • 62. Section 4 MapReduce beyond the Data Center G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 57 / 84
  • 63. Introduction to Computational Desktop Grids High!"#$$%&'%&()$(*$&+&,)-%&'./"'#0)1%*"-& Throughput Computing over2%"-/00%$-Idle Large Sets of Desktop Computers ! 34"54%"&$%-&2*#--)0(%-&'%& ()$(*$&'%&67-&2%0')01&& $%*"&1%82-&'.#0)(1#9#15 • Volunteer Computing Systems " :/$&'%&(;($%&<&5(/0/8#-%*"& (SETI@Home, Distributed.net) '.5(")0 • Open Source Desktop Grid " )22$#()1#/0-&2)")$$=$%-&'%&4")0'%& (BOINC, XtremWeb, Condor, 1)#$$% OurGrid) " >"4)0#-)1#/0&'%&?(/0(/*"-@&'%& 2*#--)0(% • Some numbers with BOINC " A%-&*1#$#-)1%*"-&'/#9%01&2/*9/#"& (October 21, 2013) 8%-*"%"&$%*"&(/01"#,*1#/0 • Active: 0.5M users, 1M 5(/0/8#%&'*&'/0 hosts. • 24-hour average: 12 PetaFLOPS. • EC2 equivalent : 1b$/year G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 58 / 84
  • 64. MapReduce over the Internet Computing on Volunteers’ PCs (a la SETI@Home) • Potential of aggregate computing power and storage • Average PC : 1TFlops, 4BG RAM, 1TByte disc • 1 billion PCs, 100 millions GPUs + tablets + smartphone + STB • hardware and electrical power are paid for by consumers • Self-maintained by the volunteers But ... • High number of resources • Low performance • Volatility • Owned by volunteer • Scalable but mainly for embarrassingly parallel applications with few I/O requirements • Challenge is to broaden the application domain G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 59 / 84
  • 65. MapReduce over the Internet Computing on Volunteers’ PCs (a la SETI@Home) • Potential of aggregate computing power and storage • Average PC : 1TFlops, 4BG RAM, 1TByte disc • 1 billion PCs, 100 millions GPUs + tablets + smartphone + STB • hardware and electrical power are paid for by consumers • Self-maintained by the volunteers But ... • High number of resources • Low performance • Volatility • Owned by volunteer • Scalable but mainly for embarrassingly parallel applications with few I/O requirements • Challenge is to broaden the application domain G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 59 / 84
  • 66. Subsection 1 Background on BitDew G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 60 / 84
  • 67. Large Scale Data Management BitDew : a Programmable Environment for Large Scale Data Management Key Idea 1: provides an API and a runtime environment which integrates several P2P technologies in a consistent way Key Idea 2: relies on metadata (Data Attributes) to drive transparently data management operation : replication, fault-tolerance, distribution, placement, life-cycle. BitDew: A Programmable Environment for Large-Scale Data Management and Distribution Gilles Fedak, Haiwu He and Franck Cappello International Conference for High Performnance Computing (SC’08), Austin, 2008 G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 61 / 84
  • 68. BitDew : the Big Cloudy Picture • Aggregates storage in a single Data Space: Data Space • Clients put and get data from the data space • Clients defines data attributes • Distinguishes service nodes (stable), client and Worker nodes (volatile) • Service : ensure fault tolerance, indexing and scheduling of data to Worker nodes • Worker : stores data on Desktop PCs REPLICAT =3 put get client • push/pull protocol between client → service ← Worker G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 62 / 84
  • 69. BitDew : the Big Cloudy Picture • Aggregates storage in a single Data Space: Data Space • Clients put and get data from the data space Service Nodes Reservoir Nodes • Clients defines data attributes • Distinguishes service nodes pull pull (stable), client and Worker nodes (volatile) • Service : ensure fault pull tolerance, indexing and scheduling of data to Worker nodes • Worker : stores data on Desktop PCs put get client • push/pull protocol between client → service ← Worker G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 62 / 84
  • 70. Data Attributes REPLICA : RESILIENCE : LIFETIME : AFFINITY : TRANSFER PROTOCOL : DISTRIBUTION : G. Fedak (INRIA, France) indicates how many occurrences of data should be available at the same time in the system controls the resilience of data in presence of machine crash is a duration, absolute or relative to the existence of other data, which indicates when a datum is obsolete drives movement of data according to dependency rules gives the runtime environment hints about the file transfer protocol appropriate to distribute the data which indicates the maximum number of pieces of Data with the same Attribute should be sent to particular node. MapReduce Salvador da Bahia, Brazil, 22/10/2013 63 / 84
  • 71. Architecture Overview Applications Service Container Active Data Data Scheduler SQL Server Transfer Manager BitDew Data Catalog File System DHT Master/ Worker Storage Data Repository Http Ftp Data Transfer Bittorrent Ba ck- en ds Se rvic e AP I Command -line Tool BitDew Runtime Environnement • Programming APIs to create Data, Attributes, manage file transfers and program applications • Services (DC, DR, DT, DS) to store, index, distribute, schedule, transfer and provide resiliency to the Data • Several information storage backends and file transfer protocols G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 64 / 84
  • 72. Anatomy of a BitDew Application • Clients 1 2 3 creates Data and Attributes put file into data slot created schedule the data with their corresponding attribute Data Space Service Nodes Reservoir Nodes dr/dt dc/ds • Reservoir pull • are notified when data are locally scheduled onDataScheduled or deleted onDataDeleted • programmers install callbacks on these events pull create put schedule • Clients and Reservoirs can both create data and install callbacks G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 65 / 84
  • 73. Anatomy of a BitDew Application • Clients 1 2 3 creates Data and Attributes put file into data slot created schedule the data with their corresponding attribute Data Space Service Nodes Reservoir Nodes dr/dt dc/ds • Reservoir onDataSceduled • are notified when data are locally scheduled onDataScheduled or deleted onDataDeleted • programmers install callbacks on these events onDataScheduled • Clients and Reservoirs can both create data and install callbacks G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 65 / 84
  • 74. Examples of BitDew Applications • Data-Intensive Application • DI-BOT : data-driven master-worker Arabic characters recognition (M. Labidi, University of Sfax) • MapReduce, (X. Shi, L. Lu HUST, Wuhan China) • Framework for distributed results checking (M. Moca, G. S. Silaghi Babes-Bolyai University of Cluj-Napoca, Romania) • Data Management Utilities • File Sharing for Social Network (N. Kourtellis, Univ. Florida) • Distributed Checkpoint Storage (F. Bouabache, Univ. Paris XI) • Grid Data Stagging (IEP, Chinese Academy of Science) G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 66 / 84
  • 75. Experiment Setup Hardware Setup • Experiments are conducted in a controlled environment (cluster for the microbenchmark and Grid5K for scalability tests) as well as an Internet platform • GdX Cluster : 310-nodes cluster of dual opteron CPU (AMD-64 2Ghz, 2GB SDRAM), network 1Gbs Ethernet switches • Grid5K : an experimental Grid of 4 clusters including GdX • DSL-lab : Low-power, low-noise cluster hosted by real Internet users on DSL broadband network. 20 Mini-ITX nodes, Pentium-M 1Ghz, 512MB SDRam, with 2GB Flash storage • Each experiments is averaged over 30 measures G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 67 / 84
  • 76. Fault-Tolerance Scenario 0 20 40 60 80 100 120 492KB/s DSL01 DSL02 211KB/s DSL03 254KB/s DSL04 247KB/s DSL05 384KB/s DSL06 53KB/s DSL07 412KB/s DSL08 332KB/s DSL09 304KB/s DSL10 259KB/s Evaluation of Bitdew in Presence of Host Failures Scenario: a datum is created with the attribute : REPLICA = 5, FAULT _ TOLERANCE = true and PROTOCOL = "ftp". Every 20 seconds, we simulate a machine crash by killing the BitDew process on one machine owning the data. We run the experiment in the DSL-Lab environment Figure: The Gantt chart presents the main events : the blue box shows the download duration, red box is the waiting time, and red star indicates a node crash. G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 68 / 84
  • 77. Collective File Transfer with Replication 1 d2 0.8 ALL_TO_ALL Fraction d1 d1 + d2 + d3 + d4 d4 0.6 0.4 d3 0.2 0 150 Scenario: All-to-all file transfer with replication. 250 300 KBytes/sec 350 400 450 Results: experiments on DSL-Lab • {d1, d2, d3, d4} is created • one run every hour during approximately 12 hours where each datum has the REPLICA attribute set to 4 • during the collective, file • All-to-all collective implemented with the AFFINITY attribute. Each data attributes is modified on the fly so that d1.affinity=d2, d2.affinity=d3, d3.affinity=d4, d4.affinity=d1. G. Fedak (INRIA, France) 200 MapReduce transfers are performed concurrently and the data size is 5MB. • Cumulative Distribution Function (CDF) plot for collective all-to-all. Salvador da Bahia, Brazil, 22/10/2013 69 / 84
  • 78. Multi-sources and Multi-protocols File Distribution Scenario • 4 clusters of 50 nodes each, each cluster hosting its own data repository (DC/DR). dr1/dc1 C1 • first phase: a client uploads a dr2/dc2 large datum to the 4 data repositories C2 • second phase: the cluster C3 nodes get the datum from the data repository of their cluster. dr3/dc3 We vary: 1 the number of data repositories from 1 to 4 2 C4 dr4/dc4 the protocol used to distribute the data (BitTorrent vs FTP) G. Fedak (INRIA, France) 1st phase MapReduce 2nd phase Salvador da Bahia, Brazil, 22/10/2013 70 / 84
  • 79. Multi-sources and Multi-protocols File Distribution BitTorrent (MB/s) orsay 1st phase 1DR/1DC 2DR/2DC 4DR/4DC 2nd phase 1DR/1DC 2DR/2DC 4DR/4DC nancy bordeaux 30.34 29.79 22.07 3.93 4.04 8.46 FTP (MB/s) rennes mean orsay nancy bordeaux rennes 10.36 12.35 11.51 9.72 30.34 21.07 13.41 98.50 73.85 49.41 22.95 49.86 26.19 22.89 3.93 4.88 7.58 3.78 9.50 7.65 3.96 5.04 5.14 3.90 5.86 7.21 0.72 1.43 2.90 0.70 1.38 2.78 0.56 1.14 2.33 0.60 1.29 2.54 • for both protocols, increasing the number of data sources also increases the available bandwidth • FTP performs better in the first phase and BitTorrent gives superior performances in the second phase • the best combination (4DC/4DR, FTP for the first phase and BitTorrent for the second phase) takes 26.46 seconds and outperforms by 168.6% the best single data source, single protocol (BitTorrent) configuration G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 71 / 84
  • 80. Subsection 2 MapReduce for Desktop Grid Computing G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 72 / 84
  • 81. Challenge of MapReduce over the Internet Data Split Intermediate Results Input Files Final Output Output Files Mapper Reducer Output2 Reducer Mapper Output1 Reducer Mapper Output3 Combine Mapper Mapper • Needs Data Replica Management • Result Certification of • no shared file system nor direct communication between hosts Intermediate Data • Faults and hosts churn • Collective Operation (scatter + gather/reduction) Towards MapReduce for Desktop Grids B. Tang, M. Moca, S. Chevalier, H. He, G. Fedak, in Proceedings of the Fifth International Conference on P2P, Parallel, Grid, Cloud and Internet Computing 3GPCIC, Fukuoka, Japan, Novembre 2010 G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 73 / 84
  • 82. Implementing MapReduce over BitDew (1/2) Initialisation by the Master 1 Data slicing : Master creates DataCollection DC = d1 , . . . , dn , list of data chuncks sharing the same MapInput attribute. 2 Creates reducer tokens each one with its own TokenReducX attribute (affinity set to the Reducer). 3 Creates and pin collector tokens with TokenCollector attribute (affinity set to the Collector Token). G. Fedak (INRIA, France) Algorithm 2 E XECUTION OF M AP TASKS Require: Let Map be the Map function to execute Require: Let M be the number of Mappers and m a single mapper Require: Let dm = d1,m , . . . dk,m be the set of map input data received by worker m 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. {on the Master node} Create a single data MapToken with affinity set to DC {on the Worker node} if MapToken is scheduled then for all data d j,m ∈ dm do execute Map(d j,m ) create list j,m (k, v) end for end if Algorithm 3 S HUFFLING INTERMEDIATE RESULTS Require: Let of the number of Mappers and m a single worker ExecutionM beMap Tasks Require: Let R be the number of Reducers Require: Let listm (k, v) be the set of key, values pairs of intermediate 1 When a Worker receives results on worker m 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. MapReduce {on MapToken it launches the the Master node} for corresponding MapTask and all r ∈ [1, . . . , R] do Create Attribute ReduceAttrr with distrib = 1 computes the intermediate values Create data ReduceTokenr schedule data ReduceTokenr with attribute ReduceTokenAttr end listj,m (k , v ). for {on the Worker node} split listm (k, v) in i f m,1 , . . . , i f m,r intermediate files for all file i fm,r do create reduce input data irm,r and upload i fm,r Salvador da Bahia, Brazil, 22/10/2013 74 / 84
  • 83. 2. 3. 4. 5. 6. 7. 8. 9. 10. Create a single data MapToken with affinity set to DC {on the Worker node} if MapToken is scheduled then for all data d j,m ∈ dm do execute Map(d j,m ) create list j,m (k, v) end for end if 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. Create a single data MasterToken pinAndSchedule(MasterToken) Implementing MapReduce over BitDew (2/2) Require: Let Require: Let Require: Let results on 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. M be the number of Mappers and m a single worker R be the number of Reducers listm (k, v) be the set of key, values pairs of intermediate worker m {on the Master node} for all r ∈ [1, . . . , R] do Create Attribute ReduceAttrr with distrib = 1 Create data ReduceTokenr schedule data ReduceTokenr with attribute ReduceTokenAttr end for {on the Worker node} split listm (k, v) in i f m,1 , . . . , i f m,r intermediate files for all file i fm,r do create reduce input data irm,r and upload i fm,r schedule data irm,r with a f f inity = ReduceTokenr end for workers, the ReduceTokenAttr has distrib=1 flag value which ensures a fair Shufflingdistribution of the workload between reducers. Finally, Intermediate Results the Shuffle phase is simply implemented by scheduling the portioned intermediate data with an attribute whose affinity tag is equal to the corresponding ReduceToken. 1 Algorithm 4 presents the Reduce phase. When a reducer, that is a worker which has the ReduceToken, starts to receive intermediate results, it calls the Reduce function on the (k, list(v)). When all the intermediate files have been received, all the values have been processed for a specific key. If the user wishes, he can get all the results back to the master and eventually combine them. To proceed to this last step, the worker creates an MasterToken data, which is pinnedAndScheduled. This operation means that the MasterToken is known from the BitDew scheduler but will not be sent to any nodes in the system. Instead, MasterToken is pinned on the Master node, which allows the result of the Reduce tasks, scheduled with a tag affinity set to MasterToken, to be sent to the Master. Worker partitions intermediate results according to the partition function and the number of reducers and creates ReduceInput with the attribute ReducTokenR. D. Desktop Grid Features G. Fedak (INRIA, France) {on the Master node} if all or have been received then Combine or into a single result end if Execution of Reduce tasks overlapping communication with computation and prefetch of data. We have designed a multi-threaded worker (see Figure 3) which can 1 process several concurrent files transfers, even if the protocol used is m,r synchronous such as HTTP for instance. As soon as a file transfer is finished, the corresponding Map or Reduce tasks are enqueued and can be processed concurrently. The number of maximum concurrent Map and Reduce tasks can be configured, as well as the minimum m,r number of tasks in the queue before computations can start. In MapReduce, prefetch of data is natural as data are staged on the 2 r distributed file system before computations are launched. When a reducer receives ReduceInput ir files it lauches the corresponding launch ReduceTask Reduce(ir ). 7+889+04 When all the ir have been processed, the reducer combines !"#$%& the result in a single final result "#$ Or . '6 "#$ *+,,%3 *+,-%./0% %#$ 1234%3 ! :0;%./8% %#$ Algorithm 3 S HUFFLING INTERMEDIATE RESULTS {on the Worker node} if irm,r is scheduled then execute Reduce(irm,r ) if all irr have been processed then create or with a f f inity = MasterToken end if end if %#$ ') -%./0%3 3 eventually, the final result can be sent back to the master using the %#$ '( %#$ TokenCollector attribute. -%./0%35/##%3 Figure 3. Worker handles the following threads: Mapper, Reducer and ReducePutter. Q1, Q2 and Q3 represent queues and are used for communication between threads. Collective file operation: MapReduce processing is composed of several collective file operations which in some aspects, look similar with collective communications in parallel programming such as MPI. Namely, the collectives are the Distribution, the initial distribution of file chunks (similar to MPI_Scatter), the Shuffle, that is the redistribution of intermediate results between the Mapper and MapReduce the Reducer nodes (similar to MPI_AlltoAll) and Salvador da Bahia, Brazil, 22/10/2013 75 / 84
  • 84. Handling Communications Latency Hiding • Multi-thredead worker to overlap communication and computation. • The number of maximum concurrent Map and Reduce tasks can be configured, as well as the minimum number of tasks in the queue before computations can start. Barrier-free computation • Reducers detect duplication of intermediate results (that happen because of faults and/or lags). • Early reduction : process IR as they arrive → allowed us to remove the barrier between Map and Reduce tasks. • But ... IR are not sorted. G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 76 / 84
  • 85. Scheduling and Fault-tolerance 2-level scheduling 1 Data placement is ensured by the BitDew scheduler, which is mainly guided by the data attribute. 2 Workers periodically report to the MR-scheduler, running on the master node the state of their ongoing computation. 3 The MR-scheduler determines if there are more nodes available than tasks to execute which can avoid the lagger effect. Fault-tolerance • In Desktop Grid, computing resources have high failure rates : → during the computation (either execution of Map or Reduce tasks) → during the communication, that is file upload and download. • MapInput data and ReduceToken token have the resilient attribute enabled. • Because intermediate results irj,r , j ∈ [1, m] have the affinity set to ReduceTokenr , they will be automatically downloaded by the node on which the ReduceToken data is rescheduled. G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 77 / 84
  • 86. Data Security and Privacy Distributed Result Checking • Traditional DG or VC projects implements result checking on the server. • IR are too large to be sent back to the server → distributed result checking • replicates the MapInput and Reducers • select correct results by majority voting %" * !" &%"'" +" !# !) !$ (% (a) No replication (b) Replication of the Map tasks Figure 2. (c) Replication of both Map and Reduce tasks Dataflows in the MapReduce implementation distinct workers as input for Map task. Consequently, the received intermediate result is present in its own codes list K, ignoring failed results. m versions of intermediate files m corresponding to a map input fi . After receiving r2 + 1 IV. E VALUATION O F T HE R ESULTS C HECKING identical versions, the reducer considers the respective result M ECHANISM correct and further, accepts it as input for a Reduce task. Figure 2b illustrates the dataflow corresponding to a In this section we describe the model for characterizMapReduce execution where rm = 3. ing the errors produced by the MapReduce algorithm in Volunteer Computing. We assume that workers sabotage In order to activate the checking mechanism for the independently of one another, thus we do not take into results received from reducers, the master node schedules the consideration the colluding nodes behavior.Bahia, Brazil, 22/10/2013 ReduceT okens G. Fedak (INRIA, France) T , with the attribute replica = rr (rr > MapReduce Salvador da We employ the Ensuring Datareceives a set of r reducer Privacy • Use a hybrid infrastructure : composed of private and public resources • Use IDA (Information Dispersal Algorithms) approach to distribute and store securely the data 78 / 84
  • 87. BitDew Collective Communication %!!! 56,7-87/9:012.:3;0<4 56,7-87/9:012.:35190,66.+94 =8799.6:012. >79?.6:012. $"!! 012.3/4 $!!! #"!! #!!! "!! ! # & ' #$ #( $& %$ *+,-./ &' (& '! )( #$' Execution time of the Broadcast, Scatter, Gather collective for a 2.7GB Figure 6. The next experiment evaluates the performance of the collective file operation. Considering that we need enough chunks to measure the Scatter/Gather collective, we selected 16MB as the chunks size E VALUA Figure 5. Figure: Execution time of theof nodes. to throu file to a varying number Broadcast, Scatter, Gather collective for a 2.7GB file the a varying number of nodes G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 79 / 84
  • 88. MapReduce Evaluation #&!! 0?6,@A?B@9:3C5D/4 #$!! #!!! '!! (!! &!! $!! ! & ' a 2.7GB Figure 6. ' #( %$ (& *+,-./ #$' $"( "#$ Scalability evaluation on the WordCount application: the y axis presents Figure: Scalability evaluationand the xWordCount application: the y from 1presents the the throughput in MB/s on the axis the number of nodes varying axis to 512. throughput in MB/s and the x axis the number of nodes varying from 1 to 512. lective measure ks size Table II E VALUATION OF THE PERFORMANCE ACCORDING TO THE NUMBER OF M APPERS AND R EDUCERS . G. Fedak (INRIA, France) #Mappers 4 8MapReduce 16 32 32 32 32 Salvador da Bahia, Brazil, 22/10/2013 80 / 84
  • 89. 1;38CA< 1;3"CA< (;6#< (;08< (;6A< 78 (;0"< 1; 0 1; 8 < 0" < (;08< E %;68< 8EE 1;:8< !"#"$% 54,/64 8AE 1;38C$< 5;3"C$< %;6$< (;3"C8< < "< 3 8CA ;3 8C 5; ()!*+)&,-%&'-.*'/0 1;:"< %&' = (;3"C#< 9),4-2&3+/:4 ?6@4,/+4-,&0& 54D6@4,/+4-,&0& 7# (;68< 1;68-B-A< AE 5;3"CA< 5;3"C8< (;3"CA< 1;3"C8< 5 5;38C8< (;38C"< > 1;3"C8< 5;38C8< 1'+)&,-23+4 1;38C"< 5;3"C"< %;6"< (;6"< (;6$< !8 %;68< 7" %;6#< (;6#< !" (;38C$< "EE (;38CA< (;6$< !# 1;3"C#< 5;38C$< 5;3"C#< !$ %;6#< (;38CA< %;6A< (;68< (;38C"< !A 5;38C"< 5;38CA< 5;38C#< Fault-Tolerance Scenario "AE #EE #AE (;:"< $EE $AE (;:8< AEE =3>4 Figure: Fault-tolerance scenario (5 mappers, 2 reducers): crashes are injected during the download of a file (F1), the execution of the map task (F2) and during the execution of the reduce task on the last intermediate result (F3). G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 81 / 84
  • 90. In case of substantial unnecessary in Table the HDFS needs execution progress at all.inter-communication between two On the other hand, BitDew-MapReduce avoids network failures, as shown fault tol DataNodes. in BitDew-MapReduce, the intermediate Hadoop JobTracker just marks all the tempo works. Because, On the other hand, BitDew-MR works properly outputs of completed map tasks are safely store disconnected nodes as dead - although they are and the performance is almost the same as the baseline in the the stable central storage conditions. master does not re-execute the successfully completedall the tasks done running tasks, and blindly removes map tasks of f normal case of network server, the because it needs inter-communication between worker nodes, andnodes from the successful task list. Re-executing t these BitDew mapreduce is almost independent of workers. Objective Hadoop vs MapReduce/BitDew on Extreme Scenarios E. CPU Heterogeneity and Straggler such network conditions. tasks significantly prolongs the job make-span. Meanw Host •Churn The 5 shows, Hadoop works very well with thousands or evento ten scenarios go tempora millions of peer to that Evaluates MapReduce/BitDew on CPU heterogeneity by adjusting the CPU frequency of half As Figure independent arrival and departure of the BitDew-MR clearly allows workersmachines lea CPU Heterogeneity and Straggler We emulate Grid5K according • to kill worker processes worker offline without node another it host churn. We nodesWe50% kill the MapReduce onhave straggler and oneanythatand launchnode to new no dynamic taskperiodically with wrekavoc. Internet one process(a machine performance penalty. on a the worker an execution on the We nodes a mimic scheduling approach when worker emulate node on launch ontakes an unusually long time The key Hadoop job completion, we increas differenthost churn effect. Totasks in the group process probability the target nodes’ tasks re-execution avoida emulate complete oneof CPU.last fewfrom the fastthe joining and leaving of idea behind mapCPU frequency to 10%. to the• classes of the Nodes increase computation) by adjusting the computation emulate volunteers survival failures, failures, seconds. 20 tasks on average sabotage, fromsetchurn, massive which makes BitDew-MR outperforms HDFSBoth Hadoop andfactorthe MapReduce can launch backupheartbeat timeout value to mapreduce Hadoop u chunk firewall,BitDew onesand the the DataNode tasks on idlenetwork when a 20 laggers, Becaus replica and to 3, host slow group get machines operation network scenarios allowing red about 11 tasks. • Hadoop etc. . has same scheduling Although, BitDew-MR setup: heterogeneity does not stragglers work machine and failing failure 3, host isMapReduce BitDew close to completion perform well.the waste to Figure 6. completed re-download map outputs thatchurn already b by is MapReduce runtimealleviate according the problem. However, as shown in workers, BitDewhave causes to figure tasks to heuristic but it does not HDFS chunk the number to 3 is slower from job completion time. the replica factor uploaded to the table 2, for host churn intervals small The nodesthan Hadoop groups get almostbecause only one copy of ashown in available onstorage rather nodesre-gene effects on the Some results the two in -this process On same other hand, as chunk iscentral stable the worker than at a given time. reason the we DataNode one additional to 80% 20the map phase chunk before launching them. - could only progress timeout to has to benefit the seconds of Host The Hence,is that jobsperforming theMapReduce workerAnotherdownloadof this method is that tasks. churn : We periodically kill the chunk copy 10 and 25 seconds, Hadoopnodeonly maintain heartbeat up time firstof process on onebefore and launchitadoes • node failing. Evaluation: network fault tolerance the map task. introduce any extra overhead since all the intermediate on the worker nodes for the sake of the assumption that there arenew one on another node. and data transfer. Thus, environment, the runtime system must becentral sto no inter-worker communication Network Fault Tolerance To work in a Desktop Grid should be always transferred through the resilient regardless of 25 whether or 50 use the fault tolerant strat not although fast nodes spend half of time to process (sec.)local Churn Interval their 10 to temporary network failures. We set the worker heartbeat 5 timeout value to 30 seconds for both Hadoop and 20 However, the overhead of data transfer to the central st chunks compare to the slow nodes, they still need to take makespan (sec.) BitDew mapreduce, heartbeat set to 20 seconds on Hadoop 25 desktop grid more suitable job progress period to failed 2357 1752 • to download Hadoop job &'(#)* launching failed failed and BitDew-MapReduce much time Worker and inject a 25 seconds off-line window storage makes worker nodes at differentfor the applicat new chunks before points of the tasks. In BitDew-MR joa makespan (sec.) 457 398 marks all temporary data, such as nodes which generate a 361 357 additional map map phase.25 this test, the Hadoop JobTracker simplyto 366 few intermediate disconnectedWord C • We they are workers at different as “dead” when induce stillseconds offline window periods theseGrep. To mitigate the data transfer overh running tasks. The tasks performed by 25 nodes are blindly removed from the and Distributed successful taskERFORMANCE EVALUATION OF STRAGGLERS SCENARIO prolonging churn scenario with large aggregated bandw TABLE III. P list and must be re-executed, significantly we can use a storage cluster Fig. 2: churn interval < 30 seconds progresswhen Performance evaluation of host the job makespan, as table 1. Meanwhile, points → Hadoop fails The emerging public any storage services also provid the BitDew-MapReduce clearly allows workers to go temporarily offline withoutcloud performance penalty. Straggler Number Very small 1 2 5 • alternative different progress included in our fu • Network failure : networkeffect on BitDew 10 25 seconds atsolution, which can betime is shutdown during Hadoop job make477 487 490 • Hadoop fails481 when churn interval <work. 30 seconds span (sec.) BitDew-MR job Job progress of the crash points 12.5% 25% 37.5% 50% 62.5% 75% 87.5% 100% 245 267 50 Network Connectivity 211 set custom map tasksand NAT rules on all the300 RELATED WORK turn down We Re-executed firewall 298 100 150 200 250 VI. 350 nodes to worker 400 make-span (sec.) Hadoop There been many studies on network links and observe how Job makespan (sec.) perform. In thishave536 Hadoop cannot improving MapRed MapReduce jobs 425 468 479 512test, 572 589 601 even launch a 3 Re-executed map tasks 0 0 performance [17, 18, 0 28] and 0 0 0 0 19, 0 exploring ways to sup BitDew-MR 256 MapReduce 254 we architectures [16, 22, 23, o GdX has 356 double core nodes, Job to measure the performance on 521 on different 274two workers per node26 so makespan (sec.) 246 249 243 239 nodes 257 run Anthony SIMONET - MapReduce on Desktop Grids related work isscenario [23] 2012 stands 14 closely MOON Table 1: Performance evaluation of the network fault toleranceDecember 4th, which nodes. MapReduce On Opportunistic eNvironments. Unlike lundi 3 décembre 12 • Hadoop marks all temporary disconnected nodes limits the systemtaskswithin a campus are work, MOON as failed → scale dead • The Hadoop job tracker marks all temporary disconnected nodes as and assumes the underlying resources are hybrid re-executed - These tasks must be re-executed organized by provisioning a fixed fraction of dedicated st 4.2 Evaluation of the incremental BitDew MapReduce G. Fedak (INRIA, France) MapReduce Salvador da other volatile personal82 / 84 computers to supplementBahia, Brazil, 22/10/2013 compu
  • 91. Section 5 Conclusion G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 83 / 84
  • 92. Conclusion • MapReduce Runtimes • Attractive programming model for data-intense computing • Many research issues concerning execution runtimes (heterogeneity, stragglers, scheduling ) • Many others : security, QoS, iterative MapReduce, • MapReduce beyond the Data Center : • enable large scale data-intensive computing : Internet MapReduce • Want to learn more ? • Home page for the Mini-Curso : http://graal.ens-lyon.fr/ gfedak/pmwiki- lri/pmwiki.php/Main/MarpReduceClass • the MapReduce Workshop@HPDC (http://graal.ens-lyon.fr/mapreduce) • Desktop Grid Computing ed. C. Cerin and G. Fedak – CRC Press • our websites : http://www.bitdew.net and http://www.xtremweb-hep.org G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 84 / 84