The document discusses MapReduce runtime environments, including their design, performance optimizations, and applications. It provides an overview of MapReduce, describing the programming model and key-value data processing. It also discusses the design of MapReduce execution runtimes, including their use of distributed file systems and handling of parallelization, load balancing, and failures. Finally, it outlines areas of ongoing research to improve MapReduce performance and applicability.
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
MapReduce Runtime Design and Performance Optimizations
1. MapReduce Runtime Environments
Design, Performance, Optimizations
Gilles Fedak
Many Figures imported from authors’ works
Gilles.Fedak@inria.fr
INRIA/University of Lyon, France
Escola Regional de Alto Desempenho - Região Nordeste
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
1 / 84
2. Outline of Topics
1
What is MapReduce ?
The Programming Model
Applications and Execution Runtimes
2
Design and Principles of MapReduce Execution Runtime
File Systems
MapReduce Execution
Key Features of MapReduce Runtime Environments
3
Performance Improvements and Optimizations
Handling Heterogeneity and Stragglers
Extending MapReduce Application Range
4
MapReduce beyond the Data Center
Background on BitDew
MapReduce for Desktop Grid Computing
5
Conclusion
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
2 / 84
3. Section 1
What is MapReduce ?
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
3 / 84
4. What is MapReduce ?
• Programming Model for data-intense applications
• Proposed by Google in 2004
• Simple, inspired by functionnal programming
• programmer simply defines Map and Reduce tasks
• Building block for other parallel programming tools
• Strong open-source implementation: Hadoop
• Highly scalable
• Accommodate large scale clusters: faulty and unreliable resources
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat, in OSDI’04:
Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004.
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
4 / 84
5. Subsection 1
The Programming Model
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
5 / 84
7. MapReduce a la Google
Map(k , v ) → (k , v )
Mapper
Group(k , v ) by k
Reduce(k , {v1 , . . . , vn }) → v
Reducer
Input
Output
• Map(key , value) is run on each item in the input data set
• Emits a new (key , val) pair
• Reduce(key , val) is run for each unique key emitted by the Map
• Emits final output
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
7 / 84
8. Exemple: Word Count
• Input: Large number of text documents
• Task: Compute word count across all the document
• Solution :
• Mapper : For every word in document, emits (word, 1)
• Reducer : Sum all occurrences of word and emits (word, total_count)
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
8 / 84
9. WordCount Example
/ / Pseudo−code f o r t h e Map f u n c t i o n
map( S t r i n g key , S t r i n g v a l u e ) :
/ / key : document name ,
/ / v a l u e : document c o n t e n t s
f o r e a c h word i n s p l i t ( v a l u e ) :
E m i t I n t e r m e d i a t e ( word , ’ ’ 1 ’ ’ ) ;
/ / Peudo−code f o r t h e Reduce f u n c t i o n
reduce ( S t r i n g key , I t e r a t o r v a l u e ) :
/ / key : a word
/ / v a l u e s : a l i s t o f counts
i n t word_count = 0 ;
foreach v i n values :
word_count += P a r s e I n t ( v ) ;
Emit ( key , A s S t r i n g ( word_count ) ) ;
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
9 / 84
10. Subsection 2
Applications and Execution Runtimes
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
10 / 84
11. MapReduce Applications Skeleton
• Read large data set
• Map: extract relevant information from each data item
• Shuffle and Sort
• Reduce : aggregate, summarize, filter, or transform
• Store the result
• Iterate if necessary
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
11 / 84
12. MapReduce Usage at Google
Jerry Zhao (MapReduce Group Tech lead @ Google)
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
12 / 84
13. MaReduce + Distributed Storage
! !"#$%&'(()#*$+),--"./,0$1)#/0*),)1/-0%/2'0$1)-0"%,3$)
-4-0$5)
Powerfull when associated with a distributed storage system
! •678)96""3($)78:;)<=78)9<,1"">)7/($)84-0$5:)
• GFS (Google FS), HDFS (Hadoop File System)
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
13 / 84
14. The Execution Runtime
• MapReduce runtime handles transparently:
•
•
•
•
•
•
Parallelization
Communications
Load balancing
I/O: network and disk
Machine failures
Laggers
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
14 / 84
16. Existing MapReduce Systems
#$%&'()*+,-./0-(12$3-4$((
""#$%&'()*%+,-%&
• Google MapReduce
./")/0%1(/2&(3+&-$"4%+&4",/-%&
• Proprietary and close source
5$"4%$2&$036%+&1"&!78& to GFS
• Closely linked
9:)$%:%31%+&03&5;;&&<01=&.21="3&(3+&>(?(&@03+03#4&
• Implemented in C++ with Python and Java Bindings
8(<A($$&
• Hadoop
(+"")&
• Open source (Apache projects)
• Cascading, Java, Ruby, Perl, Python, PHP, R, C++ bindings (and certainly
C)%3&4",/-%&DE)(-=%&)/"F%-1G&
more)
5(4-(+03#H&>(?(H&*,@2H&.%/$H&.21="3H&.B.H&*H&"/&5;;&
• Many side projects: HDFS, PIG, Hive, Hbase, Zookeeper
'(32&40+%&)/"F%-14&I&BJ78H&.9!H&B@(4%H&B0?%H&K""6%)%/&
• Used by Yahoo!, Facebook, Amazon and the Google IBM research Cloud
L4%+&@2&M(=""NH&7(-%@""6H&E:(A"3&(3+&1=%&!""#$%&9O'&
/%4%(/-=&5$",+&
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
16 / 84
17. Other Specialized Runtime Environments
• Hardware Platform
• Frameworks based on
• Multicore (Phoenix)
• FPGA (FPMR)
• GPU (Mars)
MapReduce
• Iterative MapReduce
(Twister)
• Security
• Incremental MapReduce
• Data privacy (Airvat)
(Nephele, Online MR)
• MapReduce/Virtualization
• Languages (PigLatin)
• Log analysis : In situ MR
• EC2 Amazon
• CAM
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
17 / 84
19. Research Areas to Improve MapReduce Performance
• File System
• Higher Throughput, reliability, replication efficiency
• Scheduling
• data/task affinity, scheduling multiple MR apps, hybrid distributed computing
infrastructures
• Failure detections
• Failure characterization, failure detector and prediction
• Security
• Data privacy, distributed result/error checking
• Greener MapReduce
• Energy efficiency, renewable energy
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
19 / 84
20. Research Areas to Improve MapReduce Applicability
• Scientific Computing
• Iterative MapReduce, numerical processing, algorithmics
• Parallel Stream Processing
• language extension, runtime optimization for parallel stream processing,
incremental processing
• MapReduce on widely distributed data-sets
• Mapreduce on embedded devices (sensors), data integration
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
20 / 84
21. CloudBlast: MapReduce and Virtualization for BLAST
Computing
NCBI BLAST
• The Basic Local Alignment Search Tool
(BLAST)
• finds regions of local similarity between
Split into chunks
based on new line. CCAAGGTT
>seq1
AACCGGTT
>seq2
>seq3
GGTTAACC
>seq4
TTAACCGG
sequences.
StreamPatternRecordReader splits
chunks into sequences as keys for
the mapper (empty value).
• compares nucleotide or protein sequences
to sequence databases and calculates the
statistical significance of matches.
Hadoop + BLAST
• stream of sequences are split by Hadoop.
• Mapper compares each sequence to the
gene database
• provides fault-tolerance, load balancing.
G. Fedak (INRIA, France)
>seq3
GGTTAACC
>seq4
TTAACCGG
>seq1
AACCGGTT
>seq2
CCAAGGTT
BLAST
Mapper
BLAST
Mapper
BLAST output is
written to a local file.
part-00000
part-00001
DFS merger
• HDFS is used to merge the result
• database are not chunked
MapReduce
'I'8%1#</
&';<%&8';
455A#841#<
'/:#&</C
Y)!,55
P3!#C4$
02'/! /'
C4#/14#/'
1'C5A41'!
4%1<C41#8
&'A#':';! %
;#1';?! =&<
7'5A<FC'
1<! 7'4A! 0
.'C5A41'
E#<A<$#;1;
#/1'&5&'1#/
455A#4/8'
StreamInputFormat
!
"#$%&'!()!! *+,-.#/$!0#12!3456'7%8')!9#:'/!4!;'1!<=!#/5%1!;'>%'/8';?!
@47<<5!;5A#1;!12'!#/5%1!=#A'!#/1<!82%/B;!<=!;4C'!;#D'?!5<;;#EAF!;5A#11#/$!#/!
12'!C#77A'!<=!4!;'>%'/8')!-1&'4CG411'&/6'8<&76'47'&!04;!7':'A<5'7!1<!
Cloudblast: Combining mapreduce and virtualization on
#/1'&5&'1!8<&&'81AF!12'!E<%/74&#';!<=!;'>%'/8';?!02#82!4&'!54;;'7!4;!B'F;!<=!
distributed resources for bioinformatics applications. A.
4!B'FH:4A%'!54#&!1<!12'!3455'&!1241!'I'8%1';!*+,-.)!J"-!&'1&#':';!4AA!
Matsunaga, M.A<84A!=#A'!<%15%1;!8<CE#/#/$!#/1<!4!;#/$A'!&';%A1)!
Tsugawa, & J. Fortes. (eScience’08), 2008.
<5'&41#/$!;F;1'C;!4/7!477#1#</4A!;<=104&'!K')$)!1<!;%55<&1!4!
54&1#8%A4&! 1F5'! <=! /'10<&B! 5&<1<8<A! <&! 1'82/<A<$FL)!
-#C#A4&AF?!4!@47<<5ME4;'7!455&<482!&'>%#&';!C482#/';!0#12!
12'! @47<<5! &%/1#C'! 4/7! =#A'! ;F;1'C! ;<=104&'?! 02#82! 84/!
</AF!'I'8%1'!#=!</'! <=!;':'&4A!488'514EA'!:'&;#</!<=!N4:4!#;!
Salvador da Bahia, Brazil, 22/10/2013
!"! #$%&%
*#<#/=
;'1;!<=!':
5&<1'#/! ;'
YL)! G'&#<7
<=='&! %5M
%5741';!#;
&';<%&8';
J414!
9P"-!TYc
12'! 7#;1&#
@<0':'&?
7';#$/'7!
C4/4$'C
<E14#/'7!
/<7';?! #7
47:4/14$'
21 / 84
25. File System for Data-intensive Computing
MapReduce is associated with parallel file systems
• GFS (Google File System) and HDFS (Hadoop File System)
• Parallel, Scalable, Fault tolerant file system
• Based on inexpensive commodity hardware (failures)
• Allows to mix storage and computation
GFS/HDFS specificities
• Many files, large files (> x GB)
• Two types of reads
• Large stream reads (MapReduce)
• Small random reads (usually batched together)
• Write Once (rarely modified), Read Many
• High sustained throughput (favored over latency)
• Consistent concurrent execution
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
25 / 84
26. Example: the Hadoop Architecture
Distributed Computing
(MapReduce)
Data
Analy)cs
Applica)ons
JobTracker
Distributed Storage
(HDFS)
NameNode
Data
Node
TaskTracker
Data
Node
TaskTracker
Data
Node
TaskTracker
Data
Node
TaskTracker
Data
Node
TaskTracker
Data
Node
TaskTracker
Data
Node
TaskTracker
Data
Node
TaskTracker
Masters
Data
Node
TaskTracker
G. Fedak (INRIA, France)
MapReduce
Slaves
Salvador da Bahia, Brazil, 22/10/2013
26 / 84
27. HDFS Main Principles
HDFS Roles
File.txt
• Application
A
• Split a file in chunks
• The more chunks, the more
B
C
Write
chunks
A,
B,
C
of
file
File.txt
parallelism
• Asks HDFS to store the
Ok,
use
Data
nodes
2,
4,
6
NameNode
Applica'on
chunks
• Name Node
• Keeps track of File metadata
(list of chunks)
Data
Node
1
• Stores file chunks
• Replicates file chunks
G. Fedak (INRIA, France)
MapReduce
Data
Node
6
C
Data
Node
4
B
• Data Node
Data
Node
5
Data
Node
3
replication (3) of chunks on
Data Nodes
Data
Node
4
Data
Node
2
A
• Ensures the distribution and
Data
Node
7
Salvador da Bahia, Brazil, 22/10/2013
27 / 84
28. Data locality in HDFS
Rack Awareness
• Name node groups Data Node
in racks
rack 1
rack 2
DN1, DN2, DN3, DN4
DN5, DN6, DN7, DN8
• Assumption that network
performance differs if the
communication is in-rack vs
inter-racks
A
Data
Node
1
• Replication
Data
Node
2
B
• Keep 2 replicas in-rack and 1
in an other rack
A
B
A
Data
Node
3
Data
Node
4
DN1, DN3, DN5
DN2, DN4, DN7
Data
Node
4
B
Data
Node
5
A
Data
Node
6
Data
Node
7
B
• Minimizes inter-rack
communications
• Tolerates full rack failure
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
28 / 84
29. Writing to HDFS
Client
Data Node 1
Data Node 3
Data Node 5
Name Node
write File.txt in dir d
Writeable
For each Data chunk
AddBlock A
Data nodes 1, 3, 5 (IP, port, rack number)
initiate TCP connection
Data Nodes 3, 5
initiate TCP connection
Data Node 5
initiate TCP connection
Pipeline Ready
Pipeline Ready
Pipeline Ready
• Name Nodes checks permissions and select a list of Data Nodes
• A pipeline is opened from the Client to each Data Nodes
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
29 / 84
30. Pipeline Write HDFS
Client
Data Node 1
Data Node 3
Data Node 5
Name Node
For each Data chunk
Send Block A
Block Received A
Send Block A
Block Received A
Send Block A
Block Received A
Success
Close TCP connection
Success
Close TCP connection
Success
Close TCP connection
Success
• Data nodes send and receive chunk concurrently
• Blocks are transferred one by one
• Name Node updates the meta data table
A
G. Fedak (INRIA, France)
DN1, DN3, DN5
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
30 / 84
31. Fault Resilience and Replication
NameNode
• Ensures fault detection
• Heartbeat is sent by DataNode to the NameNode every 3 seconds
• After 10 successful heartbeats : Block Report
• Meta data is updated
• Fault is detected
• NameNode lists from the meta-data table the missing chunks
• Name nodes instruct the Data Nodes to replicate the chunks that were on
the faulty nodes
Backup NameNode
• updated every hour
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
31 / 84
33. Anatomy of a MapReduce Execution
Data Split
Intermediate
Results
Input Files
Final Output
Output Files
Mapper
Reducer
Output2
Reducer
Mapper
Output1
Reducer
Mapper
Output3
Combine
Mapper
Mapper
Distribution
• Distribution :
• Split the input file in chunks
• Distribute the chunks to the DFS
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
33 / 84
34. Anatomy of a MapReduce Execution
Data Split
Intermediate
Results
Input Files
Final Output
Output Files
Mapper
Reducer
Output2
Reducer
Mapper
Output1
Reducer
Mapper
Output3
Combine
Mapper
Mapper
Map
• Map :
• Mappers execute the Map function on each chunks
• Compute the intermediate results (i.e., list(k , v ))
• Optionally, a Combine function is executed to reduce the size of the IR
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
33 / 84
35. Anatomy of a MapReduce Execution
Data Split
Intermediate
Results
Input Files
Final Output
Output Files
Mapper
Reducer
Output2
Reducer
Mapper
Output1
Reducer
Mapper
Output3
Combine
Mapper
Mapper
Partition
• Partition :
• Intermediate results are sorted
• IR are partitioned, each partition corresponds to a reducer
• Possibly user-provided partition function
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
33 / 84
36. Anatomy of a MapReduce Execution
Data Split
Intermediate
Results
Input Files
Final Output
Output Files
Mapper
Reducer
Output2
Reducer
Mapper
Output1
Reducer
Mapper
Output3
Combine
Mapper
Mapper
Shuffle
• Shuffle :
• references to IR partitions are passed to their corresponding reducer
• Reducers access to the IR partitions using remore-read RPC
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
33 / 84
37. Anatomy of a MapReduce Execution
Data Split
Intermediate
Results
Input Files
Final Output
Output Files
Mapper
Reducer
Output2
Reducer
Mapper
Output1
Reducer
Mapper
Output3
Combine
Mapper
Mapper
Reduce
• Reduce:
• IR are sorted and grouped by their intermediate keys
• Reducers execute the reduce function to the set of IR they have received
• Reducers write the Output results
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
33 / 84
38. Anatomy of a MapReduce Execution
Data Split
Intermediate
Results
Input Files
Final Output
Output Files
Mapper
Reducer
Output2
Reducer
Mapper
Output1
Reducer
Mapper
Output3
Combine
Mapper
Mapper
Combine
• Combine :
• the output of the MapReduce computation is available as R output files
• Optionally, output are combined in a single result or passed as a parameter
for another MapReduce computation
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
33 / 84
39. Return to WordCount: Application-Level Optimization
The Combiner Optimization
• How to decrease the amount of information exchanged between mappers
and reducers ?
→ Combiner : Apply reduce function to map output before it is sent to
reducer
/ / Pseudo−code f o r t h e Combine f u n c t i o n
combine ( S t r i n g key , I t e r a t o r v a l u e s ) :
/ / key : a word ,
/ / v a l u e s : l i s t o f counts
i n t partial_word_count = 0;
f o r each v i n v a l u e s :
p a r t i a l _ w o r d _ c o u n t += p a r s e I n t ( v ) ;
Emit ( key , A s S t r i n g ( p a r t i a l _ w o r d _ c o u n t ) ;
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
34 / 84
40. Let’s go further: Word Frequency
• Input : Large number of text documents
• Task: Compute word frequency across all the documents
→ Frequency is computed using the total word count
• A naive solution relies on chaining two MapReduce programs
• MR1 counts the total number of words (using combiners)
• MR2 counts the number of each word and divide it by the total number of
words provided by MR1
• How to turn this into a single MR computation ?
1
2
Property : reduce keys are ordered
function EmitToAllReducers(k,v)
• Map task send their local total word count to ALL the reducers with the key ""
• key "" is the first key processed by the reducer
• sum of its values gives the total number of words
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
35 / 84
41. Word Frequency Mapper and Reducer
/ / Pseudo−code f o r t h e Map f u n c t i o n
map( S t r i n g key , S t r i n g v a l u e ) :
/ / key : document name , v a l u e : document c o n t e n t s
i n t word_count = 0 ;
f o r e a c h word i n s p l i t ( v a l u e ) :
E m i t I n t e r m e d i a t e ( word , ’ ’ 1 ’ ’ ) ;
word_count ++;
E m i t I n t e r m e d i a t e T o A l l R e d u c e r s ( " " , A s S t r i n g ( word_count ) ) ;
/ / Peudo−code f o r t h e Reduce f u n c t i o n
reduce ( S t r i n g key , I t e r a t o r v a l u e ) :
/ / key : a word , v a l u e s : a l i s t o f counts
i f ( key= " " ) :
i n t total_word_count = 0;
foreach v i n values :
t o t a l _ w o r d _ c o u n t += P a r s e I n t ( v ) ;
else :
i n t word_count = 0 ;
foreach v i n values :
word_count += P a r s e I n t ( v ) ;
Emit ( key , A s S t r i n g ( word_count / t o t a l _ w o r d _ c o u n t ) ) ;
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
36 / 84
42. Subsection 3
Key Features of MapReduce Runtime Environments
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
37 / 84
43. Handling Failures
On very large cluster made of commodity servers, failures are not a rare
event.
Fault-tolerant Execution
• Workers’ failures are detected by heartbeat
• In case of machine crash :
1
2
3
In progress Map and Reduce task are marked as idle and re-scheduled to
another alive machine.
Completed Map task are also rescheduled
Any reducers are warned of Mappers’ failure
• Result replica “disambiguation”: only the first result is considered
• Simple FT scheme, yet it handles failure of large number of nodes.
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
38 / 84
44. Scheduling
Data Driven Scheduling
• Data-chunks replication
• GFS divides files in 64MB chunks and stores 3 replica of each chunk on
different machines
• map task is scheduled first to a machine that hosts a replica of the input
chunk
• second, to the nearest machine, i.e. on the same network switch
• Over-partitioning of input file
• #input chunks >> #workers machines
• # mappers > # reducers
Speculative Execution
• Backup task to alleviate stragglers slowdown:
• stragglers : machine that takes unusual long time to complete one of the few
last map or reduce task in a computation
• backup task : at the end of the computation, the scheduler replicates the
remaining in-progress tasks.
• the result is provided by the first task that completes
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
39 / 84
45. Speculative Execution in Hadoop
• When a node is idle, the scheduler selects a task
1
2
3
failed tasks have the higher priority
non-running tasks with data local to the node
speculative tasks (i.e., backup task)
• Speculative task selection relies on a progress score between 0 and 1
• Map task : progress score is the fraction of input read
• Reduce task :
• considers 3 phases : the copy phase, the sort phase, the reduce phase
• in each phase, the score is the fraction of the data processed. Example : a task
halfway of the reduce phase has score 1/3 + 1/3 + (1/3 ∗ 1/2) = 5/6
• straggler : tasks which are 0.2 less than the average progress score
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
40 / 84
48. Issues with Hadoop Scheduler and heterogeneous
environment
Hadoop makes several assumptions :
• nodes are homogeneous
• tasks are homogeneous
which leads to bad decisions :
• Too many backups, thrashing shared resources like network bandwidth
• Wrong tasks backed up
• Backups may be placed on slow nodes
• Breaks when tasks start at different times
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
43 / 84
49. The LATE Scheduler
The LATE Scheduler (Longest Approximate Time to End)
• estimate task completion time and backup LATE task
• choke backup execution :
• Limit the number of simultaneous backup tasks (SpequlativeCap)
• Slowest nodes are not scheduled backup tasks (SlowNodeTreshold)
• Only back up tasks that are sufficiently slow (SlowTaskTreshold)
• Estimate task completion time
progress score
execution time
score
left = 1−progress rate
progress
• progress rate =
• estimated time
In practice SpeculativeCap=10% of available task slots, SlowNodeTreshold, SlowTaskTreshold = 25th percentile
of node progress and task progress rate
Improving MapReduce Performance in Heterogeneous Environments. Zaharia, M., Konwinski, A., Joseph, A. D., Katz, R. and Stoica, I. in OSDI’08: Sixth
Symposium on Operating System Design and Implementation, San Diego, CA, December, 2008.
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
44 / 84
50. Progress RateProgress
Example
1 min
Rate Example
2 min
Node 1
1 task/min
Node 2
3x slower
Node 3
1.9x slower
Time (min)
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
45 / 84
51. LATE Example
LATE Example
2 min
Estimated time left:
(1-0.66) / (1/3) = 1
Node 1
Node 2
Progress = 66%
Estimated time left:
(1-0.05) / (1/1.9) = 1.8
Progress = 5.3%
Node 3
Time (min)
LATE correctly picks Node 3
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
46 / 84
52. LATE Performance (EC2 Sort benchmark)
Stragglers
!"#$%&'($)*(+$,-(-'&.-/-*(0 emulated by running
$
4 stragglers
Heterogeneity
243VM/99 Hosts 1-7VM/Host
,36&&5$E382-$
F1G!$%;+-67H-'$
B>?$
B>#$
B$
=>A$
=>@$
=>?$
=>#$
=$
C&'4($
D-4($
!"#$%&'($)*(+$%(',--./
dd
Each host
sorted 128MB C,4&&3$B,72/$
B&$A,:;530$
with a total of
#=>$
30GB data
'()*+,-./0&1/23(42/&5-*/&
&'()*+,-./01.23'42.05,).0
E&$D3;<754$
12-'3.-$
D1E!$%:+/45./'$
#=<$
?=>$
E
2
o
S
w
a
d
in
?=<$
<=>$
<=<$
@&'0($
A/0($
12/',-/$
Results
Results
12/',-/$!"#$03//453$&2/'$6,72/8$$$%#&&2/'
12-'3.-$!"#$45--675$&2-'$/382-9$$%#$&2-'$/&$:3;<754$
Average 27% speedup over
Average 58% speedup over
Hadoop, 31% over no backups
Hadoop, 220% over noofbackups UIUC
Department Computer Science,
Department of Computer Science, UIUC
G. Fedak (INRIA, France)
MapReduce
?A$
Salvador da Bahia, Brazil, 22/10/2013
47 / 84
53. Understanding Outliers
• Stragglers : Tasks that take ≥ 1.5 the median task in that phase
• Re-compute : When a task output is lost, dependant tasks that wait until
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
1
Cumulative
problem,
Cumulative Fraction of Phases
the output is regenerated
high runtime
recompute
0.8
0.6
high runtime
recompute
0.4
0.2
0
0 1 2
0
0.1 0.2 0.3 0.4
Fraction of Outliers
0.5
4
6
8
10
Ratio of Straggler Duration to the
Duration of the Median Task
tliers and
(a) What fraction of tasks in a
(b) How much longer do outr resource
phase are outliers?
liers take to finish?
mics, conFigure : Prevalence of Outliers.
e to large of outliers
Causes
ic opera- skew: few keys corresponds to a lot of data
• Data
dian task) are well explained by the amount of data they
ur knowl• Network contention : 70% of the crossrack traffic induced by reduce
process
the
tasks
nces from and Busy or move on half network. Duplicating these5% of the
• Bad
machines :
of recomputes happen on
would not make them run faster and will waste resources.
machines
At the other extreme, in of the phases Salvador da Bahia, Brazil, 22/10/2013
(bottom left),
G. Fedak (INRIA, France)
MapReduce
48 / 84
54. 123
!*8%?*&
:9
The Mantri Scheduler
9
9
234
DF;CG4E
:9
;9
<9
=9
>99
@A*')#,*ABC#D4E#8B#1"./)*%8"B#78.*
0"))/%1
• cause- and resource- aware scheduling
Figure : Percentage tasks : define
• real-time monitoring of the speed-up of job completion time in the
ideal case when (some combination of) outliers do not occur.
>.*6'?3%).$&"#*
?+$5&%)
!.+*%*
!"#$%#$&"#'
(")')%*"+),%*
%=%,+$%
-.$/*'/.0%'
1&((2',.3.,&$4
that are ev
of a task a
)%.1'+$
A#3+$''B%,"9%*'
+#.0.&5.B5%
Figure :
;#%<+.5'8")6'
1&0&*&"#
tasks bef
ones nea
#%$8")6'.8.)%' • )%35&,.$%'"+$3+$ *$.)$'$.*6*'$/.$'
@"5+$&"#* • 1+35&,.$%
There
• 6&557')%*$.)$ 35.,%9%#$
• 3)%:,"93+$%
1"'9")%'(&)*$
ing the c
Figure : The Outlier Problem: Causes and Solutions
job comp
of every t
By inducing high variability in repeat runs of the same
copies ar
job, outliers make it hard to meet SLAs. At median, the
Reining in the Outliers in Map-Reduce Clusters using Mantri in G. Ananthanarayanan, S. Kandula, A. slow dow
ratio Lu,stdev E. Harris OSDI’10: Sixth Symposium i.e., jobs System
Greenberg, I. Stoica, Y.of mean in job completion time is .,on Operatinghave a Design and up f
B. Saha,
used
Implementation,non-trivial probability of taking twice as long or finishing
Vancouver, Canada, December, 2010.
are waiti
half as quickly.
a phase v
To summarize, we takeMapReduce
the following lessons from our22/10/2013 comp
G. Fedak (INRIA, France)
Salvador da Bahia, Brazil,
up 49 / 84
55. Resource Aware Restart
• Restart Outliers tasks on better machines
• Consider two options : kill & restart or duplicate
+
0%*+
,-($)".$/%
67*%$-1)8%
()*!(%
&% !%
'%
'!% !%
+!%
'!% !%
!%
!"#$%
234)"5-!$%
&% !% !% '!% !%
!% '!% !%
'%
!"#$%
0"))/%1$(!-1!%
&% !% '!% !% !%
'% !% !% '!%
!"#$%
>99
78.*
mpletion time in the
tliers do not occur.
Figure : A stylized example to illustrate our main ideas. Tasks
that are eventually killed are filled with stripes, repeat instances
• restart or duplicate iffilled new <lighter)mesh.
of a task are P(t with a trem is high
• restart
"9%*' ;#%<+.5'8")6' if the remaining time is large trem > E(tnew ) + ∆
tasks before the others to avoid being stuck with the large
.B5%
1&0&*&"#
• duplicates if it decreases the total amount of resources used
ones near completion (§.).
+$3+$ *$.)$'$.*6*'$/.$'
There > subtle point < 3 is the number reduc$%
1"'9")%'(&)*$
P(trem > tnew (c+1) )is a δ, where c with outlier mitigation:of copies and
c
δ = 0.25 ing the completion time of a task may in fact increase the
s and Solutions
job completion time. For example, replicating the output
G. Fedak (INRIA, France)every task will drastically reduce recomputations–both
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
of
50 / 84
56. Network Aware Placement
• Schedule tasks so that it minimizes network congestion
• Local approximation of optimal placement
• do not track bandwidth changes
• do not require global co-ordination
• Placement of a reduce phase with n tasks, running on a cluster with r
racks, In,r the input data matrix.
i
i
i
i
• Let du , dv the data to be uploaded and downloaded, and bu , bd the
corresponding available bandwidth
di
• the placement of is arg min(max( u ,
bi
v
G. Fedak (INRIA, France)
MapReduce
i
dv
i
bd
))
Salvador da Bahia, Brazil, 22/10/2013
51 / 84
57. Mantri Performances
%"#&
%&
"&
&
!'#!
!"#$
!"#(
'()*+,-./01(/1(
203()*40,5-*4
$&
(&
!&
%""
-!./0/12$34/
-!./0/12$34/!5"$67'8
)&
"&
%&
!"#$
• Speed up the job median by
9#6
55%
:3;. 9<=853/41 >324
?;3/7++++?;-7
7#8
!&
%#6
&
:05+
<=3>* ?50,@((((?5*@
AB
@A
;0,1. 20/1
• Network aware data
(a) Completion Time
(b) Resource Usage
placement reduces the
ine implementation on a production cluster of thousands
completion time of typical
rvers for the four representative jobs.
reduce phase of 31%
:9
9
;:9
9
:9
<9
=9
>9 ?99
)*+,-,./"0%,
)*+,-,./"0%,*'1"2345
%""
.
.
tasks speeds up. of the
half
.
.
.
reduce phases by > 60%
e : Comparing Mantri’s network-aware spread of tasks
the baseline implementation on a production cluster of
sands of servers.
G. Fedak (INRIA, France)
!"
(a) Change in Completion Time
reduction in
• Network mincompletion time
aware placement of
avg
max
e cluster. Jobs that occupy the cluster for half the time
up by at least .. Figure (b) shows that of
see a reduction in resource consumption while the
rs use up a few extra resources. These gains are due
!"#$%
&$%''(
)*+,
@$(A4%5B4
@$86"7
#"
0/A4%5B67'8/78/-'C(D467'8/+7C4
re : Comparing Mantri’s straggler mitigation with the
Phase
Job
$"
$"
!"#$%
&$%''(
)*+,
!"#$%&'(%
!"5213
#"
86
!"
76
6
986 976 6
76 86 :6 ;6 <66
-,$%&'(2345,35,$%04'1(%,=0">%
(b) Change in Resource Usage
Figure : Comparing straggler mitigation strategies. Mantri
provides a greater speed-up in completion time while using
fewer resources than existing schemes.
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
52 / 84
58. Subsection 2
Extending MapReduce Application Range
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
53 / 84
59. !"#$%&'()(*%&'+,-&(.+/0&123&(
Twister: Iterative MapReduce
! !"#$"%&$"'%('%(#$)$"&()%*(
+),")-./(*)$)(
! (0'%123,)-./(.'%2(,3%%"%2(
4&)&5/)-./6(7)89,/*3&/($)#:#((
! ;3-9#3-(7/##)2"%2(-)#/*(
&'773%"&)$"'%9*)$)($,)%#</,#((
! =>&"/%$(#388',$(<',(?$/,)$"+/(
@)8A/*3&/(&'783$)$"'%#(
4/B$,/7/.C(<)#$/,($5)%(D)*''8(
',(!,C)*9!,C)*E?FG6((
! 0'7-"%/(85)#/($'(&'../&$()..(
,/*3&/('3$83$#((
! !)$)()&&/##(+")(.'&).(*"#:#(
E"25$H/"25$(4IJKLL(."%/#('<(
M)+)(&'*/6((
! N388',$(<',($C8"&).(@)8A/*3&/(
&'783$)$"'%#(O''.#($'(7)%)2/(
*)$)((
Ekanayake, Jaliya, et al. Twister: a Runtime for Iterative Mapreduce. International Symposium on High Performance Distributed Computing. HPDC 2010.
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
54 / 84
60. Moon: MapReduce on Opportunistic Environments
• Consider volatile resources during their idleness time
• Consider the Hadoop runtime
→ Hadoop does not tolerate massive failures
• The Moon approach :
• Consider hybrid resources: very volatile nodes and more stable nodes
• two queues are managed for speculative execution : frozen and slow tasks
• 2-phase task replication : separate the job between normal and homestretch
phase
→ the job enter the homestretch phase when the number of remaining tasks < H%
available execution slots
Lin, Heshan, et al. Moon: Mapreduce on opportunistic environments International Symposium on High Performance Distributed Computing. HPDC 2010.
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
55 / 84
61. Towards Greener Data Center
• Sustainable Computing Infrastructures which minimizes the impact on the
environment
• Reduction of energy consumption
• Reduction of Green House Gaz emission
• Improve the use of renewable energy (Smart Grids, Co-location, etc...)
Solar Bay Apple Maiden Data Center (20MW, ) 00 acres : 400 000 m2. 20 MW peak
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
56 / 84
62. Section 4
MapReduce beyond the Data Center
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
57 / 84
63. Introduction to Computational Desktop Grids
High!"#$$%&'%&()$(*$&+&,)-%&'./"'#0)1%*"-&
Throughput Computing
over2%"-/00%$-Idle
Large Sets of
Desktop Computers
! 34"54%"&$%-&2*#--)0(%-&'%&
()$(*$&'%&67-&2%0')01&&
$%*"&1%82-&'.#0)(1#9#15
• Volunteer Computing Systems
" :/$&'%&(;($%&<&5(/0/8#-%*"&
(SETI@Home, Distributed.net)
'.5(")0
• Open Source Desktop Grid
"
)22$#()1#/0-&2)")$$=$%-&'%&4")0'%&
(BOINC, XtremWeb, Condor,
1)#$$%
OurGrid)
" >"4)0#-)1#/0&'%&?(/0(/*"-@&'%&
2*#--)0(%
• Some numbers with BOINC
" A%-&*1#$#-)1%*"-&'/#9%01&2/*9/#"&
(October 21, 2013)
8%-*"%"&$%*"&(/01"#,*1#/0
• Active: 0.5M users, 1M
5(/0/8#%&'*&'/0
hosts.
• 24-hour average: 12
PetaFLOPS.
• EC2 equivalent : 1b$/year
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
58 / 84
64. MapReduce over the Internet
Computing on Volunteers’ PCs (a la SETI@Home)
• Potential of aggregate computing power and storage
• Average PC : 1TFlops, 4BG RAM, 1TByte disc
• 1 billion PCs, 100 millions GPUs + tablets + smartphone + STB
• hardware and electrical power are paid for by consumers
• Self-maintained by the volunteers
But ...
• High number of resources
• Low performance
• Volatility
• Owned by volunteer
• Scalable but mainly for embarrassingly parallel applications with few I/O
requirements
• Challenge is to broaden the application domain
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
59 / 84
65. MapReduce over the Internet
Computing on Volunteers’ PCs (a la SETI@Home)
• Potential of aggregate computing power and storage
• Average PC : 1TFlops, 4BG RAM, 1TByte disc
• 1 billion PCs, 100 millions GPUs + tablets + smartphone + STB
• hardware and electrical power are paid for by consumers
• Self-maintained by the volunteers
But ...
• High number of resources
• Low performance
• Volatility
• Owned by volunteer
• Scalable but mainly for embarrassingly parallel applications with few I/O
requirements
• Challenge is to broaden the application domain
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
59 / 84
66. Subsection 1
Background on BitDew
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
60 / 84
67. Large Scale Data Management
BitDew : a Programmable Environment for Large Scale Data
Management
Key Idea 1: provides an API and a runtime environment which integrates
several P2P technologies in a consistent way
Key Idea 2: relies on metadata (Data Attributes) to drive transparently data
management operation : replication, fault-tolerance, distribution,
placement, life-cycle.
BitDew: A Programmable Environment for Large-Scale Data Management and Distribution Gilles Fedak, Haiwu He and Franck Cappello International
Conference for High Performnance Computing (SC’08), Austin, 2008
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
61 / 84
68. BitDew : the Big Cloudy Picture
• Aggregates storage in a single
Data Space:
Data Space
• Clients put and get data
from the data space
• Clients defines data
attributes
• Distinguishes service nodes
(stable), client and Worker
nodes (volatile)
• Service : ensure fault
tolerance, indexing and
scheduling of data to Worker
nodes
• Worker : stores data on
Desktop PCs
REPLICAT =3
put
get
client
• push/pull protocol between
client → service ← Worker
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
62 / 84
69. BitDew : the Big Cloudy Picture
• Aggregates storage in a single
Data Space:
Data Space
• Clients put and get data
from the data space
Service Nodes
Reservoir Nodes
• Clients defines data
attributes
• Distinguishes service nodes
pull
pull
(stable), client and Worker
nodes (volatile)
• Service : ensure fault
pull
tolerance, indexing and
scheduling of data to Worker
nodes
• Worker : stores data on
Desktop PCs
put
get
client
• push/pull protocol between
client → service ← Worker
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
62 / 84
70. Data Attributes
REPLICA
:
RESILIENCE
:
LIFETIME
:
AFFINITY
:
TRANSFER PROTOCOL
:
DISTRIBUTION
:
G. Fedak (INRIA, France)
indicates how many occurrences of data should
be available at the same time in the system
controls the resilience of data in presence of machine crash
is a duration, absolute or relative to the existence
of other data, which indicates when a datum is obsolete
drives movement of data according to dependency rules
gives the runtime environment hints about the file
transfer protocol appropriate to distribute the data
which indicates the maximum number of pieces
of Data with the same Attribute should be sent to
particular node.
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
63 / 84
72. Anatomy of a BitDew Application
• Clients
1
2
3
creates Data and
Attributes
put file into data slot created
schedule the data with their
corresponding attribute
Data Space
Service Nodes
Reservoir Nodes
dr/dt
dc/ds
• Reservoir
pull
• are notified when data are
locally scheduled
onDataScheduled or
deleted onDataDeleted
• programmers install
callbacks on these events
pull
create
put
schedule
• Clients and Reservoirs can
both create data and install
callbacks
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
65 / 84
73. Anatomy of a BitDew Application
• Clients
1
2
3
creates Data and
Attributes
put file into data slot created
schedule the data with their
corresponding attribute
Data Space
Service Nodes
Reservoir Nodes
dr/dt
dc/ds
• Reservoir
onDataSceduled
• are notified when data are
locally scheduled
onDataScheduled or
deleted onDataDeleted
• programmers install
callbacks on these events
onDataScheduled
• Clients and Reservoirs can
both create data and install
callbacks
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
65 / 84
74. Examples of BitDew Applications
• Data-Intensive Application
• DI-BOT : data-driven master-worker Arabic characters recognition (M.
Labidi, University of Sfax)
• MapReduce, (X. Shi, L. Lu HUST, Wuhan China)
• Framework for distributed results checking (M. Moca, G. S. Silaghi
Babes-Bolyai University of Cluj-Napoca, Romania)
• Data Management Utilities
• File Sharing for Social Network (N. Kourtellis, Univ. Florida)
• Distributed Checkpoint Storage (F. Bouabache, Univ. Paris XI)
• Grid Data Stagging (IEP, Chinese Academy of Science)
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
66 / 84
75. Experiment Setup
Hardware Setup
• Experiments are conducted in a controlled environment (cluster for the
microbenchmark and Grid5K for scalability tests) as well as an Internet
platform
• GdX Cluster : 310-nodes cluster of dual opteron CPU (AMD-64 2Ghz,
2GB SDRAM), network 1Gbs Ethernet switches
• Grid5K : an experimental Grid of 4 clusters including GdX
• DSL-lab : Low-power, low-noise cluster hosted by real Internet users on
DSL broadband network. 20 Mini-ITX nodes, Pentium-M 1Ghz, 512MB
SDRam, with 2GB Flash storage
• Each experiments is averaged over 30 measures
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
67 / 84
77. Collective File Transfer with Replication
1
d2
0.8
ALL_TO_ALL
Fraction
d1
d1 + d2 + d3 + d4
d4
0.6
0.4
d3
0.2
0
150
Scenario: All-to-all file transfer with
replication.
250
300
KBytes/sec
350
400
450
Results: experiments on DSL-Lab
• {d1, d2, d3, d4} is created
• one run every hour during
approximately 12 hours
where each datum has the
REPLICA attribute set to 4
• during the collective, file
• All-to-all collective
implemented with the AFFINITY
attribute. Each data attributes
is modified on the fly so that
d1.affinity=d2, d2.affinity=d3,
d3.affinity=d4, d4.affinity=d1.
G. Fedak (INRIA, France)
200
MapReduce
transfers are performed
concurrently and the data size
is 5MB.
• Cumulative Distribution
Function (CDF) plot for
collective all-to-all.
Salvador da Bahia, Brazil, 22/10/2013
69 / 84
78. Multi-sources and Multi-protocols File Distribution
Scenario
• 4 clusters of 50 nodes each,
each cluster hosting its own
data repository (DC/DR).
dr1/dc1
C1
• first phase: a client uploads a
dr2/dc2
large datum to the 4 data
repositories
C2
• second phase: the cluster
C3
nodes get the datum from the
data repository of their cluster.
dr3/dc3
We vary:
1
the number of data
repositories from 1 to 4
2
C4
dr4/dc4
the protocol used to distribute
the data (BitTorrent vs FTP)
G. Fedak (INRIA, France)
1st phase
MapReduce
2nd phase
Salvador da Bahia, Brazil, 22/10/2013
70 / 84
79. Multi-sources and Multi-protocols File Distribution
BitTorrent (MB/s)
orsay
1st phase
1DR/1DC
2DR/2DC
4DR/4DC
2nd phase
1DR/1DC
2DR/2DC
4DR/4DC
nancy
bordeaux
30.34
29.79
22.07
3.93
4.04
8.46
FTP (MB/s)
rennes
mean
orsay
nancy
bordeaux
rennes
10.36
12.35
11.51
9.72
30.34
21.07
13.41
98.50
73.85
49.41
22.95
49.86
26.19
22.89
3.93
4.88
7.58
3.78
9.50
7.65
3.96
5.04
5.14
3.90
5.86
7.21
0.72
1.43
2.90
0.70
1.38
2.78
0.56
1.14
2.33
0.60
1.29
2.54
• for both protocols, increasing the number of data sources also increases
the available bandwidth
• FTP performs better in the first phase and BitTorrent gives superior
performances in the second phase
• the best combination (4DC/4DR, FTP for the first phase and BitTorrent for
the second phase) takes 26.46 seconds and outperforms by 168.6% the
best single data source, single protocol (BitTorrent) configuration
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
71 / 84
80. Subsection 2
MapReduce for Desktop Grid Computing
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
72 / 84
81. Challenge of MapReduce over the Internet
Data Split
Intermediate
Results
Input Files
Final Output
Output Files
Mapper
Reducer
Output2
Reducer
Mapper
Output1
Reducer
Mapper
Output3
Combine
Mapper
Mapper
• Needs Data Replica Management
• Result Certification of
• no shared file system nor direct
communication between hosts
Intermediate Data
• Faults and hosts churn
• Collective Operation (scatter +
gather/reduction)
Towards MapReduce for Desktop Grids B. Tang, M. Moca, S. Chevalier, H. He, G. Fedak, in Proceedings of the Fifth International Conference on P2P,
Parallel, Grid, Cloud and Internet Computing 3GPCIC, Fukuoka, Japan, Novembre 2010
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
73 / 84
82. Implementing MapReduce over BitDew (1/2)
Initialisation by the Master
1
Data slicing : Master creates
DataCollection
DC = d1 , . . . , dn , list of data
chuncks sharing the same
MapInput attribute.
2
Creates reducer tokens each
one with its own
TokenReducX attribute
(affinity set to the Reducer).
3
Creates and pin collector
tokens with
TokenCollector attribute
(affinity set to the Collector
Token).
G. Fedak (INRIA, France)
Algorithm 2 E XECUTION OF M AP TASKS
Require: Let Map be the Map function to execute
Require: Let M be the number of Mappers and m a single mapper
Require: Let dm = d1,m , . . . dk,m be the set of map input data received by
worker m
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
{on the Master node}
Create a single data MapToken with affinity set to DC
{on the Worker node}
if MapToken is scheduled then
for all data d j,m ∈ dm do
execute Map(d j,m )
create list j,m (k, v)
end for
end if
Algorithm 3 S HUFFLING INTERMEDIATE RESULTS
Require: Let of the number of Mappers and m a single worker
ExecutionM beMap Tasks
Require: Let R be the number of Reducers
Require: Let listm (k, v) be the set of key, values pairs of intermediate
1 When a Worker receives
results on worker m
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
MapReduce
{on MapToken it launches the
the Master node}
for corresponding MapTask and
all r ∈ [1, . . . , R] do
Create Attribute ReduceAttrr with distrib = 1
computes the intermediate values
Create data ReduceTokenr
schedule data ReduceTokenr with attribute ReduceTokenAttr
end listj,m (k , v ).
for
{on the Worker node}
split listm (k, v) in i f m,1 , . . . , i f m,r intermediate files
for all file i fm,r do
create reduce input data irm,r and upload i fm,r
Salvador da Bahia, Brazil, 22/10/2013
74 / 84
83. 2.
3.
4.
5.
6.
7.
8.
9.
10.
Create a single data MapToken with affinity set to DC
{on the Worker node}
if MapToken is scheduled then
for all data d j,m ∈ dm do
execute Map(d j,m )
create list j,m (k, v)
end for
end if
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
Create a single data MasterToken
pinAndSchedule(MasterToken)
Implementing MapReduce over BitDew (2/2)
Require: Let
Require: Let
Require: Let
results on
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
M be the number of Mappers and m a single worker
R be the number of Reducers
listm (k, v) be the set of key, values pairs of intermediate
worker m
{on the Master node}
for all r ∈ [1, . . . , R] do
Create Attribute ReduceAttrr with distrib = 1
Create data ReduceTokenr
schedule data ReduceTokenr with attribute ReduceTokenAttr
end for
{on the Worker node}
split listm (k, v) in i f m,1 , . . . , i f m,r intermediate files
for all file i fm,r do
create reduce input data irm,r and upload i fm,r
schedule data irm,r with a f f inity = ReduceTokenr
end for
workers, the ReduceTokenAttr has distrib=1 flag value which
ensures a fair
Shufflingdistribution of the workload between reducers. Finally,
Intermediate Results
the Shuffle phase is simply implemented by scheduling the portioned
intermediate data with an attribute whose affinity tag is equal
to the corresponding ReduceToken.
1
Algorithm 4 presents the Reduce phase. When a reducer, that is
a worker which has the ReduceToken, starts to receive intermediate
results, it calls the Reduce function on the (k, list(v)). When all
the intermediate files have been received, all the values have been
processed for a specific key. If the user wishes, he can get all the
results back to the master and eventually combine them. To proceed
to this last step, the worker creates an MasterToken data, which is
pinnedAndScheduled. This operation means that the MasterToken
is known from the BitDew scheduler but will not be sent to any
nodes in the system. Instead, MasterToken is pinned on the Master
node, which allows the result of the Reduce tasks, scheduled with
a tag affinity set to MasterToken, to be sent to the Master.
Worker partitions intermediate
results according to the partition
function and the number of
reducers and creates
ReduceInput with the attribute
ReducTokenR.
D. Desktop Grid Features
G. Fedak (INRIA, France)
{on the Master node}
if all or have been received then
Combine or into a single result
end if
Execution of Reduce tasks
overlapping communication with computation and prefetch of data.
We have designed a multi-threaded worker (see Figure 3) which can
1
process several concurrent files transfers, even if the protocol used is
m,r
synchronous such as HTTP for instance. As soon as a file transfer is
finished, the corresponding Map or Reduce tasks are enqueued and
can be processed concurrently. The number of maximum concurrent
Map and Reduce tasks can be configured, as well as the minimum
m,r
number of tasks in the queue before computations can start. In
MapReduce, prefetch of data is natural as data are staged on the
2
r
distributed file system before computations are launched.
When a reducer receives
ReduceInput ir
files it lauches
the corresponding launch
ReduceTask Reduce(ir ).
7+889+04
When all the ir have been
processed, the reducer combines
!"#$%&
the result in a single final result
"#$
Or . '6 "#$ *+,,%3
*+,-%./0%
%#$
1234%3
!
:0;%./8%
%#$
Algorithm 3 S HUFFLING INTERMEDIATE RESULTS
{on the Worker node}
if irm,r is scheduled then
execute Reduce(irm,r )
if all irr have been processed then
create or with a f f inity = MasterToken
end if
end if
%#$
')
-%./0%3
3 eventually, the final result can be
sent back to the master using the
%#$
'(
%#$
TokenCollector attribute.
-%./0%35/##%3
Figure 3. Worker handles the following threads: Mapper, Reducer and ReducePutter. Q1, Q2 and Q3 represent queues and are used for communication between
threads.
Collective file operation: MapReduce processing is composed
of several collective file operations which in some aspects, look
similar with collective communications in parallel programming
such as MPI. Namely, the collectives are the Distribution, the
initial distribution of file chunks (similar to MPI_Scatter), the
Shuffle, that is the redistribution of intermediate results between the
Mapper and
MapReduce the Reducer nodes (similar to MPI_AlltoAll) and
Salvador da Bahia, Brazil, 22/10/2013
75 / 84
84. Handling Communications
Latency Hiding
• Multi-thredead worker to overlap
communication and computation.
• The number of maximum concurrent Map
and Reduce tasks can be configured, as
well as the minimum number of tasks in
the queue before computations can start.
Barrier-free computation
• Reducers detect duplication of intermediate results (that happen because of faults and/or
lags).
• Early reduction : process IR as they arrive → allowed us to remove the barrier between
Map and Reduce tasks.
• But ... IR are not sorted.
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
76 / 84
85. Scheduling and Fault-tolerance
2-level scheduling
1 Data placement is ensured by the BitDew scheduler, which is mainly guided by the data
attribute.
2 Workers periodically report to the MR-scheduler, running on the master node the state of
their ongoing computation.
3 The MR-scheduler determines if there are more nodes available than tasks to execute
which can avoid the lagger effect.
Fault-tolerance
• In Desktop Grid, computing resources have high failure rates :
→ during the computation (either execution of Map or Reduce tasks)
→ during the communication, that is file upload and download.
• MapInput data and ReduceToken token have the resilient attribute enabled.
• Because intermediate results irj,r , j ∈ [1, m] have the affinity set to
ReduceTokenr , they will be automatically downloaded by the node on which
the ReduceToken data is rescheduled.
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
77 / 84
86. Data Security and Privacy
Distributed Result Checking
• Traditional DG or VC projects implements result checking on the server.
• IR are too large to be sent back to the server
→ distributed result checking
• replicates the MapInput and Reducers
• select correct results by majority voting
%"
*
!"
&%"'"
+"
!#
!)
!$
(%
(a) No replication
(b) Replication of the Map tasks
Figure 2.
(c) Replication of both Map and Reduce tasks
Dataflows in the MapReduce implementation
distinct workers as input for Map task. Consequently, the
received intermediate result is present in its own codes list
K, ignoring failed results.
m versions of intermediate files
m
corresponding to a map input fi . After receiving r2 + 1
IV. E VALUATION O F T HE R ESULTS C HECKING
identical versions, the reducer considers the respective result
M ECHANISM
correct and further, accepts it as input for a Reduce task.
Figure 2b illustrates the dataflow corresponding to a
In this section we describe the model for characterizMapReduce execution where rm = 3.
ing the errors produced by the MapReduce algorithm in
Volunteer Computing. We assume that workers sabotage
In order to activate the checking mechanism for the
independently of one another, thus we do not take into
results received from reducers, the master node schedules the
consideration the colluding nodes behavior.Bahia, Brazil, 22/10/2013
ReduceT okens
G. Fedak (INRIA, France) T , with the attribute replica = rr (rr >
MapReduce
Salvador da We employ the
Ensuring Datareceives a set of r
reducer Privacy
• Use a hybrid infrastructure : composed of private and public resources
• Use IDA (Information Dispersal Algorithms) approach to distribute and store securely the
data
78 / 84
87. BitDew Collective Communication
%!!!
56,7-87/9:012.:3;0<4
56,7-87/9:012.:35190,66.+94
=8799.6:012.
>79?.6:012.
$"!!
012.3/4
$!!!
#"!!
#!!!
"!!
!
#
&
'
#$
#(
$& %$
*+,-./
&'
(&
'!
)( #$'
Execution time of the Broadcast, Scatter, Gather collective for a 2.7GB
Figure 6.
The next experiment evaluates the performance of the collective
file operation. Considering that we need enough chunks to measure
the Scatter/Gather collective, we selected 16MB as the chunks size
E VALUA
Figure 5.
Figure: Execution time of theof nodes.
to throu
file to a varying number Broadcast, Scatter, Gather collective for a 2.7GB file the a
varying number of nodes
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
79 / 84
88. MapReduce Evaluation
#&!!
0?6,@A?B@9:3C5D/4
#$!!
#!!!
'!!
(!!
&!!
$!!
!
&
'
a 2.7GB
Figure 6.
'
#(
%$
(&
*+,-./
#$'
$"(
"#$
Scalability evaluation on the WordCount application: the y axis presents
Figure: Scalability evaluationand the xWordCount application: the y from 1presents the
the throughput in MB/s on the axis the number of nodes varying axis to 512.
throughput in MB/s and the x axis the number of nodes varying from 1 to 512.
lective
measure
ks size
Table II
E VALUATION OF THE PERFORMANCE ACCORDING TO THE NUMBER OF M APPERS
AND R EDUCERS .
G. Fedak (INRIA, France)
#Mappers
4
8MapReduce
16
32
32
32
32
Salvador da Bahia, Brazil, 22/10/2013
80 / 84
89. 1;38CA<
1;3"CA<
(;6#<
(;08<
(;6A<
78
(;0"<
1;
0
1; 8 <
0" <
(;08<
E
%;68<
8EE
1;:8<
!"#"$%
54,/64
8AE
1;38C$<
5;3"C$<
%;6$<
(;3"C8<
<
"< 3 8CA
;3 8C 5;
()!*+)&,-%&'-.*'/0
1;:"<
%&'
=
(;3"C#<
9),4-2&3+/:4
?6@4,/+4-,&0&
54D6@4,/+4-,&0&
7#
(;68<
1;68-B-A<
AE
5;3"CA<
5;3"C8<
(;3"CA<
1;3"C8<
5
5;38C8<
(;38C"<
>
1;3"C8<
5;38C8<
1'+)&,-23+4
1;38C"<
5;3"C"<
%;6"<
(;6"< (;6$<
!8
%;68<
7"
%;6#<
(;6#<
!"
(;38C$<
"EE
(;38CA<
(;6$<
!#
1;3"C#<
5;38C$<
5;3"C#<
!$
%;6#<
(;38CA<
%;6A<
(;68<
(;38C"<
!A
5;38C"<
5;38CA<
5;38C#<
Fault-Tolerance Scenario
"AE
#EE
#AE
(;:"<
$EE
$AE
(;:8<
AEE =3>4
Figure: Fault-tolerance scenario (5 mappers, 2 reducers): crashes are injected during
the download of a file (F1), the execution of the map task (F2) and during the execution
of the reduce task on the last intermediate result (F3).
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
81 / 84
90. In case of substantial unnecessary in Table
the HDFS needs
execution progress at all.inter-communication between two
On the other hand, BitDew-MapReduce avoids network failures, as shown fault tol
DataNodes. in BitDew-MapReduce, the intermediate Hadoop JobTracker just marks all the tempo
works. Because, On the other hand, BitDew-MR works properly outputs of completed map tasks are safely store
disconnected nodes as dead - although they are
and the performance is almost the same as the baseline in the
the stable central storage conditions. master does not re-execute the successfully completedall the tasks done
running tasks, and blindly removes map tasks of f
normal case of network server, the
because it needs inter-communication between worker nodes, andnodes from the successful task list. Re-executing t
these BitDew mapreduce is almost independent of
workers.
Objective
Hadoop vs MapReduce/BitDew on Extreme Scenarios
E. CPU Heterogeneity and Straggler
such network conditions.
tasks significantly prolongs the job make-span. Meanw
Host •Churn The 5 shows, Hadoop works very well with thousands or evento ten scenarios go tempora
millions of peer to that
Evaluates MapReduce/BitDew on CPU heterogeneity by adjusting the CPU frequency of half
As Figure independent arrival and departure of the BitDew-MR clearly allows workersmachines lea
CPU Heterogeneity and Straggler We emulate Grid5K according
• to kill worker processes worker offline without node another it
host churn. We nodesWe50% kill the MapReduce onhave straggler and oneanythatand launchnode to new no
dynamic taskperiodically with wrekavoc. Internet one process(a machine performance penalty. on a
the worker an execution on the We nodes a
mimic scheduling approach when worker emulate node on launch ontakes an unusually long time
The key Hadoop job completion, we increas
differenthost churn effect. Totasks in the group process probability the target nodes’ tasks re-execution avoida
emulate complete oneof CPU.last fewfrom the fastthe joining and leaving of idea behind mapCPU frequency to 10%.
to the• classes of the Nodes increase computation) by adjusting the computation
emulate volunteers survival
failures,
failures, seconds.
20 tasks on average sabotage, fromsetchurn, massive which makes BitDew-MR outperforms
HDFSBoth Hadoop andfactorthe MapReduce can launch backupheartbeat timeout value to mapreduce Hadoop u
chunk firewall,BitDew onesand the the DataNode tasks on idlenetwork when a 20 laggers, Becaus
replica and to 3, host slow group get
machines
operation
network
scenarios
allowing red
about 11 tasks. • Hadoop etc. . has same scheduling
Although, BitDew-MR
setup:
heterogeneity does not stragglers work machine and failing failure 3, host isMapReduce
BitDew close to completion perform well.the waste to Figure 6. completed re-download map outputs thatchurn already b
by
is MapReduce runtimealleviate according the problem. However, as shown in workers, BitDewhave causes
to
figure
tasks to
heuristic but it does not
HDFS chunk the number
to 3
is slower from job completion time. the replica factor uploaded to the table 2, for host churn intervals
small The nodesthan Hadoop groups get almostbecause only one copy of ashown in available onstorage rather nodesre-gene
effects on the
Some results the two in -this process On same other hand, as chunk iscentral stable the worker than at a
given time. reason the we DataNode one additional to 80% 20the map phase chunk before launching
them.
- could only progress timeout to has to benefit the
seconds
of Host The Hence,is that jobsperforming theMapReduce workerAnotherdownloadof this method is that
tasks. churn : We periodically kill the chunk copy
10 and 25 seconds, Hadoopnodeonly maintain heartbeat up time firstof process on onebefore and launchitadoes
•
node failing.
Evaluation: network fault tolerance
the map task.
introduce any extra overhead since all the intermediate
on the worker nodes for the sake of the assumption that there
arenew one on another node. and data transfer. Thus, environment, the runtime system must becentral sto
no inter-worker communication
Network Fault Tolerance To work in a Desktop Grid should be always transferred through the resilient
regardless of 25
whether or 50 use the fault tolerant strat
not
although fast nodes spend half of time to process (sec.)local
Churn Interval their
10
to temporary network failures. We set the worker heartbeat 5
timeout value to 30 seconds for both Hadoop and
20
However, the overhead of data transfer to the central st
chunks compare to the slow nodes, they still need to take
makespan (sec.)
BitDew mapreduce, heartbeat set to 20 seconds on Hadoop 25 desktop grid more suitable job progress
period to failed 2357 1752
• to download Hadoop job &'(#)* launching failed failed and BitDew-MapReduce
much time Worker and inject a 25 seconds off-line window storage makes worker nodes at differentfor the applicat
new chunks before
points of the tasks.
In
BitDew-MR joa makespan (sec.) 457 398 marks all temporary data, such as nodes
which generate a 361 357
additional map map phase.25 this test, the Hadoop JobTracker simplyto 366 few intermediate disconnectedWord C
• We they are
workers at different
as “dead” when induce stillseconds offline window periods theseGrep. To mitigate the data transfer overh
running tasks. The tasks performed by 25 nodes are blindly removed from the
and Distributed
successful taskERFORMANCE EVALUATION OF STRAGGLERS SCENARIO prolonging churn scenario with large aggregated bandw
TABLE III.
P list and must be re-executed, significantly we can use a storage cluster
Fig. 2: churn interval < 30 seconds
progresswhen Performance evaluation of host the job makespan, as table 1. Meanwhile,
points
→ Hadoop fails
The emerging public any storage services also provid
the BitDew-MapReduce clearly allows workers to go temporarily offline withoutcloud performance penalty.
Straggler Number Very small
1
2
5
•
alternative different progress included in our fu
• Network failure : networkeffect on BitDew 10 25 seconds atsolution, which can betime
is shutdown during
Hadoop job make477
487
490
• Hadoop fails481
when churn interval <work.
30 seconds
span (sec.)
BitDew-MR job
Job progress of the crash points
12.5% 25% 37.5% 50% 62.5% 75% 87.5% 100%
245
267
50
Network Connectivity 211 set custom map tasksand NAT rules on all the300 RELATED WORK turn down
We Re-executed firewall 298 100 150 200 250 VI. 350 nodes to
worker 400
make-span (sec.)
Hadoop
There
been many studies on
network links and observe how Job makespan (sec.) perform. In thishave536 Hadoop cannot improving MapRed
MapReduce jobs 425 468 479 512test, 572 589 601 even launch a
3
Re-executed map tasks 0
0 performance [17, 18, 0 28] and 0
0
0
0
19, 0
exploring ways to sup
BitDew-MR
256
MapReduce 254 we architectures [16, 22, 23, o
GdX has 356 double core nodes, Job to measure the performance on 521 on different 274two workers per node26
so makespan (sec.) 246 249 243 239 nodes 257 run
Anthony SIMONET - MapReduce on Desktop Grids related work isscenario [23] 2012 stands
14
closely
MOON
Table 1: Performance evaluation of the network fault toleranceDecember 4th, which
nodes.
MapReduce On Opportunistic eNvironments. Unlike
lundi 3 décembre 12
• Hadoop marks all temporary disconnected nodes limits the systemtaskswithin a campus
are
work, MOON as failed → scale dead
• The Hadoop job tracker marks all temporary disconnected nodes as
and assumes the underlying resources are hybrid
re-executed
- These tasks must be re-executed
organized by provisioning a fixed fraction of dedicated st
4.2 Evaluation of the incremental BitDew MapReduce
G. Fedak (INRIA, France)
MapReduce
Salvador da other volatile personal82 / 84
computers to supplementBahia, Brazil, 22/10/2013
compu
92. Conclusion
• MapReduce Runtimes
• Attractive programming model for data-intense computing
• Many research issues concerning execution runtimes (heterogeneity,
stragglers, scheduling )
• Many others : security, QoS, iterative MapReduce,
• MapReduce beyond the Data Center :
• enable large scale data-intensive computing : Internet MapReduce
• Want to learn more ?
• Home page for the Mini-Curso : http://graal.ens-lyon.fr/ gfedak/pmwiki-
lri/pmwiki.php/Main/MarpReduceClass
• the MapReduce Workshop@HPDC (http://graal.ens-lyon.fr/mapreduce)
• Desktop Grid Computing ed. C. Cerin and G. Fedak – CRC Press
• our websites : http://www.bitdew.net and http://www.xtremweb-hep.org
G. Fedak (INRIA, France)
MapReduce
Salvador da Bahia, Brazil, 22/10/2013
84 / 84