2. Map Reduce
Paradigm for divide / conquer
Input list divided in n splits – processed by m blades
Aggregation performed on the results
Close integration between middleware and infrastructure
Software and Hardware interact to achieve a very specific target
-> Diverge from “service oriented” paradigm where infrastructure is
abstracted from the program logic
2011 IPM - HPC4 2
3. Map Reduce
Program
Data management
INFRASTRUCTURE
Program
Data management
INFRASTRUCTURE
2011 IPM - HPC4 3
4. Map Reduce
Programming model
Based on functional programming
Map : λx . x²
Reduce : λx . λy . x+y
-> Σ(xi)²
Number of algorithms can be implemented in a sequence of
map/reduce(s).
2011 IPM - HPC4 4
5. Map Reduce
Programming model
243253
Map -> intermediate values
<2, 4> <4, 16> <3,9> <2,4> <5, 25> <3,9>
Reduce -> final results
<2,8> <3,18> <4,16> <5,25>
Reduce phase cannot begin until Map has finished
-> No streaming
2011 IPM - HPC4 5
7. Map Reduce
REDUCE :
Data output from map sorted
and grouped on the same key
<k22, v22><k21, v21><k21,v20>
<k21, v21><k21,v20><k22,v22>
<k21,[v21,v20]> <k22, v22>
Reducer iterates on this list
and combines the values for
each key
2011 IPM - HPC4 7
8. Map Reduce
Map functions run in parallel
Transform independent input data into independent
intermediate data
Reduce functions run in parallel
Aggregate independent output keys
No data sharing
2011 IPM - HPC4 8
9. Map Reduce
map (key,number):
for each number in file-contents:
emit (number, number²)
reduce (key, values):
sum = 0
for each value in values:
sum = sum + value
emit (word, sum)
2011 IPM - HPC4 9
11. Map Reduce
PARTITIONER :
• We use multiple reducers
• After shuffling, the icons of
same shape will be in the
same reducers.
• The partitioner decides
which keys goes where.
2011 IPM - HPC4 11
13. Map Reduce
COMBINER :
• Intermediate step between
Map and Reduce
• Run on mapper nodes
• Save bandwidth before
sending data to reducer
• Must be associative:
(a op b) op c = a op (b op c)
and commutative
(a op b) = (b op a)
2011 IPM - HPC4 13
15. Map Reduce
The data presented to reduce() is sorted by key on each node
-> The output of sort is put to a file, so reduce() can read the file sequentially
and do the processing
The sorting is actually performed during the map phase, and merged during
shuffle phase on the reduce node.
+ Fair distribution of processing – optimisation in the middleware
- No clear separation of responsibilities, more difficult to perform capacity
planning.
2011 IPM - HPC4 15