1. 28th International Symposium on Distributed Computing (DISC 2014)
Austin, Texas, USA (12-15 October 2014)
Assignment of Different-Sized
Inputs in MapReduce
Shantanu Sharma2
joint work with
Foto N. Afrati1, Shlomi Dolev2, Ephraim Korach2, and
Jeffrey D. Ullman3
1 National Technical University of Athens, Greece
2 Ben-Gurion University of the Negev, Israel
3 Stanford University, USA
2. Introduction
• Cluster Computing
– Terabytes or Petabytes amount of data cannot be
processed on a single computer
– Cluster of computers
– How to mask failures, e.g., hardware failures
• MapReduce is a programming model used for
parallel processing over large-scale data
2
3. Introduction
3
MapReduce job: Map Phase and Reduce Phase
Worker
Worker
Master
process
Map Phase: applies a
user-defined Map
function
Worker
Worker
Worker
fork
Read Local
write
Remote read,
sort
Output
File 0
Output
File 1
Write
Chunk 0
Chunk 1
Chunk 2
Input Data
Reduce Phase: applies
a user-defined Reduce
function
4. MapReduce working example – Word Count
Mapper
1
Reducer for
I
Mapper
2
Introduction
I 1
like 1
apple 2
Reducer for
like
Reducer for
apple
Reducer for
is
Reducer for
fruit
Reducer for
banana
(I, 2)
(like, 2)
(apple, 2)
(is, 1)
(fruit, 1)
(banana, 1)
I like
apple.
Apple is
fruit.
I like
banana.
is 1
fruit 1
I 1
like 1
banana 1
5. Inputs and outputs in our context
Mapper
1
Reducer for
I
Mapper
2
Introduction
I 1
like 1
apple 2
Reducer for
like
Reducer for
apple
Reducer for
is
Reducer for
fruit
Reducer for
banana
(I, 2)
(like, 2)
(apple, 2)
(is, 1)
(fruit, 1)
(banana, 1)
I like
apple.
Apple is
fruit.
I like
banana.
is 1
fruit 1
I 1
like 1
banana 1
Inputs
Outputs
6. Reducer Capacity
• Values, provided by each mapper, have some sizes
(input size)
• Reduce capacity: an upper bound on the sum of the
sizes of the values that are assigned to the reducer
• Example: reducer capacity to be the size of the main
memory of the processors on which reducers run
We consider two special matching problems
6
7. State-of-the-Art
• F. Afrati, A.D. Sarma, S. Salihoglu, and J.D. Ullman,
“Upper and Lower Bounds on the Cost of a Map-
Reduce Computation,” PVLDB, 2013.
• Unit input size
• Reducer Size
– Maximum number of inputs that a given reducer
can have.
7
8. Problem Statement
• Communication cost between the map and the
reduce phases is a significant factor
• How we can reduce the communication cost?
– A lesser number of reducers, and hence, a smaller
communication cost
– How to minimize the total number of reducers
while respecting their limited capacity?
• Not an easy task
– All-to-All mapping schema problem
– X-to-Y mapping schema problem
8
Mapper for
1st
input
Reducer for k1
(1, 2)
Reducer for k2
(1, 3)
Reducer for k3
(2, 3)
Mapper for
2nd
input
Mapper for
3rd
input
input1 k1
input1
k2
input2 k1
k input2 3
input3 k2
input3 k3
Mapper for
1st
input
Reducer for k1
(1, 2, 3)
Mapper for
2nd
input
Mapper for
3rd
input
input1 k1
input2 k1
input3 k1
inputinput 1 2 input3
inputinput 1 2 input3
Notation
ki: key
9. A2A Mapping Schema Problem
• A set of inputs is given
• Each pair of inputs corresponds to one output
• Example
– Computing common friends
• Lists of friends of m persons are given
• Find common friends of the given m persons
• Every two friend lists must be assigned to a single
common reducer
9
10. A2A Mapping Schema Problem
Mapper for
fl1 Reducer for k1
1st
friend
fl2
fl3
(1, 2, 3)
fl4
Reducer for k2
(1, 2, 4)
Reducer for k3
(3, 4)
Mapper for
2nd
friend
Mapper for
3rd
friend
Mapper for
4th
friend
fl1 k1
fl1
k2
fl2 k1
Reducer capacity is
enough to hold some of
the friend lists together
k fl2 2
fl3 k1
fl3 k3
fl4 k2
flk 4 3
10
Notations
ki: key
1, 2 fli: ith friend list
1, 3
2, 3
1, 4
2, 4
3, 4
11. A2A Mapping Schema Problem
Mapper for
1st
friend
fl1
fl2
fl3
Notations
ki: key
fli: ith 1, 2 friend list
1, 3
1, 4
Reducer for k1
(1, 2, 3, 4)
fl4
Mapper for
2nd
friend
Mapper for
3rd
friend
Mapper for
4th
friend
fl1 k1
Reducer capacity is
enough to hold all the
friend lists together
fl2 k1
fl3 k1
fl4 k1
11
2, 3
2, 4
3, 4
12. A2A Mapping Schema Problem
• What to do?
– Assigns the given m inputs to the given number of
reducers, without exceeding q, in a manner that
every given input is coupled with every other given
input in at least one reducer in common
• Polynomial time solution for one and two
reducers
• NP-hard for z > 2 reducers
12
13. Heuristics for A2A Mapping
Schema Problem
• Based on
– First-Fit Decreasing (FFD) or Best-Fit Decreasing
(BFD) bin-packing algorithm
– Pseudo-polynomial bin-packing algorithm*
– 2-step Algorithms
– The selection of a prime number p
• A fixed reducer capacity is given
13
*D. R. Karger and J. Scott. Efficient algorithms for fixed-precision instances of bin
packing and euclidean tsp. In APPROX-RANDOM, pages 104–117, 2008.
14. X2Y Mapping Schema Problem
• Two disjoint sets X and Y are given
• Each pairs of element xi, yj (where xi X, yj
Y, i, j) of the sets X and Y corresponds to
one output
• Example
– Skew Join
• Two relations X(A, B) and Y(B, C) are given where lots of
tuple have a common “b” value
• Every tuple with an identical “b” value is required to
assign to at least one reducer
14
15. X2Y Mapping Schema Problem
• What to do?
– Assigns each input of the set X with each input
of the set Y to at least one reducer in common,
without exceeding q
• Polynomial for one reducer
– Can we assign all the inputs of the sets X and Y to
a single reducer
• NP-hard for z > 1 reducers
15
16. Heuristics for X2Y Mapping
Schema Problem
• Based on
– First-Fit Decreasing (FFD) or Best-Fit Decreasing
(BFD) bin-packing algorithm
• A fixed reducer capacity is given
16
17. Conclusion
• Reducer capacity
– An important parameter to be considered in all MapReduce
algorithms
– The capacity is in terms of, not necessarily identical, memory
auxiliary size, augmented and added to the index of the data
item(s)
• Two assignment schemas of MapReduce are given
– All-to-All (A2A) mapping schema problem
– X-to-Y (X2Y) mapping schema problem
• Several heuristics for A2A and X2Y mapping schema
problems are provided
17
18. Presentation is available at
http://www.cs.bgu.ac.il/~sharmas/publication.html
Foto Afrati1, Shlomi Dolev2, Ephraim Korach3,
Shantanu Sharma2, and Jeffrey D. Ullman4
1 School of Electrical and Computing Engineering, National Technical
University of Athens, Greece
afrati@softlab.ece.ntua.gr
2 Department of Computer Science, Ben-Gurion University of the
Negev, Israel
{dolev,sharmas}@cs.bgu.ac.il
3 Department of Industrial Engineering and Management, Ben-Gurion
University of the Negev, Israel
korach@bgu.ac.il
4 Department of Computer Science, Stanford University, USA
ullman@cs.stanford.edu