1. A hands on introduction
to scientific data analysis
with Hadoop !
!
A matrix computations perspective
DAVID F. GLEICH, PURDUE UNIVERSITY
ICME MAPREDUCE WORKSHOP @ STANFORD
1
David Gleich · Purdue
MRWorkshop
2. Who is this for?
workshop project groups
those curious about "
“MapReduce” and “Hadoop”
those who think about "
problems as matrices
2
David Gleich · Purdue
MRWorkshop
3. What should you get out of it?
1. understand some problems that
MapReduce solves effectively.
2. techniques to solve them using
Hadoop and dumbo
3. learn some Hadoop words
3
David Gleich · Purdue
MRWorkshop
4. What you won’t learn …
latest and greatest in "
MapReduce algorithms
how to improve the perform-"
ance of your Hadoop job
how to write wordcount "
in Hadoop
4
David Gleich · Purdue
MRWorkshop
5. Slides will be online soon.
Code samples and short tutorials at
github.com/dgleich/mrmatrix
5
David Gleich · Purdue
MRWorkshop
6. 1. HPC vs. Data (redux)
2. MapReduce vs. Hadoop
3. Dive into Hadoop with
Hadoop streaming
4. Sparse matrix methods "
with Hadoop
6
David Gleich · Purdue
MRWorkshop
10. MapReduce is designed to
solve a different set of problems
10
David Gleich · Purdue
MRWorkshop
11. Supercomputer Data computing cluster Engineer
Each multi-day HPC A data cluster can … enabling engineers to query
simulation generates hold hundreds or thousands and analyze months of simulation
gigabytes of data. of old simulations … data for all sorts of neat purposes.
11
David Gleich · Purdue
MRWorkshop
13. The MapReduce
programming model
Input a list of (key, value) pairs
Map apply a function f to all pairs
Reduce apply a function g to "
all values with key k (for all k)
Output a list of (key, value) pairs
13
David Gleich · Purdue
MRWorkshop
14. The MapReduce
programming model
Input a list of (key, value) pairs
Map apply a function f to all pairs
Reduce apply a function g to "
all values with key k (for all k)
Output a list of (key, value) pairs
Map function f must be side-effect free
Reduce function g must be side-effect free
14
David Gleich · Purdue
MRWorkshop
15. The MapReduce
programming model
Input a list of (key, value) pairs
Map apply a function f to all pairs
Reduce apply a function g to "
all values with key k (for all k)
Output a list of (key, value) pairs
All map functions can be done in parallel
All reduce functions (for key k) can be done
in parallel
15
David Gleich · Purdue
MRWorkshop
16. The MapReduce
programming model
Input a list of (key, value) pairs
Map apply a function f to all pairs
Reduce apply a function g to "
all values with key k (for all k)
Output a list of (key, value) pairs
!
Shuffle group all pairs with key k together"
(sorting suffices)
16
David Gleich · Purdue
MRWorkshop
17. Mesh point variance in MapReduce
Run 1
Run 2
Run 3
T=1
T=2
T=3
T=1
T=2
T=3
T=1
T=2
T=3
17
David Gleich · Purdue
MRWorkshop
18. Mesh point variance in MapReduce
Run 1
Run 2
Run 3
T=1
T=2
T=3
T=1
T=2
T=3
T=1
T=2
T=3
M
M
M
1. Each mapper out- 2. Shuffle moves all
puts the mesh points values from the same
with the same key.
mesh point to the
R
R
same reducer.
3. Reducers just
compute a numerical
variance.
18
David Gleich · Purdue
MRWorkshop
19. MapReduce vs. Hadoop.
MapReduce! Hadoop!
A computation An implementation
model with:
of MapReduce
Map - a local data using the HDFS
transform
parallel file-system.
Shuffle - a grouping Others !
function
Pheonix++, Twisted,
Google MapReduce,
Reduce – " spark, …
an aggregation
19
David Gleich · Purdue
MRWorkshop
20. Why so many limitations?
20
David Gleich · Purdue
MRWorkshop
21. Data scalability
Maps
M M
1
2
1
M Reduce
2
M M M
R 3
4
3
M
R
4
M M
5
5
M Shuffle
The idea !
Bring the computations to the data
MR can schedule map functions without
moving data.
21
David Gleich · Purdue
MRWorkshop
22. Mesh point variance in MapReduce
Run 1
Run 2
Run 3
T=1
T=2
T=3
T=1
T=2
T=3
T=1
T=2
T=3
M
M
M
1. Each mapper out- 2. Shuffle moves all
puts the mesh points values from the same
with the same key.
mesh point to the
R
R
same reducer.
3. Reducers just
compute a numerical
variance.
Bring the computations
to the data!
22
David Gleich · Purdue
MRWorkshop
23. heartbreak on node rs252
After waiting in the queue for a month and "
after 24 hours of finding eigenvalues, one node randomly hiccups.
23
David Gleich · Purdue
MRWorkshop
24. Fault tolerant
Input stored in triplicate
Reduce input/"
M output on disk
M R
M
R
M
Map output"
persisted to disk"
before shuffle
Redundant input helps make maps data-local
Just one type of communication: shuffle
24
David Gleich · Purdue
MRWorkshop
25. Fault injection
200
Faults (200M by 200)
Time to completion (sec)
With 1/5
tasks failing,
No faults (200M by 200)
the job only
takes twice
100
Faults (800M by 10)
as long.
No faults "
(800M by 10)
10
100
1000
1/Prob(failure) – mean number of success per failure
25
David Gleich · Purdue
MRWorkshop
27. Tools I like
hadoop streaming
dumbo
mrjob
hadoopy
C++
27
David Gleich · Purdue
MRWorkshop
28. Tools I don’t use but other
people seem to like …
pig
java
hbase
Eclipse
Cassandra
28
David Gleich · Purdue
MRWorkshop
29. hadoop streaming
the map function is a program"
(key,value) pairs are sent via stdin"
output (key,value) pairs goes to stdout
the reduce function is a program"
(key,value) pairs are sent via stdin"
keys are grouped"
output (key,value) pairs goes to stdout
29
David Gleich · Purdue
MRWorkshop
30. dumbo
a wrapper around hadoop streaming for
map and reduce functions in python
#!/usr/bin/env dumbo
def mapper(key,value):
""" Each record is a line of text.
key=<byte that the line starts in the file>
value=<line of text>
"""
valarray = [float(v) for v in value.split()]
yield key, sum(valarray)
if __name__=='__main__':
import dumbo
import dumbo.lib
dumbo.run(mapper,dumbo.lib.identityreducer)
30
David Gleich · Purdue
MRWorkshop
31. Synthetic data test 100,000,000-by-500 matrix (~500GB)
How can Hadoop streaming
Codes implemented in MapReduce streaming
possibly be fast?
Matrix stored as TypedBytes lists of doubles
Python frameworks use Numpy+Atlas
Custom C++ TypedBytes reader/writer with Atlas
500 GBnon-streaming the R in a QR factorization.
too
New matrix. Computing Java implementation
Iter 1 Iter 1 Iter 2 Overall
QR (secs.) Total (secs.) Total (secs.) Total (secs.)
Dumbo 67725 960 217 1177
Hadoopy 70909 612 118 730
C++ 15809 350 37 387
Java 436 66 502
C++ in streaming beats a native Java implementation.
All timing results from the Hadoop job tracker
David Gleich (Sandia) MapReduce 2011 16/22
31
David Gleich · Purdue
MRWorkshop
32. Demo 1
1. generate data
2. get data to hadoop
3. run row sums
4. see row sums!
32
David Gleich · Purdue
MRWorkshop
33. How does Hadoop know
key = byte in file"
value = line of text!
!
InputFormat!
Map a file on HDFS to (key,value) pairs
TextInputFormat!
Map a text file to (<byte offset>, <line>)
pairs
33
David Gleich · Purdue
MRWorkshop
34. The Hadoop Distributed File System (HDFS)
and a big text file
HDFS stores files in 64MB chunks
Each chunk is a FileSplit
FileSplits are stored in parallel
A InputFormat converts FileSplits
into a sequence of key-val records
FileSplits can cross record borders"
(a small bit of communication)
34
David Gleich · Purdue
MRWorkshop
35. Tall-and-skinny matrix
storage in MapReduce
A : m x n, m ≫ n
A1
Key is an arbitrary row-id
A2
Value is the 1 x n array "
for a row
A3
A4
Each submatrix Ai is an "
InputSplit (the input to a"
map task).
35
David Gleich · Purdue
MRWorkshop
36. hadoop! MPI!
output row-sum for parallel load
all local rows
for my-batch-of-rows
compute row-sum
parallel save
36
David Gleich · Purdue
MRWorkshop
37. Isn’t reading and writing text
files rather inefficient?
37
David Gleich · Purdue
MRWorkshop
38. Sequence Files and !
OutputFormat
SequenceFile
An internal Hadoop file format to store
(key, value) pairs efficiently. Used between
map and reduce steps.
OutputFormat
Map (key, value) pairs to output on disk
TextOutputFormat
Map (key,value) pairs to keytvalue strings
38
David Gleich · Purdue
MRWorkshop
39. typedbytes
A simple binary serialization scheme.
[<1-byte-type-flag> <binary-value>]*
Roughly equivalent to JSON
(Optionally) used to communicate to and
from Hadoop streaming.
39
David Gleich · Purdue
MRWorkshop
40. typedbytes example
def _read(self):
t = unpack_type(self.file.read(1))[0]
self.t = t
return self.handler_table[t](self)
def read_vector(self):
r = self._read
count = unpack_int(self.file.read(4))[0]
return tuple(r() for i in xrange(count))
40
David Gleich · Purdue
MRWorkshop
42. Column sums in dumbo
#!/usr/bin/env dumbo
def mapper(key,value):
""" Each record is a line of text. """
valarray = [float(v) for v in value.split()]
for col,val in enumerate(valarray):
yield col, val
def reducer(col,values):
yield col, sum(values)
if __name__=='__main__':
import dumbo
import dumbo.lib
dumbo.run(mapper,reducer)
42
David Gleich · Purdue
MRWorkshop
43. Isn’t this just moving the data
to the computation?
MPI!
parallel load
Yes.
for my-batch-of-rows
update sum of each
columns
It seems much"
parallel reduce partial
worse than MPI.
column sums
parallel save
43
David Gleich · Purdue
MRWorkshop
44. The MapReduce
programming model
Input a list of (key, value) pairs
Map apply a function f to all pairs
Combine apply g to local values with key k!
Shuffle group all pairs with key k together!
Reduce apply a function g to "
all values with key k
Output a list of (key, value) pairs
!
44
David Gleich · Purdue
MRWorkshop
45. Column sums in dumbo
#!/usr/bin/env dumbo
def mapper(key,value):
""" Each record is a line of text. """
valarray = [float(v) for v in value.split()]
for col,val in enumerate(valarray):
yield col, val
def reducer(col,values):
yield col, sum(values)
if __name__=='__main__':
import dumbo
import dumbo.lib
dumbo.run(mapper,reducer,combiner=reducer)
45
David Gleich · Purdue
MRWorkshop
46. How many mappers and
reducers?
The number of maps is the number of
InputSplits.
You choose how many reducers.
Each reducer outputs to a separate file.
46
David Gleich · Purdue
MRWorkshop
47. Demo 3
Column sums with multiple
reducers
47
David Gleich · Purdue
MRWorkshop
48. Which reducer does my key
go to?
Partitioner!
Map a given key to a reducer
HashPartitioner!
Randomly distribute keys
48
David Gleich · Purdue
MRWorkshop
50. of a graph, 4 9 storing the matrix by columns corresponds to storing the
1 10 then 7 6
graph as an in-edge list.
13 4 ci 2 3 3 4 2 5 3 6 4 6
Storing a matrix by rows
We briey 14 5
ure ..
3 illustrate compressed row 13 10 12 4 storage schemes 4 g-
ai 16
and column 14 9 20 7 in
0 0 0 Compressed sparse row
16 13 0 Compressed sparse column
0 2 12 4
0 0 rp 1 3 5 7 9 11
0 10 12 cp 1 1 3 6 8 9
0 14 0
11
11
16 20
4 0 0
1
0 0 10 94 9 7
0 20 6
0 ci 2 3 3 4 2 5 6
0 0 4 ri 1 3 1 2 4 2 3 6 4 5
13 4 5 3 4
0 0 7
0 0 ai 16 13 10 12 4 14
0 30 140 5 0
ai 16 4 13 10 9 12 9
7 20
14 7
20 4
4
Row 1 13 0 (3,13.)
16 (2,16.) 0 Row 5 (4,7.) (6,4.)
0 Most graph algorithms0are designed to work with out-edge lists instead of
Compressed sparse column
0 0 10 12 0 0
Row 2 (3,10.) (4,12.)
an algorithm, MatlabBGL 9 11
0 4 lists. Before running cpRow 6
3 6 8 explicitly transposes
in-edge
0 0 14 0
1 1
graph so that Matlab’s internal representation corresponds to storing out-
the 0 9 0 0 20
Row 3 (2,4.) (5,14.)
0 lists. For algorithms symmetric graphs, these transposes are not
0 0 0 7 0 4 ri 1 3 1 2 4 2 5 3 4 5
edge on
Row 4 0 0 (6,20.)
ai 16
0 0 (3,9.) 0 0
required. 4 13 10 9 12 7 14 20 4
e mex commands mxGetPr, mxGetJc, and mxGetIr retrieve pointers to
50
Matlab’s internal storage of the matrix withoutGleich · Purdue
MRWorkshop
David making a copy. ese functions
51. of a graph, 4 9 storing the matrix by columns corresponds to storing the
1 10 then 7 6
graph as an in-edge list.
13 4 ci 2 3 3 4 2 5 3 6 4 6
Storing a matrix by rows in a text-file
We briey 14 5
ure ..
3 illustrate compressed row 13 10 12 4 storage schemes 4 g-
ai 16
and column 14 9 20 7 in
0 0 0 Compressed sparse row
16 13 0 Compressed sparse column
0 2 12 4
0 0 rp 1 3 5 7 9 11
0 10 12 cp 1 1 3 6 8 9
0 14 0
11
11
16 20
4 0 0
1
0 0 10 94 9 7
0 20 6
0 ci 2 3 3 4 2 5 6
0 0 4 ri 1 3 1 2 4 2 3 6 4 5
13 4 5 3 4
0 0 7
0 0 ai 16 13 10 12 4 14
0 30 140 5 0
ai 16 4 13 10 9 12 9
7 20
14 7
20 4
4
Row 1 13 0 (3,13.)
16 (2,16.) 0 Row 5 (4,7.) (6,4.)
0 Most graph algorithms0are designed to work with out-edge lists instead of
Compressed sparse column
0 0 10 12 0 0
Row 2 (3,10.) (4,12.)
an algorithm, MatlabBGL 9 11
0 4 lists. Before running cpRow 6
3 6 8 explicitly transposes
in-edge
0 0 14 0
1 1
graph so that Matlab’s internal representation corresponds to storing out-
the 0 9 0 0 20
Row 3 (2,4.) (5,14.)
0 lists. For algorithms symmetric graphs, these transposes are not
0 0 0 7 0 4 ri 1 3 1 2 4 2 5 3 4 5
edge on
Row 4 0 0 (6,20.)
ai 16
0 0 (3,9.) 0 0
required. 4 13 10 9 12 7 14 20 4
e mex commands mxGetPr, mxGetJc, and mxGetIr retrieve pointers to
51
Matlab’s internal storage of the matrix withoutGleich · Purdue
MRWorkshop
David making a copy. ese functions
52. To store an m×n sparse matrix M, Matlab uses compressed column format
[Gilbert et al., ]. Matlab never stores a 0 value in a sparse matrix. It always
“re-compresses” the data structure in these cases. If M is the adjacency matrix
Sparse matrix-vector product
of a graph, then storing the matrix by columns corresponds to storing the
graph as an in-edge list.
We briey illustrate compressed row and column storage schemes in g-
ure ..
2
X 12 4 The matrix!
Compressed sparse row The vector! row and c
Figure 6.1 – Compressed
rp 1 3 5 7 9 11 11
[Ax]i = Ai,j xj
16 20 storage. At far le, we have a wei
1 10 4 9 7 6 1 (2,16.) (3,13.)
1 2.1
directed graph. Its weighted adjac
13 4 ci 2 3 3 4 2 5 3 6 4 6 matrix lies below. At right are the
pressed row and compressed colu
3 14 j 5 ai 2 (3,10.) (4,12.)
16 13 10 12 4 14 9 20 7 4 2 -1.3
arrays for this graph and matrix.
sparse matrices, compressed row
0 0 Compressed sparse column
column storage make it easy to ac
0
16 13 0 0
0 cp
3 (2,4.) (5,14.)
3 0.5
entries in rows and columns, resp
0 10 12 0 Consider the rd entry in rp. It sa
0 0
1 1 3 6 8 9 11
4 0 0 14 to look at the th element in ci to
4 (3,9.) (6,20.)
4 0.6
0 20 all the columns in the rd row of
0 9 0 0
0 4 ri 1 3 1 2 4 2 5 3 4 5
matrix. e th and th elements
0 0 7 0
0 0 ai 16 4 13 10 9 12 7 14 20 4
and ai tell us that row has non-
0 0 0 0 5 (4,7.) (6,4.)
5 -1.2
in columns and , with values
. When the sparse matrix corre
to the adjacency matrix of a grap
6
Most graph algorithms are designed to work with out-edge lists instead of 6 0.89
corresponds to ecient access to
out-edges and in-edges of a vertex
in-edge lists. Before running an algorithm, MatlabBGL explicitly transposes
the graph so that Matlab’s internal representation corresponds to storing out- to
To make this work, we need to get the value of the vector
52
edge lists. For algorithms on as the column ofthese matrix
the same function symmetric graphs, the transposes are not
required. David Gleich · Purdue
MRWorkshop
53. To store an m×n sparse matrix M, Matlab uses compressed column format
[Gilbert et al., ]. Matlab never stores a 0 value in a sparse matrix. It always
“re-compresses” the data structure in these cases. If M is the adjacency matrix
Sparse matrix-vector product
of a graph, then storing the matrix by columns corresponds to storing the
graph as an in-edge list.
We briey illustrate compressed row and column storage schemes in g-
ure ..
2
X 12 4 The matrix!
Compressed sparse row The vector! row and c
Figure 6.1 – Compressed
rp 1 3 5 7 9 11 11
[Ax]i = Ai,j xj
16 20 storage. At far le, we have a wei
1 10 4 9 7 6 1 (2,16.) (3,13.)
1 2.1
directed graph. Its weighted adjac
13 4 ci 2 3 3 4 2 5 3 6 4 6 matrix lies below. At right are the
pressed row and compressed colu
3 14 j 5 ai 2 (3,10.) (4,12.)
16 13 10 12 4 14 9 20 7 4 2 -1.3
arrays for this graph and matrix.
sparse matrices, compressed row
0 0 Compressed sparse column
column storage make it easy to ac
0
16 13 0 0
0 cp
3 (2,4.) (5,14.)
3 0.5
entries in rows and columns, resp
0 10 12 0 Consider the rd entry in rp. It sa
0 0
1 1 3 6 8 9 11
4 0 0 14 to look at the th element in ci to
4 (3,9.) (6,20.)
4 0.6
0 20 all the columns in the rd row of
0 9 0 0
0 4 ri 1 3 1 2 4 2 5 3 4 5
matrix. e th and th elements
0 0 7 0
0 0 ai 16 4 13 10 9 12 7 14 20 4
and ai tell us that row has non-
0 0 0 0 5 (4,7.) (6,4.)
5 -1.2
in columns and , with values
. When the sparse matrix corre
to the adjacency matrix of a grap
6
Most graph algorithms are designed to work with out-edge lists instead of 6 0.89
corresponds to ecient access to
out-edges and in-edges of a vertex
in-edge lists. Before running an algorithm, MatlabBGL explicitly transposes
the graph so need to “join” the representationvector based storing out-
We that Matlab’s internal matrix and corresponds to on the column
53
edge lists. For algorithms on symmetric graphs, these transposes are not
required. David Gleich · Purdue
MRWorkshop
54. Sparse matrix-vector product!
takes two MR tasks
Two type
so
Map! records!
f Map!
If vector, emit (row,vecval)
Identity
If matrix,
for each non-zero (row,col,val),
emit (col,(row,val))
One of th
ese
values is
not like Reduce (row, [(Aij xj), …]) !
Reduce! the other
s
Find vecval in input keys
emit (row, sum(Aij xj))
For each (col,(row,val)),
emit (row,(val*vecval))
Form Aij xj for each nonzero
Regroup data by rows, compute sums
54
David Gleich · Purdue
MRWorkshop
55. What about a “dense” row?
Map!
If vector, emit (row,vecval)
If matrix, How do we find
for each non-zero (row,col,val),
emit (col,(row,val))
vecval without
One of th
ese
looking through
values is
Reduce! the other
not like
s
(and buffering) all
Find vecval in input keys
the input?
For each (col,(row,val)),
emit (row,(val*vecval))
Form Aij xj for each nonzero
55
David Gleich · Purdue
MRWorkshop
56. Sparse matrix-vector product!
takes two MR tasks
Two type
so
Map! records!
f
If vector, emit ((row,-1),vecval)
If matrix, Use a custom partitioner
for each non-zero (row,col,val), to make sure that (row,*)
emit ((col,0),(row,val))
all get mapped to the
same reducer, and that
we always see (row,-1)
Reduce!
before (row,0).
Find vecval in input keys
For each (col,(row,val)),
emit (row,(val*vecval))
Form Aij xj for each nonzero
Regroup data by rows, compute sums
56
David Gleich · Purdue
MRWorkshop
60. Algorithm
Data Rows of a matrix
A1 A1 Map QR factorization of rows
A2
qr Reduce QR factorization of rows
A2 Q2 R2
Mapper 1 qr
Serial TSQR A3 A3 Q3 R3
A4 qr emit
A4 Q4 R4
A5 A5
qr
A6 A6 Q6 R6
Mapper 2 qr
Serial TSQR A7 A7 Q7 R7
A8 qr emit
A8 Q8 R8
R4 R4
Reducer 1
Serial TSQR qr emit
R8 R8 Q R
60
David Gleich · Purdue
MRWorkshop
61. In hadoopy
Full code in hadoopy
import random, numpy, hadoopy def close(self):
class SerialTSQR: self.compress()
def __init__(self,blocksize,isreducer): for row in self.data:
key = random.randint(0,2000000000)
self.bsize=blocksize yield key, row
self.data = []
if isreducer: self.__call__ = self.reducer def mapper(self,key,value):
else: self.__call__ = self.mapper self.collect(key,value)
def reducer(self,key,values):
def compress(self): for value in values: self.mapper(key,value)
R = numpy.linalg.qr(
numpy.array(self.data),'r') if __name__=='__main__':
# reset data and re-initialize to R mapper = SerialTSQR(blocksize=3,isreducer=False)
self.data = [] reducer = SerialTSQR(blocksize=3,isreducer=True)
for row in R: hadoopy.run(mapper, reducer)
self.data.append([float(v) for v in row])
def collect(self,key,value):
self.data.append(value)
if len(self.data)self.bsize*len(self.data[0]):
self.compress()
61
David Gleich (Sandia) MapReduceDavid
2011 Gleich · Purdue
MRWorkshop
13/22
62. Related resources
Apache Mahout
Machine learning for Hadoop
… lots of matrices there …
Another fantasic tutorial
http://www.eurecom.fr/~michiard/
teaching/webtech/tutorial.pdf
62
David Gleich · Purdue
MRWorkshop
63. Way too much stuff!
I hope to keep expanding this tutorial
over the week…
Keep checking the git repo.
63
David Gleich · Purdue
MRWorkshop