SlideShare ist ein Scribd-Unternehmen logo
1 von 63
Downloaden Sie, um offline zu lesen
A hands on introduction
to scientific data analysis
with Hadoop !
!
A matrix computations perspective

DAVID F. GLEICH, PURDUE UNIVERSITY
ICME MAPREDUCE WORKSHOP @ STANFORD




                                                                1
                         David Gleich · Purdue
   MRWorkshop
Who is this for?

      workshop project groups
      
      those curious about "
      “MapReduce” and “Hadoop”
      
      those who think about "
      problems as matrices




                                                                 2
                          David Gleich · Purdue
   MRWorkshop
What should you get out of it?

      1. understand some problems that
      MapReduce solves effectively.
      
      2. techniques to solve them using
      Hadoop and dumbo
      
      3. learn some Hadoop words




                                                                 3
                          David Gleich · Purdue
   MRWorkshop
What you won’t learn …

     latest and greatest in "
     MapReduce algorithms
     
     how to improve the perform-"
     ance of your Hadoop job
     
     how to write wordcount "
     in Hadoop




                                                                4
                         David Gleich · Purdue
   MRWorkshop
Slides will be online soon.

Code samples and short tutorials at
github.com/dgleich/mrmatrix




                                                             5
                      David Gleich · Purdue
   MRWorkshop
1.  HPC vs. Data (redux)
2.  MapReduce vs. Hadoop
3.  Dive into Hadoop with
    Hadoop streaming
4.  Sparse matrix methods "
    with Hadoop




                                                        6
                 David Gleich · Purdue
   MRWorkshop
High performance
computing 
vs.
Data intensive
computing




                                                    7
             David Gleich · Purdue
   MRWorkshop
224k Cores
                                                 
 10 PB drive
                                                80k cores"
   1.7 Pflops
                                                50 PB drive
              
                                              ? Pflops
       7 MW
                                                 
              
                                              ? MW
     Custom "                                                
interconnect"                                                GB ethernet
              
                                              
     $104 M
                                                 $?? M
              


                  45 GB/core
              625 GB/core




                                                                       8
                                David Gleich · Purdue
   MRWorkshop
icme-hadoop1
12 nodes; 4-core i7 processor, 24GB/node, 1GB ethernet 

12 TB/node, 3000    GB/core, 50 TB usable space (3x redundancy)




                                                                                     9
                                              David Gleich · Purdue
   MRWorkshop
MapReduce is designed to
solve a different set of problems




                                                        10
                 David Gleich · Purdue
   MRWorkshop
Supercomputer            Data computing cluster           Engineer




Each multi-day HPC     A data cluster can         … enabling engineers to query
simulation generates   hold hundreds or thousands and analyze months of simulation
gigabytes of data.     of old simulations …       data for all sorts of neat purposes.




                                                                                   11
                                           David Gleich · Purdue
   MRWorkshop
MapReduce and!
Hadoop overview




                                                   12
            David Gleich · Purdue
   MRWorkshop
The MapReduce
programming model
     Input a list of (key, value) pairs
     Map apply a function f to all pairs
     Reduce apply a function g to "
       all values with key k (for all k)
     Output a list of (key, value) pairs
     
     
     




                                                                  13
                           David Gleich · Purdue
   MRWorkshop
The MapReduce
programming model
     Input a list of (key, value) pairs
     Map apply a function f to all pairs
     Reduce apply a function g to "
       all values with key k (for all k)
     Output a list of (key, value) pairs
     
     Map function f must be side-effect free
     Reduce function g must be side-effect free
     




                                                                  14
                           David Gleich · Purdue
   MRWorkshop
The MapReduce
programming model
     Input a list of (key, value) pairs
     Map apply a function f to all pairs
     Reduce apply a function g to "
       all values with key k (for all k)
     Output a list of (key, value) pairs
     
     All map functions can be done in parallel
     All reduce functions (for key k) can be done
          in parallel




                                                                  15
                           David Gleich · Purdue
   MRWorkshop
The MapReduce
programming model
     Input a list of (key, value) pairs
     Map apply a function f to all pairs
     Reduce apply a function g to "
       all values with key k (for all k)
     Output a list of (key, value) pairs
     !
     Shuffle group all pairs with key k together"
       (sorting suffices)
     




                                                                  16
                           David Gleich · Purdue
   MRWorkshop
Mesh point variance in MapReduce
          Run 1
                Run 2
                           Run 3


T=1
   T=2
    T=3
   T=1
   T=2
    T=3
       T=1
      T=2
       T=3




                                                                            17
                                     David Gleich · Purdue
   MRWorkshop
Mesh point variance in MapReduce
             Run 1
                 Run 2
                            Run 3


 T=1
     T=2
     T=3
    T=1
   T=2
    T=3
       T=1
      T=2
        T=3
            M
                       M
                           M


1. Each mapper out-                                                2. Shuffle moves all
puts the mesh points                                               values from the same
with the same key.
                                                mesh point to the
                          R
                         R
            same reducer.


  3. Reducers just
  compute a numerical
  variance.




                                                                                    18
                                          David Gleich · Purdue
     MRWorkshop
MapReduce vs. Hadoop.
     MapReduce!               Hadoop!
     A computation            An implementation
       model with:
           of MapReduce
     Map - a local data       using the HDFS
      transform
              parallel file-system.

     Shuffle - a grouping      Others !
      function 
                 Pheonix++, Twisted,
                                 Google MapReduce,
     Reduce – "                  spark, …
      an aggregation 




                                                                  19
                           David Gleich · Purdue
   MRWorkshop
Why so many limitations?




                                                       20
                David Gleich · Purdue
   MRWorkshop
Data scalability
                Maps
                      M         M
                                           1
        2
           1
    M    Reduce
           2
    M                         M         M
                        R                  3
        4
           3
    M
                        R
           4
    M                              M
                                                5
           5
    M Shuffle



      The idea !
      Bring the computations to the data
      MR can schedule map functions without
      moving data.




                                                                        21
                                David Gleich · Purdue
    MRWorkshop
Mesh point variance in MapReduce
             Run 1
                 Run 2
                            Run 3


 T=1
     T=2
     T=3
    T=1
   T=2
    T=3
       T=1
      T=2
        T=3
            M
                       M
                           M


1. Each mapper out-                                                2. Shuffle moves all
puts the mesh points                                               values from the same
with the same key.
                                                mesh point to the
                          R
                         R
            same reducer.


  3. Reducers just
  compute a numerical
  variance.
                                                 Bring the computations
                                                 to the data!




                                                                                    22
                                          David Gleich · Purdue
     MRWorkshop
heartbreak on node rs252
After waiting in the queue for a month and "
after 24 hours of finding eigenvalues, one node randomly hiccups. 




                                                                                       23
                                                David Gleich · Purdue
   MRWorkshop
Fault tolerant
             Input stored in triplicate
                                    Reduce input/"
                        M           output on disk
                        M        R
                        M
                                 R
                        M
                            Map output"
                            persisted to disk"
                            before shuffle


  Redundant input helps make maps data-local
  Just one type of communication: shuffle




                                                                          24
                                   David Gleich · Purdue
   MRWorkshop
Fault injection
                              200
                                      Faults (200M by 200)
  Time to completion (sec)




                                                                                             With 1/5
                                                                                             tasks failing,
                                               No faults (200M by 200)
                      the job only
                                                                                             takes twice
                              100
         Faults (800M by 10)
                              as long.

                                     No faults "
                                     (800M by 10)

                                         10
                 100
                 1000

                                1/Prob(failure) – mean number of success per failure




                                                                                                              25
                                                                    David Gleich · Purdue
    MRWorkshop
Diving into Hadoop
(with python)




                                                   26
            David Gleich · Purdue
   MRWorkshop
Tools I like


      hadoop streaming
        dumbo
        mrjob
        hadoopy
        C++




                                                                 27
                          David Gleich · Purdue
   MRWorkshop
Tools I don’t use but other
people seem to like …

      pig
      java
      hbase
      Eclipse
      Cassandra
      




                                                              28
                       David Gleich · Purdue
   MRWorkshop
hadoop streaming

     the map function is a program"
     (key,value) pairs are sent via stdin"
     output (key,value) pairs goes to stdout
     
     the reduce function is a program"
     (key,value) pairs are sent via stdin"
     keys are grouped"
     output (key,value) pairs goes to stdout




                                                                 29
                          David Gleich · Purdue
   MRWorkshop
dumbo
    a wrapper around hadoop streaming for
    map and reduce functions in python

    #!/usr/bin/env dumbo

    def mapper(key,value):
        """ Each record is a line of text.
        key=<byte that the line starts in the file>
        value=<line of text>
        """
        valarray = [float(v) for v in value.split()]
        yield key, sum(valarray)

    if __name__=='__main__':
        import dumbo
        import dumbo.lib
        dumbo.run(mapper,dumbo.lib.identityreducer)




                                                                    30
                             David Gleich · Purdue
   MRWorkshop
Synthetic data test 100,000,000-by-500 matrix (~500GB)
How can Hadoop streaming
 Codes implemented in MapReduce streaming

possibly be fast?
 Matrix stored as TypedBytes lists of doubles
 Python frameworks use Numpy+Atlas
 Custom C++ TypedBytes reader/writer with Atlas
500 GBnon-streaming the R in a QR factorization. 
 too
 New matrix. Computing Java implementation
                        Iter 1       Iter 1              Iter 2                   Overall
                        QR (secs.)   Total (secs.)       Total (secs.)            Total (secs.)
Dumbo                   67725        960                 217                      1177
Hadoopy                 70909        612                 118                      730
C++                     15809        350                 37                       387
Java                                 436                 66                       502

        C++ in streaming beats a native Java implementation.
                                                              All timing results from the Hadoop job tracker
David Gleich (Sandia)                   MapReduce 2011                                                16/22




                                                                                                           31
                                                  David Gleich · Purdue
           MRWorkshop
Demo 1
1. generate data
2. get data to hadoop
3. run row sums
4. see row sums!




                                                           32
                    David Gleich · Purdue
   MRWorkshop
How does Hadoop know 
    key = byte in file"
    value = line of text!
    !
InputFormat!
Map a file on HDFS to (key,value) pairs

TextInputFormat!
Map a text file to (<byte offset>, <line>)
pairs




                                                                   33
                            David Gleich · Purdue
   MRWorkshop
The Hadoop Distributed File System (HDFS)
and a big text file
             HDFS stores files in 64MB chunks
             Each chunk is a FileSplit
             FileSplits are stored in parallel
             
             A InputFormat converts FileSplits
             into a sequence of key-val records
             FileSplits can cross record borders"
             (a small bit of communication)




                                                               34
                        David Gleich · Purdue
   MRWorkshop
Tall-and-skinny matrix
storage in MapReduce
A : m x n, m ≫ n

                                                           A1


Key is an arbitrary row-id
                                                            A2
Value is the 1 x n array "
for a row
                                                            A3

                                                            A4 
Each submatrix Ai is an "
InputSplit (the input to a"
map task).




                                                                     35
                              David Gleich · Purdue
   MRWorkshop
hadoop!                 MPI!
output row-sum for      parallel load
all local rows
         for my-batch-of-rows
                           compute row-sum
                        parallel save




                                                            36
                     David Gleich · Purdue
   MRWorkshop
Isn’t reading and writing text
files rather inefficient?




                                                         37
                  David Gleich · Purdue
   MRWorkshop
Sequence Files and !
OutputFormat
      SequenceFile
      An internal Hadoop file format to store
      (key, value) pairs efficiently. Used between
      map and reduce steps.
      


      OutputFormat 
      Map (key, value) pairs to output on disk
      


      TextOutputFormat 
      Map (key,value) pairs to keytvalue strings




                                                                  38
                           David Gleich · Purdue
   MRWorkshop
typedbytes

     A simple binary serialization scheme.
     [<1-byte-type-flag> <binary-value>]*
     Roughly equivalent to JSON
     
     (Optionally) used to communicate to and
     from Hadoop streaming.




                                                                 39
                          David Gleich · Purdue
   MRWorkshop
typedbytes example

     def _read(self):
       t = unpack_type(self.file.read(1))[0]
      self.t = t
       return self.handler_table[t](self)


     def read_vector(self):
       r = self._read
       count = unpack_int(self.file.read(4))[0]
       return tuple(r() for i in xrange(count))




                                                                     40
                              David Gleich · Purdue
   MRWorkshop
Demo 2 
Column sums




                                                      41
               David Gleich · Purdue
   MRWorkshop
Column sums in dumbo

     #!/usr/bin/env dumbo

     def mapper(key,value):
         """ Each record is a line of text. """
         valarray = [float(v) for v in value.split()]
         for col,val in enumerate(valarray):
             yield col, val

     def reducer(col,values):
         yield col, sum(values)

     if __name__=='__main__':
         import dumbo
         import dumbo.lib
         dumbo.run(mapper,reducer)




                                                                     42
                              David Gleich · Purdue
   MRWorkshop
Isn’t this just moving the data
to the computation?
                            MPI!
                            parallel load
      Yes. 
                for my-batch-of-rows
      
                        update sum of each
                               columns
      It seems much"
                            parallel reduce partial
      worse than MPI.
      column sums
                            parallel save




                                                                43
                         David Gleich · Purdue
   MRWorkshop
The MapReduce
programming model
     Input a list of (key, value) pairs
     Map apply a function f to all pairs
     Combine apply g to local values with key k!
     Shuffle group all pairs with key k together!
     Reduce apply a function g to "
       all values with key k
     Output a list of (key, value) pairs
     !
     




                                                                  44
                           David Gleich · Purdue
   MRWorkshop
Column sums in dumbo

     #!/usr/bin/env dumbo

     def mapper(key,value):
         """ Each record is a line of text. """
         valarray = [float(v) for v in value.split()]
         for col,val in enumerate(valarray):
             yield col, val

     def reducer(col,values):
         yield col, sum(values)

     if __name__=='__main__':
         import dumbo
         import dumbo.lib
         dumbo.run(mapper,reducer,combiner=reducer)




                                                                     45
                              David Gleich · Purdue
   MRWorkshop
How many mappers and
reducers?


    The number of maps is the number of
    InputSplits.
    
    You choose how many reducers.
    Each reducer outputs to a separate file.




                                                               46
                        David Gleich · Purdue
   MRWorkshop
Demo 3 
Column sums with multiple
reducers




                                                      47
               David Gleich · Purdue
   MRWorkshop
Which reducer does my key
go to?


     Partitioner!
     Map a given key to a reducer
     
     HashPartitioner!
     Randomly distribute keys




                                                                48
                         David Gleich · Purdue
   MRWorkshop
Sparse matrix methods




                                                   49
            David Gleich · Purdue
   MRWorkshop
of a graph, 4 9 storing the matrix by columns corresponds to storing the
 1       10 then 7          6
graph as an in-edge list.
      13                4      ci    2 3 3 4 2 5 3 6 4 6 
Storing a matrix by rows
   We briey 14 5
ure ..
           3  illustrate compressed row 13 10 12 4 storage schemes 4 g-
                               ai 16
                                        and column 14 9 20 7 in


0                        0 0  Compressed sparse row
    16 13         0             Compressed sparse column
0        2     12     4
                         0 0  rp 1 3 5 7 9 11
     0 10        12             cp 1 1 3 6 8 9
0                       14 0 
                                                             11
                                                             11
                               
     16                    20
      4 0          0
1                              
0    0 10 94    9     7
                          0 20 6
                  0             ci 2 3 3 4 2 5                            6 
0                       0 4  ri 1 3 1 2 4 2                3    6    4    5 
                               
     13                     4                                5     3   4
      0 0          7
0                            0  ai 16 13 10 12 4 14
     0 30      140    5 0
                                 ai 16 4 13 10 9 12         9
                                                             7    20
                                                                  14   7
                                                                       20   4
                                                                            4

Row 1 13 0 (3,13.)
     16 (2,16.) 0                  Row 5 (4,7.) (6,4.)
0 Most graph algorithms0are designed to work with out-edge lists instead of
                               Compressed sparse column
0 0 10 12 0 0 
Row 2 (3,10.) (4,12.)
 an algorithm, MatlabBGL 9 11
                              
0 4 lists. Before running cpRow 6
 3 6 8 explicitly transposes
in-edge
           0 0 14 0 
                                        1 1
                             
 graph so that Matlab’s internal representation corresponds to storing out-
the 0 9 0 0 20
Row 3 (2,4.) (5,14.)
        
0 lists. For algorithms  symmetric graphs, these transposes are not
                                    
0 0 0 7 0 4  ri 1 3 1 2 4 2 5 3 4 5 
edge                          on
                             
Row 4 0 0 (6,20.)
 ai 16     
0 0 (3,9.) 0 0 
required.                                   4 13 10 9 12 7 14 20 4
   e mex commands mxGetPr, mxGetJc, and mxGetIr retrieve pointers to




                                                                              50
Matlab’s internal storage of the matrix withoutGleich · Purdue
 MRWorkshop
                                          David making a copy. ese functions
of a graph, 4 9 storing the matrix by columns corresponds to storing the
 1       10 then 7          6
graph as an in-edge list.
      13                4      ci    2 3 3 4 2 5 3 6 4 6 
Storing a matrix by rows in a text-file
   We briey 14 5
ure ..
           3  illustrate compressed row 13 10 12 4 storage schemes 4 g-
                               ai 16
                                        and column 14 9 20 7 in


0                        0 0  Compressed sparse row
    16 13         0             Compressed sparse column
0        2     12     4
                         0 0  rp 1 3 5 7 9 11
     0 10        12             cp 1 1 3 6 8 9
0                       14 0 
                                                             11
                                                             11
                               
     16                    20
      4 0          0
1                              
0    0 10 94    9     7
                          0 20 6
                  0             ci 2 3 3 4 2 5                            6 
0                       0 4  ri 1 3 1 2 4 2                3    6    4    5 
                               
     13                     4                                5     3   4
      0 0          7
0                            0  ai 16 13 10 12 4 14
     0 30      140    5 0
                                 ai 16 4 13 10 9 12         9
                                                             7    20
                                                                  14   7
                                                                       20   4
                                                                            4

Row 1 13 0 (3,13.)
     16 (2,16.) 0                  Row 5 (4,7.) (6,4.)
0 Most graph algorithms0are designed to work with out-edge lists instead of
                               Compressed sparse column
0 0 10 12 0 0 
Row 2 (3,10.) (4,12.)
 an algorithm, MatlabBGL 9 11
                              
0 4 lists. Before running cpRow 6
 3 6 8 explicitly transposes
in-edge
           0 0 14 0 
                                        1 1
                             
 graph so that Matlab’s internal representation corresponds to storing out-
the 0 9 0 0 20
Row 3 (2,4.) (5,14.)
        
0 lists. For algorithms  symmetric graphs, these transposes are not
                                    
0 0 0 7 0 4  ri 1 3 1 2 4 2 5 3 4 5 
edge                          on
                             
Row 4 0 0 (6,20.)
 ai 16     
0 0 (3,9.) 0 0 
required.                                   4 13 10 9 12 7 14 20 4
   e mex commands mxGetPr, mxGetJc, and mxGetIr retrieve pointers to




                                                                              51
Matlab’s internal storage of the matrix withoutGleich · Purdue
 MRWorkshop
                                          David making a copy. ese functions
To store an m×n sparse matrix M, Matlab uses compressed column format
  [Gilbert et al., ]. Matlab never stores a 0 value in a sparse matrix. It always
  “re-compresses” the data structure in these cases. If M is the adjacency matrix

 Sparse matrix-vector product
  of a graph, then storing the matrix by columns corresponds to storing the
  graph as an in-edge list.
      We briey illustrate compressed row and column storage schemes in g-
  ure ..

             2
                 X 12      4                       The matrix!
                                              Compressed sparse row              The vector! row and c
                                                                                   Figure 6.1 – Compressed
                                               rp  1 3 5 7 9 11 11
[Ax]i =                         Ai,j xj
       16                           20                                                  storage. At far le, we have a wei
   1        10 4   9       7              6        1 (2,16.) (3,13.)
            1 2.1
 directed graph. Its weighted adjac
       13                           4         ci   2 3 3 4 2 5 3 6 4 6                 matrix lies below. At right are the
                                                                                        pressed row and compressed colu
             3     14  j   5                  ai   2 (3,10.) (4,12.)
                                                   16 13 10 12 4 14 9 20 7 4     2 -1.3
arrays for this graph and matrix.
                                                                                        sparse matrices, compressed row

  0                                      0  Compressed sparse column
                                                                                        column storage make it easy to ac
  
  0
       16    13      0          0           
                                         0  cp
                                                   3 (2,4.) (5,14.)
             3 0.5
 entries in rows and columns, resp
       0    10     12         0                                                       Consider the rd entry in rp. It sa
  0                                     0
                                                   1 1 3 6 8 9 11
       4     0      0         14                                                      to look at the th element in ci to
                                                 4 (3,9.) (6,20.)
             4 0.6
  0                                     20                                            all the columns in the rd row of
       0     9      0          0           
  0                                     4  ri 1 3 1 2 4 2 5 3 4 5 
                                                                                        matrix. e th and th elements
       0     0     7          0            
  0                                     0  ai 16 4 13 10 9 12 7 14 20 4
                                                                                        and ai tell us that row  has non-
       0     0     0          0                  5 (4,7.) (6,4.)
              5 -1.2
in columns  and , with values 
                                                                                        . When the sparse matrix corre
                                                                                        to the adjacency matrix of a grap
                                                   6
     Most graph algorithms are designed to work with out-edge lists instead of   6 0.89
corresponds to ecient access to
                                                                                        out-edges and in-edges of a vertex
  in-edge lists. Before running an algorithm, MatlabBGL explicitly transposes
  the graph so that Matlab’s internal representation corresponds to storing out- to
          To make this work, we need to get the value of the vector




                                                                                                                   52
  edge lists. For algorithms on as the column ofthese matrix
          the same function symmetric graphs, the transposes are not
  required.                                             David Gleich · Purdue
     MRWorkshop
To store an m×n sparse matrix M, Matlab uses compressed column format
  [Gilbert et al., ]. Matlab never stores a 0 value in a sparse matrix. It always
  “re-compresses” the data structure in these cases. If M is the adjacency matrix

 Sparse matrix-vector product
  of a graph, then storing the matrix by columns corresponds to storing the
  graph as an in-edge list.
      We briey illustrate compressed row and column storage schemes in g-
  ure ..

             2
                 X 12      4                       The matrix!
                                              Compressed sparse row              The vector! row and c
                                                                                   Figure 6.1 – Compressed
                                               rp  1 3 5 7 9 11 11
[Ax]i =                         Ai,j xj
       16                           20                                                  storage. At far le, we have a wei
   1        10 4   9       7              6        1 (2,16.) (3,13.)
            1 2.1
 directed graph. Its weighted adjac
       13                           4         ci   2 3 3 4 2 5 3 6 4 6                 matrix lies below. At right are the
                                                                                        pressed row and compressed colu
             3     14  j   5                  ai   2 (3,10.) (4,12.)
                                                   16 13 10 12 4 14 9 20 7 4     2 -1.3
arrays for this graph and matrix.
                                                                                        sparse matrices, compressed row

  0                                      0  Compressed sparse column
                                                                                        column storage make it easy to ac
  
  0
       16    13      0          0           
                                         0  cp
                                                   3 (2,4.) (5,14.)
             3 0.5
 entries in rows and columns, resp
       0    10     12         0                                                       Consider the rd entry in rp. It sa
  0                                     0
                                                   1 1 3 6 8 9 11
       4     0      0         14                                                      to look at the th element in ci to
                                                 4 (3,9.) (6,20.)
             4 0.6
  0                                     20                                            all the columns in the rd row of
       0     9      0          0           
  0                                     4  ri 1 3 1 2 4 2 5 3 4 5 
                                                                                        matrix. e th and th elements
       0     0     7          0            
  0                                     0  ai 16 4 13 10 9 12 7 14 20 4
                                                                                        and ai tell us that row  has non-
       0     0     0          0                  5 (4,7.) (6,4.)
              5 -1.2
in columns  and , with values 
                                                                                        . When the sparse matrix corre
                                                                                        to the adjacency matrix of a grap
                                                   6
     Most graph algorithms are designed to work with out-edge lists instead of   6 0.89
corresponds to ecient access to
                                                                                        out-edges and in-edges of a vertex
  in-edge lists. Before running an algorithm, MatlabBGL explicitly transposes
  the graph so need to “join” the representationvector based storing out-
          We that Matlab’s internal matrix and corresponds to on the column




                                                                                                                   53
  edge lists. For algorithms on symmetric graphs, these transposes are not
  required.                                           David Gleich · Purdue
     MRWorkshop
Sparse matrix-vector product!
takes two MR tasks
             Two type
                      so
Map!         records!
 f                       Map!
If vector, emit (row,vecval)
                  Identity
If matrix,                            
  for each non-zero (row,col,val),  
    emit (col,(row,val))
                

                         One of th            
                                    ese
                         values is
                                   not like    Reduce (row, [(Aij xj), …]) !
Reduce!                  the other
                                  s
Find vecval in input keys
                     emit (row, sum(Aij xj))
For each (col,(row,val)),                     
 emit (row,(val*vecval))
    Form Aij xj for each nonzero
             Regroup data by rows, compute sums




                                                                                      54
                                               David Gleich · Purdue
   MRWorkshop
What about a “dense” row?
Map!                                                
If vector, emit (row,vecval)
                       
If matrix,                                         How do we find
  for each non-zero (row,col,val), 
    emit (col,(row,val))
                           vecval without

                         One of th
                                    ese
                                                    looking through 
                         values is
Reduce!                  the other
                                   not like
                                  s
                                                    (and buffering) all 
Find vecval in input keys
                          the input?
For each (col,(row,val)), 
 emit (row,(val*vecval))
    Form Aij xj for each nonzero




                                                                                     55
                                              David Gleich · Purdue
   MRWorkshop
Sparse matrix-vector product!
takes two MR tasks
             Two type
                      so
Map!         records!
 f                   
If vector, emit ((row,-1),vecval)
         
If matrix,                                Use a custom partitioner
  for each non-zero (row,col,val),        to make sure that (row,*)
    emit ((col,0),(row,val))
              all get mapped to the

                                          same reducer, and that
                                           we always see (row,-1)
Reduce!
                                           before (row,0).
Find vecval in input keys
For each (col,(row,val)), 
 emit (row,(val*vecval))
    Form Aij xj for each nonzero
    Regroup data by rows, compute sums




                                                                              56
                                       David Gleich · Purdue
   MRWorkshop
Demo 4 
Sparse matrix vector products




                                                       57
                David Gleich · Purdue
   MRWorkshop
58
David Gleich · Purdue
   MRWorkshop
Matrix factorizations




                                                    59
             David Gleich · Purdue
   MRWorkshop
Algorithm
                                             Data Rows of a matrix
              A1   A1                        Map QR factorization of rows
                   A2
                        qr                   Reduce QR factorization of rows
              A2             Q2   R2
Mapper 1                                qr
Serial TSQR   A3                  A3          Q3    R3
                                                    A4    qr                emit
              A4                                               Q4      R4

              A5   A5
                        qr
              A6   A6        Q6   R6
Mapper 2                                qr
Serial TSQR   A7                  A7          Q7    R7

                                                    A8    qr                emit
              A8                                               Q8      R8


              R4   R4
Reducer 1
Serial TSQR             qr             emit
              R8   R8        Q    R




                                                                                          60
                                              David Gleich · Purdue
        MRWorkshop
In hadoopy
  Full code in hadoopy
import random, numpy, hadoopy                            def close(self):
class SerialTSQR:                                          self.compress()
 def __init__(self,blocksize,isreducer):                   for row in self.data:
                                                            key = random.randint(0,2000000000)
   self.bsize=blocksize                                     yield key, row
   self.data = []
   if isreducer: self.__call__ = self.reducer            def mapper(self,key,value):
   else: self.__call__ = self.mapper                      self.collect(key,value)

                                                         def reducer(self,key,values):
 def compress(self):                                      for value in values: self.mapper(key,value)
  R = numpy.linalg.qr(
         numpy.array(self.data),'r')                    if __name__=='__main__':
  # reset data and re-initialize to R                     mapper = SerialTSQR(blocksize=3,isreducer=False)
  self.data = []                                          reducer = SerialTSQR(blocksize=3,isreducer=True)
  for row in R:                                           hadoopy.run(mapper, reducer)
   self.data.append([float(v) for v in row])

 def collect(self,key,value):
  self.data.append(value)
  if len(self.data)self.bsize*len(self.data[0]):
    self.compress()




                                                                                                             61
  David Gleich (Sandia)                             MapReduceDavid
                                                             2011    Gleich · Purdue
    MRWorkshop
    13/22
Related resources

     Apache Mahout
     Machine learning for Hadoop 
     … lots of matrices there …
     
     Another fantasic tutorial
     http://www.eurecom.fr/~michiard/
     teaching/webtech/tutorial.pdf




                                                                  62
                           David Gleich · Purdue
   MRWorkshop
Way too much stuff!



     I hope to keep expanding this tutorial
     over the week… 
     
     Keep checking the git repo.




                                                                 63
                          David Gleich · Purdue
   MRWorkshop

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map ReduceApache Apex
 
MLconf NYC Shan Shan Huang
MLconf NYC Shan Shan HuangMLconf NYC Shan Shan Huang
MLconf NYC Shan Shan HuangMLconf
 
Faunus: Graph Analytics Engine
Faunus: Graph Analytics EngineFaunus: Graph Analytics Engine
Faunus: Graph Analytics EngineMarko Rodriguez
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine ParallelismSri Prasanna
 
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXrhatr
 
Partitioning SKA Dataflows for Optimal Graph Execution
Partitioning SKA Dataflows for Optimal Graph ExecutionPartitioning SKA Dataflows for Optimal Graph Execution
Partitioning SKA Dataflows for Optimal Graph Execution Chen Wu
 
Hadoop本 輪読会 1章〜2章
Hadoop本 輪読会 1章〜2章Hadoop本 輪読会 1章〜2章
Hadoop本 輪読会 1章〜2章moai kids
 
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)Jyotirmoy Sundi
 
Map Reduce
Map ReduceMap Reduce
Map Reduceschapht
 
LocationTech Projects
LocationTech ProjectsLocationTech Projects
LocationTech ProjectsJody Garnett
 
Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advancedChirag Ahuja
 
쉽게 설명하는 GAN (What is this? Gum? It's GAN.)
쉽게 설명하는 GAN (What is this? Gum? It's GAN.)쉽게 설명하는 GAN (What is this? Gum? It's GAN.)
쉽게 설명하는 GAN (What is this? Gum? It's GAN.)Hansol Kang
 
EDF2012 Kostas Tzouma - Linking and analyzing bigdata - Stratosphere
EDF2012   Kostas Tzouma - Linking and analyzing bigdata - StratosphereEDF2012   Kostas Tzouma - Linking and analyzing bigdata - Stratosphere
EDF2012 Kostas Tzouma - Linking and analyzing bigdata - StratosphereEuropean Data Forum
 
Q4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis PresentationQ4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis PresentationRob Emanuele
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin
 

Was ist angesagt? (20)

Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
MLconf NYC Shan Shan Huang
MLconf NYC Shan Shan HuangMLconf NYC Shan Shan Huang
MLconf NYC Shan Shan Huang
 
Faunus: Graph Analytics Engine
Faunus: Graph Analytics EngineFaunus: Graph Analytics Engine
Faunus: Graph Analytics Engine
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine Parallelism
 
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
 
Partitioning SKA Dataflows for Optimal Graph Execution
Partitioning SKA Dataflows for Optimal Graph ExecutionPartitioning SKA Dataflows for Optimal Graph Execution
Partitioning SKA Dataflows for Optimal Graph Execution
 
Spark at-hackthon8jan2014
Spark at-hackthon8jan2014Spark at-hackthon8jan2014
Spark at-hackthon8jan2014
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop本 輪読会 1章〜2章
Hadoop本 輪読会 1章〜2章Hadoop本 輪読会 1章〜2章
Hadoop本 輪読会 1章〜2章
 
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
LocationTech Projects
LocationTech ProjectsLocationTech Projects
LocationTech Projects
 
vega
vegavega
vega
 
Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advanced
 
쉽게 설명하는 GAN (What is this? Gum? It's GAN.)
쉽게 설명하는 GAN (What is this? Gum? It's GAN.)쉽게 설명하는 GAN (What is this? Gum? It's GAN.)
쉽게 설명하는 GAN (What is this? Gum? It's GAN.)
 
EDF2012 Kostas Tzouma - Linking and analyzing bigdata - Stratosphere
EDF2012   Kostas Tzouma - Linking and analyzing bigdata - StratosphereEDF2012   Kostas Tzouma - Linking and analyzing bigdata - Stratosphere
EDF2012 Kostas Tzouma - Linking and analyzing bigdata - Stratosphere
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Q4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis PresentationQ4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis Presentation
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 

Ähnlich wie MapReduce for scientific simulation analysis

Massive MapReduce Matrix Computations & Multicore Graph Algorithms
Massive MapReduce Matrix Computations & Multicore Graph AlgorithmsMassive MapReduce Matrix Computations & Multicore Graph Algorithms
Massive MapReduce Matrix Computations & Multicore Graph AlgorithmsDavid Gleich
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on HadoopCarol McDonald
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopVictoria López
 
Big Data and Hadoop with MapReduce Paradigms
Big Data and Hadoop with MapReduce ParadigmsBig Data and Hadoop with MapReduce Paradigms
Big Data and Hadoop with MapReduce ParadigmsArundhati Kanungo
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraSomnath Mazumdar
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labImpetus Technologies
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementKyong-Ha Lee
 
Extending lifespan with Hadoop and R
Extending lifespan with Hadoop and RExtending lifespan with Hadoop and R
Extending lifespan with Hadoop and RRadek Maciaszek
 
Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Steve Min
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview questionpappupassindia
 
Apache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupApache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupCloudera, Inc.
 
The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)Revolution Analytics
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analyticsAvinash Pandu
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Cloudera, Inc.
 

Ähnlich wie MapReduce for scientific simulation analysis (20)

Massive MapReduce Matrix Computations & Multicore Graph Algorithms
Massive MapReduce Matrix Computations & Multicore Graph AlgorithmsMassive MapReduce Matrix Computations & Multicore Graph Algorithms
Massive MapReduce Matrix Computations & Multicore Graph Algorithms
 
MapReduce and NoSQL
MapReduce and NoSQLMapReduce and NoSQL
MapReduce and NoSQL
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on Hadoop
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
Big Data and Hadoop with MapReduce Paradigms
Big Data and Hadoop with MapReduce ParadigmsBig Data and Hadoop with MapReduce Paradigms
Big Data and Hadoop with MapReduce Paradigms
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph lab
 
mapReduce.pptx
mapReduce.pptxmapReduce.pptx
mapReduce.pptx
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
 
An Introduction to the World of Hadoop
An Introduction to the World of HadoopAn Introduction to the World of Hadoop
An Introduction to the World of Hadoop
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Big Data & Hadoop. Simone Leo (CRS4)
Big Data & Hadoop. Simone Leo (CRS4)Big Data & Hadoop. Simone Leo (CRS4)
Big Data & Hadoop. Simone Leo (CRS4)
 
Extending lifespan with Hadoop and R
Extending lifespan with Hadoop and RExtending lifespan with Hadoop and R
Extending lifespan with Hadoop and R
 
Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
 
Apache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupApache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's Group
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analytics
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
 

Mehr von David Gleich

Engineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network AnalysisEngineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network AnalysisDavid Gleich
 
Correlation clustering and community detection in graphs and networks
Correlation clustering and community detection in graphs and networksCorrelation clustering and community detection in graphs and networks
Correlation clustering and community detection in graphs and networksDavid Gleich
 
Spectral clustering with motifs and higher-order structures
Spectral clustering with motifs and higher-order structuresSpectral clustering with motifs and higher-order structures
Spectral clustering with motifs and higher-order structuresDavid Gleich
 
Higher-order organization of complex networks
Higher-order organization of complex networksHigher-order organization of complex networks
Higher-order organization of complex networksDavid Gleich
 
Spacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysisSpacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysisDavid Gleich
 
Non-exhaustive, Overlapping K-means
Non-exhaustive, Overlapping K-meansNon-exhaustive, Overlapping K-means
Non-exhaustive, Overlapping K-meansDavid Gleich
 
Using Local Spectral Methods to Robustify Graph-Based Learning
Using Local Spectral Methods to Robustify Graph-Based LearningUsing Local Spectral Methods to Robustify Graph-Based Learning
Using Local Spectral Methods to Robustify Graph-Based LearningDavid Gleich
 
Spacey random walks and higher order Markov chains
Spacey random walks and higher order Markov chainsSpacey random walks and higher order Markov chains
Spacey random walks and higher order Markov chainsDavid Gleich
 
Localized methods in graph mining
Localized methods in graph miningLocalized methods in graph mining
Localized methods in graph miningDavid Gleich
 
PageRank Centrality of dynamic graph structures
PageRank Centrality of dynamic graph structuresPageRank Centrality of dynamic graph structures
PageRank Centrality of dynamic graph structuresDavid Gleich
 
Iterative methods with special structures
Iterative methods with special structuresIterative methods with special structures
Iterative methods with special structuresDavid Gleich
 
Big data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphsBig data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphsDavid Gleich
 
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...David Gleich
 
Localized methods for diffusions in large graphs
Localized methods for diffusions in large graphsLocalized methods for diffusions in large graphs
Localized methods for diffusions in large graphsDavid Gleich
 
Anti-differentiating Approximation Algorithms: PageRank and MinCut
Anti-differentiating Approximation Algorithms: PageRank and MinCutAnti-differentiating Approximation Algorithms: PageRank and MinCut
Anti-differentiating Approximation Algorithms: PageRank and MinCutDavid Gleich
 
Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential David Gleich
 
Fast matrix primitives for ranking, link-prediction and more
Fast matrix primitives for ranking, link-prediction and moreFast matrix primitives for ranking, link-prediction and more
Fast matrix primitives for ranking, link-prediction and moreDavid Gleich
 
Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...David Gleich
 
MapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applicationsMapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applicationsDavid Gleich
 
Recommendation and graph algorithms in Hadoop and SQL
Recommendation and graph algorithms in Hadoop and SQLRecommendation and graph algorithms in Hadoop and SQL
Recommendation and graph algorithms in Hadoop and SQLDavid Gleich
 

Mehr von David Gleich (20)

Engineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network AnalysisEngineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network Analysis
 
Correlation clustering and community detection in graphs and networks
Correlation clustering and community detection in graphs and networksCorrelation clustering and community detection in graphs and networks
Correlation clustering and community detection in graphs and networks
 
Spectral clustering with motifs and higher-order structures
Spectral clustering with motifs and higher-order structuresSpectral clustering with motifs and higher-order structures
Spectral clustering with motifs and higher-order structures
 
Higher-order organization of complex networks
Higher-order organization of complex networksHigher-order organization of complex networks
Higher-order organization of complex networks
 
Spacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysisSpacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysis
 
Non-exhaustive, Overlapping K-means
Non-exhaustive, Overlapping K-meansNon-exhaustive, Overlapping K-means
Non-exhaustive, Overlapping K-means
 
Using Local Spectral Methods to Robustify Graph-Based Learning
Using Local Spectral Methods to Robustify Graph-Based LearningUsing Local Spectral Methods to Robustify Graph-Based Learning
Using Local Spectral Methods to Robustify Graph-Based Learning
 
Spacey random walks and higher order Markov chains
Spacey random walks and higher order Markov chainsSpacey random walks and higher order Markov chains
Spacey random walks and higher order Markov chains
 
Localized methods in graph mining
Localized methods in graph miningLocalized methods in graph mining
Localized methods in graph mining
 
PageRank Centrality of dynamic graph structures
PageRank Centrality of dynamic graph structuresPageRank Centrality of dynamic graph structures
PageRank Centrality of dynamic graph structures
 
Iterative methods with special structures
Iterative methods with special structuresIterative methods with special structures
Iterative methods with special structures
 
Big data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphsBig data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphs
 
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
 
Localized methods for diffusions in large graphs
Localized methods for diffusions in large graphsLocalized methods for diffusions in large graphs
Localized methods for diffusions in large graphs
 
Anti-differentiating Approximation Algorithms: PageRank and MinCut
Anti-differentiating Approximation Algorithms: PageRank and MinCutAnti-differentiating Approximation Algorithms: PageRank and MinCut
Anti-differentiating Approximation Algorithms: PageRank and MinCut
 
Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential
 
Fast matrix primitives for ranking, link-prediction and more
Fast matrix primitives for ranking, link-prediction and moreFast matrix primitives for ranking, link-prediction and more
Fast matrix primitives for ranking, link-prediction and more
 
Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...
 
MapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applicationsMapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applications
 
Recommendation and graph algorithms in Hadoop and SQL
Recommendation and graph algorithms in Hadoop and SQLRecommendation and graph algorithms in Hadoop and SQL
Recommendation and graph algorithms in Hadoop and SQL
 

Kürzlich hochgeladen

Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4JOYLYNSAMANIEGO
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...JojoEDelaCruz
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYKayeClaireEstoconing
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptxiammrhaywood
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfVanessa Camilleri
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)cama23
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfPatidar M
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfErwinPantujan2
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 

Kürzlich hochgeladen (20)

Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdf
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdf
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 

MapReduce for scientific simulation analysis

  • 1. A hands on introduction to scientific data analysis with Hadoop ! ! A matrix computations perspective DAVID F. GLEICH, PURDUE UNIVERSITY ICME MAPREDUCE WORKSHOP @ STANFORD 1 David Gleich · Purdue MRWorkshop
  • 2. Who is this for? workshop project groups those curious about " “MapReduce” and “Hadoop” those who think about " problems as matrices 2 David Gleich · Purdue MRWorkshop
  • 3. What should you get out of it? 1. understand some problems that MapReduce solves effectively. 2. techniques to solve them using Hadoop and dumbo 3. learn some Hadoop words 3 David Gleich · Purdue MRWorkshop
  • 4. What you won’t learn … latest and greatest in " MapReduce algorithms how to improve the perform-" ance of your Hadoop job how to write wordcount " in Hadoop 4 David Gleich · Purdue MRWorkshop
  • 5. Slides will be online soon. Code samples and short tutorials at github.com/dgleich/mrmatrix 5 David Gleich · Purdue MRWorkshop
  • 6. 1.  HPC vs. Data (redux) 2.  MapReduce vs. Hadoop 3.  Dive into Hadoop with Hadoop streaming 4.  Sparse matrix methods " with Hadoop 6 David Gleich · Purdue MRWorkshop
  • 7. High performance computing vs. Data intensive computing 7 David Gleich · Purdue MRWorkshop
  • 8. 224k Cores 10 PB drive 80k cores" 1.7 Pflops 50 PB drive ? Pflops 7 MW ? MW Custom " interconnect" GB ethernet $104 M $?? M 45 GB/core 625 GB/core 8 David Gleich · Purdue MRWorkshop
  • 9. icme-hadoop1 12 nodes; 4-core i7 processor, 24GB/node, 1GB ethernet 12 TB/node, 3000 GB/core, 50 TB usable space (3x redundancy) 9 David Gleich · Purdue MRWorkshop
  • 10. MapReduce is designed to solve a different set of problems 10 David Gleich · Purdue MRWorkshop
  • 11. Supercomputer Data computing cluster Engineer Each multi-day HPC A data cluster can … enabling engineers to query simulation generates hold hundreds or thousands and analyze months of simulation gigabytes of data. of old simulations … data for all sorts of neat purposes. 11 David Gleich · Purdue MRWorkshop
  • 12. MapReduce and! Hadoop overview 12 David Gleich · Purdue MRWorkshop
  • 13. The MapReduce programming model Input a list of (key, value) pairs Map apply a function f to all pairs Reduce apply a function g to " all values with key k (for all k) Output a list of (key, value) pairs 13 David Gleich · Purdue MRWorkshop
  • 14. The MapReduce programming model Input a list of (key, value) pairs Map apply a function f to all pairs Reduce apply a function g to " all values with key k (for all k) Output a list of (key, value) pairs Map function f must be side-effect free Reduce function g must be side-effect free 14 David Gleich · Purdue MRWorkshop
  • 15. The MapReduce programming model Input a list of (key, value) pairs Map apply a function f to all pairs Reduce apply a function g to " all values with key k (for all k) Output a list of (key, value) pairs All map functions can be done in parallel All reduce functions (for key k) can be done in parallel 15 David Gleich · Purdue MRWorkshop
  • 16. The MapReduce programming model Input a list of (key, value) pairs Map apply a function f to all pairs Reduce apply a function g to " all values with key k (for all k) Output a list of (key, value) pairs ! Shuffle group all pairs with key k together" (sorting suffices) 16 David Gleich · Purdue MRWorkshop
  • 17. Mesh point variance in MapReduce Run 1 Run 2 Run 3 T=1 T=2 T=3 T=1 T=2 T=3 T=1 T=2 T=3 17 David Gleich · Purdue MRWorkshop
  • 18. Mesh point variance in MapReduce Run 1 Run 2 Run 3 T=1 T=2 T=3 T=1 T=2 T=3 T=1 T=2 T=3 M M M 1. Each mapper out- 2. Shuffle moves all puts the mesh points values from the same with the same key. mesh point to the R R same reducer. 3. Reducers just compute a numerical variance. 18 David Gleich · Purdue MRWorkshop
  • 19. MapReduce vs. Hadoop. MapReduce! Hadoop! A computation An implementation model with: of MapReduce Map - a local data using the HDFS transform parallel file-system. Shuffle - a grouping Others ! function Pheonix++, Twisted, Google MapReduce, Reduce – " spark, … an aggregation 19 David Gleich · Purdue MRWorkshop
  • 20. Why so many limitations? 20 David Gleich · Purdue MRWorkshop
  • 21. Data scalability Maps M M 1 2 1 M Reduce 2 M M M R 3 4 3 M R 4 M M 5 5 M Shuffle The idea ! Bring the computations to the data MR can schedule map functions without moving data. 21 David Gleich · Purdue MRWorkshop
  • 22. Mesh point variance in MapReduce Run 1 Run 2 Run 3 T=1 T=2 T=3 T=1 T=2 T=3 T=1 T=2 T=3 M M M 1. Each mapper out- 2. Shuffle moves all puts the mesh points values from the same with the same key. mesh point to the R R same reducer. 3. Reducers just compute a numerical variance. Bring the computations to the data! 22 David Gleich · Purdue MRWorkshop
  • 23. heartbreak on node rs252 After waiting in the queue for a month and " after 24 hours of finding eigenvalues, one node randomly hiccups. 23 David Gleich · Purdue MRWorkshop
  • 24. Fault tolerant Input stored in triplicate Reduce input/" M output on disk M R M R M Map output" persisted to disk" before shuffle Redundant input helps make maps data-local Just one type of communication: shuffle 24 David Gleich · Purdue MRWorkshop
  • 25. Fault injection 200 Faults (200M by 200) Time to completion (sec) With 1/5 tasks failing, No faults (200M by 200) the job only takes twice 100 Faults (800M by 10) as long. No faults " (800M by 10) 10 100 1000 1/Prob(failure) – mean number of success per failure 25 David Gleich · Purdue MRWorkshop
  • 26. Diving into Hadoop (with python) 26 David Gleich · Purdue MRWorkshop
  • 27. Tools I like hadoop streaming dumbo mrjob hadoopy C++ 27 David Gleich · Purdue MRWorkshop
  • 28. Tools I don’t use but other people seem to like … pig java hbase Eclipse Cassandra 28 David Gleich · Purdue MRWorkshop
  • 29. hadoop streaming the map function is a program" (key,value) pairs are sent via stdin" output (key,value) pairs goes to stdout the reduce function is a program" (key,value) pairs are sent via stdin" keys are grouped" output (key,value) pairs goes to stdout 29 David Gleich · Purdue MRWorkshop
  • 30. dumbo a wrapper around hadoop streaming for map and reduce functions in python #!/usr/bin/env dumbo def mapper(key,value): """ Each record is a line of text. key=<byte that the line starts in the file> value=<line of text> """ valarray = [float(v) for v in value.split()] yield key, sum(valarray) if __name__=='__main__': import dumbo import dumbo.lib dumbo.run(mapper,dumbo.lib.identityreducer) 30 David Gleich · Purdue MRWorkshop
  • 31. Synthetic data test 100,000,000-by-500 matrix (~500GB) How can Hadoop streaming Codes implemented in MapReduce streaming possibly be fast? Matrix stored as TypedBytes lists of doubles Python frameworks use Numpy+Atlas Custom C++ TypedBytes reader/writer with Atlas 500 GBnon-streaming the R in a QR factorization. too New matrix. Computing Java implementation Iter 1 Iter 1 Iter 2 Overall QR (secs.) Total (secs.) Total (secs.) Total (secs.) Dumbo 67725 960 217 1177 Hadoopy 70909 612 118 730 C++ 15809 350 37 387 Java 436 66 502 C++ in streaming beats a native Java implementation. All timing results from the Hadoop job tracker David Gleich (Sandia) MapReduce 2011 16/22 31 David Gleich · Purdue MRWorkshop
  • 32. Demo 1 1. generate data 2. get data to hadoop 3. run row sums 4. see row sums! 32 David Gleich · Purdue MRWorkshop
  • 33. How does Hadoop know key = byte in file" value = line of text! ! InputFormat! Map a file on HDFS to (key,value) pairs TextInputFormat! Map a text file to (<byte offset>, <line>) pairs 33 David Gleich · Purdue MRWorkshop
  • 34. The Hadoop Distributed File System (HDFS) and a big text file HDFS stores files in 64MB chunks Each chunk is a FileSplit FileSplits are stored in parallel A InputFormat converts FileSplits into a sequence of key-val records FileSplits can cross record borders" (a small bit of communication) 34 David Gleich · Purdue MRWorkshop
  • 35. Tall-and-skinny matrix storage in MapReduce A : m x n, m ≫ n A1 Key is an arbitrary row-id A2 Value is the 1 x n array " for a row A3 A4 Each submatrix Ai is an " InputSplit (the input to a" map task). 35 David Gleich · Purdue MRWorkshop
  • 36. hadoop! MPI! output row-sum for parallel load all local rows for my-batch-of-rows compute row-sum parallel save 36 David Gleich · Purdue MRWorkshop
  • 37. Isn’t reading and writing text files rather inefficient? 37 David Gleich · Purdue MRWorkshop
  • 38. Sequence Files and ! OutputFormat SequenceFile An internal Hadoop file format to store (key, value) pairs efficiently. Used between map and reduce steps. OutputFormat Map (key, value) pairs to output on disk TextOutputFormat Map (key,value) pairs to keytvalue strings 38 David Gleich · Purdue MRWorkshop
  • 39. typedbytes A simple binary serialization scheme. [<1-byte-type-flag> <binary-value>]* Roughly equivalent to JSON (Optionally) used to communicate to and from Hadoop streaming. 39 David Gleich · Purdue MRWorkshop
  • 40. typedbytes example def _read(self): t = unpack_type(self.file.read(1))[0] self.t = t return self.handler_table[t](self) def read_vector(self): r = self._read count = unpack_int(self.file.read(4))[0] return tuple(r() for i in xrange(count)) 40 David Gleich · Purdue MRWorkshop
  • 41. Demo 2 Column sums 41 David Gleich · Purdue MRWorkshop
  • 42. Column sums in dumbo #!/usr/bin/env dumbo def mapper(key,value): """ Each record is a line of text. """ valarray = [float(v) for v in value.split()] for col,val in enumerate(valarray): yield col, val def reducer(col,values): yield col, sum(values) if __name__=='__main__': import dumbo import dumbo.lib dumbo.run(mapper,reducer) 42 David Gleich · Purdue MRWorkshop
  • 43. Isn’t this just moving the data to the computation? MPI! parallel load Yes. for my-batch-of-rows update sum of each columns It seems much" parallel reduce partial worse than MPI. column sums parallel save 43 David Gleich · Purdue MRWorkshop
  • 44. The MapReduce programming model Input a list of (key, value) pairs Map apply a function f to all pairs Combine apply g to local values with key k! Shuffle group all pairs with key k together! Reduce apply a function g to " all values with key k Output a list of (key, value) pairs ! 44 David Gleich · Purdue MRWorkshop
  • 45. Column sums in dumbo #!/usr/bin/env dumbo def mapper(key,value): """ Each record is a line of text. """ valarray = [float(v) for v in value.split()] for col,val in enumerate(valarray): yield col, val def reducer(col,values): yield col, sum(values) if __name__=='__main__': import dumbo import dumbo.lib dumbo.run(mapper,reducer,combiner=reducer) 45 David Gleich · Purdue MRWorkshop
  • 46. How many mappers and reducers? The number of maps is the number of InputSplits. You choose how many reducers. Each reducer outputs to a separate file. 46 David Gleich · Purdue MRWorkshop
  • 47. Demo 3 Column sums with multiple reducers 47 David Gleich · Purdue MRWorkshop
  • 48. Which reducer does my key go to? Partitioner! Map a given key to a reducer HashPartitioner! Randomly distribute keys 48 David Gleich · Purdue MRWorkshop
  • 49. Sparse matrix methods 49 David Gleich · Purdue MRWorkshop
  • 50. of a graph, 4 9 storing the matrix by columns corresponds to storing the 1 10 then 7 6 graph as an in-edge list. 13 4 ci 2 3 3 4 2 5 3 6 4 6 Storing a matrix by rows We briey 14 5 ure .. 3 illustrate compressed row 13 10 12 4 storage schemes 4 g- ai 16 and column 14 9 20 7 in 0 0 0 Compressed sparse row 16 13 0 Compressed sparse column 0 2 12 4 0 0 rp 1 3 5 7 9 11 0 10 12 cp 1 1 3 6 8 9 0 14 0 11 11 16 20 4 0 0 1 0 0 10 94 9 7 0 20 6 0 ci 2 3 3 4 2 5 6 0 0 4 ri 1 3 1 2 4 2 3 6 4 5 13 4 5 3 4 0 0 7 0 0 ai 16 13 10 12 4 14 0 30 140 5 0 ai 16 4 13 10 9 12 9 7 20 14 7 20 4 4 Row 1 13 0 (3,13.) 16 (2,16.) 0 Row 5 (4,7.) (6,4.) 0 Most graph algorithms0are designed to work with out-edge lists instead of Compressed sparse column 0 0 10 12 0 0 Row 2 (3,10.) (4,12.) an algorithm, MatlabBGL 9 11 0 4 lists. Before running cpRow 6 3 6 8 explicitly transposes in-edge 0 0 14 0 1 1 graph so that Matlab’s internal representation corresponds to storing out- the 0 9 0 0 20 Row 3 (2,4.) (5,14.) 0 lists. For algorithms symmetric graphs, these transposes are not 0 0 0 7 0 4 ri 1 3 1 2 4 2 5 3 4 5 edge on Row 4 0 0 (6,20.) ai 16 0 0 (3,9.) 0 0 required. 4 13 10 9 12 7 14 20 4 e mex commands mxGetPr, mxGetJc, and mxGetIr retrieve pointers to 50 Matlab’s internal storage of the matrix withoutGleich · Purdue MRWorkshop David making a copy. ese functions
  • 51. of a graph, 4 9 storing the matrix by columns corresponds to storing the 1 10 then 7 6 graph as an in-edge list. 13 4 ci 2 3 3 4 2 5 3 6 4 6 Storing a matrix by rows in a text-file We briey 14 5 ure .. 3 illustrate compressed row 13 10 12 4 storage schemes 4 g- ai 16 and column 14 9 20 7 in 0 0 0 Compressed sparse row 16 13 0 Compressed sparse column 0 2 12 4 0 0 rp 1 3 5 7 9 11 0 10 12 cp 1 1 3 6 8 9 0 14 0 11 11 16 20 4 0 0 1 0 0 10 94 9 7 0 20 6 0 ci 2 3 3 4 2 5 6 0 0 4 ri 1 3 1 2 4 2 3 6 4 5 13 4 5 3 4 0 0 7 0 0 ai 16 13 10 12 4 14 0 30 140 5 0 ai 16 4 13 10 9 12 9 7 20 14 7 20 4 4 Row 1 13 0 (3,13.) 16 (2,16.) 0 Row 5 (4,7.) (6,4.) 0 Most graph algorithms0are designed to work with out-edge lists instead of Compressed sparse column 0 0 10 12 0 0 Row 2 (3,10.) (4,12.) an algorithm, MatlabBGL 9 11 0 4 lists. Before running cpRow 6 3 6 8 explicitly transposes in-edge 0 0 14 0 1 1 graph so that Matlab’s internal representation corresponds to storing out- the 0 9 0 0 20 Row 3 (2,4.) (5,14.) 0 lists. For algorithms symmetric graphs, these transposes are not 0 0 0 7 0 4 ri 1 3 1 2 4 2 5 3 4 5 edge on Row 4 0 0 (6,20.) ai 16 0 0 (3,9.) 0 0 required. 4 13 10 9 12 7 14 20 4 e mex commands mxGetPr, mxGetJc, and mxGetIr retrieve pointers to 51 Matlab’s internal storage of the matrix withoutGleich · Purdue MRWorkshop David making a copy. ese functions
  • 52. To store an m×n sparse matrix M, Matlab uses compressed column format [Gilbert et al., ]. Matlab never stores a 0 value in a sparse matrix. It always “re-compresses” the data structure in these cases. If M is the adjacency matrix Sparse matrix-vector product of a graph, then storing the matrix by columns corresponds to storing the graph as an in-edge list. We briey illustrate compressed row and column storage schemes in g- ure .. 2 X 12 4 The matrix! Compressed sparse row The vector! row and c Figure 6.1 – Compressed rp 1 3 5 7 9 11 11 [Ax]i = Ai,j xj 16 20 storage. At far le, we have a wei 1 10 4 9 7 6 1 (2,16.) (3,13.) 1 2.1 directed graph. Its weighted adjac 13 4 ci 2 3 3 4 2 5 3 6 4 6 matrix lies below. At right are the pressed row and compressed colu 3 14 j 5 ai 2 (3,10.) (4,12.) 16 13 10 12 4 14 9 20 7 4 2 -1.3 arrays for this graph and matrix. sparse matrices, compressed row 0 0 Compressed sparse column column storage make it easy to ac 0 16 13 0 0 0 cp 3 (2,4.) (5,14.) 3 0.5 entries in rows and columns, resp 0 10 12 0 Consider the rd entry in rp. It sa 0 0 1 1 3 6 8 9 11 4 0 0 14 to look at the th element in ci to 4 (3,9.) (6,20.) 4 0.6 0 20 all the columns in the rd row of 0 9 0 0 0 4 ri 1 3 1 2 4 2 5 3 4 5 matrix. e th and th elements 0 0 7 0 0 0 ai 16 4 13 10 9 12 7 14 20 4 and ai tell us that row has non- 0 0 0 0 5 (4,7.) (6,4.) 5 -1.2 in columns and , with values . When the sparse matrix corre to the adjacency matrix of a grap 6 Most graph algorithms are designed to work with out-edge lists instead of 6 0.89 corresponds to ecient access to out-edges and in-edges of a vertex in-edge lists. Before running an algorithm, MatlabBGL explicitly transposes the graph so that Matlab’s internal representation corresponds to storing out- to To make this work, we need to get the value of the vector 52 edge lists. For algorithms on as the column ofthese matrix the same function symmetric graphs, the transposes are not required. David Gleich · Purdue MRWorkshop
  • 53. To store an m×n sparse matrix M, Matlab uses compressed column format [Gilbert et al., ]. Matlab never stores a 0 value in a sparse matrix. It always “re-compresses” the data structure in these cases. If M is the adjacency matrix Sparse matrix-vector product of a graph, then storing the matrix by columns corresponds to storing the graph as an in-edge list. We briey illustrate compressed row and column storage schemes in g- ure .. 2 X 12 4 The matrix! Compressed sparse row The vector! row and c Figure 6.1 – Compressed rp 1 3 5 7 9 11 11 [Ax]i = Ai,j xj 16 20 storage. At far le, we have a wei 1 10 4 9 7 6 1 (2,16.) (3,13.) 1 2.1 directed graph. Its weighted adjac 13 4 ci 2 3 3 4 2 5 3 6 4 6 matrix lies below. At right are the pressed row and compressed colu 3 14 j 5 ai 2 (3,10.) (4,12.) 16 13 10 12 4 14 9 20 7 4 2 -1.3 arrays for this graph and matrix. sparse matrices, compressed row 0 0 Compressed sparse column column storage make it easy to ac 0 16 13 0 0 0 cp 3 (2,4.) (5,14.) 3 0.5 entries in rows and columns, resp 0 10 12 0 Consider the rd entry in rp. It sa 0 0 1 1 3 6 8 9 11 4 0 0 14 to look at the th element in ci to 4 (3,9.) (6,20.) 4 0.6 0 20 all the columns in the rd row of 0 9 0 0 0 4 ri 1 3 1 2 4 2 5 3 4 5 matrix. e th and th elements 0 0 7 0 0 0 ai 16 4 13 10 9 12 7 14 20 4 and ai tell us that row has non- 0 0 0 0 5 (4,7.) (6,4.) 5 -1.2 in columns and , with values . When the sparse matrix corre to the adjacency matrix of a grap 6 Most graph algorithms are designed to work with out-edge lists instead of 6 0.89 corresponds to ecient access to out-edges and in-edges of a vertex in-edge lists. Before running an algorithm, MatlabBGL explicitly transposes the graph so need to “join” the representationvector based storing out- We that Matlab’s internal matrix and corresponds to on the column 53 edge lists. For algorithms on symmetric graphs, these transposes are not required. David Gleich · Purdue MRWorkshop
  • 54. Sparse matrix-vector product! takes two MR tasks Two type so Map! records! f Map! If vector, emit (row,vecval) Identity If matrix, for each non-zero (row,col,val), emit (col,(row,val)) One of th ese values is not like Reduce (row, [(Aij xj), …]) ! Reduce! the other s Find vecval in input keys emit (row, sum(Aij xj)) For each (col,(row,val)), emit (row,(val*vecval)) Form Aij xj for each nonzero Regroup data by rows, compute sums 54 David Gleich · Purdue MRWorkshop
  • 55. What about a “dense” row? Map! If vector, emit (row,vecval) If matrix, How do we find for each non-zero (row,col,val), emit (col,(row,val)) vecval without One of th ese looking through values is Reduce! the other not like s (and buffering) all Find vecval in input keys the input? For each (col,(row,val)), emit (row,(val*vecval)) Form Aij xj for each nonzero 55 David Gleich · Purdue MRWorkshop
  • 56. Sparse matrix-vector product! takes two MR tasks Two type so Map! records! f If vector, emit ((row,-1),vecval) If matrix, Use a custom partitioner for each non-zero (row,col,val), to make sure that (row,*) emit ((col,0),(row,val)) all get mapped to the same reducer, and that we always see (row,-1) Reduce! before (row,0). Find vecval in input keys For each (col,(row,val)), emit (row,(val*vecval)) Form Aij xj for each nonzero Regroup data by rows, compute sums 56 David Gleich · Purdue MRWorkshop
  • 57. Demo 4 Sparse matrix vector products 57 David Gleich · Purdue MRWorkshop
  • 58. 58 David Gleich · Purdue MRWorkshop
  • 59. Matrix factorizations 59 David Gleich · Purdue MRWorkshop
  • 60. Algorithm Data Rows of a matrix A1 A1 Map QR factorization of rows A2 qr Reduce QR factorization of rows A2 Q2 R2 Mapper 1 qr Serial TSQR A3 A3 Q3 R3 A4 qr emit A4 Q4 R4 A5 A5 qr A6 A6 Q6 R6 Mapper 2 qr Serial TSQR A7 A7 Q7 R7 A8 qr emit A8 Q8 R8 R4 R4 Reducer 1 Serial TSQR qr emit R8 R8 Q R 60 David Gleich · Purdue MRWorkshop
  • 61. In hadoopy Full code in hadoopy import random, numpy, hadoopy def close(self): class SerialTSQR: self.compress() def __init__(self,blocksize,isreducer): for row in self.data: key = random.randint(0,2000000000) self.bsize=blocksize yield key, row self.data = [] if isreducer: self.__call__ = self.reducer def mapper(self,key,value): else: self.__call__ = self.mapper self.collect(key,value) def reducer(self,key,values): def compress(self): for value in values: self.mapper(key,value) R = numpy.linalg.qr( numpy.array(self.data),'r') if __name__=='__main__': # reset data and re-initialize to R mapper = SerialTSQR(blocksize=3,isreducer=False) self.data = [] reducer = SerialTSQR(blocksize=3,isreducer=True) for row in R: hadoopy.run(mapper, reducer) self.data.append([float(v) for v in row]) def collect(self,key,value): self.data.append(value) if len(self.data)self.bsize*len(self.data[0]): self.compress() 61 David Gleich (Sandia) MapReduceDavid 2011 Gleich · Purdue MRWorkshop 13/22
  • 62. Related resources Apache Mahout Machine learning for Hadoop … lots of matrices there … Another fantasic tutorial http://www.eurecom.fr/~michiard/ teaching/webtech/tutorial.pdf 62 David Gleich · Purdue MRWorkshop
  • 63. Way too much stuff! I hope to keep expanding this tutorial over the week… Keep checking the git repo. 63 David Gleich · Purdue MRWorkshop