"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
Behm Shah Pagerank
1. Computing PageRank Using Hadoop (+Introduction to MapReduce) Alexander Behm, Ajey Shah University of California, Irvine Instructor: Prof. Chen Li
2.
3.
4.
5.
6.
7. MapReduce Flow Input Map Key, Value Key, Value … = Map Map Split Input into Key-Value pairs. For each K-V pair call Map. Each Map produces new set of K-V pairs. Reduce(K, V[ ]) Sort Output Key, Value Key, Value … = For each distinct key, call reduce. Produces one K-V pair for each distinct key. Output as a set of Key Value Pairs. Key, Value Key, Value … Key, Value Key, Value … Key, Value Key, Value …
8. MapReduce WordCount Example Output: Number of occurrences of each word Input: File containing words Hello World Bye World Hello Hadoop Bye Hadoop Bye Hadoop Hello Hadoop Bye 3 Hadoop 4 Hello 3 World 2 MapReduce How can we do this within the MapReduce framework? Basic idea: parallelize on lines in input file!
9. MapReduce WordCount Example Input 1, “Hello World Bye World” 2, “Hello Hadoop Bye Hadoop” 3, “Bye Hadoop Hello Hadoop” Map Output <Hello,1> <World,1> <Bye,1> <World,1> <Hello,1> <Hadoop,1> <Bye,1> <Hadoop,1> <Bye,1> <Hadoop,1> <Hello,1> <Hadoop,1> Map(K, V) { For each word w in V Collect(w, 1); } Map Map Map
25. PageRank on MapReduce Why is storage a challenge? UCI domain: 500000 pages Assuming 4 Bytes per entry Size of Vector: 500000 * 4 = 2000000 = 2MB Size of Matrix: 500000 * 500000 * 4 = 10 12 = 1TB Assumes a fully connected graph. Cleary this is very unrealistic for web pages! Solution: Sparse Matrix But: Row-Wise or Column-Wise? Depends on usage patterns! (i.e. how we do parallel matrix multiplication, updating of matrix, etc.)
26. PageRank on MapReduce Parallel Matrix Multiplication Requirement: Make it work! Simple but practical solution X = M V M x V Every Row of M is “combined” with V, yielding one element of M x V each Intuition: - Parallelize on rows: each parallel task computes one final value - Use row-wise sparse matrix, so above can be done easily! (column-wise is actually better for PageRank)
28. PageRank on MapReduce Map(Key, Row) { Vector v = getVector(); Int sum = 0; For each Element e in Row sum += e.value * v.at(e.columnNumber); collect(Key, sum); } Reduce(Key, Value) { collect(Key, Value); } Map-Reduce procedures for parallel matrix*vector multiplication using row-wise sparse matrix
31. [1] Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters , Sixth Symposium on Operating System Design and Implementation (OSDI'04), San Francisco, CA, December, 2004 [2] http://www.cs.cmu.edu/~knigam/15-505/HW1.html [3] http://bnrg.cs.berkeley.edu/~adj/cs16x/Nachos/project2.html [4] http://lucene.apache.org/hadoop/ References
32. Flow TextInputFormat implements InputFormat getSplits() getRecordReader() LineRecordReader implements RecordReader One for each Split Next(Key, Value) FileSplit implements InputSplit File Start Offset End Offset Hosts where chunks of File live FileInputFormat implements InputFormat getSplits() getRecordReader()