2. i
“imvol3” — 2007/7/25 — 21:25 — page 257 — #1
Internet Mathematics Vol. 3, No. 3: 257-294
It’s a pleasure …
Approximating Personalized
Intel Intern 2005 in PageRank with Minimal Use
Application Research of Web Graph Data
Lab in Santa Clara
David Gleich and Marzia Polito
Could you run your own search engine
Resulting in one of and crawl the web to compute
proximations to the personalized PageRank score of ayou are We focus on
your own PageRank vector if webpage.
Abstract. In this paper, we consider the problem of calculating fast and a
my favorite papers!
highly concerned with privacy?
to improve speed by limiting the amount of web graph data we need to acc
Our algorithms provide both the approximation to the personalized Page
as well as guidance in using only the necessary information—and therefo
Yes! Theory, Experiments, Implementation!
reduce not only the computational cost of the algorithm but also the m
memory bandwidth requirements. We report experiments with these alg
web graphs of up to 118 million pages and prove a theoretical approxima
2
David Gleich · Purdue
for all. Finally, we propose a local, personalized web-search system for a f
system using our algorithms.
3. Massive MapReduce
Matrix Computations
Yangyang Hou "
Purdue, CS
A1
Paul G. Constantine "
Austin Benson "
Joe Nichols"
A2
Stanford University
James Demmel "
A3
UC Berkeley
Joe Ruthruff "
A4
Jeremy Templeton"
Sandia CA
Funded by Sandia National Labs
CSAR project.
3
David Gleich · Purdue
4. By 2013(?) all Fortune 500
companies will have a data
computer
4
David Gleich · Purdue
5. Data computers I’ve worked with …
Magellan Cluster @ ! Student Cluster @ ! Nebula Cluster @ !
NERSC! Stanford! Sandia CA!
128GB/core storage, " 3TB/core storage, " 2TB/core storage, "
80 nodes, 640 cores, " 11 nodes, 44 cores, " 64 nodes, 256 cores, "
infiniband
GB ethernet
GB ethernet
Cost $30k
Cost $150k
These systems
are good
for working with
enormous matrix data!
5
David Gleich · Purdue
6. How do you program them?
6
David Gleich · Purdue
8. MapReduce in a picture
Like an MPI all-to-all
In parallel
In parallel
8
David Gleich · Purdue
9. Computing a histogram "
A simple MapReduce example
1
1
5
Input! 1
1
Output! 15
1
! 1
! 10
shuffle
1
9
Key ImageId
1
1
1
Key Color
3
Value Pixels
Map
1
1
Reduce
Value "
1
17
1
5
1
1
# of pixels
10
Map(ImageId, Pixels)
Reduce(Color, Values)
for each pixel
emit"
emit" Key = Color
Key = (r,g,b)" Value = sum(Values)
Value = 1
9
David Gleich · Purdue
10. Why a limited computational model?
Data scalability, fault tolerance.
Maps
M M The last page of a
1
M 1
2
136-page error dump.
Reduce
2
M M M
R 3
4
3
M
R
4
M M
5
5
M Shuffle
The idea !
Bring the computations
to the data
MR can schedule map After waiting in the queue for a month and "
functions without after 24 hours of finding eigenvalues, "
moving data.
one node randomly hiccups.
10
David Gleich · Purdue
11. Tall-and-Skinny
matrices
(m ≫ n)
Many rows (like a billion)
A
A few columns (under 10,000)
regression and
general linear models"
with many samples
From tinyimages"
collection
Used in
block iterative methods
panel factorizations
simulation data analysis !
big-data SVD/PCA!
11
David Gleich · Purdue
12. Scientific simulations as "
Tall-and-Skinny matrices
Input " Time history"
Parameters
of simulation
s
f"
~100GB
The simulation as a matrix
The simulation as a vector
2 3
time
q(x1 , t1 , s) A database
parameters
tall-and-skinny matrix
The database is a very"
6 .
. 7 of simulations
6 . 7
6 7
6q(xn , t1 , s)7
6 7
space-by-time
6q(x1 , t2 , s)7 s1 -> f1
space
6 7
f(s) = 6 7
A 6
6
.
.
. 7
7
s2 -> f2
A
6q(xn , t2 , s)7
6
6
7
7
.
4 .
. 5 sk -> fk
q(xn , tk , s)
12
David Gleich · Purdue
13. Model reduction
Constantine & Gleich, ICASSP 2012
A Large Scale Example
Nonlinear heat transfer model
80k nodes, 300 time-steps
104 basis runs
SVD of 24m x 104 data matrix
500x reduction in wall clock time
(100x including the SVD)
13
David Gleich · Purdue
14. PCA of 80,000,000"
images
First 16
columns
of V as
images
1000 pixels
R V
SVD
(principal
TSQR
components)
80,000,000 images
Top 100
A X singular
values
Zero"
mean"
rows
14/22
MapReduce Post Processing
Constantine & Gleich, MapReduce 2010.
David Gleich · Purdue
16. the solution of
QR is block nor
is orthogonal ( ) “normalize” a v
Quick review of QR
QR Factorization usually genera
computing in
is upper triangular.
Let , real Using QR for regression
is given by
the solution of
0
A QR is = Q
block normalization
is orthogonal ( ) “normalize” a vector
R
usually generalizes to
computing in the QR
is upper David Gleich (Sandia)
triangular. MapReduce 2011
Current MapReduce algs use the normal equations
0
AT Cholesky
! RT R 1
= Q
A = QR A A Q = AR
R
which can limit numerical accuracy
16
David Gleich (Sandia) MapReduce 2011 4/22
David Gleich · Purdue
17. There are good MPI
implementations.
Why MapReduce?
17
David Gleich · Purdue
18. Full TSQR code inhadoopy
In hadoopy
import random, numpy, hadoopy def close(self):
class SerialTSQR: self.compress()
def __init__(self,blocksize,isreducer): for row in self.data:
key = random.randint(0,2000000000)
self.bsize=blocksize yield key, row
self.data = []
if isreducer: self.__call__ = self.reducer def mapper(self,key,value):
else: self.__call__ = self.mapper self.collect(key,value)
def reducer(self,key,values):
def compress(self): for value in values: self.mapper(key,value)
R = numpy.linalg.qr(
numpy.array(self.data),'r') if __name__=='__main__':
# reset data and re-initialize to R mapper = SerialTSQR(blocksize=3,isreducer=False)
self.data = [] reducer = SerialTSQR(blocksize=3,isreducer=True)
for row in R: hadoopy.run(mapper, reducer)
self.data.append([float(v) for v in row])
def collect(self,key,value):
self.data.append(value)
if len(self.data)>self.bsize*len(self.data[0]):
self.compress()
18
David Gleich (Sandia) MapReduceDavid
2011 Gleich · Purdue
13/22
19. Tall-and-skinny matrix
storage in MapReduce
A : m x n, m ≫ n
A1
Key is an arbitrary row-id
A2
Value is the 1 x n array "
for a row
A3
A4
Each submatrix Ai is an "
the input to a map task.
19
David Gleich · Purdue
20. Numerical stability was a
problem for prior approaches
Constantine & Gleich,
MapReduce 2010
Prior work
norm ( QTQ – I )
Previous methods
couldn’t ensure AR-1
that the matrix Q
was orthogonal
Benson, Gleich,
Demmel, Submitted
AR + "
-1
nt
Direct TSQR
refineme
iterative Benson, Gleich, "
Demmel, Submitted
105
1020
Condition number
20
David Gleich · Purdue
21. Communication avoiding QR (Demmel et al. 2008) "
on MapReduce (Constantine and Gleich, 2010)
Algorithm
Data Rows of a matrix
A1 A1 Map QR factorization of rows
A2
qr Reduce QR factorization of rows
A2 Q2 R2
Mapper 1 qr
Serial TSQR A3 A3 Q3 R3
“Manual reduce” can make
A4 qr emit
A4 Q4 R4 it faster by adding a second
iteration.
A5 A5
qr
A6 A6 Q6 R6 Computes only R and not Q
Mapper 2 qr
A7 A7
Serial TSQR Q7 R7 Can get Q via Q = AR-1 with
A8 A8 qr
Q8 R8
emit another MR iteration.
R4
Use the standard
R4
Reducer 1 Householder method?
Serial TSQR qr emit
R8 R8 Q R
21
David Gleich · Purdue
22. Taking care of business by
keeping track of Q
3. Distribute the
pieces of Q*1 and
form the true Q
Mapper 1
Mapper 3
Task 2
R1
Q11
A1
Q1
R1
Q11
R
Q1
Q1
R2
Q21
Q output
R output
R2
R3
Q31
Q21
A2
Q2
Q2
Q2
R4
Q41
R3
Q31
2. Collect R on one
A3
Q3
Q3
Q3
node, compute Qs
for each piece
R4
Q41
A4
Q4
Q4
Q4
1. Output local Q and
R in separate files
22
David Gleich · Purdue
23. The price is right! Based on
performance model and tests
Experiment on
2500
NERSC
DirectTSQR is Magellan
faster than computer, 80
refinement for … and not any nodes, 640
seconds
few columns
slower for many processors,
columns.
80TB disk
500
800M-by-10
7.5B-by-4
150M-by-100
500M-by-50
23
David Gleich · Purdue
24. Ongoing work
Make AR-1 stable with targeted quad-precision
arithmetic to get a numerically orthogonal Q"
Performance model says it’s feasible!
How to handle more than ~ 10,000 columns? "
Some randomized methods?
Do we need quad-precision for big-data?"
Standard error analysis n 𝜀 to compute sum."
I’ve seen this with PageRank computations!
24
David Gleich · Purdue
25. Multicore Graph "
Assefaw Gehraimbem "
Algorithms
Arif Khan"
Alex Pothen"
Ryan Rossi"
Mem
Purdue, CS
CPU
Mahantesh Halappanavar"
Mem
PNNL
Mem
CPU
Chen Greif"
CPU
David Kurokawa"
Univ. British Columbia
Mohsen Bayati"
Funded by DOE CSCAPES Institute grant Amin Saberi"
(DE-FC02-08ER25864), NSF CAREER grant Ying Wang (now Google)"
1149756-CCF, and the Center for Adaptive
Super Computing Software Multithreaded Stanford
25
Architectures (CASS-MT) at PNNL.
David Gleich · Purdue
26. Network alignment"
What is the best way of matching "
graph A to B?
w
v
s
r
t u
A B
26
David Gleich · Purdue
27. the Figure 2. The NetworkBLAST local network alignment algorithm. Given two input
s) or
odes
lem
Network alignment"
networks, a network alignment graph is constructed. Nodes in this graph correspond
to pairs of sequence-similar proteins, one from each species, and edges correspond to
conserved interactions. A search algorithm identifies highly similar subnetworks that
follow a prespecified interaction pattern. Adapted from Sharan and Ideker.30
n the
ent;
nied
ped
lem
net-
one
one
plest
ying
eins
ome
the
be-
d as
aph
ever,
ap- From Sharan and Ideker, Modeling cellular machinery through biological
rked network comparison. Nat. Biotechnol. 24, 4 (Apr. 2006), 427–433.
27
, we
Figure 3. Performance comparison of computational approaches.
mon- David Gleich · Purdue
28. Network alignment"
What is the best way of matching "
graph A to B using only edges in L?
w
v
s
r
wtu
t u
A L B
28
David Gleich · Purdue
29. Network alignment"
Matching? 1-1 relationship"
Best? highest weight and overlap
w
v
Overlap s
r
wtu
t u
A L B
29
David Gleich · Purdue
30. Our contributions
A new belief propagation method (Bayati et al. 2009, 2013)"
Outperformed state-of-the-art PageRank and optimization-
based heuristic methods
High performance C++ implementations (Khan et al. 2012)"
40 times faster (C++ ~ 3, complexity ~ 2, threading ~ 8)"
5 million edge alignments ~ 10 sec"
www.cs.purdue.edu/~dgleich/codes/netalignmc
30
David Gleich · Purdue
32. Each iteration involves
Let x[i] be the score for
Matrix-vector-ish computations each pair-wise match in L
with a sparse matrix, e.g. sparse
matrix vector products in a semi- for i=1 to ...
ring, dot-products, axpy, etc.
update x[i] to y[i]
Bipartite max-weight matching compute a
using a different weight vector at max-weight match
with y
each iteration
update y[i] to x[i]
" (using match in MR)
No “convergence” "
100-1000 iterations
32
David Gleich · Purdue
33. The methods
Each iteration involves! Belief Propagation!
!
Listing 2. A belief-propagation message passing procedure for network
alignment. See the text for a description of othermax and round heuristic.
D
1 y(0) = 0, z(0) = 0, d(0) = 0, S(k) = 0 t
Matrix-vector-ish computations ! 2
3
for k = 1 to niter
T
F = bound0, [ S + S(k) ] Step 1: compute F
O
s
with a sparse matrix, e.g. sparse 4 d = ↵w + Fe Step 2: compute d a
! 5 y(k) = d othermaxcol(z(k 1) ) Step 3: othermax i
matrix vector products in a semi- 6 z(k) = d othermaxrow(y(k 1) ) i
h
S(k) = diag(y(k) + z(k) d)S F Step 4: update S
!
7
ring, dot-products, axpy, etc.
8 (y(k) , z(k) , S(k) ) k
(y(k) , z(k) , S(k) )+
O
a
9 (1 k
)(y(k 1) , z(k 1) , S(k 1) ) Step 5: damping
e
10
11 !
round heuristic (y(k) ) Step 6: matching
round heuristic (z(k) ) Step 6: matching
I
12 end
Bipartite max-weight matching return y(k) or z(k) with the largest objective value
!
13 t
p
using a different weight vector at m
!
w
each iteration
interpretation, the weight vectors are usually called messages
as they communicate the “beliefs” of each “agent.” In this A
particular problem, the neighborhood of an agent represents
33
all of the other edges in graph L incident on the same vertex s
in graph A (1st vector), all edges in L incident on the same
David in graph BPurdue
vertex Gleich · (2nd vector), or the edges in L that are
fi
“
34. The NEW methods
Each iteration involves! Belief Propagation!
el
!
Listing 2. A belief-propagation message passing procedure for network
alignment. See the text for a description of othermax and round heuristic.
D
l
Paral
(0) (0) (0) (k)
y = 0, z = 0, d = 0, S = 0
1 t
! F = bound
Matrix-vector-ish computations for k = 1 to n [ S + S ] Step 1: compute F
2
3
iter
0,
(k) T
O
s
with a sparse matrix, e.g. sparse d = ↵wd+ Fe Step 2: compute dStep 3: othermax
4 a
! y = d othermaxrow(y ))
= 5
(k)
othermaxcol(z (k 1) i
matrix vector products in a semi- z 6
(k)
(k)
(k 1) i
h
S = diag(y + z d)S F Step 4: update S
(k) (k)
! (y , z , S ) (y , z , S )+
7
ring, dot-products, axpy, etc.
8
(k) (k) (k) k (k) (k) (k) O
a
9 (1 k
)(y(k 1) , z(k 1) , S(k 1) ) Step 5: damping
e
10
11 !
round heuristic (y(k) ) Step 6: matching
round heuristic (z(k) )
Step 6" I
12 end approx matching
Approximate bipartite max- return y or z with the largest objective value
(k) (k)
!
13 t
p
weight matching is used here m
!
w
instead!
interpretation, the weight vectors are usually called messages
as they communicate the “beliefs” of each “agent.” In this A
particular problem, the neighborhood of an agent represents
34
all of the other edges in graph L incident on the same vertex s
in graph A (1st vector), all edges in L incident on the same
David in graph BPurdue
vertex Gleich · (2nd vector), or the edges in L that are
fi
“
35. MR
Approximation doesn’t hurt the
between the Library of Congress
r
0.2 ApproxMR
pedia categories (lcsh-wiki). While BP
e hierarchical tree, they also have
belief propagation algorithm
ApproxBP
r types of relationships. Thus we 0
0 5 10 15 20
l graphs. The second problem is an expected degree of noise in L (p ⋅ n)
rary of Congress subject headings
French National Library: Rameau. 1
d weights in L are computed via a
heading strings (and via translated
of correct match
au). These problems are larger than 0.8
BP a
Fraction fraction correct
indis nd App
tingu roxB
NMENT WITH APPROXIMATE 0.6
isha P
ATCHING ble
are
ss the question: how does the be- 0.4
d the BP method change when we
matching procedure from Section V MR
0.2 ApproxMR
step in each algorithm? Note that
BP
ching in the first step of Klau’s
ApproxBP
ch) because the problems in each 0
we parallelize over perturb onealso 0 5 10 15 20
Randomly rows. Note expected degree of noise in L (p ⋅ n)
is much more integral to Klau’s B
power-law graph to get A, The amount of random-ness in L in
average expected degree
edure. Generate L by the true-we Fig. 2. Alignment with a power-law graph shows the large effect that
For the BP procedure,
ing problem to evaluate the quality approximate rounding can have on solutions from Klau’s method (MR). With
35
match + random edges
Klau’s method, the results of the that method, using exact rounding will yield the identity matching for all
David Gleich · Purdue
problems (bottom figure), whereas using the approximation results in over a
36. A local dominating edge
method for bipartite matching
j
i The method guarantees
r
s
• ½ approximation
• maximal matching
based on work by Preis
(1999), Manne and
wtu Bisseling (2008), and
t u
Halappanavar et al (2012)
A L B
A locally dominating edge is an edge
heavier than all neighboring edges.
For bipartite Work on smaller side only
36
David Gleich · Purdue
37. A local dominating edge
method for bipartite matching
j
Queue all vertices
i
r
s Until queue is empty!
In Parallel over vertices"
Match to heavy edge
and if there’s a conflict,
wtu
u
check the winner, and
t
find an alternative for
A L B the loser
Add endpoint of non-
A locally dominating edge is an edge dominating edges to
heavier than all neighboring edges.
the queue
For bipartite Work on smaller side only
37
David Gleich · Purdue
38. A local dominating edge
method for bipartite matching
j
i Customized first iteration
r
s
(with all vertices)
Use OpenMP locks to
update choices
wtu
t u
Use sync_and_fetch_add
A L B for queue updates.
A locally dominating edge is an edge
heavier than all neighboring edges.
For bipartite Work on smaller side only
38
David Gleich · Purdue
39. Remaining multi-threading
procedures are straightforward
Standard OpenMP for matrix-computations"
use schedule=dynamic to handle skew
We can batch the matching procedures in the
BP method for additional parallelism
for i=1 to ...
update x[i] to y[i]
save y[i] in a buffer
when the buffer is full
compute max-weight match
for all in buffer and save
the best
39
David Gleich · Purdue
40. Performance evaluation
(2x4)-10 core Intel E7-8870, 2.4 GHz (80-cores)
16 GB memory/proc (128 GB)
Scaling study
Mem
Mem
Mem
Mem
1. Thread binding " CPU
CPU
CPU
CPU
scattered vs. compact
CPU
CPU
CPU
CPU
2. Memory binding "
Mem
Mem
Mem
Mem
interleaved vs. bind
40
David Gleich · Purdue
41. Scaling
BP with no batching
lcsh-rameau, 400 iterations
25
scatter and interleave
20
Speedup
15
115 seconds for 40-thread
10
5
1450 seconds for 1-thread
0
0 20 40 60 80
Threads
41
David Gleich · Purdue
42. Ongoing work
Better memory handling! "
numactl, affinity insufficient for full scaling
Better models!"
These get to be much bigger computations.
Distributed memory."
Trying to get an MPI version, looking into GraphLab
42
David Gleich · Purdue
43. PageRank was created by
ageRank details
Google to rank by Google
PageRank web-pages
3
3
2 5 The Model 0 0 0 3
2
1/ 6 1/ 2 0
2 5 6 1/ 6 0 0 1/ 3 0 0 7
1. follow edges uniformlyPwith
j 0
! 6 probability1/ 3, 0 0 7 eT P=eT
1/ 6 1/ 2 0 0 0
4
4 4 1/ 6 0 1/ 2 0 and 5
1/ 6 0 1/ 2 1/ 3 0 1
2. randomly jump 0 with probability
1 6 | 1/ 6 0 {z 0 1 }
0
1 6 1 , we’ll assume everywhere is
P
equally likely
T 0
“jump” ! v = [ 1 ... 1 ]
n n eT v=1
î ó
Markov chain P + (1 )ve T x=x
The places we find the
unique x ) j 0, eT x = 1. are im-
surfer most often
Linear system ( portant pages.
P)x = (1 )v
43/40
Ignored dangling nodes patched back to v
algorithms later Gleich, Purdue
UTRC Seminar
David
45. Multicore PageRank
… similar story …
Serialized preprocessing
Parallelize the linear algebra via an "
asynchronous Gauss-Seidel iterative method
~10x scaling on same (80-core) machine "
(1M nodes, 15M edges, synthetic)
45
David Gleich · Purdue
46. Questions?
Papers on my webpage
www.cs.purdue.edu/homes/dgleich
Codes
github.com/arbenson/mrtsqr
www.cs.purdue.edu/homes/dgleich/codes/netalignmc
github.com/dgleich/prpack
46
David Gleich · Purdue