From Event to Action: Accelerate Your Decision Making with Real-Time Automation
PageRank
1. Adding uncertainty to
the PageRank random
surfer
DAVID F. GLEICH, PURDUE UNIVERSITY, COMPUTER SCIENCE
UTRC SEMINAR, 13 DECEMBER 2011
1/40
UTRC Seminar
David Gleich, Purdue
3. are a great way to model and
study problems in network
science and physical science
3/40
UTRC Seminar
David Gleich, Purdue
4. are a great way to model and
study problems in network
science and physical science
I hope I’m preaching to the choir.
4/40
UTRC Seminar
David Gleich, Purdue
5. A cartoon websearch primer
1. Crawl webpages
2. Analyze webpage text (information retrieval)
3. Analyze webpage links
4. Fit measures to human evaluations
5. Produce rankings
6. Continuously update
5/40
UTRC Seminar
David Gleich, Purdue
6. 1
2
to
3
6/40
UTRC Seminar
David Gleich, Purdue
7. What is PageRank?
PageRank by Google
PageRank by Google
3
3
The Model
2 5 1.The Model uniformly with
follow edges
2
4
5 1. follow edges uniformly with
probability , and
4
2. randomly jump, with probability
probability and
1 6
2. randomlyassume everywhere is
1 , we’ll jump with probability
1 6 equally, likely assume everywhere is
1 we’ll
equally likely
The places we find the
The places we find the
surfer most often are im-
portant pages. often are im-
surfer most
portant pages.
7/40
David F. Gleich (Sandia) PageRank intro Purdue 5 / 36
David F. Gleich (Sandia) PageRank intro UTRC Seminar
David Gleich, Purdue
Purdue 5 / 36
8. The most important page on the web.
8/40
UTRC Seminar
David Gleich, Purdue
9. PageRank via
PageRank details
PageRank by Google 3
3
2 5 The Model 0 0 0 3
2
1/ 6 1/ 2 0
2 5 6 1/ 6 0 0 1/ 3 0 0 7
1. follow edges uniformlyPwith
j 0
! 6 probability1/ 3, 0 0 7 eT P=eT
1/ 6 1/ 2 0 0 0
4
4 4 1/ 6 0 1/ 2 0 and 5
1/ 6 0 1/ 2 1/ 3 0 1
2. randomly jump 0
1/ 6 0 0 0 1 with probability
1 6 | {z }
1 6 1 , we’ll assume everywhere
P
equally likely
T 0
“jump” ! v = [ 1 ... 1 ]
n n eT v=1
î ó
Markov chain P + (1 )ve T x=x
The places we find the
unique x ) j 0, eT x = 1. are im-
surfer most often
Linear system ( portant pages.
P)x = (1 )v
Ignored dangling nodes patched back to v
9/40
algorithms later
David F. Gleich (Sandia)
David F. Gleich (Sandia) PageRank intro PageRank intro Purdue 6 / Purdue
36
UTRC Seminar
David Gleich, Purdue
12. Richardson is a robust, simple
algorithm to compute PageRank
Given α, P, v
(I ↵P)x = (1 ↵)v
Richardson )
(k+1) (k)
x = ↵Px + (1 ↵)v
(k) k
error = kx xk1 2↵
12/40
UTRC Seminar
David Gleich, Purdue
13. Sensitivity
13/40
UTRC Seminar
David Gleich, Purdue
14. Which sensitivity?
PageRank circa 2006
( P)x = (1 )v
Sensitivity to the links : examined and understood
Sensitivity to the jump : examined, understood, and useful
Sensitivity to : less well understood
14/40
For information about how to compute the PageRank derivative, see:
Gleich, Glynn, Golub, Greif. Three results on the PageRank vector, 2007.
UTRC Seminar
David Gleich, Purdue
15. Wikipedia test case
PageRank on Wikipedia
= 0.50 = 0.85 = 0.99
United States United States C:Contents
C:Living people C:Main topic classif. C:Main topic classif.
France C:Contents C:Fundamental
Germany C:Living people United States
England C:Ctgs. by country C:Wikipedia admin.
United Kingdom United Kingdom P:List of portals
Canada C:Fundamental P:Contents/Portals
Japan C:Ctgs. by topic C:Portals
Poland C:Wikipedia admin. C:Society
Australia France C:Ctgs. by topic
Note Top 10 articles on Wikipedia with highest PageRank
15/40
David F. Gleich (Sandia) Sensitivity Purdue 11 / 36
UTRC Seminar
David Gleich, Purdue
16. What is alpha?
What is alpha?
The teleportation parameter!
Author
Brin and Page (1998) 0.85
Najork et al. (2007) 0.85
Litvak et al. (2006) 0.5
Experiment (slide 19) 0.63
Algorithms (...) 0.85
For you,αis clear.
or you, is clear
oogle Google wants PageRank for everyone
wants PageRank for everyone
16/40
UTRC Seminar
David Gleich, Purdue
17. What about me?
Multiple surfers should have an impact!
Each person picks from distribution A
...
# #
x(E [A]) E [x(A)]
& .
17/40
x(E [A]) 6= E [x(A)]
David F. Gleich (Sandia) Random sensitivity Purdue 15 / 36
UTRC Seminar
David Gleich, Purdue
18. alpha PageRank PageRa
RandomPageRank
dom alpha alpha
Random alpha PageRank
RAPr
or PageRank meets UQ
s the random variables as the random variables
Model PageRank
ageRank as the random variables
x(A) x(A)
x(A)
and look at
k E [x(A)] and Std [x(A)] .
at
E [x(A)] and Std [x(A)] .
E [x(A)] and Std [x(A)] .
18/40
Explored in Constantine and Gleich, WAW2007; and "
Constantine and Gleich, J. Internet Mathematics 2011.
UTRC Seminar
David Gleich, Purdue
19. Alpha, measured from users!
What is alpha based on users?
3.0 InfBeta( 3.2 , 2.0 , 1.9e−05 , 0.0019 )
mean 0.63
2.5
mode 0.69
2.0
density
1.5
1.0
0.5
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Raw α
19/40
see Gleich et al. WWW2010 for more
Constantine, Flaxman, Gleich, Gunawardana, Tracking the Random Surfer, WWW2010.
UTRC Seminar
David Gleich, Purdue
20. What is A?
A simple model for alpha
20/40
Bet ( , b, , r)
UTRC Seminar
David Gleich, Purdue
21. An Examplerandom variables
The PageRank
x
1
3 x
2
2 5 x
3
4
x4
1 6
x
5
x
6
21/40
0 0.5
UTRC Seminar
David Gleich, Purdue
22. A theoretical concern
Just one a problem
isn’t really
second ...
Z 1 Z 1
1
E [x( )] = x( ) ( ) d = (1 )( P) v ( )d
0 0
= 1 ( P) 1
!
P stochastic singular?
Yes, but ...
1
lim (1 )( P) v=x is unique
!1
22/40
(Think about P = 1, use Jordan Form of P to generalize)
UTRC Seminar
David Gleich, Purdue
23. Many PageRank properties are
What changes?
unchanged by a random alpha
Really, what stays the same!
x(A) A ⇠ Bet ( , b, , r) with 0 < r 1
1. E [ (A)] 0 and kE [x(A)]k = 1;
thus E [x(A)] is a probability distribution.
P î ó
2. E [x(A)] = =0
E A A +1 P v;
thus we can interpret E [x(A)] in length- paths.
3. for page with no in-links, (A) = (1 A) ;
thus E [ (A)] = (E [A]) and Std [ (A)] = Std [A]
23/40
But is this one useful?
UTRC Seminar
David Gleich, Purdue
24. Wikipedia test case (take 2)
RAPr on Wikipedia
RAPr on Wikipedia
EE [x(A)]
[x(A)] Std [x(A)]
Std [x(A)]
United States
United States United States
United States
C:Living people
C:Living people C:Living people
C:Living people
France
France C:Main topic classif.
C:Main topic classif.
United Kingdom
United Kingdom C:Contents
C:Contents
Germany
Germany C:Ctgs. by country
C:Ctgs. by country
England
England United Kingdom
United Kingdom
Canada
Canada France
France
Japan
Japan C:Fundamental
C:Fundamental
Poland
Poland England
England
24/40
Australia
Australia C:Ctgs. by topic
C:Ctgs. by topic
Note A A ⇠ Bet(0.5, 1.5, [0, 1]) ⇡ ⇡ empirical distribution on WikipediaGleich, Purdue
Note ⇠ Bet (0.5, 1.5, [0, 1]) empirical distribution Seminar
David
UTRC on Wikipedia
25. Ulam Networks
Ulam Networks
Ulam Networks
PageRank on a
dynamical system Networks yt+1
Chirikov map
Chirikov map Ulam networ
yt+1 = yt +k sin( t + t ) 1. divide phas
Ulam Ulam network Ulam t+1 = t +
network 2. form P base
hirikov map
Chirikov map
= Chirikov
+k sin( t
Ulam phase Ulam Networks
yt+1 = ytyt illustrates map1.1. divide network
space into uniform c
nicely +k sin(t + +t ) t ) divide phase space into uniform cel
Ulam network
+1 = = t Ulam Networks based ontrajectories.
the uncertainty.
NetworksP based onUlam network
+Ulam + yt+1 2.2. formmap
+1
yt+1 = yt +k sin( t + t ) 1. divide phase space into uniform cells
t+1 y yt+1
t +t+1 = t
t+1
ChirikovP P
form form
2. based on trajectories.
trajectories.
Chirikov map
Chirikov map yt+1 = yt +k sin( t + t ) 1. divide phase space
Ulam network
Ulam network
1. = t + yt+1
t t ) divide phase space into form P based
yt+1t+1 = t +k+k sin(+ t +)t+1 1. divide phase space into uniform cells on tr
y = y yt sin( t 2. uniform cells
t+1 = = +t yt+1
t+1 t + yt+1 form P P based trajectories.
2. 2. form based onon trajectories.
log(E [x(A)]) log(
log(E [x(A)]) log(Std [x(A)]))/ log(E Bet (2, 1
A ⇠ [x(A)])
Note Bet (2, 16)
A ⇠ White is larger, black is smaller
Note White is larger, black is smaller Google matrix, dynamical attractors, and
Google matrix, dynamical attractors, and Ulam networks, Shepelyansky and Zhirov, arXiv
David F. Gleich (Sandia) Random sensitivity
log(E [x(A)])
log(E [x(A)])
log(E [x(A)])
log(E [x(A)]) log(Std [x(A)]))/ log(E [x(A)]) [x(A)
log(Std[x(A)]))/ log(E [x(A)]) [x(
log(Std [x(A)]))/ log(E 23 [x(
log(Std log(E
25/40
David F. Gleich (Sandia) log(E [x(A)]) [x(A)]))/ log(Std/ 36
Random sensitivity Purdue
White is larger, black is smaller
⇠ Bet (2, 16)
A A ⇠ Bet (2, 16)
Note White is larger, black is is
Note White is larger, black
Bet (2, 16) A ⇠ Bet (2, 16)
Model from Shepelyasky and Zhirov, Bet(2, 16)
Asmaller "
Asmaller
⇠⇠
Phy. Rev. E. 2011.
Google matrix, dynamical attractors, andUTRCnetworks,smaller Gleich, Purdue
arXiv
Ulam Seminar
David
GoogleNote dynamical attractors, andblack is Shepelyansky and and Zhirov,
matrix, White is larger, Ulam networks, Shepelyansky Zhirov, arXiv
26. Convergence
0
10
Algorithms & "
Convergence
−5
10
Monte Carlo
−10
10
1. Monte Carlo
E [x(A)] −15
1 PN
10
⇡ N =1 x( ⇠A
0 1 2 3 4 5
) 0
10 10 10 10 10 10
10
2. Path Damping
E [x(A)] 10
−5
PN î ó
⇡ =0 E A A +1 P v
Path Damping
−10
10
3. Quadrature
E [x(A)] 10
−15 (No Std)
Rr 10
0 1
10
2
10 10
3
⇡ x( ) d ( ) 0
10
PN C
⇡ =1 x( ) −5
s
10
(h
Convergence toto semi-exact
Convergence semi-exact
solutions on a 335-nodestrong
solution on a 335-node graph −10
10 Quadrature
component.
(harvard500 strong component).
26/40
Blue = Beta(2,16)
16)
Blue Bet (2, −15
10
Green = Beta(1,1,0.1,0.9)
0.9)
Green Bet (1, 1, 0.1, 0 10 20 30 40 50 60 70 80 90 100
Salmon = uniform (0.6, 0.9)
Salmon Uniform(0.6,0.9)
David F. Gleich (Sandia) Random sensitivity
Red = Beta(-0.5, -0.5, 0.2, 0.7)
Red Bet ( 0.5, 0.5, 0.2, 0.7) UTRC Seminar
David Gleich, Purdue
28. Random alpha PageRank
has a rigorous convergence
Convergence
theory.
theory
Method Conv. Work Required What is N?
1 number of
Monte Carlo p N PageRank systems
N samples from A
Path Damping
r N+2 N + 1 matrix vector terms of
(without
N1+ products Neumann series
Std [x(A)])
number of
Gaussian
r 2N N PageRank systems quadrature
Quadrature
points
and r are parameters from Bet ( , b, , r)
28/40
David F. Gleich (Sandia) Random sensitivity UTRC Seminar
David Gleich, Purdue 27 / 36
Purdue
29. Convergence of quadrature in the r=1 regime
is matrix dependent.
Singularities
10
0.03
8
6 0.02
4
0.01
2
0
0 1.00129
2
3
−2 4
5
6
7
8 −0.01
−4 9
10
−6
−0.02
−8
−10 −0.03
−10 −5 0 5 10 0.97 0.98 0.99 1 1.01 1.02 1.03
29/40
log10(9+|1/λ|)eiarg(1/λ)
1/λ
Note 500-node harvard500 graph from Cleve Moler, left plot is Gleich, Purdue
UTRC Seminar
David
30. Establishing this theoretical
convergence proved
independently useful.
Constantine, Gleich, and Iaccarino. Spectral Methods for Parameterized
Matrix Equations, SIMAX, 2010.
A(s)x(s) = b(s)
, A(J 1 )x(J 1 ) = b(J 1 )
) A(J N )x(J N ) = b(J N ) or
) AN (J 1 )xN (J 1 ) = bN (J 1 )
Constantine, Gleich, and Iaccarino. A factorization of the spectral Galerkin
system for parameterized matrix equations: derivation and applications, SISC
2011.
30/40
How to compute the Galerkin solution
in a weakly intrusive manner.!
UTRC Seminar
David Gleich, Purdue
31. A real test-case
Webspam application
Hosts of uk-2006 are labeled as spam, not-spam, other
P R f FP FN
Baseline 0.694 0.558 0.618 0.034 0.442
Beta(0.5,1.5) 0.695 0.561 0.621 0.034 0.439
Beta(1,1) 0.698 0.562 0.622 0.033 0.438
Beta(2,16) 0.699 0.562 0.623 0.033 0.438
31/40
Note Bagged (10) J48 decision tree classifier in Weka, mean of 50 repetitions from
10-fold cross-validation of 4948 non-spam and 674 spam hosts (5622 total).
Becchetti et al. Link analysis for Web spam detection, 2008.
David F. Gleich (Sandia) Random sensitivity UTRC Seminar
David Gleich, Purdue
Purdue 28 / 36
33. Data driven surrogate functions
Beyond spectral methods for UQ
33/40
UTRC Seminar
David Gleich, Purdue
34. j
r Square s
)
t
t
A L B
Network alignment
34/40
m ximize w T x + 1 xT Sx
UTRC Seminar
David Gleich, Purdue
35.
Nuclear-norm
matrix completion
based ranking
Gleich and Lim, KDD2011
avid F. Gleich (Purdue) KDD 2011 16/20
Overlapping clusters
for distributed computation
Andersen, Gleich, and Mirrokni, WSDM2012
35/40
UTRC Seminar
David Gleich, Purdue
36. Local methods for massive FOR KATZ
TOP-K ALGORITHM
network analysis
Approximate
where is sparse
Keep sparse too
Ideally, don’t “touch” all of
This is possible for
personalized PageRank!
36/40
David F. Gleich (Purdue) Univ. Chicago SSCS Seminar 34 of 47
UTRC Seminar
David Gleich, Purdue
38. What about time?
Real networks evolve in time.
What to do?
Look towards dynamical systems!
38/40
UTRC Seminar
David Gleich, Purdue
39. What about time?
Real networks evolve in time.
What to do?
Look towards dynamical systems!
Now I must be preaching to the choir!
39/40
UTRC Seminar
David Gleich, Purdue