Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Ph.D. Defense: Models and Algorithms for PageRank sensitivity
1. Models and Algorithms for
PageRank Sensitivity
David F. Gleich
Stanford University
Ph.D. Oral Defense
Institute for Computational
and Mathematical Engineering
May 26, 2009
Gleich (Stanford) Ph.D. Defense 1 / 41
7. PageRank by Google
The places we find the
surfer most often are im-
portant pages.
3
The Model
2 5 1. follow edges uniformly with
4 probability α, and
2. randomly jump with probability
1 6 1 − α, we’ll assume everywhere
is equally likely
Gleich (Stanford) PageRank intro Ph.D. Defense 7 / 41
8. Some PageRank details
3
2 5 1/ 6 1/ 2 0 0 0 0
4
1/ 6 0 0 1/ 3 0 0
P j ≥0
→ 1/ 6 1/ 2 0 1/ 3 0 0
1/ 6 0 1/ 2 0 0 0 eT P=eT
1/ 6 0 1/ 2 1/ 3 0 1
1/ 6 0 0 0 1 0
1 6
P
T ≥0
“jump” → v=[1
n
... 1
n ] eT v=1
Markov chain αP + (1 − α)veT x = x
unique x ⇒ j ≥ 0, eT x = 1.
Linear system ( − αP)x = (1 − α)v
Small detail dangling nodes patched back to v
Gleich (Stanford) PageRank intro Ph.D. Defense 8 / 41
10. My other projects
Prior PageRank
Parallel Krylov Methods Approximate Personal
Gleich, Zhukov, and Berkhin , Yahoo! Research Labs PageRank
Technical Report, YRL-2004-038; Gleich and Zhukov, Gleich and Polito, Internet Math. 3(3):257 294,
SuperComputing poster, 2005. 2007.
Does existing software work for computing PageRank Can you build a web search engine on your PC?
on a cluster?
Parameterized Matrix
Ongoing
Network Alignment
Problems Come back here for (with Mohsen Bay- j Square
j
s
r
(with Paul Constantine) his defense on Monday, ati, Margot Gerritsen,
June 1st at 1:30pm! Amin Saberi, and Ying
A(s)x(s) = b(s) Wang) t
t
My Software
Packages Publications
MatlabBGL vismatrix Random α PageRank
libbvg parameterized Inner-Outer PageRank
matrix package
gaimc
(with Paul)
Gleich (Stanford) PageRank intro Ph.D. Defense 10 / 41
11. PageRank intro
Sensitivity
Sensitivity Random sensitivity
Slide 11 of 41
Inner-Outer
Summary
12. Which sensitivity?
Sensitivity to the links : examined and understood
Sensitivity to the jump : examined, understood, and useful
Sensitivity to α : less well understood
Gleich (Stanford) Sensitivity Ph.D. Defense 12 / 41
13. PageRank on Wikipedia
α = 0.50 α = 0.85 α = 0.99
United States United States C:Contents
C:Living people C:Main topic classif. C:Main topic classif.
France C:Contents C:Fundamental
Germany C:Living people United States
England C:Ctgs. by country C:Wikipedia admin.
United Kingdom United Kingdom P:List of portals
Canada C:Fundamental P:Contents/Portals
Japan C:Ctgs. by topic C:Portals
Poland C:Wikipedia admin. C:Society
Australia France C:Ctgs. by topic
Note Top 10 articles on Wikipedia with highest PageRank
Gleich (Stanford) Sensitivity Ph.D. Defense 13 / 41
14. The PageRank function
Look at the PageRank vector as a function of α
( − αP)x(α) = (1 − α)v
and examine its derivative.
My Contributions
Gleich, Glynn, Golub, Greif, Dagstuhl proceedings, 2007. Others
Compute the derivative with just PageRank becomes
simple PageRank solves. more sensitive as α → 1.
Empirically evaluated the PageRank vector at
derivative as a rank change α = 1 well defined.
predictor.
α matters!
Golub and Greif, 2004; Boldi et al., 2005; Berkhin, 2005; Langville and Meyer, 2006.
Gleich (Stanford) Sensitivity Ph.D. Defense 14 / 41
15. PageRank intro
Random
Sensitivity
sensitivity Random sensitivity
Slide 15 of 41 Inner-Outer
Summary
16. What is alpha?
Author α
Brin and Page (1998) 0.85
Najork et al. (2007) 0.85
Litvak et al. (2006) 0.5
Experiment (slide 20) 0.375
Algorithms (...) ≥ 0.85
For you, α is clear
Google wants PageRank for everyone
Gleich (Stanford) Random sensitivity Ph.D. Defense 16 / 41
17. Multiple surfers
Each person picks α from distribution A
...
↓ ↓
x(E [A]) E [x(A)]
x(E [A]) = E [x(A)]
Gleich (Stanford) Random sensitivity Ph.D. Defense 17 / 41
18. Random alpha PageRank
RAPr
Model PageRank as the random variables
x(A)
and look at
E [x(A)] and Std [x(A)] .
Gleich and Constantine, Workshop on Algorithms on the Web Graph, 2007
Gleich (Stanford) Random sensitivity Ph.D. Defense 18 / 41
19. What is A?
Beta(0,0,0.6,0.9)
Beta(2,16,0,1)
Beta(1,1,0.1,0.9)
Beta(−0.5,−0.5,0.2,0.7)
0 1
Bet ( , b, , r)
Gleich (Stanford) Random sensitivity Ph.D. Defense 19 / 41
20. Alpha is
2
Histogram
1.8 Density Fit
Beta(1.5,0.5)
1.6
mean 0.375
1.4
mode 0.25
1.2
density
1
0.8
0.6
0.4
0.2
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
α
Data provided by Abraham Flaxman and Asela Gunawardana at Microsoft.
Gleich (Stanford) Random sensitivity Ph.D. Defense 20 / 41
21. Example
x1
3 x
2
2 5 x
3
4
x4
1 6
x
5
x
6
0 0.5
Gleich (Stanford) Random sensitivity Ph.D. Defense 21 / 41
22. What changes?
x(A) A ∼ Bet ( , b, , r) with 0 ≤ < r ≤ 1
1. E [ (A)] ≥ 0 and E [x(A)] = 1;
thus E [x(A)] is a probability distribution.
∞
2. E [x(A)] = ℓ=0
E Aℓ − Aℓ+1 Pℓ v;
thus we can interpret E [x(A)] in length-ℓ paths.
3. for page with no in-links, (A) = (1 − A) ;
thus E [ (A)] = (E [A]) and Std [ (A)] = Std [A]
But is this one useful?
Gleich (Stanford) Random sensitivity Ph.D. Defense 22 / 41
23. RAPr on Wikipedia
E [x(A)] Std [x(A)]
United States United States
C:Living people C:Living people
France C:Main topic classif.
United Kingdom C:Contents
Germany C:Ctgs. by country
England United Kingdom
Canada France
Japan C:Fundamental
Poland England
Australia C:Ctgs. by topic
Gleich (Stanford) Random sensitivity Ph.D. Defense 23 / 41
24. Std vs. PageRank
Does it tell us more than just PageRank?
uk2006 — 77M nodes and 2B edges
1 k 1
isim(k) = k =1 2
|Diff[Y(1: ), Z(1: )]|
Disjoint 1
Std[x(A )] vs. x(0.85)
1
Std[x(A2)] vs. x(0.5)
Kendall’s τ
0.8
τ(x(E1 ), S1 ) = +0.3
Intersection Similarity (k)
Std[x(A )] vs. x(0.85)
3
0.6
τ(x(E2 ), S2 ) = −0.5
0.4
τ(x(0.85), S3 ) = −0.2
0.2
Identical 0 0 2 4 6
10 10 10 10
k
A1 ∼ Bet (2, 16, [0, 1]) A2 ∼ Bet (1, 1, [0, 1])
A3 ∼ Bet (0.5, 1.5, [0, 1])
Gleich (Stanford) Random sensitivity Ph.D. Defense 24 / 41
25. Computation
1. monte carlo
1 N
E [x(A)] = N =1
x(α ) α ∼A
2. path damping
N
E [x(A)] ≈ =0 E A − A +1 P v
3. quadrature
r N
E [x(A)] = x(α) dρ(α) ≈ =1
x(ζ )ω
Gleich (Stanford) Random sensitivity Ph.D. Defense 25 / 41
26. Time
cnr2000 — 325k nodes and 3M edges
0
10
−5
10
−10
10
Monte Carlo
Path Damping
Quadrature
−15
10 −2 −1 0 1 2 3 4
10 10 10 10 10 10 10
Time (sec)
Gleich (Stanford) Random sensitivity Ph.D. Defense 26 / 41
27. Convergence theory
Method Conv. Work Required What is N?
1 number of
Monte Carlo N PageRank systems
N samples from A
Path Damping
r N+2 N + 1 matrix vector terms of
(without
N1+ products Neumann series
Std [x(A)])
number of
Gaussian
r 2N N PageRank systems quadrature
Quadrature
points
and r are parameters from Bet ( , b, , r)
Gleich (Stanford) Random sensitivity Ph.D. Defense 27 / 41
28. Webspam application
Hosts of uk-2006 are labeled as spam, not-spam, other
P R f FP FN
Baseline 0.694 0.558 0.618 0.034 0.442
Beta(0.5,1.5) 0.695 0.561 0.621 0.034 0.439
Beta(1,1) 0.698 0.562 0.622 0.033 0.438
Beta(2,16) 0.699 0.562 0.623 0.033 0.438
Note Bagged (10) J48 decision tree classifier in Weka, mean of 50 repetitions from
10-fold cross-validation of 4948 non-spam and 674 spam hosts (5622 total).
Becchetti et al. Link analysis for Web spam detection, 2008.
Gleich (Stanford) Random sensitivity Ph.D. Defense 28 / 41
29. PageRank intro
Sensitivity
Inner-Outer Random sensitivity
Slide 29 of 41
Inner-Outer
Summary
30. Motivation
Why another PageRank algorithm?
For the RAPr codes, we need
1. reliable code
2. fast code over a range of α’s fancy
→ Use Matlab’s “”
3. code for big problems
→ Use a Gauss-Seidel or
custom Richardson method
4. code with only matvec products
→ Use the inner-outer iteration
5. code with only 2 vectors of memory
→ Use the power method simple
Gleich (Stanford) Inner-Outer Ph.D. Defense 30 / 41
31. Inner-Outer
Note PageRank is easier when α is smaller
Thus Solve PageRank with itself using β < α!
Outer ( − βP)x(k+1) = (α − β)Px(k) + (1 − α)v ≡ f(k)
Inner y(j+1) = βPy(j) + (α − β)Px(k) + (1 − α)v
A new parameter? What is β? 0.5
How many inner iterations? Until a residual of 10−2
Gray, Greif, Lau, 2007.
Gleich (Stanford) Inner-Outer Ph.D. Defense 31 / 41
32. Inner-Outer algorithm
Input: P, v, α, τ, (β = 0.5, η = 10−2 )
Output: x if 0 ≤ β ≤ α,
1: x ← v convergence with
2: y ← Px any η
3: while αy + (1 − α)v − x 1 ≥ τ
uses only three
4: f ← (α − β)y + (1 − α)v
vectors of memory
5: repeat
6: x ← f + βy β = 0.5, η = 10−2
7: y ← Px often faster than the
8: until f + βy − x 1 < η power method
9: end while (or just a titch slower)
10: x ← αy + (1 − α)v
Note Note that the inner-loop checks its condition after doing one iteration.
Gleich (Stanford) Inner-Outer Ph.D. Defense 32 / 41
34. Extensions
1. A large scale shared-memory parallel version on
compressed web graphs
2. A Gauss-Seidel variant
3. A BiCG-STAB preconditioner
4. A conjecture about the performance of the iteration
5. Showed the algorithm converges for “any” β, η
Gleich, Gray, Greif, Lau, submitted.
Gleich (Stanford) Inner-Outer Ph.D. Defense 34 / 41
35. Convergence Result
Sketch of convergence result
1. error after j steps of the inner iteration
j−1
α−β
f(j) = αβj−1 Pj + βℓ Pℓ f(0)
β ℓ=1
2. upper bound error by
(α − β) + (1 − α)βj
f(j) ≤ f(0) .
1−β
3. notice
f(j) ≤ α f(0) , j ≥ 1
4. hence, convergence as long as β ≤ α
Gleich (Stanford) Inner-Outer Ph.D. Defense 35 / 41
36. PageRank intro
Sensitivity
Summary Random sensitivity
Slide 36 of 41
Inner-Outer
Summary
38. Contributions
1. Derivative
Gleich, Glynn, Golub, Greif, 2007.
New technique to compute the derivative using just PageRank
2. RAPr 3. Inner-Outer
Constantine and Gleich, 2007; Constantine, Gleich,
Gleich, Gray, Greif, Lau, submitted.
and Iaccarino, submitted.
New PageRank model and Improved convergence
sensitivity measure analysis
Range of algorithms and Gauss-Seidel and
algorithmic analysis preconditioning variants
Empirically helpful for Shared-memory parallel
spam identification implementation
Robust software Robust software
Gleich (Stanford) Summary Ph.D. Defense 38 / 41
41. Margot Gerritsen Debbie Heimowitz
Peter Glynn Jason Azicri
Walter Murray Steven Fan
Reid Andersen Paul Constantine
Pavel Berkhin Michael Atkinson
Kevin Lang Jeremy Kozdon
Amy Langville Esteban Arcaute
Matthew Rasmussen
Sebastiano Vigna
Adam Guetz
Will Fong THANK
Leonid Zhukov Andrew Bradley
Indira Choudhury
Seth Tornborg
Nick Henderson
Chris Maes
YOU
Brian Tempero Nicole Taheri
Prisilla Williams Ying Wang
Deb Michael Nick West
Mayita Romero Kaustuv's Rum
Les Fletcher Saeco Coffee Machine
Hugh Fletcher Napa Valley
Lindsey Fletcher Matlab
Jane Fletcher superlu