Iterative methods with special structures

Iterative methods with
special structures
David F. Gleich!
Purdue University!
David Gleich · Purdue
1

Two projects.
1.  Circulant structure in tensors and a linear
system solver.
2.  Localized structure in the matrix exponential
and relaxation methods.
2

Circulant algebra
40 60 80 100 120
40
60
80
mm
David F. Gleich (Purdue) LA/Opt Seminar 2 / 29
Introduction
Kilmer, Martin, and Perrone (2008) presented a circulant
algebra: a set of operations that generalize matrix algebra to
three-way data and provided an SVD.
The essence of this approach amounts to viewing
three-dimensional objects as two-dimensional arrays (i.e.,
matrices) of one-dimensional arrays (i.e., vectors).
Braman (2010) developed spectral
and other decompositions.
We have extended this algebra with
the ingredients required for iterative
methods such as the power method
and Arnoldi method, and have char-
acterized the behavior of these algo-
rithms.
With Chen Greif and James Varah
at UBC.
3

40 60 80 100 120
40
60
80
mm
Three-way arrays
Given an m ⇥ n ⇥ k table of data,
we view this data as an m ⇥ n ma-
trix where each “scalar” is a vector
of length k.
A 2 Km⇥n
k
We denote the space of length-k scalars as Kk.
These scalars interact like circulant matrices.
4

40
60
80
Circulant matrices are a commutative, closed class under the
standard matrix operations.
2
6
6
6
6
6
4
1 k . . . 2
2 1
...
...
...
...
... k
k . . . 2 1
3
7
7
7
7
7
5
We’ll see more of their properties shortly!
40 60 80 100 120
40
60
80
mm
The circ operation
We denote the space of length-k scalars as Kk.
These scalars interact like circulant matrices.
= { 1 ... k } 2 Kk.
$ circ( ) ⌘
2
6
6
6
6
6
4
1 k . . . 2
2 1
...
...
...
...
... k
k . . . 2 1
3
7
7
7
7
7
5
.
+ $ circ( )+circ( ) and $ circ( )circ( );
0 = {0 0 ... 0} 1 = {1 0 ... 0}
5

Scalars to matrix-vector
products
40 60 80 100 120
40
60
mm
David F. Gleich (Purdue) LA/Opt Seminar 13
Operations (cont.)
More operations are simplified in the Fourier space too. Let
cft( ) = di g [ˆ1, ..., ˆk]. Because the ˆj values are the
eigenvalues of circ( ), we have:
abs( ) = icft(di g [| ˆ1|, ..., | ˆk|]),
= icft(di g [ˆ1, ..., ˆk]) = icft(cft( ) ), and
angle( ) = icft(di g [ˆ1/| ˆ1|, ..., ˆk/| ˆk|]).
40 60 80 100 120
40
60
80
mm
cft and icft
We define the “Circulant Fourier Transform” or cft
cft : 2 Kk 7! Ck⇥k
and its inverse
icft : Ck⇥k
7! Kk
as follows:
cft( ) ⌘
ñ
ˆ1
...
ˆk
ô
= F circ( )F,
icft
Çñ
ˆ1
...
ˆk
ôå
⌘ $ F cft( )F ,
where ˆj are the eigenvalues of circ( ) as produced in the
Fourier transform. These transformations satisfy
icft(cft( )) = and provide a convenient way of moving
between operations in Kk to the more familiar environment of
diagonal matrices in Ck⇥k.
40 60
40
60
mm
The circ operation on ma
A x =
2
6
4
Pn
j=1
A1,j j
...Pn
j=1
Am,j j
3
7
5 $
2
4
circ(A1,1) ..
...
..
circ(Am,1) ..
Define
circ(A) ⌘
2
4
circ(A1,1) ... circ(A1,n)
...
...
...
circ(Am,1) ... circ(Am,n)
3
5 c
6

The special structure
This circulant structure is our special structure
for this ﬁrst problem. We look at two types of
iterative methods:
1.  the power method and
2.  the Arnoldi method.
7

A perplexing result!
40 60 80 100 120
40
60
80
mm
Example
Run the power method on

{2 3 1} {0 0 0}
{0 0 0} {3 1 1}
Result = (1/3) {10 4 4}
8

Some understanding through
decoupling
40 60 80 100 120
40
60
80
mm
Back to ﬁgure
9

decoupling
40 60 80 100
40
60
80
mm
David F. Gleich (Purdue) LA/Opt Sem
Example
Let A =
î {2 3 1} {8 2 0}
{ 2 0 2} {3 1 1}
ó
. The result of the circ and cft
operations are:
circ(A) =
2
6
6
6
6
6
6
4
2 1 3 8 0 2
3 2 1 2 8 0
1 3 2 0 2 8
2 2 0 3 1 1
0 2 2 1 3 1
2 0 2 1 1 3
3
7
7
7
7
7
7
5
,
( ⌦ F )circ(A)( ⌦ F) =
2
6
6
6
6
6
6
6
4
6 6
p
3 9 +
p
3
p
3 9
p
3
0 5
3 +
p
3 2
3
p
3 2
3
7
7
7
7
7
7
7
5
,
cft(A) =
2
6
6
6
6
6
6
6
4
6 6
0 5
p
3 9 +
p
3
3 +
p
3 2
p
3 9
p
3
3
p
3 2
3
7
7
7
7
7
7
7
5
.
10

40 60 80 100 120
40
60
80
mm
David F. Gleich (Purdue) LA/Opt Seminar 20 /
Example
A =

{2 3 1} {0 0 0}
{0 0 0} {3 1 1}
Â1 =

6 0
0 5
, Â2 =
ñ
-
p
3 0
0 2
ô
, Â3 =
ñ p
3 0
0 2
ô
.
1 = icft(di g [6 2 2]) = (1/3) {10 4 4}
2 = icft(di g [5 -
p
3
p
3]) = (1/3) {5 2 2}
3 = icft(di g [6 -
p
3
p
3]) = {2 3 1}
4 = icft(di g [5 2 2]) = (1/3) {3 1 1} .
The corresponding eigenvectors are
x1 =

{1/3 1/3 1/3}
{2/3 -1/3 -1/3}
; x2 =

{2/3 -1/3 -1/3}
{1/3 1/3 1/3}
;
x3 =

{1 0 0}
{0 0 0}
; x4 =

{0 0 0}
{1 0 0}
.
decoupling
11

Convergence of the power method is in
terms of the individual blocks
40 60 80 100 120
40
60
mm
The power method converges
Let A 2 Kn⇥n
k have a canonical set of eigenvalues 1, . . . , n
where | 1| > | 2|, then the power method in the circulant
algebra convergences to an eigenvector x1 with eigenvalue 1.
Where we use the ordering ...
< $ cft( ) < cft( ) elementwise
40
60
80
5 = icft(di g [6 -
p
3 2]) 6 = icft(di g [6 2
p
3])
7 = icft(di g [5 -
p
3 2]) 8 = icft(di g [5 2
p
3]),
altogether polynomial number, exceeds dimension of matrix.
Deﬁnition. A canonical set of eigenvalues and eigenvectors is
a set of minimum size, ordered such that
abs( 1) abs( 2) . . . abs( k), which contains the
information to reproduce any eigenvalue or eigenvector of A
In this case, the only canonical set is {( 1, x1), ( 2, x2)}. (Need
two, and have abs( 1) abs( 2).)
12

40 60 80 100 120
40
60
mm
2000 4000 6000 8000
0
−15
0
−10
0
−5
0
0
✓
2 +2 cos( 2⇡/n)
2 +2 cos( ⇡/n)
◆2 i
✓
6 +2 cos( 2⇡/n)
6 +2 cos( ⇡/n)
◆2 i
✓
6 +2 cos( 2⇡/n)
6 +2 cos( ⇡/n)
◆i
iteration
Eigenvalue Error
Eigenvector Change
ure: The convergence behavior of the power
od in the circulant algebra. The gray lines show
rror in the each eigenvalue in Fourier space.
e curves track the predictions made based on
igenvalues as discussed in the text. The red
0 10 20 30 40 50
10
−15
10
−10
10
−5
10
0
Arnoldi iteration
Magnitude
Absolute error
Residual magnitude
Figure: The convergence behavior of a GMRES
procedure using the circulant Arnoldi process. The
gray lines show the error in each Fourier component
and the red line shows the magnitude of the
residual. We observe poor convergence in one
The Arnoldi Method
Using our repertoire of
operations, the Arnoldi method
in the circulant algebra is
equivalent to individual Arnoldi
processes on each matrix.

Equivalent to a block Arnoldi
process.

Using the cft and icft operations,
we produce an Arnoldi
factorization:
40 60 80 100 120
40
60
80
mm
avid F. Gleich (Purdue) LA/Opt Seminar 24 / 29
e Arnoldi process
Let A be an n ⇥ n matrix with real valued entries. Then the
Arnoldi method is a technique to build an orthogonal basis
or the Krylov subspace Kt(A, v) = span{v, Av, . . . , At 1
v},
where v is an initial vector.
We have the decomposition
AQt = Qt+1Ht+1,t
where Qt is an n ⇥ t matrix, and Ht+1,t is a (t + 1) ⇥ t upper
Hessenberg matrix.
Using our repertoire of operations, the Arnoldi method in
he circulant algebra is equivalent to individual Arnoldi
processes on each matrix ˆAj.
Equivalent to a block Arnoldi process.
Using the cft and icft operations, we produce an Arnoldi
actorization:
A Qt = Qt+1 Ht+1,t.
13

A number of interesting mathematical
results from this algebra
1.  A case study of how “decoupled” block
iterations arise and are meaningful for an
application.
2.  It’s a beautiful algebra. E.g.
40 60 80 100
40
60
80
mm
More operations are simpliﬁed in the Fourier space too. Le
ft( ) = di g [ˆ1, ..., ˆk]. Because the ˆj values are the
igenvalues of circ( ), we have:
abs( ) = icft(di g [| ˆ1|, ..., | ˆk|]),
= icft(di g [ˆ1, ..., ˆk]) = icft(cft( ) ), and
angle( ) = icft(di g [ˆ1/| ˆ1|, ..., ˆk/| ˆk|]).
Proofs are simple, e.g. angle( ) angle( ) = 1
Live Matlab demo
14

Conclusion to the circulant
algebra
Paper available from "
https://www.cs.purdue.edu/homes/dgleich/
The power and Arnoldi method in an algebra of
circulants, NLA 2013, Gleich, Greif, Varah.

Code available from"
https://www.cs.purdue.edu/homes/dgleich/
codes/camat
15

Project 2
Fast relaxation methods to estimate a column of
the martrix exponential.

With Kyle Kloster
16

Matrix exponentials
exp(A) is deﬁned as
1X
k=0
1
k!
Ak Always converges
dx
dt
= Ax(t) , x(t) = exp(tA)x(0) Evolution operator "
for an ODE
A is n ⇥ n, real
17
special case of a function of a matrix f(A)
others are f(x) = 1/x; f(x) = sinh(x)...

Matrix exponentials on large networks
exp(A) =
1X
k=0
1
k!
Ak If A is the adjacency matrix, then
Ak counts the number of length k
paths between node pairs.
[Estrada 2000, Farahat et al. 2002, 2006]
Large entries denote important nodes or edges.
Used for link prediction and centrality
If P is a transition matrix (column
stochastic), then Pk is the
probability of a length k walk
between node pairs.
[Kondor & Lafferty 2002, Kunegis & Lommatzsch 2009, Chung 2007]
Used for link prediction, kernels, and
clustering or community detection
exp(P) =
1X
k=0
1
k!
Pk
18

This talk: a column of the
matrix exponential
x = exp(P)ec
x the solution
P the matrix
ec the column
19

This talk: a column of the
matrix exponential
x = exp(P)ec
x the solution
P the matrix
ec the column
localized
large, sparse, stochastic
20

Uniformly localized "
solutions in livejournal
1 2 3 4 5
x 10
6
0
0.5
1
1.5
nnz = 4815948
magnitude
plot(x)
10
0
10
1
10
2
10
3
10
4
10
5
10
6
10
−14
10
−12
10
−10
10
−8
10
−6
10
−4
10
−2
10
0
1−normerror
largest non−zeros retained
10
0
10
1
10
2
10
3
10
4
10
5
10
6
10
−14
10
−12
10
−10
10
−8
10
−6
10
−4
10
−2
10
0
1−normerror
largest non−zeros retained
x = exp(P)ec
21
nnz(x) = 4, 815, 948
Gleich & Kloster,
arXiv:1310.3423

Our mission!
Exploit the structure
Find the solution with work "
roughly proportional to the "
localization, not the matrix.
22

Our algorithms for uniform localization"
www.cs.purdue.edu/homes/dgleich/codes/nexpokit
10
0
10
1
10
2
10
3
10
4
10
5
10
6
10
−8
10
−6
10
−4
10
−2
10
0
non−zeros
1−normerror
gexpm
gexpmq
expmimv
10
0
10
1
10
2
10
3
10
4
10
5
10
6
10
−8
10
−6
10
−4
10
−2
10
0
non−zeros
1−normerror
23
work = O
⇣
log(1
" )(1
" )3/2
d2
(log d)2
⌘
nnz = O
⇣
log(1
" )(1
" )3/2
d(log d)
⌘

Matrix exponentials on large networks
Is a single column interesting? Yes!
exp(P)ec =
1X
k=0
1
k!
Pk
ec Link prediction scores for node c
A community relative to node c
But …
modern networks are "
large ~ O(109) nodes,
sparse ~ O(1011) edges,
constantly changing …
and so we’d like "
speed over accuracy
24

SIAM REVIEW c⃝ 2003 Society for Industrial and Applied Mathematics
Vol. 45, No. 1, pp. 3–49
Nineteen Dubious Ways to
Compute the Exponential of a
Matrix, Twenty-Five Years Later∗
Cleve Moler†
Charles Van Loan‡
25

Our underlying method
Direct expansion!
A few matvecs, quick loss of sparsity due to ﬁll-in

This method is stable for stochastic P!
"… no cancellation, unbounded norm, etc.
!
!

26
x = exp(P)ec ⇡
PN
k=0
1
k! Pk
ec = xN
The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your
computer, and then open the ﬁle again. If the red x still appears, you may have to delete the image and then insert it again.

Our underlying method "
as a linear system
Direct expansion!

"
!
!
!

27
x = exp(P)ec ⇡
PN
k=0
1
k! Pk
ec = xN
2
6
6
6
6
6
6
4
III
P/1 III
P/2
...
... III
P/N III
3
7
7
7
7
7
7
5
2
6
6
6
6
6
6
4
v0
v1
...
...
vN
3
7
7
7
7
7
7
5
=
2
6
6
6
6
6
6
4
ec
0
...
...
0
3
7
7
7
7
7
7
5
xN =
NX
i=0
vi
(III ⌦ IIIN SN ⌦ P)v = e1 ⌦ ec
Lemma we approximate xN well if we approximate v well

Our mission (2)!
Approximately solve "

when A, b are sparse,"
x is localized.
28
Ax = b

Coordinate descent, Gauss-Southwell,
Gauss-Seidel, relaxation & “push” methods
29
Algebraically! Procedurally!
Solve(A,b)
x = sparse(size(A,1),1)
r = b
While (1)
Pick j where r(j) != 0
z = r(j)
x(j) = x(j) + r(j)
For i where A(i,j) != 0
r(i) = r(i) – z*A(i,j)
Ax = b
r(k)
= b Ax(k)
x(k+1)
= x(k)
+ ej eT
j r(k)
r(k+1)
= r(k)
r(k)
j Aej

Back to the exponential
30
2
6
6
6
6
6
6
4
III
P/1 III
P/2
...
... III
P/N III
3
7
7
7
7
7
7
5
2
6
6
6
6
6
6
4
v0
v1
...
...
vN
3
7
7
7
7
7
7
5
=
2
6
6
6
6
6
6
4
ec
0
...
...
0
3
7
7
7
7
7
7
5
xN =
NX
i=0
vi
Solve this system via the same method.

Optimization 1 build system implicitly

Optimization 2 don’t store vi, just store sum xN

Error analysis for Gauss-Southwell
31
Theorem
Assume P is column-stochastic, v(0)
= 0.
(Nonnegativity)
iterates and residuals are nonnegative
v(l)
0 and r(l)
0
(Convergence)
residual goes to 0:
kr(l)
k1 
Ql
k=1 1 1
2dk  l( 1
2d )
“easy”
“annoying”
d is the
largest degree

Proof sketch
Gauss-Southwell picks largest residual
⇒  Bound the update by avg. nonzeros in residual (sloppy)
⇒  Algebraic convergence with slow rate, but each update is
REALLY fast O(d max log n).
If d is log log n, then our method runs in sub-linear time "
(but so does just about anything)
32

Overall error analysis
33

Components!
Truncation to N terms
Residual to error

Approximate solve

Theorem kxN
(`)
xk1 
1
N!N
+
1
e
· `
1
2d
After ℓ steps of Gauss-Southwell

More recent error analysis
34
Theorem (Gleich and Kloster, 2013 arXiv:1310.3423)"

Consider computing the matrix exponential using the
Gauss-Southwell relaxation method in a graph with a
Zipf-law in the degrees with exponent p=1 and max-
degree d, then the work involved in getting a solution with
1-norm error ε is
work = O
⇣
log(1
" )(1
" )3/2
d2
(log d)2
⌘

Problem size &"
Runtimes
10
6
10
7
10
8
10
9
10
10
10
−4
10
−2
10
0
10
2
10
4
|V|+ nnz(P)
runtime(s)
expmv
half
gexpmq
gexpm
expmimv
– The median runtime of our methods for the seven graphs over 100 trials (only
Table 3 – The real-world datasets we use in our experiments span three orders of magnitude in
size.
Graph |V | nnz(P ) nnz(P )/|V |
itdk0304 190,914 1,215,220 6.37
dblp-2010 226,413 1,432,920 6.33
flickr-scc 527,476 9,357,071 17.74
ljournal-2008 5,363,260 77,991,514 14.54
webbase-2001 118,142,155 1,019,903,190 8.63
twitter-2010 33,479,734 1,394,440,635 41.65
friendster 65,608,366 3,612,134,270 55.06
Real-world networks The datasets used are summarized in Table 3. They include
a version of the ﬂickr graph from [Bonchi et al., 2012] containing just the largest
strongly-connected component of the original graph; dblp-2010 from [Boldi et al., 2011],
itdk0304 in [(The Cooperative Association for Internet Data Analyais), 2005], ljournal-
2008 from [Boldi et al., 2011, Chierichetti et al., 2009], twitter-2010 [Kwak et al., 2010]
webbase-2001 from [Hirai et al., 2000, Boldi and Vigna, 2005], and the friendster graph
in [Yang and Leskovec, 2012].
Implementation details All experiments were performed on either a dual processor
Xeon e5-2670 system with 16 cores (total) and 256GB of RAM or a single processor
Intel i7-990X, 3.47 GHz CPU and 24 GB of RAM. Our algorithms were implemented in
C++ using the Matlab MEX interface. All data structures used are memory-e cient:
the solution and residual are stored as hash tables using Google’s sparsehash pack-
age. The precise code for the algorithms and the experiments below are available via
https://www.cs.purdue.edu/homes/dgleich/codes/nexpokit/.
Comparison We compare our implementation with a state-of-the-art Matlab function
for computing the exponential of a matrix times a vector, expmv [Al-Mohy and Higham,
2011]. We customized this method with knowledge that kP k1 = 1. This single change
results in a great improvement to the runtime of their code. In each experiment, we
use as the “true solution” the result of a call to expmv using the ‘single’ option, which
guarantees a 1-norm error bounded by 2 24
, or, for smaller problems, we use a Taylor
approximation with the number of terms predicted by Lemma 12.David Gleich · Purdue
35

References and ongoing work
Kloster and Gleich, Workshop on Algorithms for the
Web-graph, 2013. Also see the journal version on arXiv.
www.cs.purdue.edu/homes/dgleich/codes/nexpokit
•  Error analysis using the queue (almost done …)
•  Better linear systems for faster convergence
•  Asynchronous coordinate descent methods
•  Scaling up to billion node graphs (done …)
36
Supported by NSF CAREER 1149756-CCF
www.cs.purdue.edu/homes/dgleich

Iterative methods with special structures

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (14)

Similar to Iterative methods with special structures

Similar to Iterative methods with special structures (20)

Recently uploaded

Recently uploaded (20)

Iterative methods with special structures