Making Static Pivoting Scalable and Dependable

Making Static Pivoting Scalable and Dependable
Ph.D. Dissertation Talk

E. Jason Riedy
jason@acm.org

EECS Department
University of California, Berkeley
Committee: Dr. James Demmel (chair), Dr. Katherine Yelick, Dr. Sanjay Govindjee

17 December, 2010

Outline

1 Introduction

2 Solving Ax = b dependably

3 Extending dependability to static pivoting

4 Distributed matching for static pivoting

5 Summary

Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 2 / 59

Motivation: Ever Larger Ax = b

Systems Ax = b are growing larger, more diﬃcult
Omega3P: n = 7.5 million with τ = 300 million entries
Quantum Mechanics: precondition with blocks of dimension
200-350 thousand
Large barrier-based optimization problems: Many solves, similar
structure, increasing condition number

Huge systems are generated, solved, and analyzed automatically.
Large, highly unsymmetric systems need scalable parallel solvers.
Low-level routines: No expert in the loop!


Motivation: Solving Ax = b better

Many people work to solve Ax = b faster.
Today we start with how to solve it better.
Better enables faster.

Use extra floating-point precision within iterative refinement to
obtain a dependable solution, adding O(n2 ) work after an O(n3 )
factorization.
Accelerate sparse factorization through static pivoting,
decoupling symbolic, numeric phases.
Refine the perturbed solution without needing extra triangular
solves for condition estimation.


Contributions
Iterative refinement
Extend iterative refinement to provide small forward errors
dependably (to be defined)
Set and use a methodology to demonstrate dependability
Show that condition estimation (expensive for sparse systems) is
not necessary for obtaining a dependable solution

Static pivoting
Improve static pivoting heuristics
Demonstrate that an approximate maximum weight bipartite
matching is faster and just as accurate
Develop a memory-scalable distributed memory auction
algorithm for static pivoting

Defining “dependable”

A dependable solver for Ax = b returns a result x with small error
often enough that you expect success with a small error, and clearly
signals results that likely contain large errors.

True error Difficulty Alg. reports w/likeliness
O(mach. precision) not bad success Very likely
failure Somewhat rare
larger not bad success (not yet seen)
failure Practically certain
O(mach. precision) difficult success Whenever feasible
failure Practically certain
larger difficult success (not yet seen)
failure Very likely


Introducing the errors and targets
y1

A b
−1

(A, b)

(A, b)
A −1b
x
LU: Small backward error LU: Error in y ∝ diﬃculty
2−25

20

2−30 Percent

Percent 0.5%

1% 1.0%
2−10
Error

1.5%
Error
2%
−35 2.0%
2 3%
4% 2.5%
3.0%
3.5%
−40
2−20
2

2−45 2−30

25 210 215 220 225 20 25 210 215 220 225 230
Difficulty Difficulty


Introducing the errors and targets
y1

A b
−1
yk
(A, b)

(A, b) yk
A −1b
x
Reﬁned: Accepted with small errors in y , or ﬂagged with unknown error.
Successful Flagged

20

2−10

2−20

% of systems
0.2%
2−30 0.4%
Error

0.6%
0.8%
1.0%
2−40
1.2%
1.4%

2−50

2−60

210 220 230 240 210 220 230 240
Difficulty



Newton’s method applied to Ax = b.

Repeat until done:
1 Compute the residual ri = b − Ayi using extra precision εr .
2 Solve Ady i = ri for the correction using working precision εw .
3 Increment yi+1 = yi + dy i , maintaining y to extra precision εx .
Precisions:
Working precision εw The precision used for storing (and factoring)
A: IEEE754 single (εw = 2−24 ), double (εw = 2−53 ), etc.
Residual precision εr At least double working precision, εr ≤ ε2 w
Solution precision εx At least double working precision, εx ≤ ε2 w
Latter two may be implemented in software.


Deﬁnitions

Errors:
Backward (relative) error
Forward (relative) error
Diﬃculty:
Condition numbers: sensitivity to perturbations
Element growth: error from factorization


Error measures: Backward error
How close is the nearest system satisfying Ay1 = b?
y1

A b
−1

(A, b)

(A, b)
A −1b
x
Three ways, given r1 = b − Ay1 :
r1 ∞ |r1 |
Normwise A y1 ∞ + b
∞ ∞
Componentwise |A| |y1 |+|b| ∞
r1 ∞
Columnwise Note: Elementwise division, 0/0 = 0,
(max |A|) |y1 |+ b ∞
and max produces a row vector

Error measures: Forward error
How close is y1 to x?

y1

A b
−1

(A, b)

(A, b)
A −1b
Two ways and two measuring sticks: x

y1 −x ∞ y1 −x ∞
Normwise x ∞ y1 ∞
y1 −x y1 −x
Componentwise x ∞ y1 ∞


Error sensitivity: Conditioning
How sensitive is y1 to perturbations in A and b?
y1

A b
−1

(A, b)

(A, b)
A −1b
x

forward error ≤ condition number × backward error

Each combination has a condition number. We choose two for use in
our diﬃculty measure.

Diﬃculty: condition number × element growth

Condition number:
Backward error κ(A−1 ) = κ(A) = A−1 ∞ A ∞
Normwise forw. err.
κ(A, x, b) = A−1 ∞ ( A ∞ x ∞ + b ∞ )
Componentwise forw. err.
ccond(A, x, b) = |A−1 | (|A| |x| + |b|) ∞
Element growth, est. δAi in (A + δAi )y = b:
|δAi | ≤ 3nd |L| |U| ≤ p(nd )g 1r max |A|
We use a col.-scaling-indep. expression allowing |L| > 1,
(max1≤k≤j maxi |L|(i,k))·(maxi |U|(i,j))
gc = maxj maxi |A|(i,j)


Dense test systems
30 × 30 single, double, complex, and double complex:
250k, 4 right-hand sides, 1M test systems
Size chosen to sample ill-conditioned region well
Generated as in Demmel, et al., plus b → x
κ∞ (A) = A−1 ∞ A ∞
Single Double

15%

10%

5%
Percent of population

0%
Complex Double Complex

15%

10%

5%

0%

20 210 220 230 240 250 260 270 20 210 220 230 240 250 260 270
Difficulty


Dense test systems
κ(A, x, b) = A−1 ∞ ( A ∞ x ∞ + b ∞)
Single Double
14%

12%

10%

8%

6%

4%

2%

0%
14%

12%

10%

8%

6%

4%

2%

0%

20 210 220 230 240 250 260 270 20 210 220 230 240 250 260 270
Difficulty


Dense test systems
ccond(A, x, b) = |A−1 | (|A| |x| + |b|) ∞
Single Double
12%

10%

8%

6%

4%

2%

0%
12%

10%

8%

6%

4%

2%

0%

20 220 240 260 280 20 220 240 260 280
Difficulty


Results: Dependable errors
nberr colberr cberr nferr nferrx cferr cferrx
20
2−10

Converged
2−20
2−30
2−40
2−50
2−60

20
2−10

No Progress
2−20
2−30
2−40
2−50 % of systems
2−60 10−5
Error

20
10−4
2−10 10−3
2−20 10−2

Unstable
2−30
2−40
2−50
2−60

20
2−10

Iteration Limit
−20
2
2−30
2−40
2−50
2−60

20 210220230240 20 210220230240 20 210220230240 20 210220230240 20 210220230240 20 210220230240 20 210220230240
Difficulty


How?

cberr cferr
20

2−10

2−20 % of systems
0.00%
2−30
Error

0.01%
2−40 0.10%
2−50 1.00%

2−60

25 210 215 220 225 230 235 240 25 210 215 220 225 230 235 240
Difficulty

Carry the intermediate soln. yi to twice the working precision.
Reﬁne the backward error down to nearly ε2 .w
By “forward error ≤ conditioning × backward error”, the
forward error for well-enough conditioned problems is nearly εw .

Results: Comparison with xGESVXX

Precision Accepted Rejected
well ill well ill
Single 79% 15% 1% 5%
Single complex 76% 19% 1% 4%
Double 87% 9% 1% 5%
Double complex 85% 11% 1% 3%

Accepted, ill-conditioned systems are those gained by our routine
that xGESVXX rejects.
Rejected, well-conditioned systems are those lost by our routine
but accepted by xGESVXX.


Results: Iteration counts, single precision
nberr colberr cberr ndx cdx
30
25

Converged
20
15
10
5

30
25

No Progress
20
15 % of systems
10 1%
# Iterations

5 2%
30
3%
25
4%
5%

Unstable
20
15 6%
10
5

30

Iteration Limit
25
20
15
10
5

20 210 220 230 240 20 210 220 230 240 20 210 220 230 240 20 210 220 230 240 20 210 220 230 240
Difficulty

Set limit at ﬁve.

Results: Iteration counts, single complex precision
30
25

Converged
20
15
10
5

30
25

No Progress
20
15
10 % of systems
# Iterations

5 2%
30
4%
25
6%
8%

Unstable
20
15
10
5

30

Iteration Limit
25
20
15
10
5

5 5 5 5 5
20 2 210 15 20 25 30 35 20 2 210 15 20 25 30 35 20 2 210 15 20 25 30 35 20 2 210 15 20 25 30 35 20 2 210 15 20 25 30 35
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
Difficulty

Set limit at seven.

Results: Iteration counts, double precision
30
25

Converged
20
15
10
5

30
25

No Progress
20
15 % of systems
10 0.5%
# Iterations

5 1.0%
30
1.5%
25
2.0%
2.5%

Unstable
20
15 3.0%
10
5

30

Iteration Limit
25
20
15
10
5

20 220 240 260 20 220 240 260 20 220 240 260 20 220 240 260 20 220 240 260
Difficulty

Set limit at ten.

Results: Iteration counts, double complex precision
30
25

Converged
20
15
10
5

30
25

No Progress
20
15
% of systems
10 0.5%
# Iterations

5 1.0%
1.5%
30 2.0%
25 2.5%

Unstable
20 3.0%
15 3.5%
10
5

30

Iteration Limit
25
20
15
10
5

20 210220230240250260 20 210220230240250260 20 210220230240250260 20 210220230240250260 20 210220230240250260
Difficulty

Set limit at 15.

Static pivoting

If a pivot |A(j, j)| < T , perturb up to T by adding

sign(A(j, j)) · (T − |A(j, j)|).

Forcibly increases backward error, decreases element growth
In sparse systems, few updates should occur to an entry.
Large diagonal entries should remain large...

Thresholding heuristics
SuperLU γ · A 1
column-relative γ · max |A(:, j)|
diagonal-relative γ · |A(j, j)|
√
γ = 2−26 ≈ εw , 2−38 , or 2−43 = 210 εw


Sparse test systems

Matrices are from the UF Collection, chosen from existing
comparisons between SuperLU, MUMPS, and UMFPACK.
Wide range of conditioning and numerical scaling
Compute “True” solutions using a doubled-double-extended
factorization and quad-double-extended refinement with a
modified TAUCS.
Refinement uses LAPACK-style numerical scaling throughout,
but the test systems are generated in the matrix’s given scaling.
Also tested on singular systems; no solutions accepted.
At some point, plan on feeding the “true” solutions into the UF
Collection...


Sparse normwise conditioning

8%

6%

4%

2%

0%

210 220 230 240 250
Difficulty


Sparse componentwise conditioning
8%

6%

4%

2%

0%

220 230 240 250 260
Difficulty


Results: SuperLU perturbation heuristic
Before reﬁnement, by max. perturbation amount
20
−10
2

2−20

2^10 * eps
2−30

2−40

2−50

2−60
Error / sqrt(max row deg.)

20

2−10

2^−12 * sqrt(eps)
% of systems
2−20
0.1%
2−30
0.3%
2−40 1.0%
2−50 3.2%
2−60

20

2−10

2−20

sqrt(eps)
2−30

2−40

2−50

2−60

20 10 20 30 40 50 60 20 10 20 30 40 50 60 20 10 20 30 40 50 60 20 10 20 30 40 50 60 20 10 20 30 40 50 60 20 10 20 30 40 50 60 20 10 20 30 40 50 60
222222 222222 222222 222222 222222 222222 222222
Difficulty

Results: Column-relative perturbation heuristic
20
−10
2

2−20

2^10 * eps
2−30

2−40

2−50

2−60

20

2−10

2^−12 * sqrt(eps)
% of systems
2−20
0.1%
2−30
0.3%
2−40 1.0%
2−50 3.2%
2−60

20

2−10

2−20

sqrt(eps)
2−30

2−40

2−50

2−60

20 10 20 30 40 50 60 20 10 20 30 40 50 60 20 10 20 30 40 50 60 20 10 20 30 40 50 60 20 10 20 30 40 50 60 20 10 20 30 40 50 60 20 10 20 30 40 50 60
222222 222222 222222 222222 222222 222222 222222
Difficulty

Results: Diagonal-relative perturbation heuristic
20
−10
2

2−20

2^10 * eps
2−30

2−40

2−50

2−60

20

2−10

2^−12 * sqrt(eps)
% of systems
2−20
0.1%
2−30
0.3%
2−40 1.0%
2−50 3.2%
2−60

20

2−10

2−20

sqrt(eps)
2−30

2−40

2−50

2−60

20 10 20 30 40 50 60 20 10 20 30 40 50 60 20 10 20 30 40 50 60 20 10 20 30 40 50 60 20 10 20 30 40 50 60 20 10 20 30 40 50 60 20 10 20 30 40 50 60
222222 222222 222222 222222 222222 222222 222222
Difficulty

Results: SuperLU perturbation heuristic
After reﬁnement, with γ = 2−43 = 210 εw


Results: Column-relative perturbation heuristic


Results: Diagonal-relative perturbation heuristic


results
Level and heuristic Result
Trust both Trust nwise Reject
2−43 = 210 · εf
SuperLU 42.9% 8.0% 49.0%
Column-relative 55.7% 5.7% 38.6%
Diagonal-relative 55.8% 5.9% 38.3%
−38 √
2 =≈ 2−12 · εf
SuperLU 36.6% 6.7% 56.6%
√
2−26 ≈ εf
SuperLU 32.4% 4.0% 63.6%

Sparse Matrix to Bipartite Graph to Pivots
Col 1Col 2Col 3Col 4 Col 1Col 2Col 3Col 4

Row 1 Row 1 Col 1 Row 2




Bipartite model
Each row and column is a vertex.
Each explicit entry is an edge.
Want to chose “largest” entries for pivots.
Maximum weight complete bipartite matching:
linear assignment problem

Mathematical Form
“Just” a linear optimization problem:
B n × n matrix of beneﬁts in ∪ {−∞}, often c + log2 |A|
X n × n permutation matrix: the matching
pr , πc dual variables, will be price and proﬁt
1r , 1c unit entry vectors corresponding to rows, cols
Lin. assignment prob. Dual problem

maximize Tr B T X minimize 1T pr + 1T πc
r c
X∈ n×n pr ,πc

subject to X 1c = 1r , subject to pr 1T + 1r πc ≥ B.
c
T

X T 1r = 1c , and
X ≥ 0.


Mathematical Form
“Just” a linear optimization problem:
B n × n matrix of beneﬁts in ∪ {−∞}, often c + log2 |A|
X n × n permutation matrix: the matching
pr , πc dual variables, will be price and proﬁt
1r , 1c unit entry vectors corresponding to rows, cols
Lin. assignment prob. Dual problem
Implicit form:
T
maximize Tr B X
X∈ n×n minimize 1T pr
r
pr
subject to X 1c = 1r ,
+ max(B(i, j)
X T 1r = 1c , and i∈R
j∈C
X ≥ 0. − pr (j)).


Do We Need a Special Method?
The LAP: Standard form:
maximize Tr B T X min cT x
X∈ n×n x
subject to X 1c = 1r , subject to Ax = 1r +c , and
T x ≥ 0.
X 1r = 1c , and
X ≥ 0.
A: 2n × τ vertex-edge matrix

Network optimization kills simplex methods.
(“Smoothed analysis” does not apply.)
Interior point algs need to round the solution.
(And need to solve Ax = b for a much larger A, although
theoretically great in NC.)
Combinatorial methods should be faster.
(But unpredictable!)

Properties from Optimization

Complementary slackness
X c
T
(pr 1T + 1r πc − B) = 0.

If (i, j) is in the matching (X (i, j) = 0), then
pr (i) + πc (j) = B(i, j).
Used to chose matching edges and modify dual variables in
combinatorial algorithms.


Properties from Optimization
Relaxed problem
Introduce a parameter µ, two interpretations:
from a barrier function related to X ≥ 0, or
from the auction algorithm (later).
Then
Tr B T X∗ ≤ 1T pr + 1T πc ≤ Tr B T X∗ + (n − 1)µ,
r c

or the computed dual value (and hence computed primal matching) is
within (n − 1)µ of the optimal primal.
Very useful for ﬁnding approximately optimal matchings.

Feasibility bound
Starting from zero prices:
pr (i) ≤ (n − 1)(µ + ﬁnite range of B)

Algorithms for Solving the LAP

Goal: A parallel algorithm that justifies buying big machines.
Acceptable: A distributed algorithm; matrix is on many nodes.
Choices:
Simplex or continuous / interior-point
Plain simplex blows up, network simplex difficult to parallelize.
Rounding for interior point often falls back on matching.
(Optimal IP algorithm: Goldberg, Plotkin, Shmoys, Tardos.
Needs factorization.)
Augmenting-path based (Mc64: Duff and Koster)
Based on depth- or breadth-first search.
Both are P-complete, inherently sequential (Greenlaw, Reif).
Auctions (Bertsekas, et al.)
Only length-1 or -2 alternating paths; global sync for duals.


Auction Algorithms
Discussion will be column-major.
General structure:
1 Each unmatched column finds the “best” row, places a bid.
The dual variable pr holds the prices.
The profit πc is implicit. (No significant FP errors!)
Each entry’s value: benefit B(i, j)− price p(i).
A bid maximally increases the price of the most valuable row.
2 Bids are reconciled.
Highest proposed price wins, forms a match.
Loser needs to re-bid.
Some versions need tie-breaking; here least column.
3 Repeat.
Eventually everyone will be matched, or
some price will be too high.
Seq. implementation in ∼40–50 lines, can compete with Mc64
Some corner cases to handle. . .

The Bid-Finding Loop
For each unmatched column:
Price
Row Index
Row Entry

value = entry − price
Save largest and second−largest
Bid price incr: diff. in values

Diﬀerences from sparse matrix-vector products
Not all columns, rows used every iteration. (sparse matrix,
sparse vector)
Hence output price updates are scattered.
More local work per entry

The Bid-Finding Loop
For each unmatched column:
Price
Row Index
Row Entry

value = entry − price
Save largest and second−largest
Bid price incr: diff. in values

Little points
Increase bid price by µ to avoid loops
Needs care in ﬂoating-point for small µ.
Single adjacent row → ∞ price
Aﬀects feasibility test, computing dual

Termination

Once a row is matched, it stays matched.
A new bid may swap it to another column.
The matching (primal) increases monotonically.
Prices only increase.
The dual does not change when a row is newly matched.
But the dual may decrease when a row is taken.
The dual decreases monotonically.
Subtle part: If the dual doesn’t decrease. . .
It’s ok. Can show the new edge begins an augmenting path that
increases the matching or an alternating path that decreases the
dual.


Successive Approximation (µ-scaling)

Simple auctions aren’t really competitive with Mc64.
Start with a rough approximation (large µ) and reﬁne.
Called -scaling in the literature, but µ-scaling is better.
Preserve the prices pr at each step, but clear the matching.
Note: Do not clear matches associated with ∞ prices!
Equivalent to ﬁnding diagonal scaling Dr ADc and matching
again on the new B.
Problem: Performance strongly depends on initial scaling.
Also depends strongly on hidden parameters.


Sequential performance: Auction v. MC64
MC64
Group Name Auction (s) MC64 (s) Auction

Bai af23560 0.025 0.017 0.68
FEMLAB poisson3Db 0.014 0.040 2.74
FIDAP ex11 0.060 0.015 0.26
GHS indef cont-300 0.007 0.019 2.89
GHS indef ncvxqp5 0.338 0.794 2.35
Hamm scircuit 0.048 0.024 0.50
Hollinger g7jac200 0.355 0.817 2.30
Mallya lhr14 0.044 0.026 0.60
Schenk IBMSDS 3D 51448 3D 0.031 0.010 0.33
Schenk IBMSDS matrix 9 0.074 0.024 0.33
Schenk ISEI barrier2-4 0.291 0.044 0.15
Vavasis av41092 5.462 3.595 0.66
Zhao Zhao2 1.041 3.237 3.11


Sequential performance: Highly variable
Row
Group Name By col (s) By row (s) Col

Bai af23560 0.025 0.028 1.13
FIDAP ex11 0.060 0.060 1.00
GHS indef cont-300 0.007 0.006 0.84
Hollinger g7jac200 0.355 0.339 0.95
Mallya lhr14 0.044 0.065 1.47
Schenk IBMSDS 3D 51448 3D 0.031 0.282 9.22
Vavasis av41092 5.462 4.083 0.75
Zhao Zhao2 1.041 0.609 0.58


Sequential performance: Highly variable
Int
Group Name Float (s) Int (s) Float

Bai af23560 0.025 0.040 1.61
FIDAP ex11 0.060 0.029 0.49
GHS indef cont-300 0.007 0.006 0.91
Hollinger g7jac200 0.355 1.004 2.83
Mallya lhr14 0.044 0.050 1.12
Schenk IBMSDS 3D 51448 3D 0.031 0.020 0.66
Vavasis av41092 5.462 5.401 0.99
Zhao Zhao2 1.041 2.269 2.18


Approximately maximum matchings
Terminal µ value
Name 0 5.96e-08 2.44e-04 5.00e-01
af23560 Primal 1342850 1342850 1342850 1342670
Time(s) 0.14 0.05 0.03 0
ratio 0.37 0.21 0.02
poisson3Db Primal 2483070 2483070 2483070 2483070
Time(s) 0.02 0.02 0.02 0.02
ratio 1.01 1.04 1.07
g7jac200 Primal 3533980 3533980 3533980 3533340
Time(s) 2.98 1.07 0.28 0.18
ratio 0.36 0.09 0.06
av41092 Primal 3156210 3156210 3156210 3155920
Time(s) 24.51 8.09 2.48 0.11
ratio 0.33 0.10 0.00
Zhao2 Primal 333891 333891 333891 333487
Time(s) 7.69 2.37 3.65 0.02
ratio 0.31 0.47 0.00


Setting / Lowering Parallel Expectations

Performance scalability?
Originally proposed (early 1990s) when
cpu speed ≈ memory speed ≈ network speed ≈ slow.
Now:
cpu speed memory latency > network latency.
The number of communication phases dominates matching
algorithms (auction and others).
Communication patterns are very irregular.
Latency and software overhead is not improving. . .

Scaled back goal
It suﬃces to not slow down much on distributed data.


Basic Idea: Run Local Auctions, Treat as Bids

1
0 1
0
1 1
0 0
1 1
0 0
1111111
0000000
1 1
0 0
1111111
0000000
1 1
0 0
111
000
111
000
1
0
1
0
1
0
1
0
111
000
111
000
1
0
1
0
1
0
1
0
11111
00000
1111
0000
1 1
0 0
1 1
0 0 1111
0000 1
0
1
0 11
00
111
000 11
00
1
0
1
0111
000
⇒ 0000
1111 11
00
111
000 11
00
111
000
1 1
0 0 1
0 1
0
1111
0000
1111
0000
1 1
0 0
1 1
0 0
1 1
0 0 1111
0000
1
0
1
0
1
0 11
00
111
000
1
0
11
00
1
0
1
0111
000
1111
0000
1 1
0 0
1 1
0 0
B
1 1
0 0 1111
0000
1111
0000
1
0
1
0 11
00
111
000
11
00
111
000 11
00
1
0
1
0111
000
11
00
111
000
1111
0000
1111
0000
1 1
0 0
1 1
0 0
1 1
0 0
1 1
0 0 1111
0000
1
0
1
0
1
0
1
0 11
00
111
000
1
0
1
0
11
00
1
0
1
0111
000
1 1
0 0 1
0 1
0
1 1
0 0 1
0 1
0
P1 0 1 P2 0
1 P3
Slice the matrix into pieces, run local auctions.
The winning local bids are the slices’ bids.
Merge. . . (“And then a miracle occurs. . .”)
Need to keep some data in sync for termination.


Basic Idea: Run Local Auctions, Treat as Bids

1
0 1
0
1 1
0 0
1 1
0 0
1111111
0000000
1 1
0 0
1111111
0000000
1 1
0 0
111
000
111
000
1
0
1
0
1
0
1
0
111
000
111
000
1
0
1
0
1
0
1
0
11111
00000
1111
0000
1 1
0 0
1 1
0 0 1111
0000 1
0
1
0 11
00
111
0001
0
1
0 11
00
111
000
⇒ 0000
1111 11
00
111
000 11
00
111
000
1 1
0 0 1
0 1
0
1111
0000
1111
0000
1 1
0 0
1 1
0 0
1 1
0 0 1111
0000
1
0
1
0
1
0 11
00
111
000
1
0
1
0
1
0 11
00
111
000
1111
0000
1 1
0 0
1 1
0 0
B
1 1
0 0 1111
0000
1111
0000
1
0
1
0 11
00
111
000
11
00
111
000
1
0
1
0 11
00
111
000
11
00
111
000
1111
0000
1111
0000
1 1
0 0
1 1
0 0
1 1
0 0
1 1
0 0 1111
0000
1
0
1
0
1
0
1
0 11
00
111
000
1
0
1
0
1
0
1
0 11
00
111
000
1 1
0 0 1
0 1
0
1 1
0 0 1
0 1
0
P1 0
1 P2 0 1 P3
Practically memory scalable: Compact the local pieces.
Have not experimented with simple SMP version.
Sequential performance is limited by the memory system.
Note: Could be useful for multicore w/local memory.


Speed-up?

104

103

102

101
Speed−up

100

10−1

10−2

10−3

5 10 15 20
Number of processors


Speed-up: A bit better measuring appropriately

104

103
Speed−up relative to reducing to the root node

102

101

100

10−1

10−2

10−3

5 10 15 20


Comparing distributed with reduce-to-root

104

103

102

101
Speed−up

To root
Dist.
100 q
q q
q q
q q
q
q q
q q
q q
q q q q
q q
q q
q q
q q
q q
q q q
q q q q
q q q
q q q

10−1

10−2

10−3

2 3 4 8 12 16 24


Iteration order still matters
av41092 shyy161

G

102

G

1
10
Time (s)

Direction
G
G G
G G Row−major
G G G

Col−major

100
G
G

G

−1
10
G
G
G
G G

5 10 15 20 5 10 15 20
Number of Processors


Many diﬀerent speed-up proﬁles
af23560 bmwcra_1
101 G

100
G
G
G
10−1
G
G G
G G G
G G G
G G

10−2

10−3

10−4
Time (s)

garon2 stomach
101
G G
G

100

G
10−1 G G G
G G
G G

−2 G
10
G
G G G

10−3

10−4

5 10 15 20 5 10 15 20
Number of Processors


So what happens in some cases?

Matrix av41092 has one large strongly connected component.
(The square blocks in a Dulmage-Mendelsohn decomposition.)
The SCC spans all the processors.
Every edge in an SCC is a part of some complete matching.
Horrible performance from:
starting along a non-max-weight matching,
making it almost complete,
then an edge-by-edge search for nearby matchings,
requiring a communication phase almost per edge.
Conjecture: This type of performance land-mine will aﬀect any
0-1 combinatorial algorithm.


Improvements?
Approximate matchings: Speeds up the sequential case,
eliminating any “speed-up.”
Rearranging deck chairs: few-to-few communication
Build a directory of which nodes share rows: collapsed BB T .
Send only to/from those neighbors.
Minor improvement over MPI Allgatherv for a huge eﬀort.
Latency not a major factor...
Improving communication may not be worth it. . .
The real problem is the number of comm. phases.
If diagonal is the matching, everything is overhead.
Or if there’s a large SCC. . .
Another alternative: Multiple algorithms at once.
Run Bora U¸ar’s alg. on one set of nodes, auction on another,
c
transposed auction on another, . . .
Requires some painful software engineering.

Latency not a dominating factor

103
Speed−up relative to reducing to the root node

102

101

100

10−1

1x3 3x1 1x8 2x4
Number of nodes x number of procs. per node


So, Could This Ever Be Parallel?

For a given matrix-processor layout, constructing a matrix
requiring O(n) communication is pretty easy for combinatorial
algorithms.
Force almost every local action to be undone at every step.
Non-fractional combinatorial algorithms are too restricted.
Using less-restricted optimization methods is promising, but far
slower sequentially.
Existing algs (Goldberg, et al.) are PRAM with n3 processors.
General purpose methods: Cutting planes, successive SDPs
Someone clever might ﬁnd a parallel rounding algorithm.
Solving the fractional LAP quickly would become a matter of
ﬁnding a magic preconditioner. . .
Maybe not a good thing for a direct method?


Review of contributions
Successfully deliver dependable solutions with a little extra
precision.
Removed need for condition estimation.
Built methodology for evaluating Ax = b solution methods’
accuracy and dependability.

Static pivoting
Tuned static pivoting heuristics to provide dependability.
Demonstrated that an approximate maximum weight bipartite
matching is faster and just as dependable.
Developed a memory-scalable (although not
performance-scalable) distributed memory auction algorithm for
static pivoting.

Future directions
Least-squares refinement demonstrated (Demmel, Hida, Li, &
Riedy), but needs... refinement.
Perhaps refinement could render an iterative method
dependable. Could improve accuracy of Ady i = ri with extra
iterations as i increases.
Could help build trust in new methods (e.g. CALU).

Distributed matching
Interesting software problem: Run multiple algorithms on
portions of a parallel allotment. How do you signal the others to
terminate?
Interesting algorithm problem: Is there an efficient rounding
method for fractional / interior point algorithms?

Thank you!


Bounds

Backward error
Di−1 ri ∞ ≤ (c − ρ)−1 (3(nd + 1)εr + εx )
¯
Here nd is an expression of size, c is the upper bound on per-iteration
decrease, and ρ is a safety factor for the region around 1/εw .
¯

Forward error
Di−1 ei ∞ 2(4 + ρ(nd + 1))εw · (c − ρ)−1
¯ ¯
Assuming εr ≤ ε2 , εx ≤ ε2 . Using only one precision, εr = εx = εw ,
w w

(c − ρ) Di−1 ei
¯ ∞ 2(5 + 2(nd + 1) ccond(A, yi ))εd .


Making Static Pivoting Scalable and Dependable

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (14)

Similar to Making Static Pivoting Scalable and Dependable

Similar to Making Static Pivoting Scalable and Dependable (20)

More from Jason Riedy

More from Jason Riedy (20)

Recently uploaded

Recently uploaded (20)

Making Static Pivoting Scalable and Dependable