Druinsky_SIAMCSE15

Comparative Performance Analysis of an
Algebraic-Multigrid Solver on
Leading Multicore Architectures
Brian Austin, Alex Druinsky, Pieter Ghysels, Xiaoye Sherry Li,
Osni A. Marques, Eric Roman, Samuel Williams
Lawrence Berkeley National Laboratory
Andrew Barker, Panayot Vassilevski
Lawrence Livermore National Laboratory
Delyan Kalchev
University of Colorado, Boulder

What this talk is about
Performance optimization, comparison and modeling of
a novel shared-memory algebraic-multigrid solver
using the SPE10 reservoir-modeling problem
on a node of Cray XC30 and on a Xeon Phi.

How our multigrid solver works
Repeat until converged:
pre-smoothing y ← x + M−1(b − Ax)
coarse-grid correction z ← y + PA−1
c PT (b − Ay)
post-smoothing x ← z + M−1(b − Az)

How we construct the interpolator
=
SP P�

How we construct the coarse-grid matrix
=
Ac PAPT

What the spe10 problem is and how we are solving it
Credit: http://www.spe10.org
oil-reservoir modeling benchmark problem
solved using Darcy’s equation (in primal form)
− · (κ(x) p(x)) = f (x) ,
where p(x) = pressure, and κ(x) = permeability
deﬁned over a 60 × 220 × 85 grid
with isotropic and anisotropic versions

What the spe10 problem is and how we are solving it
oil-reservoir modeling benchmark problem
solved using Darcy’s equation (in primal form)
− · (κ(x) p(x)) = f (x) ,
where p(x) = pressure, and κ(x) = permeability
deﬁned over a 60 × 220 × 85 grid
with isotropic and anisotropic versions

What are the machines that we study?
Edison Babbage
name Ivy Bridge Knights Corner
model Xeon E5-2695 v2 Xeon Phi 5110P
clock speed 2.4 GHz 1.053 GHz
cores 12 60
SMT threads 2 4
SIMD width 4 8
peak gﬂop/s 230.4 1010.88
bandwidth 48.5 GB/s 122.9 GB/s
per-core caches:
L1-D 32 KB 32 KB
L2 256 KB 512 KB
shared cache:
L3 30 MB none

What the coarse-grid system is
n = 7,782; nnz = 1,412,840; nnz/n = 181.6

How we chose the preconditioner for PCG
preconditioner operator
Jacobi z = D−1r
Symmetric Gauss–Seidel z = (L + D)−1D(L + D)−T r
= + +
Ac L D LT

unprecond Jacobi SGS
conditioning
isotropic 3.37 × 104 1.35 × 103 1.83 × 102
anisotropic 9.68 × 106 1.89 × 104 2.91 × 103
iterations
isotropic 605.53 194.57 78.87
anisotropic 1,267.85 288.32 122.85

SGS Jacobi
1 thread 1 thread 12 threads
time (s)
isotropic 83.0 80.3 29.2
anisotropic 128.6 121.6 43.8

Where does the AMG cycle spend most of its time?
1 2 4 8 16 32 60 120
32
64
128
256
512
1,024
2,048
smoothing
PCG
total
number of threads
runtime (s)

How to improve the performance of PCG
1: while not converged do
2: ρ ← σ
3: omp parallel for: w ← Ap
4: omp parallel for: τ ← w · p
5: α ← ρ/τ
6: omp parallel for: x ← x + αp
7: omp parallel for: r ← r − αw
8: omp parallel for: z ← M−1r
9: omp parallel for: σ ← z · r
10: β ← σ/ρ
11: omp parallel for: p ← z + βp
12: end while

1: omp parallel
3: omp single: τ ← 0.0 implied barrier
4: omp single nowait: ρ ← σ, σ ← 0.0
5: omp for nowait: w ← Ap
6: omp for reduction: τ ← w · p implied barrier
7: α ← ρ/τ
8: omp for nowait: x ← x + αp
9: omp for nowait: r ← r − αw
10: omp for nowait: z ← M−1r
11: omp for reduction: σ ← z · r implied barrier
12: β ← σ/ρ
13: omp for nowait: p ← z + βp
14: end while
15: end omp parallel

1: omp parallel
3: omp for: w ← Ap
4: omp single
5: τ ← w · p
6: α ← ρ/τ
7: x ← x + αp
8: r ← r − αw
9: z ← M−1r
10: ρ ← σ
11: σ ← z · r
12: β ← σ/ρ
13: p ← z + βp
14: end omp single
15: end while
16: end omp parallel

2: ρ ← σ
3: omp parallel for: w ← Ap
4: τ ← w · p
5: α ← ρ/τ
6: x ← x + αp
7: r ← r − αw
8: z ← M−1r
9: σ ← z · r
10: β ← σ/ρ
11: p ← z + βp
12: end while

1 2 4 8 16 32 60 120
16
32
64
128
256
number of threads
runtime (s)
Algorithm 1
Algorithm 2
Algorithm 3
Algorithm 4

How the sparse HSS solver works
sparse matrix-factorization
algorithm
represents the frontal matrices
as hierarchically-semiseparable
(HSS) matrices
uses randomized sampling for
faster compression
D1
D2
D4
D5
D8
D9
D11
D12
U3B3V6
H 7 B14
H
U6B6V3
H
B
U7
U3R3
U6R6
=
More details in Pieter Ghysels’ talk tomorrow!

How do the parameters of the solver aﬀect performance?
Parameter Values
coarse solver HSS, PCG
elements-per-agglomerate 64, 128, 256, 512
νP 0, 1, 2
νM−1 1, 3, 5
θ 0.001, 0.001 × 100.5, 0.01

How do the parameters of the solver aﬀect performance?
1%2%4%8%16%32%64%
8
16
32
64
128
percentile rank
runtime (s)
Babbage (HSS)
Babbage (PCG)
Edison (HSS)
Edison (PCG)
default conﬁguration

What our performance model is
stage bytes ﬂops
pre- and post-smooth (3ν + 1)(12 nza + 3 · 8n) 2(3ν + 1)(nza + 2n)
restriction 12 nza + 12 nzp + 3 · 8n 2(nza + nzp)
one coarse solve
multiply by Ac 12 nzc 2 nzc
preconditioner 2 · 8nc nc
vector operations 5 · 8nc 2 · 5nc
interpolation 12 nzp + 8n 2 nzp
stopping criterion 12 nza + 4 · 8n 2(nza + n)

What our performance model is
1 2 4 8 12
8
16
32
64
128
memory bound
ﬂops bound
actual
number of cores
runtime (s)

Final comments
HSS is an attractive option for solving coarse systems
performance is quite sensitive to parameter tuning
performance model indicates where the bottlenecks are

Druinsky_SIAMCSE15

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Druinsky_SIAMCSE15

Ähnlich wie Druinsky_SIAMCSE15 (20)

Mehr von Karen Pao

Mehr von Karen Pao (8)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Druinsky_SIAMCSE15