Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Druinsky_SIAMCSE15
1. Comparative Performance Analysis of an
Algebraic-Multigrid Solver on
Leading Multicore Architectures
Brian Austin, Alex Druinsky, Pieter Ghysels, Xiaoye Sherry Li,
Osni A. Marques, Eric Roman, Samuel Williams
Lawrence Berkeley National Laboratory
Andrew Barker, Panayot Vassilevski
Lawrence Livermore National Laboratory
Delyan Kalchev
University of Colorado, Boulder
2. What this talk is about
Performance optimization, comparison and modeling of
a novel shared-memory algebraic-multigrid solver
using the SPE10 reservoir-modeling problem
on a node of Cray XC30 and on a Xeon Phi.
3. How our multigrid solver works
Repeat until converged:
pre-smoothing y ← x + M−1(b − Ax)
coarse-grid correction z ← y + PA−1
c PT (b − Ay)
post-smoothing x ← z + M−1(b − Az)
6. What the spe10 problem is and how we are solving it
Credit: http://www.spe10.org
oil-reservoir modeling benchmark problem
solved using Darcy’s equation (in primal form)
− · (κ(x) p(x)) = f (x) ,
where p(x) = pressure, and κ(x) = permeability
defined over a 60 × 220 × 85 grid
with isotropic and anisotropic versions
7. What the spe10 problem is and how we are solving it
oil-reservoir modeling benchmark problem
solved using Darcy’s equation (in primal form)
− · (κ(x) p(x)) = f (x) ,
where p(x) = pressure, and κ(x) = permeability
defined over a 60 × 220 × 85 grid
with isotropic and anisotropic versions
8. What are the machines that we study?
Edison Babbage
name Ivy Bridge Knights Corner
model Xeon E5-2695 v2 Xeon Phi 5110P
clock speed 2.4 GHz 1.053 GHz
cores 12 60
SMT threads 2 4
SIMD width 4 8
peak gflop/s 230.4 1010.88
bandwidth 48.5 GB/s 122.9 GB/s
per-core caches:
L1-D 32 KB 32 KB
L2 256 KB 512 KB
shared cache:
L3 30 MB none
12. How we chose the preconditioner for PCG
SGS Jacobi
1 thread 1 thread 12 threads
time (s)
isotropic 83.0 80.3 29.2
anisotropic 128.6 121.6 43.8
13. Where does the AMG cycle spend most of its time?
1 2 4 8 16 32 60 120
32
64
128
256
512
1,024
2,048
smoothing
PCG
total
number of threads
runtime (s)
14. How to improve the performance of PCG
1: while not converged do
2: ρ ← σ
3: omp parallel for: w ← Ap
4: omp parallel for: τ ← w · p
5: α ← ρ/τ
6: omp parallel for: x ← x + αp
7: omp parallel for: r ← r − αw
8: omp parallel for: z ← M−1r
9: omp parallel for: σ ← z · r
10: β ← σ/ρ
11: omp parallel for: p ← z + βp
12: end while
15. How to improve the performance of PCG
1: omp parallel
2: while not converged do
3: omp single: τ ← 0.0 implied barrier
4: omp single nowait: ρ ← σ, σ ← 0.0
5: omp for nowait: w ← Ap
6: omp for reduction: τ ← w · p implied barrier
7: α ← ρ/τ
8: omp for nowait: x ← x + αp
9: omp for nowait: r ← r − αw
10: omp for nowait: z ← M−1r
11: omp for reduction: σ ← z · r implied barrier
12: β ← σ/ρ
13: omp for nowait: p ← z + βp
14: end while
15: end omp parallel
16. How to improve the performance of PCG
1: omp parallel
2: while not converged do
3: omp for: w ← Ap
4: omp single
5: τ ← w · p
6: α ← ρ/τ
7: x ← x + αp
8: r ← r − αw
9: z ← M−1r
10: ρ ← σ
11: σ ← z · r
12: β ← σ/ρ
13: p ← z + βp
14: end omp single
15: end while
16: end omp parallel
17. How to improve the performance of PCG
1: while not converged do
2: ρ ← σ
3: omp parallel for: w ← Ap
4: τ ← w · p
5: α ← ρ/τ
6: x ← x + αp
7: r ← r − αw
8: z ← M−1r
9: σ ← z · r
10: β ← σ/ρ
11: p ← z + βp
12: end while
18. How to improve the performance of PCG
1 2 4 8 16 32 60 120
16
32
64
128
256
number of threads
runtime (s)
Algorithm 1
Algorithm 2
Algorithm 3
Algorithm 4
19. How the sparse HSS solver works
sparse matrix-factorization
algorithm
represents the frontal matrices
as hierarchically-semiseparable
(HSS) matrices
uses randomized sampling for
faster compression
D1
D2
D4
D5
D8
D9
D11
D12
U3B3V6
H 7 B14
H
U6B6V3
H
B
U7
U3R3
U6R6
=
More details in Pieter Ghysels’ talk tomorrow!
20. How do the parameters of the solver affect performance?
Parameter Values
coarse solver HSS, PCG
elements-per-agglomerate 64, 128, 256, 512
νP 0, 1, 2
νM−1 1, 3, 5
θ 0.001, 0.001 × 100.5, 0.01
21. How do the parameters of the solver affect performance?
1%2%4%8%16%32%64%
8
16
32
64
128
percentile rank
runtime (s)
Babbage (HSS)
Babbage (PCG)
Edison (HSS)
Edison (PCG)
default configuration
23. What our performance model is
1 2 4 8 12
8
16
32
64
128
memory bound
flops bound
actual
number of cores
runtime (s)
24. Final comments
HSS is an attractive option for solving coarse systems
performance is quite sensitive to parameter tuning
performance model indicates where the bottlenecks are