SlideShare ist ein Scribd-Unternehmen logo
1 von 25
Downloaden Sie, um offline zu lesen
Comparative Performance Analysis of an
Algebraic-Multigrid Solver on
Leading Multicore Architectures
Brian Austin, Alex Druinsky, Pieter Ghysels, Xiaoye Sherry Li,
Osni A. Marques, Eric Roman, Samuel Williams
Lawrence Berkeley National Laboratory
Andrew Barker, Panayot Vassilevski
Lawrence Livermore National Laboratory
Delyan Kalchev
University of Colorado, Boulder
What this talk is about
Performance optimization, comparison and modeling of
a novel shared-memory algebraic-multigrid solver
using the SPE10 reservoir-modeling problem
on a node of Cray XC30 and on a Xeon Phi.
How our multigrid solver works
Repeat until converged:
pre-smoothing y ← x + M−1(b − Ax)
coarse-grid correction z ← y + PA−1
c PT (b − Ay)
post-smoothing x ← z + M−1(b − Az)
How we construct the interpolator
=
SP P�
How we construct the coarse-grid matrix
=
Ac PAPT
What the spe10 problem is and how we are solving it
Credit: http://www.spe10.org
oil-reservoir modeling benchmark problem
solved using Darcy’s equation (in primal form)
− · (κ(x) p(x)) = f (x) ,
where p(x) = pressure, and κ(x) = permeability
defined over a 60 × 220 × 85 grid
with isotropic and anisotropic versions
What the spe10 problem is and how we are solving it
oil-reservoir modeling benchmark problem
solved using Darcy’s equation (in primal form)
− · (κ(x) p(x)) = f (x) ,
where p(x) = pressure, and κ(x) = permeability
defined over a 60 × 220 × 85 grid
with isotropic and anisotropic versions
What are the machines that we study?
Edison Babbage
name Ivy Bridge Knights Corner
model Xeon E5-2695 v2 Xeon Phi 5110P
clock speed 2.4 GHz 1.053 GHz
cores 12 60
SMT threads 2 4
SIMD width 4 8
peak gflop/s 230.4 1010.88
bandwidth 48.5 GB/s 122.9 GB/s
per-core caches:
L1-D 32 KB 32 KB
L2 256 KB 512 KB
shared cache:
L3 30 MB none
What the coarse-grid system is
n = 7,782; nnz = 1,412,840; nnz/n = 181.6
How we chose the preconditioner for PCG
preconditioner operator
Jacobi z = D−1r
Symmetric Gauss–Seidel z = (L + D)−1D(L + D)−T r
= + +
Ac L D LT
How we chose the preconditioner for PCG
unprecond Jacobi SGS
conditioning
isotropic 3.37 × 104 1.35 × 103 1.83 × 102
anisotropic 9.68 × 106 1.89 × 104 2.91 × 103
iterations
isotropic 605.53 194.57 78.87
anisotropic 1,267.85 288.32 122.85
How we chose the preconditioner for PCG
SGS Jacobi
1 thread 1 thread 12 threads
time (s)
isotropic 83.0 80.3 29.2
anisotropic 128.6 121.6 43.8
Where does the AMG cycle spend most of its time?
1 2 4 8 16 32 60 120
32
64
128
256
512
1,024
2,048
smoothing
PCG
total
number of threads
runtime (s)
How to improve the performance of PCG
1: while not converged do
2: ρ ← σ
3: omp parallel for: w ← Ap
4: omp parallel for: τ ← w · p
5: α ← ρ/τ
6: omp parallel for: x ← x + αp
7: omp parallel for: r ← r − αw
8: omp parallel for: z ← M−1r
9: omp parallel for: σ ← z · r
10: β ← σ/ρ
11: omp parallel for: p ← z + βp
12: end while
How to improve the performance of PCG
1: omp parallel
2: while not converged do
3: omp single: τ ← 0.0 implied barrier
4: omp single nowait: ρ ← σ, σ ← 0.0
5: omp for nowait: w ← Ap
6: omp for reduction: τ ← w · p implied barrier
7: α ← ρ/τ
8: omp for nowait: x ← x + αp
9: omp for nowait: r ← r − αw
10: omp for nowait: z ← M−1r
11: omp for reduction: σ ← z · r implied barrier
12: β ← σ/ρ
13: omp for nowait: p ← z + βp
14: end while
15: end omp parallel
How to improve the performance of PCG
1: omp parallel
2: while not converged do
3: omp for: w ← Ap
4: omp single
5: τ ← w · p
6: α ← ρ/τ
7: x ← x + αp
8: r ← r − αw
9: z ← M−1r
10: ρ ← σ
11: σ ← z · r
12: β ← σ/ρ
13: p ← z + βp
14: end omp single
15: end while
16: end omp parallel
How to improve the performance of PCG
1: while not converged do
2: ρ ← σ
3: omp parallel for: w ← Ap
4: τ ← w · p
5: α ← ρ/τ
6: x ← x + αp
7: r ← r − αw
8: z ← M−1r
9: σ ← z · r
10: β ← σ/ρ
11: p ← z + βp
12: end while
How to improve the performance of PCG
1 2 4 8 16 32 60 120
16
32
64
128
256
number of threads
runtime (s)
Algorithm 1
Algorithm 2
Algorithm 3
Algorithm 4
How the sparse HSS solver works
sparse matrix-factorization
algorithm
represents the frontal matrices
as hierarchically-semiseparable
(HSS) matrices
uses randomized sampling for
faster compression
D1
D2
D4
D5
D8
D9
D11
D12
U3B3V6
H 7 B14
H
U6B6V3
H
B
U7
U3R3
U6R6
=
More details in Pieter Ghysels’ talk tomorrow!
How do the parameters of the solver affect performance?
Parameter Values
coarse solver HSS, PCG
elements-per-agglomerate 64, 128, 256, 512
νP 0, 1, 2
νM−1 1, 3, 5
θ 0.001, 0.001 × 100.5, 0.01
How do the parameters of the solver affect performance?
1%2%4%8%16%32%64%
8
16
32
64
128
percentile rank
runtime (s)
Babbage (HSS)
Babbage (PCG)
Edison (HSS)
Edison (PCG)
default configuration
What our performance model is
stage bytes flops
pre- and post-smooth (3ν + 1)(12 nza + 3 · 8n) 2(3ν + 1)(nza + 2n)
restriction 12 nza + 12 nzp + 3 · 8n 2(nza + nzp)
one coarse solve
multiply by Ac 12 nzc 2 nzc
preconditioner 2 · 8nc nc
vector operations 5 · 8nc 2 · 5nc
interpolation 12 nzp + 8n 2 nzp
stopping criterion 12 nza + 4 · 8n 2(nza + n)
What our performance model is
1 2 4 8 12
8
16
32
64
128
memory bound
flops bound
actual
number of cores
runtime (s)
Final comments
HSS is an attractive option for solving coarse systems
performance is quite sensitive to parameter tuning
performance model indicates where the bottlenecks are
Thank you!

Weitere ähnliche Inhalte

Was ist angesagt?

Lagrangian Relaxation of Magnetic Fields
Lagrangian Relaxation of Magnetic FieldsLagrangian Relaxation of Magnetic Fields
Lagrangian Relaxation of Magnetic Fields
Simon Candelaresi
 
Cloud flare jgc bigo meetup rolling hashes
Cloud flare jgc   bigo meetup rolling hashesCloud flare jgc   bigo meetup rolling hashes
Cloud flare jgc bigo meetup rolling hashes
Cloudflare
 
証明駆動開発のたのしみ@名古屋reject会議
証明駆動開発のたのしみ@名古屋reject会議証明駆動開発のたのしみ@名古屋reject会議
証明駆動開発のたのしみ@名古屋reject会議
Hiroki Mizuno
 

Was ist angesagt? (20)

Auto
AutoAuto
Auto
 
Stream
StreamStream
Stream
 
All Pairs-Shortest Path (Fast Floyd-Warshall) Code
All Pairs-Shortest Path (Fast Floyd-Warshall) Code All Pairs-Shortest Path (Fast Floyd-Warshall) Code
All Pairs-Shortest Path (Fast Floyd-Warshall) Code
 
Linux-Permission
Linux-PermissionLinux-Permission
Linux-Permission
 
JVM memory management & Diagnostics
JVM memory management & DiagnosticsJVM memory management & Diagnostics
JVM memory management & Diagnostics
 
Ethereum 9¾ @ Devcon5
Ethereum 9¾ @ Devcon5Ethereum 9¾ @ Devcon5
Ethereum 9¾ @ Devcon5
 
aiboのAI:DeepLearning認識
aiboのAI:DeepLearning認識aiboのAI:DeepLearning認識
aiboのAI:DeepLearning認識
 
Lagrangian Relaxation of Magnetic Fields
Lagrangian Relaxation of Magnetic FieldsLagrangian Relaxation of Magnetic Fields
Lagrangian Relaxation of Magnetic Fields
 
Cloud flare jgc bigo meetup rolling hashes
Cloud flare jgc   bigo meetup rolling hashesCloud flare jgc   bigo meetup rolling hashes
Cloud flare jgc bigo meetup rolling hashes
 
Scaling the #2ndhalf
Scaling the #2ndhalfScaling the #2ndhalf
Scaling the #2ndhalf
 
Logistic Regression in R-An Exmple.
Logistic Regression in R-An Exmple. Logistic Regression in R-An Exmple.
Logistic Regression in R-An Exmple.
 
doc
docdoc
doc
 
NAS EP Algorithm
NAS EP Algorithm NAS EP Algorithm
NAS EP Algorithm
 
【論文紹介】Relay: A New IR for Machine Learning Frameworks
【論文紹介】Relay: A New IR for Machine Learning Frameworks【論文紹介】Relay: A New IR for Machine Learning Frameworks
【論文紹介】Relay: A New IR for Machine Learning Frameworks
 
Exploring Parallel Merging In GPU Based Systems Using CUDA C.
Exploring Parallel Merging In GPU Based Systems Using CUDA C.Exploring Parallel Merging In GPU Based Systems Using CUDA C.
Exploring Parallel Merging In GPU Based Systems Using CUDA C.
 
ALPSチュートリアル
ALPSチュートリアルALPSチュートリアル
ALPSチュートリアル
 
第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)
 
証明駆動開発のたのしみ@名古屋reject会議
証明駆動開発のたのしみ@名古屋reject会議証明駆動開発のたのしみ@名古屋reject会議
証明駆動開発のたのしみ@名古屋reject会議
 
Gc in golang
Gc in golangGc in golang
Gc in golang
 
The impact of supercomputers on MSR
The impact of supercomputers on MSRThe impact of supercomputers on MSR
The impact of supercomputers on MSR
 

Ähnlich wie Druinsky_SIAMCSE15

Practical Spherical Harmonics Based PRT Methods
Practical Spherical Harmonics Based PRT MethodsPractical Spherical Harmonics Based PRT Methods
Practical Spherical Harmonics Based PRT Methods
Naughty Dog
 
Injecting image priors into Learnable Compressive Subsampling
Injecting image priors into Learnable Compressive SubsamplingInjecting image priors into Learnable Compressive Subsampling
Injecting image priors into Learnable Compressive Subsampling
Martino Ferrari
 

Ähnlich wie Druinsky_SIAMCSE15 (20)

ILP Based Approach for Input Vector Controlled (IVC) Toggle Maximization in C...
ILP Based Approach for Input Vector Controlled (IVC) Toggle Maximization in C...ILP Based Approach for Input Vector Controlled (IVC) Toggle Maximization in C...
ILP Based Approach for Input Vector Controlled (IVC) Toggle Maximization in C...
 
MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4
 
Practical spherical harmonics based PRT methods.ppsx
Practical spherical harmonics based PRT methods.ppsxPractical spherical harmonics based PRT methods.ppsx
Practical spherical harmonics based PRT methods.ppsx
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolvers
 
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
 
Practical Spherical Harmonics Based PRT Methods
Practical Spherical Harmonics Based PRT MethodsPractical Spherical Harmonics Based PRT Methods
Practical Spherical Harmonics Based PRT Methods
 
fast-matmul-cse15
fast-matmul-cse15fast-matmul-cse15
fast-matmul-cse15
 
sheet6.pdf
sheet6.pdfsheet6.pdf
sheet6.pdf
 
doc6.pdf
doc6.pdfdoc6.pdf
doc6.pdf
 
paper6.pdf
paper6.pdfpaper6.pdf
paper6.pdf
 
lecture5.pdf
lecture5.pdflecture5.pdf
lecture5.pdf
 
Injecting image priors into Learnable Compressive Subsampling
Injecting image priors into Learnable Compressive SubsamplingInjecting image priors into Learnable Compressive Subsampling
Injecting image priors into Learnable Compressive Subsampling
 
Svm map reduce_slides
Svm map reduce_slidesSvm map reduce_slides
Svm map reduce_slides
 
Intro to Machine Learning for GPUs
Intro to Machine Learning for GPUsIntro to Machine Learning for GPUs
Intro to Machine Learning for GPUs
 
Symbolic Regression on Network Properties
Symbolic Regression on Network PropertiesSymbolic Regression on Network Properties
Symbolic Regression on Network Properties
 
Pseudo Random Number Generators
Pseudo Random Number GeneratorsPseudo Random Number Generators
Pseudo Random Number Generators
 
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
 
Dictionary Learning in Games - GDC 2014
Dictionary Learning in Games - GDC 2014Dictionary Learning in Games - GDC 2014
Dictionary Learning in Games - GDC 2014
 
R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...
R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...
R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...
 

Mehr von Karen Pao (8)

LupoPasini_SIAMCSE15
LupoPasini_SIAMCSE15LupoPasini_SIAMCSE15
LupoPasini_SIAMCSE15
 
Barker_SIAMCSE15
Barker_SIAMCSE15Barker_SIAMCSE15
Barker_SIAMCSE15
 
Myers_SIAMCSE15
Myers_SIAMCSE15Myers_SIAMCSE15
Myers_SIAMCSE15
 
Adams_SIAMCSE15
Adams_SIAMCSE15Adams_SIAMCSE15
Adams_SIAMCSE15
 
Austin_SIAMCSE15
Austin_SIAMCSE15Austin_SIAMCSE15
Austin_SIAMCSE15
 
Slattery_SIAMCSE15
Slattery_SIAMCSE15Slattery_SIAMCSE15
Slattery_SIAMCSE15
 
Loffeld_SIAMCSE15
Loffeld_SIAMCSE15Loffeld_SIAMCSE15
Loffeld_SIAMCSE15
 
Dubey_SIAMCSE15
Dubey_SIAMCSE15Dubey_SIAMCSE15
Dubey_SIAMCSE15
 

Kürzlich hochgeladen

Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
gindu3009
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
RohitNehra6
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Sérgio Sacani
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
Sérgio Sacani
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Lokesh Kothari
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
PirithiRaju
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 

Kürzlich hochgeladen (20)

9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 

Druinsky_SIAMCSE15

  • 1. Comparative Performance Analysis of an Algebraic-Multigrid Solver on Leading Multicore Architectures Brian Austin, Alex Druinsky, Pieter Ghysels, Xiaoye Sherry Li, Osni A. Marques, Eric Roman, Samuel Williams Lawrence Berkeley National Laboratory Andrew Barker, Panayot Vassilevski Lawrence Livermore National Laboratory Delyan Kalchev University of Colorado, Boulder
  • 2. What this talk is about Performance optimization, comparison and modeling of a novel shared-memory algebraic-multigrid solver using the SPE10 reservoir-modeling problem on a node of Cray XC30 and on a Xeon Phi.
  • 3. How our multigrid solver works Repeat until converged: pre-smoothing y ← x + M−1(b − Ax) coarse-grid correction z ← y + PA−1 c PT (b − Ay) post-smoothing x ← z + M−1(b − Az)
  • 4. How we construct the interpolator = SP P�
  • 5. How we construct the coarse-grid matrix = Ac PAPT
  • 6. What the spe10 problem is and how we are solving it Credit: http://www.spe10.org oil-reservoir modeling benchmark problem solved using Darcy’s equation (in primal form) − · (κ(x) p(x)) = f (x) , where p(x) = pressure, and κ(x) = permeability defined over a 60 × 220 × 85 grid with isotropic and anisotropic versions
  • 7. What the spe10 problem is and how we are solving it oil-reservoir modeling benchmark problem solved using Darcy’s equation (in primal form) − · (κ(x) p(x)) = f (x) , where p(x) = pressure, and κ(x) = permeability defined over a 60 × 220 × 85 grid with isotropic and anisotropic versions
  • 8. What are the machines that we study? Edison Babbage name Ivy Bridge Knights Corner model Xeon E5-2695 v2 Xeon Phi 5110P clock speed 2.4 GHz 1.053 GHz cores 12 60 SMT threads 2 4 SIMD width 4 8 peak gflop/s 230.4 1010.88 bandwidth 48.5 GB/s 122.9 GB/s per-core caches: L1-D 32 KB 32 KB L2 256 KB 512 KB shared cache: L3 30 MB none
  • 9. What the coarse-grid system is n = 7,782; nnz = 1,412,840; nnz/n = 181.6
  • 10. How we chose the preconditioner for PCG preconditioner operator Jacobi z = D−1r Symmetric Gauss–Seidel z = (L + D)−1D(L + D)−T r = + + Ac L D LT
  • 11. How we chose the preconditioner for PCG unprecond Jacobi SGS conditioning isotropic 3.37 × 104 1.35 × 103 1.83 × 102 anisotropic 9.68 × 106 1.89 × 104 2.91 × 103 iterations isotropic 605.53 194.57 78.87 anisotropic 1,267.85 288.32 122.85
  • 12. How we chose the preconditioner for PCG SGS Jacobi 1 thread 1 thread 12 threads time (s) isotropic 83.0 80.3 29.2 anisotropic 128.6 121.6 43.8
  • 13. Where does the AMG cycle spend most of its time? 1 2 4 8 16 32 60 120 32 64 128 256 512 1,024 2,048 smoothing PCG total number of threads runtime (s)
  • 14. How to improve the performance of PCG 1: while not converged do 2: ρ ← σ 3: omp parallel for: w ← Ap 4: omp parallel for: τ ← w · p 5: α ← ρ/τ 6: omp parallel for: x ← x + αp 7: omp parallel for: r ← r − αw 8: omp parallel for: z ← M−1r 9: omp parallel for: σ ← z · r 10: β ← σ/ρ 11: omp parallel for: p ← z + βp 12: end while
  • 15. How to improve the performance of PCG 1: omp parallel 2: while not converged do 3: omp single: τ ← 0.0 implied barrier 4: omp single nowait: ρ ← σ, σ ← 0.0 5: omp for nowait: w ← Ap 6: omp for reduction: τ ← w · p implied barrier 7: α ← ρ/τ 8: omp for nowait: x ← x + αp 9: omp for nowait: r ← r − αw 10: omp for nowait: z ← M−1r 11: omp for reduction: σ ← z · r implied barrier 12: β ← σ/ρ 13: omp for nowait: p ← z + βp 14: end while 15: end omp parallel
  • 16. How to improve the performance of PCG 1: omp parallel 2: while not converged do 3: omp for: w ← Ap 4: omp single 5: τ ← w · p 6: α ← ρ/τ 7: x ← x + αp 8: r ← r − αw 9: z ← M−1r 10: ρ ← σ 11: σ ← z · r 12: β ← σ/ρ 13: p ← z + βp 14: end omp single 15: end while 16: end omp parallel
  • 17. How to improve the performance of PCG 1: while not converged do 2: ρ ← σ 3: omp parallel for: w ← Ap 4: τ ← w · p 5: α ← ρ/τ 6: x ← x + αp 7: r ← r − αw 8: z ← M−1r 9: σ ← z · r 10: β ← σ/ρ 11: p ← z + βp 12: end while
  • 18. How to improve the performance of PCG 1 2 4 8 16 32 60 120 16 32 64 128 256 number of threads runtime (s) Algorithm 1 Algorithm 2 Algorithm 3 Algorithm 4
  • 19. How the sparse HSS solver works sparse matrix-factorization algorithm represents the frontal matrices as hierarchically-semiseparable (HSS) matrices uses randomized sampling for faster compression D1 D2 D4 D5 D8 D9 D11 D12 U3B3V6 H 7 B14 H U6B6V3 H B U7 U3R3 U6R6 = More details in Pieter Ghysels’ talk tomorrow!
  • 20. How do the parameters of the solver affect performance? Parameter Values coarse solver HSS, PCG elements-per-agglomerate 64, 128, 256, 512 νP 0, 1, 2 νM−1 1, 3, 5 θ 0.001, 0.001 × 100.5, 0.01
  • 21. How do the parameters of the solver affect performance? 1%2%4%8%16%32%64% 8 16 32 64 128 percentile rank runtime (s) Babbage (HSS) Babbage (PCG) Edison (HSS) Edison (PCG) default configuration
  • 22. What our performance model is stage bytes flops pre- and post-smooth (3ν + 1)(12 nza + 3 · 8n) 2(3ν + 1)(nza + 2n) restriction 12 nza + 12 nzp + 3 · 8n 2(nza + nzp) one coarse solve multiply by Ac 12 nzc 2 nzc preconditioner 2 · 8nc nc vector operations 5 · 8nc 2 · 5nc interpolation 12 nzp + 8n 2 nzp stopping criterion 12 nza + 4 · 8n 2(nza + n)
  • 23. What our performance model is 1 2 4 8 12 8 16 32 64 128 memory bound flops bound actual number of cores runtime (s)
  • 24. Final comments HSS is an attractive option for solving coarse systems performance is quite sensitive to parameter tuning performance model indicates where the bottlenecks are