WordPress Websites for Engineers: Elevate Your Brand
Accelerators: the good, the bad, and the ugly
1. Experts in numerical algorithms
and HPC services
Accelerators: the good, the bad and the ugly!
Dr Ian Reid
Ian.Reid@nag.co.uk
2. 2
NAG Introduction
Accelerators – NAG experience
NAG on Intel Xeon Phi
Summary
Agenda
3. 3
Founded 1970
Not-for-profit organisation
Surpluses fund on-going R&D
Mathematical and Statistical Expertise
Libraries of components
Consulting
HPC Services
Computational Science and Engineering (CSE) support
Procurement advice, market watch, benchmarking
NAG Background
4. 4
Escalator?:
Want more performance? Buy the next processor!
To get performance/efficiency we have to go
(massively) parallel
Disruption causing serious look at ‘other’
technologies (and algorithms!)
Even CPUs with tens of cores
Hybrid, shared-memory and distributed-memory
parallelism
Painful whichever way we turn!
Where has my Escalator gone?
5. 5
Loose definition: hardware on which to run your
software better than on your (general purpose) CPU
Generally NOT an easy win
Significant learning curve and effort
Offload disadvantages…
The good: put some effort in; get a great result!
The bad: put effort in, get an OK result, but learn
lessons which can be re-used (often good!)
The ugly: put significant effort in, get a poor result
and don’t learn anything substantive
Accelerators
6. 6
The Intel Xeon Phi is a co-processor attached to a
host system via the PCI express bus
Highly parallel architecture
Compiler support for OpenMP parallelism
It has a distinct memory system from the host
Several use cases to consider:
Automatic Offloading
Explicit Offloading
Native Applications
Intel Xeon Phi
7. 7
Relatively easy to take existing OpenMP based code
and port to Phi
Tuning for Phi takes some learning and expertise
… but feedback into Xeon code is often very strong
NAG Library for Intel Xeon Phi supports all models
Offload (supports automatic and explicit) and Native libs
Windows version from Intel Xeon Phi now in beta
NAG Experience with Intel Xeon Phi
8. 8
Offload OpenMP regions to Phi when problem sizes
are above some threshold
Estimating problem size can be complex
Required data is transferred to/from the host
prior/post executing OpenMP region
Data transfer takes time, eats into the benefit of running
the OpenMP on the Phi
Transparent to the user of the Library
Just recompile code containing NAG Library function calls
to benefit.
Automatic Offload
9. 9
All NAG functions can be explicitly offloaded by user
user code modified to include relevant offload statements
allows control of which functions offloaded
Data transfers to Phi can be dissociated with function
offloading allowing data to remain on the Phi
user responsible for data movement
reduces penalty of offloading data by allowing its use by
multiple offloaded function calls before returning to host
Effort required by the user to re-code application
Explicit Offload
10. 10
Users may choose to port their entire application
user code modified to include relevant offload statements
allows complete control of which functions are offloaded
Data transfers to Phi can be dissociated with function
offloading allowing data to remain on the Phi
user responsible for data movement
reduces penalty of offloading data by allowing its use by
multiple offloaded function calls before returning to host
Effort required by the user to re-code application
Native Applications
11. 11
Sandybridge CPUs (typically using 32 threads)
Knights Corner Phi processor (typically using 240
threads)
Performance Examples and Lessons
12. 12
0
200
400
600
800
1,000
1,200
1,400
1,600
0 5000 10000 15000 20000 25000 30000
Time(s)
Problem Size (n)
Hierarchical Cluster Analysis (go3ec)
32 threads original Phi offload original Phi offload opt 32 threads opt
n=30k; m=3k
Xeon 32t: 1,412s
Phi 240t*: 1,259s
Xeon 32t*: 1,073s
For this size problem
best to stay on CPU
but take the 25%!
15. 15
0
50
100
150
200
250
300
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
Time(s)
Problem Size (weighted)
Maximum Likelihood Estimates (g03ca)
32 threads original Phi offload original Phi offload opt 32 threads opt
n=2500; m=2500;
nfac=30; nvar=200
Xeon 32t: 256s
Phi 240t*: 53.6s
Xeon 32t*: 54.7s
Phi gain 4x, but also
Xeon speed-up (green
line under red)
16. 16
0
20
40
60
80
100
120
140
160
180
200
0 1000 2000 3000 4000 5000 6000 7000
Time(s)
Problem Size (n)
Solve real symmetric positive definite simultaneous linear
equations using iterative refinement (f04af)
32 threads original Phi offload original Phi offload opt 32 threads opt
n=6,000; nrhs;1,000
Xeon32t: 171s
Phi 240t*: 66s
Xeon 32t*: 86s
Phi gain ~1.3x (~3x
original)
17. 17
Parallelism is a real issue we all face
Exciting for some. Challenging for others!
Accelerators are interesting and can offer spectacular wins
Intel Phi claiming less spectacular performance gains
Less effort than on other Accelerators
… and often repays on CPU as well!
Acid test is always solving your (complete) problem!
NAG can help you try out this technology
NAG Library for Phi
NAG expertise
Summary