Accelerators: the good, the bad, and the ugly

Experts in numerical algorithms
and HPC services
Accelerators: the good, the bad and the ugly!
Dr Ian Reid
Ian.Reid@nag.co.uk

2
 NAG Introduction
 Accelerators – NAG experience
 NAG on Intel Xeon Phi
 Summary
Agenda

3
 Founded 1970
 Not-for-profit organisation
 Surpluses fund on-going R&D
 Mathematical and Statistical Expertise
 Libraries of components
 Consulting
 HPC Services
 Computational Science and Engineering (CSE) support
 Procurement advice, market watch, benchmarking
NAG Background

4
 Escalator?:
Want more performance? Buy the next processor!
 To get performance/efficiency we have to go
(massively) parallel
 Disruption causing serious look at ‘other’
technologies (and algorithms!)
 Even CPUs with tens of cores
 Hybrid, shared-memory and distributed-memory
parallelism
 Painful whichever way we turn!
Where has my Escalator gone?

5
 Loose definition: hardware on which to run your
software better than on your (general purpose) CPU
 Generally NOT an easy win
 Significant learning curve and effort
 Offload disadvantages…
 The good: put some effort in; get a great result!
 The bad: put effort in, get an OK result, but learn
lessons which can be re-used (often good!)
 The ugly: put significant effort in, get a poor result
and don’t learn anything substantive
Accelerators

6
 The Intel Xeon Phi is a co-processor attached to a
host system via the PCI express bus
 Highly parallel architecture
 Compiler support for OpenMP parallelism
 It has a distinct memory system from the host
 Several use cases to consider:
 Automatic Offloading
 Explicit Offloading
 Native Applications
Intel Xeon Phi

7
 Relatively easy to take existing OpenMP based code
and port to Phi
 Tuning for Phi takes some learning and expertise
 … but feedback into Xeon code is often very strong
 NAG Library for Intel Xeon Phi supports all models
 Offload (supports automatic and explicit) and Native libs
 Windows version from Intel Xeon Phi now in beta
NAG Experience with Intel Xeon Phi

8
 Offload OpenMP regions to Phi when problem sizes
are above some threshold
 Estimating problem size can be complex
 Required data is transferred to/from the host
prior/post executing OpenMP region
 Data transfer takes time, eats into the benefit of running
the OpenMP on the Phi
 Transparent to the user of the Library
 Just recompile code containing NAG Library function calls
to benefit.
Automatic Offload

9
 All NAG functions can be explicitly offloaded by user
 user code modified to include relevant offload statements
 allows control of which functions offloaded
 Data transfers to Phi can be dissociated with function
offloading allowing data to remain on the Phi
 user responsible for data movement
 reduces penalty of offloading data by allowing its use by
multiple offloaded function calls before returning to host
 Effort required by the user to re-code application
Explicit Offload

10
 Users may choose to port their entire application
 user code modified to include relevant offload statements
 allows complete control of which functions are offloaded
 Data transfers to Phi can be dissociated with function
offloading allowing data to remain on the Phi
 user responsible for data movement
 reduces penalty of offloading data by allowing its use by
multiple offloaded function calls before returning to host
 Effort required by the user to re-code application
Native Applications

11
 Sandybridge CPUs (typically using 32 threads)
 Knights Corner Phi processor (typically using 240
threads)
Performance Examples and Lessons

12
0
200
400
600
800
1,000
1,200
1,400
1,600
0 5000 10000 15000 20000 25000 30000
Time(s)
Problem Size (n)
Hierarchical Cluster Analysis (go3ec)
32 threads original Phi offload original Phi offload opt 32 threads opt
 n=30k; m=3k
 Xeon 32t: 1,412s
 Phi 240t*: 1,259s
 Xeon 32t*: 1,073s
 For this size problem
best to stay on CPU
but take the 25%!

13
0
50
100
150
200
250
300
350
400
450
0 5000 10000 15000 20000 25000 30000
Time(s)
Problem Size (n)
Distance Matrix (g03ea)
 n=30k; m=3k
 Xeon 32t: 192s
 Phi 240t*: 40.6s
 Xeon 32t*: 75.7s
 Phi gain ~2x (~5x
over original)

14
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
100 10,000 1,000,000 100,000,000
Time(s)
Size of problem (n, log scale)
Uniform RNG - Mersenne Twister (g05sa)
8 threads original Native Phi original Native Phi opt 8 threads opt
 n=500m
 Xeon 8t: 0.25s
 Phi 240t*: 0.08s
 Xeon 8t*: 0.22s
 Phi gain ~3x

15
0
50
100
150
200
250
300
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
Time(s)
Problem Size (weighted)
Maximum Likelihood Estimates (g03ca)
 n=2500; m=2500;
nfac=30; nvar=200
 Xeon 32t: 256s
 Phi 240t*: 53.6s
 Xeon 32t*: 54.7s
 Phi gain 4x, but also
Xeon speed-up (green
line under red)

16
0
20
40
60
80
100
120
140
160
180
200
0 1000 2000 3000 4000 5000 6000 7000
Time(s)
Problem Size (n)
Solve real symmetric positive definite simultaneous linear
equations using iterative refinement (f04af)
 n=6,000; nrhs;1,000
 Xeon32t: 171s
 Phi 240t*: 66s
 Xeon 32t*: 86s
 Phi gain ~1.3x (~3x
original)

17
 Parallelism is a real issue we all face
 Exciting for some. Challenging for others!
 Accelerators are interesting and can offer spectacular wins
 Intel Phi claiming less spectacular performance gains
 Less effort than on other Accelerators
 … and often repays on CPU as well!
 Acid test is always solving your (complete) problem!
 NAG can help you try out this technology
 NAG Library for Phi
 NAG expertise
Summary

Accelerators: the good, the bad, and the ugly

Recommended

Recommended

More Related Content

What's hot

What's hot (7)

Viewers also liked

Viewers also liked (12)

Similar to Accelerators: the good, the bad, and the ugly

Similar to Accelerators: the good, the bad, and the ugly (20)

More from Intel IT Center

More from Intel IT Center (20)

Recently uploaded

Recently uploaded (20)

Accelerators: the good, the bad, and the ugly