Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010

 Review of XT6 Architecture
 AMD Opteron
 Cray Networks
 Lustre Basics
 Programming Environment
 PGI Compiler Basics
 The Cray Compiler Environment
 Cray Scientific Libraries
 Cray Performance Analysis Tools
 Optimizations
 CPU
 Communication
 I/O

AMD CPU Architecture
Cray Architecture
Lustre Filesystem Basics

2003 2005 2007 2008 2009 2010
AMD AMD
“Barcelona” “Shanghai” “Istanbul” “Magny-Cours”
Opteron™ Opteron™

Mfg.
130nm SOI 90nm SOI 65nm SOI 45nm SOI 45nm SOI 45nm SOI
Process
K8 K8 Greyhound Greyhound+ Greyhound+ Greyhound+
CPU Core

L2/L3 1MB/0 1MB/0 512kB/2MB 512kB/6MB 512kB/6MB 512kB/12MB

Hyper
Transport™ 3x 1.6GT/.s 3x 1.6GT/.s 3x 2GT/s 3x 4.0GT/s 3x 4.8GT/s 4x 6.4GT/s
Technology

Memory 2x DDR1 300 2x DDR1 400 2x DDR2 667 2x DDR2 800 2x DDR2 800 4x DDR3 1333

12 cores
1.7-2.2Ghz
1 4 7 10 105.6Gflops

8 cores
5 11 1.8-2.4Ghz
2 8 76.8Gflops

Power (ACP)
3 6 9 12 80Watts

Stream
27.5GB/s

Cache
12x 64KB L1
12x 512KB L2
12MB L3

L3 cache

HT Link
HT Link
HT Link
HT Link

L2 cache L2 cache L2 cache L2 cache

MEMORY CONTROLLER

MEMORY CONTROLLER
Core 2 Core 5 Core 8 Core 11

HT Link

HT Link
HT Link

HT Link




 A cache line is 64B
 Cache is a “victim cache”
 All references go to L1 immediately and get evicted down the caches
 A cache line is usually only in one level of cache
 Hardware prefetcher detects forward and backward strides through
memory
 Each core can perform a 128b add and 128b multiply per clock cycle
 This requires SSE, packed instructions
 “Stride-one vectorization”

SeaStar (XT-series)
Gemini (XE-series)

 Microkernel on Compute PEs,
full featured Linux on Service
PEs.
 Service PEs specialize by
function
Compute PE
 Software Architecture
Login PE eliminates OS “Jitter”
Network PE  Software Architecture enables
reproducible run times
System PE
 Large machines boot in under
I/O PE 30 minutes, including
filesystem
Service Partition
Specialized
Linux nodes

10

Z
Y GigE

X

10 GigE

GigE
SMW
Fibre
Channels
RAID
Subsystem

Compute node
Login node
Network node
Boot/Syslog/Database nodes
I/O and Metadata nodes 11

Now Scaled
to 225,000
cores

 Cray XT5 systems ship with the
SeaStar2+ interconnect
DMA HyperTransport
6-Port
Router
Engine
Interface
 Custom ASIC
 Integrated NIC / Router
Memory
 MPI offload engine
Blade
Control
Processor
PowerPC
440 Processor
 Connectionless Protocol
Interface
 Link Level Reliability
 Proven scalability to 225,000
cores
12

Processor Frequency Peak Bandwidth Balance
(Gflops) (GB/sec) (bytes/flop
)
Istanbul
2.6 62.4 12.8 0.21
(XT5)
2.0 64 42.6 0.67

MC-8 2.3 73.6 42.6 0.58

2.4 76.8 42.6 0.55

1.9 91.2 42.6 0.47

MC-12 2.1 100.8 42.6 0.42

2.2 105.6 42.6 0.40

Cray Inc. Preliminary and Proprietary SC09 13

6.4 GB/sec direct connect
Characteristics HyperTransport
Number of 16 or 24 (MC)
Cores 32 (IL)
Peak 153 Gflops/sec
Performance
MC-8 (2.4)
Peak 211 Gflops/sec
Performance
MC-12 (2.2)
Memory Size 32 or 64 GB per
node
Memory 83.5 GB/sec
Bandwidth 83.5 GB/sec direct
connect memory
Cray
SeaStar2+
Interconnect


Greyhound Greyhound
Greyhound Greyhound DDR3 Channel
DDR3 Channel 6MB L3 HT3 6MB L3
Greyhound Greyhound
Cache Greyhound Cache Greyhound
Greyhound Greyhound

DDR3 Channel Greyhound Greyhound DDR3 Channel

HT3

HT3
Greyhound H Greyhound

DDR3 Channel 6MB L3
Greyhound
Greyhound
T3 6MB L3
Greyhound
Greyhound
DDR3 Channel

Cache Greyhound Cache Greyhound
Greyhound Greyhound
Greyhound HT3 Greyhound DDR3 Channel
DDR3 Channel

To Interconnect
HT1 / HT3
 2 Multi-Chip Modules, 4 Opteron Dies
 8 Channels of DDR3 Bandwidth to 8 DIMMs
 24 (or 16) Computational Cores, 24 MB of L3 cache
 Dies are fully connected with HT3
 Snoop Filter Feature Allows 4 Die SMP to scale well


Without snoop filter, a streams test
shows 25MB/sec out of a possible
51.2 GB/sec or 48% of peak
bandwidth


With snoop filter, a streams test
shows 42.3 MB/sec out of a
possible 51.2 GB/sec or 82% of
peak bandwidth

• This feature will be key for two-
socket Magny Cours Nodes which
are the same architecture-wise

 New compute blade with 8 AMD
Magny Cours processors
 Plug-compatible with XT5 cabinets
and backplanes
 Initially will ship with SeaStar
interconnect as the Cray XT6
 Upgradeable to Gemini
Interconnect or Cray XE6
 Upgradeable to AMD’s
“Interlagos” series
 XT6 systems will continue to ship
with the current SIO blade
 First customer ship, March 31st


 Supports 2 Nodes per ASIC
 168 GB/sec routing capacity
 Scales to over 100,000 network
endpoints
 Link Level Reliability and Adaptive Hyper Hyper
Routing Transport Transport
3 3
 Advanced Resiliency Features
 Provides global address NIC 0 Netlink NIC 1
SB
space Block Gemini
LO
 Advanced NIC designed to Processor
efficiently support 48-Port
 MPI YARC Router
 One-sided MPI
 Shmem
 UPC, Coarray FORTRAN

Cray Baker Node
Characteristics
Number of 16 or 24
10 12X Gemini Cores
Channels
Peak 140 or 210 Gflops/s
(Each Gemini High Radix
YARC Router
Performance
acts like two
nodes on the 3D with adaptive Memory Size 32 or 64 GB per
Torus) Routing
node
168 GB/sec
capacity Memory 85 GB/sec
Bandwidth

Module with
SeaStar

Z

Y

X

Module with
Gemini

net rsp
net req

LB
ht treq p
net LB Ring
ht treq np FMA req T net
ht trsp net net
A req S req req vc0
ht p req
net R S
req O
ht np req B I
BTE R net
D rsp
B vc1

Router Tiles
HT3 Cave

NL
ht irsp
NPT vc1

ht np net
ireq rsp
net req CQ NAT
ht np req
H
ht p req net rsp headers
ht p
A AMO net
ht p req
ireq R net req req net req vc0
B RMT
ht p req
RAT net rsp

LM
CLM

 FMA (Fast Memory Access)
 Mechanism for most MPI transfers
 Supports tens of millions of MPI requests per second
 BTE (Block Transfer Engine)
 Supports asynchronous block transfers between local and remote memory,
in either direction
 For use for large MPI transfers that happen in the background

 Two Gemini ASICs are
packaged on a pin-compatible
mezzanine card
 Topology is a 3-D torus
 Each lane of the torus is
composed of 4 Gemini router
“tiles”
 Systems with SeaStar
interconnects can be upgraded
by swapping this card
 100% of the 48 router tiles on
each Gemini chip are used


 Like SeaStar, Gemini has a DMA offload engine
allowing large transfers to proceed
asynchronously

 Gemini provides low-overhead OS-bypass features for short transfers
 MPI latency targeted at ~ 1us
 NIC provides for many millions of MPI messages per second
 “Hybrid” programming not a requirement for performance

 RDMA provides a much improved one-sided communication mechanism

 AMOs provide a faster synchronization method for barriers

 Gemini supports adaptive routing, which
 Reduces problems with network hot spots
 Allows MPI to survive link failures


 Globally addressable memory provides efficient
support for UPC, Co-array FORTRAN, Shmem and
Global Arrays
 Cray Programming Environment will target this capability
directly

 Pipelined global loads and stores
 Allows for fast irregular communication patterns

 Atomic memory operations
 Provides fast synchronization needed for one-sided
communication models


Gemini will represent a large improvement over SeaStar in
terms of reliability and serviceability

 Adaptive Routing – multiple paths to the same destination
 Allows mapping around bad links without rebooting
 Supports warm-swap of blades
 Prevents hot spots

 Reliable Transport of Messages
 Packet level CRC carried from start to finish
 Large blocks of memory protected by ECC
 Can better handle failures on the HT-link, discards packets instead of putting backpressure
into the network
 Supports end-to-end reliable communication (used by MPI)

 Improved error reporting and handling
 The low overhead error reporting allows the programming model to replay failed
transactions
 Performance counters allowing tracking of app specific packets


Low Velocity Airflow

High Velocity Airflow

Low Velocity Airflow

High Velocity Airflow

30 Low Velocity Airflow

Cool air is released into the computer room

Liquid Liquid/Vapor
in Mixture out

Hot air stream passes through evaporator, rejects
heat to R134a via liquid-vapor phase change
(evaporation).

R134a absorbs energy only in the presence of heated air.
Phase change is 10x more efficient than pure water
cooling.

31

R134a piping Exit Evaporators

Inlet Evaporator

32

 32 MB per OST (32 MB – 5 GB) and 32 MB Transfer Size
 Unable to take advantage of file system parallelism
 Access to multiple disks adds overhead which hurts performance

Single Writer
Write Performance
120

100

80
Write (MB/s)

1 MB Stripe
60
32 MB Stripe

40
Lustre
20

0
1 2 4 16 32 64 128 160
Stripe Count

36

 Single OST, 256 MB File Size
 Performance can be limited by the process (transfer size) or file system
(stripe size)

Single Writer
Transfer vs. Stripe Size
140

120

100
Write (MB/s)

80
32 MB Transfer
60
8 MB Transfer
1 MB Transfer
40 Lustre
20

0
1 2 4 8 16 32 64 128
Stripe Size (MB)

37

 Use the lfs command, libLUT, or MPIIO hints to adjust your stripe count and
possibly size
 lfs setstripe -c -1 -s 4M <file or directory> (160 OSTs, 4MB stripe)
 lfs setstripe -c 1 -s 16M <file or directory> (1 OST, 16M stripe)
 export MPICH_MPIIO_HINTS=‘*: striping_factor=160’
 Files inherit striping information from the parent directory, this cannot be
changed once the file is written
 Set the striping before copying in files

PGI Compiler
Cray Compiler Environment
Cray Scientific Libraries

 Cray XT/XE Supercomputers come with compiler wrappers to simplify
building parallel applications (similar the mpicc/mpif90)
 Fortran Compiler: ftn
 C Compiler: cc
 C++ Compiler: CC
 Using these wrappers ensures that your code is built for the compute
nodes and linked against important libraries
 Cray MPT (MPI, Shmem, etc.)
 Cray LibSci (BLAS, LAPACK, etc.)
 …
 Choosing the underlying compiler is via the PrgEnv-* modules, do not call
the PGI, Cray, etc. compilers directly.
 Always load the appropriate xtpe-<arch> module for your machine
 Enables proper compiler target
 Links optimized math libraries

 Traditional (scalar) optimizations are controlled via -O# compiler flags
 Default: -O2
 More aggressive optimizations (including vectorization) are enabled with
the -fast or -fastsse metaflags
 These translate to: -O2 -Munroll=c:1 -Mnoframe -Mlre
–Mautoinline -Mvect=sse -Mscalarsse
-Mcache_align -Mflushz –Mpre
 Interprocedural analysis allows the compiler to perform whole-program
optimizations. This is enabled with –Mipa=fast
 See man pgf90, man pgcc, or man pgCC for more information about
compiler options.

 Compiler feedback is enabled with -Minfo and -Mneginfo
 This can provide valuable information about what optimizations were
or were not done and why.
 To debug an optimized code, the -gopt flag will insert debugging
information without disabling optimizations
 It’s possible to disable optimizations included with -fast if you believe one
is causing problems
 For example: -fast -Mnolre enables -fast and then disables loop
redundant optimizations
 To get more information about any compiler flag, add -help with the
flag in question
 pgf90 -help -fast will give more information about the -fast
flag
 OpenMP is enabled with the -mp flag

Some compiler options may effect both performance and accuracy. Lower
accuracy is often higher performance, but it’s also able to enforce accuracy.

 -Kieee: All FP math strictly conforms to IEEE 754 (off by default)
 -Ktrap: Turns on processor trapping of FP exceptions
 -Mdaz: Treat all denormalized numbers as zero
 -Mflushz: Set SSE to flush-to-zero (on with -fast)
 -Mfprelaxed: Allow the compiler to use relaxed (reduced) precision to
speed up some floating point optimizations
 Some other compilers turn this on by default, PGI chooses to favor
accuracy to speed by default.

 Cray has a long tradition of high performance compilers on Cray
platforms (Traditional vector, T3E, X1, X2)
 Vectorization
 Parallelization
 Code transformation
 More…
 Investigated leveraging an open source compiler called LLVM

 First release December 2008

Fortran Source C and C++ Source C and C++ Front End
supplied by Edison Design
Group, with Cray-developed
Fortran Front End C & C++ Front End code for extensions and
interface support

Interprocedural Analysis
Cray Inc. Compiler
Technology
Compiler

Optimization and
Parallelization

X86 Code Cray X2 Code
Generator Generator
X86 Code Generation from
Open Source LLVM, with
Object File
additional Cray-developed
optimizations and interface
support

 Standard conforming languages and programming models
 Fortran 2003
 UPC & CoArray Fortran
 Fully optimized and integrated into the compiler
 No preprocessor involved
 Target the network appropriately:
 GASNet with Portals
 DMAPP with Gemini & Aries

 Ability and motivation to provide high-quality support for custom
Cray network hardware
 Cray technology focused on scientific applications
 Takes advantage of Cray’s extensive knowledge of automatic
vectorization
 Takes advantage of Cray’s extensive knowledge of automatic
shared memory parallelization
 Supplements, rather than replaces, the available compiler
choices

 Make sure it is available
 module avail PrgEnv-cray
 To access the Cray compiler
 module load PrgEnv-cray
 To target the various chip
 module load xtpe-[barcelona,shanghi,istanbul]
 Once you have loaded the module “cc” and “ftn” are the Cray
compilers
 Recommend just using default options
 Use –rm (fortran) and –hlist=m (C) to find out what happened
 man crayftn

 Excellent Vectorization
 Vectorize more loops than other compilers
 OpenMP 3.0
 Task and Nesting
 PGAS: Functional UPC and CAF available today
 C++ Support
 Automatic Parallelization
 Modernized version of Cray X1 streaming capability
 Interacts with OMP directives
 Cache optimizations
 Automatic Blocking
 Automatic Management of what stays in cache
 Prefetching, Interchange, Fusion, and much more…

 Loop Based Optimizations
 Vectorization
 OpenMP
 Autothreading
 Interchange
 Pattern Matching
 Cache blocking/ non-temporal / prefetching
 Fortran 2003 Standard; working on 2008
 PGAS (UPC and Co-Array Fortran)
 Some performance optimizations available in 7.1
 Optimization Feedback: Loopmark
 Focus

 Cray compiler supports a full and growing set of directives
and pragmas

!dir$ concurrent
!dir$ ivdep
!dir$ interchange
!dir$ unroll
!dir$ loop_info [max_trips] [cache_na] ... Many more
!dir$ blockable

man directives
man loop_info

 Compiler can generate an filename.lst file.
 Contains annotated listing of your source code with letter indicating important
optimizations
%%% L o o p m a r k L e g e n d %%%
Primary Loop Type Modifiers
------- ---- ---- ---------
a - vector atomic memory operation
A - Pattern matched b - blocked
C - Collapsed f - fused
D - Deleted i - interchanged
E - Cloned m - streamed but not partitioned
I - Inlined p - conditional, partial and/or computed
M - Multithreaded r - unrolled
P - Parallel/Tasked s - shortloop
V - Vectorized t - array syntax temp used
W - Unwound w - unwound

• ftn –rm … or cc –hlist=m …
29. b-------< do i3=2,n3-1
30. b b-----< do i2=2,n2-1
31. b b Vr--< do i1=1,n1
32. b b Vr u1(i1) = u(i1,i2-1,i3) + u(i1,i2+1,i3)
33. b b Vr > + u(i1,i2,i3-1) + u(i1,i2,i3+1)
34. b b Vr u2(i1) = u(i1,i2-1,i3-1) + u(i1,i2+1,i3-1)
35. b b Vr > + u(i1,i2-1,i3+1) + u(i1,i2+1,i3+1)
36. b b Vr--> enddo
37. b b Vr--< do i1=2,n1-1
38. b b Vr r(i1,i2,i3) = v(i1,i2,i3)
39. b b Vr > - a(0) * u(i1,i2,i3)
40. b b Vr > - a(2) * ( u2(i1) + u1(i1-1) + u1(i1+1) )
41. b b Vr > - a(3) * ( u2(i1-1) + u2(i1+1) )
42. b b Vr--> enddo
43. b b-----> enddo
44. b-------> enddo

ftn-6289 ftn: VECTOR File = resid.f, Line = 29
A loop starting at line 29 was not vectorized because a recurrence was found on "U1" between lines
32 and 38.
ftn-6049 ftn: SCALAR File = resid.f, Line = 29
A loop starting at line 29 was blocked with block size 4.
A loop starting at line 30 was not vectorized because a recurrence was found on "U1" between lines 32
and 38.
A loop starting at line 30 was blocked with block size 4.
A loop starting at line 31 was unrolled 4 times.
A loop starting at line 31 was vectorized.
A loop starting at line 37 was unrolled 4 times.
A loop starting at line 37 was vectorized.

 -hbyteswapio
 Link time option
 Applies to all unformatted fortran IO
 Assign command
 With the PrgEnv-cray module loaded do this:
setenv FILENV assign.txt
assign -N swap_endian g:su
assign -N swap_endian g:du

 Can use assign to be more precise

 OpenMP is ON by default
 Optimizations controlled by –Othread#
 To shut off use –Othread0 or –xomp or –hnoomp

 Autothreading is NOT on by default;
 -hautothread to turn on
 Modernized version of Cray X1 streaming capability
 Interacts with OMP directives

If you do not want to use OpenMP and have OMP directives
in the code, make sure to make a run with OpenMP shut
off at compile time

 Traditional model

 Tuned general purpose codes

 Only good for dense

 Not problem sensitive

 Not architecture sensitive

60

 Goal of scientific libraries
Improve Productivity at optimal performance
 Cray use four concentrations to achieve this
 Standardization
 Use standard or “de facto” standard interfaces whenever available

 Hand tuning
 Use extensive knowledge of target processor and network to optimize common code
patterns
 Auto-tuning
 Automate code generation and a huge number of empirical performance evaluations
to configure software to the target platforms
 Adaptive Libraries
 Make runtime decisions to choose the best kernel/library/routine

61

 Three separate classes of standardization, each with a corresponding
definition of productivity
1. Standard interfaces (e.g., dense linear algebra)
 Bend over backwards to keep everything the same despite increases in machine complexity.
Innovate ‘behind-the-scenes’
 Productivity -> innovation to keep things simple

2. Adoption of near-standard interfaces (e.g., sparse kernels)
 Assume near-standards and promote those. Out-mode alternatives. Innovate ‘behind-the-scenes’
 Productivity -> innovation in the simplest areas
 (requires the same innovation as #1 also)

3. Simplification of non-standard interfaces (e.g., FFT)
 Productivity -> innovation to make things simpler than they are

62

 Algorithmic tuning
 Increased performance by exploiting algorithmic improvements
 Sub-blocking, new algorithms

 LAPACK, ScaLAPACK
 Kernel tuning
 Improve the numerical kernel performance in assembly language
 BLAS, FFT
 Parallel tuning
 Exploit Cray’s custom network interfaces and MPT
 ScaLAPACK, P-CRAFFT

63

Dense Sparse FFT
BLAS CASK CRAFFT

LAPACK
PETSc FFTW
ScaLAPACK

IRT Trilinos P-CRAFFT

IRT – Iterative Refinement Toolkit
CASK – Cray Adaptive Sparse Kernels
CRAFFT – Cray Adaptive FFT

64

 Serial and Parallel versions of sparse iterative linear solvers
 Suites of iterative solvers
 CG, GMRES, BiCG, QMR, etc.

 Suites of preconditioning methods
 IC, ILU, diagonal block (ILU/IC), Additive Schwartz, Jacobi, SOR

 Support block sparse matrix data format for better performance
 Interface to external packages (ScaLAPACK, SuperLU_DIST)
 Fortran and C support
 Newton-type nonlinear solvers
 Large user community
 DoE Labs, PSC, CSCS, CSC, ERDC, AWE and more.
 http://www-unix.mcs.anl.gov/petsc/petsc-as

65

 Cray provides state-of-the art scientific computing packages to strengthen
the capability of PETSc
 Hypre: scalable parallel preconditioners
 AMG (Very scalable and efficient for specific class of problems)
 2 different ILU (General purpose)
 Sparse Approximate Inverse (General purpose)

 ParMetis: parallel graph partitioning package
 MUMPS: parallel multifrontal sparse direct solver
 SuperLU: sequential version of SuperLU_DIST
 To use Cray-PETSc, load the appropriate module :
module load petsc
module load petsc-complex
(no need to load a compiler specific module)
 Treat the Cray distribution as your local PETSc installation

66

 The Trilinos Project http://trilinos.sandia.gov/
“an effort to develop algorithms and enabling technologies within an
object-oriented software framework for the solution of large-scale,
complex multi-physics engineering and scientific problems”
 A unique design feature of Trilinos is its focus on packages.
 Very large user-base and growing rapidly. Important to DOE.
 Cray’s optimized Trilinos released on January 21
 Includes 50+ trilinos packages
 Optimized via CASK
 Any code that uses Epetra objects can access the optimizations
 Usage :
module load trilinos

67

 CASK is a product developed at Cray using the
Cray Auto-tuning Framework (Cray ATF)
 The CASK Concept :
 Analyze matrix at minimal cost
 Categorize matrix against internal classes
 Based on offline experience, find best CASK code for particular matrix
 Previously assign “best” compiler flags to CASK code
 Assign best CASK kernel and perform Ax

 CASK silently sits beneath PETSc on Cray systems
 Trilinos support coming soon

 Released with PETSc 3.0 in February 2009
 Generic and blocked CSR format
68

Large-scale application

• Highly portable
• User controlled

PETSc / Trilinos / Hypre

All systems • Highly portable
• User controlled

Cray only CASK

• XT4 & XT5
specific /
tuned
• Invisible to
User

69

Speedup on Parallel SpMV on 8 cores, 60 different matrices
1.4

1.3

1.2

1.1

1
0 10 20 30 40 50 60
Matrix ID#

70

Block Jacobi Preconditioning
SpMV
Performance of CASK VS Performance of CASK VS
PETSc PETSc
N=65,536 to 67,108,864 300 N=65,536 to 67,108,864
200
250
150
200
GFlops

100

GFlops
150

50 100

50
0
0 128 256 384 512 640 768 896 1024 0
0 128 256 384 512 640 768 896 1024

# of Cores # of Cores
BlockJacobi-IC(0)-CASK
MatMult-CASK MatMult-PETSc BlockJacobi-IC(0)-PETSc

71

2000
1800
1600
1400
MFlops

1200
1000
800
600
400
200
0

Matrix Name

Geometric Mean of 80 sparse matrix instances from U of Florida collection
5000
4500
4000
3500
MFlops

3000
2500
2000
1500
1000
500
0
1 2 3 4 5 6 7 8
# of vectors
CASK Trilinos Original

 In FFTs, the problems are
 Which library choice to use?
 How to use complicated interfaces (e.g., FFTW)

 Standard FFT practice
 Do a plan stage
 Deduced machine and system information and run micro-kernels
 Select best FFT strategy

 Do an execute

Our system knowledge can remove some of this cost!

74

 CRAFFT is designed with simple-to-use interfaces
 Planning and execution stage can be combined into one function call
 Underneath the interfaces, CRAFFT calls the appropriate FFT kernel

 CRAFFT provides both offline and online tuning
 Offline tuning
 Which FFT kernel to use
 Pre-computed PLANs for common-sized FFT
 No expensive plan stages
 Online tuning is performed as necessary at runtime as well

 At runtime, CRAFFT will adaptively select the best FFT kernel to use based
on both offline and online testing (e.g. FFTW, Custom FFT)

75

128x128 256x256 512x512

FFTW plan 74 312 2758

FFTW exec 0.105 0.97 9.7

CRAFFT plan 0.00037 0.0009 0.00005

CRAFFT exec 0.139 1.2 11.4

1. Load module fftw/3.2.0 or higher.
2. Add a Fortran statement “use crafft”
3. call crafft_init()
4. Call crafft transform using none, some or all optional arguments (as
shown in red)
In-place, implicit memory management :
call crafft_z2z3d(n1,n2,n3,input,ld_in,ld_in2,isign)
in-place, explicit memory management
call crafft_z2z3d(n1,n2,n3,input,ld_in,ld_in2,isign,work)
out-of-place, explicit memory management :
crafft_z2z3d(n1,n2,n3,input,ld_in,ld_in2,output,ld_out,ld_out2,isign,work)

Note : the user can also control the planning strategy of CRAFFT using the
CRAFFT_PLANNING environment variable and the do_exe optional argument,
please see the intro_crafft man page.
77

 As of December 2009, CRAFFT includes distributed parallel transforms
 Uses the CRAFFT interface prefixed by “p”, with optional arguments
 Can provide performance improvement over FFTW 2.1.5
 Currently implemented
 complex-complex
 Real-complex and complex-real
 3-d and 2-d
 In-place and out-of-place
 Upcoming
 C language support for serial and parallel

78

1. Add “use crafft” to Fortran code
2. Initialize CRAFFT using crafft_init
3. Assume MPI initialized and data distributed (see manpage)
4. Call crafft, e.g. (optional arguments in red)
2-d complex-complex, in-place, internal mem management :
call crafft_pz2z2d(n1,n2,input,isign,flag,comm)

2-d complex-complex, in-place with no internal memory :
call crafft_pz2z2d(n1,n2,input,isign,flag,comm,work)

2-d complex-complex, out-of-place, internal mem manager :
call crafft_pz2z2d(n1,n2,input,output,isign,flag,comm)

2-d complex-complex, out-of-place, no internal memory :
crafft_pz2z2d(n1,n2,input,output,isign,flag,comm,work)

Each routine above has manpage. Also see 3d equivalent :
man crafft_pz2z3d 79

2D FFT (N x N, transposed), 128 cores
140,000

120,000

100,000
Mflops

80,000

60,000 pcrafft
fftw2.5.1
40,000

20,000

0
128 256 512 1024 2048 4096 8192 16384 3276865536
Size N

80

 Solves linear systems in single precision
 Obtaining solutions accurate to double precision
 For well conditioned problems
 Serial and Parallel versions of LU, Cholesky, and QR
 2 usage methods
 IRT Benchmark routines
 Uses IRT 'under-the-covers' without changing your code
 Simply set an environment variable
 Useful when you cannot alter source code

 Advanced IRT API
 If greater control of the iterative refinement process is required
 Allows
 condition number estimation
 error bounds return
 minimization of either forward or backward error
 'fall back' to full precision if the condition number is too high
 max number of iterations can be altered by users

81

 “High Power Electromagnetic Wave
Heating in the ITER Burning Plasma’’
 rf heating in tokamak

 Maxwell-Bolzmann Eqns

 FFT

 Dense linear system

 Calc Quasi-linear op

Courtesy
Richard Barrett

82

Theoretical
Peak

83

Decide if you want to use advanced API or benchmark API
benchmark API :
setenv IRT_USE_SOLVERS 1
Advanced API :
1. locate the factor and solve in your code (LAPACK or ScaLAPACK)
2. Replace factor and solve with a call to IRT routine
 e.g. dgesv -> irt_lu_real_serial
 e.g. pzgesv -> irt_lu_complex_parallel
 e.g pzposv -> irt_po_complex_parallel
3. Set advanced arguments
 Forward error convergence for most accurate solution
 Condition number estimate
 “fall-back” to full precision if condition number too high

84

 LibSci 10.4.2 February 18th 2010
 OpenMP-aware LibSci
 Allows calling of BLAS inside or outside parallel region
 Single library supported
 No multi-thread library and single thread library (-lsci and –lsci_mp)
 Performance not compromised
(there were some usage restrictions with this version)

 LibSci 10.4.3 April 2010
 Parallel CRAFFT improvements
 Fixes usage restrictions of 10.4.2
 OMP_NUM_THREADS required (not GOTO_NUM_THREADS)
 Upcoming
 PETSc 3.1.0 May 20
 Trilinos 10.2 May 20

85

 Assist the user with application performance analysis and optimization
 Help user identify important and meaningful information from
potentially massive data sets
 Help user identify problem areas instead of just reporting data
 Bring optimization knowledge to a wider set of users

 Focus on ease of use and intuitive user interfaces
 Automatic program instrumentation
 Automatic analysis

 Target scalability issues in all areas of tool development
 Data management
 Storage, movement, presentation

September 21-24, 2009 © Cray Inc. 87

 Supports traditional post-mortem performance analysis
 Automatic identification of performance problems
 Indication of causes of problems
 Suggestions of modifications for performance improvement

 CrayPat
 pat_build: automatic instrumentation (no source code changes needed)
 run-time library for measurements (transparent to the user)
 pat_report for performance analysis reports
 pat_help: online help utility

 Cray Apprentice2
 Graphical performance analysis and visualization tool


 CrayPat
 Instrumentation of optimized code
 No source code modification required
 Data collection transparent to the user
 Text-based performance reports
 Derived metrics
 Performance analysis

 Cray Apprentice2
 Performance data visualization tool
 Call tree view
 Source code mappings


 When performance measurement is triggered
 External agent (asynchronous)
 Sampling
 Timer interrupt
 Hardware counters overflow

 Internal agent (synchronous)
 Code instrumentation
 Event based
 Automatic or manual instrumentation

 How performance data is recorded
 Profile ::= Summation of events over time
 run time summarization (functions, call sites, loops, …)

 Trace file ::= Sequence of events over time


 Millions of lines of code
 Automatic profiling analysis
 Identifies top time consuming routines
 Automatically creates instrumentation template customized to your application

 Lots of processes/threads
 Load imbalance analysis
 Identifies computational code regions and synchronization calls that could benefit most from
load balance optimization
 Estimates savings if corresponding section of code were balanced

 Long running applications
 Detection of outliers


 Important performance statistics:

 Top time consuming routines

 Load balance across computing resources

 Communication overhead

 Cache utilization

 FLOPS

 Vectorization (SSE instructions)

 Ratio of computation versus communication


 No source code or makefile modification required
 Automatic instrumentation at group (function) level
 Groups: mpi, io, heap, math SW, …

 Performs link-time instrumentation
 Requires object files
 Instruments optimized code
 Generates stand-alone instrumented program
 Preserves original binary
 Supports sample-based and event-based instrumentation


 Analyze the performance data and direct the user to meaningful
information

 Simplifies the procedure to instrument and collect performance data for
novice users

 Based on a two phase mechanism
1. Automatically detects the most time consuming functions in the
application and feeds this information back to the tool for further
(and focused) data collection

2. Provides performance information on the most significant parts of the
application


 Performs data conversion

 Combines information from binary with raw performance
data

 Performs analysis on data

 Generates text report of performance results

 Formats data for input into Cray Apprentice2


 Craypat / Cray Apprentice2 5.0 released September 10, 2009

 New internal data format
 FAQ
 Grid placement support
 Better caller information (ETC group in pat_report)
 Support larger numbers of processors
 Client/server version of Cray Apprentice2
 Panel help in Cray Apprentice2


 Access performance tools software

% module load xt-craypat apprentice2

 Build application keeping .o files (CCE: -h keepfiles)

% make clean
% make

 Instrument application for automatic profiling analysis
 You should get an instrumented program a.out+pat

% pat_build –O apa a.out

 Run application to get top time consuming routines
 You should get a performance file (“<sdatafile>.xf”) or
multiple files in a directory <sdatadir>

% aprun … a.out+pat (or qsub <pat script>)


 Generate report and .apa instrumentation file

% pat_report –o my_sampling_report [<sdatafile>.xf |
<sdatadir>]

 Inspect .apa file and sampling report

 Verify if additional instrumentation is needed

September 21-24, 2009 © Cray Inc. Slide 98

# You can edit this file, if desired, and use it # 43.37% 99659 bytes
# to reinstrument the program for tracing like this: -T mlwxyz_
#
# pat_build -O mhd3d.Oapa.x+4125-401sdt.apa # 16.09% 17615 bytes
# -T half_
# These suggested trace options are based on data from:
# # 6.82% 6846 bytes
# /home/crayadm/ldr/mhd3d/run/mhd3d.Oapa.x+4125-401sdt.ap2, -T artv_
/home/crayadm/ldr/mhd3d/run/mhd3d.Oapa.x+4125-401sdt.xf

# 1.29% 5352 bytes
# ----------------------------------------------------------------------
-T currenh_

# HWPC group to collect by default.
# 1.03% 25294 bytes
-T bndbo_
-Drtenv=PAT_RT_HWPC=1 # Summary with instructions metrics.

# Functions below this point account for less than 10% of samples.
# ----------------------------------------------------------------------

# Libraries to trace.
# 1.03% 31240 bytes
# -T bndto_
-g mpi

...
# ----------------------------------------------------------------------

# ----------------------------------------------------------------------
# User-defined functions to trace, sorted by % of samples.
# Limited to top 200. A function is commented out if it has < 1%
-o mhd3d.x+apa # New instrumented program.
# of samples, or if a cumulative threshold of 90% has been reached,
# or if it has size < 200 bytes.
/work/crayadm/ldr/mhd3d/mhd3d.x # Original program.

# Note: -u should NOT be specified as an additional option.

September 21-24, 2009 99
© Cray Inc.

 biolib Cray Bioinformatics library routines  omp OpenMP API (not supported on
 blacs Basic Linear Algebra communication Catamount)
subprograms  omp-rtl OpenMP runtime library (not
 blas Basic Linear Algebra subprograms supported on Catamount)

 caf Co-Array Fortran (Cray X2 systems only)  portals Lightweight message passing API
 fftw Fast Fourier Transform library (64-bit  pthreads POSIX threads (not supported on
only) Catamount)

 hdf5 manages extremely large and complex  scalapack Scalable LAPACK
data collections  shmem SHMEM
 heap dynamic heap  stdio all library functions that accept or return
 io includes stdio and sysio groups the FILE* construct

 lapack Linear Algebra Package  sysio I/O system calls

 lustre Lustre File System  system system calls

 math ANSI math  upc Unified Parallel C (Cray X2 systems only)
 mpi MPI
 netcdf network common data form (manages
array-oriented scientific data)

0 Summary with instruction 11 Floating point operations
metrics mix (2)
1 Summary with TLB metrics 12 Floating point operations
mix (vectorization)
2 L1 and L2 metrics
13 Floating point operations
3 Bandwidth information mix (SP)
4 Hypertransport information 14 Floating point operations
5 Floating point mix mix (DP)
6 Cycles stalled, resources 15 L3 (socket-level)
idle 16 L3 (core-level reads)
7 Cycles stalled, resources 17 L3 (core-level misses)
full
18 L3 (core-level fills caused
8 Instructions and branches by L2 evictions)
9 Instruction cache 19 Prefetches
10 Cache hierarchy

June 10 Slide 101

 Regions, useful to break up long routines
 int PAT_region_begin (int id, const char *label)
 int PAT_region_end (int id)
 Disable/Enable Profiling, useful for excluding initialization
 int PAT_record (int state)
 Flush buffer, useful when program isn’t exiting cleanly
 int PAT_flush_buffer (void)

 Instrument application for further analysis (a.out+apa)

% pat_build –O <apafile>.apa

 Run application

% aprun … a.out+apa (or qsub <apa script>)

 Generate text report and visualization file (.ap2)

% pat_report –o my_text_report.txt [<datafile>.xf |
<datadir>]

 View report in text and/or with Cray Apprentice2

% app2 <datafile>.ap2


 MUST run on Lustre ( /work/… , /lus/…, /scratch/…, etc.)

 Number of files used to store raw data

 1 file created for program with 1 – 256 processes

 √n files created for program with 257 – n processes

 Ability to customize with PAT_RT_EXPFILE_MAX


 Full trace files show transient events but are too large

 Current run-time summarization misses transient events

 Plan to add ability to record:

 Top N peak values (N small)

 Approximate std dev over time

 For time, memory traffic, etc.

 During tracing and sampling

July 15, 2008 Slide 106

 Call graph profile  Cray Apprentice2
 Communication statistics  is target to help identify and
 Time-line view correct:
 Communication  Load imbalance
 I/O  Excessive communication
 Network contention
 Activity view
 Excessive serialization
 Pair-wise communication statistics  I/O Problems
 Text reports
 Source code mapping

September 21-24, 2009 107
© Cray Inc.

Switch Overview display


Min, Avg, and Max
Values

-1, +1
Std Dev
marks


Width  inclusive time

Height  exclusive time

Filtered
nodes or
sub tree
Load balance overview:
Height  Max time
Middle bar  Average time
DUH Button:
Lower bar  Min time
Provides hints
Yellow represents for performance
imbalance time tuning

Function
Zoom
List


Right mouse click:
Node menu
e.g., hide/unhide
children
Right mouse click:
View menu:
e.g., Filter

Sort options
% Time,
Time,
Imbalance %
Imbalance time

Function
List off


Min, Avg, and Max
Values

-1, +1
Std Dev
marks


 Cray Apprentice2 panel help

 pat_help – interactive help on the Cray Performance toolset

 FAQ available through pat_help


 intro_craypat(1)
 Introduces the craypat performance tool
 pat_build
 Instrument a program for performance analysis
 pat_help
 Interactive online help utility
 pat_report
 Generate performance report in both text and for use with GUI
 hwpc(3)
 describes predefined hardware performance counter groups
 papi_counters(5)
 Lists PAPI event counters
 Use papi_avail or papi_native_avail utilities to get list of events when
running on a specific architecture

pat_report: Help for -O option:

Available option values are in left column, a prefix can be specified:

ct -O calltree
defaults Tables that would appear by default.
heap -O heap_program,heap_hiwater,heap_leaks
io -O read_stats,write_stats
lb -O load_balance
load_balance -O lb_program,lb_group,lb_function
mpi -O mpi_callers
---
callers Profile by Function and Callers
callers+hwpc Profile by Function and Callers
callers+src Profile by Function and Callers, with Line Numbers
callers+src+hwpc Profile by Function and Callers, with Line Numbers
calltree Function Calltree View
calltree+hwpc Function Calltree View
calltree+src Calltree View with Callsite Line Numbers
calltree+src+hwpc Calltree View with Callsite Line Numbers
...


 Interactive by default, or use trailing '.' to just print a topic:

 New FAQ craypat 5.0.0.

 Has counter and counter group information

% pat_help counters amd_fam10h groups .


The top level CrayPat/X help topics are listed below.
A good place to start is:
overview
If a topic has subtopics, they are displayed under the heading
"Additional topics", as below. To view a subtopic, you need
only enter as many initial letters as required to distinguish
it from other items in the list. To see a table of contents
including subtopics of those subtopics, etc., enter:
toc
To produce the full text corresponding to the table of contents,
specify "all", but preferably in a non-interactive invocation:
pat_help all . > all_pat_help
pat_help report all . > all_report_help
Additional topics:
API execute
balance experiment
build first_example
counters overview
demos report
environment run
pat_help (.=quit ,=back ^=up /=top ~=search)
=>


CPU Optimizations
Optimizing Communication
I/O Best Practices

55. 1 ii = 0
56. 1 2-----------< do b = abmin, abmax Poor loop order
57. 1 2 3---------< do j=ijmin, ijmax results in poor
58. 1 2 3 ii = ii+1
striding
59. 1 2 3 jj = 0 The inner-most loop
60. 1 2 3 4-------< do a = abmin, abmax strides on a slow
61. 1 2 3 4 r8----< do i = ijmin, ijmax dimension of each
62. 1 2 3 4 r8 jj = jj+1
array.
63. 1 2 3 4 r8 f5d(a,b,i,j) = f5d(a,b,i,j)
+ tmat7(ii,jj) The best the compiler
64. 1 2 3 4 r8 f5d(b,a,i,j) = f5d(b,a,i,j) can do is unroll.
- tmat7(ii,jj)
65. 1 2 3 4 r8 f5d(a,b,j,i) = f5d(a,b,j,i) Little to no cache
- tmat7(ii,jj) reuse.
66. 1 2 3 4 r8 f5d(b,a,j,i) = f5d(b,a,j,i)
+ tmat7(ii,jj)
67. 1 2 3 4 r8----> end do
68. 1 2 3 4-------> end do
69. 1 2 3---------> end do
70. 1 2-----------> end do

USER / #1.Original Loops
----------------------------------------------------------------- Poor loop order
Time% 55.0% results in poor
Time 13.938244 secs cache reuse
Imb.Time 0.075369 secs
Imb.Time% 0.6% For every L1 cache
Calls 0.1 /sec 1.0 calls hit, there’s 2 misses
DATA_CACHE_REFILLS:
L2_MODIFIED:L2_OWNED: Overall, only 2/3 of
L2_EXCLUSIVE:L2_SHARED 11.858M/sec 165279602 fills all references were in
DATA_CACHE_REFILLS_FROM_SYSTEM: level 1 or 2 cache.
ALL 11.931M/sec 166291054 fills
PAPI_L1_DCM 23.499M/sec 327533338 misses
PAPI_L1_DCA 34.635M/sec 482751044 refs
User time (approx) 13.938 secs 36239439807 cycles
100.0%Time
Average Time per Call 13.938244 sec
CrayPat Overhead : Time 0.0%
D1 cache hit,miss ratios 32.2% hits 67.8% misses
D2 cache hit,miss ratio 49.8% hits 50.2% misses
D1+D2 cache hit,miss ratio 66.0% hits 34.0% misses

75. 1 2-----------< do i = ijmin, ijmax
76. 1 2 jj = 0
77. 1 2 3---------< do a = abmin, abmax Reordered loop
78. 1 2 3 4-------< do j=ijmin, ijmax
nest
79. 1 2 3 4 jj = jj+1 Now, the inner-most
80. 1 2 3 4 ii = 0 loop is stride-1 on
81. 1 2 3 4 Vcr2--< do b = abmin, abmax both arrays.
82. 1 2 3 4 Vcr2 ii = ii+1
83. 1 2 3 4 Vcr2 f5d(a,b,i,j) = f5d(a,b,i,j) Now memory
+ tmat7(ii,jj) accesses happen
84. 1 2 3 4 Vcr2 f5d(b,a,i,j) = f5d(b,a,i,j) along the cache line,
- tmat7(ii,jj) allowing reuse.
85. 1 2 3 4 Vcr2 f5d(a,b,j,i) = f5d(a,b,j,i)
- tmat7(ii,jj) Compiler is able to
86. 1 2 3 4 Vcr2 f5d(b,a,j,i) = f5d(b,a,j,i) vectorize and better-
+ tmat7(ii,jj) use SSE instructions.
87. 1 2 3 4 Vcr2--> end do
88. 1 2 3 4-------> end do
89. 1 2 3---------> end do
90. 1 2-----------> end do

USER / #2.Reordered Loops
----------------------------------------------------------------- Improved striding
Time% 31.4% greatly improved
Time 7.955379 secs cache reuse
Imb.Time% 3.8% Runtine was cut
Calls 0.1 /sec 1.0 calls nearly in half.
DATA_CACHE_REFILLS:
L2_MODIFIED:L2_OWNED: Still, some 20% of all
L2_EXCLUSIVE:L2_SHARED 0.419M/sec 3331289 fills references are cache
DATA_CACHE_REFILLS_FROM_SYSTEM: misses
ALL 15.285M/sec 121598284 fills
PAPI_L1_DCM 13.330M/sec 106046801 misses
100.0%Time

First loop, partially vectorized and Second loop, vectorized and
unrolled by 4 unrolled by 4
95. 1 ii = 0 109. 1 jj = 0
96. 1 2-----------< do j = ijmin, ijmax 110. 1 2-----------< do i = ijmin, ijmax
97. 1 2 i---------< do b = abmin, abmax 111. 1 2 3---------< do a = abmin, abmax
98. 1 2 i ii = ii+1 112. 1 2 3 jj = jj+1
99. 1 2 i jj = 0 113. 1 2 3 ii = 0
100. 1 2 i i-------< do i = ijmin, ijmax 114. 1 2 3 4-------< do j = ijmin, ijmax
101. 1 2 i i Vpr4--< do a = abmin, abmax 115. 1 2 3 4 Vr4---< do b = abmin, abmax
102. 1 2 i i Vpr4 jj = jj+1 116. 1 2 3 4 Vr4 ii = ii+1
103. 1 2 i i Vpr4 f5d(a,b,i,j) = 117. 1 2 3 4 Vr4 f5d(b,a,i,j) =
f5d(a,b,i,j) + tmat7(ii,jj) f5d(b,a,i,j) - tmat7(ii,jj)
104. 1 2 i i Vpr4 f5d(a,b,j,i) = 118. 1 2 3 4 Vr4 f5d(b,a,j,i) =
f5d(a,b,j,i) - tmat7(ii,jj) f5d(b,a,i,j) + tmat7(ii,jj)
105. 1 2 i i Vpr4--> end do 119. 1 2 3 4 Vr4---> end do
106. 1 2 i i-------> end do 120. 1 2 3 4-------> end do
107. 1 2 i---------> end do 121. 1 2 3---------> end do
108. 1 2-----------> end do 122. 1 2-----------> end do

USER / #3.Fissioned Loops
Fissioning further
----------------------------------------------------------------- improved cache reuse
Time% 9.8% and resulted in better
Time 2.481636 secs vectorization
Imb.Time% 2.1% Runtime further
Calls 0.4 /sec 1.0 calls reduced.
DATA_CACHE_REFILLS:
L2_MODIFIED:L2_OWNED: Cache hit/miss ratio
L2_EXCLUSIVE:L2_SHARED 1.175M/sec 2916610 fills improved slightly
DATA_CACHE_REFILLS_FROM_SYSTEM:
ALL 34.109M/sec 84646518 fills Loopmark file points
PAPI_L1_DCM 26.424M/sec 65575972 misses to better
vectorization from
100.0%Time the fissioned loops

 Cache blocking is a combination of strip mining and loop interchange, designed
to increase data reuse.
 Takes advantage of temporal reuse: re-reference array elements already
referenced
 Good blocking will take advantage of spatial reuse: work with the cache
lines!
 Many ways to block any given loop nest
 Which loops get blocked?
 What block size(s) to use?
 Analysis can reveal which ways are beneficial
 But trial-and-error is probably faster

j=1

j=8
 2D Laplacian
i=1 do j = 1, 8
do i = 1, 16
a = u(i-1,j) + u(i+1,j) &
- 4*u(i,j) &
+ u(i,j-1) + u(i,j+1)
end do
end do

 Cache structure for this example:
 Each line holds 4 array elements
 Cache can hold 12 lines of u data
i=16
 No cache reuse between outer loop
120
30
18
15
13
12
10
9
7
6
4
3 iterations

j=1

j=8
 Unblocked loop: 120 cache misses
i=1  Block the inner loop

do IBLOCK = 1, 16, 4
do j = 1, 8
i=5 do i = IBLOCK, IBLOCK + 3
a(i,j) = u(i-1,j) + u(i+1,j) &
- 2*u(i,j) &
i=9 + u(i,j-1) + u(i,j+1)
end do
end do
end do
i=13
 Now we have reuse of the “j+1” data

80
20
12
10
11
9
8
7
6
4
3

j=1

j=5
 One-dimensional blocking reduced
i=1 misses from 120 to 80
 Iterate over 4 4 blocks

i=5 do JBLOCK = 1, 8, 4
do IBLOCK = 1, 16, 4
do j = JBLOCK, JBLOCK + 3
do i = IBLOCK, IBLOCK + 3
i=9 a(i,j) = u(i-1,j) + u(i+1,j) &
- 2*u(i,j) &
+ u(i,j-1) + u(i,j+1)
end do
i=13
end do
end do
end do

15
13
12
10
60
30
18
17
16
11
9
8
7
6
4
3
 Better use of spatial locality (cache lines)

 Matrix-matrix multiply (GEMM) is the canonical cache-blocking example
 Operations can be arranged to create multiple levels of blocking
 Block for register
 Block for cache (L1, L2, L3)
 Block for TLB
 No further discussion here. Interested readers can see
 Any book on code optimization
 Sun’s Techniques for Optimizing Applications: High Performance Computing contains a decent introductory discussion in
Chapter 8
 Insert your favorite book here
 Gunnels, Henry, and van de Geijn. June 2001. High-performance matrix multiplication
algorithms for architectures with hierarchical memories. FLAME Working Note #4 TR-
2001-22, The University of Texas at Austin, Department of Computer Sciences
 Develops algorithms and cost models for GEMM in hierarchical memories
 Goto and van de Geijn. 2008. Anatomy of high-performance matrix multiplication. ACM
Transactions on Mathematical Software 34, 3 (May), 1-25
 Description of GotoBLAS DGEMM

“I tried cache-blocking my code, but it didn’t help”

 You’re doing it wrong.
 Your block size is too small (too much loop overhead).
 Your block size is too big (data is falling out of cache).
 You’re targeting the wrong cache level (?)
 You haven’t selected the correct subset of loops to block.
 The compiler is already blocking that loop.
 Prefetching is acting to minimize cache misses.
 Computational intensity within the loop nest is very large, making blocking less
important.

 Multigrid PDE solver
 Class D, 64 MPI ranks
do i3 = 2, 257
 Global grid is 1024 × 1024 × 1024 do i2 = 2, 257
 Local grid is 258 × 258 × 258 do i1 = 2, 257
 Two similar loop nests account for ! update u(i1,i2,i3)
! using 27-point stencil
>50% of run time
end do
 27-point 3D stencil end do
 There is good data reuse along end doi2+1
cache lines
i2
leading dimension, even without i2-1
blocking i3+1

i3

i3-1

i1-1 i1 i1+1

 Block the inner two loops Mop/s/proces
Block size
 Creates blocks extending along i3 direction s
unblocked 531.50
do I2BLOCK = 2, 257, BS2
do I1BLOCK = 2, 257, BS1 16 × 16 279.89
do i3 = 2, 257
22 × 22 321.26
do i2 = I2BLOCK, &
min(I2BLOCK+BS2-1, 257) 28 × 28 358.96
do i1 = I1BLOCK, &
min(I1BLOCK+BS1-1, 257) 34 × 34 385.33
! update u(i1,i2,i3)
! using 27-point stencil 40 × 40 408.53
end do
end do
46 × 46 443.94
end do 52 × 52 468.58
end do
end do 58 × 58 470.32
64 × 64 512.03
70 × 70 506.92

 Block the outer two loops Mop/s/proces
Block size
 Preserves spatial locality along i1 direction s
unblocked 531.50
do I3BLOCK = 2, 257, BS3
do I2BLOCK = 2, 257, BS2 16 × 16 674.76
do i3 = I3BLOCK, &
22 × 22 680.16
min(I3BLOCK+BS3-1, 257)
do i2 = I2BLOCK, & 28 × 28 688.64
min(I2BLOCK+BS2-1, 257)
do i1 = 2, 257 34 × 34 683.84
! update u(i1,i2,i3)
! using 27-point stencil 40 × 40 698.47
end do
end do
46 × 46 689.14
end do 52 × 52 706.62
end do
end do 58 × 58 692.57
64 × 64 703.40
70 × 70 693.87

( 53) void mat_mul_daxpy(double *a, double *b, double *c, int rowa,
int cola, int colb)
( 54) {
( 55) int i, j, k; /* loop counters */
( 56) int rowc, colc, rowb; /* sizes not passed as arguments */
C pointers
( 57) double con; /* constant value */
C pointers don’t carry
( 58)
( 59) rowb = cola;
the same rules as
( 60) rowc = rowa; Fortran Arrays.
( 61) colc = colb;
( 62) The compiler has no
( 63) for(i=0;i<rowc;i++) { way to know whether
( 64) for(k=0;k<cola;k++) {
*a, *b, and *c
( 65) con = *(a + i*cola +k);
( 66) for(j=0;j<colc;j++) { overlap or are
( 67) *(c + i*colc + j) += con * *(b + k*colb + j); referenced differently
( 68) } elsewhere.
( 69) }
( 70) }
The compiler must
( 71) }
assume the worst,
mat_mul_daxpy: thus a false data
66, Loop not vectorized: data dependency dependency.
Loop not vectorized: data dependency
Loop unrolled 4 times

Slide 147

( 53) void mat_mul_daxpy(double* restrict a, double* restrict b,
double* restrict c, int rowa, int cola, int colb)
( 54) {
( 55) int i, j, k; /* loop counters */
C pointers,
( 56) int rowc, colc, rowb; /* sizes not passed as arguments */
restricted
( 57) double con; /* constant value */
C99 introduces the
( 58)
( 59) rowb = cola;
restrict keyword,
( 60) rowc = rowa; which allows the
( 61) colc = colb; programmer to
( 62)
promise not to
( 63) for(i=0;i<rowc;i++) {
( 64) for(k=0;k<cola;k++) {
reference the
( 65) con = *(a + i*cola +k); memory via another
( 66) for(j=0;j<colc;j++) { pointer.
( 67) *(c + i*colc + j) += con * *(b + k*colb + j);
( 68) } If you declare a
( 69) }
restricted pointer and
( 70) }
( 71) } break the rules,
behavior is undefined
by the standard.

Slide 148

66, Generated alternate loop with no peeling - executed if loop count <= 24
Generated vector sse code for inner loop
Generated 2 prefetch instructions for this loop
Generated alternate loop with no peeling and more aligned moves -
executed if loop count <= 24 and alignment test is passed
Generated alternate loop with more aligned moves - executed if loop
count >= 25 and alignment test is passed

• This can also be achieved with the PGI safe pragma and –Msafeptr
compiler option or Pathscale –OPT:alias option

Slide 149

 GNU malloc library
 malloc, calloc, realloc, free calls
 Fortran dynamic variables
 Malloc library system calls
 Mmap, munmap =>for larger allocations
 Brk, sbrk => increase/decrease heap
 Malloc library optimized for low system memory use
 Can result in system calls/minor page faults

151

 Detecting “bad” malloc behavior
 Profile data => “excessive system time”
 Correcting “bad” malloc behavior
 Eliminate mmap use by malloc
 Increase threshold to release heap memory
 Use environment variables to alter malloc
 MALLOC_MMAP_MAX_ = 0
 MALLOC_TRIM_THRESHOLD_ = 536870912
 Possible downsides
 Heap fragmentation
 User process may call mmap directly
 User process may launch other processes
 PGI’s –Msmartalloc does something similar for you at
compile time

152

 Google created a replacement “malloc” library
 “Minimal” TCMalloc replaces GNU malloc
 Limited testing indicates TCMalloc as good or better
than GNU malloc
 Environment variables not required
 TCMalloc almost certainly better for allocations in
OpenMP parallel regions
 There’s currently no pre-built tcmalloc for Cray XT, but
some users have successfully built it.

153

 Linux has a “first touch policy” for memory allocation
 *alloc functions don’t actually allocate your memory
 Memory gets allocated when “touched”
 Problem: A code can allocate more memory than available
 Linux assumed “swap space,” we don’t have any
 Applications won’t fail from over-allocation until the memory is finally
touched
 Problem: Memory will be put on the core of the “touching” thread
 Only a problem if thread 0 allocates all memory for a node
 Solution: Always initialize your memory immediately after allocating it
 If you over-allocate, it will fail immediately, rather than a strange place
in your code
 If every thread touches its own memory, it will be allocated on the
proper socket

Slide 154

 Short Message Eager Protocol
 The sending rank “pushes” the message to the receiving rank
 Used for messages MPICH_MAX_SHORT_MSG_SIZE bytes or less
 Sender assumes that receiver can handle the message
 Matching receive is posted - or -
 Has available event queue entries (MPICH_PTL_UNEX_EVENTS) and buffer space
(MPICH_UNEX_BUFFER_SIZE) to store the message

 Long Message Rendezvous Protocol
 Messages are “pulled” by the receiving rank
 Used for messages greater than MPICH_MAX_SHORT_MSG_SIZE bytes
 Sender sends small header packet with information for the receiver to pull
over the data
 Data is sent only after matching receive is posted by receiving rank

Match Entries Posted by MPI
Incoming Msg S to handle Unexpected Msgs
E
A Eager Rendezvous
S App ME
Short Msg ME Long Msg ME
T
A
R

STEP 1
STEP 2
MPI_RECV call
Sender MPI_SEND call Receiver Post ME to Portals MPI
RANK 0 RANK 1 Unexpected
Buffers

(MPICH_UNEX_BUFFER_SIZE)
STEP 3
Portals DMA PUT Unexpected
Msg Queue

Other Event Queue
(MPICH_PTL_OTHER_EVENTS)

Unexpected
Event Queue
MPI_RECV is posted prior to MPI_SEND call (MPICH_PTL_UNEX_EVENTS)

Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010

Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010

Ähnlich wie Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010 (20)

Mehr von Jeff Larkin

Mehr von Jeff Larkin (16)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010

Hinweis der Redaktion