7. L3 cache
HT Link
HT Link
HT Link
HT Link
L2 cache L2 cache L2 cache L2 cache
MEMORY CONTROLLER
MEMORY CONTROLLER
Core 2 Core 5 Core 8 Core 11
HT Link
HT Link
HT Link
HT Link
L2 cache L2 cache L2 cache L2 cache
Core 1 Core 4 Core 7 Core 10
L2 cache L2 cache L2 cache L2 cache
Core 0 Core 3 Core 6 Core 9
8. A cache line is 64B
Cache is a “victim cache”
All references go to L1 immediately and get evicted down the caches
A cache line is usually only in one level of cache
Hardware prefetcher detects forward and backward strides through
memory
Each core can perform a 128b add and 128b multiply per clock cycle
This requires SSE, packed instructions
“Stride-one vectorization”
10. Microkernel on Compute PEs,
full featured Linux on Service
PEs.
Service PEs specialize by
function
Compute PE
Software Architecture
Login PE eliminates OS “Jitter”
Network PE Software Architecture enables
reproducible run times
System PE
Large machines boot in under
I/O PE 30 minutes, including
filesystem
Service Partition
Specialized
Linux nodes
10
11. Z
Y GigE
X
10 GigE
GigE
SMW
Fibre
Channels
RAID
Subsystem
Compute node
Login node
Network node
Boot/Syslog/Database nodes
I/O and Metadata nodes 11
12. Now Scaled
to 225,000
cores
Cray XT5 systems ship with the
SeaStar2+ interconnect
DMA HyperTransport
6-Port
Router
Engine
Interface
Custom ASIC
Integrated NIC / Router
Memory
MPI offload engine
Blade
Control
Processor
PowerPC
440 Processor
Connectionless Protocol
Interface
Link Level Reliability
Proven scalability to 225,000
cores
12
14. 6.4 GB/sec direct connect
Characteristics HyperTransport
Number of 16 or 24 (MC)
Cores 32 (IL)
Peak 153 Gflops/sec
Performance
MC-8 (2.4)
Peak 211 Gflops/sec
Performance
MC-12 (2.2)
Memory Size 32 or 64 GB per
node
Memory 83.5 GB/sec
Bandwidth 83.5 GB/sec direct
connect memory
Cray
SeaStar2+
Interconnect
Cray Inc. Preliminary and Proprietary SC09 14
15. Greyhound Greyhound
Greyhound Greyhound DDR3 Channel
DDR3 Channel 6MB L3 HT3 6MB L3
Greyhound Greyhound
Cache Greyhound Cache Greyhound
Greyhound Greyhound
DDR3 Channel Greyhound Greyhound DDR3 Channel
HT3
HT3
Greyhound H Greyhound
DDR3 Channel 6MB L3
Greyhound
Greyhound
T3 6MB L3
Greyhound
Greyhound
DDR3 Channel
Cache Greyhound Cache Greyhound
Greyhound Greyhound
Greyhound HT3 Greyhound DDR3 Channel
DDR3 Channel
To Interconnect
HT1 / HT3
2 Multi-Chip Modules, 4 Opteron Dies
8 Channels of DDR3 Bandwidth to 8 DIMMs
24 (or 16) Computational Cores, 24 MB of L3 cache
Dies are fully connected with HT3
Snoop Filter Feature Allows 4 Die SMP to scale well
Cray Inc. Preliminary and Proprietary SC09 15
16. Without snoop filter, a streams test
shows 25MB/sec out of a possible
51.2 GB/sec or 48% of peak
bandwidth
Cray Inc. Preliminary and Proprietary SC09 16
17. With snoop filter, a streams test
shows 42.3 MB/sec out of a
possible 51.2 GB/sec or 82% of
peak bandwidth
• This feature will be key for two-
socket Magny Cours Nodes which
are the same architecture-wise
Cray Inc. Preliminary and Proprietary SC09 17
18. New compute blade with 8 AMD
Magny Cours processors
Plug-compatible with XT5 cabinets
and backplanes
Initially will ship with SeaStar
interconnect as the Cray XT6
Upgradeable to Gemini
Interconnect or Cray XE6
Upgradeable to AMD’s
“Interlagos” series
XT6 systems will continue to ship
with the current SIO blade
First customer ship, March 31st
Cray Inc. Preliminary and Proprietary SC09 18
20. Supports 2 Nodes per ASIC
168 GB/sec routing capacity
Scales to over 100,000 network
endpoints
Link Level Reliability and Adaptive Hyper Hyper
Routing Transport Transport
3 3
Advanced Resiliency Features
Provides global address NIC 0 Netlink NIC 1
SB
space Block Gemini
LO
Advanced NIC designed to Processor
efficiently support 48-Port
MPI YARC Router
One-sided MPI
Shmem
UPC, Coarray FORTRAN
Cray Inc. Preliminary and Proprietary SC09 20
21. Cray Baker Node
Characteristics
Number of 16 or 24
10 12X Gemini Cores
Channels
Peak 140 or 210 Gflops/s
(Each Gemini High Radix
YARC Router
Performance
acts like two
nodes on the 3D with adaptive Memory Size 32 or 64 GB per
Torus) Routing
node
168 GB/sec
capacity Memory 85 GB/sec
Bandwidth
Cray Inc. Preliminary and Proprietary SC09 21
22. Module with
SeaStar
Z
Y
X
Module with
Gemini
Cray Inc. Preliminary and Proprietary SC09 22
23. net rsp
net req
LB
ht treq p
net LB Ring
ht treq np FMA req T net
ht trsp net net
A req S req req vc0
ht p req
net R S
req O
ht np req B I
BTE R net
D rsp
B vc1
Router Tiles
HT3 Cave
NL
ht irsp
NPT vc1
ht np net
ireq rsp
net req CQ NAT
ht np req
H
ht p req net rsp headers
ht p
A AMO net
ht p req
ireq R net req req net req vc0
B RMT
ht p req
RAT net rsp
LM
CLM
FMA (Fast Memory Access)
Mechanism for most MPI transfers
Supports tens of millions of MPI requests per second
BTE (Block Transfer Engine)
Supports asynchronous block transfers between local and remote memory,
in either direction
For use for large MPI transfers that happen in the background
Cray Inc. Preliminary and Proprietary SC09 23
24. Two Gemini ASICs are
packaged on a pin-compatible
mezzanine card
Topology is a 3-D torus
Each lane of the torus is
composed of 4 Gemini router
“tiles”
Systems with SeaStar
interconnects can be upgraded
by swapping this card
100% of the 48 router tiles on
each Gemini chip are used
Cray Inc. Preliminary and Proprietary SC09 24
25. Like SeaStar, Gemini has a DMA offload engine
allowing large transfers to proceed
asynchronously
Gemini provides low-overhead OS-bypass features for short transfers
MPI latency targeted at ~ 1us
NIC provides for many millions of MPI messages per second
“Hybrid” programming not a requirement for performance
RDMA provides a much improved one-sided communication mechanism
AMOs provide a faster synchronization method for barriers
Gemini supports adaptive routing, which
Reduces problems with network hot spots
Allows MPI to survive link failures
Cray Inc. Preliminary and Proprietary SC09 25
26. Globally addressable memory provides efficient
support for UPC, Co-array FORTRAN, Shmem and
Global Arrays
Cray Programming Environment will target this capability
directly
Pipelined global loads and stores
Allows for fast irregular communication patterns
Atomic memory operations
Provides fast synchronization needed for one-sided
communication models
Cray Inc. Preliminary and Proprietary SC09 26
27. Gemini will represent a large improvement over SeaStar in
terms of reliability and serviceability
Adaptive Routing – multiple paths to the same destination
Allows mapping around bad links without rebooting
Supports warm-swap of blades
Prevents hot spots
Reliable Transport of Messages
Packet level CRC carried from start to finish
Large blocks of memory protected by ECC
Can better handle failures on the HT-link, discards packets instead of putting backpressure
into the network
Supports end-to-end reliable communication (used by MPI)
Improved error reporting and handling
The low overhead error reporting allows the programming model to replay failed
transactions
Performance counters allowing tracking of app specific packets
Cray Inc. Preliminary and Proprietary SC09 27
30. Low Velocity Airflow
High Velocity Airflow
Low Velocity Airflow
High Velocity Airflow
30 Low Velocity Airflow
31. Cool air is released into the computer room
Liquid Liquid/Vapor
in Mixture out
Hot air stream passes through evaporator, rejects
heat to R134a via liquid-vapor phase change
(evaporation).
R134a absorbs energy only in the presence of heated air.
Phase change is 10x more efficient than pure water
cooling.
31
36. 32 MB per OST (32 MB – 5 GB) and 32 MB Transfer Size
Unable to take advantage of file system parallelism
Access to multiple disks adds overhead which hurts performance
Single Writer
Write Performance
120
100
80
Write (MB/s)
1 MB Stripe
60
32 MB Stripe
40
Lustre
20
0
1 2 4 16 32 64 128 160
Stripe Count
36
37. Single OST, 256 MB File Size
Performance can be limited by the process (transfer size) or file system
(stripe size)
Single Writer
Transfer vs. Stripe Size
140
120
100
Write (MB/s)
80
32 MB Transfer
60
8 MB Transfer
1 MB Transfer
40 Lustre
20
0
1 2 4 8 16 32 64 128
Stripe Size (MB)
37
38. Use the lfs command, libLUT, or MPIIO hints to adjust your stripe count and
possibly size
lfs setstripe -c -1 -s 4M <file or directory> (160 OSTs, 4MB stripe)
lfs setstripe -c 1 -s 16M <file or directory> (1 OST, 16M stripe)
export MPICH_MPIIO_HINTS=‘*: striping_factor=160’
Files inherit striping information from the parent directory, this cannot be
changed once the file is written
Set the striping before copying in files
40. Cray XT/XE Supercomputers come with compiler wrappers to simplify
building parallel applications (similar the mpicc/mpif90)
Fortran Compiler: ftn
C Compiler: cc
C++ Compiler: CC
Using these wrappers ensures that your code is built for the compute
nodes and linked against important libraries
Cray MPT (MPI, Shmem, etc.)
Cray LibSci (BLAS, LAPACK, etc.)
…
Choosing the underlying compiler is via the PrgEnv-* modules, do not call
the PGI, Cray, etc. compilers directly.
Always load the appropriate xtpe-<arch> module for your machine
Enables proper compiler target
Links optimized math libraries
41.
42. Traditional (scalar) optimizations are controlled via -O# compiler flags
Default: -O2
More aggressive optimizations (including vectorization) are enabled with
the -fast or -fastsse metaflags
These translate to: -O2 -Munroll=c:1 -Mnoframe -Mlre
–Mautoinline -Mvect=sse -Mscalarsse
-Mcache_align -Mflushz –Mpre
Interprocedural analysis allows the compiler to perform whole-program
optimizations. This is enabled with –Mipa=fast
See man pgf90, man pgcc, or man pgCC for more information about
compiler options.
43. Compiler feedback is enabled with -Minfo and -Mneginfo
This can provide valuable information about what optimizations were
or were not done and why.
To debug an optimized code, the -gopt flag will insert debugging
information without disabling optimizations
It’s possible to disable optimizations included with -fast if you believe one
is causing problems
For example: -fast -Mnolre enables -fast and then disables loop
redundant optimizations
To get more information about any compiler flag, add -help with the
flag in question
pgf90 -help -fast will give more information about the -fast
flag
OpenMP is enabled with the -mp flag
44. Some compiler options may effect both performance and accuracy. Lower
accuracy is often higher performance, but it’s also able to enforce accuracy.
-Kieee: All FP math strictly conforms to IEEE 754 (off by default)
-Ktrap: Turns on processor trapping of FP exceptions
-Mdaz: Treat all denormalized numbers as zero
-Mflushz: Set SSE to flush-to-zero (on with -fast)
-Mfprelaxed: Allow the compiler to use relaxed (reduced) precision to
speed up some floating point optimizations
Some other compilers turn this on by default, PGI chooses to favor
accuracy to speed by default.
45.
46. Cray has a long tradition of high performance compilers on Cray
platforms (Traditional vector, T3E, X1, X2)
Vectorization
Parallelization
Code transformation
More…
Investigated leveraging an open source compiler called LLVM
First release December 2008
47. Fortran Source C and C++ Source C and C++ Front End
supplied by Edison Design
Group, with Cray-developed
Fortran Front End C & C++ Front End code for extensions and
interface support
Interprocedural Analysis
Cray Inc. Compiler
Technology
Compiler
Optimization and
Parallelization
X86 Code Cray X2 Code
Generator Generator
X86 Code Generation from
Open Source LLVM, with
Object File
additional Cray-developed
optimizations and interface
support
48. Standard conforming languages and programming models
Fortran 2003
UPC & CoArray Fortran
Fully optimized and integrated into the compiler
No preprocessor involved
Target the network appropriately:
GASNet with Portals
DMAPP with Gemini & Aries
Ability and motivation to provide high-quality support for custom
Cray network hardware
Cray technology focused on scientific applications
Takes advantage of Cray’s extensive knowledge of automatic
vectorization
Takes advantage of Cray’s extensive knowledge of automatic
shared memory parallelization
Supplements, rather than replaces, the available compiler
choices
49. Make sure it is available
module avail PrgEnv-cray
To access the Cray compiler
module load PrgEnv-cray
To target the various chip
module load xtpe-[barcelona,shanghi,istanbul]
Once you have loaded the module “cc” and “ftn” are the Cray
compilers
Recommend just using default options
Use –rm (fortran) and –hlist=m (C) to find out what happened
man crayftn
50. Excellent Vectorization
Vectorize more loops than other compilers
OpenMP 3.0
Task and Nesting
PGAS: Functional UPC and CAF available today
C++ Support
Automatic Parallelization
Modernized version of Cray X1 streaming capability
Interacts with OMP directives
Cache optimizations
Automatic Blocking
Automatic Management of what stays in cache
Prefetching, Interchange, Fusion, and much more…
51. Loop Based Optimizations
Vectorization
OpenMP
Autothreading
Interchange
Pattern Matching
Cache blocking/ non-temporal / prefetching
Fortran 2003 Standard; working on 2008
PGAS (UPC and Co-Array Fortran)
Some performance optimizations available in 7.1
Optimization Feedback: Loopmark
Focus
52. Cray compiler supports a full and growing set of directives
and pragmas
!dir$ concurrent
!dir$ ivdep
!dir$ interchange
!dir$ unroll
!dir$ loop_info [max_trips] [cache_na] ... Many more
!dir$ blockable
man directives
man loop_info
53. Compiler can generate an filename.lst file.
Contains annotated listing of your source code with letter indicating important
optimizations
%%% L o o p m a r k L e g e n d %%%
Primary Loop Type Modifiers
------- ---- ---- ---------
a - vector atomic memory operation
A - Pattern matched b - blocked
C - Collapsed f - fused
D - Deleted i - interchanged
E - Cloned m - streamed but not partitioned
I - Inlined p - conditional, partial and/or computed
M - Multithreaded r - unrolled
P - Parallel/Tasked s - shortloop
V - Vectorized t - array syntax temp used
W - Unwound w - unwound
54. • ftn –rm … or cc –hlist=m …
29. b-------< do i3=2,n3-1
30. b b-----< do i2=2,n2-1
31. b b Vr--< do i1=1,n1
32. b b Vr u1(i1) = u(i1,i2-1,i3) + u(i1,i2+1,i3)
33. b b Vr > + u(i1,i2,i3-1) + u(i1,i2,i3+1)
34. b b Vr u2(i1) = u(i1,i2-1,i3-1) + u(i1,i2+1,i3-1)
35. b b Vr > + u(i1,i2-1,i3+1) + u(i1,i2+1,i3+1)
36. b b Vr--> enddo
37. b b Vr--< do i1=2,n1-1
38. b b Vr r(i1,i2,i3) = v(i1,i2,i3)
39. b b Vr > - a(0) * u(i1,i2,i3)
40. b b Vr > - a(2) * ( u2(i1) + u1(i1-1) + u1(i1+1) )
41. b b Vr > - a(3) * ( u2(i1-1) + u2(i1+1) )
42. b b Vr--> enddo
43. b b-----> enddo
44. b-------> enddo
55. ftn-6289 ftn: VECTOR File = resid.f, Line = 29
A loop starting at line 29 was not vectorized because a recurrence was found on "U1" between lines
32 and 38.
ftn-6049 ftn: SCALAR File = resid.f, Line = 29
A loop starting at line 29 was blocked with block size 4.
ftn-6289 ftn: VECTOR File = resid.f, Line = 30
A loop starting at line 30 was not vectorized because a recurrence was found on "U1" between lines 32
and 38.
ftn-6049 ftn: SCALAR File = resid.f, Line = 30
A loop starting at line 30 was blocked with block size 4.
ftn-6005 ftn: SCALAR File = resid.f, Line = 31
A loop starting at line 31 was unrolled 4 times.
ftn-6204 ftn: VECTOR File = resid.f, Line = 31
A loop starting at line 31 was vectorized.
ftn-6005 ftn: SCALAR File = resid.f, Line = 37
A loop starting at line 37 was unrolled 4 times.
ftn-6204 ftn: VECTOR File = resid.f, Line = 37
A loop starting at line 37 was vectorized.
56. -hbyteswapio
Link time option
Applies to all unformatted fortran IO
Assign command
With the PrgEnv-cray module loaded do this:
setenv FILENV assign.txt
assign -N swap_endian g:su
assign -N swap_endian g:du
Can use assign to be more precise
57. OpenMP is ON by default
Optimizations controlled by –Othread#
To shut off use –Othread0 or –xomp or –hnoomp
Autothreading is NOT on by default;
-hautothread to turn on
Modernized version of Cray X1 streaming capability
Interacts with OMP directives
If you do not want to use OpenMP and have OMP directives
in the code, make sure to make a run with OpenMP shut
off at compile time
58.
59. Traditional model
Tuned general purpose codes
Only good for dense
Not problem sensitive
Not architecture sensitive
60
60. Goal of scientific libraries
Improve Productivity at optimal performance
Cray use four concentrations to achieve this
Standardization
Use standard or “de facto” standard interfaces whenever available
Hand tuning
Use extensive knowledge of target processor and network to optimize common code
patterns
Auto-tuning
Automate code generation and a huge number of empirical performance evaluations
to configure software to the target platforms
Adaptive Libraries
Make runtime decisions to choose the best kernel/library/routine
61
61. Three separate classes of standardization, each with a corresponding
definition of productivity
1. Standard interfaces (e.g., dense linear algebra)
Bend over backwards to keep everything the same despite increases in machine complexity.
Innovate ‘behind-the-scenes’
Productivity -> innovation to keep things simple
2. Adoption of near-standard interfaces (e.g., sparse kernels)
Assume near-standards and promote those. Out-mode alternatives. Innovate ‘behind-the-scenes’
Productivity -> innovation in the simplest areas
(requires the same innovation as #1 also)
3. Simplification of non-standard interfaces (e.g., FFT)
Productivity -> innovation to make things simpler than they are
62
62. Algorithmic tuning
Increased performance by exploiting algorithmic improvements
Sub-blocking, new algorithms
LAPACK, ScaLAPACK
Kernel tuning
Improve the numerical kernel performance in assembly language
BLAS, FFT
Parallel tuning
Exploit Cray’s custom network interfaces and MPT
ScaLAPACK, P-CRAFFT
63
64. Serial and Parallel versions of sparse iterative linear solvers
Suites of iterative solvers
CG, GMRES, BiCG, QMR, etc.
Suites of preconditioning methods
IC, ILU, diagonal block (ILU/IC), Additive Schwartz, Jacobi, SOR
Support block sparse matrix data format for better performance
Interface to external packages (ScaLAPACK, SuperLU_DIST)
Fortran and C support
Newton-type nonlinear solvers
Large user community
DoE Labs, PSC, CSCS, CSC, ERDC, AWE and more.
http://www-unix.mcs.anl.gov/petsc/petsc-as
65
65. Cray provides state-of-the art scientific computing packages to strengthen
the capability of PETSc
Hypre: scalable parallel preconditioners
AMG (Very scalable and efficient for specific class of problems)
2 different ILU (General purpose)
Sparse Approximate Inverse (General purpose)
ParMetis: parallel graph partitioning package
MUMPS: parallel multifrontal sparse direct solver
SuperLU: sequential version of SuperLU_DIST
To use Cray-PETSc, load the appropriate module :
module load petsc
module load petsc-complex
(no need to load a compiler specific module)
Treat the Cray distribution as your local PETSc installation
66
66. The Trilinos Project http://trilinos.sandia.gov/
“an effort to develop algorithms and enabling technologies within an
object-oriented software framework for the solution of large-scale,
complex multi-physics engineering and scientific problems”
A unique design feature of Trilinos is its focus on packages.
Very large user-base and growing rapidly. Important to DOE.
Cray’s optimized Trilinos released on January 21
Includes 50+ trilinos packages
Optimized via CASK
Any code that uses Epetra objects can access the optimizations
Usage :
module load trilinos
67
67. CASK is a product developed at Cray using the
Cray Auto-tuning Framework (Cray ATF)
The CASK Concept :
Analyze matrix at minimal cost
Categorize matrix against internal classes
Based on offline experience, find best CASK code for particular matrix
Previously assign “best” compiler flags to CASK code
Assign best CASK kernel and perform Ax
CASK silently sits beneath PETSc on Cray systems
Trilinos support coming soon
Released with PETSc 3.0 in February 2009
Generic and blocked CSR format
68
68. Large-scale application
• Highly portable
• User controlled
PETSc / Trilinos / Hypre
All systems • Highly portable
• User controlled
Cray only CASK
• XT4 & XT5
specific /
tuned
• Invisible to
User
69
69. Speedup on Parallel SpMV on 8 cores, 60 different matrices
1.4
1.3
1.2
1.1
1
0 10 20 30 40 50 60
Matrix ID#
70
70. Block Jacobi Preconditioning
SpMV
Performance of CASK VS Performance of CASK VS
PETSc PETSc
N=65,536 to 67,108,864 300 N=65,536 to 67,108,864
200
250
150
200
GFlops
100
GFlops
150
50 100
50
0
0 128 256 384 512 640 768 896 1024 0
0 128 256 384 512 640 768 896 1024
# of Cores # of Cores
BlockJacobi-IC(0)-CASK
MatMult-CASK MatMult-PETSc BlockJacobi-IC(0)-PETSc
71
72. Geometric Mean of 80 sparse matrix instances from U of Florida collection
5000
4500
4000
3500
MFlops
3000
2500
2000
1500
1000
500
0
1 2 3 4 5 6 7 8
# of vectors
CASK Trilinos Original
73. In FFTs, the problems are
Which library choice to use?
How to use complicated interfaces (e.g., FFTW)
Standard FFT practice
Do a plan stage
Deduced machine and system information and run micro-kernels
Select best FFT strategy
Do an execute
Our system knowledge can remove some of this cost!
74
74. CRAFFT is designed with simple-to-use interfaces
Planning and execution stage can be combined into one function call
Underneath the interfaces, CRAFFT calls the appropriate FFT kernel
CRAFFT provides both offline and online tuning
Offline tuning
Which FFT kernel to use
Pre-computed PLANs for common-sized FFT
No expensive plan stages
Online tuning is performed as necessary at runtime as well
At runtime, CRAFFT will adaptively select the best FFT kernel to use based
on both offline and online testing (e.g. FFTW, Custom FFT)
75
76. 1. Load module fftw/3.2.0 or higher.
2. Add a Fortran statement “use crafft”
3. call crafft_init()
4. Call crafft transform using none, some or all optional arguments (as
shown in red)
In-place, implicit memory management :
call crafft_z2z3d(n1,n2,n3,input,ld_in,ld_in2,isign)
in-place, explicit memory management
call crafft_z2z3d(n1,n2,n3,input,ld_in,ld_in2,isign,work)
out-of-place, explicit memory management :
crafft_z2z3d(n1,n2,n3,input,ld_in,ld_in2,output,ld_out,ld_out2,isign,work)
Note : the user can also control the planning strategy of CRAFFT using the
CRAFFT_PLANNING environment variable and the do_exe optional argument,
please see the intro_crafft man page.
77
77. As of December 2009, CRAFFT includes distributed parallel transforms
Uses the CRAFFT interface prefixed by “p”, with optional arguments
Can provide performance improvement over FFTW 2.1.5
Currently implemented
complex-complex
Real-complex and complex-real
3-d and 2-d
In-place and out-of-place
Upcoming
C language support for serial and parallel
78
78. 1. Add “use crafft” to Fortran code
2. Initialize CRAFFT using crafft_init
3. Assume MPI initialized and data distributed (see manpage)
4. Call crafft, e.g. (optional arguments in red)
2-d complex-complex, in-place, internal mem management :
call crafft_pz2z2d(n1,n2,input,isign,flag,comm)
2-d complex-complex, in-place with no internal memory :
call crafft_pz2z2d(n1,n2,input,isign,flag,comm,work)
2-d complex-complex, out-of-place, internal mem manager :
call crafft_pz2z2d(n1,n2,input,output,isign,flag,comm)
2-d complex-complex, out-of-place, no internal memory :
crafft_pz2z2d(n1,n2,input,output,isign,flag,comm,work)
Each routine above has manpage. Also see 3d equivalent :
man crafft_pz2z3d 79
80. Solves linear systems in single precision
Obtaining solutions accurate to double precision
For well conditioned problems
Serial and Parallel versions of LU, Cholesky, and QR
2 usage methods
IRT Benchmark routines
Uses IRT 'under-the-covers' without changing your code
Simply set an environment variable
Useful when you cannot alter source code
Advanced IRT API
If greater control of the iterative refinement process is required
Allows
condition number estimation
error bounds return
minimization of either forward or backward error
'fall back' to full precision if the condition number is too high
max number of iterations can be altered by users
81
81. “High Power Electromagnetic Wave
Heating in the ITER Burning Plasma’’
rf heating in tokamak
Maxwell-Bolzmann Eqns
FFT
Dense linear system
Calc Quasi-linear op
Courtesy
Richard Barrett
82
83. Decide if you want to use advanced API or benchmark API
benchmark API :
setenv IRT_USE_SOLVERS 1
Advanced API :
1. locate the factor and solve in your code (LAPACK or ScaLAPACK)
2. Replace factor and solve with a call to IRT routine
e.g. dgesv -> irt_lu_real_serial
e.g. pzgesv -> irt_lu_complex_parallel
e.g pzposv -> irt_po_complex_parallel
3. Set advanced arguments
Forward error convergence for most accurate solution
Condition number estimate
“fall-back” to full precision if condition number too high
84
84. LibSci 10.4.2 February 18th 2010
OpenMP-aware LibSci
Allows calling of BLAS inside or outside parallel region
Single library supported
No multi-thread library and single thread library (-lsci and –lsci_mp)
Performance not compromised
(there were some usage restrictions with this version)
LibSci 10.4.3 April 2010
Parallel CRAFFT improvements
Fixes usage restrictions of 10.4.2
OMP_NUM_THREADS required (not GOTO_NUM_THREADS)
Upcoming
PETSc 3.1.0 May 20
Trilinos 10.2 May 20
85
99. biolib Cray Bioinformatics library routines omp OpenMP API (not supported on
blacs Basic Linear Algebra communication Catamount)
subprograms omp-rtl OpenMP runtime library (not
blas Basic Linear Algebra subprograms supported on Catamount)
caf Co-Array Fortran (Cray X2 systems only) portals Lightweight message passing API
fftw Fast Fourier Transform library (64-bit pthreads POSIX threads (not supported on
only) Catamount)
hdf5 manages extremely large and complex scalapack Scalable LAPACK
data collections shmem SHMEM
heap dynamic heap stdio all library functions that accept or return
io includes stdio and sysio groups the FILE* construct
lapack Linear Algebra Package sysio I/O system calls
lustre Lustre File System system system calls
math ANSI math upc Unified Parallel C (Cray X2 systems only)
mpi MPI
netcdf network common data form (manages
array-oriented scientific data)
100. 0 Summary with instruction 11 Floating point operations
metrics mix (2)
1 Summary with TLB metrics 12 Floating point operations
mix (vectorization)
2 L1 and L2 metrics
13 Floating point operations
3 Bandwidth information mix (SP)
4 Hypertransport information 14 Floating point operations
5 Floating point mix mix (DP)
6 Cycles stalled, resources 15 L3 (socket-level)
idle 16 L3 (core-level reads)
7 Cycles stalled, resources 17 L3 (core-level misses)
full
18 L3 (core-level fills caused
8 Instructions and branches by L2 evictions)
9 Instruction cache 19 Prefetches
10 Cache hierarchy
June 10 Slide 101
101. Regions, useful to break up long routines
int PAT_region_begin (int id, const char *label)
int PAT_region_end (int id)
Disable/Enable Profiling, useful for excluding initialization
int PAT_record (int state)
Flush buffer, useful when program isn’t exiting cleanly
int PAT_flush_buffer (void)
104. Full trace files show transient events but are too large
Current run-time summarization misses transient events
Plan to add ability to record:
Top N peak values (N small)
Approximate std dev over time
For time, memory traffic, etc.
During tracing and sampling
July 15, 2008 Slide 106
126. 55. 1 ii = 0
56. 1 2-----------< do b = abmin, abmax Poor loop order
57. 1 2 3---------< do j=ijmin, ijmax results in poor
58. 1 2 3 ii = ii+1
striding
59. 1 2 3 jj = 0 The inner-most loop
60. 1 2 3 4-------< do a = abmin, abmax strides on a slow
61. 1 2 3 4 r8----< do i = ijmin, ijmax dimension of each
62. 1 2 3 4 r8 jj = jj+1
array.
63. 1 2 3 4 r8 f5d(a,b,i,j) = f5d(a,b,i,j)
+ tmat7(ii,jj) The best the compiler
64. 1 2 3 4 r8 f5d(b,a,i,j) = f5d(b,a,i,j) can do is unroll.
- tmat7(ii,jj)
65. 1 2 3 4 r8 f5d(a,b,j,i) = f5d(a,b,j,i) Little to no cache
- tmat7(ii,jj) reuse.
66. 1 2 3 4 r8 f5d(b,a,j,i) = f5d(b,a,j,i)
+ tmat7(ii,jj)
67. 1 2 3 4 r8----> end do
68. 1 2 3 4-------> end do
69. 1 2 3---------> end do
70. 1 2-----------> end do
127. USER / #1.Original Loops
----------------------------------------------------------------- Poor loop order
Time% 55.0% results in poor
Time 13.938244 secs cache reuse
Imb.Time 0.075369 secs
Imb.Time% 0.6% For every L1 cache
Calls 0.1 /sec 1.0 calls hit, there’s 2 misses
DATA_CACHE_REFILLS:
L2_MODIFIED:L2_OWNED: Overall, only 2/3 of
L2_EXCLUSIVE:L2_SHARED 11.858M/sec 165279602 fills all references were in
DATA_CACHE_REFILLS_FROM_SYSTEM: level 1 or 2 cache.
ALL 11.931M/sec 166291054 fills
PAPI_L1_DCM 23.499M/sec 327533338 misses
PAPI_L1_DCA 34.635M/sec 482751044 refs
User time (approx) 13.938 secs 36239439807 cycles
100.0%Time
Average Time per Call 13.938244 sec
CrayPat Overhead : Time 0.0%
D1 cache hit,miss ratios 32.2% hits 67.8% misses
D2 cache hit,miss ratio 49.8% hits 50.2% misses
D1+D2 cache hit,miss ratio 66.0% hits 34.0% misses
128.
129.
130. 75. 1 2-----------< do i = ijmin, ijmax
76. 1 2 jj = 0
77. 1 2 3---------< do a = abmin, abmax Reordered loop
78. 1 2 3 4-------< do j=ijmin, ijmax
nest
79. 1 2 3 4 jj = jj+1 Now, the inner-most
80. 1 2 3 4 ii = 0 loop is stride-1 on
81. 1 2 3 4 Vcr2--< do b = abmin, abmax both arrays.
82. 1 2 3 4 Vcr2 ii = ii+1
83. 1 2 3 4 Vcr2 f5d(a,b,i,j) = f5d(a,b,i,j) Now memory
+ tmat7(ii,jj) accesses happen
84. 1 2 3 4 Vcr2 f5d(b,a,i,j) = f5d(b,a,i,j) along the cache line,
- tmat7(ii,jj) allowing reuse.
85. 1 2 3 4 Vcr2 f5d(a,b,j,i) = f5d(a,b,j,i)
- tmat7(ii,jj) Compiler is able to
86. 1 2 3 4 Vcr2 f5d(b,a,j,i) = f5d(b,a,j,i) vectorize and better-
+ tmat7(ii,jj) use SSE instructions.
87. 1 2 3 4 Vcr2--> end do
88. 1 2 3 4-------> end do
89. 1 2 3---------> end do
90. 1 2-----------> end do
131. USER / #2.Reordered Loops
----------------------------------------------------------------- Improved striding
Time% 31.4% greatly improved
Time 7.955379 secs cache reuse
Imb.Time 0.260492 secs
Imb.Time% 3.8% Runtine was cut
Calls 0.1 /sec 1.0 calls nearly in half.
DATA_CACHE_REFILLS:
L2_MODIFIED:L2_OWNED: Still, some 20% of all
L2_EXCLUSIVE:L2_SHARED 0.419M/sec 3331289 fills references are cache
DATA_CACHE_REFILLS_FROM_SYSTEM: misses
ALL 15.285M/sec 121598284 fills
PAPI_L1_DCM 13.330M/sec 106046801 misses
PAPI_L1_DCA 66.226M/sec 526855581 refs
User time (approx) 7.955 secs 20684020425 cycles
100.0%Time
Average Time per Call 7.955379 sec
CrayPat Overhead : Time 0.0%
D1 cache hit,miss ratios 79.9% hits 20.1% misses
D2 cache hit,miss ratio 2.7% hits 97.3% misses
D1+D2 cache hit,miss ratio 80.4% hits 19.6% misses
132. First loop, partially vectorized and Second loop, vectorized and
unrolled by 4 unrolled by 4
95. 1 ii = 0 109. 1 jj = 0
96. 1 2-----------< do j = ijmin, ijmax 110. 1 2-----------< do i = ijmin, ijmax
97. 1 2 i---------< do b = abmin, abmax 111. 1 2 3---------< do a = abmin, abmax
98. 1 2 i ii = ii+1 112. 1 2 3 jj = jj+1
99. 1 2 i jj = 0 113. 1 2 3 ii = 0
100. 1 2 i i-------< do i = ijmin, ijmax 114. 1 2 3 4-------< do j = ijmin, ijmax
101. 1 2 i i Vpr4--< do a = abmin, abmax 115. 1 2 3 4 Vr4---< do b = abmin, abmax
102. 1 2 i i Vpr4 jj = jj+1 116. 1 2 3 4 Vr4 ii = ii+1
103. 1 2 i i Vpr4 f5d(a,b,i,j) = 117. 1 2 3 4 Vr4 f5d(b,a,i,j) =
f5d(a,b,i,j) + tmat7(ii,jj) f5d(b,a,i,j) - tmat7(ii,jj)
104. 1 2 i i Vpr4 f5d(a,b,j,i) = 118. 1 2 3 4 Vr4 f5d(b,a,j,i) =
f5d(a,b,j,i) - tmat7(ii,jj) f5d(b,a,i,j) + tmat7(ii,jj)
105. 1 2 i i Vpr4--> end do 119. 1 2 3 4 Vr4---> end do
106. 1 2 i i-------> end do 120. 1 2 3 4-------> end do
107. 1 2 i---------> end do 121. 1 2 3---------> end do
108. 1 2-----------> end do 122. 1 2-----------> end do
133. USER / #3.Fissioned Loops
Fissioning further
----------------------------------------------------------------- improved cache reuse
Time% 9.8% and resulted in better
Time 2.481636 secs vectorization
Imb.Time 0.045475 secs
Imb.Time% 2.1% Runtime further
Calls 0.4 /sec 1.0 calls reduced.
DATA_CACHE_REFILLS:
L2_MODIFIED:L2_OWNED: Cache hit/miss ratio
L2_EXCLUSIVE:L2_SHARED 1.175M/sec 2916610 fills improved slightly
DATA_CACHE_REFILLS_FROM_SYSTEM:
ALL 34.109M/sec 84646518 fills Loopmark file points
PAPI_L1_DCM 26.424M/sec 65575972 misses to better
PAPI_L1_DCA 156.705M/sec 388885686 refs
vectorization from
User time (approx) 2.482 secs 6452279320 cycles
100.0%Time the fissioned loops
Average Time per Call 2.481636 sec
CrayPat Overhead : Time 0.0%
D1 cache hit,miss ratios 83.1% hits 16.9% misses
D2 cache hit,miss ratio 3.3% hits 96.7% misses
D1+D2 cache hit,miss ratio 83.7% hits 16.3% misses
134.
135. Cache blocking is a combination of strip mining and loop interchange, designed
to increase data reuse.
Takes advantage of temporal reuse: re-reference array elements already
referenced
Good blocking will take advantage of spatial reuse: work with the cache
lines!
Many ways to block any given loop nest
Which loops get blocked?
What block size(s) to use?
Analysis can reveal which ways are beneficial
But trial-and-error is probably faster
136. j=1
j=8
2D Laplacian
i=1 do j = 1, 8
do i = 1, 16
a = u(i-1,j) + u(i+1,j) &
- 4*u(i,j) &
+ u(i,j-1) + u(i,j+1)
end do
end do
Cache structure for this example:
Each line holds 4 array elements
Cache can hold 12 lines of u data
i=16
No cache reuse between outer loop
120
30
18
15
13
12
10
9
7
6
4
3 iterations
137. j=1
j=8
Unblocked loop: 120 cache misses
i=1 Block the inner loop
do IBLOCK = 1, 16, 4
do j = 1, 8
i=5 do i = IBLOCK, IBLOCK + 3
a(i,j) = u(i-1,j) + u(i+1,j) &
- 2*u(i,j) &
i=9 + u(i,j-1) + u(i,j+1)
end do
end do
end do
i=13
Now we have reuse of the “j+1” data
80
20
12
10
11
9
8
7
6
4
3
138. j=1
j=5
One-dimensional blocking reduced
i=1 misses from 120 to 80
Iterate over 4 4 blocks
i=5 do JBLOCK = 1, 8, 4
do IBLOCK = 1, 16, 4
do j = JBLOCK, JBLOCK + 3
do i = IBLOCK, IBLOCK + 3
i=9 a(i,j) = u(i-1,j) + u(i+1,j) &
- 2*u(i,j) &
+ u(i,j-1) + u(i,j+1)
end do
i=13
end do
end do
end do
15
13
12
10
60
30
18
17
16
11
9
8
7
6
4
3
Better use of spatial locality (cache lines)
139. Matrix-matrix multiply (GEMM) is the canonical cache-blocking example
Operations can be arranged to create multiple levels of blocking
Block for register
Block for cache (L1, L2, L3)
Block for TLB
No further discussion here. Interested readers can see
Any book on code optimization
Sun’s Techniques for Optimizing Applications: High Performance Computing contains a decent introductory discussion in
Chapter 8
Insert your favorite book here
Gunnels, Henry, and van de Geijn. June 2001. High-performance matrix multiplication
algorithms for architectures with hierarchical memories. FLAME Working Note #4 TR-
2001-22, The University of Texas at Austin, Department of Computer Sciences
Develops algorithms and cost models for GEMM in hierarchical memories
Goto and van de Geijn. 2008. Anatomy of high-performance matrix multiplication. ACM
Transactions on Mathematical Software 34, 3 (May), 1-25
Description of GotoBLAS DGEMM
140. “I tried cache-blocking my code, but it didn’t help”
You’re doing it wrong.
Your block size is too small (too much loop overhead).
Your block size is too big (data is falling out of cache).
You’re targeting the wrong cache level (?)
You haven’t selected the correct subset of loops to block.
The compiler is already blocking that loop.
Prefetching is acting to minimize cache misses.
Computational intensity within the loop nest is very large, making blocking less
important.
141. Multigrid PDE solver
Class D, 64 MPI ranks
do i3 = 2, 257
Global grid is 1024 × 1024 × 1024 do i2 = 2, 257
Local grid is 258 × 258 × 258 do i1 = 2, 257
Two similar loop nests account for ! update u(i1,i2,i3)
! using 27-point stencil
>50% of run time
end do
27-point 3D stencil end do
There is good data reuse along end doi2+1
cache lines
i2
leading dimension, even without i2-1
blocking i3+1
i3
i3-1
i1-1 i1 i1+1
142. Block the inner two loops Mop/s/proces
Block size
Creates blocks extending along i3 direction s
unblocked 531.50
do I2BLOCK = 2, 257, BS2
do I1BLOCK = 2, 257, BS1 16 × 16 279.89
do i3 = 2, 257
22 × 22 321.26
do i2 = I2BLOCK, &
min(I2BLOCK+BS2-1, 257) 28 × 28 358.96
do i1 = I1BLOCK, &
min(I1BLOCK+BS1-1, 257) 34 × 34 385.33
! update u(i1,i2,i3)
! using 27-point stencil 40 × 40 408.53
end do
end do
46 × 46 443.94
end do 52 × 52 468.58
end do
end do 58 × 58 470.32
64 × 64 512.03
70 × 70 506.92
143. Block the outer two loops Mop/s/proces
Block size
Preserves spatial locality along i1 direction s
unblocked 531.50
do I3BLOCK = 2, 257, BS3
do I2BLOCK = 2, 257, BS2 16 × 16 674.76
do i3 = I3BLOCK, &
22 × 22 680.16
min(I3BLOCK+BS3-1, 257)
do i2 = I2BLOCK, & 28 × 28 688.64
min(I2BLOCK+BS2-1, 257)
do i1 = 2, 257 34 × 34 683.84
! update u(i1,i2,i3)
! using 27-point stencil 40 × 40 698.47
end do
end do
46 × 46 689.14
end do 52 × 52 706.62
end do
end do 58 × 58 692.57
64 × 64 703.40
70 × 70 693.87
144.
145. ( 53) void mat_mul_daxpy(double *a, double *b, double *c, int rowa,
int cola, int colb)
( 54) {
( 55) int i, j, k; /* loop counters */
( 56) int rowc, colc, rowb; /* sizes not passed as arguments */
C pointers
( 57) double con; /* constant value */
C pointers don’t carry
( 58)
( 59) rowb = cola;
the same rules as
( 60) rowc = rowa; Fortran Arrays.
( 61) colc = colb;
( 62) The compiler has no
( 63) for(i=0;i<rowc;i++) { way to know whether
( 64) for(k=0;k<cola;k++) {
*a, *b, and *c
( 65) con = *(a + i*cola +k);
( 66) for(j=0;j<colc;j++) { overlap or are
( 67) *(c + i*colc + j) += con * *(b + k*colb + j); referenced differently
( 68) } elsewhere.
( 69) }
( 70) }
The compiler must
( 71) }
assume the worst,
mat_mul_daxpy: thus a false data
66, Loop not vectorized: data dependency dependency.
Loop not vectorized: data dependency
Loop unrolled 4 times
Slide 147
146. ( 53) void mat_mul_daxpy(double* restrict a, double* restrict b,
double* restrict c, int rowa, int cola, int colb)
( 54) {
( 55) int i, j, k; /* loop counters */
C pointers,
( 56) int rowc, colc, rowb; /* sizes not passed as arguments */
restricted
( 57) double con; /* constant value */
C99 introduces the
( 58)
( 59) rowb = cola;
restrict keyword,
( 60) rowc = rowa; which allows the
( 61) colc = colb; programmer to
( 62)
promise not to
( 63) for(i=0;i<rowc;i++) {
( 64) for(k=0;k<cola;k++) {
reference the
( 65) con = *(a + i*cola +k); memory via another
( 66) for(j=0;j<colc;j++) { pointer.
( 67) *(c + i*colc + j) += con * *(b + k*colb + j);
( 68) } If you declare a
( 69) }
restricted pointer and
( 70) }
( 71) } break the rules,
behavior is undefined
by the standard.
Slide 148
147. 66, Generated alternate loop with no peeling - executed if loop count <= 24
Generated vector sse code for inner loop
Generated 2 prefetch instructions for this loop
Generated vector sse code for inner loop
Generated 2 prefetch instructions for this loop
Generated alternate loop with no peeling and more aligned moves -
executed if loop count <= 24 and alignment test is passed
Generated vector sse code for inner loop
Generated 2 prefetch instructions for this loop
Generated alternate loop with more aligned moves - executed if loop
count >= 25 and alignment test is passed
Generated vector sse code for inner loop
Generated 2 prefetch instructions for this loop
• This can also be achieved with the PGI safe pragma and –Msafeptr
compiler option or Pathscale –OPT:alias option
Slide 149
149. GNU malloc library
malloc, calloc, realloc, free calls
Fortran dynamic variables
Malloc library system calls
Mmap, munmap =>for larger allocations
Brk, sbrk => increase/decrease heap
Malloc library optimized for low system memory use
Can result in system calls/minor page faults
151
150. Detecting “bad” malloc behavior
Profile data => “excessive system time”
Correcting “bad” malloc behavior
Eliminate mmap use by malloc
Increase threshold to release heap memory
Use environment variables to alter malloc
MALLOC_MMAP_MAX_ = 0
MALLOC_TRIM_THRESHOLD_ = 536870912
Possible downsides
Heap fragmentation
User process may call mmap directly
User process may launch other processes
PGI’s –Msmartalloc does something similar for you at
compile time
152
151. Google created a replacement “malloc” library
“Minimal” TCMalloc replaces GNU malloc
Limited testing indicates TCMalloc as good or better
than GNU malloc
Environment variables not required
TCMalloc almost certainly better for allocations in
OpenMP parallel regions
There’s currently no pre-built tcmalloc for Cray XT, but
some users have successfully built it.
153
152. Linux has a “first touch policy” for memory allocation
*alloc functions don’t actually allocate your memory
Memory gets allocated when “touched”
Problem: A code can allocate more memory than available
Linux assumed “swap space,” we don’t have any
Applications won’t fail from over-allocation until the memory is finally
touched
Problem: Memory will be put on the core of the “touching” thread
Only a problem if thread 0 allocates all memory for a node
Solution: Always initialize your memory immediately after allocating it
If you over-allocate, it will fail immediately, rather than a strange place
in your code
If every thread touches its own memory, it will be allocated on the
proper socket
Slide 154
153.
154. Short Message Eager Protocol
The sending rank “pushes” the message to the receiving rank
Used for messages MPICH_MAX_SHORT_MSG_SIZE bytes or less
Sender assumes that receiver can handle the message
Matching receive is posted - or -
Has available event queue entries (MPICH_PTL_UNEX_EVENTS) and buffer space
(MPICH_UNEX_BUFFER_SIZE) to store the message
Long Message Rendezvous Protocol
Messages are “pulled” by the receiving rank
Used for messages greater than MPICH_MAX_SHORT_MSG_SIZE bytes
Sender sends small header packet with information for the receiver to pull
over the data
Data is sent only after matching receive is posted by receiving rank
155. Match Entries Posted by MPI
Incoming Msg S to handle Unexpected Msgs
E
A Eager Rendezvous
S App ME
Short Msg ME Long Msg ME
T
A
R
STEP 1
STEP 2
MPI_RECV call
Sender MPI_SEND call Receiver Post ME to Portals MPI
RANK 0 RANK 1 Unexpected
Buffers
(MPICH_UNEX_BUFFER_SIZE)
STEP 3
Portals DMA PUT Unexpected
Msg Queue
Other Event Queue
(MPICH_PTL_OTHER_EVENTS)
Unexpected
Event Queue
MPI_RECV is posted prior to MPI_SEND call (MPICH_PTL_UNEX_EVENTS)
CQ (Completion Queue) - an event notification block used when processor needs to be notified that BTE or FMA transactions have completed.NAT (Network Address Translation) - responsible for validating and translating addresses from the network address format to an address on the local node.AMO (Atomic Memory Operation) - responsible for AMO type of transactions.ORB (Outstanding Request Buffer) - processes requests to the network and matches responses from the network to the original requests.RMT (Receive Message Table) - tracks groups of packets, or sequences, transmitted from remote nodes of the networkSSID (Synchronization Sequence Identification) - Tracks all of the request packets that originate and all of the response packets that terminate at the NIC, in order to perform completion notifications for transactions.Assists in the identification of SW operations and processes impacted by errorsMonitors error detected by other NIC blocks
Figure 2: Logical and Physical views of striping. Four application processes write a variable amount of data sequentially within a shared file. This shared file is striped over 4 OSTs with 1 MB stripe sizes. This write operation is not stripe aligned therefore some processes write their data to stripes used by other processes. Some stripes are accessed by more than one process (which may cause contention). Additionally, OSTs are accessed by variable numbers of processes (3 OST0, 1 OST1, 2 OST2 and 2 OST3).
Figure 3: Write Performance for serial I/O at various Lustre stripe counts. File size is 32 MB per OST utilized and write operations are 32 MB in size. Utilizing more OSTs does not increase write performance. The Best performance is seen by utilizing a stripe size which matches the size of write operations.
Figure 4: Write Performance for serial I/O at various Lustre stripe sizes and I/O operation sizes. File utilized is 256 MB written to a single OST. Performance is limited by small operation sizes and small stripe sizes. Either can become a limiting factor in write performance. The best performance is obtained in each case when the I/O operation and stripe sizes are similar.
These loops were taken from the nuccor application and provided by Rebecca Hartman-Baker from ORNL. She originally began by comparing various compilers and optimization levels. The results of the following rewrites came at the suggestion of Vince Graziano of Cray.
This code better plays to the strengths of the CPU. More cache reuse, easier prefetching, better chance of vectorizing.
Original: 13.938244 sReordered: 7.955379 s
The code further improves on the last by allowing slightly better cache reuse, but significantly better opportunity to vectorize on both a and b. I asked the compiler team why the loop nest on the left was only partially vectorized and they said that their studies showed that it would probably not be profitable (probably due to the tmat7 array striding on the second dimension.
Original: 13.938244 sReordered: 7.955379 sFissioned: 2.481636 s
The following Cache Blocking example was created by Steve Whalen of Cray.
See http://en.wikipedia.org/wiki/Restrict for more information on “Restrict”
The following come from Kim McMahon (Cray)
Figure 5: Write performance of a file-per-process I/O pattern as a function of number of files/processes. The file size is 128 MB with 32 MB sized write operations. Performance increases as the number of processes/files increases until OST and metadata contention hinder performance improvements. Each file is subject to the limitations of serial I/O.Improved performance can be obtained from a parallel file system such as Lustre. However, at large process counts (large number of files) metadata operations may hinder overall performance. Additionally, at large process counts (large number of files) OSS and OST contention will hinder overall performance.
Figure 8: Write Performance of a single shared file as the number of processes increases. A file size of 32 MB per process is utilized with 32 MB write operations. For each I/O library (Posix, MPI-IO, and HDF5) performance levels off at high core counts.