SlideShare ist ein Scribd-Unternehmen logo
1 von 195
Downloaden Sie, um offline zu lesen
 Review of XT6 Architecture
    AMD Opteron
    Cray Networks
    Lustre Basics
 Programming Environment
    PGI Compiler Basics
    The Cray Compiler Environment
    Cray Scientific Libraries
 Cray Performance Analysis Tools
 Optimizations
    CPU
    Communication
    I/O
AMD CPU Architecture
   Cray Architecture
Lustre Filesystem Basics
2003          2005          2007          2008          2009           2010
               AMD           AMD
                                         “Barcelona”   “Shanghai”    “Istanbul”    “Magny-Cours”
             Opteron™      Opteron™

      Mfg.
             130nm SOI     90nm SOI       65nm SOI     45nm SOI      45nm SOI        45nm SOI
   Process
                 K8            K8         Greyhound    Greyhound+    Greyhound+     Greyhound+
CPU Core


    L2/L3      1MB/0         1MB/0       512kB/2MB     512kB/6MB     512kB/6MB      512kB/12MB

     Hyper
Transport™   3x 1.6GT/.s   3x 1.6GT/.s    3x 2GT/s     3x 4.0GT/s    3x 4.8GT/s     4x 6.4GT/s
Technology

   Memory    2x DDR1 300   2x DDR1 400   2x DDR2 667   2x DDR2 800   2x DDR2 800   4x DDR3 1333
12 cores
                 1.7-2.2Ghz
1   4   7   10   105.6Gflops

                 8 cores
    5       11   1.8-2.4Ghz
2       8        76.8Gflops

                 Power (ACP)
3   6   9   12   80Watts

                 Stream
                 27.5GB/s

                 Cache
                 12x 64KB L1
                 12x 512KB L2
                 12MB L3
L3 cache




                                                                                                                   HT Link
                                                     HT Link
                                                               HT Link
HT Link



          L2 cache                       L2 cache                        L2 cache                       L2 cache

                     MEMORY CONTROLLER




                                                                                    MEMORY CONTROLLER
           Core 2                         Core 5                          Core 8                        Core 11




                                                               HT Link




                                                                                                                   HT Link
HT Link




                                                     HT Link
          L2 cache                       L2 cache                        L2 cache                       L2 cache

           Core 1                         Core 4                         Core 7                         Core 10

          L2 cache                       L2 cache                        L2 cache                       L2 cache

           Core 0                         Core 3                          Core 6                        Core 9
 A cache line is 64B
 Cache is a “victim cache”
    All references go to L1 immediately and get evicted down the caches
    A cache line is usually only in one level of cache
 Hardware prefetcher detects forward and backward strides through
  memory
 Each core can perform a 128b add and 128b multiply per clock cycle
    This requires SSE, packed instructions
    “Stride-one vectorization”
SeaStar (XT-series)
Gemini (XE-series)
 Microkernel on Compute PEs,
                                   full featured Linux on Service
                                   PEs.
                                  Service PEs specialize by
                                   function
                    Compute PE
                                  Software Architecture
                    Login PE       eliminates OS “Jitter”
                    Network PE    Software Architecture enables
                                   reproducible run times
                    System PE
                                  Large machines boot in under
                    I/O PE         30 minutes, including
                                   filesystem
Service Partition
Specialized
Linux nodes




                                                               10
Z
        Y    GigE

X

            10 GigE



             GigE
                            SMW
            Fibre
            Channels
                        RAID
                        Subsystem


               Compute node
               Login node
               Network node
                Boot/Syslog/Database nodes
               I/O and Metadata nodes        11
Now Scaled
                                                 to 225,000
                                                    cores



                                         Cray XT5 systems ship with the
                                          SeaStar2+ interconnect
             DMA     HyperTransport
6-Port
Router
            Engine
                        Interface
                                         Custom ASIC
                                         Integrated NIC / Router
                     Memory
                                         MPI offload engine
  Blade
 Control
Processor
                       PowerPC
                     440 Processor
                                         Connectionless Protocol
Interface
                                         Link Level Reliability
                                         Proven scalability to 225,000
                                          cores
                                                                      12
Processor   Frequency     Peak     Bandwidth        Balance
                        (Gflops)    (GB/sec)      (bytes/flop
                                                       )
Istanbul
               2.6       62.4        12.8             0.21
  (XT5)
               2.0        64         42.6             0.67

 MC-8          2.3       73.6        42.6             0.58

               2.4       76.8        42.6             0.55

               1.9        91.2       42.6             0.47

 MC-12         2.1       100.8       42.6             0.42

               2.2       105.6       42.6             0.40




                                     Cray Inc. Preliminary and Proprietary   SC09   13
6.4 GB/sec direct connect
       Characteristics                            HyperTransport
Number of       16 or 24 (MC)
Cores               32 (IL)
Peak            153 Gflops/sec
Performance
MC-8 (2.4)
Peak            211 Gflops/sec
Performance
MC-12 (2.2)
Memory Size     32 or 64 GB per
                     node
Memory           83.5 GB/sec
Bandwidth                                                                        83.5 GB/sec direct
                                                                                 connect memory
                                                   Cray
                                                 SeaStar2+
                                               Interconnect



                                  Cray Inc. Preliminary and Proprietary   SC09                    14
Greyhound                                          Greyhound
                                    Greyhound                                          Greyhound   DDR3 Channel
     DDR3 Channel    6MB L3                            HT3              6MB L3
                                    Greyhound                                          Greyhound
                     Cache          Greyhound                           Cache          Greyhound
                                    Greyhound                                          Greyhound

     DDR3 Channel                   Greyhound                                          Greyhound   DDR3 Channel



                              HT3




                                                                                 HT3
                                    Greyhound          H                               Greyhound


     DDR3 Channel    6MB L3
                                    Greyhound
                                    Greyhound
                                                       T3               6MB L3
                                                                                       Greyhound
                                                                                       Greyhound
                                                                                                   DDR3 Channel

                     Cache          Greyhound                           Cache          Greyhound
                                    Greyhound                                          Greyhound
                                    Greyhound          HT3                             Greyhound   DDR3 Channel
     DDR3 Channel




                                                                        To Interconnect
                                                                            HT1 / HT3
 2 Multi-Chip Modules, 4 Opteron Dies
 8 Channels of DDR3 Bandwidth to 8 DIMMs
 24 (or 16) Computational Cores, 24 MB of L3 cache
 Dies are fully connected with HT3
 Snoop Filter Feature Allows 4 Die SMP to scale well

                                       Cray Inc. Preliminary and Proprietary   SC09                               15
Without snoop filter, a streams test
                                shows 25MB/sec out of a possible
                                51.2 GB/sec or 48% of peak
                                bandwidth




Cray Inc. Preliminary and Proprietary   SC09                     16
With snoop filter, a streams test
                                shows 42.3 MB/sec out of a
                                possible 51.2 GB/sec or 82% of
                                peak bandwidth




              • This feature will be key for two-
                socket Magny Cours Nodes which
                are the same architecture-wise
Cray Inc. Preliminary and Proprietary   SC09                        17
 New compute blade with 8 AMD
    Magny Cours processors
   Plug-compatible with XT5 cabinets
    and backplanes
   Initially will ship with SeaStar
    interconnect as the Cray XT6
   Upgradeable to Gemini
    Interconnect or Cray XE6
   Upgradeable to AMD’s
    “Interlagos” series
   XT6 systems will continue to ship
    with the current SIO blade
   First customer ship, March 31st



                             Cray Inc. Preliminary and Proprietary   SC09   18
Cray Inc. Preliminary and Proprietary   SC09   19
 Supports 2 Nodes per ASIC
 168 GB/sec routing capacity
 Scales to over 100,000 network
    endpoints
   Link Level Reliability and Adaptive                             Hyper                  Hyper
    Routing                                                       Transport              Transport
                                                                      3                      3
   Advanced Resiliency Features
   Provides global address                                              NIC 0 Netlink     NIC 1
                                                                           SB
    space                                                                      Block          Gemini
                                         LO
   Advanced NIC designed to          Processor
    efficiently support                                                        48-Port
      MPI                                                                   YARC Router
      One-sided MPI
      Shmem
      UPC, Coarray FORTRAN
                                 Cray Inc. Preliminary and Proprietary    SC09                       20
Cray Baker Node
                                                                    Characteristics
                                                          Number of            16 or 24
10 12X Gemini                                             Cores
  Channels
                                                          Peak            140 or 210 Gflops/s
 (Each Gemini           High Radix
                       YARC Router
                                                          Performance
  acts like two
nodes on the 3D        with adaptive                      Memory Size      32 or 64 GB per
     Torus)               Routing
                                                                                node
                        168 GB/sec
                         capacity                         Memory              85 GB/sec
                                                          Bandwidth
                  Cray Inc. Preliminary and Proprietary     SC09                          21
Module with
SeaStar


                                                             Z

                                                                  Y

                                                                 X




Module with
Gemini
              Cray Inc. Preliminary and Proprietary   SC09       22
net rsp
                                                                                                                 net req




                                                                                                                                     LB
            ht treq p
                                                                               net                                         LB Ring
            ht treq np                                            FMA          req   T                               net
            ht trsp                                                                      net           net
                                                                                     A   req       S   req           req   vc0
                                                ht p req
                                                                               net   R             S
                                                                               req                           O
                                              ht np req                              B             I
                                                                  BTE                                        R       net
                                                                                                   D                 rsp
                                                                                                             B                vc1




                                                                                                                                          Router Tiles
 HT3 Cave




                                                                                                                                     NL
            ht irsp
                                        NPT                                                                                vc1

            ht np                                                                                                   net
            ireq                                                                                                    rsp
                                                 net req    CQ          NAT
                            ht np req
                        H
                            ht p req                                             net rsp headers
            ht p
                        A               AMO                       net
                            ht p req
            ireq        R                         net req         req                     net req                             vc0
                        B                                   RMT
                            ht p req
                                                                         RAT             net rsp




                                                                                                                                     LM
                                                                                                       CLM




 FMA (Fast Memory Access)
    Mechanism for most MPI transfers
    Supports tens of millions of MPI requests per second
 BTE (Block Transfer Engine)
    Supports asynchronous block transfers between local and remote memory,
     in either direction
    For use for large MPI transfers that happen in the background
                                                                   Cray Inc. Preliminary and Proprietary                         SC09                    23
 Two Gemini ASICs are
    packaged on a pin-compatible
    mezzanine card
   Topology is a 3-D torus
   Each lane of the torus is
    composed of 4 Gemini router
    “tiles”
   Systems with SeaStar
    interconnects can be upgraded
    by swapping this card
   100% of the 48 router tiles on
    each Gemini chip are used



                           Cray Inc. Preliminary and Proprietary   SC09   24
 Like SeaStar, Gemini has a DMA offload engine
  allowing large transfers to proceed
  asynchronously

 Gemini provides low-overhead OS-bypass features for short transfers
    MPI latency targeted at ~ 1us
    NIC provides for many millions of MPI messages per second
    “Hybrid” programming not a requirement for performance

 RDMA provides a much improved one-sided communication mechanism


 AMOs provide a faster synchronization method for barriers


 Gemini supports adaptive routing, which
    Reduces problems with network hot spots
    Allows MPI to survive link failures

                                Cray Inc. Preliminary and Proprietary   SC09   25
 Globally addressable memory provides efficient
  support for UPC, Co-array FORTRAN, Shmem and
  Global Arrays
    Cray Programming Environment will target this capability
      directly

 Pipelined global loads and stores
    Allows for fast irregular communication patterns


 Atomic memory operations
    Provides fast synchronization needed for one-sided
     communication models




                              Cray Inc. Preliminary and Proprietary   SC09   26
Gemini will represent a large improvement over SeaStar in
terms of reliability and serviceability


   Adaptive Routing – multiple paths to the same destination
      Allows mapping around bad links without rebooting
      Supports warm-swap of blades
      Prevents hot spots


   Reliable Transport of Messages
      Packet level CRC carried from start to finish
      Large blocks of memory protected by ECC
      Can better handle failures on the HT-link, discards packets instead of putting backpressure
         into the network
      Supports end-to-end reliable communication (used by MPI)


   Improved error reporting and handling
      The low overhead error reporting allows the programming model to replay failed
        transactions
      Performance counters allowing tracking of app specific packets


                                       Cray Inc. Preliminary and Proprietary   SC09                  27
28
29
Low Velocity Airflow




      High Velocity Airflow




      Low Velocity Airflow




     High Velocity Airflow




30   Low Velocity Airflow
Cool air is released into the computer room




Liquid                                                              Liquid/Vapor
in                                                                  Mixture out




                Hot air stream passes through evaporator, rejects
                  heat to R134a via liquid-vapor phase change
                                 (evaporation).


     R134a absorbs energy only in the presence of heated air.
     Phase change is 10x more efficient than pure water
     cooling.

                                                                           31
R134a piping   Exit Evaporators




                      Inlet Evaporator



                                  32
 32 MB per OST (32 MB – 5 GB) and 32 MB Transfer Size
                        Unable to take advantage of file system parallelism
                        Access to multiple disks adds overhead which hurts performance


                                     Single Writer
                                   Write Performance
                    120


                    100


                     80
     Write (MB/s)




                                                                       1 MB Stripe
                     60
                                                                       32 MB Stripe

                     40
                                                                                      Lustre
                     20


                     0
                          1   2    4    16      32    64   128   160
                                       Stripe Count



36
 Single OST, 256 MB File Size
                        Performance can be limited by the process (transfer size) or file system
                         (stripe size)

                                            Single Writer
                                        Transfer vs. Stripe Size
                    140


                    120


                    100
     Write (MB/s)




                     80
                                                                              32 MB Transfer
                     60
                                                                              8 MB Transfer
                                                                              1 MB Transfer
                     40                                                                        Lustre
                     20


                     0
                          1    2    4        8       16      32    64   128
                                          Stripe Size (MB)




37
 Use the lfs command, libLUT, or MPIIO hints to adjust your stripe count and
  possibly size
    lfs setstripe -c -1 -s 4M <file or directory> (160 OSTs, 4MB stripe)
    lfs setstripe -c 1 -s 16M <file or directory> (1 OST, 16M stripe)
    export MPICH_MPIIO_HINTS=‘*: striping_factor=160’
 Files inherit striping information from the parent directory, this cannot be
  changed once the file is written
    Set the striping before copying in files
PGI Compiler
Cray Compiler Environment
  Cray Scientific Libraries
 Cray XT/XE Supercomputers come with compiler wrappers to simplify
  building parallel applications (similar the mpicc/mpif90)
    Fortran Compiler: ftn
    C Compiler: cc
    C++ Compiler: CC
 Using these wrappers ensures that your code is built for the compute
  nodes and linked against important libraries
    Cray MPT (MPI, Shmem, etc.)
    Cray LibSci (BLAS, LAPACK, etc.)
    …
 Choosing the underlying compiler is via the PrgEnv-* modules, do not call
  the PGI, Cray, etc. compilers directly.
 Always load the appropriate xtpe-<arch> module for your machine
    Enables proper compiler target
    Links optimized math libraries
 Traditional (scalar) optimizations are controlled via -O# compiler flags
    Default: -O2
 More aggressive optimizations (including vectorization) are enabled with
  the -fast or -fastsse metaflags
    These translate to: -O2 -Munroll=c:1 -Mnoframe -Mlre
      –Mautoinline -Mvect=sse -Mscalarsse
      -Mcache_align -Mflushz –Mpre
 Interprocedural analysis allows the compiler to perform whole-program
  optimizations. This is enabled with –Mipa=fast
 See man pgf90, man pgcc, or man pgCC for more information about
  compiler options.
 Compiler feedback is enabled with -Minfo and -Mneginfo
     This can provide valuable information about what optimizations were
         or were not done and why.
   To debug an optimized code, the -gopt flag will insert debugging
    information without disabling optimizations
   It’s possible to disable optimizations included with -fast if you believe one
    is causing problems
      For example: -fast -Mnolre enables -fast and then disables loop
         redundant optimizations
   To get more information about any compiler flag, add -help with the
    flag in question
      pgf90 -help -fast will give more information about the -fast
         flag
   OpenMP is enabled with the -mp flag
Some compiler options may effect both performance and accuracy. Lower
accuracy is often higher performance, but it’s also able to enforce accuracy.

 -Kieee: All FP math strictly conforms to IEEE 754 (off by default)
 -Ktrap: Turns on processor trapping of FP exceptions
 -Mdaz: Treat all denormalized numbers as zero
 -Mflushz: Set SSE to flush-to-zero (on with -fast)
 -Mfprelaxed: Allow the compiler to use relaxed (reduced) precision to
  speed up some floating point optimizations
    Some other compilers turn this on by default, PGI chooses to favor
     accuracy to speed by default.
 Cray has a long tradition of high performance compilers on Cray
  platforms (Traditional vector, T3E, X1, X2)
    Vectorization
    Parallelization
    Code transformation
    More…
 Investigated leveraging an open source compiler called LLVM


 First release December 2008
Fortran Source              C and C++ Source   C and C++ Front End
                                                          supplied by Edison Design
                                                          Group, with Cray-developed
  Fortran Front End                  C & C++ Front End    code for extensions and
                                                          interface support


                 Interprocedural Analysis
                                                          Cray Inc. Compiler
                                                          Technology
Compiler




                 Optimization and
                 Parallelization




           X86 Code                  Cray X2 Code
           Generator                 Generator
                                                          X86 Code Generation from
                                                          Open Source LLVM, with
                       Object File
                                                          additional Cray-developed
                                                          optimizations and interface
                                                          support
 Standard conforming languages and programming models
    Fortran 2003
    UPC & CoArray Fortran
       Fully optimized and integrated into the compiler
       No preprocessor involved
       Target the network appropriately:
          GASNet with Portals
          DMAPP with Gemini & Aries

 Ability and motivation to provide high-quality support for custom
  Cray network hardware
 Cray technology focused on scientific applications
    Takes advantage of Cray’s extensive knowledge of automatic
     vectorization
    Takes advantage of Cray’s extensive knowledge of automatic
     shared memory parallelization
    Supplements, rather than replaces, the available compiler
     choices
 Make sure it is available
    module avail PrgEnv-cray
 To access the Cray compiler
    module load PrgEnv-cray
 To target the various chip
    module load xtpe-[barcelona,shanghi,istanbul]
 Once you have loaded the module “cc” and “ftn” are the Cray
  compilers
    Recommend just using default options
    Use –rm (fortran) and –hlist=m (C) to find out what happened
 man crayftn
 Excellent Vectorization
    Vectorize more loops than other compilers
 OpenMP 3.0
   Task and Nesting
 PGAS: Functional UPC and CAF available today
 C++ Support
 Automatic Parallelization
    Modernized version of Cray X1 streaming capability
    Interacts with OMP directives
 Cache optimizations
    Automatic Blocking
    Automatic Management of what stays in cache
 Prefetching, Interchange, Fusion, and much more…
 Loop Based Optimizations
    Vectorization
    OpenMP
       Autothreading
   Interchange
   Pattern Matching
   Cache blocking/ non-temporal / prefetching
 Fortran 2003 Standard; working on 2008
 PGAS (UPC and Co-Array Fortran)
    Some performance optimizations available in 7.1
 Optimization Feedback: Loopmark
 Focus
 Cray compiler supports a full and growing set of directives
  and pragmas

!dir$ concurrent
!dir$ ivdep
!dir$ interchange
!dir$ unroll
!dir$ loop_info [max_trips] [cache_na] ... Many more
!dir$ blockable


                              man directives
                              man loop_info
 Compiler can generate an filename.lst file.
     Contains annotated listing of your source code with letter indicating important
      optimizations
%%% L o o p m a r k L e g e n d %%%
 Primary Loop Type    Modifiers
 ------- ---- ----     ---------
                       a - vector atomic memory operation
 A - Pattern matched     b - blocked
 C - Collapsed         f - fused
 D - Deleted          i - interchanged
 E - Cloned           m - streamed but not partitioned
 I - Inlined           p - conditional, partial and/or computed
 M - Multithreaded    r - unrolled
 P - Parallel/Tasked    s - shortloop
 V - Vectorized        t - array syntax temp used
 W - Unwound          w - unwound
• ftn –rm …      or cc –hlist=m …
29. b-------<   do i3=2,n3-1
30. b b-----<      do i2=2,n2-1
31. b b Vr--<        do i1=1,n1
32. b b Vr            u1(i1) = u(i1,i2-1,i3) + u(i1,i2+1,i3)
33. b b Vr      >           + u(i1,i2,i3-1) + u(i1,i2,i3+1)
34. b b Vr            u2(i1) = u(i1,i2-1,i3-1) + u(i1,i2+1,i3-1)
35. b b Vr      >           + u(i1,i2-1,i3+1) + u(i1,i2+1,i3+1)
36. b b Vr-->        enddo
37. b b Vr--<        do i1=2,n1-1
38. b b Vr            r(i1,i2,i3) = v(i1,i2,i3)
39. b b Vr      >              - a(0) * u(i1,i2,i3)
40. b b Vr      >              - a(2) * ( u2(i1) + u1(i1-1) + u1(i1+1) )
41. b b Vr      >              - a(3) * ( u2(i1-1) + u2(i1+1) )
42. b b Vr-->        enddo
43. b b----->      enddo
44. b------->    enddo
ftn-6289 ftn: VECTOR File = resid.f, Line = 29
 A loop starting at line 29 was not vectorized because a recurrence was found on "U1" between lines
   32 and 38.
ftn-6049 ftn: SCALAR File = resid.f, Line = 29
 A loop starting at line 29 was blocked with block size 4.
ftn-6289 ftn: VECTOR File = resid.f, Line = 30
 A loop starting at line 30 was not vectorized because a recurrence was found on "U1" between lines 32
   and 38.
ftn-6049 ftn: SCALAR File = resid.f, Line = 30
 A loop starting at line 30 was blocked with block size 4.
ftn-6005 ftn: SCALAR File = resid.f, Line = 31
 A loop starting at line 31 was unrolled 4 times.
ftn-6204 ftn: VECTOR File = resid.f, Line = 31
 A loop starting at line 31 was vectorized.
ftn-6005 ftn: SCALAR File = resid.f, Line = 37
 A loop starting at line 37 was unrolled 4 times.
ftn-6204 ftn: VECTOR File = resid.f, Line = 37
 A loop starting at line 37 was vectorized.
 -hbyteswapio
   Link time option
   Applies to all unformatted fortran IO
 Assign command
   With the PrgEnv-cray module loaded do this:
setenv FILENV assign.txt
assign -N swap_endian g:su
assign -N swap_endian g:du


 Can use assign to be more precise
 OpenMP is ON by default
   Optimizations controlled by –Othread#
   To shut off use –Othread0 or –xomp or –hnoomp



 Autothreading is NOT on by default;
   -hautothread to turn on
   Modernized version of Cray X1 streaming capability
   Interacts with OMP directives


If you do not want to use OpenMP and have OMP directives
    in the code, make sure to make a run with OpenMP shut
                      off at compile time
 Traditional model


   Tuned general purpose codes

     Only good for dense


     Not problem sensitive


     Not architecture sensitive




                                   60
 Goal of scientific libraries
       Improve Productivity at optimal performance
 Cray use four concentrations to achieve this
    Standardization
       Use standard or “de facto” standard interfaces whenever available

    Hand tuning
       Use extensive knowledge of target processor and network to optimize common code
        patterns
    Auto-tuning
       Automate code generation and a huge number of empirical performance evaluations
        to configure software to the target platforms
    Adaptive Libraries
       Make runtime decisions to choose the best kernel/library/routine



                                                                                          61
 Three separate classes of standardization, each with a corresponding
  definition of productivity
   1. Standard interfaces (e.g., dense linear algebra)
         Bend over backwards to keep everything the same despite increases in machine complexity.
          Innovate ‘behind-the-scenes’
         Productivity -> innovation to keep things simple


   2.   Adoption of near-standard interfaces (e.g., sparse kernels)
         Assume near-standards and promote those. Out-mode alternatives. Innovate ‘behind-the-scenes’
         Productivity -> innovation in the simplest areas
            (requires the same innovation as #1 also)


   3.   Simplification of non-standard interfaces (e.g., FFT)
         Productivity -> innovation to make things simpler than they are




                                                                                                         62
 Algorithmic tuning
    Increased performance by exploiting algorithmic improvements
        Sub-blocking, new algorithms

    LAPACK, ScaLAPACK
 Kernel tuning
    Improve the numerical kernel performance in assembly language
    BLAS, FFT
 Parallel tuning
    Exploit Cray’s custom network interfaces and MPT
    ScaLAPACK, P-CRAFFT




                                                                     63
Dense           Sparse              FFT
  BLAS             CASK            CRAFFT

 LAPACK
                   PETSc            FFTW
ScaLAPACK

   IRT            Trilinos        P-CRAFFT


   IRT – Iterative Refinement Toolkit
   CASK – Cray Adaptive Sparse Kernels
   CRAFFT – Cray Adaptive FFT

                                             64
 Serial and Parallel versions of sparse iterative linear solvers
    Suites of iterative solvers
        CG, GMRES, BiCG, QMR, etc.

    Suites of preconditioning methods
        IC, ILU, diagonal block (ILU/IC), Additive Schwartz, Jacobi, SOR

    Support block sparse matrix data format for better performance
    Interface to external packages (ScaLAPACK, SuperLU_DIST)
    Fortran and C support
    Newton-type nonlinear solvers
 Large user community
    DoE Labs, PSC, CSCS, CSC, ERDC, AWE and more.
 http://www-unix.mcs.anl.gov/petsc/petsc-as




                                                                            65
 Cray provides state-of-the art scientific computing packages to strengthen
  the capability of PETSc
    Hypre: scalable parallel preconditioners
        AMG (Very scalable and efficient for specific class of problems)
        2 different ILU (General purpose)
        Sparse Approximate Inverse (General purpose)

    ParMetis: parallel graph partitioning package
    MUMPS: parallel multifrontal sparse direct solver
    SuperLU: sequential version of SuperLU_DIST
 To use Cray-PETSc, load the appropriate module :
   module load petsc
   module load petsc-complex
  (no need to load a compiler specific module)
 Treat the Cray distribution as your local PETSc installation

                                                                               66
 The Trilinos Project http://trilinos.sandia.gov/
    “an effort to develop algorithms and enabling technologies within an
    object-oriented software framework for the solution of large-scale,
    complex multi-physics engineering and scientific problems”
   A unique design feature of Trilinos is its focus on packages.
   Very large user-base and growing rapidly. Important to DOE.
   Cray’s optimized Trilinos released on January 21
      Includes 50+ trilinos packages
      Optimized via CASK
      Any code that uses Epetra objects can access the optimizations
   Usage :
       module load trilinos



                                                                           67
 CASK is a product developed at Cray using the
  Cray Auto-tuning Framework (Cray ATF)
 The CASK Concept :
    Analyze matrix at minimal cost
    Categorize matrix against internal classes
    Based on offline experience, find best CASK code for particular matrix
    Previously assign “best” compiler flags to CASK code
    Assign best CASK kernel and perform Ax


 CASK silently sits beneath PETSc on Cray systems
    Trilinos support coming soon


 Released with PETSc 3.0 in February 2009
    Generic and blocked CSR format
                                                                              68
Large-scale application

              • Highly portable
              • User controlled

              PETSc / Trilinos / Hypre

All systems   • Highly portable
              • User controlled

Cray only     CASK

              • XT4 & XT5
                specific /
                tuned
              • Invisible to
                User


                                         69
Speedup on Parallel SpMV on 8 cores, 60 different matrices
1.4


1.3


1.2


1.1


  1
      0           10          20         30          40          50    60
                                     Matrix ID#


                                                                            70
Block Jacobi Preconditioning
 SpMV
                  Performance of CASK VS                         Performance of CASK VS
                          PETSc                                          PETSc
                   N=65,536 to 67,108,864                    300 N=65,536 to 67,108,864
    200
                                                             250
     150
                                                             200
GFlops




     100



                                                         GFlops
                                                              150

         50                                                   100

                                                                  50
         0
              0   128   256 384   512 640 768 896 1024            0
                                                                       0   128   256 384    512 640 768 896 1024

                             # of Cores                                                # of Cores
                                                                                 BlockJacobi-IC(0)-CASK
              MatMult-CASK          MatMult-PETSc                                BlockJacobi-IC(0)-PETSc


                                                                                                                   71
2000
         1800
         1600
         1400
MFlops




         1200
         1000
          800
          600
          400
          200
            0




                Matrix Name
Geometric Mean of 80 sparse matrix instances from U of Florida collection
         5000
         4500
         4000
         3500
MFlops




         3000
         2500
         2000
          1500
         1000
           500
             0
                 1      2       3       4        5       6       7       8
                                       # of vectors
                            CASK     Trilinos Original
 In FFTs, the problems are
    Which library choice to use?
    How to use complicated interfaces (e.g., FFTW)


 Standard FFT practice
    Do a plan stage
       Deduced machine and system information and run micro-kernels
       Select best FFT strategy

    Do an execute


   Our system knowledge can remove some of this cost!


                                                                       74
 CRAFFT is designed with simple-to-use interfaces
    Planning and execution stage can be combined into one function call
    Underneath the interfaces, CRAFFT calls the appropriate FFT kernel


 CRAFFT provides both offline and online tuning
    Offline tuning
         Which FFT kernel to use
         Pre-computed PLANs for common-sized FFT
           No expensive plan stages
      Online tuning is performed as necessary at runtime as well

 At runtime, CRAFFT will adaptively select the best FFT kernel to use based
  on both offline and online testing (e.g. FFTW, Custom FFT)




                                                                               75
128x128        256x256    512x512


FFTW plan               74        312         2758


FFTW exec          0.105         0.97             9.7


CRAFFT plan     0.00037        0.0009     0.00005


CRAFFT exec        0.139          1.2             11.4
1.   Load module fftw/3.2.0 or higher.
2.   Add a Fortran statement “use crafft”
3.   call crafft_init()
4.   Call crafft transform using none, some or all optional arguments (as
     shown in red)
     In-place, implicit memory management :
call crafft_z2z3d(n1,n2,n3,input,ld_in,ld_in2,isign)
     in-place, explicit memory management
call crafft_z2z3d(n1,n2,n3,input,ld_in,ld_in2,isign,work)
     out-of-place, explicit memory management :
crafft_z2z3d(n1,n2,n3,input,ld_in,ld_in2,output,ld_out,ld_out2,isign,work)


Note : the user can also control the planning strategy of CRAFFT using the
    CRAFFT_PLANNING environment variable and the do_exe optional argument,
    please see the intro_crafft man page.
                                                                             77
 As of December 2009, CRAFFT includes distributed parallel transforms
 Uses the CRAFFT interface prefixed by “p”, with optional arguments
 Can provide performance improvement over FFTW 2.1.5
 Currently implemented
    complex-complex
    Real-complex and complex-real
    3-d and 2-d
    In-place and out-of-place
 Upcoming
    C language support for serial and parallel




                                                                         78
1.   Add “use crafft” to Fortran code
2.   Initialize CRAFFT using crafft_init
3.   Assume MPI initialized and data distributed (see manpage)
4.   Call crafft, e.g. (optional arguments in red)
          2-d complex-complex, in-place, internal mem management :
         call crafft_pz2z2d(n1,n2,input,isign,flag,comm)

         2-d complex-complex, in-place with no internal memory :
         call crafft_pz2z2d(n1,n2,input,isign,flag,comm,work)

         2-d complex-complex, out-of-place, internal mem manager :
         call crafft_pz2z2d(n1,n2,input,output,isign,flag,comm)

         2-d complex-complex, out-of-place, no internal memory :
         crafft_pz2z2d(n1,n2,input,output,isign,flag,comm,work)

Each routine above has manpage. Also see 3d equivalent :
         man crafft_pz2z3d                                           79
2D FFT (N x N, transposed), 128 cores
         140,000

         120,000

         100,000
Mflops




         80,000

         60,000                                                              pcrafft
                                                                             fftw2.5.1
         40,000

         20,000

              0
                    128   256   512   1024 2048 4096 8192 16384 3276865536
                                            Size N

                                                                                    80
 Solves linear systems in single precision
 Obtaining solutions accurate to double precision
    For well conditioned problems
 Serial and Parallel versions of LU, Cholesky, and QR
 2 usage methods
    IRT Benchmark routines
           Uses IRT 'under-the-covers' without changing your code
             Simply set an environment variable
             Useful when you cannot alter source code


    Advanced IRT API
           If greater control of the iterative refinement process is required
              Allows
                   condition number estimation
                   error bounds return
                   minimization of either forward or backward error
                   'fall back' to full precision if the condition number is too high
                   max number of iterations can be altered by users




                                                                                        81
 “High Power Electromagnetic Wave
  Heating in the ITER Burning Plasma’’
 rf heating in tokamak

 Maxwell-Bolzmann Eqns

 FFT

 Dense linear system

 Calc Quasi-linear op




                                         Courtesy
                                         Richard Barrett




                                                    82
Theoretical
Peak




          83
Decide if you want to use advanced API or benchmark API
    benchmark API :
         setenv IRT_USE_SOLVERS 1
    Advanced API :
1. locate the factor and solve in your code (LAPACK or ScaLAPACK)
2. Replace factor and solve with a call to IRT routine
       e.g. dgesv -> irt_lu_real_serial
       e.g. pzgesv -> irt_lu_complex_parallel
       e.g pzposv -> irt_po_complex_parallel
3. Set advanced arguments
       Forward error convergence for most accurate solution
       Condition number estimate
       “fall-back” to full precision if condition number too high

                                                                     84
 LibSci 10.4.2 February 18th 2010
    OpenMP-aware LibSci
    Allows calling of BLAS inside or outside parallel region
    Single library supported
        No multi-thread library and single thread library (-lsci and –lsci_mp)
        Performance not compromised
       (there were some usage restrictions with this version)

 LibSci 10.4.3 April 2010
    Parallel CRAFFT improvements
    Fixes usage restrictions of 10.4.2
    OMP_NUM_THREADS required (not GOTO_NUM_THREADS)
 Upcoming
    PETSc 3.1.0 May 20
    Trilinos 10.2 May 20


                                                                                  85
CrayPAT
 Assist the user with application performance analysis and optimization
           Help user identify important and meaningful information from
            potentially massive data sets
           Help user identify problem areas instead of just reporting data
           Bring optimization knowledge to a wider set of users



     Focus on ease of use and intuitive user interfaces
        Automatic program instrumentation
        Automatic analysis


     Target scalability issues in all areas of tool development
        Data management
                  Storage, movement, presentation


September 21-24, 2009                                  © Cray Inc.             87
 Supports traditional post-mortem performance analysis
       Automatic identification of performance problems
              Indication of causes of problems
              Suggestions of modifications for performance improvement


 CrayPat
       pat_build: automatic instrumentation (no source code changes needed)
       run-time library for measurements (transparent to the user)
       pat_report for performance analysis reports
       pat_help: online help utility


 Cray Apprentice2
       Graphical performance analysis and visualization tool




September 21-24, 2009                                     © Cray Inc.          88
 CrayPat
           Instrumentation of optimized code
           No source code modification required
           Data collection transparent to the user
           Text-based performance reports
           Derived metrics
           Performance analysis


     Cray Apprentice2
           Performance data visualization tool
           Call tree view
           Source code mappings




September 21-24, 2009                                 © Cray Inc.   89
 When performance measurement is triggered
         External agent (asynchronous)
               Sampling
                    Timer interrupt
                    Hardware counters overflow

         Internal agent (synchronous)
               Code instrumentation
                    Event based
                    Automatic or manual instrumentation

   How performance data is recorded
         Profile ::= Summation of events over time
               run time summarization (functions, call sites, loops, …)

         Trace file ::= Sequence of events over time




September 21-24, 2009                                                      © Cray Inc.   90
 Millions of lines of code
           Automatic profiling analysis
                  Identifies top time consuming routines
                  Automatically creates instrumentation template customized to your application

     Lots of processes/threads
           Load imbalance analysis
                  Identifies computational code regions and synchronization calls that could benefit most from
                   load balance optimization
                  Estimates savings if corresponding section of code were balanced

     Long running applications
           Detection of outliers




September 21-24, 2009                                                      © Cray Inc.                            91
 Important performance statistics:


       Top time consuming routines


       Load balance across computing resources


       Communication overhead


       Cache utilization


       FLOPS


       Vectorization (SSE instructions)


       Ratio of computation versus communication

September 21-24, 2009                               © Cray Inc.   92
 No source code or makefile modification required
           Automatic instrumentation at group (function) level
                  Groups: mpi, io, heap, math SW, …


     Performs link-time instrumentation
           Requires object files
           Instruments optimized code
           Generates stand-alone instrumented program
           Preserves original binary
           Supports sample-based and event-based instrumentation




September 21-24, 2009                                  © Cray Inc.   93
       Analyze the performance data and direct the user to meaningful
            information

           Simplifies the procedure to instrument and collect performance data for
            novice users

          Based on a two phase mechanism
           1. Automatically detects the most time consuming functions in the
              application and feeds this information back to the tool for further
              (and focused) data collection

           2.    Provides performance information on the most significant parts of the
                 application


September 21-24, 2009                                     © Cray Inc.                    94
 Performs data conversion


             Combines information from binary with raw performance
                  data

       Performs analysis on data


     Generates text report of performance results


     Formats data for input into Cray Apprentice2

September 21-24, 2009                         © Cray Inc.             95
 Craypat / Cray Apprentice2 5.0 released September 10, 2009


           New internal data format
           FAQ
           Grid placement support
           Better caller information (ETC group in pat_report)
           Support larger numbers of processors
           Client/server version of Cray Apprentice2
           Panel help in Cray Apprentice2




September 21-24, 2009                                   © Cray Inc.   96
       Access performance tools software

                  % module load xt-craypat apprentice2

           Build application keeping .o files (CCE: -h keepfiles)

                  % make clean
                  % make

          Instrument application for automatic profiling analysis
              You should get an instrumented program a.out+pat

                  % pat_build –O apa a.out

          Run application to get top time consuming routines
             You should get a performance file (“<sdatafile>.xf”) or
              multiple files in a directory <sdatadir>

                  % aprun … a.out+pat           (or     qsub <pat script>)

September 21-24, 2009                                      © Cray Inc.       97
      Generate report and .apa instrumentation file

       % pat_report –o my_sampling_report [<sdatafile>.xf |
          <sdatadir>]

      Inspect .apa file and sampling report

      Verify if additional instrumentation is needed




September 21-24, 2009                                   © Cray Inc.   Slide 98
# You can edit this file, if desired, and use it                           # 43.37% 99659 bytes
# to reinstrument the program for tracing like this:                             -T mlwxyz_
#
#        pat_build -O mhd3d.Oapa.x+4125-401sdt.apa                         # 16.09% 17615 bytes
#                                                                                -T half_
# These suggested trace options are based on data from:
#                                                                          # 6.82% 6846 bytes
#     /home/crayadm/ldr/mhd3d/run/mhd3d.Oapa.x+4125-401sdt.ap2,                  -T artv_
       /home/crayadm/ldr/mhd3d/run/mhd3d.Oapa.x+4125-401sdt.xf

                                                                           # 1.29% 5352 bytes
# ----------------------------------------------------------------------
                                                                                 -T currenh_

#     HWPC group to collect by default.
                                                                           # 1.03% 25294 bytes
                                                                                 -T bndbo_
 -Drtenv=PAT_RT_HWPC=1 # Summary with instructions metrics.

                                                                           # Functions below this point account for less than 10% of samples.
# ----------------------------------------------------------------------


#     Libraries to trace.
                                                                           # 1.03% 31240 bytes
                                                                           #      -T bndto_
 -g mpi

                                                                           ...
# ----------------------------------------------------------------------

                                                                           # ----------------------------------------------------------------------
#     User-defined functions to trace, sorted by % of samples.
#     Limited to top 200. A function is commented out if it has < 1%
                                                                            -o mhd3d.x+apa                    # New instrumented program.
#     of samples, or if a cumulative threshold of 90% has been reached,
#     or if it has size < 200 bytes.
                                                                            /work/crayadm/ldr/mhd3d/mhd3d.x # Original program.

# Note: -u should NOT be specified as an additional option.


                                                                                                           September 21-24, 2009                      99
                                                                                                       © Cray Inc.
   biolib Cray Bioinformatics library routines      omp    OpenMP API (not supported on
   blacs Basic Linear Algebra communication          Catamount)
    subprograms                                      omp-rtl OpenMP runtime library (not
   blas    Basic Linear Algebra subprograms          supported on Catamount)

   caf Co-Array Fortran (Cray X2 systems only)  portals Lightweight message passing API
 fftw   Fast Fourier Transform library (64-bit  pthreads POSIX threads (not supported on
  only)                                           Catamount)

 hdf5   manages extremely large and complex  scalapack Scalable LAPACK
    data collections                                 shmem     SHMEM
   heap     dynamic heap                            stdio all library functions that accept or return
   io     includes stdio and sysio groups            the FILE* construct

   lapack Linear Algebra Package                    sysio   I/O system calls

   lustre Lustre File System                        system system calls

   math     ANSI math                               upc     Unified Parallel C (Cray X2 systems only)
   mpi     MPI
   netcdf network common data form (manages
    array-oriented scientific data)
0   Summary with instruction     11 Floating point operations
    metrics                        mix (2)
1   Summary with TLB metrics     12 Floating point operations
                                   mix (vectorization)
2   L1 and L2 metrics
                                 13 Floating point operations
3   Bandwidth information          mix (SP)
4   Hypertransport information   14 Floating point operations
5   Floating point mix             mix (DP)
6    Cycles stalled, resources   15 L3 (socket-level)
    idle                         16 L3 (core-level reads)
7    Cycles stalled, resources   17 L3 (core-level misses)
    full
                                 18 L3 (core-level fills caused
8   Instructions and branches      by L2 evictions)
9   Instruction cache            19 Prefetches
10 Cache hierarchy



                                          June 10                 Slide 101
 Regions, useful to break up long routines
    int PAT_region_begin (int id, const char *label)
    int PAT_region_end (int id)
 Disable/Enable Profiling, useful for excluding initialization
    int PAT_record (int state)
 Flush buffer, useful when program isn’t exiting cleanly
    int PAT_flush_buffer (void)
      Instrument application for further analysis (a.out+apa)

       % pat_build –O <apafile>.apa

      Run application

       % aprun … a.out+apa         (or    qsub <apa script>)

      Generate text report and visualization file (.ap2)

       % pat_report –o my_text_report.txt [<datafile>.xf |
          <datadir>]


      View report in text and/or with Cray Apprentice2

       % app2 <datafile>.ap2



September 21-24, 2009                                       © Cray Inc.   Slide 104
 MUST run on Lustre ( /work/… , /lus/…, /scratch/…, etc.)


     Number of files used to store raw data


           1 file created for program with 1 – 256 processes

           √n files created for program with 257 – n processes

           Ability to customize with PAT_RT_EXPFILE_MAX




September 21-24, 2009                                  © Cray Inc.   105
 Full trace files show transient events but are too large

 Current run-time summarization misses transient events

 Plan to add ability to record:

     Top N peak values (N small)

     Approximate std dev over time

     For time, memory traffic, etc.

     During tracing and sampling




July 15, 2008                                                Slide 106
 Call graph profile                    Cray Apprentice2
 Communication statistics              is target to help                      identify and
 Time-line view                         correct:
    Communication                         Load imbalance
    I/O                                   Excessive communication
                                           Network contention
 Activity view
                                           Excessive serialization
 Pair-wise communication statistics       I/O Problems
 Text reports
 Source code mapping




                                                        September 21-24, 2009                  107
                                                    © Cray Inc.
Switch Overview display




September 21-24, 2009   © Cray Inc.               108
© Cray Inc.   September 21-24, 2009   Slide 109
September 21-24, 2009   © Cray Inc.   110
September 21-24, 2009   © Cray Inc.   111
Min, Avg, and Max
                        Values




                                            -1, +1
                                            Std Dev
                                            marks




September 21-24, 2009       © Cray Inc.               112
Width  inclusive time

                                   Height  exclusive time


                                                       Filtered
                                                       nodes or
                                                       sub tree
Load balance overview:
Height  Max time
Middle bar  Average time
                                     DUH Button:
Lower bar  Min time
                                     Provides hints
Yellow represents                    for performance
imbalance time                       tuning



          Function
                                                             Zoom
          List




September 21-24, 2009       © Cray Inc.                             113
Right mouse click:
                                                           Node menu
                                                           e.g., hide/unhide
                                                           children
                        Right mouse click:
                        View menu:
                        e.g., Filter




                           Sort options
                           % Time,
                           Time,
                           Imbalance %
                           Imbalance time

                          Function
                          List off




September 21-24, 2009                        © Cray Inc.                        114
September 21-24, 2009   © Cray Inc.   115
September 21-24, 2009   © Cray Inc.   Slide 116
September 21-24, 2009   © Cray Inc.   Slide 117
September 21-24, 2009   © Cray Inc.   Slide 118
Min, Avg, and Max
                        Values




                                            -1, +1
                                            Std Dev
                                            marks




September 21-24, 2009       © Cray Inc.               119
September 21-24, 2009   © Cray Inc.   120
 Cray Apprentice2 panel help


     pat_help – interactive help on the Cray Performance toolset


     FAQ available through pat_help




September 21-24, 2009                               © Cray Inc.     121
 intro_craypat(1)
           Introduces the craypat performance tool
     pat_build
           Instrument a program for performance analysis
     pat_help
           Interactive online help utility
     pat_report
           Generate performance report in both text and for use with GUI
     hwpc(3)
           describes predefined hardware performance counter groups
     papi_counters(5)
           Lists PAPI event counters
           Use papi_avail or papi_native_avail utilities to get list of events when
              running on a specific architecture
September 21-24, 2009                                    © Cray Inc.                   122
pat_report: Help for -O option:

Available option values are in left column, a prefix can be specified:

  ct                    -O calltree
  defaults              Tables that would appear by default.
  heap                  -O heap_program,heap_hiwater,heap_leaks
  io                    -O read_stats,write_stats
  lb                    -O load_balance
  load_balance          -O lb_program,lb_group,lb_function
  mpi                   -O mpi_callers
  ---
  callers               Profile by Function and Callers
  callers+hwpc          Profile by Function and Callers
  callers+src           Profile by Function and Callers,      with Line Numbers
  callers+src+hwpc      Profile by Function and Callers,      with Line Numbers
  calltree              Function Calltree View
  calltree+hwpc         Function Calltree View
  calltree+src          Calltree View with Callsite Line      Numbers
  calltree+src+hwpc     Calltree View with Callsite Line      Numbers
  ...


September 21-24, 2009                           © Cray Inc.             Slide 123
 Interactive by default, or use trailing '.' to just print a topic:


     New FAQ craypat 5.0.0.


     Has counter and counter group information


          % pat_help counters amd_fam10h groups .




September 21-24, 2009                                     © Cray Inc.      124
The top level CrayPat/X help topics are listed below.
       A good place to start is:
               overview
       If a topic has subtopics, they are displayed under the heading
       "Additional topics", as below. To view a subtopic, you need
       only enter as many initial letters as required to distinguish
       it from other items in the list. To see a table of contents
       including subtopics of those subtopics, etc., enter:
               toc
       To produce the full text corresponding to the table of contents,
       specify "all", but preferably in a non-interactive invocation:
               pat_help all . > all_pat_help
               pat_help report all . > all_report_help
   Additional topics:
       API                        execute
       balance                    experiment
       build                      first_example
       counters                   overview
       demos                      report
       environment                run
pat_help (.=quit ,=back ^=up /=top ~=search)
=>

September 21-24, 2009                              © Cray Inc.          Slide 125
CPU Optimizations
Optimizing Communication
    I/O Best Practices
55. 1                 ii = 0
56. 1 2-----------< do b = abmin, abmax                           Poor loop order
57. 1 2 3---------<    do j=ijmin, ijmax                           results in poor
58. 1 2 3                 ii = ii+1
                                                                          striding
59. 1 2 3                jj = 0                              The inner-most loop
60. 1 2 3 4-------<       do a = abmin, abmax                strides on a slow
61. 1 2 3 4 r8----<        do i = ijmin, ijmax               dimension of each
62. 1 2 3 4 r8                 jj = jj+1
                                                             array.
63. 1 2 3 4 r8                 f5d(a,b,i,j) = f5d(a,b,i,j)
                                            + tmat7(ii,jj)   The best the compiler
64. 1 2 3 4 r8                 f5d(b,a,i,j) = f5d(b,a,i,j)   can do is unroll.
                                            - tmat7(ii,jj)
65. 1 2 3 4 r8                 f5d(a,b,j,i) = f5d(a,b,j,i)   Little to no cache
                                            - tmat7(ii,jj)   reuse.
66. 1 2 3 4 r8                 f5d(b,a,j,i) = f5d(b,a,j,i)
                                            + tmat7(ii,jj)
67. 1 2 3 4 r8---->        end do
68. 1 2 3 4------->       end do
69. 1 2 3--------->    end do
70. 1 2-----------> end do
USER / #1.Original Loops
-----------------------------------------------------------------         Poor loop order
 Time%                                               55.0%                 results in poor
 Time                                             13.938244 secs              cache reuse
 Imb.Time                                          0.075369 secs
 Imb.Time%                                            0.6%           For every L1 cache
 Calls                              0.1 /sec           1.0 calls     hit, there’s 2 misses
 DATA_CACHE_REFILLS:
   L2_MODIFIED:L2_OWNED:                                             Overall, only 2/3 of
   L2_EXCLUSIVE:L2_SHARED       11.858M/sec      165279602 fills     all references were in
 DATA_CACHE_REFILLS_FROM_SYSTEM:                                     level 1 or 2 cache.
   ALL                          11.931M/sec      166291054 fills
 PAPI_L1_DCM                    23.499M/sec      327533338 misses
 PAPI_L1_DCA                    34.635M/sec      482751044 refs
 User time (approx)             13.938 secs     36239439807 cycles
  100.0%Time
 Average Time per Call                           13.938244 sec
 CrayPat Overhead : Time           0.0%
 D1 cache hit,miss ratios          32.2% hits        67.8% misses
 D2 cache hit,miss ratio           49.8% hits        50.2% misses
 D1+D2 cache hit,miss ratio        66.0% hits        34.0% misses
75. 1 2-----------< do i = ijmin, ijmax
76. 1 2               jj = 0
77. 1 2 3---------<   do a = abmin, abmax                   Reordered loop
78. 1 2 3 4-------<     do j=ijmin, ijmax
                                                            nest
79. 1 2 3 4              jj = jj+1                          Now, the inner-most
80. 1 2 3 4                  ii = 0                         loop is stride-1 on
81. 1 2 3 4 Vcr2--<       do b = abmin, abmax               both arrays.
82. 1 2 3 4 Vcr2              ii = ii+1
83. 1 2 3 4 Vcr2              f5d(a,b,i,j) = f5d(a,b,i,j)   Now memory
                                           + tmat7(ii,jj)   accesses happen
84. 1 2 3 4 Vcr2              f5d(b,a,i,j) = f5d(b,a,i,j)   along the cache line,
                                           - tmat7(ii,jj)   allowing reuse.
85. 1 2 3 4 Vcr2              f5d(a,b,j,i) = f5d(a,b,j,i)
                                           - tmat7(ii,jj)   Compiler is able to
86. 1 2 3 4 Vcr2              f5d(b,a,j,i) = f5d(b,a,j,i)   vectorize and better-
                                           + tmat7(ii,jj)   use SSE instructions.
87. 1 2 3 4 Vcr2-->       end do
88. 1 2 3 4------->     end do
89. 1 2 3--------->   end do
90. 1 2-----------> end do
USER / #2.Reordered Loops
-----------------------------------------------------------------       Improved striding
 Time%                                               31.4%               greatly improved
 Time                                              7.955379 secs              cache reuse
 Imb.Time                                          0.260492 secs
 Imb.Time%                                            3.8%           Runtine was cut
 Calls                               0.1 /sec          1.0 calls     nearly in half.
 DATA_CACHE_REFILLS:
   L2_MODIFIED:L2_OWNED:                                             Still, some 20% of all
   L2_EXCLUSIVE:L2_SHARED          0.419M/sec      3331289 fills     references are cache
 DATA_CACHE_REFILLS_FROM_SYSTEM:                                     misses
   ALL                          15.285M/sec      121598284 fills
 PAPI_L1_DCM                    13.330M/sec      106046801 misses
 PAPI_L1_DCA                    66.226M/sec      526855581 refs
 User time (approx)             7.955 secs      20684020425 cycles
  100.0%Time
 Average Time per Call                            7.955379 sec
 CrayPat Overhead : Time            0.0%
 D1 cache hit,miss ratios          79.9% hits        20.1% misses
 D2 cache hit,miss ratio            2.7% hits        97.3% misses
 D1+D2 cache hit,miss ratio        80.4% hits        19.6% misses
First loop, partially vectorized and                Second loop, vectorized and
unrolled by 4                                       unrolled by 4
95.    1                 ii = 0                     109.   1                 jj = 0
96.    1 2-----------< do j = ijmin, ijmax          110.   1 2-----------< do i = ijmin, ijmax
97.    1 2 i---------<     do b = abmin, abmax      111.   1 2 3---------<    do a = abmin, abmax
98.    1 2 i                ii = ii+1               112.   1 2 3                 jj = jj+1
99.    1 2 i                jj = 0                  113.   1 2 3                ii = 0
100.   1 2 i i-------<      do i = ijmin, ijmax     114.   1 2 3 4-------<      do j = ijmin, ijmax
101.   1 2 i i Vpr4--<        do a = abmin, abmax   115.   1 2 3 4 Vr4---<        do b = abmin, abmax
102.   1 2 i i Vpr4               jj = jj+1         116.   1 2 3 4 Vr4                ii = ii+1
103.   1 2 i i Vpr4               f5d(a,b,i,j) =    117.   1 2 3 4 Vr4                f5d(b,a,i,j) =
                      f5d(a,b,i,j) + tmat7(ii,jj)                        f5d(b,a,i,j) - tmat7(ii,jj)
104.   1 2 i i Vpr4               f5d(a,b,j,i) =    118.   1 2 3 4 Vr4                f5d(b,a,j,i) =
                      f5d(a,b,j,i) - tmat7(ii,jj)                        f5d(b,a,i,j) + tmat7(ii,jj)
105.   1 2 i i Vpr4-->        end do                119.   1 2 3 4 Vr4--->        end do
106.   1 2 i i------->      end do                  120.   1 2 3 4------->      end do
107.   1 2 i--------->    end do                    121.   1 2 3--------->    end do
108.   1 2-----------> end do                       122.   1 2-----------> end do
USER / #3.Fissioned Loops
                                                                         Fissioning further
-----------------------------------------------------------------   improved cache reuse
 Time%                                               9.8%            and resulted in better
 Time                                             2.481636 secs               vectorization
 Imb.Time                                         0.045475 secs
 Imb.Time%                                           2.1%           Runtime further
 Calls                               0.4 /sec         1.0 calls     reduced.
 DATA_CACHE_REFILLS:
   L2_MODIFIED:L2_OWNED:                                            Cache hit/miss ratio
   L2_EXCLUSIVE:L2_SHARED          1.175M/sec     2916610 fills     improved slightly
 DATA_CACHE_REFILLS_FROM_SYSTEM:
   ALL                          34.109M/sec      84646518 fills     Loopmark file points
 PAPI_L1_DCM                    26.424M/sec      65575972 misses    to better
 PAPI_L1_DCA                  156.705M/sec      388885686 refs
                                                                    vectorization from
 User time (approx)             2.482 secs      6452279320 cycles
  100.0%Time                                                        the fissioned loops
 Average Time per Call                           2.481636 sec
 CrayPat Overhead : Time            0.0%
 D1 cache hit,miss ratios          83.1% hits       16.9% misses
 D2 cache hit,miss ratio            3.3% hits       96.7% misses
 D1+D2 cache hit,miss ratio        83.7% hits       16.3% misses
 Cache blocking is a combination of strip mining and loop interchange, designed
  to increase data reuse.
     Takes advantage of temporal reuse: re-reference array elements already
       referenced
     Good blocking will take advantage of spatial reuse: work with the cache
       lines!
 Many ways to block any given loop nest
     Which loops get blocked?
     What block size(s) to use?
 Analysis can reveal which ways are beneficial
 But trial-and-error is probably faster
j=1




             j=8
                          2D Laplacian
i=1                         do j = 1, 8
                               do i = 1, 16
                                  a = u(i-1,j) + u(i+1,j) &
                                      - 4*u(i,j)           &
                                      + u(i,j-1) + u(i,j+1)
                               end do
                            end do


                          Cache structure for this example:
                             Each line holds 4 array elements
                             Cache can hold 12 lines of u data
i=16
                          No cache reuse between outer loop
                   120
                   30
                   18
                   15
                   13
                   12
                   10
                    9
                    7
                    6
                    4
                    3      iterations
j=1




             j=8
                         Unblocked loop: 120 cache misses
i=1                      Block the inner loop

                           do IBLOCK = 1, 16, 4
                              do j = 1, 8
i=5                              do i = IBLOCK, IBLOCK + 3
                                    a(i,j) = u(i-1,j) + u(i+1,j) &
                                             - 2*u(i,j)           &
i=9                                          + u(i,j-1) + u(i,j+1)
                                 end do
                              end do
                           end do
i=13
                         Now we have reuse of the “j+1” data

                   80
                   20
                   12
                   10
                   11
                    9
                    8
                    7
                    6
                    4
                    3
j=1


             j=5
                         One-dimensional blocking reduced
i=1                       misses from 120 to 80
                         Iterate over 4 4 blocks

i=5                        do JBLOCK = 1, 8, 4
                              do IBLOCK = 1, 16, 4
                                 do j = JBLOCK, JBLOCK + 3
                                    do i = IBLOCK, IBLOCK + 3
i=9                                    a(i,j) = u(i-1,j) + u(i+1,j) &
                                                - 2*u(i,j)           &
                                                + u(i,j-1) + u(i,j+1)
                                    end do
i=13
                                 end do
                              end do
                           end do


                   15
                   13
                   12
                   10
                   60
                   30
                   18
                   17
                   16
                   11
                    9
                    8
                    7
                    6
                    4
                    3
                         Better use of spatial locality (cache lines)
   Matrix-matrix multiply (GEMM) is the canonical cache-blocking example
   Operations can be arranged to create multiple levels of blocking
      Block for register
      Block for cache (L1, L2, L3)
      Block for TLB
   No further discussion here. Interested readers can see
      Any book on code optimization
             Sun’s Techniques for Optimizing Applications: High Performance Computing contains a decent introductory discussion in
              Chapter 8
             Insert your favorite book here
        Gunnels, Henry, and van de Geijn. June 2001. High-performance matrix multiplication
         algorithms for architectures with hierarchical memories. FLAME Working Note #4 TR-
         2001-22, The University of Texas at Austin, Department of Computer Sciences
             Develops algorithms and cost models for GEMM in hierarchical memories
        Goto and van de Geijn. 2008. Anatomy of high-performance matrix multiplication. ACM
         Transactions on Mathematical Software 34, 3 (May), 1-25
             Description of GotoBLAS DGEMM
“I tried cache-blocking my code, but it didn’t help”


 You’re doing it wrong.
    Your block size is too small (too much loop overhead).
    Your block size is too big (data is falling out of cache).
    You’re targeting the wrong cache level (?)
    You haven’t selected the correct subset of loops to block.
 The compiler is already blocking that loop.
 Prefetching is acting to minimize cache misses.
 Computational intensity within the loop nest is very large, making blocking less
  important.
 Multigrid PDE solver
 Class D, 64 MPI ranks
                                         do i3 = 2, 257
    Global grid is 1024 × 1024 × 1024      do i2 = 2, 257
    Local grid is 258 × 258 × 258             do i1 = 2, 257
 Two similar loop nests account for     !         update u(i1,i2,i3)
                                         !         using 27-point stencil
  >50% of run time
                                               end do
 27-point 3D stencil                       end do
    There is good data reuse along      end doi2+1
                                                                      cache lines
                                                i2
      leading dimension, even without       i2-1
      blocking                              i3+1


                                             i3


                                            i3-1

                                                   i1-1   i1   i1+1
 Block the inner two loops                                     Mop/s/proces
                                                   Block size
 Creates blocks extending along i3 direction                        s
                                                   unblocked       531.50
   do I2BLOCK = 2, 257, BS2
      do I1BLOCK = 2, 257, BS1                       16 × 16       279.89
         do i3 = 2, 257
                                                    22 × 22        321.26
            do i2 = I2BLOCK,                   &
                    min(I2BLOCK+BS2-1, 257)         28 × 28        358.96
               do i1 = I1BLOCK,                &
                       min(I1BLOCK+BS1-1, 257)      34 × 34        385.33
   !              update u(i1,i2,i3)
   !              using 27-point stencil            40 × 40        408.53
               end do
            end do
                                                    46 × 46        443.94
         end do                                     52 × 52        468.58
      end do
   end do                                           58 × 58        470.32
                                                    64 × 64        512.03
                                                    70 × 70        506.92
 Block the outer two loops                                    Mop/s/proces
                                                  Block size
 Preserves spatial locality along i1 direction                     s
                                                  unblocked       531.50
    do I3BLOCK = 2, 257, BS3
       do I2BLOCK = 2, 257, BS2                     16 × 16       674.76
          do i3 = I3BLOCK,                   &
                                                   22 × 22        680.16
                  min(I3BLOCK+BS3-1, 257)
             do i2 = I2BLOCK,                &     28 × 28        688.64
                     min(I2BLOCK+BS2-1, 257)
                do i1 = 2, 257                     34 × 34        683.84
    !              update u(i1,i2,i3)
    !              using 27-point stencil          40 × 40        698.47
                end do
             end do
                                                   46 × 46        689.14
          end do                                   52 × 52        706.62
       end do
    end do                                         58 × 58        692.57
                                                   64 × 64        703.40
                                                   70 × 70        693.87
(    53) void mat_mul_daxpy(double *a, double *b, double *c, int rowa,
    int cola, int colb)
(   54) {
(   55)     int i, j, k;               /* loop counters */
(   56)     int rowc, colc, rowb; /* sizes not passed as arguments */
                                                                         C pointers
(   57)     double con;                /* constant value */
                                                                         C pointers don’t carry
(   58)
(   59)     rowb = cola;
                                                                         the same rules as
(   60)     rowc = rowa;                                                 Fortran Arrays.
(   61)     colc = colb;
(   62)                                                                  The compiler has no
(   63)     for(i=0;i<rowc;i++) {                                        way to know whether
(   64)         for(k=0;k<cola;k++) {
                                                                         *a, *b, and *c
(   65)             con = *(a + i*cola +k);
(   66)             for(j=0;j<colc;j++) {                                overlap or are
(   67)                 *(c + i*colc + j) += con * *(b + k*colb + j);    referenced differently
(   68)             }                                                    elsewhere.
(   69)         }
(   70)     }
                                                                         The compiler must
(   71) }
                                                                         assume the worst,
mat_mul_daxpy:                                                           thus a false data
    66, Loop not vectorized: data dependency                             dependency.
          Loop not vectorized: data dependency
          Loop unrolled 4 times

                                                                                             Slide 147
(    53) void mat_mul_daxpy(double* restrict a, double* restrict b,
    double* restrict c, int rowa, int cola, int colb)
(   54) {
(   55)     int i, j, k;               /* loop counters */
                                                                        C pointers,
(   56)     int rowc, colc, rowb; /* sizes not passed as arguments */
                                                                        restricted
(   57)     double con;                /* constant value */
                                                                        C99 introduces the
(   58)
(   59)     rowb = cola;
                                                                        restrict keyword,
(   60)     rowc = rowa;                                                which allows the
(   61)     colc = colb;                                                programmer to
(   62)
                                                                        promise not to
(   63)     for(i=0;i<rowc;i++) {
(   64)         for(k=0;k<cola;k++) {
                                                                        reference the
(   65)             con = *(a + i*cola +k);                             memory via another
(   66)             for(j=0;j<colc;j++) {                               pointer.
(   67)                 *(c + i*colc + j) += con * *(b + k*colb + j);
(   68)             }                                                   If you declare a
(   69)         }
                                                                        restricted pointer and
(   70)     }
(   71) }                                                               break the rules,
                                                                        behavior is undefined
                                                                        by the standard.




                                                                                           Slide 148
66, Generated alternate loop with no peeling - executed if loop count <= 24
        Generated vector sse code for inner loop
        Generated 2 prefetch instructions for this loop
        Generated vector sse code for inner loop
        Generated 2 prefetch instructions for this loop
       Generated alternate loop with no peeling and more aligned moves -
  executed if loop count <= 24 and alignment test is passed
        Generated vector sse code for inner loop
        Generated 2 prefetch instructions for this loop
       Generated alternate loop with more aligned moves - executed if loop
  count >= 25 and alignment test is passed
        Generated vector sse code for inner loop
        Generated 2 prefetch instructions for this loop


• This can also be achieved with the PGI safe pragma and –Msafeptr
  compiler option or Pathscale –OPT:alias option

                                                                           Slide 149
July 2009   Slide 150
 GNU malloc library
   malloc, calloc, realloc, free calls
      Fortran dynamic variables
 Malloc library system calls
   Mmap, munmap =>for larger allocations
   Brk, sbrk => increase/decrease heap
 Malloc library optimized for low system memory use
   Can result in system calls/minor page faults




                                                       151
   Detecting “bad” malloc behavior
     Profile data => “excessive system time”
   Correcting “bad” malloc behavior
     Eliminate mmap use by malloc
     Increase threshold to release heap memory
   Use environment variables to alter malloc
     MALLOC_MMAP_MAX_ = 0
     MALLOC_TRIM_THRESHOLD_ = 536870912
   Possible downsides
     Heap fragmentation
     User process may call mmap directly
     User process may launch other processes
 PGI’s –Msmartalloc does something similar for you at
    compile time

                                                         152
 Google created a replacement “malloc” library
   “Minimal” TCMalloc replaces GNU malloc
 Limited testing indicates TCMalloc as good or better
  than GNU malloc
    Environment variables not required
    TCMalloc almost certainly better for allocations in
     OpenMP parallel regions
 There’s currently no pre-built tcmalloc for Cray XT, but
  some users have successfully built it.



                                                             153
 Linux has a “first touch policy” for memory allocation
    *alloc functions don’t actually allocate your memory
    Memory gets allocated when “touched”
 Problem: A code can allocate more memory than available
    Linux assumed “swap space,” we don’t have any
    Applications won’t fail from over-allocation until the memory is finally
     touched
 Problem: Memory will be put on the core of the “touching” thread
    Only a problem if thread 0 allocates all memory for a node
 Solution: Always initialize your memory immediately after allocating it
    If you over-allocate, it will fail immediately, rather than a strange place
     in your code
    If every thread touches its own memory, it will be allocated on the
     proper socket



                                                                                   Slide 154
 Short Message Eager Protocol
    The sending rank “pushes” the message to the receiving rank
    Used for messages MPICH_MAX_SHORT_MSG_SIZE bytes or less
    Sender assumes that receiver can handle the message
        Matching receive is posted - or -
        Has available event queue entries (MPICH_PTL_UNEX_EVENTS) and buffer space
         (MPICH_UNEX_BUFFER_SIZE) to store the message


 Long Message Rendezvous Protocol
    Messages are “pulled” by the receiving rank
    Used for messages greater than MPICH_MAX_SHORT_MSG_SIZE bytes
    Sender sends small header packet with information for the receiver to pull
     over the data
    Data is sent only after matching receive is posted by receiving rank
Match Entries Posted by MPI
                                       Incoming Msg    S                         to handle Unexpected Msgs
                                                       E
                                                       A                            Eager         Rendezvous
                                                       S         App ME
                                                                                 Short Msg ME    Long Msg ME
                                                       T
                                                       A
                                                       R



                                                 STEP 1
             STEP 2
                                             MPI_RECV call
Sender    MPI_SEND call     Receiver        Post ME to Portals                      MPI
RANK 0                      RANK 1                                               Unexpected
                                                                                   Buffers




                                                                               (MPICH_UNEX_BUFFER_SIZE)
               STEP 3
          Portals DMA PUT                                           Unexpected
                                                                    Msg Queue




                                                           Other Event Queue
                                                       (MPICH_PTL_OTHER_EVENTS)


                                                                                           Unexpected
                                                                                           Event Queue
  MPI_RECV is posted prior to MPI_SEND call                                        (MPICH_PTL_UNEX_EVENTS)
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010

Weitere ähnliche Inhalte

Was ist angesagt?

DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitchDPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitchJim St. Leger
 
Virtual Network Performance Challenge
Virtual Network Performance ChallengeVirtual Network Performance Challenge
Virtual Network Performance ChallengeStephen Hemminger
 
2012 Fall OpenStack Bare-metal Speaker Session
2012 Fall OpenStack Bare-metal Speaker Session2012 Fall OpenStack Bare-metal Speaker Session
2012 Fall OpenStack Bare-metal Speaker SessionMikyung Kang
 
thread-clustering
thread-clusteringthread-clustering
thread-clusteringdavidkftam
 
An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011
An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011
An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011Shinya Takamaeda-Y
 
CloudStackユーザ会〜仮想ルータの謎に迫る
CloudStackユーザ会〜仮想ルータの謎に迫るCloudStackユーザ会〜仮想ルータの謎に迫る
CloudStackユーザ会〜仮想ルータの謎に迫るsamemoon
 
บทที่ 2 Mobile Aplication
บทที่ 2 Mobile Aplicationบทที่ 2 Mobile Aplication
บทที่ 2 Mobile Aplicationrubtumproject.com
 
Durgam vahia open_sparc_fpga
Durgam vahia open_sparc_fpgaDurgam vahia open_sparc_fpga
Durgam vahia open_sparc_fpgaObsidian Software
 
Higher Performance SSDs with HLNAND
Higher Performance SSDs with HLNANDHigher Performance SSDs with HLNAND
Higher Performance SSDs with HLNANDrrschuetz
 
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAsScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAsShinya Takamaeda-Y
 
Devconf2017 - Can VMs networking benefit from DPDK
Devconf2017 - Can VMs networking benefit from DPDKDevconf2017 - Can VMs networking benefit from DPDK
Devconf2017 - Can VMs networking benefit from DPDKMaxime Coquelin
 
MARC ONERA Toulouse2012 Altreonic
MARC ONERA Toulouse2012 AltreonicMARC ONERA Toulouse2012 Altreonic
MARC ONERA Toulouse2012 AltreonicEric Verhulst
 
Jaguar x86 Core Functional Verification
Jaguar x86 Core Functional VerificationJaguar x86 Core Functional Verification
Jaguar x86 Core Functional VerificationDVClub
 
Learn OpenStack from trystack.cn ——Folsom in practice
Learn OpenStack from trystack.cn  ——Folsom in practiceLearn OpenStack from trystack.cn  ——Folsom in practice
Learn OpenStack from trystack.cn ——Folsom in practiceOpenCity Community
 
16 August 2012 - SWUG - Hyper-V in Windows 2012
16 August 2012 - SWUG - Hyper-V in Windows 201216 August 2012 - SWUG - Hyper-V in Windows 2012
16 August 2012 - SWUG - Hyper-V in Windows 2012Daniel Mar
 

Was ist angesagt? (20)

Brochure NAS LG
Brochure NAS LGBrochure NAS LG
Brochure NAS LG
 
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitchDPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
 
Virtual Network Performance Challenge
Virtual Network Performance ChallengeVirtual Network Performance Challenge
Virtual Network Performance Challenge
 
2012 Fall OpenStack Bare-metal Speaker Session
2012 Fall OpenStack Bare-metal Speaker Session2012 Fall OpenStack Bare-metal Speaker Session
2012 Fall OpenStack Bare-metal Speaker Session
 
thread-clustering
thread-clusteringthread-clustering
thread-clustering
 
An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011
An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011
An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011
 
Virtual net performance
Virtual net performanceVirtual net performance
Virtual net performance
 
CloudStackユーザ会〜仮想ルータの謎に迫る
CloudStackユーザ会〜仮想ルータの謎に迫るCloudStackユーザ会〜仮想ルータの謎に迫る
CloudStackユーザ会〜仮想ルータの謎に迫る
 
บทที่ 2 Mobile Aplication
บทที่ 2 Mobile Aplicationบทที่ 2 Mobile Aplication
บทที่ 2 Mobile Aplication
 
Durgam vahia open_sparc_fpga
Durgam vahia open_sparc_fpgaDurgam vahia open_sparc_fpga
Durgam vahia open_sparc_fpga
 
Higher Performance SSDs with HLNAND
Higher Performance SSDs with HLNANDHigher Performance SSDs with HLNAND
Higher Performance SSDs with HLNAND
 
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAsScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
 
Brocade VDX 6730 Converged Switch for IBM
Brocade VDX 6730 Converged Switch for IBMBrocade VDX 6730 Converged Switch for IBM
Brocade VDX 6730 Converged Switch for IBM
 
Devconf2017 - Can VMs networking benefit from DPDK
Devconf2017 - Can VMs networking benefit from DPDKDevconf2017 - Can VMs networking benefit from DPDK
Devconf2017 - Can VMs networking benefit from DPDK
 
Nvidia Cuda Apps Jun27 11
Nvidia Cuda Apps Jun27 11Nvidia Cuda Apps Jun27 11
Nvidia Cuda Apps Jun27 11
 
MARC ONERA Toulouse2012 Altreonic
MARC ONERA Toulouse2012 AltreonicMARC ONERA Toulouse2012 Altreonic
MARC ONERA Toulouse2012 Altreonic
 
Jaguar x86 Core Functional Verification
Jaguar x86 Core Functional VerificationJaguar x86 Core Functional Verification
Jaguar x86 Core Functional Verification
 
Learn OpenStack from trystack.cn ——Folsom in practice
Learn OpenStack from trystack.cn  ——Folsom in practiceLearn OpenStack from trystack.cn  ——Folsom in practice
Learn OpenStack from trystack.cn ——Folsom in practice
 
ISBI MPI Tutorial
ISBI MPI TutorialISBI MPI Tutorial
ISBI MPI Tutorial
 
16 August 2012 - SWUG - Hyper-V in Windows 2012
16 August 2012 - SWUG - Hyper-V in Windows 201216 August 2012 - SWUG - Hyper-V in Windows 2012
16 August 2012 - SWUG - Hyper-V in Windows 2012
 

Ähnlich wie Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010

HP - HPC-29mai2012
HP - HPC-29mai2012HP - HPC-29mai2012
HP - HPC-29mai2012Agora Group
 
Sun sparc enterprise t5140 and t5240 servers technical presentation
Sun sparc enterprise t5140 and t5240 servers technical presentationSun sparc enterprise t5140 and t5240 servers technical presentation
Sun sparc enterprise t5140 and t5240 servers technical presentationxKinAnx
 
QsNetIII, An HPC Interconnect For Peta Scale Systems
QsNetIII, An HPC Interconnect For Peta Scale SystemsQsNetIII, An HPC Interconnect For Peta Scale Systems
QsNetIII, An HPC Interconnect For Peta Scale SystemsFederica Pisani
 
IBM System x3850 X5 Technical Presenation abbrv.
IBM System x3850 X5 Technical Presenation abbrv.IBM System x3850 X5 Technical Presenation abbrv.
IBM System x3850 X5 Technical Presenation abbrv.meye0611
 
Sun sparc enterprise t5120 and t5220 servers technical presentation
Sun sparc enterprise t5120 and t5220 servers technical presentationSun sparc enterprise t5120 and t5220 servers technical presentation
Sun sparc enterprise t5120 and t5220 servers technical presentationxKinAnx
 
Core 2 processors
Core 2 processorsCore 2 processors
Core 2 processorsArun Kumar
 
Exaflop In 2018 Hardware
Exaflop In 2018   HardwareExaflop In 2018   Hardware
Exaflop In 2018 HardwareJacob Wu
 
I3 multicore processor
I3 multicore processorI3 multicore processor
I3 multicore processorAmol Barewar
 
How to Modernize Your Database Platform to Realize Consolidation Savings
How to Modernize Your Database Platform to Realize Consolidation SavingsHow to Modernize Your Database Platform to Realize Consolidation Savings
How to Modernize Your Database Platform to Realize Consolidation SavingsIsaac Christoffersen
 
SUN主机产品介绍.ppt
SUN主机产品介绍.pptSUN主机产品介绍.ppt
SUN主机产品介绍.pptPencilData
 
Case Study: Porting Qt for Embedded Linux on Embedded Processors
Case Study: Porting Qt for Embedded Linux on Embedded ProcessorsCase Study: Porting Qt for Embedded Linux on Embedded Processors
Case Study: Porting Qt for Embedded Linux on Embedded Processorsaccount inactive
 
Threading Successes 06 Allegorithmic
Threading Successes 06   AllegorithmicThreading Successes 06   Allegorithmic
Threading Successes 06 Allegorithmicguest40fc7cd
 
Fujitsu Presents Post-K CPU Specifications
Fujitsu Presents Post-K CPU SpecificationsFujitsu Presents Post-K CPU Specifications
Fujitsu Presents Post-K CPU Specificationsinside-BigData.com
 
Trend - HPC-29mai2012
Trend - HPC-29mai2012Trend - HPC-29mai2012
Trend - HPC-29mai2012Agora Group
 
Jetson AGX Xavier and the New Era of Autonomous Machines
Jetson AGX Xavier and the New Era of Autonomous MachinesJetson AGX Xavier and the New Era of Autonomous Machines
Jetson AGX Xavier and the New Era of Autonomous MachinesDustin Franklin
 
Sun sparc enterprise t5440 server technical presentation
Sun sparc enterprise t5440 server technical presentationSun sparc enterprise t5440 server technical presentation
Sun sparc enterprise t5440 server technical presentationxKinAnx
 

Ähnlich wie Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010 (20)

HP - HPC-29mai2012
HP - HPC-29mai2012HP - HPC-29mai2012
HP - HPC-29mai2012
 
Sun sparc enterprise t5140 and t5240 servers technical presentation
Sun sparc enterprise t5140 and t5240 servers technical presentationSun sparc enterprise t5140 and t5240 servers technical presentation
Sun sparc enterprise t5140 and t5240 servers technical presentation
 
QsNetIII, An HPC Interconnect For Peta Scale Systems
QsNetIII, An HPC Interconnect For Peta Scale SystemsQsNetIII, An HPC Interconnect For Peta Scale Systems
QsNetIII, An HPC Interconnect For Peta Scale Systems
 
Fpga technology
Fpga technologyFpga technology
Fpga technology
 
IBM System x3850 X5 Technical Presenation abbrv.
IBM System x3850 X5 Technical Presenation abbrv.IBM System x3850 X5 Technical Presenation abbrv.
IBM System x3850 X5 Technical Presenation abbrv.
 
Sun sparc enterprise t5120 and t5220 servers technical presentation
Sun sparc enterprise t5120 and t5220 servers technical presentationSun sparc enterprise t5120 and t5220 servers technical presentation
Sun sparc enterprise t5120 and t5220 servers technical presentation
 
Core 2 processors
Core 2 processorsCore 2 processors
Core 2 processors
 
Exaflop In 2018 Hardware
Exaflop In 2018   HardwareExaflop In 2018   Hardware
Exaflop In 2018 Hardware
 
I3 multicore processor
I3 multicore processorI3 multicore processor
I3 multicore processor
 
I3
I3I3
I3
 
How to Modernize Your Database Platform to Realize Consolidation Savings
How to Modernize Your Database Platform to Realize Consolidation SavingsHow to Modernize Your Database Platform to Realize Consolidation Savings
How to Modernize Your Database Platform to Realize Consolidation Savings
 
SUN主机产品介绍.ppt
SUN主机产品介绍.pptSUN主机产品介绍.ppt
SUN主机产品介绍.ppt
 
Userspace networking
Userspace networkingUserspace networking
Userspace networking
 
Case Study: Porting Qt for Embedded Linux on Embedded Processors
Case Study: Porting Qt for Embedded Linux on Embedded ProcessorsCase Study: Porting Qt for Embedded Linux on Embedded Processors
Case Study: Porting Qt for Embedded Linux on Embedded Processors
 
Vigor Ex
Vigor ExVigor Ex
Vigor Ex
 
Threading Successes 06 Allegorithmic
Threading Successes 06   AllegorithmicThreading Successes 06   Allegorithmic
Threading Successes 06 Allegorithmic
 
Fujitsu Presents Post-K CPU Specifications
Fujitsu Presents Post-K CPU SpecificationsFujitsu Presents Post-K CPU Specifications
Fujitsu Presents Post-K CPU Specifications
 
Trend - HPC-29mai2012
Trend - HPC-29mai2012Trend - HPC-29mai2012
Trend - HPC-29mai2012
 
Jetson AGX Xavier and the New Era of Autonomous Machines
Jetson AGX Xavier and the New Era of Autonomous MachinesJetson AGX Xavier and the New Era of Autonomous Machines
Jetson AGX Xavier and the New Era of Autonomous Machines
 
Sun sparc enterprise t5440 server technical presentation
Sun sparc enterprise t5440 server technical presentationSun sparc enterprise t5440 server technical presentation
Sun sparc enterprise t5440 server technical presentation
 

Mehr von Jeff Larkin

Best Practices for OpenMP on GPUs - OpenMP UK Users Group
Best Practices for OpenMP on GPUs - OpenMP UK Users GroupBest Practices for OpenMP on GPUs - OpenMP UK Users Group
Best Practices for OpenMP on GPUs - OpenMP UK Users GroupJeff Larkin
 
FortranCon2020: Highly Parallel Fortran and OpenACC Directives
FortranCon2020: Highly Parallel Fortran and OpenACC DirectivesFortranCon2020: Highly Parallel Fortran and OpenACC Directives
FortranCon2020: Highly Parallel Fortran and OpenACC DirectivesJeff Larkin
 
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUsEarly Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUsJeff Larkin
 
Performance Portability Through Descriptive Parallelism
Performance Portability Through Descriptive ParallelismPerformance Portability Through Descriptive Parallelism
Performance Portability Through Descriptive ParallelismJeff Larkin
 
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5Jeff Larkin
 
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5Jeff Larkin
 
SC13: OpenMP and NVIDIA
SC13: OpenMP and NVIDIASC13: OpenMP and NVIDIA
SC13: OpenMP and NVIDIAJeff Larkin
 
Refactoring Applications for the XK7 and Future Hybrid Architectures
Refactoring Applications for the XK7 and Future Hybrid ArchitecturesRefactoring Applications for the XK7 and Future Hybrid Architectures
Refactoring Applications for the XK7 and Future Hybrid ArchitecturesJeff Larkin
 
Optimizing GPU to GPU Communication on Cray XK7
Optimizing GPU to GPU Communication on Cray XK7Optimizing GPU to GPU Communication on Cray XK7
Optimizing GPU to GPU Communication on Cray XK7Jeff Larkin
 
Progress Toward Accelerating CAM-SE
Progress Toward Accelerating CAM-SEProgress Toward Accelerating CAM-SE
Progress Toward Accelerating CAM-SEJeff Larkin
 
CUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU ComputingCUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU ComputingJeff Larkin
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-optJeff Larkin
 
A Comparison of Accelerator Programming Models
A Comparison of Accelerator Programming ModelsA Comparison of Accelerator Programming Models
A Comparison of Accelerator Programming ModelsJeff Larkin
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesJeff Larkin
 
XT Best Practices
XT Best PracticesXT Best Practices
XT Best PracticesJeff Larkin
 
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)Jeff Larkin
 

Mehr von Jeff Larkin (16)

Best Practices for OpenMP on GPUs - OpenMP UK Users Group
Best Practices for OpenMP on GPUs - OpenMP UK Users GroupBest Practices for OpenMP on GPUs - OpenMP UK Users Group
Best Practices for OpenMP on GPUs - OpenMP UK Users Group
 
FortranCon2020: Highly Parallel Fortran and OpenACC Directives
FortranCon2020: Highly Parallel Fortran and OpenACC DirectivesFortranCon2020: Highly Parallel Fortran and OpenACC Directives
FortranCon2020: Highly Parallel Fortran and OpenACC Directives
 
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUsEarly Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
 
Performance Portability Through Descriptive Parallelism
Performance Portability Through Descriptive ParallelismPerformance Portability Through Descriptive Parallelism
Performance Portability Through Descriptive Parallelism
 
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
 
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
 
SC13: OpenMP and NVIDIA
SC13: OpenMP and NVIDIASC13: OpenMP and NVIDIA
SC13: OpenMP and NVIDIA
 
Refactoring Applications for the XK7 and Future Hybrid Architectures
Refactoring Applications for the XK7 and Future Hybrid ArchitecturesRefactoring Applications for the XK7 and Future Hybrid Architectures
Refactoring Applications for the XK7 and Future Hybrid Architectures
 
Optimizing GPU to GPU Communication on Cray XK7
Optimizing GPU to GPU Communication on Cray XK7Optimizing GPU to GPU Communication on Cray XK7
Optimizing GPU to GPU Communication on Cray XK7
 
Progress Toward Accelerating CAM-SE
Progress Toward Accelerating CAM-SEProgress Toward Accelerating CAM-SE
Progress Toward Accelerating CAM-SE
 
CUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU ComputingCUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU Computing
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-opt
 
A Comparison of Accelerator Programming Models
A Comparison of Accelerator Programming ModelsA Comparison of Accelerator Programming Models
A Comparison of Accelerator Programming Models
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best Practices
 
XT Best Practices
XT Best PracticesXT Best Practices
XT Best Practices
 
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
 

Kürzlich hochgeladen

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 

Kürzlich hochgeladen (20)

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 

Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010

  • 1.
  • 2.  Review of XT6 Architecture  AMD Opteron  Cray Networks  Lustre Basics  Programming Environment  PGI Compiler Basics  The Cray Compiler Environment  Cray Scientific Libraries  Cray Performance Analysis Tools  Optimizations  CPU  Communication  I/O
  • 3. AMD CPU Architecture Cray Architecture Lustre Filesystem Basics
  • 4.
  • 5. 2003 2005 2007 2008 2009 2010 AMD AMD “Barcelona” “Shanghai” “Istanbul” “Magny-Cours” Opteron™ Opteron™ Mfg. 130nm SOI 90nm SOI 65nm SOI 45nm SOI 45nm SOI 45nm SOI Process K8 K8 Greyhound Greyhound+ Greyhound+ Greyhound+ CPU Core L2/L3 1MB/0 1MB/0 512kB/2MB 512kB/6MB 512kB/6MB 512kB/12MB Hyper Transport™ 3x 1.6GT/.s 3x 1.6GT/.s 3x 2GT/s 3x 4.0GT/s 3x 4.8GT/s 4x 6.4GT/s Technology Memory 2x DDR1 300 2x DDR1 400 2x DDR2 667 2x DDR2 800 2x DDR2 800 4x DDR3 1333
  • 6. 12 cores 1.7-2.2Ghz 1 4 7 10 105.6Gflops 8 cores 5 11 1.8-2.4Ghz 2 8 76.8Gflops Power (ACP) 3 6 9 12 80Watts Stream 27.5GB/s Cache 12x 64KB L1 12x 512KB L2 12MB L3
  • 7. L3 cache HT Link HT Link HT Link HT Link L2 cache L2 cache L2 cache L2 cache MEMORY CONTROLLER MEMORY CONTROLLER Core 2 Core 5 Core 8 Core 11 HT Link HT Link HT Link HT Link L2 cache L2 cache L2 cache L2 cache Core 1 Core 4 Core 7 Core 10 L2 cache L2 cache L2 cache L2 cache Core 0 Core 3 Core 6 Core 9
  • 8.  A cache line is 64B  Cache is a “victim cache”  All references go to L1 immediately and get evicted down the caches  A cache line is usually only in one level of cache  Hardware prefetcher detects forward and backward strides through memory  Each core can perform a 128b add and 128b multiply per clock cycle  This requires SSE, packed instructions  “Stride-one vectorization”
  • 10.  Microkernel on Compute PEs, full featured Linux on Service PEs.  Service PEs specialize by function Compute PE  Software Architecture Login PE eliminates OS “Jitter” Network PE  Software Architecture enables reproducible run times System PE  Large machines boot in under I/O PE 30 minutes, including filesystem Service Partition Specialized Linux nodes 10
  • 11. Z Y GigE X 10 GigE GigE SMW Fibre Channels RAID Subsystem Compute node Login node Network node Boot/Syslog/Database nodes I/O and Metadata nodes 11
  • 12. Now Scaled to 225,000 cores  Cray XT5 systems ship with the SeaStar2+ interconnect DMA HyperTransport 6-Port Router Engine Interface  Custom ASIC  Integrated NIC / Router Memory  MPI offload engine Blade Control Processor PowerPC 440 Processor  Connectionless Protocol Interface  Link Level Reliability  Proven scalability to 225,000 cores 12
  • 13. Processor Frequency Peak Bandwidth Balance (Gflops) (GB/sec) (bytes/flop ) Istanbul 2.6 62.4 12.8 0.21 (XT5) 2.0 64 42.6 0.67 MC-8 2.3 73.6 42.6 0.58 2.4 76.8 42.6 0.55 1.9 91.2 42.6 0.47 MC-12 2.1 100.8 42.6 0.42 2.2 105.6 42.6 0.40 Cray Inc. Preliminary and Proprietary SC09 13
  • 14. 6.4 GB/sec direct connect Characteristics HyperTransport Number of 16 or 24 (MC) Cores 32 (IL) Peak 153 Gflops/sec Performance MC-8 (2.4) Peak 211 Gflops/sec Performance MC-12 (2.2) Memory Size 32 or 64 GB per node Memory 83.5 GB/sec Bandwidth 83.5 GB/sec direct connect memory Cray SeaStar2+ Interconnect Cray Inc. Preliminary and Proprietary SC09 14
  • 15. Greyhound Greyhound Greyhound Greyhound DDR3 Channel DDR3 Channel 6MB L3 HT3 6MB L3 Greyhound Greyhound Cache Greyhound Cache Greyhound Greyhound Greyhound DDR3 Channel Greyhound Greyhound DDR3 Channel HT3 HT3 Greyhound H Greyhound DDR3 Channel 6MB L3 Greyhound Greyhound T3 6MB L3 Greyhound Greyhound DDR3 Channel Cache Greyhound Cache Greyhound Greyhound Greyhound Greyhound HT3 Greyhound DDR3 Channel DDR3 Channel To Interconnect HT1 / HT3  2 Multi-Chip Modules, 4 Opteron Dies  8 Channels of DDR3 Bandwidth to 8 DIMMs  24 (or 16) Computational Cores, 24 MB of L3 cache  Dies are fully connected with HT3  Snoop Filter Feature Allows 4 Die SMP to scale well Cray Inc. Preliminary and Proprietary SC09 15
  • 16. Without snoop filter, a streams test shows 25MB/sec out of a possible 51.2 GB/sec or 48% of peak bandwidth Cray Inc. Preliminary and Proprietary SC09 16
  • 17. With snoop filter, a streams test shows 42.3 MB/sec out of a possible 51.2 GB/sec or 82% of peak bandwidth • This feature will be key for two- socket Magny Cours Nodes which are the same architecture-wise Cray Inc. Preliminary and Proprietary SC09 17
  • 18.  New compute blade with 8 AMD Magny Cours processors  Plug-compatible with XT5 cabinets and backplanes  Initially will ship with SeaStar interconnect as the Cray XT6  Upgradeable to Gemini Interconnect or Cray XE6  Upgradeable to AMD’s “Interlagos” series  XT6 systems will continue to ship with the current SIO blade  First customer ship, March 31st Cray Inc. Preliminary and Proprietary SC09 18
  • 19. Cray Inc. Preliminary and Proprietary SC09 19
  • 20.  Supports 2 Nodes per ASIC  168 GB/sec routing capacity  Scales to over 100,000 network endpoints  Link Level Reliability and Adaptive Hyper Hyper Routing Transport Transport 3 3  Advanced Resiliency Features  Provides global address NIC 0 Netlink NIC 1 SB space Block Gemini LO  Advanced NIC designed to Processor efficiently support 48-Port  MPI YARC Router  One-sided MPI  Shmem  UPC, Coarray FORTRAN Cray Inc. Preliminary and Proprietary SC09 20
  • 21. Cray Baker Node Characteristics Number of 16 or 24 10 12X Gemini Cores Channels Peak 140 or 210 Gflops/s (Each Gemini High Radix YARC Router Performance acts like two nodes on the 3D with adaptive Memory Size 32 or 64 GB per Torus) Routing node 168 GB/sec capacity Memory 85 GB/sec Bandwidth Cray Inc. Preliminary and Proprietary SC09 21
  • 22. Module with SeaStar Z Y X Module with Gemini Cray Inc. Preliminary and Proprietary SC09 22
  • 23. net rsp net req LB ht treq p net LB Ring ht treq np FMA req T net ht trsp net net A req S req req vc0 ht p req net R S req O ht np req B I BTE R net D rsp B vc1 Router Tiles HT3 Cave NL ht irsp NPT vc1 ht np net ireq rsp net req CQ NAT ht np req H ht p req net rsp headers ht p A AMO net ht p req ireq R net req req net req vc0 B RMT ht p req RAT net rsp LM CLM  FMA (Fast Memory Access)  Mechanism for most MPI transfers  Supports tens of millions of MPI requests per second  BTE (Block Transfer Engine)  Supports asynchronous block transfers between local and remote memory, in either direction  For use for large MPI transfers that happen in the background Cray Inc. Preliminary and Proprietary SC09 23
  • 24.  Two Gemini ASICs are packaged on a pin-compatible mezzanine card  Topology is a 3-D torus  Each lane of the torus is composed of 4 Gemini router “tiles”  Systems with SeaStar interconnects can be upgraded by swapping this card  100% of the 48 router tiles on each Gemini chip are used Cray Inc. Preliminary and Proprietary SC09 24
  • 25.  Like SeaStar, Gemini has a DMA offload engine allowing large transfers to proceed asynchronously  Gemini provides low-overhead OS-bypass features for short transfers  MPI latency targeted at ~ 1us  NIC provides for many millions of MPI messages per second  “Hybrid” programming not a requirement for performance  RDMA provides a much improved one-sided communication mechanism  AMOs provide a faster synchronization method for barriers  Gemini supports adaptive routing, which  Reduces problems with network hot spots  Allows MPI to survive link failures Cray Inc. Preliminary and Proprietary SC09 25
  • 26.  Globally addressable memory provides efficient support for UPC, Co-array FORTRAN, Shmem and Global Arrays  Cray Programming Environment will target this capability directly  Pipelined global loads and stores  Allows for fast irregular communication patterns  Atomic memory operations  Provides fast synchronization needed for one-sided communication models Cray Inc. Preliminary and Proprietary SC09 26
  • 27. Gemini will represent a large improvement over SeaStar in terms of reliability and serviceability  Adaptive Routing – multiple paths to the same destination  Allows mapping around bad links without rebooting  Supports warm-swap of blades  Prevents hot spots  Reliable Transport of Messages  Packet level CRC carried from start to finish  Large blocks of memory protected by ECC  Can better handle failures on the HT-link, discards packets instead of putting backpressure into the network  Supports end-to-end reliable communication (used by MPI)  Improved error reporting and handling  The low overhead error reporting allows the programming model to replay failed transactions  Performance counters allowing tracking of app specific packets Cray Inc. Preliminary and Proprietary SC09 27
  • 28. 28
  • 29. 29
  • 30. Low Velocity Airflow High Velocity Airflow Low Velocity Airflow High Velocity Airflow 30 Low Velocity Airflow
  • 31. Cool air is released into the computer room Liquid Liquid/Vapor in Mixture out Hot air stream passes through evaporator, rejects heat to R134a via liquid-vapor phase change (evaporation). R134a absorbs energy only in the presence of heated air. Phase change is 10x more efficient than pure water cooling. 31
  • 32. R134a piping Exit Evaporators Inlet Evaporator 32
  • 33.
  • 34.
  • 35.
  • 36.  32 MB per OST (32 MB – 5 GB) and 32 MB Transfer Size  Unable to take advantage of file system parallelism  Access to multiple disks adds overhead which hurts performance Single Writer Write Performance 120 100 80 Write (MB/s) 1 MB Stripe 60 32 MB Stripe 40 Lustre 20 0 1 2 4 16 32 64 128 160 Stripe Count 36
  • 37.  Single OST, 256 MB File Size  Performance can be limited by the process (transfer size) or file system (stripe size) Single Writer Transfer vs. Stripe Size 140 120 100 Write (MB/s) 80 32 MB Transfer 60 8 MB Transfer 1 MB Transfer 40 Lustre 20 0 1 2 4 8 16 32 64 128 Stripe Size (MB) 37
  • 38.  Use the lfs command, libLUT, or MPIIO hints to adjust your stripe count and possibly size  lfs setstripe -c -1 -s 4M <file or directory> (160 OSTs, 4MB stripe)  lfs setstripe -c 1 -s 16M <file or directory> (1 OST, 16M stripe)  export MPICH_MPIIO_HINTS=‘*: striping_factor=160’  Files inherit striping information from the parent directory, this cannot be changed once the file is written  Set the striping before copying in files
  • 39. PGI Compiler Cray Compiler Environment Cray Scientific Libraries
  • 40.  Cray XT/XE Supercomputers come with compiler wrappers to simplify building parallel applications (similar the mpicc/mpif90)  Fortran Compiler: ftn  C Compiler: cc  C++ Compiler: CC  Using these wrappers ensures that your code is built for the compute nodes and linked against important libraries  Cray MPT (MPI, Shmem, etc.)  Cray LibSci (BLAS, LAPACK, etc.)  …  Choosing the underlying compiler is via the PrgEnv-* modules, do not call the PGI, Cray, etc. compilers directly.  Always load the appropriate xtpe-<arch> module for your machine  Enables proper compiler target  Links optimized math libraries
  • 41.
  • 42.  Traditional (scalar) optimizations are controlled via -O# compiler flags  Default: -O2  More aggressive optimizations (including vectorization) are enabled with the -fast or -fastsse metaflags  These translate to: -O2 -Munroll=c:1 -Mnoframe -Mlre –Mautoinline -Mvect=sse -Mscalarsse -Mcache_align -Mflushz –Mpre  Interprocedural analysis allows the compiler to perform whole-program optimizations. This is enabled with –Mipa=fast  See man pgf90, man pgcc, or man pgCC for more information about compiler options.
  • 43.  Compiler feedback is enabled with -Minfo and -Mneginfo  This can provide valuable information about what optimizations were or were not done and why.  To debug an optimized code, the -gopt flag will insert debugging information without disabling optimizations  It’s possible to disable optimizations included with -fast if you believe one is causing problems  For example: -fast -Mnolre enables -fast and then disables loop redundant optimizations  To get more information about any compiler flag, add -help with the flag in question  pgf90 -help -fast will give more information about the -fast flag  OpenMP is enabled with the -mp flag
  • 44. Some compiler options may effect both performance and accuracy. Lower accuracy is often higher performance, but it’s also able to enforce accuracy.  -Kieee: All FP math strictly conforms to IEEE 754 (off by default)  -Ktrap: Turns on processor trapping of FP exceptions  -Mdaz: Treat all denormalized numbers as zero  -Mflushz: Set SSE to flush-to-zero (on with -fast)  -Mfprelaxed: Allow the compiler to use relaxed (reduced) precision to speed up some floating point optimizations  Some other compilers turn this on by default, PGI chooses to favor accuracy to speed by default.
  • 45.
  • 46.  Cray has a long tradition of high performance compilers on Cray platforms (Traditional vector, T3E, X1, X2)  Vectorization  Parallelization  Code transformation  More…  Investigated leveraging an open source compiler called LLVM  First release December 2008
  • 47. Fortran Source C and C++ Source C and C++ Front End supplied by Edison Design Group, with Cray-developed Fortran Front End C & C++ Front End code for extensions and interface support Interprocedural Analysis Cray Inc. Compiler Technology Compiler Optimization and Parallelization X86 Code Cray X2 Code Generator Generator X86 Code Generation from Open Source LLVM, with Object File additional Cray-developed optimizations and interface support
  • 48.  Standard conforming languages and programming models  Fortran 2003  UPC & CoArray Fortran  Fully optimized and integrated into the compiler  No preprocessor involved  Target the network appropriately:  GASNet with Portals  DMAPP with Gemini & Aries  Ability and motivation to provide high-quality support for custom Cray network hardware  Cray technology focused on scientific applications  Takes advantage of Cray’s extensive knowledge of automatic vectorization  Takes advantage of Cray’s extensive knowledge of automatic shared memory parallelization  Supplements, rather than replaces, the available compiler choices
  • 49.  Make sure it is available  module avail PrgEnv-cray  To access the Cray compiler  module load PrgEnv-cray  To target the various chip  module load xtpe-[barcelona,shanghi,istanbul]  Once you have loaded the module “cc” and “ftn” are the Cray compilers  Recommend just using default options  Use –rm (fortran) and –hlist=m (C) to find out what happened  man crayftn
  • 50.  Excellent Vectorization  Vectorize more loops than other compilers  OpenMP 3.0  Task and Nesting  PGAS: Functional UPC and CAF available today  C++ Support  Automatic Parallelization  Modernized version of Cray X1 streaming capability  Interacts with OMP directives  Cache optimizations  Automatic Blocking  Automatic Management of what stays in cache  Prefetching, Interchange, Fusion, and much more…
  • 51.  Loop Based Optimizations  Vectorization  OpenMP  Autothreading  Interchange  Pattern Matching  Cache blocking/ non-temporal / prefetching  Fortran 2003 Standard; working on 2008  PGAS (UPC and Co-Array Fortran)  Some performance optimizations available in 7.1  Optimization Feedback: Loopmark  Focus
  • 52.  Cray compiler supports a full and growing set of directives and pragmas !dir$ concurrent !dir$ ivdep !dir$ interchange !dir$ unroll !dir$ loop_info [max_trips] [cache_na] ... Many more !dir$ blockable man directives man loop_info
  • 53.  Compiler can generate an filename.lst file.  Contains annotated listing of your source code with letter indicating important optimizations %%% L o o p m a r k L e g e n d %%% Primary Loop Type Modifiers ------- ---- ---- --------- a - vector atomic memory operation A - Pattern matched b - blocked C - Collapsed f - fused D - Deleted i - interchanged E - Cloned m - streamed but not partitioned I - Inlined p - conditional, partial and/or computed M - Multithreaded r - unrolled P - Parallel/Tasked s - shortloop V - Vectorized t - array syntax temp used W - Unwound w - unwound
  • 54. • ftn –rm … or cc –hlist=m … 29. b-------< do i3=2,n3-1 30. b b-----< do i2=2,n2-1 31. b b Vr--< do i1=1,n1 32. b b Vr u1(i1) = u(i1,i2-1,i3) + u(i1,i2+1,i3) 33. b b Vr > + u(i1,i2,i3-1) + u(i1,i2,i3+1) 34. b b Vr u2(i1) = u(i1,i2-1,i3-1) + u(i1,i2+1,i3-1) 35. b b Vr > + u(i1,i2-1,i3+1) + u(i1,i2+1,i3+1) 36. b b Vr--> enddo 37. b b Vr--< do i1=2,n1-1 38. b b Vr r(i1,i2,i3) = v(i1,i2,i3) 39. b b Vr > - a(0) * u(i1,i2,i3) 40. b b Vr > - a(2) * ( u2(i1) + u1(i1-1) + u1(i1+1) ) 41. b b Vr > - a(3) * ( u2(i1-1) + u2(i1+1) ) 42. b b Vr--> enddo 43. b b-----> enddo 44. b-------> enddo
  • 55. ftn-6289 ftn: VECTOR File = resid.f, Line = 29 A loop starting at line 29 was not vectorized because a recurrence was found on "U1" between lines 32 and 38. ftn-6049 ftn: SCALAR File = resid.f, Line = 29 A loop starting at line 29 was blocked with block size 4. ftn-6289 ftn: VECTOR File = resid.f, Line = 30 A loop starting at line 30 was not vectorized because a recurrence was found on "U1" between lines 32 and 38. ftn-6049 ftn: SCALAR File = resid.f, Line = 30 A loop starting at line 30 was blocked with block size 4. ftn-6005 ftn: SCALAR File = resid.f, Line = 31 A loop starting at line 31 was unrolled 4 times. ftn-6204 ftn: VECTOR File = resid.f, Line = 31 A loop starting at line 31 was vectorized. ftn-6005 ftn: SCALAR File = resid.f, Line = 37 A loop starting at line 37 was unrolled 4 times. ftn-6204 ftn: VECTOR File = resid.f, Line = 37 A loop starting at line 37 was vectorized.
  • 56.  -hbyteswapio  Link time option  Applies to all unformatted fortran IO  Assign command  With the PrgEnv-cray module loaded do this: setenv FILENV assign.txt assign -N swap_endian g:su assign -N swap_endian g:du  Can use assign to be more precise
  • 57.  OpenMP is ON by default  Optimizations controlled by –Othread#  To shut off use –Othread0 or –xomp or –hnoomp  Autothreading is NOT on by default;  -hautothread to turn on  Modernized version of Cray X1 streaming capability  Interacts with OMP directives If you do not want to use OpenMP and have OMP directives in the code, make sure to make a run with OpenMP shut off at compile time
  • 58.
  • 59.  Traditional model  Tuned general purpose codes  Only good for dense  Not problem sensitive  Not architecture sensitive 60
  • 60.  Goal of scientific libraries Improve Productivity at optimal performance  Cray use four concentrations to achieve this  Standardization  Use standard or “de facto” standard interfaces whenever available  Hand tuning  Use extensive knowledge of target processor and network to optimize common code patterns  Auto-tuning  Automate code generation and a huge number of empirical performance evaluations to configure software to the target platforms  Adaptive Libraries  Make runtime decisions to choose the best kernel/library/routine 61
  • 61.  Three separate classes of standardization, each with a corresponding definition of productivity 1. Standard interfaces (e.g., dense linear algebra)  Bend over backwards to keep everything the same despite increases in machine complexity. Innovate ‘behind-the-scenes’  Productivity -> innovation to keep things simple 2. Adoption of near-standard interfaces (e.g., sparse kernels)  Assume near-standards and promote those. Out-mode alternatives. Innovate ‘behind-the-scenes’  Productivity -> innovation in the simplest areas  (requires the same innovation as #1 also) 3. Simplification of non-standard interfaces (e.g., FFT)  Productivity -> innovation to make things simpler than they are 62
  • 62.  Algorithmic tuning  Increased performance by exploiting algorithmic improvements  Sub-blocking, new algorithms  LAPACK, ScaLAPACK  Kernel tuning  Improve the numerical kernel performance in assembly language  BLAS, FFT  Parallel tuning  Exploit Cray’s custom network interfaces and MPT  ScaLAPACK, P-CRAFFT 63
  • 63. Dense Sparse FFT BLAS CASK CRAFFT LAPACK PETSc FFTW ScaLAPACK IRT Trilinos P-CRAFFT IRT – Iterative Refinement Toolkit CASK – Cray Adaptive Sparse Kernels CRAFFT – Cray Adaptive FFT 64
  • 64.  Serial and Parallel versions of sparse iterative linear solvers  Suites of iterative solvers  CG, GMRES, BiCG, QMR, etc.  Suites of preconditioning methods  IC, ILU, diagonal block (ILU/IC), Additive Schwartz, Jacobi, SOR  Support block sparse matrix data format for better performance  Interface to external packages (ScaLAPACK, SuperLU_DIST)  Fortran and C support  Newton-type nonlinear solvers  Large user community  DoE Labs, PSC, CSCS, CSC, ERDC, AWE and more.  http://www-unix.mcs.anl.gov/petsc/petsc-as 65
  • 65.  Cray provides state-of-the art scientific computing packages to strengthen the capability of PETSc  Hypre: scalable parallel preconditioners  AMG (Very scalable and efficient for specific class of problems)  2 different ILU (General purpose)  Sparse Approximate Inverse (General purpose)  ParMetis: parallel graph partitioning package  MUMPS: parallel multifrontal sparse direct solver  SuperLU: sequential version of SuperLU_DIST  To use Cray-PETSc, load the appropriate module : module load petsc module load petsc-complex (no need to load a compiler specific module)  Treat the Cray distribution as your local PETSc installation 66
  • 66.  The Trilinos Project http://trilinos.sandia.gov/ “an effort to develop algorithms and enabling technologies within an object-oriented software framework for the solution of large-scale, complex multi-physics engineering and scientific problems”  A unique design feature of Trilinos is its focus on packages.  Very large user-base and growing rapidly. Important to DOE.  Cray’s optimized Trilinos released on January 21  Includes 50+ trilinos packages  Optimized via CASK  Any code that uses Epetra objects can access the optimizations  Usage : module load trilinos 67
  • 67.  CASK is a product developed at Cray using the Cray Auto-tuning Framework (Cray ATF)  The CASK Concept :  Analyze matrix at minimal cost  Categorize matrix against internal classes  Based on offline experience, find best CASK code for particular matrix  Previously assign “best” compiler flags to CASK code  Assign best CASK kernel and perform Ax  CASK silently sits beneath PETSc on Cray systems  Trilinos support coming soon  Released with PETSc 3.0 in February 2009  Generic and blocked CSR format 68
  • 68. Large-scale application • Highly portable • User controlled PETSc / Trilinos / Hypre All systems • Highly portable • User controlled Cray only CASK • XT4 & XT5 specific / tuned • Invisible to User 69
  • 69. Speedup on Parallel SpMV on 8 cores, 60 different matrices 1.4 1.3 1.2 1.1 1 0 10 20 30 40 50 60 Matrix ID# 70
  • 70. Block Jacobi Preconditioning SpMV Performance of CASK VS Performance of CASK VS PETSc PETSc N=65,536 to 67,108,864 300 N=65,536 to 67,108,864 200 250 150 200 GFlops 100 GFlops 150 50 100 50 0 0 128 256 384 512 640 768 896 1024 0 0 128 256 384 512 640 768 896 1024 # of Cores # of Cores BlockJacobi-IC(0)-CASK MatMult-CASK MatMult-PETSc BlockJacobi-IC(0)-PETSc 71
  • 71. 2000 1800 1600 1400 MFlops 1200 1000 800 600 400 200 0 Matrix Name
  • 72. Geometric Mean of 80 sparse matrix instances from U of Florida collection 5000 4500 4000 3500 MFlops 3000 2500 2000 1500 1000 500 0 1 2 3 4 5 6 7 8 # of vectors CASK Trilinos Original
  • 73.  In FFTs, the problems are  Which library choice to use?  How to use complicated interfaces (e.g., FFTW)  Standard FFT practice  Do a plan stage  Deduced machine and system information and run micro-kernels  Select best FFT strategy  Do an execute Our system knowledge can remove some of this cost! 74
  • 74.  CRAFFT is designed with simple-to-use interfaces  Planning and execution stage can be combined into one function call  Underneath the interfaces, CRAFFT calls the appropriate FFT kernel  CRAFFT provides both offline and online tuning  Offline tuning  Which FFT kernel to use  Pre-computed PLANs for common-sized FFT  No expensive plan stages  Online tuning is performed as necessary at runtime as well  At runtime, CRAFFT will adaptively select the best FFT kernel to use based on both offline and online testing (e.g. FFTW, Custom FFT) 75
  • 75. 128x128 256x256 512x512 FFTW plan 74 312 2758 FFTW exec 0.105 0.97 9.7 CRAFFT plan 0.00037 0.0009 0.00005 CRAFFT exec 0.139 1.2 11.4
  • 76. 1. Load module fftw/3.2.0 or higher. 2. Add a Fortran statement “use crafft” 3. call crafft_init() 4. Call crafft transform using none, some or all optional arguments (as shown in red) In-place, implicit memory management : call crafft_z2z3d(n1,n2,n3,input,ld_in,ld_in2,isign) in-place, explicit memory management call crafft_z2z3d(n1,n2,n3,input,ld_in,ld_in2,isign,work) out-of-place, explicit memory management : crafft_z2z3d(n1,n2,n3,input,ld_in,ld_in2,output,ld_out,ld_out2,isign,work) Note : the user can also control the planning strategy of CRAFFT using the CRAFFT_PLANNING environment variable and the do_exe optional argument, please see the intro_crafft man page. 77
  • 77.  As of December 2009, CRAFFT includes distributed parallel transforms  Uses the CRAFFT interface prefixed by “p”, with optional arguments  Can provide performance improvement over FFTW 2.1.5  Currently implemented  complex-complex  Real-complex and complex-real  3-d and 2-d  In-place and out-of-place  Upcoming  C language support for serial and parallel 78
  • 78. 1. Add “use crafft” to Fortran code 2. Initialize CRAFFT using crafft_init 3. Assume MPI initialized and data distributed (see manpage) 4. Call crafft, e.g. (optional arguments in red) 2-d complex-complex, in-place, internal mem management : call crafft_pz2z2d(n1,n2,input,isign,flag,comm) 2-d complex-complex, in-place with no internal memory : call crafft_pz2z2d(n1,n2,input,isign,flag,comm,work) 2-d complex-complex, out-of-place, internal mem manager : call crafft_pz2z2d(n1,n2,input,output,isign,flag,comm) 2-d complex-complex, out-of-place, no internal memory : crafft_pz2z2d(n1,n2,input,output,isign,flag,comm,work) Each routine above has manpage. Also see 3d equivalent : man crafft_pz2z3d 79
  • 79. 2D FFT (N x N, transposed), 128 cores 140,000 120,000 100,000 Mflops 80,000 60,000 pcrafft fftw2.5.1 40,000 20,000 0 128 256 512 1024 2048 4096 8192 16384 3276865536 Size N 80
  • 80.  Solves linear systems in single precision  Obtaining solutions accurate to double precision  For well conditioned problems  Serial and Parallel versions of LU, Cholesky, and QR  2 usage methods  IRT Benchmark routines  Uses IRT 'under-the-covers' without changing your code  Simply set an environment variable  Useful when you cannot alter source code  Advanced IRT API  If greater control of the iterative refinement process is required  Allows  condition number estimation  error bounds return  minimization of either forward or backward error  'fall back' to full precision if the condition number is too high  max number of iterations can be altered by users 81
  • 81.  “High Power Electromagnetic Wave Heating in the ITER Burning Plasma’’  rf heating in tokamak  Maxwell-Bolzmann Eqns  FFT  Dense linear system  Calc Quasi-linear op Courtesy Richard Barrett 82
  • 83. Decide if you want to use advanced API or benchmark API benchmark API : setenv IRT_USE_SOLVERS 1 Advanced API : 1. locate the factor and solve in your code (LAPACK or ScaLAPACK) 2. Replace factor and solve with a call to IRT routine  e.g. dgesv -> irt_lu_real_serial  e.g. pzgesv -> irt_lu_complex_parallel  e.g pzposv -> irt_po_complex_parallel 3. Set advanced arguments  Forward error convergence for most accurate solution  Condition number estimate  “fall-back” to full precision if condition number too high 84
  • 84.  LibSci 10.4.2 February 18th 2010  OpenMP-aware LibSci  Allows calling of BLAS inside or outside parallel region  Single library supported  No multi-thread library and single thread library (-lsci and –lsci_mp)  Performance not compromised (there were some usage restrictions with this version)  LibSci 10.4.3 April 2010  Parallel CRAFFT improvements  Fixes usage restrictions of 10.4.2  OMP_NUM_THREADS required (not GOTO_NUM_THREADS)  Upcoming  PETSc 3.1.0 May 20  Trilinos 10.2 May 20 85
  • 86.  Assist the user with application performance analysis and optimization  Help user identify important and meaningful information from potentially massive data sets  Help user identify problem areas instead of just reporting data  Bring optimization knowledge to a wider set of users  Focus on ease of use and intuitive user interfaces  Automatic program instrumentation  Automatic analysis  Target scalability issues in all areas of tool development  Data management  Storage, movement, presentation September 21-24, 2009 © Cray Inc. 87
  • 87.  Supports traditional post-mortem performance analysis  Automatic identification of performance problems  Indication of causes of problems  Suggestions of modifications for performance improvement  CrayPat  pat_build: automatic instrumentation (no source code changes needed)  run-time library for measurements (transparent to the user)  pat_report for performance analysis reports  pat_help: online help utility  Cray Apprentice2  Graphical performance analysis and visualization tool September 21-24, 2009 © Cray Inc. 88
  • 88.  CrayPat  Instrumentation of optimized code  No source code modification required  Data collection transparent to the user  Text-based performance reports  Derived metrics  Performance analysis  Cray Apprentice2  Performance data visualization tool  Call tree view  Source code mappings September 21-24, 2009 © Cray Inc. 89
  • 89.  When performance measurement is triggered  External agent (asynchronous)  Sampling  Timer interrupt  Hardware counters overflow  Internal agent (synchronous)  Code instrumentation  Event based  Automatic or manual instrumentation  How performance data is recorded  Profile ::= Summation of events over time  run time summarization (functions, call sites, loops, …)  Trace file ::= Sequence of events over time September 21-24, 2009 © Cray Inc. 90
  • 90.  Millions of lines of code  Automatic profiling analysis  Identifies top time consuming routines  Automatically creates instrumentation template customized to your application  Lots of processes/threads  Load imbalance analysis  Identifies computational code regions and synchronization calls that could benefit most from load balance optimization  Estimates savings if corresponding section of code were balanced  Long running applications  Detection of outliers September 21-24, 2009 © Cray Inc. 91
  • 91.  Important performance statistics:  Top time consuming routines  Load balance across computing resources  Communication overhead  Cache utilization  FLOPS  Vectorization (SSE instructions)  Ratio of computation versus communication September 21-24, 2009 © Cray Inc. 92
  • 92.  No source code or makefile modification required  Automatic instrumentation at group (function) level  Groups: mpi, io, heap, math SW, …  Performs link-time instrumentation  Requires object files  Instruments optimized code  Generates stand-alone instrumented program  Preserves original binary  Supports sample-based and event-based instrumentation September 21-24, 2009 © Cray Inc. 93
  • 93. Analyze the performance data and direct the user to meaningful information  Simplifies the procedure to instrument and collect performance data for novice users  Based on a two phase mechanism 1. Automatically detects the most time consuming functions in the application and feeds this information back to the tool for further (and focused) data collection 2. Provides performance information on the most significant parts of the application September 21-24, 2009 © Cray Inc. 94
  • 94.  Performs data conversion  Combines information from binary with raw performance data  Performs analysis on data  Generates text report of performance results  Formats data for input into Cray Apprentice2 September 21-24, 2009 © Cray Inc. 95
  • 95.  Craypat / Cray Apprentice2 5.0 released September 10, 2009  New internal data format  FAQ  Grid placement support  Better caller information (ETC group in pat_report)  Support larger numbers of processors  Client/server version of Cray Apprentice2  Panel help in Cray Apprentice2 September 21-24, 2009 © Cray Inc. 96
  • 96. Access performance tools software % module load xt-craypat apprentice2  Build application keeping .o files (CCE: -h keepfiles) % make clean % make  Instrument application for automatic profiling analysis  You should get an instrumented program a.out+pat % pat_build –O apa a.out  Run application to get top time consuming routines  You should get a performance file (“<sdatafile>.xf”) or multiple files in a directory <sdatadir> % aprun … a.out+pat (or qsub <pat script>) September 21-24, 2009 © Cray Inc. 97
  • 97. Generate report and .apa instrumentation file % pat_report –o my_sampling_report [<sdatafile>.xf | <sdatadir>]  Inspect .apa file and sampling report  Verify if additional instrumentation is needed September 21-24, 2009 © Cray Inc. Slide 98
  • 98. # You can edit this file, if desired, and use it # 43.37% 99659 bytes # to reinstrument the program for tracing like this: -T mlwxyz_ # # pat_build -O mhd3d.Oapa.x+4125-401sdt.apa # 16.09% 17615 bytes # -T half_ # These suggested trace options are based on data from: # # 6.82% 6846 bytes # /home/crayadm/ldr/mhd3d/run/mhd3d.Oapa.x+4125-401sdt.ap2, -T artv_ /home/crayadm/ldr/mhd3d/run/mhd3d.Oapa.x+4125-401sdt.xf # 1.29% 5352 bytes # ---------------------------------------------------------------------- -T currenh_ # HWPC group to collect by default. # 1.03% 25294 bytes -T bndbo_ -Drtenv=PAT_RT_HWPC=1 # Summary with instructions metrics. # Functions below this point account for less than 10% of samples. # ---------------------------------------------------------------------- # Libraries to trace. # 1.03% 31240 bytes # -T bndto_ -g mpi ... # ---------------------------------------------------------------------- # ---------------------------------------------------------------------- # User-defined functions to trace, sorted by % of samples. # Limited to top 200. A function is commented out if it has < 1% -o mhd3d.x+apa # New instrumented program. # of samples, or if a cumulative threshold of 90% has been reached, # or if it has size < 200 bytes. /work/crayadm/ldr/mhd3d/mhd3d.x # Original program. # Note: -u should NOT be specified as an additional option. September 21-24, 2009 99 © Cray Inc.
  • 99. biolib Cray Bioinformatics library routines  omp OpenMP API (not supported on  blacs Basic Linear Algebra communication Catamount) subprograms  omp-rtl OpenMP runtime library (not  blas Basic Linear Algebra subprograms supported on Catamount)  caf Co-Array Fortran (Cray X2 systems only)  portals Lightweight message passing API  fftw Fast Fourier Transform library (64-bit  pthreads POSIX threads (not supported on only) Catamount)  hdf5 manages extremely large and complex  scalapack Scalable LAPACK data collections  shmem SHMEM  heap dynamic heap  stdio all library functions that accept or return  io includes stdio and sysio groups the FILE* construct  lapack Linear Algebra Package  sysio I/O system calls  lustre Lustre File System  system system calls  math ANSI math  upc Unified Parallel C (Cray X2 systems only)  mpi MPI  netcdf network common data form (manages array-oriented scientific data)
  • 100. 0 Summary with instruction 11 Floating point operations metrics mix (2) 1 Summary with TLB metrics 12 Floating point operations mix (vectorization) 2 L1 and L2 metrics 13 Floating point operations 3 Bandwidth information mix (SP) 4 Hypertransport information 14 Floating point operations 5 Floating point mix mix (DP) 6 Cycles stalled, resources 15 L3 (socket-level) idle 16 L3 (core-level reads) 7 Cycles stalled, resources 17 L3 (core-level misses) full 18 L3 (core-level fills caused 8 Instructions and branches by L2 evictions) 9 Instruction cache 19 Prefetches 10 Cache hierarchy June 10 Slide 101
  • 101.  Regions, useful to break up long routines  int PAT_region_begin (int id, const char *label)  int PAT_region_end (int id)  Disable/Enable Profiling, useful for excluding initialization  int PAT_record (int state)  Flush buffer, useful when program isn’t exiting cleanly  int PAT_flush_buffer (void)
  • 102. Instrument application for further analysis (a.out+apa) % pat_build –O <apafile>.apa  Run application % aprun … a.out+apa (or qsub <apa script>)  Generate text report and visualization file (.ap2) % pat_report –o my_text_report.txt [<datafile>.xf | <datadir>]  View report in text and/or with Cray Apprentice2 % app2 <datafile>.ap2 September 21-24, 2009 © Cray Inc. Slide 104
  • 103.  MUST run on Lustre ( /work/… , /lus/…, /scratch/…, etc.)  Number of files used to store raw data  1 file created for program with 1 – 256 processes  √n files created for program with 257 – n processes  Ability to customize with PAT_RT_EXPFILE_MAX September 21-24, 2009 © Cray Inc. 105
  • 104.  Full trace files show transient events but are too large  Current run-time summarization misses transient events  Plan to add ability to record:  Top N peak values (N small)  Approximate std dev over time  For time, memory traffic, etc.  During tracing and sampling July 15, 2008 Slide 106
  • 105.  Call graph profile  Cray Apprentice2  Communication statistics  is target to help identify and  Time-line view correct:  Communication  Load imbalance  I/O  Excessive communication  Network contention  Activity view  Excessive serialization  Pair-wise communication statistics  I/O Problems  Text reports  Source code mapping September 21-24, 2009 107 © Cray Inc.
  • 106. Switch Overview display September 21-24, 2009 © Cray Inc. 108
  • 107. © Cray Inc. September 21-24, 2009 Slide 109
  • 108. September 21-24, 2009 © Cray Inc. 110
  • 109. September 21-24, 2009 © Cray Inc. 111
  • 110. Min, Avg, and Max Values -1, +1 Std Dev marks September 21-24, 2009 © Cray Inc. 112
  • 111. Width  inclusive time Height  exclusive time Filtered nodes or sub tree Load balance overview: Height  Max time Middle bar  Average time DUH Button: Lower bar  Min time Provides hints Yellow represents for performance imbalance time tuning Function Zoom List September 21-24, 2009 © Cray Inc. 113
  • 112. Right mouse click: Node menu e.g., hide/unhide children Right mouse click: View menu: e.g., Filter Sort options % Time, Time, Imbalance % Imbalance time Function List off September 21-24, 2009 © Cray Inc. 114
  • 113. September 21-24, 2009 © Cray Inc. 115
  • 114. September 21-24, 2009 © Cray Inc. Slide 116
  • 115. September 21-24, 2009 © Cray Inc. Slide 117
  • 116. September 21-24, 2009 © Cray Inc. Slide 118
  • 117. Min, Avg, and Max Values -1, +1 Std Dev marks September 21-24, 2009 © Cray Inc. 119
  • 118. September 21-24, 2009 © Cray Inc. 120
  • 119.  Cray Apprentice2 panel help  pat_help – interactive help on the Cray Performance toolset  FAQ available through pat_help September 21-24, 2009 © Cray Inc. 121
  • 120.  intro_craypat(1)  Introduces the craypat performance tool  pat_build  Instrument a program for performance analysis  pat_help  Interactive online help utility  pat_report  Generate performance report in both text and for use with GUI  hwpc(3)  describes predefined hardware performance counter groups  papi_counters(5)  Lists PAPI event counters  Use papi_avail or papi_native_avail utilities to get list of events when running on a specific architecture September 21-24, 2009 © Cray Inc. 122
  • 121. pat_report: Help for -O option: Available option values are in left column, a prefix can be specified: ct -O calltree defaults Tables that would appear by default. heap -O heap_program,heap_hiwater,heap_leaks io -O read_stats,write_stats lb -O load_balance load_balance -O lb_program,lb_group,lb_function mpi -O mpi_callers --- callers Profile by Function and Callers callers+hwpc Profile by Function and Callers callers+src Profile by Function and Callers, with Line Numbers callers+src+hwpc Profile by Function and Callers, with Line Numbers calltree Function Calltree View calltree+hwpc Function Calltree View calltree+src Calltree View with Callsite Line Numbers calltree+src+hwpc Calltree View with Callsite Line Numbers ... September 21-24, 2009 © Cray Inc. Slide 123
  • 122.  Interactive by default, or use trailing '.' to just print a topic:  New FAQ craypat 5.0.0.  Has counter and counter group information % pat_help counters amd_fam10h groups . September 21-24, 2009 © Cray Inc. 124
  • 123. The top level CrayPat/X help topics are listed below. A good place to start is: overview If a topic has subtopics, they are displayed under the heading "Additional topics", as below. To view a subtopic, you need only enter as many initial letters as required to distinguish it from other items in the list. To see a table of contents including subtopics of those subtopics, etc., enter: toc To produce the full text corresponding to the table of contents, specify "all", but preferably in a non-interactive invocation: pat_help all . > all_pat_help pat_help report all . > all_report_help Additional topics: API execute balance experiment build first_example counters overview demos report environment run pat_help (.=quit ,=back ^=up /=top ~=search) => September 21-24, 2009 © Cray Inc. Slide 125
  • 125.
  • 126. 55. 1 ii = 0 56. 1 2-----------< do b = abmin, abmax Poor loop order 57. 1 2 3---------< do j=ijmin, ijmax results in poor 58. 1 2 3 ii = ii+1 striding 59. 1 2 3 jj = 0 The inner-most loop 60. 1 2 3 4-------< do a = abmin, abmax strides on a slow 61. 1 2 3 4 r8----< do i = ijmin, ijmax dimension of each 62. 1 2 3 4 r8 jj = jj+1 array. 63. 1 2 3 4 r8 f5d(a,b,i,j) = f5d(a,b,i,j) + tmat7(ii,jj) The best the compiler 64. 1 2 3 4 r8 f5d(b,a,i,j) = f5d(b,a,i,j) can do is unroll. - tmat7(ii,jj) 65. 1 2 3 4 r8 f5d(a,b,j,i) = f5d(a,b,j,i) Little to no cache - tmat7(ii,jj) reuse. 66. 1 2 3 4 r8 f5d(b,a,j,i) = f5d(b,a,j,i) + tmat7(ii,jj) 67. 1 2 3 4 r8----> end do 68. 1 2 3 4-------> end do 69. 1 2 3---------> end do 70. 1 2-----------> end do
  • 127. USER / #1.Original Loops ----------------------------------------------------------------- Poor loop order Time% 55.0% results in poor Time 13.938244 secs cache reuse Imb.Time 0.075369 secs Imb.Time% 0.6% For every L1 cache Calls 0.1 /sec 1.0 calls hit, there’s 2 misses DATA_CACHE_REFILLS: L2_MODIFIED:L2_OWNED: Overall, only 2/3 of L2_EXCLUSIVE:L2_SHARED 11.858M/sec 165279602 fills all references were in DATA_CACHE_REFILLS_FROM_SYSTEM: level 1 or 2 cache. ALL 11.931M/sec 166291054 fills PAPI_L1_DCM 23.499M/sec 327533338 misses PAPI_L1_DCA 34.635M/sec 482751044 refs User time (approx) 13.938 secs 36239439807 cycles 100.0%Time Average Time per Call 13.938244 sec CrayPat Overhead : Time 0.0% D1 cache hit,miss ratios 32.2% hits 67.8% misses D2 cache hit,miss ratio 49.8% hits 50.2% misses D1+D2 cache hit,miss ratio 66.0% hits 34.0% misses
  • 128.
  • 129.
  • 130. 75. 1 2-----------< do i = ijmin, ijmax 76. 1 2 jj = 0 77. 1 2 3---------< do a = abmin, abmax Reordered loop 78. 1 2 3 4-------< do j=ijmin, ijmax nest 79. 1 2 3 4 jj = jj+1 Now, the inner-most 80. 1 2 3 4 ii = 0 loop is stride-1 on 81. 1 2 3 4 Vcr2--< do b = abmin, abmax both arrays. 82. 1 2 3 4 Vcr2 ii = ii+1 83. 1 2 3 4 Vcr2 f5d(a,b,i,j) = f5d(a,b,i,j) Now memory + tmat7(ii,jj) accesses happen 84. 1 2 3 4 Vcr2 f5d(b,a,i,j) = f5d(b,a,i,j) along the cache line, - tmat7(ii,jj) allowing reuse. 85. 1 2 3 4 Vcr2 f5d(a,b,j,i) = f5d(a,b,j,i) - tmat7(ii,jj) Compiler is able to 86. 1 2 3 4 Vcr2 f5d(b,a,j,i) = f5d(b,a,j,i) vectorize and better- + tmat7(ii,jj) use SSE instructions. 87. 1 2 3 4 Vcr2--> end do 88. 1 2 3 4-------> end do 89. 1 2 3---------> end do 90. 1 2-----------> end do
  • 131. USER / #2.Reordered Loops ----------------------------------------------------------------- Improved striding Time% 31.4% greatly improved Time 7.955379 secs cache reuse Imb.Time 0.260492 secs Imb.Time% 3.8% Runtine was cut Calls 0.1 /sec 1.0 calls nearly in half. DATA_CACHE_REFILLS: L2_MODIFIED:L2_OWNED: Still, some 20% of all L2_EXCLUSIVE:L2_SHARED 0.419M/sec 3331289 fills references are cache DATA_CACHE_REFILLS_FROM_SYSTEM: misses ALL 15.285M/sec 121598284 fills PAPI_L1_DCM 13.330M/sec 106046801 misses PAPI_L1_DCA 66.226M/sec 526855581 refs User time (approx) 7.955 secs 20684020425 cycles 100.0%Time Average Time per Call 7.955379 sec CrayPat Overhead : Time 0.0% D1 cache hit,miss ratios 79.9% hits 20.1% misses D2 cache hit,miss ratio 2.7% hits 97.3% misses D1+D2 cache hit,miss ratio 80.4% hits 19.6% misses
  • 132. First loop, partially vectorized and Second loop, vectorized and unrolled by 4 unrolled by 4 95. 1 ii = 0 109. 1 jj = 0 96. 1 2-----------< do j = ijmin, ijmax 110. 1 2-----------< do i = ijmin, ijmax 97. 1 2 i---------< do b = abmin, abmax 111. 1 2 3---------< do a = abmin, abmax 98. 1 2 i ii = ii+1 112. 1 2 3 jj = jj+1 99. 1 2 i jj = 0 113. 1 2 3 ii = 0 100. 1 2 i i-------< do i = ijmin, ijmax 114. 1 2 3 4-------< do j = ijmin, ijmax 101. 1 2 i i Vpr4--< do a = abmin, abmax 115. 1 2 3 4 Vr4---< do b = abmin, abmax 102. 1 2 i i Vpr4 jj = jj+1 116. 1 2 3 4 Vr4 ii = ii+1 103. 1 2 i i Vpr4 f5d(a,b,i,j) = 117. 1 2 3 4 Vr4 f5d(b,a,i,j) = f5d(a,b,i,j) + tmat7(ii,jj) f5d(b,a,i,j) - tmat7(ii,jj) 104. 1 2 i i Vpr4 f5d(a,b,j,i) = 118. 1 2 3 4 Vr4 f5d(b,a,j,i) = f5d(a,b,j,i) - tmat7(ii,jj) f5d(b,a,i,j) + tmat7(ii,jj) 105. 1 2 i i Vpr4--> end do 119. 1 2 3 4 Vr4---> end do 106. 1 2 i i-------> end do 120. 1 2 3 4-------> end do 107. 1 2 i---------> end do 121. 1 2 3---------> end do 108. 1 2-----------> end do 122. 1 2-----------> end do
  • 133. USER / #3.Fissioned Loops Fissioning further ----------------------------------------------------------------- improved cache reuse Time% 9.8% and resulted in better Time 2.481636 secs vectorization Imb.Time 0.045475 secs Imb.Time% 2.1% Runtime further Calls 0.4 /sec 1.0 calls reduced. DATA_CACHE_REFILLS: L2_MODIFIED:L2_OWNED: Cache hit/miss ratio L2_EXCLUSIVE:L2_SHARED 1.175M/sec 2916610 fills improved slightly DATA_CACHE_REFILLS_FROM_SYSTEM: ALL 34.109M/sec 84646518 fills Loopmark file points PAPI_L1_DCM 26.424M/sec 65575972 misses to better PAPI_L1_DCA 156.705M/sec 388885686 refs vectorization from User time (approx) 2.482 secs 6452279320 cycles 100.0%Time the fissioned loops Average Time per Call 2.481636 sec CrayPat Overhead : Time 0.0% D1 cache hit,miss ratios 83.1% hits 16.9% misses D2 cache hit,miss ratio 3.3% hits 96.7% misses D1+D2 cache hit,miss ratio 83.7% hits 16.3% misses
  • 134.
  • 135.  Cache blocking is a combination of strip mining and loop interchange, designed to increase data reuse.  Takes advantage of temporal reuse: re-reference array elements already referenced  Good blocking will take advantage of spatial reuse: work with the cache lines!  Many ways to block any given loop nest  Which loops get blocked?  What block size(s) to use?  Analysis can reveal which ways are beneficial  But trial-and-error is probably faster
  • 136. j=1 j=8  2D Laplacian i=1 do j = 1, 8 do i = 1, 16 a = u(i-1,j) + u(i+1,j) & - 4*u(i,j) & + u(i,j-1) + u(i,j+1) end do end do  Cache structure for this example:  Each line holds 4 array elements  Cache can hold 12 lines of u data i=16  No cache reuse between outer loop 120 30 18 15 13 12 10 9 7 6 4 3 iterations
  • 137. j=1 j=8  Unblocked loop: 120 cache misses i=1  Block the inner loop do IBLOCK = 1, 16, 4 do j = 1, 8 i=5 do i = IBLOCK, IBLOCK + 3 a(i,j) = u(i-1,j) + u(i+1,j) & - 2*u(i,j) & i=9 + u(i,j-1) + u(i,j+1) end do end do end do i=13  Now we have reuse of the “j+1” data 80 20 12 10 11 9 8 7 6 4 3
  • 138. j=1 j=5  One-dimensional blocking reduced i=1 misses from 120 to 80  Iterate over 4 4 blocks i=5 do JBLOCK = 1, 8, 4 do IBLOCK = 1, 16, 4 do j = JBLOCK, JBLOCK + 3 do i = IBLOCK, IBLOCK + 3 i=9 a(i,j) = u(i-1,j) + u(i+1,j) & - 2*u(i,j) & + u(i,j-1) + u(i,j+1) end do i=13 end do end do end do 15 13 12 10 60 30 18 17 16 11 9 8 7 6 4 3  Better use of spatial locality (cache lines)
  • 139. Matrix-matrix multiply (GEMM) is the canonical cache-blocking example  Operations can be arranged to create multiple levels of blocking  Block for register  Block for cache (L1, L2, L3)  Block for TLB  No further discussion here. Interested readers can see  Any book on code optimization  Sun’s Techniques for Optimizing Applications: High Performance Computing contains a decent introductory discussion in Chapter 8  Insert your favorite book here  Gunnels, Henry, and van de Geijn. June 2001. High-performance matrix multiplication algorithms for architectures with hierarchical memories. FLAME Working Note #4 TR- 2001-22, The University of Texas at Austin, Department of Computer Sciences  Develops algorithms and cost models for GEMM in hierarchical memories  Goto and van de Geijn. 2008. Anatomy of high-performance matrix multiplication. ACM Transactions on Mathematical Software 34, 3 (May), 1-25  Description of GotoBLAS DGEMM
  • 140. “I tried cache-blocking my code, but it didn’t help”  You’re doing it wrong.  Your block size is too small (too much loop overhead).  Your block size is too big (data is falling out of cache).  You’re targeting the wrong cache level (?)  You haven’t selected the correct subset of loops to block.  The compiler is already blocking that loop.  Prefetching is acting to minimize cache misses.  Computational intensity within the loop nest is very large, making blocking less important.
  • 141.  Multigrid PDE solver  Class D, 64 MPI ranks do i3 = 2, 257  Global grid is 1024 × 1024 × 1024 do i2 = 2, 257  Local grid is 258 × 258 × 258 do i1 = 2, 257  Two similar loop nests account for ! update u(i1,i2,i3) ! using 27-point stencil >50% of run time end do  27-point 3D stencil end do  There is good data reuse along end doi2+1 cache lines i2 leading dimension, even without i2-1 blocking i3+1 i3 i3-1 i1-1 i1 i1+1
  • 142.  Block the inner two loops Mop/s/proces Block size  Creates blocks extending along i3 direction s unblocked 531.50 do I2BLOCK = 2, 257, BS2 do I1BLOCK = 2, 257, BS1 16 × 16 279.89 do i3 = 2, 257 22 × 22 321.26 do i2 = I2BLOCK, & min(I2BLOCK+BS2-1, 257) 28 × 28 358.96 do i1 = I1BLOCK, & min(I1BLOCK+BS1-1, 257) 34 × 34 385.33 ! update u(i1,i2,i3) ! using 27-point stencil 40 × 40 408.53 end do end do 46 × 46 443.94 end do 52 × 52 468.58 end do end do 58 × 58 470.32 64 × 64 512.03 70 × 70 506.92
  • 143.  Block the outer two loops Mop/s/proces Block size  Preserves spatial locality along i1 direction s unblocked 531.50 do I3BLOCK = 2, 257, BS3 do I2BLOCK = 2, 257, BS2 16 × 16 674.76 do i3 = I3BLOCK, & 22 × 22 680.16 min(I3BLOCK+BS3-1, 257) do i2 = I2BLOCK, & 28 × 28 688.64 min(I2BLOCK+BS2-1, 257) do i1 = 2, 257 34 × 34 683.84 ! update u(i1,i2,i3) ! using 27-point stencil 40 × 40 698.47 end do end do 46 × 46 689.14 end do 52 × 52 706.62 end do end do 58 × 58 692.57 64 × 64 703.40 70 × 70 693.87
  • 144.
  • 145. ( 53) void mat_mul_daxpy(double *a, double *b, double *c, int rowa, int cola, int colb) ( 54) { ( 55) int i, j, k; /* loop counters */ ( 56) int rowc, colc, rowb; /* sizes not passed as arguments */ C pointers ( 57) double con; /* constant value */ C pointers don’t carry ( 58) ( 59) rowb = cola; the same rules as ( 60) rowc = rowa; Fortran Arrays. ( 61) colc = colb; ( 62) The compiler has no ( 63) for(i=0;i<rowc;i++) { way to know whether ( 64) for(k=0;k<cola;k++) { *a, *b, and *c ( 65) con = *(a + i*cola +k); ( 66) for(j=0;j<colc;j++) { overlap or are ( 67) *(c + i*colc + j) += con * *(b + k*colb + j); referenced differently ( 68) } elsewhere. ( 69) } ( 70) } The compiler must ( 71) } assume the worst, mat_mul_daxpy: thus a false data 66, Loop not vectorized: data dependency dependency. Loop not vectorized: data dependency Loop unrolled 4 times Slide 147
  • 146. ( 53) void mat_mul_daxpy(double* restrict a, double* restrict b, double* restrict c, int rowa, int cola, int colb) ( 54) { ( 55) int i, j, k; /* loop counters */ C pointers, ( 56) int rowc, colc, rowb; /* sizes not passed as arguments */ restricted ( 57) double con; /* constant value */ C99 introduces the ( 58) ( 59) rowb = cola; restrict keyword, ( 60) rowc = rowa; which allows the ( 61) colc = colb; programmer to ( 62) promise not to ( 63) for(i=0;i<rowc;i++) { ( 64) for(k=0;k<cola;k++) { reference the ( 65) con = *(a + i*cola +k); memory via another ( 66) for(j=0;j<colc;j++) { pointer. ( 67) *(c + i*colc + j) += con * *(b + k*colb + j); ( 68) } If you declare a ( 69) } restricted pointer and ( 70) } ( 71) } break the rules, behavior is undefined by the standard. Slide 148
  • 147. 66, Generated alternate loop with no peeling - executed if loop count <= 24 Generated vector sse code for inner loop Generated 2 prefetch instructions for this loop Generated vector sse code for inner loop Generated 2 prefetch instructions for this loop Generated alternate loop with no peeling and more aligned moves - executed if loop count <= 24 and alignment test is passed Generated vector sse code for inner loop Generated 2 prefetch instructions for this loop Generated alternate loop with more aligned moves - executed if loop count >= 25 and alignment test is passed Generated vector sse code for inner loop Generated 2 prefetch instructions for this loop • This can also be achieved with the PGI safe pragma and –Msafeptr compiler option or Pathscale –OPT:alias option Slide 149
  • 148. July 2009 Slide 150
  • 149.  GNU malloc library  malloc, calloc, realloc, free calls  Fortran dynamic variables  Malloc library system calls  Mmap, munmap =>for larger allocations  Brk, sbrk => increase/decrease heap  Malloc library optimized for low system memory use  Can result in system calls/minor page faults 151
  • 150. Detecting “bad” malloc behavior  Profile data => “excessive system time”  Correcting “bad” malloc behavior  Eliminate mmap use by malloc  Increase threshold to release heap memory  Use environment variables to alter malloc  MALLOC_MMAP_MAX_ = 0  MALLOC_TRIM_THRESHOLD_ = 536870912  Possible downsides  Heap fragmentation  User process may call mmap directly  User process may launch other processes  PGI’s –Msmartalloc does something similar for you at compile time 152
  • 151.  Google created a replacement “malloc” library  “Minimal” TCMalloc replaces GNU malloc  Limited testing indicates TCMalloc as good or better than GNU malloc  Environment variables not required  TCMalloc almost certainly better for allocations in OpenMP parallel regions  There’s currently no pre-built tcmalloc for Cray XT, but some users have successfully built it. 153
  • 152.  Linux has a “first touch policy” for memory allocation  *alloc functions don’t actually allocate your memory  Memory gets allocated when “touched”  Problem: A code can allocate more memory than available  Linux assumed “swap space,” we don’t have any  Applications won’t fail from over-allocation until the memory is finally touched  Problem: Memory will be put on the core of the “touching” thread  Only a problem if thread 0 allocates all memory for a node  Solution: Always initialize your memory immediately after allocating it  If you over-allocate, it will fail immediately, rather than a strange place in your code  If every thread touches its own memory, it will be allocated on the proper socket Slide 154
  • 153.
  • 154.  Short Message Eager Protocol  The sending rank “pushes” the message to the receiving rank  Used for messages MPICH_MAX_SHORT_MSG_SIZE bytes or less  Sender assumes that receiver can handle the message  Matching receive is posted - or -  Has available event queue entries (MPICH_PTL_UNEX_EVENTS) and buffer space (MPICH_UNEX_BUFFER_SIZE) to store the message  Long Message Rendezvous Protocol  Messages are “pulled” by the receiving rank  Used for messages greater than MPICH_MAX_SHORT_MSG_SIZE bytes  Sender sends small header packet with information for the receiver to pull over the data  Data is sent only after matching receive is posted by receiving rank
  • 155. Match Entries Posted by MPI Incoming Msg S to handle Unexpected Msgs E A Eager Rendezvous S App ME Short Msg ME Long Msg ME T A R STEP 1 STEP 2 MPI_RECV call Sender MPI_SEND call Receiver Post ME to Portals MPI RANK 0 RANK 1 Unexpected Buffers (MPICH_UNEX_BUFFER_SIZE) STEP 3 Portals DMA PUT Unexpected Msg Queue Other Event Queue (MPICH_PTL_OTHER_EVENTS) Unexpected Event Queue MPI_RECV is posted prior to MPI_SEND call (MPICH_PTL_UNEX_EVENTS)

Hinweis der Redaktion

  1. Planned Times:Architecture: 30-45 minPGI: 10-15 minCCE: 15-20 minLibsci: 15-20 minCrayPAT: 30-45 minOptimization: 60-90 min
  2. CQ (Completion Queue) - an event notification block used when processor needs to be notified that BTE or FMA transactions have completed.NAT (Network Address Translation) - responsible for validating and translating addresses from the network address format to an address on the local node.AMO (Atomic Memory Operation) - responsible for AMO type of transactions.ORB (Outstanding Request Buffer) - processes requests to the network and matches responses from the network to the original requests.RMT (Receive Message Table) - tracks groups of packets, or sequences, transmitted from remote nodes of the networkSSID (Synchronization Sequence Identification) - Tracks all of the request packets that originate and all of the response packets that terminate at the NIC, in order to perform completion notifications for transactions.Assists in the identification of SW operations and processes impacted by errorsMonitors error detected by other NIC blocks
  3. Figure 2: Logical and Physical views of striping. Four application processes write a variable amount of data sequentially within a shared file. This shared file is striped over 4 OSTs with 1 MB stripe sizes. This write operation is not stripe aligned therefore some processes write their data to stripes used by other processes. Some stripes are accessed by more than one process (which may cause contention). Additionally, OSTs are accessed by variable numbers of processes (3 OST0, 1 OST1, 2 OST2 and 2 OST3).
  4. Figure 3: Write Performance for serial I/O at various Lustre stripe counts. File size is 32 MB per OST utilized and write operations are 32 MB in size. Utilizing more OSTs does not increase write performance. The Best performance is seen by utilizing a stripe size which matches the size of write operations.
  5. Figure 4: Write Performance for serial I/O at various Lustre stripe sizes and I/O operation sizes. File utilized is 256 MB written to a single OST. Performance is limited by small operation sizes and small stripe sizes. Either can become a limiting factor in write performance. The best performance is obtained in each case when the I/O operation and stripe sizes are similar.
  6. These loops were taken from the nuccor application and provided by Rebecca Hartman-Baker from ORNL. She originally began by comparing various compilers and optimization levels. The results of the following rewrites came at the suggestion of Vince Graziano of Cray.
  7. This code better plays to the strengths of the CPU. More cache reuse, easier prefetching, better chance of vectorizing.
  8. Original: 13.938244 sReordered: 7.955379 s
  9. The code further improves on the last by allowing slightly better cache reuse, but significantly better opportunity to vectorize on both a and b. I asked the compiler team why the loop nest on the left was only partially vectorized and they said that their studies showed that it would probably not be profitable (probably due to the tmat7 array striding on the second dimension.
  10. Original: 13.938244 sReordered: 7.955379 sFissioned: 2.481636 s
  11. The following Cache Blocking example was created by Steve Whalen of Cray.
  12. See http://en.wikipedia.org/wiki/Restrict for more information on “Restrict”
  13. The following come from Kim McMahon (Cray)
  14. Figure 5: Write performance of a file-per-process I/O pattern as a function of number of files/processes. The file size is 128 MB with 32 MB sized write operations. Performance increases as the number of processes/files increases until OST and metadata contention hinder performance improvements. Each file is subject to the limitations of serial I/O.Improved performance can be obtained from a parallel file system such as Lustre. However, at large process counts (large number of files) metadata operations may hinder overall performance. Additionally, at large process counts (large number of files) OSS and OST contention will hinder overall performance.
  15. Figure 8: Write Performance of a single shared file as the number of processes increases. A file size of 32 MB per process is utilized with 32 MB write operations. For each I/O library (Posix, MPI-IO, and HDF5) performance levels off at high core counts.