SlideShare ist ein Scribd-Unternehmen logo
1 von 83
Downloaden Sie, um offline zu lesen
Cray XT Best Practices

        Jeff Larkin
    larkin@cray.com
XT REFRESHER


November 09           Slide 2
Quad-Core Cray XT4 Node
                       6.4 GB/sec direct connect
                       HyperTransport


 4-way SMP
 >35 Gflops per
  node
 Up to 8 GB per
  node
 OpenMP Support                                                   2 – 8 GB
  within socket

                                                   12.8 GB/sec direct
                       Cray                        connect memory
                    SeaStar2+                      (DDR 800)
                   Interconnect



November 09                                                              Slide 3
2 – 32 GB memory
Cray XT5 Node
                    6.4 GB/sec direct connect
                    HyperTransport

 8 or 12-way SMP
 >70 Gflops per
  node
 Up to 32 GB of
  shared memory
  per node
                                                25.6 GB/sec direct
 OpenMP Support                                connect memory
                             Cray
                          SeaStar2+
                         Interconnect




November 09                                                   Slide 4
AMD Quad-core review
    4 DP/8 SP FLOPS/clock cycle, per core
          128bit packed ADD + 128bit packed MUL
    64KB Level 1 Data Cache per core
    512KB Level Cache per core
    2 – 6MB Level 3 Cache per socket
          Coherent cache between XT5 sockets
    2 DDR Memory Controllers that can work synchronously or
    asynchronously
          Determined at boot, we generally run them asynchronously for better
          performance




November 09                                                               Slide 5
XT System Configuration Example
         Z
                  Y            GigE

X

                              10 GigE



                               GigE
                                                SMW
                               Fibre
                              Channels
                                                RAID
                                              Subsystem


                                 Compute node
                                 Login node
                                 Network node
                                 Boot/Syslog/Database nodes

    November 09                  I/O and Metadata nodes       Slide 6
COMPILING


November 09        Slide 7
Choosing a Compiler
    Try every compiler available to you, there’s no “best”
          PGI, Cray, Pathscale, Intel, GCC are all available, but not
          necessarily on all machines
    Each compiler favors certain optimizations, which may
    benefit applications differently
    Test your answers carefully
          Order of operation may not be the same between compilers or even
          compiler versions
          You may have to decide whether speed or precision is more
          important to you




November 09                                                             Slide 8
Choosing Compiler Flags
    PGI
          -fast –Mipa=fast
          man pgf90; man pgcc; man pgCC
    Cray
          <none, turned on by default>
          man crayftn; man craycc ; man crayCC
    Pathscale
          -Ofast
          man eko (“Every Known Optimization”)
    GNU
          -O2 / -O3
          man gfortran; man gcc; man g++
    Intel
          -fast
          man ifort; man icc; man iCC
November 09                                      Slide 9
PROFILING AND DEBUGGING


November 09                      Slide 10
Profile and Debug “at Scale”
            Codes don’t run the same at thousands of cores as
            hundreds, 10s of thousands as thousands, or 100s of
            thousands as 10s
            Determine how many nodes you wish to run on and test at
            that size, don’t test for throughput
                                                  Scaling (Time)
            2000
            1800
            1600
            1400
            1200
 Time (s)




            1000                                                                                           v1 (time)
            800                                                                                            v2 (time)
            600
            400
            200
               0
                   122   244   488   976   1952    3904   7808   15616   31232   62464   124928   150304

November 09                                                                                                  Slide 11
Profile and Debug with Real Science
    Choose a real problem,
    not a toy
    What do you want to
    achieve with the
    machine?
    How will you really run
    your code?
    Will you really run
    without I/O?



November 09                           Slide 12
What Data Should We Collect?
    Compiler Feedback
          Most compilers can tell you a lot about what they do to your code
              Did your important loop vectorize?
              Was a routine inlined?
              Was this loop unrolled?
          It’s just as important to know what the compiler didn’t do
              Some C/C++ loops won’t vectorize without some massaging
              Some loop counts aren’t known at runtime, so optimization is
              limited
          Flags to know:
              PGI: -Minfo=all –Mneginfo=all
              Cray: -rm
              Pathscale: -LNO:simd_verbose=ON
              Intel: -vec-report1



November 09                                                                   Slide 13
Compiler Feedback Examples: PGI
! Matrix Multiply             24, Loop interchange
do k = 1, N                     produces reordered loop
  do j = 1, N                   nest: 25,24,26
    do i = 1, N               26, Generated an alternate
                                loop for the loop
       c(i,j) = c(i,j) + &
                                       Generated vector
            a(i,k) * b(k,j)     sse code for the loop
    end do                             Generated 2
  end do                        prefetch instructions for
end do                          the loop




November 09                                           Slide 14
Compiler Feedback Examples: Cray
23.                        ! Matrix Multiply
24.      ib------------<   do k = 1, N
25.      ib ibr4-------<     do j = 1, N
26.      ib ibr4 Vbr4--<       do i = 1, N
27.      ib ibr4 Vbr4            c(i,j) = c(i,j) + a(i,k) * b(k,j)
28.      ib ibr4 Vbr4-->       end do
29.      ib ibr4------->     end do
30.      ib------------>   end do

i   –   interchanged
b   –   blocked
r   –   unrolled
V   -   Vectorized




November 09                                                    Slide 15
Compiler Feedback Examples: Pathscale
    -LNO:simd_verbose appears to be broken




November 09                                  Slide 16
Compiler Feedback Examples: Intel
mm.F90(14): (col. 3) remark: PERMUTED LOOP WAS VECTORIZED.
mm.F90(25): (col. 7) remark: LOOP WAS VECTORIZED.




November 09                                                  Slide 17
What Data Should We Collect?
    Hardware Performance Counters
          Using tools like CrayPAT it’s possible to know what the processor is
          doing
          We can get counts for
             FLOPS
             Cache Hits/Misses
             TLB Hits/Misses
             Stalls
             …
          We can derive
             FLOP Rate
             Cache Hit/Miss Ratio
             Computational Intensity
             …



November 09                                                                Slide 18
Gathering Performance Data
  Use Craypat’s APA
     First gather sampling for line
     number profile
        Light-weight
        Guides what should and should
        not be instrumented              load module
     Second gather instrumentation (-g   make
     mpi,io,blas,lapack,math …)          pat_build -O apa a.out
                                         Execute
        Hardware counters
                                         pat_report *.xf
        MPI message passing              Examine *.apa
        information                      pat_build –O *.apa
        I/O information                  Execute
        Math Libraries
        …
Sample APA File (Significantly Trimmed)
 # You can edit this file, if desired, and use it
 # ----------------------------------------------------------------------
 #       HWPC group to collect by default.
   -Drtenv=PAT_RT_HWPC=1 # Summary with TLB metrics.
 # ----------------------------------------------------------------------
 #       Libraries to trace.
   -g mpi,math
 # ----------------------------------------------------------------------
   -w # Enable tracing of user-defined functions.
       # Note: -u should NOT be specified as an additional option.
 # 9.08% 188347 bytes
          -T ratt_i_
 # 6.71% 177904 bytes
          -T rhsf_
 # 5.61% 205682 bytes
          -T ratx_i_
 # 5.59% 22005 bytes
          -T transport_m_computespeciesdiffflux_
CrayPAT Groups
  biolib Cray Bioinformatics library routines    Catamount)
  blacs Basic Linear Algebra                     omp-rtl OpenMP runtime library (not
  communication subprograms                      supported on Catamount)
  blas      Basic Linear Algebra subprograms     portals Lightweight message passing API
  caf      Co-Array Fortran (Cray X2 systems     pthreads POSIX threads (not supported on
  only)                                          Catamount)
  fftw     Fast Fourier Transform library (64-   scalapack Scalable LAPACK
  bit only)                                      shmem SHMEM
  hdf5      manages extremely large and          stdio all library functions that accept or
  complex data collections                       return the FILE* construct
  heap       dynamic heap                        sysio I/O system calls
  io      includes stdio and sysio groups        system system calls
  lapack Linear Algebra Package                  upc      Unified Parallel C (Cray X2 systems
  lustre Lustre File System                      only)
  math       ANSI math
  mpi       MPI
  netcdf network common data form
  (manages array-oriented scientific data)
  omp       OpenMP API (not supported on
Useful API Calls, for the control-freaks
 Regions, useful to break up long routines
     int PAT_region_begin (int id, const char *label)
     int PAT_region_end (int id)
 Disable/Enable Profiling, useful for excluding initialization
    int PAT_record (int state)
 Flush buffer, useful when program isn’t exiting cleanly
    int PAT_flush_buffer (void)
Sample CrayPAT HWPC Data
USER / rhsf_
------------------------------------------------------------------------
  Time%                                          74.6%
  Time                                      556.742885 secs
  Imb.Time                                   14.817686 secs
  Imb.Time%                                       2.6%
  Calls                       2.3 /sec          1200.0 calls
  PAPI_L1_DCM              14.406M/sec      7569486532 misses
  PAPI_TLB_DM               0.225M/sec       117992047 misses
  PAPI_L1_DCA             921.729M/sec   484310815400 refs
  PAPI_FP_OPS             871.740M/sec   458044890200 ops
  User time (approx)      525.438 secs 1103418882813 cycles     94.4%Time
  Average Time per Call                       0.463952 sec
  CrayPat Overhead : Time    0.0%
  HW FP Ops / User time   871.740M/sec   458044890200 ops 10.4%peak(DP)
  HW FP Ops / WCT         822.722M/sec
  Computational intensity    0.42 ops/cycle       0.95 ops/ref
  MFLOPS (aggregate)   1785323.32M/sec
  TLB utilization         4104.61 refs/miss      8.017 avg uses
  D1 cache hit,miss ratios 98.4% hits             1.6% misses
  D1 cache utilization (M) 63.98 refs/miss       7.998 avg uses
November 09                                                                 Slide 23
CrayPAT HWPC Groups
0  Summary with instruction        11 Floating point operations mix
   metrics                           (2)
1 Summary with TLB metrics         12 Floating point operations mix
2 L1 and L2 metrics                  (vectorization)
3 Bandwidth information            13 Floating point operations mix
                                     (SP)
4 Hypertransport information
                                   14 Floating point operations mix
5 Floating point mix                 (DP)
6 Cycles stalled, resources idle   15 L3 (socket-level)
7 Cycles stalled, resources full   16 L3 (core-level reads)
8 Instructions and branches        17 L3 (core-level misses)
9 Instruction cache                18 L3 (core-level fills caused by
10 Cache hierarchy                   L2 evictions)
                                   19 Prefetches




November 09                                                      Slide 24
MPI Statistics
    How many times are MPI routines called and with how much
    data?
    Do I have a load imbalance?
    Are my processors waiting for data?
    Could I perform better by adjusting MPI environment
    variables?




November 09                                              Slide 25
Sample CrayPAT MPI Statistics

MPI Msg Bytes | MPI Msg |     MsgSz | 16B<= | 256B<= |       4KB<= |Experiment=1
              |     Count |     <16B | MsgSz |    MsgSz |    MsgSz |Function
              |           |   Count | <256B |       <4KB |   <64KB | Caller
              |           |          | Count |    Count |    Count | PE[mmm]
 3062457144.0 | 144952.0 | 15022.0 | 39.0 | 64522.0 | 65369.0 |Total
|---------------------------------------------------------------------------
| 3059984152.0 | 129926.0 |        -- | 36.0 | 64522.0 | 65368.0 |mpi_isend_
||--------------------------------------------------------------------------
|| 1727628971.0 | 63645.1 |         -- |   4.0 | 31817.1 | 31824.0 |MPP_DO_UPDATE_R8_3DV.in.MPP_DOMAINS_MOD
3|              |           |          |        |          |         | MPP_UPDATE_DOMAIN2D_R8_3DV.in.MPP_DOMAINS_MOD
||||------------------------------------------------------------------------
4||| 1680716892.0 | 61909.4 |         -- |     -- | 30949.4 | 30960.0 |DYN_CORE.in.DYN_CORE_MOD
5|||              |           |          |        |          |         | FV_DYNAMICS.in.FV_DYNAMICS_MOD
6|||              |           |          |        |          |         | ATMOSPHERE.in.ATMOSPHERE_MOD
7|||              |           |          |        |          |         |   MAIN__
8|||              |           |          |        |          |         |    main
|||||||||-------------------------------------------------------------------
9|||||||| 1680756480.0 | 61920.0 |         -- |      -- | 30960.0 | 30960.0 |pe.13666
9|||||||| 1680756480.0 | 61920.0 |         -- |      -- | 30960.0 | 30960.0 |pe.8949
9|||||||| 1651777920.0 | 54180.0 |         -- |      -- | 23220.0 | 30960.0 |pe.12549
|||||||||===================================================================




November 09                                                                                                   Slide 26
I/O Statistics
    How much time is spent in I/O?
    How much data am I writing or reading?
    How many processes are performing I/O?




November 09                                  Slide 27
OPTIMIZING YOUR CODE


November 09                   Slide 29
Step by Step
1. Fix any load imbalance
2. Fix your hotspots
   1. Communication
              •     Pre-post receives
              •     Overlap computation and communication
              •     Reduce collectives
              •     Adjust MPI environment variables
              •     Use rank reordering
      2.        Computation
              •     Examine the hardware counters and compiler feedback
              •     Adjust the compiler flags, directives, or code structure to improve
                    performance
      3.        I/O
              •     Stripe files/directories appropriately
              •     Use methods that scale
                  •    MPI-IO or Subsetting
At each step, check your answers and performance.
Between each step, gather your data again.
November 09                                                                               Slide 30
COMMUNICATION


November 09            Slide 31
Craypat load-imbalance data
Table 1:   Profile by Function Group and Function

Time % |        Time |   Imb. Time |   Imb.   |   Calls |Experiment=1
       |             |             | Time %   |         |Group
       |             |             |          |         | Function
       |             |             |          |         | PE='HIDE'

 100.0% | 1061.141647 |         -- |     -- | 3454195.8 |Total
|--------------------------------------------------------------------
| 70.7% | 750.564025 |           -- |     -- | 280169.0 |MPI_SYNC
||-------------------------------------------------------------------
|| 45.3% | 480.828018 | 163.575446 | 25.4% |      14653.0 |mpi_barrier_(sync)
|| 18.4% | 195.548030 | 33.071062 | 14.5% | 257546.0 |mpi_allreduce_(sync)
||   7.0% |   74.187977 |   5.261545 |   6.6% |    7970.0 |mpi_bcast_(sync)
||===================================================================
| 15.2% | 161.166842 |           -- |     -- | 3174022.8 |MPI
||-------------------------------------------------------------------
|| 10.1% | 106.808182 |     8.237162 |   7.2% | 257546.0 |mpi_allreduce_
||   3.2% |   33.841961 | 342.085777 | 91.0% | 755495.8 |mpi_waitall_
||===================================================================
| 14.1% | 149.410781 |           -- |     -- |       4.0 |USER
||-------------------------------------------------------------------
|| 14.0% | 148.048597 | 446.124165 | 75.1% |          1.0 |main
|====================================================================
Fixing a Load Imbalance
    What is causing the load imbalance
          Computation
            Is decomposition appropriate?
            Would RANK_REORDER help?
          Communication
            Is decomposition appropriate?
            Would RANK_REORDER help?
            Are recevies pre-posted
    OpenMP may help
          Able to spread workload with less overhead
             Large amount of work to go from all-MPI to Hybrid
              • Must accept challenge to OpenMP-ize large amount of code




November 09                                                            Slide 33
Why we pre-post: XT MPI – Receive Side
                         Match Entries created by
                                                            Match Entries Posted by MPI to handle
                         application pre-posting of
                                                            unexpected short and long messages
                                 Receives

                          pre-posted      pre-posted   eager short        eager short   eager short     long
                              ME              ME        message            message       message      message
                            msgX            msgY          ME                 ME            ME            ME



         incoming                            app
         message                            buffer
                                             for
                             app            msgY          short              short         short
                            buffer                         un-                un-           un-
                             for                         expect             expect        expect
                                                         buffer             buffer        buffer           Unexpected long
                            msgX
                                                                                                           message buffers-
                                                                                                           Portals EQ event
                                                                                                           only


Portals matches incoming
message with pre-posted
receives and delivers                  other EQ             unexpected EQ
message data directly into
user buffer.
                                                                      Unexpected short
                            An unexpected message                     message buffers
                            generates two entries on
                            unexpected EQ
 Cray Inc. Proprietary                                               34
11/5/2009   35
11/5/2009   36
11/5/2009   37
11/5/2009   38
11/5/2009   39
11/5/2009   40
11/5/2009   41
11/5/2009   42
11/5/2009   43
11/5/2009   44
Using Rank Reordering




November 09             Slide 45
Using Rank Reordering




November 09             Slide 46
Using Rank Reordering




November 09             Slide 47
Using Rank Reordering




November 09             Slide 48
Using Rank Reordering




November 09             Slide 49
Even Smarter Rank Ordering
    If you understand your decomposition well enough, you may
    be able to map it to the network
    Craypat 5.0 adds the grid_order and mgrid_order tools to
    help
    For more information, run grid_order or mgrid_order with no
    options




November 09                                                 Slide 50
grid_order Example
Usage: grid_order -c n1,n2,... -g N1,N2,... [-o d1,d2,...] [-m max]

    This program can be used for placement of the ranks of an
    MPI program that uses communication between nearest neighbors
    in a grid, or lattice.

    For example, consider an application in which each MPI rank
    computes values for a lattice point in an N by M grid,
    communicates with its nearest neighbors in the grid,
    and is run on quad-core processors. Then with the options:
       -c 2,2 -g N,M
    this program will produce a list of ranks suitable for use in
    the MPICH_RANK_ORDER file, such that a block of four nearest
    neighbors is placed on each processor.

    If the same application is run on nodes containing two quad-
    core processors, then either of the following can be used:
       -c 2,4 -g M,N
       -c 4,2 -g M,N



November 09                                                           Slide 51
Auto-Scaling MPI Environment Variables
  Default values for various MPI jobs sizes

MPI Environment Variable Name       1,000 PEs   10,000 PEs   50,000 PEs   100,000 PEs

MPICH_MAX_SHORT_MSG_SIZE
(This size determines whether        128,000      20,480        4096         2048
the message uses the Eager or         bytes
Rendezvous protocol)
MPICH_UNEX_BUFFER_SIZE
                                      60 MB       60 MB        150 MB       260 MB
(The buffer allocated to hold the
unexpected Eager data)
MPICH_PTL_UNEX_EVENTS
(Portals generates two events for     20,480
                                                  22,000       110,000      220,000
each unexpected message               events
received)
MPICH_PTL_OTHER_EVENTS
                                       2048
(Portals send-side and expected                    2500        12,500       25,000
events)                               events


    July 2009                                                                        Slide 52
MPICH Performance Variable
MPICH_MAX_SHORT_MSG_SIZE
        Controls message sending protocol (Default:128000 byte)
           Message sizes <= MSG_SIZE: Use EAGER
           Message sizes > MSG_SIZE: Use RENDEZVOUS
           Increasing this variable may require that
           MPICH_UNEX_BUFFER_SIZE be increased
        Increase MPICH_MAX_SHORT_MSG_SIZE if App sends
        large msgs(>128K) and receives are pre-posted
           Can reduce messaging overhead via EAGER protocol
           Can reduce network contention
        Decrease MPICH_MAX_SHORT_MSG_SIZE if:
            App sends msgs in 32k-128k range and receives not
           pre-posted
Cray Inc. Proprietary             53
COMPUTATION


November 09          Slide 54
Vectorization
 Stride one memory accesses
 No IF tests
 No subroutine calls
    Inline
    Module Functions
    Statement Functions
 What is size of loop
 Loop nest
    Stride one on inside
    Longest on the inside
 Unroll small loops
 Increase computational intensity
    CU = (vector flops/number of memory accesses)
C pointers
(     53) void mat_mul_daxpy(double *a, double *b, double *c, int rowa, int cola, int colb)
(     54) {
(     55)   int i, j, k;          /* loop counters */
(     56)   int rowc, colc, rowb; /* sizes not passed as arguments */
(     57)   double con;           /* constant value */
(     58)
(     59)   rowb = cola;
(     60)   rowc = rowa;
(     61)   colc = colb;
(     62)
(     63)   for(i=0;i<rowc;i++) {
(     64)     for(k=0;k<cola;k++) {
(     65)       con = *(a + i*cola +k);
(     66)       for(j=0;j<colc;j++) {
(     67)         *(c + i*colc + j) += con * *(b + k*colb + j);
(     68)       }
(     69)     }
(     70)   }
(     71) }

mat_mul_daxpy:
    66, Loop not vectorized: data dependency
        Loop not vectorized: data dependency
        Loop unrolled 4 times
April 2008                             ORNL/NICS Users' Meeting                          Slide 56
C pointers, rewrite
(    53) void mat_mul_daxpy(double* restrict a, double* restrict b, double*
    restrict c, int rowa, int cola, int colb)
(    54) {
(    55)   int i, j, k;          /* loop counters */
(    56)   int rowc, colc, rowb; /* sizes not passed as arguments */
(    57)   double con;           /* constant value */
(    58)
(    59)   rowb = cola;
(    60)   rowc = rowa;
(    61)   colc = colb;
(    62)
(    63)   for(i=0;i<rowc;i++) {
(    64)     for(k=0;k<cola;k++) {
(    65)       con = *(a + i*cola +k);
(    66)       for(j=0;j<colc;j++) {
(    67)         *(c + i*colc + j) += con * *(b + k*colb + j);
(    68)       }
(    69)     }
(    70)   }
(    71) }

April 2008                       ORNL/NICS Users' Meeting                     Slide 57
C pointers, rewrite
66, Generated alternate loop with no peeling - executed if loop count <= 24
        Generated vector sse code for inner loop
        Generated 2 prefetch instructions for this loop
        Generated vector sse code for inner loop
        Generated 2 prefetch instructions for this loop
        Generated alternate loop with no peeling and more aligned moves -
   executed if loop count <= 24 and alignment test is passed
        Generated vector sse code for inner loop
        Generated 2 prefetch instructions for this loop
        Generated alternate loop with more aligned moves - executed if loop
   count >= 25 and alignment test is passed
        Generated vector sse code for inner loop
        Generated 2 prefetch instructions for this loop


• This can also be achieved with the PGI safe pragma and –Msafeptr
    compiler option or Pathscale –OPT:alias option




April 2008                      ORNL/NICS Users' Meeting                      Slide 58
Nested Loops

             (    47)       DO 45020 I = 1, N
             (    48)        F(I) = A(I) + .5
             (    49)        DO 45020 J = 1, 10
             (    50)         D(I,J) = B(J) * F(I)
             (    51)         DO 45020 K = 1, 5
             (    52)          C(K,I,J) = D(I,J) * E(K)
             (    53) 45020 CONTINUE



             PGI
               49, Generated vector sse code for inner loop
                   Generated 1 prefetch instructions for this loop
                   Loop unrolled 2 times (completely unrolled)
             Pathscale
             (lp45020.f:48) LOOP WAS VECTORIZED.
             (lp45020.f:48) Non-contiguous array "C(_BLNK__.0.0)" reference exists. Loop
             was not vectorized.




April 2008                  ORNL/NICS Users' Meeting                                       59
Rewrite
             (   71)    DO 45021 I = 1,N
             (   72)    F(I) = A(I) + .5
             (   73) 45021 CONTINUE
             (   74)
             (   75)    DO 45022 J = 1, 10
             (   76)    DO 45022 I = 1, N
             (   77)     D(I,J) = B(J) * F(I)
             (   78) 45022 CONTINUE
             (   79)
             (   80)    DO 45023 K = 1, 5
             (   81)    DO 45023 J = 1, 10
             (   82)     DO 45023 I = 1, N
             (   83)     C(K,I,J) = D(I,J) * E(K)
             (   84) 45023 CONTINUE




April 2008                     ORNL/NICS Users' Meeting   60
PGI
                 73, Generated an alternate loop for the inner loop
                   Generated vector sse code for inner loop
                   Generated 1 prefetch instructions for this loop
                   Generated vector sse code for inner loop
                   Generated 1 prefetch instructions for this loop
                78, Generated 2 alternate loops for the inner loop
                   Generated vector sse code for inner loop
                   Generated 1 prefetch instructions for this loop
                   Generated vector sse code for inner loop
                   Generated 1 prefetch instructions for this loop
                   Generated vector sse code for inner loop
                   Generated 1 prefetch instructions for this loop
                82, Interchange produces reordered loop nest: 83, 84, 82
                   Loop unrolled 5 times (completely unrolled)
                84, Generated vector sse code for inner loop
                   Generated 1 prefetch instructions for this loop
             Pathscale
             (lp45020.f:73) LOOP WAS VECTORIZED.
             (lp45020.f:78) LOOP WAS VECTORIZED.
             (lp45020.f:78) LOOP WAS VECTORIZED.
             (lp45020.f:84) Non-contiguous array "C(_BLNK__.0.0)" reference
             exists. Loop was not vectorized.
             (lp45020.f:84) Non-contiguous array "C(_BLNK__.0.0)" reference
             exists. Loop was not vectorized.


April 2008                ORNL/NICS Users' Meeting                            61
April 2008   ORNL/NICS Users' Meeting   62
GNU Malloc
       GNU malloc library
         malloc, calloc, realloc, free calls
           Fortran dynamic variables
       Malloc library system calls
         Mmap, munmap =>for larger allocations
         Brk, sbrk => increase/decrease heap
       Malloc library optimized for low system memory
       use
         Can result in system calls/minor page faults


Cray Inc. Proprietary         63
Improving GNU Malloc
        Detecting “bad” malloc behavior
          FPMPI data => “excessive system time”
        Correcting “bad” malloc behavior
          Eliminate mmap use by malloc
          Increase threshold to release heap memory
        Use environment variables to alter malloc
          MALLOC_MMAP_MAX_ = 0
          MALLOC_TRIM_THRESHOLD_ = 536870912
        Possible downsides
          Heap fragmentation
          User process may call mmap directly
          User process may launch other processes


Cray Inc. Proprietary          64
Google TCMalloc
        Google created a replacement “malloc” library
          “Minimal” TCMalloc replaces GNU malloc
        Limited testing indicates TCMalloc as good or
        better than GNU malloc
          Environment variables not required
          TCMalloc almost certainly better for allocations
          in OpenMP parallel regions
        Kevin Thomas built/installed “Minimal” TCMalloc
          To use it or try it:
                        module use /cray/iss/park_bench/lib/modulesfiles
                        module load tcmalloc
                        Link your app
Cray Inc. Proprietary                            65
I/O


November 09   Slide 66
I/O: Don't over do it.
  Countless techniques for
  doing I/O operations
     Varying difficulty
     Varying efficiency
  All techniques suffer from
  the same phenomenon,
  eventually it will turn over.
     Limited Disk Bandwidth                          FPP Write (MB/s)
     Limited Interconnect         35000
     Bandwidth                    30000
     Limited Filesystem           25000
     Parallelism                  20000


  Respect the limits or           15000

                                  10000
  suffer the consequences.         5000

                                      0
                                          0   2000      4000    6000    8000   10000
Lustre Striping

  Lustre is parallel, not
  paranormal
  Striping is critical and
  often overlooked
  Writing many-to-one
  requires a large stripe
  count
  Writing many-to-many
  requires a single stripe
Lustre Striping: How to do it
  Files inherit the striping of the       #include <unistd.h>
  parent directory                        #include <fcntl.h>
                                          #include <sys/ioctl.h>
       Input directory must be striped    #include <lustre/lustre_user.h>
       before copying in data             int open_striped(char *filename,
       Output directory must be             int mode, int stripe_size,
                                            int stripe_offset, int stripe_count)
       striped before running             {
  May also “touch” a file using the lfs     int fd;
                                            struct lov_user_md opts = {0};
  command                                   opts.lmm_magic = LOV_USER_MAGIC;
                                            opts.lmm_stripe_size = stripe_size;
  An API to stripe a file                   opts.lmm_stripe_offset = stripe_offset;
  programmatically is often                 opts.lmm_stripe_count = stripe_count;
  requested, here’s how to do it.           fd = open64(filename, O_CREAT | O_EXCL
       Call from only one processor       | O_LOV_DELAY_CREATE | mode, 0644);
                                            if ( fd >= 0 )
  New support in xt-mpt for striping          ioctl(fd, LL_IOC_LOV_SETSTRIPE,
  hints                                   &opts);

       striping_factor                        return fd;
       striping_size                      }
Lustre Striping: Picking a Stripe Size

  We know that large       4MB    4MB         4MB
  writes perform better
  so we buffer              5MB         5MB


  We can control our
  buffer size               5MB         5MB


  We can ALSO control      4MB    4MB         4MB
  our stripe size
  Misaligning Buffer and   4MB    4MB         4MB

  Stripe sizes can hurt
  your performance         4MB    4MB         4MB
Subgrouping: MPI-IO Collective I/O
 MPI-IO provides a way to handle buffering and grouping
 behind the scenes
    Advantage: Little or No code changes
    Disadvantage: Little or No knowledge of what’s actually done
 Use Collective file access
    MPI_File_write_all – Specify file view first
    MPI_File_write_at_all – Calculate offset for each write
 Set the cb_* hints
    cb_nodes – number of I/O aggregators
    cb_buffer_size – size of collective buffer
    romio_cb_write – enable/disable collective buffering
 No need to split comms, gather data, etc.
 See David Knaak’s paper at http://docs.cray.com/kbase for
 details
MPIIO Collective Buffering: Outline
 All PE’s must access via:
        MPI_Info_create(info)
        MPI_File_open(comm, filename, amode, info)
        MPI_File_write_all(handle, buffer, count, datatype)
 Specific PE’s will be selected as “aggregators”
 MPI writers will share meta-data with aggregators
 Data will be transferred to aggregators and aligned
 Aggregators will send data off to the OST(s)




                                                              73
 March 2009
MPIIO Collective Buffering: What to know
    setenv MPICH_MPIIO_CB_ALIGN 0|1|2 default is 2 if not
    set
             romio_cb_[read|write]=automatic Enables collective buffering
             cb_nodes=<number of OST’s> (set by lfs)
             cb_config_list=*:* allows as many aggregators as needed on a node
             striping_factor=<number of lustre stripes>
             striping_unit=<set equal to lfs setstripe>
    Limitation: many small files that don’t need aggregation
    Need to set striping pattern manually with lfs
             Ideally 1 OST per aggregator
             If not possible, N aggregators per OST (N a small integer)
    View statistics by setting MPICH_MPIIO_XSTATS=[0|1|2]



March 2009                                                                   74
Application I/O wallclock times (MPI_Wtime)

zaiowr wallclock times   Collective Buffering   No Collective Buffering
(sec)
7 OSTs                   5.498                  23.167

8 OSTs                   4.966                  24.989

28 OSTs                  2.246                  17.466

Serial I/O, 28 OSTs      22.214
MPI-IO Improvements
    MPI-IO collective buffering
            MPICH_MPIIO_CB_ALIGN=0
              Divides the I/O workload equally among all aggregators
              Inefficient if multiple aggregators reference the same physical I/O block
              Default setting in MPT 3.2 and prior versions

            MPICH_MPIIO_CB_ALIGN=1
              Divides the I/O workload up among the aggregators based on physical I/O
              boundaries and the size of the I/O request
              Allows only one aggregator access to any stripe on a single I/O call
              Available in MPT 3.1

            MPICH_MPIIO_CB_ALIGN=2
              Divides the I/O workload into Lustre stripe-sized groups and assigns them
              to aggregators
              Persistent across multiple I/O calls, so each aggregator always accesses
              the same set of stripes and no other aggregator accesses those stripes
              Minimizes Lustre file system lock contention
              Default setting in MPT 3.3
July 2009                                                                            Slide 76
IOR benchmark 1,000,000 bytes
      MPI-IO API , non-power-of-2 blocks and transfers, in this case blocks and
       transfers both of 1M bytes and a strided access pattern. Tested on an
           XT5 with 32 PEs, 8 cores/node, 16 stripes, 16 aggregators, 3220
                                segments, 96 GB file
               1800
               1600
            MB/Sec




               1400
               1200
               1000
                800
                600
                400
                200
                   0




July 2009
HYCOM MPI-2 I/O
       On 5107 PEs, and by application design, a subset of the PEs(88), do the
      writes. With collective buffering, this is further reduced to 22 aggregators
         (cb_nodes) writing to 22 stripes. Tested on an XT5 with 5107 PEs, 8
                                      cores/node

                     4000
                     3500
                     3000
            MB/Sec




                     2500
                     2000
                     1500
                     1000
                      500
                        0




July 2009
Subgrouping: By Hand

  Lose ease-of-use, gain control
  Countless methods to implement
      Simple gathering
      Serialized Sends within group
      Write token
      Double Buffered
      Bucket Brigade
      …
  Look for existing groups in your
  code
  Even the simplest solutions often
  seem to work.
      Try to keep the pipeline full
      Always be doing I/O             I find your lack of faith
  Now we can think about multiple      in ROMIO disturbing.
  shared files!
What about HDF5, NetCDF, Etc?

  Every code uses these           IOR: Shared File Writes
  very differently             8000

                               7000
  Follow as many of the
  same rules as possible       6000

                               5000
  It is very possible to get
                               4000
  good results, but also
  possible to get bad          3000

                               2000
  Because Parallel HDF5 is
  written over MPI-IO, it’s    1000


  possible to use hints          0
                                      0   2000    4000   6000   8000   10000
  Check out ADIOS from                    POSIX      MPIIO      HDF5

  ORNL & GT
SOME INTERESTING CASE
STUDIES
Provided by Lonnie Crosby (NICS/UT)
XT Best Practices
XT Best Practices
XT Best Practices
XT Best Practices

Weitere ähnliche Inhalte

Was ist angesagt?

Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
Rob Gillen
 
UA Mobile 2012 (English)
UA Mobile 2012 (English)UA Mobile 2012 (English)
UA Mobile 2012 (English)
dmalykhanov
 
Kernelvm 201312-dlmopen
Kernelvm 201312-dlmopenKernelvm 201312-dlmopen
Kernelvm 201312-dlmopen
Hajime Tazaki
 

Was ist angesagt? (20)

Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
 
Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012
Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012
Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012
 
Porting the Source Engine to Linux: Valve's Lessons Learned
Porting the Source Engine to Linux: Valve's Lessons LearnedPorting the Source Engine to Linux: Valve's Lessons Learned
Porting the Source Engine to Linux: Valve's Lessons Learned
 
Profiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systemsProfiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systems
 
UA Mobile 2012 (English)
UA Mobile 2012 (English)UA Mobile 2012 (English)
UA Mobile 2012 (English)
 
Can we boost more HPC performance? Integrate IBM POWER servers with GPUs to O...
Can we boost more HPC performance? Integrate IBM POWER servers with GPUs to O...Can we boost more HPC performance? Integrate IBM POWER servers with GPUs to O...
Can we boost more HPC performance? Integrate IBM POWER servers with GPUs to O...
 
How To Train Your Calxeda EnergyCore
How To Train Your  Calxeda EnergyCoreHow To Train Your  Calxeda EnergyCore
How To Train Your Calxeda EnergyCore
 
A CGRA-based Approach for Accelerating Convolutional Neural Networks
A CGRA-based Approachfor Accelerating Convolutional Neural NetworksA CGRA-based Approachfor Accelerating Convolutional Neural Networks
A CGRA-based Approach for Accelerating Convolutional Neural Networks
 
100Gbps OpenStack For Providing High-Performance NFV
100Gbps OpenStack For Providing High-Performance NFV100Gbps OpenStack For Providing High-Performance NFV
100Gbps OpenStack For Providing High-Performance NFV
 
NvFX GTC 2013
NvFX GTC 2013NvFX GTC 2013
NvFX GTC 2013
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
 
Virtual net performance
Virtual net performanceVirtual net performance
Virtual net performance
 
Qemu JIT Code Generator and System Emulation
Qemu JIT Code Generator and System EmulationQemu JIT Code Generator and System Emulation
Qemu JIT Code Generator and System Emulation
 
macOSの仮想化技術について ~Virtualization-rs Rust bindings for virtualization.framework ~
macOSの仮想化技術について ~Virtualization-rs Rust bindings for virtualization.framework ~macOSの仮想化技術について ~Virtualization-rs Rust bindings for virtualization.framework ~
macOSの仮想化技術について ~Virtualization-rs Rust bindings for virtualization.framework ~
 
vkFX: Effect(ive) approach for Vulkan API
vkFX: Effect(ive) approach for Vulkan APIvkFX: Effect(ive) approach for Vulkan API
vkFX: Effect(ive) approach for Vulkan API
 
Direct Code Execution @ CoNEXT 2013
Direct Code Execution @ CoNEXT 2013Direct Code Execution @ CoNEXT 2013
Direct Code Execution @ CoNEXT 2013
 
Java and Containers - Make it Awesome !
Java and Containers - Make it Awesome !Java and Containers - Make it Awesome !
Java and Containers - Make it Awesome !
 
Kernelvm 201312-dlmopen
Kernelvm 201312-dlmopenKernelvm 201312-dlmopen
Kernelvm 201312-dlmopen
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
 
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
 

Andere mochten auch (6)

Drupal1a
Drupal1aDrupal1a
Drupal1a
 
Refactoring Applications for the XK7 and Future Hybrid Architectures
Refactoring Applications for the XK7 and Future Hybrid ArchitecturesRefactoring Applications for the XK7 and Future Hybrid Architectures
Refactoring Applications for the XK7 and Future Hybrid Architectures
 
1ºtecnologias emergentes-thinkepi-fesabid-2011-masmedios
1ºtecnologias emergentes-thinkepi-fesabid-2011-masmedios1ºtecnologias emergentes-thinkepi-fesabid-2011-masmedios
1ºtecnologias emergentes-thinkepi-fesabid-2011-masmedios
 
A Comparison of Accelerator Programming Models
A Comparison of Accelerator Programming ModelsA Comparison of Accelerator Programming Models
A Comparison of Accelerator Programming Models
 
Progress Toward Accelerating CAM-SE
Progress Toward Accelerating CAM-SEProgress Toward Accelerating CAM-SE
Progress Toward Accelerating CAM-SE
 
HPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorialHPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorial
 

Ähnlich wie XT Best Practices

May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-opt
Jeff Larkin
 
Open Storage Sun Intel European Business Technology Tour
Open Storage Sun Intel European Business Technology TourOpen Storage Sun Intel European Business Technology Tour
Open Storage Sun Intel European Business Technology Tour
Walter Moriconi
 
OSDC 2015: Roland Kammerer | DRBD9: Managing High-Available Storage in Many-N...
OSDC 2015: Roland Kammerer | DRBD9: Managing High-Available Storage in Many-N...OSDC 2015: Roland Kammerer | DRBD9: Managing High-Available Storage in Many-N...
OSDC 2015: Roland Kammerer | DRBD9: Managing High-Available Storage in Many-N...
NETWAYS
 
20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris
imec.archive
 
0xdroid osdc-2010-100426084937-phpapp02
0xdroid osdc-2010-100426084937-phpapp020xdroid osdc-2010-100426084937-phpapp02
0xdroid osdc-2010-100426084937-phpapp02
chon2010
 

Ähnlich wie XT Best Practices (20)

May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-opt
 
Experiences with Power 9 at A*STAR CRC
Experiences with Power 9 at A*STAR CRCExperiences with Power 9 at A*STAR CRC
Experiences with Power 9 at A*STAR CRC
 
Open Storage Sun Intel European Business Technology Tour
Open Storage Sun Intel European Business Technology TourOpen Storage Sun Intel European Business Technology Tour
Open Storage Sun Intel European Business Technology Tour
 
Next Stop, Android
Next Stop, AndroidNext Stop, Android
Next Stop, Android
 
Five cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark fasterFive cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark faster
 
Under The Hood Of A Shard-Per-Core Database Architecture
Under The Hood Of A Shard-Per-Core Database ArchitectureUnder The Hood Of A Shard-Per-Core Database Architecture
Under The Hood Of A Shard-Per-Core Database Architecture
 
GoSF Jan 2016 - Go Write a Plugin for Snap!
GoSF Jan 2016 - Go Write a Plugin for Snap!GoSF Jan 2016 - Go Write a Plugin for Snap!
GoSF Jan 2016 - Go Write a Plugin for Snap!
 
What’s New in ScyllaDB Open Source 5.0
What’s New in ScyllaDB Open Source 5.0What’s New in ScyllaDB Open Source 5.0
What’s New in ScyllaDB Open Source 5.0
 
OSDC 2015: Roland Kammerer | DRBD9: Managing High-Available Storage in Many-N...
OSDC 2015: Roland Kammerer | DRBD9: Managing High-Available Storage in Many-N...OSDC 2015: Roland Kammerer | DRBD9: Managing High-Available Storage in Many-N...
OSDC 2015: Roland Kammerer | DRBD9: Managing High-Available Storage in Many-N...
 
20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris
 
Lev
LevLev
Lev
 
Scala & Spark(1.6) in Performance Aspect for Scala Taiwan
Scala & Spark(1.6) in Performance Aspect for Scala TaiwanScala & Spark(1.6) in Performance Aspect for Scala Taiwan
Scala & Spark(1.6) in Performance Aspect for Scala Taiwan
 
0xdroid osdc-2010-100426084937-phpapp02
0xdroid osdc-2010-100426084937-phpapp020xdroid osdc-2010-100426084937-phpapp02
0xdroid osdc-2010-100426084937-phpapp02
 
OpenPOWER Application Optimisation meet up
OpenPOWER Application Optimisation meet up OpenPOWER Application Optimisation meet up
OpenPOWER Application Optimisation meet up
 
SNAP MACHINE LEARNING
SNAP MACHINE LEARNINGSNAP MACHINE LEARNING
SNAP MACHINE LEARNING
 
Gpu and The Brick Wall
Gpu and The Brick WallGpu and The Brick Wall
Gpu and The Brick Wall
 
不深不淺,帶你認識 LLVM (Found LLVM in your life)
不深不淺,帶你認識 LLVM (Found LLVM in your life)不深不淺,帶你認識 LLVM (Found LLVM in your life)
不深不淺,帶你認識 LLVM (Found LLVM in your life)
 
CAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablementCAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablement
 
DPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet ProcessingDPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet Processing
 
GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra
GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra
GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra
 

Mehr von Jeff Larkin

CUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU ComputingCUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU Computing
Jeff Larkin
 
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Jeff Larkin
 

Mehr von Jeff Larkin (10)

Best Practices for OpenMP on GPUs - OpenMP UK Users Group
Best Practices for OpenMP on GPUs - OpenMP UK Users GroupBest Practices for OpenMP on GPUs - OpenMP UK Users Group
Best Practices for OpenMP on GPUs - OpenMP UK Users Group
 
FortranCon2020: Highly Parallel Fortran and OpenACC Directives
FortranCon2020: Highly Parallel Fortran and OpenACC DirectivesFortranCon2020: Highly Parallel Fortran and OpenACC Directives
FortranCon2020: Highly Parallel Fortran and OpenACC Directives
 
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUsEarly Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
 
Performance Portability Through Descriptive Parallelism
Performance Portability Through Descriptive ParallelismPerformance Portability Through Descriptive Parallelism
Performance Portability Through Descriptive Parallelism
 
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
 
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
 
SC13: OpenMP and NVIDIA
SC13: OpenMP and NVIDIASC13: OpenMP and NVIDIA
SC13: OpenMP and NVIDIA
 
Optimizing GPU to GPU Communication on Cray XK7
Optimizing GPU to GPU Communication on Cray XK7Optimizing GPU to GPU Communication on Cray XK7
Optimizing GPU to GPU Communication on Cray XK7
 
CUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU ComputingCUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU Computing
 
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
 

Kürzlich hochgeladen

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Kürzlich hochgeladen (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 

XT Best Practices

  • 1. Cray XT Best Practices Jeff Larkin larkin@cray.com
  • 3. Quad-Core Cray XT4 Node 6.4 GB/sec direct connect HyperTransport  4-way SMP  >35 Gflops per node  Up to 8 GB per node  OpenMP Support 2 – 8 GB within socket 12.8 GB/sec direct Cray connect memory SeaStar2+ (DDR 800) Interconnect November 09 Slide 3
  • 4. 2 – 32 GB memory Cray XT5 Node 6.4 GB/sec direct connect HyperTransport  8 or 12-way SMP  >70 Gflops per node  Up to 32 GB of shared memory per node 25.6 GB/sec direct  OpenMP Support connect memory Cray SeaStar2+ Interconnect November 09 Slide 4
  • 5. AMD Quad-core review 4 DP/8 SP FLOPS/clock cycle, per core 128bit packed ADD + 128bit packed MUL 64KB Level 1 Data Cache per core 512KB Level Cache per core 2 – 6MB Level 3 Cache per socket Coherent cache between XT5 sockets 2 DDR Memory Controllers that can work synchronously or asynchronously Determined at boot, we generally run them asynchronously for better performance November 09 Slide 5
  • 6. XT System Configuration Example Z Y GigE X 10 GigE GigE SMW Fibre Channels RAID Subsystem Compute node Login node Network node Boot/Syslog/Database nodes November 09 I/O and Metadata nodes Slide 6
  • 8. Choosing a Compiler Try every compiler available to you, there’s no “best” PGI, Cray, Pathscale, Intel, GCC are all available, but not necessarily on all machines Each compiler favors certain optimizations, which may benefit applications differently Test your answers carefully Order of operation may not be the same between compilers or even compiler versions You may have to decide whether speed or precision is more important to you November 09 Slide 8
  • 9. Choosing Compiler Flags PGI -fast –Mipa=fast man pgf90; man pgcc; man pgCC Cray <none, turned on by default> man crayftn; man craycc ; man crayCC Pathscale -Ofast man eko (“Every Known Optimization”) GNU -O2 / -O3 man gfortran; man gcc; man g++ Intel -fast man ifort; man icc; man iCC November 09 Slide 9
  • 11. Profile and Debug “at Scale” Codes don’t run the same at thousands of cores as hundreds, 10s of thousands as thousands, or 100s of thousands as 10s Determine how many nodes you wish to run on and test at that size, don’t test for throughput Scaling (Time) 2000 1800 1600 1400 1200 Time (s) 1000 v1 (time) 800 v2 (time) 600 400 200 0 122 244 488 976 1952 3904 7808 15616 31232 62464 124928 150304 November 09 Slide 11
  • 12. Profile and Debug with Real Science Choose a real problem, not a toy What do you want to achieve with the machine? How will you really run your code? Will you really run without I/O? November 09 Slide 12
  • 13. What Data Should We Collect? Compiler Feedback Most compilers can tell you a lot about what they do to your code Did your important loop vectorize? Was a routine inlined? Was this loop unrolled? It’s just as important to know what the compiler didn’t do Some C/C++ loops won’t vectorize without some massaging Some loop counts aren’t known at runtime, so optimization is limited Flags to know: PGI: -Minfo=all –Mneginfo=all Cray: -rm Pathscale: -LNO:simd_verbose=ON Intel: -vec-report1 November 09 Slide 13
  • 14. Compiler Feedback Examples: PGI ! Matrix Multiply 24, Loop interchange do k = 1, N produces reordered loop do j = 1, N nest: 25,24,26 do i = 1, N 26, Generated an alternate loop for the loop c(i,j) = c(i,j) + & Generated vector a(i,k) * b(k,j) sse code for the loop end do Generated 2 end do prefetch instructions for end do the loop November 09 Slide 14
  • 15. Compiler Feedback Examples: Cray 23. ! Matrix Multiply 24. ib------------< do k = 1, N 25. ib ibr4-------< do j = 1, N 26. ib ibr4 Vbr4--< do i = 1, N 27. ib ibr4 Vbr4 c(i,j) = c(i,j) + a(i,k) * b(k,j) 28. ib ibr4 Vbr4--> end do 29. ib ibr4-------> end do 30. ib------------> end do i – interchanged b – blocked r – unrolled V - Vectorized November 09 Slide 15
  • 16. Compiler Feedback Examples: Pathscale -LNO:simd_verbose appears to be broken November 09 Slide 16
  • 17. Compiler Feedback Examples: Intel mm.F90(14): (col. 3) remark: PERMUTED LOOP WAS VECTORIZED. mm.F90(25): (col. 7) remark: LOOP WAS VECTORIZED. November 09 Slide 17
  • 18. What Data Should We Collect? Hardware Performance Counters Using tools like CrayPAT it’s possible to know what the processor is doing We can get counts for FLOPS Cache Hits/Misses TLB Hits/Misses Stalls … We can derive FLOP Rate Cache Hit/Miss Ratio Computational Intensity … November 09 Slide 18
  • 19. Gathering Performance Data Use Craypat’s APA First gather sampling for line number profile Light-weight Guides what should and should not be instrumented load module Second gather instrumentation (-g make mpi,io,blas,lapack,math …) pat_build -O apa a.out Execute Hardware counters pat_report *.xf MPI message passing Examine *.apa information pat_build –O *.apa I/O information Execute Math Libraries …
  • 20. Sample APA File (Significantly Trimmed) # You can edit this file, if desired, and use it # ---------------------------------------------------------------------- # HWPC group to collect by default. -Drtenv=PAT_RT_HWPC=1 # Summary with TLB metrics. # ---------------------------------------------------------------------- # Libraries to trace. -g mpi,math # ---------------------------------------------------------------------- -w # Enable tracing of user-defined functions. # Note: -u should NOT be specified as an additional option. # 9.08% 188347 bytes -T ratt_i_ # 6.71% 177904 bytes -T rhsf_ # 5.61% 205682 bytes -T ratx_i_ # 5.59% 22005 bytes -T transport_m_computespeciesdiffflux_
  • 21. CrayPAT Groups biolib Cray Bioinformatics library routines Catamount) blacs Basic Linear Algebra omp-rtl OpenMP runtime library (not communication subprograms supported on Catamount) blas Basic Linear Algebra subprograms portals Lightweight message passing API caf Co-Array Fortran (Cray X2 systems pthreads POSIX threads (not supported on only) Catamount) fftw Fast Fourier Transform library (64- scalapack Scalable LAPACK bit only) shmem SHMEM hdf5 manages extremely large and stdio all library functions that accept or complex data collections return the FILE* construct heap dynamic heap sysio I/O system calls io includes stdio and sysio groups system system calls lapack Linear Algebra Package upc Unified Parallel C (Cray X2 systems lustre Lustre File System only) math ANSI math mpi MPI netcdf network common data form (manages array-oriented scientific data) omp OpenMP API (not supported on
  • 22. Useful API Calls, for the control-freaks Regions, useful to break up long routines int PAT_region_begin (int id, const char *label) int PAT_region_end (int id) Disable/Enable Profiling, useful for excluding initialization int PAT_record (int state) Flush buffer, useful when program isn’t exiting cleanly int PAT_flush_buffer (void)
  • 23. Sample CrayPAT HWPC Data USER / rhsf_ ------------------------------------------------------------------------ Time% 74.6% Time 556.742885 secs Imb.Time 14.817686 secs Imb.Time% 2.6% Calls 2.3 /sec 1200.0 calls PAPI_L1_DCM 14.406M/sec 7569486532 misses PAPI_TLB_DM 0.225M/sec 117992047 misses PAPI_L1_DCA 921.729M/sec 484310815400 refs PAPI_FP_OPS 871.740M/sec 458044890200 ops User time (approx) 525.438 secs 1103418882813 cycles 94.4%Time Average Time per Call 0.463952 sec CrayPat Overhead : Time 0.0% HW FP Ops / User time 871.740M/sec 458044890200 ops 10.4%peak(DP) HW FP Ops / WCT 822.722M/sec Computational intensity 0.42 ops/cycle 0.95 ops/ref MFLOPS (aggregate) 1785323.32M/sec TLB utilization 4104.61 refs/miss 8.017 avg uses D1 cache hit,miss ratios 98.4% hits 1.6% misses D1 cache utilization (M) 63.98 refs/miss 7.998 avg uses November 09 Slide 23
  • 24. CrayPAT HWPC Groups 0 Summary with instruction 11 Floating point operations mix metrics (2) 1 Summary with TLB metrics 12 Floating point operations mix 2 L1 and L2 metrics (vectorization) 3 Bandwidth information 13 Floating point operations mix (SP) 4 Hypertransport information 14 Floating point operations mix 5 Floating point mix (DP) 6 Cycles stalled, resources idle 15 L3 (socket-level) 7 Cycles stalled, resources full 16 L3 (core-level reads) 8 Instructions and branches 17 L3 (core-level misses) 9 Instruction cache 18 L3 (core-level fills caused by 10 Cache hierarchy L2 evictions) 19 Prefetches November 09 Slide 24
  • 25. MPI Statistics How many times are MPI routines called and with how much data? Do I have a load imbalance? Are my processors waiting for data? Could I perform better by adjusting MPI environment variables? November 09 Slide 25
  • 26. Sample CrayPAT MPI Statistics MPI Msg Bytes | MPI Msg | MsgSz | 16B<= | 256B<= | 4KB<= |Experiment=1 | Count | <16B | MsgSz | MsgSz | MsgSz |Function | | Count | <256B | <4KB | <64KB | Caller | | | Count | Count | Count | PE[mmm] 3062457144.0 | 144952.0 | 15022.0 | 39.0 | 64522.0 | 65369.0 |Total |--------------------------------------------------------------------------- | 3059984152.0 | 129926.0 | -- | 36.0 | 64522.0 | 65368.0 |mpi_isend_ ||-------------------------------------------------------------------------- || 1727628971.0 | 63645.1 | -- | 4.0 | 31817.1 | 31824.0 |MPP_DO_UPDATE_R8_3DV.in.MPP_DOMAINS_MOD 3| | | | | | | MPP_UPDATE_DOMAIN2D_R8_3DV.in.MPP_DOMAINS_MOD ||||------------------------------------------------------------------------ 4||| 1680716892.0 | 61909.4 | -- | -- | 30949.4 | 30960.0 |DYN_CORE.in.DYN_CORE_MOD 5||| | | | | | | FV_DYNAMICS.in.FV_DYNAMICS_MOD 6||| | | | | | | ATMOSPHERE.in.ATMOSPHERE_MOD 7||| | | | | | | MAIN__ 8||| | | | | | | main |||||||||------------------------------------------------------------------- 9|||||||| 1680756480.0 | 61920.0 | -- | -- | 30960.0 | 30960.0 |pe.13666 9|||||||| 1680756480.0 | 61920.0 | -- | -- | 30960.0 | 30960.0 |pe.8949 9|||||||| 1651777920.0 | 54180.0 | -- | -- | 23220.0 | 30960.0 |pe.12549 |||||||||=================================================================== November 09 Slide 26
  • 27. I/O Statistics How much time is spent in I/O? How much data am I writing or reading? How many processes are performing I/O? November 09 Slide 27
  • 29. Step by Step 1. Fix any load imbalance 2. Fix your hotspots 1. Communication • Pre-post receives • Overlap computation and communication • Reduce collectives • Adjust MPI environment variables • Use rank reordering 2. Computation • Examine the hardware counters and compiler feedback • Adjust the compiler flags, directives, or code structure to improve performance 3. I/O • Stripe files/directories appropriately • Use methods that scale • MPI-IO or Subsetting At each step, check your answers and performance. Between each step, gather your data again. November 09 Slide 30
  • 31. Craypat load-imbalance data Table 1: Profile by Function Group and Function Time % | Time | Imb. Time | Imb. | Calls |Experiment=1 | | | Time % | |Group | | | | | Function | | | | | PE='HIDE' 100.0% | 1061.141647 | -- | -- | 3454195.8 |Total |-------------------------------------------------------------------- | 70.7% | 750.564025 | -- | -- | 280169.0 |MPI_SYNC ||------------------------------------------------------------------- || 45.3% | 480.828018 | 163.575446 | 25.4% | 14653.0 |mpi_barrier_(sync) || 18.4% | 195.548030 | 33.071062 | 14.5% | 257546.0 |mpi_allreduce_(sync) || 7.0% | 74.187977 | 5.261545 | 6.6% | 7970.0 |mpi_bcast_(sync) ||=================================================================== | 15.2% | 161.166842 | -- | -- | 3174022.8 |MPI ||------------------------------------------------------------------- || 10.1% | 106.808182 | 8.237162 | 7.2% | 257546.0 |mpi_allreduce_ || 3.2% | 33.841961 | 342.085777 | 91.0% | 755495.8 |mpi_waitall_ ||=================================================================== | 14.1% | 149.410781 | -- | -- | 4.0 |USER ||------------------------------------------------------------------- || 14.0% | 148.048597 | 446.124165 | 75.1% | 1.0 |main |====================================================================
  • 32. Fixing a Load Imbalance What is causing the load imbalance Computation Is decomposition appropriate? Would RANK_REORDER help? Communication Is decomposition appropriate? Would RANK_REORDER help? Are recevies pre-posted OpenMP may help Able to spread workload with less overhead Large amount of work to go from all-MPI to Hybrid • Must accept challenge to OpenMP-ize large amount of code November 09 Slide 33
  • 33. Why we pre-post: XT MPI – Receive Side Match Entries created by Match Entries Posted by MPI to handle application pre-posting of unexpected short and long messages Receives pre-posted pre-posted eager short eager short eager short long ME ME message message message message msgX msgY ME ME ME ME incoming app message buffer for app msgY short short short buffer un- un- un- for expect expect expect buffer buffer buffer Unexpected long msgX message buffers- Portals EQ event only Portals matches incoming message with pre-posted receives and delivers other EQ unexpected EQ message data directly into user buffer. Unexpected short An unexpected message message buffers generates two entries on unexpected EQ Cray Inc. Proprietary 34
  • 34. 11/5/2009 35
  • 35. 11/5/2009 36
  • 36. 11/5/2009 37
  • 37. 11/5/2009 38
  • 38. 11/5/2009 39
  • 39. 11/5/2009 40
  • 40. 11/5/2009 41
  • 41. 11/5/2009 42
  • 42. 11/5/2009 43
  • 43. 11/5/2009 44
  • 49. Even Smarter Rank Ordering If you understand your decomposition well enough, you may be able to map it to the network Craypat 5.0 adds the grid_order and mgrid_order tools to help For more information, run grid_order or mgrid_order with no options November 09 Slide 50
  • 50. grid_order Example Usage: grid_order -c n1,n2,... -g N1,N2,... [-o d1,d2,...] [-m max] This program can be used for placement of the ranks of an MPI program that uses communication between nearest neighbors in a grid, or lattice. For example, consider an application in which each MPI rank computes values for a lattice point in an N by M grid, communicates with its nearest neighbors in the grid, and is run on quad-core processors. Then with the options: -c 2,2 -g N,M this program will produce a list of ranks suitable for use in the MPICH_RANK_ORDER file, such that a block of four nearest neighbors is placed on each processor. If the same application is run on nodes containing two quad- core processors, then either of the following can be used: -c 2,4 -g M,N -c 4,2 -g M,N November 09 Slide 51
  • 51. Auto-Scaling MPI Environment Variables Default values for various MPI jobs sizes MPI Environment Variable Name 1,000 PEs 10,000 PEs 50,000 PEs 100,000 PEs MPICH_MAX_SHORT_MSG_SIZE (This size determines whether 128,000 20,480 4096 2048 the message uses the Eager or bytes Rendezvous protocol) MPICH_UNEX_BUFFER_SIZE 60 MB 60 MB 150 MB 260 MB (The buffer allocated to hold the unexpected Eager data) MPICH_PTL_UNEX_EVENTS (Portals generates two events for 20,480 22,000 110,000 220,000 each unexpected message events received) MPICH_PTL_OTHER_EVENTS 2048 (Portals send-side and expected 2500 12,500 25,000 events) events July 2009 Slide 52
  • 52. MPICH Performance Variable MPICH_MAX_SHORT_MSG_SIZE Controls message sending protocol (Default:128000 byte) Message sizes <= MSG_SIZE: Use EAGER Message sizes > MSG_SIZE: Use RENDEZVOUS Increasing this variable may require that MPICH_UNEX_BUFFER_SIZE be increased Increase MPICH_MAX_SHORT_MSG_SIZE if App sends large msgs(>128K) and receives are pre-posted Can reduce messaging overhead via EAGER protocol Can reduce network contention Decrease MPICH_MAX_SHORT_MSG_SIZE if: App sends msgs in 32k-128k range and receives not pre-posted Cray Inc. Proprietary 53
  • 54. Vectorization Stride one memory accesses No IF tests No subroutine calls Inline Module Functions Statement Functions What is size of loop Loop nest Stride one on inside Longest on the inside Unroll small loops Increase computational intensity CU = (vector flops/number of memory accesses)
  • 55. C pointers ( 53) void mat_mul_daxpy(double *a, double *b, double *c, int rowa, int cola, int colb) ( 54) { ( 55) int i, j, k; /* loop counters */ ( 56) int rowc, colc, rowb; /* sizes not passed as arguments */ ( 57) double con; /* constant value */ ( 58) ( 59) rowb = cola; ( 60) rowc = rowa; ( 61) colc = colb; ( 62) ( 63) for(i=0;i<rowc;i++) { ( 64) for(k=0;k<cola;k++) { ( 65) con = *(a + i*cola +k); ( 66) for(j=0;j<colc;j++) { ( 67) *(c + i*colc + j) += con * *(b + k*colb + j); ( 68) } ( 69) } ( 70) } ( 71) } mat_mul_daxpy: 66, Loop not vectorized: data dependency Loop not vectorized: data dependency Loop unrolled 4 times April 2008 ORNL/NICS Users' Meeting Slide 56
  • 56. C pointers, rewrite ( 53) void mat_mul_daxpy(double* restrict a, double* restrict b, double* restrict c, int rowa, int cola, int colb) ( 54) { ( 55) int i, j, k; /* loop counters */ ( 56) int rowc, colc, rowb; /* sizes not passed as arguments */ ( 57) double con; /* constant value */ ( 58) ( 59) rowb = cola; ( 60) rowc = rowa; ( 61) colc = colb; ( 62) ( 63) for(i=0;i<rowc;i++) { ( 64) for(k=0;k<cola;k++) { ( 65) con = *(a + i*cola +k); ( 66) for(j=0;j<colc;j++) { ( 67) *(c + i*colc + j) += con * *(b + k*colb + j); ( 68) } ( 69) } ( 70) } ( 71) } April 2008 ORNL/NICS Users' Meeting Slide 57
  • 57. C pointers, rewrite 66, Generated alternate loop with no peeling - executed if loop count <= 24 Generated vector sse code for inner loop Generated 2 prefetch instructions for this loop Generated vector sse code for inner loop Generated 2 prefetch instructions for this loop Generated alternate loop with no peeling and more aligned moves - executed if loop count <= 24 and alignment test is passed Generated vector sse code for inner loop Generated 2 prefetch instructions for this loop Generated alternate loop with more aligned moves - executed if loop count >= 25 and alignment test is passed Generated vector sse code for inner loop Generated 2 prefetch instructions for this loop • This can also be achieved with the PGI safe pragma and –Msafeptr compiler option or Pathscale –OPT:alias option April 2008 ORNL/NICS Users' Meeting Slide 58
  • 58. Nested Loops ( 47) DO 45020 I = 1, N ( 48) F(I) = A(I) + .5 ( 49) DO 45020 J = 1, 10 ( 50) D(I,J) = B(J) * F(I) ( 51) DO 45020 K = 1, 5 ( 52) C(K,I,J) = D(I,J) * E(K) ( 53) 45020 CONTINUE PGI 49, Generated vector sse code for inner loop Generated 1 prefetch instructions for this loop Loop unrolled 2 times (completely unrolled) Pathscale (lp45020.f:48) LOOP WAS VECTORIZED. (lp45020.f:48) Non-contiguous array "C(_BLNK__.0.0)" reference exists. Loop was not vectorized. April 2008 ORNL/NICS Users' Meeting 59
  • 59. Rewrite ( 71) DO 45021 I = 1,N ( 72) F(I) = A(I) + .5 ( 73) 45021 CONTINUE ( 74) ( 75) DO 45022 J = 1, 10 ( 76) DO 45022 I = 1, N ( 77) D(I,J) = B(J) * F(I) ( 78) 45022 CONTINUE ( 79) ( 80) DO 45023 K = 1, 5 ( 81) DO 45023 J = 1, 10 ( 82) DO 45023 I = 1, N ( 83) C(K,I,J) = D(I,J) * E(K) ( 84) 45023 CONTINUE April 2008 ORNL/NICS Users' Meeting 60
  • 60. PGI 73, Generated an alternate loop for the inner loop Generated vector sse code for inner loop Generated 1 prefetch instructions for this loop Generated vector sse code for inner loop Generated 1 prefetch instructions for this loop 78, Generated 2 alternate loops for the inner loop Generated vector sse code for inner loop Generated 1 prefetch instructions for this loop Generated vector sse code for inner loop Generated 1 prefetch instructions for this loop Generated vector sse code for inner loop Generated 1 prefetch instructions for this loop 82, Interchange produces reordered loop nest: 83, 84, 82 Loop unrolled 5 times (completely unrolled) 84, Generated vector sse code for inner loop Generated 1 prefetch instructions for this loop Pathscale (lp45020.f:73) LOOP WAS VECTORIZED. (lp45020.f:78) LOOP WAS VECTORIZED. (lp45020.f:78) LOOP WAS VECTORIZED. (lp45020.f:84) Non-contiguous array "C(_BLNK__.0.0)" reference exists. Loop was not vectorized. (lp45020.f:84) Non-contiguous array "C(_BLNK__.0.0)" reference exists. Loop was not vectorized. April 2008 ORNL/NICS Users' Meeting 61
  • 61. April 2008 ORNL/NICS Users' Meeting 62
  • 62. GNU Malloc GNU malloc library malloc, calloc, realloc, free calls Fortran dynamic variables Malloc library system calls Mmap, munmap =>for larger allocations Brk, sbrk => increase/decrease heap Malloc library optimized for low system memory use Can result in system calls/minor page faults Cray Inc. Proprietary 63
  • 63. Improving GNU Malloc Detecting “bad” malloc behavior FPMPI data => “excessive system time” Correcting “bad” malloc behavior Eliminate mmap use by malloc Increase threshold to release heap memory Use environment variables to alter malloc MALLOC_MMAP_MAX_ = 0 MALLOC_TRIM_THRESHOLD_ = 536870912 Possible downsides Heap fragmentation User process may call mmap directly User process may launch other processes Cray Inc. Proprietary 64
  • 64. Google TCMalloc Google created a replacement “malloc” library “Minimal” TCMalloc replaces GNU malloc Limited testing indicates TCMalloc as good or better than GNU malloc Environment variables not required TCMalloc almost certainly better for allocations in OpenMP parallel regions Kevin Thomas built/installed “Minimal” TCMalloc To use it or try it: module use /cray/iss/park_bench/lib/modulesfiles module load tcmalloc Link your app Cray Inc. Proprietary 65
  • 65. I/O November 09 Slide 66
  • 66. I/O: Don't over do it. Countless techniques for doing I/O operations Varying difficulty Varying efficiency All techniques suffer from the same phenomenon, eventually it will turn over. Limited Disk Bandwidth FPP Write (MB/s) Limited Interconnect 35000 Bandwidth 30000 Limited Filesystem 25000 Parallelism 20000 Respect the limits or 15000 10000 suffer the consequences. 5000 0 0 2000 4000 6000 8000 10000
  • 67. Lustre Striping Lustre is parallel, not paranormal Striping is critical and often overlooked Writing many-to-one requires a large stripe count Writing many-to-many requires a single stripe
  • 68. Lustre Striping: How to do it Files inherit the striping of the #include <unistd.h> parent directory #include <fcntl.h> #include <sys/ioctl.h> Input directory must be striped #include <lustre/lustre_user.h> before copying in data int open_striped(char *filename, Output directory must be int mode, int stripe_size, int stripe_offset, int stripe_count) striped before running { May also “touch” a file using the lfs int fd; struct lov_user_md opts = {0}; command opts.lmm_magic = LOV_USER_MAGIC; opts.lmm_stripe_size = stripe_size; An API to stripe a file opts.lmm_stripe_offset = stripe_offset; programmatically is often opts.lmm_stripe_count = stripe_count; requested, here’s how to do it. fd = open64(filename, O_CREAT | O_EXCL Call from only one processor | O_LOV_DELAY_CREATE | mode, 0644); if ( fd >= 0 ) New support in xt-mpt for striping ioctl(fd, LL_IOC_LOV_SETSTRIPE, hints &opts); striping_factor return fd; striping_size }
  • 69. Lustre Striping: Picking a Stripe Size We know that large 4MB 4MB 4MB writes perform better so we buffer 5MB 5MB We can control our buffer size 5MB 5MB We can ALSO control 4MB 4MB 4MB our stripe size Misaligning Buffer and 4MB 4MB 4MB Stripe sizes can hurt your performance 4MB 4MB 4MB
  • 70. Subgrouping: MPI-IO Collective I/O MPI-IO provides a way to handle buffering and grouping behind the scenes Advantage: Little or No code changes Disadvantage: Little or No knowledge of what’s actually done Use Collective file access MPI_File_write_all – Specify file view first MPI_File_write_at_all – Calculate offset for each write Set the cb_* hints cb_nodes – number of I/O aggregators cb_buffer_size – size of collective buffer romio_cb_write – enable/disable collective buffering No need to split comms, gather data, etc. See David Knaak’s paper at http://docs.cray.com/kbase for details
  • 71. MPIIO Collective Buffering: Outline All PE’s must access via: MPI_Info_create(info) MPI_File_open(comm, filename, amode, info) MPI_File_write_all(handle, buffer, count, datatype) Specific PE’s will be selected as “aggregators” MPI writers will share meta-data with aggregators Data will be transferred to aggregators and aligned Aggregators will send data off to the OST(s) 73 March 2009
  • 72. MPIIO Collective Buffering: What to know setenv MPICH_MPIIO_CB_ALIGN 0|1|2 default is 2 if not set romio_cb_[read|write]=automatic Enables collective buffering cb_nodes=<number of OST’s> (set by lfs) cb_config_list=*:* allows as many aggregators as needed on a node striping_factor=<number of lustre stripes> striping_unit=<set equal to lfs setstripe> Limitation: many small files that don’t need aggregation Need to set striping pattern manually with lfs Ideally 1 OST per aggregator If not possible, N aggregators per OST (N a small integer) View statistics by setting MPICH_MPIIO_XSTATS=[0|1|2] March 2009 74
  • 73. Application I/O wallclock times (MPI_Wtime) zaiowr wallclock times Collective Buffering No Collective Buffering (sec) 7 OSTs 5.498 23.167 8 OSTs 4.966 24.989 28 OSTs 2.246 17.466 Serial I/O, 28 OSTs 22.214
  • 74. MPI-IO Improvements MPI-IO collective buffering MPICH_MPIIO_CB_ALIGN=0 Divides the I/O workload equally among all aggregators Inefficient if multiple aggregators reference the same physical I/O block Default setting in MPT 3.2 and prior versions MPICH_MPIIO_CB_ALIGN=1 Divides the I/O workload up among the aggregators based on physical I/O boundaries and the size of the I/O request Allows only one aggregator access to any stripe on a single I/O call Available in MPT 3.1 MPICH_MPIIO_CB_ALIGN=2 Divides the I/O workload into Lustre stripe-sized groups and assigns them to aggregators Persistent across multiple I/O calls, so each aggregator always accesses the same set of stripes and no other aggregator accesses those stripes Minimizes Lustre file system lock contention Default setting in MPT 3.3 July 2009 Slide 76
  • 75. IOR benchmark 1,000,000 bytes MPI-IO API , non-power-of-2 blocks and transfers, in this case blocks and transfers both of 1M bytes and a strided access pattern. Tested on an XT5 with 32 PEs, 8 cores/node, 16 stripes, 16 aggregators, 3220 segments, 96 GB file 1800 1600 MB/Sec 1400 1200 1000 800 600 400 200 0 July 2009
  • 76. HYCOM MPI-2 I/O On 5107 PEs, and by application design, a subset of the PEs(88), do the writes. With collective buffering, this is further reduced to 22 aggregators (cb_nodes) writing to 22 stripes. Tested on an XT5 with 5107 PEs, 8 cores/node 4000 3500 3000 MB/Sec 2500 2000 1500 1000 500 0 July 2009
  • 77. Subgrouping: By Hand Lose ease-of-use, gain control Countless methods to implement Simple gathering Serialized Sends within group Write token Double Buffered Bucket Brigade … Look for existing groups in your code Even the simplest solutions often seem to work. Try to keep the pipeline full Always be doing I/O I find your lack of faith Now we can think about multiple in ROMIO disturbing. shared files!
  • 78. What about HDF5, NetCDF, Etc? Every code uses these IOR: Shared File Writes very differently 8000 7000 Follow as many of the same rules as possible 6000 5000 It is very possible to get 4000 good results, but also possible to get bad 3000 2000 Because Parallel HDF5 is written over MPI-IO, it’s 1000 possible to use hints 0 0 2000 4000 6000 8000 10000 Check out ADIOS from POSIX MPIIO HDF5 ORNL & GT
  • 79. SOME INTERESTING CASE STUDIES Provided by Lonnie Crosby (NICS/UT)