SlideShare ist ein Scribd-Unternehmen logo
1 von 1
Downloaden Sie, um offline zu lesen
FOX: A Fault-Oblivious Extreme Scale Execution Environment                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Checksum-based schemes ensure fault-tolerance
                                                                                                                                                                                                                                                                                                                                                                      SST/macro




                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  SIMULATION




                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             OPERATING SYSTEM
Exascale computing systems will provide a thousand-fold increase in parallelism and




                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              FAULT TOLERANCE
                                                                                                                                                                                                                                                                                                                                                           OVERVIEW
                                                                                                                                                                                                                                                                                                                                                                      The Structural Simulation Toolkit (SST) macroscale components provide the ability                                                                                                                                                                                                                                             of read-only data with much lower space
a proportional increase in failure rate relative to today's machines. Systems software                                                                                                                                                                                                                                                                                                                                                                                                                                                  % ls /data/dht_A
                                                                                                                                                                                                                                                                                                                                                                      to explore the interaction of software and hardware for full scale machines using a                                                                                      ...                App 1           ...            App i                                                                                              overheads. We have extended the approaches to
for exascale machines must provide the infrastructure to support existing applications
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    the computation and mapping of parities to




                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Fault Oblivious
while simultaneously enabling efficient execution of new programming models that                                                                                                                                                                                                                                                                                       coarse-grained simulation approach. The parallel machine is represented by                                                                                           User's                                                                                                                                                   processors as employed by fault tolerant linear
naturally express dynamic, adaptive, irregular computation; coupled simulations; and                                                                                                                                                                                                                                                                                  models which are used to estimate the performance of processing and network                                                                                         Desktop               Application Workflow                                                                                                                 algebra, an algorithmic fault tolerance technique
massive data analysis in a highly unreliable energy-constrained hardware                                                                                                                                                                                                                                                                                              components. An application can be represented by a “skeleton” code which




                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Layers
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    for linear algebraic operations. The scalable
environment with billions of threads of execution. Further, these systems must be                                                                                                                                                                                                                                                                                     replicates the control flow and message passing behavior of the real application
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    algorithms developed support generalized
designed with failure in mind.                                                                                                                                                                                                                                                                                                                                        without the cost of doing actual message passing or heavyweight computation.                                                                                                                                                                                                                                                  Cartesian distributions involving arbitrary
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    processor counts, arbitrary specification of failure
FOX is a new system for the exascale which will support distributed data objects as                                                                                                                                                                                                                                                                                   A skeleton application models the control flow, communication pattern, and                                                                                                                                                                                                                                                     units, and colocation of parities with data blocks.
first class objects in the operating system itself. This memory-based data store will be                                                                                                                                                                                                                                                                               computation pattern of the application in a coarse-grained manner. This method                                                                                                                                                                                                                                                In the 2-D processor grid in the figure, note the
named and accessed as part of the file system name space of the application. We                                                                                                                                                                                                                                                                                        provides a powerful approach to evaluate efficiency and scalability at extreme                                                                                           Abstract                   Programming                                                                                                                irregular distribution of processors (denoted by
can build many types of objects with this data store, including data-driven work                                                                                                                                                                                                                                                                                      scales and to experiment with code reorganization or high-level refactoring without                                                                                        File                       Model                                                                                                                   P_(i,j)) among the failure units (denoted by N_k).
queues, which will in turn support applications with inherent resilience.                                                                                                                                                                                                                                                                                             having to rewrite the numerical part of an application.                                                                                                                 Interface                    Support                                                                                                                  For modified data, we create shadow copies of
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                important data structures. All operations that modify the original data structures are
                                                                                                                                                                                                                    Coarse-grained                                                                                                                                                                                                                                                                                                                                                                                                                              duplicated on the shadow data. By mirroring the application data on distinct failure
                                                                                                                                                                                                                    operation queue




                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Fault Hiding
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                units, we ensure that we can recover its state in the event of failures. We ensured
                                                                                                                                                                                                                                                                                                                                                                                                         // Set up the instructions object to tell the processor model                                                                                        Task
                                                                                                                                                                                                                                                                                                                                                                                                         // how many fused multiply-add instructions each compute call executes                                                                                                                                                                                 fault tolerance for the key matrices in the most expensive Hartree-Fock procedure




                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Layers
                                                                                                                                                                                                                                                                                                                                                                                                         boost::shared_ptr<sstmac::eventdata> instructions =                                                                                               Management                                                                                           (shown above) with almost unnoticeable overheads.
                                                                                                                                                                                                                                                                                                                                                                                                           sstmac::eventdata::construct();
                                                                                                                                              Fine-                                                                                                                                                                                                                                                      instructions->set_event("FMA",blockrowsize*blockcolsize*blocklnksize);
                                                                                                                                               grained
                                                                                                                                                 queues                                                                                                                                                                                                                                                  // Iterate over number of remote row and column blocks                                                                                     Distributed Data Store                                                                                      Fault tolerant data stores (shown to the right)
                                                                                                                                                                                                                                                                                                                                                                                                         for (int i=0; i<nblock-1; i++) {                                                                                                                                                                                                                       are based on the premise that applications
                                                                                                                                                         Preload                                                                                                                                                                                                                                           std::vector<sstmac::mpiapi::mpirequest_t> reqs;
                                                                                                                                                          queues                                                                                                                                                                                                                                           // Begin non-blocking left shift of A blocks                                                                                                                                                                                                         have a few critical data structures and that
                                                                                                                                                                     Ready                                                                                                                                                                                                                                 sstmac::mpiapi::mpirequest_t req;                                                                                                    Systems Software Support                                                                                        each data structure is accessed in a specific
                                                                                                                                                                      queues                                                                                                                                                                                                                               mpi()->isend(blocksize, sstmac::mpitype::mpi_double,                                                                                                                                                                                                 form in a given phase. By limiting our
                                                                                                                                                                                                                                                                                                                                                                                                                    sstmac::mpiid(myleft), sstmac::mpitag(0), world, req);
                                                                                                                                                                                                                                                                                                                                                                                                           reqs.push_back(req);                                                                                                                                                                                                                                 attention to certain significant data
                                                                                                             Compute Resource                                                                                         Compute Resource                                                                                                                                                                     mpi()->irecv(blocksize, sstmac::mpitype::mpi_double,                                                                                                                                                                                                 structures, we limit the state to be saved.
                                                                                                                                                                                                                                                                                                                                                                                                                    sstmac::mpiid(myright), sstmac::mpitag(0), world, req);                                                                                                                                                                                     The access mode associated with a data
Work queues are a familiar concept in many areas of computing; we will apply the work queue                                                                                                                                                                                                                                                                                                                reqs.push_back(req);                                                                                                                                 App 1                   App 2                            App 1      dht_A
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           dht_A.part1                                                      ...                                                 structure in a given phase can be used to
concept to applications. Shown below is a graphical data flow description of a quantum                                                                                                                                                                                                                                                                                                                      // Begin non-blocking down shift of B blocks                                                                                                         task 1                  task 3                           task n     partn
chemistry application, which can be executed using a work queue approach.                                                                                                                                                                                                                                                                                                                                  sstmac::mpiapi::mpirequest_t req;                                                                                                                                                                                                                    tailor the form of the data store. Rather than
                                                                                                                                                                                                                                                                                                                                                                                                           mpi()->isend(blocksize, sstmac::mpitype::mpi_double,                                                                                                                                                                                                 treating an SMP node as the unit of failure,
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       taskdata.
                                                                                                                                                                                                                                                                                                                                                                                                                    sstmac::mpiid(myup), sstmac::mpitag(0), world, req);                                                                  matrix_C.part3     task management data.part1
                                                                                                                                                                                                                                                                                                                                                                                                           reqs.push_back(req);                                                                                                                                                                    node 0
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         partn
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    node n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                our approaches can adapt to arbitrary failure
                                                                                                                                                                                                                                                                                                                                                                                                           mpi()->irecv(blocksize, sstmac::mpitype::mpi_double,                                                                                                                                                                                                 units (eg., processors sharing a power
                                                                                                                                                                                                                                                                                                                                                                                                                    sstmac::mpiid(mydown), sstmac::mpitag(0), world, req);                                                                                                                                                                                      supply unit).
                                                                                                                    G            G            G                                                                                                       G                     G                    G
                                                                                                                                                                                                                                                                                                                                                                                                           reqs.push_back(req);
                                                                                                                                                                                                                                                                                                                                                                                                           // Simulate computation with current blocks                                                                         The proposed framework provides a simple, uniform file system abstraction as the                                                                       10
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       4                                                                    The graph to the left shows the cost of
                                                                                                                G(2,2,2,2)   G(2,2,1,1)   G(2,2,0,0)   D(2,2)    D(0,0)           D(1,0)          D(1,1)       D(2,0)          D(2,1)          G(0,0,0,0)            G(1,1,0,0)              G(1,1,1,1)

                                                                                                                                                                                                                                                                                                                                                                                                           compute_api()->compute(instructions);                                                                                                                                                                                                                                                              NWChem twoel                                  two electron contribution (twoel) in
                                                                                                                                                          F         F               F             F            F             F                                                                                                                                                                             std::vector<sstmac::mpiapi::const_mpistatus_t> statuses;                                                            mechanism for parallel computation and distribution of data to computation blocks.                                                                                             NWChem twoel + FT
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            NWChem Hartree-Fock calculation




                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 Time per step (s)
                                                                                                                                                                                                                                                                                                                                                                                                           // Wait for data needed for next iteration                                                                                                                                                                                                                                  3
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     10
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            with and without the maintenance of
                                                                                                                                                                                                F(0,0)                F(1,0)          F(1,1)                                                              S(0,0)          S(1,0)              S(1,1)


                                           F(i,i)       F(i,j)       F(j,j)              S(i,i)        S(i,j)   S(j,j)
                                                                                                                                                                                                                                                                                                                                                                                                           mpi()->waitall(reqs, statuses);                                                                                     Locating and accessing tasks from a processʼs work queue is accomplished through normal
                      G                                                                                                                                                                                                                                             J(0)                                                                                                                                 }                                                                                                                                                                                                                                                                                                                                                  shadow copies (fock matrix
       G                                                                         J                                                                                                                                                                                                                                                                                                                       // Simulate computation with blocks received during last loop iteration                                               file system operations such as listing directory contents and opening files.                                                                            102
                 G(i,j,k,l)
                                                             J                                                                                                  F(2,1)          F(2,0)                                                                             J(0,1,0)            S(2,1)        S(2,0)
                                                                                                                                                                                                                                                                                                                                                                                                         compute_api()->compute(instructions);                                                                                                                                                                                                                                                                                                              size=3600x3600). The two lines
                                                                              J(d,i,j)

                                                                                                                                                                         R(0)            R(0)           R(0)                   R(0)                     R(0)                      R(0)            R(0)            R(0)               R(0)                 R(0)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               Data distribution and access are performed by writing and reading files within directory                                                                                                                                      overlap due to the small overhead
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     101                                                                    and the included error bars are not
                                                                                                                                                                                                 F(2,2)        Fi(0,1,2,1)        Fi(0,1,1,1)             S(2,2)           Si(0,1,2,1)           Si(0,1,1,1)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               hierarchies.
     D(k,l)   G(i,j,k,l)      G(i,k,j,l)


                                                                                                                                                                    Fi(0,1,2,0)                                                                        J(0)                                              Si(0,1,2,0)               Si(0,1,0,0)
                                                                                                                                                                                                                                                                                                                                                                                                      A code fragment implementing a skeleton program for a dense matrix                                                                                                                                                                                                                                                                                    visible due to small standard
                                                                                                                                                                                                                                                                                                                                                                                                                  multiplication (restricted to square blocks).                                                                                                                                                                                                                        0
     F            F
                                                    F(i,i)       F(i,j)           F(j,j)          J(d,i,j)               Assembly                                                                                                                                                                                                                                                                                                                                                                                              Shown above is the user view (at the top) of the file /data/dht_A and the way the view of one                                                          10                                                                     deviation.
                F(i,j)
                                                                                                                                                                                   Fi(0,1,1,0)                                                        J(0,2,1)                                                     Si(0,1,1,0)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           32   64   128     256     512    1024                      2048
                                                                          R
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               file maps down to resources provided by one node (at bottom). The node contains a part of
                                                     Fi                                                                                                                                          R(0)                 R(0)               R(0)                                   R(0)                       R(0)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     Number of processors
                                                                     Fi(d,s,i,j)

                                                                                                                                                                                                                   Fi(0,1,2,2)          Fi(0,2,2,0)            Fi(0,2,1,1)             Si(0,1,2,2)         Si(0,2,2,0)                      Fi(0,1,0,0)
                                                                                                                                                                                                                                                                                                                                                                      Using simulation to explore fault oblivious programming models                                                                                           the file /data/dht_A, which is distributed across several nodes, with redundant block storage.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                The graph to the right shows time to                                               6
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               The OS supports a fault oblivious abstraction to the application and user, with fault handling                                                                                                                    10
     Elementary Operations                                                                                                                                                                                                                                                                J(0)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                determine the location and distribution of                                                            ALL-ROW
                                                                                                                                                                                                                                                                                                                                                                      Major effort is required to rewrite an application using a new programming model, thus it is                                                             and hiding infrastructure in the work and data distribution and systems software support




                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             Compute time (µs)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      PARTIAL-ROW
                                                                                                                                                                                                                                                       J(0,2,0)


                                                                                                                                                                                                                                                                                                      Dependency                                                      highly desirable to attempt to understand the ease of expressing algorithms in the new model                                                             layers.                                                                                                                          parities for a given data distribution (and                                      10
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   5
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ALL-SYM
                                                                                                                                                                                                                                               R(0)                        R(0)

                                                                                                                                                                                                                                                                                                                                                                      and the expected performance advantages on future machines before doing so. Skeleton                                                                                                                                                                                                      different FT distribution schemes) with                                                               PARTIAL-SYM
                                                                                                                                                                                                                                                                                                        Graph
                                                                                                                                                                                                                                          Fi(0,2,2,2)                 Fi(0,2,0,0)
                                                                                                                                                                                                                                                                                                                                                                      applications provide the basis for programming model exploration in SST/macro. The                                                                       Emerging exascale applications will feature dynamic mapping of resources to computational                                        increase in process count. This time is                                          10
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   4

                                             Dependency graph for a portion of the Hartree-Fock procedure.                                                                                                                                                                                                                                                            application control flow is represented by lightweight threads and each of these threads                                                                  nodes. The goal of the Scalable Elastic Systems Architecture (SESA) is to provide system                                         shown to be small even for large
                                                                                                                                                                                                                                                                                                                                                                      represents one or more threads in the application (depending on the level of detail desired in                                                           support for these new environments.                                                                                              processor counts, showing the feasibility                                          3
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 10
                                                                                                                                                                                                                                                                                                                                                                      the model). This approach allows programmers to specify control flow in a straightforward way.                                                                                                                                                                                             of this approach on future extreme scale
We have simulated an application modified to use this model and, in simulation, parallel                                                                                                                                                                                                                                                                                                                                                                                                                                        SESA is a multi-layer organization that exploits prior results for SMP scalable software. It                                     systems. This approach was used to                                                 2
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                compute the checksums and recover lost                                           10
performance improves. Shown below is the comparative performance, with the traditional                                                                                                                                                                                                                                                                                                                                                                                                                                         introduces an infrastructure for developing application-tuned libraries of elastic components                                                                                                                          210          212       214       216                218
SPMD-style program on the top, and the data-driven program on the bottom. White space                                                                                                                                                                                                                                                                                                                                                                                                                                          that encapsulate and account for hardware change.                                                                                blocks on failure. Both actions are shown
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                to be scalable (details in CFʼ11                                                                  Number of processor cores
indicates idle computing nodes. The second cycle begins at t=16.7 on the data-driven                                                                                                                                                                                                                                                                                                        1.00                                                                    1.00
program, whereas it begins later -- at t=16.9 -- on the SPMD program. On the data driven                                                                                                                                                                                                                                                                                                                                                                                                                                       Our LibraryOS approach to SESA allows us to collapse deep software stacks into a single                                          publication).
program, utilization is higher, cycle times for an iteration are shorter, and work on the next                                                                                                                                                                                                                                                                                                                                                                                                                                 layer which is tightly coupled to the underlying hardware, increasing performance, efficiency,                                                                                                                           In addition to performance we are
                                                                                                                                                                                                                                                                                                                                                                                            0.80                                                                    0.80
cycle can begin well before the current cycle completes.                                                                                                                                                                                                                                                                                                                                                                                                                                                                       and lowering latency of key operations.                                                                                                                                                                                 studying energy consumption of fault
                                                                                                                                                                                                                                                                                                                                                                      Parallel Efficiency




                                                                                                                                                                                                                                                                                                                                                                                                                                              Parallel Efficiency




                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       tolerance techniques. The graph to the
                                                                                                                                                                                                                                                                                                                                                                                            0.60                                                                    0.60                                                                                                                                                                                                                                                                               left is the power signature of HF
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       implementation with 4Be on a rack of 224
                                                                                                                                                                                                                                                                                                                                                                                            0.40
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    0.40                                                                                                                                                                                                                                                                               cores on NW-ICE (with liquid cooling;
                                                                                                                                                                                                                                                                                                                                                                                                              Crossbar (DS)
                                                                                                                                                                                                                                                                                                                                                                                                               Fat-tree (DS)                                                                                                                                                                                                                                                                                                                           i.e., local cooling and PSU included). We
                                                                                                                                                                                                                                                                                                                                                                                            0.20                       Torus
                                                                                                                                                                                                                                                                                                                                                                                                                 Torus (DS)                                         0.20          Theoretically optimal scaling                                                                                                                                                                                                                                        use time dilation to show the 11 power
                                                                                                                                                                                                                                                                                                                                                                                                                                                                               Expected without load balancing
                                                                                                                                                                                                                                                                                                                                                                                                      Degraded Fat-tree (DS)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Simple dynamic load balancing                                                                                                                                                                                                                                          spikes representing the 11 iterations until
                                                                                                                                                                                                                                                                                                                                                                                                            Degraded Torus
                                                                                                                                                                                                                                                                                                                                                                                            0.00
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    0.00
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             Systolic algorithm                                                                                                                                                                                                                                        convergence. The initial spike is
                                                                                                                                                                                                                                                                                                                                                                                                102     103        104        105     106
                                                                                                                                                                                                                                                                                                                                                                                                              Number of Cores
                                                                                                                                                                                                                                                                                                                                                                                                                                                                           1   4     16      64     256     1024       4096                                                                                                                                                                                                                            attributed to the initialization and the
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     Number of Nodes                                                                                                                                                                                                                                                   generation of the Schwartz matrix.
                                                                                                                                                                                                                                                                                                                                                                      Weak scaling parallel efficiency of the systolic matrix-matrix
                                                                                                                                                                                                                                                                                                                                                                                                                                             A comparison of the effect of a single degraded node on the
                                                                                                                                                                                                                                                                                                                                                                      multiply algorithm under a variety of conditions: 1) full
                                                                                                                                                                                                                                                                                                                                                                                                                                             parallel efficiency of an actor-based algorithm with simple
                                                                                                                                                                                                                                                                                                                                                                      crossbar network (DS), 2) fat-tree formed from radix 36
                                                                                                                                                                                                                                                                                                                                                                                                                                             dynamic load-balancing and the traditional systolic algorithm.




                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              PUBLICATIONS
                                                                                                                                                                                                                                                                                                                                                                      switches (DS), 3) 2D torus network, 4) 2D torus network                                                                                                                                                                                                                                   • Dan Schatzberg, Jonathan Appavoo, Orran Krieger, and Eric Van Hensbergen,
                                                                                                                                                                                                                                                                                                                                                                                                                                             A fat-tree formed from radix 36 switches was simulated.
                                                                                                                                                                                                                                                                                                                                                                      (DS),5) fat-tree formed from radix 36 switches with a single                                                                                                                                                                                                                                “Scalable Elastic Systems Architecture”, To appear at RESOLVE Workshop co-
                                                                                                                                                                                                                                                                                                                                                                      degraded node (DS), 6) 2D torus with a single degraded
                                                                                                                                                                                                                                                                                                                                                                                                                                             In both the systolic and direct send algorithm, the full matrix is                                                                                                                                                   located with ASPLOS 2011, March 5, 2011.
                                                                                                                                                                                                                                                                                                                                                                      node. DS designates use of the direct send algorithm.
                                                                                                                                                                                                                                                                                                                                                                                                                                             evenly decomposed onto all nodes in the parallel system. The
                                                                                                                                                                                                                                                                                                                                                                                                                                             systolic algorithm diffuses data between nodes as the                                                                                                                                                              • Dan Schatzberg, Jonathan Appavoo, Orran Krieger, and Eric Van Hensbergen,
                                                                                                                                                                                                                                                                                                                                                                      In the “degraded node” runs, the computational performance
                                                                                                                                                                                                                                                                                                                                                                                                                                             calculation progresses while the direct send (DS) algorithm
                                                                                                                                                                                                                                                                                                                                                                      of one node was reduced to half that of the other nodes in
                                                                                                                                                                                                                                                                                                                                                                                                                                             fetches data directly from the neighbor node that owns the                                                                                                                                                           “Why Elasticity Matters”, Under submission to HotOS 2011.
                                                                                                                                                                                                                                                                                                                                                                      the system. This type of performance reduction is found in
                                                                                                                                                                                                                                                                                                                                                                                                                                             given submatrix. The actor model implementation uses the
                                                                                                                                                                                                                                                                                                                                                                      real systems due to thermal ramp-down or resource
                                                                                                                                                                                                                                                                                                                                                                      contention.

Weitere ähnliche Inhalte

Mehr von Eric Van Hensbergen (14)

VirtFS
VirtFSVirtFS
VirtFS
 
HARE 2010 Review
HARE 2010 ReviewHARE 2010 Review
HARE 2010 Review
 
PUSH-- a Dataflow Shell
PUSH-- a Dataflow ShellPUSH-- a Dataflow Shell
PUSH-- a Dataflow Shell
 
XCPU3: Workload Distribution and Aggregation
XCPU3: Workload Distribution and AggregationXCPU3: Workload Distribution and Aggregation
XCPU3: Workload Distribution and Aggregation
 
9P Code Walkthrough
9P Code Walkthrough9P Code Walkthrough
9P Code Walkthrough
 
9P Overview
9P Overview9P Overview
9P Overview
 
Push Podc09
Push Podc09Push Podc09
Push Podc09
 
Libra: a Library OS for a JVM
Libra: a Library OS for a JVMLibra: a Library OS for a JVM
Libra: a Library OS for a JVM
 
Effect of Virtualization on OS Interference
Effect of Virtualization on OS InterferenceEffect of Virtualization on OS Interference
Effect of Virtualization on OS Interference
 
PROSE
PROSEPROSE
PROSE
 
Libra Library OS
Libra Library OSLibra Library OS
Libra Library OS
 
Systems Support for Many Task Computing
Systems Support for Many Task ComputingSystems Support for Many Task Computing
Systems Support for Many Task Computing
 
Holistic Aggregate Resource Environment
Holistic Aggregate Resource EnvironmentHolistic Aggregate Resource Environment
Holistic Aggregate Resource Environment
 
Paravirtualized File Systems
Paravirtualized File SystemsParavirtualized File Systems
Paravirtualized File Systems
 

Poster: FOX

  • 1. FOX: A Fault-Oblivious Extreme Scale Execution Environment Checksum-based schemes ensure fault-tolerance SST/macro SIMULATION OPERATING SYSTEM Exascale computing systems will provide a thousand-fold increase in parallelism and FAULT TOLERANCE OVERVIEW The Structural Simulation Toolkit (SST) macroscale components provide the ability of read-only data with much lower space a proportional increase in failure rate relative to today's machines. Systems software % ls /data/dht_A to explore the interaction of software and hardware for full scale machines using a ... App 1 ... App i overheads. We have extended the approaches to for exascale machines must provide the infrastructure to support existing applications the computation and mapping of parities to Fault Oblivious while simultaneously enabling efficient execution of new programming models that coarse-grained simulation approach. The parallel machine is represented by User's processors as employed by fault tolerant linear naturally express dynamic, adaptive, irregular computation; coupled simulations; and models which are used to estimate the performance of processing and network Desktop Application Workflow algebra, an algorithmic fault tolerance technique massive data analysis in a highly unreliable energy-constrained hardware components. An application can be represented by a “skeleton” code which Layers for linear algebraic operations. The scalable environment with billions of threads of execution. Further, these systems must be replicates the control flow and message passing behavior of the real application algorithms developed support generalized designed with failure in mind. without the cost of doing actual message passing or heavyweight computation. Cartesian distributions involving arbitrary processor counts, arbitrary specification of failure FOX is a new system for the exascale which will support distributed data objects as A skeleton application models the control flow, communication pattern, and units, and colocation of parities with data blocks. first class objects in the operating system itself. This memory-based data store will be computation pattern of the application in a coarse-grained manner. This method In the 2-D processor grid in the figure, note the named and accessed as part of the file system name space of the application. We provides a powerful approach to evaluate efficiency and scalability at extreme Abstract Programming irregular distribution of processors (denoted by can build many types of objects with this data store, including data-driven work scales and to experiment with code reorganization or high-level refactoring without File Model P_(i,j)) among the failure units (denoted by N_k). queues, which will in turn support applications with inherent resilience. having to rewrite the numerical part of an application. Interface Support For modified data, we create shadow copies of important data structures. All operations that modify the original data structures are Coarse-grained duplicated on the shadow data. By mirroring the application data on distinct failure operation queue Fault Hiding units, we ensure that we can recover its state in the event of failures. We ensured // Set up the instructions object to tell the processor model Task // how many fused multiply-add instructions each compute call executes fault tolerance for the key matrices in the most expensive Hartree-Fock procedure Layers boost::shared_ptr<sstmac::eventdata> instructions = Management (shown above) with almost unnoticeable overheads. sstmac::eventdata::construct(); Fine- instructions->set_event("FMA",blockrowsize*blockcolsize*blocklnksize); grained queues // Iterate over number of remote row and column blocks Distributed Data Store Fault tolerant data stores (shown to the right) for (int i=0; i<nblock-1; i++) { are based on the premise that applications Preload std::vector<sstmac::mpiapi::mpirequest_t> reqs; queues // Begin non-blocking left shift of A blocks have a few critical data structures and that Ready sstmac::mpiapi::mpirequest_t req; Systems Software Support each data structure is accessed in a specific queues mpi()->isend(blocksize, sstmac::mpitype::mpi_double, form in a given phase. By limiting our sstmac::mpiid(myleft), sstmac::mpitag(0), world, req); reqs.push_back(req); attention to certain significant data Compute Resource Compute Resource mpi()->irecv(blocksize, sstmac::mpitype::mpi_double, structures, we limit the state to be saved. sstmac::mpiid(myright), sstmac::mpitag(0), world, req); The access mode associated with a data Work queues are a familiar concept in many areas of computing; we will apply the work queue reqs.push_back(req); App 1 App 2 App 1 dht_A dht_A.part1 ... structure in a given phase can be used to concept to applications. Shown below is a graphical data flow description of a quantum // Begin non-blocking down shift of B blocks task 1 task 3 task n partn chemistry application, which can be executed using a work queue approach. sstmac::mpiapi::mpirequest_t req; tailor the form of the data store. Rather than mpi()->isend(blocksize, sstmac::mpitype::mpi_double, treating an SMP node as the unit of failure, taskdata. sstmac::mpiid(myup), sstmac::mpitag(0), world, req); matrix_C.part3 task management data.part1 reqs.push_back(req); node 0 partn node n our approaches can adapt to arbitrary failure mpi()->irecv(blocksize, sstmac::mpitype::mpi_double, units (eg., processors sharing a power sstmac::mpiid(mydown), sstmac::mpitag(0), world, req); supply unit). G G G G G G reqs.push_back(req); // Simulate computation with current blocks The proposed framework provides a simple, uniform file system abstraction as the 10 4 The graph to the left shows the cost of G(2,2,2,2) G(2,2,1,1) G(2,2,0,0) D(2,2) D(0,0) D(1,0) D(1,1) D(2,0) D(2,1) G(0,0,0,0) G(1,1,0,0) G(1,1,1,1) compute_api()->compute(instructions); NWChem twoel two electron contribution (twoel) in F F F F F F std::vector<sstmac::mpiapi::const_mpistatus_t> statuses; mechanism for parallel computation and distribution of data to computation blocks. NWChem twoel + FT NWChem Hartree-Fock calculation Time per step (s) // Wait for data needed for next iteration 3 10 with and without the maintenance of F(0,0) F(1,0) F(1,1) S(0,0) S(1,0) S(1,1) F(i,i) F(i,j) F(j,j) S(i,i) S(i,j) S(j,j) mpi()->waitall(reqs, statuses); Locating and accessing tasks from a processʼs work queue is accomplished through normal G J(0) } shadow copies (fock matrix G J // Simulate computation with blocks received during last loop iteration file system operations such as listing directory contents and opening files. 102 G(i,j,k,l) J F(2,1) F(2,0) J(0,1,0) S(2,1) S(2,0) compute_api()->compute(instructions); size=3600x3600). The two lines J(d,i,j) R(0) R(0) R(0) R(0) R(0) R(0) R(0) R(0) R(0) R(0) Data distribution and access are performed by writing and reading files within directory overlap due to the small overhead 101 and the included error bars are not F(2,2) Fi(0,1,2,1) Fi(0,1,1,1) S(2,2) Si(0,1,2,1) Si(0,1,1,1) hierarchies. D(k,l) G(i,j,k,l) G(i,k,j,l) Fi(0,1,2,0) J(0) Si(0,1,2,0) Si(0,1,0,0) A code fragment implementing a skeleton program for a dense matrix visible due to small standard multiplication (restricted to square blocks). 0 F F F(i,i) F(i,j) F(j,j) J(d,i,j) Assembly Shown above is the user view (at the top) of the file /data/dht_A and the way the view of one 10 deviation. F(i,j) Fi(0,1,1,0) J(0,2,1) Si(0,1,1,0) 32 64 128 256 512 1024 2048 R file maps down to resources provided by one node (at bottom). The node contains a part of Fi R(0) R(0) R(0) R(0) R(0) Number of processors Fi(d,s,i,j) Fi(0,1,2,2) Fi(0,2,2,0) Fi(0,2,1,1) Si(0,1,2,2) Si(0,2,2,0) Fi(0,1,0,0) Using simulation to explore fault oblivious programming models the file /data/dht_A, which is distributed across several nodes, with redundant block storage. The graph to the right shows time to 6 The OS supports a fault oblivious abstraction to the application and user, with fault handling 10 Elementary Operations J(0) determine the location and distribution of ALL-ROW Major effort is required to rewrite an application using a new programming model, thus it is and hiding infrastructure in the work and data distribution and systems software support Compute time (µs) PARTIAL-ROW J(0,2,0) Dependency highly desirable to attempt to understand the ease of expressing algorithms in the new model layers. parities for a given data distribution (and 10 5 ALL-SYM R(0) R(0) and the expected performance advantages on future machines before doing so. Skeleton different FT distribution schemes) with PARTIAL-SYM Graph Fi(0,2,2,2) Fi(0,2,0,0) applications provide the basis for programming model exploration in SST/macro. The Emerging exascale applications will feature dynamic mapping of resources to computational increase in process count. This time is 10 4 Dependency graph for a portion of the Hartree-Fock procedure. application control flow is represented by lightweight threads and each of these threads nodes. The goal of the Scalable Elastic Systems Architecture (SESA) is to provide system shown to be small even for large represents one or more threads in the application (depending on the level of detail desired in support for these new environments. processor counts, showing the feasibility 3 10 the model). This approach allows programmers to specify control flow in a straightforward way. of this approach on future extreme scale We have simulated an application modified to use this model and, in simulation, parallel SESA is a multi-layer organization that exploits prior results for SMP scalable software. It systems. This approach was used to 2 compute the checksums and recover lost 10 performance improves. Shown below is the comparative performance, with the traditional introduces an infrastructure for developing application-tuned libraries of elastic components 210 212 214 216 218 SPMD-style program on the top, and the data-driven program on the bottom. White space that encapsulate and account for hardware change. blocks on failure. Both actions are shown to be scalable (details in CFʼ11 Number of processor cores indicates idle computing nodes. The second cycle begins at t=16.7 on the data-driven 1.00 1.00 program, whereas it begins later -- at t=16.9 -- on the SPMD program. On the data driven Our LibraryOS approach to SESA allows us to collapse deep software stacks into a single publication). program, utilization is higher, cycle times for an iteration are shorter, and work on the next layer which is tightly coupled to the underlying hardware, increasing performance, efficiency, In addition to performance we are 0.80 0.80 cycle can begin well before the current cycle completes. and lowering latency of key operations. studying energy consumption of fault Parallel Efficiency Parallel Efficiency tolerance techniques. The graph to the 0.60 0.60 left is the power signature of HF implementation with 4Be on a rack of 224 0.40 0.40 cores on NW-ICE (with liquid cooling; Crossbar (DS) Fat-tree (DS) i.e., local cooling and PSU included). We 0.20 Torus Torus (DS) 0.20 Theoretically optimal scaling use time dilation to show the 11 power Expected without load balancing Degraded Fat-tree (DS) Simple dynamic load balancing spikes representing the 11 iterations until Degraded Torus 0.00 0.00 Systolic algorithm convergence. The initial spike is 102 103 104 105 106 Number of Cores 1 4 16 64 256 1024 4096 attributed to the initialization and the Number of Nodes generation of the Schwartz matrix. Weak scaling parallel efficiency of the systolic matrix-matrix A comparison of the effect of a single degraded node on the multiply algorithm under a variety of conditions: 1) full parallel efficiency of an actor-based algorithm with simple crossbar network (DS), 2) fat-tree formed from radix 36 dynamic load-balancing and the traditional systolic algorithm. PUBLICATIONS switches (DS), 3) 2D torus network, 4) 2D torus network • Dan Schatzberg, Jonathan Appavoo, Orran Krieger, and Eric Van Hensbergen, A fat-tree formed from radix 36 switches was simulated. (DS),5) fat-tree formed from radix 36 switches with a single “Scalable Elastic Systems Architecture”, To appear at RESOLVE Workshop co- degraded node (DS), 6) 2D torus with a single degraded In both the systolic and direct send algorithm, the full matrix is located with ASPLOS 2011, March 5, 2011. node. DS designates use of the direct send algorithm. evenly decomposed onto all nodes in the parallel system. The systolic algorithm diffuses data between nodes as the • Dan Schatzberg, Jonathan Appavoo, Orran Krieger, and Eric Van Hensbergen, In the “degraded node” runs, the computational performance calculation progresses while the direct send (DS) algorithm of one node was reduced to half that of the other nodes in fetches data directly from the neighbor node that owns the “Why Elasticity Matters”, Under submission to HotOS 2011. the system. This type of performance reduction is found in given submatrix. The actor model implementation uses the real systems due to thermal ramp-down or resource contention.