SlideShare ist ein Scribd-Unternehmen logo
1 von 76
Downloaden Sie, um offline zu lesen
The University of Hertfordshire

          The Challenges facing Libraries and
              Imperative Languages from
            Massively Parallel Architectures



Jason McGuiness               Building Futures in Computer Science   Colin Egan
Computer Science              empowering people through technology   Computer Science
University of Hertfordshire                                          University of Hertfordshire
UK                                                                   UK
Presentation Structure
• Parallel processing
   – Pipeline processors, MII architectures, Multiprocessors
• Processing In Memory (PIM)
   – Cellular Architectures: Cyclops/DIMES and picoChip
• Code-generation issues arising from massive parallelism
• Possible solutions to this issue:
   – Use the compiler or some libraries
• An example implementation of a library, and the issues
• Questions?
   – Ask as we go along, but we’ll also leave time for questions at
     then end of this presentation
Parallel Processing

• How can parallel processing be achieved?
   – By exploiting:
      •   Instruction Level Parallelism (ILP)
      •   Thread Level Parallelism (TLP)
      •   Multi-processing
      •   Data Level Parallelism (DLP)
      •   Simultaneous Multi-Processing (SMP)
      •   Concurrent processing
      •   etcetera
Pipelining

• Exploits ILP by overlapping instructions in different
  stages of execution:
   – ILP is the amount of operations in a computer program that
     can be performed on at the same time (simultaneously)

• Improves overall program execution time by
  increasing throughput:
   – It does not improve individual instruction execution time
Pipelining

• A simple 5-stage pipeline

Instruction
  Cache
  Access;
                Instruction
Increment        Decode;                          Data
 Program                                         Cache    Write Back
 Counter;     Register operand                   Access
                   set up
 Branch
Prediction
                              Execution in ALU
Pipelining hazards

• Pipelining introduces hazards which can severely
  impact on processor performance:
   – Data (RAW, WAW and WAR)
   – Control (conditional branch instructions)
   – Structural (hardware contention)

• To overcome such hazards complex hardware
  (dynamic scheduling) or complex software (static
  scheduling) or a combination of both is required
Pipelining hazards
             Data, Control, Structural
• Data hazards
  – Read After Write (RAW dependency)
     • The result of an operation is required as an operand of a
       subsequent operation and that result is not yet available in the
       register file
     add $1,$2,$3 //$1=$2+$3
     addi $4,$1,8 //$4=$1+8 data dependency ($1 is unavailable)
     sub $5,$2,$1 //$5=$2-$1 data dependency ($1 is unavailable)
     ld $6,8($1) //$6=Mem[$1+8]... $1 may be available
     add $7,$6,$8 //load-use dependency...data dependency ($6 is
               unavailable)
Pipelining hazards
           Data, Control, Structural
• Data hazards
  – Write After Write (WAW data dependency)
     sub $1,$2,$3
     addi $1,$4,8   //the addi cannot complete ahead of the sub


  – Write After Read (WAR data dependency)
     sub $2,$1,$3
     addi $1,$4,8   //the addi cannot complete ahead of the sub
Pipelining hazards
            Data, Control, Structural
• Data hazards
  – Read After Write (RAW data dependency) is a
    problem in single instruction issue architectures
    and Multiple Instruction Issue (MII) architectures

   – Write After Write (WAW data dependency) and
     Write After Read (WAR data dependency) are a
     problem MII architectures, not single instruction
     issue architectures
Pipelining hazards
              Data, Control, Structural
• Data hazards
  – Read After Write (RAW data dependency) are True Data
    Dependencies
     • RAW are avoided:
         – in single instruction issue architectures by forwarding (bypassing),
         – in MII:
              » Superscalar (dynamic) by result broadcast
              » VLIW (static) by aggressive instruction (percolation) scheduling


  – Write After Write (WAW data dependency) and Write
    After Read (WAR data dependency) are Artificial Data
    Dependencies
     • WAW/WAR are eliminated by Register Renaming
Pipelining hazards
             Data, Control, Structural
• Control hazards
  – Are caused by (conditional) branch instructions
  – The only instruction which can alter the smooth flow of
    instruction execution
  – The next instruction to be executed is not known until the
    condition evaluation has been determined and in the case
    that the condition is to be taken then the address of the next
    instruction to be executed needs to be computed:
     • branch resolution
  beq $1,$2,label //if     $1==$2 then pc=label; else pc=pc+4;
  bnq $1,$2,label //if     $1!=$2 then pc=label; else pc=pc+4;
Pipelining hazards
                Data, Control, Structural
• Control hazards
  – Are a major problem:
     • Between 1 in 5 and 1 in 8 instructions from the dynamic stream of a
       general purpose program are conditional branch instructions
     • Approximately 66% of conditional branches are taken (target path)
     • Are a particular problem in Superscalar Architectures where 100s of
       instructions may be in flight

  – The impact can be reduced by:
     •   Branch prediction (dynamic and static)
     •   Branch removal (scheduling - static)
     •   Delay region scheduling (static)
     •   Multithreading (dynamic and static)
Pipelining hazards
             Data, Control, Structural
• Structural hazards
   – More than one instruction attempting to access the
     same piece of hardware at the same time

   – These can be overcome by increasing the amount
     of hardware:
      • For example, use the Harvard Architecture (separate the
        I-cache from the D-cache)
Multiple Instruction Issue

• MII
  – A processor that is capable of fetching and issuing
    more than one instruction during each processor cycle
  – A program is executed in parallel, but the processor
    maintains the outward appearance of sequential
    execution
  – The program binary must therefore be regarded as a
    specification of what was done, not how it was done
  – Minimise program execution time by:
        • by reducing instruction latencies
        • by exploiting additional ILP
Multiple Instruction Issue

• Superscalar Processor:
   – An MII processor where the number of instructions issued
     in each clock cycle is determined dynamically by hardware
     at run-time
      • Instruction issue may be in-order or out-of-order (Tomasulo or
        equivalent).

• VLIW Processor (Very Long Instruction Word):
   – An MII processor where the number of operations
     (instructions) issued in each clock cycle is fixed and where
     the operations are selected statically by the compiler at
     compile time
Superscalars

• Fetch and Decode multiple instructions in parallel

• Dynamically predict conditional branches

• Initiate instructions for parallel execution based,
  whenever possible, on the availability of operands

• Frequently issue/execute instructions out-of-order

• Re-sequence instructions into the original program order
  after completion
Superscalars

1.       Instruction fetch strategies
           Multiple instruction fetch
           Dynamic branch prediction
2.       Detecting True Data Dependencies between registers
3.       Initiating or issuing multiple instructions in parallel
4.       Resources for parallel execution of instructions:
           Multiple pipelined functional units
           Efficient memory hierarchies
5.       Communication of data values through memory:
           Loads and stores (RISC)
           Allowing for dynamic, unpredictable behaviour of memory hierarchies
6.       Committing the process state in the correct order:
           Maintaining the appearance of sequential instruction execution
Superscalars

•   Since there is very little ILP within individual basic
    blocks (i.e. between branches), branch prediction is
    essential for high performance

•   It is important to maintain an instruction window or
    window of execution from which ILP can be
    extracted
Superscalars

•   The problems include:
    –   Conditional branches
    –   Data path complexity
    –   The memory wall (widening gap between CPU and
        memory access times)
    –   Limited available parallelism within a finite sized
        instruction window

•   Is the superscalar approach at a point of diminishing
    returns?
VLIWs
• Attempts to reduce hardware complexity by extracting
  ILP at compile time instead of run time
• The processor issues a single very long instruction during
  each machine cycle
• Each very long instruction consists of a large number of
  independent operations packed into a single instruction
  word by the compiler
• Each operation requires a small, statically predictable
  number of cycles to execute
• Each operation is itself pipelined
• Execute instructions in-order
VLIWs

• Problems:
  – Code expansion:
     • The need to pad out instruction groups with NOPs can
       double the code size


  – Compatibility:
     • Code must be recompiled (scheduled) for every new instance
       of a particular VLIW architecture
Problems of MII

• To sustain multiple instruction fetch, MII
  architectures require a complex memory hierarchy:
   – Caches
      • l1, l2, stream buffers, non-blocking caches
   – Virtual Memory
      • TLB

   – Caches suffer from:
      • Compulsory misses
      • Capacity misses
      • Collisions
Memory Problems

• Compulsory misses:
  – The first time a processor address is requested it will not be
    in cache memory and must be fetched from a slower level
    of the memory hierarchy:
     • Hopefully main (physical) memory
     • If not from Virtual Memory
     • If not from secondary storage
  – This can result in long delay (latency) due to large access
    time(s)
Memory Problems

• Capacity misses:
   – There are more cache block requests than the size of the
     cache
• Collisions:
   – The processor makes a request to the same block but for
     different instructions/data
• For both:
   – Blocks therefore have to be replaced
   – But a block that has been replaced might be referenced
     again resulting in yet more replacements
Thread Level Parallelism

• A thread can be considered to be a ‘light weight
  process’
   – Where a thread consists of a short sequence of code, with
     its own:
      • registers, data, state and so on
      • but shares process space


• TLP is exploited by simultaneously executing
  different threads on different processors:
   – TLP is therefore exploited by multiprocessors
Multiprocessors

• Should be:
  – Easily scalable
  – Fault tolerant
  – Achieve higher performance than a uni-processor

• But …
  – How many processors can we connect?
  – How do parallel processors share data?
  – How are parallel processors co-ordinated?
Multiprocessors

• Shared Memory Processors (SMP)
  – All processors share a single global memory address space

  – Communication is through shared variables in memory

  – Synchronisation is via locks (hardware) or semaphores
    (software)
Multiprocessors

• Uniform Memory Access (UMA)
  – All memory accesses take the same time
  – Do not scale well

• Non-uniform Memory Access (NUMA)
  – Each processor has a private (local) memory
  – Global memory access time can vary from processor to
    processor
  – Present more programming challenges
  – Are easier to scale
Multiprocessors

• NUMA
 – Communication and synchronization are achieved
   through message passing:
    • Processors could then, for example, communicate over
      an interconnection network
    • Processors use send and receive primitives
Multiprocessors

• The difficulty is in writing effective parallel programs:
   –   Parallel programs are inherently harder to develop
   –   Programmers need to understand the underlying hardware
   –   Programs tend not to be portable
   –   Amdahl’s law; a very small part of a program that is inherently
       sequential can severely limit the attainable speedup
• “It remains to be seen how many important applications
  will run on multiprocessors via parallel processing.”
• “The difficulty has been that too few important
  application programs have been written to complete tasks
  sooner on multiprocessors.”
Multiprocessors

• Multiprocessors suffer the same memory problems as
  uni-processors and in addition:
   – The problem of maintaining memory coherence
      between the processors
Cache Coherence

• A read operation must return the value of the
  latest write operation

• But in multiprocessors each processor will
  (probably) have its own private cache memory
  – There is no guarantee of data consistency between
    private (local) cache memory and shared (global)
    memory
The memory wall

• Today’s processors are 10 times faster than processors of only
  5 years ago

• But today’s processors do not perform at 1/10th of the time of
  processors from 5 years ago

• CPU speeds double approximately every eighteen months
  (Moore's Law), while main memory speeds double only about
  every ten years
   – This causes a bottle-neck in the memory subsystem, which impacts on
     overall system performance
   – The “memory wall” or the “Von Neumann bottleneck”
The memory wall

• In today’s processors RAM roughly holds 1% of the
  data stored on a hard disk drive

• And hard disk drives supply data nearly a thousand
  times slower than a processor can use it

• This means that for many data requests, a processor is
  idle while it waits waiting for data to be found and
  transferred
The memory wall

• IBM’s vice president Mark Dean has said:
  – "What's needed today is not so much the ability to
    process each piece of data a great deal, it's the
    ability to swiftly sort through a huge amount of
    data."
The memory wall

• Mark Dean again:
  – "The key idea is that instead of focusing on
    processor micro-architecture and structure, as in
    the past, we optimize the memory system's latency
    and throughput — how fast we can access, search
    and move data …"

• Leads to the concept of Processing In Memory
  (PIM)
Processing in memory

• The idea of PIM is to overcome the bottleneck
  between the processor and main memory by
  combining a processor and memory on a single chip
• The benefits of a PIM architecture are:
   –   Reduced memory latency
   –   Increases memory bandwidth
   –   Simplifies the memory hierarchy
   –   Provides multi-processor scaling capabilities:
        • Cellular architectures
   – Avoids the Von Neumann bottleneck
Processing in memory

• This means that:
  – Much of the expensive memory hierarchy can be
    dispensed with
  – CPU cores can be replaced with simpler designs
  – Less power is used by PIM
  – Less silicon space is used by PIM
Processing in memory

• But …
  – Processor speed is reduced
  – The amount of available memory is reduced

• However, PIM is easily scaled:
  – Multiple PIM chips connected together forming a
    network of PIM cells

  – Such scaled architectures are called Cellular
    architectures
Cellular architectures

• Cellular architectures consist of a high number of
  cells (PIM units):
   – With tens of thousands up to one million processors

   – Each cell (PIM) is small enough to achieve extremely
     large-scale parallel operations

   – To minimise communication time between cells, each cell
     is only connected to its neighbours
Cellular architectures

• Cellular architectures are fault tolerant:
   – With so many cells, it is inevitable that some processors
     will fail
   – Cellular architecture simply re-route instructions and data
     around failed cells

• Cellular architectures are ranked highly as today’s
  Supercomputers:
   – IBM BlueGene takes the top slots in the Top 500 list
Cellular architectures

• Cellular architectures are threaded:
   – Each thread unit:
      • Is independent of all other thread units
      • Serves as a single in-order issue processor
      • Shares computationally expensive hardware such as floating-point
        units


   – There can be a large number of thread units:
      • 1,000s if not 100,000s of thousands
      • Therefore they are massively parallel architectures
Cellular architectures

• Cellular architectures are NUMA
  – Have irregular memory access:
     • Some memory is very close to the thread units and is
       extremely fast
     • Some is off-chip and slow

• Cellular architectures, therefore, use caches
  and have a memory hierarchy
Cellular architectures

• In Cellular architectures multiple thread units
  perform memory accesses independently

• This means that the memory subsystem of
  Cellular architectures do in fact require some
  form of memory access model that permits
  memory accesses to be effectively served
Cellular architectures

• Uses of Cellular architectures:
  – Games machines (simple Cellular architecture)
  – Bioinformatics (protein folding)
  – Imaging
     • Satellite
     • Medical
     • Etcetera
  – Research
  – Etcetera
Cellular architectures

• Examples:
  – BlueGene Project:
     • Cyclops (IBM) – next generation from BlueGene/P,
       called BlueGene/C
        – DIMES – a prototype implementation
  – Gilgamesh (NASA)
  – Shamrock (Notre Dame)
  – picoChip (Bath, UK)
IBM Cyclops or BlueGene/C

• Developed by IBM at the Tom Watson Research
  Center

• Also called BlueGene/C in comparison with the
  earlier version of BlueGene/L and BlueGene/P
Cyclops

• The idea of Cyclops is to provide around one million
  processors:

   – Where each processor can perform a billion operations per
     second

   – Which means that Cyclops will be capable of one petaflop
     of computations per second (a thousand trillion
     calculations per second)
Cyclops
Board
                        Chip

                                                                Processor
  Off-Chip
  Memory



                           SP       SP            SP            SP       SP            SP                      SP    SP           SP


             4 GB/sec
                           TU       TU    …       TU            TU       TU    …       TU            …         TU    TU      …    TU

                                                                                                                                          1 Gbits/sec
                           FPU                    SR            FPU                    SR                      FPU               SR
  Off-Chip
  Memory




                                                                4 GB/sec
                                                                                            Crossbar Network



                                                                                                                                          6 * 4 GB/sec    Other
                                                                                                                                                         Chips via
                                                                                                                                          6 * 4 GB/sec   3D mesh
  Off-Chip
  Memory




                                                                                                                                           50 MB/sec
                                                       MEMORY




                                                                                            MEMORY
                                         MEMORY




                                                                              MEMORY




                                                                                                                    MEMORY



                                                                                                                                 MEMORY
                           MEMORY




                                                                MEMORY
                                          BANK




                                                                               BANK




                                                                                                                     BANK
                                                        BANK




                                                                                             BANK




                                                                                                                                  BANK
                            BANK




                                                                 BANK
  Off-Chip
  Memory




                                                                                                         …                                                IDE
                               SP         SP            SP       SP            SP            SP                      SP           SP                      HDD
DIMES

• DIMES:
  – Is the first hardware implementation of a Cellular
    architecture
  – Is a simplified ‘cut-down’ version of Cyclops
  – Is hardware validation tool for Cellular architectures
  – Emulates Cellular architectures, in particular Cyclops,
    cycle-by-cycle
  – Is implemented on at least one FPGA
  – Has been evaluated by Jason
DIMES
• The DIMES implementation that Jason evaluated:
   – Supports a P-thread programming model
   – Is a dual processor where each processor has four thread
     units
   – Has 4K of scratch-pad (local) memory per thread unit
   – Has two banks of 64K global shared memory
   – Has different memory models:
       • Scratch pad memory obeys the program consistency model for all
         of the eight thread units
       • Global memory obeys the sequential consistency model for all of
         the eight thread units
   – Is called DIMES/P2
DIMES/P2
DIMES

• Jason’s concerns were:
  – How to manage a potentially large number of
    threads

  – How to exploit parallelism from the input source
    code in these threads

  – How to manage memory consistency
DIMES

• Jason tested his concerns by using an “embarrassingly
  parallel program which generated Mandelbrot sets”

• Jason’s approach was to distribute the work-load between
  threads and he also implemented a work-stealing
  algorithm to balance loads between threads:
   – When a thread completed its ‘work-load’, rather than
     remain idle that thread would ‘steal-work’ from another
     ‘busy’ thread
   – This meant that he maximised parallelism and improved
     thread performance and hence overall program execution
     time
DIMES
Shortly after program start           Shortly before work stealing




Just after work stealing              More work stealing
picoChip
• picoChip are based in Bath
  – “… is dedicated to providing fast, flexible wireless
    solutions for next generation telecommunications
    systems.”
• picoArrayTM
  – Is a tiled architecture
  – 308 heterogeneous processors connected together,
  – The interconnects consist of bus switches joined
    by picoBusTM
  – Each processor is connected to the picoBusTM
    above and below it
picoChip
• picoArrayTM
  –
picoChip and Parallelism
• The parallelism provided by picoChip is
  synchronous
  – This avoids many of the issues raised by the other
    architectures that expose asynchronous parallelism

  – But it is at the cost of the flexibility that
    asynchronous parallelism provides
Issues arising ...
• Such massively parallel hardware creates a
  serious problem for the programmer:
  – Because it is computer science folklore that writing
    parallel programs is very hard

• Writing fast and correct massively parallel
  programs is a very major concern
Abstraction of the Parallelism
• This may be done in various ways:
  – For example within the compiler:
    • Using trace scheduling, list-based scheduling or other
      data-flow based means amongst others
  – Using language features:
    • HPF, UPC or the additions to C++ in the IBM Visual
      Age compiler and Microsoft's additions
  – Most commonly using libraries:
    • For example: Posix threads, Win32 threads, OpenMP,
      boost.thread, home-grown wrappers
Parallelism using Libraries

• Using libraries has a major advantage over
  implementing parallelism within the language:
  – It does not require the design of a new language,
    nor learning a new language
  – Novel languages are traditionally seen as a burden
    and often hinder adoption of new systems due to
    the volume of existing source-code
  – But libraries are especially prone to mis-use and
    are traditionally hard to use
Issues of Libraries: part I
• The model exposed is very diverse:
  – Message passing, e.g. OpenMP
  – Exposes loop-level parallelism (e.g. “forall ...”
    constructs) exposes very limited parallelism in
    general-purpose code
  – Is very low-level, e.g. the primitives exposed in
    Posix Threads are extremely basic: mutexes,
    condition variables, and basic thread operations
• The libraries require experience to use well, or
  even correctly
Part II: Non-composability of atomic
             operations!

 • The fact that atomic operations do not compose is a
   major concern when using libraries!

 • The composition of thread-safe data structures does
   not guarantee thread-safety:
    – At each combination of the data structures, more locks are
      required to ensure correct operation!
    – This implies more and more layers of locks, that are
      slow...
Parallelism in the Compiler

• Given the concerns with regards to libraries,
  what about parallelising compilers?
• The fact is that auto parallelising compilers
  exist, e.g. list-based scheduling implemented
  in the EARTH-C (circa 1999) has been
  proven to be optimal for that architecture
• Data-flow compilers have existed for years
   – Why aren't they used?
Industrial Parallelising Compilers
• Microsoft is introducing OpenMP-based
  constructs into their compiler, e.g. “forall”
• IBM Visual Age has similar functionality
• Java has a thread library
• C++0x: much work has been done regarding
  standardisation with respect to multiple threads
C++ as an example
• The uses of C++ makes it an interesting target
  for parallelisation:
  – Although imperative, so arguably flawed for
    implementing parallelism
  – It has great market penetration, therefore there is
    much demand for using parallelism
  – Commonly used in high-performance, multi-
    threaded systems
  – General-purpose nature and quality libraries are
    increasing the appeal to super-computers
Parallelism support in C++
• Libraries exist beyond the usual C libraries:
  – boost.thread – exists now, requires standards-
    compliant compilers
  – C++0x: details of the threading support are
    becoming apparent that appear to include:
     • Atomic operations (memory consistency), exceptions,
       machine-model underpins approach
     • Threading models: thread-as-a-class, lock objects
     • Thread pools – coming later – probably
     • More details on the web, or at the ACCU - in flux
Experience using C++
• Recall DIMES:
  – Prototype of massively parallel hardware
  – Posix-threads style library implementing threads
  – C++ thread-as-a-class wrapper implemented


• Summary of experience:
  – Hardly object-orientated: no separation in design of the
    Mandlebrot application with the work-stealing algorithm
    and thread pool
  – The software insufficiently separated the details of the
    hardware features from the design
Further experiences using C++
• From this work and other experiences, I
  developed a more interesting thread library:
  – Traits abstract underlying OS thread API from
    wrapper library
  – Therefore has hardware abstractions too
  – Provision of higher-level threading models:
     • Primarily based on futures and thread pools
  – Use of thread pools and futures creates a singly
    rooted-tree of threads:
     • Trivially deadlock free – a holy grail!
C++0x threads now? No!
• Included in libjmmcg:
   – Relies upon non-standard behaviour and broken optimisers! For
     example:
       • Problems with global code-motion moving apparently const-objects past
         locks
       • Exception stack is unique to a thread, not global to program and currently
         unspecified
       • Implementation of std::list
   – DSEL has syntax limitations due to design of C++
   – Doesn't use boost ...
   – Has example code and test cases
   – Isn't complete! (e.g. Posix & sequential specialisations
     incomplete, some inefficiencies)
   – But get it from libjmmcg.sourceforge.net, under LGPL
Trivial example usage
struct res_t { int i; };
struct work_type {
    typedef res_t result_type;
    void process(result_type &) {}
};
pool_type pool(2);
async_work work(creator_t::it(work_type(1)));
execution_context context(pool<<joinable()<<time_critical()<<work);
pool.erase(context);
context->i;

• The devil is in the omitted details: the typedefs for:
– pool_type, async_work, execution_context, joinable, time_critical

• The library requires that the work to be mutated has the
  items in italics defined
Explanation of the example
• The concept is:
   – that asynchronous work (async_work) that should be
     mutated (process) to the specified output type (result_type)
     is transferred into a thread pool (pool_type) of some kind
• This transfer (<<) may, optionally (joinable), return a
  future (execution_context)
   – Which can be used to communicate (->) the result of the
     mutation, executed at kernel priority (time_critical), back
     to the caller
• The future also allows exceptions to be propagated
More details regarding the example
• The thread pool (pool_type) has many traits:
   –   Master-slave or work-stealing
   –   The thread API (Win32, Posix or sequential)
   –   The thread API costs in very rough terms
   –   Implies a work schedule that is a fifo baker's ticket schedule,
       implementation of GSS(k) is in progress
• The library effectively implements a software
  simulation of data-flow
• Wrapping a function call, plus parameters, in a class
  converts Kevlin's threading model to this one
Time for controversy....
        What faces programmers...
• Large-scale parallelism is here, now:
   – Blade frames at work:
      • 4 cores x 4 CPUs x 20 frames per rack = 320 thread units, in a
        NUMA architecture
• The hardware is expensive!
• But so is the software ...
   – It must be programmed, economically
   – The programs must be maintained ...
• Or it will be an expensive failure?
Talk summary
• In this talk we have looked at parallel processing and
  justified the reasons for Processing In Memory (PIM) and
  therefore cellular architectures
• We have briefly looked at two example architectures:
   – Cyclops and picoChip
• Jason has worked on DIMES, the first implementation of a
  (cut-down) version of a cellular architecture
• The issues of programming for these massively parallel
  architectures has been described
• We focussed on the future of threading in C++
The University of Hertfordshire

          The Challenges facing Libraries and
              Imperative Languages from
            Massively Parallel Architectures

                                Questions?
Jason McGuiness               Building Futures in Computer Science   Colin Egan
Computer Science              empowering people through technology   Computer Science
University of Hertfordshire                                          University of Hertfordshire
UK                                                                   UK

Weitere ähnliche Inhalte

Was ist angesagt?

Task Parallel Library Data Flows
Task Parallel Library Data FlowsTask Parallel Library Data Flows
Task Parallel Library Data Flows
SANKARSAN BOSE
 
Optimizing your java applications for multi core hardware
Optimizing your java applications for multi core hardwareOptimizing your java applications for multi core hardware
Optimizing your java applications for multi core hardware
IndicThreads
 
Parallel Programming Primer
Parallel Programming PrimerParallel Programming Primer
Parallel Programming Primer
Sri Prasanna
 
Java in flames
Java in flamesJava in flames
Java in flames
Isuru Perera
 
Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...
Cheng-Hsuan Li
 
Parallel Programming in .NET
Parallel Programming in .NETParallel Programming in .NET
Parallel Programming in .NET
SANKARSAN BOSE
 
Introduction to National Supercomputer center in Tianjin TH-1A Supercomputer
Introduction to National Supercomputer center in Tianjin TH-1A SupercomputerIntroduction to National Supercomputer center in Tianjin TH-1A Supercomputer
Introduction to National Supercomputer center in Tianjin TH-1A Supercomputer
Förderverein Technische Fakultät
 
C++ neural networks and fuzzy logic
C++ neural networks and fuzzy logicC++ neural networks and fuzzy logic
C++ neural networks and fuzzy logic
Jamerson Ramos
 
RL-Cache: Learning-Based Cache Admission for Content Delivery
RL-Cache: Learning-Based Cache Admission for Content DeliveryRL-Cache: Learning-Based Cache Admission for Content Delivery
RL-Cache: Learning-Based Cache Admission for Content Delivery
Förderverein Technische Fakultät
 

Was ist angesagt? (20)

Task Parallel Library Data Flows
Task Parallel Library Data FlowsTask Parallel Library Data Flows
Task Parallel Library Data Flows
 
Optimizing your java applications for multi core hardware
Optimizing your java applications for multi core hardwareOptimizing your java applications for multi core hardware
Optimizing your java applications for multi core hardware
 
Manycores for the Masses
Manycores for the MassesManycores for the Masses
Manycores for the Masses
 
Parallel Programming Primer
Parallel Programming PrimerParallel Programming Primer
Parallel Programming Primer
 
Parallel Programming: Beyond the Critical Section
Parallel Programming: Beyond the Critical SectionParallel Programming: Beyond the Critical Section
Parallel Programming: Beyond the Critical Section
 
KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
 
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesOptimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
 
Migration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsMigration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming Models
 
Java performance tuning
Java performance tuningJava performance tuning
Java performance tuning
 
Modern processors
Modern processorsModern processors
Modern processors
 
Java in flames
Java in flamesJava in flames
Java in flames
 
Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...
 
Software Profiling: Understanding Java Performance and how to profile in Java
Software Profiling: Understanding Java Performance and how to profile in JavaSoftware Profiling: Understanding Java Performance and how to profile in Java
Software Profiling: Understanding Java Performance and how to profile in Java
 
Final training course
Final training courseFinal training course
Final training course
 
Parallel Programming in .NET
Parallel Programming in .NETParallel Programming in .NET
Parallel Programming in .NET
 
SoC-2012-pres-2
SoC-2012-pres-2SoC-2012-pres-2
SoC-2012-pres-2
 
Introduction to National Supercomputer center in Tianjin TH-1A Supercomputer
Introduction to National Supercomputer center in Tianjin TH-1A SupercomputerIntroduction to National Supercomputer center in Tianjin TH-1A Supercomputer
Introduction to National Supercomputer center in Tianjin TH-1A Supercomputer
 
C++ neural networks and fuzzy logic
C++ neural networks and fuzzy logicC++ neural networks and fuzzy logic
C++ neural networks and fuzzy logic
 
RL-Cache: Learning-Based Cache Admission for Content Delivery
RL-Cache: Learning-Based Cache Admission for Content DeliveryRL-Cache: Learning-Based Cache Admission for Content Delivery
RL-Cache: Learning-Based Cache Admission for Content Delivery
 

Andere mochten auch (7)

[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introduction[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introduction
 
Massively Parallel Architectures
Massively Parallel ArchitecturesMassively Parallel Architectures
Massively Parallel Architectures
 
Storage strategy and tsm roadmap
Storage strategy and tsm roadmapStorage strategy and tsm roadmap
Storage strategy and tsm roadmap
 
Massively Parallel Processing with Procedural Python - Pivotal HAWQ
Massively Parallel Processing with Procedural Python - Pivotal HAWQMassively Parallel Processing with Procedural Python - Pivotal HAWQ
Massively Parallel Processing with Procedural Python - Pivotal HAWQ
 
Greenplum- an opensource
Greenplum- an opensourceGreenplum- an opensource
Greenplum- an opensource
 
Vivre en parallèle - Softshake 2013
Vivre en parallèle - Softshake 2013Vivre en parallèle - Softshake 2013
Vivre en parallèle - Softshake 2013
 
1.sma
1.sma1.sma
1.sma
 

Ähnlich wie The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

Tiger oracle
Tiger oracleTiger oracle
Tiger oracle
d0nn9n
 

Ähnlich wie The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures (20)

13 superscalar
13 superscalar13 superscalar
13 superscalar
 
13_Superscalar.ppt
13_Superscalar.ppt13_Superscalar.ppt
13_Superscalar.ppt
 
Next-Gen Decision Making in Under 2ms
Next-Gen Decision Making in Under 2msNext-Gen Decision Making in Under 2ms
Next-Gen Decision Making in Under 2ms
 
Performance Enhancement with Pipelining
Performance Enhancement with PipeliningPerformance Enhancement with Pipelining
Performance Enhancement with Pipelining
 
Parallel Computing
Parallel ComputingParallel Computing
Parallel Computing
 
Capital One's Next Generation Decision in less than 2 ms
Capital One's Next Generation Decision in less than 2 msCapital One's Next Generation Decision in less than 2 ms
Capital One's Next Generation Decision in less than 2 ms
 
Tiger oracle
Tiger oracleTiger oracle
Tiger oracle
 
Introduction to Cloud Data Center and Network Issues
Introduction to Cloud Data Center and Network IssuesIntroduction to Cloud Data Center and Network Issues
Introduction to Cloud Data Center and Network Issues
 
14 superscalar
14 superscalar14 superscalar
14 superscalar
 
Software engineering
Software engineeringSoftware engineering
Software engineering
 
Stream Computing (The Engineer's Perspective)
Stream Computing (The Engineer's Perspective)Stream Computing (The Engineer's Perspective)
Stream Computing (The Engineer's Perspective)
 
Large-scale projects development (scaling LAMP)
Large-scale projects development (scaling LAMP)Large-scale projects development (scaling LAMP)
Large-scale projects development (scaling LAMP)
 
IBM MQ High Availabillity and Disaster Recovery (2017 version)
IBM MQ High Availabillity and Disaster Recovery (2017 version)IBM MQ High Availabillity and Disaster Recovery (2017 version)
IBM MQ High Availabillity and Disaster Recovery (2017 version)
 
IBM MQ - High Availability and Disaster Recovery
IBM MQ - High Availability and Disaster RecoveryIBM MQ - High Availability and Disaster Recovery
IBM MQ - High Availability and Disaster Recovery
 
VMworld 2014: Extreme Performance Series
VMworld 2014: Extreme Performance Series VMworld 2014: Extreme Performance Series
VMworld 2014: Extreme Performance Series
 
13 risc
13 risc13 risc
13 risc
 
RISC.ppt
RISC.pptRISC.ppt
RISC.ppt
 
13 risc
13 risc13 risc
13 risc
 
2. ILP Processors.ppt
2. ILP Processors.ppt2. ILP Processors.ppt
2. ILP Processors.ppt
 
OpenPOWER Webinar
OpenPOWER Webinar OpenPOWER Webinar
OpenPOWER Webinar
 

The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

  • 1. The University of Hertfordshire The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures Jason McGuiness Building Futures in Computer Science Colin Egan Computer Science empowering people through technology Computer Science University of Hertfordshire University of Hertfordshire UK UK
  • 2. Presentation Structure • Parallel processing – Pipeline processors, MII architectures, Multiprocessors • Processing In Memory (PIM) – Cellular Architectures: Cyclops/DIMES and picoChip • Code-generation issues arising from massive parallelism • Possible solutions to this issue: – Use the compiler or some libraries • An example implementation of a library, and the issues • Questions? – Ask as we go along, but we’ll also leave time for questions at then end of this presentation
  • 3. Parallel Processing • How can parallel processing be achieved? – By exploiting: • Instruction Level Parallelism (ILP) • Thread Level Parallelism (TLP) • Multi-processing • Data Level Parallelism (DLP) • Simultaneous Multi-Processing (SMP) • Concurrent processing • etcetera
  • 4. Pipelining • Exploits ILP by overlapping instructions in different stages of execution: – ILP is the amount of operations in a computer program that can be performed on at the same time (simultaneously) • Improves overall program execution time by increasing throughput: – It does not improve individual instruction execution time
  • 5. Pipelining • A simple 5-stage pipeline Instruction Cache Access; Instruction Increment Decode; Data Program Cache Write Back Counter; Register operand Access set up Branch Prediction Execution in ALU
  • 6. Pipelining hazards • Pipelining introduces hazards which can severely impact on processor performance: – Data (RAW, WAW and WAR) – Control (conditional branch instructions) – Structural (hardware contention) • To overcome such hazards complex hardware (dynamic scheduling) or complex software (static scheduling) or a combination of both is required
  • 7. Pipelining hazards Data, Control, Structural • Data hazards – Read After Write (RAW dependency) • The result of an operation is required as an operand of a subsequent operation and that result is not yet available in the register file add $1,$2,$3 //$1=$2+$3 addi $4,$1,8 //$4=$1+8 data dependency ($1 is unavailable) sub $5,$2,$1 //$5=$2-$1 data dependency ($1 is unavailable) ld $6,8($1) //$6=Mem[$1+8]... $1 may be available add $7,$6,$8 //load-use dependency...data dependency ($6 is unavailable)
  • 8. Pipelining hazards Data, Control, Structural • Data hazards – Write After Write (WAW data dependency) sub $1,$2,$3 addi $1,$4,8 //the addi cannot complete ahead of the sub – Write After Read (WAR data dependency) sub $2,$1,$3 addi $1,$4,8 //the addi cannot complete ahead of the sub
  • 9. Pipelining hazards Data, Control, Structural • Data hazards – Read After Write (RAW data dependency) is a problem in single instruction issue architectures and Multiple Instruction Issue (MII) architectures – Write After Write (WAW data dependency) and Write After Read (WAR data dependency) are a problem MII architectures, not single instruction issue architectures
  • 10. Pipelining hazards Data, Control, Structural • Data hazards – Read After Write (RAW data dependency) are True Data Dependencies • RAW are avoided: – in single instruction issue architectures by forwarding (bypassing), – in MII: » Superscalar (dynamic) by result broadcast » VLIW (static) by aggressive instruction (percolation) scheduling – Write After Write (WAW data dependency) and Write After Read (WAR data dependency) are Artificial Data Dependencies • WAW/WAR are eliminated by Register Renaming
  • 11. Pipelining hazards Data, Control, Structural • Control hazards – Are caused by (conditional) branch instructions – The only instruction which can alter the smooth flow of instruction execution – The next instruction to be executed is not known until the condition evaluation has been determined and in the case that the condition is to be taken then the address of the next instruction to be executed needs to be computed: • branch resolution beq $1,$2,label //if $1==$2 then pc=label; else pc=pc+4; bnq $1,$2,label //if $1!=$2 then pc=label; else pc=pc+4;
  • 12. Pipelining hazards Data, Control, Structural • Control hazards – Are a major problem: • Between 1 in 5 and 1 in 8 instructions from the dynamic stream of a general purpose program are conditional branch instructions • Approximately 66% of conditional branches are taken (target path) • Are a particular problem in Superscalar Architectures where 100s of instructions may be in flight – The impact can be reduced by: • Branch prediction (dynamic and static) • Branch removal (scheduling - static) • Delay region scheduling (static) • Multithreading (dynamic and static)
  • 13. Pipelining hazards Data, Control, Structural • Structural hazards – More than one instruction attempting to access the same piece of hardware at the same time – These can be overcome by increasing the amount of hardware: • For example, use the Harvard Architecture (separate the I-cache from the D-cache)
  • 14. Multiple Instruction Issue • MII – A processor that is capable of fetching and issuing more than one instruction during each processor cycle – A program is executed in parallel, but the processor maintains the outward appearance of sequential execution – The program binary must therefore be regarded as a specification of what was done, not how it was done – Minimise program execution time by: • by reducing instruction latencies • by exploiting additional ILP
  • 15. Multiple Instruction Issue • Superscalar Processor: – An MII processor where the number of instructions issued in each clock cycle is determined dynamically by hardware at run-time • Instruction issue may be in-order or out-of-order (Tomasulo or equivalent). • VLIW Processor (Very Long Instruction Word): – An MII processor where the number of operations (instructions) issued in each clock cycle is fixed and where the operations are selected statically by the compiler at compile time
  • 16. Superscalars • Fetch and Decode multiple instructions in parallel • Dynamically predict conditional branches • Initiate instructions for parallel execution based, whenever possible, on the availability of operands • Frequently issue/execute instructions out-of-order • Re-sequence instructions into the original program order after completion
  • 17. Superscalars 1. Instruction fetch strategies  Multiple instruction fetch  Dynamic branch prediction 2. Detecting True Data Dependencies between registers 3. Initiating or issuing multiple instructions in parallel 4. Resources for parallel execution of instructions:  Multiple pipelined functional units  Efficient memory hierarchies 5. Communication of data values through memory:  Loads and stores (RISC)  Allowing for dynamic, unpredictable behaviour of memory hierarchies 6. Committing the process state in the correct order:  Maintaining the appearance of sequential instruction execution
  • 18. Superscalars • Since there is very little ILP within individual basic blocks (i.e. between branches), branch prediction is essential for high performance • It is important to maintain an instruction window or window of execution from which ILP can be extracted
  • 19. Superscalars • The problems include: – Conditional branches – Data path complexity – The memory wall (widening gap between CPU and memory access times) – Limited available parallelism within a finite sized instruction window • Is the superscalar approach at a point of diminishing returns?
  • 20. VLIWs • Attempts to reduce hardware complexity by extracting ILP at compile time instead of run time • The processor issues a single very long instruction during each machine cycle • Each very long instruction consists of a large number of independent operations packed into a single instruction word by the compiler • Each operation requires a small, statically predictable number of cycles to execute • Each operation is itself pipelined • Execute instructions in-order
  • 21. VLIWs • Problems: – Code expansion: • The need to pad out instruction groups with NOPs can double the code size – Compatibility: • Code must be recompiled (scheduled) for every new instance of a particular VLIW architecture
  • 22. Problems of MII • To sustain multiple instruction fetch, MII architectures require a complex memory hierarchy: – Caches • l1, l2, stream buffers, non-blocking caches – Virtual Memory • TLB – Caches suffer from: • Compulsory misses • Capacity misses • Collisions
  • 23. Memory Problems • Compulsory misses: – The first time a processor address is requested it will not be in cache memory and must be fetched from a slower level of the memory hierarchy: • Hopefully main (physical) memory • If not from Virtual Memory • If not from secondary storage – This can result in long delay (latency) due to large access time(s)
  • 24. Memory Problems • Capacity misses: – There are more cache block requests than the size of the cache • Collisions: – The processor makes a request to the same block but for different instructions/data • For both: – Blocks therefore have to be replaced – But a block that has been replaced might be referenced again resulting in yet more replacements
  • 25. Thread Level Parallelism • A thread can be considered to be a ‘light weight process’ – Where a thread consists of a short sequence of code, with its own: • registers, data, state and so on • but shares process space • TLP is exploited by simultaneously executing different threads on different processors: – TLP is therefore exploited by multiprocessors
  • 26. Multiprocessors • Should be: – Easily scalable – Fault tolerant – Achieve higher performance than a uni-processor • But … – How many processors can we connect? – How do parallel processors share data? – How are parallel processors co-ordinated?
  • 27. Multiprocessors • Shared Memory Processors (SMP) – All processors share a single global memory address space – Communication is through shared variables in memory – Synchronisation is via locks (hardware) or semaphores (software)
  • 28. Multiprocessors • Uniform Memory Access (UMA) – All memory accesses take the same time – Do not scale well • Non-uniform Memory Access (NUMA) – Each processor has a private (local) memory – Global memory access time can vary from processor to processor – Present more programming challenges – Are easier to scale
  • 29. Multiprocessors • NUMA – Communication and synchronization are achieved through message passing: • Processors could then, for example, communicate over an interconnection network • Processors use send and receive primitives
  • 30. Multiprocessors • The difficulty is in writing effective parallel programs: – Parallel programs are inherently harder to develop – Programmers need to understand the underlying hardware – Programs tend not to be portable – Amdahl’s law; a very small part of a program that is inherently sequential can severely limit the attainable speedup • “It remains to be seen how many important applications will run on multiprocessors via parallel processing.” • “The difficulty has been that too few important application programs have been written to complete tasks sooner on multiprocessors.”
  • 31. Multiprocessors • Multiprocessors suffer the same memory problems as uni-processors and in addition: – The problem of maintaining memory coherence between the processors
  • 32. Cache Coherence • A read operation must return the value of the latest write operation • But in multiprocessors each processor will (probably) have its own private cache memory – There is no guarantee of data consistency between private (local) cache memory and shared (global) memory
  • 33. The memory wall • Today’s processors are 10 times faster than processors of only 5 years ago • But today’s processors do not perform at 1/10th of the time of processors from 5 years ago • CPU speeds double approximately every eighteen months (Moore's Law), while main memory speeds double only about every ten years – This causes a bottle-neck in the memory subsystem, which impacts on overall system performance – The “memory wall” or the “Von Neumann bottleneck”
  • 34. The memory wall • In today’s processors RAM roughly holds 1% of the data stored on a hard disk drive • And hard disk drives supply data nearly a thousand times slower than a processor can use it • This means that for many data requests, a processor is idle while it waits waiting for data to be found and transferred
  • 35. The memory wall • IBM’s vice president Mark Dean has said: – "What's needed today is not so much the ability to process each piece of data a great deal, it's the ability to swiftly sort through a huge amount of data."
  • 36. The memory wall • Mark Dean again: – "The key idea is that instead of focusing on processor micro-architecture and structure, as in the past, we optimize the memory system's latency and throughput — how fast we can access, search and move data …" • Leads to the concept of Processing In Memory (PIM)
  • 37. Processing in memory • The idea of PIM is to overcome the bottleneck between the processor and main memory by combining a processor and memory on a single chip • The benefits of a PIM architecture are: – Reduced memory latency – Increases memory bandwidth – Simplifies the memory hierarchy – Provides multi-processor scaling capabilities: • Cellular architectures – Avoids the Von Neumann bottleneck
  • 38. Processing in memory • This means that: – Much of the expensive memory hierarchy can be dispensed with – CPU cores can be replaced with simpler designs – Less power is used by PIM – Less silicon space is used by PIM
  • 39. Processing in memory • But … – Processor speed is reduced – The amount of available memory is reduced • However, PIM is easily scaled: – Multiple PIM chips connected together forming a network of PIM cells – Such scaled architectures are called Cellular architectures
  • 40. Cellular architectures • Cellular architectures consist of a high number of cells (PIM units): – With tens of thousands up to one million processors – Each cell (PIM) is small enough to achieve extremely large-scale parallel operations – To minimise communication time between cells, each cell is only connected to its neighbours
  • 41. Cellular architectures • Cellular architectures are fault tolerant: – With so many cells, it is inevitable that some processors will fail – Cellular architecture simply re-route instructions and data around failed cells • Cellular architectures are ranked highly as today’s Supercomputers: – IBM BlueGene takes the top slots in the Top 500 list
  • 42. Cellular architectures • Cellular architectures are threaded: – Each thread unit: • Is independent of all other thread units • Serves as a single in-order issue processor • Shares computationally expensive hardware such as floating-point units – There can be a large number of thread units: • 1,000s if not 100,000s of thousands • Therefore they are massively parallel architectures
  • 43. Cellular architectures • Cellular architectures are NUMA – Have irregular memory access: • Some memory is very close to the thread units and is extremely fast • Some is off-chip and slow • Cellular architectures, therefore, use caches and have a memory hierarchy
  • 44. Cellular architectures • In Cellular architectures multiple thread units perform memory accesses independently • This means that the memory subsystem of Cellular architectures do in fact require some form of memory access model that permits memory accesses to be effectively served
  • 45. Cellular architectures • Uses of Cellular architectures: – Games machines (simple Cellular architecture) – Bioinformatics (protein folding) – Imaging • Satellite • Medical • Etcetera – Research – Etcetera
  • 46. Cellular architectures • Examples: – BlueGene Project: • Cyclops (IBM) – next generation from BlueGene/P, called BlueGene/C – DIMES – a prototype implementation – Gilgamesh (NASA) – Shamrock (Notre Dame) – picoChip (Bath, UK)
  • 47. IBM Cyclops or BlueGene/C • Developed by IBM at the Tom Watson Research Center • Also called BlueGene/C in comparison with the earlier version of BlueGene/L and BlueGene/P
  • 48. Cyclops • The idea of Cyclops is to provide around one million processors: – Where each processor can perform a billion operations per second – Which means that Cyclops will be capable of one petaflop of computations per second (a thousand trillion calculations per second)
  • 49. Cyclops Board Chip Processor Off-Chip Memory SP SP SP SP SP SP SP SP SP 4 GB/sec TU TU … TU TU TU … TU … TU TU … TU 1 Gbits/sec FPU SR FPU SR FPU SR Off-Chip Memory 4 GB/sec Crossbar Network 6 * 4 GB/sec Other Chips via 6 * 4 GB/sec 3D mesh Off-Chip Memory 50 MB/sec MEMORY MEMORY MEMORY MEMORY MEMORY MEMORY MEMORY MEMORY BANK BANK BANK BANK BANK BANK BANK BANK Off-Chip Memory … IDE SP SP SP SP SP SP SP SP HDD
  • 50. DIMES • DIMES: – Is the first hardware implementation of a Cellular architecture – Is a simplified ‘cut-down’ version of Cyclops – Is hardware validation tool for Cellular architectures – Emulates Cellular architectures, in particular Cyclops, cycle-by-cycle – Is implemented on at least one FPGA – Has been evaluated by Jason
  • 51. DIMES • The DIMES implementation that Jason evaluated: – Supports a P-thread programming model – Is a dual processor where each processor has four thread units – Has 4K of scratch-pad (local) memory per thread unit – Has two banks of 64K global shared memory – Has different memory models: • Scratch pad memory obeys the program consistency model for all of the eight thread units • Global memory obeys the sequential consistency model for all of the eight thread units – Is called DIMES/P2
  • 53. DIMES • Jason’s concerns were: – How to manage a potentially large number of threads – How to exploit parallelism from the input source code in these threads – How to manage memory consistency
  • 54. DIMES • Jason tested his concerns by using an “embarrassingly parallel program which generated Mandelbrot sets” • Jason’s approach was to distribute the work-load between threads and he also implemented a work-stealing algorithm to balance loads between threads: – When a thread completed its ‘work-load’, rather than remain idle that thread would ‘steal-work’ from another ‘busy’ thread – This meant that he maximised parallelism and improved thread performance and hence overall program execution time
  • 55. DIMES Shortly after program start Shortly before work stealing Just after work stealing More work stealing
  • 56. picoChip • picoChip are based in Bath – “… is dedicated to providing fast, flexible wireless solutions for next generation telecommunications systems.” • picoArrayTM – Is a tiled architecture – 308 heterogeneous processors connected together, – The interconnects consist of bus switches joined by picoBusTM – Each processor is connected to the picoBusTM above and below it
  • 58. picoChip and Parallelism • The parallelism provided by picoChip is synchronous – This avoids many of the issues raised by the other architectures that expose asynchronous parallelism – But it is at the cost of the flexibility that asynchronous parallelism provides
  • 59. Issues arising ... • Such massively parallel hardware creates a serious problem for the programmer: – Because it is computer science folklore that writing parallel programs is very hard • Writing fast and correct massively parallel programs is a very major concern
  • 60. Abstraction of the Parallelism • This may be done in various ways: – For example within the compiler: • Using trace scheduling, list-based scheduling or other data-flow based means amongst others – Using language features: • HPF, UPC or the additions to C++ in the IBM Visual Age compiler and Microsoft's additions – Most commonly using libraries: • For example: Posix threads, Win32 threads, OpenMP, boost.thread, home-grown wrappers
  • 61. Parallelism using Libraries • Using libraries has a major advantage over implementing parallelism within the language: – It does not require the design of a new language, nor learning a new language – Novel languages are traditionally seen as a burden and often hinder adoption of new systems due to the volume of existing source-code – But libraries are especially prone to mis-use and are traditionally hard to use
  • 62. Issues of Libraries: part I • The model exposed is very diverse: – Message passing, e.g. OpenMP – Exposes loop-level parallelism (e.g. “forall ...” constructs) exposes very limited parallelism in general-purpose code – Is very low-level, e.g. the primitives exposed in Posix Threads are extremely basic: mutexes, condition variables, and basic thread operations • The libraries require experience to use well, or even correctly
  • 63. Part II: Non-composability of atomic operations! • The fact that atomic operations do not compose is a major concern when using libraries! • The composition of thread-safe data structures does not guarantee thread-safety: – At each combination of the data structures, more locks are required to ensure correct operation! – This implies more and more layers of locks, that are slow...
  • 64. Parallelism in the Compiler • Given the concerns with regards to libraries, what about parallelising compilers? • The fact is that auto parallelising compilers exist, e.g. list-based scheduling implemented in the EARTH-C (circa 1999) has been proven to be optimal for that architecture • Data-flow compilers have existed for years – Why aren't they used?
  • 65. Industrial Parallelising Compilers • Microsoft is introducing OpenMP-based constructs into their compiler, e.g. “forall” • IBM Visual Age has similar functionality • Java has a thread library • C++0x: much work has been done regarding standardisation with respect to multiple threads
  • 66. C++ as an example • The uses of C++ makes it an interesting target for parallelisation: – Although imperative, so arguably flawed for implementing parallelism – It has great market penetration, therefore there is much demand for using parallelism – Commonly used in high-performance, multi- threaded systems – General-purpose nature and quality libraries are increasing the appeal to super-computers
  • 67. Parallelism support in C++ • Libraries exist beyond the usual C libraries: – boost.thread – exists now, requires standards- compliant compilers – C++0x: details of the threading support are becoming apparent that appear to include: • Atomic operations (memory consistency), exceptions, machine-model underpins approach • Threading models: thread-as-a-class, lock objects • Thread pools – coming later – probably • More details on the web, or at the ACCU - in flux
  • 68. Experience using C++ • Recall DIMES: – Prototype of massively parallel hardware – Posix-threads style library implementing threads – C++ thread-as-a-class wrapper implemented • Summary of experience: – Hardly object-orientated: no separation in design of the Mandlebrot application with the work-stealing algorithm and thread pool – The software insufficiently separated the details of the hardware features from the design
  • 69. Further experiences using C++ • From this work and other experiences, I developed a more interesting thread library: – Traits abstract underlying OS thread API from wrapper library – Therefore has hardware abstractions too – Provision of higher-level threading models: • Primarily based on futures and thread pools – Use of thread pools and futures creates a singly rooted-tree of threads: • Trivially deadlock free – a holy grail!
  • 70. C++0x threads now? No! • Included in libjmmcg: – Relies upon non-standard behaviour and broken optimisers! For example: • Problems with global code-motion moving apparently const-objects past locks • Exception stack is unique to a thread, not global to program and currently unspecified • Implementation of std::list – DSEL has syntax limitations due to design of C++ – Doesn't use boost ... – Has example code and test cases – Isn't complete! (e.g. Posix & sequential specialisations incomplete, some inefficiencies) – But get it from libjmmcg.sourceforge.net, under LGPL
  • 71. Trivial example usage struct res_t { int i; }; struct work_type { typedef res_t result_type; void process(result_type &) {} }; pool_type pool(2); async_work work(creator_t::it(work_type(1))); execution_context context(pool<<joinable()<<time_critical()<<work); pool.erase(context); context->i; • The devil is in the omitted details: the typedefs for: – pool_type, async_work, execution_context, joinable, time_critical • The library requires that the work to be mutated has the items in italics defined
  • 72. Explanation of the example • The concept is: – that asynchronous work (async_work) that should be mutated (process) to the specified output type (result_type) is transferred into a thread pool (pool_type) of some kind • This transfer (<<) may, optionally (joinable), return a future (execution_context) – Which can be used to communicate (->) the result of the mutation, executed at kernel priority (time_critical), back to the caller • The future also allows exceptions to be propagated
  • 73. More details regarding the example • The thread pool (pool_type) has many traits: – Master-slave or work-stealing – The thread API (Win32, Posix or sequential) – The thread API costs in very rough terms – Implies a work schedule that is a fifo baker's ticket schedule, implementation of GSS(k) is in progress • The library effectively implements a software simulation of data-flow • Wrapping a function call, plus parameters, in a class converts Kevlin's threading model to this one
  • 74. Time for controversy.... What faces programmers... • Large-scale parallelism is here, now: – Blade frames at work: • 4 cores x 4 CPUs x 20 frames per rack = 320 thread units, in a NUMA architecture • The hardware is expensive! • But so is the software ... – It must be programmed, economically – The programs must be maintained ... • Or it will be an expensive failure?
  • 75. Talk summary • In this talk we have looked at parallel processing and justified the reasons for Processing In Memory (PIM) and therefore cellular architectures • We have briefly looked at two example architectures: – Cyclops and picoChip • Jason has worked on DIMES, the first implementation of a (cut-down) version of a cellular architecture • The issues of programming for these massively parallel architectures has been described • We focussed on the future of threading in C++
  • 76. The University of Hertfordshire The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures Questions? Jason McGuiness Building Futures in Computer Science Colin Egan Computer Science empowering people through technology Computer Science University of Hertfordshire University of Hertfordshire UK UK