SlideShare a Scribd company logo
1 of 45
Download to read offline
Massively Parallel Architectures




    Colin Egan, Jason McGuiness and Michael Hicks
              (Compiler Technology and
        Computer Architecture Research Group)


                                                    1
Presentation Structure

Introduction




Conclusions

Questions? (ask as we go along and I’ll also leave time for
questions at then end of this presentation)


                                                              2
The memory wall

High performance computing can be achieved
through instruction level parallelism (ILP) by:
  Pipelining,
  Multiple instruction issue (MII).


But both of these techniques can cause a bottle-
neck in the memory subsystem, which may impact
on overall system performance.
  The memory wall.
                                                   3
Overcoming the memory wall

We can improve data throughput and
data storage between the CPU and the
memory subsystem by introducing extra
levels into the memory hierarchy.
  L1, L2, L3 Caches.



                                    4
Overcoming the memory wall
But …
   This increases the penalty associated with a miss in the memory subsystem.
   Which impacts on the average cache access time.
   Where:
average cache access time = hit-time + miss-rate * miss-penalty

for two levels of caches:
average access time = hit-timeL1 + miss-rateL1 * hit-timeL2 + miss-rateL2 * miss-penaltyL2
because the
miss-penaltyL1 = hit-timeL2 + miss-rateL2 * miss-penaltyL2

for three levels of caches:
average access time = hit-timeL1 + miss-rateL1 * hit-timeL2 + miss-rateL2 * hit-timeL3 +
    miss-rateL3 * miss-penaltyL3




                                                                                           5
Overcoming the memory wall

The average cache access time is therefore dependent on
three criteria: hit-time, miss-rate and miss-penalty.

Increasing the number of levels in the memory hierarchy
does not improve main memory access time.

Recovery from a cache miss causes a main memory bottle-
neck, since main memory cannot keep up with the required
fetch rate.


                                                           6
Overcoming the memory wall

The overall effect:
  can limit the amount of ILP,
  can impact on processor performance,
  increases the architectural design complexity,
  increases hardware cost.




                                                   7
Overcoming the memory wall

So perhaps we should improve the
performance of the memory subsystem.

Or we could place processors in memory.
  Processing In Memory (PIMs).




                                          8
Processing in memory

The idea of PIM is to overcome the bottleneck
between the processor and main memory by
combining a processor and memory on a single
chip.

The benefits of PIM are:
  reduced latency,
  a simplification of the memory hierarchy,
  high bandwidth,
  multi-processor scaling capabilities
     Cellular architectures.


                                                9
Processing in memory

Reduced latency:
   In PIM architectures accesses from the processor to memory:
      do not have to go through multiple memory levels (caches) to go off
      chip,
      do not have to travel along a printed circuit line,
      do not have to be reconverted back down to logic levels,
      do not have to be re-synchronised with a local clock.


   Currently there is about an order of magnitude reduction in latency
      thereby overcoming the memory wall.




                                                                            10
Processing in memory

Beneficial side-effects of reducing latency:
  total silicon area is reduced,
  power consumption is reduced.




                                               11
Processing in memory

But …
 processor speed is reduced,
 the amount of available memory is reduced.




                                              12
Cellular architectures

PIM is easily scaled,
  multiple PIM chips connected together forming a
  network of PIM cells (Cellular architectures) to
  overcome these two problems.




                                                 13
Cellular architectures

Cellular architectures are threaded:
  each thread unit is independent of all other
  thread units,
  each thread unit serves as a single in-order issue
  processor,
  each thread unit shares computationally
  expensive hardware such as floating-point units,
  there can be a large number (1,000s) of thread
  units – therefore they are massively parallel.


                                                       14
Cellular architectures

Cellular architectures have irregular memory
access:
  some memory is very close to the thread units
  and is extremely fast,
  some is off-chip and slow.

Cellular architectures, therefore, use caches
and have a memory hierarchy.

                                                  15
Cellular architectures

In Cellular architectures multiple thread units
perform memory accesses independently.

This means that the memory subsystem of
Cellular architectures requires some form of
memory access model that permits memory
accesses to be effectively served.

                                               16
Cellular architectures

Examples:
 Bluegene Project
   Cyclops (IBM)
   DIMES
 Gilgamesh (NASA)
 Shamrock (Notre Dame)


                                17
Cyclops

Developed by IBM at the Tom Watson
Research Center.
Also called Bluegene/C in comparison
with the later version of Bluegene/L.



                                        18
Logical View of the Cyclops64 Chip Architecture
Board
                          Chip

                                 MSOffice2                         Processor
    Off-Chip
    Memory
                             SP       SP             SP            SP       SP            SP                      SP     SP           SP


               4 GB/sec
                             TU       TU     …       TU            TU       TU    …       TU             …        TU     TU      …    TU

                                                                                                                                              1 Gbits/sec
                             FPU                 SR                FPU                SR                          FPU                SR
    Off-Chip
    Memory




                                                                   4 GB/sec
                                                                                               Crossbar Network



                                                                                                                                              6 * 4 GB/sec    Other
                                                                                                                                                             Chips via
                                                                                                                                              6 * 4 GB/sec   3D mesh
    Off-Chip
    Memory




                                                                                                                                               50 MB/sec
                                            MEMORY



                                                          MEMORY




                                                                                 MEMORY



                                                                                                MEMORY




                                                                                                                        MEMORY



                                                                                                                                     MEMORY
                             MEMORY




                                                                   MEMORY
                                             BANK



                                                           BANK




                                                                                  BANK



                                                                                                 BANK




                                                                                                                         BANK



                                                                                                                                      BANK
                              BANK




                                                                    BANK
    Off-Chip
    Memory




                                                                                                             …
                                                                                                                                                              IDE
                                 SP         SP            SP        SP           SP             SP                      SP           SP                       HDD




        Processors                         : 80 / chip                              On-chip Memory (SRAM) : 160 x 28KB = 4480 KB
        Thread Units (TU)                  : 2 / processor                          Off-chip Memory (DDR2) : 4 x 256MB = 1GB
        Floating-Point Unit (FPU)          : 1 / processor                          Hard Drive (IDE)       : 1 x 120GB = 120 GB
        Shared Registers (SR)              : 0 / processor                          SP is located in the memory banks and is accessed directly by the TU and
        Scratch-Pad Memory (SP)            : n KB / TU                              via the crossbar network by the other TUs.
                                                                                                                                                                  19
Slide 19

MSOffice2 NOTE ABOUT SCRATCH-PAD MEMORIES:

           There is one memory bank(MB) per TU on the chip. One part of the MB is reserved for the SP of the corresponding TU. The size of the
           SP is customizable.

           The TU can access its SP directly without having to use the crossbar network with a latency of 3/2 for ld/st. The other TUs can access
           the SP of another TU by accessing the corresponding MB through the network with a latency of 22/11 for ld/st. Even TUs from the
           same processor must use the network to access the SPs of each other.

           The memory bank can support only one access per cycle. Therefore if a TU accesses its SP while another TU accesses the same SP
           through the network, some arbitration will occur and one of the two accesses will be delayed.
           , 03/10/2003
DIMES

Delaware Interactive Multiprocessor
Emulation System (DIMES)
  under development by the Computer
  Architecture and Parallel Systems
  Laboratory (CAPSL) at the University of
  Delaware, Newark, DE. USA under the
  leadership of Prof. Guang Gao.

                                            20
DIMES

DIMES
 is the first hardware implementation of a Cellular
 architecture,
 is a simplified ‘cut-down’ version of Cyclops,
 is hardware validation tool for Cellular architectures,
 emulates Cellular architectures, in particular Cyclops,
 cycle by cycle,
 is implemented on at least one FPGA,
 has been evaluated by Jason McGuiness.


                                                           21
DIMES

The DIMES implementation that Jason evaluated:
  supports a P-thread programming model,
  is a dual processor
      where each processor has four thread units,
  has 4K of scratch-pad (local) memory per thread unit,
  has two banks of 64K global shared memory,
  has different memory models:
     scratch pad memory obeys the program consistency model for all of the
     eight thread units,
     global memory obeys the sequential consistency model for all of the
     eight thread units,
  is called DIMES/P2.



                                                                         22
DIMES
                    Processor 0                 Processor 1
DIMES/P2.            Thread Unit 0               Thread Unit 0

                     4K Scratch pad              4K Scratch pad

                     Thread Unit 1               Thread Unit 1

                     4K Scratch pad              4K Scratch pad

                     Thread Unit 2               Thread Unit 2

                     4K Scratch pad              4K Scratch pad

                     Thread Unit 3               Thread Unit 3

                     4K Scratch pad              4K Scratch pad




       64K Global                                                 64K Global
       Memory                         Network
                                                                  Memory




                                                                               23
Gilgamesh

This material is based on:
  Sterling TL and Zima HP, Gilgamesh: A
  Multithreaded Processor-In-Memory
  Architecture for Petaflops Computing, IEEE,
  2002.




                                                24
Gilgamesh

PIM-based massively parallel architecture
Extend existing PIM capabilities by:
  Incorporating advanced mechanisms for
  virtualizing tasks and data.
  Providing adaptive resource management for
  load balancing and latency tolerance.



                                               25
Gilgamesh System Architecture

Collection of MIND chips interconnected by
a system network
  MIND: Memory, Intelligence, Network Device
A MIND chip contains:
  Multiple banks of DRAM storage
  Processing logic
  External I/O interface
  Inter-chip communication channels

                                               26
Gilgamesh System Architecture
Fig 1 S&Z




                             27
MIND Chip

MIND chip is multithreaded.
MIND chips communication:
  Via parcels,
  Parcel can be used to perform conventional
  operation as well as to invoke methods on a
  remote chip.



                                                28
MIND Chip Architecture
Fig 2 S&Z




                              29
MIND Chip Architecture

Nodes provide storage, information processing and
execution control.
Local wide registers serve as thread state, instruction cache,
vector memory row buffer.
Wide ALU performs multiple operations on separate field
simultaneously.
System memory bus interface connects MIND to
workstation and server motherboard.
Data streaming interface deals with rapid high data
bandwidth movement.
On-chip communication interconnects allow sharing of
pipelined FPUs among nodes.

                                                             30
MIND Node Structure
Fig 3 S&Z




                            31
MIND Node Structure

Integrating memory block with processing logic
and control mechanism.
Wide ALU is 256 bits, supporting multiple
operations.
ALU doesn’t provide explicit floating point
operations.
Permutation network rearrange bytes in a complete
row.

                                                    32
MIND PIM Architecture

High degree of memory bandwidth is achieved by
  Accessing an entire memory row at a time: data
  parallelism.
  Partitioning the total memory into multiple independently
  accessible blocks: multiple memory banks.
Therefore:
  Memory bandwidth is square order over off-chip
  architecture.
  Low latency.

                                                         33
Parcels

A variable length of communication packet that
contains sufficient information to perform a remote
procedure invocation.
It is targeted to a virtual address which may identify
individual variables, blocks of data, structures,
objects, or threads, I/O streams.
Parcel fields mainly contain: destination address,
parcel action specifier, parcel argument,
continuation, and housekeeping.

                                                    34
Multithreaded Execution

Each thread resides in a MIND node wide
register.
The multithread controller determines:
  When the thread is ready to perform next
  operation.
  Which resource will be required and allocated.
Threads can be created, terminated or
destroyed, suspended, and stored.

                                                   35
Shamrock Notre Dame

The material is based on:
  Kogge PM, Brockman JB, Sterling T, and Gao
  G, Processing in memory: chips in petaflops.
  Presented in the Workshop of ISCA 97’.




                                                 36
Shamrock Notre Dame

A PIM architecture developed in the
University of Notre Dame




                                      37
Shamrock Architecture Overview

Fig 2 Kogge




                                   38
Node Structure
Fig 3 Kogge




                               39
Node Structure

A node consists of logic for a CPU, four
arrays of memory and data routing
  Two separately addressable memories on each
  face of the CPU.
Each memory array is made up of multiple
sub-units:
  allows all bits read from a memory in one cycle
  to be available at the same time to the CPU logic.

                                                  40
Shamrock Chip

Nodes are arranged in rows, staggering so
that the two memory arrays on one side of
one node impinge directly on the memory
arrays of two different nodes in the next row:
  Each CPU has a true shared memory interface
  with four other CPUs.



                                                41
Shamrock Chip Architecture

A PIM architecture developed in the
University of Notre Dame




                                      42
Shamrock Chip

Off-chip connections can:
  Connect to adjoining chips in the same tile
  topology.
  Allow a processor on the outside to view the
  chip as memory in the conventional sense.




                                                 43
Massively Parallel Architectures


 Questions?

   Colin Egan, Jason McGuiness and Michael Hicks
             (Compiler Technology and
       Computer Architecture Research Group)



                                                   44

More Related Content

Similar to Massively Parallel Architectures

Types of memory 10 to11
Types of memory 10 to11Types of memory 10 to11
Types of memory 10 to11myrajendra
 
Types of memory
Types of memoryTypes of memory
Types of memorymyrajendra
 
Types of memory
Types of memoryTypes of memory
Types of memorymyrajendra
 
Dsmp Whitepaper Release 3
Dsmp Whitepaper Release 3Dsmp Whitepaper Release 3
Dsmp Whitepaper Release 3gelfstrom
 
Function of memory.4to5
Function of memory.4to5Function of memory.4to5
Function of memory.4to5myrajendra
 
Function of memory.4to5
Function of memory.4to5Function of memory.4to5
Function of memory.4to5myrajendra
 
2. the memory systems (module2)
2. the memory systems (module2)2. the memory systems (module2)
2. the memory systems (module2)Ajit Saraf
 
RANDOM ACCEES MEMORY
RANDOM ACCEES MEMORYRANDOM ACCEES MEMORY
RANDOM ACCEES MEMORYMaya Singh
 
Computer System Architecture Lecture Note 8.2 Cache Memory
Computer System Architecture Lecture Note 8.2 Cache MemoryComputer System Architecture Lecture Note 8.2 Cache Memory
Computer System Architecture Lecture Note 8.2 Cache MemoryBudditha Hettige
 
Linux Memory Management with CMA (Contiguous Memory Allocator)
Linux Memory Management with CMA (Contiguous Memory Allocator)Linux Memory Management with CMA (Contiguous Memory Allocator)
Linux Memory Management with CMA (Contiguous Memory Allocator)Pankaj Suryawanshi
 
Cache memory by emad
Cache memory by emadCache memory by emad
Cache memory by emadKazi Emad
 
Understanding And Managing Memory
Understanding And Managing MemoryUnderstanding And Managing Memory
Understanding And Managing Memoryisma ishak
 
Controlling Memory Footprint at All Layers: Linux Kernel, Applications, Libra...
Controlling Memory Footprint at All Layers: Linux Kernel, Applications, Libra...Controlling Memory Footprint at All Layers: Linux Kernel, Applications, Libra...
Controlling Memory Footprint at All Layers: Linux Kernel, Applications, Libra...peknap
 
301378156 design-of-sram-in-verilog
301378156 design-of-sram-in-verilog301378156 design-of-sram-in-verilog
301378156 design-of-sram-in-verilogSrinivas Naidu
 
Memory hierarchy.pdf
Memory hierarchy.pdfMemory hierarchy.pdf
Memory hierarchy.pdfISHAN194169
 
Multi-core processor and Multi-channel memory architecture
Multi-core processor and Multi-channel memory architectureMulti-core processor and Multi-channel memory architecture
Multi-core processor and Multi-channel memory architectureUmair Amjad
 

Similar to Massively Parallel Architectures (20)

Types of memory 10 to11
Types of memory 10 to11Types of memory 10 to11
Types of memory 10 to11
 
Types of memory
Types of memoryTypes of memory
Types of memory
 
Types of memory
Types of memoryTypes of memory
Types of memory
 
Dsmp Whitepaper Release 3
Dsmp Whitepaper Release 3Dsmp Whitepaper Release 3
Dsmp Whitepaper Release 3
 
Function of memory.4to5
Function of memory.4to5Function of memory.4to5
Function of memory.4to5
 
Function of memory.4to5
Function of memory.4to5Function of memory.4to5
Function of memory.4to5
 
Rahman
RahmanRahman
Rahman
 
Cache memory presentation
Cache memory presentationCache memory presentation
Cache memory presentation
 
UNIT 3.docx
UNIT 3.docxUNIT 3.docx
UNIT 3.docx
 
2. the memory systems (module2)
2. the memory systems (module2)2. the memory systems (module2)
2. the memory systems (module2)
 
RANDOM ACCEES MEMORY
RANDOM ACCEES MEMORYRANDOM ACCEES MEMORY
RANDOM ACCEES MEMORY
 
Computer System Architecture Lecture Note 8.2 Cache Memory
Computer System Architecture Lecture Note 8.2 Cache MemoryComputer System Architecture Lecture Note 8.2 Cache Memory
Computer System Architecture Lecture Note 8.2 Cache Memory
 
Linux Memory Management with CMA (Contiguous Memory Allocator)
Linux Memory Management with CMA (Contiguous Memory Allocator)Linux Memory Management with CMA (Contiguous Memory Allocator)
Linux Memory Management with CMA (Contiguous Memory Allocator)
 
Cache memory by emad
Cache memory by emadCache memory by emad
Cache memory by emad
 
Lecture2
Lecture2Lecture2
Lecture2
 
Understanding And Managing Memory
Understanding And Managing MemoryUnderstanding And Managing Memory
Understanding And Managing Memory
 
Controlling Memory Footprint at All Layers: Linux Kernel, Applications, Libra...
Controlling Memory Footprint at All Layers: Linux Kernel, Applications, Libra...Controlling Memory Footprint at All Layers: Linux Kernel, Applications, Libra...
Controlling Memory Footprint at All Layers: Linux Kernel, Applications, Libra...
 
301378156 design-of-sram-in-verilog
301378156 design-of-sram-in-verilog301378156 design-of-sram-in-verilog
301378156 design-of-sram-in-verilog
 
Memory hierarchy.pdf
Memory hierarchy.pdfMemory hierarchy.pdf
Memory hierarchy.pdf
 
Multi-core processor and Multi-channel memory architecture
Multi-core processor and Multi-channel memory architectureMulti-core processor and Multi-channel memory architecture
Multi-core processor and Multi-channel memory architecture
 

Massively Parallel Architectures

  • 1. Massively Parallel Architectures Colin Egan, Jason McGuiness and Michael Hicks (Compiler Technology and Computer Architecture Research Group) 1
  • 2. Presentation Structure Introduction Conclusions Questions? (ask as we go along and I’ll also leave time for questions at then end of this presentation) 2
  • 3. The memory wall High performance computing can be achieved through instruction level parallelism (ILP) by: Pipelining, Multiple instruction issue (MII). But both of these techniques can cause a bottle- neck in the memory subsystem, which may impact on overall system performance. The memory wall. 3
  • 4. Overcoming the memory wall We can improve data throughput and data storage between the CPU and the memory subsystem by introducing extra levels into the memory hierarchy. L1, L2, L3 Caches. 4
  • 5. Overcoming the memory wall But … This increases the penalty associated with a miss in the memory subsystem. Which impacts on the average cache access time. Where: average cache access time = hit-time + miss-rate * miss-penalty for two levels of caches: average access time = hit-timeL1 + miss-rateL1 * hit-timeL2 + miss-rateL2 * miss-penaltyL2 because the miss-penaltyL1 = hit-timeL2 + miss-rateL2 * miss-penaltyL2 for three levels of caches: average access time = hit-timeL1 + miss-rateL1 * hit-timeL2 + miss-rateL2 * hit-timeL3 + miss-rateL3 * miss-penaltyL3 5
  • 6. Overcoming the memory wall The average cache access time is therefore dependent on three criteria: hit-time, miss-rate and miss-penalty. Increasing the number of levels in the memory hierarchy does not improve main memory access time. Recovery from a cache miss causes a main memory bottle- neck, since main memory cannot keep up with the required fetch rate. 6
  • 7. Overcoming the memory wall The overall effect: can limit the amount of ILP, can impact on processor performance, increases the architectural design complexity, increases hardware cost. 7
  • 8. Overcoming the memory wall So perhaps we should improve the performance of the memory subsystem. Or we could place processors in memory. Processing In Memory (PIMs). 8
  • 9. Processing in memory The idea of PIM is to overcome the bottleneck between the processor and main memory by combining a processor and memory on a single chip. The benefits of PIM are: reduced latency, a simplification of the memory hierarchy, high bandwidth, multi-processor scaling capabilities Cellular architectures. 9
  • 10. Processing in memory Reduced latency: In PIM architectures accesses from the processor to memory: do not have to go through multiple memory levels (caches) to go off chip, do not have to travel along a printed circuit line, do not have to be reconverted back down to logic levels, do not have to be re-synchronised with a local clock. Currently there is about an order of magnitude reduction in latency thereby overcoming the memory wall. 10
  • 11. Processing in memory Beneficial side-effects of reducing latency: total silicon area is reduced, power consumption is reduced. 11
  • 12. Processing in memory But … processor speed is reduced, the amount of available memory is reduced. 12
  • 13. Cellular architectures PIM is easily scaled, multiple PIM chips connected together forming a network of PIM cells (Cellular architectures) to overcome these two problems. 13
  • 14. Cellular architectures Cellular architectures are threaded: each thread unit is independent of all other thread units, each thread unit serves as a single in-order issue processor, each thread unit shares computationally expensive hardware such as floating-point units, there can be a large number (1,000s) of thread units – therefore they are massively parallel. 14
  • 15. Cellular architectures Cellular architectures have irregular memory access: some memory is very close to the thread units and is extremely fast, some is off-chip and slow. Cellular architectures, therefore, use caches and have a memory hierarchy. 15
  • 16. Cellular architectures In Cellular architectures multiple thread units perform memory accesses independently. This means that the memory subsystem of Cellular architectures requires some form of memory access model that permits memory accesses to be effectively served. 16
  • 17. Cellular architectures Examples: Bluegene Project Cyclops (IBM) DIMES Gilgamesh (NASA) Shamrock (Notre Dame) 17
  • 18. Cyclops Developed by IBM at the Tom Watson Research Center. Also called Bluegene/C in comparison with the later version of Bluegene/L. 18
  • 19. Logical View of the Cyclops64 Chip Architecture Board Chip MSOffice2 Processor Off-Chip Memory SP SP SP SP SP SP SP SP SP 4 GB/sec TU TU … TU TU TU … TU … TU TU … TU 1 Gbits/sec FPU SR FPU SR FPU SR Off-Chip Memory 4 GB/sec Crossbar Network 6 * 4 GB/sec Other Chips via 6 * 4 GB/sec 3D mesh Off-Chip Memory 50 MB/sec MEMORY MEMORY MEMORY MEMORY MEMORY MEMORY MEMORY MEMORY BANK BANK BANK BANK BANK BANK BANK BANK Off-Chip Memory … IDE SP SP SP SP SP SP SP SP HDD Processors : 80 / chip On-chip Memory (SRAM) : 160 x 28KB = 4480 KB Thread Units (TU) : 2 / processor Off-chip Memory (DDR2) : 4 x 256MB = 1GB Floating-Point Unit (FPU) : 1 / processor Hard Drive (IDE) : 1 x 120GB = 120 GB Shared Registers (SR) : 0 / processor SP is located in the memory banks and is accessed directly by the TU and Scratch-Pad Memory (SP) : n KB / TU via the crossbar network by the other TUs. 19
  • 20. Slide 19 MSOffice2 NOTE ABOUT SCRATCH-PAD MEMORIES: There is one memory bank(MB) per TU on the chip. One part of the MB is reserved for the SP of the corresponding TU. The size of the SP is customizable. The TU can access its SP directly without having to use the crossbar network with a latency of 3/2 for ld/st. The other TUs can access the SP of another TU by accessing the corresponding MB through the network with a latency of 22/11 for ld/st. Even TUs from the same processor must use the network to access the SPs of each other. The memory bank can support only one access per cycle. Therefore if a TU accesses its SP while another TU accesses the same SP through the network, some arbitration will occur and one of the two accesses will be delayed. , 03/10/2003
  • 21. DIMES Delaware Interactive Multiprocessor Emulation System (DIMES) under development by the Computer Architecture and Parallel Systems Laboratory (CAPSL) at the University of Delaware, Newark, DE. USA under the leadership of Prof. Guang Gao. 20
  • 22. DIMES DIMES is the first hardware implementation of a Cellular architecture, is a simplified ‘cut-down’ version of Cyclops, is hardware validation tool for Cellular architectures, emulates Cellular architectures, in particular Cyclops, cycle by cycle, is implemented on at least one FPGA, has been evaluated by Jason McGuiness. 21
  • 23. DIMES The DIMES implementation that Jason evaluated: supports a P-thread programming model, is a dual processor where each processor has four thread units, has 4K of scratch-pad (local) memory per thread unit, has two banks of 64K global shared memory, has different memory models: scratch pad memory obeys the program consistency model for all of the eight thread units, global memory obeys the sequential consistency model for all of the eight thread units, is called DIMES/P2. 22
  • 24. DIMES Processor 0 Processor 1 DIMES/P2. Thread Unit 0 Thread Unit 0 4K Scratch pad 4K Scratch pad Thread Unit 1 Thread Unit 1 4K Scratch pad 4K Scratch pad Thread Unit 2 Thread Unit 2 4K Scratch pad 4K Scratch pad Thread Unit 3 Thread Unit 3 4K Scratch pad 4K Scratch pad 64K Global 64K Global Memory Network Memory 23
  • 25. Gilgamesh This material is based on: Sterling TL and Zima HP, Gilgamesh: A Multithreaded Processor-In-Memory Architecture for Petaflops Computing, IEEE, 2002. 24
  • 26. Gilgamesh PIM-based massively parallel architecture Extend existing PIM capabilities by: Incorporating advanced mechanisms for virtualizing tasks and data. Providing adaptive resource management for load balancing and latency tolerance. 25
  • 27. Gilgamesh System Architecture Collection of MIND chips interconnected by a system network MIND: Memory, Intelligence, Network Device A MIND chip contains: Multiple banks of DRAM storage Processing logic External I/O interface Inter-chip communication channels 26
  • 29. MIND Chip MIND chip is multithreaded. MIND chips communication: Via parcels, Parcel can be used to perform conventional operation as well as to invoke methods on a remote chip. 28
  • 31. MIND Chip Architecture Nodes provide storage, information processing and execution control. Local wide registers serve as thread state, instruction cache, vector memory row buffer. Wide ALU performs multiple operations on separate field simultaneously. System memory bus interface connects MIND to workstation and server motherboard. Data streaming interface deals with rapid high data bandwidth movement. On-chip communication interconnects allow sharing of pipelined FPUs among nodes. 30
  • 33. MIND Node Structure Integrating memory block with processing logic and control mechanism. Wide ALU is 256 bits, supporting multiple operations. ALU doesn’t provide explicit floating point operations. Permutation network rearrange bytes in a complete row. 32
  • 34. MIND PIM Architecture High degree of memory bandwidth is achieved by Accessing an entire memory row at a time: data parallelism. Partitioning the total memory into multiple independently accessible blocks: multiple memory banks. Therefore: Memory bandwidth is square order over off-chip architecture. Low latency. 33
  • 35. Parcels A variable length of communication packet that contains sufficient information to perform a remote procedure invocation. It is targeted to a virtual address which may identify individual variables, blocks of data, structures, objects, or threads, I/O streams. Parcel fields mainly contain: destination address, parcel action specifier, parcel argument, continuation, and housekeeping. 34
  • 36. Multithreaded Execution Each thread resides in a MIND node wide register. The multithread controller determines: When the thread is ready to perform next operation. Which resource will be required and allocated. Threads can be created, terminated or destroyed, suspended, and stored. 35
  • 37. Shamrock Notre Dame The material is based on: Kogge PM, Brockman JB, Sterling T, and Gao G, Processing in memory: chips in petaflops. Presented in the Workshop of ISCA 97’. 36
  • 38. Shamrock Notre Dame A PIM architecture developed in the University of Notre Dame 37
  • 41. Node Structure A node consists of logic for a CPU, four arrays of memory and data routing Two separately addressable memories on each face of the CPU. Each memory array is made up of multiple sub-units: allows all bits read from a memory in one cycle to be available at the same time to the CPU logic. 40
  • 42. Shamrock Chip Nodes are arranged in rows, staggering so that the two memory arrays on one side of one node impinge directly on the memory arrays of two different nodes in the next row: Each CPU has a true shared memory interface with four other CPUs. 41
  • 43. Shamrock Chip Architecture A PIM architecture developed in the University of Notre Dame 42
  • 44. Shamrock Chip Off-chip connections can: Connect to adjoining chips in the same tile topology. Allow a processor on the outside to view the chip as memory in the conventional sense. 43
  • 45. Massively Parallel Architectures Questions? Colin Egan, Jason McGuiness and Michael Hicks (Compiler Technology and Computer Architecture Research Group) 44