3. The memory wall
High performance computing can be achieved
through instruction level parallelism (ILP) by:
Pipelining,
Multiple instruction issue (MII).
But both of these techniques can cause a bottle-
neck in the memory subsystem, which may impact
on overall system performance.
The memory wall.
3
4. Overcoming the memory wall
We can improve data throughput and
data storage between the CPU and the
memory subsystem by introducing extra
levels into the memory hierarchy.
L1, L2, L3 Caches.
4
5. Overcoming the memory wall
But …
This increases the penalty associated with a miss in the memory subsystem.
Which impacts on the average cache access time.
Where:
average cache access time = hit-time + miss-rate * miss-penalty
for two levels of caches:
average access time = hit-timeL1 + miss-rateL1 * hit-timeL2 + miss-rateL2 * miss-penaltyL2
because the
miss-penaltyL1 = hit-timeL2 + miss-rateL2 * miss-penaltyL2
for three levels of caches:
average access time = hit-timeL1 + miss-rateL1 * hit-timeL2 + miss-rateL2 * hit-timeL3 +
miss-rateL3 * miss-penaltyL3
5
6. Overcoming the memory wall
The average cache access time is therefore dependent on
three criteria: hit-time, miss-rate and miss-penalty.
Increasing the number of levels in the memory hierarchy
does not improve main memory access time.
Recovery from a cache miss causes a main memory bottle-
neck, since main memory cannot keep up with the required
fetch rate.
6
7. Overcoming the memory wall
The overall effect:
can limit the amount of ILP,
can impact on processor performance,
increases the architectural design complexity,
increases hardware cost.
7
8. Overcoming the memory wall
So perhaps we should improve the
performance of the memory subsystem.
Or we could place processors in memory.
Processing In Memory (PIMs).
8
9. Processing in memory
The idea of PIM is to overcome the bottleneck
between the processor and main memory by
combining a processor and memory on a single
chip.
The benefits of PIM are:
reduced latency,
a simplification of the memory hierarchy,
high bandwidth,
multi-processor scaling capabilities
Cellular architectures.
9
10. Processing in memory
Reduced latency:
In PIM architectures accesses from the processor to memory:
do not have to go through multiple memory levels (caches) to go off
chip,
do not have to travel along a printed circuit line,
do not have to be reconverted back down to logic levels,
do not have to be re-synchronised with a local clock.
Currently there is about an order of magnitude reduction in latency
thereby overcoming the memory wall.
10
11. Processing in memory
Beneficial side-effects of reducing latency:
total silicon area is reduced,
power consumption is reduced.
11
12. Processing in memory
But …
processor speed is reduced,
the amount of available memory is reduced.
12
13. Cellular architectures
PIM is easily scaled,
multiple PIM chips connected together forming a
network of PIM cells (Cellular architectures) to
overcome these two problems.
13
14. Cellular architectures
Cellular architectures are threaded:
each thread unit is independent of all other
thread units,
each thread unit serves as a single in-order issue
processor,
each thread unit shares computationally
expensive hardware such as floating-point units,
there can be a large number (1,000s) of thread
units – therefore they are massively parallel.
14
15. Cellular architectures
Cellular architectures have irregular memory
access:
some memory is very close to the thread units
and is extremely fast,
some is off-chip and slow.
Cellular architectures, therefore, use caches
and have a memory hierarchy.
15
16. Cellular architectures
In Cellular architectures multiple thread units
perform memory accesses independently.
This means that the memory subsystem of
Cellular architectures requires some form of
memory access model that permits memory
accesses to be effectively served.
16
18. Cyclops
Developed by IBM at the Tom Watson
Research Center.
Also called Bluegene/C in comparison
with the later version of Bluegene/L.
18
19. Logical View of the Cyclops64 Chip Architecture
Board
Chip
MSOffice2 Processor
Off-Chip
Memory
SP SP SP SP SP SP SP SP SP
4 GB/sec
TU TU … TU TU TU … TU … TU TU … TU
1 Gbits/sec
FPU SR FPU SR FPU SR
Off-Chip
Memory
4 GB/sec
Crossbar Network
6 * 4 GB/sec Other
Chips via
6 * 4 GB/sec 3D mesh
Off-Chip
Memory
50 MB/sec
MEMORY
MEMORY
MEMORY
MEMORY
MEMORY
MEMORY
MEMORY
MEMORY
BANK
BANK
BANK
BANK
BANK
BANK
BANK
BANK
Off-Chip
Memory
…
IDE
SP SP SP SP SP SP SP SP HDD
Processors : 80 / chip On-chip Memory (SRAM) : 160 x 28KB = 4480 KB
Thread Units (TU) : 2 / processor Off-chip Memory (DDR2) : 4 x 256MB = 1GB
Floating-Point Unit (FPU) : 1 / processor Hard Drive (IDE) : 1 x 120GB = 120 GB
Shared Registers (SR) : 0 / processor SP is located in the memory banks and is accessed directly by the TU and
Scratch-Pad Memory (SP) : n KB / TU via the crossbar network by the other TUs.
19
20. Slide 19
MSOffice2 NOTE ABOUT SCRATCH-PAD MEMORIES:
There is one memory bank(MB) per TU on the chip. One part of the MB is reserved for the SP of the corresponding TU. The size of the
SP is customizable.
The TU can access its SP directly without having to use the crossbar network with a latency of 3/2 for ld/st. The other TUs can access
the SP of another TU by accessing the corresponding MB through the network with a latency of 22/11 for ld/st. Even TUs from the
same processor must use the network to access the SPs of each other.
The memory bank can support only one access per cycle. Therefore if a TU accesses its SP while another TU accesses the same SP
through the network, some arbitration will occur and one of the two accesses will be delayed.
, 03/10/2003
21. DIMES
Delaware Interactive Multiprocessor
Emulation System (DIMES)
under development by the Computer
Architecture and Parallel Systems
Laboratory (CAPSL) at the University of
Delaware, Newark, DE. USA under the
leadership of Prof. Guang Gao.
20
22. DIMES
DIMES
is the first hardware implementation of a Cellular
architecture,
is a simplified ‘cut-down’ version of Cyclops,
is hardware validation tool for Cellular architectures,
emulates Cellular architectures, in particular Cyclops,
cycle by cycle,
is implemented on at least one FPGA,
has been evaluated by Jason McGuiness.
21
23. DIMES
The DIMES implementation that Jason evaluated:
supports a P-thread programming model,
is a dual processor
where each processor has four thread units,
has 4K of scratch-pad (local) memory per thread unit,
has two banks of 64K global shared memory,
has different memory models:
scratch pad memory obeys the program consistency model for all of the
eight thread units,
global memory obeys the sequential consistency model for all of the
eight thread units,
is called DIMES/P2.
22
24. DIMES
Processor 0 Processor 1
DIMES/P2. Thread Unit 0 Thread Unit 0
4K Scratch pad 4K Scratch pad
Thread Unit 1 Thread Unit 1
4K Scratch pad 4K Scratch pad
Thread Unit 2 Thread Unit 2
4K Scratch pad 4K Scratch pad
Thread Unit 3 Thread Unit 3
4K Scratch pad 4K Scratch pad
64K Global 64K Global
Memory Network
Memory
23
25. Gilgamesh
This material is based on:
Sterling TL and Zima HP, Gilgamesh: A
Multithreaded Processor-In-Memory
Architecture for Petaflops Computing, IEEE,
2002.
24
26. Gilgamesh
PIM-based massively parallel architecture
Extend existing PIM capabilities by:
Incorporating advanced mechanisms for
virtualizing tasks and data.
Providing adaptive resource management for
load balancing and latency tolerance.
25
27. Gilgamesh System Architecture
Collection of MIND chips interconnected by
a system network
MIND: Memory, Intelligence, Network Device
A MIND chip contains:
Multiple banks of DRAM storage
Processing logic
External I/O interface
Inter-chip communication channels
26
29. MIND Chip
MIND chip is multithreaded.
MIND chips communication:
Via parcels,
Parcel can be used to perform conventional
operation as well as to invoke methods on a
remote chip.
28
31. MIND Chip Architecture
Nodes provide storage, information processing and
execution control.
Local wide registers serve as thread state, instruction cache,
vector memory row buffer.
Wide ALU performs multiple operations on separate field
simultaneously.
System memory bus interface connects MIND to
workstation and server motherboard.
Data streaming interface deals with rapid high data
bandwidth movement.
On-chip communication interconnects allow sharing of
pipelined FPUs among nodes.
30
33. MIND Node Structure
Integrating memory block with processing logic
and control mechanism.
Wide ALU is 256 bits, supporting multiple
operations.
ALU doesn’t provide explicit floating point
operations.
Permutation network rearrange bytes in a complete
row.
32
34. MIND PIM Architecture
High degree of memory bandwidth is achieved by
Accessing an entire memory row at a time: data
parallelism.
Partitioning the total memory into multiple independently
accessible blocks: multiple memory banks.
Therefore:
Memory bandwidth is square order over off-chip
architecture.
Low latency.
33
35. Parcels
A variable length of communication packet that
contains sufficient information to perform a remote
procedure invocation.
It is targeted to a virtual address which may identify
individual variables, blocks of data, structures,
objects, or threads, I/O streams.
Parcel fields mainly contain: destination address,
parcel action specifier, parcel argument,
continuation, and housekeeping.
34
36. Multithreaded Execution
Each thread resides in a MIND node wide
register.
The multithread controller determines:
When the thread is ready to perform next
operation.
Which resource will be required and allocated.
Threads can be created, terminated or
destroyed, suspended, and stored.
35
37. Shamrock Notre Dame
The material is based on:
Kogge PM, Brockman JB, Sterling T, and Gao
G, Processing in memory: chips in petaflops.
Presented in the Workshop of ISCA 97’.
36
41. Node Structure
A node consists of logic for a CPU, four
arrays of memory and data routing
Two separately addressable memories on each
face of the CPU.
Each memory array is made up of multiple
sub-units:
allows all bits read from a memory in one cycle
to be available at the same time to the CPU logic.
40
42. Shamrock Chip
Nodes are arranged in rows, staggering so
that the two memory arrays on one side of
one node impinge directly on the memory
arrays of two different nodes in the next row:
Each CPU has a true shared memory interface
with four other CPUs.
41
44. Shamrock Chip
Off-chip connections can:
Connect to adjoining chips in the same tile
topology.
Allow a processor on the outside to view the
chip as memory in the conventional sense.
43
45. Massively Parallel Architectures
Questions?
Colin Egan, Jason McGuiness and Michael Hicks
(Compiler Technology and
Computer Architecture Research Group)
44