Los computadores actuales, desde los sistemas on-chip de los móviles hasta los más potentes supercomputadores, son paralelos. La escala va desde los 8 cores de un móvil hasta los millones desplegados por los grandes superomputadores. La necesidad de hacer visible la memoria del sistema a cada uno de sus cores se resuelve, independientemente de la escala, interconectando todos los cores con el rendimiento adecuado a unos costes acotados. En los sistemas de menor escala (MPSoCs), la memoria se comparte usando redes on-chip que transportan líneas de cache y comandos de coherencia. Unos pocos MPSoCs se interconectan formando servidores que usan mecanismos para extender la coherencia y compartir memoria usando una arquitectura CC-NUMA. Decenas de estos servidores se apilan en un rack y un número de racks (hasta centenares) constituyen un datacenter o un supercomputador. La memoria global resultante no puede ser compartida, pero sus contenidos son transferibles mediante el envío de mensajes a través de la red de sistema. Por ello, las redes son sistemas críticos y básicos en las arquitecturas de memoria de los computadores de cualquier gama. En esta charla se ofrecerá una visión argumentada de las elecciones que hacen diferentes fabricantes para el despliegue de las redes on-chip y de sistema que interconectan los computadores actuales.
Tendencias de Uso y Diseño de Redes de Interconexión en Computadores Paralelos - Prof. Ramón Beivide
1. Ramón Beivide
Universidad de Cantabria
Tendencias de Uso y Diseño de Redes
de Interconexión en Computadores
Paralelos
14 de Abril, 2016
Universidad Complutense de Madrid
2
Outline
1. Introduction
2. Network Basis
3. System networks
4. On-chip networks (NoCs)
5. Some current research
2. 1. Intro: MareNostrum
3
1. Intro: MareNostrum BSC,
Infiniband FDR10 non-blocking Folded Clos (up to 40 racks)
4
36-port FDR10 36-port FDR10 36-port FDR10 36-port FDR10 36-port FDR10 36-port FDR10 36-port FDR10 36-port FDR10
Mellanox
648-port
IB
Core Switch
Mellanox
648-port
IB
Core Switch
Mellanox
648-port
IB
Core Switch
Mellanox
648-port
IB
Core Switch
Infiniband
648-port
FDR
Core
switch
Mellanox
648-port
IB
Core Switch
Mellanox
648-port
IB
Core Switch
36-port FDR10 36-port FDR10
560 560 560 560 560 560
Leaf switches
40 iDataPlex racks / 3360 dx360 M4 nodes
18 18 18 18 12
3 links to
each
core
3 links to
each
core
3 links to
each
core
3 links to
each
core
2 links to
each
core
FDR10 links
18 18 18 18 12
3 links to
each
core
3 links to
each
core
3 links to
each
core
3 links to
each
core
2 links to
each
core18 18 18 18 12
18 18 18 18 12
Latency: 0,7 μs
Bandwidth: 40Gb/s
5. 9
1. Intro: Multicore E5-2670 Xeon Processor
1. Intro: A row of servers in a Google DataCenter, 2012.
10
6. 11
3. WSCs Array: Enrackable boards or blades + rack router
To other
clusters
Figure 1.1: Sketch of the typical elements in warehouse-scale systems: 1U
server (left), 7’ rack with Ethernet switch (middle), and diagram of a small
cluster with a cluster-level Ethernet switch/router (right).
12
3. WSC Hierarchy
8. 15
1. Intro: An Architectural Model
Interconnection Network
…M1 Mn
S/R
ATU
S/R
CPU1 …
L/SL/S
L/S L/SATU
…
S/RS/R
Interconnection Network
Interconnection Network
…M1 Mn
S/R
ATU
S/R
CPU1 CPUn…
L/SL/S
L/S L/SATU
…
… … …
Interconnection Network
1. Intro: What we need for one ExaFlop/s
16
Networks are pervasive and critical components in Supercomputers,
Datacenters, Servers and Mobile Computers.
Complexity is moving from system networks towards on-chip networks:
less nodes but more complex
9. 17
Outline
1. Introduction
2. Network Basis
Crossbars & Routers
Direct vs Indirect Networks
3. System networks
4. On-chip networks (NoCs)
5. Some current research
18
2. Network Basis
All networks based on Crossbar switches
• Switch complexity increases quadratically with the number of
crossbar input/output ports, N, i.e., grows as O(N2)
• Has the property of being non-blocking (N! I/O permutations)
• Bidirectional for exploiting communication locality
• Minimize latency & maximize throughput
7
6
5
4
3
2
1
0
76543210
7
6
5
4
3
2
1
0
76543210
10. 19
2. Blocking vs. Non-blocking
• Reduction cost comes at the price of performance
– Some networks have the property of being blocking (Not N!)
– Contention is more likely to occur on network links
› Paths from different sources to different destinations share one or
more links
blocking topology
X
non-blocking topology
7
6
5
4
3
2
1
0
76543210
7
6
5
4
3
2
1
0
7
6
5
4
3
2
1
0
20
2. Swith or Router Microarchitecture
Routing Control Unit
Header
Flit
Forwarding
Table
Input buffers
DEMUX
Input buffers
DEMUX
Physical
channel
Link
Control
Link
Control
Physical
channel
MUX
CrossBar
DEMUX
MUX
DEMUX
Crossbar
Control
Stage 1
Output buffers
MUX
Link
Control
Output buffers
MUX
Link
Control
Physical
channel
Physical
channel
Stage 2 Stage 3 Stage 4 Stage 5
Arbitration
Unit
Output
Port #
IB (Input Buffering) RC (Route Computation) SA (Switch Arb) ST (Switch Traversal) OB (Output Buffering)
IB
IB
IB
RC
IB
SA
IB
IB
ST
ST
IB IB ST
IB IB ST
OB
OB
OB
OB
Packet header
Payload fragment
Payload fragment
Payload fragment
Pipelined Switch Microarchitecture
Matching the throughput
of the internal switch
datapath to the external
link BW is the goal
11. 21
2. Network Organization
Switches
End Nodes
Indirect (Centralized) and Direct (Distributed) Networks
2. Previous Myrinet core switches (Indirect,
Centralized)
22
12. 2. IBM BG/Q (Direct, Distributed)
23
24
2. Network Organization
64-node system with 8-port switches, c = 4 32-node system with 8-port switches
• As crossbars do not scale they need to be interconnected for
servicing an increasing number of endpoints.
• Direct (Distributed) vs Indirect (Centralized) Networks
• Concentration can be used to reduce network costs
– “c” end nodes connect to each switch
– Allows larger systems to be built from fewer switches and links
– Requires larger switch degree
13. 25
Outline
1. Introduction
2. Network Basis
3. System networks
Folded Clos
Tori
Dragonflies
4. On-chip networks (NoCs)
5. Some current research
3. MareNostrum BSC,
Infiniband FDR10 non-blocking Folded Clos (up to 40 racks)
26
36-port FDR10 36-port FDR10 36-port FDR10 36-port FDR10 36-port FDR10 36-port FDR10 36-port FDR10 36-port FDR10
Mellanox
648-port
IB
Core Switch
Mellanox
648-port
IB
Core Switch
Mellanox
648-port
IB
Core Switch
Mellanox
648-port
IB
Core Switch
Infiniband
648-port
FDR
Core
switch
Mellanox
648-port
IB
Core Switch
Mellanox
648-port
IB
Core Switch
36-port FDR10 36-port FDR10
560 560 560 560 560 560
Leaf switches
40 iDataPlex racks / 3360 dx360 M4 nodes
18 18 18 18 12
3 links to
each
core
3 links to
each
core
3 links to
each
core
3 links to
each
core
2 links to
each
core
FDR10 links
18 18 18 18 12
3 links to
each
core
3 links to
each
core
3 links to
each
core
3 links to
each
core
2 links to
each
core18 18 18 18 12
18 18 18 18 12
Latency: 0,7 μs
Bandwidth: 40Gb/s
16. 31
3. Network Topology
• Bidirectional MINs
• Increase modularity
• Reduce hop count, d
• Folded Clos network
– Nodes at tree leaves
– Switches at tree vertices
– Total link bandwidth is
constant across all tree
levels, with full bisection
bandwidth
Folded Clos = Folded Benes <> Fat tree network !!!
Centralized Switched (Indirect) Networks
7
6
5
4
3
2
1
0
15
14
13
12
11
10
9
8
Network
Bisection
32
3. Other DIRECT System Network Topologies
Distributed Switched (Direct) Networks
2Dtorusof16nodes
hypercubeof16nodes
(16=24,son=4)
2Dmeshorgridof16nodes
Network
Bisection
≤ full bisection bandwidth!
17. 33
3. IBM BlueGene/L/P Network
Prismatic 32x32x64 Torus (mixed-radix networks)
BlueGene/P: 32x32x72 in maximum configuration
Mixed-radix prismatic Tori also used by Cray
3. IBM BG/Q
34
18. 3. IBM BG/Q
35
36
3 .BG Network Routing
X Wires
Y Wires
Z Wires
Adaptive Bubble Routing
ATC-UC Research Group
19. 3. Fujitsu Tofu Network
37
38
3. More Recent Network Topologies
Distributed Switched (Direct) Networks
• Fully-connected network: all nodes are directly connected to
all other nodes using bidirectional dedicated links
6 2
4
5
7
0
1
3
21. 3. IBM PERCS
41
Organized as groups of
routers
Parameters:
• a: Routers per group
• p: Node per router
• h: Global link per
router
• Well-balanced
dragonfly [1]
a = 2p =2h
3. Dragonfly Interconnection Network
Inter-group
•Global links
•Complete graph
Intra-group
•Local links
•Complete graph
22. Minimal routing
• Longest path 3 hops:
local-global-local
• Good performance under
UN traffic
Adversarial traffic [1]
• ADV+N: Nodes in group i
send traffic to group i+N
• Saturation of the global
link
3. Dragonfly Interconnection Network
Source
Node
Destination
Node
Destination
Group i+N
Source Group
i
SATURATION
[1] J. Kim, W. Dally, S. Scott, and D. Abts.
“Technology-driven, highly-scalable dragonfly
topology.” ISCA ‘08.
Valiant Routing [2]
• Randomly selects an
intermediate group to
misroute packets
• Avoids saturated
channel
• Longest path 5 hops
local-global-local-
global-local
3. Dragonfly Interconnection Network
Source
Node
Destination
Node
Intermediate
Group
[2] L. Valiant, “A scheme for fast parallel
communication," SIAM journal on com-
puting, vol. 11, p. 350, 1982.
23. 45
3. Cray Cascade, electrical supernode
46
3. Cray Cascade, system and routing
24. 47
Outline
1. Introduction
2. Network Basis
3. System networks
4. On-chip networks (NoCs)
Rings
Meshes
5. Some current research
SEM photo of local levels interconnect
4. On-Chip local interconnects
48
26. 4. Bumps & Balls
51
Multiple integration with 3D stacking…
4. 3D (& 2.5D) Stacking & Silicon Photonics
3M, IBM team to develop 3D IC adhesive, EETimes India STMicroelectronics & CEA
52
28. 55
Folded ring:
Lower
maximum
physical
link length
4. Rings (Direct or Indirect?)
• Bidirectional Ring networks (folded)
– N switches (3 × 3) and N bidirectional network links
– Simultaneous packet transport over disjoint paths
– Packets must hop across intermediate nodes
– Shortest direction usually selected (N/4 hops, on average)
– Bisection Bandwidth???
56
4. Meshes and Tori
Distributed Switched (Direct) Networks
2Dtorusof16nodes
2Dmeshorgridof16nodes
Network
Bisection
29. 4. Meshes from Tilera
4. Mesh from Pythium Mars Architecture
These images were taken form the slides presented at Hot Chips 2015
• L1:
– Separated L1 Icache and L1 Dcache
– 32 KB Icache
– 32 KB Dcache
• 6 outstanding loads
• 4 cycles latency from load to use
• L2:
– 16 L2 banks of 4 MB
– 32 MB of shared L2
• L3:
– 8 L3 arrays of 16 MB
– 128 MB of L3
• Memory Controllers:
– 16 DDR3-1600 channels
• 2x16-lane PCIe-3.0
• Directory based cache coherency
– 16 Directory Control Unit (DCU)
• MOESI like cache coherence protocol
30. 4. Pythium Mars NoC
This image was taken form the slides presented at Hot Chips 2015
• 6 bi-directional ports switches
• 4 physical channels for cache coherence
• 3 cycles for each hop
• 384 GB/s each cell
4. Meshes from Intel Knights Landing
60
Intel Knights Landing – 3 options
33. 65
Outline
1. Introduction
2. Network Basis
3. System networks
4. On-chip networks (NoCs)
5. Some current research
5. Some research on NUCA-based CMP Models
66
34. 5. Full-system simulation including concentration
67
GEM5 + BookSim full‐system simulation platform parameters
ISA X86
Number of Cores 64
CPU Model Out of Order
CPU Frequency 2 GHz
Cache Coherence Protocol MESI
L1 Instructions Size 32 KB
L1 Data Size 64 KB
Shared distributed L2 256 KB per Core
# Memory Controllers 4
Network Frequency 1 GHz
Router Pipeline Stages 4
Physical Networks 3
Buffer Size 10 flits
Link Width 64 bits
Topologies
8x8 mesh, torus and FBFLY
4x4 FBFBLY with C=4
Applications used PARSEC benchmarks
5. Topology comparison
Three different topologies are considered:
68
Topology 2D Mesh 2D Torus 2D FBFLY
Degree (ports) ↓
Diameter (max. distance)↓ 2
Average distance ↓
Bisection Bandwidth (links)↑
Advantages
Low degree
Shortest links
Low degree
Symmetry
Better properties
Symmetry
Best properties
Larger concentration
Disadvantages
Largest distances
Lowest BB
Folding
Deadlock
Highest costs
Non‐uniform link
lengths
N=16
35. 5. Full-system simulation
Normalized execution time
and network latencies:
• Average latency has impact in AMAT.
• High latencies can degrade execution
times if the affected data are critical.
69
5. Router Power and Area
Router leakage power and area evaluation:
• Buffers are the most consuming part of the router.
• Crossbars and allocators grew quadratically with the number of
ports.
• The load in these simulations is low. Hence, the leakage power is
the dominant one.
70
36. 5. Router Power and Area
Network leakage power evaluation:
• FBFLY can manage higher concentrations because its higher BB.
71
5. OmpSs vs. pThreads
72