SlideShare ist ein Scribd-Unternehmen logo
1 von 84
Downloaden Sie, um offline zu lesen
spcl.inf.ethz.ch
@spcl_eth
T. HOEFLER
RDMA, Scalable MPI-3 RMA, and Next-Generation Post-RDMA Interconnects
HPC Advisory Council Swiss Workshop, Lugano, Switzerland
WITH HELP OF ROBERT GERSTENBERGER, MACIEJ BESTA, S. DI GIROLAMO, K. TARANOV, R. E. GRANT, R. BRIGHTWELL AND ALL OF SPCL
https://eurompi19.inf.ethz.ch
Submit papers by April 15th!
spcl.inf.ethz.ch
@spcl_eth
2
The Development of High-Performance Networking Interfaces
1980 1990 2000 2010 2020
Ethernet+TCP/IP
Scalable Coherent Interface Myrinet GM+MX
Fast Messages Quadrics QsNet
Virtual Interface Architecture
IB Verbs
OFED libfabric
Portals 4
sockets
coherent memory access
(active) message based
Cray Gemini
remote direct memory access (RDMA)
triggered operationsOS bypass
protocol offload
zero copy
businessinsider.com
95 / top-100 systems use RDMA
>285 / top-500 systems use RDMA
June 2017
spcl.inf.ethz.ch
@spcl_eth
▪ MPI-3.0 supports RMA (“MPI One Sided”)
▪ Designed to react to hardware trends
▪ Majority of HPC networks support RDMA
3
RDMA as MPI-3.0 REMOTE MEMORY ACCESS TRANSPORT
[1] http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf
spcl.inf.ethz.ch
@spcl_eth
▪ MPI-3.0 supports RMA (“MPI One Sided”)
▪ Designed to react to hardware trends
▪ Majority of HPC networks support RDMA
▪ Communication is „one sided” (no involvement of destination)
▪ RMA decouples communication & synchronization
▪ Different from message passing
4
MPI-3.0 REMOTE MEMORY ACCESS
[1] http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf
Proc A Proc B
send
recv
Proc A Proc B
put
two sided one sided
CommunicationCommunication
+
Synchronization Synchronizationsync
spcl.inf.ethz.ch
@spcl_eth
5
PRESENTATION OVERVIEW
5. Application evaluation
1. Overview of three
MPI-3 RMA concepts
2. MPI window creation 3. Communication
4. Synchronization6. Post-RDMA networking
Outlook!
spcl.inf.ethz.ch
@spcl_eth
6
MPI-3 RMA COMMUNICATION OVERVIEW
Process A (passive)
Memory
MPI window
Process B (active)
Process C (active)
Put
GetAtomic
Non-atomic
communication
calls (put, get)
Atomic communication calls
(Acc, Get & Acc, CAS, FAO)
Memory
MPI window
…
Process D (active)
…
Balaji, Hoefler: Using Advanced MPI Tutorial, ISC19, June
spcl.inf.ethz.ch
@spcl_eth
7
MPI-3 RMA COMMUNICATION OVERVIEW
Process A (passive)
Memory
MPI window
Process B (active)
Process C (active)
Put
GetAtomic
Non-atomic
communication
calls (put, get)
Atomic communication calls
(Acc, Get & Acc, CAS, FAO)
Memory
MPI window
…
Process D (active)
…
…
…
Balaji, Hoefler: Using Advanced MPI Tutorial, ISC19, June
spcl.inf.ethz.ch
@spcl_eth
8
MPI-3 RMA COMMUNICATION OVERVIEW
Process A (passive)
Memory
MPI window
Process B (active)
Process C (active)
Put
GetAtomic
Non-atomic
communication
calls (put, get)
Atomic communication calls
(Acc, Get & Acc, CAS, FAO)
Memory
MPI window
…
Process D (active)
…
…
…
Balaji, Hoefler: Using Advanced MPI Tutorial, ISC19, June
spcl.inf.ethz.ch
@spcl_eth
9
MPI-3 RMA COMMUNICATION OVERVIEW
Process A (passive)
Memory
MPI window
Process B (active)
Process C (active)
Put
GetAtomic
Non-atomic
communication
calls (put, get)
Atomic communication calls
(Acc, Get & Acc, CAS, FAO)
Memory
MPI window
…
Process D (active)
…
…
…
Balaji, Hoefler: Using Advanced MPI Tutorial, ISC19, June
spcl.inf.ethz.ch
@spcl_eth
10
MPI-3 RMA COMMUNICATION OVERVIEW
Process A (passive)
Memory
MPI window
Process B (active)
Process C (active)
Put
GetAtomic
Non-atomic
communication
calls (put, get)
Atomic communication calls
(Acc, Get & Acc, CAS, FAO)
Memory
MPI window
…
Process D (active)
…
…
…
Balaji, Hoefler: Using Advanced MPI Tutorial, ISC19, June
spcl.inf.ethz.ch
@spcl_eth
11
MPI-3.0 RMA SYNCHRONIZATION OVERVIEW
Active
process
Passive
process
Synchroni-
zation
Passive Target Mode
Lock
Lock All
Active Target Mode
Fence
Post/Start/
Complete/Wait
Communi-
cation
Balaji, Hoefler: Using Advanced MPI Tutorial, ISC19, June
spcl.inf.ethz.ch
@spcl_eth
12
MPI-3.0 RMA SYNCHRONIZATION OVERVIEW
Active
process
Passive
process
Synchroni-
zation
Passive Target Mode
Lock
Lock All
Active Target Mode
Fence
Post/Start/
Complete/Wait
Communi-
cation
Balaji, Hoefler: Using Advanced MPI Tutorial, ISC19, June
spcl.inf.ethz.ch
@spcl_eth
13
MPI-3.0 RMA SYNCHRONIZATION OVERVIEW
Active
process
Passive
process
Synchroni-
zation
Passive Target Mode
Lock
Lock All
Active Target Mode
Fence
Post/Start/
Complete/Wait
Communi-
cation
Balaji, Hoefler: Using Advanced MPI Tutorial, ISC19, June
spcl.inf.ethz.ch
@spcl_eth
14
MPI-3.0 RMA SYNCHRONIZATION OVERVIEW
Active
process
Passive
process
Synchroni-
zation
Passive Target Mode
Lock
Lock All
Active Target Mode
Fence
Post/Start/
Complete/Wait
Communi-
cation
Balaji, Hoefler: Using Advanced MPI Tutorial, ISC19, June
spcl.inf.ethz.ch
@spcl_eth
15
Active
process
Passive
process
Synchroni-
zation
Passive Target Mode
Lock
Lock All
Active Target Mode
Fence
Post/Start/
Complete/Wait
Communi-
cation
MPI-3.0 RMA SYNCHRONIZATION OVERVIEW
Balaji, Hoefler: Using Advanced MPI Tutorial, ISC19, June
spcl.inf.ethz.ch
@spcl_eth
▪ Scalable & generic protocols
▪ Can be used on any RDMA network (e.g., OFED/IB)
16
SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
spcl.inf.ethz.ch
@spcl_eth
17
SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION
▪ Scalable & generic protocols
▪ Can be used on any RDMA network (e.g., OFED/IB)
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
spcl.inf.ethz.ch
@spcl_eth
18
SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION
Window creation
Communication
Synchronization
▪ Scalable & generic protocols
▪ Can be used on any RDMA network (e.g., OFED/IB)
▪ Window creation, communication and synchronization
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
spcl.inf.ethz.ch
@spcl_eth
▪ Scalable & generic protocols
▪ Can be used on any RDMA network (e.g., OFED/IB)
▪ Window creation, communication and synchronization
▪ foMPI, a fully functional MPI-3 RMA implementation
▪ DMAPP: lowest-level networking API for Cray Gemini/Aries systems
▪ XPMEM: a portable Linux kernel module
19
SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION
http://spcl.inf.ethz.ch/Research/Parallel_Programming/foMPI
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
spcl.inf.ethz.ch
@spcl_eth
20
SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION
http://spcl.inf.ethz.ch/Research/Parallel_Programming/foMPI
▪ Scalable & generic protocols
▪ Can be used on any RDMA network (e.g., OFED/IB)
▪ Window creation, communication and synchronization
▪ foMPI, a fully functional MPI-3 RMA implementation
▪ DMAPP: lowest-level networking API for Cray Gemini/Aries systems
▪ XPMEM: a portable Linux kernel module
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
spcl.inf.ethz.ch
@spcl_eth
21
SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION
http://spcl.inf.ethz.ch/Research/Parallel_Programming/foMPI
▪ Scalable & generic protocols
▪ Can be used on any RDMA network (e.g., OFED/IB)
▪ Window creation, communication and synchronization
▪ foMPI, a fully functional MPI-3 RMA implementation
▪ DMAPP: lowest-level networking API for Cray Gemini/Aries systems
▪ XPMEM: a portable Linux kernel module
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
spcl.inf.ethz.ch
@spcl_eth
22
PART 1: SCALABLE WINDOW CREATION
Traditional windows
backwards compatible
(MPI-2)
Time bound: 𝒪 𝑝
Memory bound: 𝒪 𝑝
𝑝 = total number
of processes
Process A
Memory
Process B
Memory
Process C
Memory
0x111
0x123
0x120
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
spcl.inf.ethz.ch
@spcl_eth
𝑝 = total number
of processes
23
PART 1: SCALABLE WINDOW CREATION
Allocated windows
Process A
Memory
Process B
Memory
Process C
Memory
Allows MPI
to allocate memory
Time bound: 𝒪 log 𝑝 (𝑤ℎ𝑝)
Memory bound: 𝒪 1
0x123 0x1230x123
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
spcl.inf.ethz.ch
@spcl_eth
Process A
Memory
Process B
Memory
Process C
Memory
𝑝 = total number
of processes
24
Local attach/detach
Most flexible
Time bound: 𝒪 𝑝
Memory bound: 𝒪 𝑝
0x129 0x129
0x111
0x123
0x120
PART 1: SCALABLE WINDOW CREATION
Dynamic windows
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
spcl.inf.ethz.ch
@spcl_eth
▪ Put and Get:
▪ Direct DMAPP put and get operations or
local (blocking) memcpy (XPMEM)
▪ Accumulate:
▪ DMAPP atomic operations for 64 bit types
▪ ...or fall back to remote locking protocol
▪ MPI datatype handling with MPITypes library [1]
▪ Fast path for contiguous data transfers of common
intrinsic datatypes (e.g., MPI_DOUBLE)
25
PART 2: COMMUNICATION
[1] Ross, Latham, Gropp, Lusk, Thakur. Processing MPI datatypes outside MPI. EuroMPI/PVM’09
Contiguous memory
MPI_Put
dmapp_put_nbi
Remote
process
…
MPI_Compare
_and_swap
dmapp_
acswap_qw_nbi
Remote
process
…
spcl.inf.ethz.ch
@spcl_eth
26
PERFORMANCE INTER-NODE: LATENCY
Put Inter-Node Get Inter-Node
20%
faster
80%
faster
Proc 0 Proc 1
put
sync memory
Half ping-pong
spcl.inf.ethz.ch
@spcl_eth
27
PERFORMANCE INTRA-NODE: LATENCY
Put/Get Intra-Node 3x
faster
Proc 0 Proc 1
put
sync memory
Half ping-pong
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
spcl.inf.ethz.ch
@spcl_eth
29
PERFORMANCE: MESSAGE RATE
Intra-NodeInter-Node
Proc 0 Proc 1
puts
Sync memory
...
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
spcl.inf.ethz.ch
@spcl_eth
30
PERFORMANCE: ATOMICS
64 bit integers
hardware-
accelerated
protocol:
fall back
protocol:
lower latency
higher
bandwidth
proprietary
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
spcl.inf.ethz.ch
@spcl_eth
31
PART 3: SYNCHRONIZATION
Active
process
Passive
process
Synchroni-
zation
Passive Target Mode
Lock
Lock All
Active Target Mode
Fence
Post/Start/
Complete/Wait
Communi-
cation
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
spcl.inf.ethz.ch
@spcl_eth
Node 0 Node 1
SCALABLE FENCE IMPLEMENTATION
▪ Collective call
▪ Completes all outstanding memory operations
Proc 2
put
Proc 0 Proc 1 Proc 3
int MPI_Win_fence(…) {
asm( mfence );
dmapp_gsync_wait();
MPI_Barrier(...);
return MPI_SUCCESS;
}
put
put
put
put
put
put
33Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
spcl.inf.ethz.ch
@spcl_eth
Node 0 Node 1
SCALABLE FENCE IMPLEMENTATION
▪ Collective call
▪ Completes all outstanding memory operations
Proc 2Proc 0 Proc 1 Proc 3
int MPI_Win_fence(…) {
asm( mfence );
dmapp_gsync_wait();
MPI_Barrier(...);
return MPI_SUCCESS;
}
put
put
put
33
Local completion
(XPMEM)
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
spcl.inf.ethz.ch
@spcl_eth
Node 0 Node 1
SCALABLE FENCE IMPLEMENTATION
▪ Collective call
▪ Completes all outstanding memory operations
Proc 2Proc 0 Proc 1 Proc 3
int MPI_Win_fence(…) {
asm( mfence );
dmapp_gsync_wait();
MPI_Barrier(...);
return MPI_SUCCESS;
}
33
Local completion
(DMAPP)
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
spcl.inf.ethz.ch
@spcl_eth
Node 0 Node 1
SCALABLE FENCE IMPLEMENTATION
▪ Collective call
▪ Completes all outstanding memory operations
Proc 2Proc 0 Proc 1 Proc 3
int MPI_Win_fence(…) {
asm( mfence );
dmapp_gsync_wait();
MPI_Barrier(...);
return MPI_SUCCESS;
}
33
Global
completion
barrier
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
spcl.inf.ethz.ch
@spcl_eth
36
SCALABLE FENCE PERFORMANCE
Time bound 𝒪 log 𝑝
Memory bound 𝒪 1
90%
faster
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
spcl.inf.ethz.ch
@spcl_eth
37
PSCW SYNCHRONIZATION
post
wait
start
complete
access
epoch
exposure
epoch
Proc 0 Proc 1
matching
algorithm
matching
algorithm
allows to
access other
processes
allows access
from other
processes
Posting
process
Puts
…
Starting
process
Puts
…
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
spcl.inf.ethz.ch
@spcl_eth
38
PSCW SYNCHRONIZATION
post
wait
start
complete
Proc 0 Proc 1
start
complete
start
complete
Proc 2 Proc 3
start
complete
post
wait
Proc 4 Proc 5
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
spcl.inf.ethz.ch
@spcl_eth
▪ In general, there can be n posting and m
starting processes
▪ In this example there is one posting and
4 starting processes
39
PSCW SCALABLE POST/START MATCHING
Posting process
(opens its window)
j4
j1
j3
j2i
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
spcl.inf.ethz.ch
@spcl_eth
▪ In general, there can be n posting and m
starting processes
▪ In this example there is one posting and
4 starting processes
40
PSCW SCALABLE POST/START MATCHING
j4
j1
j3
j2i
Starting processes
(access remote window)
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
spcl.inf.ethz.ch
@spcl_eth
41
PSCW SCALABLE POST/START MATCHING
j4
j1
j3
j2i
Local list
▪ Each starting process has a local list
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
spcl.inf.ethz.ch
@spcl_eth
42
PSCW SCALABLE POST/START MATCHING
j4
j1
j3
j2i
▪ Posting process i adds its rank i to a
list at each starting process j1, . . . , j4
▪ Each starting process j waits
until the rank of the posting
process i is present in its local list
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
spcl.inf.ethz.ch
@spcl_eth
43
PSCW SCALABLE POST/START MATCHING
j4
j1
j3
j2i
▪ Posting process i adds its rank i to a
list at each starting process j1, . . . , j4
▪ Each starting process j waits
until the rank of the posting
process i is present in its local list
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
spcl.inf.ethz.ch
@spcl_eth
▪ Posting process i adds its rank i to a
list at each starting process j1, . . . , j4
▪ Each starting process j waits
until the rank of the posting
process i is present in its local list
44
PSCW SCALABLE POST/START MATCHING
j4
j1
j3
j2i
i
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
spcl.inf.ethz.ch
@spcl_eth
▪ Posting process i adds its rank i to a
list at each starting process j1, . . . , j4
▪ Each starting process j waits
until the rank of the posting
process i is present in its local list
45
PSCW SCALABLE POST/START MATCHING
j4
j1
j3
j2i
i
i
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
spcl.inf.ethz.ch
@spcl_eth
▪ Posting process i adds its rank i to a
list at each starting process j1, . . . , j4
▪ Each starting process j waits
until the rank of the posting
process i is present in its local list
46
PSCW SCALABLE POST/START MATCHING
j4
j1
j3
j2i
i
i
i
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
spcl.inf.ethz.ch
@spcl_eth
▪ Posting process i adds its rank i to a
list at each starting process j1, . . . , j4
▪ Each starting process j waits
until the rank of the posting
process i is present in its local list
47
PSCW SCALABLE POST/START MATCHING
j4
j1
j3
j2i
i
i
i
i
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
spcl.inf.ethz.ch
@spcl_eth
▪ Posting process i adds its rank i to a
list at each starting process j1, . . . , j4
▪ Each starting process j waits
until the rank of the posting
process i is present in its local list
48
PSCW SCALABLE POST/START MATCHING
j4
j1
j3
j2i
i
i
i
i
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
spcl.inf.ethz.ch
@spcl_eth
▪ Each completing process increments a counter stored at the posting process
49
PSCW SCALABLE COMPLETE/WAIT MATCHING
ij4
j1
j3
j2
0
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
spcl.inf.ethz.ch
@spcl_eth
▪ Each completing process increments a counter stored at the posting process
50
PSCW SCALABLE COMPLETE/WAIT MATCHING
ij4
j1
j3
j2
1
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
spcl.inf.ethz.ch
@spcl_eth
▪ Each completing process increments a counter stored at the posting process
51
PSCW SCALABLE COMPLETE/WAIT MATCHING
ij4
j1
j3
j2
2
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
spcl.inf.ethz.ch
@spcl_eth
▪ Each completing process increments a counter stored at the posting process
52
PSCW SCALABLE COMPLETE/WAIT MATCHING
ij4
j1
j3
j2
3
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
spcl.inf.ethz.ch
@spcl_eth
53
PSCW SCALABLE COMPLETE/WAIT MATCHING
ij4
j1
j3
j2
4
▪ Each completing process increments a counter stored at the
posting process
▪ When the counter is equal to the number of starting processes,
the posting process returns from wait
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
spcl.inf.ethz.ch
@spcl_eth
54
PSCW PERFORMANCE
Time bound
𝒫𝑠𝑡𝑎𝑟𝑡 = 𝒫 𝑤𝑎𝑖𝑡 = 𝒪 1
𝒫𝑝𝑜𝑠𝑡 = 𝒫𝑐𝑜𝑚𝑝𝑙𝑒𝑡𝑒 = 𝒪 log 𝑝
Memory bound
𝒪 log 𝑝 (for scalable programs)
Ring Topology
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
spcl.inf.ethz.ch
@spcl_eth
55
SCALABLE LOCK SYNCHRONIZATION
Active
process
Passive
process
Lock/Unlock
(shared/exclusive)
Lock All
(always shared)
P. Schmid, M. Besta, TH: High-Performance Distributed RMA Locks, ACM HPDC’16, best paper
spcl.inf.ethz.ch
@spcl_eth
▪ Guarantees remote completion
▪ Issues a remote bulk synchronization and an x86 mfence
▪ One of the most performance critical functions, we add only 78 x86 CPU
instructions to the critical path
64
FLUSH SYNCHRONIZATION
Time bound 𝒪 1
Memory bound 𝒪(1)
Process 0 Process 1
inc(counter)
0
counter:
…
inc(counter)
inc(counter)
spcl.inf.ethz.ch
@spcl_eth
▪ Guarantees remote completion
▪ Issues a remote bulk synchronization and an x86 mfence
▪ One of the most performance critical functions, we add only 78 x86 CPU
instructions to the critical path
64
FLUSH SYNCHRONIZATION
Time bound 𝒪 1
Memory bound 𝒪(1)
Process 0 Process 1
inc(counter)
counter:
…
inc(counter)
inc(counter)
3
flush
spcl.inf.ethz.ch
@spcl_eth
66
▪ Evaluation on Blue Waters System
▪ 22,640 computing Cray XE6 nodes
▪ 724,480 schedulable cores
▪ All microbenchmarks
▪ 4 applications
▪ One nearly full-scale run ☺
PERFORMANCE
spcl.inf.ethz.ch
@spcl_eth
67
PERFORMANCE: MOTIF APPLICATIONS
Key/Value Store:
Random Inserts per Second
Dynamic Sparse Data Exchange (DSDE)
with 6 neighbors
spcl.inf.ethz.ch
@spcl_eth
68
PERFORMANCE: APPLICATIONS
NAS 3D FFT [1] Performance MILC [2] Application Execution Time
Annotations represent performance gain of foMPI over Cray MPI-1.
[1] Nishtala et al. Scaling communication-intensive applications on BlueGene/P using one-sided communication and overlap. IPDPS’09
[2] Shan et al. Accelerating applications at scale using one-sided communication. PGAS’12
scale
to 512k procs
scale
to 65k procs
spcl.inf.ethz.ch
@spcl_eth
69
The Development of High-Performance Networking Interfaces
1980 1990 2000 2010 2020
Ethernet+TCP/IP
Scalable Coherent Interface Myrinet GM+MX
Fast Messages Quadrics QsNet
Virtual Interface Architecture
IB Verbs
OFED libfabric
Portals 4
sockets
coherent memory access
(active) message based
Cray Gemini
remote direct memory access (RDMA)
triggered operationsOS bypass
protocol offload
zero copy
What next?
Fundamental development: CPU core speed cannot keep up with the growing
network bandwidths! Already today, cannot process packets at line rate!
spcl.inf.ethz.ch
@spcl_eth
Local Node
Main Memory
RDMA NIC
Core i7 Haswell
L3
L2
L1
Regs
PCIe bus
arriving
packets
34 cycles ~11.3ns
11 cycles ~ 3.6ns
4 cycles ~1.3ns
~ 250ns
125 cycles ~41.6ns
Data Processing in modern RDMA networks
DMA
Unit
RDMA
Processing
Remote Nodes (via network)
L2
L1
Regs
34 cycles ~11.3ns
11 cycles ~ 3.6ns
4 cycles ~1.3ns
Input buffer
Mellanox Connect-X5: 1 packet/5ns
Tomorrow (400G): 1 packet/1.2ns
70
spcl.inf.ethz.ch
@spcl_eth
71
The future of High-Performance Networking Interfaces
1980 1990 2000 2010 2020
Ethernet+TCP/IP
Scalable Coherent Interface Myrinet GM+MX
Fast Messages Quadrics QsNet
Virtual Interface Architecture
IB Verbs
OFED libfabric
Portals 4
sockets
coherent memory access
(active) message based
Cray Gemini
remote direct memory access (RDMA)
triggered operationsOS bypass
protocol offload
zero copy
95 / top-100 systems use RDMA
>285 / top-500 systems use RDMA
June 2017
fully
programmable
NIC acceleration
sPIN
Streaming Processing
In the Network
Established Principles for Compute Acceleration
PortabilityEase-of-use
ProgrammabilitySpecialization Libraries
Efficiency 4.0
spcl.inf.ethz.ch
@spcl_eth
arriving
packets
PacketScheduler
DMA Unit
manage
memory
upload
handlers
Fast shared memory
(packet input buffer)
HPU 1
HPU 3
HPU 0
HPU 2
R/W
MEM
CPU
sPIN NIC - Abstract Machine Model
72
spcl.inf.ethz.ch
@spcl_eth
RDMA vs. sPIN in action: Simple Ping Pong
73
Initiator Target
spcl.inf.ethz.ch
@spcl_eth
RDMA vs. sPIN in action: Streaming Ping Pong
74
Initiator Target
spcl.inf.ethz.ch
@spcl_eth
PacketScheduler
75
sPIN – Programming Interface
__handler int pp_header_handler(const ptl_header_t h, void *state) {
pingpong_info_t *i = state;
i->source = h.source_id;
return PROCESS_DATA; // execute payload handler to put from device
}
Header handler
__handler int pp_payload_handler(const ptl_payload_t p, void * state) {
pingpong_info_t *i = state;
PtlHandlerPutFromDevice(p.base, p.length, 1, 0, i->source, 10, 0, NULL, 0);
return SUCCESS;
}
Payload handler
__handler int pp_completion_handler(int dropped_bytes,
bool flow_control_triggered, void *state) {
return SUCCESS;
}
Completion handler
connect(peer, /* … */, &pp_header_handler, &pp_payload_handler, &pp_completion_handler);
Incoming message
Header
Payload
Tail
spcl.inf.ethz.ch
@spcl_eth
▪ sPIN is a programming abstraction, similar to CUDA or OpenCL combined with OFED or Portals 4
▪ It enables a large variety of NIC implementations!
▪ For example, massively multithreaded HPUs
Including warp-like scheduling strategies
▪ Main goal: sPIN must not obstruct line-rate
▪ Programmer must limit processing time per packet
Little’s Law: 500 instructions per handler, 2.5 GHz, IPC=1, 1 Tb/s → 25 kiB memory
▪ Relies on fast shared memory (processing in packet buffers)
Scratchpad or registers
▪ Quick (single-cycle) handler invocation on packet arrival
Pre-initialized memory & context
▪ Can be implemented in most RDMA NICs with a firmware update
▪ Or in software in programmable (Smart) NICs
76
Possible sPIN implementations
BCM58800 SoC
(Full Linux)
Innova Flex
(Kintex FPGA)
Catapult
(Virtex FPGA)
at 400G, process more than
833 million messages/s
spcl.inf.ethz.ch
@spcl_eth
Simulating a sPIN NIC – Ping Pong
77
RDMA
sPIN (stream)
▪ LogGOPSim v2 [1]: combine LogGOPSim (packet-level network)
with gem5 (cycle accurate CPU simulation)
▪ Network (LogGOPSim):
▪ Supports Portals 4 and MPI
▪ Parametrized for future InfiniBand
o=65ns (measured)
g=6.7ns (150 MM/s)
G=2.5ps (400 Gib/s)
Switch L=50ns (measured)
Wire L=33.4ns (10m cable)
▪ NIC HPU
▪ 2.5 GHz ARM Cortex A15 OOO
▪ ARMv8-A 32 bit ISA
▪ Single-cycle access SRAM (no DRAM)
▪ Header matching m=30ns, per packet 2ns
In parallel with g!
35% lower latency
32%higherBW
77
[1] S. Di Girolamo, K. Taranov, T. Schneider, E. Stalder, T. Hoefler, LogGOPSim+gem5: Simulating Network
Offload Engines Over Packet-Switched Networks. Presented at ExaMPI’17
Handlers cost:
18 instructions + 1 Put
Data Layout
Transformation
Network Group
Communication
Distributed Data
Management
spcl.inf.ethz.ch
@spcl_eth
Use Case 1: Broadcast acceleration
Message size: 8 Bytes
78
Network Group
Communication
Liu, J., et al., High performance RDMA-based MPI implementation over InfiniBand. International Journal of Parallel Programming 2004
RDMA
spcl.inf.ethz.ch
@spcl_eth
Offloaded collectives
(e.g., ConnectX-2, Portals 4)
7979
RDMA
Use Case 1: Broadcast acceleration Network Group
Communication
Underwood, K.D., et al., Enabling flexible collective communication offload with triggered operations. HOTI’11
Liu, J., et al., High performance RDMA-based MPI implementation over InfiniBand. International Journal of Parallel Programming 2004
Message size: 8 Bytes
spcl.inf.ethz.ch
@spcl_eth
sPIN
80
Handlers cost:
24 instructions + Log P Puts
Use Case 1: Broadcast acceleration Network Group
Communication
Underwood, K.D., et al., Enabling flexible collective communication offload with triggered operations. HOTI’11
Liu, J., et al., High performance RDMA-based MPI implementation over InfiniBand. International Journal of Parallel Programming 2004
Offloaded collectives
(e.g., ConnectX-2, Portals 4)
RDMA
Message size: 8 Bytes
spcl.inf.ethz.ch
@spcl_eth
Parity
Update
Use Case 2: RAID acceleration
RDMA
sPIN
81
Server Node Parity Node
Write
ACK
Parity
ACK
20% lower latency
176%higherBW
Handlers cost:
Server: 58 instructions + 1 Put
Parity: 46 instructions + 1 Put
Distributed Data
Management
Shankar D. et al., High-performance and Resilient Key-Value Store with Online Erasure Coding for Big Data Workloads. ICDCS’17
spcl.inf.ethz.ch
@spcl_eth
RDMA
82
4 MiB transfer with varying blocksize
stride = 2 x blocksize
11.44 GiB/s
Input buffer Destination memory
Use Case 3: MPI Datatypes acceleration Data Layout
Transformation
Gropp, W., et al., March. Improving the performance of MPI derived datatypes. MPIDC’99
spcl.inf.ethz.ch
@spcl_eth
Use Case 3: MPI Datatypes acceleration
83
Input buffer Destination memory
sPIN
43.6 GiB/s
83
Handlers cost:
54 instructions
Data Layout
Transformation
RDMA
4 MiB transfer with varying blocksize
stride = 2 x blocksize
11.44 GiB/s
Gropp, W., et al., March. Improving the performance of MPI derived datatypes. MPIDC’99
3.8x speedup
spcl.inf.ethz.ch
@spcl_eth
Further results and use-cases
84
spcl.inf.ethz.ch
@spcl_eth
Further results and use-cases
85
Use Case 4: MPI Rendezvous Protocol
MILC
POP
coMD
coMD
Cloverleaf
Cloverleaf
64
64
72
360
72
360
5.7M
772M
5.3M
28.1M
2.7M
15.3M
5.5%
3.1%
6.1%
6.5%
5.2%
5.6%
1.9%
2.4%
2.4%
2.8%
2.4%
3.2%
program p msgs ovhd ovhd
65%
22%
60%
58%
53%
42%
red
spcl.inf.ethz.ch
@spcl_eth
Use Case 5: Distributed KV Store
Further results and use-cases
Kalia, A., et al., Using RDMA efficiently for key-value services. In
ACM SIGCOMM Computer Communication Review, 2014
Network
K1.tail: 0x88
0x88
K1, V
V
K1.tail: 0x98
86
Use Case 4: MPI Rendezvous Protocol
MILC
POP
coMD
coMD
Cloverleaf
Cloverleaf
64
64
72
360
72
360
5.7M
772M
5.3M
28.1M
2.7M
15.3M
5.5%
3.1%
6.1%
6.5%
5.2%
5.6%
1.9%
2.4%
2.4%
2.8%
2.4%
3.2%
program p msgs ovhd ovhd
65%
22%
60%
58%
53%
42%
red
41% lower latency
spcl.inf.ethz.ch
@spcl_eth
Use Case 6: Conditional Read
Further results and use-cases
87
Barthels, C., et al., Designing Databases for Future High-
Performance Networks. IEEE Data Eng. Bulletin, 2017
Use Case 5: Distributed KV Store
Kalia, A., et al., Using RDMA efficiently for key-value services. In
ACM SIGCOMM Computer Communication Review, 2014
Network 20%
40%
60%
Discarded data: 80%
Use Case 4: MPI Rendezvous Protocol
MILC
POP
coMD
coMD
Cloverleaf
Cloverleaf
64
64
72
360
72
360
5.7M
772M
5.3M
28.1M
2.7M
15.3M
5.5%
3.1%
6.1%
6.5%
5.2%
5.6%
1.9%
2.4%
2.4%
2.8%
2.4%
3.2%
program p msgs ovhd ovhd
65%
22%
60%
58%
53%
42%
red
41% lower latency
spcl.inf.ethz.ch
@spcl_eth
Use Case 6: Conditional ReadUse Case 5: Distributed KV Store
Further results and use-cases
Kalia, A., et al., Using RDMA efficiently for key-value services. In
ACM SIGCOMM Computer Communication Review, 2014
Network
88
Barthels, C., et al., Designing Databases for Future High-
Performance Networks. IEEE Data Eng. Bulletin, 2017
Use Case 7: Distributed Transactions
Dragojević, A, et al., No compromises: distributed transactions
with consistency, availability, and performance. SOSP’15
Network
data pkts
log pkts
20%
40%
60%
Discarded data: 80%
Use Case 4: MPI Rendezvous Protocol
MILC
POP
coMD
coMD
Cloverleaf
Cloverleaf
64
64
72
360
72
360
5.7M
772M
5.3M
28.1M
2.7M
15.3M
5.5%
3.1%
6.1%
6.5%
5.2%
5.6%
1.9%
2.4%
2.4%
2.8%
2.4%
3.2%
program p msgs ovhd ovhd
65%
22%
60%
58%
53%
42%
red
41% lower latency
spcl.inf.ethz.ch
@spcl_eth
Use Case 6: Conditional ReadUse Case 5: Distributed KV Store
Further results and use-cases
Kalia, A., et al., Using RDMA efficiently for key-value services. In
ACM SIGCOMM Computer Communication Review, 2014
Network
89
Barthels, C., et al., Designing Databases for Future High-
Performance Networks. IEEE Data Eng. Bulletin, 2017
Use Case 7: Distributed Transactions
Dragojević, A, et al., No compromises: distributed transactions
with consistency, availability, and performance. SOSP’15
Network
Use Case 8: FT Broadcast
Bosilca, G., et al., Failure Detection and Propagation in
HPC systems. SC’16
Network
bcast pkts
redundant bcast pkts
20%
40%
60%
Discarded data: 80%
Use Case 4: MPI Rendezvous Protocol
MILC
POP
coMD
coMD
Cloverleaf
Cloverleaf
64
64
72
360
72
360
5.7M
772M
5.3M
28.1M
2.7M
15.3M
5.5%
3.1%
6.1%
6.5%
5.2%
5.6%
1.9%
2.4%
2.4%
2.8%
2.4%
3.2%
program p msgs ovhd ovhd
65%
22%
60%
58%
53%
42%
red
41% lower latency
spcl.inf.ethz.ch
@spcl_eth
Use Case 6: Conditional ReadUse Case 5: Distributed KV Store
Further results and use-cases
Kalia, A., et al., Using RDMA efficiently for key-value services. In
ACM SIGCOMM Computer Communication Review, 2014
Network
90
Barthels, C., et al., Designing Databases for Future High-
Performance Networks. IEEE Data Eng. Bulletin, 2017
Use Case 7: Distributed Transactions
Dragojević, A, et al., No compromises: distributed transactions
with consistency, availability, and performance. SOSP’15
Network
Use Case 8: FT Broadcast
Bosilca, G., et al., Failure Detection and Propagation in
HPC systems. SC’16
Use Case 4: MPI Rendezvous Protocol
MILC
POP
coMD
coMD
Cloverleaf
Cloverleaf
64
64
72
360
72
360
5.7M
772M
5.3M
28.1M
2.7M
15.3M
5.5%
3.1%
6.1%
6.5%
5.2%
5.6%
1.9%
2.4%
2.4%
2.8%
2.4%
3.2%
program p msgs ovhd ovhd
65%
22%
60%
58%
53%
42%
red
Network
Use Case 9: Distributed Consensus
20%
40%
60%
Discarded data: 80%
Consensus
István, Z., et al., Consensus in a Box: Inexpensive Coordination in
Hardware. NSDI’16
41% lower latency
The Next 700 sPIN use-cases
… just think about sPIN graph kernels ….
spcl.inf.ethz.ch
@spcl_eth
sPIN Streaming Processing in the Network for Network Acceleration
Try it out: https://spcl.inf.ethz.ch/Research/Parallel_Programming/sPIN/Full specification: https://arxiv.org/abs/1709.05483
sPIN beyond RDMA
3,800+ reads in
less than 4 hours
spcl.inf.ethz.ch
@spcl_eth
▪ More complex network interfaces
▪ sPIN – just discussed in detail
▪ How to use it in real applications?
▪ How to implement sPIN
▪ Collaboration with Luca Benini, RISC-V SoC
▪ Full FPGA/simulation environment available as open-source
▪ New application use-cases – “Deep Learning is HPC [1]”
▪ Different communication requirements and opportunities for optimization
Mainly sparsity [2] and asynchronity!
▪ Need to be a bit careful though when entering a new field
92
MPI – some new research challenges!
T. Hoefler: “Twelve ways to fool the masses when reporting performance of deep learning workloads”
(my humorous guide to floptimize deep learning)
[1]: Ben-Nun, Hoefler: “Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, arXiv:1802.09941
[2]: Renggli et al.: “SparCML: High-Performance Sparse Communication for Machine Learning”, arXiv:1802.08021
https://eurompi19.inf.ethz.ch
Submit papers by April 15th!
spcl.inf.ethz.ch
@spcl_eth
Important dates:
Submission server opens: January 14th, 2019
Full paper submission: April 15th, 2019 (AOE)
Notification: July 1st, 2019
Camera-ready: August 5th, 2019https://eurompi19.inf.ethz.ch

Weitere ähnliche Inhalte

Was ist angesagt?

High-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringHigh-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringScyllaDB
 
DevOps with ActiveMQ, Camel, Fabric8, and HawtIO
DevOps with ActiveMQ, Camel, Fabric8, and HawtIO DevOps with ActiveMQ, Camel, Fabric8, and HawtIO
DevOps with ActiveMQ, Camel, Fabric8, and HawtIO Christian Posta
 
Kubernetes
KubernetesKubernetes
Kuberneteserialc_w
 
Open vSwitch Introduction
Open vSwitch IntroductionOpen vSwitch Introduction
Open vSwitch IntroductionHungWei Chiu
 
Openstack zun,virtual kubelet
Openstack zun,virtual kubeletOpenstack zun,virtual kubelet
Openstack zun,virtual kubeletChanyeol yoon
 
QEMU Disk IO Which performs Better: Native or threads?
QEMU Disk IO Which performs Better: Native or threads?QEMU Disk IO Which performs Better: Native or threads?
QEMU Disk IO Which performs Better: Native or threads?Pradeep Kumar
 
Ceph QoS: How to support QoS in distributed storage system - Taewoong Kim
Ceph QoS: How to support QoS in distributed storage system - Taewoong KimCeph QoS: How to support QoS in distributed storage system - Taewoong Kim
Ceph QoS: How to support QoS in distributed storage system - Taewoong KimCeph Community
 
Gorush: A push notification server written in Go
Gorush: A push notification server written in GoGorush: A push notification server written in Go
Gorush: A push notification server written in GoBo-Yi Wu
 
The 7 Deadly Sins of Packet Processing - Venky Venkatesan and Bruce Richardson
The 7 Deadly Sins of Packet Processing - Venky Venkatesan and Bruce RichardsonThe 7 Deadly Sins of Packet Processing - Venky Venkatesan and Bruce Richardson
The 7 Deadly Sins of Packet Processing - Venky Venkatesan and Bruce Richardsonharryvanhaaren
 
Comparing Next-Generation Container Image Building Tools
 Comparing Next-Generation Container Image Building Tools Comparing Next-Generation Container Image Building Tools
Comparing Next-Generation Container Image Building ToolsAkihiro Suda
 
[OpenStack] 공개 소프트웨어 오픈스택 입문 & 파헤치기
[OpenStack] 공개 소프트웨어 오픈스택 입문 & 파헤치기[OpenStack] 공개 소프트웨어 오픈스택 입문 & 파헤치기
[OpenStack] 공개 소프트웨어 오픈스택 입문 & 파헤치기Ian Choi
 
KVM tools and enterprise usage
KVM tools and enterprise usageKVM tools and enterprise usage
KVM tools and enterprise usagevincentvdk
 
Room 3 - 1 - Nguyễn Xuân Trường Lâm - Zero touch on-premise storage infrastru...
Room 3 - 1 - Nguyễn Xuân Trường Lâm - Zero touch on-premise storage infrastru...Room 3 - 1 - Nguyễn Xuân Trường Lâm - Zero touch on-premise storage infrastru...
Room 3 - 1 - Nguyễn Xuân Trường Lâm - Zero touch on-premise storage infrastru...Vietnam Open Infrastructure User Group
 
K8s cluster autoscaler
K8s cluster autoscaler K8s cluster autoscaler
K8s cluster autoscaler k8s study
 
Cassandra Performance Tuning Like You've Been Doing It for Ten Years
Cassandra Performance Tuning Like You've Been Doing It for Ten YearsCassandra Performance Tuning Like You've Been Doing It for Ten Years
Cassandra Performance Tuning Like You've Been Doing It for Ten YearsJon Haddad
 
OVN operationalization at scale at eBay
OVN operationalization at scale at eBayOVN operationalization at scale at eBay
OVN operationalization at scale at eBayAliasgar Ginwala
 
Kvm virtualization platform
Kvm virtualization platformKvm virtualization platform
Kvm virtualization platformAhmad Hafeezi
 
Overview of kubernetes network functions
Overview of kubernetes network functionsOverview of kubernetes network functions
Overview of kubernetes network functionsHungWei Chiu
 

Was ist angesagt? (20)

High-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringHigh-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uring
 
DevOps with ActiveMQ, Camel, Fabric8, and HawtIO
DevOps with ActiveMQ, Camel, Fabric8, and HawtIO DevOps with ActiveMQ, Camel, Fabric8, and HawtIO
DevOps with ActiveMQ, Camel, Fabric8, and HawtIO
 
Kubernetes
KubernetesKubernetes
Kubernetes
 
Open vSwitch Introduction
Open vSwitch IntroductionOpen vSwitch Introduction
Open vSwitch Introduction
 
Openstack zun,virtual kubelet
Openstack zun,virtual kubeletOpenstack zun,virtual kubelet
Openstack zun,virtual kubelet
 
QEMU Disk IO Which performs Better: Native or threads?
QEMU Disk IO Which performs Better: Native or threads?QEMU Disk IO Which performs Better: Native or threads?
QEMU Disk IO Which performs Better: Native or threads?
 
Ceph QoS: How to support QoS in distributed storage system - Taewoong Kim
Ceph QoS: How to support QoS in distributed storage system - Taewoong KimCeph QoS: How to support QoS in distributed storage system - Taewoong Kim
Ceph QoS: How to support QoS in distributed storage system - Taewoong Kim
 
Gorush: A push notification server written in Go
Gorush: A push notification server written in GoGorush: A push notification server written in Go
Gorush: A push notification server written in Go
 
The 7 Deadly Sins of Packet Processing - Venky Venkatesan and Bruce Richardson
The 7 Deadly Sins of Packet Processing - Venky Venkatesan and Bruce RichardsonThe 7 Deadly Sins of Packet Processing - Venky Venkatesan and Bruce Richardson
The 7 Deadly Sins of Packet Processing - Venky Venkatesan and Bruce Richardson
 
Comparing Next-Generation Container Image Building Tools
 Comparing Next-Generation Container Image Building Tools Comparing Next-Generation Container Image Building Tools
Comparing Next-Generation Container Image Building Tools
 
[OpenStack] 공개 소프트웨어 오픈스택 입문 & 파헤치기
[OpenStack] 공개 소프트웨어 오픈스택 입문 & 파헤치기[OpenStack] 공개 소프트웨어 오픈스택 입문 & 파헤치기
[OpenStack] 공개 소프트웨어 오픈스택 입문 & 파헤치기
 
KVM tools and enterprise usage
KVM tools and enterprise usageKVM tools and enterprise usage
KVM tools and enterprise usage
 
Room 3 - 1 - Nguyễn Xuân Trường Lâm - Zero touch on-premise storage infrastru...
Room 3 - 1 - Nguyễn Xuân Trường Lâm - Zero touch on-premise storage infrastru...Room 3 - 1 - Nguyễn Xuân Trường Lâm - Zero touch on-premise storage infrastru...
Room 3 - 1 - Nguyễn Xuân Trường Lâm - Zero touch on-premise storage infrastru...
 
K8s cluster autoscaler
K8s cluster autoscaler K8s cluster autoscaler
K8s cluster autoscaler
 
Cassandra Performance Tuning Like You've Been Doing It for Ten Years
Cassandra Performance Tuning Like You've Been Doing It for Ten YearsCassandra Performance Tuning Like You've Been Doing It for Ten Years
Cassandra Performance Tuning Like You've Been Doing It for Ten Years
 
OVN operationalization at scale at eBay
OVN operationalization at scale at eBayOVN operationalization at scale at eBay
OVN operationalization at scale at eBay
 
美团技术团队 - KVM性能优化
美团技术团队 - KVM性能优化美团技术团队 - KVM性能优化
美团技术团队 - KVM性能优化
 
Kvm virtualization platform
Kvm virtualization platformKvm virtualization platform
Kvm virtualization platform
 
Kvm
KvmKvm
Kvm
 
Overview of kubernetes network functions
Overview of kubernetes network functionsOverview of kubernetes network functions
Overview of kubernetes network functions
 

Ähnlich wie RDMA, Scalable MPI-3 RMA, and Next-Generation Post-RDMA Interconnects

Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...EUDAT
 
Advanced Scalable Decomposition Method with MPICH Environment for HPC
Advanced Scalable Decomposition Method with MPICH Environment for HPCAdvanced Scalable Decomposition Method with MPICH Environment for HPC
Advanced Scalable Decomposition Method with MPICH Environment for HPCIJSRD
 
Burst Buffer: From Alpha to Omega
Burst Buffer: From Alpha to OmegaBurst Buffer: From Alpha to Omega
Burst Buffer: From Alpha to OmegaGeorge Markomanolis
 
DECK36 - Log everything! and Realtime Datastream Analytics with Storm
DECK36 - Log everything! and Realtime Datastream Analytics with StormDECK36 - Log everything! and Realtime Datastream Analytics with Storm
DECK36 - Log everything! and Realtime Datastream Analytics with StormMike Lohmann
 
An AI accelerator ASIC architecture
An AI accelerator ASIC architectureAn AI accelerator ASIC architecture
An AI accelerator ASIC architectureKhanh Le
 
CAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablementCAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablementGanesan Narayanasamy
 
Digital Media Production - Future Internet
Digital Media Production - Future InternetDigital Media Production - Future Internet
Digital Media Production - Future InternetMaarten Verwaest
 
carrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-APIcarrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-APIYoni Davidson
 
Open HFT libraries in @Java
Open HFT libraries in @JavaOpen HFT libraries in @Java
Open HFT libraries in @JavaPeter Lawrey
 
pythonOCC PDE2009 presentation
pythonOCC PDE2009 presentationpythonOCC PDE2009 presentation
pythonOCC PDE2009 presentationThomas Paviot
 
Streaming your Lyft Ride Prices - Flink Forward SF 2019
Streaming your Lyft Ride Prices - Flink Forward SF 2019Streaming your Lyft Ride Prices - Flink Forward SF 2019
Streaming your Lyft Ride Prices - Flink Forward SF 2019Thomas Weise
 
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...Flink Forward
 
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...Flink Forward
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...inside-BigData.com
 
D. Fast, Simple User-Space Network Functions with Snabb (RIPE 77)
D. Fast, Simple User-Space Network Functions with Snabb (RIPE 77)D. Fast, Simple User-Space Network Functions with Snabb (RIPE 77)
D. Fast, Simple User-Space Network Functions with Snabb (RIPE 77)Igalia
 
The road ahead for scientific computing with Python
The road ahead for scientific computing with PythonThe road ahead for scientific computing with Python
The road ahead for scientific computing with PythonRalf Gommers
 
Cloud Operations with Streaming Analytics using Apache NiFi and Apache Flink
Cloud Operations with Streaming Analytics using Apache NiFi and Apache FlinkCloud Operations with Streaming Analytics using Apache NiFi and Apache Flink
Cloud Operations with Streaming Analytics using Apache NiFi and Apache FlinkDataWorks Summit
 

Ähnlich wie RDMA, Scalable MPI-3 RMA, and Next-Generation Post-RDMA Interconnects (20)

Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...
 
Advanced Scalable Decomposition Method with MPICH Environment for HPC
Advanced Scalable Decomposition Method with MPICH Environment for HPCAdvanced Scalable Decomposition Method with MPICH Environment for HPC
Advanced Scalable Decomposition Method with MPICH Environment for HPC
 
Burst Buffer: From Alpha to Omega
Burst Buffer: From Alpha to OmegaBurst Buffer: From Alpha to Omega
Burst Buffer: From Alpha to Omega
 
PowerAI Deep dive
PowerAI Deep divePowerAI Deep dive
PowerAI Deep dive
 
DECK36 - Log everything! and Realtime Datastream Analytics with Storm
DECK36 - Log everything! and Realtime Datastream Analytics with StormDECK36 - Log everything! and Realtime Datastream Analytics with Storm
DECK36 - Log everything! and Realtime Datastream Analytics with Storm
 
An AI accelerator ASIC architecture
An AI accelerator ASIC architectureAn AI accelerator ASIC architecture
An AI accelerator ASIC architecture
 
CAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablementCAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablement
 
Digital Media Production - Future Internet
Digital Media Production - Future InternetDigital Media Production - Future Internet
Digital Media Production - Future Internet
 
carrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-APIcarrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-API
 
Multicore
MulticoreMulticore
Multicore
 
Open HFT libraries in @Java
Open HFT libraries in @JavaOpen HFT libraries in @Java
Open HFT libraries in @Java
 
pythonOCC PDE2009 presentation
pythonOCC PDE2009 presentationpythonOCC PDE2009 presentation
pythonOCC PDE2009 presentation
 
Streaming your Lyft Ride Prices - Flink Forward SF 2019
Streaming your Lyft Ride Prices - Flink Forward SF 2019Streaming your Lyft Ride Prices - Flink Forward SF 2019
Streaming your Lyft Ride Prices - Flink Forward SF 2019
 
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
 
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
D. Fast, Simple User-Space Network Functions with Snabb (RIPE 77)
D. Fast, Simple User-Space Network Functions with Snabb (RIPE 77)D. Fast, Simple User-Space Network Functions with Snabb (RIPE 77)
D. Fast, Simple User-Space Network Functions with Snabb (RIPE 77)
 
An Optics Life
An Optics LifeAn Optics Life
An Optics Life
 
The road ahead for scientific computing with Python
The road ahead for scientific computing with PythonThe road ahead for scientific computing with Python
The road ahead for scientific computing with Python
 
Cloud Operations with Streaming Analytics using Apache NiFi and Apache Flink
Cloud Operations with Streaming Analytics using Apache NiFi and Apache FlinkCloud Operations with Streaming Analytics using Apache NiFi and Apache Flink
Cloud Operations with Streaming Analytics using Apache NiFi and Apache Flink
 

Mehr von inside-BigData.com

Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networksinside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...inside-BigData.com
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...inside-BigData.com
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networksinside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoringinside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecastsinside-BigData.com
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Updateinside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuninginside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODinside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Accelerationinside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficientlyinside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Erainside-BigData.com
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computinginside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Clusterinside-BigData.com
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...inside-BigData.com
 

Mehr von inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
 

Kürzlich hochgeladen

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 

Kürzlich hochgeladen (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

RDMA, Scalable MPI-3 RMA, and Next-Generation Post-RDMA Interconnects

  • 1. spcl.inf.ethz.ch @spcl_eth T. HOEFLER RDMA, Scalable MPI-3 RMA, and Next-Generation Post-RDMA Interconnects HPC Advisory Council Swiss Workshop, Lugano, Switzerland WITH HELP OF ROBERT GERSTENBERGER, MACIEJ BESTA, S. DI GIROLAMO, K. TARANOV, R. E. GRANT, R. BRIGHTWELL AND ALL OF SPCL https://eurompi19.inf.ethz.ch Submit papers by April 15th!
  • 2. spcl.inf.ethz.ch @spcl_eth 2 The Development of High-Performance Networking Interfaces 1980 1990 2000 2010 2020 Ethernet+TCP/IP Scalable Coherent Interface Myrinet GM+MX Fast Messages Quadrics QsNet Virtual Interface Architecture IB Verbs OFED libfabric Portals 4 sockets coherent memory access (active) message based Cray Gemini remote direct memory access (RDMA) triggered operationsOS bypass protocol offload zero copy businessinsider.com 95 / top-100 systems use RDMA >285 / top-500 systems use RDMA June 2017
  • 3. spcl.inf.ethz.ch @spcl_eth ▪ MPI-3.0 supports RMA (“MPI One Sided”) ▪ Designed to react to hardware trends ▪ Majority of HPC networks support RDMA 3 RDMA as MPI-3.0 REMOTE MEMORY ACCESS TRANSPORT [1] http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf
  • 4. spcl.inf.ethz.ch @spcl_eth ▪ MPI-3.0 supports RMA (“MPI One Sided”) ▪ Designed to react to hardware trends ▪ Majority of HPC networks support RDMA ▪ Communication is „one sided” (no involvement of destination) ▪ RMA decouples communication & synchronization ▪ Different from message passing 4 MPI-3.0 REMOTE MEMORY ACCESS [1] http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf Proc A Proc B send recv Proc A Proc B put two sided one sided CommunicationCommunication + Synchronization Synchronizationsync
  • 5. spcl.inf.ethz.ch @spcl_eth 5 PRESENTATION OVERVIEW 5. Application evaluation 1. Overview of three MPI-3 RMA concepts 2. MPI window creation 3. Communication 4. Synchronization6. Post-RDMA networking Outlook!
  • 6. spcl.inf.ethz.ch @spcl_eth 6 MPI-3 RMA COMMUNICATION OVERVIEW Process A (passive) Memory MPI window Process B (active) Process C (active) Put GetAtomic Non-atomic communication calls (put, get) Atomic communication calls (Acc, Get & Acc, CAS, FAO) Memory MPI window … Process D (active) … Balaji, Hoefler: Using Advanced MPI Tutorial, ISC19, June
  • 7. spcl.inf.ethz.ch @spcl_eth 7 MPI-3 RMA COMMUNICATION OVERVIEW Process A (passive) Memory MPI window Process B (active) Process C (active) Put GetAtomic Non-atomic communication calls (put, get) Atomic communication calls (Acc, Get & Acc, CAS, FAO) Memory MPI window … Process D (active) … … … Balaji, Hoefler: Using Advanced MPI Tutorial, ISC19, June
  • 8. spcl.inf.ethz.ch @spcl_eth 8 MPI-3 RMA COMMUNICATION OVERVIEW Process A (passive) Memory MPI window Process B (active) Process C (active) Put GetAtomic Non-atomic communication calls (put, get) Atomic communication calls (Acc, Get & Acc, CAS, FAO) Memory MPI window … Process D (active) … … … Balaji, Hoefler: Using Advanced MPI Tutorial, ISC19, June
  • 9. spcl.inf.ethz.ch @spcl_eth 9 MPI-3 RMA COMMUNICATION OVERVIEW Process A (passive) Memory MPI window Process B (active) Process C (active) Put GetAtomic Non-atomic communication calls (put, get) Atomic communication calls (Acc, Get & Acc, CAS, FAO) Memory MPI window … Process D (active) … … … Balaji, Hoefler: Using Advanced MPI Tutorial, ISC19, June
  • 10. spcl.inf.ethz.ch @spcl_eth 10 MPI-3 RMA COMMUNICATION OVERVIEW Process A (passive) Memory MPI window Process B (active) Process C (active) Put GetAtomic Non-atomic communication calls (put, get) Atomic communication calls (Acc, Get & Acc, CAS, FAO) Memory MPI window … Process D (active) … … … Balaji, Hoefler: Using Advanced MPI Tutorial, ISC19, June
  • 11. spcl.inf.ethz.ch @spcl_eth 11 MPI-3.0 RMA SYNCHRONIZATION OVERVIEW Active process Passive process Synchroni- zation Passive Target Mode Lock Lock All Active Target Mode Fence Post/Start/ Complete/Wait Communi- cation Balaji, Hoefler: Using Advanced MPI Tutorial, ISC19, June
  • 12. spcl.inf.ethz.ch @spcl_eth 12 MPI-3.0 RMA SYNCHRONIZATION OVERVIEW Active process Passive process Synchroni- zation Passive Target Mode Lock Lock All Active Target Mode Fence Post/Start/ Complete/Wait Communi- cation Balaji, Hoefler: Using Advanced MPI Tutorial, ISC19, June
  • 13. spcl.inf.ethz.ch @spcl_eth 13 MPI-3.0 RMA SYNCHRONIZATION OVERVIEW Active process Passive process Synchroni- zation Passive Target Mode Lock Lock All Active Target Mode Fence Post/Start/ Complete/Wait Communi- cation Balaji, Hoefler: Using Advanced MPI Tutorial, ISC19, June
  • 14. spcl.inf.ethz.ch @spcl_eth 14 MPI-3.0 RMA SYNCHRONIZATION OVERVIEW Active process Passive process Synchroni- zation Passive Target Mode Lock Lock All Active Target Mode Fence Post/Start/ Complete/Wait Communi- cation Balaji, Hoefler: Using Advanced MPI Tutorial, ISC19, June
  • 15. spcl.inf.ethz.ch @spcl_eth 15 Active process Passive process Synchroni- zation Passive Target Mode Lock Lock All Active Target Mode Fence Post/Start/ Complete/Wait Communi- cation MPI-3.0 RMA SYNCHRONIZATION OVERVIEW Balaji, Hoefler: Using Advanced MPI Tutorial, ISC19, June
  • 16. spcl.inf.ethz.ch @spcl_eth ▪ Scalable & generic protocols ▪ Can be used on any RDMA network (e.g., OFED/IB) 16 SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
  • 17. spcl.inf.ethz.ch @spcl_eth 17 SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION ▪ Scalable & generic protocols ▪ Can be used on any RDMA network (e.g., OFED/IB) Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
  • 18. spcl.inf.ethz.ch @spcl_eth 18 SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION Window creation Communication Synchronization ▪ Scalable & generic protocols ▪ Can be used on any RDMA network (e.g., OFED/IB) ▪ Window creation, communication and synchronization Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
  • 19. spcl.inf.ethz.ch @spcl_eth ▪ Scalable & generic protocols ▪ Can be used on any RDMA network (e.g., OFED/IB) ▪ Window creation, communication and synchronization ▪ foMPI, a fully functional MPI-3 RMA implementation ▪ DMAPP: lowest-level networking API for Cray Gemini/Aries systems ▪ XPMEM: a portable Linux kernel module 19 SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION http://spcl.inf.ethz.ch/Research/Parallel_Programming/foMPI Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
  • 20. spcl.inf.ethz.ch @spcl_eth 20 SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION http://spcl.inf.ethz.ch/Research/Parallel_Programming/foMPI ▪ Scalable & generic protocols ▪ Can be used on any RDMA network (e.g., OFED/IB) ▪ Window creation, communication and synchronization ▪ foMPI, a fully functional MPI-3 RMA implementation ▪ DMAPP: lowest-level networking API for Cray Gemini/Aries systems ▪ XPMEM: a portable Linux kernel module Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
  • 21. spcl.inf.ethz.ch @spcl_eth 21 SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION http://spcl.inf.ethz.ch/Research/Parallel_Programming/foMPI ▪ Scalable & generic protocols ▪ Can be used on any RDMA network (e.g., OFED/IB) ▪ Window creation, communication and synchronization ▪ foMPI, a fully functional MPI-3 RMA implementation ▪ DMAPP: lowest-level networking API for Cray Gemini/Aries systems ▪ XPMEM: a portable Linux kernel module Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
  • 22. spcl.inf.ethz.ch @spcl_eth 22 PART 1: SCALABLE WINDOW CREATION Traditional windows backwards compatible (MPI-2) Time bound: 𝒪 𝑝 Memory bound: 𝒪 𝑝 𝑝 = total number of processes Process A Memory Process B Memory Process C Memory 0x111 0x123 0x120 Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
  • 23. spcl.inf.ethz.ch @spcl_eth 𝑝 = total number of processes 23 PART 1: SCALABLE WINDOW CREATION Allocated windows Process A Memory Process B Memory Process C Memory Allows MPI to allocate memory Time bound: 𝒪 log 𝑝 (𝑤ℎ𝑝) Memory bound: 𝒪 1 0x123 0x1230x123 Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
  • 24. spcl.inf.ethz.ch @spcl_eth Process A Memory Process B Memory Process C Memory 𝑝 = total number of processes 24 Local attach/detach Most flexible Time bound: 𝒪 𝑝 Memory bound: 𝒪 𝑝 0x129 0x129 0x111 0x123 0x120 PART 1: SCALABLE WINDOW CREATION Dynamic windows Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
  • 25. spcl.inf.ethz.ch @spcl_eth ▪ Put and Get: ▪ Direct DMAPP put and get operations or local (blocking) memcpy (XPMEM) ▪ Accumulate: ▪ DMAPP atomic operations for 64 bit types ▪ ...or fall back to remote locking protocol ▪ MPI datatype handling with MPITypes library [1] ▪ Fast path for contiguous data transfers of common intrinsic datatypes (e.g., MPI_DOUBLE) 25 PART 2: COMMUNICATION [1] Ross, Latham, Gropp, Lusk, Thakur. Processing MPI datatypes outside MPI. EuroMPI/PVM’09 Contiguous memory MPI_Put dmapp_put_nbi Remote process … MPI_Compare _and_swap dmapp_ acswap_qw_nbi Remote process …
  • 26. spcl.inf.ethz.ch @spcl_eth 26 PERFORMANCE INTER-NODE: LATENCY Put Inter-Node Get Inter-Node 20% faster 80% faster Proc 0 Proc 1 put sync memory Half ping-pong
  • 27. spcl.inf.ethz.ch @spcl_eth 27 PERFORMANCE INTRA-NODE: LATENCY Put/Get Intra-Node 3x faster Proc 0 Proc 1 put sync memory Half ping-pong Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
  • 28. spcl.inf.ethz.ch @spcl_eth 29 PERFORMANCE: MESSAGE RATE Intra-NodeInter-Node Proc 0 Proc 1 puts Sync memory ... Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
  • 29. spcl.inf.ethz.ch @spcl_eth 30 PERFORMANCE: ATOMICS 64 bit integers hardware- accelerated protocol: fall back protocol: lower latency higher bandwidth proprietary Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
  • 30. spcl.inf.ethz.ch @spcl_eth 31 PART 3: SYNCHRONIZATION Active process Passive process Synchroni- zation Passive Target Mode Lock Lock All Active Target Mode Fence Post/Start/ Complete/Wait Communi- cation Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
  • 31. spcl.inf.ethz.ch @spcl_eth Node 0 Node 1 SCALABLE FENCE IMPLEMENTATION ▪ Collective call ▪ Completes all outstanding memory operations Proc 2 put Proc 0 Proc 1 Proc 3 int MPI_Win_fence(…) { asm( mfence ); dmapp_gsync_wait(); MPI_Barrier(...); return MPI_SUCCESS; } put put put put put put 33Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
  • 32. spcl.inf.ethz.ch @spcl_eth Node 0 Node 1 SCALABLE FENCE IMPLEMENTATION ▪ Collective call ▪ Completes all outstanding memory operations Proc 2Proc 0 Proc 1 Proc 3 int MPI_Win_fence(…) { asm( mfence ); dmapp_gsync_wait(); MPI_Barrier(...); return MPI_SUCCESS; } put put put 33 Local completion (XPMEM) Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
  • 33. spcl.inf.ethz.ch @spcl_eth Node 0 Node 1 SCALABLE FENCE IMPLEMENTATION ▪ Collective call ▪ Completes all outstanding memory operations Proc 2Proc 0 Proc 1 Proc 3 int MPI_Win_fence(…) { asm( mfence ); dmapp_gsync_wait(); MPI_Barrier(...); return MPI_SUCCESS; } 33 Local completion (DMAPP) Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
  • 34. spcl.inf.ethz.ch @spcl_eth Node 0 Node 1 SCALABLE FENCE IMPLEMENTATION ▪ Collective call ▪ Completes all outstanding memory operations Proc 2Proc 0 Proc 1 Proc 3 int MPI_Win_fence(…) { asm( mfence ); dmapp_gsync_wait(); MPI_Barrier(...); return MPI_SUCCESS; } 33 Global completion barrier Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
  • 35. spcl.inf.ethz.ch @spcl_eth 36 SCALABLE FENCE PERFORMANCE Time bound 𝒪 log 𝑝 Memory bound 𝒪 1 90% faster Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
  • 36. spcl.inf.ethz.ch @spcl_eth 37 PSCW SYNCHRONIZATION post wait start complete access epoch exposure epoch Proc 0 Proc 1 matching algorithm matching algorithm allows to access other processes allows access from other processes Posting process Puts … Starting process Puts … Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
  • 37. spcl.inf.ethz.ch @spcl_eth 38 PSCW SYNCHRONIZATION post wait start complete Proc 0 Proc 1 start complete start complete Proc 2 Proc 3 start complete post wait Proc 4 Proc 5 Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
  • 38. spcl.inf.ethz.ch @spcl_eth ▪ In general, there can be n posting and m starting processes ▪ In this example there is one posting and 4 starting processes 39 PSCW SCALABLE POST/START MATCHING Posting process (opens its window) j4 j1 j3 j2i Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
  • 39. spcl.inf.ethz.ch @spcl_eth ▪ In general, there can be n posting and m starting processes ▪ In this example there is one posting and 4 starting processes 40 PSCW SCALABLE POST/START MATCHING j4 j1 j3 j2i Starting processes (access remote window) Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
  • 40. spcl.inf.ethz.ch @spcl_eth 41 PSCW SCALABLE POST/START MATCHING j4 j1 j3 j2i Local list ▪ Each starting process has a local list Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
  • 41. spcl.inf.ethz.ch @spcl_eth 42 PSCW SCALABLE POST/START MATCHING j4 j1 j3 j2i ▪ Posting process i adds its rank i to a list at each starting process j1, . . . , j4 ▪ Each starting process j waits until the rank of the posting process i is present in its local list Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
  • 42. spcl.inf.ethz.ch @spcl_eth 43 PSCW SCALABLE POST/START MATCHING j4 j1 j3 j2i ▪ Posting process i adds its rank i to a list at each starting process j1, . . . , j4 ▪ Each starting process j waits until the rank of the posting process i is present in its local list Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
  • 43. spcl.inf.ethz.ch @spcl_eth ▪ Posting process i adds its rank i to a list at each starting process j1, . . . , j4 ▪ Each starting process j waits until the rank of the posting process i is present in its local list 44 PSCW SCALABLE POST/START MATCHING j4 j1 j3 j2i i Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
  • 44. spcl.inf.ethz.ch @spcl_eth ▪ Posting process i adds its rank i to a list at each starting process j1, . . . , j4 ▪ Each starting process j waits until the rank of the posting process i is present in its local list 45 PSCW SCALABLE POST/START MATCHING j4 j1 j3 j2i i i Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
  • 45. spcl.inf.ethz.ch @spcl_eth ▪ Posting process i adds its rank i to a list at each starting process j1, . . . , j4 ▪ Each starting process j waits until the rank of the posting process i is present in its local list 46 PSCW SCALABLE POST/START MATCHING j4 j1 j3 j2i i i i Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
  • 46. spcl.inf.ethz.ch @spcl_eth ▪ Posting process i adds its rank i to a list at each starting process j1, . . . , j4 ▪ Each starting process j waits until the rank of the posting process i is present in its local list 47 PSCW SCALABLE POST/START MATCHING j4 j1 j3 j2i i i i i Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
  • 47. spcl.inf.ethz.ch @spcl_eth ▪ Posting process i adds its rank i to a list at each starting process j1, . . . , j4 ▪ Each starting process j waits until the rank of the posting process i is present in its local list 48 PSCW SCALABLE POST/START MATCHING j4 j1 j3 j2i i i i i Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
  • 48. spcl.inf.ethz.ch @spcl_eth ▪ Each completing process increments a counter stored at the posting process 49 PSCW SCALABLE COMPLETE/WAIT MATCHING ij4 j1 j3 j2 0 Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
  • 49. spcl.inf.ethz.ch @spcl_eth ▪ Each completing process increments a counter stored at the posting process 50 PSCW SCALABLE COMPLETE/WAIT MATCHING ij4 j1 j3 j2 1 Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
  • 50. spcl.inf.ethz.ch @spcl_eth ▪ Each completing process increments a counter stored at the posting process 51 PSCW SCALABLE COMPLETE/WAIT MATCHING ij4 j1 j3 j2 2 Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
  • 51. spcl.inf.ethz.ch @spcl_eth ▪ Each completing process increments a counter stored at the posting process 52 PSCW SCALABLE COMPLETE/WAIT MATCHING ij4 j1 j3 j2 3 Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
  • 52. spcl.inf.ethz.ch @spcl_eth 53 PSCW SCALABLE COMPLETE/WAIT MATCHING ij4 j1 j3 j2 4 ▪ Each completing process increments a counter stored at the posting process ▪ When the counter is equal to the number of starting processes, the posting process returns from wait Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
  • 53. spcl.inf.ethz.ch @spcl_eth 54 PSCW PERFORMANCE Time bound 𝒫𝑠𝑡𝑎𝑟𝑡 = 𝒫 𝑤𝑎𝑖𝑡 = 𝒪 1 𝒫𝑝𝑜𝑠𝑡 = 𝒫𝑐𝑜𝑚𝑝𝑙𝑒𝑡𝑒 = 𝒪 log 𝑝 Memory bound 𝒪 log 𝑝 (for scalable programs) Ring Topology Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
  • 54. spcl.inf.ethz.ch @spcl_eth 55 SCALABLE LOCK SYNCHRONIZATION Active process Passive process Lock/Unlock (shared/exclusive) Lock All (always shared) P. Schmid, M. Besta, TH: High-Performance Distributed RMA Locks, ACM HPDC’16, best paper
  • 55. spcl.inf.ethz.ch @spcl_eth ▪ Guarantees remote completion ▪ Issues a remote bulk synchronization and an x86 mfence ▪ One of the most performance critical functions, we add only 78 x86 CPU instructions to the critical path 64 FLUSH SYNCHRONIZATION Time bound 𝒪 1 Memory bound 𝒪(1) Process 0 Process 1 inc(counter) 0 counter: … inc(counter) inc(counter)
  • 56. spcl.inf.ethz.ch @spcl_eth ▪ Guarantees remote completion ▪ Issues a remote bulk synchronization and an x86 mfence ▪ One of the most performance critical functions, we add only 78 x86 CPU instructions to the critical path 64 FLUSH SYNCHRONIZATION Time bound 𝒪 1 Memory bound 𝒪(1) Process 0 Process 1 inc(counter) counter: … inc(counter) inc(counter) 3 flush
  • 57. spcl.inf.ethz.ch @spcl_eth 66 ▪ Evaluation on Blue Waters System ▪ 22,640 computing Cray XE6 nodes ▪ 724,480 schedulable cores ▪ All microbenchmarks ▪ 4 applications ▪ One nearly full-scale run ☺ PERFORMANCE
  • 58. spcl.inf.ethz.ch @spcl_eth 67 PERFORMANCE: MOTIF APPLICATIONS Key/Value Store: Random Inserts per Second Dynamic Sparse Data Exchange (DSDE) with 6 neighbors
  • 59. spcl.inf.ethz.ch @spcl_eth 68 PERFORMANCE: APPLICATIONS NAS 3D FFT [1] Performance MILC [2] Application Execution Time Annotations represent performance gain of foMPI over Cray MPI-1. [1] Nishtala et al. Scaling communication-intensive applications on BlueGene/P using one-sided communication and overlap. IPDPS’09 [2] Shan et al. Accelerating applications at scale using one-sided communication. PGAS’12 scale to 512k procs scale to 65k procs
  • 60. spcl.inf.ethz.ch @spcl_eth 69 The Development of High-Performance Networking Interfaces 1980 1990 2000 2010 2020 Ethernet+TCP/IP Scalable Coherent Interface Myrinet GM+MX Fast Messages Quadrics QsNet Virtual Interface Architecture IB Verbs OFED libfabric Portals 4 sockets coherent memory access (active) message based Cray Gemini remote direct memory access (RDMA) triggered operationsOS bypass protocol offload zero copy What next? Fundamental development: CPU core speed cannot keep up with the growing network bandwidths! Already today, cannot process packets at line rate!
  • 61. spcl.inf.ethz.ch @spcl_eth Local Node Main Memory RDMA NIC Core i7 Haswell L3 L2 L1 Regs PCIe bus arriving packets 34 cycles ~11.3ns 11 cycles ~ 3.6ns 4 cycles ~1.3ns ~ 250ns 125 cycles ~41.6ns Data Processing in modern RDMA networks DMA Unit RDMA Processing Remote Nodes (via network) L2 L1 Regs 34 cycles ~11.3ns 11 cycles ~ 3.6ns 4 cycles ~1.3ns Input buffer Mellanox Connect-X5: 1 packet/5ns Tomorrow (400G): 1 packet/1.2ns 70
  • 62. spcl.inf.ethz.ch @spcl_eth 71 The future of High-Performance Networking Interfaces 1980 1990 2000 2010 2020 Ethernet+TCP/IP Scalable Coherent Interface Myrinet GM+MX Fast Messages Quadrics QsNet Virtual Interface Architecture IB Verbs OFED libfabric Portals 4 sockets coherent memory access (active) message based Cray Gemini remote direct memory access (RDMA) triggered operationsOS bypass protocol offload zero copy 95 / top-100 systems use RDMA >285 / top-500 systems use RDMA June 2017 fully programmable NIC acceleration sPIN Streaming Processing In the Network Established Principles for Compute Acceleration PortabilityEase-of-use ProgrammabilitySpecialization Libraries Efficiency 4.0
  • 63. spcl.inf.ethz.ch @spcl_eth arriving packets PacketScheduler DMA Unit manage memory upload handlers Fast shared memory (packet input buffer) HPU 1 HPU 3 HPU 0 HPU 2 R/W MEM CPU sPIN NIC - Abstract Machine Model 72
  • 64. spcl.inf.ethz.ch @spcl_eth RDMA vs. sPIN in action: Simple Ping Pong 73 Initiator Target
  • 65. spcl.inf.ethz.ch @spcl_eth RDMA vs. sPIN in action: Streaming Ping Pong 74 Initiator Target
  • 66. spcl.inf.ethz.ch @spcl_eth PacketScheduler 75 sPIN – Programming Interface __handler int pp_header_handler(const ptl_header_t h, void *state) { pingpong_info_t *i = state; i->source = h.source_id; return PROCESS_DATA; // execute payload handler to put from device } Header handler __handler int pp_payload_handler(const ptl_payload_t p, void * state) { pingpong_info_t *i = state; PtlHandlerPutFromDevice(p.base, p.length, 1, 0, i->source, 10, 0, NULL, 0); return SUCCESS; } Payload handler __handler int pp_completion_handler(int dropped_bytes, bool flow_control_triggered, void *state) { return SUCCESS; } Completion handler connect(peer, /* … */, &pp_header_handler, &pp_payload_handler, &pp_completion_handler); Incoming message Header Payload Tail
  • 67. spcl.inf.ethz.ch @spcl_eth ▪ sPIN is a programming abstraction, similar to CUDA or OpenCL combined with OFED or Portals 4 ▪ It enables a large variety of NIC implementations! ▪ For example, massively multithreaded HPUs Including warp-like scheduling strategies ▪ Main goal: sPIN must not obstruct line-rate ▪ Programmer must limit processing time per packet Little’s Law: 500 instructions per handler, 2.5 GHz, IPC=1, 1 Tb/s → 25 kiB memory ▪ Relies on fast shared memory (processing in packet buffers) Scratchpad or registers ▪ Quick (single-cycle) handler invocation on packet arrival Pre-initialized memory & context ▪ Can be implemented in most RDMA NICs with a firmware update ▪ Or in software in programmable (Smart) NICs 76 Possible sPIN implementations BCM58800 SoC (Full Linux) Innova Flex (Kintex FPGA) Catapult (Virtex FPGA) at 400G, process more than 833 million messages/s
  • 68. spcl.inf.ethz.ch @spcl_eth Simulating a sPIN NIC – Ping Pong 77 RDMA sPIN (stream) ▪ LogGOPSim v2 [1]: combine LogGOPSim (packet-level network) with gem5 (cycle accurate CPU simulation) ▪ Network (LogGOPSim): ▪ Supports Portals 4 and MPI ▪ Parametrized for future InfiniBand o=65ns (measured) g=6.7ns (150 MM/s) G=2.5ps (400 Gib/s) Switch L=50ns (measured) Wire L=33.4ns (10m cable) ▪ NIC HPU ▪ 2.5 GHz ARM Cortex A15 OOO ▪ ARMv8-A 32 bit ISA ▪ Single-cycle access SRAM (no DRAM) ▪ Header matching m=30ns, per packet 2ns In parallel with g! 35% lower latency 32%higherBW 77 [1] S. Di Girolamo, K. Taranov, T. Schneider, E. Stalder, T. Hoefler, LogGOPSim+gem5: Simulating Network Offload Engines Over Packet-Switched Networks. Presented at ExaMPI’17 Handlers cost: 18 instructions + 1 Put Data Layout Transformation Network Group Communication Distributed Data Management
  • 69. spcl.inf.ethz.ch @spcl_eth Use Case 1: Broadcast acceleration Message size: 8 Bytes 78 Network Group Communication Liu, J., et al., High performance RDMA-based MPI implementation over InfiniBand. International Journal of Parallel Programming 2004 RDMA
  • 70. spcl.inf.ethz.ch @spcl_eth Offloaded collectives (e.g., ConnectX-2, Portals 4) 7979 RDMA Use Case 1: Broadcast acceleration Network Group Communication Underwood, K.D., et al., Enabling flexible collective communication offload with triggered operations. HOTI’11 Liu, J., et al., High performance RDMA-based MPI implementation over InfiniBand. International Journal of Parallel Programming 2004 Message size: 8 Bytes
  • 71. spcl.inf.ethz.ch @spcl_eth sPIN 80 Handlers cost: 24 instructions + Log P Puts Use Case 1: Broadcast acceleration Network Group Communication Underwood, K.D., et al., Enabling flexible collective communication offload with triggered operations. HOTI’11 Liu, J., et al., High performance RDMA-based MPI implementation over InfiniBand. International Journal of Parallel Programming 2004 Offloaded collectives (e.g., ConnectX-2, Portals 4) RDMA Message size: 8 Bytes
  • 72. spcl.inf.ethz.ch @spcl_eth Parity Update Use Case 2: RAID acceleration RDMA sPIN 81 Server Node Parity Node Write ACK Parity ACK 20% lower latency 176%higherBW Handlers cost: Server: 58 instructions + 1 Put Parity: 46 instructions + 1 Put Distributed Data Management Shankar D. et al., High-performance and Resilient Key-Value Store with Online Erasure Coding for Big Data Workloads. ICDCS’17
  • 73. spcl.inf.ethz.ch @spcl_eth RDMA 82 4 MiB transfer with varying blocksize stride = 2 x blocksize 11.44 GiB/s Input buffer Destination memory Use Case 3: MPI Datatypes acceleration Data Layout Transformation Gropp, W., et al., March. Improving the performance of MPI derived datatypes. MPIDC’99
  • 74. spcl.inf.ethz.ch @spcl_eth Use Case 3: MPI Datatypes acceleration 83 Input buffer Destination memory sPIN 43.6 GiB/s 83 Handlers cost: 54 instructions Data Layout Transformation RDMA 4 MiB transfer with varying blocksize stride = 2 x blocksize 11.44 GiB/s Gropp, W., et al., March. Improving the performance of MPI derived datatypes. MPIDC’99 3.8x speedup
  • 76. spcl.inf.ethz.ch @spcl_eth Further results and use-cases 85 Use Case 4: MPI Rendezvous Protocol MILC POP coMD coMD Cloverleaf Cloverleaf 64 64 72 360 72 360 5.7M 772M 5.3M 28.1M 2.7M 15.3M 5.5% 3.1% 6.1% 6.5% 5.2% 5.6% 1.9% 2.4% 2.4% 2.8% 2.4% 3.2% program p msgs ovhd ovhd 65% 22% 60% 58% 53% 42% red
  • 77. spcl.inf.ethz.ch @spcl_eth Use Case 5: Distributed KV Store Further results and use-cases Kalia, A., et al., Using RDMA efficiently for key-value services. In ACM SIGCOMM Computer Communication Review, 2014 Network K1.tail: 0x88 0x88 K1, V V K1.tail: 0x98 86 Use Case 4: MPI Rendezvous Protocol MILC POP coMD coMD Cloverleaf Cloverleaf 64 64 72 360 72 360 5.7M 772M 5.3M 28.1M 2.7M 15.3M 5.5% 3.1% 6.1% 6.5% 5.2% 5.6% 1.9% 2.4% 2.4% 2.8% 2.4% 3.2% program p msgs ovhd ovhd 65% 22% 60% 58% 53% 42% red 41% lower latency
  • 78. spcl.inf.ethz.ch @spcl_eth Use Case 6: Conditional Read Further results and use-cases 87 Barthels, C., et al., Designing Databases for Future High- Performance Networks. IEEE Data Eng. Bulletin, 2017 Use Case 5: Distributed KV Store Kalia, A., et al., Using RDMA efficiently for key-value services. In ACM SIGCOMM Computer Communication Review, 2014 Network 20% 40% 60% Discarded data: 80% Use Case 4: MPI Rendezvous Protocol MILC POP coMD coMD Cloverleaf Cloverleaf 64 64 72 360 72 360 5.7M 772M 5.3M 28.1M 2.7M 15.3M 5.5% 3.1% 6.1% 6.5% 5.2% 5.6% 1.9% 2.4% 2.4% 2.8% 2.4% 3.2% program p msgs ovhd ovhd 65% 22% 60% 58% 53% 42% red 41% lower latency
  • 79. spcl.inf.ethz.ch @spcl_eth Use Case 6: Conditional ReadUse Case 5: Distributed KV Store Further results and use-cases Kalia, A., et al., Using RDMA efficiently for key-value services. In ACM SIGCOMM Computer Communication Review, 2014 Network 88 Barthels, C., et al., Designing Databases for Future High- Performance Networks. IEEE Data Eng. Bulletin, 2017 Use Case 7: Distributed Transactions Dragojević, A, et al., No compromises: distributed transactions with consistency, availability, and performance. SOSP’15 Network data pkts log pkts 20% 40% 60% Discarded data: 80% Use Case 4: MPI Rendezvous Protocol MILC POP coMD coMD Cloverleaf Cloverleaf 64 64 72 360 72 360 5.7M 772M 5.3M 28.1M 2.7M 15.3M 5.5% 3.1% 6.1% 6.5% 5.2% 5.6% 1.9% 2.4% 2.4% 2.8% 2.4% 3.2% program p msgs ovhd ovhd 65% 22% 60% 58% 53% 42% red 41% lower latency
  • 80. spcl.inf.ethz.ch @spcl_eth Use Case 6: Conditional ReadUse Case 5: Distributed KV Store Further results and use-cases Kalia, A., et al., Using RDMA efficiently for key-value services. In ACM SIGCOMM Computer Communication Review, 2014 Network 89 Barthels, C., et al., Designing Databases for Future High- Performance Networks. IEEE Data Eng. Bulletin, 2017 Use Case 7: Distributed Transactions Dragojević, A, et al., No compromises: distributed transactions with consistency, availability, and performance. SOSP’15 Network Use Case 8: FT Broadcast Bosilca, G., et al., Failure Detection and Propagation in HPC systems. SC’16 Network bcast pkts redundant bcast pkts 20% 40% 60% Discarded data: 80% Use Case 4: MPI Rendezvous Protocol MILC POP coMD coMD Cloverleaf Cloverleaf 64 64 72 360 72 360 5.7M 772M 5.3M 28.1M 2.7M 15.3M 5.5% 3.1% 6.1% 6.5% 5.2% 5.6% 1.9% 2.4% 2.4% 2.8% 2.4% 3.2% program p msgs ovhd ovhd 65% 22% 60% 58% 53% 42% red 41% lower latency
  • 81. spcl.inf.ethz.ch @spcl_eth Use Case 6: Conditional ReadUse Case 5: Distributed KV Store Further results and use-cases Kalia, A., et al., Using RDMA efficiently for key-value services. In ACM SIGCOMM Computer Communication Review, 2014 Network 90 Barthels, C., et al., Designing Databases for Future High- Performance Networks. IEEE Data Eng. Bulletin, 2017 Use Case 7: Distributed Transactions Dragojević, A, et al., No compromises: distributed transactions with consistency, availability, and performance. SOSP’15 Network Use Case 8: FT Broadcast Bosilca, G., et al., Failure Detection and Propagation in HPC systems. SC’16 Use Case 4: MPI Rendezvous Protocol MILC POP coMD coMD Cloverleaf Cloverleaf 64 64 72 360 72 360 5.7M 772M 5.3M 28.1M 2.7M 15.3M 5.5% 3.1% 6.1% 6.5% 5.2% 5.6% 1.9% 2.4% 2.4% 2.8% 2.4% 3.2% program p msgs ovhd ovhd 65% 22% 60% 58% 53% 42% red Network Use Case 9: Distributed Consensus 20% 40% 60% Discarded data: 80% Consensus István, Z., et al., Consensus in a Box: Inexpensive Coordination in Hardware. NSDI’16 41% lower latency The Next 700 sPIN use-cases … just think about sPIN graph kernels ….
  • 82. spcl.inf.ethz.ch @spcl_eth sPIN Streaming Processing in the Network for Network Acceleration Try it out: https://spcl.inf.ethz.ch/Research/Parallel_Programming/sPIN/Full specification: https://arxiv.org/abs/1709.05483 sPIN beyond RDMA 3,800+ reads in less than 4 hours
  • 83. spcl.inf.ethz.ch @spcl_eth ▪ More complex network interfaces ▪ sPIN – just discussed in detail ▪ How to use it in real applications? ▪ How to implement sPIN ▪ Collaboration with Luca Benini, RISC-V SoC ▪ Full FPGA/simulation environment available as open-source ▪ New application use-cases – “Deep Learning is HPC [1]” ▪ Different communication requirements and opportunities for optimization Mainly sparsity [2] and asynchronity! ▪ Need to be a bit careful though when entering a new field 92 MPI – some new research challenges! T. Hoefler: “Twelve ways to fool the masses when reporting performance of deep learning workloads” (my humorous guide to floptimize deep learning) [1]: Ben-Nun, Hoefler: “Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, arXiv:1802.09941 [2]: Renggli et al.: “SparCML: High-Performance Sparse Communication for Machine Learning”, arXiv:1802.08021 https://eurompi19.inf.ethz.ch Submit papers by April 15th!
  • 84. spcl.inf.ethz.ch @spcl_eth Important dates: Submission server opens: January 14th, 2019 Full paper submission: April 15th, 2019 (AOE) Notification: July 1st, 2019 Camera-ready: August 5th, 2019https://eurompi19.inf.ethz.ch