The Future of Software Development - Devin AI Innovative Approach.pdf
PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt
1. Cache coherence for
GPU Architectures
Inderpreet Singh, Arrvindh Shriraman, Wilson W. L. Fung, Mike O'Connor, Tor M. Aamodt, Cache Coherence for GPU
Architectures, In proceedings of the 19th IEEE International Symposium on High-Performance Computer Architecture
1
(HPCA-19)
6. Why provide coherence?
1. Inter-workgroup
communication
2. Atomic operations
Characterizing and Evaluating a Key-value Store
Application on Heterogeneous CPU-GPU Systems, ISPASS 2012
3. Task queues
3
54. Temporal Coherence
No coherence messages
All transactions are 2-hop
Protocol complexity minimal
Supports strong and weak
memory models
Enables optimized communication
(ask me later...)
21
55. How to set the block lifetime?
• Longer
= writes may stall
• Shorter
= may not exploit temporal locality
!
•
Lifetime predictor
at L2.
-Load to expired block (for temporal locality)
-Store to unexpired block (reduce write stalls)
-Eviction of unexpired block (reduce L2 eviction stalls)
22
96. What did we learn
!
• Throughput
and heterogeneous architectures
require a more streamlined caching framework.
!
• Single-chip
integration enables mechanisms
that we can exploit to simplify communication
protocols.
!
• Efficient
coherence protocols enable
programmers to deploy accelerators for wider
purposes..
99. 1.0
1.0
0.5
0.5
0.0
0.0
STN
HSP
VPR
37
or coherent and non-coherent GPU memory systems.
communication
2.0
1.5
KMN
(b) Intra-workgroup communication
RCL=0.25
REQ=0.55
2.0
1.5
1.0
0.5
0.0
LPS
NO-COH
MESI
GPU-VI
GPU-Vini
NO-COH
TCW
Interconnect Traffic
0.0
NDL
MESI
NO-COH
GPU-VI
MESI
GPU-Vini
GPU-VI
TCW
GPU-Vini
RCL=0.09
REQ=0.55
HSP
KMN
RG
SR
TCW
2.0
ST
NO-COH
MESI
NO-COH
GPU-VI
MESI
GPU-VI
GPU-Vini
GPU-Vini
TCW
ATO
TCW
RCL=0.15
REQ=0.63
GPU-Vini
TCW
NO-COH
RCL ST LD REQ
INV
ATO
MESI
GPU-VI
NO-COH
GPU-Vini
MESI
TCW
GPU-VI
AVG
NO-COH
MESI
GPU-VI
GPU-Vini
TCW
NO-COH
NO-L1
MESI
MESI
GPU-VI
GPU-VI
GPU-Vini
GPU-Vini
TCW
TCW
1.5
Traffic
2.0
NO-L1
NO-COH
MESI
MESI
GPU-VI
GPU-VI
GPU-Vini
GPU-Vini
TCW
TCW
NO-L1
MESI
GPU-VI Interconnect
GPU-Vini
TCW
REQ
LD
RCL=0.16 RCL=0.25
REQ=0.63 REQ=0.55
2.27
R
R
1.5
1.0
0.5
AVG
LPS
(b) Intra-work
100. 1.0
STN
NO-L1
NO-L1
MESI MESI
GPU-VI
GPU-VI
GPU-Vini
GPU-Vini
TCW TCW
BH
VPR
(a) Inter-workgroup communicationKMN
HSP
AVG
CL
NO-L1
MESI
GPU-VI
GPU-Vini
Interconnect
TCW
NO-L1
NO-L1
MESI MESI
GPU-VI
GPU-VI
GPU-Vini
GPU-Vini
TCW TCW
CC
DLB
0.5
0.5
0.0
0.0
0.0
STN
VPR
GPU-VI
GPU-Vini
NO-L1
TCWMESI
2.0
AVG
GPU-VI
GPU-Vini
TCW
ST
GPU-VI
GPU-Vini
NO-COH
MESI TCW
ATO
GPU-VI
NO-COH
GPU-Vini
MESI TCW
REQ
GPU-Vini
NO-L1
TCWMESI
1.5
NO-L1
MESI
GPU-VI
NO-COH
GPU-Vini
MESI
GPU-VI TCW
2.0
Traffic
INV
1.0
0.5
TCW
NO-L1
NO-L1
MESI
GPU-VI MESI
GPU-VI
GPU-Vini
GPU-Vini
TCW
RCL
RCL=0.03
INV=0.03
REQ=0.68
RCL
INV
LD
REQ
2.0 R
RCL=0.25
REQ=0.55
R
1.5
1.5
1.0
LPS
communication Breakdown of interconnect(b) Intra-work
Figure 8.
traffic for co
38
101. TC-Strong vs TC-Weak
TCSUO
TCSOO
TCS
TCW
TCW w/ predictor
Fixed lifetime for all applications
Best lifetime for each application
1.2
1.2
Speedup
Speedup
1.4
1.0
0.8
0.6
1.0
0.8
0.6
All applications
39
All applications