Global Load Instruction Aggregation Based on Code Motion

The 2012 International Symposium on Parallel
Architectures, Algorithms and Programming.
December18, 2012

Global Load Instruction Aggregation
Based on Code Motion

Outline
 Background
 Previous works
 Motivations

 Partial Redundancy Elimination(PRE)
 Lazy code motion(LCM)

 Global Load Instruction Aggregation(GLIA)
 Experiment results
 Conclusion

Background

Processor

Speed：
Speed：

Main
memory

Background

Important Processor

Cache
memory

Main
memory

Previous works

1. Prefetch instructions
2. Transform loop structures.
before after
for(j=0;j<10;j++) for(i=0;i<10;i++)
for(i=0;i<10;i++) for(j=0;j<10;j++)
... = a[i][j] ... = a[i][j]

Previous works

for(j=0;j<10;j++) j:0

for(i=0;i<10;i++) j:1
i:0
... = a[i][j] ・
・
・
j:0

j:1
i:1
・
・
・

Problems

1. Local technique
ex. target: initial load instruction, loop only.

2. It is necessary to change the structure.

How we can apply cache optimization to any program
globally?

Main memory
・
Cache memory ・
・
main(){

x = a[i] a[i]

a[i+1]
}
・
・
・

globally?

Main memory
・
Cache memory ・
・
main(){

x = a[i] a[i] a[i]
a[i+1] a[i+1]
}
・
・
・

globally?

Main memory

Cache memory a[i]
main(){ a[i+1]
... = a[i] ・
... = b[i] a[i]
・
... = a[i+1] a[i+1]
・
} b[i]

b[i+1]

globally?

Main memory

Cache memory a[i]
main(){ a[i+1]
... = a[i] ・
... = b[i] b[i]
・
... = a[i+1] b[i+1]
・
} b[i]

b[i+1]

globally?

Main memory

Cache memory a[i]
main(){ a[i+1]
... = a[i] ・
... = b[i] b[i]
・
... = a[i+1] b[i+1]
・
} b[i]

Cache miss b[i+1]

globally?

We can remove this cache miss by
changing the order of accesses
a[i]
main(){ a[i+1]
... = a[i] ・
... = b[i] b[i]
・
... = a[i+1] b[i+1]
・
} b[i]

Cache miss b[i+1]

Code motion

x = a[i]

Expel from y = x+1
cache memory

z = b[i]
w = a[i+j]

Code motion

x = a[i]
w = a[i+j]

y = x+1

z = b[i]

Code motion

x = a[i]
w = a[i+j]

Live range
y = x+1
of w

z = b[i]

Code motion

x = a[i]
w = a[i+j]
x w

y = x+1

z = b[i]

Code motion

x = a[i]
w = a[i+j]

y = x+1 Spill

z = b[i]

Code motion

x = a[i]
t = Load(j)
w = a[i+t] Change the
access order

y = x+1

z = b[i]

Code motion

x = a[i]
Delayed

y = x+1

w = a[i+j]
z = b[i]

Implementation
We use Partial Redundancy Elimination（PRE）
One of the code optimization
Eliminates redundant expressions

PRE

x = a[i] t = a[i] t = a[i]
x=t

y = a[i] y=t

LCM
LCM determines two insertion node
-- Earliest and Latest
x = a[i]
• Earliest(n) denotes that node n is
the closest to the start node of the
nodes which can be inserted y = a[i]

• Latest(n) denotes that node n is
the closest to nodes which contain
same load instruction.

Knoop,J.,etc.:Lazy Code Motion, Proc. Programming Language
Design and Implementation, ACM, pp.224-234, 1992.

LCM

x = a[i]
y = a[i]

LCM

t = a[i]

x = a[i]
y = a[i]

LCM

Delayed
t = a[i]

x = a[i]
y = a[i]

LCM

Delayed

t = a[i]
x = a[i]
y = a[i]

LCM

t = a[i]
x=t
y=t

Global Load Instruction
Aggregation(GLIA)
 Purpose
1. Decrease the cache miss.
2. Suppress register spills.

 Extension
1. Move not redundant load instructions.
2. Delayed considering the order of
memory access.

GLIA

x = a[i]

y = b[i]

w = a[i+1]

GLIA

t = a[i+1]
x = a[i]

y = b[i]

w = a[i+1]

GLIA

x = a[i]

t = a[i+1]
y = b[i]

w = a[i+1]

GLIA

x = a[i]

t = a[i+1]
y = b[i]

w=t

Application to the entire program

= a[i]

= b[i]
= a[i+1]
= a[i+1]

Application to the entire program

= a[i]
= a[i+1]
= b[i]
= a[i+1]

Experiment
 Implementation
 our technique in COINS compiler as LIR
converter.

 Benchmark
 SPEC2000

 Measurement
1. Execution efficiency
2. The number of cache misses

Experiment(1/2) | Execution
efficiency
 Environment
 SPARC64-V 2GHz, Solaris 10

 Optimization
 BASE：applies Dead Code Elimination(DCE)
 GLIADCE：applies GLIA and DCE.

Experiment(1/2) | Execution
efficiency
Improvement of art has been about 10.5%

The decrease reason 1: speculative code
motion

= a[i]
= b[i]

= a[j]

The decrease reason 1: speculative code
motion

= a[i]
= a[j]
= b[i]

The decrease reason 2: register spill

 The number of spills

Experiment(2/2) | Cache misses

 System parameter of x86 machine
 Intel corei5-2320 3.00GHz
 Floating register：8
 Integer register ：8
 L1D cache memory：32KB
 L2 cache memory ：256KB
 L3 cache memory ：6144KB

Experiment(2/2) | Level 2 cache
misses
Improvement of twolf has been about 10.6%

Experiment(2/2) | Level 3 cache
misses
Improvement of art has been about 93.7%

Conclusion

We proposed a new cache optimization.
1. GLIA can be applied to any programs
2. GLIA improves cache efficiency
3. GLIA considers register spill

Thank you for your attention.

Global Load Instruction Aggregation Based on Code Motion

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (8)

Andere mochten auch

Andere mochten auch (7)

Ähnlich wie Global Load Instruction Aggregation Based on Code Motion

Ähnlich wie Global Load Instruction Aggregation Based on Code Motion (9)

Global Load Instruction Aggregation Based on Code Motion