The 2012 International Symposium on Parallel Architecture, Algorithm and Programming.
If you download this file, you can see explanation of each slides.
Global Load Instruction Aggregation Based on Code Motion
1. The 2012 International Symposium on Parallel
Architectures, Algorithms and Programming.
December18, 2012
Global Load Instruction Aggregation
Based on Code Motion
11. Previous works
1. Prefetch instructions
2. Transform loop structures.
before after
for(j=0;j<10;j++) for(i=0;i<10;i++)
for(i=0;i<10;i++) for(j=0;j<10;j++)
... = a[i][j] ... = a[i][j]
12. Problems
1. Local technique
ex. target: initial load instruction, loop only.
2. It is necessary to change the structure.
13. How we can apply cache optimization to any program
globally?
Main memory
・
Cache memory ・
・
main(){
x = a[i] a[i]
a[i+1]
}
・
・
・
14. How we can apply cache optimization to any program
globally?
Main memory
・
Cache memory ・
・
main(){
x = a[i] a[i] a[i]
a[i+1] a[i+1]
}
・
・
・
15. How we can apply cache optimization to any program
globally?
Main memory
Cache memory a[i]
main(){ a[i+1]
... = a[i] ・
... = b[i] a[i]
・
... = a[i+1] a[i+1]
・
} b[i]
b[i+1]
16. How we can apply cache optimization to any program
globally?
Main memory
Cache memory a[i]
main(){ a[i+1]
... = a[i] ・
... = b[i] b[i]
・
... = a[i+1] b[i+1]
・
} b[i]
b[i+1]
17. How we can apply cache optimization to any program
globally?
Main memory
Cache memory a[i]
main(){ a[i+1]
... = a[i] ・
... = b[i] b[i]
・
... = a[i+1] b[i+1]
・
} b[i]
Cache miss b[i+1]
18. How we can apply cache optimization to any program
globally?
We can remove this cache miss by
changing the order of accesses
a[i]
main(){ a[i+1]
... = a[i] ・
... = b[i] b[i]
・
... = a[i+1] b[i+1]
・
} b[i]
Cache miss b[i+1]
19. Code motion
x = a[i]
Expel from y = x+1
cache memory
z = b[i]
w = a[i+j]
20. Code motion
x = a[i]
w = a[i+j]
y = x+1
z = b[i]
21. Code motion
x = a[i]
w = a[i+j]
Live range
y = x+1
of w
z = b[i]
22. Code motion
x = a[i]
w = a[i+j]
x w
y = x+1
z = b[i]
23. Code motion
x = a[i]
w = a[i+j]
y = x+1 Spill
z = b[i]
24. Code motion
x = a[i]
t = Load(j)
w = a[i+t] Change the
access order
y = x+1
z = b[i]
25. Code motion
x = a[i]
w = a[i+j]
y = x+1
z = b[i]
26. Code motion
x = a[i]
Delayed
y = x+1
w = a[i+j]
z = b[i]
27. Implementation
We use Partial Redundancy Elimination(PRE)
One of the code optimization
Eliminates redundant expressions
29. LCM
LCM determines two insertion node
-- Earliest and Latest
x = a[i]
• Earliest(n) denotes that node n is
the closest to the start node of the
nodes which can be inserted y = a[i]
• Latest(n) denotes that node n is
the closest to nodes which contain
same load instruction.
Knoop,J.,etc.:Lazy Code Motion, Proc. Programming Language
Design and Implementation, ACM, pp.224-234, 1992.
57. Conclusion
We proposed a new cache optimization.
1. GLIA can be applied to any programs
2. GLIA improves cache efficiency
3. GLIA considers register spill
Thank you for your attention.