This document discusses various compiler optimizations including constant folding, hoisting loop invariant code, scalarization, loop unswitching, peeling and sentinels, strength reduction, loop induction variable elimination, and auto-vectorization. It provides code examples and the generated assembly for each optimization. It explains that many optimizations are performed by compilers automatically at high optimization levels, while some more advanced optimizations like loop peeling and sentinels require manual intervention.
3. 3
OUTLINE
Constant folding
Hoisting loop invariant code
Scalarization
Loop unswitching
Loop peeling and sentinels
Strength reduction
Loop-induction variable elimination
Auto-vectorization
Function body inlining
Built-in detection
Case study
4. 4
CONSTANT FOLDING
evaluates expressions from constants in compile time
It is one of the simplest and most widely used compiler
transformations. Expressions could be quite complicated, but
absence of any kind of side effects is always to be considered.
double b = exp(sqrt(M_PI/2));
fldd d17, .L5
// other code
.L5:
.word 2911864813
.word 1074529267
May be performed in global as well as local scope.
Applicable to any code pattern.
5. 5 . 1
HOISTING LOOP INVARIANT CODE
The goal of hoisting — also called loop-invariant code motion
— is to avoid recomputing loop-invariant code each time
through the body of a loop.
void scale(double* x, double* y, int len)
{
for (int i = 0; i < len; i++)
y[i] = x[i] * exp(sqrt(M_PI/2));
}
... is the same as...
void scale(double* x, double *y, int len)
{
double factor = exp(sqrt(M_PI/2));
for (int i = 0; i < len; i++)
y[i] = x[i] * factor;
}
9. 6
Usually it is performed on a branches inside a loop body to
allow further optimization. Machine-independent, local.
SCALARIZATION
Scalarization replaces branchy code with a branchless analogy
int branchy(int i)
{
if (i > 4 && i < 42) return 1;
return 0;
}
int branchless(int i)
{
return (((unsigned)i) - 5 < 36);
}
$ gcc -march=armv8-a+simd -O3 1.c -S -o 1.s
branchy:
sub w0, w0, #5
cmp w0, 36
cset w0, ls
ret
branchless:
sub w0, w0, #5
cmp w0, 36
cset w0, ls
ret
10. 7 . 1
LOOP UNSWITCHING
Moves loop-invariant conditionals or switches which are
independent of the loop index out of the loop. Increases ILP,
enables further optimizations.
for (int i = 0; i < len; i++)
{
if (a > 32)
arr[i] = a;
else
arr[i] = 42;
}
int i = 0;
if (a > 32)
for (; i < len; i++)
arr[i] = a;
else
for (; i < len; i++)
arr[i] = 42;
$ gcc -march=armv8-a+simd -O3 -S -o 1.s
-fno-tree-vectorize 1.c
auto-vectorization is disable d just to make example readable.
12. 7 . 3
Can be done by a compiler under high optimization levels.
LOOP PEELING
Loop peeling takes out of the loop and executes separately a
small number of iterations from the beginning or/and the end
of a loop, which eliminates if-statements.
for(int i=0; i<len; i++)
{
if (i == 0)
b[i] = a[i+1];
else if(i == len-1 )
b[i] = a[i-1];
else
b[i] = a[i+1]+a[i-1];
}
b[0] = a[0];
for(int i=0; i<len; i++)
[i] = a[i+1]+a[i-1];
b[len-1] = a[len-1];
For the example below both te sted compilers: gcc 4.9 and clang 3.5 won't perform this optimization.
Make sure that a compiler w as able to apply optimization to such a code pattern or do it manually .
13. 7 . 4
ADVANCED UNSWITCHING: SENTINELS
Sentinels are special dummy values placed in a data structure
which are added to simplify the logic of handling boundary
conditions (in particular, handling of loop-exit tests or
elimination out-of-bound checks).
This optimization cannot be performed by the compiler
because require changes in use of data structures
14. 7 . 5
ADVANCED UNSWITCHING: SENTINELS
Let's assume that array a contains two extra elements: one —
at the left a[-1], and one — at the right a[N]. With such
assumptions the code can be rewritten as follows.
for(int i=0; i<len; i++)
{
if (i == 0)
b[i] = a[i+1];
else if(i == len-1 )
b[i] = a[i-1];
else
b[i] = a[i+1]+a[i-1];
}
// setup boundaries
a[-1] = 0;
a[len] = 0;
for (int i=0; i<len; i++)
[i] = a[i+1] + a[i-1];
15. 8 . 1
STRENGTH REDUCTION
replaces complex expressions with a simpler analogy
double usePow(double x)
{
return pow(x, 2.);
}
float usePowf(float x)
{
return powf(x, 2.f);
}
Machine-independent transformation in most cases.
It may be machine-dependent in case if it relies on a speci c
feature set, implemented in the HW (e.g. built-in detection)
16. 8 . 2
STRENGTH REDUCTION
$ gcc -march=armv7-a -mthumb -O3 -S -o 1.s
> -mfpu=neon-vfpv4 -mfloat-abi=softfp 1.c
usePow:
fmdrr d16, r0, r1
fmuld d16, d16, d16
fmrrd r0, r1, d16
bx lr
usePowf:
fmsr s15, r0
fmuls s15, s15, s15
fmrs r0, s15
bx lr
Usually it is performed in a local scope
for a dependency chain.
17. 8 . 3
STRENGTH REDUCTION (ADVANCED CASE)
float useManyPowf(float a, float b, float c,
float d, float e, float f, float x)
{
return
a * powf(x, 5.f) +
b * powf(x, 4.f) +
c * powf(x, 3.f) +
d * powf(x, 2.f) +
e * x + f;
}
$ gcc -march=armv7-a -mthumb -O3 -S -o 1.s
> -mfpu=neon-vfpv4 -mfloat-abi=softfp 1.c
19. 8 . 5
STRENGTH REDUCTION (MANUAL)
Let's further reduce latency by applying Horner rule.
float applyHornerf(float a, float b, float c,
float d, float e, float f, float x)
{
return ((((a * x + b) * x + c) * x + d) * x + e) * x + f;
}
$ gcc -march=armv7-a -mthumb -O3 -S -o 1.s
> -mfpu=neon-vfpv4 -mfloat-abi=softfp 1.c
Compiler is not capable of this optimization because it
doesn't produce exact result with the original formula
21. 9 . 1
LOOP-INDUCTION VARIABLE ELIMINATION
In most cases compiler is able to replace
address arithmetics with pointer arithmetics.
void function(int* arr, int len)
{
for (int i = 0; i < len; i++)
arr[i] = 1;
}
void function(int* arr, int len)
{
for (int* p = arr; p < arr + len; p++)
*p = 1;
}
$ gcc -march=armv7-a -mthumb -O1 -S -o 1.s
> -mfpu=neon-vfpv4 -mfloat-abi=softfp 1.c
22. 9 . 2
LOOP-INDUCTION VARIABLE ELIMINATION
cmp r1, #0
ble .L1
mov r3,r0
add r0,r0,r1,lsl #2
movs r2, #1
.L3:
str r2, [r3], #4
cmp r3, r0
bne .L3
.L1:
bx lr
add r1,r0,r1,lsl #2
cmp r0, r1
bcs .L5
movs r3, #1
.L8:
str r3, [r0], #4
cmp r1, r0
bhi .L8
.L5:
bx lr
Most hand-written pointer optimizations do not make sense
with usage optimization levels higher than O0.
23. 10 . 1
AUTO-VECTORIZATION
Auto-vectorization is machine code generation
that takes an advantage of vector instructions.
Most of all modern architectures have vector extensions as
a co-processor or as dedicated pipes
MMX, SSE, SSE2, SSE4, AVX, AVX-512
AltiVec, VSX
ASIMD (NEON), MSA
Enabled by inlining, unrolling, fusion, software pipelining,
inter-procedural optimization, and other machine
independent transformations.
24. 10 . 2
AUTO-VECTORIZATION
void vectorizeMe(float *a, float *b, int len)
{
int i;
for (i = 0; i < len; i++)
a[i] += b[i];
}
$ gcc -march=armv7-a -mthumb -O3 -S -o 1.s
> -mfpu=neon-vfpv4 -mfloat-abi=softfp
> -fopt-info-missed 1.c
But, NEON does not support full IEEE 754,
so gcc cannot vectorize the loop, what it told us
1.c:64:3: note: not vectorized:
relevant stmt not supported:_13 =_9+_12;
25. 10 . 3
Here is a generated assembly
and... Here we are!
AUTO-VECTORIZATION
.L3:
fldmias r1!, {s14}
flds s15, [r0]
fadds s15, s15, s14
fstmias r0!, {s15}
cmp r0, r2
bne .L3
But armv8-a does support, let's check it!
$ gcc -march=armv8-a+simd -O3 -S -o 1.s
> -fopt-info-all 1.c
1.c:64:3: note: loop vectorized
26. 10 . 4
Here is a generated assembly and full optimizer's report
AUTO-VECTORIZATION
.L6:
ldr q1, [x3],16
add w6, w6, 1
ldr q0, [x7],16
cmp w6, w4
fadd v0.4s, v0.4s, v1.4s
str q0, [x8],16
bcc .L6
1.c:66:3: note: loop vectorized
1.c:66:3: note: loop versioned for vectorization because of possible aliasing
1.c:66:3: note: loop peeled for vectorization to enhance alignment
1.c:66:3: note: loop with 3 iterations completelyunrolled
1.c:61:6: note: loop with 3 iterations completelyunrolled
Compiler versions the loop to allow optimized paths
in case of aligned and non-aliasing pointers
27. 10 . 5
Let's follow the advice to put some keywords
Optimizer's report shrinks.
KEYWORDS
void vectorizeMe(float* __restrict a_, float* __restrict b_, int len)
{
float *a=__builtin_assume_aligned(a_, 16);
float *b=__builtin_assume_aligned(b_, 16);
for (int i = 0; i < len; i++)
a[i] += b[i];
}
1.c:66:3: note: loop vectorized
1.c:66:3: note: loop with 3 iterations completely unrolled
__restrict and __builtin_assume_aligned
keywords only eliminate some loop versioning, but are not
very useful from the performance perspective
28. 11 . 1
FUNCTION BODY INLINING
Replaces functional call to function body itself.
int square(int x) { return x*x; }
for (int i = 0; i < len; i++)
arr[i] = square(i);
Becomes
for (int i = 0; i < len; i++)
arr[i] = i*i;
Enables all further optimizations.
Machine-independent, interprocedural.
34. 13 . 1
CASE STUDY: FLOATING POINT
double power( double d, unsigned n)
{
double x = 1.0;
for (unsigned j = 1; j<=n; j++, x *= d) ;
return x;
}
int main ()
{
double a = 1./0x80000000U, sum = 0.0;
for (unsigned i=1; i<= 0x80000000U; i++)
sum += power( i*a, i % 8);
printf ("sum = %gn", sum);
}
ags-dp.c
35. 13 . 2
CASE STUDY: FLOATING POINT
1. Compile it without optimization
2. Compile it with O1: ~3.26 speedup
$ gcc -std=c99 -Wall -O0 flags-dp.c -o flags-dp
$ time ./flags-dp
sum = 7.29569e+08
real 0m24.550s
$ gcc -std=c99 -Wall -O1 flags-dp.c -o flags-dp
$ time ./flags-dp
sum = 7.29569e+08
real 0m7.529s
36. 13 . 3
CASE STUDY: FLOATING POINT
3. Compile it with O2: ~1.24 speedup
4. Compile it with O3: ~1.00 speedup
$ gcc -std=c99 -Wall -O2 flags-dp.c -o flags-dp
$ time ./flags-dp
sum = 7.29569e+08
real 0m6.069s
$ gcc -std=c99 -Wall -O3 flags-dp.c -o flags-dp
$ time ./flags-dp
sum = 7.29569e+08
real 0m6.067s
Total speedup is ~4.05
37. 14 . 1
CASE STUDY: INTEGER
int power( int d, unsigned n)
{
int x = 1;
for (unsigned j = 1; j<=n; j++, x*=d) ;
return x;
}
int main ()
{
int64_t sum = 0;
for (unsigned i=1; i<0x80000000U; i++)
sum += power( i, i % 8);
printf ("sum = %ldn", sum);
}
ags.c
38. 14 . 2
CASE STUDY: INTEGER
1. Compile it without optimization
2. Compile it with O1: ~2.64 speedup
$ gcc -std=c99 -Wall -O0 flags.c -o flags
$ time ./flags
sum = 288231861405089791
real 0m18.750s
$ gcc -std=c99 -Wall -O1 flag.c -o flags
$ time ./flags
sum = 288231861405089791
real 0m7.092s
39. 14 . 3
CASE STUDY: INTEGER
3. Compile it with O2:~0.97 speedup
4. Compile it with O3: ~1.00 speedup
$ gcc -std=c99 -Wall -O2 flags.c -o flags
$ time ./flags
sum = 288231861405089791
real 0m7.300s
$ gcc -std=c99 -Wall -O3 flags.c -o flags
$ time ./flags
sum = 288231861405089791
real 0m7.082s
WHAT THERE IS NO IMPROVEMENT?
41. 15 . 1
HELPING A COMPILER
Compiler usually applies optimization to the inner loops. In
this example number of iterations in the inner loop depends on
a value of an induction variable of the outer loop.
Let's help the compiler ( )ags-dp-tuned.c
42. 15 . 2
HELPING A COMPILER
/*power function is not changed*/
int main ()
{
double a = 1./0x80000000U, s = -1.;
for (double i=0; i<=0x80000000U-8; i += 8)
{
s+=power((i+0)*a,0);s+=power((i+1)*a,1);
s+=power((i+2)*a,2);s+=power((i+3)*a,3);
s+=power((i+4)*a,4);s+=power((i+5)*a,5);
s+=power((i+6)*a,6);s+=power((i+7)*a,7);
}
printf ("sum = %gn", s);
}
$ gcc -std=c99 -Wall -O3 flags-dp-tuned.c -o flags-dp-tuned
$ time ./flags-dp-tuned
sum = 7.29569e+08
real 0m2.448s
Speedup is ~2.48x.
43. 15 . 3
Let's try it for integers ( )
HELPING A COMPILER
ags-tuned.c
/*power function is not changed*/
int main ()
{
int64_t sum = -1;
for (unsigned i=0; i<=0x80000000U-8; i += 8)
{
sum+=power(i+0, 0); sum+=power(i+1,1);
sum+=power(i+2, 2); sum+=power(i+3,3);
sum+=power(i+4, 4); sum+=power(i+5,5);
sum+=power(i+6, 6); sum+=power(i+7,7);
}
printf ("sum = %ldn", sum);
}
$ gcc -std=c99 -Wall -O3 flags-tuned.c -o flags-tuned
$ time ./flags-tuned
sum = 288231861405089791
real 0m1.286s
Speedup is ~5.5x.
44. 16 . 1
DETAILED ANALYSIS
This is how the compiler sees the outer loop
after inner loop inlining
int64_t sum = -1;
for (unsigned i=0; i<=0x80000000U-8; i += 8)
{
int d = (int)i;
sum+=1;
sum+=(1+d);
sum+=(2+d)*(2+d);
sum+=(3+d)*(3+d)*(3+d);
sum+=(4+d)*(4+d)*(4+d)*(4+d);
sum+=(5+d)*(5+d)*(5+d)*(5+d)*(5+d);
sum+=(6+d)*(6+d)*(6+d)*(6+d)*(6+d)*(6+d);
sum+=(7+d)*(7+d)*(7+d)*(7+d)*(7+d)*(7+d)*(7+d);
}
48. 16 . 5
(I+7)
leal 5(%r8), %esi
movl $7, %ecx
movl $1, %edx
.L2:
imull %esi, %edx
subl $1, %ecx
jne .L2
movslq %edx, %rdx
addq %rdx, %rax
Full listing is available in gitHub
49. 17 . 1
SUMMARY
Copy propagation and other basic transformations enables
more complicated ones
Hoisting loop-invariant code and other ways of strength
reduction techniques minimizes number of operations
Compiler is very conservative to hoist oating point math
Scalarization and loop unswitching makes loop bodies plane
Loop peeling compiler capabilities are steel pure, use
sentinels technique to eliminate extra checks and control
ow
Most hand-written pointer optimizations do not make sense
with usage of optimization levels higher than O0
50. 17 . 2
SUMMARY
__restrict and __builtin_assume_aligned
keywords only eliminate some loop versioning, but are not
very useful from the performance perspective nowadays
Functional body inlining is essential for loop further loop
optimization
Compiler will detect memcpy and memset, even if you
implement them manually and call library function instead.
Library functions are usually more ef cient
Use knowledge about semantics of your code to help a
compiler optimize it