SlideShare ist ein Scribd-Unternehmen logo
1 von 51
Downloaden Sie, um offline zu lesen
1
PRAGMATIC
OPTIMIZATION
IN MODERN PROGRAMMING
MASTERING COMPILER OPTIMIZATIONS
Created by for / 2015-2016Marina (geek) Kolpakova UNN
2
COURSE TOPICS
Ordering optimization approaches
Demystifying a compiler
Mastering compiler optimizations
Modern computer architectures concepts
3
OUTLINE
Constant folding
Hoisting loop invariant code
Scalarization
Loop unswitching
Loop peeling and sentinels
Strength reduction
Loop-induction variable elimination
Auto-vectorization
Function body inlining
Built-in detection
Case study
4
CONSTANT FOLDING
evaluates expressions from constants in compile time
It is one of the simplest and most widely used compiler
transformations. Expressions could be quite complicated, but
absence of any kind of side effects is always to be considered.
double b = exp(sqrt(M_PI/2));
fldd d17, .L5
// other code
.L5:
.word 2911864813
.word 1074529267
May be performed in global as well as local scope.
Applicable to any code pattern.
5 . 1
HOISTING LOOP INVARIANT CODE
The goal of hoisting — also called loop-invariant code motion
— is to avoid recomputing loop-invariant code each time
through the body of a loop.
void scale(double* x, double* y, int len)
{
for (int i = 0; i < len; i++)
y[i] = x[i] * exp(sqrt(M_PI/2));
}
... is the same as...
void scale(double* x, double *y, int len)
{
double factor = exp(sqrt(M_PI/2));
for (int i = 0; i < len; i++)
y[i] = x[i] * factor;
}
5 . 2
HOISTING LOOP INVARIANT CODE
$ gcc -march=armv7-a -mthumb -O1 -S -o 1.s
-mfpu=neon-vfpv4 -mfloat-abi=softfp 1.c
fldd d17, .L5
.L3:
fldmiad r3!, {d16}
fmuld d16, d16, d17
fstmiad r1!, {d16}
cmp r3, r0
bne .L3
.L1:
bx lr
.L5:
.word 2911864813
.word 1074529267
fldd d17, .L11
.L9:
fldmiad r3!, {d16}
fmuld d16, d16, d17
fstmiad r1!, {d16}
cmp r3, r0
bne .L9
.L7:
bx lr
.L11:
.word 2911864813
.word 1074529267
5 . 3
HOISTING NON-CONSTANTS
Let's make expression variable
void scale(double* x, double* y, int len)
{
for (int i = 0; i < len; i++)
y[i] = x[i] * exp(sqrt((double)len)/2.));
}
void scale(double* x, double *y, int len)
{
double factor = exp(sqrt((double)len)/2.));
for (int i = 0; i < len; i++)
y[i] = x[i] * factor;
}
$ gcc -march=armv7-a -mthumb -O1 -S -o 1.s
-mfpu=neon-vfpv4 -mfloat-abi=softfp 1.c
5 . 4
HOISTING NON-CONSTANTS
fmsr s15, r2 @ int
fsitod d9, s15
fsqrtd d9, d9
.L4:
fldmiad r6!, {d8}
fcpyd d16, d9
fcmpd d9, d9
fmstat
beq .L3
fmsr s15, r7 @ int
fsitod d16, s15
fmrrd r0, r1, d16
bl sqrt(PLT)
fmdrr d16, r0, r1
.L3:
fmuld d8, d8, d16
fstmiad r5!, {d8}
adds r4, r4, #1
cmp r4, r7
bne .L4
fmsr s15, r2 @ int
fsitod d17, s15
fsqrtd d17, d17
fcmpd d17, d17
fmstat
beq .L9
fsitod d16, s15
fmrrd r0, r1, d16
bl sqrt(PLT)
fmdrr d17, r0, r1
.L9:
mov r3, r5
mov r1, r4
add r0, r5, r6, lsl #3
.L11:
fldmiad r3!, {d16}
fmuld d16, d16, d17
fstmiad r1!, {d16}
cmp r3, r0
bne .L11
Compiler isn't sure that oating point ags aren't affected!
6
Usually it is performed on a branches inside a loop body to
allow further optimization. Machine-independent, local.
SCALARIZATION
Scalarization replaces branchy code with a branchless analogy
int branchy(int i)
{
if (i > 4 && i < 42) return 1;
return 0;
}
int branchless(int i)
{
return (((unsigned)i) - 5 < 36);
}
$ gcc -march=armv8-a+simd -O3 1.c -S -o 1.s
branchy:
sub w0, w0, #5
cmp w0, 36
cset w0, ls
ret
branchless:
sub w0, w0, #5
cmp w0, 36
cset w0, ls
ret
7 . 1
LOOP UNSWITCHING
Moves loop-invariant conditionals or switches which are
independent of the loop index out of the loop. Increases ILP,
enables further optimizations.
for (int i = 0; i < len; i++)
{
if (a > 32)
arr[i] = a;
else
arr[i] = 42;
}
int i = 0;
if (a > 32)
for (; i < len; i++)
arr[i] = a;
else
for (; i < len; i++)
arr[i] = 42;
$ gcc -march=armv8-a+simd -O3 -S -o 1.s 
-fno-tree-vectorize 1.c
auto-vectorization is disable d just to make example readable.
7 . 2
LOOP UNSWITCHING
cmp w1, wzr
ble .L1
cmp w2, 32
bgt .L6
mov x2, 0
mov w3, 42
.L4:
str w3,[x0,x2,lsl 2]
add x2, x2, 1
cmp w1, w2
bgt .L4
.L1:
ret
L6:
mov x3, 0
.L3:
str w2,[x0,x3,lsl 2]
add x3, x3, 1
cmp w1, w3
bgt .L3
ret
7 . 3
Can be done by a compiler under high optimization levels.
LOOP PEELING
Loop peeling takes out of the loop and executes separately a
small number of iterations from the beginning or/and the end
of a loop, which eliminates if-statements.
for(int i=0; i<len; i++)
{
if (i == 0)
b[i] = a[i+1];
else if(i == len-1 )
b[i] = a[i-1];
else
b[i] = a[i+1]+a[i-1];
}
b[0] = a[0];
for(int i=0; i<len; i++)
[i] = a[i+1]+a[i-1];
b[len-1] = a[len-1];
For the example below both te sted compilers: gcc 4.9 and clang 3.5 won't perform this optimization.
Make sure that a compiler w as able to apply optimization to such a code pattern or do it manually .
7 . 4
ADVANCED UNSWITCHING: SENTINELS
Sentinels are special dummy values placed in a data structure
which are added to simplify the logic of handling boundary
conditions (in particular, handling of loop-exit tests or
elimination out-of-bound checks).
This optimization cannot be performed by the compiler
because require changes in use of data structures
7 . 5
ADVANCED UNSWITCHING: SENTINELS
Let's assume that array a contains two extra elements: one —
at the left a[-1], and one — at the right a[N]. With such
assumptions the code can be rewritten as follows.
for(int i=0; i<len; i++)
{
if (i == 0)
b[i] = a[i+1];
else if(i == len-1 )
b[i] = a[i-1];
else
b[i] = a[i+1]+a[i-1];
}
// setup boundaries
a[-1] = 0;
a[len] = 0;
for (int i=0; i<len; i++)
[i] = a[i+1] + a[i-1];
8 . 1
STRENGTH REDUCTION
replaces complex expressions with a simpler analogy
double usePow(double x)
{
return pow(x, 2.);
}
float usePowf(float x)
{
return powf(x, 2.f);
}
Machine-independent transformation in most cases.
It may be machine-dependent in case if it relies on a speci c
feature set, implemented in the HW (e.g. built-in detection)
8 . 2
STRENGTH REDUCTION
$ gcc -march=armv7-a -mthumb -O3 -S -o 1.s 
> -mfpu=neon-vfpv4 -mfloat-abi=softfp 1.c
usePow:
fmdrr d16, r0, r1
fmuld d16, d16, d16
fmrrd r0, r1, d16
bx lr
usePowf:
fmsr s15, r0
fmuls s15, s15, s15
fmrs r0, s15
bx lr
Usually it is performed in a local scope
for a dependency chain.
8 . 3
STRENGTH REDUCTION (ADVANCED CASE)
float useManyPowf(float a, float b, float c,
float d, float e, float f, float x)
{
return
a * powf(x, 5.f) +
b * powf(x, 4.f) +
c * powf(x, 3.f) +
d * powf(x, 2.f) +
e * x + f;
}
$ gcc -march=armv7-a -mthumb -O3 -S -o 1.s 
> -mfpu=neon-vfpv4 -mfloat-abi=softfp 1.c
8 . 4
Compiler was able just partly reduce the complexity and
generate there it is possible
STRENGTH REDUCTION (ADVANCED)
useManyPowf:
push {r3, lr}
flds s17, [sp, #56]
fmsr s24, r1
movs r1, #0
fmsr s22, r0
movt r1, 16544
fmrs r0, s17
fmsr s21, r2
fmsr s20, r3
flds s19, [sp, #48]
flds s18, [sp, #52]
bl powf(PLT)
mov r1, #1082130432
fmsr s23, r0
fmrs r0, s17
bl powf(PLT)
movs r1, #0
movt r1, 16448
fmsr s16, r0
fmrs r0, s17
bl powf(PLT)
fmuls s16, s16, s24
vfma.f32 s16, s23, s22
fmsr s15, r0
vfma.f32 s16, s15, s21
fmuls s15, s17, s17
vfma.f32 s16, s20, s15
vfma.f32 s16, s19, s17
fadds s15, s16, s18
fldmfdd sp!, {d8-d12}
fmrs r0, s15
pop {r3, pc}
vfma
8 . 5
STRENGTH REDUCTION (MANUAL)
Let's further reduce latency by applying Horner rule.
float applyHornerf(float a, float b, float c,
float d, float e, float f, float x)
{
return ((((a * x + b) * x + c) * x + d) * x + e) * x + f;
}
$ gcc -march=armv7-a -mthumb -O3 -S -o 1.s 
> -mfpu=neon-vfpv4 -mfloat-abi=softfp 1.c
Compiler is not capable of this optimization because it
doesn't produce exact result with the original formula
8 . 6
STRENGTH REDUCTION (MANUAL)
applyHornerf:
flds s15, [sp, #8]
fmsr s11, r0
fmsr s12, r1
flds s14, [sp]
vfma.f32 s12, s11, s15
fmsr s11, r2
flds s13, [sp, #4]
vfma.f32 s11, s12, s15
fcpys s12, s11
fmsr s11, r3
vfma.f32 s11, s12, s15
vfma.f32 s14, s11, s15
vfma.f32 s13, s14, s15
fmrs r0, s13
bx lr
9 . 1
LOOP-INDUCTION VARIABLE ELIMINATION
In most cases compiler is able to replace
address arithmetics with pointer arithmetics.
void function(int* arr, int len)
{
for (int i = 0; i < len; i++)
arr[i] = 1;
}
void function(int* arr, int len)
{
for (int* p = arr; p < arr + len; p++)
*p = 1;
}
$ gcc -march=armv7-a -mthumb -O1 -S -o 1.s 
> -mfpu=neon-vfpv4 -mfloat-abi=softfp 1.c
9 . 2
LOOP-INDUCTION VARIABLE ELIMINATION
cmp r1, #0
ble .L1
mov r3,r0
add r0,r0,r1,lsl #2
movs r2, #1
.L3:
str r2, [r3], #4
cmp r3, r0
bne .L3
.L1:
bx lr
add r1,r0,r1,lsl #2
cmp r0, r1
bcs .L5
movs r3, #1
.L8:
str r3, [r0], #4
cmp r1, r0
bhi .L8
.L5:
bx lr
Most hand-written pointer optimizations do not make sense
with usage optimization levels higher than O0.
10 . 1
AUTO-VECTORIZATION
Auto-vectorization is machine code generation
that takes an advantage of vector instructions.
Most of all modern architectures have vector extensions as
a co-processor or as dedicated pipes
MMX, SSE, SSE2, SSE4, AVX, AVX-512
AltiVec, VSX
ASIMD (NEON), MSA
Enabled by inlining, unrolling, fusion, software pipelining,
inter-procedural optimization, and other machine
independent transformations.
10 . 2
AUTO-VECTORIZATION
void vectorizeMe(float *a, float *b, int len)
{
int i;
for (i = 0; i < len; i++)
a[i] += b[i];
}
$ gcc -march=armv7-a -mthumb -O3 -S -o 1.s 
> -mfpu=neon-vfpv4 -mfloat-abi=softfp 
> -fopt-info-missed 1.c
But, NEON does not support full IEEE 754,
so gcc cannot vectorize the loop, what it told us
1.c:64:3: note: not vectorized:
relevant stmt not supported:_13 =_9+_12;
10 . 3
Here is a generated assembly
and... Here we are!
AUTO-VECTORIZATION
.L3:
fldmias r1!, {s14}
flds s15, [r0]
fadds s15, s15, s14
fstmias r0!, {s15}
cmp r0, r2
bne .L3
But armv8-a does support, let's check it!
$ gcc -march=armv8-a+simd -O3 -S -o 1.s 
> -fopt-info-all 1.c
1.c:64:3: note: loop vectorized
10 . 4
Here is a generated assembly and full optimizer's report
AUTO-VECTORIZATION
.L6:
ldr q1, [x3],16
add w6, w6, 1
ldr q0, [x7],16
cmp w6, w4
fadd v0.4s, v0.4s, v1.4s
str q0, [x8],16
bcc .L6
1.c:66:3: note: loop vectorized
1.c:66:3: note: loop versioned for vectorization because of possible aliasing
1.c:66:3: note: loop peeled for vectorization to enhance alignment
1.c:66:3: note: loop with 3 iterations completelyunrolled
1.c:61:6: note: loop with 3 iterations completelyunrolled
Compiler versions the loop to allow optimized paths
in case of aligned and non-aliasing pointers
10 . 5
Let's follow the advice to put some keywords
Optimizer's report shrinks.
KEYWORDS
void vectorizeMe(float* __restrict a_, float* __restrict b_, int len)
{
float *a=__builtin_assume_aligned(a_, 16);
float *b=__builtin_assume_aligned(b_, 16);
for (int i = 0; i < len; i++)
a[i] += b[i];
}
1.c:66:3: note: loop vectorized
1.c:66:3: note: loop with 3 iterations completely unrolled
__restrict and __builtin_assume_aligned
keywords only eliminate some loop versioning, but are not
very useful from the performance perspective
11 . 1
FUNCTION BODY INLINING
Replaces functional call to function body itself.
int square(int x) { return x*x; }
for (int i = 0; i < len; i++)
arr[i] = square(i);
Becomes
for (int i = 0; i < len; i++)
arr[i] = i*i;
Enables all further optimizations.
Machine-independent, interprocedural.
11 . 2
FUNCTION BODY INLINING
gcc -march=armv8-a+nosimd -O3 -S -o 1.s 1.c
.L2:
add x2, x4, :lo12:.LANCHOR0
mov x1, 34464
mul w3, w0, w0
movk x1, 0x1, lsl 16
str w3, [x2,x0,lsl 2]
add x0, x0, 1
cmp x0, x1
bne .L2
Let's compile the code with vector extension enabled
-march=armv8-a+simd
11 . 3
FUNCTION BODY INLINING
add x0, x0, :lo12:.LANCHOR0
movi v2.4s, 0x4
ldr q0, [x1]
add x1, x0, 397312
add x1, x1, 2688
.L2:
mul v1.4s, v0.4s, v0.4s
add v0.4s, v0.4s, v2.4s
str q1, [x0],16
cmp x0, x1
bne .L2
Auto-vectorization is enabled because of possibility
to inline function call inside a loop.
12 . 1
BUILT-IN DETECTION
Compiler automatically uses the library functions memset
and memcpy to initialize and copy memory blocks
static char a[100000];
static char b[100000];
static int at(int idx, char val)
{
if (idx>=0 && idx<100000)
a[idx] = val;
}
int main()
{
for (int i=0; i<100000; ++i) a[i]=42;
for (int i=0; i<100000; ++i) at(i,-1);
for (int i=0; i<100000; ++i) b[i] = a[i];
}
12 . 2
Compiler knows what you mean
FILLING/COPYING MEMORY BLOCKS
main:
.LFB1:
.cfi_startproc
subq $8, %rsp
.cfi_def_cfa_offset 16
movl $100000, %edx
movl $42, %esi
movl $a, %edi
call memset
movl $100000, %edx
movl $255, %esi
movl $a, %edi
call memset
movl $100000, %edx
movl $a, %esi
movl $b, %edi
call memcpy
addq $8, %rsp
.cfi_def_cfa_offset 8
ret
12 . 3
The same picture, if the code is compiled for ARM target
FILLING/COPYING MEMORY BLOCKS
main:
ldr r3, .L3
mov r1, #42
stmfd sp!, {r4, lr}
add r3, pc, r3
movw r4, #34464
movt r4, 1
mov r0, r3
mov r2, r4
bl memset(PLT)
mov r2, r4
mov r1, #255
bl memset(PLT)
mov r2, r4
mov r3, r0
ldr r0, .L3+4
mov r1, r3
add r0, pc, r0
add r0, r0, #1792
bl memcpy(PLT)
ldmfd sp!, {r4, pc}
13 . 1
CASE STUDY: FLOATING POINT
double power( double d, unsigned n)
{
double x = 1.0;
for (unsigned j = 1; j<=n; j++, x *= d) ;
return x;
}
int main ()
{
double a = 1./0x80000000U, sum = 0.0;
for (unsigned i=1; i<= 0x80000000U; i++)
sum += power( i*a, i % 8);
printf ("sum = %gn", sum);
}
ags-dp.c
13 . 2
CASE STUDY: FLOATING POINT
1. Compile it without optimization
2. Compile it with O1: ~3.26 speedup
$ gcc -std=c99 -Wall -O0 flags-dp.c -o flags-dp
$ time ./flags-dp
sum = 7.29569e+08
real 0m24.550s
$ gcc -std=c99 -Wall -O1 flags-dp.c -o flags-dp
$ time ./flags-dp
sum = 7.29569e+08
real 0m7.529s
13 . 3
CASE STUDY: FLOATING POINT
3. Compile it with O2: ~1.24 speedup
4. Compile it with O3: ~1.00 speedup
$ gcc -std=c99 -Wall -O2 flags-dp.c -o flags-dp
$ time ./flags-dp
sum = 7.29569e+08
real 0m6.069s
$ gcc -std=c99 -Wall -O3 flags-dp.c -o flags-dp
$ time ./flags-dp
sum = 7.29569e+08
real 0m6.067s
Total speedup is ~4.05
14 . 1
CASE STUDY: INTEGER
int power( int d, unsigned n)
{
int x = 1;
for (unsigned j = 1; j<=n; j++, x*=d) ;
return x;
}
int main ()
{
int64_t sum = 0;
for (unsigned i=1; i<0x80000000U; i++)
sum += power( i, i % 8);
printf ("sum = %ldn", sum);
}
ags.c
14 . 2
CASE STUDY: INTEGER
1. Compile it without optimization
2. Compile it with O1: ~2.64 speedup
$ gcc -std=c99 -Wall -O0 flags.c -o flags
$ time ./flags
sum = 288231861405089791
real 0m18.750s
$ gcc -std=c99 -Wall -O1 flag.c -o flags
$ time ./flags
sum = 288231861405089791
real 0m7.092s
14 . 3
CASE STUDY: INTEGER
3. Compile it with O2:~0.97 speedup
4. Compile it with O3: ~1.00 speedup
$ gcc -std=c99 -Wall -O2 flags.c -o flags
$ time ./flags
sum = 288231861405089791
real 0m7.300s
$ gcc -std=c99 -Wall -O3 flags.c -o flags
$ time ./flags
sum = 288231861405089791
real 0m7.082s
WHAT THERE IS NO IMPROVEMENT?
14 . 4
EXAMPLE: INTEGER
-O1 -O2
COMPILER OVERDONE
WITH BRANCH
TWIDDLING!
movl $1, %r8d
movl $0, %edx
.L9:
movl %r8d, %edi
movl %r8d, %esi
andl $7, %esi
je .L10
movl $1, %ecx
movl $1, %eax
.L8: @
addl $1, %eax @
imull %edi, %ecx @
cmpl %eax, %esi @
jae .L8
jmp .L7
.L10:
movl $1, %ecx
.L7:
movslq %ecx, %rcx
addq %rcx, %rdx
addl $1, %r8d
jns .L9
@ printing is here
movl $1, %esi
movl $1, %r8d
xorl %edx, %edx
.p2align 4,,10
.p2align 3
.L10:
movl %r8d, %edi
andl $7, %edi
je .L11
movl $1, %ecx
movl $1, %eax
.p2align 4,,10
.p2align 3
.L9:
addl $1, %eax
imull %esi, %ecx
cmpl %eax, %edi
jae .L9
movslq %ecx, %rcx
.L8:
addl $1, %r8d
addq %rcx, %rdx
testl %r8d, %r8d
movl %r8d, %esi
jns .L10
subq $8, %rsp
@ printing is here
ret
.L11:
movl $1, %ecx
jmp .L8
15 . 1
HELPING A COMPILER
Compiler usually applies optimization to the inner loops. In
this example number of iterations in the inner loop depends on
a value of an induction variable of the outer loop.
Let's help the compiler ( )ags-dp-tuned.c
15 . 2
HELPING A COMPILER
/*power function is not changed*/
int main ()
{
double a = 1./0x80000000U, s = -1.;
for (double i=0; i<=0x80000000U-8; i += 8)
{
s+=power((i+0)*a,0);s+=power((i+1)*a,1);
s+=power((i+2)*a,2);s+=power((i+3)*a,3);
s+=power((i+4)*a,4);s+=power((i+5)*a,5);
s+=power((i+6)*a,6);s+=power((i+7)*a,7);
}
printf ("sum = %gn", s);
}
$ gcc -std=c99 -Wall -O3 flags-dp-tuned.c -o flags-dp-tuned
$ time ./flags-dp-tuned
sum = 7.29569e+08
real 0m2.448s
Speedup is ~2.48x.
15 . 3
Let's try it for integers ( )
HELPING A COMPILER
ags-tuned.c
/*power function is not changed*/
int main ()
{
int64_t sum = -1;
for (unsigned i=0; i<=0x80000000U-8; i += 8)
{
sum+=power(i+0, 0); sum+=power(i+1,1);
sum+=power(i+2, 2); sum+=power(i+3,3);
sum+=power(i+4, 4); sum+=power(i+5,5);
sum+=power(i+6, 6); sum+=power(i+7,7);
}
printf ("sum = %ldn", sum);
}
$ gcc -std=c99 -Wall -O3 flags-tuned.c -o flags-tuned
$ time ./flags-tuned
sum = 288231861405089791
real 0m1.286s
Speedup is ~5.5x.
16 . 1
DETAILED ANALYSIS
This is how the compiler sees the outer loop
after inner loop inlining
int64_t sum = -1;
for (unsigned i=0; i<=0x80000000U-8; i += 8)
{
int d = (int)i;
sum+=1;
sum+=(1+d);
sum+=(2+d)*(2+d);
sum+=(3+d)*(3+d)*(3+d);
sum+=(4+d)*(4+d)*(4+d)*(4+d);
sum+=(5+d)*(5+d)*(5+d)*(5+d)*(5+d);
sum+=(6+d)*(6+d)*(6+d)*(6+d)*(6+d)*(6+d);
sum+=(7+d)*(7+d)*(7+d)*(7+d)*(7+d)*(7+d)*(7+d);
}
16 . 2
5 LOOP COUNTERS INSTEAD OF 1
movl $6, %r11d
movl $5, %r10d
movl $4, %ebx
movl $3, %r9d
movl $2, %r8d
movq $-1, %rax // <-accumulator
.L3:
// loop body
addl $8, %r8d
addl $8, %r9d
addl $8, %ebx
addl $8, %r10d
addl $8, %r11d
cmpl $-2147483646, %r8d
jne .L3
16 . 3
(I+1), (I+2), (I+3)
leal -1(%r8), %edx // 2+i-1
movslq %edx, %rdx // to long
leaq 1(%rax,%rdx), %rdi // sum += 1+i
movl %r8d, %edx // 2+i
imull %r8d, %edx // (2+i)**2
movslq %edx, %rdx // to long
leaq (%rdx,%rdi), %rsi // sum += (2+i)**2
movl %r9d, %ecx
imull %r9d, %ecx
imull %r9d, %ecx
movslq %ecx, %rcx
leaq (%rcx,%rsi), %rdx
16 . 4
(I+4), (I+5), (I+6)
movl %ebx, %eax
imull %ebx, %eax
imull %eax, %eax
cltq
leaq (%rax,%rdx), %rcx
movl %r10d, %eax
imull %r10d, %eax
imull %eax, %eax
imull %r10d, %eax
cltq
addq %rcx, %rax
movl %r11d, %edx
imull %r11d, %edx
imull %r11d, %edx
imull %edx, %edx
movslq %edx, %rdx
addq %rdx, %rax
16 . 5
(I+7)
leal 5(%r8), %esi
movl $7, %ecx
movl $1, %edx
.L2:
imull %esi, %edx
subl $1, %ecx
jne .L2
movslq %edx, %rdx
addq %rdx, %rax
Full listing is available in gitHub
17 . 1
SUMMARY
Copy propagation and other basic transformations enables
more complicated ones
Hoisting loop-invariant code and other ways of strength
reduction techniques minimizes number of operations
Compiler is very conservative to hoist oating point math
Scalarization and loop unswitching makes loop bodies plane
Loop peeling compiler capabilities are steel pure, use
sentinels technique to eliminate extra checks and control
ow
Most hand-written pointer optimizations do not make sense
with usage of optimization levels higher than O0
17 . 2
SUMMARY
__restrict and __builtin_assume_aligned
keywords only eliminate some loop versioning, but are not
very useful from the performance perspective nowadays
Functional body inlining is essential for loop further loop
optimization
Compiler will detect memcpy and memset, even if you
implement them manually and call library function instead.
Library functions are usually more ef cient
Use knowledge about semantics of your code to help a
compiler optimize it
18
THE END
/ 2015-2016MARINA KOLPAKOVA

Weitere ähnliche Inhalte

Was ist angesagt?

Is Is Routing Protocol
Is Is Routing ProtocolIs Is Routing Protocol
Is Is Routing Protocol
hayenas
 

Was ist angesagt? (20)

8085 alp programs
8085 alp programs8085 alp programs
8085 alp programs
 
Rotate instructions
Rotate instructionsRotate instructions
Rotate instructions
 
CS8391 Data Structures 2 mark Questions - Anna University Questions
CS8391 Data Structures 2 mark Questions - Anna University QuestionsCS8391 Data Structures 2 mark Questions - Anna University Questions
CS8391 Data Structures 2 mark Questions - Anna University Questions
 
Postfix Notation | Compiler design
Postfix Notation | Compiler designPostfix Notation | Compiler design
Postfix Notation | Compiler design
 
Data Structure Lec #1
Data Structure Lec #1Data Structure Lec #1
Data Structure Lec #1
 
Firebirdbasededatos
FirebirdbasededatosFirebirdbasededatos
Firebirdbasededatos
 
15CS44 MP &MC Module 3
15CS44 MP &MC Module 315CS44 MP &MC Module 3
15CS44 MP &MC Module 3
 
Lists
ListsLists
Lists
 
LR Parsing
LR ParsingLR Parsing
LR Parsing
 
Is Is Routing Protocol
Is Is Routing ProtocolIs Is Routing Protocol
Is Is Routing Protocol
 
Instruction Execution Cycle
Instruction Execution CycleInstruction Execution Cycle
Instruction Execution Cycle
 
IPv4 VS IPv6
IPv4 VS IPv6IPv4 VS IPv6
IPv4 VS IPv6
 
Basic BGP Configuration
Basic BGP ConfigurationBasic BGP Configuration
Basic BGP Configuration
 
Active directory
Active directoryActive directory
Active directory
 
standard template library(STL) in C++
standard template library(STL) in C++standard template library(STL) in C++
standard template library(STL) in C++
 
Doubly Linked List || Operations || Algorithms
Doubly Linked List || Operations || AlgorithmsDoubly Linked List || Operations || Algorithms
Doubly Linked List || Operations || Algorithms
 
INTERNAL STRUCTURE OF 8086 MICROPROCESSOR
INTERNAL STRUCTURE OF  8086 MICROPROCESSORINTERNAL STRUCTURE OF  8086 MICROPROCESSOR
INTERNAL STRUCTURE OF 8086 MICROPROCESSOR
 
Programming with LEX & YACC
Programming with LEX & YACCProgramming with LEX & YACC
Programming with LEX & YACC
 
overview of register transfer, micro operations and basic computer organizati...
overview of register transfer, micro operations and basic computer organizati...overview of register transfer, micro operations and basic computer organizati...
overview of register transfer, micro operations and basic computer organizati...
 
Fcaps
FcapsFcaps
Fcaps
 

Andere mochten auch

Andere mochten auch (9)

Code GPU with CUDA - Memory Subsystem
Code GPU with CUDA - Memory SubsystemCode GPU with CUDA - Memory Subsystem
Code GPU with CUDA - Memory Subsystem
 
Code gpu with cuda - CUDA introduction
Code gpu with cuda - CUDA introductionCode gpu with cuda - CUDA introduction
Code gpu with cuda - CUDA introduction
 
Code GPU with CUDA - Optimizing memory and control flow
Code GPU with CUDA - Optimizing memory and control flowCode GPU with CUDA - Optimizing memory and control flow
Code GPU with CUDA - Optimizing memory and control flow
 
Code GPU with CUDA - Applying optimization techniques
Code GPU with CUDA - Applying optimization techniquesCode GPU with CUDA - Applying optimization techniques
Code GPU with CUDA - Applying optimization techniques
 
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesPragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
 
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerPragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
 
Code GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principleCode GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principle
 
Code GPU with CUDA - SIMT
Code GPU with CUDA - SIMTCode GPU with CUDA - SIMT
Code GPU with CUDA - SIMT
 
Code GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limitersCode GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limiters
 

Ähnlich wie Pragmatic Optimization in Modern Programming - Mastering Compiler Optimizations

EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5
PRADEEP
 
Vectorization on x86: all you need to know
Vectorization on x86: all you need to knowVectorization on x86: all you need to know
Vectorization on x86: all you need to know
Roberto Agostino Vitillo
 
Efficient JIT to 32-bit Arches
Efficient JIT to 32-bit ArchesEfficient JIT to 32-bit Arches
Efficient JIT to 32-bit Arches
Netronome
 
Chapter Eight(3)
Chapter Eight(3)Chapter Eight(3)
Chapter Eight(3)
bolovv
 

Ähnlich wie Pragmatic Optimization in Modern Programming - Mastering Compiler Optimizations (20)

EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5
 
Computer Architecture Assignment Help
Computer Architecture Assignment HelpComputer Architecture Assignment Help
Computer Architecture Assignment Help
 
OptimizingARM
OptimizingARMOptimizingARM
OptimizingARM
 
Vectorization on x86: all you need to know
Vectorization on x86: all you need to knowVectorization on x86: all you need to know
Vectorization on x86: all you need to know
 
Efficient JIT to 32-bit Arches
Efficient JIT to 32-bit ArchesEfficient JIT to 32-bit Arches
Efficient JIT to 32-bit Arches
 
Good news, everybody! Guile 2.2 performance notes (FOSDEM 2016)
Good news, everybody! Guile 2.2 performance notes (FOSDEM 2016)Good news, everybody! Guile 2.2 performance notes (FOSDEM 2016)
Good news, everybody! Guile 2.2 performance notes (FOSDEM 2016)
 
[FT-11][suhorng] “Poor Man's” Undergraduate Compilers
[FT-11][suhorng] “Poor Man's” Undergraduate Compilers[FT-11][suhorng] “Poor Man's” Undergraduate Compilers
[FT-11][suhorng] “Poor Man's” Undergraduate Compilers
 
Chapter Eight(3)
Chapter Eight(3)Chapter Eight(3)
Chapter Eight(3)
 
Arm tools and roadmap for SVE compiler support
Arm tools and roadmap for SVE compiler supportArm tools and roadmap for SVE compiler support
Arm tools and roadmap for SVE compiler support
 
HackLU 2018 Make ARM Shellcode Great Again
HackLU 2018 Make ARM Shellcode Great AgainHackLU 2018 Make ARM Shellcode Great Again
HackLU 2018 Make ARM Shellcode Great Again
 
Optimization in Programming languages
Optimization in Programming languagesOptimization in Programming languages
Optimization in Programming languages
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
 
Boosting Developer Productivity with Clang
Boosting Developer Productivity with ClangBoosting Developer Productivity with Clang
Boosting Developer Productivity with Clang
 
5 - Advanced SVE.pdf
5 - Advanced SVE.pdf5 - Advanced SVE.pdf
5 - Advanced SVE.pdf
 
openMP loop parallelization
openMP loop parallelizationopenMP loop parallelization
openMP loop parallelization
 
Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06
Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06
Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06
 
Evgeniy Muralev, Mark Vince, Working with the compiler, not against it
Evgeniy Muralev, Mark Vince, Working with the compiler, not against itEvgeniy Muralev, Mark Vince, Working with the compiler, not against it
Evgeniy Muralev, Mark Vince, Working with the compiler, not against it
 
2.1 ### uVision Project, (C) Keil Software .docx
2.1   ### uVision Project, (C) Keil Software    .docx2.1   ### uVision Project, (C) Keil Software    .docx
2.1 ### uVision Project, (C) Keil Software .docx
 
Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW
Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIWLec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW
Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW
 
An evaluation of LLVM compiler for SVE with fairly complicated loops
An evaluation of LLVM compiler for SVE with fairly complicated loopsAn evaluation of LLVM compiler for SVE with fairly complicated loops
An evaluation of LLVM compiler for SVE with fairly complicated loops
 

Kürzlich hochgeladen

Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
KarakKing
 

Kürzlich hochgeladen (20)

Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptx
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 

Pragmatic Optimization in Modern Programming - Mastering Compiler Optimizations

  • 1. 1 PRAGMATIC OPTIMIZATION IN MODERN PROGRAMMING MASTERING COMPILER OPTIMIZATIONS Created by for / 2015-2016Marina (geek) Kolpakova UNN
  • 2. 2 COURSE TOPICS Ordering optimization approaches Demystifying a compiler Mastering compiler optimizations Modern computer architectures concepts
  • 3. 3 OUTLINE Constant folding Hoisting loop invariant code Scalarization Loop unswitching Loop peeling and sentinels Strength reduction Loop-induction variable elimination Auto-vectorization Function body inlining Built-in detection Case study
  • 4. 4 CONSTANT FOLDING evaluates expressions from constants in compile time It is one of the simplest and most widely used compiler transformations. Expressions could be quite complicated, but absence of any kind of side effects is always to be considered. double b = exp(sqrt(M_PI/2)); fldd d17, .L5 // other code .L5: .word 2911864813 .word 1074529267 May be performed in global as well as local scope. Applicable to any code pattern.
  • 5. 5 . 1 HOISTING LOOP INVARIANT CODE The goal of hoisting — also called loop-invariant code motion — is to avoid recomputing loop-invariant code each time through the body of a loop. void scale(double* x, double* y, int len) { for (int i = 0; i < len; i++) y[i] = x[i] * exp(sqrt(M_PI/2)); } ... is the same as... void scale(double* x, double *y, int len) { double factor = exp(sqrt(M_PI/2)); for (int i = 0; i < len; i++) y[i] = x[i] * factor; }
  • 6. 5 . 2 HOISTING LOOP INVARIANT CODE $ gcc -march=armv7-a -mthumb -O1 -S -o 1.s -mfpu=neon-vfpv4 -mfloat-abi=softfp 1.c fldd d17, .L5 .L3: fldmiad r3!, {d16} fmuld d16, d16, d17 fstmiad r1!, {d16} cmp r3, r0 bne .L3 .L1: bx lr .L5: .word 2911864813 .word 1074529267 fldd d17, .L11 .L9: fldmiad r3!, {d16} fmuld d16, d16, d17 fstmiad r1!, {d16} cmp r3, r0 bne .L9 .L7: bx lr .L11: .word 2911864813 .word 1074529267
  • 7. 5 . 3 HOISTING NON-CONSTANTS Let's make expression variable void scale(double* x, double* y, int len) { for (int i = 0; i < len; i++) y[i] = x[i] * exp(sqrt((double)len)/2.)); } void scale(double* x, double *y, int len) { double factor = exp(sqrt((double)len)/2.)); for (int i = 0; i < len; i++) y[i] = x[i] * factor; } $ gcc -march=armv7-a -mthumb -O1 -S -o 1.s -mfpu=neon-vfpv4 -mfloat-abi=softfp 1.c
  • 8. 5 . 4 HOISTING NON-CONSTANTS fmsr s15, r2 @ int fsitod d9, s15 fsqrtd d9, d9 .L4: fldmiad r6!, {d8} fcpyd d16, d9 fcmpd d9, d9 fmstat beq .L3 fmsr s15, r7 @ int fsitod d16, s15 fmrrd r0, r1, d16 bl sqrt(PLT) fmdrr d16, r0, r1 .L3: fmuld d8, d8, d16 fstmiad r5!, {d8} adds r4, r4, #1 cmp r4, r7 bne .L4 fmsr s15, r2 @ int fsitod d17, s15 fsqrtd d17, d17 fcmpd d17, d17 fmstat beq .L9 fsitod d16, s15 fmrrd r0, r1, d16 bl sqrt(PLT) fmdrr d17, r0, r1 .L9: mov r3, r5 mov r1, r4 add r0, r5, r6, lsl #3 .L11: fldmiad r3!, {d16} fmuld d16, d16, d17 fstmiad r1!, {d16} cmp r3, r0 bne .L11 Compiler isn't sure that oating point ags aren't affected!
  • 9. 6 Usually it is performed on a branches inside a loop body to allow further optimization. Machine-independent, local. SCALARIZATION Scalarization replaces branchy code with a branchless analogy int branchy(int i) { if (i > 4 && i < 42) return 1; return 0; } int branchless(int i) { return (((unsigned)i) - 5 < 36); } $ gcc -march=armv8-a+simd -O3 1.c -S -o 1.s branchy: sub w0, w0, #5 cmp w0, 36 cset w0, ls ret branchless: sub w0, w0, #5 cmp w0, 36 cset w0, ls ret
  • 10. 7 . 1 LOOP UNSWITCHING Moves loop-invariant conditionals or switches which are independent of the loop index out of the loop. Increases ILP, enables further optimizations. for (int i = 0; i < len; i++) { if (a > 32) arr[i] = a; else arr[i] = 42; } int i = 0; if (a > 32) for (; i < len; i++) arr[i] = a; else for (; i < len; i++) arr[i] = 42; $ gcc -march=armv8-a+simd -O3 -S -o 1.s -fno-tree-vectorize 1.c auto-vectorization is disable d just to make example readable.
  • 11. 7 . 2 LOOP UNSWITCHING cmp w1, wzr ble .L1 cmp w2, 32 bgt .L6 mov x2, 0 mov w3, 42 .L4: str w3,[x0,x2,lsl 2] add x2, x2, 1 cmp w1, w2 bgt .L4 .L1: ret L6: mov x3, 0 .L3: str w2,[x0,x3,lsl 2] add x3, x3, 1 cmp w1, w3 bgt .L3 ret
  • 12. 7 . 3 Can be done by a compiler under high optimization levels. LOOP PEELING Loop peeling takes out of the loop and executes separately a small number of iterations from the beginning or/and the end of a loop, which eliminates if-statements. for(int i=0; i<len; i++) { if (i == 0) b[i] = a[i+1]; else if(i == len-1 ) b[i] = a[i-1]; else b[i] = a[i+1]+a[i-1]; } b[0] = a[0]; for(int i=0; i<len; i++) [i] = a[i+1]+a[i-1]; b[len-1] = a[len-1]; For the example below both te sted compilers: gcc 4.9 and clang 3.5 won't perform this optimization. Make sure that a compiler w as able to apply optimization to such a code pattern or do it manually .
  • 13. 7 . 4 ADVANCED UNSWITCHING: SENTINELS Sentinels are special dummy values placed in a data structure which are added to simplify the logic of handling boundary conditions (in particular, handling of loop-exit tests or elimination out-of-bound checks). This optimization cannot be performed by the compiler because require changes in use of data structures
  • 14. 7 . 5 ADVANCED UNSWITCHING: SENTINELS Let's assume that array a contains two extra elements: one — at the left a[-1], and one — at the right a[N]. With such assumptions the code can be rewritten as follows. for(int i=0; i<len; i++) { if (i == 0) b[i] = a[i+1]; else if(i == len-1 ) b[i] = a[i-1]; else b[i] = a[i+1]+a[i-1]; } // setup boundaries a[-1] = 0; a[len] = 0; for (int i=0; i<len; i++) [i] = a[i+1] + a[i-1];
  • 15. 8 . 1 STRENGTH REDUCTION replaces complex expressions with a simpler analogy double usePow(double x) { return pow(x, 2.); } float usePowf(float x) { return powf(x, 2.f); } Machine-independent transformation in most cases. It may be machine-dependent in case if it relies on a speci c feature set, implemented in the HW (e.g. built-in detection)
  • 16. 8 . 2 STRENGTH REDUCTION $ gcc -march=armv7-a -mthumb -O3 -S -o 1.s > -mfpu=neon-vfpv4 -mfloat-abi=softfp 1.c usePow: fmdrr d16, r0, r1 fmuld d16, d16, d16 fmrrd r0, r1, d16 bx lr usePowf: fmsr s15, r0 fmuls s15, s15, s15 fmrs r0, s15 bx lr Usually it is performed in a local scope for a dependency chain.
  • 17. 8 . 3 STRENGTH REDUCTION (ADVANCED CASE) float useManyPowf(float a, float b, float c, float d, float e, float f, float x) { return a * powf(x, 5.f) + b * powf(x, 4.f) + c * powf(x, 3.f) + d * powf(x, 2.f) + e * x + f; } $ gcc -march=armv7-a -mthumb -O3 -S -o 1.s > -mfpu=neon-vfpv4 -mfloat-abi=softfp 1.c
  • 18. 8 . 4 Compiler was able just partly reduce the complexity and generate there it is possible STRENGTH REDUCTION (ADVANCED) useManyPowf: push {r3, lr} flds s17, [sp, #56] fmsr s24, r1 movs r1, #0 fmsr s22, r0 movt r1, 16544 fmrs r0, s17 fmsr s21, r2 fmsr s20, r3 flds s19, [sp, #48] flds s18, [sp, #52] bl powf(PLT) mov r1, #1082130432 fmsr s23, r0 fmrs r0, s17 bl powf(PLT) movs r1, #0 movt r1, 16448 fmsr s16, r0 fmrs r0, s17 bl powf(PLT) fmuls s16, s16, s24 vfma.f32 s16, s23, s22 fmsr s15, r0 vfma.f32 s16, s15, s21 fmuls s15, s17, s17 vfma.f32 s16, s20, s15 vfma.f32 s16, s19, s17 fadds s15, s16, s18 fldmfdd sp!, {d8-d12} fmrs r0, s15 pop {r3, pc} vfma
  • 19. 8 . 5 STRENGTH REDUCTION (MANUAL) Let's further reduce latency by applying Horner rule. float applyHornerf(float a, float b, float c, float d, float e, float f, float x) { return ((((a * x + b) * x + c) * x + d) * x + e) * x + f; } $ gcc -march=armv7-a -mthumb -O3 -S -o 1.s > -mfpu=neon-vfpv4 -mfloat-abi=softfp 1.c Compiler is not capable of this optimization because it doesn't produce exact result with the original formula
  • 20. 8 . 6 STRENGTH REDUCTION (MANUAL) applyHornerf: flds s15, [sp, #8] fmsr s11, r0 fmsr s12, r1 flds s14, [sp] vfma.f32 s12, s11, s15 fmsr s11, r2 flds s13, [sp, #4] vfma.f32 s11, s12, s15 fcpys s12, s11 fmsr s11, r3 vfma.f32 s11, s12, s15 vfma.f32 s14, s11, s15 vfma.f32 s13, s14, s15 fmrs r0, s13 bx lr
  • 21. 9 . 1 LOOP-INDUCTION VARIABLE ELIMINATION In most cases compiler is able to replace address arithmetics with pointer arithmetics. void function(int* arr, int len) { for (int i = 0; i < len; i++) arr[i] = 1; } void function(int* arr, int len) { for (int* p = arr; p < arr + len; p++) *p = 1; } $ gcc -march=armv7-a -mthumb -O1 -S -o 1.s > -mfpu=neon-vfpv4 -mfloat-abi=softfp 1.c
  • 22. 9 . 2 LOOP-INDUCTION VARIABLE ELIMINATION cmp r1, #0 ble .L1 mov r3,r0 add r0,r0,r1,lsl #2 movs r2, #1 .L3: str r2, [r3], #4 cmp r3, r0 bne .L3 .L1: bx lr add r1,r0,r1,lsl #2 cmp r0, r1 bcs .L5 movs r3, #1 .L8: str r3, [r0], #4 cmp r1, r0 bhi .L8 .L5: bx lr Most hand-written pointer optimizations do not make sense with usage optimization levels higher than O0.
  • 23. 10 . 1 AUTO-VECTORIZATION Auto-vectorization is machine code generation that takes an advantage of vector instructions. Most of all modern architectures have vector extensions as a co-processor or as dedicated pipes MMX, SSE, SSE2, SSE4, AVX, AVX-512 AltiVec, VSX ASIMD (NEON), MSA Enabled by inlining, unrolling, fusion, software pipelining, inter-procedural optimization, and other machine independent transformations.
  • 24. 10 . 2 AUTO-VECTORIZATION void vectorizeMe(float *a, float *b, int len) { int i; for (i = 0; i < len; i++) a[i] += b[i]; } $ gcc -march=armv7-a -mthumb -O3 -S -o 1.s > -mfpu=neon-vfpv4 -mfloat-abi=softfp > -fopt-info-missed 1.c But, NEON does not support full IEEE 754, so gcc cannot vectorize the loop, what it told us 1.c:64:3: note: not vectorized: relevant stmt not supported:_13 =_9+_12;
  • 25. 10 . 3 Here is a generated assembly and... Here we are! AUTO-VECTORIZATION .L3: fldmias r1!, {s14} flds s15, [r0] fadds s15, s15, s14 fstmias r0!, {s15} cmp r0, r2 bne .L3 But armv8-a does support, let's check it! $ gcc -march=armv8-a+simd -O3 -S -o 1.s > -fopt-info-all 1.c 1.c:64:3: note: loop vectorized
  • 26. 10 . 4 Here is a generated assembly and full optimizer's report AUTO-VECTORIZATION .L6: ldr q1, [x3],16 add w6, w6, 1 ldr q0, [x7],16 cmp w6, w4 fadd v0.4s, v0.4s, v1.4s str q0, [x8],16 bcc .L6 1.c:66:3: note: loop vectorized 1.c:66:3: note: loop versioned for vectorization because of possible aliasing 1.c:66:3: note: loop peeled for vectorization to enhance alignment 1.c:66:3: note: loop with 3 iterations completelyunrolled 1.c:61:6: note: loop with 3 iterations completelyunrolled Compiler versions the loop to allow optimized paths in case of aligned and non-aliasing pointers
  • 27. 10 . 5 Let's follow the advice to put some keywords Optimizer's report shrinks. KEYWORDS void vectorizeMe(float* __restrict a_, float* __restrict b_, int len) { float *a=__builtin_assume_aligned(a_, 16); float *b=__builtin_assume_aligned(b_, 16); for (int i = 0; i < len; i++) a[i] += b[i]; } 1.c:66:3: note: loop vectorized 1.c:66:3: note: loop with 3 iterations completely unrolled __restrict and __builtin_assume_aligned keywords only eliminate some loop versioning, but are not very useful from the performance perspective
  • 28. 11 . 1 FUNCTION BODY INLINING Replaces functional call to function body itself. int square(int x) { return x*x; } for (int i = 0; i < len; i++) arr[i] = square(i); Becomes for (int i = 0; i < len; i++) arr[i] = i*i; Enables all further optimizations. Machine-independent, interprocedural.
  • 29. 11 . 2 FUNCTION BODY INLINING gcc -march=armv8-a+nosimd -O3 -S -o 1.s 1.c .L2: add x2, x4, :lo12:.LANCHOR0 mov x1, 34464 mul w3, w0, w0 movk x1, 0x1, lsl 16 str w3, [x2,x0,lsl 2] add x0, x0, 1 cmp x0, x1 bne .L2 Let's compile the code with vector extension enabled -march=armv8-a+simd
  • 30. 11 . 3 FUNCTION BODY INLINING add x0, x0, :lo12:.LANCHOR0 movi v2.4s, 0x4 ldr q0, [x1] add x1, x0, 397312 add x1, x1, 2688 .L2: mul v1.4s, v0.4s, v0.4s add v0.4s, v0.4s, v2.4s str q1, [x0],16 cmp x0, x1 bne .L2 Auto-vectorization is enabled because of possibility to inline function call inside a loop.
  • 31. 12 . 1 BUILT-IN DETECTION Compiler automatically uses the library functions memset and memcpy to initialize and copy memory blocks static char a[100000]; static char b[100000]; static int at(int idx, char val) { if (idx>=0 && idx<100000) a[idx] = val; } int main() { for (int i=0; i<100000; ++i) a[i]=42; for (int i=0; i<100000; ++i) at(i,-1); for (int i=0; i<100000; ++i) b[i] = a[i]; }
  • 32. 12 . 2 Compiler knows what you mean FILLING/COPYING MEMORY BLOCKS main: .LFB1: .cfi_startproc subq $8, %rsp .cfi_def_cfa_offset 16 movl $100000, %edx movl $42, %esi movl $a, %edi call memset movl $100000, %edx movl $255, %esi movl $a, %edi call memset movl $100000, %edx movl $a, %esi movl $b, %edi call memcpy addq $8, %rsp .cfi_def_cfa_offset 8 ret
  • 33. 12 . 3 The same picture, if the code is compiled for ARM target FILLING/COPYING MEMORY BLOCKS main: ldr r3, .L3 mov r1, #42 stmfd sp!, {r4, lr} add r3, pc, r3 movw r4, #34464 movt r4, 1 mov r0, r3 mov r2, r4 bl memset(PLT) mov r2, r4 mov r1, #255 bl memset(PLT) mov r2, r4 mov r3, r0 ldr r0, .L3+4 mov r1, r3 add r0, pc, r0 add r0, r0, #1792 bl memcpy(PLT) ldmfd sp!, {r4, pc}
  • 34. 13 . 1 CASE STUDY: FLOATING POINT double power( double d, unsigned n) { double x = 1.0; for (unsigned j = 1; j<=n; j++, x *= d) ; return x; } int main () { double a = 1./0x80000000U, sum = 0.0; for (unsigned i=1; i<= 0x80000000U; i++) sum += power( i*a, i % 8); printf ("sum = %gn", sum); } ags-dp.c
  • 35. 13 . 2 CASE STUDY: FLOATING POINT 1. Compile it without optimization 2. Compile it with O1: ~3.26 speedup $ gcc -std=c99 -Wall -O0 flags-dp.c -o flags-dp $ time ./flags-dp sum = 7.29569e+08 real 0m24.550s $ gcc -std=c99 -Wall -O1 flags-dp.c -o flags-dp $ time ./flags-dp sum = 7.29569e+08 real 0m7.529s
  • 36. 13 . 3 CASE STUDY: FLOATING POINT 3. Compile it with O2: ~1.24 speedup 4. Compile it with O3: ~1.00 speedup $ gcc -std=c99 -Wall -O2 flags-dp.c -o flags-dp $ time ./flags-dp sum = 7.29569e+08 real 0m6.069s $ gcc -std=c99 -Wall -O3 flags-dp.c -o flags-dp $ time ./flags-dp sum = 7.29569e+08 real 0m6.067s Total speedup is ~4.05
  • 37. 14 . 1 CASE STUDY: INTEGER int power( int d, unsigned n) { int x = 1; for (unsigned j = 1; j<=n; j++, x*=d) ; return x; } int main () { int64_t sum = 0; for (unsigned i=1; i<0x80000000U; i++) sum += power( i, i % 8); printf ("sum = %ldn", sum); } ags.c
  • 38. 14 . 2 CASE STUDY: INTEGER 1. Compile it without optimization 2. Compile it with O1: ~2.64 speedup $ gcc -std=c99 -Wall -O0 flags.c -o flags $ time ./flags sum = 288231861405089791 real 0m18.750s $ gcc -std=c99 -Wall -O1 flag.c -o flags $ time ./flags sum = 288231861405089791 real 0m7.092s
  • 39. 14 . 3 CASE STUDY: INTEGER 3. Compile it with O2:~0.97 speedup 4. Compile it with O3: ~1.00 speedup $ gcc -std=c99 -Wall -O2 flags.c -o flags $ time ./flags sum = 288231861405089791 real 0m7.300s $ gcc -std=c99 -Wall -O3 flags.c -o flags $ time ./flags sum = 288231861405089791 real 0m7.082s WHAT THERE IS NO IMPROVEMENT?
  • 40. 14 . 4 EXAMPLE: INTEGER -O1 -O2 COMPILER OVERDONE WITH BRANCH TWIDDLING! movl $1, %r8d movl $0, %edx .L9: movl %r8d, %edi movl %r8d, %esi andl $7, %esi je .L10 movl $1, %ecx movl $1, %eax .L8: @ addl $1, %eax @ imull %edi, %ecx @ cmpl %eax, %esi @ jae .L8 jmp .L7 .L10: movl $1, %ecx .L7: movslq %ecx, %rcx addq %rcx, %rdx addl $1, %r8d jns .L9 @ printing is here movl $1, %esi movl $1, %r8d xorl %edx, %edx .p2align 4,,10 .p2align 3 .L10: movl %r8d, %edi andl $7, %edi je .L11 movl $1, %ecx movl $1, %eax .p2align 4,,10 .p2align 3 .L9: addl $1, %eax imull %esi, %ecx cmpl %eax, %edi jae .L9 movslq %ecx, %rcx .L8: addl $1, %r8d addq %rcx, %rdx testl %r8d, %r8d movl %r8d, %esi jns .L10 subq $8, %rsp @ printing is here ret .L11: movl $1, %ecx jmp .L8
  • 41. 15 . 1 HELPING A COMPILER Compiler usually applies optimization to the inner loops. In this example number of iterations in the inner loop depends on a value of an induction variable of the outer loop. Let's help the compiler ( )ags-dp-tuned.c
  • 42. 15 . 2 HELPING A COMPILER /*power function is not changed*/ int main () { double a = 1./0x80000000U, s = -1.; for (double i=0; i<=0x80000000U-8; i += 8) { s+=power((i+0)*a,0);s+=power((i+1)*a,1); s+=power((i+2)*a,2);s+=power((i+3)*a,3); s+=power((i+4)*a,4);s+=power((i+5)*a,5); s+=power((i+6)*a,6);s+=power((i+7)*a,7); } printf ("sum = %gn", s); } $ gcc -std=c99 -Wall -O3 flags-dp-tuned.c -o flags-dp-tuned $ time ./flags-dp-tuned sum = 7.29569e+08 real 0m2.448s Speedup is ~2.48x.
  • 43. 15 . 3 Let's try it for integers ( ) HELPING A COMPILER ags-tuned.c /*power function is not changed*/ int main () { int64_t sum = -1; for (unsigned i=0; i<=0x80000000U-8; i += 8) { sum+=power(i+0, 0); sum+=power(i+1,1); sum+=power(i+2, 2); sum+=power(i+3,3); sum+=power(i+4, 4); sum+=power(i+5,5); sum+=power(i+6, 6); sum+=power(i+7,7); } printf ("sum = %ldn", sum); } $ gcc -std=c99 -Wall -O3 flags-tuned.c -o flags-tuned $ time ./flags-tuned sum = 288231861405089791 real 0m1.286s Speedup is ~5.5x.
  • 44. 16 . 1 DETAILED ANALYSIS This is how the compiler sees the outer loop after inner loop inlining int64_t sum = -1; for (unsigned i=0; i<=0x80000000U-8; i += 8) { int d = (int)i; sum+=1; sum+=(1+d); sum+=(2+d)*(2+d); sum+=(3+d)*(3+d)*(3+d); sum+=(4+d)*(4+d)*(4+d)*(4+d); sum+=(5+d)*(5+d)*(5+d)*(5+d)*(5+d); sum+=(6+d)*(6+d)*(6+d)*(6+d)*(6+d)*(6+d); sum+=(7+d)*(7+d)*(7+d)*(7+d)*(7+d)*(7+d)*(7+d); }
  • 45. 16 . 2 5 LOOP COUNTERS INSTEAD OF 1 movl $6, %r11d movl $5, %r10d movl $4, %ebx movl $3, %r9d movl $2, %r8d movq $-1, %rax // <-accumulator .L3: // loop body addl $8, %r8d addl $8, %r9d addl $8, %ebx addl $8, %r10d addl $8, %r11d cmpl $-2147483646, %r8d jne .L3
  • 46. 16 . 3 (I+1), (I+2), (I+3) leal -1(%r8), %edx // 2+i-1 movslq %edx, %rdx // to long leaq 1(%rax,%rdx), %rdi // sum += 1+i movl %r8d, %edx // 2+i imull %r8d, %edx // (2+i)**2 movslq %edx, %rdx // to long leaq (%rdx,%rdi), %rsi // sum += (2+i)**2 movl %r9d, %ecx imull %r9d, %ecx imull %r9d, %ecx movslq %ecx, %rcx leaq (%rcx,%rsi), %rdx
  • 47. 16 . 4 (I+4), (I+5), (I+6) movl %ebx, %eax imull %ebx, %eax imull %eax, %eax cltq leaq (%rax,%rdx), %rcx movl %r10d, %eax imull %r10d, %eax imull %eax, %eax imull %r10d, %eax cltq addq %rcx, %rax movl %r11d, %edx imull %r11d, %edx imull %r11d, %edx imull %edx, %edx movslq %edx, %rdx addq %rdx, %rax
  • 48. 16 . 5 (I+7) leal 5(%r8), %esi movl $7, %ecx movl $1, %edx .L2: imull %esi, %edx subl $1, %ecx jne .L2 movslq %edx, %rdx addq %rdx, %rax Full listing is available in gitHub
  • 49. 17 . 1 SUMMARY Copy propagation and other basic transformations enables more complicated ones Hoisting loop-invariant code and other ways of strength reduction techniques minimizes number of operations Compiler is very conservative to hoist oating point math Scalarization and loop unswitching makes loop bodies plane Loop peeling compiler capabilities are steel pure, use sentinels technique to eliminate extra checks and control ow Most hand-written pointer optimizations do not make sense with usage of optimization levels higher than O0
  • 50. 17 . 2 SUMMARY __restrict and __builtin_assume_aligned keywords only eliminate some loop versioning, but are not very useful from the performance perspective nowadays Functional body inlining is essential for loop further loop optimization Compiler will detect memcpy and memset, even if you implement them manually and call library function instead. Library functions are usually more ef cient Use knowledge about semantics of your code to help a compiler optimize it