The document describes optimizing a 9-point image blurring algorithm on Intel Xeon and Xeon Phi processors. Initially, running the algorithm serially, Xeon was over 11 times faster than Xeon Phi. Adding OpenMP pragmas to enable vectorization improved performance further, with Xeon now over 3 times faster than Xeon Phi. Further optimizations discussed include adding thread parallelism and improving data access patterns.
How to Troubleshoot Apps for the Modern Connected Worker
Intel® Xeon® Phi Coprocessor High Performance Programming
1. Intel® Xeon® Phi Coprocessor
High Performance Programming
Parallelizing a Simple Image Blurring
Algorithm
Brian Gesiak
April 16th, 2014
Research Student, The University of Tokyo
@modocache
5. Stencil Algorithms
typedef double real;
typedef struct {
real center;
real next;
real diagonal;
} weight_t;
A 9-Point Stencil on a 2D Matrix
6. Stencil Algorithms
typedef double real;
typedef struct {
real center;
real next;
real diagonal;
} weight_t;
weight.center;
A 9-Point Stencil on a 2D Matrix
7. Stencil Algorithms
typedef double real;
typedef struct {
real center;
real next;
real diagonal;
} weight_t;
weight.center;
weight.next;
A 9-Point Stencil on a 2D Matrix
8. Stencil Algorithms
typedef double real;
typedef struct {
real center;
real next;
real diagonal;
} weight_t;
weight.center;
weight.diagonal;
weight.next;
A 9-Point Stencil on a 2D Matrix
37. ivdep
Tells compiler to ignore assumed dependencies
• In our program, the compiler cannot determine whether
the two pointers refer to the same block of memory. So
the compiler assumes they do.
Source: https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-B25ABCC2-BE6F-4599-
AEDF-2434F4676E1B.htm
38. ivdep
Tells compiler to ignore assumed dependencies
• In our program, the compiler cannot determine whether
the two pointers refer to the same block of memory. So
the compiler assumes they do.
• The ivdep pragma negates this assumption.
Source: https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-B25ABCC2-BE6F-4599-
AEDF-2434F4676E1B.htm
39. ivdep
Tells compiler to ignore assumed dependencies
• In our program, the compiler cannot determine whether
the two pointers refer to the same block of memory. So
the compiler assumes they do.
• The ivdep pragma negates this assumption.
• Proven dependencies may not be ignored.
Source: https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-B25ABCC2-BE6F-4599-
AEDF-2434F4676E1B.htm
57. Optimization 1: Padded Arrays
• We can add extra, unused data to the end of each row
Optimizing Cache Access
58. Optimization 1: Padded Arrays
• We can add extra, unused data to the end of each row
• Doing so aligns heavily used memory addresses for
efficient cache line access
Optimizing Cache Access
71. Optimization 1: Padded Arrays
#pragma omp parallel for
for (int y = 1; y < height - 1; ++y) {
!
// ...calculate center, east, northwest, etc.
int center = 1 + y * kPaddingSize + 1;
int north = center - kPaddingSize;
int south = center + kPaddingSize;
int east = center + 1;
int west = center - 1;
int northwest = north - 1;
int northeast = north + 1;
int southwest = south - 1;
int southeast = south + 1;
!
#pragma ivdep
// ...
}
Accommodating for Padding
72. Optimization 1: Padded Arrays
#pragma omp parallel for
for (int y = 1; y < height - 1; ++y) {
!
// ...calculate center, east, northwest, etc.
int center = 1 + y * kPaddingSize + 1;
int north = center - kPaddingSize;
int south = center + kPaddingSize;
int east = center + 1;
int west = center - 1;
int northwest = north - 1;
int northeast = north + 1;
int southwest = south - 1;
int southeast = south + 1;
!
#pragma ivdep
// ...
}
Accommodating for Padding
77. Optimization 2: Streaming Stores
Read-less Writes
• By default, Xeon® Phi processors read the value at an
address before writing to that address.
78. Optimization 2: Streaming Stores
Read-less Writes
• By default, Xeon® Phi processors read the value at an
address before writing to that address.
• When calculating the weighted average for a pixel in our
program, we do not use the original value of that pixel.
Therefore, enabling streaming stores should result in
better performance.
79. Optimization 2: Streaming Stores
for (int i = 0; i < count; ++i) {
for (int y = 1; y < height - 1; ++y) {
// ...calculate center, east, northwest, etc.
#pragma ivdep
#pragma vector nontemporal
for (int x = 1; x < width - 1; ++x) {
fout[center] = weight.diagonal * fin[northwest] +
weight.next * fin[west] +
// ...add weighted, adjacent pixels
weight.center * fin[center];
// ...increment locations
++center; ++north; ++northeast;
}
}
// ...
}
Read-less Writes with Vector Nontemporal
80. Optimization 2: Streaming Stores
for (int i = 0; i < count; ++i) {
for (int y = 1; y < height - 1; ++y) {
// ...calculate center, east, northwest, etc.
#pragma ivdep
#pragma vector nontemporal
for (int x = 1; x < width - 1; ++x) {
fout[center] = weight.diagonal * fin[northwest] +
weight.next * fin[west] +
// ...add weighted, adjacent pixels
weight.center * fin[center];
// ...increment locations
++center; ++north; ++northeast;
}
}
// ...
}
Read-less Writes with Vector Nontemporal
85. • Memory pages map virtual memory used by our
program to physical memory
Optimization 3: Huge Memory Pages
86. • Memory pages map virtual memory used by our
program to physical memory
• Mappings are stored in a translation look-aside buffer
(TLB)
Optimization 3: Huge Memory Pages
87. • Memory pages map virtual memory used by our
program to physical memory
• Mappings are stored in a translation look-aside buffer
(TLB)
• Mappings are traversed in a“page table walk”
Optimization 3: Huge Memory Pages
88. • Memory pages map virtual memory used by our
program to physical memory
• Mappings are stored in a translation look-aside buffer
(TLB)
• Mappings are traversed in a“page table walk”
• malloc and _mm_malloc use 4KB memory pages by
default
Optimization 3: Huge Memory Pages
89. • Memory pages map virtual memory used by our
program to physical memory
• Mappings are stored in a translation look-aside buffer
(TLB)
• Mappings are traversed in a“page table walk”
• malloc and _mm_malloc use 4KB memory pages by
default
• By increasing the size of each memory page, traversal
time may be reduced
Optimization 3: Huge Memory Pages
96. Takeaways
• The key to achieving high-performance is to use loop
vectorization and multiple threads
• Completely serial programs run faster on standard
processors
• Only properly designed programs achieve peak
performance on an Intel® Xeon® Phi Coprocessor
• Other optimizations may be used to tweak performance
• Data padding,
• Streaming stores
• Huge memory pages