SlideShare ist ein Scribd-Unternehmen logo
1 von 97
Downloaden Sie, um offline zu lesen
Intel® Xeon® Phi Coprocessor
High Performance Programming
Parallelizing a Simple Image Blurring
Algorithm
Brian Gesiak
April 16th, 2014
Research Student, The University of Tokyo
@modocache
Today
• Image blurring with a 9-point stencil algorithm
• Comparing performance
• Intel® Xeon® Dual Processor
• Intel® Xeon® Phi Coprocessor
• Iteratively improving performance
• Worst: Completely serial
• Better: Adding loop vectorization
• Best: Supporting multiple threads
• Further optimizations
• Padding arrays for improved cache performance
• Read-less writes, i.e.: streaming stores
• Using huge memory pages
Stencil Algorithms
A 9-Point Stencil on a 2D Matrix
Stencil Algorithms
A 9-Point Stencil on a 2D Matrix
Stencil Algorithms
typedef double real;
typedef struct {
real center;
real next;
real diagonal;
} weight_t;
A 9-Point Stencil on a 2D Matrix
Stencil Algorithms
typedef double real;
typedef struct {
real center;
real next;
real diagonal;
} weight_t;
weight.center;
A 9-Point Stencil on a 2D Matrix
Stencil Algorithms
typedef double real;
typedef struct {
real center;
real next;
real diagonal;
} weight_t;
weight.center;
weight.next;
A 9-Point Stencil on a 2D Matrix
Stencil Algorithms
typedef double real;
typedef struct {
real center;
real next;
real diagonal;
} weight_t;
weight.center;
weight.diagonal;
weight.next;
A 9-Point Stencil on a 2D Matrix
Image Blurring
Applying a 9-Point Stencil to a Bitmap
Image Blurring
Applying a 9-Point Stencil to a Bitmap
Image Blurring
Applying a 9-Point Stencil to a Bitmap
Halo Effect
Image Blurring
Applying a 9-Point Stencil to a Bitmap
• Apply a 9-point stencil to a 5,900 x 10,000 px image
• Apply the stencil 1,000 times
Sample Application
• Apply a 9-point stencil to a 5,900 x 10,000 px image
• Apply the stencil 1,000 times
Sample Application
• Apply a 9-point stencil to a 5,900 x 10,000 px image
• Apply the stencil 1,000 times
Sample Application
Comparing Processors
Xeon® Dual Processor vs. Xeon® Phi Coprocessor
Processor
Clock
Frequency
Number of
Cores
Memory
Size/Type
Peak DP/SP
FLOPs
Peak
Memory
Bandwidth
Comparing Processors
Xeon® Dual Processor vs. Xeon® Phi Coprocessor
Processor
Clock
Frequency
Number of
Cores
Memory
Size/Type
Peak DP/SP
FLOPs
Peak
Memory
Bandwidth
Intel® Xeon®
Dual
Processor
2.6 GHz
16 (8 x 2
CPUs)
63 GB /
DDR3
345.6 / 691.2
GigaFLOP/s
85.3 GB/s
Comparing Processors
Xeon® Dual Processor vs. Xeon® Phi Coprocessor
Processor
Clock
Frequency
Number of
Cores
Memory
Size/Type
Peak DP/SP
FLOPs
Peak
Memory
Bandwidth
Intel® Xeon®
Dual
Processor
2.6 GHz
16 (8 x 2
CPUs)
63 GB /
DDR3
345.6 / 691.2
GigaFLOP/s
85.3 GB/s
Intel® Xeon®
Phi
Coprocessor
1.091 GHz 61
8 GB/
GDDR5
1.065/2.130
TeraFLOP/s
352 GB/s
1st Comparison: Serial Execution
1st Comparison: Serial Execution
void stencil_9pt(real *fin, real *fout, int width, int height,
weight_t weight, int count) {
for (int i = 0; i < count; ++i) {
for (int y = 1; y < height - 1; ++y) {
// ...calculate center, east, northwest, etc.
for (int x = 1; x < width - 1; ++x) {
fout[center] = weight.diagonal * fin[northwest] +
weight.next * fin[west] +
// ...add weighted, adjacent pixels
weight.center * fin[center];
// ...increment locations
++center; ++north; ++northeast;
}
}
!
// Swap buffers for next iteration
real *ftmp = fin;
fin = fout;
fout = ftmp;
}
}
1st Comparison: Serial Execution
void stencil_9pt(real *fin, real *fout, int width, int height,
weight_t weight, int count) {
for (int i = 0; i < count; ++i) {
for (int y = 1; y < height - 1; ++y) {
// ...calculate center, east, northwest, etc.
for (int x = 1; x < width - 1; ++x) {
fout[center] = weight.diagonal * fin[northwest] +
weight.next * fin[west] +
// ...add weighted, adjacent pixels
weight.center * fin[center];
// ...increment locations
++center; ++north; ++northeast;
}
}
!
// Swap buffers for next iteration
real *ftmp = fin;
fin = fout;
fout = ftmp;
}
}
1st Comparison: Serial Execution
void stencil_9pt(real *fin, real *fout, int width, int height,
weight_t weight, int count) {
for (int i = 0; i < count; ++i) {
for (int y = 1; y < height - 1; ++y) {
// ...calculate center, east, northwest, etc.
for (int x = 1; x < width - 1; ++x) {
fout[center] = weight.diagonal * fin[northwest] +
weight.next * fin[west] +
// ...add weighted, adjacent pixels
weight.center * fin[center];
// ...increment locations
++center; ++north; ++northeast;
}
}
!
// Swap buffers for next iteration
real *ftmp = fin;
fin = fout;
fout = ftmp;
}
}
1st Comparison: Serial Execution
void stencil_9pt(real *fin, real *fout, int width, int height,
weight_t weight, int count) {
for (int i = 0; i < count; ++i) {
for (int y = 1; y < height - 1; ++y) {
// ...calculate center, east, northwest, etc.
for (int x = 1; x < width - 1; ++x) {
fout[center] = weight.diagonal * fin[northwest] +
weight.next * fin[west] +
// ...add weighted, adjacent pixels
weight.center * fin[center];
// ...increment locations
++center; ++north; ++northeast;
}
}
!
// Swap buffers for next iteration
real *ftmp = fin;
fin = fout;
fout = ftmp;
}
}
1st Comparison: Serial Execution
void stencil_9pt(real *fin, real *fout, int width, int height,
weight_t weight, int count) {
for (int i = 0; i < count; ++i) {
for (int y = 1; y < height - 1; ++y) {
// ...calculate center, east, northwest, etc.
for (int x = 1; x < width - 1; ++x) {
fout[center] = weight.diagonal * fin[northwest] +
weight.next * fin[west] +
// ...add weighted, adjacent pixels
weight.center * fin[center];
// ...increment locations
++center; ++north; ++northeast;
}
}
!
// Swap buffers for next iteration
real *ftmp = fin;
fin = fout;
fout = ftmp;
}
}
1st Comparison: Serial Execution
void stencil_9pt(real *fin, real *fout, int width, int height,
weight_t weight, int count) {
for (int i = 0; i < count; ++i) {
for (int y = 1; y < height - 1; ++y) {
// ...calculate center, east, northwest, etc.
for (int x = 1; x < width - 1; ++x) {
fout[center] = weight.diagonal * fin[northwest] +
weight.next * fin[west] +
// ...add weighted, adjacent pixels
weight.center * fin[center];
// ...increment locations
++center; ++north; ++northeast;
}
}
!
// Swap buffers for next iteration
real *ftmp = fin;
fin = fout;
fout = ftmp;
}
}
1st Comparison: Serial Execution
void stencil_9pt(real *fin, real *fout, int width, int height,
weight_t weight, int count) {
for (int i = 0; i < count; ++i) {
for (int y = 1; y < height - 1; ++y) {
// ...calculate center, east, northwest, etc.
for (int x = 1; x < width - 1; ++x) {
fout[center] = weight.diagonal * fin[northwest] +
weight.next * fin[west] +
// ...add weighted, adjacent pixels
weight.center * fin[center];
// ...increment locations
++center; ++north; ++northeast;
}
}
!
// Swap buffers for next iteration
real *ftmp = fin;
fin = fout;
fout = ftmp;
}
}
1st Comparison: Serial Execution
void stencil_9pt(real *fin, real *fout, int width, int height,
weight_t weight, int count) {
for (int i = 0; i < count; ++i) {
for (int y = 1; y < height - 1; ++y) {
// ...calculate center, east, northwest, etc.
for (int x = 1; x < width - 1; ++x) {
fout[center] = weight.diagonal * fin[northwest] +
weight.next * fin[west] +
// ...add weighted, adjacent pixels
weight.center * fin[center];
// ...increment locations
++center; ++north; ++northeast;
}
}
!
// Swap buffers for next iteration
real *ftmp = fin;
fin = fout;
fout = ftmp;
}
}
1st Comparison: Serial Execution
void stencil_9pt(real *fin, real *fout, int width, int height,
weight_t weight, int count) {
for (int i = 0; i < count; ++i) {
for (int y = 1; y < height - 1; ++y) {
// ...calculate center, east, northwest, etc.
for (int x = 1; x < width - 1; ++x) {
fout[center] = weight.diagonal * fin[northwest] +
weight.next * fin[west] +
// ...add weighted, adjacent pixels
weight.center * fin[center];
// ...increment locations
++center; ++north; ++northeast;
}
}
!
// Swap buffers for next iteration
real *ftmp = fin;
fin = fout;
fout = ftmp;
}
}
1st Comparison: Serial Execution
void stencil_9pt(real *fin, real *fout, int width, int height,
weight_t weight, int count) {
for (int i = 0; i < count; ++i) {
for (int y = 1; y < height - 1; ++y) {
// ...calculate center, east, northwest, etc.
for (int x = 1; x < width - 1; ++x) {
fout[center] = weight.diagonal * fin[northwest] +
weight.next * fin[west] +
// ...add weighted, adjacent pixels
weight.center * fin[center];
// ...increment locations
++center; ++north; ++northeast;
}
}
!
// Swap buffers for next iteration
real *ftmp = fin;
fin = fout;
fout = ftmp;
}
}
Assumed vector dependency
Processor Elapsed Wall Time MegaFLOPS
Intel® Xeon® Dual
Processor
244.178 seconds
(4 minutes)
4,107.658
Intel® Xeon® Phi
Coprocessor
2,838.342 seconds
(47.3 minutes)
353.375
1st Comparison: Serial Execution
Results
Processor Elapsed Wall Time MegaFLOPS
Intel® Xeon® Dual
Processor
244.178 seconds
(4 minutes)
4,107.658
Intel® Xeon® Phi
Coprocessor
2,838.342 seconds
(47.3 minutes)
353.375
1st Comparison: Serial Execution
Results
$ icc -openmp -O3 stencil.c -o stencil
Processor Elapsed Wall Time MegaFLOPS
Intel® Xeon® Dual
Processor
244.178 seconds
(4 minutes)
4,107.658
Intel® Xeon® Phi
Coprocessor
2,838.342 seconds
(47.3 minutes)
353.375
1st Comparison: Serial Execution
Results
$ icc -openmp -mmic -O3 stencil.c -o stencil_phi
Dual is 11 times faster than Phi
Processor Elapsed Wall Time MegaFLOPS
Intel® Xeon® Dual
Processor
244.178 seconds
(4 minutes)
4,107.658
Intel® Xeon® Phi
Coprocessor
2,838.342 seconds
(47.3 minutes)
353.375
1st Comparison: Serial Execution
Results
2nd Comparison: Vectorization
for (int i = 0; i < count; ++i) {
for (int y = 1; y < height - 1; ++y) {
// ...calculate center, east, northwest, etc.
#pragma ivdep
for (int x = 1; x < width - 1; ++x) {
fout[center] = weight.diagonal * fin[northwest] +
weight.next * fin[west] +
// ...add weighted, adjacent pixels
weight.center * fin[center];
// ...increment locations
++center; ++north; ++northeast;
}
}
// ...
}
Ignoring Assumed Vector Dependencies
2nd Comparison: Vectorization
for (int i = 0; i < count; ++i) {
for (int y = 1; y < height - 1; ++y) {
// ...calculate center, east, northwest, etc.
#pragma ivdep
for (int x = 1; x < width - 1; ++x) {
fout[center] = weight.diagonal * fin[northwest] +
weight.next * fin[west] +
// ...add weighted, adjacent pixels
weight.center * fin[center];
// ...increment locations
++center; ++north; ++northeast;
}
}
// ...
}
Ignoring Assumed Vector Dependencies
ivdep
Tells compiler to ignore assumed dependencies
Source: https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-B25ABCC2-BE6F-4599-
AEDF-2434F4676E1B.htm
ivdep
Tells compiler to ignore assumed dependencies
• In our program, the compiler cannot determine whether
the two pointers refer to the same block of memory. So
the compiler assumes they do.
Source: https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-B25ABCC2-BE6F-4599-
AEDF-2434F4676E1B.htm
ivdep
Tells compiler to ignore assumed dependencies
• In our program, the compiler cannot determine whether
the two pointers refer to the same block of memory. So
the compiler assumes they do.
• The ivdep pragma negates this assumption.
Source: https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-B25ABCC2-BE6F-4599-
AEDF-2434F4676E1B.htm
ivdep
Tells compiler to ignore assumed dependencies
• In our program, the compiler cannot determine whether
the two pointers refer to the same block of memory. So
the compiler assumes they do.
• The ivdep pragma negates this assumption.
• Proven dependencies may not be ignored.
Source: https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-B25ABCC2-BE6F-4599-
AEDF-2434F4676E1B.htm
Processor Elapsed Wall Time MegaFLOPS
Intel® Xeon® Dual
Processor
186.585 seconds
(3.1 minutes)
5,375.572
Intel® Xeon® Phi
Coprocessor
623.302 seconds
(10.3 minutes)
1,609.171
2nd Comparison: Vectorization
Results
Processor Elapsed Wall Time MegaFLOPS
Intel® Xeon® Dual
Processor
186.585 seconds
(3.1 minutes)
5,375.572
Intel® Xeon® Phi
Coprocessor
623.302 seconds
(10.3 minutes)
1,609.171
2nd Comparison: Vectorization
Results
$ icc -openmp -O3 stencil.c -o stencil
1.3 times
faster
Processor Elapsed Wall Time MegaFLOPS
Intel® Xeon® Dual
Processor
186.585 seconds
(3.1 minutes)
5,375.572
Intel® Xeon® Phi
Coprocessor
623.302 seconds
(10.3 minutes)
1,609.171
2nd Comparison: Vectorization
Results
$ icc -openmp -mmic -O3 stencil.c -o stencil_phi
4.5 times
faster
1.3 times
faster
Processor Elapsed Wall Time MegaFLOPS
Intel® Xeon® Dual
Processor
186.585 seconds
(3.1 minutes)
5,375.572
Intel® Xeon® Phi
Coprocessor
623.302 seconds
(10.3 minutes)
1,609.171
2nd Comparison: Vectorization
Results
4.5 times
faster
1.3 times
faster
Dual is now only 4 times faster than Phi
3rd Comparison: Multithreading
Work Division Using Parallel For Loops
3rd Comparison: Multithreading
#pragma omp parallel for
for (int i = 0; i < count; ++i) {
for (int y = 1; y < height - 1; ++y) {
// ...calculate center, east, northwest, etc.
#pragma ivdep
for (int x = 1; x < width - 1; ++x) {
fout[center] = weight.diagonal * fin[northwest] +
weight.next * fin[west] +
// ...add weighted, adjacent pixels
weight.center * fin[center];
// ...increment locations
++center; ++north; ++northeast;
}
}
// ...
}
Work Division Using Parallel For Loops
3rd Comparison: Multithreading
#pragma omp parallel for
for (int i = 0; i < count; ++i) {
for (int y = 1; y < height - 1; ++y) {
// ...calculate center, east, northwest, etc.
#pragma ivdep
for (int x = 1; x < width - 1; ++x) {
fout[center] = weight.diagonal * fin[northwest] +
weight.next * fin[west] +
// ...add weighted, adjacent pixels
weight.center * fin[center];
// ...increment locations
++center; ++north; ++northeast;
}
}
// ...
}
Work Division Using Parallel For Loops
Processor
Elapsed Wall Time
(seconds)
MegaFLOPS
Xeon® Dual Proc.,
16 Threads
43.862 22,867.185
Xeon® Dual Proc.,
32 Threads
46.247 21,688.103
Xeon® Phi,
61 Threads
11.366 88,246.452
Xeon® Phi,
122 Threads
8.772 114,338.399
Xeon® Phi,
183 Threads
10.546 94,946.364
Xeon® Phi,
244 Threads
12.696 78,999.44
3rd Comparison: Multithreading
Results
Processor
Elapsed Wall Time
(seconds)
MegaFLOPS
Xeon® Dual Proc.,
16 Threads
43.862 22,867.185
Xeon® Dual Proc.,
32 Threads
46.247 21,688.103
Xeon® Phi,
61 Threads
11.366 88,246.452
Xeon® Phi,
122 Threads
8.772 114,338.399
Xeon® Phi,
183 Threads
10.546 94,946.364
Xeon® Phi,
244 Threads
12.696 78,999.44
3rd Comparison: Multithreading
Results
Processor
Elapsed Wall Time
(seconds)
MegaFLOPS
Xeon® Dual Proc.,
16 Threads
43.862 22,867.185
Xeon® Dual Proc.,
32 Threads
46.247 21,688.103
Xeon® Phi,
61 Threads
11.366 88,246.452
Xeon® Phi,
122 Threads
8.772 114,338.399
Xeon® Phi,
183 Threads
10.546 94,946.364
Xeon® Phi,
244 Threads
12.696 78,999.44
3rd Comparison: Multithreading
Results
Processor
Elapsed Wall Time
(seconds)
MegaFLOPS
Xeon® Dual Proc.,
16 Threads
43.862 22,867.185
Xeon® Dual Proc.,
32 Threads
46.247 21,688.103
Xeon® Phi,
61 Threads
11.366 88,246.452
Xeon® Phi,
122 Threads
8.772 114,338.399
Xeon® Phi,
183 Threads
10.546 94,946.364
Xeon® Phi,
244 Threads
12.696 78,999.44
3rd Comparison: Multithreading
Results
4x
71x
Processor
Elapsed Wall Time
(seconds)
MegaFLOPS
Xeon® Dual Proc.,
16 Threads
43.862 22,867.185
Xeon® Dual Proc.,
32 Threads
46.247 21,688.103
Xeon® Phi,
61 Threads
11.366 88,246.452
Xeon® Phi,
122 Threads
8.772 114,338.399
Xeon® Phi,
183 Threads
10.546 94,946.364
Xeon® Phi,
244 Threads
12.696 78,999.44
3rd Comparison: Multithreading
Results
4x
71x
Phi now 5 times faster
Further Optimizations
Further Optimizations
1. Padded arrays
Further Optimizations
1. Padded arrays
2. Streaming stores
Further Optimizations
1. Padded arrays
2. Streaming stores
3. Huge memory pages
Optimization 1: Padded Arrays
Optimizing Cache Access
Optimization 1: Padded Arrays
• We can add extra, unused data to the end of each row
Optimizing Cache Access
Optimization 1: Padded Arrays
• We can add extra, unused data to the end of each row
• Doing so aligns heavily used memory addresses for
efficient cache line access
Optimizing Cache Access
Optimization 1: Padded Arrays
Optimization 1: Padded Arrays
static const size_t kPaddingSize = 64;
int main(int argc, const char **argv) {
int height = 10000;
int width = 5900;
int count = 1000;
!
size_t size = sizeof(real) * width * height;
real *fin = (real *)malloc(size);
real *fout = (real *)malloc(size);
!
weight_t weight = { .center = 0.99,
.next = 0.00125,
.diagonal = 0.00125 };
stencil_9pt(fin, fout, width, height, weight, count);
!
// ...save results
!
free(fin);
free(fout);
return 0;
}
Optimization 1: Padded Arrays
static const size_t kPaddingSize = 64;
int main(int argc, const char **argv) {
int height = 10000;
int width = 5900;
int count = 1000;
!
size_t size = sizeof(real) * width * height;
real *fin = (real *)malloc(size);
real *fout = (real *)malloc(size);
!
weight_t weight = { .center = 0.99,
.next = 0.00125,
.diagonal = 0.00125 };
stencil_9pt(fin, fout, width, height, weight, count);
!
// ...save results
!
free(fin);
free(fout);
return 0;
}
Optimization 1: Padded Arrays
static const size_t kPaddingSize = 64;
int main(int argc, const char **argv) {
int height = 10000;
int width = 5900;
int count = 1000;
!
size_t size = sizeof(real) * width * height;
real *fin = (real *)malloc(size);
real *fout = (real *)malloc(size);
!
weight_t weight = { .center = 0.99,
.next = 0.00125,
.diagonal = 0.00125 };
stencil_9pt(fin, fout, width, height, weight, count);
!
// ...save results
!
free(fin);
free(fout);
return 0;
}
Optimization 1: Padded Arrays
static const size_t kPaddingSize = 64;
int main(int argc, const char **argv) {
int height = 10000;
int width = 5900;
int count = 1000;
!
size_t size = sizeof(real) * width * height;
real *fin = (real *)malloc(size);
real *fout = (real *)malloc(size);
!
weight_t weight = { .center = 0.99,
.next = 0.00125,
.diagonal = 0.00125 };
stencil_9pt(fin, fout, width, height, weight, count);
!
// ...save results
!
free(fin);
free(fout);
return 0;
}
Optimization 1: Padded Arrays
static const size_t kPaddingSize = 64;
int main(int argc, const char **argv) {
int height = 10000;
int width = 5900;
int count = 1000;
!
size_t size = sizeof(real) * width * height;
real *fin = (real *)malloc(size);
real *fout = (real *)malloc(size);
!
weight_t weight = { .center = 0.99,
.next = 0.00125,
.diagonal = 0.00125 };
stencil_9pt(fin, fout, width, height, weight, count);
!
// ...save results
!
free(fin);
free(fout);
return 0;
}
Optimization 1: Padded Arrays
static const size_t kPaddingSize = 64;
int main(int argc, const char **argv) {
int height = 10000;
int width = 5900;
int count = 1000;
!
size_t size = sizeof(real) * width * height;
real *fin = (real *)malloc(size);
real *fout = (real *)malloc(size);
!
weight_t weight = { .center = 0.99,
.next = 0.00125,
.diagonal = 0.00125 };
stencil_9pt(fin, fout, width, height, weight, count);
!
// ...save results
!
free(fin);
free(fout);
return 0;
}
Optimization 1: Padded Arrays
static const size_t kPaddingSize = 64;
int main(int argc, const char **argv) {
int height = 10000;
int width = 5900;
int count = 1000;
!
size_t size = sizeof(real) * width * height;
real *fin = (real *)malloc(size);
real *fout = (real *)malloc(size);
!
weight_t weight = { .center = 0.99,
.next = 0.00125,
.diagonal = 0.00125 };
stencil_9pt(fin, fout, width, height, weight, count);
!
// ...save results
!
free(fin);
free(fout);
return 0;
}
Optimization 1: Padded Arrays
static const size_t kPaddingSize = 64;
int main(int argc, const char **argv) {
int height = 10000;
int width = 5900;
int count = 1000;
!
size_t size = sizeof(real) * width * height;
real *fin = (real *)malloc(size);
real *fout = (real *)malloc(size);
!
weight_t weight = { .center = 0.99,
.next = 0.00125,
.diagonal = 0.00125 };
stencil_9pt(fin, fout, width, height, weight, count);
!
// ...save results
!
free(fin);
free(fout);
return 0;
}
((5900*sizeof(real)+63)/64)*(64/sizeof(real));
Optimization 1: Padded Arrays
static const size_t kPaddingSize = 64;
int main(int argc, const char **argv) {
int height = 10000;
int width = 5900;
int count = 1000;
!
size_t size = sizeof(real) * width * height;
real *fin = (real *)malloc(size);
real *fout = (real *)malloc(size);
!
weight_t weight = { .center = 0.99,
.next = 0.00125,
.diagonal = 0.00125 };
stencil_9pt(fin, fout, width, height, weight, count);
!
// ...save results
!
free(fin);
free(fout);
return 0;
}
(real *)_mm_malloc(size, kPaddingSize);
(real *)_mm_malloc(size, kPaddingSize);
sizeof(real)* width*kPaddingSize * height;
((5900*sizeof(real)+63)/64)*(64/sizeof(real));
Optimization 1: Padded Arrays
static const size_t kPaddingSize = 64;
int main(int argc, const char **argv) {
int height = 10000;
int width = 5900;
int count = 1000;
!
size_t size = sizeof(real) * width * height;
real *fin = (real *)malloc(size);
real *fout = (real *)malloc(size);
!
weight_t weight = { .center = 0.99,
.next = 0.00125,
.diagonal = 0.00125 };
stencil_9pt(fin, fout, width, height, weight, count);
!
// ...save results
!
free(fin);
free(fout);
return 0;
}
_mm_free(fin);
_mm_free(fout);
(real *)_mm_malloc(size, kPaddingSize);
(real *)_mm_malloc(size, kPaddingSize);
sizeof(real)* width*kPaddingSize * height;
((5900*sizeof(real)+63)/64)*(64/sizeof(real));
Optimization 1: Padded Arrays
Accommodating for Padding
Optimization 1: Padded Arrays
#pragma omp parallel for
for (int y = 1; y < height - 1; ++y) {
!
// ...calculate center, east, northwest, etc.
int center = 1 + y * kPaddingSize + 1;
int north = center - kPaddingSize;
int south = center + kPaddingSize;
int east = center + 1;
int west = center - 1;
int northwest = north - 1;
int northeast = north + 1;
int southwest = south - 1;
int southeast = south + 1;
!
#pragma ivdep
// ...
}
Accommodating for Padding
Optimization 1: Padded Arrays
#pragma omp parallel for
for (int y = 1; y < height - 1; ++y) {
!
// ...calculate center, east, northwest, etc.
int center = 1 + y * kPaddingSize + 1;
int north = center - kPaddingSize;
int south = center + kPaddingSize;
int east = center + 1;
int west = center - 1;
int northwest = north - 1;
int northeast = north + 1;
int southwest = south - 1;
int southeast = south + 1;
!
#pragma ivdep
// ...
}
Accommodating for Padding
Processor
Elapsed Wall Time
(seconds)
MegaFLOPS
Xeon® Phi,
61 Threads
11.644 86,138.371
Xeon® Phi,
122 Threads
8.973 111,774.803
Xeon® Phi,
183 Threads
10.326 97,132.546
Xeon® Phi,
244 Threads
11.469 87,452.707
Optimization 1: Padded Arrays
Results
Processor
Elapsed Wall Time
(seconds)
MegaFLOPS
Xeon® Phi,
61 Threads
11.644 86,138.371
Xeon® Phi,
122 Threads
8.973 111,774.803
Xeon® Phi,
183 Threads
10.326 97,132.546
Xeon® Phi,
244 Threads
11.469 87,452.707
Optimization 1: Padded Arrays
Results
Processor
Elapsed Wall Time
(seconds)
MegaFLOPS
Xeon® Phi,
61 Threads
11.644 86,138.371
Xeon® Phi,
122 Threads
8.973 111,774.803
Xeon® Phi,
183 Threads
10.326 97,132.546
Xeon® Phi,
244 Threads
11.469 87,452.707
Optimization 1: Padded Arrays
Results
Optimization 2: Streaming Stores
Read-less Writes
Optimization 2: Streaming Stores
Read-less Writes
• By default, Xeon® Phi processors read the value at an
address before writing to that address.
Optimization 2: Streaming Stores
Read-less Writes
• By default, Xeon® Phi processors read the value at an
address before writing to that address.
• When calculating the weighted average for a pixel in our
program, we do not use the original value of that pixel.
Therefore, enabling streaming stores should result in
better performance.
Optimization 2: Streaming Stores
for (int i = 0; i < count; ++i) {
for (int y = 1; y < height - 1; ++y) {
// ...calculate center, east, northwest, etc.
#pragma ivdep
#pragma vector nontemporal
for (int x = 1; x < width - 1; ++x) {
fout[center] = weight.diagonal * fin[northwest] +
weight.next * fin[west] +
// ...add weighted, adjacent pixels
weight.center * fin[center];
// ...increment locations
++center; ++north; ++northeast;
}
}
// ...
}
Read-less Writes with Vector Nontemporal
Optimization 2: Streaming Stores
for (int i = 0; i < count; ++i) {
for (int y = 1; y < height - 1; ++y) {
// ...calculate center, east, northwest, etc.
#pragma ivdep
#pragma vector nontemporal
for (int x = 1; x < width - 1; ++x) {
fout[center] = weight.diagonal * fin[northwest] +
weight.next * fin[west] +
// ...add weighted, adjacent pixels
weight.center * fin[center];
// ...increment locations
++center; ++north; ++northeast;
}
}
// ...
}
Read-less Writes with Vector Nontemporal
Processor
Elapsed Wall Time
(seconds)
MegaFLOPS
Xeon® Phi,
61 Threads
13.588 73,978.915
Xeon® Phi,
122 Threads
8.491 111,774.803
Xeon® Phi,
183 Threads
8.663 115,773.405
Xeon® Phi,
244 Threads
9.507 105,498.781
Optimization 2: Streaming Stores
Results
Processor
Elapsed Wall Time
(seconds)
MegaFLOPS
Xeon® Phi,
61 Threads
13.588 73,978.915
Xeon® Phi,
122 Threads
8.491 111,774.803
Xeon® Phi,
183 Threads
8.663 115,773.405
Xeon® Phi,
244 Threads
9.507 105,498.781
Optimization 2: Streaming Stores
Results
Processor
Elapsed Wall Time
(seconds)
MegaFLOPS
Xeon® Phi,
61 Threads
13.588 73,978.915
Xeon® Phi,
122 Threads
8.491 111,774.803
Xeon® Phi,
183 Threads
8.663 115,773.405
Xeon® Phi,
244 Threads
9.507 105,498.781
Optimization 2: Streaming Stores
Results
Optimization 3: Huge Memory Pages
• Memory pages map virtual memory used by our
program to physical memory
Optimization 3: Huge Memory Pages
• Memory pages map virtual memory used by our
program to physical memory
• Mappings are stored in a translation look-aside buffer
(TLB)
Optimization 3: Huge Memory Pages
• Memory pages map virtual memory used by our
program to physical memory
• Mappings are stored in a translation look-aside buffer
(TLB)
• Mappings are traversed in a“page table walk”
Optimization 3: Huge Memory Pages
• Memory pages map virtual memory used by our
program to physical memory
• Mappings are stored in a translation look-aside buffer
(TLB)
• Mappings are traversed in a“page table walk”
• malloc and _mm_malloc use 4KB memory pages by
default
Optimization 3: Huge Memory Pages
• Memory pages map virtual memory used by our
program to physical memory
• Mappings are stored in a translation look-aside buffer
(TLB)
• Mappings are traversed in a“page table walk”
• malloc and _mm_malloc use 4KB memory pages by
default
• By increasing the size of each memory page, traversal
time may be reduced
Optimization 3: Huge Memory Pages
Optimization 3: Huge Memory Pages
size_t size = sizeof(real) * width * kPaddingSize * height;
real *fin = (real *)_mm_malloc(size, kPaddingSize);
Optimization 3: Huge Memory Pages
size_t size = sizeof(real) * width * kPaddingSize * height;
real *fin = (real *)_mm_malloc(size, kPaddingSize);real *fin = (real *)mmap(0,
size,
PROT_READ|PROT_WRITE,
MAP_ANON|MAP_PRIVATE|MAP_HUGETLB,
-1.0);
Optimization 3: Huge Memory Pages
size_t size = sizeof(real) * width * kPaddingSize * height;
real *fin = (real *)_mm_malloc(size, kPaddingSize);real *fin = (real *)mmap(0,
size,
PROT_READ|PROT_WRITE,
MAP_ANON|MAP_PRIVATE|MAP_HUGETLB,
-1.0);
Processor
Elapsed Wall Time
(seconds)
MegaFLOPS
Xeon® Phi,
61 Threads
14.486 69,239.365
Xeon® Phi,
122 Threads
8.226 121,924.389
Xeon® Phi,
183 Threads
8.749 114,636.799
Xeon® Phi,
244 Threads
9.466 105,955.358
Results
Optimization 3: Huge Memory Pages
Processor
Elapsed Wall Time
(seconds)
MegaFLOPS
Xeon® Phi,
61 Threads
14.486 69,239.365
Xeon® Phi,
122 Threads
8.226 121,924.389
Xeon® Phi,
183 Threads
8.749 114,636.799
Xeon® Phi,
244 Threads
9.466 105,955.358
Results
Optimization 3: Huge Memory Pages
Processor
Elapsed Wall Time
(seconds)
MegaFLOPS
Xeon® Phi,
61 Threads
14.486 69,239.365
Xeon® Phi,
122 Threads
8.226 121,924.389
Xeon® Phi,
183 Threads
8.749 114,636.799
Xeon® Phi,
244 Threads
9.466 105,955.358
Results
Optimization 3: Huge Memory Pages
Takeaways
• The key to achieving high-performance is to use loop
vectorization and multiple threads
• Completely serial programs run faster on standard
processors
• Only properly designed programs achieve peak
performance on an Intel® Xeon® Phi Coprocessor
• Other optimizations may be used to tweak performance
• Data padding,
• Streaming stores
• Huge memory pages
Sources and Additional Resources
• Today’s slides
• http://modocache.io/xeon-phi-high-performance
• Intel® Xeon® Phi Coprocessor High Performance
Programming (James Jeffers, James Reinders)
• http://www.amazon.com/dp/0124104142
• Intel Documentation
• ivdep: https://software.intel.com/sites/products/
documentation/doclib/iss/2013/compiler/cpp-lin/
GUID-B25ABCC2-BE6F-4599-AEDF-2434F4676E1B.htm
• vector: https://software.intel.com/sites/products/
documentation/studio/composer/en-us/2011Update/
compiler_c/cref_cls/common/
cppref_pragma_vector.htm

Weitere ähnliche Inhalte

Was ist angesagt?

Sound analysis and processing with MATLAB
Sound analysis and processing with MATLABSound analysis and processing with MATLAB
Sound analysis and processing with MATLABTan Hoang Luu
 
Introduction to TensorFlow 2
Introduction to TensorFlow 2Introduction to TensorFlow 2
Introduction to TensorFlow 2Oswald Campesato
 
Introduction to TensorFlow 2
Introduction to TensorFlow 2Introduction to TensorFlow 2
Introduction to TensorFlow 2Oswald Campesato
 
Working with tf.data (TF 2)
Working with tf.data (TF 2)Working with tf.data (TF 2)
Working with tf.data (TF 2)Oswald Campesato
 
TensorFlow Tutorial
TensorFlow TutorialTensorFlow Tutorial
TensorFlow TutorialNamHyuk Ahn
 
Introduction to PyTorch
Introduction to PyTorchIntroduction to PyTorch
Introduction to PyTorchJun Young Park
 
TensorFlow in Your Browser
TensorFlow in Your BrowserTensorFlow in Your Browser
TensorFlow in Your BrowserOswald Campesato
 
Scientific visualization with_gr
Scientific visualization with_grScientific visualization with_gr
Scientific visualization with_grJosef Heinen
 
Introduction to Tensorflow
Introduction to TensorflowIntroduction to Tensorflow
Introduction to TensorflowTzar Umang
 
Power ai tensorflowworkloadtutorial-20171117
Power ai tensorflowworkloadtutorial-20171117Power ai tensorflowworkloadtutorial-20171117
Power ai tensorflowworkloadtutorial-20171117Ganesan Narayanasamy
 
Natural language processing open seminar For Tensorflow usage
Natural language processing open seminar For Tensorflow usageNatural language processing open seminar For Tensorflow usage
Natural language processing open seminar For Tensorflow usagehyunyoung Lee
 
Rajat Monga at AI Frontiers: Deep Learning with TensorFlow
Rajat Monga at AI Frontiers: Deep Learning with TensorFlowRajat Monga at AI Frontiers: Deep Learning with TensorFlow
Rajat Monga at AI Frontiers: Deep Learning with TensorFlowAI Frontiers
 
Tensorflow - Intro (2017)
Tensorflow - Intro (2017)Tensorflow - Intro (2017)
Tensorflow - Intro (2017)Alessio Tonioni
 
TensorFlow Dev Summit 2018 Extended: TensorFlow Eager Execution
TensorFlow Dev Summit 2018 Extended: TensorFlow Eager ExecutionTensorFlow Dev Summit 2018 Extended: TensorFlow Eager Execution
TensorFlow Dev Summit 2018 Extended: TensorFlow Eager ExecutionTaegyun Jeon
 
Explanation on Tensorflow example -Deep mnist for expert
Explanation on Tensorflow example -Deep mnist for expertExplanation on Tensorflow example -Deep mnist for expert
Explanation on Tensorflow example -Deep mnist for expert홍배 김
 
Machine Learning - Introduction to Tensorflow
Machine Learning - Introduction to TensorflowMachine Learning - Introduction to Tensorflow
Machine Learning - Introduction to TensorflowAndrew Ferlitsch
 
Tensor board
Tensor boardTensor board
Tensor boardSung Kim
 
Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0Databricks
 

Was ist angesagt? (20)

Sound analysis and processing with MATLAB
Sound analysis and processing with MATLABSound analysis and processing with MATLAB
Sound analysis and processing with MATLAB
 
Introduction to TensorFlow 2
Introduction to TensorFlow 2Introduction to TensorFlow 2
Introduction to TensorFlow 2
 
Introduction to TensorFlow 2
Introduction to TensorFlow 2Introduction to TensorFlow 2
Introduction to TensorFlow 2
 
Working with tf.data (TF 2)
Working with tf.data (TF 2)Working with tf.data (TF 2)
Working with tf.data (TF 2)
 
TensorFlow Tutorial
TensorFlow TutorialTensorFlow Tutorial
TensorFlow Tutorial
 
Introduction to PyTorch
Introduction to PyTorchIntroduction to PyTorch
Introduction to PyTorch
 
TensorFlow in Your Browser
TensorFlow in Your BrowserTensorFlow in Your Browser
TensorFlow in Your Browser
 
Scientific visualization with_gr
Scientific visualization with_grScientific visualization with_gr
Scientific visualization with_gr
 
Introduction to Tensorflow
Introduction to TensorflowIntroduction to Tensorflow
Introduction to Tensorflow
 
Power ai tensorflowworkloadtutorial-20171117
Power ai tensorflowworkloadtutorial-20171117Power ai tensorflowworkloadtutorial-20171117
Power ai tensorflowworkloadtutorial-20171117
 
Natural language processing open seminar For Tensorflow usage
Natural language processing open seminar For Tensorflow usageNatural language processing open seminar For Tensorflow usage
Natural language processing open seminar For Tensorflow usage
 
Dive Into PyTorch
Dive Into PyTorchDive Into PyTorch
Dive Into PyTorch
 
Rajat Monga at AI Frontiers: Deep Learning with TensorFlow
Rajat Monga at AI Frontiers: Deep Learning with TensorFlowRajat Monga at AI Frontiers: Deep Learning with TensorFlow
Rajat Monga at AI Frontiers: Deep Learning with TensorFlow
 
TensorFlow
TensorFlowTensorFlow
TensorFlow
 
Tensorflow - Intro (2017)
Tensorflow - Intro (2017)Tensorflow - Intro (2017)
Tensorflow - Intro (2017)
 
TensorFlow Dev Summit 2018 Extended: TensorFlow Eager Execution
TensorFlow Dev Summit 2018 Extended: TensorFlow Eager ExecutionTensorFlow Dev Summit 2018 Extended: TensorFlow Eager Execution
TensorFlow Dev Summit 2018 Extended: TensorFlow Eager Execution
 
Explanation on Tensorflow example -Deep mnist for expert
Explanation on Tensorflow example -Deep mnist for expertExplanation on Tensorflow example -Deep mnist for expert
Explanation on Tensorflow example -Deep mnist for expert
 
Machine Learning - Introduction to Tensorflow
Machine Learning - Introduction to TensorflowMachine Learning - Introduction to Tensorflow
Machine Learning - Introduction to Tensorflow
 
Tensor board
Tensor boardTensor board
Tensor board
 
Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0
 

Andere mochten auch

RSpec 3.0: Under the Covers
RSpec 3.0: Under the CoversRSpec 3.0: Under the Covers
RSpec 3.0: Under the CoversBrian Gesiak
 
Apple Templates Considered Harmful
Apple Templates Considered HarmfulApple Templates Considered Harmful
Apple Templates Considered HarmfulBrian Gesiak
 
iOS UI Component API Design
iOS UI Component API DesigniOS UI Component API Design
iOS UI Component API DesignBrian Gesiak
 
iOS Behavior-Driven Development
iOS Behavior-Driven DevelopmentiOS Behavior-Driven Development
iOS Behavior-Driven DevelopmentBrian Gesiak
 
アップルのテンプレートは有害と考えられる
アップルのテンプレートは有害と考えられるアップルのテンプレートは有害と考えられる
アップルのテンプレートは有害と考えられるBrian Gesiak
 
iOSビヘイビア駆動開発
iOSビヘイビア駆動開発iOSビヘイビア駆動開発
iOSビヘイビア駆動開発Brian Gesiak
 
iOS UI Component API Design
iOS UI Component API DesigniOS UI Component API Design
iOS UI Component API DesignBrian Gesiak
 

Andere mochten auch (7)

RSpec 3.0: Under the Covers
RSpec 3.0: Under the CoversRSpec 3.0: Under the Covers
RSpec 3.0: Under the Covers
 
Apple Templates Considered Harmful
Apple Templates Considered HarmfulApple Templates Considered Harmful
Apple Templates Considered Harmful
 
iOS UI Component API Design
iOS UI Component API DesigniOS UI Component API Design
iOS UI Component API Design
 
iOS Behavior-Driven Development
iOS Behavior-Driven DevelopmentiOS Behavior-Driven Development
iOS Behavior-Driven Development
 
アップルのテンプレートは有害と考えられる
アップルのテンプレートは有害と考えられるアップルのテンプレートは有害と考えられる
アップルのテンプレートは有害と考えられる
 
iOSビヘイビア駆動開発
iOSビヘイビア駆動開発iOSビヘイビア駆動開発
iOSビヘイビア駆動開発
 
iOS UI Component API Design
iOS UI Component API DesigniOS UI Component API Design
iOS UI Component API Design
 

Ähnlich wie Intel® Xeon® Phi Coprocessor High Performance Programming

#OOP_D_ITS - 2nd - C++ Getting Started
#OOP_D_ITS - 2nd - C++ Getting Started#OOP_D_ITS - 2nd - C++ Getting Started
#OOP_D_ITS - 2nd - C++ Getting StartedHadziq Fabroyir
 
#OOP_D_ITS - 3rd - Pointer And References
#OOP_D_ITS - 3rd - Pointer And References#OOP_D_ITS - 3rd - Pointer And References
#OOP_D_ITS - 3rd - Pointer And ReferencesHadziq Fabroyir
 
how to reuse code
how to reuse codehow to reuse code
how to reuse codejleed1
 
C Recursion, Pointers, Dynamic memory management
C Recursion, Pointers, Dynamic memory managementC Recursion, Pointers, Dynamic memory management
C Recursion, Pointers, Dynamic memory managementSreedhar Chowdam
 
#include stdafx.h using namespace std; #include stdlib.h.docx
#include stdafx.h using namespace std; #include stdlib.h.docx#include stdafx.h using namespace std; #include stdlib.h.docx
#include stdafx.h using namespace std; #include stdlib.h.docxajoy21
 
Node.js behind: V8 and its optimizations
Node.js behind: V8 and its optimizationsNode.js behind: V8 and its optimizations
Node.js behind: V8 and its optimizationsDawid Rusnak
 
Let’s talk about microbenchmarking
Let’s talk about microbenchmarkingLet’s talk about microbenchmarking
Let’s talk about microbenchmarkingAndrey Akinshin
 
Write a function in C++ to generate an N-node random binary search t.pdf
Write a function in C++ to generate an N-node random binary search t.pdfWrite a function in C++ to generate an N-node random binary search t.pdf
Write a function in C++ to generate an N-node random binary search t.pdfinfo824691
 
Please do Part A, Ill be really gratefulThe main.c is the skeleto.pdf
Please do Part A, Ill be really gratefulThe main.c is the skeleto.pdfPlease do Part A, Ill be really gratefulThe main.c is the skeleto.pdf
Please do Part A, Ill be really gratefulThe main.c is the skeleto.pdfaioils
 
Adam Sitnik "State of the .NET Performance"
Adam Sitnik "State of the .NET Performance"Adam Sitnik "State of the .NET Performance"
Adam Sitnik "State of the .NET Performance"Yulia Tsisyk
 
State of the .Net Performance
State of the .Net PerformanceState of the .Net Performance
State of the .Net PerformanceCUSTIS
 
Go Programming Language (Golang)
Go Programming Language (Golang)Go Programming Language (Golang)
Go Programming Language (Golang)Ishin Vin
 
Introduction to programming - class 11
Introduction to programming - class 11Introduction to programming - class 11
Introduction to programming - class 11Paul Brebner
 
C++ FUNCTIONS-1.pptx
C++ FUNCTIONS-1.pptxC++ FUNCTIONS-1.pptx
C++ FUNCTIONS-1.pptxShashiShash2
 
C++ Language -- Dynamic Memory -- There are 7 files in this project- a.pdf
C++ Language -- Dynamic Memory -- There are 7 files in this project- a.pdfC++ Language -- Dynamic Memory -- There are 7 files in this project- a.pdf
C++ Language -- Dynamic Memory -- There are 7 files in this project- a.pdfaassecuritysystem
 

Ähnlich wie Intel® Xeon® Phi Coprocessor High Performance Programming (20)

#OOP_D_ITS - 2nd - C++ Getting Started
#OOP_D_ITS - 2nd - C++ Getting Started#OOP_D_ITS - 2nd - C++ Getting Started
#OOP_D_ITS - 2nd - C++ Getting Started
 
#OOP_D_ITS - 3rd - Pointer And References
#OOP_D_ITS - 3rd - Pointer And References#OOP_D_ITS - 3rd - Pointer And References
#OOP_D_ITS - 3rd - Pointer And References
 
how to reuse code
how to reuse codehow to reuse code
how to reuse code
 
ch08.ppt
ch08.pptch08.ppt
ch08.ppt
 
C Recursion, Pointers, Dynamic memory management
C Recursion, Pointers, Dynamic memory managementC Recursion, Pointers, Dynamic memory management
C Recursion, Pointers, Dynamic memory management
 
#include stdafx.h using namespace std; #include stdlib.h.docx
#include stdafx.h using namespace std; #include stdlib.h.docx#include stdafx.h using namespace std; #include stdlib.h.docx
#include stdafx.h using namespace std; #include stdlib.h.docx
 
Node.js behind: V8 and its optimizations
Node.js behind: V8 and its optimizationsNode.js behind: V8 and its optimizations
Node.js behind: V8 and its optimizations
 
Let’s talk about microbenchmarking
Let’s talk about microbenchmarkingLet’s talk about microbenchmarking
Let’s talk about microbenchmarking
 
Workshop 10: ECMAScript 6
Workshop 10: ECMAScript 6Workshop 10: ECMAScript 6
Workshop 10: ECMAScript 6
 
Write a function in C++ to generate an N-node random binary search t.pdf
Write a function in C++ to generate an N-node random binary search t.pdfWrite a function in C++ to generate an N-node random binary search t.pdf
Write a function in C++ to generate an N-node random binary search t.pdf
 
Please do Part A, Ill be really gratefulThe main.c is the skeleto.pdf
Please do Part A, Ill be really gratefulThe main.c is the skeleto.pdfPlease do Part A, Ill be really gratefulThe main.c is the skeleto.pdf
Please do Part A, Ill be really gratefulThe main.c is the skeleto.pdf
 
Adam Sitnik "State of the .NET Performance"
Adam Sitnik "State of the .NET Performance"Adam Sitnik "State of the .NET Performance"
Adam Sitnik "State of the .NET Performance"
 
State of the .Net Performance
State of the .Net PerformanceState of the .Net Performance
State of the .Net Performance
 
Go Programming Language (Golang)
Go Programming Language (Golang)Go Programming Language (Golang)
Go Programming Language (Golang)
 
C++ Language
C++ LanguageC++ Language
C++ Language
 
Introduction to programming - class 11
Introduction to programming - class 11Introduction to programming - class 11
Introduction to programming - class 11
 
functions
functionsfunctions
functions
 
C++ FUNCTIONS-1.pptx
C++ FUNCTIONS-1.pptxC++ FUNCTIONS-1.pptx
C++ FUNCTIONS-1.pptx
 
C++ FUNCTIONS-1.pptx
C++ FUNCTIONS-1.pptxC++ FUNCTIONS-1.pptx
C++ FUNCTIONS-1.pptx
 
C++ Language -- Dynamic Memory -- There are 7 files in this project- a.pdf
C++ Language -- Dynamic Memory -- There are 7 files in this project- a.pdfC++ Language -- Dynamic Memory -- There are 7 files in this project- a.pdf
C++ Language -- Dynamic Memory -- There are 7 files in this project- a.pdf
 

Kürzlich hochgeladen

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Kürzlich hochgeladen (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Intel® Xeon® Phi Coprocessor High Performance Programming

  • 1. Intel® Xeon® Phi Coprocessor High Performance Programming Parallelizing a Simple Image Blurring Algorithm Brian Gesiak April 16th, 2014 Research Student, The University of Tokyo @modocache
  • 2. Today • Image blurring with a 9-point stencil algorithm • Comparing performance • Intel® Xeon® Dual Processor • Intel® Xeon® Phi Coprocessor • Iteratively improving performance • Worst: Completely serial • Better: Adding loop vectorization • Best: Supporting multiple threads • Further optimizations • Padding arrays for improved cache performance • Read-less writes, i.e.: streaming stores • Using huge memory pages
  • 3. Stencil Algorithms A 9-Point Stencil on a 2D Matrix
  • 4. Stencil Algorithms A 9-Point Stencil on a 2D Matrix
  • 5. Stencil Algorithms typedef double real; typedef struct { real center; real next; real diagonal; } weight_t; A 9-Point Stencil on a 2D Matrix
  • 6. Stencil Algorithms typedef double real; typedef struct { real center; real next; real diagonal; } weight_t; weight.center; A 9-Point Stencil on a 2D Matrix
  • 7. Stencil Algorithms typedef double real; typedef struct { real center; real next; real diagonal; } weight_t; weight.center; weight.next; A 9-Point Stencil on a 2D Matrix
  • 8. Stencil Algorithms typedef double real; typedef struct { real center; real next; real diagonal; } weight_t; weight.center; weight.diagonal; weight.next; A 9-Point Stencil on a 2D Matrix
  • 9. Image Blurring Applying a 9-Point Stencil to a Bitmap
  • 10. Image Blurring Applying a 9-Point Stencil to a Bitmap
  • 11. Image Blurring Applying a 9-Point Stencil to a Bitmap
  • 12. Halo Effect Image Blurring Applying a 9-Point Stencil to a Bitmap
  • 13. • Apply a 9-point stencil to a 5,900 x 10,000 px image • Apply the stencil 1,000 times Sample Application
  • 14. • Apply a 9-point stencil to a 5,900 x 10,000 px image • Apply the stencil 1,000 times Sample Application
  • 15. • Apply a 9-point stencil to a 5,900 x 10,000 px image • Apply the stencil 1,000 times Sample Application
  • 16. Comparing Processors Xeon® Dual Processor vs. Xeon® Phi Coprocessor Processor Clock Frequency Number of Cores Memory Size/Type Peak DP/SP FLOPs Peak Memory Bandwidth
  • 17. Comparing Processors Xeon® Dual Processor vs. Xeon® Phi Coprocessor Processor Clock Frequency Number of Cores Memory Size/Type Peak DP/SP FLOPs Peak Memory Bandwidth Intel® Xeon® Dual Processor 2.6 GHz 16 (8 x 2 CPUs) 63 GB / DDR3 345.6 / 691.2 GigaFLOP/s 85.3 GB/s
  • 18. Comparing Processors Xeon® Dual Processor vs. Xeon® Phi Coprocessor Processor Clock Frequency Number of Cores Memory Size/Type Peak DP/SP FLOPs Peak Memory Bandwidth Intel® Xeon® Dual Processor 2.6 GHz 16 (8 x 2 CPUs) 63 GB / DDR3 345.6 / 691.2 GigaFLOP/s 85.3 GB/s Intel® Xeon® Phi Coprocessor 1.091 GHz 61 8 GB/ GDDR5 1.065/2.130 TeraFLOP/s 352 GB/s
  • 20. 1st Comparison: Serial Execution void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
  • 21. 1st Comparison: Serial Execution void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
  • 22. 1st Comparison: Serial Execution void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
  • 23. 1st Comparison: Serial Execution void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
  • 24. 1st Comparison: Serial Execution void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
  • 25. 1st Comparison: Serial Execution void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
  • 26. 1st Comparison: Serial Execution void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
  • 27. 1st Comparison: Serial Execution void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
  • 28. 1st Comparison: Serial Execution void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
  • 29. 1st Comparison: Serial Execution void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } } Assumed vector dependency
  • 30. Processor Elapsed Wall Time MegaFLOPS Intel® Xeon® Dual Processor 244.178 seconds (4 minutes) 4,107.658 Intel® Xeon® Phi Coprocessor 2,838.342 seconds (47.3 minutes) 353.375 1st Comparison: Serial Execution Results
  • 31. Processor Elapsed Wall Time MegaFLOPS Intel® Xeon® Dual Processor 244.178 seconds (4 minutes) 4,107.658 Intel® Xeon® Phi Coprocessor 2,838.342 seconds (47.3 minutes) 353.375 1st Comparison: Serial Execution Results $ icc -openmp -O3 stencil.c -o stencil
  • 32. Processor Elapsed Wall Time MegaFLOPS Intel® Xeon® Dual Processor 244.178 seconds (4 minutes) 4,107.658 Intel® Xeon® Phi Coprocessor 2,838.342 seconds (47.3 minutes) 353.375 1st Comparison: Serial Execution Results $ icc -openmp -mmic -O3 stencil.c -o stencil_phi
  • 33. Dual is 11 times faster than Phi Processor Elapsed Wall Time MegaFLOPS Intel® Xeon® Dual Processor 244.178 seconds (4 minutes) 4,107.658 Intel® Xeon® Phi Coprocessor 2,838.342 seconds (47.3 minutes) 353.375 1st Comparison: Serial Execution Results
  • 34. 2nd Comparison: Vectorization for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. #pragma ivdep for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... } Ignoring Assumed Vector Dependencies
  • 35. 2nd Comparison: Vectorization for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. #pragma ivdep for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... } Ignoring Assumed Vector Dependencies
  • 36. ivdep Tells compiler to ignore assumed dependencies Source: https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-B25ABCC2-BE6F-4599- AEDF-2434F4676E1B.htm
  • 37. ivdep Tells compiler to ignore assumed dependencies • In our program, the compiler cannot determine whether the two pointers refer to the same block of memory. So the compiler assumes they do. Source: https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-B25ABCC2-BE6F-4599- AEDF-2434F4676E1B.htm
  • 38. ivdep Tells compiler to ignore assumed dependencies • In our program, the compiler cannot determine whether the two pointers refer to the same block of memory. So the compiler assumes they do. • The ivdep pragma negates this assumption. Source: https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-B25ABCC2-BE6F-4599- AEDF-2434F4676E1B.htm
  • 39. ivdep Tells compiler to ignore assumed dependencies • In our program, the compiler cannot determine whether the two pointers refer to the same block of memory. So the compiler assumes they do. • The ivdep pragma negates this assumption. • Proven dependencies may not be ignored. Source: https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-B25ABCC2-BE6F-4599- AEDF-2434F4676E1B.htm
  • 40. Processor Elapsed Wall Time MegaFLOPS Intel® Xeon® Dual Processor 186.585 seconds (3.1 minutes) 5,375.572 Intel® Xeon® Phi Coprocessor 623.302 seconds (10.3 minutes) 1,609.171 2nd Comparison: Vectorization Results
  • 41. Processor Elapsed Wall Time MegaFLOPS Intel® Xeon® Dual Processor 186.585 seconds (3.1 minutes) 5,375.572 Intel® Xeon® Phi Coprocessor 623.302 seconds (10.3 minutes) 1,609.171 2nd Comparison: Vectorization Results $ icc -openmp -O3 stencil.c -o stencil 1.3 times faster
  • 42. Processor Elapsed Wall Time MegaFLOPS Intel® Xeon® Dual Processor 186.585 seconds (3.1 minutes) 5,375.572 Intel® Xeon® Phi Coprocessor 623.302 seconds (10.3 minutes) 1,609.171 2nd Comparison: Vectorization Results $ icc -openmp -mmic -O3 stencil.c -o stencil_phi 4.5 times faster 1.3 times faster
  • 43. Processor Elapsed Wall Time MegaFLOPS Intel® Xeon® Dual Processor 186.585 seconds (3.1 minutes) 5,375.572 Intel® Xeon® Phi Coprocessor 623.302 seconds (10.3 minutes) 1,609.171 2nd Comparison: Vectorization Results 4.5 times faster 1.3 times faster Dual is now only 4 times faster than Phi
  • 44. 3rd Comparison: Multithreading Work Division Using Parallel For Loops
  • 45. 3rd Comparison: Multithreading #pragma omp parallel for for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. #pragma ivdep for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... } Work Division Using Parallel For Loops
  • 46. 3rd Comparison: Multithreading #pragma omp parallel for for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. #pragma ivdep for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... } Work Division Using Parallel For Loops
  • 47. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Dual Proc., 16 Threads 43.862 22,867.185 Xeon® Dual Proc., 32 Threads 46.247 21,688.103 Xeon® Phi, 61 Threads 11.366 88,246.452 Xeon® Phi, 122 Threads 8.772 114,338.399 Xeon® Phi, 183 Threads 10.546 94,946.364 Xeon® Phi, 244 Threads 12.696 78,999.44 3rd Comparison: Multithreading Results
  • 48. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Dual Proc., 16 Threads 43.862 22,867.185 Xeon® Dual Proc., 32 Threads 46.247 21,688.103 Xeon® Phi, 61 Threads 11.366 88,246.452 Xeon® Phi, 122 Threads 8.772 114,338.399 Xeon® Phi, 183 Threads 10.546 94,946.364 Xeon® Phi, 244 Threads 12.696 78,999.44 3rd Comparison: Multithreading Results
  • 49. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Dual Proc., 16 Threads 43.862 22,867.185 Xeon® Dual Proc., 32 Threads 46.247 21,688.103 Xeon® Phi, 61 Threads 11.366 88,246.452 Xeon® Phi, 122 Threads 8.772 114,338.399 Xeon® Phi, 183 Threads 10.546 94,946.364 Xeon® Phi, 244 Threads 12.696 78,999.44 3rd Comparison: Multithreading Results
  • 50. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Dual Proc., 16 Threads 43.862 22,867.185 Xeon® Dual Proc., 32 Threads 46.247 21,688.103 Xeon® Phi, 61 Threads 11.366 88,246.452 Xeon® Phi, 122 Threads 8.772 114,338.399 Xeon® Phi, 183 Threads 10.546 94,946.364 Xeon® Phi, 244 Threads 12.696 78,999.44 3rd Comparison: Multithreading Results 4x 71x
  • 51. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Dual Proc., 16 Threads 43.862 22,867.185 Xeon® Dual Proc., 32 Threads 46.247 21,688.103 Xeon® Phi, 61 Threads 11.366 88,246.452 Xeon® Phi, 122 Threads 8.772 114,338.399 Xeon® Phi, 183 Threads 10.546 94,946.364 Xeon® Phi, 244 Threads 12.696 78,999.44 3rd Comparison: Multithreading Results 4x 71x Phi now 5 times faster
  • 54. Further Optimizations 1. Padded arrays 2. Streaming stores
  • 55. Further Optimizations 1. Padded arrays 2. Streaming stores 3. Huge memory pages
  • 56. Optimization 1: Padded Arrays Optimizing Cache Access
  • 57. Optimization 1: Padded Arrays • We can add extra, unused data to the end of each row Optimizing Cache Access
  • 58. Optimization 1: Padded Arrays • We can add extra, unused data to the end of each row • Doing so aligns heavily used memory addresses for efficient cache line access Optimizing Cache Access
  • 60. Optimization 1: Padded Arrays static const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }
  • 61. Optimization 1: Padded Arrays static const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }
  • 62. Optimization 1: Padded Arrays static const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }
  • 63. Optimization 1: Padded Arrays static const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }
  • 64. Optimization 1: Padded Arrays static const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }
  • 65. Optimization 1: Padded Arrays static const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }
  • 66. Optimization 1: Padded Arrays static const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }
  • 67. Optimization 1: Padded Arrays static const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; } ((5900*sizeof(real)+63)/64)*(64/sizeof(real));
  • 68. Optimization 1: Padded Arrays static const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; } (real *)_mm_malloc(size, kPaddingSize); (real *)_mm_malloc(size, kPaddingSize); sizeof(real)* width*kPaddingSize * height; ((5900*sizeof(real)+63)/64)*(64/sizeof(real));
  • 69. Optimization 1: Padded Arrays static const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; } _mm_free(fin); _mm_free(fout); (real *)_mm_malloc(size, kPaddingSize); (real *)_mm_malloc(size, kPaddingSize); sizeof(real)* width*kPaddingSize * height; ((5900*sizeof(real)+63)/64)*(64/sizeof(real));
  • 70. Optimization 1: Padded Arrays Accommodating for Padding
  • 71. Optimization 1: Padded Arrays #pragma omp parallel for for (int y = 1; y < height - 1; ++y) { ! // ...calculate center, east, northwest, etc. int center = 1 + y * kPaddingSize + 1; int north = center - kPaddingSize; int south = center + kPaddingSize; int east = center + 1; int west = center - 1; int northwest = north - 1; int northeast = north + 1; int southwest = south - 1; int southeast = south + 1; ! #pragma ivdep // ... } Accommodating for Padding
  • 72. Optimization 1: Padded Arrays #pragma omp parallel for for (int y = 1; y < height - 1; ++y) { ! // ...calculate center, east, northwest, etc. int center = 1 + y * kPaddingSize + 1; int north = center - kPaddingSize; int south = center + kPaddingSize; int east = center + 1; int west = center - 1; int northwest = north - 1; int northeast = north + 1; int southwest = south - 1; int southeast = south + 1; ! #pragma ivdep // ... } Accommodating for Padding
  • 73. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Phi, 61 Threads 11.644 86,138.371 Xeon® Phi, 122 Threads 8.973 111,774.803 Xeon® Phi, 183 Threads 10.326 97,132.546 Xeon® Phi, 244 Threads 11.469 87,452.707 Optimization 1: Padded Arrays Results
  • 74. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Phi, 61 Threads 11.644 86,138.371 Xeon® Phi, 122 Threads 8.973 111,774.803 Xeon® Phi, 183 Threads 10.326 97,132.546 Xeon® Phi, 244 Threads 11.469 87,452.707 Optimization 1: Padded Arrays Results
  • 75. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Phi, 61 Threads 11.644 86,138.371 Xeon® Phi, 122 Threads 8.973 111,774.803 Xeon® Phi, 183 Threads 10.326 97,132.546 Xeon® Phi, 244 Threads 11.469 87,452.707 Optimization 1: Padded Arrays Results
  • 76. Optimization 2: Streaming Stores Read-less Writes
  • 77. Optimization 2: Streaming Stores Read-less Writes • By default, Xeon® Phi processors read the value at an address before writing to that address.
  • 78. Optimization 2: Streaming Stores Read-less Writes • By default, Xeon® Phi processors read the value at an address before writing to that address. • When calculating the weighted average for a pixel in our program, we do not use the original value of that pixel. Therefore, enabling streaming stores should result in better performance.
  • 79. Optimization 2: Streaming Stores for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. #pragma ivdep #pragma vector nontemporal for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... } Read-less Writes with Vector Nontemporal
  • 80. Optimization 2: Streaming Stores for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. #pragma ivdep #pragma vector nontemporal for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... } Read-less Writes with Vector Nontemporal
  • 81. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Phi, 61 Threads 13.588 73,978.915 Xeon® Phi, 122 Threads 8.491 111,774.803 Xeon® Phi, 183 Threads 8.663 115,773.405 Xeon® Phi, 244 Threads 9.507 105,498.781 Optimization 2: Streaming Stores Results
  • 82. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Phi, 61 Threads 13.588 73,978.915 Xeon® Phi, 122 Threads 8.491 111,774.803 Xeon® Phi, 183 Threads 8.663 115,773.405 Xeon® Phi, 244 Threads 9.507 105,498.781 Optimization 2: Streaming Stores Results
  • 83. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Phi, 61 Threads 13.588 73,978.915 Xeon® Phi, 122 Threads 8.491 111,774.803 Xeon® Phi, 183 Threads 8.663 115,773.405 Xeon® Phi, 244 Threads 9.507 105,498.781 Optimization 2: Streaming Stores Results
  • 84. Optimization 3: Huge Memory Pages
  • 85. • Memory pages map virtual memory used by our program to physical memory Optimization 3: Huge Memory Pages
  • 86. • Memory pages map virtual memory used by our program to physical memory • Mappings are stored in a translation look-aside buffer (TLB) Optimization 3: Huge Memory Pages
  • 87. • Memory pages map virtual memory used by our program to physical memory • Mappings are stored in a translation look-aside buffer (TLB) • Mappings are traversed in a“page table walk” Optimization 3: Huge Memory Pages
  • 88. • Memory pages map virtual memory used by our program to physical memory • Mappings are stored in a translation look-aside buffer (TLB) • Mappings are traversed in a“page table walk” • malloc and _mm_malloc use 4KB memory pages by default Optimization 3: Huge Memory Pages
  • 89. • Memory pages map virtual memory used by our program to physical memory • Mappings are stored in a translation look-aside buffer (TLB) • Mappings are traversed in a“page table walk” • malloc and _mm_malloc use 4KB memory pages by default • By increasing the size of each memory page, traversal time may be reduced Optimization 3: Huge Memory Pages
  • 90. Optimization 3: Huge Memory Pages size_t size = sizeof(real) * width * kPaddingSize * height; real *fin = (real *)_mm_malloc(size, kPaddingSize);
  • 91. Optimization 3: Huge Memory Pages size_t size = sizeof(real) * width * kPaddingSize * height; real *fin = (real *)_mm_malloc(size, kPaddingSize);real *fin = (real *)mmap(0, size, PROT_READ|PROT_WRITE, MAP_ANON|MAP_PRIVATE|MAP_HUGETLB, -1.0);
  • 92. Optimization 3: Huge Memory Pages size_t size = sizeof(real) * width * kPaddingSize * height; real *fin = (real *)_mm_malloc(size, kPaddingSize);real *fin = (real *)mmap(0, size, PROT_READ|PROT_WRITE, MAP_ANON|MAP_PRIVATE|MAP_HUGETLB, -1.0);
  • 93. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Phi, 61 Threads 14.486 69,239.365 Xeon® Phi, 122 Threads 8.226 121,924.389 Xeon® Phi, 183 Threads 8.749 114,636.799 Xeon® Phi, 244 Threads 9.466 105,955.358 Results Optimization 3: Huge Memory Pages
  • 94. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Phi, 61 Threads 14.486 69,239.365 Xeon® Phi, 122 Threads 8.226 121,924.389 Xeon® Phi, 183 Threads 8.749 114,636.799 Xeon® Phi, 244 Threads 9.466 105,955.358 Results Optimization 3: Huge Memory Pages
  • 95. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Phi, 61 Threads 14.486 69,239.365 Xeon® Phi, 122 Threads 8.226 121,924.389 Xeon® Phi, 183 Threads 8.749 114,636.799 Xeon® Phi, 244 Threads 9.466 105,955.358 Results Optimization 3: Huge Memory Pages
  • 96. Takeaways • The key to achieving high-performance is to use loop vectorization and multiple threads • Completely serial programs run faster on standard processors • Only properly designed programs achieve peak performance on an Intel® Xeon® Phi Coprocessor • Other optimizations may be used to tweak performance • Data padding, • Streaming stores • Huge memory pages
  • 97. Sources and Additional Resources • Today’s slides • http://modocache.io/xeon-phi-high-performance • Intel® Xeon® Phi Coprocessor High Performance Programming (James Jeffers, James Reinders) • http://www.amazon.com/dp/0124104142 • Intel Documentation • ivdep: https://software.intel.com/sites/products/ documentation/doclib/iss/2013/compiler/cpp-lin/ GUID-B25ABCC2-BE6F-4599-AEDF-2434F4676E1B.htm • vector: https://software.intel.com/sites/products/ documentation/studio/composer/en-us/2011Update/ compiler_c/cref_cls/common/ cppref_pragma_vector.htm