SlideShare a Scribd company logo
1 of 43
Download to read offline
Deep Convolutional Network
Evaluation on the Intel Xeon Phi
Gaurav Raina
MSc Graduation Project
5-1-2016
Cameras are ubiquitous
1
Vision processing on mobile devices
• Currently most processing off-line
• High compute demands + energy
• Move to edge processing
2
Motivation
• Convolutional neural nets very generic (support
many vision tasks)
• Traffic sign
• Pedestrian
• Face detection
• Accelerate with an
power efficient core
3
Problem statement
“Efficiently parallelize a Convolutional Neural Network
on a highly-parallel power efficient processor
platform”
4
You are here:
5
Overview
1. ConvNet algorithm
2. Optimization Approach
3. Mapping on the Core i7
4. Mapping on the Xeon Phi
5. Conclusion & Future Work
6
Overview
1. Convolution Network (ConvNet) algorithm
2. Optimization Approach
3. Mapping on the core i7
4. Mapping on the Xeon Phi
5. Conclusion & Future Work
7
Introduction Neural Networks
• Artificial neuron model
8
Convolution example
9
Image credit:
deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution
Speed sign detection application
10
Image courtesy: Maurice Peemen
ConvNet Application in action
11
Video courtesy: Maurice Peemen
https://youtu.be/kkha3sPoU70
ConvNet Code Structure
1. for( 0 < r < 6 ){
2. acc = bias[r];
3. for( 0 < m < YL1 ){
4. for( 0 < n < XL1 ){
5. for( 0< k < 6 ){
6. for( 0 < l < 6 ){
7. acc = acc + in_layer[m,n,l] x weight[r,k,l];
8. }
9. }
10. index = saturate_shift(acc); //10bit fixedpoint format
11. output_layer[r,m,n]=fixact[index];
12. }
13. }
14. }
“r” = o/p feature maps (6) “k*l” = 6*6 convolution kernel
“n” = Neuron outputs fixact = sigmoid activation function
12
Compute
Store
Overview
1. ConvNet algorithm
2. Optimization Approach
3. Mapping on the Core i7
4. Mapping on the Xeon Phi
5. Conclusion & Future Work
13
Optimization Approach
• Methodology:
• Test on Core i7 (Haswell – AVX2)
• Move to Xeon Phi (Knights Corner - IMCI)
• Steps:
1. Loop unrolling
2. Vectorization using SIMD intrinsics (DLP)
− Fused Multiply Add instruction
3. Parallelization using OpenMP (TLP)
14
1 core
Many-core
SIMD Vectorization example
15
Courtesy: www.kernel.org
Intel MIC Programming models
16
Credit: Dr. Volker Weinberg,
Introduction into Intel Xeon Phi Programming LRZ, 28.4.2015
Overview
1. ConvNet algorithm
2. Optimization Approach
3. Mapping on the Core i7
4. Mapping on the Xeon Phi
5. Conclusion & Future Work
17
Roofline Model
18
actual FLOP/Byte ratio
attainableGFLOP/s
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4
1/2 1 2 4 8 16
Performance Roofline
Y coordinate is
performance
Processor BW
Roofline
(Slope is BW)
Kernel 2
Kernel 1
Each kernels
performance
bound
Each kernels
performance
bound
Intel Core i7
• Intel Core i7 @3.5GHz
• Haswell micro-architecture
• AVX2 vector instructions
− 256bit vectors
19
Multiply Accumulate intrinsic – AVX2
20
Calculation of Ops/Byte
• acc += in_layer[i]*weight[j]
• Intrinsics used
• add(acc, madd(in_layer,weight))
• Bytes Loaded
• in_layer[i] - 1bytes
• weight[j] - 2bytes
• Operational Intensity
• 2ops/3bytes = 0.67 Ops/Byte
21
Speedup after SIMD intrinsics
• (w.r.t non-vectorized code)
• Intel C Compiler
• Layer 1 - 4.7x
• Layer 2 - 5.7x
• Layer 3 - 4.88x
• Overall CNN – 5.6x
• GCC compiler
• Layer 1 - 4.7x
• Layer 2 - 6.8x
• Layer 3 - 6.7x
• Overall CNN – 6.3x
22
• (w.r.t auto-vectorized code)
• ICC
• 4.9x
• 11.3x
• 4.8x
• Overall CNN - 5x
• GCC
• same
• same
• same
• Overall CNN – 6.3x
Roofline - Core i7 - manual v/s auto
23
Layer3 Hand-
optimized
0.67, 35.54
Complete CNN Hand-
optimized,
0.67, 32.46
Complete CNN Auto-
vectorized ,
0.67, 5.134
8
16
32
64
0.125 0.25 0.5 1 2
Performance(GigaOps/s)
Operational Intensity (Ops/Byte)
Single core SIMD ops roofline - Intel i7 5930K @3.5GHz
56 Gops/s -Vector ops ceiling 112GBytes/s Write BW L1 cache
224GBytes/s Read BW L1 cache 68 GBytes/s BW to DDR RAM
16.6 GBytes/s STREAM BW Layer1 Hand-optimized - gcc
Layer2 Hand-optimized - icc Layer3 Hand-optimized - gcc
Complete CNN Hand-optimized - gcc Complete CNN Auto-vectorized -gcc
Complete CNN no-vectorization gcc
Overview
1. ConvNet algorithm
2. Optimization Approach
3. Mapping on the Intel core i7
4. Mapping on the Xeon Phi
5. Conclusion & Future Work
24
Intel Xeon Phi
• Knights Corner
• Initial Many Core
Instructions (IMCI)
• Knights Landing
• AVX-512
• 57-61core
25
Credit: Intel
Intel Xeon Phi
26
Intel Many Integrated Core Architecture
Credit: http://semiaccurate.com/2012/08/28/
intel-details-knights-corner-architecture-at-long-last/
Core Architecture Overview
• 60+ in-order, low power IA cores
• Bi-directional Ring interconnect
• Two pipelines (u & v)
• Scalar Unit based on Pentium
• 512bit SIMD Vector Processing unit
• 4 hardware threads
• Coherent 512KB L2 Cache per core
27
Image courtesy: Intel PRACE MIC Summer School, July 2013, CINECA
Ref. pg. 18-19, section 2.1.2 Xeon Phi Co-proc system software devs guide
28
Going from Core i7 to Xeon Phi (AVX to KNC)
29
Going from Core i7 to Xeon Phi (AVX to IMCI)
madd()
fmadd()
• acc = acc + in_layer[m,n,l] x weight[r,k,l]
30
Fused Multiply-Add on Xeon Phi
31
Intrinsics Kernel implementation
Speedup after SIMD intrinsics
• (w.r.t non-vectorized code)
• Intel C Compiler
• Layer 1 - 5.7x
• Layer 2 - 10.2x
• Layer 3 - 12.4x
• Overall CNN – 11x
• ~0.75 Frame per second
− 57 cores => 43 FPS
32
• (w.r.t auto-vectorized code)
• ICC
• 5.6x
• 6.3x
• 10.7x
• Overall CNN – 9.2x
Roofline – Xeon Phi
33
0.125
0.25
0.5
1
2
4
8
16
32
64
0.25 0.5 1
Performance(GigaFLOP/s)
Operational Intensity (FLOP/Byte)
Single core Roofline - Xeon Phi @1.1GHz
35.2GFLOP/s Vector compute ceiling 0.48 GFLOP/s Scalar compute ceiling
70.4GBytes/s BW L1 cache 35.2GBytes/s BW L2 cache
5.8GB/s STREAM BW to DDR RAM
Roofline – Xeon Phi
34
0.125
0.25
0.5
1
2
4
8
16
32
64
0.25 0.5 1
Performance(GigaFLOP/s)
Operational Intensity (FLOP/Byte)
Single core Roofline - Xeon Phi @1.1GHz
35.2GFLOP/s Vector compute ceiling 0.48 GFLOP/s Scalar compute ceiling
70.4GBytes/s BW L1 cache 35.2GBytes/s BW L2 cache
5.8GB/s STREAM BW to DDR RAM Layer 1 - hand optimized
Layer 1 Auto vectorized Layer 2 - hand optimized
Layer 2 Auto vectorized Layer 3 - hand optimized
Layer 3 Auto vectorized
Roofline – Xeon Phi - Complete
35
Complete - hand
optimized, 0.67, 1.5626
Complete Auto
vectorized, 0.67, 0.17020.125
0.25
0.5
1
2
4
8
16
32
64
0.25 0.5 1
Performance(GigaFLOP/s)
Operational Intensity (FLOP/Byte)
Single core Roofline - Xeon Phi @1.1GHz
35.2GFLOP/s Vector compute ceiling 0.48 GFLOP/s Scalar compute ceiling
70.4GBytes/s BW L1 cache 35.2GBytes/s BW L2 cache
5.8GB/s BW to DDR RAM Complete - hand optimized
Complete Auto vectorized
Demo
• Speed sign application running on:
• The Core i7
• The Xeon Phi
36
Overview
1. ConvNet algorithm
2. Optimization Approach
3. Mapping on the Core i7
4. Mapping on the Xeon Phi
5. Conclusion & Future Work
37
You are here:
38
Conclusion
• Contribution
• Core i7 – 6.3x
• Xeon Phi – 11x
• Design trade-off:
• Developer time v/s Optimized code
• Architecture specific intrinsics v/s generic OpenMP
39
Future Work
OpenMP number of threads
• Varying number of threads per core
• 1T x 57 cores = 57T
• 4T x 57 cores = 228T
• Varying thread distribution on Cores
• KMP_AFFINITY (Environment Variable)
• Splitting work using OpenMP directives
• #pragma omp for
40
41
Baseline
OpenMP
Scaling Vectorization Peeling
Elapsed time (s): 5605.027 127.616 17.767 15.619
FLOPS (MFlops) : 254.991 11199.45 80442.24 91506.41
Throughput (GB/s): 0.235 10.338 74.254 84.467
Test code on Xeon Phi
• Baseline - simulate diffusion of a solute through a volume of liquid
• OpenMP Scaling
• #pragma omp for collapse(2)
• Vectorization
• #pragma simd
Credit: Jeffers, James, and James Reinders. Intel Xeon Phi coprocessor
high-performance programming. Newnes, 2013.
Thank You.
Questions?

More Related Content

What's hot

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
inside-BigData.com
 
“DepthAI: Embedded, Performant Spatial AI and Computer Vision,” a Presentatio...
“DepthAI: Embedded, Performant Spatial AI and Computer Vision,” a Presentatio...“DepthAI: Embedded, Performant Spatial AI and Computer Vision,” a Presentatio...
“DepthAI: Embedded, Performant Spatial AI and Computer Vision,” a Presentatio...
Edge AI and Vision Alliance
 
asap2013-khoa-presentation
asap2013-khoa-presentationasap2013-khoa-presentation
asap2013-khoa-presentation
Abhishek Jain
 

What's hot (20)

Trends in Systems and How to Get Efficient Performance
Trends in Systems and How to Get Efficient PerformanceTrends in Systems and How to Get Efficient Performance
Trends in Systems and How to Get Efficient Performance
 
Intel python 2017
Intel python 2017Intel python 2017
Intel python 2017
 
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
 
Image Fusion - Approaches in Hardware
Image Fusion - Approaches in HardwareImage Fusion - Approaches in Hardware
Image Fusion - Approaches in Hardware
 
Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)
 
Increasing Throughput per Node for Content Delivery Networks
Increasing Throughput per Node for Content Delivery NetworksIncreasing Throughput per Node for Content Delivery Networks
Increasing Throughput per Node for Content Delivery Networks
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
 
Modern INTEL Microprocessors' Architecture and Sneak Peak at NVIDIA TEGRA GPU
Modern INTEL Microprocessors' Architecture and Sneak Peak at NVIDIA TEGRA GPUModern INTEL Microprocessors' Architecture and Sneak Peak at NVIDIA TEGRA GPU
Modern INTEL Microprocessors' Architecture and Sneak Peak at NVIDIA TEGRA GPU
 
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
FPGAs and Machine Learning
FPGAs and Machine LearningFPGAs and Machine Learning
FPGAs and Machine Learning
 
3rd Generation Intel® Xeon® Scalable Processor - Achieving 1 Tbps IPsec with ...
3rd Generation Intel® Xeon® Scalable Processor - Achieving 1 Tbps IPsec with ...3rd Generation Intel® Xeon® Scalable Processor - Achieving 1 Tbps IPsec with ...
3rd Generation Intel® Xeon® Scalable Processor - Achieving 1 Tbps IPsec with ...
 
IBM HPC Transformation with AI
IBM HPC Transformation with AI IBM HPC Transformation with AI
IBM HPC Transformation with AI
 
XPDDS18: Xen Testing at Intel - Xudong Hao, Intel
XPDDS18: Xen Testing at Intel - Xudong Hao, IntelXPDDS18: Xen Testing at Intel - Xudong Hao, Intel
XPDDS18: Xen Testing at Intel - Xudong Hao, Intel
 
Efficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approachEfficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approach
 
Scalability for All: Unreal Engine* 4 with Intel
Scalability for All: Unreal Engine* 4 with Intel Scalability for All: Unreal Engine* 4 with Intel
Scalability for All: Unreal Engine* 4 with Intel
 
“DepthAI: Embedded, Performant Spatial AI and Computer Vision,” a Presentatio...
“DepthAI: Embedded, Performant Spatial AI and Computer Vision,” a Presentatio...“DepthAI: Embedded, Performant Spatial AI and Computer Vision,” a Presentatio...
“DepthAI: Embedded, Performant Spatial AI and Computer Vision,” a Presentatio...
 
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER TutorialSCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
 
Developing Real-Time Systems on Application Processors
Developing Real-Time Systems on Application ProcessorsDeveloping Real-Time Systems on Application Processors
Developing Real-Time Systems on Application Processors
 
asap2013-khoa-presentation
asap2013-khoa-presentationasap2013-khoa-presentation
asap2013-khoa-presentation
 

Similar to Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav Raina small

Similar to Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav Raina small (20)

Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core ArchitecturesPerformance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
 
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
 
Early Successes Debugging with TotalView on the Intel Xeon Phi Coprocessor
Early Successes Debugging with TotalView on the Intel Xeon Phi CoprocessorEarly Successes Debugging with TotalView on the Intel Xeon Phi Coprocessor
Early Successes Debugging with TotalView on the Intel Xeon Phi Coprocessor
 
Dataplane programming with eBPF: architecture and tools
Dataplane programming with eBPF: architecture and toolsDataplane programming with eBPF: architecture and tools
Dataplane programming with eBPF: architecture and tools
 
Project Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxProject Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptx
 
NNECST: an FPGA-based approach for the hardware acceleration of Convolutional...
NNECST: an FPGA-based approach for the hardware acceleration of Convolutional...NNECST: an FPGA-based approach for the hardware acceleration of Convolutional...
NNECST: an FPGA-based approach for the hardware acceleration of Convolutional...
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
 
Implementation of Soft-core processor on FPGA (Final Presentation)
Implementation of Soft-core processor on FPGA (Final Presentation)Implementation of Soft-core processor on FPGA (Final Presentation)
Implementation of Soft-core processor on FPGA (Final Presentation)
 
00 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver200 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver2
 
CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...
CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...
CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...
 
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...
 
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph
 
Fastsocket Linxiaofeng
Fastsocket LinxiaofengFastsocket Linxiaofeng
Fastsocket Linxiaofeng
 
How to Get the Best Deep Learning performance with OpenVINO Toolkit
How to Get the Best Deep Learning performance with OpenVINO ToolkitHow to Get the Best Deep Learning performance with OpenVINO Toolkit
How to Get the Best Deep Learning performance with OpenVINO Toolkit
 
Seminar Accelerating Business Using Microservices Architecture in Digital Age...
Seminar Accelerating Business Using Microservices Architecture in Digital Age...Seminar Accelerating Business Using Microservices Architecture in Digital Age...
Seminar Accelerating Business Using Microservices Architecture in Digital Age...
 
FD.io Vector Packet Processing (VPP)
FD.io Vector Packet Processing (VPP)FD.io Vector Packet Processing (VPP)
FD.io Vector Packet Processing (VPP)
 
FD.IO Vector Packet Processing
FD.IO Vector Packet ProcessingFD.IO Vector Packet Processing
FD.IO Vector Packet Processing
 
FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning
 
Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Day Taipei - Accelerate Ceph via SPDK Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Day Taipei - Accelerate Ceph via SPDK
 
Hardware and Software Co-optimization to Make Sure Oracle Fusion Middleware R...
Hardware and Software Co-optimization to Make Sure Oracle Fusion Middleware R...Hardware and Software Co-optimization to Make Sure Oracle Fusion Middleware R...
Hardware and Software Co-optimization to Make Sure Oracle Fusion Middleware R...
 

Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav Raina small

  • 1. Deep Convolutional Network Evaluation on the Intel Xeon Phi Gaurav Raina MSc Graduation Project 5-1-2016
  • 3. Vision processing on mobile devices • Currently most processing off-line • High compute demands + energy • Move to edge processing 2
  • 4. Motivation • Convolutional neural nets very generic (support many vision tasks) • Traffic sign • Pedestrian • Face detection • Accelerate with an power efficient core 3
  • 5. Problem statement “Efficiently parallelize a Convolutional Neural Network on a highly-parallel power efficient processor platform” 4
  • 7. Overview 1. ConvNet algorithm 2. Optimization Approach 3. Mapping on the Core i7 4. Mapping on the Xeon Phi 5. Conclusion & Future Work 6
  • 8. Overview 1. Convolution Network (ConvNet) algorithm 2. Optimization Approach 3. Mapping on the core i7 4. Mapping on the Xeon Phi 5. Conclusion & Future Work 7
  • 9. Introduction Neural Networks • Artificial neuron model 8
  • 11. Speed sign detection application 10 Image courtesy: Maurice Peemen
  • 12. ConvNet Application in action 11 Video courtesy: Maurice Peemen https://youtu.be/kkha3sPoU70
  • 13. ConvNet Code Structure 1. for( 0 < r < 6 ){ 2. acc = bias[r]; 3. for( 0 < m < YL1 ){ 4. for( 0 < n < XL1 ){ 5. for( 0< k < 6 ){ 6. for( 0 < l < 6 ){ 7. acc = acc + in_layer[m,n,l] x weight[r,k,l]; 8. } 9. } 10. index = saturate_shift(acc); //10bit fixedpoint format 11. output_layer[r,m,n]=fixact[index]; 12. } 13. } 14. } “r” = o/p feature maps (6) “k*l” = 6*6 convolution kernel “n” = Neuron outputs fixact = sigmoid activation function 12 Compute Store
  • 14. Overview 1. ConvNet algorithm 2. Optimization Approach 3. Mapping on the Core i7 4. Mapping on the Xeon Phi 5. Conclusion & Future Work 13
  • 15. Optimization Approach • Methodology: • Test on Core i7 (Haswell – AVX2) • Move to Xeon Phi (Knights Corner - IMCI) • Steps: 1. Loop unrolling 2. Vectorization using SIMD intrinsics (DLP) − Fused Multiply Add instruction 3. Parallelization using OpenMP (TLP) 14 1 core Many-core
  • 17. Intel MIC Programming models 16 Credit: Dr. Volker Weinberg, Introduction into Intel Xeon Phi Programming LRZ, 28.4.2015
  • 18. Overview 1. ConvNet algorithm 2. Optimization Approach 3. Mapping on the Core i7 4. Mapping on the Xeon Phi 5. Conclusion & Future Work 17
  • 19. Roofline Model 18 actual FLOP/Byte ratio attainableGFLOP/s 0.5 1.0 1/8 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/4 1/2 1 2 4 8 16 Performance Roofline Y coordinate is performance Processor BW Roofline (Slope is BW) Kernel 2 Kernel 1 Each kernels performance bound Each kernels performance bound
  • 20. Intel Core i7 • Intel Core i7 @3.5GHz • Haswell micro-architecture • AVX2 vector instructions − 256bit vectors 19
  • 22. Calculation of Ops/Byte • acc += in_layer[i]*weight[j] • Intrinsics used • add(acc, madd(in_layer,weight)) • Bytes Loaded • in_layer[i] - 1bytes • weight[j] - 2bytes • Operational Intensity • 2ops/3bytes = 0.67 Ops/Byte 21
  • 23. Speedup after SIMD intrinsics • (w.r.t non-vectorized code) • Intel C Compiler • Layer 1 - 4.7x • Layer 2 - 5.7x • Layer 3 - 4.88x • Overall CNN – 5.6x • GCC compiler • Layer 1 - 4.7x • Layer 2 - 6.8x • Layer 3 - 6.7x • Overall CNN – 6.3x 22 • (w.r.t auto-vectorized code) • ICC • 4.9x • 11.3x • 4.8x • Overall CNN - 5x • GCC • same • same • same • Overall CNN – 6.3x
  • 24. Roofline - Core i7 - manual v/s auto 23 Layer3 Hand- optimized 0.67, 35.54 Complete CNN Hand- optimized, 0.67, 32.46 Complete CNN Auto- vectorized , 0.67, 5.134 8 16 32 64 0.125 0.25 0.5 1 2 Performance(GigaOps/s) Operational Intensity (Ops/Byte) Single core SIMD ops roofline - Intel i7 5930K @3.5GHz 56 Gops/s -Vector ops ceiling 112GBytes/s Write BW L1 cache 224GBytes/s Read BW L1 cache 68 GBytes/s BW to DDR RAM 16.6 GBytes/s STREAM BW Layer1 Hand-optimized - gcc Layer2 Hand-optimized - icc Layer3 Hand-optimized - gcc Complete CNN Hand-optimized - gcc Complete CNN Auto-vectorized -gcc Complete CNN no-vectorization gcc
  • 25. Overview 1. ConvNet algorithm 2. Optimization Approach 3. Mapping on the Intel core i7 4. Mapping on the Xeon Phi 5. Conclusion & Future Work 24
  • 26. Intel Xeon Phi • Knights Corner • Initial Many Core Instructions (IMCI) • Knights Landing • AVX-512 • 57-61core 25 Credit: Intel
  • 27. Intel Xeon Phi 26 Intel Many Integrated Core Architecture Credit: http://semiaccurate.com/2012/08/28/ intel-details-knights-corner-architecture-at-long-last/
  • 28. Core Architecture Overview • 60+ in-order, low power IA cores • Bi-directional Ring interconnect • Two pipelines (u & v) • Scalar Unit based on Pentium • 512bit SIMD Vector Processing unit • 4 hardware threads • Coherent 512KB L2 Cache per core 27 Image courtesy: Intel PRACE MIC Summer School, July 2013, CINECA Ref. pg. 18-19, section 2.1.2 Xeon Phi Co-proc system software devs guide
  • 29. 28 Going from Core i7 to Xeon Phi (AVX to KNC)
  • 30. 29 Going from Core i7 to Xeon Phi (AVX to IMCI) madd() fmadd() • acc = acc + in_layer[m,n,l] x weight[r,k,l]
  • 33. Speedup after SIMD intrinsics • (w.r.t non-vectorized code) • Intel C Compiler • Layer 1 - 5.7x • Layer 2 - 10.2x • Layer 3 - 12.4x • Overall CNN – 11x • ~0.75 Frame per second − 57 cores => 43 FPS 32 • (w.r.t auto-vectorized code) • ICC • 5.6x • 6.3x • 10.7x • Overall CNN – 9.2x
  • 34. Roofline – Xeon Phi 33 0.125 0.25 0.5 1 2 4 8 16 32 64 0.25 0.5 1 Performance(GigaFLOP/s) Operational Intensity (FLOP/Byte) Single core Roofline - Xeon Phi @1.1GHz 35.2GFLOP/s Vector compute ceiling 0.48 GFLOP/s Scalar compute ceiling 70.4GBytes/s BW L1 cache 35.2GBytes/s BW L2 cache 5.8GB/s STREAM BW to DDR RAM
  • 35. Roofline – Xeon Phi 34 0.125 0.25 0.5 1 2 4 8 16 32 64 0.25 0.5 1 Performance(GigaFLOP/s) Operational Intensity (FLOP/Byte) Single core Roofline - Xeon Phi @1.1GHz 35.2GFLOP/s Vector compute ceiling 0.48 GFLOP/s Scalar compute ceiling 70.4GBytes/s BW L1 cache 35.2GBytes/s BW L2 cache 5.8GB/s STREAM BW to DDR RAM Layer 1 - hand optimized Layer 1 Auto vectorized Layer 2 - hand optimized Layer 2 Auto vectorized Layer 3 - hand optimized Layer 3 Auto vectorized
  • 36. Roofline – Xeon Phi - Complete 35 Complete - hand optimized, 0.67, 1.5626 Complete Auto vectorized, 0.67, 0.17020.125 0.25 0.5 1 2 4 8 16 32 64 0.25 0.5 1 Performance(GigaFLOP/s) Operational Intensity (FLOP/Byte) Single core Roofline - Xeon Phi @1.1GHz 35.2GFLOP/s Vector compute ceiling 0.48 GFLOP/s Scalar compute ceiling 70.4GBytes/s BW L1 cache 35.2GBytes/s BW L2 cache 5.8GB/s BW to DDR RAM Complete - hand optimized Complete Auto vectorized
  • 37. Demo • Speed sign application running on: • The Core i7 • The Xeon Phi 36
  • 38. Overview 1. ConvNet algorithm 2. Optimization Approach 3. Mapping on the Core i7 4. Mapping on the Xeon Phi 5. Conclusion & Future Work 37
  • 40. Conclusion • Contribution • Core i7 – 6.3x • Xeon Phi – 11x • Design trade-off: • Developer time v/s Optimized code • Architecture specific intrinsics v/s generic OpenMP 39
  • 41. Future Work OpenMP number of threads • Varying number of threads per core • 1T x 57 cores = 57T • 4T x 57 cores = 228T • Varying thread distribution on Cores • KMP_AFFINITY (Environment Variable) • Splitting work using OpenMP directives • #pragma omp for 40
  • 42. 41 Baseline OpenMP Scaling Vectorization Peeling Elapsed time (s): 5605.027 127.616 17.767 15.619 FLOPS (MFlops) : 254.991 11199.45 80442.24 91506.41 Throughput (GB/s): 0.235 10.338 74.254 84.467 Test code on Xeon Phi • Baseline - simulate diffusion of a solute through a volume of liquid • OpenMP Scaling • #pragma omp for collapse(2) • Vectorization • #pragma simd Credit: Jeffers, James, and James Reinders. Intel Xeon Phi coprocessor high-performance programming. Newnes, 2013.