SlideShare ist ein Scribd-Unternehmen logo
1 von 33
Downloaden Sie, um offline zu lesen
Performance Optimization of Deep
Learning Frameworks on Modern Intel
Architectures
ElMoustapha Ould-Ahmed-Vall, AG Ramesh,
Vamsi Sripathi and Karthik Raman
Representing the work of many at Intel
Agenda	
  
•  Op#miza#on	
  ma+ers	
  on	
  modern	
  architectures	
  
•  Intel’s	
  recent	
  Xeon	
  and	
  Xeon	
  Phi	
  products	
  
•  Introduc#on	
  to	
  Deep	
  Learning	
  
•  Op#mizing	
  DL	
  frameworks	
  on	
  IA	
  
•  Key	
  challenges	
  
•  Op#miza#on	
  techniques	
  
•  Performance	
  data	
  	
  
•  DL	
  scaling	
  
Moore’s	
  Law	
  Goes	
  on!	
  
Increasing	
  clock	
  speeds	
  -­‐>	
  	
  more	
  cores	
  +	
  wider	
  SIMD	
  (Hierarchical	
  parallelism)	
  
Combined	
  Amdahl’s	
  Law	
  for	
  Vector	
  Mul<cores*	
  
𝑺𝒑𝒆𝒆𝒅𝒖𝒑=(​1/​ 𝑆 𝑒𝑟𝑖𝑎𝑙↓𝑓𝑟𝑎𝑐 +​1−​ 𝑆 𝑒𝑟𝑖𝑎𝑙↓𝑓𝑟𝑎𝑐 /𝑵𝒖𝒎𝑪𝒐𝒓𝒆𝒔  )∗(​1/​ 𝑆 𝑐𝑎𝑙𝑎𝑟↓𝑓𝑟𝑎𝑐 
+​1−​ 𝑆 𝑐𝑎𝑙𝑎𝑟↓𝑓𝑟𝑎𝑐 /𝑽𝒆𝒄𝒕𝒐𝒓𝑳𝒆𝒏𝒈𝒕𝒉  )
Goal: Reduce Serial Fraction and Reduce Scalar Fraction of Code
Ideal Speedup: NumCores*VectorLength (requires zero scalar, zero serial work)
Compute Bound Performance
Most kernels of ML codes are compute bound
i.e. raw FLOPS matter
Roofline Model
Gflops/s = min (Peak Gflops/s, Stream BW * flops/byte)
Peak	
  “Compute”	
  Gflops/s	
  
Peak	
  “Compute”	
  Gflops/s	
  
without	
  SIMD	
  
Compute	
  intensity	
  (flops/byte)	
  
A+ainable	
  Gflops/s	
  
Overview of Current Generation of
Intel Xeon and Xeon Phi Products
Current	
  Intel®	
  Xeon	
  PlaBorms	
  
Westmere	
   Sandy	
  Bridge	
  
Intel	
  
Microarchitecture	
  
(Nehalem)	
  
Intel	
  
Microarchitecture	
  
(Sandy	
  Bridge)	
  
NEW	
  	
  Intel	
  
Microarchitecture	
  
(Sandy	
  Bridge)	
  
Nehalem	
   Ivy	
  Bridge	
  
45nm	
  Process	
  
Technology	
   32nm	
  Process	
  Technology	
   22nm	
  Process	
  Technology	
  
TOCK	
   TICK	
   TOCK	
   TICK	
   TOCK	
  
NEW	
  Intel®	
  
Microarchitecture	
  
(Nehalem)	
  
Haswell	
  
NEW	
  	
  Intel	
  
Microarchitecture	
  
(Haswell)	
  
TICK	
  
14nm	
  Process	
  
Technology	
  
Latest released – Broadwell (14nm process)
•  Intel’s foundation of HPC and ML performance
•  Suited for full scope of workloads
•  Industry leading performance/watt for serial & highly parallel workloads.
•  Upto 22 cores / socket (Broadwell-EP) (w/ Hyper-Threading technology)
Software optimization helps maximize benefit and
adoption of new features	
  
Broadwell	
  
	
  
Intel	
  
Microarchitecture	
  
(Haswell)	
  
	
  
2nd	
  Genera<on	
  Intel®	
  Xeon	
  Phi™	
  PlaBorm	
  
Intel®	
  AVX	
  Technology
	
  
HSW/BDW	
  
512b AVX512
Flops/Cycle: 64SP / 32
DP (FMA)
SKX	
  &	
  KNL	
  SNB/IVB	
  
AVX512
512-bit FP/Integer
32 registers
8 mask registers
Embedded rounding
Embedded broadcast
Scalar/SSE/AVX “promotions”
Native media additions
HPC additions
Transcendental support
Gather/Scatter
AVX AVX2
256-bit basic FP
16 registers
NDS (and AVX128)
Improved blend
MASKMOV
Implicit unaligned
Float16 (IVB 2012)
256-bit FP FMA
256-bit integer
PERMD
Gather
256b AVX2
Flops/Cycle: 32SP / 16
DP (FMA)
256b AVX1
Flops/Cycle: 16 SP / 8
DP
Overview of Deep Learning and
DL Frameworks
Deep	
  Learning	
  –	
  Convolu<onal	
  Neural	
  Network	
  
Filter = 3 x
3
Stride =
2
Pad_size =
1
Convolution Parameters:
Number of outputs/feature-maps: < 4 >
Filter size: < 3 x 3 >
Stride: < 2 >
Pad_size (for corner case): <1>
Feature
maps
•  Step	
  1:	
  Training	
  	
  
(Over	
  Hours/Days/Weeks)	
  
Deep	
  Learning:	
  Train	
  Once	
  Use	
  Many	
  Times	
  
Person	
  
90%	
  person	
  
8%	
  traffic	
  light	
  
Input	
  data	
  
Output	
  
Classifica#on	
  
Create	
  Deep	
  
network	
  
•  Step	
  2:	
  Inference	
  
(Real	
  Time)	
  
New	
  input	
  from	
  
camera	
  and	
  
sensors	
  
Output	
  
Classifica#on	
  
Trained	
  neural	
  
network	
  model	
  
97%	
  
person	
  
Trained	
  
Model	
  
Bigger	
  Data	
   Be[er	
  Hardware	
   Smarter	
  Algorithms	
  
Deep	
  Learning:	
  Why	
  Now?	
  
Image:	
  1000	
  KB	
  /	
  picture	
  
Audio:	
  5000	
  KB	
  /	
  song	
  
Video:	
  5,000,000	
  KB	
  /	
  movie	
  
Transistor	
  density	
  doubles	
  every	
  
18	
  months	
  
Cost	
  /	
  GB	
  in	
  1995:	
  $1000.00	
  
Cost	
  /	
  GB	
  in	
  2015:	
  $0.03	
  
Advances	
  in	
  algorithm	
  
innova#on,	
  including	
  neural	
  
networks,	
  leading	
  to	
  be+er	
  
accuracy	
  in	
  training	
  models	
  
Intel	
  Caffe	
  –	
  ML	
  Framework	
  	
  
Op<mized	
  for	
  Xeon	
  and	
  Xeon	
  Phi	
  Products	
  
q  Fork of BVLC Caffe by Intel to
optimize for IA
q  Leverages Intel MKL Deep Neural
Network (DNN) API’s
q  Optimized for BDW (AVX2) and KNL
(MIC_AVX512)
q  https://github.com/intel/caffe
Intel	
  Caffe	
  
Compute	
  
Layer	
  
Convolu#on	
   ReLU	
   Pooling	
  
Data	
  Layer	
  
LMDB,	
  
LevelDB,	
  
HDF5	
  
Intel	
  MKL	
  DNN	
  	
  
BDW	
   KNL	
  
Tensorflow	
  ™	
  :	
  Open	
  Source	
  ML	
  Framework	
  (Google)
	
   •  Computa<on	
  is	
  a	
  Dataflow	
  Graph	
  with	
  Tensors	
  
•  General	
  compu#ng	
  mathema#cal	
  framework	
  –	
  widely	
  used	
  for	
  	
  
•  Deep	
  Neural	
  Networks	
  
•  Other	
  machine	
  learning	
  algorithms	
  
•  HPC	
  applica#ons	
  
•  Key	
  computa#onal	
  kernels,	
  extendable	
  user	
  opera#ons	
  
•  Core	
  in	
  C++,	
  front	
  end	
  wrapper	
  in	
  python	
  
•  Mul#	
  node	
  support	
  using	
  GRPC	
  
•  Google	
  Remote	
  Procedural	
  Calls	
  
Example	
  	
  from	
  Jeff	
  Dean’s	
  presenta#on	
  	
  
Optimizing Deep Learning Frameworks
Performance	
  Op<miza<on	
  on	
  Modern	
  PlaBorms	
  
U#lize	
  all	
  the	
  
cores	
  
OpenMP,	
  MPI,	
  TBB…	
  
	
  
Reduce	
  
synchroniza#on	
  
events,	
  serial	
  code	
  
	
  
Improve	
  load	
  
balancing	
  
Vectorize/SIMD	
  
	
  
Unit	
  strided	
  access	
  
per	
  SIMD	
  lane	
  
	
  
High	
  vector	
  
efficiency	
  
	
  
Data	
  alignment	
  
Efficient	
  
memory/cache	
  
use	
  
	
  
	
  Blocking	
  
	
  
	
  Data	
  reuse	
  
	
  	
  
	
  Prefetching	
  
	
  
	
  Memory	
  alloca#on	
  
	
  
Hierarchical	
  Parallelism	
  
Fine-­‐Grained	
  Parallelism	
  /	
  within	
  node	
  	
  
Sub-­‐domain:	
  1)	
  Mul<-­‐level	
  domain	
  decomposi<on	
  (ex.	
  across	
  layers)	
  	
  	
  	
  	
  	
  
	
   	
  2)	
  Data	
  decomposi<on	
  (layer	
  parallelism)	
  	
  
Coarse-­‐Grained	
  /	
  	
  
mul#-­‐node	
  
Domain	
  decomposi<on	
  
Scaling	
  
	
  
Improve	
  load	
  
balancing	
  
	
  
Reduce	
  
synchroniza#on	
  
events,	
  all-­‐to-­‐all	
  
comms	
  
Intel	
  Strategy:	
  Op<mized	
  Deep	
  Learning	
  Environment	
  
Fuel	
  the	
  development	
  of	
  ver#cal	
  solu#ons	
  
Deliver	
  best	
  single	
  node	
  and	
  mul#-­‐node	
  
performance	
  
	
  Accelerate	
  design,	
  training,	
  and	
  deployment	
  
Drive	
  op#miza#ons	
  across	
  open	
  source	
  deep	
  
learning	
  frameworks	
  	
  
Intel®	
  Deep	
  Learning	
  SDK	
  
Intel®	
  Omni-­‐Path	
  
Architecture	
  
(Intel®	
  OPA)	
  
Maximum	
  performance	
  on	
  Intel	
  architecture	
  
Intel®	
  Math	
  Kernel	
  
Library	
  (Intel®	
  MKL)	
  
+
Training	
   Inference	
  
Intel®	
  MKL-­‐DNN	
  
+
Example	
  Challenge	
  1:	
  Data	
  Layout	
  Has	
  Big	
  Impact	
  on	
  Performance	
  
•  Data	
  Layouts	
  impacts	
  performance	
  
•  Sequen#al	
  access	
  to	
  avoid	
  gather/sca+er	
  
•  Have	
  itera#ons	
  in	
  inner	
  most	
  loop	
  to	
  ensure	
  high	
  vector	
  u#liza#on	
  
•  Maximize	
  data	
  reuse;	
  e.g.	
  weights	
  in	
  a	
  convolu#on	
  layer	
  
•  Conver#ng	
  to/from	
  op#mized	
  Layout	
  is	
  some	
  #mes	
  less	
  expensive	
  than	
  
opera#ng	
  on	
  unop#mized	
  Layout	
  
21	
   18	
   32	
   6	
   3	
  
1	
   8	
   0	
   3	
   26	
  
40	
   9	
   22	
   76	
   81	
  
23	
   44	
   81	
   32	
   11	
  
5	
   38	
   10	
   11	
   1	
  
8	
   92	
   37	
   29	
   44	
  
11	
   9	
   22	
   3	
   26	
  
3	
   47	
   29	
   88	
   1	
  
15	
   16	
   22	
   46	
   12	
  
29	
   9	
   13	
   11	
   1	
  
21	
   8	
   18	
   92	
   ..	
   1	
   11	
   ..	
  
21	
   18	
   …	
   1	
   ..	
   8	
   92	
   ..	
  
Be+er	
  op#mized	
  for	
  
some	
  opera#ons	
  
	
  	
  	
  	
  vs	
  
Example	
  Challenge	
  2:	
  Minimize	
  Conversions	
  Overhead	
  
•  End	
  to	
  end	
  op#miza#on	
  can	
  reduce	
  conversions	
  
•  Staying	
  in	
  op#mized	
  layout	
  as	
  long	
  as	
  possible	
  becomes	
  one	
  of	
  the	
  
tuning	
  goals	
  	
  
•  Minimize	
  the	
  number	
  of	
  back	
  and	
  forth	
  conversions	
  
•  Use	
  of	
  graph	
  op#miza#on	
  techniques	
  
Convolu#on	
   Convolu#on	
  Max	
  Pool	
  	
  	
  Na#ve	
  to	
  
MKL	
  layout	
  
MKL	
  layout	
  
to	
  	
  Na#ve	
  
MKL	
  layout	
  
to	
  	
  Na#ve	
  
	
  	
  Na#ve	
  to	
  
MKL	
  layout	
  
•  Maximize	
  parallelism	
  to	
  use	
  all	
  cores	
  efficiently	
  
•  Intra	
  opera#on/layer	
  parallelism	
  
within	
  operators	
  (OpenMP)	
  
Inter	
  opera#on	
  parallelism	
  across	
  operators	
  
8	
   92	
   37	
   29	
  
11	
   9	
   22	
   3	
  
3	
   47	
   29	
   88	
  
15	
   16	
   22	
   46	
  
concat	
  
3x3	
  
Conv	
  
5x5	
  
Conv	
  
1x1	
  
Conv	
  
10	
   20	
  
15	
   18	
  Convolu#on	
  of	
  #les	
  in	
  parallel	
  
Parallel	
  execu#on	
  
Example	
  Challenge	
  3:	
  Ensuring	
  Enough	
  Parallelism	
  to	
  Leverage	
  all	
  Cores	
  
Example	
  Challenge	
  4:	
  Op<mizing	
  the	
  Data	
  Layer	
  
•  Data	
  Layer	
  comprises	
  3	
  major	
  ops	
  
o  Read	
  data	
  
o  Decode	
  data:	
  e.g.	
  JPEG	
  decode,	
  decompression	
  
o  Transform	
  data	
  
•  Result	
  of	
  read,	
  decode	
  &	
  transform	
  is	
  input	
  to	
  DNN	
  layers	
  
•  Reduce	
  number	
  of	
  cores	
  dedicated	
  to	
  feed	
  DNN	
  
o  IO	
  op#miza#on:	
  consider	
  compression	
  
o  Decode:	
  consider	
  LMDB	
  instead	
  of	
  JPEG	
  
o  Resizing/data	
  processing:	
  consider	
  pre-­‐processing	
  
o  Then	
  vectorize,	
  parallelize	
  	
  
C0	
  Boost	
  
thread	
  
C1	
  Boost	
  
thread	
  
C2	
  
OpenMP	
  
C3	
  
OpenMP	
   	
  ….	
   	
  ….	
  
Cn-­‐1	
  
OpenMP	
  
Op<mizing	
  Deep	
  Learning	
  Frameworks	
  for	
  Intel®	
  Architecture	
  	
  
•  Leverage	
  high	
  performant	
  compute	
  libraries	
  and	
  tools	
  
•  e.g.	
  Intel®	
  Math	
  Kernel	
  Library,	
  Intel®	
  Python,	
  Intel®	
  Compiler	
  etc.	
  
•  Data	
  Format/Shape:	
  
•  Right	
  format/shape	
  for	
  max	
  performance:	
  blocking,	
  gather/sca+er	
  
•  Data	
  Layout:	
  	
  
•  Minimize	
  cost	
  of	
  data	
  layout	
  conversions	
  	
  
•  Parallelism:	
  
•  Use	
  all	
  cores,	
  eliminate	
  serial	
  sec#ons,	
  load	
  imbalance	
  
•  Other	
  Func#ons/Primi#ves	
  (un-­‐op#mized	
  in	
  libraries):	
  
•  Op#mize	
  via	
  compiler	
  knobs,	
  improve	
  exis#ng	
  implementa#ons	
  
•  Memory	
  alloca#on	
  
•  unique	
  characteris#cs	
  and	
  ability	
  to	
  reuse	
  buffers	
  
•  Data	
  layer	
  op#miza#ons:	
  	
  
•  paralleliza#on,	
  vectoriza#on,	
  IO	
  
•  Op#mize	
  hyper	
  parameters:	
  
•  e.g.	
  batch	
  size	
  for	
  more	
  parallelism	
  
•  learning	
  rate	
  and	
  op#mizer	
  to	
  ensure	
  accuracy/convergence	
  
AlexNet	
  Op<miza<on	
  Progression	
  
1.00x	
  
2.20x	
   4.18x	
  
6.96x	
   7.72x	
   9.27x	
  
13.36x	
   13.72x	
  
2.16x	
  
12.48x	
  
18.28x	
  
25.49x	
   27.17x	
  
40.71x	
  
49.07x	
  
0	
  
10	
  
20	
  
30	
  
40	
  
50	
  
60	
  
Cumula#ve	
  speedup	
  
Broadwell	
   Knights	
  Landing	
  
VGG	
  Op<miza<on	
  Progression	
  
1.00x	
   3.15x	
   5.40x	
   10.18x	
   13.29x	
   14.65x	
   19.27x	
  
1.00x	
  
15.80x	
  
23.60x	
  
122.50x	
  
164.95x	
   171.20x	
  
273.50x	
  
0	
  
50	
  
100	
  
150	
  
200	
  
250	
  
300	
  
Baseline	
   MKL	
  Integra#on	
   Thread	
  
Op#miza#on	
  
Compiler	
  Knobs	
  
Tuning	
  
Matrix	
  
Transpose/Data	
  
Transforma#ons	
  
Memory	
  
Alloca#ons	
  
Conversions	
  
Op#miza#on	
  
Broadwell	
   Knights	
  Landing	
  
Cumula#ve	
  Speedup	
  
25	
  
Configura<on	
  details	
  
Intel®	
  Xeon™	
  processor	
  E5-­‐2699v4	
  (22	
  Cores,	
  2.2	
  GHz),	
  128GB	
  DDR	
  memory,	
  	
  Centos	
  
7.2	
  based	
  on	
  Red	
  Hat*	
  Enterprise	
  Linux	
  7.2	
  
	
  
Intel®	
  Xeon	
  Phi™	
  processor	
  7250	
  (68	
  Cores,	
  1.4	
  GHz,	
  16GB	
  MCDRAM:	
  Flat	
  mode),	
  
96GB	
  DDR	
  memory,	
  	
  Centos	
  7.2	
  based	
  on	
  Red	
  Hat*	
  Enterprise	
  Linux	
  7.2	
  
	
  
	
  
AlexNet	
  and	
  VGG	
  benchmarks:	
  
	
  
h+ps://github.com/soumith/convnet-­‐benchmarks	
  
	
  
Mul<-­‐Node	
  Distributed	
  Training	
  
•  Model	
  Parallelism	
  
•  Break	
  the	
  model	
  into	
  N	
  nodes	
  
•  The	
  same	
  data	
  is	
  in	
  all	
  the	
  nodes	
  
•  Data	
  Parallelism	
  
•  Break	
  the	
  dataset	
  into	
  N	
  	
  nodes	
  
•  The	
  same	
  model	
  is	
  in	
  all	
  the	
  nodes	
  
•  Good	
  for	
  networks	
  with	
  few	
  weights,	
  e.g.	
  GoogLeNet	
  
•  You	
  can	
  use	
  either	
  model	
  or	
  data	
  parallelism	
  or	
  a	
  hybrid	
  of	
  both	
  
Data	
  Parallelism	
  
Scaling	
  Efficiency:	
  Intel®	
  Xeon	
  Phi™	
  Processor	
  	
  
Deep	
  Learning	
  Image	
  Classifica#on	
  Training	
  Performance	
  :	
  
MULTI-­‐NODE	
  Scaling	
  
Soxware	
  and	
  workloads	
  used	
  in	
  performance	
  tests	
  may	
  have	
  been	
  op#mized	
  for	
  performance	
  only	
  on	
  Intel	
  microprocessors.	
  	
  Performance	
  tests,	
  such	
  as	
  SYSmark	
  and	
  MobileMark,	
  are	
  measured	
  using	
  specific	
  computer	
  systems,	
  components,	
  
soxware,	
  opera#ons	
  and	
  func#ons.	
  	
  Any	
  change	
  to	
  any	
  of	
  those	
  factors	
  may	
  cause	
  the	
  results	
  to	
  vary.	
  	
  You	
  should	
  consult	
  other	
  informa#on	
  and	
  performance	
  tests	
  to	
  assist	
  you	
  in	
  fully	
  evalua#ng	
  your	
  contemplated	
  purchases,	
  including	
  the	
  
performance	
  of	
  that	
  product	
  when	
  combined	
  with	
  other	
  products.	
  	
  For	
  more	
  informa#on	
  go	
  to	
  h+p://www.intel.com/performance	
  .	
  *Other	
  names	
  and	
  brands	
  may	
  be	
  property	
  of	
  others	
  
Configura#ons:	
  	
  
•  Intel®	
  Xeon	
  Phi™	
  	
  Processor	
  7250	
  (68	
  Cores,	
  1.4	
  GHz,	
  16GB	
  MCDRAM),	
  128	
  GB	
  memory,	
  Red	
  Hat*	
  Enterprise	
  Linux	
  6.7,	
  Intel®	
  Op#mized	
  Framework	
  
0	
  
10	
  
20	
  
30	
  
40	
  
50	
  
60	
  
70	
  
80	
  
90	
  
100	
  
1	
   2	
   4	
   8	
   16	
   32	
   64	
   128	
  
SCALING	
  EFFICIENCY	
  %	
  
#	
  OF	
  INTEL®	
  XEON	
  PHI™	
  PROCESSOR	
  7250	
  (68-­‐CORES,	
  1.4	
  GHZ,	
  16	
  GB)	
  NODES	
  
OverFeat	
  	
   AlexNet	
   VGG-­‐A	
   GoogLeNet	
  
62	
  
87	
  
	
  	
  
𝑇𝑖𝑚𝑒   𝑇𝑜   𝑇𝑟𝑎𝑖𝑛  ( 𝑇𝑇𝑇)	
  
𝑏𝑎𝑡𝑐ℎ   𝑠𝑖𝑧𝑒	
  
sweet	
  spot	
  
Mul<-­‐node	
  Challenges	
  
•  Need	
  to	
  op#mize	
  both	
  compute	
  (itera#on)	
  and	
  
communica#on	
  (weight	
  updates)	
  
•  More	
  nodes	
  mean	
  higher	
  batch	
  per	
  itera#on	
  
•  Enough	
  work	
  for	
  each	
  node	
  
•  Op#mized	
  hyper	
  parameters	
  (e.g.	
  Batch	
  Size)	
  
•  Time	
  to	
  Train:	
  increases	
  with	
  batch	
  size	
  	
  
•  Accuracy:	
  batch	
  size	
  impacts	
  convergence	
  and	
  accuracy	
  
•  Communication overheads if small per node batch
•  e.g.	
  Total	
  batch	
  size	
  =	
  1024	
  
•  1024	
  nodes	
  :	
  Batch	
  size	
  =	
  1	
  per	
  node	
  –	
  communica<on	
  
dominates	
  
•  64	
  nodes	
  each	
  :	
  Batch	
  size	
  =	
  16	
  per	
  node	
  –	
  computa<on	
  
dominates	
  
Summary	
  
•  Don’t	
  be	
  fooled	
  by	
  performance	
  of	
  DL	
  workloads	
  when	
  using	
  unop#mized	
  
frameworks	
  
•  Significant	
  performance	
  headroom	
  from	
  op#miza#on	
  on	
  Xeon	
  and	
  Xeon	
  Phi	
  
•  Close	
  to	
  300x	
  speedup	
  in	
  certain	
  topologies	
  
•  Tradi#onal	
  vectoriza#on	
  and	
  paralleliza#on	
  strategies	
  apply	
  
•  Other	
  unique	
  performance	
  challenges:	
  hyper	
  parameters,	
  data	
  layer,	
  inter/
intra	
  layer	
  paralleliza#on,	
  etc.	
  
•  Call	
  to	
  ac#on:	
  
•  Try	
  Intel	
  op#mized	
  frameworks	
  available	
  today,	
  more	
  to	
  come	
  soon	
  
Legal Disclaimers
•  Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across
different processor families: Go to: Learn About Intel® Processor Numbers http://www.intel.com/products/processor_number
•  Some results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system
hardware or software design or configuration may affect actual performance.
•  Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.  Performance tests, such
as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions.  Any change to any of
those factors may cause the results to vary.  You should consult other information and performance tests to assist you in fully evaluating your
contemplated purchases, including the performance of that product when combined with other products. 
•  Intel does not control or audit the design or implementation of third party benchmarks or Web sites referenced in this document. Intel encourages all
of its customers to visit the referenced Web sites or others where similar performance benchmarks are reported and confirm whether the referenced
benchmarks are accurate and reflect performance of systems available for purchase.
•  Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result, and then dividing the actual benchmark result for the
baseline platform into each of the specific benchmark results of each of the other platforms, and assigning them a relative performance number that
correlates with the performance improvements reported.
•  SPEC, SPECint, SPECfp, SPECrate, SPECpower, SPECjbb, SPECompG, SPEC MPI, and SPECjEnterprise* are trademarks of the Standard Performance
Evaluation Corporation. See http://www.spec.org for more information.
•  TPC Benchmark, TPC-C, TPC-H, and TPC-E are trademarks of the Transaction Processing Council. See http://www.tpc.org for more information.
•  No computer system can provide absolute reliability, availability or serviceability. Requires an Intel® Xeon® processor E7-8800/4800/2800 v2 product
families or Intel® Itanium® 9500 series-based system (or follow-on generations of either.) Built-in reliability features available on select Intel®
processors may require additional software, hardware, services and/or an internet connection. Results may vary depending upon configuration.
Consult your system manufacturer for more details.
For systems also featuring Resilient System Technologies: No computer system can provide absolute reliability, availability or serviceability. Requires
an Intel® Run Sure Technology-enabled system, including an enabled Intel processor and enabled technology(ies). Built-in reliability features available
on select Intel® processors may require additional software, hardware, services and/or an Internet connection. Results may vary depending upon
configuration. Consult your system manufacturer for more details.
For systems also featuring Resilient Memory Technologies: No computer system can provide absolute reliability, availability or serviceability. Requires
an Intel® Run Sure Technology-enabled system, including an enabled Intel® processor and enabled technology(ies). built-in reliability features
available on select Intel® processors may require additional software, hardware, services and/or an Internet connection. Results may vary depending
upon configuration. Consult your system manufacturer for more details.
Op/miza/on  No/ce
Optimization Notice
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for
optimizations that are not unique to Intel microprocessors. These optimizations include SSE2,
SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the
availability, functionality, or effectiveness of any optimization on microprocessors not
manufactured by Intel.
Microprocessor-dependent optimizations in this product are intended for use with Intel
microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for
Intel microprocessors. Please refer to the applicable product User and Reference Guides for
more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* for Xeon Phi Cluster

Weitere ähnliche Inhalte

Was ist angesagt?

"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
Edge AI and Vision Alliance
 

Was ist angesagt? (20)

Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
 
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre..."Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...
 
A Platform for Accelerating Machine Learning Applications
 A Platform for Accelerating Machine Learning Applications A Platform for Accelerating Machine Learning Applications
A Platform for Accelerating Machine Learning Applications
 
Spine net learning scale permuted backbone for recognition and localization
Spine net learning scale permuted backbone for recognition and localizationSpine net learning scale permuted backbone for recognition and localization
Spine net learning scale permuted backbone for recognition and localization
 
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
 
A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing Clusters
 
NVIDIA 深度學習教育機構 (DLI): Approaches to object detection
NVIDIA 深度學習教育機構 (DLI): Approaches to object detectionNVIDIA 深度學習教育機構 (DLI): Approaches to object detection
NVIDIA 深度學習教育機構 (DLI): Approaches to object detection
 
Evolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server SolutionEvolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server Solution
 
Recent developments in Deep Learning
Recent developments in Deep LearningRecent developments in Deep Learning
Recent developments in Deep Learning
 
2. Cnnecst-Why the use of FPGA?
2. Cnnecst-Why the use of FPGA? 2. Cnnecst-Why the use of FPGA?
2. Cnnecst-Why the use of FPGA?
 
Advanced spark deep learning
Advanced spark deep learningAdvanced spark deep learning
Advanced spark deep learning
 
MIT's experience on OpenPOWER/POWER 9 platform
MIT's experience on OpenPOWER/POWER 9 platformMIT's experience on OpenPOWER/POWER 9 platform
MIT's experience on OpenPOWER/POWER 9 platform
 
Deep Learning Accelerator Design Techniques
Deep Learning Accelerator Design TechniquesDeep Learning Accelerator Design Techniques
Deep Learning Accelerator Design Techniques
 
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
 
Convolutional Neural Networks at scale in Spark MLlib
Convolutional Neural Networks at scale in Spark MLlibConvolutional Neural Networks at scale in Spark MLlib
Convolutional Neural Networks at scale in Spark MLlib
 
Nervana and the Future of Computing
Nervana and the Future of ComputingNervana and the Future of Computing
Nervana and the Future of Computing
 
Squeezing Deep Learning Into Mobile Phones
Squeezing Deep Learning Into Mobile PhonesSqueezing Deep Learning Into Mobile Phones
Squeezing Deep Learning Into Mobile Phones
 
Deep learning with FPGA
Deep learning with FPGADeep learning with FPGA
Deep learning with FPGA
 
Improving Hardware Efficiency for DNN Applications
Improving Hardware Efficiency for DNN ApplicationsImproving Hardware Efficiency for DNN Applications
Improving Hardware Efficiency for DNN Applications
 

Ähnlich wie Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* for Xeon Phi Cluster

OpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalOpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation final
Junli Gu
 
Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor Design
Sri Prasanna
 

Ähnlich wie Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* for Xeon Phi Cluster (20)

2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
 
OpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalOpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation final
 
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core ArchitecturesPerformance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
 
Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)
 
Unleash performance through parallelism - Intel® Math Kernel Library
Unleash performance through parallelism - Intel® Math Kernel LibraryUnleash performance through parallelism - Intel® Math Kernel Library
Unleash performance through parallelism - Intel® Math Kernel Library
 
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
Scale Up Performance with Intel® Development
Scale Up Performance with Intel® DevelopmentScale Up Performance with Intel® Development
Scale Up Performance with Intel® Development
 
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
 
Real time machine learning proposers day v3
Real time machine learning proposers day v3Real time machine learning proposers day v3
Real time machine learning proposers day v3
 
Mauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscteMauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscte
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
 
Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor Design
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architecture
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
Harnessing OpenCL in Modern Coprocessors
Harnessing OpenCL in Modern CoprocessorsHarnessing OpenCL in Modern Coprocessors
Harnessing OpenCL in Modern Coprocessors
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
 
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
 
Current Trends in HPC
Current Trends in HPCCurrent Trends in HPC
Current Trends in HPC
 

Mehr von Intel® Software

Mehr von Intel® Software (20)

AI for All: Biology is eating the world & AI is eating Biology
AI for All: Biology is eating the world & AI is eating Biology AI for All: Biology is eating the world & AI is eating Biology
AI for All: Biology is eating the world & AI is eating Biology
 
Python Data Science and Machine Learning at Scale with Intel and Anaconda
Python Data Science and Machine Learning at Scale with Intel and AnacondaPython Data Science and Machine Learning at Scale with Intel and Anaconda
Python Data Science and Machine Learning at Scale with Intel and Anaconda
 
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSciStreamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
 
AI for good: Scaling AI in science, healthcare, and more.
AI for good: Scaling AI in science, healthcare, and more.AI for good: Scaling AI in science, healthcare, and more.
AI for good: Scaling AI in science, healthcare, and more.
 
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
 
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
 
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
 
AWS & Intel Webinar Series - Accelerating AI Research
AWS & Intel Webinar Series - Accelerating AI ResearchAWS & Intel Webinar Series - Accelerating AI Research
AWS & Intel Webinar Series - Accelerating AI Research
 
Intel Developer Program
Intel Developer ProgramIntel Developer Program
Intel Developer Program
 
Intel AIDC Houston Summit - Overview Slides
Intel AIDC Houston Summit - Overview SlidesIntel AIDC Houston Summit - Overview Slides
Intel AIDC Houston Summit - Overview Slides
 
AIDC NY: BODO AI Presentation - 09.19.2019
AIDC NY: BODO AI Presentation - 09.19.2019AIDC NY: BODO AI Presentation - 09.19.2019
AIDC NY: BODO AI Presentation - 09.19.2019
 
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
 
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
 
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
 
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
 
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
 
AIDC India - AI on IA
AIDC India  - AI on IAAIDC India  - AI on IA
AIDC India - AI on IA
 
AIDC India - Intel Movidius / Open Vino Slides
AIDC India - Intel Movidius / Open Vino SlidesAIDC India - Intel Movidius / Open Vino Slides
AIDC India - Intel Movidius / Open Vino Slides
 
AIDC India - AI Vision Slides
AIDC India - AI Vision SlidesAIDC India - AI Vision Slides
AIDC India - AI Vision Slides
 
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...
 

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 

Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* for Xeon Phi Cluster

  • 1. Performance Optimization of Deep Learning Frameworks on Modern Intel Architectures ElMoustapha Ould-Ahmed-Vall, AG Ramesh, Vamsi Sripathi and Karthik Raman Representing the work of many at Intel
  • 2. Agenda   •  Op#miza#on  ma+ers  on  modern  architectures   •  Intel’s  recent  Xeon  and  Xeon  Phi  products   •  Introduc#on  to  Deep  Learning   •  Op#mizing  DL  frameworks  on  IA   •  Key  challenges   •  Op#miza#on  techniques   •  Performance  data     •  DL  scaling  
  • 3. Moore’s  Law  Goes  on!   Increasing  clock  speeds  -­‐>    more  cores  +  wider  SIMD  (Hierarchical  parallelism)  
  • 4. Combined  Amdahl’s  Law  for  Vector  Mul<cores*   𝑺𝒑𝒆𝒆𝒅𝒖𝒑=(​1/​ 𝑆 𝑒𝑟𝑖𝑎𝑙↓𝑓𝑟𝑎𝑐 +​1−​ 𝑆 𝑒𝑟𝑖𝑎𝑙↓𝑓𝑟𝑎𝑐 /𝑵𝒖𝒎𝑪𝒐𝒓𝒆𝒔  )∗(​1/​ 𝑆 𝑐𝑎𝑙𝑎𝑟↓𝑓𝑟𝑎𝑐  +​1−​ 𝑆 𝑐𝑎𝑙𝑎𝑟↓𝑓𝑟𝑎𝑐 /𝑽𝒆𝒄𝒕𝒐𝒓𝑳𝒆𝒏𝒈𝒕𝒉  ) Goal: Reduce Serial Fraction and Reduce Scalar Fraction of Code Ideal Speedup: NumCores*VectorLength (requires zero scalar, zero serial work) Compute Bound Performance Most kernels of ML codes are compute bound i.e. raw FLOPS matter Roofline Model Gflops/s = min (Peak Gflops/s, Stream BW * flops/byte) Peak  “Compute”  Gflops/s   Peak  “Compute”  Gflops/s   without  SIMD   Compute  intensity  (flops/byte)   A+ainable  Gflops/s  
  • 5. Overview of Current Generation of Intel Xeon and Xeon Phi Products
  • 6. Current  Intel®  Xeon  PlaBorms   Westmere   Sandy  Bridge   Intel   Microarchitecture   (Nehalem)   Intel   Microarchitecture   (Sandy  Bridge)   NEW    Intel   Microarchitecture   (Sandy  Bridge)   Nehalem   Ivy  Bridge   45nm  Process   Technology   32nm  Process  Technology   22nm  Process  Technology   TOCK   TICK   TOCK   TICK   TOCK   NEW  Intel®   Microarchitecture   (Nehalem)   Haswell   NEW    Intel   Microarchitecture   (Haswell)   TICK   14nm  Process   Technology   Latest released – Broadwell (14nm process) •  Intel’s foundation of HPC and ML performance •  Suited for full scope of workloads •  Industry leading performance/watt for serial & highly parallel workloads. •  Upto 22 cores / socket (Broadwell-EP) (w/ Hyper-Threading technology) Software optimization helps maximize benefit and adoption of new features   Broadwell     Intel   Microarchitecture   (Haswell)    
  • 7. 2nd  Genera<on  Intel®  Xeon  Phi™  PlaBorm  
  • 8. Intel®  AVX  Technology   HSW/BDW   512b AVX512 Flops/Cycle: 64SP / 32 DP (FMA) SKX  &  KNL  SNB/IVB   AVX512 512-bit FP/Integer 32 registers 8 mask registers Embedded rounding Embedded broadcast Scalar/SSE/AVX “promotions” Native media additions HPC additions Transcendental support Gather/Scatter AVX AVX2 256-bit basic FP 16 registers NDS (and AVX128) Improved blend MASKMOV Implicit unaligned Float16 (IVB 2012) 256-bit FP FMA 256-bit integer PERMD Gather 256b AVX2 Flops/Cycle: 32SP / 16 DP (FMA) 256b AVX1 Flops/Cycle: 16 SP / 8 DP
  • 9. Overview of Deep Learning and DL Frameworks
  • 10. Deep  Learning  –  Convolu<onal  Neural  Network   Filter = 3 x 3 Stride = 2 Pad_size = 1 Convolution Parameters: Number of outputs/feature-maps: < 4 > Filter size: < 3 x 3 > Stride: < 2 > Pad_size (for corner case): <1> Feature maps
  • 11. •  Step  1:  Training     (Over  Hours/Days/Weeks)   Deep  Learning:  Train  Once  Use  Many  Times   Person   90%  person   8%  traffic  light   Input  data   Output   Classifica#on   Create  Deep   network   •  Step  2:  Inference   (Real  Time)   New  input  from   camera  and   sensors   Output   Classifica#on   Trained  neural   network  model   97%   person   Trained   Model  
  • 12. Bigger  Data   Be[er  Hardware   Smarter  Algorithms   Deep  Learning:  Why  Now?   Image:  1000  KB  /  picture   Audio:  5000  KB  /  song   Video:  5,000,000  KB  /  movie   Transistor  density  doubles  every   18  months   Cost  /  GB  in  1995:  $1000.00   Cost  /  GB  in  2015:  $0.03   Advances  in  algorithm   innova#on,  including  neural   networks,  leading  to  be+er   accuracy  in  training  models  
  • 13. Intel  Caffe  –  ML  Framework     Op<mized  for  Xeon  and  Xeon  Phi  Products   q  Fork of BVLC Caffe by Intel to optimize for IA q  Leverages Intel MKL Deep Neural Network (DNN) API’s q  Optimized for BDW (AVX2) and KNL (MIC_AVX512) q  https://github.com/intel/caffe Intel  Caffe   Compute   Layer   Convolu#on   ReLU   Pooling   Data  Layer   LMDB,   LevelDB,   HDF5   Intel  MKL  DNN     BDW   KNL  
  • 14. Tensorflow  ™  :  Open  Source  ML  Framework  (Google)   •  Computa<on  is  a  Dataflow  Graph  with  Tensors   •  General  compu#ng  mathema#cal  framework  –  widely  used  for     •  Deep  Neural  Networks   •  Other  machine  learning  algorithms   •  HPC  applica#ons   •  Key  computa#onal  kernels,  extendable  user  opera#ons   •  Core  in  C++,  front  end  wrapper  in  python   •  Mul#  node  support  using  GRPC   •  Google  Remote  Procedural  Calls   Example    from  Jeff  Dean’s  presenta#on    
  • 16. Performance  Op<miza<on  on  Modern  PlaBorms   U#lize  all  the   cores   OpenMP,  MPI,  TBB…     Reduce   synchroniza#on   events,  serial  code     Improve  load   balancing   Vectorize/SIMD     Unit  strided  access   per  SIMD  lane     High  vector   efficiency     Data  alignment   Efficient   memory/cache   use      Blocking      Data  reuse        Prefetching      Memory  alloca#on     Hierarchical  Parallelism   Fine-­‐Grained  Parallelism  /  within  node     Sub-­‐domain:  1)  Mul<-­‐level  domain  decomposi<on  (ex.  across  layers)                2)  Data  decomposi<on  (layer  parallelism)     Coarse-­‐Grained  /     mul#-­‐node   Domain  decomposi<on   Scaling     Improve  load   balancing     Reduce   synchroniza#on   events,  all-­‐to-­‐all   comms  
  • 17. Intel  Strategy:  Op<mized  Deep  Learning  Environment   Fuel  the  development  of  ver#cal  solu#ons   Deliver  best  single  node  and  mul#-­‐node   performance    Accelerate  design,  training,  and  deployment   Drive  op#miza#ons  across  open  source  deep   learning  frameworks     Intel®  Deep  Learning  SDK   Intel®  Omni-­‐Path   Architecture   (Intel®  OPA)   Maximum  performance  on  Intel  architecture   Intel®  Math  Kernel   Library  (Intel®  MKL)   + Training   Inference   Intel®  MKL-­‐DNN   +
  • 18. Example  Challenge  1:  Data  Layout  Has  Big  Impact  on  Performance   •  Data  Layouts  impacts  performance   •  Sequen#al  access  to  avoid  gather/sca+er   •  Have  itera#ons  in  inner  most  loop  to  ensure  high  vector  u#liza#on   •  Maximize  data  reuse;  e.g.  weights  in  a  convolu#on  layer   •  Conver#ng  to/from  op#mized  Layout  is  some  #mes  less  expensive  than   opera#ng  on  unop#mized  Layout   21   18   32   6   3   1   8   0   3   26   40   9   22   76   81   23   44   81   32   11   5   38   10   11   1   8   92   37   29   44   11   9   22   3   26   3   47   29   88   1   15   16   22   46   12   29   9   13   11   1   21   8   18   92   ..   1   11   ..   21   18   …   1   ..   8   92   ..   Be+er  op#mized  for   some  opera#ons          vs  
  • 19. Example  Challenge  2:  Minimize  Conversions  Overhead   •  End  to  end  op#miza#on  can  reduce  conversions   •  Staying  in  op#mized  layout  as  long  as  possible  becomes  one  of  the   tuning  goals     •  Minimize  the  number  of  back  and  forth  conversions   •  Use  of  graph  op#miza#on  techniques   Convolu#on   Convolu#on  Max  Pool      Na#ve  to   MKL  layout   MKL  layout   to    Na#ve   MKL  layout   to    Na#ve      Na#ve  to   MKL  layout  
  • 20. •  Maximize  parallelism  to  use  all  cores  efficiently   •  Intra  opera#on/layer  parallelism   within  operators  (OpenMP)   Inter  opera#on  parallelism  across  operators   8   92   37   29   11   9   22   3   3   47   29   88   15   16   22   46   concat   3x3   Conv   5x5   Conv   1x1   Conv   10   20   15   18  Convolu#on  of  #les  in  parallel   Parallel  execu#on   Example  Challenge  3:  Ensuring  Enough  Parallelism  to  Leverage  all  Cores  
  • 21. Example  Challenge  4:  Op<mizing  the  Data  Layer   •  Data  Layer  comprises  3  major  ops   o  Read  data   o  Decode  data:  e.g.  JPEG  decode,  decompression   o  Transform  data   •  Result  of  read,  decode  &  transform  is  input  to  DNN  layers   •  Reduce  number  of  cores  dedicated  to  feed  DNN   o  IO  op#miza#on:  consider  compression   o  Decode:  consider  LMDB  instead  of  JPEG   o  Resizing/data  processing:  consider  pre-­‐processing   o  Then  vectorize,  parallelize     C0  Boost   thread   C1  Boost   thread   C2   OpenMP   C3   OpenMP    ….    ….   Cn-­‐1   OpenMP  
  • 22. Op<mizing  Deep  Learning  Frameworks  for  Intel®  Architecture     •  Leverage  high  performant  compute  libraries  and  tools   •  e.g.  Intel®  Math  Kernel  Library,  Intel®  Python,  Intel®  Compiler  etc.   •  Data  Format/Shape:   •  Right  format/shape  for  max  performance:  blocking,  gather/sca+er   •  Data  Layout:     •  Minimize  cost  of  data  layout  conversions     •  Parallelism:   •  Use  all  cores,  eliminate  serial  sec#ons,  load  imbalance   •  Other  Func#ons/Primi#ves  (un-­‐op#mized  in  libraries):   •  Op#mize  via  compiler  knobs,  improve  exis#ng  implementa#ons   •  Memory  alloca#on   •  unique  characteris#cs  and  ability  to  reuse  buffers   •  Data  layer  op#miza#ons:     •  paralleliza#on,  vectoriza#on,  IO   •  Op#mize  hyper  parameters:   •  e.g.  batch  size  for  more  parallelism   •  learning  rate  and  op#mizer  to  ensure  accuracy/convergence  
  • 23. AlexNet  Op<miza<on  Progression   1.00x   2.20x   4.18x   6.96x   7.72x   9.27x   13.36x   13.72x   2.16x   12.48x   18.28x   25.49x   27.17x   40.71x   49.07x   0   10   20   30   40   50   60   Cumula#ve  speedup   Broadwell   Knights  Landing  
  • 24. VGG  Op<miza<on  Progression   1.00x   3.15x   5.40x   10.18x   13.29x   14.65x   19.27x   1.00x   15.80x   23.60x   122.50x   164.95x   171.20x   273.50x   0   50   100   150   200   250   300   Baseline   MKL  Integra#on   Thread   Op#miza#on   Compiler  Knobs   Tuning   Matrix   Transpose/Data   Transforma#ons   Memory   Alloca#ons   Conversions   Op#miza#on   Broadwell   Knights  Landing   Cumula#ve  Speedup  
  • 25. 25   Configura<on  details   Intel®  Xeon™  processor  E5-­‐2699v4  (22  Cores,  2.2  GHz),  128GB  DDR  memory,    Centos   7.2  based  on  Red  Hat*  Enterprise  Linux  7.2     Intel®  Xeon  Phi™  processor  7250  (68  Cores,  1.4  GHz,  16GB  MCDRAM:  Flat  mode),   96GB  DDR  memory,    Centos  7.2  based  on  Red  Hat*  Enterprise  Linux  7.2       AlexNet  and  VGG  benchmarks:     h+ps://github.com/soumith/convnet-­‐benchmarks    
  • 26. Mul<-­‐Node  Distributed  Training   •  Model  Parallelism   •  Break  the  model  into  N  nodes   •  The  same  data  is  in  all  the  nodes   •  Data  Parallelism   •  Break  the  dataset  into  N    nodes   •  The  same  model  is  in  all  the  nodes   •  Good  for  networks  with  few  weights,  e.g.  GoogLeNet   •  You  can  use  either  model  or  data  parallelism  or  a  hybrid  of  both  
  • 28. Scaling  Efficiency:  Intel®  Xeon  Phi™  Processor     Deep  Learning  Image  Classifica#on  Training  Performance  :   MULTI-­‐NODE  Scaling   Soxware  and  workloads  used  in  performance  tests  may  have  been  op#mized  for  performance  only  on  Intel  microprocessors.    Performance  tests,  such  as  SYSmark  and  MobileMark,  are  measured  using  specific  computer  systems,  components,   soxware,  opera#ons  and  func#ons.    Any  change  to  any  of  those  factors  may  cause  the  results  to  vary.    You  should  consult  other  informa#on  and  performance  tests  to  assist  you  in  fully  evalua#ng  your  contemplated  purchases,  including  the   performance  of  that  product  when  combined  with  other  products.    For  more  informa#on  go  to  h+p://www.intel.com/performance  .  *Other  names  and  brands  may  be  property  of  others   Configura#ons:     •  Intel®  Xeon  Phi™    Processor  7250  (68  Cores,  1.4  GHz,  16GB  MCDRAM),  128  GB  memory,  Red  Hat*  Enterprise  Linux  6.7,  Intel®  Op#mized  Framework   0   10   20   30   40   50   60   70   80   90   100   1   2   4   8   16   32   64   128   SCALING  EFFICIENCY  %   #  OF  INTEL®  XEON  PHI™  PROCESSOR  7250  (68-­‐CORES,  1.4  GHZ,  16  GB)  NODES   OverFeat     AlexNet   VGG-­‐A   GoogLeNet   62   87      
  • 29. 𝑇𝑖𝑚𝑒   𝑇𝑜   𝑇𝑟𝑎𝑖𝑛  ( 𝑇𝑇𝑇)   𝑏𝑎𝑡𝑐ℎ   𝑠𝑖𝑧𝑒   sweet  spot   Mul<-­‐node  Challenges   •  Need  to  op#mize  both  compute  (itera#on)  and   communica#on  (weight  updates)   •  More  nodes  mean  higher  batch  per  itera#on   •  Enough  work  for  each  node   •  Op#mized  hyper  parameters  (e.g.  Batch  Size)   •  Time  to  Train:  increases  with  batch  size     •  Accuracy:  batch  size  impacts  convergence  and  accuracy   •  Communication overheads if small per node batch •  e.g.  Total  batch  size  =  1024   •  1024  nodes  :  Batch  size  =  1  per  node  –  communica<on   dominates   •  64  nodes  each  :  Batch  size  =  16  per  node  –  computa<on   dominates  
  • 30. Summary   •  Don’t  be  fooled  by  performance  of  DL  workloads  when  using  unop#mized   frameworks   •  Significant  performance  headroom  from  op#miza#on  on  Xeon  and  Xeon  Phi   •  Close  to  300x  speedup  in  certain  topologies   •  Tradi#onal  vectoriza#on  and  paralleliza#on  strategies  apply   •  Other  unique  performance  challenges:  hyper  parameters,  data  layer,  inter/ intra  layer  paralleliza#on,  etc.   •  Call  to  ac#on:   •  Try  Intel  op#mized  frameworks  available  today,  more  to  come  soon  
  • 31. Legal Disclaimers •  Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families: Go to: Learn About Intel® Processor Numbers http://www.intel.com/products/processor_number •  Some results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. •  Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.  Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions.  Any change to any of those factors may cause the results to vary.  You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.  •  Intel does not control or audit the design or implementation of third party benchmarks or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmarks are reported and confirm whether the referenced benchmarks are accurate and reflect performance of systems available for purchase. •  Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result, and then dividing the actual benchmark result for the baseline platform into each of the specific benchmark results of each of the other platforms, and assigning them a relative performance number that correlates with the performance improvements reported. •  SPEC, SPECint, SPECfp, SPECrate, SPECpower, SPECjbb, SPECompG, SPEC MPI, and SPECjEnterprise* are trademarks of the Standard Performance Evaluation Corporation. See http://www.spec.org for more information. •  TPC Benchmark, TPC-C, TPC-H, and TPC-E are trademarks of the Transaction Processing Council. See http://www.tpc.org for more information. •  No computer system can provide absolute reliability, availability or serviceability. Requires an Intel® Xeon® processor E7-8800/4800/2800 v2 product families or Intel® Itanium® 9500 series-based system (or follow-on generations of either.) Built-in reliability features available on select Intel® processors may require additional software, hardware, services and/or an internet connection. Results may vary depending upon configuration. Consult your system manufacturer for more details. For systems also featuring Resilient System Technologies: No computer system can provide absolute reliability, availability or serviceability. Requires an Intel® Run Sure Technology-enabled system, including an enabled Intel processor and enabled technology(ies). Built-in reliability features available on select Intel® processors may require additional software, hardware, services and/or an Internet connection. Results may vary depending upon configuration. Consult your system manufacturer for more details. For systems also featuring Resilient Memory Technologies: No computer system can provide absolute reliability, availability or serviceability. Requires an Intel® Run Sure Technology-enabled system, including an enabled Intel® processor and enabled technology(ies). built-in reliability features available on select Intel® processors may require additional software, hardware, services and/or an Internet connection. Results may vary depending upon configuration. Consult your system manufacturer for more details.
  • 32. Op/miza/on  No/ce Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804