SlideShare ist ein Scribd-Unternehmen logo
1 von 43
Downloaden Sie, um offline zu lesen
PORTING	
  AND	
  OPTIMIZING	
  OPENMP	
  APPLICATIONS	
  
TO	
  APU	
  USING	
  CAPS	
  TOOLS	
  
JEAN-­‐CHARLES	
  VASNIER,	
  CAPS	
  ENTREPRISE	
  
AGENDA	
  
y  CAPS	
  enterprise	
  
y  OpenACC	
  
y  CAPS	
  Compilers	
  
y  CAPS	
  OpenMP	
  Compiler	
  for	
  AMD	
  APUs	
  
‒  Compiler	
  analyzes	
  and	
  code	
  generaPon	
  
‒  InteracPve	
  report	
  

y  ExperimentaPons	
  with	
  benchmark	
  applicaPons	
  
‒  HydroC	
  

y  Future	
  work	
  

2	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBRE	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  

CAPS OpenMP Compiler - June 2013

2
CAPS	
  enterprise	
  
COMPANY	
  PROFILE	
  
y  Founded	
  in	
  2002	
  
‒  Large	
  experPse	
  in	
  processor	
  micro-­‐architecture	
  and	
  code	
  generaPon	
  
‒  Spin-­‐off	
  of	
  French	
  INRIA	
  Research	
  Lab	
  
‒  30	
  employees	
  

y  Mission:	
  to	
  help	
  its	
  customers	
  to	
  leverage	
  the	
  performance	
  of	
  mulP/manycore	
  machines	
  
‒  ConsulPng	
  &	
  engineering	
  services	
  
‒  CAPS	
  OpenACC	
  Compiler	
  &	
  toolchain	
  
‒  Trainings	
  

y  Expanding	
  sales	
  worldwide	
  
‒  Resellers	
  in	
  US	
  and	
  APAC	
  	
  
(Exxact,	
  Abso^,	
  JCC	
  Gimmick	
  Ltd,	
  Nodasys,	
  …)	
  	
  

4	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBRE	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  

www.caps-entreprise.com

4
CAPS	
  ECOSYSTEM	
  
Customers

5	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBRE	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  

Business Partners

www.caps-entreprise.com

European R&D
Projects

5
OpenACC	
  
	
  
OPENACC	
  INITIATIVE	
  

y  A CAPS, CRAY, Nvidia and PGI initiative
y  Open Standard
y  A directive-based approach for programming heterogeneous manycore hardware for C and FORTRAN applications
y  http://www.openacc-standard.com

7	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBRE	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  

www.caps-entreprise.com

7
DIRECTIVE-­‐BASED	
  PROGRAMMING	
  (1) 	
  	
  
y  Three ways of programming GPGPU applications:

Libraries

Directives

Programming
Languages

Ready-to-use Acceleration	
  

Quickly Accelerate Existing
Applications	
  

Maximum Performance	
  

8	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBRE	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  

www.caps-entreprise.com

8
DIRECTIVE-­‐BASED	
  PROGRAMMING	
  (2) 	
  	
  

9	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBRE	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  

www.caps-­‐entreprise.com	
  

9	
  
EXECUTION	
  MODEL	
  
y  Among a bulk of computations executed by the CPU, some regions can be offloaded to hardware
accelerators
‒  Parallel regions
‒  Kernels regions

y  Host is responsible for:
‒  Allocating memory space on accelerator
‒  Initiating data transfers
‒  Launching computations
‒  Waiting for completion
‒  Deallocating memory space

y  Accelerators execute parallel regions:
‒  Use work-sharing directives
‒  Specify level of parallelization

10	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBRE	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  

www.caps-entreprise.com

10
OPENACC	
  EXECUTION	
  MODEL	
  
y  Host-­‐controlled	
  execuPon	
  
y  Based	
  on	
  three	
  parallelism	
  levels	
  
‒  Gangs	
  –	
  coarse	
  grain	
  
‒  Workers	
  –	
  fine	
  grain	
  
‒  Vectors	
  –	
  finest	
  grain	
  

Device	
  
Gang	
  
	
   Worker	
  
	
   	
  	
  	
  
	
  
	
   	
  
	
  	
  	
  	
  	
  	
  	
  	
  
Vectors	
  

11	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBRE	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  

Gang	
  
	
   Worker	
  	
  
	
   	
  	
  	
  
	
  
	
   	
  
	
  	
  	
  	
  	
  	
  	
  	
  
Vectors	
  

www.caps-entreprise.com

…	
  

11
CAPS	
  
Compilers	
  
OPENACC	
  COMPILERS	
  (1)	
  

CAPS	
  Compilers:	
  

PGI	
  Accelerator	
  

y  Source-­‐to-­‐source	
  compilers	
  
y  Support	
  Intel	
  Xeon	
  Phi,	
  NVIDIA	
  GPUs,	
  
AMD	
  GPUs	
  and	
  APUs	
  

y  Extension	
  of	
  x86	
  PGI	
  compiler	
  
y  Support	
  Intel	
  Xeon	
  Phi,	
  NVIDIA	
  GPUs,	
  
AMD	
  GPUs	
  and	
  APUs	
  

Cray	
  Compilers:	
  
y  Provided	
  with	
  Cray	
  system	
  only	
  

13	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBRE	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  

www.caps-­‐entreprise.com	
  

13	
  
CAPS	
  COMPILERS	
  (2)	
  
Are source-to-source compilers, composed of 3 parts:
y  The directives (OpenACC or OpenHMPP)
‒ Define parts of code to be accelerated
‒ Indicate resource allocation and communication
‒ Ensure portability

y  The toolchain
‒ Helps building manycore applications
‒ Includes compilers and target code generators
‒ Insulates hardware specific computations
‒ Uses hardware vendor SDK

y  The runtime
‒ Helps to adapt to platform configuration
‒ Manages hardware resource availability
14	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBRE	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  

www.caps-entreprise.com

14
CAPS	
  COMPILERS	
  (3)	
  
y  Take	
  the	
  original	
  applicaPon	
  as	
  input	
  and	
  generate	
  another	
  applicaPon	
  source	
  code	
  as	
  
output	
  
‒ AutomaPcally	
  turn	
  the	
  OpenACC	
  source	
  code	
  into	
  a	
  accelerator-­‐specific	
  source	
  code	
  (CUDA,	
  OpenCL)	
  

y  Compile	
  the	
  enPre	
  hybrid	
  applicaPon	
  
	
  
y  Just	
  prefix	
  the	
  original	
  compilaPon	
  line	
  with	
  capsmc	
  to	
  produce	
  a	
  hybrid	
  applicaPon	
  
$ capsmc gcc myprogram.c
$ capsmc gfortran myprogram.f90	
  

y  CompaPble	
  with:	
  
‒ GNU	
  
‒ Intel	
  
‒ Open64	
  
‒ Abso^	
  
‒ …	
  

15	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBRE	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  

www.caps-entreprise.com

15
CAPS	
  COMPILERS	
  (4)	
  
C++	
  
Frontend	
  

y CAPS Compilers drives all
compilation passes

Fortran	
  
Frontend	
  

ExtracPon	
  module	
  

y Host application compilation
‒ Calls traditional CPU compilers
‒ CAPS Runtime is linked to the
host part of the application

C	
  
Frontend	
  

codelets	
  
Host	
  code	
  

Fun	
  
#1	
  

Fun	
  
#2	
  

Fun	
  
#3	
  

Instrumen-­‐taPon	
  
module	
  

CUDA	
  Code	
  
GeneraPon	
  

OpenCL	
  
GeneraPon	
  

CPU	
  compiler	
  	
  
(gcc,	
  ifort,	
  …)	
  

CUDA	
  compilers	
  

OpenCL	
  
compilers	
  

y Device code production
‒ According to the specified
target
‒ A dynamic library is built

Executable	
  
(mybin.exe)	
  

16	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBRE	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  

CAPS	
  RunDme	
  

www.caps-­‐entreprise.com	
  

HWA	
  Code	
  	
  

(Dynamic	
  library)	
  

16	
  
From	
  OpenMP	
  
To	
  OpenACC	
  
CAPS	
  OPENMP	
  COMPILER	
  

y  AutomaPcally	
  turns	
  OpenMP	
  codes	
  into	
  OpenACC	
  
y  Diagnoses	
  compaPbility	
  issues	
  and	
  suggests	
  code	
  transformaPons	
  
y  Builds	
  accelerated	
  versions	
  based	
  on	
  CUDA	
  or	
  OpenCL	
  
y  Works	
  with	
  all	
  plalorms	
  
‒  AMD	
  and	
  Nvidia	
  GPUs	
  
‒  AMD	
  APUs	
  
‒  Intel	
  Xeon	
  Phi	
  

18	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBRE	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  

CAPS OpenMP Compiler - June 2013

18
CAPS	
  OPENMP	
  COMPILER	
  OVERVIEW	
  

Profiling	
  

19	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBRE	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  

Analysis	
  

CAPS OpenMP Compiler - June 2013

AcceleraPon	
  

19
EXTENSION	
  OF	
  THE	
  CAPS	
  OPENACC	
  COMPILER	
  
y  Converts	
  OpenMP	
  codes	
  into	
  OpenACC	
  	
  
‒  Examine	
  OpenMP	
  loop	
  nests	
  and	
  check	
  their	
  OpenACC	
  compaPbility	
  	
  
‒  Diagnose	
  non	
  compaPbility	
  issues	
  and	
  propose	
  advice	
  	
  
‒  Build	
  an	
  APU	
  version	
  based	
  on	
  OpenCL	
  

y  Builds	
  a	
  interacPve	
  report	
  	
  
‒  Based	
  on	
  the	
  compiler	
  staPc	
  and	
  dynamic	
  analyses	
  	
  
‒  OpenMP	
  to	
  OpenACC	
  kernels	
  view	
  o	
  	
  Performance	
  details	
  of	
  each	
  region	
  	
  
‒  Regions’	
  In/Out	
  and	
  data	
  dependencies	
  between	
  regions	
  
‒  Gives	
  the	
  user	
  control	
  on	
  pushing	
  kernels	
  onto	
  GPU	
  and	
  manage	
  data	
  transfers	
  

20	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBRE	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  
OPENMP-­‐BASED	
  OPTIMIZATION	
  PROCESS	
  

Application
with OpenMP
directives

Instrumentation

Execution

Analysis

Tracable application

Profiling report

HTML
interactive
report

21	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBRE	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  

CAPS OpenMP Compiler - June 2013

Generation

Accelerated
executable

21
INSTRUMENTATION	
  AND	
  PROFILING	
  PHASES	
  
y  Code	
  preprocessing	
  and	
  instrumentaPon	
  
‒  IdenPfy	
  supported	
  OpenMP	
  regions	
  	
  
‒  	
  parallel,	
  parallel	
  	
  for	
  and	
  parallel	
  for	
  constructs	
  

‒  Instrument	
  the	
  code	
  to	
  track	
  data	
  and	
  measure	
  kernel	
  performance	
  
	
  

y  Instrumented	
  applicaPon	
  execuPon	
  	
  
‒  Based	
  on	
  the	
  user	
  data	
  set	
  	
  	
  
‒  Number	
  of	
  Pmes	
  a	
  OpenMP	
  region	
  is	
  executed	
  	
  
‒  Region’s	
  reads	
  and	
  writes	
  	
  
‒  Range	
  of	
  loops	
  iteraPon	
  	
  
‒  Region	
  performance	
  

22	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBRE	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  
ANALYSIS	
  PHASE	
  
y  Generates	
  an	
  interacPve	
  HTML	
  report	
  
‒  Based	
  on	
  the	
  compiler	
  staPc	
  and	
  dynamic	
  analyses	
  
‒  Metrics	
  for	
  each	
  OpenMP	
  regions	
  	
  
‒  	
  Check	
  OpenACC	
  compliancy	
  	
  
‒  ComputaPon	
  density	
  	
  
‒  Coalescing	
  of	
  data	
  accesses	
  
‒  EsPmated	
  speed-­‐up	
  
‒  Memory	
  usage	
  

‒  Propose	
  a	
  GPU	
  execuPon	
  or	
  naPve	
  OpenMP	
  execuPon	
  
‒  Data	
  usage	
  and	
  data	
  dependencies	
  graph	
  between	
  regions	
  
‒  Determine	
  when	
  transfers	
  are	
  required	
  between	
  kernels	
  

‒  Let	
  the	
  user	
  modify	
  the	
  CPU	
  or	
  GPU	
  execuPon	
  and	
  data	
  transfer	
  policy	
  

23	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBRE	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  
HTML	
  INTERACTIVE	
  REPORT

	
  (1)	
  

y  Get	
  regions	
  overview	
  in	
  a	
  snap!	
  
	
  
y  Code	
  View:	
  from	
  OpenMP	
  to	
  OpenACC	
  direcPves	
  

24	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBRE	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  

CAPS OpenMP Compiler - June 2013

24
HTML	
  INTERACTIVE	
  REPORT

	
  (2)	
  

y Performance	
  details	
  of	
  each	
  
region	
  

y Analysis	
  conclusions	
  and	
  
portability	
  diagnosis	
  

25	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBRE	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  

CAPS	
  OpenMP	
  Compiler	
  -­‐	
  June	
  2013	
  

25	
  
HTML	
  INTERACTIVE	
  REPORT

	
  (3)	
  

y  Regions’	
  inputs/outputs	
  and	
  data	
  dependencies	
  map	
  

26	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBRE	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  

CAPS OpenMP Compiler - June 2013

26
HTML	
  INTERACTIVE	
  REPORT

	
  (4)	
  

y  Get	
  the	
  control!	
  
‒  Manually	
  push	
  kernels	
  onto	
  accelerators	
  
‒  Manage	
  data	
  transfers	
  

27	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBRE	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  

CAPS OpenMP Compiler - June 2013

27
CODE	
  GENERATION	
  PHASE	
  
y  Same	
  as	
  the	
  CAPS	
  OpenACC	
  Compiler	
  	
  
‒  Based	
  on	
  the	
  analysis	
  report	
  	
  
‒  Generates	
  OpenCL	
  kernels	
  from	
  OpenACC	
  	
  
‒  AutomaPc	
  data	
  updates	
  to	
  ensure	
  memory	
  coherency	
  

28	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBRE	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  
FEATURES	
  
y  Diagnoses	
  
‒  OpenACC	
  compliancy	
  
‒  ComputaPonal	
  density	
  
‒  Data	
  accesses	
  coalescing	
  
‒  Memory	
  usage	
  
‒  EsPmated	
  speed-­‐up	
  

y  AutomaPc	
  porPng	
  to	
  AMD,	
  NVIDIA,	
  or	
  Intel	
  accelerators	
  
y  Accelerates	
  execuPon	
  or	
  keeps	
  the	
  OpenMP	
  naPve	
  one	
  
y  Gives	
  users	
  control	
  to	
  manual	
  opPmizaPons	
  

29	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBRE	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  

CAPS OpenMP Compiler - June 2013

29
ApplicaPon	
  
ExperimentaPons	
  
HARDWARE	
  AND	
  SOFTWARE	
  ENVIRONMENT	
  
y  Linux	
  system	
  
‒  AMD	
  SDK	
  2.8	
  
‒  CAPS	
  Compiler	
  revision	
  50387	
  
‒  GCC	
  4.6.1	
  
‒  OpenMPI	
  1.6.4	
  

y  Hardware	
  
‒  AMD	
  A10-­‐5800K	
  APU	
  with	
  Radeon	
  HD	
  Graphics	
  

31	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBRE	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  

CAPS OpenMP Compiler - June 2013

31
APPLICATIONS	
  STATUS	
  
y  Main	
  objecPve	
  is	
  proof	
  of	
  concept,	
  not	
  performance	
  
‒  Performance	
  limitaPons	
  of	
  current	
  version	
  of	
  the	
  APU	
  	
  

y  HydroC	
  
‒  Most	
  convincing	
  demo	
  
‒  x1.3	
  speed-­‐up	
  by	
  modifying	
  the	
  	
  
execuPon	
  and	
  transfer	
  policy	
  

32	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBRE	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  

CAPS OpenMP Compiler - June 2013

32
HYDROC	
  HTML	
  REPORT	
  

33	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBRE	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  
Fututre	
  Work	
  
C2PO	
  
C2PO	
  MISSION	
  STATEMENT	
  

Guides	
  you	
  through	
  the	
  whole	
  process	
  of	
  porPng	
  and	
  tuning	
  
applicaPons	
  onto	
  manycore	
  parallel	
  systems	
  
y  Combines	
  various	
  CAPS	
  technologies	
  in	
  a	
  modular	
  tool	
  chain	
  
‒  StaPc	
  and	
  dynamic	
  code	
  analyzers	
  
‒  OpenMP	
  to	
  OpenACC	
  code	
  transformers	
  
‒  Kernel	
  micro-­‐bencher	
  
‒  Plug	
  with	
  third-­‐party	
  tools:	
  Vtune,	
  CUDA	
  profiler	
  
‒  Use	
  CAPS	
  Compiler	
  at	
  final	
  stage	
  to	
  produce	
  manycore	
  applicaPon	
  

35	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBRE	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  

C2PO - Oct. 2013

35
C2PO	
  PHASES	
  

1. 

GeneraPon	
  of	
  an	
  OpenACC	
  skeleton	
  from	
  OpenMP	
  or	
  sequenPal	
  code	
  
‒ 

2. 

Hotspot	
  detecPon	
  and	
  dataflow	
  analysis	
  

Indicates	
  global	
  and	
  local	
  advice	
  on	
  	
  
‒  Data	
  management/placement	
  between	
  kernels	
  or	
  regions	
  
‒  First	
  ten	
  Pps	
  on	
  kernel	
  performance	
  
‒  Data	
  coalescing,	
  parallelism,	
  gridificaPon,	
  loops	
  order	
  

3. 

Let	
  you	
  rapidly	
  opPmize	
  performance	
  of	
  kernels	
  
‒  Extracts	
  funcPons,	
  loops	
  or	
  annotated	
  regions	
  
‒  Tune	
  kernel	
  code	
  following	
  C2PO	
  advice	
  
‒  Replay	
  standalone	
  with	
  applicaPon	
  data	
  and	
  measure	
  performance	
  gain	
  
‒  Re-­‐inject	
  opPmized	
  into	
  applicaPon	
  source	
  code	
  

4. 

Use	
  CAPS	
  Compilers	
  to	
  build	
  Intel	
  Xeon	
  Phi,	
  NVIDIA	
  or	
  AMD	
  GPUs	
  

Dataflow	
  
analysis	
  
OpenACC	
  
skeleton	
  
generaPon	
  
Extract	
  loops,	
  
funcPons,	
  regions	
  

Fine	
  tune	
  kernels	
  

User	
  Input	
  

36	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBRE	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  

C2PO - Oct. 2013

36
C2PO	
  TOOL	
  CHAIN	
  
InteracPve	
  
Report	
  
Global	
  tuning	
  
Code	
  skeleton	
  generaDon	
  

Data	
  Movement	
  
Analyzer	
  

SequenPal	
  
Code	
  

OpenACC	
  
Generator	
  

OpenACC	
  Code	
  

OpenMP	
  Code	
  

ubencher	
  
HTML	
  Report	
  

CUDA	
  
profiler	
  

Local	
  tuning	
  
Kernels	
  

37	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBRE	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  

VTune	
  

C2PO - Oct. 2013

Performance	
  
analyzer	
  

37
C2PO	
  OPENACC	
  GENERATION	
  
y  From	
  sequenPal	
  or	
  OpenMP	
  code	
  to	
  first	
  parallelized	
  code	
  
‒  Instrument	
  applicaPon	
  and	
  detect	
  hotspots	
  
‒  Generate	
  OpenACC	
  skeleton	
  of	
  kernels	
  from	
  loops	
  
‒  Manage	
  data	
  transfers	
  between	
  kernels	
  

y  A	
  report	
  is	
  generated	
  containing	
  
‒  Various	
  performance	
  metrics	
  
‒  Kernel	
  execuPon	
  
‒  Memory	
  reads	
  and	
  writes	
  
‒  PotenPal	
  performance	
  gain	
  

‒  Data	
  dependencies	
  and	
  usage	
  between	
  kernels	
  
‒  OpenACC	
  code	
  view	
  

38	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBRE	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  

C2PO - Oct. 2013

38
C2PO	
  GLOBAL	
  TUNING	
  
y  Dynamic	
  tracking	
  of	
  data	
  so	
  as	
  to	
  opPmize	
  their	
  movement	
  
‒  Dynamically	
  trace	
  uploads	
  and	
  downloads	
  at	
  execuPon	
  Pme	
  
‒  Detect	
  potenPally	
  redundant	
  data	
  transfers	
  	
  

Difficult	
  for	
  the	
  
compiler	
  to	
  detect	
  
any	
  CPU	
  use	
  of	
  data	
  

#openacc	
  data	
  region	
  
//	
  convergence	
  loop	
  	
  
for	
  {	
  
	
  	
  	
  	
  Upload	
  data()	
  
	
  	
  	
  	
  Kernels’	
  calls()	
  
	
  	
  	
  	
  Download	
  data()	
  
}	
  
…	
  

39	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBRE	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  

C2PO - Oct. 2013

Possible	
  advice:	
  are	
  the	
  
following	
  parameters	
  modified	
  
by	
  the	
  CPU	
  between	
  the	
  
downloads	
  and	
  uploads?	
  	
  

If	
  yes,	
  insert	
  OpenACC	
  data	
  region	
  
with	
  non	
  modified	
  parameters	
  

39
C2PO	
  TUNING	
  PHASE	
  
y  Microbenchmarking	
  mechanism	
  
‒  Loops,	
  funcPons,	
  user	
  annotated	
  regions	
  are	
  extracted	
  in	
  kernels	
  
‒  Apply	
  opPmizaPons	
  	
  
‒  Replay	
  kernels	
  with	
  original	
  data	
  set	
  without	
  running	
  the	
  whole	
  applicaPon	
  
‒  Once	
  tuned,	
  inject	
  kernels	
  into	
  the	
  applicaPon	
  source	
  code	
  

y  Apply	
  performance	
  analyzers	
  from	
  third	
  party	
  tools	
  (Vtune,	
  CUDA	
  profiler)	
  
‒  Synthesizes	
  raw	
  metrics	
  (hardware	
  counters)	
  linked	
  to	
  the	
  source	
  code	
  

40	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBRE	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  

C2PO - Oct. 2013

40
C2PO	
  OBJECTIVES	
  AND	
  BENEFITS	
  
y  Keep	
  one	
  single	
  OpenMP	
  code	
  for	
  various	
  parallel	
  many-­‐core	
  systems	
  (GPUs,	
  APUs,	
  MIC)	
  
y  Incrementally	
  port	
  and	
  opPmize	
  codes	
  in	
  a	
  modular	
  way	
  
y  Use	
  an	
  interacPve	
  compiler:	
  advice	
  from	
  dynamic	
  and	
  staPc	
  analyses	
  at	
  source	
  code	
  level	
  

41	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBRE	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  

C2PO - Oct. 2013

41
THANK	
  YOU	
  FOR	
  YOUR	
  
ATTENTION!	
  

Vasnier	
  Jean-­‐Charles	
  
Sales	
  Engineer,	
  CAPS	
  entreprise	
  
Phone:	
  +1-­‐865-­‐227-­‐6899	
  
Email:	
  jvasnier@caps-­‐entreprise.com	
  
GET	
  PERFORMANCE	
  IN	
  NO	
  TIME!	
  

ExecuDon	
  Time	
  (seconds)	
  

70	
  

63,42	
  

60	
  
50	
  

45,698	
  
Original	
  (OpenMP)	
  

40	
  
30	
  

27,539	
  

Generated	
  (auto)	
  

23,417	
  

Generated(tweaked)	
  

20	
  

12,71	
  

12,55	
  

10	
  
0	
  
Hydro	
  

x2	
  speed-­‐up	
  

(a^er	
  user’s	
  tuning)	
  

Nbody	
  

x6	
  speed-­‐up	
  
in	
  3	
  clicks	
  
(full	
  automaPc)	
  

	
  
‒  Measured	
  on	
  a	
  dual	
  Sandy	
  bridge	
  E5-­‐2687W	
  with	
  32	
  Go	
  RAM	
  and	
  a	
  Kepler	
  K20C	
  driven	
  by	
  CUDA	
  v5.0	
  	
  

43	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBRE	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  

CAPS OpenMP Compiler - June 2013

43

Más contenido relacionado

Was ist angesagt?

PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...
PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...
PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...AMD Developer Central
 
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...AMD Developer Central
 
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor MillerPL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor MillerAMD Developer Central
 
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...AMD Developer Central
 
HSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben GasterHSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben GasterAMD Developer Central
 
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon SelleyPT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon SelleyAMD Developer Central
 
MM-4104, Smart Sharpen using OpenCL in Adobe Photoshop CC – Challenges and Ac...
MM-4104, Smart Sharpen using OpenCL in Adobe Photoshop CC – Challenges and Ac...MM-4104, Smart Sharpen using OpenCL in Adobe Photoshop CC – Challenges and Ac...
MM-4104, Smart Sharpen using OpenCL in Adobe Photoshop CC – Challenges and Ac...AMD Developer Central
 
GS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin Coumans
GS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin CoumansGS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin Coumans
GS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin CoumansAMD Developer Central
 
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey PavlenkoMM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey PavlenkoAMD Developer Central
 
IS-4081, Rabbit: Reinventing Video Chat, by Philippe Clavel
IS-4081, Rabbit: Reinventing Video Chat, by Philippe ClavelIS-4081, Rabbit: Reinventing Video Chat, by Philippe Clavel
IS-4081, Rabbit: Reinventing Video Chat, by Philippe ClavelAMD Developer Central
 
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...AMD Developer Central
 
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...AMD Developer Central
 
HC-4017, HSA Compilers Technology, by Debyendu Das
HC-4017, HSA Compilers Technology, by Debyendu DasHC-4017, HSA Compilers Technology, by Debyendu Das
HC-4017, HSA Compilers Technology, by Debyendu DasAMD Developer Central
 
CC-4006, Deliver Hardware Accelerated Applications Using RemoteFX vGPU with W...
CC-4006, Deliver Hardware Accelerated Applications Using RemoteFX vGPU with W...CC-4006, Deliver Hardware Accelerated Applications Using RemoteFX vGPU with W...
CC-4006, Deliver Hardware Accelerated Applications Using RemoteFX vGPU with W...AMD Developer Central
 
PG-4039, RapidFire API, by Dmitry Kozlov
PG-4039, RapidFire API, by Dmitry KozlovPG-4039, RapidFire API, by Dmitry Kozlov
PG-4039, RapidFire API, by Dmitry KozlovAMD Developer Central
 
WT-4073, ANGLE and cross-platform WebGL support, by Shannon Woods
WT-4073, ANGLE and cross-platform WebGL support, by Shannon WoodsWT-4073, ANGLE and cross-platform WebGL support, by Shannon Woods
WT-4073, ANGLE and cross-platform WebGL support, by Shannon WoodsAMD Developer Central
 
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael MantorGS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael MantorAMD Developer Central
 
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...AMD Developer Central
 
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...AMD Developer Central
 
PT-4055, Optimizing Raytracing on GCN with AMD Development Tools, by Tzachi C...
PT-4055, Optimizing Raytracing on GCN with AMD Development Tools, by Tzachi C...PT-4055, Optimizing Raytracing on GCN with AMD Development Tools, by Tzachi C...
PT-4055, Optimizing Raytracing on GCN with AMD Development Tools, by Tzachi C...AMD Developer Central
 

Was ist angesagt? (20)

PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...
PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...
PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...
 
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
 
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor MillerPL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
 
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
 
HSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben GasterHSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben Gaster
 
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon SelleyPT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
 
MM-4104, Smart Sharpen using OpenCL in Adobe Photoshop CC – Challenges and Ac...
MM-4104, Smart Sharpen using OpenCL in Adobe Photoshop CC – Challenges and Ac...MM-4104, Smart Sharpen using OpenCL in Adobe Photoshop CC – Challenges and Ac...
MM-4104, Smart Sharpen using OpenCL in Adobe Photoshop CC – Challenges and Ac...
 
GS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin Coumans
GS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin CoumansGS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin Coumans
GS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin Coumans
 
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey PavlenkoMM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
 
IS-4081, Rabbit: Reinventing Video Chat, by Philippe Clavel
IS-4081, Rabbit: Reinventing Video Chat, by Philippe ClavelIS-4081, Rabbit: Reinventing Video Chat, by Philippe Clavel
IS-4081, Rabbit: Reinventing Video Chat, by Philippe Clavel
 
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
 
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...
 
HC-4017, HSA Compilers Technology, by Debyendu Das
HC-4017, HSA Compilers Technology, by Debyendu DasHC-4017, HSA Compilers Technology, by Debyendu Das
HC-4017, HSA Compilers Technology, by Debyendu Das
 
CC-4006, Deliver Hardware Accelerated Applications Using RemoteFX vGPU with W...
CC-4006, Deliver Hardware Accelerated Applications Using RemoteFX vGPU with W...CC-4006, Deliver Hardware Accelerated Applications Using RemoteFX vGPU with W...
CC-4006, Deliver Hardware Accelerated Applications Using RemoteFX vGPU with W...
 
PG-4039, RapidFire API, by Dmitry Kozlov
PG-4039, RapidFire API, by Dmitry KozlovPG-4039, RapidFire API, by Dmitry Kozlov
PG-4039, RapidFire API, by Dmitry Kozlov
 
WT-4073, ANGLE and cross-platform WebGL support, by Shannon Woods
WT-4073, ANGLE and cross-platform WebGL support, by Shannon WoodsWT-4073, ANGLE and cross-platform WebGL support, by Shannon Woods
WT-4073, ANGLE and cross-platform WebGL support, by Shannon Woods
 
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael MantorGS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
 
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
 
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
 
PT-4055, Optimizing Raytracing on GCN with AMD Development Tools, by Tzachi C...
PT-4055, Optimizing Raytracing on GCN with AMD Development Tools, by Tzachi C...PT-4055, Optimizing Raytracing on GCN with AMD Development Tools, by Tzachi C...
PT-4055, Optimizing Raytracing on GCN with AMD Development Tools, by Tzachi C...
 

Andere mochten auch

Reactive Design Patterns — J on the Beach
Reactive Design Patterns — J on the BeachReactive Design Patterns — J on the Beach
Reactive Design Patterns — J on the BeachRoland Kuhn
 
The Rendering Pipeline - Challenges & Next Steps
The Rendering Pipeline - Challenges & Next StepsThe Rendering Pipeline - Challenges & Next Steps
The Rendering Pipeline - Challenges & Next StepsJohan Andersson
 
Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016Graham Wihlidal
 
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAn Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAMD Developer Central
 
Visual Design with Data
Visual Design with DataVisual Design with Data
Visual Design with DataSeth Familian
 

Andere mochten auch (7)

Reactive Design Patterns — J on the Beach
Reactive Design Patterns — J on the BeachReactive Design Patterns — J on the Beach
Reactive Design Patterns — J on the Beach
 
Reactive Design Patterns
Reactive Design PatternsReactive Design Patterns
Reactive Design Patterns
 
The Rendering Pipeline - Challenges & Next Steps
The Rendering Pipeline - Challenges & Next StepsThe Rendering Pipeline - Challenges & Next Steps
The Rendering Pipeline - Challenges & Next Steps
 
Lighting the City of Glass
Lighting the City of GlassLighting the City of Glass
Lighting the City of Glass
 
Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016
 
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAn Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
 
Visual Design with Data
Visual Design with DataVisual Design with Data
Visual Design with Data
 

Ähnlich wie PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, by Jean-Charles Vasnier

Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Databricks
 
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...chiportal
 
Evaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI SupercomputerEvaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI SupercomputerGeorge Markomanolis
 
02 ai inference acceleration with components all in open hardware: opencapi a...
02 ai inference acceleration with components all in open hardware: opencapi a...02 ai inference acceleration with components all in open hardware: opencapi a...
02 ai inference acceleration with components all in open hardware: opencapi a...Yutaka Kawai
 
Summit 16: ARM Mini-Summit - OpenDataPlane Monarch Release - Linaro
Summit 16: ARM Mini-Summit -   OpenDataPlane Monarch Release - LinaroSummit 16: ARM Mini-Summit -   OpenDataPlane Monarch Release - Linaro
Summit 16: ARM Mini-Summit - OpenDataPlane Monarch Release - LinaroOPNFV
 
PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018NVIDIA
 
Learn more about the tremendous value Open Data Plane brings to NFV
Learn more about the tremendous value Open Data Plane brings to NFVLearn more about the tremendous value Open Data Plane brings to NFV
Learn more about the tremendous value Open Data Plane brings to NFVGhodhbane Mohamed Amine
 
OpenACC Monthly Highlights: June 2020
OpenACC Monthly Highlights: June 2020OpenACC Monthly Highlights: June 2020
OpenACC Monthly Highlights: June 2020OpenACC
 
FIWARE Tech Summit - Stream Processing with Kurento Media Server
FIWARE Tech Summit - Stream Processing with Kurento Media ServerFIWARE Tech Summit - Stream Processing with Kurento Media Server
FIWARE Tech Summit - Stream Processing with Kurento Media ServerFIWARE
 
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Databricks
 
COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...
COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...
COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...IJCSEA Journal
 
OpenACC Monthly Highlights: July 2020
OpenACC Monthly Highlights: July 2020OpenACC Monthly Highlights: July 2020
OpenACC Monthly Highlights: July 2020OpenACC
 
COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...
COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...
COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...IJCSEA Journal
 
COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...
COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...
COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...IJCSEA Journal
 
Programming Models for High-performance Computing
Programming Models for High-performance ComputingProgramming Models for High-performance Computing
Programming Models for High-performance ComputingMarc Snir
 
CUDA DLI Training Courses at GTC 2019
CUDA DLI Training Courses at GTC 2019CUDA DLI Training Courses at GTC 2019
CUDA DLI Training Courses at GTC 2019NVIDIA
 
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...
 Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo... Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo...
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...Rogue Wave Software
 
PT-4056, Harnessing Heterogeneous Systems Using C++ AMP – How the Story is Ev...
PT-4056, Harnessing Heterogeneous Systems Using C++ AMP – How the Story is Ev...PT-4056, Harnessing Heterogeneous Systems Using C++ AMP – How the Story is Ev...
PT-4056, Harnessing Heterogeneous Systems Using C++ AMP – How the Story is Ev...AMD Developer Central
 
BKK16-106 ODP Project Update
BKK16-106 ODP Project UpdateBKK16-106 ODP Project Update
BKK16-106 ODP Project UpdateLinaro
 

Ähnlich wie PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, by Jean-Charles Vasnier (20)

Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
 
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...
 
Evaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI SupercomputerEvaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI Supercomputer
 
02 ai inference acceleration with components all in open hardware: opencapi a...
02 ai inference acceleration with components all in open hardware: opencapi a...02 ai inference acceleration with components all in open hardware: opencapi a...
02 ai inference acceleration with components all in open hardware: opencapi a...
 
Summit 16: ARM Mini-Summit - OpenDataPlane Monarch Release - Linaro
Summit 16: ARM Mini-Summit -   OpenDataPlane Monarch Release - LinaroSummit 16: ARM Mini-Summit -   OpenDataPlane Monarch Release - Linaro
Summit 16: ARM Mini-Summit - OpenDataPlane Monarch Release - Linaro
 
PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018
 
Learn more about the tremendous value Open Data Plane brings to NFV
Learn more about the tremendous value Open Data Plane brings to NFVLearn more about the tremendous value Open Data Plane brings to NFV
Learn more about the tremendous value Open Data Plane brings to NFV
 
OpenACC Monthly Highlights: June 2020
OpenACC Monthly Highlights: June 2020OpenACC Monthly Highlights: June 2020
OpenACC Monthly Highlights: June 2020
 
FIWARE Tech Summit - Stream Processing with Kurento Media Server
FIWARE Tech Summit - Stream Processing with Kurento Media ServerFIWARE Tech Summit - Stream Processing with Kurento Media Server
FIWARE Tech Summit - Stream Processing with Kurento Media Server
 
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
 
COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...
COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...
COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...
 
OpenACC Monthly Highlights: July 2020
OpenACC Monthly Highlights: July 2020OpenACC Monthly Highlights: July 2020
OpenACC Monthly Highlights: July 2020
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...
COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...
COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...
 
COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...
COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...
COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...
 
Programming Models for High-performance Computing
Programming Models for High-performance ComputingProgramming Models for High-performance Computing
Programming Models for High-performance Computing
 
CUDA DLI Training Courses at GTC 2019
CUDA DLI Training Courses at GTC 2019CUDA DLI Training Courses at GTC 2019
CUDA DLI Training Courses at GTC 2019
 
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...
 Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo... Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo...
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...
 
PT-4056, Harnessing Heterogeneous Systems Using C++ AMP – How the Story is Ev...
PT-4056, Harnessing Heterogeneous Systems Using C++ AMP – How the Story is Ev...PT-4056, Harnessing Heterogeneous Systems Using C++ AMP – How the Story is Ev...
PT-4056, Harnessing Heterogeneous Systems Using C++ AMP – How the Story is Ev...
 
BKK16-106 ODP Project Update
BKK16-106 ODP Project UpdateBKK16-106 ODP Project Update
BKK16-106 ODP Project Update
 

Mehr von AMD Developer Central

DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsDX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsAMD Developer Central
 
Webinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceWebinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceAMD Developer Central
 
TressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozTressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozAMD Developer Central
 
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellRendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellAMD Developer Central
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonAMD Developer Central
 
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornDirect3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornAMD Developer Central
 
Introduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevIntroduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevAMD Developer Central
 
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasAMD Developer Central
 
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...AMD Developer Central
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...AMD Developer Central
 
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14AMD Developer Central
 
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14AMD Developer Central
 
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...AMD Developer Central
 
Mantle - Introducing a new API for Graphics - AMD at GDC14
Mantle - Introducing a new API for Graphics - AMD at GDC14Mantle - Introducing a new API for Graphics - AMD at GDC14
Mantle - Introducing a new API for Graphics - AMD at GDC14AMD Developer Central
 

Mehr von AMD Developer Central (20)

DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsDX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
 
Introduction to Node.js
Introduction to Node.jsIntroduction to Node.js
Introduction to Node.js
 
Media SDK Webinar 2014
Media SDK Webinar 2014Media SDK Webinar 2014
Media SDK Webinar 2014
 
DirectGMA on AMD’S FirePro™ GPUS
DirectGMA on AMD’S  FirePro™ GPUSDirectGMA on AMD’S  FirePro™ GPUS
DirectGMA on AMD’S FirePro™ GPUS
 
Webinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceWebinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop Intelligence
 
Inside XBox- One, by Martin Fuller
Inside XBox- One, by Martin FullerInside XBox- One, by Martin Fuller
Inside XBox- One, by Martin Fuller
 
TressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozTressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas Thibieroz
 
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellRendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
 
Gcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodesGcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodes
 
Inside XBOX ONE by Martin Fuller
Inside XBOX ONE by Martin FullerInside XBOX ONE by Martin Fuller
Inside XBOX ONE by Martin Fuller
 
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornDirect3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
 
Introduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevIntroduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan Nevraev
 
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
 
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
 
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
 
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
 
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
 
Mantle - Introducing a new API for Graphics - AMD at GDC14
Mantle - Introducing a new API for Graphics - AMD at GDC14Mantle - Introducing a new API for Graphics - AMD at GDC14
Mantle - Introducing a new API for Graphics - AMD at GDC14
 

Último

Stobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through TokenizationStobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through TokenizationStobox
 
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptxEmil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptxNeo4j
 
Planetek Italia Srl - Corporate Profile Brochure
Planetek Italia Srl - Corporate Profile BrochurePlanetek Italia Srl - Corporate Profile Brochure
Planetek Italia Srl - Corporate Profile BrochurePlanetek Italia Srl
 
AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024Brian Pichman
 
Top 10 Squarespace Development Companies
Top 10 Squarespace Development CompaniesTop 10 Squarespace Development Companies
Top 10 Squarespace Development CompaniesTopCSSGallery
 
IT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced ComputingIT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced ComputingMAGNIntelligence
 
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationKnoldus Inc.
 
UiPath Studio Web workshop Series - Day 3
UiPath Studio Web workshop Series - Day 3UiPath Studio Web workshop Series - Day 3
UiPath Studio Web workshop Series - Day 3DianaGray10
 
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENTSIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENTxtailishbaloch
 
Scenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenariosScenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenariosErol GIRAUDY
 
Where developers are challenged, what developers want and where DevEx is going
Where developers are challenged, what developers want and where DevEx is goingWhere developers are challenged, what developers want and where DevEx is going
Where developers are challenged, what developers want and where DevEx is goingFrancesco Corti
 
20140402 - Smart house demo kit
20140402 - Smart house demo kit20140402 - Smart house demo kit
20140402 - Smart house demo kitJamie (Taka) Wang
 
UiPath Studio Web workshop series - Day 4
UiPath Studio Web workshop series - Day 4UiPath Studio Web workshop series - Day 4
UiPath Studio Web workshop series - Day 4DianaGray10
 
EMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? WebinarEMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? WebinarThousandEyes
 
Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...DianaGray10
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
 
How to become a GDSC Lead GDSC MI AOE.pptx
How to become a GDSC Lead GDSC MI AOE.pptxHow to become a GDSC Lead GDSC MI AOE.pptx
How to become a GDSC Lead GDSC MI AOE.pptxKaustubhBhavsar6
 
Extra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdfExtra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdfInfopole1
 
Oracle Database 23c Security New Features.pptx
Oracle Database 23c Security New Features.pptxOracle Database 23c Security New Features.pptx
Oracle Database 23c Security New Features.pptxSatishbabu Gunukula
 

Último (20)

Stobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through TokenizationStobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
 
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptxEmil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
 
Planetek Italia Srl - Corporate Profile Brochure
Planetek Italia Srl - Corporate Profile BrochurePlanetek Italia Srl - Corporate Profile Brochure
Planetek Italia Srl - Corporate Profile Brochure
 
AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024
 
Top 10 Squarespace Development Companies
Top 10 Squarespace Development CompaniesTop 10 Squarespace Development Companies
Top 10 Squarespace Development Companies
 
IT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced ComputingIT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced Computing
 
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its application
 
UiPath Studio Web workshop Series - Day 3
UiPath Studio Web workshop Series - Day 3UiPath Studio Web workshop Series - Day 3
UiPath Studio Web workshop Series - Day 3
 
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENTSIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
 
Scenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenariosScenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenarios
 
Where developers are challenged, what developers want and where DevEx is going
Where developers are challenged, what developers want and where DevEx is goingWhere developers are challenged, what developers want and where DevEx is going
Where developers are challenged, what developers want and where DevEx is going
 
20140402 - Smart house demo kit
20140402 - Smart house demo kit20140402 - Smart house demo kit
20140402 - Smart house demo kit
 
UiPath Studio Web workshop series - Day 4
UiPath Studio Web workshop series - Day 4UiPath Studio Web workshop series - Day 4
UiPath Studio Web workshop series - Day 4
 
EMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? WebinarEMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? Webinar
 
Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
How to become a GDSC Lead GDSC MI AOE.pptx
How to become a GDSC Lead GDSC MI AOE.pptxHow to become a GDSC Lead GDSC MI AOE.pptx
How to become a GDSC Lead GDSC MI AOE.pptx
 
Extra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdfExtra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdf
 
Oracle Database 23c Security New Features.pptx
Oracle Database 23c Security New Features.pptxOracle Database 23c Security New Features.pptx
Oracle Database 23c Security New Features.pptx
 

PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, by Jean-Charles Vasnier

  • 1. PORTING  AND  OPTIMIZING  OPENMP  APPLICATIONS   TO  APU  USING  CAPS  TOOLS   JEAN-­‐CHARLES  VASNIER,  CAPS  ENTREPRISE  
  • 2. AGENDA   y  CAPS  enterprise   y  OpenACC   y  CAPS  Compilers   y  CAPS  OpenMP  Compiler  for  AMD  APUs   ‒  Compiler  analyzes  and  code  generaPon   ‒  InteracPve  report   y  ExperimentaPons  with  benchmark  applicaPons   ‒  HydroC   y  Future  work   2   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   CAPS OpenMP Compiler - June 2013 2
  • 4. COMPANY  PROFILE   y  Founded  in  2002   ‒  Large  experPse  in  processor  micro-­‐architecture  and  code  generaPon   ‒  Spin-­‐off  of  French  INRIA  Research  Lab   ‒  30  employees   y  Mission:  to  help  its  customers  to  leverage  the  performance  of  mulP/manycore  machines   ‒  ConsulPng  &  engineering  services   ‒  CAPS  OpenACC  Compiler  &  toolchain   ‒  Trainings   y  Expanding  sales  worldwide   ‒  Resellers  in  US  and  APAC     (Exxact,  Abso^,  JCC  Gimmick  Ltd,  Nodasys,  …)     4   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   www.caps-entreprise.com 4
  • 5. CAPS  ECOSYSTEM   Customers 5   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   Business Partners www.caps-entreprise.com European R&D Projects 5
  • 7. OPENACC  INITIATIVE   y  A CAPS, CRAY, Nvidia and PGI initiative y  Open Standard y  A directive-based approach for programming heterogeneous manycore hardware for C and FORTRAN applications y  http://www.openacc-standard.com 7   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   www.caps-entreprise.com 7
  • 8. DIRECTIVE-­‐BASED  PROGRAMMING  (1)     y  Three ways of programming GPGPU applications: Libraries Directives Programming Languages Ready-to-use Acceleration   Quickly Accelerate Existing Applications   Maximum Performance   8   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   www.caps-entreprise.com 8
  • 9. DIRECTIVE-­‐BASED  PROGRAMMING  (2)     9   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   www.caps-­‐entreprise.com   9  
  • 10. EXECUTION  MODEL   y  Among a bulk of computations executed by the CPU, some regions can be offloaded to hardware accelerators ‒  Parallel regions ‒  Kernels regions y  Host is responsible for: ‒  Allocating memory space on accelerator ‒  Initiating data transfers ‒  Launching computations ‒  Waiting for completion ‒  Deallocating memory space y  Accelerators execute parallel regions: ‒  Use work-sharing directives ‒  Specify level of parallelization 10   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   www.caps-entreprise.com 10
  • 11. OPENACC  EXECUTION  MODEL   y  Host-­‐controlled  execuPon   y  Based  on  three  parallelism  levels   ‒  Gangs  –  coarse  grain   ‒  Workers  –  fine  grain   ‒  Vectors  –  finest  grain   Device   Gang     Worker                                 Vectors   11   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   Gang     Worker                                   Vectors   www.caps-entreprise.com …   11
  • 13. OPENACC  COMPILERS  (1)   CAPS  Compilers:   PGI  Accelerator   y  Source-­‐to-­‐source  compilers   y  Support  Intel  Xeon  Phi,  NVIDIA  GPUs,   AMD  GPUs  and  APUs   y  Extension  of  x86  PGI  compiler   y  Support  Intel  Xeon  Phi,  NVIDIA  GPUs,   AMD  GPUs  and  APUs   Cray  Compilers:   y  Provided  with  Cray  system  only   13   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   www.caps-­‐entreprise.com   13  
  • 14. CAPS  COMPILERS  (2)   Are source-to-source compilers, composed of 3 parts: y  The directives (OpenACC or OpenHMPP) ‒ Define parts of code to be accelerated ‒ Indicate resource allocation and communication ‒ Ensure portability y  The toolchain ‒ Helps building manycore applications ‒ Includes compilers and target code generators ‒ Insulates hardware specific computations ‒ Uses hardware vendor SDK y  The runtime ‒ Helps to adapt to platform configuration ‒ Manages hardware resource availability 14   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   www.caps-entreprise.com 14
  • 15. CAPS  COMPILERS  (3)   y  Take  the  original  applicaPon  as  input  and  generate  another  applicaPon  source  code  as   output   ‒ AutomaPcally  turn  the  OpenACC  source  code  into  a  accelerator-­‐specific  source  code  (CUDA,  OpenCL)   y  Compile  the  enPre  hybrid  applicaPon     y  Just  prefix  the  original  compilaPon  line  with  capsmc  to  produce  a  hybrid  applicaPon   $ capsmc gcc myprogram.c $ capsmc gfortran myprogram.f90   y  CompaPble  with:   ‒ GNU   ‒ Intel   ‒ Open64   ‒ Abso^   ‒ …   15   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   www.caps-entreprise.com 15
  • 16. CAPS  COMPILERS  (4)   C++   Frontend   y CAPS Compilers drives all compilation passes Fortran   Frontend   ExtracPon  module   y Host application compilation ‒ Calls traditional CPU compilers ‒ CAPS Runtime is linked to the host part of the application C   Frontend   codelets   Host  code   Fun   #1   Fun   #2   Fun   #3   Instrumen-­‐taPon   module   CUDA  Code   GeneraPon   OpenCL   GeneraPon   CPU  compiler     (gcc,  ifort,  …)   CUDA  compilers   OpenCL   compilers   y Device code production ‒ According to the specified target ‒ A dynamic library is built Executable   (mybin.exe)   16   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   CAPS  RunDme   www.caps-­‐entreprise.com   HWA  Code     (Dynamic  library)   16  
  • 17. From  OpenMP   To  OpenACC  
  • 18. CAPS  OPENMP  COMPILER   y  AutomaPcally  turns  OpenMP  codes  into  OpenACC   y  Diagnoses  compaPbility  issues  and  suggests  code  transformaPons   y  Builds  accelerated  versions  based  on  CUDA  or  OpenCL   y  Works  with  all  plalorms   ‒  AMD  and  Nvidia  GPUs   ‒  AMD  APUs   ‒  Intel  Xeon  Phi   18   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   CAPS OpenMP Compiler - June 2013 18
  • 19. CAPS  OPENMP  COMPILER  OVERVIEW   Profiling   19   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   Analysis   CAPS OpenMP Compiler - June 2013 AcceleraPon   19
  • 20. EXTENSION  OF  THE  CAPS  OPENACC  COMPILER   y  Converts  OpenMP  codes  into  OpenACC     ‒  Examine  OpenMP  loop  nests  and  check  their  OpenACC  compaPbility     ‒  Diagnose  non  compaPbility  issues  and  propose  advice     ‒  Build  an  APU  version  based  on  OpenCL   y  Builds  a  interacPve  report     ‒  Based  on  the  compiler  staPc  and  dynamic  analyses     ‒  OpenMP  to  OpenACC  kernels  view  o    Performance  details  of  each  region     ‒  Regions’  In/Out  and  data  dependencies  between  regions   ‒  Gives  the  user  control  on  pushing  kernels  onto  GPU  and  manage  data  transfers   20   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL  
  • 21. OPENMP-­‐BASED  OPTIMIZATION  PROCESS   Application with OpenMP directives Instrumentation Execution Analysis Tracable application Profiling report HTML interactive report 21   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   CAPS OpenMP Compiler - June 2013 Generation Accelerated executable 21
  • 22. INSTRUMENTATION  AND  PROFILING  PHASES   y  Code  preprocessing  and  instrumentaPon   ‒  IdenPfy  supported  OpenMP  regions     ‒   parallel,  parallel    for  and  parallel  for  constructs   ‒  Instrument  the  code  to  track  data  and  measure  kernel  performance     y  Instrumented  applicaPon  execuPon     ‒  Based  on  the  user  data  set       ‒  Number  of  Pmes  a  OpenMP  region  is  executed     ‒  Region’s  reads  and  writes     ‒  Range  of  loops  iteraPon     ‒  Region  performance   22   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL  
  • 23. ANALYSIS  PHASE   y  Generates  an  interacPve  HTML  report   ‒  Based  on  the  compiler  staPc  and  dynamic  analyses   ‒  Metrics  for  each  OpenMP  regions     ‒   Check  OpenACC  compliancy     ‒  ComputaPon  density     ‒  Coalescing  of  data  accesses   ‒  EsPmated  speed-­‐up   ‒  Memory  usage   ‒  Propose  a  GPU  execuPon  or  naPve  OpenMP  execuPon   ‒  Data  usage  and  data  dependencies  graph  between  regions   ‒  Determine  when  transfers  are  required  between  kernels   ‒  Let  the  user  modify  the  CPU  or  GPU  execuPon  and  data  transfer  policy   23   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL  
  • 24. HTML  INTERACTIVE  REPORT  (1)   y  Get  regions  overview  in  a  snap!     y  Code  View:  from  OpenMP  to  OpenACC  direcPves   24   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   CAPS OpenMP Compiler - June 2013 24
  • 25. HTML  INTERACTIVE  REPORT  (2)   y Performance  details  of  each   region   y Analysis  conclusions  and   portability  diagnosis   25   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   CAPS  OpenMP  Compiler  -­‐  June  2013   25  
  • 26. HTML  INTERACTIVE  REPORT  (3)   y  Regions’  inputs/outputs  and  data  dependencies  map   26   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   CAPS OpenMP Compiler - June 2013 26
  • 27. HTML  INTERACTIVE  REPORT  (4)   y  Get  the  control!   ‒  Manually  push  kernels  onto  accelerators   ‒  Manage  data  transfers   27   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   CAPS OpenMP Compiler - June 2013 27
  • 28. CODE  GENERATION  PHASE   y  Same  as  the  CAPS  OpenACC  Compiler     ‒  Based  on  the  analysis  report     ‒  Generates  OpenCL  kernels  from  OpenACC     ‒  AutomaPc  data  updates  to  ensure  memory  coherency   28   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL  
  • 29. FEATURES   y  Diagnoses   ‒  OpenACC  compliancy   ‒  ComputaPonal  density   ‒  Data  accesses  coalescing   ‒  Memory  usage   ‒  EsPmated  speed-­‐up   y  AutomaPc  porPng  to  AMD,  NVIDIA,  or  Intel  accelerators   y  Accelerates  execuPon  or  keeps  the  OpenMP  naPve  one   y  Gives  users  control  to  manual  opPmizaPons   29   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   CAPS OpenMP Compiler - June 2013 29
  • 31. HARDWARE  AND  SOFTWARE  ENVIRONMENT   y  Linux  system   ‒  AMD  SDK  2.8   ‒  CAPS  Compiler  revision  50387   ‒  GCC  4.6.1   ‒  OpenMPI  1.6.4   y  Hardware   ‒  AMD  A10-­‐5800K  APU  with  Radeon  HD  Graphics   31   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   CAPS OpenMP Compiler - June 2013 31
  • 32. APPLICATIONS  STATUS   y  Main  objecPve  is  proof  of  concept,  not  performance   ‒  Performance  limitaPons  of  current  version  of  the  APU     y  HydroC   ‒  Most  convincing  demo   ‒  x1.3  speed-­‐up  by  modifying  the     execuPon  and  transfer  policy   32   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   CAPS OpenMP Compiler - June 2013 32
  • 33. HYDROC  HTML  REPORT   33   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL  
  • 35. C2PO  MISSION  STATEMENT   Guides  you  through  the  whole  process  of  porPng  and  tuning   applicaPons  onto  manycore  parallel  systems   y  Combines  various  CAPS  technologies  in  a  modular  tool  chain   ‒  StaPc  and  dynamic  code  analyzers   ‒  OpenMP  to  OpenACC  code  transformers   ‒  Kernel  micro-­‐bencher   ‒  Plug  with  third-­‐party  tools:  Vtune,  CUDA  profiler   ‒  Use  CAPS  Compiler  at  final  stage  to  produce  manycore  applicaPon   35   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   C2PO - Oct. 2013 35
  • 36. C2PO  PHASES   1.  GeneraPon  of  an  OpenACC  skeleton  from  OpenMP  or  sequenPal  code   ‒  2.  Hotspot  detecPon  and  dataflow  analysis   Indicates  global  and  local  advice  on     ‒  Data  management/placement  between  kernels  or  regions   ‒  First  ten  Pps  on  kernel  performance   ‒  Data  coalescing,  parallelism,  gridificaPon,  loops  order   3.  Let  you  rapidly  opPmize  performance  of  kernels   ‒  Extracts  funcPons,  loops  or  annotated  regions   ‒  Tune  kernel  code  following  C2PO  advice   ‒  Replay  standalone  with  applicaPon  data  and  measure  performance  gain   ‒  Re-­‐inject  opPmized  into  applicaPon  source  code   4.  Use  CAPS  Compilers  to  build  Intel  Xeon  Phi,  NVIDIA  or  AMD  GPUs   Dataflow   analysis   OpenACC   skeleton   generaPon   Extract  loops,   funcPons,  regions   Fine  tune  kernels   User  Input   36   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   C2PO - Oct. 2013 36
  • 37. C2PO  TOOL  CHAIN   InteracPve   Report   Global  tuning   Code  skeleton  generaDon   Data  Movement   Analyzer   SequenPal   Code   OpenACC   Generator   OpenACC  Code   OpenMP  Code   ubencher   HTML  Report   CUDA   profiler   Local  tuning   Kernels   37   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   VTune   C2PO - Oct. 2013 Performance   analyzer   37
  • 38. C2PO  OPENACC  GENERATION   y  From  sequenPal  or  OpenMP  code  to  first  parallelized  code   ‒  Instrument  applicaPon  and  detect  hotspots   ‒  Generate  OpenACC  skeleton  of  kernels  from  loops   ‒  Manage  data  transfers  between  kernels   y  A  report  is  generated  containing   ‒  Various  performance  metrics   ‒  Kernel  execuPon   ‒  Memory  reads  and  writes   ‒  PotenPal  performance  gain   ‒  Data  dependencies  and  usage  between  kernels   ‒  OpenACC  code  view   38   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   C2PO - Oct. 2013 38
  • 39. C2PO  GLOBAL  TUNING   y  Dynamic  tracking  of  data  so  as  to  opPmize  their  movement   ‒  Dynamically  trace  uploads  and  downloads  at  execuPon  Pme   ‒  Detect  potenPally  redundant  data  transfers     Difficult  for  the   compiler  to  detect   any  CPU  use  of  data   #openacc  data  region   //  convergence  loop     for  {          Upload  data()          Kernels’  calls()          Download  data()   }   …   39   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   C2PO - Oct. 2013 Possible  advice:  are  the   following  parameters  modified   by  the  CPU  between  the   downloads  and  uploads?     If  yes,  insert  OpenACC  data  region   with  non  modified  parameters   39
  • 40. C2PO  TUNING  PHASE   y  Microbenchmarking  mechanism   ‒  Loops,  funcPons,  user  annotated  regions  are  extracted  in  kernels   ‒  Apply  opPmizaPons     ‒  Replay  kernels  with  original  data  set  without  running  the  whole  applicaPon   ‒  Once  tuned,  inject  kernels  into  the  applicaPon  source  code   y  Apply  performance  analyzers  from  third  party  tools  (Vtune,  CUDA  profiler)   ‒  Synthesizes  raw  metrics  (hardware  counters)  linked  to  the  source  code   40   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   C2PO - Oct. 2013 40
  • 41. C2PO  OBJECTIVES  AND  BENEFITS   y  Keep  one  single  OpenMP  code  for  various  parallel  many-­‐core  systems  (GPUs,  APUs,  MIC)   y  Incrementally  port  and  opPmize  codes  in  a  modular  way   y  Use  an  interacPve  compiler:  advice  from  dynamic  and  staPc  analyses  at  source  code  level   41   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   C2PO - Oct. 2013 41
  • 42. THANK  YOU  FOR  YOUR   ATTENTION!   Vasnier  Jean-­‐Charles   Sales  Engineer,  CAPS  entreprise   Phone:  +1-­‐865-­‐227-­‐6899   Email:  jvasnier@caps-­‐entreprise.com  
  • 43. GET  PERFORMANCE  IN  NO  TIME!   ExecuDon  Time  (seconds)   70   63,42   60   50   45,698   Original  (OpenMP)   40   30   27,539   Generated  (auto)   23,417   Generated(tweaked)   20   12,71   12,55   10   0   Hydro   x2  speed-­‐up   (a^er  user’s  tuning)   Nbody   x6  speed-­‐up   in  3  clicks   (full  automaPc)     ‒  Measured  on  a  dual  Sandy  bridge  E5-­‐2687W  with  32  Go  RAM  and  a  Kepler  K20C  driven  by  CUDA  v5.0     43   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   CAPS OpenMP Compiler - June 2013 43