SlideShare ist ein Scribd-Unternehmen logo
1 von 22
Downloaden Sie, um offline zu lesen
OpenACC on AMD GPUs and APUs
with the PGI Accelerator Compilers
Michael Wolfe

Michael.Wolfe@pgroup.com
http://www.pgroup.com

APU13
San Jose, November, 2013
 C, C++, Fortran compilers
 Optimizing
 Vectorizing
 Parallelizing

 Graphical parallel tools
 PGDBG debugger
 PGPROF profiler







AMD, Intel, NVIDIA processors
PGI Unified Binary™ technology
Linux, MacOS, Windows
Visual Studio & Eclipse integration
PGI Accelerator support
 OpenACC
 CUDA Fortran

www.pgroup.com
SMP Parallel Programming

for( i = 0; i < n; ++i )
a[i] = sinf(b[i]) + cosf(c[i]);
SMP Parallel Programming

#pragma omp parallel for private(i)
for( i = 0; i < n; ++i )
a[i] = sinf(b[i]) + cosf(c[i]);
% pgcc –mp x.c …
AMD Radeon Block Diagram*
 Multiple Compute Units
 Vector Unit
 Pipelining / Multithreading

 Device Memory
 Cache Hierarchy


SW-managed cache (LDS)

*From “AMD Accelerated Parallel Processing – OpenCL Programming Guide”, © 2012 Advanced Micro Devices, Inc.
Heterogeneous Parallel
Programming

for( i = 0; i < n; ++i )
a[i] = sinf(b[i]) + cosf(c[i]);
Heterogeneous Parallel
Programming
#pragma acc parallel loop private(i) 
pcopyin(b[0:n], c[0:n]) 
pcopyout(a[0:n])
for( i = 0; i < n; ++i )
a[i] = sinf(b[i]) + cosf(c[i]);
% pgcc –acc –ta=radeon x.c
Overview
 Parallel programming
 GPU Architectural highlights
 OpenACC 5 minute summary
 PGI Implementation
 Performance
Abstract CPU+Accelerator Target
Accelerator Architecture Features
 Potentially separate memory (relatively small)
 High bandwidth memory interface
 Many degrees of parallelism
 MIMD parallelism across many cores
 SIMD parallelism within a core
 Multithreading for latency tolerance

 Asynchronous with host
 Performance from Parallelism
 slower clock, less ILP, simpler control unit, smaller caches
OpenACC
Open Programming Standard for Parallel Computing
“PGI OpenACC will enable programmers to easily develop portable applications that
maximize the performance and power efficiency benefits of the hybrid CPU/GPU
architecture of Titan.”
--Buddy Bland, Titan Project Director, Oak Ridge National Lab
“OpenACC is a technically impressive initiative brought together by members of the
OpenMP Working Group on Accelerators, as well as many others. We look forward to
releasing a version of this proposal in the next release of OpenMP.”

--Michael Wong, CEO OpenMP Directives Board
OpenACC Overview

 Directive-based
 Parallel Computation
 Data Management

#pragma acc data copyin( a[0:n] ) 
copy( b(0:n] ) create( tmp[0:n] )
{
for( int i = 0; i < iters; ++i ){
relax( a, b, tmp, n );
relax( b, a, tmp, n );
}
}
relax(float *x,float *y,float *t,int n){
#pragma acc data 
present( x[0:n], y[0:n], t[0:n] )
{
#pragma acc parallel loop
for( int j = 0; j < n; ++j )
t[j] = x[j];
#pragma acc parallel loop
for( int j = 1; j < n-1; ++j
x[j] = 0.25f*(t[j-1]+t[j+1] +
y[n-j+1] + y[n-j-1]);
}
}
OpenACC compared to OpenMP
 Data parallelism

 Thread parallelism

 Parallel per region

 Fixed number of threads

 Flexible || mapping

 Fixed || thread mapping

 Structured parallelism

 Tasks and loops

 Performance portability

 ?
PGI OpenACC Implementation
 C, C++, Fortran
 pgcc, pgc++, pgfortran

 Command line options





-acc
-ta=radeon
-ta=radeon,host
-ta=radeon,nvidia

 Planner
 maps program ||ism to
hardware ||ism

 Code Generator
 OpenCL API

 Runtime
 initialization
 data management
 kernel launches
Planner
 Maps parallel loops
 OpenACC abstractions
 gang, worker, vector

 OpenCL abstractions
 work group, work item

 Hardware abstractions
 wavefront

#pragma acc parallel loop gang
for( int j = 0; j < n; ++j )
t[j] = x[j];

#pragma acc parallel loop gang vector
for( int j = 0; j < n; ++j )
t[j] = x[j];
#pragma acc kernels loop independent
for( int j = 0; j < n; ++j )
t[j] = x[j];
Code Generator
 Low-level OpenCL
 “assembly code in C”

 SPIR interface to AMD
Radeon LLVM back-end

 Uses non-standard
features
 device addresses
Runtime
 Dynamically loads
OpenCL library

 Supports multiple devices
 Multiple command
queues
 Host as a device (*)

 Memory management
 device addresses
 bigbuffer(s) suballocation

 Profiling support
Performance
 AMD Piledriver 5800K
 4.0GHz
 2MB cache
 8 cores

 Single thread/core
 OpenMP parallel
 PGI 13.10 –fast –mp

 AMD Radeon 7970





Tahiti
925 MHz
3GB memory
32 compute units

 OpenACC parallel
 PGI 13.10 –fast –acc
–ta=radeon:tahiti
Cloverleaf Mantevo Miniapp
 Lagrangian-Eulerian hydrodynamics
 compressible Euler equation solver in 2D
 9500 lines of Fortran+C with OpenMP, OpenACC
 Accelerating Hydrocodes with OpenACC, OpenCL and CUDA,
Herdman et al, 2012 SC Companion
DOI: 10.1109/SC.Companion.2012.66
Performance Results
40
35
30
25

Serial

OpenMP

20

R7970
15

S10000

10
5
0

960^2x87

1920^2x87

3840^2x87

960^2x2955

1920^2x2955
OpenACC on AMD GPUs and APUs
 OpenACC is designed for performance portability
 PGI Accelerator compilers provide evidence
 Target-specific tuning still underway
 Open Beta compilers available now
 Product version in January 2014
Copyright Notice

© Contents copyright 2013, NVIDIA Corp. This material may not be
reproduced in any manner without the expressed written
permission of NVIDIA Corp.

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
 
Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...
Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...
Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...
 
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey PavlenkoMM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
 
Final lisa opening_keynote_draft_-_v12.1tb
Final lisa opening_keynote_draft_-_v12.1tbFinal lisa opening_keynote_draft_-_v12.1tb
Final lisa opening_keynote_draft_-_v12.1tb
 
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon SelleyPT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
 
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
 
MM-4104, Smart Sharpen using OpenCL in Adobe Photoshop CC – Challenges and Ac...
MM-4104, Smart Sharpen using OpenCL in Adobe Photoshop CC – Challenges and Ac...MM-4104, Smart Sharpen using OpenCL in Adobe Photoshop CC – Challenges and Ac...
MM-4104, Smart Sharpen using OpenCL in Adobe Photoshop CC – Challenges and Ac...
 
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
 
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
 
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
 
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael MantorGS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
 
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...
 
Leverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesLeverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math Libraries
 
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
 
IS-4081, Rabbit: Reinventing Video Chat, by Philippe Clavel
IS-4081, Rabbit: Reinventing Video Chat, by Philippe ClavelIS-4081, Rabbit: Reinventing Video Chat, by Philippe Clavel
IS-4081, Rabbit: Reinventing Video Chat, by Philippe Clavel
 
HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng
HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu FengHC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng
HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng
 
HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...
HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...
HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...
 
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience Report
 
HSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben GasterHSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben Gaster
 

Ähnlich wie PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by Michael Wolfe

Ähnlich wie PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by Michael Wolfe (20)

Transparent GPU Exploitation for Java
Transparent GPU Exploitation for JavaTransparent GPU Exploitation for Java
Transparent GPU Exploitation for Java
 
PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018
 
GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)
 
Easy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java ProgrammersEasy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java Programmers
 
H2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt DowleH2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt Dowle
 
Speeding up Programs with OpenACC in GCC
Speeding up Programs with OpenACC in GCCSpeeding up Programs with OpenACC in GCC
Speeding up Programs with OpenACC in GCC
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda en
 
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
 
20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage
 
Making Hardware Accelerator Easier to Use
Making Hardware Accelerator Easier to UseMaking Hardware Accelerator Easier to Use
Making Hardware Accelerator Easier to Use
 
Porting and Maintaining your C++ Game on Android without losing your mind
Porting and Maintaining your C++ Game on Android without losing your mindPorting and Maintaining your C++ Game on Android without losing your mind
Porting and Maintaining your C++ Game on Android without losing your mind
 
NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読み
 
Application Optimisation using OpenPOWER and Power 9 systems
Application Optimisation using OpenPOWER and Power 9 systemsApplication Optimisation using OpenPOWER and Power 9 systems
Application Optimisation using OpenPOWER and Power 9 systems
 
New Jersey Red Hat Users Group Presentation: Provisioning anywhere
New Jersey Red Hat Users Group Presentation: Provisioning anywhereNew Jersey Red Hat Users Group Presentation: Provisioning anywhere
New Jersey Red Hat Users Group Presentation: Provisioning anywhere
 
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionMachine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
 
D. Fast, Simple User-Space Network Functions with Snabb (RIPE 77)
D. Fast, Simple User-Space Network Functions with Snabb (RIPE 77)D. Fast, Simple User-Space Network Functions with Snabb (RIPE 77)
D. Fast, Simple User-Space Network Functions with Snabb (RIPE 77)
 
20150318-SFPUG-Meetup-PGStrom
20150318-SFPUG-Meetup-PGStrom20150318-SFPUG-Meetup-PGStrom
20150318-SFPUG-Meetup-PGStrom
 
Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to Accelerators
 
GPU Accelerated Data Science with RAPIDS - ODSC West 2020
GPU Accelerated Data Science with RAPIDS - ODSC West 2020GPU Accelerated Data Science with RAPIDS - ODSC West 2020
GPU Accelerated Data Science with RAPIDS - ODSC West 2020
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 

Mehr von AMD Developer Central

Rendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellRendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
AMD Developer Central
 

Mehr von AMD Developer Central (20)

DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsDX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
 
Introduction to Node.js
Introduction to Node.jsIntroduction to Node.js
Introduction to Node.js
 
Media SDK Webinar 2014
Media SDK Webinar 2014Media SDK Webinar 2014
Media SDK Webinar 2014
 
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAn Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
 
DirectGMA on AMD’S FirePro™ GPUS
DirectGMA on AMD’S  FirePro™ GPUSDirectGMA on AMD’S  FirePro™ GPUS
DirectGMA on AMD’S FirePro™ GPUS
 
Webinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceWebinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop Intelligence
 
Inside XBox- One, by Martin Fuller
Inside XBox- One, by Martin FullerInside XBox- One, by Martin Fuller
Inside XBox- One, by Martin Fuller
 
TressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozTressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas Thibieroz
 
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellRendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
 
Gcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodesGcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodes
 
Inside XBOX ONE by Martin Fuller
Inside XBOX ONE by Martin FullerInside XBOX ONE by Martin Fuller
Inside XBOX ONE by Martin Fuller
 
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornDirect3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
 
Introduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevIntroduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan Nevraev
 
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
 
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
 
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
 
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
 
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
 

Kürzlich hochgeladen

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 

PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by Michael Wolfe

  • 1. OpenACC on AMD GPUs and APUs with the PGI Accelerator Compilers Michael Wolfe Michael.Wolfe@pgroup.com http://www.pgroup.com APU13 San Jose, November, 2013
  • 2.  C, C++, Fortran compilers  Optimizing  Vectorizing  Parallelizing  Graphical parallel tools  PGDBG debugger  PGPROF profiler      AMD, Intel, NVIDIA processors PGI Unified Binary™ technology Linux, MacOS, Windows Visual Studio & Eclipse integration PGI Accelerator support  OpenACC  CUDA Fortran www.pgroup.com
  • 3. SMP Parallel Programming for( i = 0; i < n; ++i ) a[i] = sinf(b[i]) + cosf(c[i]);
  • 4. SMP Parallel Programming #pragma omp parallel for private(i) for( i = 0; i < n; ++i ) a[i] = sinf(b[i]) + cosf(c[i]); % pgcc –mp x.c …
  • 5. AMD Radeon Block Diagram*  Multiple Compute Units  Vector Unit  Pipelining / Multithreading  Device Memory  Cache Hierarchy  SW-managed cache (LDS) *From “AMD Accelerated Parallel Processing – OpenCL Programming Guide”, © 2012 Advanced Micro Devices, Inc.
  • 6. Heterogeneous Parallel Programming for( i = 0; i < n; ++i ) a[i] = sinf(b[i]) + cosf(c[i]);
  • 7. Heterogeneous Parallel Programming #pragma acc parallel loop private(i) pcopyin(b[0:n], c[0:n]) pcopyout(a[0:n]) for( i = 0; i < n; ++i ) a[i] = sinf(b[i]) + cosf(c[i]); % pgcc –acc –ta=radeon x.c
  • 8. Overview  Parallel programming  GPU Architectural highlights  OpenACC 5 minute summary  PGI Implementation  Performance
  • 10. Accelerator Architecture Features  Potentially separate memory (relatively small)  High bandwidth memory interface  Many degrees of parallelism  MIMD parallelism across many cores  SIMD parallelism within a core  Multithreading for latency tolerance  Asynchronous with host  Performance from Parallelism  slower clock, less ILP, simpler control unit, smaller caches
  • 11. OpenACC Open Programming Standard for Parallel Computing “PGI OpenACC will enable programmers to easily develop portable applications that maximize the performance and power efficiency benefits of the hybrid CPU/GPU architecture of Titan.” --Buddy Bland, Titan Project Director, Oak Ridge National Lab “OpenACC is a technically impressive initiative brought together by members of the OpenMP Working Group on Accelerators, as well as many others. We look forward to releasing a version of this proposal in the next release of OpenMP.” --Michael Wong, CEO OpenMP Directives Board
  • 12. OpenACC Overview  Directive-based  Parallel Computation  Data Management #pragma acc data copyin( a[0:n] ) copy( b(0:n] ) create( tmp[0:n] ) { for( int i = 0; i < iters; ++i ){ relax( a, b, tmp, n ); relax( b, a, tmp, n ); } } relax(float *x,float *y,float *t,int n){ #pragma acc data present( x[0:n], y[0:n], t[0:n] ) { #pragma acc parallel loop for( int j = 0; j < n; ++j ) t[j] = x[j]; #pragma acc parallel loop for( int j = 1; j < n-1; ++j x[j] = 0.25f*(t[j-1]+t[j+1] + y[n-j+1] + y[n-j-1]); } }
  • 13. OpenACC compared to OpenMP  Data parallelism  Thread parallelism  Parallel per region  Fixed number of threads  Flexible || mapping  Fixed || thread mapping  Structured parallelism  Tasks and loops  Performance portability  ?
  • 14. PGI OpenACC Implementation  C, C++, Fortran  pgcc, pgc++, pgfortran  Command line options     -acc -ta=radeon -ta=radeon,host -ta=radeon,nvidia  Planner  maps program ||ism to hardware ||ism  Code Generator  OpenCL API  Runtime  initialization  data management  kernel launches
  • 15. Planner  Maps parallel loops  OpenACC abstractions  gang, worker, vector  OpenCL abstractions  work group, work item  Hardware abstractions  wavefront #pragma acc parallel loop gang for( int j = 0; j < n; ++j ) t[j] = x[j]; #pragma acc parallel loop gang vector for( int j = 0; j < n; ++j ) t[j] = x[j]; #pragma acc kernels loop independent for( int j = 0; j < n; ++j ) t[j] = x[j];
  • 16. Code Generator  Low-level OpenCL  “assembly code in C”  SPIR interface to AMD Radeon LLVM back-end  Uses non-standard features  device addresses
  • 17. Runtime  Dynamically loads OpenCL library  Supports multiple devices  Multiple command queues  Host as a device (*)  Memory management  device addresses  bigbuffer(s) suballocation  Profiling support
  • 18. Performance  AMD Piledriver 5800K  4.0GHz  2MB cache  8 cores  Single thread/core  OpenMP parallel  PGI 13.10 –fast –mp  AMD Radeon 7970     Tahiti 925 MHz 3GB memory 32 compute units  OpenACC parallel  PGI 13.10 –fast –acc –ta=radeon:tahiti
  • 19. Cloverleaf Mantevo Miniapp  Lagrangian-Eulerian hydrodynamics  compressible Euler equation solver in 2D  9500 lines of Fortran+C with OpenMP, OpenACC  Accelerating Hydrocodes with OpenACC, OpenCL and CUDA, Herdman et al, 2012 SC Companion DOI: 10.1109/SC.Companion.2012.66
  • 21. OpenACC on AMD GPUs and APUs  OpenACC is designed for performance portability  PGI Accelerator compilers provide evidence  Target-specific tuning still underway  Open Beta compilers available now  Product version in January 2014
  • 22. Copyright Notice © Contents copyright 2013, NVIDIA Corp. This material may not be reproduced in any manner without the expressed written permission of NVIDIA Corp.