SlideShare ist ein Scribd-Unternehmen logo
1 von 28
Downloaden Sie, um offline zu lesen
CaseStudy:AcceleratingFull
WaveformInversionviaOpenCL
onAMDGPUs
Case Study:
Accelerating Full Waveform Inversion
via OpenCL™ on AMD GPUs
©2014 Acceleware Ltd. All rights reserved.
Chris Mason, Acceleware Product Manager
March 5, 2014
CaseStudy:AcceleratingFull
WaveformInversionviaOpenCL
onAMDGPUs
About Acceleware
 Software and services company specializing in HPC
product development, developer training and
consulting services
 OpenCL training for AMD GPUs
– Progressive lectures and hands-on lab exercises
– Experienced instructors
– Delivered worldwide
– Find out more
 High performance consulting
– Feasibility studies
– Porting and optimization
– Code commercialization
– Find out more
1
CaseStudy:AcceleratingFull
WaveformInversionviaOpenCL
onAMDGPUs
Acceleware Software
 Seismic Applications
– Survey design and 3D modeling
– Reverse Time Migration
 Electromagnetics
– FDTD Solver
 Radio Frequency Heating
– Simulation application for the RF
heating of hydrocarbon reserves
2
CaseStudy:AcceleratingFull
WaveformInversionviaOpenCL
onAMDGPUs
Outline
 Watch the recording of this webinar
 What is Full Waveform Inversion?
 The Project
 OpenCL
 Optimizations
– Coalescing
– Iterative kernel for stencil
operations
– Fusing kernels together to eliminate
redundant memory accesses
 Key Performance Results
3
CaseStudy:AcceleratingFull
WaveformInversionviaOpenCL
onAMDGPUs What is Full Waveform Inversion?
 Seismic inversion technique
 Used to build Earth models from recorded seismic data
 Uses a finite-difference solution to the acoustic wave
equation
 Computationally expensive
4
CaseStudy:AcceleratingFull
WaveformInversionviaOpenCL
onAMDGPUs
What is FWI?
From a basic starting point...
... to an accurate velocity model
5
CaseStudy:AcceleratingFull
WaveformInversionviaOpenCL
onAMDGPUs
FWI Algorithm
Initial Model Estimate
Forward Propagate Source → Residuals
Back Propagate Residuals → Gradient
Forward Propagation(s) → Step Length
Update Model
Increase Frequency
Loop over shots
Loop over
frequencies
Loop until
convergence
6
CaseStudy:AcceleratingFull
WaveformInversionviaOpenCL
onAMDGPUs
FWI Compute Cost
 Cluster size of 10s to 100s of CPU nodes
 Many days of runtime
 Accuracy and quality reduced to keep runtime acceptable
7
CaseStudy:AcceleratingFull
WaveformInversionviaOpenCL
onAMDGPUs
The Project
 GeoTomo develops high-end geophysical software products
that help geophysicists around the world to image beneath
the subsurface
 GeoTomo had pre-existing cluster-ready multi-threaded
(OpenMP based) CPU FWI solution
 GeoTomo required their FWI application to run faster so they
could deliver the results quicker to their clients
– Looked to AMD GPUs to potentially accelerate their FWI and approached
Acceleware for our help to make it happen
8
CaseStudy:AcceleratingFull
WaveformInversionviaOpenCL
onAMDGPUs
Why use GPUs? Performance!
9
AMD Opteron 6386 SE AMD FirePro
W9000
AMD Firepro
S10000
Memory Bandwidth
59.7 GB/s 264 GB/s 480 GB/s
Peak Gflops (single) ~410 4000 5910
Peak Gflops (double) ~205 1000 1480
Total Memory >>6 GB 6GB 6 GB
Power Consumption
140 W 274 W 375 W
Gflops per Watt
(single precision) <3 14.59 15.76
CaseStudy:AcceleratingFull
WaveformInversionviaOpenCL
onAMDGPUs
OpenCL Overview
 Parallel computing architecture standardized by the Khronos
Group
 OpenCL:
– Is a royalty free standard
– Provides an API to coordinate parallel computation across
heterogeneous processors
 Of interest because heterogeneous devices can significantly accelerate certain
(primarily data-parallel) workloads
– Defines a cross-platform programming language
– Used on handheld/embedded devices through supercomputers
10
CaseStudy:AcceleratingFull
WaveformInversionviaOpenCL
onAMDGPUs
OpenCL Programming Model
 Heterogeneous model, including provisions for a host connected to
one or more devices
– Example: GPUs, CPUs
Host
Device 1
GPU
Device 2
GPU
…
Device N
GPU
11
CaseStudy:AcceleratingFull
WaveformInversionviaOpenCL
onAMDGPUs The OpenCL Programming Model
 Data-parallel portions of an
algorithm are executed on the
device as kernels
– Kernels are C functions with some
restrictions and a few language extensions
– Many (parallel) work-items execute the
kernel
 The host executes serial code
between device kernel launches
– Memory management
– Data exchange to/from device (usually)
– Error handling
12
Work-Group (0,0) Work-Group (1,0)
Work-Group (0,1) Work-Group (1,1)
Work-Group (0,2) Work-Group( 1,2)
ND Range
Work-Group
(0,0)
Work-Group
(1,0)
Work-Group
(2,0)
Work-Group
(0,1)
Work-Group
(1,1)
Work-Group
(2,1)
ND Range
Host
Device
Host
Device
CaseStudy:AcceleratingFull
WaveformInversionviaOpenCL
onAMDGPUs
OpenCL Memory Model
 OpenCL kernels have access to four distinct memory regions:
– Global
 Allows read/write access from all work-items in all work-groups
 Persistent across kernels
– Local
 Memory that is local to all work-items within a work-group
– Constant
 Region of memory that remains constant (read-only) during the execution of a kernel
– Private
 Memory that is private to a work-item
 OpenCL vendors map memory regions into physical resources
– Local/constant/private memory usually several orders of magnitude lower
capacity but orders of magnitude faster than global memory
13
CaseStudy:AcceleratingFull
WaveformInversionviaOpenCL
onAMDGPUs OpenCL Syntax – Memory Spaces
 Host and device have separate memory spaces
– Data is explicitly moved between them
 Typically over PCIe bus
 Host functions to allocate, copy, and free memory on device, eg.
– clCreateBuffer()
– clEnqueueReadBuffer()
– clEnqueueWriteBuffer()
– clReleaseMemoryObject()
14
CaseStudy:AcceleratingFull
WaveformInversionviaOpenCL
onAMDGPUs
Putting It All Together
15
A0 A1 A2 A3 A4 A5 A6 A7
B0 B1 B2 B3 B4 B5 B6 B7
C0 C1 C2 C3 C4 C5 C6 C7
Cx = Ax + Bx
One work-item per element
Operation
__kernel
void VectorAdd(__global float* a,
__global float* b,
__global float* c)
{
int idx = get_global_id(0);
c[idx] = a[idx] + b[idx];
}
Each work-item has
a unique index,
typically used to
index into arrays
CaseStudy:AcceleratingFull
WaveformInversionviaOpenCL
onAMDGPUs
Vector Add – Host Code
16
void VectorAdd(float* aH, float* bH, float* cH, int N)
{
int N_BYTES = N * sizeof(float);
// Device management code
…
cl_mem aD = clCreateBuffer(…,N_BYTES, …);
cl_mem bD = clCreateBuffer(…,N_BYTES, …);
cl_mem cD = clCreateBuffer(…,N_BYTES, …);
clEnqueueWriteBuffer(...,aD,…,N_BYTES,aH,…);
clEnqueueWriteBuffer(...,bD,…,N_BYTES,bH,…);
// Pass kernel arguments and launch kernel
…
clEnqueueNDRangeKernel(…, &N, …);
clEnqueueReadBuffer(...,cD,…,N_BYTES,cH,…);
}
Allocate memory
on device
Transfer input
arrays to device
Launch kernel
Transfer output
array to host
CaseStudy:AcceleratingFull
WaveformInversionviaOpenCL
onAMDGPUs
Project Steps
 1) Profiling
– Acquired code, datasets and reference benchmarks from
GeoTomo
– Set up local machines with near-equivalent hardware, compiled
code and confirmed reference benchmark numbers
– Augmented code with timers to determine time spent in parallel
regions, areas of interest
17
CaseStudy:AcceleratingFull
WaveformInversionviaOpenCL
onAMDGPUs
Project Steps
 2) Feasibility Analysis
– Investigated memory footprint for FWI jobs
 GPU memory limited to 6GB per card
– Investigated potential speedup / time to port code
 Maximum speed up determined by time spent in parallel regions
(Amdahl’s Law)
 Time to port dependent on feature set
– E.g. domain decomposition across multiple GPUs
18
CaseStudy:AcceleratingFull
WaveformInversionviaOpenCL
onAMDGPUs
Project Steps
 3) Implementation
– Creating testing harnesses
– Kernel implementation
– Resolving hardware driver issues
– Enabling multi-GPU device support
– Optimization iterations
 4) Wrapup
– Delivery of port, along with installation documentation
– Trained GeoTomo developer on OpenCL
19
CaseStudy:AcceleratingFull
WaveformInversionviaOpenCL
onAMDGPUs
Key GeoTomo Optimizations
 1) Coalescing
– Changing memory access patterns in the kernels to those best
suited for GPUs
 Global memory is accessed via a request for a multi-byte word
 Combine load/store requests from consecutive work-items to reduce
the number of requested words
– Fewer requests  less contention to global memory
 Make one big multi-word burst request to global memory whenever
possible
– Contiguous bursts -> less global memory overhead
20
CaseStudy:AcceleratingFull
WaveformInversionviaOpenCL
onAMDGPUs
Key GeoTomo Optimizations
 2) Iterative kernel for stencil operations
Input Volumes Stencil Kernels
* • Outputs are weighted
combinations of
surrounding elements from
input volumes
• Off-axis weights are zero
Acknowledgement: Paulius Micikevicius, 2009 21
CaseStudy:AcceleratingFull
WaveformInversionviaOpenCL
onAMDGPUs
Key GeoTomo Optimizations
 Naïve implementation would have each work-item read all of
its neighboring elements directly from global memory
– Possible to hit maximum GPU memory bandwidth but redundant
reads hurt performance
22
CaseStudy:AcceleratingFull
WaveformInversionviaOpenCL
onAMDGPUs
Key GeoTomo Optimizations
 Alternative: Iterating over 2D slices
along slowest dimension
– Single items responsible for column of
output array
– Work-group caches 2D plane of input in
local memory
– Work-items store inputs in direction of
iteration in registers
– Reduces required number of global
memory reads significantly
Single Work-
item View
Register Local memory
Acknowledgement: Paulius Micikevicius, 2009 23
CaseStudy:AcceleratingFull
WaveformInversionviaOpenCL
onAMDGPUs
Key GeoTomo Optimizations
 3) Kernel Fusion
– Reduce redundant memory accesses by fusing kernels that
operate on the same volume together
– Improves performance by reducing redundant global memory
reads
 4) Kernel Fission
– Improve occupancy by lowering kernel resource requirements
(registers) via kernel simplification
– Allows for more work-items to run concurrently on GPU,
improving masking of global memory latency
24
CaseStudy:AcceleratingFull
WaveformInversionviaOpenCL
onAMDGPUs
Performance Results
 FWI 15 Hz, 15 shots
– GPU version 7997 seconds
– CPU (5 cores per shot) 67086 seconds [8.4X]
– CPU (30 cores per shot) 166948 seconds [20.9X]
 GPU: Sapphire Radeon HD 7970 GHz Edition
– 6GB model
25
CaseStudy:AcceleratingFull
WaveformInversionviaOpenCL
onAMDGPUs
Performance Results
“Using GPU’s we can use higher frequencies and more if not all
of the shots to improve the resolution and coverage.”
James Jackson, President, GeoTomo
26
CaseStudy:AcceleratingFull
WaveformInversionviaOpenCL
onAMDGPUs
Questions?
Contact Us
 Tel: +1 403.249.9099
 Email: services@acceleware.com
OpenCL Courses
 June 3-6, 2014, Calgary, Canada
 Private onsite classes also available
 Find out more
OpenCL Consulting
 Feasibility studies
 Code commercialization
 Porting and optimization
 Mentoring
 Find out more
Watch the recording of this webinar 27

Weitere ähnliche Inhalte

Mehr von AMD Developer Central

Rendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellRendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
AMD Developer Central
 

Mehr von AMD Developer Central (20)

DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsDX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
 
Leverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesLeverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math Libraries
 
Introduction to Node.js
Introduction to Node.jsIntroduction to Node.js
Introduction to Node.js
 
Media SDK Webinar 2014
Media SDK Webinar 2014Media SDK Webinar 2014
Media SDK Webinar 2014
 
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAn Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
 
DirectGMA on AMD’S FirePro™ GPUS
DirectGMA on AMD’S  FirePro™ GPUSDirectGMA on AMD’S  FirePro™ GPUS
DirectGMA on AMD’S FirePro™ GPUS
 
Webinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceWebinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop Intelligence
 
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
 
Inside XBox- One, by Martin Fuller
Inside XBox- One, by Martin FullerInside XBox- One, by Martin Fuller
Inside XBox- One, by Martin Fuller
 
TressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozTressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas Thibieroz
 
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellRendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
 
Gcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodesGcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodes
 
Inside XBOX ONE by Martin Fuller
Inside XBOX ONE by Martin FullerInside XBOX ONE by Martin Fuller
Inside XBOX ONE by Martin Fuller
 
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornDirect3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
 
Introduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevIntroduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan Nevraev
 
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
 
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
 
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
 

Kürzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 

Accelerating Full Waveform Inversion via OpenCL on AMD GPUs - Case Study / Webinar