SlideShare ist ein Scribd-Unternehmen logo
1 von 39
Downloaden Sie, um offline zu lesen
Simulation, Compilation, and Debugging of
OpenCL on The S outhern Islands
Dana Schaa (AMD), Rafael Ubal, and David Kaeli
(Northeastern University, Boston, MA)
Simulation Methodology
Application- OS vs. Guest Program
Guest
program 1

Virtualization of
Complete processor ISA
I/O hardware

Guest
program 2

...

Guest
program 1

Full- OS simulation
An OS runs on the simulator. The simulator implements the
complete ISA, and virtualizes native hardware devices, similar
to a virtual machine. Accurate simulations, but extremely
slow.
2

...
Virtualization of
User-space subset of ISA
System call interface

Full O.S.

Guest program
s imulat or c ore

Full-s y st em
s im ulat or core

•

Guest
program 2

•

Guest program simulation
An application runs directly on the simulator. The
simulator implements the non-privileged subset of the
ISA, and virtualizes the system call interface (ABI).
Multi2Sim falls in this category.

| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
Simulation Methodology
Four-Step Simulation Process
Exectuable
ELF file

Exectuable file,
program arguments

Instruction
bytes
Disassembler

Run one
instruction

3

User
interaction

Pipeline
trace
Timing
simulator

Emulator

Instruction
fields

Instructions
dump

Executable file,
program arguments,
processor configuration

Visual
tool

Performance
statistics

Cycle navigation,
timing diagrams

Instruction
information

Program
output

| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
Simulation Methodology
Current Architecture Support
Timing
simulation

Visual
tool

X
X
X
X

–
–
X
X
X

In progress

In progress

–

–

–
–
X
X
X
–
–

Disasm.

Emulation
In progress

NVIDIA Fermi

X
X
X
X
X
X

NVIDIA Kepler

In progress

ARM
MIPS
x86
AMD Evergreen
AMD Southern Islands

• In our latest Multi2Sim SVN repository
─ 4 GPU + 3 CPU architectures supported or in progress
─ This presentation focuses on Southern Islands (and x86)
4

| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
The x86 Emulator
Program Loading
Initial virtual
memory image

Program args.
Env. variables

mmap region

eax
eax

0x08xxxxxx
Text
Initialized data

Instruction pointer

Initialized data

5

ecx

eip

0x40000000

0x08000000

ebx

─ Update memory map if needed
─ Example: add [bp+16], 0x5

esp

(not initialized)

Heap

• Emulation of x86 instructions
─ Update x86 registers

Stack

Stack pointer

0xc0000000

Initial values for
x86 registers

• Emulation of Linux system calls
─ Analyze system call code and arguments
─ Update memory map
─ Update register eax with return value
─ Example: read(fd, buf, count)

| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
The x86 Emulator
Emulation Loop

1) Parse ELF executable

Read instr.
at eip

─ Read ELF sections and symbols

Instr.
bytes

─ Initialize code and data

Decode
instruction

2) Initialize stack

Instr.
fields
No

Instr. is
int 0x80

─ Program headers
Yes

─ Arguments
─ Environment variables

Emulate
x86 instr.

Emulate
system call

Move eip
to next instr.

6

3) Initialize registers
─ Program entry → eip
─ Stack pointer → esp

| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
OpenCL on the Host
Execution Framework
User
application

─ An OpenCL driver module (Multi2Sim code) intercepts
the ABI call and communicates with the GPU emulator
─ The GPU emulator updates its internal state based on the
message received from the driver
7

User-level code

─ Multi2Sim's OpenCL runtime librar y, running with
guest code, transparently intercepts the call. It
communicates with the Multi2Sim driver using system calls
with codes not reserved in Linux.

API call

Runtime
library

OS-level code

─ An OpenCL host program performs a set of OpenCL
library function calls (API calls)

Device
driver

ABI call

Internal
interface

Hardware

| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
OpenCL on the Device
Execution Model
─ Work-items execute multiple instances of the same kernel code
─ Work-groups are sets of work-items that can synchronize and communicate efficiently
─ The ND -Range contains all work-groups, not communicating with each other and executing in any order
ND-Range

Work-Group

Workgroup

···

Workgroup

Workitem

···

Workitem

Workgroup

···

Workgroup

Workitem

···

Workitem

__kernel func()
{

}

···

8

···

···
Global Memory

Work-Item

Local Memory

Private Memory

| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
The S outhern Islands Disassembler

Disassembler

9

Emulator

Timing
simulator

Visual
tool

| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
The S outhern Islands Disassembler
Vector Addition Kernel

__kernel void vector_add(
__read_only __global int *src1,
__read_only __global int *src2,
__write_only __global int *dst)
{
int id = get_global_id(0);
dst[id] = src1[id] + src2[id];
}

Vector instructions

• Source code

Scalar instructions

Scalar registers

Vector registers

The loads
The addition
The store
10

| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
The S outhern Islands Disassembler
Conditional Statements

• Source code

• Assembly code

__kernel void if_kernel(
__global int *v)
{
uint id = get_global_id(0);
if (id < 5)
v[id] = 10;
}

The comparison.
Save active mask.

Store value 10.
Restore active mask.
11

| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
The S outhern Islands Emulator

Disassembler

12

Emulator

Timing
simulator

Visual
tool

| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
The S outhern Islands Emulator
Program Loading

• Responsible entity

User
application

API call

Runtime
library
ABI call

Device
driver
Internal
interface

Hardware

13

─ The device driver is the module responsible for setting up an initial
state for the hardware, leaving it ready to run the first ISA instruction.
─ Natively, it writes on hardware registers and global memory locations.
On Multi2Sim, it calls initialization functions of the emulator.

• Setup
─ Instruction memories in compute units, each with one copy of the ISA
section of the kernel binary
─ Initial global memor y image, copying global buffers from CPU to GPU
memory
─ Kernel arguments
─ ND -Range topology, including number of dimensions and sizes

| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
The S outhern Islands Emulator
Emulation Loop

• Work-group execution
─ Work-groups can execute in any order.
This order is irrelevant for emulation
purposes.
─ The chosen policy is executing one workgroup at a time, in increasing order of
ID for each dimension.

• Wavefront execution
─ Wavefronts within a work-group can also
execute in any order, as long as
synchronizations are considered.
─ The chosen policy is executing one
wavefront at a time until it hits a
barrier, if any.
14

Wavefront
pool

Split ND-Range
into work-groups

Work-group
pool

Any
work-groups
left?

No

No

Yes
Grab work-group and
split in wavefronts

Any
wavefront
left?
Yes

End

For each running wavefront not
stalled in a barrier:
Read instruction @PC
Emulate (update mem. + regs.)
● Advance PC
●
●

| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
The S outhern Islands Timing Simulator

Disassembler

15

Emulator

Timing
simulator

Visual
tool

| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
The S outhern Islands Timing Simulator
The GPU Architecture
Command Processor

─ A command processor receives and processes
commands from the host.
─ When the ND-Range is created, an ultra-threaded
dispatcher (scheduler) assigns work-groups into compute
units while new available slots occur.

Ultra-Threaded Dispatcher

Compute
Unit 0

Compute
Unit 1

···

Compute
Unit 31

L1
Cache

L1
Cache

···

L1
Cache

Crossbar
Main Memory Hierarchy
(L2 caches, memory controllers,
video memory)

16

| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
The S outhern Islands Timing Simulator
The Compute Unit

Branch unit
LDS unit

Global
memory
Local
memory

Vector memory unit

SIMD unit 0
SIMD unit 1
Instruction
memory

SIMD unit 2

···

17

Scalar unit

Front-End

─ The instruction memor y of each
compute unit contains the OpenCL
kernel.
─ A front-end fetches instructions and
sends them to the appropriate execution
unit.
─ There is one scalar unit, vector-memory
unit, branch unit, LDS (local data store)
unit.
─ There are multiple instances of SIMD
units.

| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
The S outhern Islands Timing Simulator
The Front-End

Wavefront
Pool

Fetch

···

···

SIMD issue buffer,
matching
wavefront pool

···

Scalar unit
issue buffer

Wavefront
Pool

···

Wavefront
Pool

···

···

Branch unit
issue buffer

Wavefront
Pool

···

···

Vector memory
unit issue buffer

···

LDS unit
issue buffer

Fetch buffers, one
per wavefront pool

18

Issue

─ Work-groups are split into
wavefronts and allocated to
wavefront pools .
─ The fetch and issue stages
operate in a round-robin
fashion.
─ There is one SIMD unit
associated to each
wavefront pool.

| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
The S outhern Islands Timing Simulator
The SIMD Unit

─ The SIMD unit runs arithmetic-logic vector instructions.
─ There are 4 SIMD units, each one associated with one of the 4 wavefront pools.
─ The SIMD unit pipeline is modeled with 5 stages: decode, read, execute, write, and complete.
─ In the execute stage, a wavefront (64 work-items max.) is split into 4 subwavefronts (16
work-items each). Subwavefronts are pipelined over the 16 stream cores in 4 consecutive
cycles.
─ The vector register file is accessed in the read and write stages to consume input and
produce output operands, respectively.

19

| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
The S outhern Islands Timing Simulator
From compute
unit front-end

The SIMD Unit
···

Decode

Issue
buffer

SIMD Lane 0

Decode
buffer

···
Read

Pipelined
Functional
units

···
Read
buffer

SIMD Lane 1
Work-items 1, 17, 33, 49

Write

Execute
buffer

Write
buffer

SIMD Lane 15
Work-items 15, 31, 47, 63

···

···

...

Vector/scalar
register file

`

Work-item 0
Work-item 16
Work-item 32
Work-item 48

Vector
register file

Execute
Complete

20

| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
The S outhern Islands Timing Simulator
─ Runs both arithmetic-logic
and memory scalar
instructions

···

Decode

Issue
buffer

Decode
buffer

···

─ Modeled with 5 stages –
decode, read,
execute/memory, write,
complete

From compute
unit front-end

The S calar Unit

Vector
register file
Execute

Read

···
Read
buffer

···
Memory

Write

Execute
buffer

···

Write
buffer

Scalar
register file

Complete

21

| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
The S outhern Islands Timing Simulator
From compute
unit front-end

The Vector Memor y Unit

···

─ Runs vector memor y
instructions

Decode

Issue
buffer

Decode
buffer

···
Read

Vector
register file

···

Memory

Read
buffer

Write

Memory
buffer

Write
buffer

Global
memory

···

Vector
register file

···

─ Modeled with 5 stages –
decode, read, memory, write,
complete
─ Accesses to the global
memor y hierarchy
happen mainly in this unit

Complete

22

| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
The S outhern Islands Timing Simulator
From compute
unit front-end

The Branch Unit

···

Decode

Issue
buffer

Decode
buffer

···
Read

Scalar reg. file
(program
counter)

···
Read
buffer

···

─ Modeled with 5 stages –
decode, read, execute/memory,
write, complete

Write

Execute
buffer

Write
buffer

···

Scalar
register file
(condition codes)

Execute

─ Runs branch instructions ,
which decide whether to make
an entire wavefront jump to a
target address depending on the
scalar condition code

Complete

23

| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
The S outhern Islands Timing Simulator
─ Runs local memor y
accesses instructions

···

Decode

Issue
buffer

Decode
buffer

···

─ Modeled with 5 stages –
decode, read,
execute/memory, write,
complete

From compute
unit front-end

The Local Data Share (LDS) Unit

Read

···

Memory

Read
buffer

Write

Memory
buffer
Local
memory

···

Vector
register file

···

Write
buffer

─ The memory stage accesses
the compute unit local
memor y for read/write

Vector
register file

Complete

24

| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
The S outhern Islands Timing Simulator
Global Memor y Hierarchy

25

CU 0

CU 1

CU 2

CU 3

L1

L1

L1

L1

CU 28

Scalar
cache

...

CU 29

CU 30

CU 31

L1

L1

L1

L1

Interconnect

L2
Bank 0

L2
Bank 1

...

L2
Bank 1

...

─ Fully configurable memory
hierarchy, with default
values based on the
AMD Radeon HD 7970
Southern Islands GPU
─ One 16KB data L1 per
compute unit
─ One scalar L1 cache
shared by every 4 compute
units
─ Six L2 banks with a total
size of 128KB, each
connected to a DRAM
module

| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013

Scalar
cache
The S outhern Islands Visual Tool

Disassembler

26

Emulator

Timing
simulator

Visual
tool

| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
The S outhern Islands Visual Tool
Main Window and Timing Diagram
─ The main window provides c ycle-by-c ycle
navigation throughout simulation.
─ A dedicated S outhern Islands panel contains one
widget per compute unit, showing allocated workgroups.
─ The memor y hierarchy panel shows caches
connected to Southern Islands compute units, and
special-purpose scalar caches.

27

| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
Ongoing Projects
CPU- GPU Cache Coherence Protocol
ARM

x86

Evg.

S.I.

Fermi

─ MOESI protocol extended with an additional
non-coherent write state: NMOESI
─ CPU and GPU cores with any ISA can be
connected to different entry points of the
memory hierarchy
─ Processing nodes interact with the memory
hierarchy with three types of accesses: load,
store, and n-store

28

NMOESI Interface
L1

L1

L1

L1

...

Interconnect

L2
Bank 0

L2
Bank 1

...

─ GPUs and CPUs running OpenCL kernels
issue n-store write accesses. The rest issue
regular store accesses.

...

| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013

...
Ongoing Projects
Cooperative Execution of Work- Groups

• Attained concurrency

• Work-group mapping

─ x86 cores run both the host program
and a portion of the ND-Range
─ Idle regions are removed during
the execution of the ND-Range

─ Portions of ND-Range executed by CPU/GPU cores
with different ISAs
─ Work-groups mapped to CPU cores or GPU
compute units as they become available

x86

x86

S.I.

S.I.

S.I.

ND-Range
WG-1

WG-2

WG-3

...

WG-N

Time

WG-0

x86

x86

Hardware

29

S.I.

S.I.

S.I.

S.I.
Host
Kernel
program
+ kernel

Kernel

| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013

S.I.
Ongoing Projects
Multi2C – An OpenCL /CUDA Kernel Compiler
─ LLVM-based compiler for OpenCL and CUDA kernels

• Yellow = under development
• Blue = arriving soon
• Green = supported

─ Future release Multi2Sim 4.2 will include a working version
─ Diagrams show progress as per SVN Rev. 1838

vec-add.cl

LLVM to
Southern
Islands
back-end

OpenCL C
to LLVM
front-end
vec-add.llvm

vec-add.cu

30

CUDA
to LLVM
front-end

vec-add.s

Southern
Islands
assembler

vec-add.bin

LLVM to
Fermi
back-end

vec-add.s

Fermi
assembler

vec-add.cubin

LLVM to
Kepler
back-end

vec-add.s

Kepler
assembler

vec-add.cubin

| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
Ongoing Projects
Simulation of OpenGL Pipelines

• Goal
─ Leverage our Southern Islands pipeline models to
execute OpenGL vertex and fragment shaders

• Steps
─ Develop a runtime librar y to link with guest programs,
implementing the OpenGL, GLUT and GLEW library APIs
─ Reverse engineer AMD's OpenGL binar y format to
decode embedded metadata
─ Timing model of the OpenGL pipeline

• New capabilities targeted
─ Timing simulation of other critical GPU components, such as rasterizer
─ Concurrency evaluation of compute + graphics pipelines
31

| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
The Multi2Sim Community
Academic Efforts at Northeastern

• The “GPU Programming and Architecture” course
─ We started an unofficial seminar that students can voluntarily attend. The syllabus covers
OpenCL programming, GPU architecture, and state-of-the-art research topics on GPUs.
─ Average attendance of ~25 students per semester.

• Undergraduate directed studies
─ Official alternative equivalent to a 4-credit course that an undergraduate student can
optionally enroll, collaborating with Multi2Sim development

• Graduate-level development
─ Multiple ongoing PhD theses using Multi2Sim as support tool
─ All related development becomes openly available through the SVN repo

32

| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
The Multi2Sim Community
Academic Publications

• Conference papers
─ Multi2Sim: A Simulation Framework to Evaluate Multicore-Multithreaded Processors , SBAC-PAD, 2007
─ The Multi2Sim Simulation Framework: A CPU-GPU Model for Heterogeneous Computing , PACT, 2012

• Tutorials
─ The Multi2Sim Simulation Framework: A CPU-GPU Model for Heterogeneous Computing , PACT, 2011
─ Programming and Simulating Fused Devices — OpenCL and Multi2Sim, ICPE, 2012
─ Multi-Architecture ISA-Level Simulation of OpenCL, IWOCL, 2013
─ Simulation of OpenCL and APUs on Multi2Sim, ISCA, 2013
33

| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
The Multi2Sim Community
w w w.multi2sim.org

34

| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
The Multi2Sim Community
w w w.multi2sim.org

35

| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
The Multi2Sim Community
w w w.multi2sim.org

36

| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
The Multi2Sim Community
Sponsors

`

37

| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
Thanks!
Questions?

38
Disclaimer & Attribution
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap
changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software
changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD
reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such
revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES,
ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE
TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN,
EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

AT TRIBUTION
© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in
the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names are for
39
informational purposes only and may be trademarks of their respective owners.

Weitere ähnliche Inhalte

Was ist angesagt?

GS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin Coumans
GS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin CoumansGS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin Coumans
GS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin CoumansAMD Developer Central
 
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...AMD Developer Central
 
PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...
PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...
PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...AMD Developer Central
 
Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...
Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...
Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...AMD Developer Central
 
WT-4071, GPU accelerated 3D graphics for Java, by Kevin Rushforth, Chien Yang...
WT-4071, GPU accelerated 3D graphics for Java, by Kevin Rushforth, Chien Yang...WT-4071, GPU accelerated 3D graphics for Java, by Kevin Rushforth, Chien Yang...
WT-4071, GPU accelerated 3D graphics for Java, by Kevin Rushforth, Chien Yang...AMD Developer Central
 
Leverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesLeverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesAMD Developer Central
 
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD
Bolt C++ Standard Template Libary for HSA  by Ben Sanders, AMDBolt C++ Standard Template Libary for HSA  by Ben Sanders, AMD
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMDHSA Foundation
 
GS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill BilodeauGS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill BilodeauAMD Developer Central
 
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...AMD Developer Central
 
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...AMD Developer Central
 
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary DemosMM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary DemosAMD Developer Central
 
CC-4010, Bringing Spatial Love to your Java Application, by Steven Citron-Pousty
CC-4010, Bringing Spatial Love to your Java Application, by Steven Citron-PoustyCC-4010, Bringing Spatial Love to your Java Application, by Steven Citron-Pousty
CC-4010, Bringing Spatial Love to your Java Application, by Steven Citron-PoustyAMD Developer Central
 
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon SelleyPT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon SelleyAMD Developer Central
 
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornDirect3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornAMD Developer Central
 
HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...
HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...
HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...AMD Developer Central
 
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...AMD Developer Central
 
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...AMD Developer Central
 
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...AMD Developer Central
 
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...AMD Developer Central
 
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael MantorGS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael MantorAMD Developer Central
 

Was ist angesagt? (20)

GS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin Coumans
GS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin CoumansGS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin Coumans
GS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin Coumans
 
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
 
PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...
PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...
PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...
 
Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...
Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...
Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...
 
WT-4071, GPU accelerated 3D graphics for Java, by Kevin Rushforth, Chien Yang...
WT-4071, GPU accelerated 3D graphics for Java, by Kevin Rushforth, Chien Yang...WT-4071, GPU accelerated 3D graphics for Java, by Kevin Rushforth, Chien Yang...
WT-4071, GPU accelerated 3D graphics for Java, by Kevin Rushforth, Chien Yang...
 
Leverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesLeverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math Libraries
 
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD
Bolt C++ Standard Template Libary for HSA  by Ben Sanders, AMDBolt C++ Standard Template Libary for HSA  by Ben Sanders, AMD
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD
 
GS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill BilodeauGS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill Bilodeau
 
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
 
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
 
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary DemosMM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
 
CC-4010, Bringing Spatial Love to your Java Application, by Steven Citron-Pousty
CC-4010, Bringing Spatial Love to your Java Application, by Steven Citron-PoustyCC-4010, Bringing Spatial Love to your Java Application, by Steven Citron-Pousty
CC-4010, Bringing Spatial Love to your Java Application, by Steven Citron-Pousty
 
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon SelleyPT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
 
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornDirect3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
 
HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...
HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...
HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...
 
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
 
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
 
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
 
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
 
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael MantorGS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
 

Ähnlich wie PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern Islands, by David Kaeli

Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8AbdullahMunir32
 
My First 100 days with an Exadata (PPT)
My First 100 days with an Exadata (PPT)My First 100 days with an Exadata (PPT)
My First 100 days with an Exadata (PPT)Gustavo Rene Antunez
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hoodAdarsh Pannu
 
Open CL For Speedup Workshop
Open CL For Speedup WorkshopOpen CL For Speedup Workshop
Open CL For Speedup WorkshopOfer Rosenberg
 
開放運算&GPU技術研究班
開放運算&GPU技術研究班開放運算&GPU技術研究班
開放運算&GPU技術研究班Paul Chao
 
Gpu and The Brick Wall
Gpu and The Brick WallGpu and The Brick Wall
Gpu and The Brick Wallugur candan
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiectureHaris456
 
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...Stefano Di Carlo
 
AMP Kynetics - ELC 2018 Portland
AMP  Kynetics - ELC 2018 PortlandAMP  Kynetics - ELC 2018 Portland
AMP Kynetics - ELC 2018 PortlandKynetics
 
Asymmetric Multiprocessing - Kynetics ELC 2018 portland
Asymmetric Multiprocessing - Kynetics ELC 2018 portlandAsymmetric Multiprocessing - Kynetics ELC 2018 portland
Asymmetric Multiprocessing - Kynetics ELC 2018 portlandNicola La Gloria
 
Bca 2nd sem-u-1.2 digital logic circuits, digital component
Bca 2nd sem-u-1.2 digital logic circuits, digital componentBca 2nd sem-u-1.2 digital logic circuits, digital component
Bca 2nd sem-u-1.2 digital logic circuits, digital componentRai University
 
digital logic circuits, digital component
digital logic circuits, digital componentdigital logic circuits, digital component
digital logic circuits, digital componentRai University
 
Heterogeneous multiprocessing on androd and i.mx7
Heterogeneous multiprocessing on androd and i.mx7Heterogeneous multiprocessing on androd and i.mx7
Heterogeneous multiprocessing on androd and i.mx7Kynetics
 
B.sc cs-ii -u-1.2 digital logic circuits, digital component
B.sc cs-ii -u-1.2 digital logic circuits, digital componentB.sc cs-ii -u-1.2 digital logic circuits, digital component
B.sc cs-ii -u-1.2 digital logic circuits, digital componentRai University
 
Challenges in GPU compilers
Challenges in GPU compilersChallenges in GPU compilers
Challenges in GPU compilersAnastasiaStulova
 
Reliability, Availability and Serviceability on Linux
Reliability, Availability and Serviceability on LinuxReliability, Availability and Serviceability on Linux
Reliability, Availability and Serviceability on LinuxSamsung Open Source Group
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxssuser413a98
 

Ähnlich wie PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern Islands, by David Kaeli (20)

Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8
 
General Purpose GPU Computing
General Purpose GPU ComputingGeneral Purpose GPU Computing
General Purpose GPU Computing
 
My First 100 days with an Exadata (PPT)
My First 100 days with an Exadata (PPT)My First 100 days with an Exadata (PPT)
My First 100 days with an Exadata (PPT)
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
Open CL For Speedup Workshop
Open CL For Speedup WorkshopOpen CL For Speedup Workshop
Open CL For Speedup Workshop
 
開放運算&GPU技術研究班
開放運算&GPU技術研究班開放運算&GPU技術研究班
開放運算&GPU技術研究班
 
Gpu and The Brick Wall
Gpu and The Brick WallGpu and The Brick Wall
Gpu and The Brick Wall
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiecture
 
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...
 
AMP Kynetics - ELC 2018 Portland
AMP  Kynetics - ELC 2018 PortlandAMP  Kynetics - ELC 2018 Portland
AMP Kynetics - ELC 2018 Portland
 
Asymmetric Multiprocessing - Kynetics ELC 2018 portland
Asymmetric Multiprocessing - Kynetics ELC 2018 portlandAsymmetric Multiprocessing - Kynetics ELC 2018 portland
Asymmetric Multiprocessing - Kynetics ELC 2018 portland
 
Bca 2nd sem-u-1.2 digital logic circuits, digital component
Bca 2nd sem-u-1.2 digital logic circuits, digital componentBca 2nd sem-u-1.2 digital logic circuits, digital component
Bca 2nd sem-u-1.2 digital logic circuits, digital component
 
digital logic circuits, digital component
digital logic circuits, digital componentdigital logic circuits, digital component
digital logic circuits, digital component
 
Heterogeneous multiprocessing on androd and i.mx7
Heterogeneous multiprocessing on androd and i.mx7Heterogeneous multiprocessing on androd and i.mx7
Heterogeneous multiprocessing on androd and i.mx7
 
cnsm2011_slide
cnsm2011_slidecnsm2011_slide
cnsm2011_slide
 
B.sc cs-ii -u-1.2 digital logic circuits, digital component
B.sc cs-ii -u-1.2 digital logic circuits, digital componentB.sc cs-ii -u-1.2 digital logic circuits, digital component
B.sc cs-ii -u-1.2 digital logic circuits, digital component
 
Challenges in GPU compilers
Challenges in GPU compilersChallenges in GPU compilers
Challenges in GPU compilers
 
Reliability, Availability and Serviceability on Linux
Reliability, Availability and Serviceability on LinuxReliability, Availability and Serviceability on Linux
Reliability, Availability and Serviceability on Linux
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
 
Parallel computation
Parallel computationParallel computation
Parallel computation
 

Mehr von AMD Developer Central

DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsDX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsAMD Developer Central
 
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAn Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAMD Developer Central
 
Webinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceWebinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceAMD Developer Central
 
TressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozTressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozAMD Developer Central
 
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellRendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellAMD Developer Central
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonAMD Developer Central
 
Introduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevIntroduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevAMD Developer Central
 
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasAMD Developer Central
 
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...AMD Developer Central
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...AMD Developer Central
 
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14AMD Developer Central
 
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14AMD Developer Central
 
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...AMD Developer Central
 
Mantle - Introducing a new API for Graphics - AMD at GDC14
Mantle - Introducing a new API for Graphics - AMD at GDC14Mantle - Introducing a new API for Graphics - AMD at GDC14
Mantle - Introducing a new API for Graphics - AMD at GDC14AMD Developer Central
 

Mehr von AMD Developer Central (20)

DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsDX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
 
Introduction to Node.js
Introduction to Node.jsIntroduction to Node.js
Introduction to Node.js
 
Media SDK Webinar 2014
Media SDK Webinar 2014Media SDK Webinar 2014
Media SDK Webinar 2014
 
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAn Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
 
DirectGMA on AMD’S FirePro™ GPUS
DirectGMA on AMD’S  FirePro™ GPUSDirectGMA on AMD’S  FirePro™ GPUS
DirectGMA on AMD’S FirePro™ GPUS
 
Webinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceWebinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop Intelligence
 
Inside XBox- One, by Martin Fuller
Inside XBox- One, by Martin FullerInside XBox- One, by Martin Fuller
Inside XBox- One, by Martin Fuller
 
TressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozTressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas Thibieroz
 
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellRendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
 
Gcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodesGcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodes
 
Inside XBOX ONE by Martin Fuller
Inside XBOX ONE by Martin FullerInside XBOX ONE by Martin Fuller
Inside XBOX ONE by Martin Fuller
 
Introduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevIntroduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan Nevraev
 
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
 
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
 
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
 
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
 
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
 
Mantle - Introducing a new API for Graphics - AMD at GDC14
Mantle - Introducing a new API for Graphics - AMD at GDC14Mantle - Introducing a new API for Graphics - AMD at GDC14
Mantle - Introducing a new API for Graphics - AMD at GDC14
 

Kürzlich hochgeladen

Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 

Kürzlich hochgeladen (20)

Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 

PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern Islands, by David Kaeli

  • 1. Simulation, Compilation, and Debugging of OpenCL on The S outhern Islands Dana Schaa (AMD), Rafael Ubal, and David Kaeli (Northeastern University, Boston, MA)
  • 2. Simulation Methodology Application- OS vs. Guest Program Guest program 1 Virtualization of Complete processor ISA I/O hardware Guest program 2 ... Guest program 1 Full- OS simulation An OS runs on the simulator. The simulator implements the complete ISA, and virtualizes native hardware devices, similar to a virtual machine. Accurate simulations, but extremely slow. 2 ... Virtualization of User-space subset of ISA System call interface Full O.S. Guest program s imulat or c ore Full-s y st em s im ulat or core • Guest program 2 • Guest program simulation An application runs directly on the simulator. The simulator implements the non-privileged subset of the ISA, and virtualizes the system call interface (ABI). Multi2Sim falls in this category. | Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
  • 3. Simulation Methodology Four-Step Simulation Process Exectuable ELF file Exectuable file, program arguments Instruction bytes Disassembler Run one instruction 3 User interaction Pipeline trace Timing simulator Emulator Instruction fields Instructions dump Executable file, program arguments, processor configuration Visual tool Performance statistics Cycle navigation, timing diagrams Instruction information Program output | Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
  • 4. Simulation Methodology Current Architecture Support Timing simulation Visual tool X X X X – – X X X In progress In progress – – – – X X X – – Disasm. Emulation In progress NVIDIA Fermi X X X X X X NVIDIA Kepler In progress ARM MIPS x86 AMD Evergreen AMD Southern Islands • In our latest Multi2Sim SVN repository ─ 4 GPU + 3 CPU architectures supported or in progress ─ This presentation focuses on Southern Islands (and x86) 4 | Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
  • 5. The x86 Emulator Program Loading Initial virtual memory image Program args. Env. variables mmap region eax eax 0x08xxxxxx Text Initialized data Instruction pointer Initialized data 5 ecx eip 0x40000000 0x08000000 ebx ─ Update memory map if needed ─ Example: add [bp+16], 0x5 esp (not initialized) Heap • Emulation of x86 instructions ─ Update x86 registers Stack Stack pointer 0xc0000000 Initial values for x86 registers • Emulation of Linux system calls ─ Analyze system call code and arguments ─ Update memory map ─ Update register eax with return value ─ Example: read(fd, buf, count) | Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
  • 6. The x86 Emulator Emulation Loop 1) Parse ELF executable Read instr. at eip ─ Read ELF sections and symbols Instr. bytes ─ Initialize code and data Decode instruction 2) Initialize stack Instr. fields No Instr. is int 0x80 ─ Program headers Yes ─ Arguments ─ Environment variables Emulate x86 instr. Emulate system call Move eip to next instr. 6 3) Initialize registers ─ Program entry → eip ─ Stack pointer → esp | Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
  • 7. OpenCL on the Host Execution Framework User application ─ An OpenCL driver module (Multi2Sim code) intercepts the ABI call and communicates with the GPU emulator ─ The GPU emulator updates its internal state based on the message received from the driver 7 User-level code ─ Multi2Sim's OpenCL runtime librar y, running with guest code, transparently intercepts the call. It communicates with the Multi2Sim driver using system calls with codes not reserved in Linux. API call Runtime library OS-level code ─ An OpenCL host program performs a set of OpenCL library function calls (API calls) Device driver ABI call Internal interface Hardware | Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
  • 8. OpenCL on the Device Execution Model ─ Work-items execute multiple instances of the same kernel code ─ Work-groups are sets of work-items that can synchronize and communicate efficiently ─ The ND -Range contains all work-groups, not communicating with each other and executing in any order ND-Range Work-Group Workgroup ··· Workgroup Workitem ··· Workitem Workgroup ··· Workgroup Workitem ··· Workitem __kernel func() { } ··· 8 ··· ··· Global Memory Work-Item Local Memory Private Memory | Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
  • 9. The S outhern Islands Disassembler Disassembler 9 Emulator Timing simulator Visual tool | Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
  • 10. The S outhern Islands Disassembler Vector Addition Kernel __kernel void vector_add( __read_only __global int *src1, __read_only __global int *src2, __write_only __global int *dst) { int id = get_global_id(0); dst[id] = src1[id] + src2[id]; } Vector instructions • Source code Scalar instructions Scalar registers Vector registers The loads The addition The store 10 | Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
  • 11. The S outhern Islands Disassembler Conditional Statements • Source code • Assembly code __kernel void if_kernel( __global int *v) { uint id = get_global_id(0); if (id < 5) v[id] = 10; } The comparison. Save active mask. Store value 10. Restore active mask. 11 | Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
  • 12. The S outhern Islands Emulator Disassembler 12 Emulator Timing simulator Visual tool | Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
  • 13. The S outhern Islands Emulator Program Loading • Responsible entity User application API call Runtime library ABI call Device driver Internal interface Hardware 13 ─ The device driver is the module responsible for setting up an initial state for the hardware, leaving it ready to run the first ISA instruction. ─ Natively, it writes on hardware registers and global memory locations. On Multi2Sim, it calls initialization functions of the emulator. • Setup ─ Instruction memories in compute units, each with one copy of the ISA section of the kernel binary ─ Initial global memor y image, copying global buffers from CPU to GPU memory ─ Kernel arguments ─ ND -Range topology, including number of dimensions and sizes | Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
  • 14. The S outhern Islands Emulator Emulation Loop • Work-group execution ─ Work-groups can execute in any order. This order is irrelevant for emulation purposes. ─ The chosen policy is executing one workgroup at a time, in increasing order of ID for each dimension. • Wavefront execution ─ Wavefronts within a work-group can also execute in any order, as long as synchronizations are considered. ─ The chosen policy is executing one wavefront at a time until it hits a barrier, if any. 14 Wavefront pool Split ND-Range into work-groups Work-group pool Any work-groups left? No No Yes Grab work-group and split in wavefronts Any wavefront left? Yes End For each running wavefront not stalled in a barrier: Read instruction @PC Emulate (update mem. + regs.) ● Advance PC ● ● | Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
  • 15. The S outhern Islands Timing Simulator Disassembler 15 Emulator Timing simulator Visual tool | Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
  • 16. The S outhern Islands Timing Simulator The GPU Architecture Command Processor ─ A command processor receives and processes commands from the host. ─ When the ND-Range is created, an ultra-threaded dispatcher (scheduler) assigns work-groups into compute units while new available slots occur. Ultra-Threaded Dispatcher Compute Unit 0 Compute Unit 1 ··· Compute Unit 31 L1 Cache L1 Cache ··· L1 Cache Crossbar Main Memory Hierarchy (L2 caches, memory controllers, video memory) 16 | Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
  • 17. The S outhern Islands Timing Simulator The Compute Unit Branch unit LDS unit Global memory Local memory Vector memory unit SIMD unit 0 SIMD unit 1 Instruction memory SIMD unit 2 ··· 17 Scalar unit Front-End ─ The instruction memor y of each compute unit contains the OpenCL kernel. ─ A front-end fetches instructions and sends them to the appropriate execution unit. ─ There is one scalar unit, vector-memory unit, branch unit, LDS (local data store) unit. ─ There are multiple instances of SIMD units. | Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
  • 18. The S outhern Islands Timing Simulator The Front-End Wavefront Pool Fetch ··· ··· SIMD issue buffer, matching wavefront pool ··· Scalar unit issue buffer Wavefront Pool ··· Wavefront Pool ··· ··· Branch unit issue buffer Wavefront Pool ··· ··· Vector memory unit issue buffer ··· LDS unit issue buffer Fetch buffers, one per wavefront pool 18 Issue ─ Work-groups are split into wavefronts and allocated to wavefront pools . ─ The fetch and issue stages operate in a round-robin fashion. ─ There is one SIMD unit associated to each wavefront pool. | Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
  • 19. The S outhern Islands Timing Simulator The SIMD Unit ─ The SIMD unit runs arithmetic-logic vector instructions. ─ There are 4 SIMD units, each one associated with one of the 4 wavefront pools. ─ The SIMD unit pipeline is modeled with 5 stages: decode, read, execute, write, and complete. ─ In the execute stage, a wavefront (64 work-items max.) is split into 4 subwavefronts (16 work-items each). Subwavefronts are pipelined over the 16 stream cores in 4 consecutive cycles. ─ The vector register file is accessed in the read and write stages to consume input and produce output operands, respectively. 19 | Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
  • 20. The S outhern Islands Timing Simulator From compute unit front-end The SIMD Unit ··· Decode Issue buffer SIMD Lane 0 Decode buffer ··· Read Pipelined Functional units ··· Read buffer SIMD Lane 1 Work-items 1, 17, 33, 49 Write Execute buffer Write buffer SIMD Lane 15 Work-items 15, 31, 47, 63 ··· ··· ... Vector/scalar register file ` Work-item 0 Work-item 16 Work-item 32 Work-item 48 Vector register file Execute Complete 20 | Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
  • 21. The S outhern Islands Timing Simulator ─ Runs both arithmetic-logic and memory scalar instructions ··· Decode Issue buffer Decode buffer ··· ─ Modeled with 5 stages – decode, read, execute/memory, write, complete From compute unit front-end The S calar Unit Vector register file Execute Read ··· Read buffer ··· Memory Write Execute buffer ··· Write buffer Scalar register file Complete 21 | Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
  • 22. The S outhern Islands Timing Simulator From compute unit front-end The Vector Memor y Unit ··· ─ Runs vector memor y instructions Decode Issue buffer Decode buffer ··· Read Vector register file ··· Memory Read buffer Write Memory buffer Write buffer Global memory ··· Vector register file ··· ─ Modeled with 5 stages – decode, read, memory, write, complete ─ Accesses to the global memor y hierarchy happen mainly in this unit Complete 22 | Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
  • 23. The S outhern Islands Timing Simulator From compute unit front-end The Branch Unit ··· Decode Issue buffer Decode buffer ··· Read Scalar reg. file (program counter) ··· Read buffer ··· ─ Modeled with 5 stages – decode, read, execute/memory, write, complete Write Execute buffer Write buffer ··· Scalar register file (condition codes) Execute ─ Runs branch instructions , which decide whether to make an entire wavefront jump to a target address depending on the scalar condition code Complete 23 | Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
  • 24. The S outhern Islands Timing Simulator ─ Runs local memor y accesses instructions ··· Decode Issue buffer Decode buffer ··· ─ Modeled with 5 stages – decode, read, execute/memory, write, complete From compute unit front-end The Local Data Share (LDS) Unit Read ··· Memory Read buffer Write Memory buffer Local memory ··· Vector register file ··· Write buffer ─ The memory stage accesses the compute unit local memor y for read/write Vector register file Complete 24 | Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
  • 25. The S outhern Islands Timing Simulator Global Memor y Hierarchy 25 CU 0 CU 1 CU 2 CU 3 L1 L1 L1 L1 CU 28 Scalar cache ... CU 29 CU 30 CU 31 L1 L1 L1 L1 Interconnect L2 Bank 0 L2 Bank 1 ... L2 Bank 1 ... ─ Fully configurable memory hierarchy, with default values based on the AMD Radeon HD 7970 Southern Islands GPU ─ One 16KB data L1 per compute unit ─ One scalar L1 cache shared by every 4 compute units ─ Six L2 banks with a total size of 128KB, each connected to a DRAM module | Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013 Scalar cache
  • 26. The S outhern Islands Visual Tool Disassembler 26 Emulator Timing simulator Visual tool | Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
  • 27. The S outhern Islands Visual Tool Main Window and Timing Diagram ─ The main window provides c ycle-by-c ycle navigation throughout simulation. ─ A dedicated S outhern Islands panel contains one widget per compute unit, showing allocated workgroups. ─ The memor y hierarchy panel shows caches connected to Southern Islands compute units, and special-purpose scalar caches. 27 | Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
  • 28. Ongoing Projects CPU- GPU Cache Coherence Protocol ARM x86 Evg. S.I. Fermi ─ MOESI protocol extended with an additional non-coherent write state: NMOESI ─ CPU and GPU cores with any ISA can be connected to different entry points of the memory hierarchy ─ Processing nodes interact with the memory hierarchy with three types of accesses: load, store, and n-store 28 NMOESI Interface L1 L1 L1 L1 ... Interconnect L2 Bank 0 L2 Bank 1 ... ─ GPUs and CPUs running OpenCL kernels issue n-store write accesses. The rest issue regular store accesses. ... | Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013 ...
  • 29. Ongoing Projects Cooperative Execution of Work- Groups • Attained concurrency • Work-group mapping ─ x86 cores run both the host program and a portion of the ND-Range ─ Idle regions are removed during the execution of the ND-Range ─ Portions of ND-Range executed by CPU/GPU cores with different ISAs ─ Work-groups mapped to CPU cores or GPU compute units as they become available x86 x86 S.I. S.I. S.I. ND-Range WG-1 WG-2 WG-3 ... WG-N Time WG-0 x86 x86 Hardware 29 S.I. S.I. S.I. S.I. Host Kernel program + kernel Kernel | Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013 S.I.
  • 30. Ongoing Projects Multi2C – An OpenCL /CUDA Kernel Compiler ─ LLVM-based compiler for OpenCL and CUDA kernels • Yellow = under development • Blue = arriving soon • Green = supported ─ Future release Multi2Sim 4.2 will include a working version ─ Diagrams show progress as per SVN Rev. 1838 vec-add.cl LLVM to Southern Islands back-end OpenCL C to LLVM front-end vec-add.llvm vec-add.cu 30 CUDA to LLVM front-end vec-add.s Southern Islands assembler vec-add.bin LLVM to Fermi back-end vec-add.s Fermi assembler vec-add.cubin LLVM to Kepler back-end vec-add.s Kepler assembler vec-add.cubin | Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
  • 31. Ongoing Projects Simulation of OpenGL Pipelines • Goal ─ Leverage our Southern Islands pipeline models to execute OpenGL vertex and fragment shaders • Steps ─ Develop a runtime librar y to link with guest programs, implementing the OpenGL, GLUT and GLEW library APIs ─ Reverse engineer AMD's OpenGL binar y format to decode embedded metadata ─ Timing model of the OpenGL pipeline • New capabilities targeted ─ Timing simulation of other critical GPU components, such as rasterizer ─ Concurrency evaluation of compute + graphics pipelines 31 | Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
  • 32. The Multi2Sim Community Academic Efforts at Northeastern • The “GPU Programming and Architecture” course ─ We started an unofficial seminar that students can voluntarily attend. The syllabus covers OpenCL programming, GPU architecture, and state-of-the-art research topics on GPUs. ─ Average attendance of ~25 students per semester. • Undergraduate directed studies ─ Official alternative equivalent to a 4-credit course that an undergraduate student can optionally enroll, collaborating with Multi2Sim development • Graduate-level development ─ Multiple ongoing PhD theses using Multi2Sim as support tool ─ All related development becomes openly available through the SVN repo 32 | Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
  • 33. The Multi2Sim Community Academic Publications • Conference papers ─ Multi2Sim: A Simulation Framework to Evaluate Multicore-Multithreaded Processors , SBAC-PAD, 2007 ─ The Multi2Sim Simulation Framework: A CPU-GPU Model for Heterogeneous Computing , PACT, 2012 • Tutorials ─ The Multi2Sim Simulation Framework: A CPU-GPU Model for Heterogeneous Computing , PACT, 2011 ─ Programming and Simulating Fused Devices — OpenCL and Multi2Sim, ICPE, 2012 ─ Multi-Architecture ISA-Level Simulation of OpenCL, IWOCL, 2013 ─ Simulation of OpenCL and APUs on Multi2Sim, ISCA, 2013 33 | Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
  • 34. The Multi2Sim Community w w w.multi2sim.org 34 | Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
  • 35. The Multi2Sim Community w w w.multi2sim.org 35 | Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
  • 36. The Multi2Sim Community w w w.multi2sim.org 36 | Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
  • 37. The Multi2Sim Community Sponsors ` 37 | Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
  • 39. Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. AT TRIBUTION © 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names are for 39 informational purposes only and may be trademarks of their respective owners.