PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern Islands, by David Kaeli

Simulation, Compilation, and Debugging of
OpenCL on The S outhern Islands
Dana Schaa (AMD), Rafael Ubal, and David Kaeli
(Northeastern University, Boston, MA)

Simulation Methodology
Application- OS vs. Guest Program
Guest
program 1

Virtualization of
Complete processor ISA
I/O hardware

Guest
program 2

...

Guest
program 1

Full- OS simulation
An OS runs on the simulator. The simulator implements the
complete ISA, and virtualizes native hardware devices, similar
to a virtual machine. Accurate simulations, but extremely
slow.
2

...
Virtualization of
User-space subset of ISA
System call interface

Full O.S.

Guest program
s imulat or c ore

Full-s y st em
s im ulat or core

•

Guest
program 2

•

Guest program simulation
An application runs directly on the simulator. The
simulator implements the non-privileged subset of the
ISA, and virtualizes the system call interface (ABI).
Multi2Sim falls in this category.

| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013

Four-Step Simulation Process
Exectuable
ELF file

Exectuable file,
program arguments

Instruction
bytes
Disassembler

Run one
instruction

3

User
interaction

Pipeline
trace
Timing
simulator

Emulator

Instruction
fields

Instructions
dump

Executable file,
program arguments,
processor configuration

Visual
tool

Performance
statistics

Cycle navigation,
timing diagrams

Instruction
information

Program
output


Current Architecture Support
Timing
simulation

Visual
tool

X
X
X
X

–
–
X
X
X

In progress

In progress

–

–

–
–
X
X
X
–
–

Disasm.

Emulation
In progress

NVIDIA Fermi

X
X
X
X
X
X

NVIDIA Kepler

In progress

ARM
MIPS
x86
AMD Evergreen
AMD Southern Islands

• In our latest Multi2Sim SVN repository
─ 4 GPU + 3 CPU architectures supported or in progress
─ This presentation focuses on Southern Islands (and x86)
4


The x86 Emulator
Program Loading
Initial virtual
memory image

Program args.
Env. variables

mmap region

eax
eax

0x08xxxxxx
Text
Initialized data

Instruction pointer

Initialized data

5

ecx

eip

0x40000000

0x08000000

ebx

─ Update memory map if needed
─ Example: add [bp+16], 0x5

esp

(not initialized)

Heap

• Emulation of x86 instructions
─ Update x86 registers

Stack

Stack pointer

0xc0000000

Initial values for
x86 registers

• Emulation of Linux system calls
─ Analyze system call code and arguments
─ Update memory map
─ Update register eax with return value
─ Example: read(fd, buf, count)


The x86 Emulator
Emulation Loop

1) Parse ELF executable

Read instr.
at eip

─ Read ELF sections and symbols

Instr.
bytes

─ Initialize code and data

Decode
instruction

2) Initialize stack

Instr.
fields
No

Instr. is
int 0x80

─ Program headers
Yes

─ Arguments
─ Environment variables

Emulate
x86 instr.

Emulate
system call

Move eip
to next instr.

6

3) Initialize registers
─ Program entry → eip
─ Stack pointer → esp


OpenCL on the Host
Execution Framework
User
application

─ An OpenCL driver module (Multi2Sim code) intercepts
the ABI call and communicates with the GPU emulator
─ The GPU emulator updates its internal state based on the
message received from the driver
7

User-level code

─ Multi2Sim's OpenCL runtime librar y, running with
guest code, transparently intercepts the call. It
communicates with the Multi2Sim driver using system calls
with codes not reserved in Linux.

API call

Runtime
library

OS-level code

─ An OpenCL host program performs a set of OpenCL
library function calls (API calls)

Device
driver

ABI call

Internal
interface

Hardware


OpenCL on the Device
Execution Model
─ Work-items execute multiple instances of the same kernel code
─ Work-groups are sets of work-items that can synchronize and communicate efficiently
─ The ND -Range contains all work-groups, not communicating with each other and executing in any order
ND-Range

Work-Group

Workgroup

···

Workgroup

Workitem

···

Workitem

Workgroup

···

Workgroup

Workitem

···

Workitem

__kernel func()
{

}

···

8

···

···
Global Memory

Work-Item

Local Memory

Private Memory


The S outhern Islands Disassembler

Disassembler

9

Emulator

Timing
simulator

Visual
tool


Vector Addition Kernel

__kernel void vector_add(
__read_only __global int *src1,
__read_only __global int *src2,
__write_only __global int *dst)
{
int id = get_global_id(0);
dst[id] = src1[id] + src2[id];
}

Vector instructions

• Source code

Scalar instructions

Scalar registers

Vector registers

The loads
The addition
The store
10


Conditional Statements

• Source code

• Assembly code

__kernel void if_kernel(
__global int *v)
{
uint id = get_global_id(0);
if (id < 5)
v[id] = 10;
}

The comparison.
Save active mask.

Store value 10.
Restore active mask.
11


The S outhern Islands Emulator

Disassembler

12

Emulator

Timing
simulator

Visual
tool


Program Loading

• Responsible entity

User
application

API call

Runtime
library
ABI call

Device
driver
Internal
interface

Hardware

13

─ The device driver is the module responsible for setting up an initial
state for the hardware, leaving it ready to run the first ISA instruction.
─ Natively, it writes on hardware registers and global memory locations.
On Multi2Sim, it calls initialization functions of the emulator.

• Setup
─ Instruction memories in compute units, each with one copy of the ISA
section of the kernel binary
─ Initial global memor y image, copying global buffers from CPU to GPU
memory
─ Kernel arguments
─ ND -Range topology, including number of dimensions and sizes


Emulation Loop

• Work-group execution
─ Work-groups can execute in any order.
This order is irrelevant for emulation
purposes.
─ The chosen policy is executing one workgroup at a time, in increasing order of
ID for each dimension.

• Wavefront execution
─ Wavefronts within a work-group can also
execute in any order, as long as
synchronizations are considered.
─ The chosen policy is executing one
wavefront at a time until it hits a
barrier, if any.
14

Wavefront
pool

Split ND-Range
into work-groups

Work-group
pool

Any
work-groups
left?

No

No

Yes
Grab work-group and
split in wavefronts

Any
wavefront
left?
Yes

End

For each running wavefront not
stalled in a barrier:
Read instruction @PC
Emulate (update mem. + regs.)
● Advance PC
●
●


The S outhern Islands Timing Simulator

Disassembler

15

Emulator

Timing
simulator

Visual
tool


The GPU Architecture
Command Processor

─ A command processor receives and processes
commands from the host.
─ When the ND-Range is created, an ultra-threaded
dispatcher (scheduler) assigns work-groups into compute
units while new available slots occur.

Ultra-Threaded Dispatcher

Compute
Unit 0

Compute
Unit 1

···

Compute
Unit 31

L1
Cache

L1
Cache

···

L1
Cache

Crossbar
Main Memory Hierarchy
(L2 caches, memory controllers,
video memory)

16


The Compute Unit

Branch unit
LDS unit

Global
memory
Local
memory

Vector memory unit

SIMD unit 0
SIMD unit 1
Instruction
memory

SIMD unit 2

···

17

Scalar unit

Front-End

─ The instruction memor y of each
compute unit contains the OpenCL
kernel.
─ A front-end fetches instructions and
sends them to the appropriate execution
unit.
─ There is one scalar unit, vector-memory
unit, branch unit, LDS (local data store)
unit.
─ There are multiple instances of SIMD
units.


The Front-End

Wavefront
Pool

Fetch

···

···

SIMD issue buffer,
matching
wavefront pool

···

Scalar unit
issue buffer

Wavefront
Pool

···

Wavefront
Pool

···

···

Branch unit
issue buffer

Wavefront
Pool

···

···

Vector memory
unit issue buffer

···

LDS unit
issue buffer

Fetch buffers, one
per wavefront pool

18

Issue

─ Work-groups are split into
wavefronts and allocated to
wavefront pools .
─ The fetch and issue stages
operate in a round-robin
fashion.
─ There is one SIMD unit
associated to each
wavefront pool.


The SIMD Unit

─ The SIMD unit runs arithmetic-logic vector instructions.
─ There are 4 SIMD units, each one associated with one of the 4 wavefront pools.
─ The SIMD unit pipeline is modeled with 5 stages: decode, read, execute, write, and complete.
─ In the execute stage, a wavefront (64 work-items max.) is split into 4 subwavefronts (16
work-items each). Subwavefronts are pipelined over the 16 stream cores in 4 consecutive
cycles.
─ The vector register file is accessed in the read and write stages to consume input and
produce output operands, respectively.

19


From compute
unit front-end

The SIMD Unit
···

Decode

Issue
buffer

SIMD Lane 0

Decode
buffer

···
Read

Pipelined
Functional
units

···
Read
buffer

SIMD Lane 1
Work-items 1, 17, 33, 49

Write

Execute
buffer

Write
buffer

SIMD Lane 15
Work-items 15, 31, 47, 63

···

···

...

Vector/scalar
register file

`

Work-item 0
Work-item 16
Work-item 32
Work-item 48

Vector
register file

Execute
Complete

20


─ Runs both arithmetic-logic
and memory scalar
instructions

···

Decode

Issue
buffer

Decode
buffer

···

─ Modeled with 5 stages –
decode, read,
execute/memory, write,
complete

From compute
unit front-end

The S calar Unit

Vector
register file
Execute

Read

···
Read
buffer

···
Memory

Write

Execute
buffer

···

Write
buffer

Scalar
register file

Complete

21


From compute
unit front-end

The Vector Memor y Unit

···

─ Runs vector memor y
instructions

Decode

Issue
buffer

Decode
buffer

···
Read

Vector
register file

···

Memory

Read
buffer

Write

Memory
buffer

Write
buffer

Global
memory

···

Vector
register file

···

decode, read, memory, write,
complete
─ Accesses to the global
memor y hierarchy
happen mainly in this unit

Complete

22


From compute
unit front-end

The Branch Unit

···

Decode

Issue
buffer

Decode
buffer

···
Read

Scalar reg. file
(program
counter)

···
Read
buffer

···

decode, read, execute/memory,
write, complete

Write

Execute
buffer

Write
buffer

···

Scalar
register file
(condition codes)

Execute

─ Runs branch instructions ,
which decide whether to make
an entire wavefront jump to a
target address depending on the
scalar condition code

Complete

23


─ Runs local memor y
accesses instructions

···

Decode

Issue
buffer

Decode
buffer

···

decode, read,
execute/memory, write,
complete

From compute
unit front-end

The Local Data Share (LDS) Unit

Read

···

Memory

Read
buffer

Write

Memory
buffer
Local
memory

···

Vector
register file

···

Write
buffer

─ The memory stage accesses
the compute unit local
memor y for read/write

Vector
register file

Complete

24


Global Memor y Hierarchy

25

CU 0

CU 1

CU 2

CU 3

L1

L1

L1

L1

CU 28

Scalar
cache

...

CU 29

CU 30

CU 31

L1

L1

L1

L1

Interconnect

L2
Bank 0

L2
Bank 1

...

L2
Bank 1

...

─ Fully configurable memory
hierarchy, with default
values based on the
AMD Radeon HD 7970
Southern Islands GPU
─ One 16KB data L1 per
compute unit
─ One scalar L1 cache
shared by every 4 compute
units
─ Six L2 banks with a total
size of 128KB, each
connected to a DRAM
module


Scalar
cache

The S outhern Islands Visual Tool

Disassembler

26

Emulator

Timing
simulator

Visual
tool


The S outhern Islands Visual Tool
Main Window and Timing Diagram
─ The main window provides c ycle-by-c ycle
navigation throughout simulation.
─ A dedicated S outhern Islands panel contains one
widget per compute unit, showing allocated workgroups.
─ The memor y hierarchy panel shows caches
connected to Southern Islands compute units, and
special-purpose scalar caches.

27


Ongoing Projects
CPU- GPU Cache Coherence Protocol
ARM

x86

Evg.

S.I.

Fermi

─ MOESI protocol extended with an additional
non-coherent write state: NMOESI
─ CPU and GPU cores with any ISA can be
connected to different entry points of the
memory hierarchy
─ Processing nodes interact with the memory
hierarchy with three types of accesses: load,
store, and n-store

28

NMOESI Interface
L1

L1

L1

L1

...

Interconnect

L2
Bank 0

L2
Bank 1

...

─ GPUs and CPUs running OpenCL kernels
issue n-store write accesses. The rest issue
regular store accesses.

...


...

Ongoing Projects
Cooperative Execution of Work- Groups

• Attained concurrency

• Work-group mapping

─ x86 cores run both the host program
and a portion of the ND-Range
─ Idle regions are removed during
the execution of the ND-Range

─ Portions of ND-Range executed by CPU/GPU cores
with different ISAs
─ Work-groups mapped to CPU cores or GPU
compute units as they become available

x86

x86

S.I.

S.I.

S.I.

ND-Range
WG-1

WG-2

WG-3

...

WG-N

Time

WG-0

x86

x86

Hardware

29

S.I.

S.I.

S.I.

S.I.
Host
Kernel
program
+ kernel

Kernel


S.I.

Ongoing Projects
Multi2C – An OpenCL /CUDA Kernel Compiler
─ LLVM-based compiler for OpenCL and CUDA kernels

• Yellow = under development
• Blue = arriving soon
• Green = supported

─ Future release Multi2Sim 4.2 will include a working version
─ Diagrams show progress as per SVN Rev. 1838

vec-add.cl

LLVM to
Southern
Islands
back-end

OpenCL C
to LLVM
front-end
vec-add.llvm

vec-add.cu

30

CUDA
to LLVM
front-end

vec-add.s

Southern
Islands
assembler

vec-add.bin

LLVM to
Fermi
back-end

vec-add.s

Fermi
assembler

vec-add.cubin

LLVM to
Kepler
back-end

vec-add.s

Kepler
assembler

vec-add.cubin


Ongoing Projects
Simulation of OpenGL Pipelines

• Goal
─ Leverage our Southern Islands pipeline models to
execute OpenGL vertex and fragment shaders

• Steps
─ Develop a runtime librar y to link with guest programs,
implementing the OpenGL, GLUT and GLEW library APIs
─ Reverse engineer AMD's OpenGL binar y format to
decode embedded metadata
─ Timing model of the OpenGL pipeline

• New capabilities targeted
─ Timing simulation of other critical GPU components, such as rasterizer
─ Concurrency evaluation of compute + graphics pipelines
31


The Multi2Sim Community
Academic Efforts at Northeastern

• The “GPU Programming and Architecture” course
─ We started an unofficial seminar that students can voluntarily attend. The syllabus covers
OpenCL programming, GPU architecture, and state-of-the-art research topics on GPUs.
─ Average attendance of ~25 students per semester.

• Undergraduate directed studies
─ Official alternative equivalent to a 4-credit course that an undergraduate student can
optionally enroll, collaborating with Multi2Sim development

• Graduate-level development
─ Multiple ongoing PhD theses using Multi2Sim as support tool
─ All related development becomes openly available through the SVN repo

32


Academic Publications

• Conference papers
─ Multi2Sim: A Simulation Framework to Evaluate Multicore-Multithreaded Processors , SBAC-PAD, 2007
─ The Multi2Sim Simulation Framework: A CPU-GPU Model for Heterogeneous Computing , PACT, 2012

• Tutorials
─ The Multi2Sim Simulation Framework: A CPU-GPU Model for Heterogeneous Computing , PACT, 2011
─ Programming and Simulating Fused Devices — OpenCL and Multi2Sim, ICPE, 2012
─ Multi-Architecture ISA-Level Simulation of OpenCL, IWOCL, 2013
─ Simulation of OpenCL and APUs on Multi2Sim, ISCA, 2013
33


w w w.multi2sim.org

34


w w w.multi2sim.org

35


w w w.multi2sim.org

36


Sponsors

`

37


Disclaimer & Attribution
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap
changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software
changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD
reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such
revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES,
ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE
TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN,
EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

AT TRIBUTION
© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in
the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names are for
39
informational purposes only and may be trademarks of their respective owners.

PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern Islands, by David Kaeli

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern Islands, by David Kaeli

Ähnlich wie PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern Islands, by David Kaeli (20)

Mehr von AMD Developer Central

Mehr von AMD Developer Central (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern Islands, by David Kaeli