Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern Islands, by David Kaeli
1. Simulation, Compilation, and Debugging of
OpenCL on The S outhern Islands
Dana Schaa (AMD), Rafael Ubal, and David Kaeli
(Northeastern University, Boston, MA)
2. Simulation Methodology
Application- OS vs. Guest Program
Guest
program 1
Virtualization of
Complete processor ISA
I/O hardware
Guest
program 2
...
Guest
program 1
Full- OS simulation
An OS runs on the simulator. The simulator implements the
complete ISA, and virtualizes native hardware devices, similar
to a virtual machine. Accurate simulations, but extremely
slow.
2
...
Virtualization of
User-space subset of ISA
System call interface
Full O.S.
Guest program
s imulat or c ore
Full-s y st em
s im ulat or core
•
Guest
program 2
•
Guest program simulation
An application runs directly on the simulator. The
simulator implements the non-privileged subset of the
ISA, and virtualizes the system call interface (ABI).
Multi2Sim falls in this category.
| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
3. Simulation Methodology
Four-Step Simulation Process
Exectuable
ELF file
Exectuable file,
program arguments
Instruction
bytes
Disassembler
Run one
instruction
3
User
interaction
Pipeline
trace
Timing
simulator
Emulator
Instruction
fields
Instructions
dump
Executable file,
program arguments,
processor configuration
Visual
tool
Performance
statistics
Cycle navigation,
timing diagrams
Instruction
information
Program
output
| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
4. Simulation Methodology
Current Architecture Support
Timing
simulation
Visual
tool
X
X
X
X
–
–
X
X
X
In progress
In progress
–
–
–
–
X
X
X
–
–
Disasm.
Emulation
In progress
NVIDIA Fermi
X
X
X
X
X
X
NVIDIA Kepler
In progress
ARM
MIPS
x86
AMD Evergreen
AMD Southern Islands
• In our latest Multi2Sim SVN repository
─ 4 GPU + 3 CPU architectures supported or in progress
─ This presentation focuses on Southern Islands (and x86)
4
| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
5. The x86 Emulator
Program Loading
Initial virtual
memory image
Program args.
Env. variables
mmap region
eax
eax
0x08xxxxxx
Text
Initialized data
Instruction pointer
Initialized data
5
ecx
eip
0x40000000
0x08000000
ebx
─ Update memory map if needed
─ Example: add [bp+16], 0x5
esp
(not initialized)
Heap
• Emulation of x86 instructions
─ Update x86 registers
Stack
Stack pointer
0xc0000000
Initial values for
x86 registers
• Emulation of Linux system calls
─ Analyze system call code and arguments
─ Update memory map
─ Update register eax with return value
─ Example: read(fd, buf, count)
| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
6. The x86 Emulator
Emulation Loop
1) Parse ELF executable
Read instr.
at eip
─ Read ELF sections and symbols
Instr.
bytes
─ Initialize code and data
Decode
instruction
2) Initialize stack
Instr.
fields
No
Instr. is
int 0x80
─ Program headers
Yes
─ Arguments
─ Environment variables
Emulate
x86 instr.
Emulate
system call
Move eip
to next instr.
6
3) Initialize registers
─ Program entry → eip
─ Stack pointer → esp
| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
7. OpenCL on the Host
Execution Framework
User
application
─ An OpenCL driver module (Multi2Sim code) intercepts
the ABI call and communicates with the GPU emulator
─ The GPU emulator updates its internal state based on the
message received from the driver
7
User-level code
─ Multi2Sim's OpenCL runtime librar y, running with
guest code, transparently intercepts the call. It
communicates with the Multi2Sim driver using system calls
with codes not reserved in Linux.
API call
Runtime
library
OS-level code
─ An OpenCL host program performs a set of OpenCL
library function calls (API calls)
Device
driver
ABI call
Internal
interface
Hardware
| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
8. OpenCL on the Device
Execution Model
─ Work-items execute multiple instances of the same kernel code
─ Work-groups are sets of work-items that can synchronize and communicate efficiently
─ The ND -Range contains all work-groups, not communicating with each other and executing in any order
ND-Range
Work-Group
Workgroup
···
Workgroup
Workitem
···
Workitem
Workgroup
···
Workgroup
Workitem
···
Workitem
__kernel func()
{
}
···
8
···
···
Global Memory
Work-Item
Local Memory
Private Memory
| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
9. The S outhern Islands Disassembler
Disassembler
9
Emulator
Timing
simulator
Visual
tool
| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
10. The S outhern Islands Disassembler
Vector Addition Kernel
__kernel void vector_add(
__read_only __global int *src1,
__read_only __global int *src2,
__write_only __global int *dst)
{
int id = get_global_id(0);
dst[id] = src1[id] + src2[id];
}
Vector instructions
• Source code
Scalar instructions
Scalar registers
Vector registers
The loads
The addition
The store
10
| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
11. The S outhern Islands Disassembler
Conditional Statements
• Source code
• Assembly code
__kernel void if_kernel(
__global int *v)
{
uint id = get_global_id(0);
if (id < 5)
v[id] = 10;
}
The comparison.
Save active mask.
Store value 10.
Restore active mask.
11
| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
12. The S outhern Islands Emulator
Disassembler
12
Emulator
Timing
simulator
Visual
tool
| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
13. The S outhern Islands Emulator
Program Loading
• Responsible entity
User
application
API call
Runtime
library
ABI call
Device
driver
Internal
interface
Hardware
13
─ The device driver is the module responsible for setting up an initial
state for the hardware, leaving it ready to run the first ISA instruction.
─ Natively, it writes on hardware registers and global memory locations.
On Multi2Sim, it calls initialization functions of the emulator.
• Setup
─ Instruction memories in compute units, each with one copy of the ISA
section of the kernel binary
─ Initial global memor y image, copying global buffers from CPU to GPU
memory
─ Kernel arguments
─ ND -Range topology, including number of dimensions and sizes
| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
14. The S outhern Islands Emulator
Emulation Loop
• Work-group execution
─ Work-groups can execute in any order.
This order is irrelevant for emulation
purposes.
─ The chosen policy is executing one workgroup at a time, in increasing order of
ID for each dimension.
• Wavefront execution
─ Wavefronts within a work-group can also
execute in any order, as long as
synchronizations are considered.
─ The chosen policy is executing one
wavefront at a time until it hits a
barrier, if any.
14
Wavefront
pool
Split ND-Range
into work-groups
Work-group
pool
Any
work-groups
left?
No
No
Yes
Grab work-group and
split in wavefronts
Any
wavefront
left?
Yes
End
For each running wavefront not
stalled in a barrier:
Read instruction @PC
Emulate (update mem. + regs.)
● Advance PC
●
●
| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
15. The S outhern Islands Timing Simulator
Disassembler
15
Emulator
Timing
simulator
Visual
tool
| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
16. The S outhern Islands Timing Simulator
The GPU Architecture
Command Processor
─ A command processor receives and processes
commands from the host.
─ When the ND-Range is created, an ultra-threaded
dispatcher (scheduler) assigns work-groups into compute
units while new available slots occur.
Ultra-Threaded Dispatcher
Compute
Unit 0
Compute
Unit 1
···
Compute
Unit 31
L1
Cache
L1
Cache
···
L1
Cache
Crossbar
Main Memory Hierarchy
(L2 caches, memory controllers,
video memory)
16
| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
17. The S outhern Islands Timing Simulator
The Compute Unit
Branch unit
LDS unit
Global
memory
Local
memory
Vector memory unit
SIMD unit 0
SIMD unit 1
Instruction
memory
SIMD unit 2
···
17
Scalar unit
Front-End
─ The instruction memor y of each
compute unit contains the OpenCL
kernel.
─ A front-end fetches instructions and
sends them to the appropriate execution
unit.
─ There is one scalar unit, vector-memory
unit, branch unit, LDS (local data store)
unit.
─ There are multiple instances of SIMD
units.
| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
18. The S outhern Islands Timing Simulator
The Front-End
Wavefront
Pool
Fetch
···
···
SIMD issue buffer,
matching
wavefront pool
···
Scalar unit
issue buffer
Wavefront
Pool
···
Wavefront
Pool
···
···
Branch unit
issue buffer
Wavefront
Pool
···
···
Vector memory
unit issue buffer
···
LDS unit
issue buffer
Fetch buffers, one
per wavefront pool
18
Issue
─ Work-groups are split into
wavefronts and allocated to
wavefront pools .
─ The fetch and issue stages
operate in a round-robin
fashion.
─ There is one SIMD unit
associated to each
wavefront pool.
| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
19. The S outhern Islands Timing Simulator
The SIMD Unit
─ The SIMD unit runs arithmetic-logic vector instructions.
─ There are 4 SIMD units, each one associated with one of the 4 wavefront pools.
─ The SIMD unit pipeline is modeled with 5 stages: decode, read, execute, write, and complete.
─ In the execute stage, a wavefront (64 work-items max.) is split into 4 subwavefronts (16
work-items each). Subwavefronts are pipelined over the 16 stream cores in 4 consecutive
cycles.
─ The vector register file is accessed in the read and write stages to consume input and
produce output operands, respectively.
19
| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
20. The S outhern Islands Timing Simulator
From compute
unit front-end
The SIMD Unit
···
Decode
Issue
buffer
SIMD Lane 0
Decode
buffer
···
Read
Pipelined
Functional
units
···
Read
buffer
SIMD Lane 1
Work-items 1, 17, 33, 49
Write
Execute
buffer
Write
buffer
SIMD Lane 15
Work-items 15, 31, 47, 63
···
···
...
Vector/scalar
register file
`
Work-item 0
Work-item 16
Work-item 32
Work-item 48
Vector
register file
Execute
Complete
20
| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
21. The S outhern Islands Timing Simulator
─ Runs both arithmetic-logic
and memory scalar
instructions
···
Decode
Issue
buffer
Decode
buffer
···
─ Modeled with 5 stages –
decode, read,
execute/memory, write,
complete
From compute
unit front-end
The S calar Unit
Vector
register file
Execute
Read
···
Read
buffer
···
Memory
Write
Execute
buffer
···
Write
buffer
Scalar
register file
Complete
21
| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
22. The S outhern Islands Timing Simulator
From compute
unit front-end
The Vector Memor y Unit
···
─ Runs vector memor y
instructions
Decode
Issue
buffer
Decode
buffer
···
Read
Vector
register file
···
Memory
Read
buffer
Write
Memory
buffer
Write
buffer
Global
memory
···
Vector
register file
···
─ Modeled with 5 stages –
decode, read, memory, write,
complete
─ Accesses to the global
memor y hierarchy
happen mainly in this unit
Complete
22
| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
23. The S outhern Islands Timing Simulator
From compute
unit front-end
The Branch Unit
···
Decode
Issue
buffer
Decode
buffer
···
Read
Scalar reg. file
(program
counter)
···
Read
buffer
···
─ Modeled with 5 stages –
decode, read, execute/memory,
write, complete
Write
Execute
buffer
Write
buffer
···
Scalar
register file
(condition codes)
Execute
─ Runs branch instructions ,
which decide whether to make
an entire wavefront jump to a
target address depending on the
scalar condition code
Complete
23
| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
24. The S outhern Islands Timing Simulator
─ Runs local memor y
accesses instructions
···
Decode
Issue
buffer
Decode
buffer
···
─ Modeled with 5 stages –
decode, read,
execute/memory, write,
complete
From compute
unit front-end
The Local Data Share (LDS) Unit
Read
···
Memory
Read
buffer
Write
Memory
buffer
Local
memory
···
Vector
register file
···
Write
buffer
─ The memory stage accesses
the compute unit local
memor y for read/write
Vector
register file
Complete
24
| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
25. The S outhern Islands Timing Simulator
Global Memor y Hierarchy
25
CU 0
CU 1
CU 2
CU 3
L1
L1
L1
L1
CU 28
Scalar
cache
...
CU 29
CU 30
CU 31
L1
L1
L1
L1
Interconnect
L2
Bank 0
L2
Bank 1
...
L2
Bank 1
...
─ Fully configurable memory
hierarchy, with default
values based on the
AMD Radeon HD 7970
Southern Islands GPU
─ One 16KB data L1 per
compute unit
─ One scalar L1 cache
shared by every 4 compute
units
─ Six L2 banks with a total
size of 128KB, each
connected to a DRAM
module
| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
Scalar
cache
26. The S outhern Islands Visual Tool
Disassembler
26
Emulator
Timing
simulator
Visual
tool
| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
27. The S outhern Islands Visual Tool
Main Window and Timing Diagram
─ The main window provides c ycle-by-c ycle
navigation throughout simulation.
─ A dedicated S outhern Islands panel contains one
widget per compute unit, showing allocated workgroups.
─ The memor y hierarchy panel shows caches
connected to Southern Islands compute units, and
special-purpose scalar caches.
27
| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
28. Ongoing Projects
CPU- GPU Cache Coherence Protocol
ARM
x86
Evg.
S.I.
Fermi
─ MOESI protocol extended with an additional
non-coherent write state: NMOESI
─ CPU and GPU cores with any ISA can be
connected to different entry points of the
memory hierarchy
─ Processing nodes interact with the memory
hierarchy with three types of accesses: load,
store, and n-store
28
NMOESI Interface
L1
L1
L1
L1
...
Interconnect
L2
Bank 0
L2
Bank 1
...
─ GPUs and CPUs running OpenCL kernels
issue n-store write accesses. The rest issue
regular store accesses.
...
| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
...
29. Ongoing Projects
Cooperative Execution of Work- Groups
• Attained concurrency
• Work-group mapping
─ x86 cores run both the host program
and a portion of the ND-Range
─ Idle regions are removed during
the execution of the ND-Range
─ Portions of ND-Range executed by CPU/GPU cores
with different ISAs
─ Work-groups mapped to CPU cores or GPU
compute units as they become available
x86
x86
S.I.
S.I.
S.I.
ND-Range
WG-1
WG-2
WG-3
...
WG-N
Time
WG-0
x86
x86
Hardware
29
S.I.
S.I.
S.I.
S.I.
Host
Kernel
program
+ kernel
Kernel
| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
S.I.
30. Ongoing Projects
Multi2C – An OpenCL /CUDA Kernel Compiler
─ LLVM-based compiler for OpenCL and CUDA kernels
• Yellow = under development
• Blue = arriving soon
• Green = supported
─ Future release Multi2Sim 4.2 will include a working version
─ Diagrams show progress as per SVN Rev. 1838
vec-add.cl
LLVM to
Southern
Islands
back-end
OpenCL C
to LLVM
front-end
vec-add.llvm
vec-add.cu
30
CUDA
to LLVM
front-end
vec-add.s
Southern
Islands
assembler
vec-add.bin
LLVM to
Fermi
back-end
vec-add.s
Fermi
assembler
vec-add.cubin
LLVM to
Kepler
back-end
vec-add.s
Kepler
assembler
vec-add.cubin
| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
31. Ongoing Projects
Simulation of OpenGL Pipelines
• Goal
─ Leverage our Southern Islands pipeline models to
execute OpenGL vertex and fragment shaders
• Steps
─ Develop a runtime librar y to link with guest programs,
implementing the OpenGL, GLUT and GLEW library APIs
─ Reverse engineer AMD's OpenGL binar y format to
decode embedded metadata
─ Timing model of the OpenGL pipeline
• New capabilities targeted
─ Timing simulation of other critical GPU components, such as rasterizer
─ Concurrency evaluation of compute + graphics pipelines
31
| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
32. The Multi2Sim Community
Academic Efforts at Northeastern
• The “GPU Programming and Architecture” course
─ We started an unofficial seminar that students can voluntarily attend. The syllabus covers
OpenCL programming, GPU architecture, and state-of-the-art research topics on GPUs.
─ Average attendance of ~25 students per semester.
• Undergraduate directed studies
─ Official alternative equivalent to a 4-credit course that an undergraduate student can
optionally enroll, collaborating with Multi2Sim development
• Graduate-level development
─ Multiple ongoing PhD theses using Multi2Sim as support tool
─ All related development becomes openly available through the SVN repo
32
| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
33. The Multi2Sim Community
Academic Publications
• Conference papers
─ Multi2Sim: A Simulation Framework to Evaluate Multicore-Multithreaded Processors , SBAC-PAD, 2007
─ The Multi2Sim Simulation Framework: A CPU-GPU Model for Heterogeneous Computing , PACT, 2012
• Tutorials
─ The Multi2Sim Simulation Framework: A CPU-GPU Model for Heterogeneous Computing , PACT, 2011
─ Programming and Simulating Fused Devices — OpenCL and Multi2Sim, ICPE, 2012
─ Multi-Architecture ISA-Level Simulation of OpenCL, IWOCL, 2013
─ Simulation of OpenCL and APUs on Multi2Sim, ISCA, 2013
33
| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
34. The Multi2Sim Community
w w w.multi2sim.org
34
| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
35. The Multi2Sim Community
w w w.multi2sim.org
35
| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013
36. The Multi2Sim Community
w w w.multi2sim.org
36
| Simulation, Compilation, and Debugging of OpenCL on the AMD Southern Islands | November 13th, 2013