SlideShare ist ein Scribd-Unternehmen logo
1 von 48
Downloaden Sie, um offline zu lesen
ITRI
Industrial Technology
Research Institute
Heterogeneous System Architecture
(HSA) Design
王振傑 (Jay Wang)
嵌入式系統與晶片技術組 -系統架構設計部 (D200)
資訊與通訊研究所 (ICL)
ccwang.jay@itri.org.tw
2015-04-30
2
嵌入式系統硬體技術部 (D100)
系統架構設計部 (D200)
嵌入式系統軟體技術部 (D300)
智慧電子產業推動部 (D400)
系統整合與應用部 (D500)
嵌入式系統與晶片技術組
Division for Embedded System
and SoC Technology
工業技術研究院
資訊與通訊研究所
HSA Design (2015-04-30) @ NCKU, Tainan
What is HSA?
3
An intelligent computing architecture that enables CPU, GPU and other
processors to work in harmony on a single piece of silicon by seamlessly
moving the right tasks to the best suited processing element.
HSA Design (2015-04-30) @ NCKU, Tainan
Three Eras of Processor Performance
4
?
Single-thread
Performance
Time
we are
here
Enabled by:
 Moore’s Observation
 Voltage Scaling
 Micro-Architecture
Constrained by:
 Power
 Complexity
Single-Core Era
ModernApplication
Performance
Time (Data-parallel exploitation)
we are
here
Heterogeneous
Systems Era
Enabled by:
 Moore’s Observation
 Abundant data parallelism
 Power efficient data parallel
processing (GPUs)
Constrained by:
 Programming models
 Communication overheads
Throughput
Performance
Time (# of processors)
we are
here
Enabled by:
 Moore’s Observation
 Desire for Throughput
 20 years of SMP arch
Constrained by:
 Power
 Parallel SW availability
 Scalability
Multi-Core Era
Assembly  C/C++  Java … pthreads  OpenMP / TBB …
Shader  CUDA OpenCL
 C++ and Java
SOURCE : HSA INTRODUCTION, HSA FOUNDATION (PHIL ROGERS, AMD)
HSA Design (2015-04-30) @ NCKU, Tainan
HSA Foundation
5
 Founded in June 2012
 www.hsafoundation.com
 Developing a new platform for heterogeneous
systems
 Launched the official v1.0 specification set in
March 2015
HSA Design (2015-04-30) @ NCKU, Tainan
HSA Foundation Members (April 2015)
6
Founders
Promoters
Contributors
Academics
Supporters
HSA Design (2015-04-30) @ NCKU, Tainan
HSA Platform Model
7
In HSA system, a regular device is called an HSA agent, and if the HSA
agent can run kernels then it is also an HSA kernel agent.
Compute Unit (CU)
Compute Unit (CU)
Compute Unit (CU)
Compute Unit (CU)
Compute Unit
(CU)
Lane
(Processing Element)
Host CPU
(OS, HSA runtime)
HSA Kernel Agent
Compute Unit (CU)
Compute Unit (CU)
Wavefront Size
(A power of 2 in the range from 1 to 256 inclusive)
HSA Agent
SIMD
Data Parallel
Workloads
Serial and Task
Parallel Workloads
Jay Wang, Taiwan, 2015.03
HSA Design (2015-04-30) @ NCKU, Tainan
HSA Intermediate Language (HSAIL)
8
The HSA Foundation members are building a heterogeneous compute software ecosystem
built on open, royalty-free industry standards and open-source software: the HSA
runtimes and compilation tools are based on open-source technologies such as LLVM and
GCC. ( https://github.com/HSAFoundation )
Company D
GPU
...
Other
Hardware
Accelerator
Company B
CPUs
Finalizer
(Company A - CPU)
Finalizer
(Company B - CPU)
Finalizer
(Company C - GPU)
Finalizer
(Company D - GPU)
Finalizer
(Company E - DSP)
Finalizer
(...)
OpenMP DSL
Virtual Parallel
ISA
CLOC –
Compile OpenCL
kernels to HSAIL
HSA Intermediate Language (HSAIL)
OpenCL C++AMP Java
Company A
CPUs
Company C
GPU
Company E
DSP
Parallel
Programming
Languages
HSA Runtime
Libraries
Jay Wang, Taiwan,
2014.10
HSA Design (2015-04-30) @ NCKU, Tainan
HSAIL Programming Model
9
HSA Design (2015-04-30) @ NCKU, Tainan
HSA Runtime Stack
10
HSA Kernel Agent
CPU
HSA Runtime
HSA
Application
(HSA Agent)
Language Runtime
(ex: OpenCL runtime)
User Application
( CPU Code + HSAIL Kernel Code )
HSA Kernel Agent
GPU
HSA
Kernel Mode
Driver
Host CPU
HSA Kernel Agent
DSP
HSA User Mode Queuing (Architected Queuing Language)
+
HSA Signaling
Jay Wang, Taiwan, 2015.04
Target ISA
HSA
Finalizers
HSA Design (2015-04-30) @ NCKU, Tainan
Kernel Execution
11
HSA Design (2015-04-30) @ NCKU, Tainan
HSA Memory Consistency Model
(Relaxed Model)
Second Operation
ld_rlx
st_rlx
atomic_rlx
atomicNoRet_rlx
atomic_acq
atomicNoRet_acq
fence_acq
atomic_rel
atomicNoRet_rel
fence_rel
atomic_ar
atomicNoRet_ar
fence_ar
First
Operation
ld_rlx or st_rlx yes yes yes yes no no
atomic_rlx
atomicNoRet_rlx
yes yes yes no no no
atomic_acq
atomicNoRet_acq
fence_acq
no no no no no no
atomic_rel
atomicNoRet_rel
yes yes no no no no
fence_rel yes no no no no no
atomic_ar
atomicNoRet_ar
fence_ar
no no no no no no
12
relaxed ;
…..
acquire ;
…..
release ;
…..
acq_rel ;
…..
HSA Design (2015-04-30) @ NCKU, Tainan
System Arch. Requirements
1. Shared Virtual Memory
2. Cache Coherency Domains
3. Flat Addressing
4. Endianess
5. Signaling and Synchronization
6. Atomic Memory Operations
7. HSA System Timestamp
8. User Mode Queuing
9. Architected Queuing Language (AQL)
10. Agent Scheduling
11. Kernel Agent Context Switching
12. IEEE754-2008 Floating Point Exceptions
13. Kernel Agent Hardware Debug Infrastructure
14. HSA Platform Topology Discovery
15. Images
13
@ HSA PLATFORM SYSTEM ARCHITECTURE SPECIFICATION, VERSION 1.0 FINAL (2015-03-16)
HSA Design (2015-04-30) @ NCKU, Tainan
Legacy GPU Compute
 Multiple memory pools and address spaces
 Data copies before/after GPU compute
14
System Memory GPU Memory
1
23
Host CPUs GPU
Virtual Memory #1 Virtual Memory #2
(HSA Agent)
(HSA Kernel Agent) Jay Wang, Taiwan, 2015.04
HSA Design (2015-04-30) @ NCKU, Tainan
Host CPUs GPU(HSA Agent)
(HSA Kernel Agent)
Shared Virtual Memory
System Memory GPU Memory
Jay Wang, Taiwan, 2015.04
Shared Virtual Memory (HSA)
15
32-bit HSA System
(32 bits VA)
64-bit HSA System
(≥ 48 bits VA)
IOMMU
OS Page Table
MMU
HSA Design (2015-04-30) @ NCKU, Tainan
Group Segments within
Flat Address Space
Global Segment within
Flat Address Space
Private Segments within
Flat Address Space
Kernel Dispatch Grid
Work-Group Work-Group
WI WI WI
Private Segment
WI WI WI
Private Segment
Group Segment
Group Segment
Global Segment
Flat Address SpaceHSA Agent
$s0
$s1
$s2
$s3
$s4
$s5
$s6
$s7
$s124
$s125
$s126
$s127
32-bit
Registers
( s registers)
$c0
$c1
$c2
$c3
$c4
$c5
$c6
$c7
$d0
$d1
$d2
$d3
$d62
$d63
64-bit
Registers
( d registers)
$q0
$q31
$q1
128-bit
Registers
( q registers)
1-bit
Control Registers
( c registers)
Local Registers per Work-Item
Jay Wang, Taiwan,
2014.10
HSA Memory Hierarchy
16
1) Global
2) Group
3) Private
4) Kernarg
5) Readonly
6) Spill
7) Arg Virtual Address Range Reservation
(System Memory or Device Local Memory)
HSA Design (2015-04-30) @ NCKU, Tainan
Group Segments within
Flat Address Space
Global Segment within
Flat Address Space
Private Segments within
Flat Address Space
Kernel Dispatch Grid
Work-Group Work-Group
WI WI WI
Private Segment
WI WI WI
Private Segment
Group Segment
Group Segment
Global Segment
Flat Address Space
HSA
Kernel Agent
Host CPUs
Jay Wang, Taiwan,
2015.04
Cache Coherency Domains
17
System Memory
Cache
Cache
Cache
Coherency
HSA Design (2015-04-30) @ NCKU, Tainan
System Arch. Requirements
1. Shared Virtual Memory
2. Cache Coherency Domains
3. Flat Addressing
4. Endianess
5. Signaling and Synchronization
6. Atomic Memory Operations
7. HSA System Timestamp
8. User Mode Queuing
9. Architected Queuing Language (AQL)
10. Agent Scheduling
11. Kernel Agent Context Switching
12. IEEE754-2008 Floating Point Exceptions
13. Kernel Agent Hardware Debug Infrastructure
14. HSA Platform Topology Discovery
15. Images
18
@ HSA PLATFORM SYSTEM ARCHITECTURE SPECIFICATION, VERSION 1.0 FINAL (2015-03-16)
HSA Design (2015-04-30) @ NCKU, Tainan
Signaling and Synchronization
 The required mechanisms for HSAIL and the HSA runtime are:
 Allocate/Destroy an HSA signal
 Read the current HSA signal value
 Wait on an HSA signal to meet a specified condition (with a maximum wait duration
requested)
 Send an HSA signal value
 Atomic read-modify-write an HSA signal value
19
sem_init()
sem_wait()
sem_post()
sem_destroy()
pthread_mutex_init()
pthread_mutex_lock()
pthread_mutex_unlock()
pthread_mutex_destroy()
Signal Handle
(hsa_signal_t)
Signal Value
(hsa_signal_value_t)
HSA
Kernel Agent
Host CPU
HSA Runtime
APIs
HSAIL
Instructions
Implementation-
defined data
Sig32 or Sig64
Jay Wang, Taiwan, 2015.04
HSA Design (2015-04-30) @ NCKU, Tainan
HSA Runtime APIs for Signaling
20
HSA Runtime APIs ( for HSA application )
• hsa_signal_create ( )
• hsa_signal_destroy ( )
• hsa_signal_load_{acquire, relaxed} ( )
• hsa_signal_store_{relaxed, release} ( )
• hsa_signal_exchange_{acq_rel, acquire, relaxed, release} ( )
• hsa_signal_cas_{acq_rel, acquire, relaxed, release} ( )
• hsa_signal_add_{acq_rel, acquire, relaxed, release} ( )
• hsa_signal_subtract_{acq_rel, acquire, relaxed, release} ( )
• hsa_signal_and_{acq_rel, acquire, relaxed, release} ( )
• hsa_signal_or_{acq_rel, acquire, relaxed, release} ( )
• hsa_signal_xor_{acq_rel, acquire, relaxed, release} ( )
• hsa_signal_wait_{acquire, relaxed} ( )
HSA Runtime Programmer’s Reference Manual (v1.0)
2.4 Signals
HSA Design (2015-04-30) @ NCKU, Tainan
HSAIL Instructions for Signaling
21
HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model,
Compiler Writer’s Guide, and Object Format (BRIG) (v1.0)
6.8 Notification (signal) Instructions
HSA Design (2015-04-30) @ NCKU, Tainan
Atomic Memory Operations
 HSA requires the following standard atomic memory operations to be
supported by HSA Kernel Agents (other HSA Agents only need to
support the subset of these operations required by their role in the
system):
 Load from memory
 Store to memory
 Fetch from memory, apply logic operation (bitwise AND/OR/XOR)
with one addition operand, and store back.
 Fetch from memory, apply integer arithmetic operation (add,
subtract, increment, decrement, minimum, maximum) with one
addition operand, and store back.
 Exchange memory location with operand.
 Compare-and-swap (CAS); load memory location, compare with first
operand, if equal than store second operand back to memory
location.
22
HSA Design (2015-04-30) @ NCKU, Tainan
Timestamp
(64-bit)
Host CPU
HSA
Runtime
APIs
HSAIL
Clock
Instruction
Timestamp
Frequency
(1~400MHz)
HSA Runtime
HSA
Kernel Agent
Jay Wang, Taiwan, 2015.04
HSA System Timestamp
 The HSA system provide for a low overhead mechanism of determining the
passing of time.
 A system timestamp is required that can be read from HSAIL or through the
HSA runtime.
 It is also possible to determine the system timestamp frequency through the
HSA runtime.
23
HSA Design (2015-04-30) @ NCKU, Tainan
System Arch. Requirements
1. Shared Virtual Memory
2. Cache Coherency Domains
3. Flat Addressing
4. Endianess
5. Signaling and Synchronization
6. Atomic Memory Operations
7. HSA System Timestamp
8. User Mode Queuing
9. Architected Queuing Language (AQL)
10. Agent Scheduling
11. Kernel Agent Context Switching
12. IEEE754-2008 Floating Point Exceptions
13. Kernel Agent Hardware Debug Infrastructure
14. HSA Platform Topology Discovery
15. Images
24
@ HSA PLATFORM SYSTEM ARCHITECTURE SPECIFICATION, VERSION 1.0 FINAL (2015-03-16)
HSA Design (2015-04-30) @ NCKU, Tainan
User Model Queuing
 Multiple user-level
command queues
 Runtime-allocated
 Architected Queuing
Language (AQL)
25
HSA Kernel Agent
K
A
CPU
A
HSA Runtime
HSA
Application
(HSA Agent)
CPU
Language
Runtime
(ex: OpenCL runtime)
User Application
HSA
Finalizers
HSA Kernel Agent
GPU
HSA
Kernel Mode
Driver
CPU
K
A
A
Jay Wang, Taiwan, 2015.04
K
AQL
Kernel Dispatch Queue
A
AQL
Agent Dispatch Queue
HSA Design (2015-04-30) @ NCKU, Tainan
HSA Packet Processor
26
type
features
base_address
doorbell_signal
0x00
0x04
0x08
0x10
0x0C
0x14
size0x18
reserved (must be 0)0x1C
write_index (64-bit)read_index (64-bit)
base_address +
( (read_index%size) * AQL packet size )
base_address +
( (write_index%size) * AQL packet size )
Support single or multiple producers
Support KERNEL_DISPATCH and/or
AGENT_DISPATCH packet
AQL Packet (64 Bytes)
User Mode Queue Structure (hsa_queue_t)
Ring Buffer
id
0x20
0x24
Jay Wang, Taiwan, 2015.03
HSA Design (2015-04-30) @ NCKU, Tainan
HSA Kernel Agent
K
A
A
HSA Runtime
HSA Application
(HSA Agent)
CPU
Language Runtime
(ex: OpenCL runtime)
User Application
GPU
Jay Wang, Taiwan, 2015.04
User Mode Queue Operations
HSA Runtime APIs ( for HSA application )
• hsa_queue_create ( )
• hsa_soft_queue_create ( )
• hsa_queue_destroy ( )
• hsa_queue_inactivate ( )
• hsa_queue_load_write_index_{acquire, relaxed} ( )
• hsa_queue_store_write_index_{relaxed, release} ( )
• hsa_queue_cas_write_index_{acq_rel, acquire, relaxed, release} ( )
• hsa_queue_add_write_index_{acq_rel, acquire, relaxed, release} ( )
• hsa_queue_load_read_index_{acquire, relaxed} ( )
• hsa_queue_store_read_index_{relaxed, release} ( )
27
HSAIL Instructions ( for HSA Kernel Agent)
• queueid_u32 dest
• queueptr_uLength dest
• ldqueuewriteindex_segment_order_u64 dest, address
• stqueuewriteindex_segment_order_u64 address, src
• casqueuewriteindex_segment_order_u64 dest, address, src0, src1
• addqueuewriteindex_segment_order_u64 dest, address, src
• ldqueuereadindex_segment_order_u64 dest, address
• stqueuereadindex_segment_order_u64 address, src
HSA Design (2015-04-30) @ NCKU, Tainan
0x00
0x04
0x08
0x10
0x0C
0x14
0x18
0x1C
0x20
0x24
0x28
0x30
0x2C
0x34
0x38
0x3C
header
workgroup_size_x
kernel_object
kernarg_address
dimensions (2-bit)
workgroup_size_y
workgroup_size_z
grid_size_x
reserved
grid_size_y
grid_size_z
private_segment_size_bytes
group_segment_size_bytes
reserved
completion_signal
Kernel Dispatch Packet
031 1516
Jay Wang, Taiwan, 2015.03
header
return_address
arg0
0x00
0x04
0x08
0x10
0x0C
0x14
0x18
0x1C
type
reserved
0x20
0x24
0x28
0x30
0x2C
0x34
0x38
0x3C
arg1
arg2
arg3
reserved
completion_signal
Agent Dispatch Packet
031 1516
Jay Wang, Taiwan, 2015.03
header
dep_signal0
0x00
0x04
0x08
0x10
0x0C
0x14
0x18
0x1C
reserved
reserved
0x20
0x24
0x28
0x30
0x2C
0x34
0x38
0x3C
reserved
completion_signal
dep_signal1
dep_signal2
dep_signal3
dep_signal4
Barrier-AND / Barrier-OR Packet
031 1516
Jay Wang, Taiwan, 2015.03
AQL Packet Types
28
 HSA signaling object handle used to indicate completion of the job.
format (8-bit)
barrier (1-bit)
acquire_fence_scope (2-bit)
release_fence_scope (2-bit)
reserved (3-bit)
0101112 9 8 71315
AQL_FORMAT
0 VENDOR_SPECIFIC
1 INVALID
2 KERNEL_DISPATCH
3 BARRIER_AND
4 AGENT_DISPATCH
5 BARRIER_OR
Jay Wang, Taiwan, 2015.03
HSA Design (2015-04-30) @ NCKU, Tainan
0x00
0x04
0x08
0x10
0x0C
0x14
0x18
0x1C
0x20
0x24
0x28
0x30
0x2C
0x34
0x38
0x3C
header
workgroup_size_x
kernel_object
kernarg_address
dimensions (2-bit)
workgroup_size_y
workgroup_size_z
grid_size_x
reserved
grid_size_y
grid_size_z
private_segment_size_bytes
group_segment_size_bytes
reserved
completion_signal
031 1516
Jay Wang, Taiwan, 2015.03
Kernel Dispatch Packet
29
Work-group Size
Grid Size
Segment Size
Pointer to the Kernel
Pointer to the
arguments
HSA Design (2015-04-30) @ NCKU, Tainan
header
return_address
arg0
0x00
0x04
0x08
0x10
0x0C
0x14
0x18
0x1C
type
reserved
0x20
0x24
0x28
0x30
0x2C
0x34
0x38
0x3C
arg1
arg2
arg3
reserved
completion_signal
031 1516
Jay Wang, Taiwan, 2015.03
Agent Dispatch Packet
30
64-bit direct or indirect
arguments
Pointer to location to
store the function
return value(s) in
The function to be performed by the destination agent.
The function codes are application defined.
HSA Design (2015-04-30) @ NCKU, Tainan
header
dep_signal0
0x00
0x04
0x08
0x10
0x0C
0x14
0x18
0x1C
reserved
reserved
0x20
0x24
0x28
0x30
0x2C
0x34
0x38
0x3C
reserved
completion_signal
dep_signal1
dep_signal2
dep_signal3
dep_signal4
031 1516
Jay Wang, Taiwan, 2015.03
Barrier-AND / Barrier-OR Packet
 The Barrier packet defines dependencies for the HSA Packet Processor
to monitor.
 The HSA Packet Processor will not launch any further packets until the Barrier-
AND / Barrier-OR packet is complete.
31
Handles for dependent
signaling objects to be
evaluated by the packet
processor.
HSA Design (2015-04-30) @ NCKU, Tainan
Packet Process Flow
 All preceding packets in the queue must have completed their launch phase.
 If the barrier bit in the packet header is set than all preceding packets in the
queue must have completed.
 An acquire memory fence is applied for Kernel/Agent Dispatch packets
before the packet enters the active phase.
 Kernel Dispatch packets and Agent Dispatch packets execute on the Kernel
Agent/Agent, and the active phase ends when the task completes.
 Barrier-AND and Barrier-OR packets remain in the active phase until their
condition is met.
 If the packet is a Barrier-AND or Barrier-OR packet then an acquire memory
fence is applied as the first step.
 After execution of the acquire fence, the memory release fence is applied.
 After the memory release fence completes, the signal specified by the
completion_signal field in the AQL packet is signaled with a decrementing
atomic operation.
32
Launch Phase
Active Phase
Completion Phase
HSA Design (2015-04-30) @ NCKU, Tainan
Barrier-bit Example
33
completionSignal
AQL Packet
Barrier bit = 1
DequeueEnqueue
LaunchPhase
ActivePhase
CompletionPhase
Jay Wang, Taiwan, 2015.04
If barrier bit is set, then
processing of the packet will
only begin when all preceding
packets are complete.
HSA Design (2015-04-30) @ NCKU, Tainan
Barrier-AND Packet Example
34
HSA Design (2015-04-30) @ NCKU, Tainan
System Arch. Requirements
1. Shared Virtual Memory
2. Cache Coherency Domains
3. Flat Addressing
4. Endianess
5. Signaling and Synchronization
6. Atomic Memory Operations
7. HSA System Timestamp
8. User Mode Queuing
9. Architected Queuing Language (AQL)
10. Agent Scheduling
11. Kernel Agent Context Switching
12. IEEE754-2008 Floating Point Exceptions
13. Kernel Agent Hardware Debug Infrastructure
14. HSA Platform Topology Discovery
15. Images
35
@ HSA PLATFORM SYSTEM ARCHITECTURE SPECIFICATION, VERSION 1.0 FINAL (2015-03-16)
HSA Design (2015-04-30) @ NCKU, Tainan
Agent Scheduling
36
AQL packet
(Agent/Kernel Dispatch packet or Barrier-AND/OR packet)
Agent
Scheduling
AQL Queue
AQL Queue
AQL Queue
AQL Queue
Non-HSA Task Pool
AQL Queue
Application #1
Application #2
Application #3
HSA
(Kernel)
Agent
Poke!
(1) Task execution completed
(3) Barrier packet completed
Agt
Agt
Agt
Agt
Agt
Agt
Agt
Jay Wang, Taiwan, 2015.04
(2) New AQL packet submission
HSA Design (2015-04-30) @ NCKU, Tainan
Kernel Agent Context Switching
37
AQL Queue
AQL Queue
AQL Queue
AQL Queue
Non-HSA Task Pool
AQL Queue
#1
#2
#3
HSA
Agent
Scheduling
Compute Unit
(CU)
Compute Unit
(CU)
Compute Unit
(CU)
HSA Kernel Agent
Context
Switching
Kernel
Program
Kernel
Program
Kernel
Program
WG
WG
WG
1. Switch ( Required )
2. Preempt ( Required as soon as possible )
3. Terminate and context reset (Terminated as fast as possible)
Jay Wang, Taiwan, 2015.04
HSA Design (2015-04-30) @ NCKU, Tainan
System Arch. Requirements
1. Shared Virtual Memory
2. Cache Coherency Domains
3. Flat Addressing
4. Endianess
5. Signaling and Synchronization
6. Atomic Memory Operations
7. HSA System Timestamp
8. User Mode Queuing
9. Architected Queuing Language (AQL)
10. Agent Scheduling
11. Kernel Agent Context Switching
12. IEEE754-2008 Floating Point Exceptions
13. Kernel Agent Hardware Debug Infrastructure
14. HSA Platform Topology Discovery
15. Images
38
@ HSA PLATFORM SYSTEM ARCHITECTURE SPECIFICATION, VERSION 1.0 FINAL (2015-03-16)
HSA Design (2015-04-30) @ NCKU, Tainan
FP Exception Reporting
 A Kernel Agent shall report certain defined exceptions related to the
execution of the HSAIL code to the HSA Runtime.
39
Lane
0
Lane
1
Lane
2
Lane
(N-1)
Lane
3
Work
Item
Work
Item
Work
Item
Work
Item
Work
Item
Lane
4
Work
Item
Work-Group 0 Work-Group 2Work-Group 1 Work-Group X
avefront 0 Wavefront 1 Wavefront 2 Wavefront 3 Wavefront Y
Work-Group 1
Compute Unit (CU)
PC
HSA Kernel Agent
Wavefront 2
SIMD (Single Instruction, Multiple Data) style
HSA Runtime
Host CPU
Exception Module
Control Directive
enablebreakexceptions #EC
Signaling
Exception
Code
Description
Invalid operatoin
Divide-by-zero
Overflow
Underflow
Inexact
0
1
2
3
4
IEEE754-2008
Jay Wang, Taiwan, 2015.04
enabledetectexceptions #EC
DETECT
Policy
BREAK
Policy
BreakEn bits
DetectEn bits
Status bits
Exception
Handler
HSAIL Instruction
cleardetectexcept_u32
getdetectexcept_u32
setdetectexcept_u32
HSA Design (2015-04-30) @ NCKU, Tainan
Debug Infrastructure
 The Kernel Agent shall provide mechanisms to allow system software
and some select application software (for example, debuggers and
profilers) to set breakpoints and collect throughput information for
profiling.
40
Lane
0
Lane
1
Lane
2
Lane
(N-1)
Lane
3
Work
Item
Work
Item
Work
Item
Work
Item
Work
Item
Lane
4
Work
Item
Work-Group 0 Work-Group 2Work-Group 1
Wavefront 0 Wavefront 1 Wavefront 2 Wavefront 3
Grid
Work-Group 1
Compute
Unit
PC
HSA Kernel Agent
Wavefront 2
SIMD (Single Instruction, Multiple Data) style
Host CPU
(HSA Agent)
Debuggers
HSA
Kernel Agent
Debug Inteface
Profilers
Debug Module
Conditional
Breakpoint
Memory
Breakpoint
Jay Wang, Taiwan, 2015.04
Instruction
Breakpoint
HSA Design (2015-04-30) @ NCKU, Tainan
System Arch. Requirements
1. Shared Virtual Memory
2. Cache Coherency Domains
3. Flat Addressing
4. Endianess
5. Signaling and Synchronization
6. Atomic Memory Operations
7. HSA System Timestamp
8. User Mode Queuing
9. Architected Queuing Language (AQL)
10. Agent Scheduling
11. Kernel Agent Context Switching
12. IEEE754-2008 Floating Point Exceptions
13. Kernel Agent Hardware Debug Infrastructure
14. HSA Platform Topology Discovery
15. Images
41
@ HSA PLATFORM SYSTEM ARCHITECTURE SPECIFICATION, VERSION 1.0 FINAL (2015-03-16)
HSA Design (2015-04-30) @ NCKU, Tainan
Execution Environment
42
You have 2 OpenCL platform(s)
----------------------------------------------
Platform[0].Name = NVIDIA CUDA
Platform[0].Vendor = NVIDIA Corporation
Platform[0].Version = OpenCL 1.1 CUDA 4.2.1
Platform[0].Profile = FULL_PROFILE
----------------------------------------------
Platform[1].Name = Intel(R) OpenCL
Platform[1].Vendor = Intel(R) Corporation
Platform[1].Version = OpenCL 1.2
Platform[1].Profile = FULL_PROFILE
----------------------------------------------
Platform[0] has 1 device(s)
----------------------------------------------
Device[0].Type = CL_DEVICE_TYPE_GPU
Device[0].Name = GeForce GT 625
Device[0].Vendor = NVIDIA Corporation
Device[0].Version = OpenCL 1.1 CUDA
Device[0].DriverVersion = 320.49
Device[0].Profile = FULL_PROFILE
Device[0].OpenCL_C = OpenCL C 1.1
Device[0].MaxComputeUnits = 1
Device[0].MaxWiDimensions = 3
Device[0].MaxWiSize = (1024,1024,64)
Device[0].MaxWgSize = 1024
Device[0].MaxClkFrequency = 1747 MHz
Device[0].AddrSpaceSize = 32 bits
Platform[1] has 1 device(s)
----------------------------------------------
Device[0].Type = CL_DEVICE_TYPE_CPU
Device[0].Name = Intel(R) Core(TM) i5-4440 CPU @ 3.10GHz
Device[0].Vendor = Intel(R) Corporation
Device[0].Version = OpenCL 1.2 (Build 80752)
Device[0].DriverVersion = 3.0.1.15216
Device[0].Profile = FULL_PROFILE
Device[0].OpenCL_C = OpenCL C 1.2
Device[0].MaxComputeUnits = 4
Device[0].MaxWiDimensions = 3
Device[0].MaxWiSize = (1024,1024,1024)
Device[0].MaxWgSize = 1024
Device[0].MaxClkFrequency = 3100 MHz
Device[0].AddrSpaceSize = 32 bits
OpenCL APIs
HSA Design (2015-04-30) @ NCKU, Tainan
HSA Platform Topology Discovery
 HSA platform resources: Agent, Memory, Compute Properties, Caches, and I/O
43
HSA Platform Node 2
Node 0
Add-In Board (optional)
HSA discrete GPU
System Memory
(cacheable)
coherent
(non-cacheable)
non-coherent
HSA APU
GPU
H-CU
H-CU
H-CU
GPU
H-CU
H-CU
H-CU
CPU
Core
Core
Core
Device Local
Memory
coherent
non-coherent
Mem
Mem
HSA MMU
SBIOS
UEFI
HSA discrete GPU
GPU
H-CU
H-CU
H-CU
Device Local
Memory
coherent
non-coherent
Mem
Node 1
PCIe
BridgePCIE
System Memory
(cacheable)
coherent
(non-cacheable)
non-coherent
HSA APU
GPU
H-CU
H-CU
H-CU
CPU
Core
Core
Core
Mem HSA MMU
Add-In Board (optional)
HSA discrete GPU
GPU
H-CU
H-CU
H-CU
Device Local
Memory
coherent
non-coherent
PCIE
Mem
VBIOS
UEFI GOP
SocketInterconnect
Node 3
PCIE
Node 4
PCIE
VBIOS
UEFI GOP
HSA Design (2015-04-30) @ NCKU, Tainan
System Arch. Requirements
1. Shared Virtual Memory
2. Cache Coherency Domains
3. Flat Addressing
4. Endianess
5. Signaling and Synchronization
6. Atomic Memory Operations
7. HSA System Timestamp
8. User Mode Queuing
9. Architected Queuing Language (AQL)
10. Agent Scheduling
11. Kernel Agent Context Switching
12. IEEE754-2008 Floating Point Exceptions
13. Kernel Agent Hardware Debug Infrastructure
14. HSA Platform Topology Discovery
15. Images
44
@ HSA PLATFORM SYSTEM ARCHITECTURE SPECIFICATION, VERSION 1.0 FINAL (2015-03-16)
HSA Design (2015-04-30) @ NCKU, Tainan
Images
 A graphics feature that can
sometimes be useful in data-
parallel computing
 Used to store one-, two-, or
three-dimensional images
 predefined image formats
 Image memory is a special kind
of memory access
 Dedicated hardware to speed
up image operations.
45
 The OpenCL™ Specification
Version 2.1:
5.3 Image Objects
https://www.khronos.org/registry/cl/specs/opencl-2.1.pdf
Image Channel Type
Image Channel Order
Image Geometry
Image Data Size
Image Handle
(hsa_ext_image_handle_t)
Image Data
(1D, 2D, or 3D images)
Global Segment
Image
Data
Image Descriptor
HSA Kernel Agent
HSA Runtime
Image Object
rdimage
ldimage
stimage
Jay Wang, Taiwan, 2015.04
HSA Design (2015-04-30) @ NCKU, Tainan
Summary
 Programming model issues
 HSA Intermediate Language (HSAIL) + HSA Runtime
 Architected Queuing Language (AQL) + Signaling
 Debug infrastructure
 Communication overhead issues
 Cache coherent shared virtual memory (CC-SVM)
 Architected Queuing Language (AQL) for user mode queuing
 Hardware-assisted signaling and atomic operations for synchronization
46
CPUs GPU DSP
...
HSAIL
Unified Coherent Memory
HSA Runtime
AQL
Jay Wang, Taiwan, 2015.04
HSA Design (2015-04-30) @ NCKU, Tainan
HSA Kernel Agent
CPU
HSA Runtime
HSA
Application
(HSA Agent)
User Application
( CPU Code + HSAIL Kernel Code )
HSA Kernel Agent
GPU
HSA
Kernel Mode
Driver
Host CPU
HSA Kernel Agent
DSP
HSA User Mode Queuing (Architected Queuing Language)
+
HSA Signaling
Jay Wang, Taiwan, 2015.04
HSA
Finalizers
HSA Kernel Agent Designer
Parallel Application
Designer
HSA
System Software
Designer
HSA
System Architecture
Designer
Language Runtime
(ex: OpenCL runtime)
47
媽~
我在這!
 OpenCL Standards ( https://www.khronos.org/opencl/ )
 HSA Standards ( http://www.hsafoundation.com/html/HSA_Library.htm )
 HSA Platform System Architecture Specification v1.0
 HSA Programmer Reference Manual Specification v1.0
 HSA Runtime Specification v1.0
 HSA Foundation Github ( https://github.com/HSAFoundation )
HSA Design (2015-04-30) @ NCKU, Tainan
Taiwan HSA Group @ Facebook
48

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Arm device tree and linux device drivers
Arm device tree and linux device driversArm device tree and linux device drivers
Arm device tree and linux device drivers
 
U-Boot presentation 2013
U-Boot presentation  2013U-Boot presentation  2013
U-Boot presentation 2013
 
Making Linux do Hard Real-time
Making Linux do Hard Real-timeMaking Linux do Hard Real-time
Making Linux do Hard Real-time
 
Linux Internals - Kernel/Core
Linux Internals - Kernel/CoreLinux Internals - Kernel/Core
Linux Internals - Kernel/Core
 
New Ways to Find Latency in Linux Using Tracing
New Ways to Find Latency in Linux Using TracingNew Ways to Find Latency in Linux Using Tracing
New Ways to Find Latency in Linux Using Tracing
 
Linux PV on HVM
Linux PV on HVMLinux PV on HVM
Linux PV on HVM
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)
 
Booting Android: bootloaders, fastboot and boot images
Booting Android: bootloaders, fastboot and boot imagesBooting Android: bootloaders, fastboot and boot images
Booting Android: bootloaders, fastboot and boot images
 
BKK16-315 Graphics Stack Update
BKK16-315 Graphics Stack UpdateBKK16-315 Graphics Stack Update
BKK16-315 Graphics Stack Update
 
The Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast StorageThe Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast Storage
 
Introduction to yocto
Introduction to yoctoIntroduction to yocto
Introduction to yocto
 
Receive side scaling (RSS) with eBPF in QEMU and virtio-net
Receive side scaling (RSS) with eBPF in QEMU and virtio-netReceive side scaling (RSS) with eBPF in QEMU and virtio-net
Receive side scaling (RSS) with eBPF in QEMU and virtio-net
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at Netflix
 
LAS16-307: Benchmarking Schedutil in Android
LAS16-307: Benchmarking Schedutil in AndroidLAS16-307: Benchmarking Schedutil in Android
LAS16-307: Benchmarking Schedutil in Android
 
Embedded Linux Kernel - Build your custom kernel
Embedded Linux Kernel - Build your custom kernelEmbedded Linux Kernel - Build your custom kernel
Embedded Linux Kernel - Build your custom kernel
 
Kernel Recipes 2015: Representing device-tree peripherals in ACPI
Kernel Recipes 2015: Representing device-tree peripherals in ACPIKernel Recipes 2015: Representing device-tree peripherals in ACPI
Kernel Recipes 2015: Representing device-tree peripherals in ACPI
 
Bootstrap process of u boot (NDS32 RISC CPU)
Bootstrap process of u boot (NDS32 RISC CPU)Bootstrap process of u boot (NDS32 RISC CPU)
Bootstrap process of u boot (NDS32 RISC CPU)
 
BKK16-317 How to generate power models for EAS and IPA
BKK16-317 How to generate power models for EAS and IPABKK16-317 How to generate power models for EAS and IPA
BKK16-317 How to generate power models for EAS and IPA
 
Linux Interrupts
Linux InterruptsLinux Interrupts
Linux Interrupts
 
BPF Internals (eBPF)
BPF Internals (eBPF)BPF Internals (eBPF)
BPF Internals (eBPF)
 

Andere mochten auch

Task 4 final: Consultants-E E-Moderating Course Oct 2015
Task 4 final: Consultants-E E-Moderating Course Oct 2015Task 4 final: Consultants-E E-Moderating Course Oct 2015
Task 4 final: Consultants-E E-Moderating Course Oct 2015
brendawm
 
ABP Electronics
ABP ElectronicsABP Electronics
ABP Electronics
Justin Yi
 
台南校區多功能會館 151002
台南校區多功能會館 151002台南校區多功能會館 151002
台南校區多功能會館 151002
健正 林
 

Andere mochten auch (20)

Web design
Web designWeb design
Web design
 
20150501南園
20150501南園20150501南園
20150501南園
 
Task 4 final: Consultants-E E-Moderating Course Oct 2015
Task 4 final: Consultants-E E-Moderating Course Oct 2015Task 4 final: Consultants-E E-Moderating Course Oct 2015
Task 4 final: Consultants-E E-Moderating Course Oct 2015
 
No Place Left Session Seven
No Place Left Session SevenNo Place Left Session Seven
No Place Left Session Seven
 
No Place Left Session Six - Acts 15
No Place Left Session Six - Acts 15No Place Left Session Six - Acts 15
No Place Left Session Six - Acts 15
 
SMTULSA Social Business Conference Sponsorship Kit
SMTULSA Social Business Conference Sponsorship KitSMTULSA Social Business Conference Sponsorship Kit
SMTULSA Social Business Conference Sponsorship Kit
 
Boats and Business
Boats and BusinessBoats and Business
Boats and Business
 
1 John Series Sunday 22nd February
1 John Series Sunday 22nd February1 John Series Sunday 22nd February
1 John Series Sunday 22nd February
 
The Tongue
The TongueThe Tongue
The Tongue
 
ABP Electronics
ABP ElectronicsABP Electronics
ABP Electronics
 
WAA PCB
WAA PCBWAA PCB
WAA PCB
 
If It's The Lords Will
If It's The Lords WillIf It's The Lords Will
If It's The Lords Will
 
Something I Can Use
Something I Can UseSomething I Can Use
Something I Can Use
 
COUFEST Rocks Social Media! How Bands can Rock Social Media
COUFEST Rocks Social Media! How Bands can Rock Social MediaCOUFEST Rocks Social Media! How Bands can Rock Social Media
COUFEST Rocks Social Media! How Bands can Rock Social Media
 
Risky Living Session Five - Sin & Judgment
Risky Living Session Five - Sin & JudgmentRisky Living Session Five - Sin & Judgment
Risky Living Session Five - Sin & Judgment
 
2014 cheer constitution
2014 cheer constitution2014 cheer constitution
2014 cheer constitution
 
Tale of Two Men
Tale of Two MenTale of Two Men
Tale of Two Men
 
地政研究所演講 160311v3.1
地政研究所演講 160311v3.1地政研究所演講 160311v3.1
地政研究所演講 160311v3.1
 
Dealing With Anxiety at Work
Dealing With Anxiety at WorkDealing With Anxiety at Work
Dealing With Anxiety at Work
 
台南校區多功能會館 151002
台南校區多功能會館 151002台南校區多功能會館 151002
台南校區多功能會館 151002
 

Ähnlich wie HSA Design (2015-04-30)

Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Holden Karau
 

Ähnlich wie HSA Design (2015-04-30) (20)

助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
 
Intel® QuickAssist Technology Introduction, Applications, and Lab, Including ...
Intel® QuickAssist Technology Introduction, Applications, and Lab, Including ...Intel® QuickAssist Technology Introduction, Applications, and Lab, Including ...
Intel® QuickAssist Technology Introduction, Applications, and Lab, Including ...
 
From FPGA-based Reconfigurable Systems to Autonomic Heterogeneous Computing S...
From FPGA-based Reconfigurable Systems to Autonomic Heterogeneous Computing S...From FPGA-based Reconfigurable Systems to Autonomic Heterogeneous Computing S...
From FPGA-based Reconfigurable Systems to Autonomic Heterogeneous Computing S...
 
HSA HSAIL Introduction Hot Chips 2013
HSA HSAIL Introduction  Hot Chips 2013 HSA HSAIL Introduction  Hot Chips 2013
HSA HSAIL Introduction Hot Chips 2013
 
Smart Data Slides: Emerging Hardware Choices for Modern AI Data Management
Smart Data Slides: Emerging Hardware Choices for Modern AI Data ManagementSmart Data Slides: Emerging Hardware Choices for Modern AI Data Management
Smart Data Slides: Emerging Hardware Choices for Modern AI Data Management
 
Software used in Electronics and Communication
Software used in Electronics and CommunicationSoftware used in Electronics and Communication
Software used in Electronics and Communication
 
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
 
Introduction to PowerAI - The Enterprise AI Platform
Introduction to PowerAI - The Enterprise AI PlatformIntroduction to PowerAI - The Enterprise AI Platform
Introduction to PowerAI - The Enterprise AI Platform
 
How to lock a Python in a cage? Managing Python environment inside an R project
How to lock a Python in a cage?  Managing Python environment inside an R projectHow to lock a Python in a cage?  Managing Python environment inside an R project
How to lock a Python in a cage? Managing Python environment inside an R project
 
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
 
PowerAI Deep dive
PowerAI Deep divePowerAI Deep dive
PowerAI Deep dive
 
Speeding up Programs with OpenACC in GCC
Speeding up Programs with OpenACC in GCCSpeeding up Programs with OpenACC in GCC
Speeding up Programs with OpenACC in GCC
 
Migrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureMigrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to Azure
 
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3
 
“Quantum” Performance Effects: beyond the Core
“Quantum” Performance Effects: beyond the Core“Quantum” Performance Effects: beyond the Core
“Quantum” Performance Effects: beyond the Core
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
 
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
 
Intel(r) Quick Assist Technology Overview
Intel(r) Quick Assist Technology OverviewIntel(r) Quick Assist Technology Overview
Intel(r) Quick Assist Technology Overview
 
The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...
 
HSA Introduction
HSA IntroductionHSA Introduction
HSA Introduction
 

Kürzlich hochgeladen

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ssuser89054b
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
Kamal Acharya
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
Epec Engineered Technologies
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
mphochane1998
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
Neometrix_Engineering_Pvt_Ltd
 

Kürzlich hochgeladen (20)

Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network Devices
 
Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech Civil
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal load
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdf
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 

HSA Design (2015-04-30)

  • 1. ITRI Industrial Technology Research Institute Heterogeneous System Architecture (HSA) Design 王振傑 (Jay Wang) 嵌入式系統與晶片技術組 -系統架構設計部 (D200) 資訊與通訊研究所 (ICL) ccwang.jay@itri.org.tw 2015-04-30
  • 2. 2 嵌入式系統硬體技術部 (D100) 系統架構設計部 (D200) 嵌入式系統軟體技術部 (D300) 智慧電子產業推動部 (D400) 系統整合與應用部 (D500) 嵌入式系統與晶片技術組 Division for Embedded System and SoC Technology 工業技術研究院 資訊與通訊研究所
  • 3. HSA Design (2015-04-30) @ NCKU, Tainan What is HSA? 3 An intelligent computing architecture that enables CPU, GPU and other processors to work in harmony on a single piece of silicon by seamlessly moving the right tasks to the best suited processing element.
  • 4. HSA Design (2015-04-30) @ NCKU, Tainan Three Eras of Processor Performance 4 ? Single-thread Performance Time we are here Enabled by:  Moore’s Observation  Voltage Scaling  Micro-Architecture Constrained by:  Power  Complexity Single-Core Era ModernApplication Performance Time (Data-parallel exploitation) we are here Heterogeneous Systems Era Enabled by:  Moore’s Observation  Abundant data parallelism  Power efficient data parallel processing (GPUs) Constrained by:  Programming models  Communication overheads Throughput Performance Time (# of processors) we are here Enabled by:  Moore’s Observation  Desire for Throughput  20 years of SMP arch Constrained by:  Power  Parallel SW availability  Scalability Multi-Core Era Assembly  C/C++  Java … pthreads  OpenMP / TBB … Shader  CUDA OpenCL  C++ and Java SOURCE : HSA INTRODUCTION, HSA FOUNDATION (PHIL ROGERS, AMD)
  • 5. HSA Design (2015-04-30) @ NCKU, Tainan HSA Foundation 5  Founded in June 2012  www.hsafoundation.com  Developing a new platform for heterogeneous systems  Launched the official v1.0 specification set in March 2015
  • 6. HSA Design (2015-04-30) @ NCKU, Tainan HSA Foundation Members (April 2015) 6 Founders Promoters Contributors Academics Supporters
  • 7. HSA Design (2015-04-30) @ NCKU, Tainan HSA Platform Model 7 In HSA system, a regular device is called an HSA agent, and if the HSA agent can run kernels then it is also an HSA kernel agent. Compute Unit (CU) Compute Unit (CU) Compute Unit (CU) Compute Unit (CU) Compute Unit (CU) Lane (Processing Element) Host CPU (OS, HSA runtime) HSA Kernel Agent Compute Unit (CU) Compute Unit (CU) Wavefront Size (A power of 2 in the range from 1 to 256 inclusive) HSA Agent SIMD Data Parallel Workloads Serial and Task Parallel Workloads Jay Wang, Taiwan, 2015.03
  • 8. HSA Design (2015-04-30) @ NCKU, Tainan HSA Intermediate Language (HSAIL) 8 The HSA Foundation members are building a heterogeneous compute software ecosystem built on open, royalty-free industry standards and open-source software: the HSA runtimes and compilation tools are based on open-source technologies such as LLVM and GCC. ( https://github.com/HSAFoundation ) Company D GPU ... Other Hardware Accelerator Company B CPUs Finalizer (Company A - CPU) Finalizer (Company B - CPU) Finalizer (Company C - GPU) Finalizer (Company D - GPU) Finalizer (Company E - DSP) Finalizer (...) OpenMP DSL Virtual Parallel ISA CLOC – Compile OpenCL kernels to HSAIL HSA Intermediate Language (HSAIL) OpenCL C++AMP Java Company A CPUs Company C GPU Company E DSP Parallel Programming Languages HSA Runtime Libraries Jay Wang, Taiwan, 2014.10
  • 9. HSA Design (2015-04-30) @ NCKU, Tainan HSAIL Programming Model 9
  • 10. HSA Design (2015-04-30) @ NCKU, Tainan HSA Runtime Stack 10 HSA Kernel Agent CPU HSA Runtime HSA Application (HSA Agent) Language Runtime (ex: OpenCL runtime) User Application ( CPU Code + HSAIL Kernel Code ) HSA Kernel Agent GPU HSA Kernel Mode Driver Host CPU HSA Kernel Agent DSP HSA User Mode Queuing (Architected Queuing Language) + HSA Signaling Jay Wang, Taiwan, 2015.04 Target ISA HSA Finalizers
  • 11. HSA Design (2015-04-30) @ NCKU, Tainan Kernel Execution 11
  • 12. HSA Design (2015-04-30) @ NCKU, Tainan HSA Memory Consistency Model (Relaxed Model) Second Operation ld_rlx st_rlx atomic_rlx atomicNoRet_rlx atomic_acq atomicNoRet_acq fence_acq atomic_rel atomicNoRet_rel fence_rel atomic_ar atomicNoRet_ar fence_ar First Operation ld_rlx or st_rlx yes yes yes yes no no atomic_rlx atomicNoRet_rlx yes yes yes no no no atomic_acq atomicNoRet_acq fence_acq no no no no no no atomic_rel atomicNoRet_rel yes yes no no no no fence_rel yes no no no no no atomic_ar atomicNoRet_ar fence_ar no no no no no no 12 relaxed ; ….. acquire ; ….. release ; ….. acq_rel ; …..
  • 13. HSA Design (2015-04-30) @ NCKU, Tainan System Arch. Requirements 1. Shared Virtual Memory 2. Cache Coherency Domains 3. Flat Addressing 4. Endianess 5. Signaling and Synchronization 6. Atomic Memory Operations 7. HSA System Timestamp 8. User Mode Queuing 9. Architected Queuing Language (AQL) 10. Agent Scheduling 11. Kernel Agent Context Switching 12. IEEE754-2008 Floating Point Exceptions 13. Kernel Agent Hardware Debug Infrastructure 14. HSA Platform Topology Discovery 15. Images 13 @ HSA PLATFORM SYSTEM ARCHITECTURE SPECIFICATION, VERSION 1.0 FINAL (2015-03-16)
  • 14. HSA Design (2015-04-30) @ NCKU, Tainan Legacy GPU Compute  Multiple memory pools and address spaces  Data copies before/after GPU compute 14 System Memory GPU Memory 1 23 Host CPUs GPU Virtual Memory #1 Virtual Memory #2 (HSA Agent) (HSA Kernel Agent) Jay Wang, Taiwan, 2015.04
  • 15. HSA Design (2015-04-30) @ NCKU, Tainan Host CPUs GPU(HSA Agent) (HSA Kernel Agent) Shared Virtual Memory System Memory GPU Memory Jay Wang, Taiwan, 2015.04 Shared Virtual Memory (HSA) 15 32-bit HSA System (32 bits VA) 64-bit HSA System (≥ 48 bits VA) IOMMU OS Page Table MMU
  • 16. HSA Design (2015-04-30) @ NCKU, Tainan Group Segments within Flat Address Space Global Segment within Flat Address Space Private Segments within Flat Address Space Kernel Dispatch Grid Work-Group Work-Group WI WI WI Private Segment WI WI WI Private Segment Group Segment Group Segment Global Segment Flat Address SpaceHSA Agent $s0 $s1 $s2 $s3 $s4 $s5 $s6 $s7 $s124 $s125 $s126 $s127 32-bit Registers ( s registers) $c0 $c1 $c2 $c3 $c4 $c5 $c6 $c7 $d0 $d1 $d2 $d3 $d62 $d63 64-bit Registers ( d registers) $q0 $q31 $q1 128-bit Registers ( q registers) 1-bit Control Registers ( c registers) Local Registers per Work-Item Jay Wang, Taiwan, 2014.10 HSA Memory Hierarchy 16 1) Global 2) Group 3) Private 4) Kernarg 5) Readonly 6) Spill 7) Arg Virtual Address Range Reservation (System Memory or Device Local Memory)
  • 17. HSA Design (2015-04-30) @ NCKU, Tainan Group Segments within Flat Address Space Global Segment within Flat Address Space Private Segments within Flat Address Space Kernel Dispatch Grid Work-Group Work-Group WI WI WI Private Segment WI WI WI Private Segment Group Segment Group Segment Global Segment Flat Address Space HSA Kernel Agent Host CPUs Jay Wang, Taiwan, 2015.04 Cache Coherency Domains 17 System Memory Cache Cache Cache Coherency
  • 18. HSA Design (2015-04-30) @ NCKU, Tainan System Arch. Requirements 1. Shared Virtual Memory 2. Cache Coherency Domains 3. Flat Addressing 4. Endianess 5. Signaling and Synchronization 6. Atomic Memory Operations 7. HSA System Timestamp 8. User Mode Queuing 9. Architected Queuing Language (AQL) 10. Agent Scheduling 11. Kernel Agent Context Switching 12. IEEE754-2008 Floating Point Exceptions 13. Kernel Agent Hardware Debug Infrastructure 14. HSA Platform Topology Discovery 15. Images 18 @ HSA PLATFORM SYSTEM ARCHITECTURE SPECIFICATION, VERSION 1.0 FINAL (2015-03-16)
  • 19. HSA Design (2015-04-30) @ NCKU, Tainan Signaling and Synchronization  The required mechanisms for HSAIL and the HSA runtime are:  Allocate/Destroy an HSA signal  Read the current HSA signal value  Wait on an HSA signal to meet a specified condition (with a maximum wait duration requested)  Send an HSA signal value  Atomic read-modify-write an HSA signal value 19 sem_init() sem_wait() sem_post() sem_destroy() pthread_mutex_init() pthread_mutex_lock() pthread_mutex_unlock() pthread_mutex_destroy() Signal Handle (hsa_signal_t) Signal Value (hsa_signal_value_t) HSA Kernel Agent Host CPU HSA Runtime APIs HSAIL Instructions Implementation- defined data Sig32 or Sig64 Jay Wang, Taiwan, 2015.04
  • 20. HSA Design (2015-04-30) @ NCKU, Tainan HSA Runtime APIs for Signaling 20 HSA Runtime APIs ( for HSA application ) • hsa_signal_create ( ) • hsa_signal_destroy ( ) • hsa_signal_load_{acquire, relaxed} ( ) • hsa_signal_store_{relaxed, release} ( ) • hsa_signal_exchange_{acq_rel, acquire, relaxed, release} ( ) • hsa_signal_cas_{acq_rel, acquire, relaxed, release} ( ) • hsa_signal_add_{acq_rel, acquire, relaxed, release} ( ) • hsa_signal_subtract_{acq_rel, acquire, relaxed, release} ( ) • hsa_signal_and_{acq_rel, acquire, relaxed, release} ( ) • hsa_signal_or_{acq_rel, acquire, relaxed, release} ( ) • hsa_signal_xor_{acq_rel, acquire, relaxed, release} ( ) • hsa_signal_wait_{acquire, relaxed} ( ) HSA Runtime Programmer’s Reference Manual (v1.0) 2.4 Signals
  • 21. HSA Design (2015-04-30) @ NCKU, Tainan HSAIL Instructions for Signaling 21 HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, Compiler Writer’s Guide, and Object Format (BRIG) (v1.0) 6.8 Notification (signal) Instructions
  • 22. HSA Design (2015-04-30) @ NCKU, Tainan Atomic Memory Operations  HSA requires the following standard atomic memory operations to be supported by HSA Kernel Agents (other HSA Agents only need to support the subset of these operations required by their role in the system):  Load from memory  Store to memory  Fetch from memory, apply logic operation (bitwise AND/OR/XOR) with one addition operand, and store back.  Fetch from memory, apply integer arithmetic operation (add, subtract, increment, decrement, minimum, maximum) with one addition operand, and store back.  Exchange memory location with operand.  Compare-and-swap (CAS); load memory location, compare with first operand, if equal than store second operand back to memory location. 22
  • 23. HSA Design (2015-04-30) @ NCKU, Tainan Timestamp (64-bit) Host CPU HSA Runtime APIs HSAIL Clock Instruction Timestamp Frequency (1~400MHz) HSA Runtime HSA Kernel Agent Jay Wang, Taiwan, 2015.04 HSA System Timestamp  The HSA system provide for a low overhead mechanism of determining the passing of time.  A system timestamp is required that can be read from HSAIL or through the HSA runtime.  It is also possible to determine the system timestamp frequency through the HSA runtime. 23
  • 24. HSA Design (2015-04-30) @ NCKU, Tainan System Arch. Requirements 1. Shared Virtual Memory 2. Cache Coherency Domains 3. Flat Addressing 4. Endianess 5. Signaling and Synchronization 6. Atomic Memory Operations 7. HSA System Timestamp 8. User Mode Queuing 9. Architected Queuing Language (AQL) 10. Agent Scheduling 11. Kernel Agent Context Switching 12. IEEE754-2008 Floating Point Exceptions 13. Kernel Agent Hardware Debug Infrastructure 14. HSA Platform Topology Discovery 15. Images 24 @ HSA PLATFORM SYSTEM ARCHITECTURE SPECIFICATION, VERSION 1.0 FINAL (2015-03-16)
  • 25. HSA Design (2015-04-30) @ NCKU, Tainan User Model Queuing  Multiple user-level command queues  Runtime-allocated  Architected Queuing Language (AQL) 25 HSA Kernel Agent K A CPU A HSA Runtime HSA Application (HSA Agent) CPU Language Runtime (ex: OpenCL runtime) User Application HSA Finalizers HSA Kernel Agent GPU HSA Kernel Mode Driver CPU K A A Jay Wang, Taiwan, 2015.04 K AQL Kernel Dispatch Queue A AQL Agent Dispatch Queue
  • 26. HSA Design (2015-04-30) @ NCKU, Tainan HSA Packet Processor 26 type features base_address doorbell_signal 0x00 0x04 0x08 0x10 0x0C 0x14 size0x18 reserved (must be 0)0x1C write_index (64-bit)read_index (64-bit) base_address + ( (read_index%size) * AQL packet size ) base_address + ( (write_index%size) * AQL packet size ) Support single or multiple producers Support KERNEL_DISPATCH and/or AGENT_DISPATCH packet AQL Packet (64 Bytes) User Mode Queue Structure (hsa_queue_t) Ring Buffer id 0x20 0x24 Jay Wang, Taiwan, 2015.03
  • 27. HSA Design (2015-04-30) @ NCKU, Tainan HSA Kernel Agent K A A HSA Runtime HSA Application (HSA Agent) CPU Language Runtime (ex: OpenCL runtime) User Application GPU Jay Wang, Taiwan, 2015.04 User Mode Queue Operations HSA Runtime APIs ( for HSA application ) • hsa_queue_create ( ) • hsa_soft_queue_create ( ) • hsa_queue_destroy ( ) • hsa_queue_inactivate ( ) • hsa_queue_load_write_index_{acquire, relaxed} ( ) • hsa_queue_store_write_index_{relaxed, release} ( ) • hsa_queue_cas_write_index_{acq_rel, acquire, relaxed, release} ( ) • hsa_queue_add_write_index_{acq_rel, acquire, relaxed, release} ( ) • hsa_queue_load_read_index_{acquire, relaxed} ( ) • hsa_queue_store_read_index_{relaxed, release} ( ) 27 HSAIL Instructions ( for HSA Kernel Agent) • queueid_u32 dest • queueptr_uLength dest • ldqueuewriteindex_segment_order_u64 dest, address • stqueuewriteindex_segment_order_u64 address, src • casqueuewriteindex_segment_order_u64 dest, address, src0, src1 • addqueuewriteindex_segment_order_u64 dest, address, src • ldqueuereadindex_segment_order_u64 dest, address • stqueuereadindex_segment_order_u64 address, src
  • 28. HSA Design (2015-04-30) @ NCKU, Tainan 0x00 0x04 0x08 0x10 0x0C 0x14 0x18 0x1C 0x20 0x24 0x28 0x30 0x2C 0x34 0x38 0x3C header workgroup_size_x kernel_object kernarg_address dimensions (2-bit) workgroup_size_y workgroup_size_z grid_size_x reserved grid_size_y grid_size_z private_segment_size_bytes group_segment_size_bytes reserved completion_signal Kernel Dispatch Packet 031 1516 Jay Wang, Taiwan, 2015.03 header return_address arg0 0x00 0x04 0x08 0x10 0x0C 0x14 0x18 0x1C type reserved 0x20 0x24 0x28 0x30 0x2C 0x34 0x38 0x3C arg1 arg2 arg3 reserved completion_signal Agent Dispatch Packet 031 1516 Jay Wang, Taiwan, 2015.03 header dep_signal0 0x00 0x04 0x08 0x10 0x0C 0x14 0x18 0x1C reserved reserved 0x20 0x24 0x28 0x30 0x2C 0x34 0x38 0x3C reserved completion_signal dep_signal1 dep_signal2 dep_signal3 dep_signal4 Barrier-AND / Barrier-OR Packet 031 1516 Jay Wang, Taiwan, 2015.03 AQL Packet Types 28  HSA signaling object handle used to indicate completion of the job. format (8-bit) barrier (1-bit) acquire_fence_scope (2-bit) release_fence_scope (2-bit) reserved (3-bit) 0101112 9 8 71315 AQL_FORMAT 0 VENDOR_SPECIFIC 1 INVALID 2 KERNEL_DISPATCH 3 BARRIER_AND 4 AGENT_DISPATCH 5 BARRIER_OR Jay Wang, Taiwan, 2015.03
  • 29. HSA Design (2015-04-30) @ NCKU, Tainan 0x00 0x04 0x08 0x10 0x0C 0x14 0x18 0x1C 0x20 0x24 0x28 0x30 0x2C 0x34 0x38 0x3C header workgroup_size_x kernel_object kernarg_address dimensions (2-bit) workgroup_size_y workgroup_size_z grid_size_x reserved grid_size_y grid_size_z private_segment_size_bytes group_segment_size_bytes reserved completion_signal 031 1516 Jay Wang, Taiwan, 2015.03 Kernel Dispatch Packet 29 Work-group Size Grid Size Segment Size Pointer to the Kernel Pointer to the arguments
  • 30. HSA Design (2015-04-30) @ NCKU, Tainan header return_address arg0 0x00 0x04 0x08 0x10 0x0C 0x14 0x18 0x1C type reserved 0x20 0x24 0x28 0x30 0x2C 0x34 0x38 0x3C arg1 arg2 arg3 reserved completion_signal 031 1516 Jay Wang, Taiwan, 2015.03 Agent Dispatch Packet 30 64-bit direct or indirect arguments Pointer to location to store the function return value(s) in The function to be performed by the destination agent. The function codes are application defined.
  • 31. HSA Design (2015-04-30) @ NCKU, Tainan header dep_signal0 0x00 0x04 0x08 0x10 0x0C 0x14 0x18 0x1C reserved reserved 0x20 0x24 0x28 0x30 0x2C 0x34 0x38 0x3C reserved completion_signal dep_signal1 dep_signal2 dep_signal3 dep_signal4 031 1516 Jay Wang, Taiwan, 2015.03 Barrier-AND / Barrier-OR Packet  The Barrier packet defines dependencies for the HSA Packet Processor to monitor.  The HSA Packet Processor will not launch any further packets until the Barrier- AND / Barrier-OR packet is complete. 31 Handles for dependent signaling objects to be evaluated by the packet processor.
  • 32. HSA Design (2015-04-30) @ NCKU, Tainan Packet Process Flow  All preceding packets in the queue must have completed their launch phase.  If the barrier bit in the packet header is set than all preceding packets in the queue must have completed.  An acquire memory fence is applied for Kernel/Agent Dispatch packets before the packet enters the active phase.  Kernel Dispatch packets and Agent Dispatch packets execute on the Kernel Agent/Agent, and the active phase ends when the task completes.  Barrier-AND and Barrier-OR packets remain in the active phase until their condition is met.  If the packet is a Barrier-AND or Barrier-OR packet then an acquire memory fence is applied as the first step.  After execution of the acquire fence, the memory release fence is applied.  After the memory release fence completes, the signal specified by the completion_signal field in the AQL packet is signaled with a decrementing atomic operation. 32 Launch Phase Active Phase Completion Phase
  • 33. HSA Design (2015-04-30) @ NCKU, Tainan Barrier-bit Example 33 completionSignal AQL Packet Barrier bit = 1 DequeueEnqueue LaunchPhase ActivePhase CompletionPhase Jay Wang, Taiwan, 2015.04 If barrier bit is set, then processing of the packet will only begin when all preceding packets are complete.
  • 34. HSA Design (2015-04-30) @ NCKU, Tainan Barrier-AND Packet Example 34
  • 35. HSA Design (2015-04-30) @ NCKU, Tainan System Arch. Requirements 1. Shared Virtual Memory 2. Cache Coherency Domains 3. Flat Addressing 4. Endianess 5. Signaling and Synchronization 6. Atomic Memory Operations 7. HSA System Timestamp 8. User Mode Queuing 9. Architected Queuing Language (AQL) 10. Agent Scheduling 11. Kernel Agent Context Switching 12. IEEE754-2008 Floating Point Exceptions 13. Kernel Agent Hardware Debug Infrastructure 14. HSA Platform Topology Discovery 15. Images 35 @ HSA PLATFORM SYSTEM ARCHITECTURE SPECIFICATION, VERSION 1.0 FINAL (2015-03-16)
  • 36. HSA Design (2015-04-30) @ NCKU, Tainan Agent Scheduling 36 AQL packet (Agent/Kernel Dispatch packet or Barrier-AND/OR packet) Agent Scheduling AQL Queue AQL Queue AQL Queue AQL Queue Non-HSA Task Pool AQL Queue Application #1 Application #2 Application #3 HSA (Kernel) Agent Poke! (1) Task execution completed (3) Barrier packet completed Agt Agt Agt Agt Agt Agt Agt Jay Wang, Taiwan, 2015.04 (2) New AQL packet submission
  • 37. HSA Design (2015-04-30) @ NCKU, Tainan Kernel Agent Context Switching 37 AQL Queue AQL Queue AQL Queue AQL Queue Non-HSA Task Pool AQL Queue #1 #2 #3 HSA Agent Scheduling Compute Unit (CU) Compute Unit (CU) Compute Unit (CU) HSA Kernel Agent Context Switching Kernel Program Kernel Program Kernel Program WG WG WG 1. Switch ( Required ) 2. Preempt ( Required as soon as possible ) 3. Terminate and context reset (Terminated as fast as possible) Jay Wang, Taiwan, 2015.04
  • 38. HSA Design (2015-04-30) @ NCKU, Tainan System Arch. Requirements 1. Shared Virtual Memory 2. Cache Coherency Domains 3. Flat Addressing 4. Endianess 5. Signaling and Synchronization 6. Atomic Memory Operations 7. HSA System Timestamp 8. User Mode Queuing 9. Architected Queuing Language (AQL) 10. Agent Scheduling 11. Kernel Agent Context Switching 12. IEEE754-2008 Floating Point Exceptions 13. Kernel Agent Hardware Debug Infrastructure 14. HSA Platform Topology Discovery 15. Images 38 @ HSA PLATFORM SYSTEM ARCHITECTURE SPECIFICATION, VERSION 1.0 FINAL (2015-03-16)
  • 39. HSA Design (2015-04-30) @ NCKU, Tainan FP Exception Reporting  A Kernel Agent shall report certain defined exceptions related to the execution of the HSAIL code to the HSA Runtime. 39 Lane 0 Lane 1 Lane 2 Lane (N-1) Lane 3 Work Item Work Item Work Item Work Item Work Item Lane 4 Work Item Work-Group 0 Work-Group 2Work-Group 1 Work-Group X avefront 0 Wavefront 1 Wavefront 2 Wavefront 3 Wavefront Y Work-Group 1 Compute Unit (CU) PC HSA Kernel Agent Wavefront 2 SIMD (Single Instruction, Multiple Data) style HSA Runtime Host CPU Exception Module Control Directive enablebreakexceptions #EC Signaling Exception Code Description Invalid operatoin Divide-by-zero Overflow Underflow Inexact 0 1 2 3 4 IEEE754-2008 Jay Wang, Taiwan, 2015.04 enabledetectexceptions #EC DETECT Policy BREAK Policy BreakEn bits DetectEn bits Status bits Exception Handler HSAIL Instruction cleardetectexcept_u32 getdetectexcept_u32 setdetectexcept_u32
  • 40. HSA Design (2015-04-30) @ NCKU, Tainan Debug Infrastructure  The Kernel Agent shall provide mechanisms to allow system software and some select application software (for example, debuggers and profilers) to set breakpoints and collect throughput information for profiling. 40 Lane 0 Lane 1 Lane 2 Lane (N-1) Lane 3 Work Item Work Item Work Item Work Item Work Item Lane 4 Work Item Work-Group 0 Work-Group 2Work-Group 1 Wavefront 0 Wavefront 1 Wavefront 2 Wavefront 3 Grid Work-Group 1 Compute Unit PC HSA Kernel Agent Wavefront 2 SIMD (Single Instruction, Multiple Data) style Host CPU (HSA Agent) Debuggers HSA Kernel Agent Debug Inteface Profilers Debug Module Conditional Breakpoint Memory Breakpoint Jay Wang, Taiwan, 2015.04 Instruction Breakpoint
  • 41. HSA Design (2015-04-30) @ NCKU, Tainan System Arch. Requirements 1. Shared Virtual Memory 2. Cache Coherency Domains 3. Flat Addressing 4. Endianess 5. Signaling and Synchronization 6. Atomic Memory Operations 7. HSA System Timestamp 8. User Mode Queuing 9. Architected Queuing Language (AQL) 10. Agent Scheduling 11. Kernel Agent Context Switching 12. IEEE754-2008 Floating Point Exceptions 13. Kernel Agent Hardware Debug Infrastructure 14. HSA Platform Topology Discovery 15. Images 41 @ HSA PLATFORM SYSTEM ARCHITECTURE SPECIFICATION, VERSION 1.0 FINAL (2015-03-16)
  • 42. HSA Design (2015-04-30) @ NCKU, Tainan Execution Environment 42 You have 2 OpenCL platform(s) ---------------------------------------------- Platform[0].Name = NVIDIA CUDA Platform[0].Vendor = NVIDIA Corporation Platform[0].Version = OpenCL 1.1 CUDA 4.2.1 Platform[0].Profile = FULL_PROFILE ---------------------------------------------- Platform[1].Name = Intel(R) OpenCL Platform[1].Vendor = Intel(R) Corporation Platform[1].Version = OpenCL 1.2 Platform[1].Profile = FULL_PROFILE ---------------------------------------------- Platform[0] has 1 device(s) ---------------------------------------------- Device[0].Type = CL_DEVICE_TYPE_GPU Device[0].Name = GeForce GT 625 Device[0].Vendor = NVIDIA Corporation Device[0].Version = OpenCL 1.1 CUDA Device[0].DriverVersion = 320.49 Device[0].Profile = FULL_PROFILE Device[0].OpenCL_C = OpenCL C 1.1 Device[0].MaxComputeUnits = 1 Device[0].MaxWiDimensions = 3 Device[0].MaxWiSize = (1024,1024,64) Device[0].MaxWgSize = 1024 Device[0].MaxClkFrequency = 1747 MHz Device[0].AddrSpaceSize = 32 bits Platform[1] has 1 device(s) ---------------------------------------------- Device[0].Type = CL_DEVICE_TYPE_CPU Device[0].Name = Intel(R) Core(TM) i5-4440 CPU @ 3.10GHz Device[0].Vendor = Intel(R) Corporation Device[0].Version = OpenCL 1.2 (Build 80752) Device[0].DriverVersion = 3.0.1.15216 Device[0].Profile = FULL_PROFILE Device[0].OpenCL_C = OpenCL C 1.2 Device[0].MaxComputeUnits = 4 Device[0].MaxWiDimensions = 3 Device[0].MaxWiSize = (1024,1024,1024) Device[0].MaxWgSize = 1024 Device[0].MaxClkFrequency = 3100 MHz Device[0].AddrSpaceSize = 32 bits OpenCL APIs
  • 43. HSA Design (2015-04-30) @ NCKU, Tainan HSA Platform Topology Discovery  HSA platform resources: Agent, Memory, Compute Properties, Caches, and I/O 43 HSA Platform Node 2 Node 0 Add-In Board (optional) HSA discrete GPU System Memory (cacheable) coherent (non-cacheable) non-coherent HSA APU GPU H-CU H-CU H-CU GPU H-CU H-CU H-CU CPU Core Core Core Device Local Memory coherent non-coherent Mem Mem HSA MMU SBIOS UEFI HSA discrete GPU GPU H-CU H-CU H-CU Device Local Memory coherent non-coherent Mem Node 1 PCIe BridgePCIE System Memory (cacheable) coherent (non-cacheable) non-coherent HSA APU GPU H-CU H-CU H-CU CPU Core Core Core Mem HSA MMU Add-In Board (optional) HSA discrete GPU GPU H-CU H-CU H-CU Device Local Memory coherent non-coherent PCIE Mem VBIOS UEFI GOP SocketInterconnect Node 3 PCIE Node 4 PCIE VBIOS UEFI GOP
  • 44. HSA Design (2015-04-30) @ NCKU, Tainan System Arch. Requirements 1. Shared Virtual Memory 2. Cache Coherency Domains 3. Flat Addressing 4. Endianess 5. Signaling and Synchronization 6. Atomic Memory Operations 7. HSA System Timestamp 8. User Mode Queuing 9. Architected Queuing Language (AQL) 10. Agent Scheduling 11. Kernel Agent Context Switching 12. IEEE754-2008 Floating Point Exceptions 13. Kernel Agent Hardware Debug Infrastructure 14. HSA Platform Topology Discovery 15. Images 44 @ HSA PLATFORM SYSTEM ARCHITECTURE SPECIFICATION, VERSION 1.0 FINAL (2015-03-16)
  • 45. HSA Design (2015-04-30) @ NCKU, Tainan Images  A graphics feature that can sometimes be useful in data- parallel computing  Used to store one-, two-, or three-dimensional images  predefined image formats  Image memory is a special kind of memory access  Dedicated hardware to speed up image operations. 45  The OpenCL™ Specification Version 2.1: 5.3 Image Objects https://www.khronos.org/registry/cl/specs/opencl-2.1.pdf Image Channel Type Image Channel Order Image Geometry Image Data Size Image Handle (hsa_ext_image_handle_t) Image Data (1D, 2D, or 3D images) Global Segment Image Data Image Descriptor HSA Kernel Agent HSA Runtime Image Object rdimage ldimage stimage Jay Wang, Taiwan, 2015.04
  • 46. HSA Design (2015-04-30) @ NCKU, Tainan Summary  Programming model issues  HSA Intermediate Language (HSAIL) + HSA Runtime  Architected Queuing Language (AQL) + Signaling  Debug infrastructure  Communication overhead issues  Cache coherent shared virtual memory (CC-SVM)  Architected Queuing Language (AQL) for user mode queuing  Hardware-assisted signaling and atomic operations for synchronization 46 CPUs GPU DSP ... HSAIL Unified Coherent Memory HSA Runtime AQL Jay Wang, Taiwan, 2015.04
  • 47. HSA Design (2015-04-30) @ NCKU, Tainan HSA Kernel Agent CPU HSA Runtime HSA Application (HSA Agent) User Application ( CPU Code + HSAIL Kernel Code ) HSA Kernel Agent GPU HSA Kernel Mode Driver Host CPU HSA Kernel Agent DSP HSA User Mode Queuing (Architected Queuing Language) + HSA Signaling Jay Wang, Taiwan, 2015.04 HSA Finalizers HSA Kernel Agent Designer Parallel Application Designer HSA System Software Designer HSA System Architecture Designer Language Runtime (ex: OpenCL runtime) 47 媽~ 我在這!  OpenCL Standards ( https://www.khronos.org/opencl/ )  HSA Standards ( http://www.hsafoundation.com/html/HSA_Library.htm )  HSA Platform System Architecture Specification v1.0  HSA Programmer Reference Manual Specification v1.0  HSA Runtime Specification v1.0  HSA Foundation Github ( https://github.com/HSAFoundation )
  • 48. HSA Design (2015-04-30) @ NCKU, Tainan Taiwan HSA Group @ Facebook 48