This was presented by Peng Fei GOU (IBM China) at OpenPOWER summit EU 2019. The original one is uploaded at:
https://static.sched.com/hosted_files/opeu19/68/NVDLA%20on%20OpenCAPI.pdf
IAC 2024 - IA Fast Track to Search Focused AI Solutions
02 ai inference acceleration with components all in open hardware: opencapi and nvdla
1. AI Inference Acceleration with
components all in Open Hardware:
OpenCAPI and NVDLA
Deep Learning Inference Engine for CAPI/OpenCAPI
October 27, 2019
IBM 中国系统实验室
IBM China System Lab
Peng Fei GOU (shgoupf@cn.ibm.com)
2. 2
Motivation
Path to hardware acceleration for AI
ü Deep learning inference acceleration is hot everywhere, from edge to cloud
ü POWER9 needs a solution on hardware acceleration for AI
OpenCAPI-NVDLA: demonstration on P9 heterogeneous computing platform
ü To align with Open Hardware strategy
ü Fast and simple acceleration deployment on server with FPGA and OpenCAPI
NVDLA: OPEN SOURCE inference engine from NVIDIA
ü NVIDIA Deep Learning Accelerator
ü High quality: production level open source RTL
ü Flexibility: configurable architecture to fulfill different business needs
3. 3
Open Hardware Ecosystem
NVDLA
üOpen hardware design
üPart of NVIDIA’s Xavier SOC
üOpen source compiler
üSifive + NVDLA collaboration
üMore than tens of startups
starting leverage NVDLA
üActive community
OpenPOWER
üOpen ISA
üOpen reference design
üEncourage more open innovations in
hardware
üRich ecosystem and partners from
software, system hardware to chip
7. Software Stack
DL training
Framework
Parser Compiler
Optimizer
User-mode
Driver
Kernel-
mode
Driver
CAPI NVDLA
HardwareModel Loadable ioctl() Reg
Writes
Publicly available,
Caffe, etc.
NVIDIA open
sourced in around
Sep, 2019
Applications: image
recognition, etc.
NVIDIA open sourced,
user/kernel mode drivers.
Changed to CAPI user mode
Hardware with
OpenCAPI
Transform a trained network to NVDLA
loadables Running offline, not necessarily
on POWER
NVIDIA’s Parser,
Compiler and Optimizer
are enough to support
early stage evaluation.
Conv 1
Conv 2 Conv 3 Conv 4 Conv 5
FC 1 FC 2 FC 3
MAXPooling
MAXPooling
MAXPooling
Applications/workloads running with OpenCAPI-
NVDLA user-mode driver, on POWER9 platforms
Kernel mode
driver changed
and eliminated
to adapt CAPI
mode.
8. 8
Driver Changes for OpenCAPI
DRM_IOCTL_NVDLA_GEM_CREATE
ioctl()
DRM_IOCTL_NVDLA_GEM_MMAP
DRM_IOCTL_NVDLA_DESTROY
DRM_IOCTL_NVDLA_SUBMIT
NVDLA DRM Driver
nvdla_gem_create() drm_gem_handle_create()
nvdla_gem_map_offset() drm_gem_create_mmap_offset()
nvdla_gem_destroy() drm_gem_dumb_destroy()
nvdla_submit() nvdla_task_sumbit()
NVDLA Firmware
dla_submit_operation()
*_reg_read()
*_reg_write()
DLA_OP_BDMA
Engines
DLA_OP_CONV
DLA_OP_SDP
DLA_OP_PDP
DLA_OP_CDP
DLA_OP_RUBIK
Hardware
User
mode
driver
IOCTL
removed
DRM and GEM
dependencies removed
Firmware changed to
user mode calls
Easy Memory Management
All kernel mode DRM/GEM codes
removed
Use user mode malloc() to manage
memories
No IOCTL calls
IOCTL calls from UMD to KMD
changed to user level function calls
Firmware Works in User Mode
No Linux Kernel Dependency
No dependency to DRM/GEM drivers
No dependency to Linux kernel
versions
Changed to
direct function
calls
9. 9
Functional Validation
Mihawk
Running in IBM Austin Lab
Mihawk (POWER9) + AlphaData AD9H7 FPGA Card
Large config (2048 MACs) running @ 200MHz
Functional tests PASSED
Alexnet running with real image inferencing
Results not 100% accurate due to model inaccuracy
AlphaData AD9H7
10. 10
Performance Evaluation and Projection
Hardware
No.
MACs
Clock FPGA
I/O
bandwidt
h
FC Batch
Size
AlexNet Perf (frames/second)
Current
Performance
2048 200MHz VU37P 1GB / s 1 10.417
Projected 2048 250MHz VU37P 20 GB / s 16 741
Current performance
Alexnet: 10.42 frames/second
Performance under tuning …
Expect to have better
performance when issues in
compiler is resolved and tuned.
Projected Performance
Alexnet: 741 frames/second
ResNet50: 113 frames/second
Projected Performance Calculated based on the analytical model from NVDLA
https://github.com/nvdla/hw/tree/master/perf
12. 12
Summary
Why NVDLA on OpenCAPI
ü Open hardware collaboration
ü Inference engine is a foreseeable hot topic in servers, data centers and clouds
ü NVIDIA is serious on open source DLA, the quality of DLA is production level
ü We don’t want to reinvent the wheel
What’s next?
ü Larger configurations (4096 MACs and/or FP16 support)
ü Parser and compiler adaption
ü Performance tuning and real workload adaption ( key to business )
Open Source
ü Important to cultivate open hardware ecosystem
13. 13
Pointers to Materials
Modified CAPI/SNAP framework for NVDLA
ü https://github.com/shgoupf/snap/tree/nvdla
ü On public Github
Modified NVDLA software for CAPI
ü https://github.com/shgoupf/nvdla-sw/tree/capi
ü On public Github
Modified NVDLA IP, including RTL and Unit Testbench
ü https://github.ibm.com/shgoupf/nvdla-capi
ü On IBM enterprise Github
14. 14
References
Hotchips 30
ü http://www.hotchips.org/
Xilinx xfDNN (CHaiDNN)
ü https://github.com/Xilinx/CHaiDNN
SNAP
ü https://github.com/open-power/snap
NVDLA
ü http://nvdla.org/
Original NVDLA Hardware
ü http://github.com/nvdla/hw
Original NVDLA Software
ü http://github.com/nvdla/sw
Original NVDLA Virtual Platform
ü http://github.com/nvdla/vp
Community Contributed NVDLA
Compiler Source
ü https://github.com/icubecorp/nvdla_compiler
16. 16
Quick Facts
What is NVDLA
ü NVIDIA Deep Learning Accelerator
ü Open Source, production level RTL
ü Hardware configurable
ü Accelerate Convolution Neural Networks
What is OpenCAPI-NVDLA
ü Bring NVDLA to OpenCAPI on FPGA
ü Explore possibility of AI acceleration on
CAPI/OpenCAPI
ü Align with POWER’s heterogenous
computing strategy
Current Development Status
ü NVDLA hardware ported to OpenCAPI
ü NVDLA software (drivers) ported to CAPI
ü Hardware running @2048 MACs
@200MHz, with AlexNet
ü Running on Mihawk + AlphaData AD9H7
Potential Use Case
ü AI acceleration solution on POWER9
ü Cloud image recognition service
ü Face recognition for large scale video
surveillance server
ü FPGA based AI acceleration on cloud
Performance
ü ~1TOPs @INT8
ü Current perf: 10.42 FPS for Alexnet
ü Projected perf: 813.49 FPS for Alexnet
Other Highlights
ü Production level unit verification
environment with full regression
enabled
ü Open source compilers
ü Larger hardware development in
progress
17. 17
NVDLA Changes for FPGA Implementation
INIT
-2260
+use DSP
-2000
+disable clock
gating
25
+disable clock
gating
-1900
+add pipeline in
MAC
-400
+add pipeline in
SDP
-130
+set max fanout
-14
-2500
-2000
-1500
-1000
-500
0
500
INIT use DSP disable clock
gating
add pipeline
in MAC
add pipeline
in SDP
set max
fanout
WNS(ps)
FPGA Implementation Timing Closure
NV_SMALL NV_LARGE
Methods Used
INIT
The initial NVDLA RTL from Github
+Use DSP
Replace all NVDLA MAC operators with Xilinx DSP IP
+Disable clock gating
Disable the clock gating for ASIC design
+Add pipelines in MAC
Add pipelines in MAC for FPGA oriented design
RTL changes verified with unit testbench
+Add pipelines in SDP
Add pipelines in SDP for FPGA oriented design
RTL changes verified with unit testbench
+Set max fanout
Set max fanout for critical registers in NVDLA
18. 18
NVDLA Changes for SNAP Integration
Address Width WR Description
0x400 32 RW [31:9] RO: Reserved
[8] RW: Selector of SNAP register and
NVDLA register, 0 is SNAP, 1 is NVDLA
[7:0] RW: Extension of NVDLA address,
use with paddr[9:2]
SNAP Action Registers
NVDLA Registers
Paddr[9:0]
0
1
Config Register
[8] [7:0]
{Config_reg[7:0], paddr[9:2]}
Config Register Definition
Indirect Register Accessing
NVDLA-hw
SNAP
AXI Lite
AXI4
interrupt
Control Path Adapter
AXI4-lite to
APB bridge
APB to CSB
bridge
Data Path Adapter
Data bus width converter
512bit to 256bit
SNAP
Action
Regs
Control Path Adaption
Add AXI-to-APB bridge and APB-to-CSB bridge
Indirect register accessing (NVDLA reg space larger than
SNAP action reg space)
Interrupt enablement
Data Path Adaption
AXI4 data bus width converter from NVDLA (256 bit dbus)
to SNAP (512 bit dbus)
Other changes to facilitate AXI4 bus signals
19. 19
Unit Sim Test Plan and Testbench
Unit-sim
Testbench
Trace Generator: Generate trace
Trace Player: Drive trace to DUT, check DUT correct behavior
and collect coverages…
Test Plan
Test Level: Level 0, Level1,…Level 10, Level 20…
Associating Tests: A method named add_test to associate
tests with test plan
Testcase
Direct(Trace) Tests: pdp_8x8x32_1x1_int8_0…
Python Tests: nvdla_reg_accessing…
Random(UVM) Tests: cc_in_width_ctest…
Unit Level Simulation Environment
§ All RTL changes protected
§ Added AXI-lite adapters and scoreboards
§ Added checkers to SNAP action registers
§ Simulator changed from VCS to XCELIUM
§ Simulating with Xilinx FPGA Ips
§ Regression running on Jenkins server
§ Production level verification environment
Simulation Environment Components
20. 20
Business Trends
NAME USAGE COMPANY DATE FEATURE
NVDLA Inferencing Nvidia 2017 Free Opensource Inference Engine
Zynq
UltraScale+
Training and
Inferencing
Xilinx 2015 HBM, CCIX, framework, int8
Xilinx DNN
Processor
Inferencing Xilinx 2018 On server inference solution
ARM ML
Processor
Inferencing ARM 2018
Flexible architecture, scaling from
edge to cloud
Brainwave
Training and
Inferencing
Microsoft 2017 Deployed on Microsoft cloud
Cambricon
1M
Inferencing Cambricon 2018 Specialized AI ISA
Ascend 910
Training and
Inferencing
Huawei 2018 Huawei’s new architecture for AI
Facebook, Ali(Ali-NPU), Baidu(XPU) are racing to hardware acceleration for AI
Aliyun, Tencent Cloud, Baidu Cloud, Huawei Cloud starts to provide FPGA cloud services
Highlighted Trends
Software + Hardware full stack solutions
Internet giants start on AI chips
Hardware providers optimize their AI
related libraries and try to get into
software’s domain
FPGA has been widely deployed on public
clouds
Focusing on energy-performance, IO
efficiency and system optimization