3. Current Computing Landscape
3
CPU technology advances have slowed the historical cost/performance improvements seen
over the last several decades => New CPU chips alone can not handle current challenges!
Over Burdened
CPUs
Slow/Complex
Algorithms &
Functions
+
100101010100011001
100110010010010010
101010001100110011
001001001001010101
000110011001100100
100100101010100011
001100110010010010
010101010001100110
011001001001001010
101000110011001100
100100100101010100
011001100110010010
01001010101
CPU
CPU
CPU
CPU
CPU
DATA
Network & Data Access
Rates
+
Computation Data Access
=
Current Technology and
Processing Overload!
Bad News, itâs only going to get worse!
4. Next Set Of Challenges Is Here!
4
Exponential Data Growth Compute Intensive Algorithms
Diverse Data Structures & Types Decreasing Time To Results
Hours .. Minutes .. Seconds .. Real Time
Compute
ď§ AI, Machine / Deep Learning
ď§ Video Processing
ď§ Database / Big Data Analytics
Storage
ď§ Scale-out Storage
ď§ Petabytes of new data
ď§ Intelligent / Compute SSDs
Networking
ď§ Network Security
ď§ Low-latency Networking
ď§ Open vSwitch offload
ď§ Software Defined Networking Acceleration
5. Next Challenges Affect All Computing Fields
5
Bank / Finance
⢠Risk analysis / Faster trading: Monte Carlo libraries
⢠Credit card fraud detection
⢠Block chain acceleration
Video / Analytics
⢠Smart Video surveillance from multiple videos feed
⢠3D video stream from multi-angles videos streams
⢠Image search / Object tracking / Scene recreation
⢠Multi-jpeg compression
Machine Learning / Deep learning
⢠Machine learning inference
⢠Accelerate frequently used ML / DL algorithm
Algorithm acceleration
⢠Compression on network path or storage
⢠Encryption on the fly to various memory types
⢠String match
6. But, what if you
could have the best
of both worlds!
Options: Software or Hardware?
6
⢠Software:
⢠Advantages:
⢠More rapid development leading to faster time to market
⢠Lower non-recurring engineering costs. Software can be reused easily.
⢠Heightened portability
⢠Ease of updating features or patching bugs
⢠Disadvantages:
⢠Slower run time
⢠Hardware
⢠Advantages:
⢠Much faster execution of functions
⢠Reduced power consumption
⢠Lower latency
⢠Increased parallelism and bandwidth
⢠Better utilization of area and functional components available on an integrated circuit (IC)
⢠Disadvantages:
⢠Lower ability to update designs once etched onto silicon
⢠Difficult to share Verilog/VHDL source code between different hardware platforms
⢠Higher costs of functional verification
⢠Longer develop process and time to market
7. So, whatâs the solution?
7
The use of computer hardware specially designed to perform functions more
efficiently than is possible in software alone running on a general-purpose CPU.
Hardware Acceleration
Thousands of tiny CPU using high
parallelization
ď¨ compute intensive application
Field Programmable Gate Array
Logic + IOs are customized exactly for the
application's needs.
ď¨ Very low and predictable latency applications
Two Options
GPU FPGA
8. The Better Choice?
8
Due to the inherent logic and IO flexibility, speed, and
predictably low latency, FPGAs have a clear advantage.
FPGA Acceleration
FPGA = Field Programmable Gate Array
Historically programmed
using Verilog/VHDL
Compiled
Mapped to FPGA HW Logic
9. What is a FPGA?
9
⢠A re-programmable computer chip with lots of configurable logic
elements based on Lookup-Tables (LUT)
⢠Programmable switch matrix routing
⢠Configurable I/O and high-speed serial links
⢠Advantages in flexibility, speed, and low latency due to:
⢠Limited instruction set
⢠High parallelism
⢠Deep pipelines
Programmable switchLogical View
Programmable logic element
⢠Integrated Hard IP (Multiply/Add, SRAM, PLL, PCIe, Ethernet, DRAM,...)
Field Programmable Gate Array
10. FPGA Example (Bittware 250-SOC)
10
Bittware 250-SoC
Multipurpose Converged Network / Storage
⢠Xilinx Zync UltraScale+ FPGA ZU19EG (64 bits Cortex-A53 ARM core)
⢠Two 4GB DDR4 (for FPGA and ARM)
⢠PCIe Gen3 x16 / Gen4 x8 ď¨ CAPI2
⢠Up to 4 x8 Oculink ports suporting NVMe, 100GbE and OpenCAPI
⢠2x 100GbE QSFP28 cages
⢠Half Height - Half Length format
11. Basics of HW Acceleration
11
Standard CPU Setup (No Acceleration)
Host Memory
Over burdened CPU
Slow functions
Congested
memory and
output card
access
CPU manages all data,
memory access,
functions, and flows
With increased data,
computing, storage, and
network challenges
Function
Application
12. Basics of HW Acceleration
12
Standard CPU Setup (No Acceleration)
Host Memory
CPU manages all data,
memory access,
functions, and flows
ď CPU manages all data, memory access, functions, and flows
Over burdened CPU
Slow functions
Congested memory and output card access
Application
Function
13. HW Acceleration with FPGA
13
Classic Acceleration with FPGA
Host Memory
Faster functions
on FPGA
Relieved function only
from CPU burden
CPU still handles
FPGA memory
access and data
copying.
No Data Coherency
Standard CPU Setup (No Acceleration)
Host Memory
Historically
programmed using
Verilog/VHDL
Function
ď CPU manages all data, memory access, functions, and flows
Over burdened CPU
Congested memory and output card access
Slow functions
ApplicationApplication
Function
14. HW Acceleration with FPGA
14
Standard CPU Setup (No Acceleration)
Host Memory
Classic Acceleration with FPGA
Host Memory
Function
ď CPU is used to manage FPGA memory access
No Data Coherency (Host memory copied to FPGA)
FPGA historically programmed using Verilog/VHDL
CPU still handles all memory and data access
ď CPU manages all data, memory access, functions, and flows
Over burdened CPU
Congested memory and output card access
Slow functions
ApplicationApplication
Function
15. Addressing Classic FPGA Acceleration Issues
15
⢠What is OpenCAPI?
⢠Open Coherent Accelerator Processor
Interface
⢠OpenCAPI is an open interface
architecture that allows any
microprocessor to attach to:
⢠Coherent user-level accelerators and
I/O devices
⢠Advanced memories accessible via
read/write or user-level direct
memory access (DMA) semantics
⢠Agnostic to processor architecture
⢠What is OC-Accel?
⢠OpenCAPI Acceleration Framework to
program FPGAs using C/C++ instead of
Verilog or VHDL
OpenCAPI 3.0
OC 3.1
OpenCAPI specifications are downloadable from www.opencapi.org
16. HW Acceleration with FPGA + OpenCAPI
16
Classic Acceleration with FPGA
Host Memory
Function
Acceleration with FPGA + OpenCAPI
Host Memory
OpenCAPI
ď OpenCAPI IO interface on FPGA accesses host memory directly
ď Function accesses only needed host memory data
ď Data Coherency (Data does not need to be copied to FPGA)
ď Address translation (@function=@application)
ď FPGA programmed with C/C++ using OC-Accel Framework
Function
ď CPU is used to manage FPGA memory access
No Data Coherency (Host memory copied to FPGA)
FPGA historically programmed using Verilog/VHDL
CPU still handles all memory and data access
ApplicationApplication
17. ⢠Hardware
⢠Advantages:
⢠Using FPGA instead of CPU
⢠FPGA is function specific only
⢠FPGA is fast + OpenCAPI direct memory access
⢠FPGA can have parallel logic
⢠FPGA uses function logic only
⢠Disadvantages:
⢠FPGA easily reconfigurable with C/C++ updates
⢠C/C++ easily recompiled for different FPGAs
⢠C/C++ code simulated and debugged
⢠C/C++ code can be easier to write and upload
⢠Software
⢠Advantages:
⢠App. Eng. Writing C/C++ functions (OC-Accel)
⢠C/C++ code is reusable
⢠C/C++ code is portable
⢠FPGA reconfigurable with C/C++ updates
⢠Disadvantages:
⢠Function executed faster on FPGA + CPU relief
⢠Software
⢠Advantages:
⢠More rapid development
⢠Lower non-recurring engineering costs
⢠Heightened portability
⢠Ease of updating features or patching bugs
⢠Disadvantages:
⢠Slower run time
FPGAs + OpenCAPI + OC-Accel Address All Issues
17
⢠Hardware
⢠Advantages:
⢠Much faster execution of functions
⢠Reduced power consumption
⢠Lower latency
⢠Increased parallelism and bandwidth
⢠Better IC area and function utilization
⢠Disadvantages:
⢠Lower ability to update design hardware
⢠Difficult to share source code btw FPGAs
⢠Higher costs of functional verification
⢠Longer develop process and time to market
18. Ex: Monte-Carlo (FPGA Accelerated)
18
Monte Carlo Analysis is a risk
management technique used in
the financial and insurance
industries and is used for
conducting a quantitative analysis
of risks.
By using CAPI with a FPGA, the C/C++ code was reduce by 40x on the application
side and freed up 33% of memory and CPU (versus a non-CAPI FPGA ).
Running 1 million iterations
Results: At least
50x Faster
with CAPI and FPGA technology on
POWER server
19. Ex: PostgreSQL regex Matching Accelerated
19
PostgreSQL + OpenCAPI shows compelling âregexâ performance increase by leveraging the bandwidth and virtual
addressing of OpenCAPI technology. In fact, accelerating the SQL with OpenCAPI-regex can be 4x to 10x faster than the
best PostgreSQL built-in functions (CPU multi-threads enabled).
PostgreSQL is a powerful, open source object-relational database system. SQL (Structured Query Language)
is used to communicate with a database.
Actual Example Single Search Run Times:
⢠CPU parallel Seq Scan: ~698ms
⢠Custom Scan (PFCAPI): ~161ms
SELECT * FROM table WHERE pkt ~ pattern;
Basically: search the db for all pkt that match pattern
Command example
20. Ex: Ultra Fast Data Acquisition (X-Ray Crystallography)
20
9GBps
1 4
4 MPixels @ 1.1kHz
Digital Camera Sensors
Raw Data
Goal: Real-time mapping of biological structure by examining molecule scatter plots of protein crystal struck by x-rays
2 3
GPU
PCIe
GPU + PCIe Configuration
(Today)
Protein
Molecule
Mapped
Real Image
Raw data to real image conversion
Decimate / sort images
Data compression
1 Data acquisition
2
3
4
Compressed
Data
21. Ex: Ultra Fast Data Acquisition (X-Ray Crystallography)
21
22GBps
1 2 4
10 MPixels @ 2.2 kHz
Digital Camera Sensors
22GBps
Compressed
Data
FPGA w/ OpenCAPI
(Goal)
OpenCAPI3.0
22GBps
Dual FPGAs
In Parallel
UnfilteredImage
FilteredImage
GPU or FPGA of both
Host with NX-gzip
Embedded
HW Accelerator
Raw Data
22GBps
Image Data
OpenCAPI breaks the 9GBps PCIe
bottleneck!
Protein
Molecule
Mapped
Real Image
Raw data to real image conversion
Decimate / sort images
Data compression
Data acquisition
3
4
3
Goal: Real-time mapping of biological structure by examining molecule scatter plots of protein crystal struck by x-rays
1
2
22. Ex: Pull Quote
22
The benefit of using POWER interfaces, i.e., NVLink and OpenCAPI, is
not only bandwidth, but these interfaces allow also for coherent
memory access. FPGA board connected via OpenCAPI or GPGPU
connected via NVLink sees host (CPU) virtual memory space exactly like
the process running on the CPU, reducing the burden of writing
reliable and secure applications. Memory coherency can be also
available for PCIe FPGA accelerators installed in POWER9 servers via
OpenCAPI predecessor, the Coherent Accelerator Processor Interface
(CAPI). IBM also provides optimized software to benefit from the
architecture, including the CAPI Storage, Network, and Analytics
Programming (SNAP) framework51,52 that simplifies the integration of
FPGA designs with POWER9, as well as optimized ML and data analysis
routines for GPGPUs or FPGAs.53
Structural Dynamics 7, 014305 (2020); https://doi.org/10.1063/1.5143480
23. Ex: Memory Coherency
23
Scenario: 2MB data scattered in host memory are processed in a FPGA.
ÂŤ Classic Âť PCIe FPGA card
Server
Function
Server
ÂŤ CAPI-enabled Âť FPGA card
Function
blk blk blk blk
Gathering data (SW memcopy)
1
1 transaction of big amount of
data to FPGA (2MB)
2
1
2
1 transaction of 8kB for AddrSet
from host memory to FPGA
1
1024 transactions of 2kB from
Host memory to FPGA.
Directly reads required data at
random address.
2 1
2
ApplicationApplication CAPI
Results: CAPI-enabled was 2-3x faster than Classic method
24. Ease of FPGA Programming (OC-Accel)
24
Benefits:
⢠Faster Time To Market: Port a function to a FPGA in days not months
⢠No Obsolescence: Simply recompile unchanged C/C++ code for different FPGA
⢠No Link Constraint: Moving from a CAPI (over PCIe) link to OpenCAPI is just a matter of recompiling
- no code change
⢠No Specific Hardware Skills Needed: C/C++ coder can focus on functionality as all the resources are
managed by the framework.
⢠Open-Source Framework: The code can be modified, improved by any user.
Example:
⢠Note: SNAP is the predecessor to OC-Accel and overall flow and performance is equivalent.
⢠Customer ported and optimized SHA3 C code within 10 days using SNAP* framework versus
4 months in VHDL without SNAP
Development Plans:
⢠OC-Accel with OpenCAPI today, OC-Accel with other emerging standards like CXL tomorrow!
25. FPGAs + OpenCAPI + OC-Accel Has It All
25
Very high bandwidth
Faster development and time
to market with OC-Accel
26. Developers Arenât Where We Need Them
Scripting
Interpreted App (Python / Rails / Java)
Non-Interpreted App (C++ / Java JRE)
Procedural App (C / C++)
High Level OS (C / C++)
Firmware
HW API (C, ASM)
Kernel (C, AS)
HDL
Chart content courtesy of Aaron Sullivan @Rackspace
Spreading the CAPI Love (OC-Accel)
26
27. Interpreted App (Python / Rails / Java)
Non-Interpreted App (C++ / Java JRE)
Procedural App (C / C++)
High Level OS (C / C++)
Kernel (C, AS)
HW API (C, ASM)
Firmware
Scripting
HDL
Application
Application
New Abstraction
New Abstraction
New Abstraction
New Abstraction
Soft-Hardware
Soft-Hardware
Soft-Hardware
Spreading the CAPI Love (OC-Accel)
Developers Where We Need Them
Chart content courtesy of Aaron Sullivan @Rackspace
27
28. - Know more about accelerators ?
- See a live demonstration?
- Do a benchmark ?
- Get answers to your questions?
Contact us
alexandre.castellane@fr.ibm.com
bruno.mesnet@fr.ibm.com
fabrice_moyen@fr.ibm.com
luyong@cn.ibm.com
shgoupf@cn.ibm.com
28