2. • Number of ML publications are growing exponentially at a faster rate than Moore’s law!
Machine Learning Arxiv Papers per Year
• Moore's law: number of transistors in a dense integrated circuit doubles about every two years.
2
“AI is the new electricity” – Andrew Ng
Artificial intelligence is everywhere!
Just as electricity revolutionized lives 100 years ago,
AI is changing our lives completely today.
Google, Netflix, face detection, predictive searches,
recommendations, maps, autonomous cars, to name
a few, all use some form of AI to make our lives
better.
https://data-mining.philippe-fournier-viger.com/too-many-machine-learning-papers/
4. Accelerator
• AI accelerator a class of specialized hardware accelerator designed to accelerate artificial
intelligence and machine learning applications, including artificial neural networks and machine vision.
• Hardware acceleration is the use of computer hardware designed to perform specific functions
more efficiently when compared to software running on a general-purpose central processing
unit (CPU). Any transformation of data that can be calculated in software running on a generic CPU can
also be calculated in custom-made hardware, or in some mix of both.
4
5. When to Use Hardware Acceleration
• Computer graphics via Graphics Processing Unit (GPU)
• Digital signal processing via Digital Signal Processor
• Analog signal processing via Field-Programmable Analog Array
• Sound processing via sound card
• Computer networking via network processor and network interface controller
• Cryptography via cryptographic accelerator and secure cryptoprocessor
• Artificial Intelligence via AI accelerator
• In-memory processing via network on a chip and systolic array
• Any given computing task via Field-Programmable Gate Arrays (FPGA), Application-Specific Integrated Circuits
(ASICs), Complex Programmable Logic Devices (CPLD), and Systems-on-Chip (SoC)
5
6. Most common hardware accelerators:
• Graphics Processing Units (GPUs): originally designed for handling the motion of image, GPUs are
now used for calculations involving massive amounts of data, accelerating portions of an
application while the rest continues to run on the CPU. The massive parallelism of modern GPUs
allows users to process billions of records instantly.
• Field Programmable Gate Arrays (FPGAs): a hardware description language (HDL)-specified
semiconductor integrated circuit designed to allow the user to configure a large majority of the
electrical functionality. FPGAs can be used to accelerate parts of an algorithm, sharing part of the
computation between the FPGA and a general-purpose processor.
• Application-Specific Integrated Circuits (ASICs): an integrated circuit customized specifically for a
particular purpose or application, improving overall speed as it focuses solely on performing its
one function. Maximum complexity in modern ASICs has grown to over 100 million logic gates.
6
7. Hardware Accelerators objectives
• Reducing the number of times values are moved from sources that have a high energy cost, such
as DRAM or large on-chip buffers; and
• Reducing the cost of moving each value
• Allocates work to as many PEs as possible so that they can operate in parallel; and
• Minimizes the number of idle cycles per PE by ensuring that there is sufficient memory bandwidth
to deliver the data that needs to be processed, the data is delivered before it is needed, and
workload imbalance among the parallel PEs is minimized.
Reuse: Input feature map reuse, Filter reuse, Convolutional reuse
Need to consider accuracy, throughput , latency , power consumption, hardware cost, flexibility,
scalability while choosing the suitable hardware and models
7
13. The RegNet architecture for non-rigid registration of pulmonary CT follow-up scans
Deep Learning in Pulmonology
13
14. Computer-Assisted Decision Support System in Pulmonary Cancer detection and stage classification on CT images
Overview of decision support system. 14
15. Predicting pregnancy test results after embryo transfer by image feature extraction and analysis using Machine Learning
15
16. Automation of early-stage human embryo development detection – Deep Learning
Embryo image classification based on AlexNet and VGG16 architectures
16
21. Frameworks –open-source libraries contain software libraries for DNNs
• Caffe - 2014 - UC Berkeley - It supports - C, C++, Python, and MATLAB
• Tensorflow - Google - 2015, supports C++ and Python- multiple CPUs and GPUs and
has more flexibility than Caffe PyTorch
• Torch - Facebook and NYU and supports C, C++, and Lua; PyTorch is its successor
and is built in Python
21
24. Popular Datasets For Classification
• The MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples.
http://yann.lecun.com/exdb/mnist/
• The most highly-used subset of ImageNet is the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2012-2017 image classification and
localization dataset. This dataset spans 1000 object classes and contains 1,281,167 training images, 50,000 validation images and 100,000 test
images https://image-net.org/download.php 24
26. Edge Computing
• Edge computing is using a low latency network to process and return data faster to the
request sender. In edge computing, users has direct access and control over the data
process.
• While on cloud computing, users send requests and let the cloud servers do the rest of the
work. The differences may be in milliseconds, but time is important in this day and age.
26
30. Field Programmable Gate Arrays (FPGAs) are semiconductor devices that are based around a matrix
of configurable logic blocks (CLBs) connected via programmable interconnects. FPGAs can be
reprogrammed to desired application or functionality requirements after manufacturing.
30
31. Tightly integrated programmable logic
Used to extend the processing system
Scalable density and performance
Complete ARM®-based processing system
Application Processor Unit (APU) Dual ARM Cortex™-A9
Caches and support blocks
Fully integrated memory controllers
I/O peripherals
Flexible array of I/O
Wide range of external multi-standard I/O
High-performance integrated serial transceivers
Analog-to-digital converter inputs
Zynq 7000 Architecture
31
32. Parallelism on FPGA
• Data comes in at the rate the camera output, (62 MHz)
• This is then split in to 8 parallel processes, so now the data rate is 496MOPS (Million operations per second)
• This then passes through the different stages of the processing. Now processing 2480 MOPS
• The parallelism is then removed and output to the memory at a steady rate of 62 MHz
32
34. A general architecture for the convolution layer (kernel size 33) with three different level of parallelism
CNN Architecture
34
35. Xilinx® Vitis™ AI
• Xilinx® Vitis™ AI is a development stack for AI inference on
Xilinx hardware platforms, including both edge devices and
Alveo cards.
• It consists of optimized IP, tools, libraries, models, and
example designs. It is designed with high efficiency and ease
of use in mind, unleashing the full potential of AI acceleration
on Xilinx FPGA and ACAP.
35
36. Vitis AI is composed of the following key components:
• AI Model Zoo - A comprehensive set of pre-optimized models that are ready to deploy on Xilinx devices.
• AI Optimizer - An optional model optimizer that can prune a model by up to 90%. It is separately available with
commercial licenses.
• AI Quantizer - A powerful quantizer that supports model quantization, calibration, and fine tuning.
• AI Compiler - Compiles the quantized model to a high-efficient instruction set and data flow.
• AI Profiler - Perform an in-depth analysis of the efficiency and utilization of AI inference implementation.
• AI Library - Offers high-level yet optimized C++ APIs for AI applications from edge to cloud.
• DPU - Efficient and scalable IP cores can be customized to meet the needs for many different applications
36
37. Vitis AI Workflow
• Model development - train models or get models from Vitis AI model zoo, use Vitis AI optimizer (optional), quantizer and
compiler to convert float models into DPU instruction files
• HW development - use Vitis tool to integrate DPU IP and other kernels with platform and generate board boot files
• SW development - implement model deployment codes using VART or Vitis AI library, finish application level SW
development and generate executable running on board.
37
38. System Requirements
Component Requirement
FPGA
Alveo U50, U50LV, U200, U250, U280 cards
Zynq UltraScale+ MPSoc ZCU102 and ZCU104 Boards
Versal ACAP VCK190 and VCK5000 boards
KV260
Motherboard PCI Express 3.0-compliant x16 with one or dual slot
System Power Supply 225W
Operating System
Ubuntu 16.04, 18.04, 20.04
CentOS 7.6, 7.7, 7.8, 8.1
RHEL 7.6, 7.7, 7.8, 8.1
CPU
Intel i3/i5/i7/i9/Xeon 64-bit CPU
AMD EPYC 7F52 64-bit CPU
GPU (Optional to accelerate quantization)
NVIDIA GPU supports CUDA 9.0 or higher, like NVIDIA P100,
V100
CUDA Driver (Optional to accelerate quantization)
Driver compatible to CUDA version, NVIDIA-384 or higher for
CUDA 9.0, NVIDIA-410 or higher for CUDA 10.0
Docker Version 19.03 or higher 38