Deep learning: Hardware Landscape

Deep Learning:
Hardware Landscape
Grigory Sapunov
YaTalks/30.11.2019
gs@inten.to

Executive Summary :)
DL requires a lot of computations:
● Currently GPUs (mostly NVIDIA) are the most popular choice
● The only alternative right now is Google TPU gen3 (ASIC, cloud).
● More FPGA/ASIC are coming into this field (Alibaba, Bitmain Sophon, Intel
Nervana?). The situation resembles the path of Bitcoin mining
● Neuromorphic computing is on the rise (IBM TrueNorth, Tianjic, memristors, etc)
● Quantum computing can benefit machine learning as well (but probably it won’t be
a desktop or in-house server solutions)

The Deep Learning Revolution and Its Implications for Computer Architecture and Chip Design
https://arxiv.org/abs/1911.05289

Typically multi-core even on the desktop market:
● usually from 2 to 10 cores in modern Core i3-i9 Intel CPUs
● up to 18 cores/36 threads in high-end Intel CPUs
(i9–7980XE/9980XE/10980XE) [https://en.wikipedia.org/wiki/List_of_Intel_Core_i9_microprocessors]
● up to 32 cores/64 threads in AMD Ryzen Threadripper
(seems to be the same for the 3rd gen
https://www.anandtech.com/show/14994/first-details-about-3rd-generation-ryzen-threadripper-32-cores-280-w)
x86: Desktops

On the server market:
● Intel Xeon: up to 56 cores/112 threads (Xeon Platinum 9282 Processor)
● AMD EPYC: up to 64 cores/128 threads (EPYC 7742)
● usually having more cores than desktop processors and some other useful
capabilities (supporting more RAM, multi-processor configurations, ECC, etc)
x86: Servers

Intel x86 manycore (up to 72 cores with up to 288 threads) processors, supporting
AVX-512 instructions.
Seems to be dead now: https://www.extremetech.com/extreme/290963-intel-quietly-kills-off-xeon-phi
Waiting for Intel’s Xe GPU:
https://wccftech.com/intel-ponte-vecchio-xe-hpc-gpu-detailed-1000-eus-hbm2-rambo-cache-clx/
x86: Xeon Phi

AVX-512: Fused Multiply Add (FMA) core instructions for enabling lower-precision
operations. List of CPUs with AVX-512 support:
https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512
VNNI (Vector Neural Network Instructions): Multiply and add for integers, etc.
designed to accelerate convolutional neural network-based algorithms.
https://en.wikichip.org/wiki/x86/avx512vnni
DL Boost: AVX512-VNNI + Brain floating-point format (bfloat16)
designed for inference acceleration.
https://en.wikichip.org/wiki/brain_floating-point_format
x86: ML instructions (SIMD)
https://blog.inten.to/cpu-hardware-for-deep-learning-b91f53cb18af

● BigDL: distributed deep learning library for Apache Spark
https://github.com/intel-analytics/BigDL
● Deep Neural Network Library (DNNL): an open-source performance library
for deep learning applications. Layer primitives, etc.
https://intel.github.io/mkl-dnn/
● PlaidML: advanced and portable tensor compiler for enabling deep learning
on laptops, embedded devices, or other devices.
Supports Keras, ONNX, and nGraph.
https://github.com/plaidml/plaidml
● OpenVINO Toolkit: for computer vision
https://docs.openvinotoolkit.org/
x86: Optimized ML Libraries

Some CPU-optimized DL libraries:
● Caffe Con Troll (research project, latest commit in 2016)
https://github.com/HazyResearch/CaffeConTroll
● Intel Caffe (optimized for Xeon):
https://github.com/intel/caffe
● Intel DL Boost can be used in many popular frameworks:
TensorFlow, PyTorch, MXNet, PaddlePaddle, Intel Caffe
https://www.intel.ai/increasing-ai-performance-intel-dlboost/
x86: Optimized ML Libraries

● nGraph: open source C++ library, compiler and runtime for Deep Learning.
Frameworks using nGraph Compiler stack to execute workloads have shown
up to 45X performance boost when compared to native framework
implementations. https://www.ngraph.ai/
Graph compilers

Graph compilers: wider view
https://medium.com/tensorflow/mlir-a-new-intermediate-representation-and-compiler-framework-beba999ed18d

Graph compilers: watch for MLIR!
https://www.tensorflow.org/mlir/overview

You need to transfer data between CPU host memory and GPU memory.
In most x86 systems this is done using PCI Express bus (PCIe).
x86: PCIe

A typical GPU card works in x16 mode at full speed, but may work in x8 or x4
mode as well at lower speed.
PCIe v.3 allows for 985 MB/s per 1 lane, so 15.75 GB/s for x16 links.
PCIe v.4 is twice faster, so 31.51 GB/s for x16 (supported in X570 chipset for
AMD and Radeon cards)
PCIe v.5 is twice faster again (spec is released, no products expected before
2020), so 63 GB/s for x16.
x86: PCIe bandwidth

Typical Intel mainstream desktop processor has 16 PCIe lanes (e.g. i7-7700K,
i7–8700K or even i9-9900K processor).
High-end desktop (HEDT) processor has 28 to 44 lanes (e.g. i7–7820X has 28,
rather old i7–6850K has 40, i9-9980XE has 44, upcoming i9-10940X and higher
will have 48).
Xeons have up to 64 lanes (PCIe v.3).
AMD Ryzen Threadripper has 64 PCIe lanes, EPYC has 128 lanes (PCIe v.4)
Be careful: Intel sometimes uses “Platform PCIe lanes”, it is CPU+PCH, but the
PCH ones have a shared uplink!
https://www.anandtech.com/show/11839/intel-core-i9-7980xe-and-core-i9-7960x-review/4
Check specs at https://ark.intel.com/
x86: PCIe bandwidth / CPU side

In cases with a low PCIe lanes CPU you can’t use two GPUs at their highest
speed (x16). In some cases you can’t even use a single GPU in x16.
But in reality the difference between v.3/v.4 or x8/x16 can be insignificant.
Bottlenecks may exist in other places.
x86: PCIe bandwidth / CPU side

https://blog.inten.to/cpu-hardware-for-deep-learning-b91f53cb18af

Memory speed can also be important (+multi-channel mode).
DDR4 data transfer rates (PCIe v.3 for comparison):
● PCIe v.3 x4: 3.94 GB/s
● PCIe v.3 x8: 7.88 GB/s
● DDR4 1600：12.8 GB/s
● DDR4 1866：14.93 GB/s
● PCIe v.3 x16: 15.75 GB/s
PCIe v.4 x8: 15.75 GB/s
● DDR4 2133：17 GB/s
● DDR4 2400：19.2 GB/s
● DDR4 2666：21.3 GB/s
● DDR4 3200：25.6 GB/s
● PCIe v.4 x16: 31.51 GB/s
x86: Memory

● Single-board computers: Raspberry Pi, part of Jetson
Nano, and Google Coral Dev Board.
● Mobile: Qualcomm, Apple A11, etc
● Server: Marvell ThunderX, Ampere eMAG,
Amazon A1 instance, etc; NVIDIA announced
GPU-accelerated Arm-based servers.
● Laptops: Microsoft Surface Pro X
● ARM also has ML/AI Ethos NPU and Mali GPU
ARM

● ARM announces Neoverse N1 platform (scales up to 128 cores)
https://www.networkworld.com/article/3342998/arm-introduces-neoverse-high-performance-cpus-for-servers-5g.html
● Qualcomm manufactured ARM server processor for cloud applications called Centriq 2400 (48 single-thread cores,
2.2GHz). Project stopped.
https://www.tomshardware.com/news/qualcomm-server-chip-exit-china-centriq-2400,38223.html
● Ampere eMAG ARM server microprocessors (up to 32 cores, up to 3.3 GHz)
https://amperecomputing.com/product/, https://en.wikichip.org/wiki/ampere_computing/emag
● Marvell ThunderX ARM Processors (up to 48 cores, up to 2.5 GHz)
https://www.marvell.com/server-processors/thunderx-arm-processors/
Supports NVIDIA GPU:
https://www.electronicsweekly.com/news/nvidia-cuda-x-ai-hpc-software-stack-marvell-thunderx-platforms-2019-11/
● Amazon Graviton ARM processor (16 cores, 2.3GHz)
https://en.wikichip.org/wiki/annapurna_labs/alpine/al73400
https://aws.amazon.com/blogs/aws/new-ec2-instances-a1-powered-by-arm-based-aws-graviton-processors/
● Huawei Kunpeng 920 ARM Server CPU (64 cores, 2.6 GHz)
https://www.huawei.com/en/press-events/news/2019/1/huawei-unveils-highest-performance-arm-based-cpu
ARM: Servers

Current architecture is POWER9:
● 12 cores x 8 threads or 24 cores x 4 threads (96 threads).
● PCIe v.4, 48 PCIe lanes
● Nvidia NVLink 2.0: the industry’s only CPU-to-GPU Nvidia NVLink connection
● CAPI 2.0, OpenCAPI 3.0 (for heterogeneous computing with FPGA/ASIC)
IBM POWER

The current fastest supercomputer in the world, Summit, is based on POWER9,
while also using Nvidia's Volta GPUs as accelerators.
POWER10 is expected in 2020-2021:
● 48 cores
● PCIe v.5
● NVLink 3.0
● OpenCAPI 4.0
● ...
IBM POWER

An open-source hardware instruction set architecture.
Examples:
● SiFive U5, U7 and U8 cores
https://www.anandtech.com/show/15036/sifive-announces-first-riscv-ooo-cpu-core-the-u8series-processor-ip
● Alibaba's RISC-V processor design – the Xuantie 910 (XT 910)
12nm 64-bit 16 cores clocked at up to 2.5GHz, the fastest RISC-V
processor to date
https://www.theregister.co.uk/2019/07/27/alibaba_risc_v_chip/
● Western Digital SweRV Core designed for embedded devices
supporting data-intensive edge applications.
https://www.westerndigital.com/company/innovations/risc-v
● Esperanto Technologies is building AI chip with 1k+ cores
https://www.esperanto.ai/technology/
RISC-V

● KiloCore project with 1000 independent programmable processors
https://www.ucdavis.edu/news/worlds-first-1000-processor-chip
● Cerebras Systems Wafer Scale Engine (WSE), an AI chip that measures
8.46x8.46 inches, making it almost the size of an iPad and more than 50
times larger than a CPU or GPU.
WSE chip has 1.2 trillion transistors, 400,000 computing cores and 18
gigabytes of memory.
https://www.nextplatform.com/2019/08/21/machine-learning-chip-breaks-new-ground-with-waferscale-integration/
Others

NVIDIA slides: http://www.nvidia.com/content/events/geoInt2015/LBrown_DL.pdf

… → Kepler → Maxwell → Pascal → Volta → Turing → Ampere → ...
NVIDIA Architectures

● Peak performance (GFLOPS) at FP32/16/...
● #Cores (+Tensor Cores)
● Memory size
● Memory speed/bandwidth
● Precision support
● Can connect using NVLink?
● Power usage (Watts)
● Price
● GFLOPS/USD
● GFLOPS/Watt
● Form factor (for desktop or server?)
● ECC memory
● Legal restrictions (e.g. GeForce is not allowed to use in datacenters)
Important dimensions
https://blog.inten.to/hardware-for-deep-learning-part-3-gpu-8906c1644664

● FP64 (64-bit float), not used for DL
● FP32 — the most commonly used for training
● FP16 or Mixed precision (FP32+FP16) — becoming the new default
● INT8 — usually for inference
● INT4, INT1 — experimental modes for inference
Precision

bfloat16 isn’t supported on GPU (but is supported on TPU gen3, and will be
supported on AMD GPU and Intel CPU/NNP).
Precision: bfloat16
https://www.nextplatform.com/2019/07/15/intel-prepares-to-graft-googles-bfloat16-onto-processors/
https://www.techpowerup.com/260344/future-amd-gpu-architecture-to-implement-bfloat16-hardware

Precision: many caveats

Not only FLOPS: Roofline Performance Model

Roofline Performance Model: Example

Separate cards can joined using NVLink; SLI is not relevant for DL, it’s for
graphics.
NVSwitch: The Fully Connected NVLink
NCCL 1: multi-GPU collective communication primitives library
NVIDIA: Single-machine Multi-GPU

Distributed training is now a commodity (but scaling is sublinear).
NCCL 2: multi-node collective communication primitives library
NVIDIA: Distributed Multi-GPU

● AMD has powerful GPUs but they are
mostly unsupported in DL frameworks
● Intel has its own GPUs on the processor
(HD Graphics)
● Some Intel CPUs were equipped with AMD
GPUs (Kaby Lake-G, say, i7-8809G)
● Intel plans to release its first discrete GPU
in 2020 (Xe architecture)
GPU: AMD, Intel

Problems
Serious problems with the current processors (CPU/GPU) are:
● Energy efficiency:
○ The version of AlphaGo playing against Lee Sedol used 1,920 CPUs and
280 GPUs (https://en.wikipedia.org/wiki/AlphaGo)
○ The estimated power consumption of approximately 1 MW (200 W per
CPU and 200 W per GPU) compared to only 20 watts used by the human
brain (https://jacquesmattheij.com/another-way-of-looking-at-lee-sedol-vs-alphago/)
● Architecture:
○ good for matrix multiplication (still the essence of DL)
○ but not well-suitable for brain-like computations

FPGA
● FPGA (field-programmable gate array) is an integrated circuit designed to be
configured by a customer or a designer after manufacturing
● Both FPGAs and ASICs (see later) are usually much more energy-efficient than
general purpose processors (so more productive with respect to GFLOPS per
Watt). FPGAs are usually used for inference, not training.
● OpenCL can be the language for development for FPGA (C/C++ can be as well),
and some ML/DL libraries are using OpenCL too (for example, Caffe). So, there
could appear an easy way to do low-level ML on FPGAs.
● For high-level ML there are vendor tools and graph compilers (inference only).
● Can use FPGA in the cloud!
● See also for MLIR (mentioned earlier).
● Learning curve to use FPGAs is too steep now :(

FPGA in production
There is some interesting movement to FPGA:
● Amazon has FPGA F1 instances https://aws.amazon.com/ec2/instance-types/f1/
● Alibaba has FGPA F3 instances in the cloud
https://www.alibabacloud.com/blog/deep-dive-into-alibaba-cloud-f3-fpga-as-a-service-instances_594057
● Yandex uses FPGAs for its own DL inference.
● Microsoft ran (in 2015) Project Catapult that uses clusters of FPGAs
https://blogs.msdn.microsoft.com/msr_er/2015/11/12/project-catapult-servers-available-to-academic-researchers/
● Microsoft Azure allows deploying pretrained models on FPGA (!).
https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-deploy-fpga-web-service
● Baidu has FPGA instances https://cloud.baidu.com/product/fpga.html
● ...

FPGA chips
Two main manufacturers: Intel (ex. Altera) and Xilinx.
The ‘world’s largest’ FPGA chip (Xilinx Virtex UltraScale+ VU19P)
contains 9M system logic cells (35B transistors)
https://www.xilinx.com/products/silicon-devices/fpga/virtex-ultrascale-plus-vu19p.html
Intel has a hybrid Xeon+FPGA chip
https://www.top500.org/news/intel-ships-xeon-skylake-processor-with-integrated-fpga/
Intel has FPGA acceleration cards
https://www.intel.com/content/www/us/en/programmable/solutions/acceleration-hub/platforms.html
More info:
https://www.intel.com/content/www/us/en/products/programmable/fpga.html
https://www.xilinx.com/products/silicon-devices/fpga.html

Adaptive compute acceleration platform (ACAP)
Xilinx Versal ACAP, a fully software-programmable,
heterogeneous compute platform that combines Scalar Engines,
Adaptable Engines, and Intelligent Engines.
The Intelligent Engines are an array of VLIW and SIMD
processing engines and memories, all interconnected with 100s
of terabits per second of interconnect and memory bandwidth.
These permit 5X–10X performance improvement for ML and
DSP applications.
https://www.xilinx.com/products/silicon-devices/acap/versal-ai-core.html
https://www.xilinx.com/support/documentation/white_papers/wp505-versal-acap.pdf

FPGA: Xilinx DL Tools
Xilinx ML Suite: tools to develop and deploy ML apps for Real-time Inference.
● xfDNN: graph compiler w optimization, quantizer
● xDNN: high Performance CNN processing engine.
https://github.com/Xilinx/ml-suite

FPGA: Intel DL Tools
xx
https://simplecore.intel.com/nervana/wp-content/uploads/sites/53/2018/05/IntelAIDC18_Macias_Mainstage_5_23_Final.pdf

ASIC custom chips
ASIC (application-specific integrated circuit) is an integrated circuit customized for a
particular use, rather than intended for general-purpose use.
There is a lot of movement to ASIC right now:
● Google has Tensor Processing Units (TPU) in the cloud.
● Intel just demonstrated their Nervana processors for training and inference.
● Mobileye (Intel) chips with specially developed ASIC cores are used in BMW, Tesla,
Volvo, etc.
● Movidius (acquired by Intel) Myriad X VPU - a dedicated hardware accelerator for deep
neural network inferences. https://www.movidius.com/myriadx
● Alibaba Hanguang 800
● Huawei Ascend 310, 910
● Bitmain Sophon
● ...

Case: AlphaGo Zero
https://deepmind.com/blog/alphago-zero-learning-scratch/

ASIC: Google TPU
TPU v2
● 180 TFLOPS (bfloat16)
● 64 GB HBM
● $4.50 / TPU hour
https://cloud.google.com/tpu/
https://cloud.google.com/tpu/docs/tpus
https://cloud.google.com/tpu/docs/system-architecture
TPU v3
● 420 TFLOPS (bfloat16)
● 128 GB HBM
● $8.00 / TPU hour

A “TPU v3 pod” 100+ petaflops, 32 TB HBM, 2-D toroidal mesh network

ASIC: Intel (Nervana) NNP-T
Processor for training. Can build PODs (say 10-rack POD with 480 NNP-T)
● 24 Tensor Processing Cluster (TPC)
● PCIe Gen 4 x16 accelerator card, 300W
● OCP Accelerator Module, 375W
● 119 TOPS bfloat16
● 32 GB HBM2
https://www.intel.ai/nervana-nnp/nnpt/
https://en.wikichip.org/wiki/nervana/microarchitectures/spring_crest

ASIC: Intel (Nervana) NNP-I
Processor for inference using mixed precision math, with a special emphasis on low-precision
computations using INT8.
● 12 inference compute engines (ICE) + 2 Intel architecture cores (AVX+VNNI)
● M.2 form factor (1 chip): 12W, up to 50 TOPS.
● PCIe card (2 chips): 75W, up to 170 TOPS.
https://www.intel.ai/nervana-nnp/nnpi
https://en.wikichip.org/wiki/intel/microarchitectures/spring_hill

ASIC: Bitmain Sophon
Tensor Computing Processor BM1684 is a
third generation TPU.
● 1024 processing units
● 17.6 TOPS INT8, 2.2 TFLOPS FP32 (?)
Deep Learning Acceleration Card SC3:
● 1x BM1682 (2nd gen TPU), 3 TFLOPS FP32,
3 GB, PCIe 3.0 x8
● Max Power: 65W
● Caffe/TensorFlow/Pytorch/Mxnet
https://www.sophon.ai/product/introduce/bm1684.html
https://www.sophon.ai/product/introduce/sc3.html

ASIC: Alibaba Hanguang 800
(October 28, 2019) Announcing Hanguang 800:
Alibaba's First AI-Inference Chip
“Hanguang 800 is the world's most powerful AI inference
chip. In the Resnet-50 industry test, the peak performance
of the new chip reached a whopping 78,563 images per
second, which is four times higher than the second best AI chip
in the world. The peak efficiency of the chip also reached 500 IPS/W, which is
3.3 times higher than the second best option.”
“A Hanguang 800 chip can offer the computing power equivalent to 10 traditional
GPUs.“
https://www.alibabacloud.com/blog/announcing-hanguang-800-alibabas-first-ai-inference-chip_595482

ASIC: Baidu Kunlun
(July 3, 2018) Baidu unveils Kunlun AI
chip for edge and cloud computing
“Kunlun is made to handle AI models for edge
computing on devices and in the cloud via
datacenters. The Kunlun 818-300 model will be used for training AI, and the
818-100 for inference..“
https://venturebeat.com/2018/07/03/baidu-unveils-kunlun-ai-chip-for-edge-and-cloud-computing/
“The Kunlun-powered server’s computing power is 30 times higher than
FPGA-based AI accelerators, according to Yin Shiming, Baidu vice president.”
https://technode.com/2019/08/29/baidu-unveils-kunlun-powered-cloud-server-at-waic/

ASIC: Huawei Ascend 910
(Aug 23, 2019) Huawei launches Ascend 910,
the world's most powerful AI processor
● 256 TFLOPS (FP16), 512 TOPS (INT8)
● 310 W of max power
● HCCS, PCIe 4.0, and RoCE v2 build scale-up and
scale-out systems both flexibly and efficiently. HCCS is Huawei's in-house
high-speed interface interconnecting Ascend 910s. On-chip RoCE
interconnects nodes directly. PCIe 4.0 doubles the throughput of the
previous generation.
● Ascend 910 2048-node cluster can deliver up to 512 PFLOPS.
https://www.huawei.com/en/press-events/news/2019/8/Huawei-Ascend-910-most-powerful-AI-processor
https://e.huawei.com/us/products/cloud-computing-dc/atlas/ascend-910
https://www.servethehome.com/huawei-ascend-910-provides-a-nvidia-ai-training-alternative/

ASIC: Habana
Gaudi: training chip. Designed to scale well.
● PCIe 4.0 x16, 32 GB HBM2, 1Tb/s, ECC, RDMA
● 200-300W
● FP32, BF16, INT/UINT 32, 16, 8
https://habana.ai/training/
https://habana.ai/wp-content/uploads/2019/06/Gaudi-Datasheet.pdf
Goya: inference chip
● PCIe 4.0 x16, 4/8/16 Gb DDR4, ECC, 200W
● FP32, INT/UINT 32, 16, 8
https://habana.ai/inference/
https://habana.ai/wp-content/uploads/2019/06/Goya-Datasheet-HL-10x.pdf
https://www.nextplatform.com/2019/08/26/habana-takes-training-and-inference-down-different-paths/

ASIC: Graphcore IPU
Graphcore IPU: for both training and inference.
Allows new and emerging machine intelligence
workloads to be realized.
Colossus IPU:
● 23.6B transistors, 1216 independent processor cores, 300Mb in-processor mem,
125 TFLOPS mixed precision
● 45TB/s memory bandwidth, 8TB/s on-chip exchange between cores
● C2 IPU Processor card: 2x Colossus, PCIe 4.0 x16 (64GB/s bidir), card-to-card
IPU-Links (2.5 TBps), 300W
https://www.servethehome.com/hands-on-with-a-graphcore-c2-ipu-pcie-card-at-dell-tech-world/
https://www.graphcore.ai/products/ipu
IPU on Azure
https://www.graphcore.ai/posts/microsoft-and-graphcore-collaborate-to-accelerate-artificial-intelligence

ASIC: Others
● Qualcomm Cloud AI 100 (inference)
https://www.qualcomm.com/news/releases/2019/04/09/qualcomm-brings-power-efficient-artificial-intellig
ence-inference
● ARM ML inference NPU Ethos-N77
https://www.arm.com/products/silicon-ip-cpu/machine-learning/arm-ml-processor
● Intel eASIC: an intermediary technology between FPGAs and standard-cell
ASICs with lower unit-cost and faster time-to-market
https://www.intel.com/content/www/us/en/products/programmable/asic/easic-devices.html
● ...

ASIC: Summary
● Very diverse field!
● Hard to directly compare different solutions based on their characteristics (can
be too different architectures).
● You can use a common benchmark like https://mlperf.org/
● DL framework support is usually limited, some solutions use their own
frameworks/libraries.

AI at the edge
● NVidia Jetson TK1/TX1/TX2/Xavier/Nano
○ 192/256/256/512/128 CUDA Cores
○ 4/4/6/8/4-Core ARM CPU, 2/4/8/16/4 Gb Mem
● Tablets, Smartphones
○ Qualcomm Snapdragon 845/855, Apple A11/12/Bionic, Huawei Kirin 970/980/990 etc
● Raspberry Pi 4 (1.5 GHz 4-core, 4Gb mem)
● Movidius Neural Compute Stick, Stick 2
● Google Edge TPU

“The Qualcomm Neural Processing SDK for AI, our
software-accelerated runtime for the execution of deep
neural networks, lets you program the Qualcomm AI
Engine. Together, the engine and the SDK allow you to
squeeze up to 7 TOPS out of AI processing out of the
Snapdragon 855, with massive acceleration for your on-device AI applications.
The engine provides high capacity for matrix multiplication on both the Qualcomm
Hexagon Vector eXtensions (HVX) and the Hexagon Tensor Accelerator
(HTA). With enough on-device processing power to run more than 140 inferences
per second on the Inception-v3 neural network, your app could classify or detect
dozens of objects in just a few milliseconds and with high confidence.”
https://developer.qualcomm.com/blog/accelerate-your-device-ai-qualcomm-artificial-intelligence-ai-engine-
snapdragon
https://www.qualcomm.com/products/snapdragon-855-mobile-platform
Mobile AI: Qualcomm SD 855 (DSP+)

“HUAWEI’s self-developed Da Vinci architecture NPU delivers better power efficiency,
stronger processing capabilities and higher accuracy. The powerful Big-Core plus ultra-low
consumption Tiny-Core contribute to an enormous boost in AI performance. In AI face
recognition, the efficiency of NPU Tiny-Core can be enhanced up to 24x than the Big-Core.
With 2 Big-Core plus 1 Tiny-Core, the NPU of Kirin 990 5G is ready to unlock the magic
of the future.”
https://consumer.huawei.com/en/campaign/kirin-990-series/
“Huawei intends to scale this AI processing block from servers to smartphones. It supports both INT8
and FP16 on both cores, whereas the older Cambricon design could only perform INT8 on one core.
There’s also a new ‘Tiny Core’ NPU. It’s a smaller version of the Da Vinci architecture focused on power
efficiency above all else, and it can be used for polling or other applications where performance isn’t
particularly time critical. The 990 5G will have two “big” NPU cores and a single Tiny Core, while the Kirin
990 (LTE) has one big core and one tiny core.”
https://www.extremetech.com/mobile/298028-huaweis-kirin-990-soc-is-the-first-chip-with-an-integrated-5g-modem
Mobile AI: Huawei Kirin 970, 980, 990 (NPU)

(Oct 2, 2019) Inside Apple’s A13 Bionic system-on-chip
“This year, the CPU has a new trick: A set of “machine
learning accelerators” that perform matrix multiplication
operations up to six times faster than the CPU alone.
The Neural Engine (8-core), like everything else in the chip, tops out
at 20 percent faster than before (it’s as if the designs are
relatively unchanged, and the new 7nm+ process allows for
20 percent higher clock speeds).
There’s a machine learning controller in the chip that automatically schedules
machine learning operations between the CPU, GPU, and Neural Engine so
developers don’t have to balance out the load themselves.”
https://www.macworld.com/article/3442716/inside-apples-a13-bionic-system-on-chip.html
Mobile AI: Apple (Neural Engine)

(Nov 26, 2019) Samsung’s first 7-nanometer EUV processor
will power the Galaxy Note 10
“The Exynos 9825 features an integrated Neural Processing
Unit (NPU) designed for the next generation of mobile
experiences from AI-powered photography to augmented reality. With fast,
efficient AI processing, the NPU brings new possibilities for on-device AI from
object recognition for optimized photos, to a suite of performance enhancing
intelligence features such as usage pattern recognition and faster app
pre-loading.”
https://www.samsung.com/semiconductor/minisite/exynos/products/mobileprocessor/exynos-9825/
Mobile AI: Samsung (NPU)

(Aug 7, 2019) MediaTek Announces Dimensity 1000 ARM
Chip With Integrated 5G Modem
“The Dimensity 1000 doesn’t just bring new branding; it’s also
sporting four Cortex A77 CPU cores and four Cortex A55 CPU
cores, all built on a 7nm process node. There’s also a 9-core Mali GPU, a 5-core
ISP, and a 6-core AI processor.
The MediaTek AI Processing Unit APU 3.0 is a brand new architecture. It houses
six AI processors (two big cores, three small cores and a single tiny core)
The new APU 3.0 brings devices a significant performance boost at 4.5 TOPS.”
https://www.extremetech.com/extreme/302712-mediatek-announces-dimensity-1000-arm-chip-with-integr
ated-5g-modem
https://i.mediatek.com/mediatek-5g
Mobile AI: MediaTek (APU)

AI at the Edge: Jetson Nano
Price: $99
NVIDIA Jetson Nano Developer Kit is a small, powerful
computer that lets you run multiple neural networks in parallel
for applications like image classification, object detection, segmentation,
and speech processing. All in an easy-to-use platform that runs in as little as 5
watts.
● 128-core Maxwell GPU + Quad-core ARM A57, 472 GFLOPS
● 4 GB 64-bit LPDDR4 25.6 GB/s
https://developer.nvidia.com/embedded/jetson-nano-developer-kit
See also Jetson TX1, TX2, Xavier: https://developer.nvidia.com/embedded/develop/hardware

Neural Compute Stick 2 (~$70)
The latest generation of Intel® VPUs includes 16
powerful processing cores (called SHAVE cores) and
a dedicated deep neural network hardware accelerator for high-performance
vision and AI inference applications—all at low power.
● Supports Convolutional Neural Network (CNN)
● Support: TensorFlow, Caffe, Apache MXNet, ONNX, PyTorch, and
PaddlePaddle via an ONNX conversion
● Processor: Intel Movidius Myriad X Vision Processing Unit (VPU)
● Connectivity: USB 3.0 Type-A
https://software.intel.com/en-us/neural-compute-stick
AI at the Edge: Movidius

AI at the Edge: Google Edge TPU
The Edge TPU is a small ASIC designed by Google that provides
high performance ML inferencing for low-power devices. For
example, it can execute state-of-the-art mobile vision models such
as MobileNet V2 at 400 FPS, in a power efficient manner.
The on-board Edge TPU coprocessor is capable of performing 4 TOPS
using 0.5 watts for each TOPS (2 TOPS per watt).
TensorFlow Lite models can be compiled to run on the Edge TPU.
USB/Mini PCIe/M.2 A+E key/M.2 B+M key/SoM/Dev Board
https://cloud.google.com/edge-tpu/
https://coral.ai/products/

● Sophon Neural Network Stick (NNS)
https://www.sophon.ai/product/introduce/nns.html
● Xilinx Edge AI (FPGA!)
https://www.xilinx.com/applications/industrial/analytics-machine-learning.html
● (Nov 13, 2019) Azure Data Box Edge is a physical network appliance,
shipped by Microsoft, that sends data in and out of
Azure. Data Box Edge is additionally equipped with
AI-enabled edge computing capabilities that help
you analyze, process, and transform the on-premises
data before uploading it to the cloud.
https://azure.microsoft.com/en-us/updates/announcing-azure-data-box-edge/
● ...
AI at the Edge: Others

Problems
Even with FPGA/ASIC and edge devices:
● Energy efficiency:
○ Better than CPU/GPU, but still far from 20 watts used by the human brain
● Architecture:
○ Even more specialized for ML/DL computations, but...
○ Still far from brain-like computations

Neuromorphic chips
● Neuromorphic computing - brain-inspired computing - has emerged as a new
technology to enable information processing at very low energy cost using
electronic devices that emulate the electrical behaviour of (biological) neural
networks.
● Neuromorphic chips attempt to model in silicon the massively parallel way the
brain processes information as billions of neurons and trillions of synapses
respond to sensory inputs such as visual and auditory stimuli.
● DARPA SyNAPSE program (Systems of Neuromorphic Adaptive Plastic
Scalable Electronics)
● IBM TrueNorth; Stanford Neurogrid; HRL neuromorphic chip; Human Brain
Project SpiNNaker and HICANN.
https://www.technologyreview.com/s/526506/neuromorphic-chips/

Neuromorphic chips: IBM TrueNorth
● 1M neurons, 256M synapses, 4096 neurosynaptic
cores on a chip, est. 46B synaptic ops per sec per W
● Uses 70mW, power density is 20 milliwatts per
cm^2— almost 1/10,000th the power of most modern
microprocessors
● “Our sights are now set high on the ambitious goal of
integrating 4,096 chips in a single rack with 4B neurons and 1T synapses while
consuming ~4kW of power”.
● Currently IBM is making plans to commercialize it.
● (2016) Lawrence Livermore National Lab got a cluster of 16 TrueNorth chips
(16M neurons, 4B synapses, for context, the human brain has 86B neurons).
When running flat out, the entire cluster will consume a grand total of 2.5 watts.
http://spectrum.ieee.org/tech-talk/computing/hardware/ibms-braininspired-computer-chip-comes-from-the-future

Neuromorphic chips: IBM TrueNorth
● (03.2016) IBM Research demonstrated convolutional neural nets with close to
state of the art performance:
“Convolutional Networks for Fast, Energy-Efficient Neuromorphic Computing”, http://arxiv.org/abs/1603.08270

Neuromorphic chips: Intel Loihi
● Fully asynchronous neuromorphic many core mesh that
supports a wide range of sparse, hierarchical and recurrent
neural network topologies
● Each neuromorphic core includes a learning engine that can
be programmed to adapt network parameters during
operation, supporting supervised, unsupervised,
reinforcement and other learning paradigms.
● Fabrication on Intel’s 14 nm process technology.
● A total of 130,000 neurons and 130 million synapses.
● Development and testing of several algorithms with high
algorithmic efficiency for problems including path planning,
constraint satisfaction, sparse coding, dictionary learning,
and dynamic pattern learning and adaptation.
https://newsroom.intel.com/editorials/intels-new-self-learning-chip-promises-accelerate-artificial-intelligence/
https://techcrunch.com/2018/01/08/intel-shows-off-its-new-loihi-ai-chip-and-a-new-49-qubit-quantum-chip/
https://ieeexplore.ieee.org/document/8259423
https://en.wikichip.org/wiki/intel/loihi

“Intel researchers have recently been testing the Loihi chip by
training it on tasks such as recognizing a small set of objects
within seconds. The company has not yet pushed the capabilities
of the neuromorphic chip to its limit, Mayberry [Michael Mayberry,
corporate vice president and managing director of Intel Labs]
says. Still, he anticipates neuromorphic computing products
potentially hitting the market within 2 to 4 years, if customers
can run their applications on the Loihi chip without requiring
additional hardware modifications.”
“Neither quantum nor neuromorphic computing are going to
replace general purpose computing,” Mayberry says. “But
they can enhance it.”
https://spectrum.ieee.org/tech-talk/computing/hardware/intels-49qubit-chip-aims-for-quantum-supremacy

“Using Intel's Loihi neuromorphic research chip and
ABR's Nengo Deep Learning toolkit, we analyze the
inference speed, dynamic power consumption, and
energy cost per inference of a two-layer neural
network keyword spotter trained to recognize a single
phrase. We perform comparative analyses of this
keyword spotter running on more conventional
hardware devices including a CPU, a GPU, Nvidia's
Jetson TX1, and the Movidius Neural Compute
Stick.”
Benchmarking Keyword Spotting Efficiency on Neuromorphic Hardware
https://arxiv.org/abs/1812.01739

Neuromorphic chips: Intel Pohoiki Beach
(Jul 15, 2019) “Intel announced that an 8 million-neuron
neuromorphic system comprising 64 Loihi research chips
— codenamed Pohoiki Beach — is now available to the broader
research community. With Pohoiki Beach, researchers can
experiment with Intel’s brain-inspired research chip, Loihi, which
applies the principles found in biological brains to computer
architectures. ”
https://newsroom.intel.com/news/intels-pohoiki-beach-64-chip-neuromorphic-system-delivers-breakthrough-results-research-tests/

Neuromorphic chips: Tianjic
Tianjic’s unified function core (FCore) which combines essential
building blocks for both artificial neural networks and biologically
networks — axon, synapse, dendrite and soma blocks. The 28-nm
chip consists of 156 FCores, containing approximately 40,000
neurons and 10 million synapses in an area of 3.8×3.8 mm2.
Tianjic delivers an internal memory bandwidth of more than 610 GB
per second, and a peak performance of 1.28 TOPS per watt for
running artificial neural networks. In the biologically-inspired spiking
neural network mode, Tianjic achieves a peak performance of about
650 giga synaptic operations per second (GSOPS) per watt.
https://medium.com/syncedreview/nature-cover-story-chinese-teams-tianjic-chip-bridges-machine-lear
ning-and-neuroscience-in-f1c3e8a03113
https://www.nature.com/articles/s41586-019-1424-8

Neuromorphic chips: Others
● SpiNNaker (1,036,800 ARM9 cores)
http://apt.cs.manchester.ac.uk/projects/SpiNNaker/
● SpiNNaker-2
https://niceworkshop.org/wp-content/uploads/2018/05/2-27-SHoppner-SpiNNaker2.pdf
https://arxiv.org/abs/1911.02385 “SpiNNaker 2: A 10 Million Core Processor System for Brain
Simulation and Machine Learning”
● BrainScaleS, HICANN: 20x 8-inch silicon wafers each incorporates 50 x 106
plastic synapses and 200,000 biologically realistic neurons.
https://www.humanbrainproject.eu/en/silicon-brains/how-we-work/hardware/
● Akida NSoC: 1.2 million neurons and 10 billion synapses
https://www.brainchipinc.com/products/akida-neuromorphic-system-on-chip
https://en.wikichip.org/wiki/brainchip/akida
● Neurogrid: Neurogrid can model a slab of cortex with up to 16x256x256
neurons (>1M) https://web.stanford.edu/group/brainsinsilicon/neurogrid.html
https://web.stanford.edu/group/brainsinsilicon/documents/BenjaminEtAlNeurogrid2014.pdf

From: https://d1io3yog0oux5.cloudfront.net/_51d5497ffa729abd180ed52c4234217f/brainchipinc/db/217/1582/pdf/Akida+Launch+Presentation.pdf

Other approaches
● Memristors https://spectrum.ieee.org/semiconductors/design/the-mysterious-memristor
● Quantum computing https://ai.googleblog.com/2019/10/quantum-supremacy-using-programmable.html
● Optical computing https://www.nextplatform.com/2019/05/31/startup-looks-to-light-up-machine-learning/
● DNA computing https://www.wired.com/story/finally-a-dna-computer-that-can-actually-be-reprogrammed/
● Unconventional computing: cellular automata, reservoir computing, using
biological cells/neurons, chemical computation, membrane computing, slime
mold computing and much more https://www.springer.com/gp/book/9781493968824
● ...

References:
Hardware for Deep Learning series of posts:
https://blog.inten.to/hardware-for-deep-learning-current-state-and-trends-51c01ebbb6dc
● Part 1: Introduction and Executive summary
● Part 2: CPU
● Part 3: GPU
● Part 4: FPGA
● Part 5: ASIC
● Part 6: Mobile AI
● Part 7: Neuromorphic computing
● Part 8: Quantum computing

https://ru.linkedin.com/in/grigorysapunov
gs@inten.to
Thanks!

Deep learning: Hardware Landscape

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Deep learning: Hardware Landscape

Ähnlich wie Deep learning: Hardware Landscape (20)

Mehr von Grigory Sapunov

Mehr von Grigory Sapunov (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Deep learning: Hardware Landscape