SlideShare ist ein Scribd-Unternehmen logo
1 von 95
Downloaden Sie, um offline zu lesen
Deep Learning:
Hardware Landscape
Grigory Sapunov
YaTalks/30.11.2019
gs@inten.to
Executive Summary :)
DL requires a lot of computations:
● Currently GPUs (mostly NVIDIA) are the most popular choice
● The only alternative right now is Google TPU gen3 (ASIC, cloud).
● More FPGA/ASIC are coming into this field (Alibaba, Bitmain Sophon, Intel
Nervana?). The situation resembles the path of Bitcoin mining
● Neuromorphic computing is on the rise (IBM TrueNorth, Tianjic, memristors, etc)
● Quantum computing can benefit machine learning as well (but probably it won’t be
a desktop or in-house server solutions)
CPU
The Deep Learning Revolution and Its Implications for Computer Architecture and Chip Design
https://arxiv.org/abs/1911.05289
Typically multi-core even on the desktop market:
● usually from 2 to 10 cores in modern Core i3-i9 Intel CPUs
● up to 18 cores/36 threads in high-end Intel CPUs
(i9–7980XE/9980XE/10980XE) [https://en.wikipedia.org/wiki/List_of_Intel_Core_i9_microprocessors]
● up to 32 cores/64 threads in AMD Ryzen Threadripper
(seems to be the same for the 3rd gen
https://www.anandtech.com/show/14994/first-details-about-3rd-generation-ryzen-threadripper-32-cores-280-w)
x86: Desktops
On the server market:
● Intel Xeon: up to 56 cores/112 threads (Xeon Platinum 9282 Processor)
● AMD EPYC: up to 64 cores/128 threads (EPYC 7742)
● usually having more cores than desktop processors and some other useful
capabilities (supporting more RAM, multi-processor configurations, ECC, etc)
x86: Servers
Intel x86 manycore (up to 72 cores with up to 288 threads) processors, supporting
AVX-512 instructions.
Seems to be dead now: https://www.extremetech.com/extreme/290963-intel-quietly-kills-off-xeon-phi
Waiting for Intel’s Xe GPU:
https://wccftech.com/intel-ponte-vecchio-xe-hpc-gpu-detailed-1000-eus-hbm2-rambo-cache-clx/
x86: Xeon Phi
AVX-512: Fused Multiply Add (FMA) core instructions for enabling lower-precision
operations. List of CPUs with AVX-512 support:
https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512
VNNI (Vector Neural Network Instructions): Multiply and add for integers, etc.
designed to accelerate convolutional neural network-based algorithms.
https://en.wikichip.org/wiki/x86/avx512vnni
DL Boost: AVX512-VNNI + Brain floating-point format (bfloat16)
designed for inference acceleration.
https://en.wikichip.org/wiki/brain_floating-point_format
x86: ML instructions (SIMD)
https://blog.inten.to/cpu-hardware-for-deep-learning-b91f53cb18af
● BigDL: distributed deep learning library for Apache Spark
https://github.com/intel-analytics/BigDL
● Deep Neural Network Library (DNNL): an open-source performance library
for deep learning applications. Layer primitives, etc.
https://intel.github.io/mkl-dnn/
● PlaidML: advanced and portable tensor compiler for enabling deep learning
on laptops, embedded devices, or other devices.
Supports Keras, ONNX, and nGraph.
https://github.com/plaidml/plaidml
● OpenVINO Toolkit: for computer vision
https://docs.openvinotoolkit.org/
x86: Optimized ML Libraries
Some CPU-optimized DL libraries:
● Caffe Con Troll (research project, latest commit in 2016)
https://github.com/HazyResearch/CaffeConTroll
● Intel Caffe (optimized for Xeon):
https://github.com/intel/caffe
● Intel DL Boost can be used in many popular frameworks:
TensorFlow, PyTorch, MXNet, PaddlePaddle, Intel Caffe
https://www.intel.ai/increasing-ai-performance-intel-dlboost/
x86: Optimized ML Libraries
● nGraph: open source C++ library, compiler and runtime for Deep Learning.
Frameworks using nGraph Compiler stack to execute workloads have shown
up to 45X performance boost when compared to native framework
implementations. https://www.ngraph.ai/
Graph compilers
Graph compilers: wider view
https://medium.com/tensorflow/mlir-a-new-intermediate-representation-and-compiler-framework-beba999ed18d
Graph compilers: watch for MLIR!
https://www.tensorflow.org/mlir/overview
You need to transfer data between CPU host memory and GPU memory.
In most x86 systems this is done using PCI Express bus (PCIe).
x86: PCIe
A typical GPU card works in x16 mode at full speed, but may work in x8 or x4
mode as well at lower speed.
PCIe v.3 allows for 985 MB/s per 1 lane, so 15.75 GB/s for x16 links.
PCIe v.4 is twice faster, so 31.51 GB/s for x16 (supported in X570 chipset for
AMD and Radeon cards)
PCIe v.5 is twice faster again (spec is released, no products expected before
2020), so 63 GB/s for x16.
x86: PCIe bandwidth
Typical Intel mainstream desktop processor has 16 PCIe lanes (e.g. i7-7700K,
i7–8700K or even i9-9900K processor).
High-end desktop (HEDT) processor has 28 to 44 lanes (e.g. i7–7820X has 28,
rather old i7–6850K has 40, i9-9980XE has 44, upcoming i9-10940X and higher
will have 48).
Xeons have up to 64 lanes (PCIe v.3).
AMD Ryzen Threadripper has 64 PCIe lanes, EPYC has 128 lanes (PCIe v.4)
Be careful: Intel sometimes uses “Platform PCIe lanes”, it is CPU+PCH, but the
PCH ones have a shared uplink!
https://www.anandtech.com/show/11839/intel-core-i9-7980xe-and-core-i9-7960x-review/4
Check specs at https://ark.intel.com/
x86: PCIe bandwidth / CPU side
In cases with a low PCIe lanes CPU you can’t use two GPUs at their highest
speed (x16). In some cases you can’t even use a single GPU in x16.
But in reality the difference between v.3/v.4 or x8/x16 can be insignificant.
Bottlenecks may exist in other places.
x86: PCIe bandwidth / CPU side
https://blog.inten.to/cpu-hardware-for-deep-learning-b91f53cb18af
Memory speed can also be important (+multi-channel mode).
DDR4 data transfer rates (PCIe v.3 for comparison):
● PCIe v.3 x4: 3.94 GB/s
● PCIe v.3 x8: 7.88 GB/s
● DDR4 1600:12.8 GB/s
● DDR4 1866:14.93 GB/s
● PCIe v.3 x16: 15.75 GB/s
PCIe v.4 x8: 15.75 GB/s
● DDR4 2133:17 GB/s
● DDR4 2400:19.2 GB/s
● DDR4 2666:21.3 GB/s
● DDR4 3200:25.6 GB/s
● PCIe v.4 x16: 31.51 GB/s
x86: Memory
● Single-board computers: Raspberry Pi, part of Jetson
Nano, and Google Coral Dev Board.
● Mobile: Qualcomm, Apple A11, etc
● Server: Marvell ThunderX, Ampere eMAG,
Amazon A1 instance, etc; NVIDIA announced
GPU-accelerated Arm-based servers.
● Laptops: Microsoft Surface Pro X
● ARM also has ML/AI Ethos NPU and Mali GPU
ARM
● ARM announces Neoverse N1 platform (scales up to 128 cores)
https://www.networkworld.com/article/3342998/arm-introduces-neoverse-high-performance-cpus-for-servers-5g.html
● Qualcomm manufactured ARM server processor for cloud applications called Centriq 2400 (48 single-thread cores,
2.2GHz). Project stopped.
https://www.tomshardware.com/news/qualcomm-server-chip-exit-china-centriq-2400,38223.html
● Ampere eMAG ARM server microprocessors (up to 32 cores, up to 3.3 GHz)
https://amperecomputing.com/product/, https://en.wikichip.org/wiki/ampere_computing/emag
● Marvell ThunderX ARM Processors (up to 48 cores, up to 2.5 GHz)
https://www.marvell.com/server-processors/thunderx-arm-processors/
Supports NVIDIA GPU:
https://www.electronicsweekly.com/news/nvidia-cuda-x-ai-hpc-software-stack-marvell-thunderx-platforms-2019-11/
● Amazon Graviton ARM processor (16 cores, 2.3GHz)
https://en.wikichip.org/wiki/annapurna_labs/alpine/al73400
https://aws.amazon.com/blogs/aws/new-ec2-instances-a1-powered-by-arm-based-aws-graviton-processors/
● Huawei Kunpeng 920 ARM Server CPU (64 cores, 2.6 GHz)
https://www.huawei.com/en/press-events/news/2019/1/huawei-unveils-highest-performance-arm-based-cpu
ARM: Servers
Current architecture is POWER9:
● 12 cores x 8 threads or 24 cores x 4 threads (96 threads).
● PCIe v.4, 48 PCIe lanes
● Nvidia NVLink 2.0: the industry’s only CPU-to-GPU Nvidia NVLink connection
● CAPI 2.0, OpenCAPI 3.0 (for heterogeneous computing with FPGA/ASIC)
IBM POWER
IBM POWER9 + NVLink 2.0
The current fastest supercomputer in the world, Summit, is based on POWER9,
while also using Nvidia's Volta GPUs as accelerators.
POWER10 is expected in 2020-2021:
● 48 cores
● PCIe v.5
● NVLink 3.0
● OpenCAPI 4.0
● ...
IBM POWER
An open-source hardware instruction set architecture.
Examples:
● SiFive U5, U7 and U8 cores
https://www.anandtech.com/show/15036/sifive-announces-first-riscv-ooo-cpu-core-the-u8series-processor-ip
● Alibaba's RISC-V processor design – the Xuantie 910 (XT 910)
12nm 64-bit 16 cores clocked at up to 2.5GHz, the fastest RISC-V
processor to date
https://www.theregister.co.uk/2019/07/27/alibaba_risc_v_chip/
● Western Digital SweRV Core designed for embedded devices
supporting data-intensive edge applications.
https://www.westerndigital.com/company/innovations/risc-v
● Esperanto Technologies is building AI chip with 1k+ cores
https://www.esperanto.ai/technology/
RISC-V
● KiloCore project with 1000 independent programmable processors
https://www.ucdavis.edu/news/worlds-first-1000-processor-chip
● Cerebras Systems Wafer Scale Engine (WSE), an AI chip that measures
8.46x8.46 inches, making it almost the size of an iPad and more than 50
times larger than a CPU or GPU.
WSE chip has 1.2 trillion transistors, 400,000 computing cores and 18
gigabytes of memory.
https://www.nextplatform.com/2019/08/21/machine-learning-chip-breaks-new-ground-with-waferscale-integration/
Others
GPU
NVIDIA slides: http://www.nvidia.com/content/events/geoInt2015/LBrown_DL.pdf
… → Kepler → Maxwell → Pascal → Volta → Turing → Ampere → ...
NVIDIA Architectures
● Peak performance (GFLOPS) at FP32/16/...
● #Cores (+Tensor Cores)
● Memory size
● Memory speed/bandwidth
● Precision support
● Can connect using NVLink?
● Power usage (Watts)
● Price
● GFLOPS/USD
● GFLOPS/Watt
● Form factor (for desktop or server?)
● ECC memory
● Legal restrictions (e.g. GeForce is not allowed to use in datacenters)
Important dimensions
https://blog.inten.to/hardware-for-deep-learning-part-3-gpu-8906c1644664
● FP64 (64-bit float), not used for DL
● FP32 — the most commonly used for training
● FP16 or Mixed precision (FP32+FP16) — becoming the new default
● INT8 — usually for inference
● INT4, INT1 — experimental modes for inference
Precision
bfloat16 isn’t supported on GPU (but is supported on TPU gen3, and will be
supported on AMD GPU and Intel CPU/NNP).
Precision: bfloat16
https://www.nextplatform.com/2019/07/15/intel-prepares-to-graft-googles-bfloat16-onto-processors/
https://www.techpowerup.com/260344/future-amd-gpu-architecture-to-implement-bfloat16-hardware
Precision: many caveats
https://blog.inten.to/hardware-for-deep-learning-part-3-gpu-8906c1644664
Not only FLOPS: Roofline Performance Model
https://blog.inten.to/hardware-for-deep-learning-part-3-gpu-8906c1644664
Roofline Performance Model: Example
https://blog.inten.to/hardware-for-deep-learning-part-3-gpu-8906c1644664
Separate cards can joined using NVLink; SLI is not relevant for DL, it’s for
graphics.
NVSwitch: The Fully Connected NVLink
NCCL 1: multi-GPU collective communication primitives library
NVIDIA: Single-machine Multi-GPU
Distributed training is now a commodity (but scaling is sublinear).
NCCL 2: multi-node collective communication primitives library
NVIDIA: Distributed Multi-GPU
● AMD has powerful GPUs but they are
mostly unsupported in DL frameworks
● Intel has its own GPUs on the processor
(HD Graphics)
● Some Intel CPUs were equipped with AMD
GPUs (Kaby Lake-G, say, i7-8809G)
● Intel plans to release its first discrete GPU
in 2020 (Xe architecture)
GPU: AMD, Intel
Is everything OK?
Problems
Serious problems with the current processors (CPU/GPU) are:
● Energy efficiency:
○ The version of AlphaGo playing against Lee Sedol used 1,920 CPUs and
280 GPUs (https://en.wikipedia.org/wiki/AlphaGo)
○ The estimated power consumption of approximately 1 MW (200 W per
CPU and 200 W per GPU) compared to only 20 watts used by the human
brain (https://jacquesmattheij.com/another-way-of-looking-at-lee-sedol-vs-alphago/)
● Architecture:
○ good for matrix multiplication (still the essence of DL)
○ but not well-suitable for brain-like computations
FPGA
FPGA
● FPGA (field-programmable gate array) is an integrated circuit designed to be
configured by a customer or a designer after manufacturing
● Both FPGAs and ASICs (see later) are usually much more energy-efficient than
general purpose processors (so more productive with respect to GFLOPS per
Watt). FPGAs are usually used for inference, not training.
● OpenCL can be the language for development for FPGA (C/C++ can be as well),
and some ML/DL libraries are using OpenCL too (for example, Caffe). So, there
could appear an easy way to do low-level ML on FPGAs.
● For high-level ML there are vendor tools and graph compilers (inference only).
● Can use FPGA in the cloud!
● See also for MLIR (mentioned earlier).
● Learning curve to use FPGAs is too steep now :(
FPGA in production
There is some interesting movement to FPGA:
● Amazon has FPGA F1 instances https://aws.amazon.com/ec2/instance-types/f1/
● Alibaba has FGPA F3 instances in the cloud
https://www.alibabacloud.com/blog/deep-dive-into-alibaba-cloud-f3-fpga-as-a-service-instances_594057
● Yandex uses FPGAs for its own DL inference.
● Microsoft ran (in 2015) Project Catapult that uses clusters of FPGAs
https://blogs.msdn.microsoft.com/msr_er/2015/11/12/project-catapult-servers-available-to-academic-researchers/
● Microsoft Azure allows deploying pretrained models on FPGA (!).
https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-deploy-fpga-web-service
● Baidu has FPGA instances https://cloud.baidu.com/product/fpga.html
● ...
FPGA chips
Two main manufacturers: Intel (ex. Altera) and Xilinx.
The ‘world’s largest’ FPGA chip (Xilinx Virtex UltraScale+ VU19P)
contains 9M system logic cells (35B transistors)
https://www.xilinx.com/products/silicon-devices/fpga/virtex-ultrascale-plus-vu19p.html
Intel has a hybrid Xeon+FPGA chip
https://www.top500.org/news/intel-ships-xeon-skylake-processor-with-integrated-fpga/
Intel has FPGA acceleration cards
https://www.intel.com/content/www/us/en/programmable/solutions/acceleration-hub/platforms.html
More info:
https://www.intel.com/content/www/us/en/products/programmable/fpga.html
https://www.xilinx.com/products/silicon-devices/fpga.html
Adaptive compute acceleration platform (ACAP)
Xilinx Versal ACAP, a fully software-programmable,
heterogeneous compute platform that combines Scalar Engines,
Adaptable Engines, and Intelligent Engines.
The Intelligent Engines are an array of VLIW and SIMD
processing engines and memories, all interconnected with 100s
of terabits per second of interconnect and memory bandwidth.
These permit 5X–10X performance improvement for ML and
DSP applications.
https://www.xilinx.com/products/silicon-devices/acap/versal-ai-core.html
https://www.xilinx.com/support/documentation/white_papers/wp505-versal-acap.pdf
FPGA: Xilinx DL Tools
Xilinx ML Suite: tools to develop and deploy ML apps for Real-time Inference.
● xfDNN: graph compiler w optimization, quantizer
● xDNN: high Performance CNN processing engine.
https://github.com/Xilinx/ml-suite
FPGA: Intel DL Tools
xx
https://simplecore.intel.com/nervana/wp-content/uploads/sites/53/2018/05/IntelAIDC18_Macias_Mainstage_5_23_Final.pdf
ASIC
ASIC custom chips
ASIC (application-specific integrated circuit) is an integrated circuit customized for a
particular use, rather than intended for general-purpose use.
There is a lot of movement to ASIC right now:
● Google has Tensor Processing Units (TPU) in the cloud.
● Intel just demonstrated their Nervana processors for training and inference.
● Mobileye (Intel) chips with specially developed ASIC cores are used in BMW, Tesla,
Volvo, etc.
● Movidius (acquired by Intel) Myriad X VPU - a dedicated hardware accelerator for deep
neural network inferences. https://www.movidius.com/myriadx
● Alibaba Hanguang 800
● Huawei Ascend 310, 910
● Bitmain Sophon
● ...
Case: AlphaGo Zero
https://deepmind.com/blog/alphago-zero-learning-scratch/
ASIC: Google TPU
TPU v2
● 180 TFLOPS (bfloat16)
● 64 GB HBM
● $4.50 / TPU hour
https://cloud.google.com/tpu/
https://cloud.google.com/tpu/docs/tpus
https://cloud.google.com/tpu/docs/system-architecture
TPU v3
● 420 TFLOPS (bfloat16)
● 128 GB HBM
● $8.00 / TPU hour
A “TPU v3 pod” 100+ petaflops, 32 TB HBM, 2-D toroidal mesh network
ASIC: Intel (Nervana) NNP-T
Processor for training. Can build PODs (say 10-rack POD with 480 NNP-T)
● 24 Tensor Processing Cluster (TPC)
● PCIe Gen 4 x16 accelerator card, 300W
● OCP Accelerator Module, 375W
● 119 TOPS bfloat16
● 32 GB HBM2
https://www.intel.ai/nervana-nnp/nnpt/
https://en.wikichip.org/wiki/nervana/microarchitectures/spring_crest
ASIC: Intel (Nervana) NNP-I
Processor for inference using mixed precision math, with a special emphasis on low-precision
computations using INT8.
● 12 inference compute engines (ICE) + 2 Intel architecture cores (AVX+VNNI)
● M.2 form factor (1 chip): 12W, up to 50 TOPS.
● PCIe card (2 chips): 75W, up to 170 TOPS.
https://www.intel.ai/nervana-nnp/nnpi
https://en.wikichip.org/wiki/intel/microarchitectures/spring_hill
ASIC: Bitmain Sophon
Tensor Computing Processor BM1684 is a
third generation TPU.
● 1024 processing units
● 17.6 TOPS INT8, 2.2 TFLOPS FP32 (?)
Deep Learning Acceleration Card SC3:
● 1x BM1682 (2nd gen TPU), 3 TFLOPS FP32,
3 GB, PCIe 3.0 x8
● Max Power: 65W
● Caffe/TensorFlow/Pytorch/Mxnet
https://www.sophon.ai/product/introduce/bm1684.html
https://www.sophon.ai/product/introduce/sc3.html
ASIC: Alibaba Hanguang 800
(October 28, 2019) Announcing Hanguang 800:
Alibaba's First AI-Inference Chip
“Hanguang 800 is the world's most powerful AI inference
chip. In the Resnet-50 industry test, the peak performance
of the new chip reached a whopping 78,563 images per
second, which is four times higher than the second best AI chip
in the world. The peak efficiency of the chip also reached 500 IPS/W, which is
3.3 times higher than the second best option.”
“A Hanguang 800 chip can offer the computing power equivalent to 10 traditional
GPUs.“
https://www.alibabacloud.com/blog/announcing-hanguang-800-alibabas-first-ai-inference-chip_595482
ASIC: Baidu Kunlun
(July 3, 2018) Baidu unveils Kunlun AI
chip for edge and cloud computing
“Kunlun is made to handle AI models for edge
computing on devices and in the cloud via
datacenters. The Kunlun 818-300 model will be used for training AI, and the
818-100 for inference..“
https://venturebeat.com/2018/07/03/baidu-unveils-kunlun-ai-chip-for-edge-and-cloud-computing/
“The Kunlun-powered server’s computing power is 30 times higher than
FPGA-based AI accelerators, according to Yin Shiming, Baidu vice president.”
https://technode.com/2019/08/29/baidu-unveils-kunlun-powered-cloud-server-at-waic/
ASIC: Huawei Ascend 910
(Aug 23, 2019) Huawei launches Ascend 910,
the world's most powerful AI processor
● 256 TFLOPS (FP16), 512 TOPS (INT8)
● 310 W of max power
● HCCS, PCIe 4.0, and RoCE v2 build scale-up and
scale-out systems both flexibly and efficiently. HCCS is Huawei's in-house
high-speed interface interconnecting Ascend 910s. On-chip RoCE
interconnects nodes directly. PCIe 4.0 doubles the throughput of the
previous generation.
● Ascend 910 2048-node cluster can deliver up to 512 PFLOPS.
https://www.huawei.com/en/press-events/news/2019/8/Huawei-Ascend-910-most-powerful-AI-processor
https://e.huawei.com/us/products/cloud-computing-dc/atlas/ascend-910
https://www.servethehome.com/huawei-ascend-910-provides-a-nvidia-ai-training-alternative/
ASIC: Habana
Gaudi: training chip. Designed to scale well.
● PCIe 4.0 x16, 32 GB HBM2, 1Tb/s, ECC, RDMA
● 200-300W
● FP32, BF16, INT/UINT 32, 16, 8
https://habana.ai/training/
https://habana.ai/wp-content/uploads/2019/06/Gaudi-Datasheet.pdf
Goya: inference chip
● PCIe 4.0 x16, 4/8/16 Gb DDR4, ECC, 200W
● FP32, INT/UINT 32, 16, 8
https://habana.ai/inference/
https://habana.ai/wp-content/uploads/2019/06/Goya-Datasheet-HL-10x.pdf
https://www.nextplatform.com/2019/08/26/habana-takes-training-and-inference-down-different-paths/
ASIC: Graphcore IPU
Graphcore IPU: for both training and inference.
Allows new and emerging machine intelligence
workloads to be realized.
Colossus IPU:
● 23.6B transistors, 1216 independent processor cores, 300Mb in-processor mem,
125 TFLOPS mixed precision
● 45TB/s memory bandwidth, 8TB/s on-chip exchange between cores
● C2 IPU Processor card: 2x Colossus, PCIe 4.0 x16 (64GB/s bidir), card-to-card
IPU-Links (2.5 TBps), 300W
https://www.servethehome.com/hands-on-with-a-graphcore-c2-ipu-pcie-card-at-dell-tech-world/
https://www.graphcore.ai/products/ipu
IPU on Azure
https://www.graphcore.ai/posts/microsoft-and-graphcore-collaborate-to-accelerate-artificial-intelligence
ASIC: Others
● Qualcomm Cloud AI 100 (inference)
https://www.qualcomm.com/news/releases/2019/04/09/qualcomm-brings-power-efficient-artificial-intellig
ence-inference
● ARM ML inference NPU Ethos-N77
https://www.arm.com/products/silicon-ip-cpu/machine-learning/arm-ml-processor
● Intel eASIC: an intermediary technology between FPGAs and standard-cell
ASICs with lower unit-cost and faster time-to-market
https://www.intel.com/content/www/us/en/products/programmable/asic/easic-devices.html
● ...
ASIC: Summary
● Very diverse field!
● Hard to directly compare different solutions based on their characteristics (can
be too different architectures).
● You can use a common benchmark like https://mlperf.org/
● DL framework support is usually limited, some solutions use their own
frameworks/libraries.
Mobile and Edge
AI at the edge
● NVidia Jetson TK1/TX1/TX2/Xavier/Nano
○ 192/256/256/512/128 CUDA Cores
○ 4/4/6/8/4-Core ARM CPU, 2/4/8/16/4 Gb Mem
● Tablets, Smartphones
○ Qualcomm Snapdragon 845/855, Apple A11/12/Bionic, Huawei Kirin 970/980/990 etc
● Raspberry Pi 4 (1.5 GHz 4-core, 4Gb mem)
● Movidius Neural Compute Stick, Stick 2
● Google Edge TPU
“The Qualcomm Neural Processing SDK for AI, our
software-accelerated runtime for the execution of deep
neural networks, lets you program the Qualcomm AI
Engine. Together, the engine and the SDK allow you to
squeeze up to 7 TOPS out of AI processing out of the
Snapdragon 855, with massive acceleration for your on-device AI applications.
The engine provides high capacity for matrix multiplication on both the Qualcomm
Hexagon Vector eXtensions (HVX) and the Hexagon Tensor Accelerator
(HTA). With enough on-device processing power to run more than 140 inferences
per second on the Inception-v3 neural network, your app could classify or detect
dozens of objects in just a few milliseconds and with high confidence.”
https://developer.qualcomm.com/blog/accelerate-your-device-ai-qualcomm-artificial-intelligence-ai-engine-
snapdragon
https://www.qualcomm.com/products/snapdragon-855-mobile-platform
Mobile AI: Qualcomm SD 855 (DSP+)
“HUAWEI’s self-developed Da Vinci architecture NPU delivers better power efficiency,
stronger processing capabilities and higher accuracy. The powerful Big-Core plus ultra-low
consumption Tiny-Core contribute to an enormous boost in AI performance. In AI face
recognition, the efficiency of NPU Tiny-Core can be enhanced up to 24x than the Big-Core.
With 2 Big-Core plus 1 Tiny-Core, the NPU of Kirin 990 5G is ready to unlock the magic
of the future.”
https://consumer.huawei.com/en/campaign/kirin-990-series/
“Huawei intends to scale this AI processing block from servers to smartphones. It supports both INT8
and FP16 on both cores, whereas the older Cambricon design could only perform INT8 on one core.
There’s also a new ‘Tiny Core’ NPU. It’s a smaller version of the Da Vinci architecture focused on power
efficiency above all else, and it can be used for polling or other applications where performance isn’t
particularly time critical. The 990 5G will have two “big” NPU cores and a single Tiny Core, while the Kirin
990 (LTE) has one big core and one tiny core.”
https://www.extremetech.com/mobile/298028-huaweis-kirin-990-soc-is-the-first-chip-with-an-integrated-5g-modem
Mobile AI: Huawei Kirin 970, 980, 990 (NPU)
(Oct 2, 2019) Inside Apple’s A13 Bionic system-on-chip
“This year, the CPU has a new trick: A set of “machine
learning accelerators” that perform matrix multiplication
operations up to six times faster than the CPU alone.
The Neural Engine (8-core), like everything else in the chip, tops out
at 20 percent faster than before (it’s as if the designs are
relatively unchanged, and the new 7nm+ process allows for
20 percent higher clock speeds).
There’s a machine learning controller in the chip that automatically schedules
machine learning operations between the CPU, GPU, and Neural Engine so
developers don’t have to balance out the load themselves.”
https://www.macworld.com/article/3442716/inside-apples-a13-bionic-system-on-chip.html
Mobile AI: Apple (Neural Engine)
(Nov 26, 2019) Samsung’s first 7-nanometer EUV processor
will power the Galaxy Note 10
“The Exynos 9825 features an integrated Neural Processing
Unit (NPU) designed for the next generation of mobile
experiences from AI-powered photography to augmented reality. With fast,
efficient AI processing, the NPU brings new possibilities for on-device AI from
object recognition for optimized photos, to a suite of performance enhancing
intelligence features such as usage pattern recognition and faster app
pre-loading.”
https://www.samsung.com/semiconductor/minisite/exynos/products/mobileprocessor/exynos-9825/
Mobile AI: Samsung (NPU)
(Aug 7, 2019) MediaTek Announces Dimensity 1000 ARM
Chip With Integrated 5G Modem
“The Dimensity 1000 doesn’t just bring new branding; it’s also
sporting four Cortex A77 CPU cores and four Cortex A55 CPU
cores, all built on a 7nm process node. There’s also a 9-core Mali GPU, a 5-core
ISP, and a 6-core AI processor.
The MediaTek AI Processing Unit APU 3.0 is a brand new architecture. It houses
six AI processors (two big cores, three small cores and a single tiny core)
The new APU 3.0 brings devices a significant performance boost at 4.5 TOPS.”
https://www.extremetech.com/extreme/302712-mediatek-announces-dimensity-1000-arm-chip-with-integr
ated-5g-modem
https://i.mediatek.com/mediatek-5g
Mobile AI: MediaTek (APU)
AI at the Edge: Jetson Nano
Price: $99
NVIDIA Jetson Nano Developer Kit is a small, powerful
computer that lets you run multiple neural networks in parallel
for applications like image classification, object detection, segmentation,
and speech processing. All in an easy-to-use platform that runs in as little as 5
watts.
● 128-core Maxwell GPU + Quad-core ARM A57, 472 GFLOPS
● 4 GB 64-bit LPDDR4 25.6 GB/s
https://developer.nvidia.com/embedded/jetson-nano-developer-kit
See also Jetson TX1, TX2, Xavier: https://developer.nvidia.com/embedded/develop/hardware
Neural Compute Stick 2 (~$70)
The latest generation of Intel® VPUs includes 16
powerful processing cores (called SHAVE cores) and
a dedicated deep neural network hardware accelerator for high-performance
vision and AI inference applications—all at low power.
● Supports Convolutional Neural Network (CNN)
● Support: TensorFlow, Caffe, Apache MXNet, ONNX, PyTorch, and
PaddlePaddle via an ONNX conversion
● Processor: Intel Movidius Myriad X Vision Processing Unit (VPU)
● Connectivity: USB 3.0 Type-A
https://software.intel.com/en-us/neural-compute-stick
AI at the Edge: Movidius
AI at the Edge: Google Edge TPU
The Edge TPU is a small ASIC designed by Google that provides
high performance ML inferencing for low-power devices. For
example, it can execute state-of-the-art mobile vision models such
as MobileNet V2 at 400 FPS, in a power efficient manner.
The on-board Edge TPU coprocessor is capable of performing 4 TOPS
using 0.5 watts for each TOPS (2 TOPS per watt).
TensorFlow Lite models can be compiled to run on the Edge TPU.
USB/Mini PCIe/M.2 A+E key/M.2 B+M key/SoM/Dev Board
https://cloud.google.com/edge-tpu/
https://coral.ai/products/
● Sophon Neural Network Stick (NNS)
https://www.sophon.ai/product/introduce/nns.html
● Xilinx Edge AI (FPGA!)
https://www.xilinx.com/applications/industrial/analytics-machine-learning.html
● (Nov 13, 2019) Azure Data Box Edge is a physical network appliance,
shipped by Microsoft, that sends data in and out of
Azure. Data Box Edge is additionally equipped with
AI-enabled edge computing capabilities that help
you analyze, process, and transform the on-premises
data before uploading it to the cloud.
https://azure.microsoft.com/en-us/updates/announcing-azure-data-box-edge/
● ...
AI at the Edge: Others
Now OK?
Problems
Even with FPGA/ASIC and edge devices:
● Energy efficiency:
○ Better than CPU/GPU, but still far from 20 watts used by the human brain
● Architecture:
○ Even more specialized for ML/DL computations, but...
○ Still far from brain-like computations
Neuromorphic Chips
Neuromorphic chips
● Neuromorphic computing - brain-inspired computing - has emerged as a new
technology to enable information processing at very low energy cost using
electronic devices that emulate the electrical behaviour of (biological) neural
networks.
● Neuromorphic chips attempt to model in silicon the massively parallel way the
brain processes information as billions of neurons and trillions of synapses
respond to sensory inputs such as visual and auditory stimuli.
● DARPA SyNAPSE program (Systems of Neuromorphic Adaptive Plastic
Scalable Electronics)
● IBM TrueNorth; Stanford Neurogrid; HRL neuromorphic chip; Human Brain
Project SpiNNaker and HICANN.
https://www.technologyreview.com/s/526506/neuromorphic-chips/
Neuromorphic chips: IBM TrueNorth
● 1M neurons, 256M synapses, 4096 neurosynaptic
cores on a chip, est. 46B synaptic ops per sec per W
● Uses 70mW, power density is 20 milliwatts per
cm^2— almost 1/10,000th the power of most modern
microprocessors
● “Our sights are now set high on the ambitious goal of
integrating 4,096 chips in a single rack with 4B neurons and 1T synapses while
consuming ~4kW of power”.
● Currently IBM is making plans to commercialize it.
● (2016) Lawrence Livermore National Lab got a cluster of 16 TrueNorth chips
(16M neurons, 4B synapses, for context, the human brain has 86B neurons).
When running flat out, the entire cluster will consume a grand total of 2.5 watts.
http://spectrum.ieee.org/tech-talk/computing/hardware/ibms-braininspired-computer-chip-comes-from-the-future
Neuromorphic chips: IBM TrueNorth
● (03.2016) IBM Research demonstrated convolutional neural nets with close to
state of the art performance:
“Convolutional Networks for Fast, Energy-Efficient Neuromorphic Computing”, http://arxiv.org/abs/1603.08270
Neuromorphic chips: Intel Loihi
● Fully asynchronous neuromorphic many core mesh that
supports a wide range of sparse, hierarchical and recurrent
neural network topologies
● Each neuromorphic core includes a learning engine that can
be programmed to adapt network parameters during
operation, supporting supervised, unsupervised,
reinforcement and other learning paradigms.
● Fabrication on Intel’s 14 nm process technology.
● A total of 130,000 neurons and 130 million synapses.
● Development and testing of several algorithms with high
algorithmic efficiency for problems including path planning,
constraint satisfaction, sparse coding, dictionary learning,
and dynamic pattern learning and adaptation.
https://newsroom.intel.com/editorials/intels-new-self-learning-chip-promises-accelerate-artificial-intelligence/
https://techcrunch.com/2018/01/08/intel-shows-off-its-new-loihi-ai-chip-and-a-new-49-qubit-quantum-chip/
https://ieeexplore.ieee.org/document/8259423
https://en.wikichip.org/wiki/intel/loihi
Neuromorphic chips: Intel Loihi
“Intel researchers have recently been testing the Loihi chip by
training it on tasks such as recognizing a small set of objects
within seconds. The company has not yet pushed the capabilities
of the neuromorphic chip to its limit, Mayberry [Michael Mayberry,
corporate vice president and managing director of Intel Labs]
says. Still, he anticipates neuromorphic computing products
potentially hitting the market within 2 to 4 years, if customers
can run their applications on the Loihi chip without requiring
additional hardware modifications.”
“Neither quantum nor neuromorphic computing are going to
replace general purpose computing,” Mayberry says. “But
they can enhance it.”
https://spectrum.ieee.org/tech-talk/computing/hardware/intels-49qubit-chip-aims-for-quantum-supremacy
Neuromorphic chips: Intel Loihi
“Using Intel's Loihi neuromorphic research chip and
ABR's Nengo Deep Learning toolkit, we analyze the
inference speed, dynamic power consumption, and
energy cost per inference of a two-layer neural
network keyword spotter trained to recognize a single
phrase. We perform comparative analyses of this
keyword spotter running on more conventional
hardware devices including a CPU, a GPU, Nvidia's
Jetson TX1, and the Movidius Neural Compute
Stick.”
Benchmarking Keyword Spotting Efficiency on Neuromorphic Hardware
https://arxiv.org/abs/1812.01739
Neuromorphic chips: Intel Pohoiki Beach
(Jul 15, 2019) “Intel announced that an 8 million-neuron
neuromorphic system comprising 64 Loihi research chips
— codenamed Pohoiki Beach — is now available to the broader
research community. With Pohoiki Beach, researchers can
experiment with Intel’s brain-inspired research chip, Loihi, which
applies the principles found in biological brains to computer
architectures. ”
https://newsroom.intel.com/news/intels-pohoiki-beach-64-chip-neuromorphic-system-delivers-breakthrough-results-research-tests/
Neuromorphic chips: Tianjic
Tianjic’s unified function core (FCore) which combines essential
building blocks for both artificial neural networks and biologically
networks — axon, synapse, dendrite and soma blocks. The 28-nm
chip consists of 156 FCores, containing approximately 40,000
neurons and 10 million synapses in an area of 3.8×3.8 mm2.
Tianjic delivers an internal memory bandwidth of more than 610 GB
per second, and a peak performance of 1.28 TOPS per watt for
running artificial neural networks. In the biologically-inspired spiking
neural network mode, Tianjic achieves a peak performance of about
650 giga synaptic operations per second (GSOPS) per watt.
https://medium.com/syncedreview/nature-cover-story-chinese-teams-tianjic-chip-bridges-machine-lear
ning-and-neuroscience-in-f1c3e8a03113
https://www.nature.com/articles/s41586-019-1424-8
Neuromorphic chips: Others
● SpiNNaker (1,036,800 ARM9 cores)
http://apt.cs.manchester.ac.uk/projects/SpiNNaker/
● SpiNNaker-2
https://niceworkshop.org/wp-content/uploads/2018/05/2-27-SHoppner-SpiNNaker2.pdf
https://arxiv.org/abs/1911.02385 “SpiNNaker 2: A 10 Million Core Processor System for Brain
Simulation and Machine Learning”
● BrainScaleS, HICANN: 20x 8-inch silicon wafers each incorporates 50 x 106
plastic synapses and 200,000 biologically realistic neurons.
https://www.humanbrainproject.eu/en/silicon-brains/how-we-work/hardware/
● Akida NSoC: 1.2 million neurons and 10 billion synapses
https://www.brainchipinc.com/products/akida-neuromorphic-system-on-chip
https://en.wikichip.org/wiki/brainchip/akida
● Neurogrid: Neurogrid can model a slab of cortex with up to 16x256x256
neurons (>1M) https://web.stanford.edu/group/brainsinsilicon/neurogrid.html
https://web.stanford.edu/group/brainsinsilicon/documents/BenjaminEtAlNeurogrid2014.pdf
From: https://d1io3yog0oux5.cloudfront.net/_51d5497ffa729abd180ed52c4234217f/brainchipinc/db/217/1582/pdf/Akida+Launch+Presentation.pdf
Anything else?
Other approaches
● Memristors https://spectrum.ieee.org/semiconductors/design/the-mysterious-memristor
● Quantum computing https://ai.googleblog.com/2019/10/quantum-supremacy-using-programmable.html
● Optical computing https://www.nextplatform.com/2019/05/31/startup-looks-to-light-up-machine-learning/
● DNA computing https://www.wired.com/story/finally-a-dna-computer-that-can-actually-be-reprogrammed/
● Unconventional computing: cellular automata, reservoir computing, using
biological cells/neurons, chemical computation, membrane computing, slime
mold computing and much more https://www.springer.com/gp/book/9781493968824
● ...
References:
Hardware for Deep Learning series of posts:
https://blog.inten.to/hardware-for-deep-learning-current-state-and-trends-51c01ebbb6dc
● Part 1: Introduction and Executive summary
● Part 2: CPU
● Part 3: GPU
● Part 4: FPGA
● Part 5: ASIC
● Part 6: Mobile AI
● Part 7: Neuromorphic computing
● Part 8: Quantum computing
https://ru.linkedin.com/in/grigorysapunov
gs@inten.to
Thanks!

Weitere ähnliche Inhalte

Was ist angesagt?

FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning Dr. Swaminathan Kathirvel
 
AMD Chiplet Architecture for High-Performance Server and Desktop Products
AMD Chiplet Architecture for High-Performance Server and Desktop ProductsAMD Chiplet Architecture for High-Performance Server and Desktop Products
AMD Chiplet Architecture for High-Performance Server and Desktop ProductsAMD
 
AMD: Where Gaming Begins
AMD: Where Gaming BeginsAMD: Where Gaming Begins
AMD: Where Gaming BeginsAMD
 
Hardware and Software for AI 2018 – Consumer focus
Hardware and Software for AI 2018 – Consumer focusHardware and Software for AI 2018 – Consumer focus
Hardware and Software for AI 2018 – Consumer focusYole Developpement
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practicesLior Sidi
 
Intel I3,I5,I7 Processor
Intel I3,I5,I7 ProcessorIntel I3,I5,I7 Processor
Intel I3,I5,I7 Processorsagar solanky
 
Hyper threading technology
Hyper threading technologyHyper threading technology
Hyper threading technologydeepakmarndi
 
ISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
ISSCC 2018: "Zeppelin": an SoC for Multi-chip ArchitecturesISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
ISSCC 2018: "Zeppelin": an SoC for Multi-chip ArchitecturesAMD
 
Qualcomm SnapDragon 800 Mobile Device
Qualcomm SnapDragon 800 Mobile DeviceQualcomm SnapDragon 800 Mobile Device
Qualcomm SnapDragon 800 Mobile DeviceJJ Wu
 
“Market Analysis on SoCs for Imaging, Vision and Deep Learning in Automotive ...
“Market Analysis on SoCs for Imaging, Vision and Deep Learning in Automotive ...“Market Analysis on SoCs for Imaging, Vision and Deep Learning in Automotive ...
“Market Analysis on SoCs for Imaging, Vision and Deep Learning in Automotive ...Edge AI and Vision Alliance
 
DDR, GDDR, HBM Memory : Presentation
DDR, GDDR, HBM Memory : PresentationDDR, GDDR, HBM Memory : Presentation
DDR, GDDR, HBM Memory : PresentationSubhajit Sahu
 
NVIDIA GeForce RTX Launch Event
NVIDIA GeForce RTX Launch EventNVIDIA GeForce RTX Launch Event
NVIDIA GeForce RTX Launch EventNVIDIA
 

Was ist angesagt? (20)

Snapdragon Processor
Snapdragon ProcessorSnapdragon Processor
Snapdragon Processor
 
FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning
 
AMD Chiplet Architecture for High-Performance Server and Desktop Products
AMD Chiplet Architecture for High-Performance Server and Desktop ProductsAMD Chiplet Architecture for High-Performance Server and Desktop Products
AMD Chiplet Architecture for High-Performance Server and Desktop Products
 
AMD: Where Gaming Begins
AMD: Where Gaming BeginsAMD: Where Gaming Begins
AMD: Where Gaming Begins
 
Intel core i5
Intel core i5Intel core i5
Intel core i5
 
Hardware and Software for AI 2018 – Consumer focus
Hardware and Software for AI 2018 – Consumer focusHardware and Software for AI 2018 – Consumer focus
Hardware and Software for AI 2018 – Consumer focus
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practices
 
AMD Processor
AMD ProcessorAMD Processor
AMD Processor
 
Intel I3,I5,I7 Processor
Intel I3,I5,I7 ProcessorIntel I3,I5,I7 Processor
Intel I3,I5,I7 Processor
 
Mobile processors
Mobile processorsMobile processors
Mobile processors
 
Hyper threading technology
Hyper threading technologyHyper threading technology
Hyper threading technology
 
Lec04 gpu architecture
Lec04 gpu architectureLec04 gpu architecture
Lec04 gpu architecture
 
ISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
ISSCC 2018: "Zeppelin": an SoC for Multi-chip ArchitecturesISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
ISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
 
Dual-core processor
Dual-core processorDual-core processor
Dual-core processor
 
Qualcomm SnapDragon 800 Mobile Device
Qualcomm SnapDragon 800 Mobile DeviceQualcomm SnapDragon 800 Mobile Device
Qualcomm SnapDragon 800 Mobile Device
 
Amd processor
Amd processorAmd processor
Amd processor
 
ARM Processor
ARM ProcessorARM Processor
ARM Processor
 
“Market Analysis on SoCs for Imaging, Vision and Deep Learning in Automotive ...
“Market Analysis on SoCs for Imaging, Vision and Deep Learning in Automotive ...“Market Analysis on SoCs for Imaging, Vision and Deep Learning in Automotive ...
“Market Analysis on SoCs for Imaging, Vision and Deep Learning in Automotive ...
 
DDR, GDDR, HBM Memory : Presentation
DDR, GDDR, HBM Memory : PresentationDDR, GDDR, HBM Memory : Presentation
DDR, GDDR, HBM Memory : Presentation
 
NVIDIA GeForce RTX Launch Event
NVIDIA GeForce RTX Launch EventNVIDIA GeForce RTX Launch Event
NVIDIA GeForce RTX Launch Event
 

Ähnlich wie Deep learning: Hardware Landscape

corei7anaghvjfinal-130316054830-.pptx
corei7anaghvjfinal-130316054830-.pptxcorei7anaghvjfinal-130316054830-.pptx
corei7anaghvjfinal-130316054830-.pptxPranita602627
 
Implementing AI: High Performance Architectures: Large scale HPC hardware in ...
Implementing AI: High Performance Architectures: Large scale HPC hardware in ...Implementing AI: High Performance Architectures: Large scale HPC hardware in ...
Implementing AI: High Performance Architectures: Large scale HPC hardware in ...KTN
 
Computer Hardware & Software Lab Manual 3
Computer Hardware & Software Lab Manual 3Computer Hardware & Software Lab Manual 3
Computer Hardware & Software Lab Manual 3senayteklay
 
00 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver200 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver2Yutaka Kawai
 
PLNOG16: Obsługa 100M pps na platformie PC , Przemysław Frasunek, Paweł Mała...
PLNOG16: Obsługa 100M pps na platformie PC, Przemysław Frasunek, Paweł Mała...PLNOG16: Obsługa 100M pps na platformie PC, Przemysław Frasunek, Paweł Mała...
PLNOG16: Obsługa 100M pps na platformie PC , Przemysław Frasunek, Paweł Mała...PROIDEA
 
Module 4 Embedded Linux
Module 4 Embedded LinuxModule 4 Embedded Linux
Module 4 Embedded LinuxTushar B Kute
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Lablup Inc.
 
Ivy bridge vs Sandy bridge Micro-architecture.
Ivy bridge vs Sandy bridge Micro-architecture.Ivy bridge vs Sandy bridge Micro-architecture.
Ivy bridge vs Sandy bridge Micro-architecture.Sumit Khanka
 
Computação acelerada – a era das ap us roberto brandão, ciência
Computação acelerada – a era das ap us   roberto brandão,  ciênciaComputação acelerada – a era das ap us   roberto brandão,  ciência
Computação acelerada – a era das ap us roberto brandão, ciênciaCampus Party Brasil
 
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGAMaking the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGAFacultad de Informática UCM
 
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」PC Cluster Consortium
 
Developping drivers on small machines
Developping drivers on small machinesDevelopping drivers on small machines
Developping drivers on small machinesAnne Nicolas
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLinaro
 
fpga1 - What is.pptx
fpga1 - What is.pptxfpga1 - What is.pptx
fpga1 - What is.pptxssuser0de10a
 
Design of 32 Bit Processor Using 8051 and Leon3 (Progress Report)
Design of 32 Bit Processor Using 8051 and Leon3 (Progress Report)Design of 32 Bit Processor Using 8051 and Leon3 (Progress Report)
Design of 32 Bit Processor Using 8051 and Leon3 (Progress Report)Talal Khaliq
 
Design and Implementation of Quintuple Processor Architecture Using FPGA
Design and Implementation of Quintuple Processor Architecture Using FPGADesign and Implementation of Quintuple Processor Architecture Using FPGA
Design and Implementation of Quintuple Processor Architecture Using FPGAIJERA Editor
 

Ähnlich wie Deep learning: Hardware Landscape (20)

corei7anaghvjfinal-130316054830-.pptx
corei7anaghvjfinal-130316054830-.pptxcorei7anaghvjfinal-130316054830-.pptx
corei7anaghvjfinal-130316054830-.pptx
 
Corei7
Corei7Corei7
Corei7
 
GIST AI-X Computing Cluster
GIST AI-X Computing ClusterGIST AI-X Computing Cluster
GIST AI-X Computing Cluster
 
Implementing AI: High Performance Architectures: Large scale HPC hardware in ...
Implementing AI: High Performance Architectures: Large scale HPC hardware in ...Implementing AI: High Performance Architectures: Large scale HPC hardware in ...
Implementing AI: High Performance Architectures: Large scale HPC hardware in ...
 
Computer Hardware & Software Lab Manual 3
Computer Hardware & Software Lab Manual 3Computer Hardware & Software Lab Manual 3
Computer Hardware & Software Lab Manual 3
 
00 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver200 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver2
 
PLNOG16: Obsługa 100M pps na platformie PC , Przemysław Frasunek, Paweł Mała...
PLNOG16: Obsługa 100M pps na platformie PC, Przemysław Frasunek, Paweł Mała...PLNOG16: Obsługa 100M pps na platformie PC, Przemysław Frasunek, Paweł Mała...
PLNOG16: Obsługa 100M pps na platformie PC , Przemysław Frasunek, Paweł Mała...
 
Module 4 Embedded Linux
Module 4 Embedded LinuxModule 4 Embedded Linux
Module 4 Embedded Linux
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
 
Ivy bridge vs Sandy bridge Micro-architecture.
Ivy bridge vs Sandy bridge Micro-architecture.Ivy bridge vs Sandy bridge Micro-architecture.
Ivy bridge vs Sandy bridge Micro-architecture.
 
Computação acelerada – a era das ap us roberto brandão, ciência
Computação acelerada – a era das ap us   roberto brandão,  ciênciaComputação acelerada – a era das ap us   roberto brandão,  ciência
Computação acelerada – a era das ap us roberto brandão, ciência
 
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGAMaking the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
 
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
 
Developping drivers on small machines
Developping drivers on small machinesDevelopping drivers on small machines
Developping drivers on small machines
 
Intel Core i7 Processors
Intel Core i7 ProcessorsIntel Core i7 Processors
Intel Core i7 Processors
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience Report
 
fpga1 - What is.pptx
fpga1 - What is.pptxfpga1 - What is.pptx
fpga1 - What is.pptx
 
AMD K6
AMD K6AMD K6
AMD K6
 
Design of 32 Bit Processor Using 8051 and Leon3 (Progress Report)
Design of 32 Bit Processor Using 8051 and Leon3 (Progress Report)Design of 32 Bit Processor Using 8051 and Leon3 (Progress Report)
Design of 32 Bit Processor Using 8051 and Leon3 (Progress Report)
 
Design and Implementation of Quintuple Processor Architecture Using FPGA
Design and Implementation of Quintuple Processor Architecture Using FPGADesign and Implementation of Quintuple Processor Architecture Using FPGA
Design and Implementation of Quintuple Processor Architecture Using FPGA
 

Mehr von Grigory Sapunov

What's new in AI in 2020 (very short)
What's new in AI in 2020 (very short)What's new in AI in 2020 (very short)
What's new in AI in 2020 (very short)Grigory Sapunov
 
Artificial Intelligence (lecture for schoolchildren) [rus]
Artificial Intelligence (lecture for schoolchildren) [rus]Artificial Intelligence (lecture for schoolchildren) [rus]
Artificial Intelligence (lecture for schoolchildren) [rus]Grigory Sapunov
 
Transformer Zoo (a deeper dive)
Transformer Zoo (a deeper dive)Transformer Zoo (a deeper dive)
Transformer Zoo (a deeper dive)Grigory Sapunov
 
Modern neural net architectures - Year 2019 version
Modern neural net architectures - Year 2019 versionModern neural net architectures - Year 2019 version
Modern neural net architectures - Year 2019 versionGrigory Sapunov
 
AI - Last Year Progress (2018-2019)
AI - Last Year Progress (2018-2019)AI - Last Year Progress (2018-2019)
AI - Last Year Progress (2018-2019)Grigory Sapunov
 
Практический подход к выбору доменно-адаптивного NMT​
Практический подход к выбору доменно-адаптивного NMT​Практический подход к выбору доменно-адаптивного NMT​
Практический подход к выбору доменно-адаптивного NMT​Grigory Sapunov
 
Deep Learning: Application Landscape - March 2018
Deep Learning: Application Landscape - March 2018Deep Learning: Application Landscape - March 2018
Deep Learning: Application Landscape - March 2018Grigory Sapunov
 
Sequence learning and modern RNNs
Sequence learning and modern RNNsSequence learning and modern RNNs
Sequence learning and modern RNNsGrigory Sapunov
 
Введение в Deep Learning
Введение в Deep LearningВведение в Deep Learning
Введение в Deep LearningGrigory Sapunov
 
Введение в машинное обучение
Введение в машинное обучениеВведение в машинное обучение
Введение в машинное обучениеGrigory Sapunov
 
Введение в архитектуры нейронных сетей / HighLoad++ 2016
Введение в архитектуры нейронных сетей / HighLoad++ 2016Введение в архитектуры нейронных сетей / HighLoad++ 2016
Введение в архитектуры нейронных сетей / HighLoad++ 2016Grigory Sapunov
 
Artificial Intelligence - Past, Present and Future
Artificial Intelligence - Past, Present and FutureArtificial Intelligence - Past, Present and Future
Artificial Intelligence - Past, Present and FutureGrigory Sapunov
 
Deep Learning and the state of AI / 2016
Deep Learning and the state of AI / 2016Deep Learning and the state of AI / 2016
Deep Learning and the state of AI / 2016Grigory Sapunov
 
Deep Learning Cases: Text and Image Processing
Deep Learning Cases: Text and Image ProcessingDeep Learning Cases: Text and Image Processing
Deep Learning Cases: Text and Image ProcessingGrigory Sapunov
 
Computer Vision and Deep Learning
Computer Vision and Deep LearningComputer Vision and Deep Learning
Computer Vision and Deep LearningGrigory Sapunov
 

Mehr von Grigory Sapunov (20)

Transformers in 2021
Transformers in 2021Transformers in 2021
Transformers in 2021
 
NLP in 2020
NLP in 2020NLP in 2020
NLP in 2020
 
What's new in AI in 2020 (very short)
What's new in AI in 2020 (very short)What's new in AI in 2020 (very short)
What's new in AI in 2020 (very short)
 
Artificial Intelligence (lecture for schoolchildren) [rus]
Artificial Intelligence (lecture for schoolchildren) [rus]Artificial Intelligence (lecture for schoolchildren) [rus]
Artificial Intelligence (lecture for schoolchildren) [rus]
 
Transformer Zoo (a deeper dive)
Transformer Zoo (a deeper dive)Transformer Zoo (a deeper dive)
Transformer Zoo (a deeper dive)
 
Transformer Zoo
Transformer ZooTransformer Zoo
Transformer Zoo
 
BERTology meets Biology
BERTology meets BiologyBERTology meets Biology
BERTology meets Biology
 
Modern neural net architectures - Year 2019 version
Modern neural net architectures - Year 2019 versionModern neural net architectures - Year 2019 version
Modern neural net architectures - Year 2019 version
 
AI - Last Year Progress (2018-2019)
AI - Last Year Progress (2018-2019)AI - Last Year Progress (2018-2019)
AI - Last Year Progress (2018-2019)
 
Практический подход к выбору доменно-адаптивного NMT​
Практический подход к выбору доменно-адаптивного NMT​Практический подход к выбору доменно-адаптивного NMT​
Практический подход к выбору доменно-адаптивного NMT​
 
Deep Learning: Application Landscape - March 2018
Deep Learning: Application Landscape - March 2018Deep Learning: Application Landscape - March 2018
Deep Learning: Application Landscape - March 2018
 
Sequence learning and modern RNNs
Sequence learning and modern RNNsSequence learning and modern RNNs
Sequence learning and modern RNNs
 
Введение в Deep Learning
Введение в Deep LearningВведение в Deep Learning
Введение в Deep Learning
 
Введение в машинное обучение
Введение в машинное обучениеВведение в машинное обучение
Введение в машинное обучение
 
Введение в архитектуры нейронных сетей / HighLoad++ 2016
Введение в архитектуры нейронных сетей / HighLoad++ 2016Введение в архитектуры нейронных сетей / HighLoad++ 2016
Введение в архитектуры нейронных сетей / HighLoad++ 2016
 
Artificial Intelligence - Past, Present and Future
Artificial Intelligence - Past, Present and FutureArtificial Intelligence - Past, Present and Future
Artificial Intelligence - Past, Present and Future
 
Multidimensional RNN
Multidimensional RNNMultidimensional RNN
Multidimensional RNN
 
Deep Learning and the state of AI / 2016
Deep Learning and the state of AI / 2016Deep Learning and the state of AI / 2016
Deep Learning and the state of AI / 2016
 
Deep Learning Cases: Text and Image Processing
Deep Learning Cases: Text and Image ProcessingDeep Learning Cases: Text and Image Processing
Deep Learning Cases: Text and Image Processing
 
Computer Vision and Deep Learning
Computer Vision and Deep LearningComputer Vision and Deep Learning
Computer Vision and Deep Learning
 

Kürzlich hochgeladen

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 

Kürzlich hochgeladen (20)

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 

Deep learning: Hardware Landscape

  • 1. Deep Learning: Hardware Landscape Grigory Sapunov YaTalks/30.11.2019 gs@inten.to
  • 2. Executive Summary :) DL requires a lot of computations: ● Currently GPUs (mostly NVIDIA) are the most popular choice ● The only alternative right now is Google TPU gen3 (ASIC, cloud). ● More FPGA/ASIC are coming into this field (Alibaba, Bitmain Sophon, Intel Nervana?). The situation resembles the path of Bitcoin mining ● Neuromorphic computing is on the rise (IBM TrueNorth, Tianjic, memristors, etc) ● Quantum computing can benefit machine learning as well (but probably it won’t be a desktop or in-house server solutions)
  • 3. CPU
  • 4. The Deep Learning Revolution and Its Implications for Computer Architecture and Chip Design https://arxiv.org/abs/1911.05289
  • 5. Typically multi-core even on the desktop market: ● usually from 2 to 10 cores in modern Core i3-i9 Intel CPUs ● up to 18 cores/36 threads in high-end Intel CPUs (i9–7980XE/9980XE/10980XE) [https://en.wikipedia.org/wiki/List_of_Intel_Core_i9_microprocessors] ● up to 32 cores/64 threads in AMD Ryzen Threadripper (seems to be the same for the 3rd gen https://www.anandtech.com/show/14994/first-details-about-3rd-generation-ryzen-threadripper-32-cores-280-w) x86: Desktops
  • 6. On the server market: ● Intel Xeon: up to 56 cores/112 threads (Xeon Platinum 9282 Processor) ● AMD EPYC: up to 64 cores/128 threads (EPYC 7742) ● usually having more cores than desktop processors and some other useful capabilities (supporting more RAM, multi-processor configurations, ECC, etc) x86: Servers
  • 7. Intel x86 manycore (up to 72 cores with up to 288 threads) processors, supporting AVX-512 instructions. Seems to be dead now: https://www.extremetech.com/extreme/290963-intel-quietly-kills-off-xeon-phi Waiting for Intel’s Xe GPU: https://wccftech.com/intel-ponte-vecchio-xe-hpc-gpu-detailed-1000-eus-hbm2-rambo-cache-clx/ x86: Xeon Phi
  • 8. AVX-512: Fused Multiply Add (FMA) core instructions for enabling lower-precision operations. List of CPUs with AVX-512 support: https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512 VNNI (Vector Neural Network Instructions): Multiply and add for integers, etc. designed to accelerate convolutional neural network-based algorithms. https://en.wikichip.org/wiki/x86/avx512vnni DL Boost: AVX512-VNNI + Brain floating-point format (bfloat16) designed for inference acceleration. https://en.wikichip.org/wiki/brain_floating-point_format x86: ML instructions (SIMD) https://blog.inten.to/cpu-hardware-for-deep-learning-b91f53cb18af
  • 9. ● BigDL: distributed deep learning library for Apache Spark https://github.com/intel-analytics/BigDL ● Deep Neural Network Library (DNNL): an open-source performance library for deep learning applications. Layer primitives, etc. https://intel.github.io/mkl-dnn/ ● PlaidML: advanced and portable tensor compiler for enabling deep learning on laptops, embedded devices, or other devices. Supports Keras, ONNX, and nGraph. https://github.com/plaidml/plaidml ● OpenVINO Toolkit: for computer vision https://docs.openvinotoolkit.org/ x86: Optimized ML Libraries
  • 10. Some CPU-optimized DL libraries: ● Caffe Con Troll (research project, latest commit in 2016) https://github.com/HazyResearch/CaffeConTroll ● Intel Caffe (optimized for Xeon): https://github.com/intel/caffe ● Intel DL Boost can be used in many popular frameworks: TensorFlow, PyTorch, MXNet, PaddlePaddle, Intel Caffe https://www.intel.ai/increasing-ai-performance-intel-dlboost/ x86: Optimized ML Libraries
  • 11. ● nGraph: open source C++ library, compiler and runtime for Deep Learning. Frameworks using nGraph Compiler stack to execute workloads have shown up to 45X performance boost when compared to native framework implementations. https://www.ngraph.ai/ Graph compilers
  • 12. Graph compilers: wider view https://medium.com/tensorflow/mlir-a-new-intermediate-representation-and-compiler-framework-beba999ed18d
  • 13. Graph compilers: watch for MLIR! https://www.tensorflow.org/mlir/overview
  • 14. You need to transfer data between CPU host memory and GPU memory. In most x86 systems this is done using PCI Express bus (PCIe). x86: PCIe
  • 15. A typical GPU card works in x16 mode at full speed, but may work in x8 or x4 mode as well at lower speed. PCIe v.3 allows for 985 MB/s per 1 lane, so 15.75 GB/s for x16 links. PCIe v.4 is twice faster, so 31.51 GB/s for x16 (supported in X570 chipset for AMD and Radeon cards) PCIe v.5 is twice faster again (spec is released, no products expected before 2020), so 63 GB/s for x16. x86: PCIe bandwidth
  • 16. Typical Intel mainstream desktop processor has 16 PCIe lanes (e.g. i7-7700K, i7–8700K or even i9-9900K processor). High-end desktop (HEDT) processor has 28 to 44 lanes (e.g. i7–7820X has 28, rather old i7–6850K has 40, i9-9980XE has 44, upcoming i9-10940X and higher will have 48). Xeons have up to 64 lanes (PCIe v.3). AMD Ryzen Threadripper has 64 PCIe lanes, EPYC has 128 lanes (PCIe v.4) Be careful: Intel sometimes uses “Platform PCIe lanes”, it is CPU+PCH, but the PCH ones have a shared uplink! https://www.anandtech.com/show/11839/intel-core-i9-7980xe-and-core-i9-7960x-review/4 Check specs at https://ark.intel.com/ x86: PCIe bandwidth / CPU side
  • 17. In cases with a low PCIe lanes CPU you can’t use two GPUs at their highest speed (x16). In some cases you can’t even use a single GPU in x16. But in reality the difference between v.3/v.4 or x8/x16 can be insignificant. Bottlenecks may exist in other places. x86: PCIe bandwidth / CPU side
  • 19.
  • 20. Memory speed can also be important (+multi-channel mode). DDR4 data transfer rates (PCIe v.3 for comparison): ● PCIe v.3 x4: 3.94 GB/s ● PCIe v.3 x8: 7.88 GB/s ● DDR4 1600:12.8 GB/s ● DDR4 1866:14.93 GB/s ● PCIe v.3 x16: 15.75 GB/s PCIe v.4 x8: 15.75 GB/s ● DDR4 2133:17 GB/s ● DDR4 2400:19.2 GB/s ● DDR4 2666:21.3 GB/s ● DDR4 3200:25.6 GB/s ● PCIe v.4 x16: 31.51 GB/s x86: Memory
  • 21. ● Single-board computers: Raspberry Pi, part of Jetson Nano, and Google Coral Dev Board. ● Mobile: Qualcomm, Apple A11, etc ● Server: Marvell ThunderX, Ampere eMAG, Amazon A1 instance, etc; NVIDIA announced GPU-accelerated Arm-based servers. ● Laptops: Microsoft Surface Pro X ● ARM also has ML/AI Ethos NPU and Mali GPU ARM
  • 22. ● ARM announces Neoverse N1 platform (scales up to 128 cores) https://www.networkworld.com/article/3342998/arm-introduces-neoverse-high-performance-cpus-for-servers-5g.html ● Qualcomm manufactured ARM server processor for cloud applications called Centriq 2400 (48 single-thread cores, 2.2GHz). Project stopped. https://www.tomshardware.com/news/qualcomm-server-chip-exit-china-centriq-2400,38223.html ● Ampere eMAG ARM server microprocessors (up to 32 cores, up to 3.3 GHz) https://amperecomputing.com/product/, https://en.wikichip.org/wiki/ampere_computing/emag ● Marvell ThunderX ARM Processors (up to 48 cores, up to 2.5 GHz) https://www.marvell.com/server-processors/thunderx-arm-processors/ Supports NVIDIA GPU: https://www.electronicsweekly.com/news/nvidia-cuda-x-ai-hpc-software-stack-marvell-thunderx-platforms-2019-11/ ● Amazon Graviton ARM processor (16 cores, 2.3GHz) https://en.wikichip.org/wiki/annapurna_labs/alpine/al73400 https://aws.amazon.com/blogs/aws/new-ec2-instances-a1-powered-by-arm-based-aws-graviton-processors/ ● Huawei Kunpeng 920 ARM Server CPU (64 cores, 2.6 GHz) https://www.huawei.com/en/press-events/news/2019/1/huawei-unveils-highest-performance-arm-based-cpu ARM: Servers
  • 23. Current architecture is POWER9: ● 12 cores x 8 threads or 24 cores x 4 threads (96 threads). ● PCIe v.4, 48 PCIe lanes ● Nvidia NVLink 2.0: the industry’s only CPU-to-GPU Nvidia NVLink connection ● CAPI 2.0, OpenCAPI 3.0 (for heterogeneous computing with FPGA/ASIC) IBM POWER
  • 24. IBM POWER9 + NVLink 2.0
  • 25. The current fastest supercomputer in the world, Summit, is based on POWER9, while also using Nvidia's Volta GPUs as accelerators. POWER10 is expected in 2020-2021: ● 48 cores ● PCIe v.5 ● NVLink 3.0 ● OpenCAPI 4.0 ● ... IBM POWER
  • 26. An open-source hardware instruction set architecture. Examples: ● SiFive U5, U7 and U8 cores https://www.anandtech.com/show/15036/sifive-announces-first-riscv-ooo-cpu-core-the-u8series-processor-ip ● Alibaba's RISC-V processor design – the Xuantie 910 (XT 910) 12nm 64-bit 16 cores clocked at up to 2.5GHz, the fastest RISC-V processor to date https://www.theregister.co.uk/2019/07/27/alibaba_risc_v_chip/ ● Western Digital SweRV Core designed for embedded devices supporting data-intensive edge applications. https://www.westerndigital.com/company/innovations/risc-v ● Esperanto Technologies is building AI chip with 1k+ cores https://www.esperanto.ai/technology/ RISC-V
  • 27. ● KiloCore project with 1000 independent programmable processors https://www.ucdavis.edu/news/worlds-first-1000-processor-chip ● Cerebras Systems Wafer Scale Engine (WSE), an AI chip that measures 8.46x8.46 inches, making it almost the size of an iPad and more than 50 times larger than a CPU or GPU. WSE chip has 1.2 trillion transistors, 400,000 computing cores and 18 gigabytes of memory. https://www.nextplatform.com/2019/08/21/machine-learning-chip-breaks-new-ground-with-waferscale-integration/ Others
  • 28. GPU
  • 30. … → Kepler → Maxwell → Pascal → Volta → Turing → Ampere → ... NVIDIA Architectures
  • 31. ● Peak performance (GFLOPS) at FP32/16/... ● #Cores (+Tensor Cores) ● Memory size ● Memory speed/bandwidth ● Precision support ● Can connect using NVLink? ● Power usage (Watts) ● Price ● GFLOPS/USD ● GFLOPS/Watt ● Form factor (for desktop or server?) ● ECC memory ● Legal restrictions (e.g. GeForce is not allowed to use in datacenters) Important dimensions https://blog.inten.to/hardware-for-deep-learning-part-3-gpu-8906c1644664
  • 32.
  • 33. ● FP64 (64-bit float), not used for DL ● FP32 — the most commonly used for training ● FP16 or Mixed precision (FP32+FP16) — becoming the new default ● INT8 — usually for inference ● INT4, INT1 — experimental modes for inference Precision
  • 34. bfloat16 isn’t supported on GPU (but is supported on TPU gen3, and will be supported on AMD GPU and Intel CPU/NNP). Precision: bfloat16 https://www.nextplatform.com/2019/07/15/intel-prepares-to-graft-googles-bfloat16-onto-processors/ https://www.techpowerup.com/260344/future-amd-gpu-architecture-to-implement-bfloat16-hardware
  • 36.
  • 37.
  • 38.
  • 39. Not only FLOPS: Roofline Performance Model https://blog.inten.to/hardware-for-deep-learning-part-3-gpu-8906c1644664
  • 40. Roofline Performance Model: Example https://blog.inten.to/hardware-for-deep-learning-part-3-gpu-8906c1644664
  • 41. Separate cards can joined using NVLink; SLI is not relevant for DL, it’s for graphics. NVSwitch: The Fully Connected NVLink NCCL 1: multi-GPU collective communication primitives library NVIDIA: Single-machine Multi-GPU
  • 42. Distributed training is now a commodity (but scaling is sublinear). NCCL 2: multi-node collective communication primitives library NVIDIA: Distributed Multi-GPU
  • 43. ● AMD has powerful GPUs but they are mostly unsupported in DL frameworks ● Intel has its own GPUs on the processor (HD Graphics) ● Some Intel CPUs were equipped with AMD GPUs (Kaby Lake-G, say, i7-8809G) ● Intel plans to release its first discrete GPU in 2020 (Xe architecture) GPU: AMD, Intel
  • 45. Problems Serious problems with the current processors (CPU/GPU) are: ● Energy efficiency: ○ The version of AlphaGo playing against Lee Sedol used 1,920 CPUs and 280 GPUs (https://en.wikipedia.org/wiki/AlphaGo) ○ The estimated power consumption of approximately 1 MW (200 W per CPU and 200 W per GPU) compared to only 20 watts used by the human brain (https://jacquesmattheij.com/another-way-of-looking-at-lee-sedol-vs-alphago/) ● Architecture: ○ good for matrix multiplication (still the essence of DL) ○ but not well-suitable for brain-like computations
  • 46. FPGA
  • 47. FPGA ● FPGA (field-programmable gate array) is an integrated circuit designed to be configured by a customer or a designer after manufacturing ● Both FPGAs and ASICs (see later) are usually much more energy-efficient than general purpose processors (so more productive with respect to GFLOPS per Watt). FPGAs are usually used for inference, not training. ● OpenCL can be the language for development for FPGA (C/C++ can be as well), and some ML/DL libraries are using OpenCL too (for example, Caffe). So, there could appear an easy way to do low-level ML on FPGAs. ● For high-level ML there are vendor tools and graph compilers (inference only). ● Can use FPGA in the cloud! ● See also for MLIR (mentioned earlier). ● Learning curve to use FPGAs is too steep now :(
  • 48. FPGA in production There is some interesting movement to FPGA: ● Amazon has FPGA F1 instances https://aws.amazon.com/ec2/instance-types/f1/ ● Alibaba has FGPA F3 instances in the cloud https://www.alibabacloud.com/blog/deep-dive-into-alibaba-cloud-f3-fpga-as-a-service-instances_594057 ● Yandex uses FPGAs for its own DL inference. ● Microsoft ran (in 2015) Project Catapult that uses clusters of FPGAs https://blogs.msdn.microsoft.com/msr_er/2015/11/12/project-catapult-servers-available-to-academic-researchers/ ● Microsoft Azure allows deploying pretrained models on FPGA (!). https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-deploy-fpga-web-service ● Baidu has FPGA instances https://cloud.baidu.com/product/fpga.html ● ...
  • 49. FPGA chips Two main manufacturers: Intel (ex. Altera) and Xilinx. The ‘world’s largest’ FPGA chip (Xilinx Virtex UltraScale+ VU19P) contains 9M system logic cells (35B transistors) https://www.xilinx.com/products/silicon-devices/fpga/virtex-ultrascale-plus-vu19p.html Intel has a hybrid Xeon+FPGA chip https://www.top500.org/news/intel-ships-xeon-skylake-processor-with-integrated-fpga/ Intel has FPGA acceleration cards https://www.intel.com/content/www/us/en/programmable/solutions/acceleration-hub/platforms.html More info: https://www.intel.com/content/www/us/en/products/programmable/fpga.html https://www.xilinx.com/products/silicon-devices/fpga.html
  • 50. Adaptive compute acceleration platform (ACAP) Xilinx Versal ACAP, a fully software-programmable, heterogeneous compute platform that combines Scalar Engines, Adaptable Engines, and Intelligent Engines. The Intelligent Engines are an array of VLIW and SIMD processing engines and memories, all interconnected with 100s of terabits per second of interconnect and memory bandwidth. These permit 5X–10X performance improvement for ML and DSP applications. https://www.xilinx.com/products/silicon-devices/acap/versal-ai-core.html https://www.xilinx.com/support/documentation/white_papers/wp505-versal-acap.pdf
  • 51. FPGA: Xilinx DL Tools Xilinx ML Suite: tools to develop and deploy ML apps for Real-time Inference. ● xfDNN: graph compiler w optimization, quantizer ● xDNN: high Performance CNN processing engine. https://github.com/Xilinx/ml-suite
  • 52. FPGA: Intel DL Tools xx https://simplecore.intel.com/nervana/wp-content/uploads/sites/53/2018/05/IntelAIDC18_Macias_Mainstage_5_23_Final.pdf
  • 53. ASIC
  • 54. ASIC custom chips ASIC (application-specific integrated circuit) is an integrated circuit customized for a particular use, rather than intended for general-purpose use. There is a lot of movement to ASIC right now: ● Google has Tensor Processing Units (TPU) in the cloud. ● Intel just demonstrated their Nervana processors for training and inference. ● Mobileye (Intel) chips with specially developed ASIC cores are used in BMW, Tesla, Volvo, etc. ● Movidius (acquired by Intel) Myriad X VPU - a dedicated hardware accelerator for deep neural network inferences. https://www.movidius.com/myriadx ● Alibaba Hanguang 800 ● Huawei Ascend 310, 910 ● Bitmain Sophon ● ...
  • 56. ASIC: Google TPU TPU v2 ● 180 TFLOPS (bfloat16) ● 64 GB HBM ● $4.50 / TPU hour https://cloud.google.com/tpu/ https://cloud.google.com/tpu/docs/tpus https://cloud.google.com/tpu/docs/system-architecture TPU v3 ● 420 TFLOPS (bfloat16) ● 128 GB HBM ● $8.00 / TPU hour
  • 57. A “TPU v3 pod” 100+ petaflops, 32 TB HBM, 2-D toroidal mesh network
  • 58. ASIC: Intel (Nervana) NNP-T Processor for training. Can build PODs (say 10-rack POD with 480 NNP-T) ● 24 Tensor Processing Cluster (TPC) ● PCIe Gen 4 x16 accelerator card, 300W ● OCP Accelerator Module, 375W ● 119 TOPS bfloat16 ● 32 GB HBM2 https://www.intel.ai/nervana-nnp/nnpt/ https://en.wikichip.org/wiki/nervana/microarchitectures/spring_crest
  • 59. ASIC: Intel (Nervana) NNP-I Processor for inference using mixed precision math, with a special emphasis on low-precision computations using INT8. ● 12 inference compute engines (ICE) + 2 Intel architecture cores (AVX+VNNI) ● M.2 form factor (1 chip): 12W, up to 50 TOPS. ● PCIe card (2 chips): 75W, up to 170 TOPS. https://www.intel.ai/nervana-nnp/nnpi https://en.wikichip.org/wiki/intel/microarchitectures/spring_hill
  • 60. ASIC: Bitmain Sophon Tensor Computing Processor BM1684 is a third generation TPU. ● 1024 processing units ● 17.6 TOPS INT8, 2.2 TFLOPS FP32 (?) Deep Learning Acceleration Card SC3: ● 1x BM1682 (2nd gen TPU), 3 TFLOPS FP32, 3 GB, PCIe 3.0 x8 ● Max Power: 65W ● Caffe/TensorFlow/Pytorch/Mxnet https://www.sophon.ai/product/introduce/bm1684.html https://www.sophon.ai/product/introduce/sc3.html
  • 61. ASIC: Alibaba Hanguang 800 (October 28, 2019) Announcing Hanguang 800: Alibaba's First AI-Inference Chip “Hanguang 800 is the world's most powerful AI inference chip. In the Resnet-50 industry test, the peak performance of the new chip reached a whopping 78,563 images per second, which is four times higher than the second best AI chip in the world. The peak efficiency of the chip also reached 500 IPS/W, which is 3.3 times higher than the second best option.” “A Hanguang 800 chip can offer the computing power equivalent to 10 traditional GPUs.“ https://www.alibabacloud.com/blog/announcing-hanguang-800-alibabas-first-ai-inference-chip_595482
  • 62. ASIC: Baidu Kunlun (July 3, 2018) Baidu unveils Kunlun AI chip for edge and cloud computing “Kunlun is made to handle AI models for edge computing on devices and in the cloud via datacenters. The Kunlun 818-300 model will be used for training AI, and the 818-100 for inference..“ https://venturebeat.com/2018/07/03/baidu-unveils-kunlun-ai-chip-for-edge-and-cloud-computing/ “The Kunlun-powered server’s computing power is 30 times higher than FPGA-based AI accelerators, according to Yin Shiming, Baidu vice president.” https://technode.com/2019/08/29/baidu-unveils-kunlun-powered-cloud-server-at-waic/
  • 63. ASIC: Huawei Ascend 910 (Aug 23, 2019) Huawei launches Ascend 910, the world's most powerful AI processor ● 256 TFLOPS (FP16), 512 TOPS (INT8) ● 310 W of max power ● HCCS, PCIe 4.0, and RoCE v2 build scale-up and scale-out systems both flexibly and efficiently. HCCS is Huawei's in-house high-speed interface interconnecting Ascend 910s. On-chip RoCE interconnects nodes directly. PCIe 4.0 doubles the throughput of the previous generation. ● Ascend 910 2048-node cluster can deliver up to 512 PFLOPS. https://www.huawei.com/en/press-events/news/2019/8/Huawei-Ascend-910-most-powerful-AI-processor https://e.huawei.com/us/products/cloud-computing-dc/atlas/ascend-910 https://www.servethehome.com/huawei-ascend-910-provides-a-nvidia-ai-training-alternative/
  • 64. ASIC: Habana Gaudi: training chip. Designed to scale well. ● PCIe 4.0 x16, 32 GB HBM2, 1Tb/s, ECC, RDMA ● 200-300W ● FP32, BF16, INT/UINT 32, 16, 8 https://habana.ai/training/ https://habana.ai/wp-content/uploads/2019/06/Gaudi-Datasheet.pdf Goya: inference chip ● PCIe 4.0 x16, 4/8/16 Gb DDR4, ECC, 200W ● FP32, INT/UINT 32, 16, 8 https://habana.ai/inference/ https://habana.ai/wp-content/uploads/2019/06/Goya-Datasheet-HL-10x.pdf https://www.nextplatform.com/2019/08/26/habana-takes-training-and-inference-down-different-paths/
  • 65. ASIC: Graphcore IPU Graphcore IPU: for both training and inference. Allows new and emerging machine intelligence workloads to be realized. Colossus IPU: ● 23.6B transistors, 1216 independent processor cores, 300Mb in-processor mem, 125 TFLOPS mixed precision ● 45TB/s memory bandwidth, 8TB/s on-chip exchange between cores ● C2 IPU Processor card: 2x Colossus, PCIe 4.0 x16 (64GB/s bidir), card-to-card IPU-Links (2.5 TBps), 300W https://www.servethehome.com/hands-on-with-a-graphcore-c2-ipu-pcie-card-at-dell-tech-world/ https://www.graphcore.ai/products/ipu IPU on Azure https://www.graphcore.ai/posts/microsoft-and-graphcore-collaborate-to-accelerate-artificial-intelligence
  • 66. ASIC: Others ● Qualcomm Cloud AI 100 (inference) https://www.qualcomm.com/news/releases/2019/04/09/qualcomm-brings-power-efficient-artificial-intellig ence-inference ● ARM ML inference NPU Ethos-N77 https://www.arm.com/products/silicon-ip-cpu/machine-learning/arm-ml-processor ● Intel eASIC: an intermediary technology between FPGAs and standard-cell ASICs with lower unit-cost and faster time-to-market https://www.intel.com/content/www/us/en/products/programmable/asic/easic-devices.html ● ...
  • 67. ASIC: Summary ● Very diverse field! ● Hard to directly compare different solutions based on their characteristics (can be too different architectures). ● You can use a common benchmark like https://mlperf.org/ ● DL framework support is usually limited, some solutions use their own frameworks/libraries.
  • 69. AI at the edge ● NVidia Jetson TK1/TX1/TX2/Xavier/Nano ○ 192/256/256/512/128 CUDA Cores ○ 4/4/6/8/4-Core ARM CPU, 2/4/8/16/4 Gb Mem ● Tablets, Smartphones ○ Qualcomm Snapdragon 845/855, Apple A11/12/Bionic, Huawei Kirin 970/980/990 etc ● Raspberry Pi 4 (1.5 GHz 4-core, 4Gb mem) ● Movidius Neural Compute Stick, Stick 2 ● Google Edge TPU
  • 70. “The Qualcomm Neural Processing SDK for AI, our software-accelerated runtime for the execution of deep neural networks, lets you program the Qualcomm AI Engine. Together, the engine and the SDK allow you to squeeze up to 7 TOPS out of AI processing out of the Snapdragon 855, with massive acceleration for your on-device AI applications. The engine provides high capacity for matrix multiplication on both the Qualcomm Hexagon Vector eXtensions (HVX) and the Hexagon Tensor Accelerator (HTA). With enough on-device processing power to run more than 140 inferences per second on the Inception-v3 neural network, your app could classify or detect dozens of objects in just a few milliseconds and with high confidence.” https://developer.qualcomm.com/blog/accelerate-your-device-ai-qualcomm-artificial-intelligence-ai-engine- snapdragon https://www.qualcomm.com/products/snapdragon-855-mobile-platform Mobile AI: Qualcomm SD 855 (DSP+)
  • 71. “HUAWEI’s self-developed Da Vinci architecture NPU delivers better power efficiency, stronger processing capabilities and higher accuracy. The powerful Big-Core plus ultra-low consumption Tiny-Core contribute to an enormous boost in AI performance. In AI face recognition, the efficiency of NPU Tiny-Core can be enhanced up to 24x than the Big-Core. With 2 Big-Core plus 1 Tiny-Core, the NPU of Kirin 990 5G is ready to unlock the magic of the future.” https://consumer.huawei.com/en/campaign/kirin-990-series/ “Huawei intends to scale this AI processing block from servers to smartphones. It supports both INT8 and FP16 on both cores, whereas the older Cambricon design could only perform INT8 on one core. There’s also a new ‘Tiny Core’ NPU. It’s a smaller version of the Da Vinci architecture focused on power efficiency above all else, and it can be used for polling or other applications where performance isn’t particularly time critical. The 990 5G will have two “big” NPU cores and a single Tiny Core, while the Kirin 990 (LTE) has one big core and one tiny core.” https://www.extremetech.com/mobile/298028-huaweis-kirin-990-soc-is-the-first-chip-with-an-integrated-5g-modem Mobile AI: Huawei Kirin 970, 980, 990 (NPU)
  • 72. (Oct 2, 2019) Inside Apple’s A13 Bionic system-on-chip “This year, the CPU has a new trick: A set of “machine learning accelerators” that perform matrix multiplication operations up to six times faster than the CPU alone. The Neural Engine (8-core), like everything else in the chip, tops out at 20 percent faster than before (it’s as if the designs are relatively unchanged, and the new 7nm+ process allows for 20 percent higher clock speeds). There’s a machine learning controller in the chip that automatically schedules machine learning operations between the CPU, GPU, and Neural Engine so developers don’t have to balance out the load themselves.” https://www.macworld.com/article/3442716/inside-apples-a13-bionic-system-on-chip.html Mobile AI: Apple (Neural Engine)
  • 73. (Nov 26, 2019) Samsung’s first 7-nanometer EUV processor will power the Galaxy Note 10 “The Exynos 9825 features an integrated Neural Processing Unit (NPU) designed for the next generation of mobile experiences from AI-powered photography to augmented reality. With fast, efficient AI processing, the NPU brings new possibilities for on-device AI from object recognition for optimized photos, to a suite of performance enhancing intelligence features such as usage pattern recognition and faster app pre-loading.” https://www.samsung.com/semiconductor/minisite/exynos/products/mobileprocessor/exynos-9825/ Mobile AI: Samsung (NPU)
  • 74. (Aug 7, 2019) MediaTek Announces Dimensity 1000 ARM Chip With Integrated 5G Modem “The Dimensity 1000 doesn’t just bring new branding; it’s also sporting four Cortex A77 CPU cores and four Cortex A55 CPU cores, all built on a 7nm process node. There’s also a 9-core Mali GPU, a 5-core ISP, and a 6-core AI processor. The MediaTek AI Processing Unit APU 3.0 is a brand new architecture. It houses six AI processors (two big cores, three small cores and a single tiny core) The new APU 3.0 brings devices a significant performance boost at 4.5 TOPS.” https://www.extremetech.com/extreme/302712-mediatek-announces-dimensity-1000-arm-chip-with-integr ated-5g-modem https://i.mediatek.com/mediatek-5g Mobile AI: MediaTek (APU)
  • 75. AI at the Edge: Jetson Nano Price: $99 NVIDIA Jetson Nano Developer Kit is a small, powerful computer that lets you run multiple neural networks in parallel for applications like image classification, object detection, segmentation, and speech processing. All in an easy-to-use platform that runs in as little as 5 watts. ● 128-core Maxwell GPU + Quad-core ARM A57, 472 GFLOPS ● 4 GB 64-bit LPDDR4 25.6 GB/s https://developer.nvidia.com/embedded/jetson-nano-developer-kit See also Jetson TX1, TX2, Xavier: https://developer.nvidia.com/embedded/develop/hardware
  • 76. Neural Compute Stick 2 (~$70) The latest generation of Intel® VPUs includes 16 powerful processing cores (called SHAVE cores) and a dedicated deep neural network hardware accelerator for high-performance vision and AI inference applications—all at low power. ● Supports Convolutional Neural Network (CNN) ● Support: TensorFlow, Caffe, Apache MXNet, ONNX, PyTorch, and PaddlePaddle via an ONNX conversion ● Processor: Intel Movidius Myriad X Vision Processing Unit (VPU) ● Connectivity: USB 3.0 Type-A https://software.intel.com/en-us/neural-compute-stick AI at the Edge: Movidius
  • 77. AI at the Edge: Google Edge TPU The Edge TPU is a small ASIC designed by Google that provides high performance ML inferencing for low-power devices. For example, it can execute state-of-the-art mobile vision models such as MobileNet V2 at 400 FPS, in a power efficient manner. The on-board Edge TPU coprocessor is capable of performing 4 TOPS using 0.5 watts for each TOPS (2 TOPS per watt). TensorFlow Lite models can be compiled to run on the Edge TPU. USB/Mini PCIe/M.2 A+E key/M.2 B+M key/SoM/Dev Board https://cloud.google.com/edge-tpu/ https://coral.ai/products/
  • 78. ● Sophon Neural Network Stick (NNS) https://www.sophon.ai/product/introduce/nns.html ● Xilinx Edge AI (FPGA!) https://www.xilinx.com/applications/industrial/analytics-machine-learning.html ● (Nov 13, 2019) Azure Data Box Edge is a physical network appliance, shipped by Microsoft, that sends data in and out of Azure. Data Box Edge is additionally equipped with AI-enabled edge computing capabilities that help you analyze, process, and transform the on-premises data before uploading it to the cloud. https://azure.microsoft.com/en-us/updates/announcing-azure-data-box-edge/ ● ... AI at the Edge: Others
  • 80. Problems Even with FPGA/ASIC and edge devices: ● Energy efficiency: ○ Better than CPU/GPU, but still far from 20 watts used by the human brain ● Architecture: ○ Even more specialized for ML/DL computations, but... ○ Still far from brain-like computations
  • 82. Neuromorphic chips ● Neuromorphic computing - brain-inspired computing - has emerged as a new technology to enable information processing at very low energy cost using electronic devices that emulate the electrical behaviour of (biological) neural networks. ● Neuromorphic chips attempt to model in silicon the massively parallel way the brain processes information as billions of neurons and trillions of synapses respond to sensory inputs such as visual and auditory stimuli. ● DARPA SyNAPSE program (Systems of Neuromorphic Adaptive Plastic Scalable Electronics) ● IBM TrueNorth; Stanford Neurogrid; HRL neuromorphic chip; Human Brain Project SpiNNaker and HICANN. https://www.technologyreview.com/s/526506/neuromorphic-chips/
  • 83. Neuromorphic chips: IBM TrueNorth ● 1M neurons, 256M synapses, 4096 neurosynaptic cores on a chip, est. 46B synaptic ops per sec per W ● Uses 70mW, power density is 20 milliwatts per cm^2— almost 1/10,000th the power of most modern microprocessors ● “Our sights are now set high on the ambitious goal of integrating 4,096 chips in a single rack with 4B neurons and 1T synapses while consuming ~4kW of power”. ● Currently IBM is making plans to commercialize it. ● (2016) Lawrence Livermore National Lab got a cluster of 16 TrueNorth chips (16M neurons, 4B synapses, for context, the human brain has 86B neurons). When running flat out, the entire cluster will consume a grand total of 2.5 watts. http://spectrum.ieee.org/tech-talk/computing/hardware/ibms-braininspired-computer-chip-comes-from-the-future
  • 84. Neuromorphic chips: IBM TrueNorth ● (03.2016) IBM Research demonstrated convolutional neural nets with close to state of the art performance: “Convolutional Networks for Fast, Energy-Efficient Neuromorphic Computing”, http://arxiv.org/abs/1603.08270
  • 85. Neuromorphic chips: Intel Loihi ● Fully asynchronous neuromorphic many core mesh that supports a wide range of sparse, hierarchical and recurrent neural network topologies ● Each neuromorphic core includes a learning engine that can be programmed to adapt network parameters during operation, supporting supervised, unsupervised, reinforcement and other learning paradigms. ● Fabrication on Intel’s 14 nm process technology. ● A total of 130,000 neurons and 130 million synapses. ● Development and testing of several algorithms with high algorithmic efficiency for problems including path planning, constraint satisfaction, sparse coding, dictionary learning, and dynamic pattern learning and adaptation. https://newsroom.intel.com/editorials/intels-new-self-learning-chip-promises-accelerate-artificial-intelligence/ https://techcrunch.com/2018/01/08/intel-shows-off-its-new-loihi-ai-chip-and-a-new-49-qubit-quantum-chip/ https://ieeexplore.ieee.org/document/8259423 https://en.wikichip.org/wiki/intel/loihi
  • 86. Neuromorphic chips: Intel Loihi “Intel researchers have recently been testing the Loihi chip by training it on tasks such as recognizing a small set of objects within seconds. The company has not yet pushed the capabilities of the neuromorphic chip to its limit, Mayberry [Michael Mayberry, corporate vice president and managing director of Intel Labs] says. Still, he anticipates neuromorphic computing products potentially hitting the market within 2 to 4 years, if customers can run their applications on the Loihi chip without requiring additional hardware modifications.” “Neither quantum nor neuromorphic computing are going to replace general purpose computing,” Mayberry says. “But they can enhance it.” https://spectrum.ieee.org/tech-talk/computing/hardware/intels-49qubit-chip-aims-for-quantum-supremacy
  • 87. Neuromorphic chips: Intel Loihi “Using Intel's Loihi neuromorphic research chip and ABR's Nengo Deep Learning toolkit, we analyze the inference speed, dynamic power consumption, and energy cost per inference of a two-layer neural network keyword spotter trained to recognize a single phrase. We perform comparative analyses of this keyword spotter running on more conventional hardware devices including a CPU, a GPU, Nvidia's Jetson TX1, and the Movidius Neural Compute Stick.” Benchmarking Keyword Spotting Efficiency on Neuromorphic Hardware https://arxiv.org/abs/1812.01739
  • 88. Neuromorphic chips: Intel Pohoiki Beach (Jul 15, 2019) “Intel announced that an 8 million-neuron neuromorphic system comprising 64 Loihi research chips — codenamed Pohoiki Beach — is now available to the broader research community. With Pohoiki Beach, researchers can experiment with Intel’s brain-inspired research chip, Loihi, which applies the principles found in biological brains to computer architectures. ” https://newsroom.intel.com/news/intels-pohoiki-beach-64-chip-neuromorphic-system-delivers-breakthrough-results-research-tests/
  • 89. Neuromorphic chips: Tianjic Tianjic’s unified function core (FCore) which combines essential building blocks for both artificial neural networks and biologically networks — axon, synapse, dendrite and soma blocks. The 28-nm chip consists of 156 FCores, containing approximately 40,000 neurons and 10 million synapses in an area of 3.8×3.8 mm2. Tianjic delivers an internal memory bandwidth of more than 610 GB per second, and a peak performance of 1.28 TOPS per watt for running artificial neural networks. In the biologically-inspired spiking neural network mode, Tianjic achieves a peak performance of about 650 giga synaptic operations per second (GSOPS) per watt. https://medium.com/syncedreview/nature-cover-story-chinese-teams-tianjic-chip-bridges-machine-lear ning-and-neuroscience-in-f1c3e8a03113 https://www.nature.com/articles/s41586-019-1424-8
  • 90. Neuromorphic chips: Others ● SpiNNaker (1,036,800 ARM9 cores) http://apt.cs.manchester.ac.uk/projects/SpiNNaker/ ● SpiNNaker-2 https://niceworkshop.org/wp-content/uploads/2018/05/2-27-SHoppner-SpiNNaker2.pdf https://arxiv.org/abs/1911.02385 “SpiNNaker 2: A 10 Million Core Processor System for Brain Simulation and Machine Learning” ● BrainScaleS, HICANN: 20x 8-inch silicon wafers each incorporates 50 x 106 plastic synapses and 200,000 biologically realistic neurons. https://www.humanbrainproject.eu/en/silicon-brains/how-we-work/hardware/ ● Akida NSoC: 1.2 million neurons and 10 billion synapses https://www.brainchipinc.com/products/akida-neuromorphic-system-on-chip https://en.wikichip.org/wiki/brainchip/akida ● Neurogrid: Neurogrid can model a slab of cortex with up to 16x256x256 neurons (>1M) https://web.stanford.edu/group/brainsinsilicon/neurogrid.html https://web.stanford.edu/group/brainsinsilicon/documents/BenjaminEtAlNeurogrid2014.pdf
  • 93. Other approaches ● Memristors https://spectrum.ieee.org/semiconductors/design/the-mysterious-memristor ● Quantum computing https://ai.googleblog.com/2019/10/quantum-supremacy-using-programmable.html ● Optical computing https://www.nextplatform.com/2019/05/31/startup-looks-to-light-up-machine-learning/ ● DNA computing https://www.wired.com/story/finally-a-dna-computer-that-can-actually-be-reprogrammed/ ● Unconventional computing: cellular automata, reservoir computing, using biological cells/neurons, chemical computation, membrane computing, slime mold computing and much more https://www.springer.com/gp/book/9781493968824 ● ...
  • 94. References: Hardware for Deep Learning series of posts: https://blog.inten.to/hardware-for-deep-learning-current-state-and-trends-51c01ebbb6dc ● Part 1: Introduction and Executive summary ● Part 2: CPU ● Part 3: GPU ● Part 4: FPGA ● Part 5: ASIC ● Part 6: Mobile AI ● Part 7: Neuromorphic computing ● Part 8: Quantum computing