SlideShare ist ein Scribd-Unternehmen logo
1 von 95
AI:
Hardware Landscape
Grigory Sapunov
OpenTalks.AI / 2021.02.05
gs@inten.to
Executive Summary :)
Most hardware focused on DL, which requires a lot of computations:
● There’s much more diversity in CPUs now, not only x86.
● GPUs (mostly NVIDIA) are the most popular choice. Intel and AMD can
propose interesting alternatives this year.
● There are some available ASIC alternatives: Google TPU (in cloud only),
Graphcore, Huawei Ascend.
● More ASICs are coming into this field: Cerebras, Habana, etc.
● Some companies try to use FPGAs and allow to use them in the cloud
(Microsoft, AWS).
● Edge AI is everywhere already! More to come!
● Neuromorphic computing is on the rise (IBM TrueNorth, Tianjic,
memristors, etc)
● Quantum computing can potentially benefit machine learning as well
(but probably it won’t be a desktop or in-house server solutions)
CPU
The Deep Learning Revolution and Its Implications for Computer Architecture and Chip Design
https://arxiv.org/abs/1911.05289
Typically multi-core even on the desktop market:
● usually from 2 to 10 cores in modern Core i3-i9 Intel CPUs
● up to 18 cores/36 threads in high-end Intel CPUs (i9–
7980XE/9980XE/10980XE) [https://en.wikipedia.org/wiki/List_of_Intel_Core_i9_microprocessors]
● up to 64 cores/128 threads in AMD Ryzen Threadripper
(Ryzen Threadripper 3990X, Ryzen Threadripper Pro 3995WX)
x86: Desktops
On the server market:
● Intel Xeon: up to 56 cores/112 threads (Xeon Platinum 9282 Processor)
● AMD EPYC: up to 64 cores/128 threads (EPYC 7702/7742)
● usually having more cores than desktop processors and some other useful
capabilities (supporting more RAM, multi-processor configurations, ECC, etc)
x86: Servers
AVX-512: Fused Multiply Add (FMA) core instructions for enabling lower-precision
operations. List of CPUs with AVX-512 support:
https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512
VNNI (Vector Neural Network Instructions): Multiply and add for integers, etc.
designed to accelerate convolutional neural network-based algorithms.
https://en.wikichip.org/wiki/x86/avx512vnni
DL Boost: AVX512-VNNI + Brain floating-point format (bfloat16)
designed for inference acceleration.
https://en.wikichip.org/wiki/brain_floating-point_format
x86: ML instructions (SIMD)
https://blog.inten.to/cpu-hardware-for-deep-learning-b91f53cb18af
● BigDL: distributed deep learning library for Apache Spark
https://github.com/intel-analytics/BigDL
● Deep Neural Network Library (DNNL): an open-source performance library
for deep learning applications. Layer primitives, etc.
https://intel.github.io/mkl-dnn/
● PlaidML: advanced and portable tensor compiler for enabling deep learning
on laptops, embedded devices, or other devices.
Supports Keras, ONNX, and nGraph.
https://github.com/plaidml/plaidml
● OpenVINO Toolkit: for computer vision
https://docs.openvinotoolkit.org/
x86: Optimized ML Libraries
Some CPU-optimized DL libraries:
● Caffe Con Troll (research project, latest commit in 2016)
https://github.com/HazyResearch/CaffeConTroll
● Intel Caffe (optimized for Xeon):
https://github.com/intel/caffe
● Intel DL Boost can be used in many popular frameworks:
TensorFlow, PyTorch, MXNet, PaddlePaddle, Intel Caffe
https://www.intel.ai/increasing-ai-performance-intel-dlboost/
x86: Optimized ML Libraries
● nGraph: open source C++ library, compiler and runtime for Deep Learning.
Frameworks using nGraph Compiler stack to execute workloads have shown
up to 45X performance boost when compared to native framework
implementations. https://www.ngraph.ai/
Graph compilers
Graph compilers: watch for MLIR!
https://www.tensorflow.org/mlir/overview
● #Cores
● PCIe bandwidth
● PCIe generation (gen3, gen4)
● PCIe lanes (x16, x8, etc) at the processor/chipset side
● Memory type (DDR4, DDR3, etc)
● Memory speed (2133, 2666, 3200, etc)
● Memory channels (1, 2, 4, …)
● Memory size
● Memory speed/bandwidth
● ECC support
● Power usage (Watts)
● Price
● ...
Important dimensions
https://blog.inten.to/cpu-hardware-for-deep-learning-b91f53cb18af
● Single-board computers: Raspberry Pi, part of Jetson
Nano, and Google Coral Dev Board.
● Mobile: Qualcomm, Apple A11, etc
● Server: Marvell ThunderX, Ampere eMAG,
Amazon A1 instance, etc; NVIDIA announced
GPU-accelerated Arm-based servers.
● Laptops: Apple M1, Microsoft Surface Pro X
● ARM also has ML/AI Ethos NPU and Mali GPU
ARM
● ARM announces Neoverse N1 platform (scales up to 128 cores)
https://www.networkworld.com/article/3342998/arm-introduces-neoverse-high-performance-cpus-for-servers-5g.html
● Qualcomm manufactured ARM server processor for cloud applications called Centriq 2400 (48 single-thread cores,
2.2GHz). Project stopped.
https://www.tomshardware.com/news/qualcomm-server-chip-exit-china-centriq-2400,38223.html
● Ampere Altra is the first 80-core ARM-based server processor
https://venturebeat.com/2020/03/03/ampere-altra-is-the-first-80-core-arm-based-server-processor/
● Ampere announces 128-core Arm server processor
https://www.networkworld.com/article/3564514/ampere-announces-128-core-arm-server-processor.html
● Ampere eMAG ARM server microprocessors (up to 32 cores, up to 3.3 GHz)
https://amperecomputing.com/product/, https://en.wikichip.org/wiki/ampere_computing/emag
● Marvell ThunderX ARM Processors (up to 48 cores, up to 2.5 GHz)
https://www.marvell.com/server-processors/thunderx-arm-processors/
● Amazon Graviton ARM processor (16 cores, 2.3GHz)
https://en.wikichip.org/wiki/annapurna_labs/alpine/al73400
https://aws.amazon.com/blogs/aws/new-ec2-instances-a1-powered-by-arm-based-aws-graviton-processors/
● Huawei Kunpeng 920 ARM Server CPU (64 cores, 2.6 GHz)
https://www.huawei.com/en/press-events/news/2019/1/huawei-unveils-highest-performance-arm-based-cpu
ARM: Servers
NVIDIA to Acquire Arm for $40 Billion
Current architecture is POWER9:
● 12 cores x 8 threads or 24 cores x 4 threads (96 threads).
● PCIe v.4, 48 PCIe lanes
● Nvidia NVLink 2.0: the industry’s only CPU-to-GPU Nvidia NVLink connection
● CAPI 2.0, OpenCAPI 3.0 (for heterogeneous computing with FPGA/ASIC)
IBM POWER
An open-source hardware instruction set architecture.
Examples:
● SiFive U5, U7 and U8 cores
https://www.anandtech.com/show/15036/sifive-announces-first-riscv-ooo-cpu-core-the-u8series-processor-ip
● Alibaba's RISC-V processor Xuantie 910 with Vector Engine for AI
Acceleration 12nm 64-bit 16 cores clocked at up to 2.5GHz,
the fastest RISC-V processor to date
https://www.theregister.co.uk/2019/07/27/alibaba_risc_v_chip/
● Western Digital SweRV Core designed for embedded devices
supporting data-intensive edge applications.
https://www.westerndigital.com/company/innovations/risc-v
● Manticore: A 4096-core RISC-V Chiplet Architecture for
Ultra-efficient Floating-point Computing https://arxiv.org/abs/2008.06502
● Esperanto Technologies is building AI chip with 1k+ cores
https://www.esperanto.ai/technology/
RISC-V
GPU
NVIDIA slides: http://www.nvidia.com/content/events/geoInt2015/LBrown_DL.pdf
… → Kepler → Maxwell → Pascal → Volta → Turing → Ampere → ...
NVIDIA Architectures
● Peak performance (GFLOPS) at FP32/16/...
● #Cores (+Tensor Cores)
● Memory size
● Memory speed/bandwidth
● Precision support
● Can connect using NVLink?
● Power usage (Watts)
● Price
● GFLOPS/USD
● GFLOPS/Watt
● Form factor (for desktop or server?)
● ECC memory
● Legal restrictions (e.g. GeForce is not allowed to use in datacenters)
Important dimensions
https://blog.inten.to/hardware-for-deep-learning-part-3-gpu-8906c1644664
● FP64 (64-bit float), not used for DL
● FP32 — the most commonly used for training
● FP16 or Mixed precision (FP32+FP16) — becoming the new default
● INT8 — usually for inference
● INT4, INT1 — experimental modes for inference
Precision
bfloat16 is now supported on Ampere GPUs, supported on TPU gen3, and will be
supported on AMD GPU and Intel CPUs..
Precision: bfloat16
https://moocaholic.medium.com/fp64-fp32-fp16-bfloat16-tf32-and-other-members-of-the-zoo-a1ca7897d407
Precision: many caveats
https://blog.inten.to/hardware-for-deep-learning-part-3-gpu-8906c1644664
Not only FLOPS: Roofline Performance Model
https://blog.inten.to/hardware-for-deep-learning-part-3-gpu-8906c1644664
Roofline Performance Model: Example
https://blog.inten.to/hardware-for-deep-learning-part-3-gpu-8906c1644664
Separate cards can joined using NVLink; SLI is not relevant for DL, it’s for
graphics.
NVSwitch: The Fully Connected NVLink
NCCL 1: multi-GPU collective communication primitives library
NVIDIA: Single-machine Multi-GPU
Distributed training is now a commodity (but scaling is sublinear).
NCCL 2: multi-node collective communication primitives library
NVIDIA: Distributed Multi-GPU
Intel offered the following performance numbers, given as peak GFLOPs of FP32
math using the OpenCL-based CLPeak benchmark.
GPU: Intel Xe
https://www.anandtech.com/show/16018/intel-xe-hp-graphics-early-samples-offer-42-tflops-of-fp32-performance
Peak Performance:
● 46.1 TFLOPs Single Precision Matrix (FP32)
● 23.1 TFLOPs Single Precision (FP32)
● 184.6 TFLOPs Half Precision (FP16)
● 11.5 TFLOPs Double Precision (FP64)
● 92.3 TFLOPs bfloat16
● 184.6 TOPs INT8 and INT4
32 GB HBM2, Up to 1228.8 GB/s
300W
Announced support for TensorFlow, PyTorch, etc!
AMD Instinct MI100
https://www.amd.com/en/products/server-accelerators/instinct-mi100
There should
be:
Intel Xe
42.2 TFLOPS
(FP32)
AMD Instinct
MI100
46.1 TFLOPS
(FP32)
Is everything OK?
Problems
Serious problems with the current processors (CPU/GPU) are:
● Energy efficiency:
○ The version of AlphaGo playing against Lee Sedol used 1,920 CPUs and
280 GPUs (https://en.wikipedia.org/wiki/AlphaGo)
○ The estimated power consumption of approximately 1 MW (200 W per
CPU and 200 W per GPU) compared to only 20 watts used by the human
brain (https://jacquesmattheij.com/another-way-of-looking-at-lee-sedol-vs-alphago/)
● Architecture:
○ good for matrix multiplication (still the essence of DL)
○ but not well-suitable for brain-like computations
FPGA
FPGA
● FPGA (field-programmable gate array) is an integrated circuit designed to be
configured by a customer or a designer after manufacturing
● Both FPGAs and ASICs (see later) are usually much more energy-efficient than
general purpose processors (so more productive with respect to GFLOPS per
Watt). FPGAs are usually used for inference, not training.
● OpenCL can be the language for development for FPGA (C/C++ can be as well),
and some ML/DL libraries are using OpenCL too (for example, Caffe). So, there
could appear an easy way to do low-level ML on FPGAs.
● For high-level ML there are vendor tools and graph compilers (inference only).
● Can use FPGA in the cloud!
● See also for MLIR (mentioned earlier).
● Learning curve to use FPGAs is too steep now :(
FPGA in production
There is some interesting movement to FPGA:
● Amazon has FPGA F1 instances https://aws.amazon.com/ec2/instance-types/f1/
● Alibaba has FGPA F3 instances in the cloud https://www.alibabacloud.com/blog/deep-dive-into-alibaba-
cloud-f3-fpga-as-a-service-instances_594057
● Yandex uses FPGAs for its own DL inference.
● Microsoft ran (in 2015) Project Catapult that uses clusters of FPGAs
https://blogs.msdn.microsoft.com/msr_er/2015/11/12/project-catapult-servers-available-to-academic-researchers/
https://www.microsoft.com/en-us/research/project/project-catapult/
● Microsoft Project Brainwave: AI inference omn FPGA
https://www.microsoft.com/en-us/research/project/project-brainwave/
● Microsoft Azure allows deploying pretrained models on FPGA (!).
https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-deploy-fpga-web-service
● Baidu has FPGA instances https://cloud.baidu.com/product/fpga.html
● ...
Two main manufacturers: Intel (ex. Altera) and Xilinx.
The ‘world’s largest’ FPGA chips
● Intel Stratix 10 GX 10M
>10.2 million logic cells, 43.3B transistors
https://www.techpowerup.com/260906/intel-unveils-worlds-largest-fpga
● Xilinx Virtex UltraScale+ VU19P
9M system logic cells, 35B transistors
https://www.xilinx.com/products/silicon-devices/fpga/virtex-ultrascale-plus-vu19p.html
Intel has a hybrid Xeon+FPGA chip https://www.top500.org/news/intel-ships-xeon-skylake-processor-with-
integrated-fpga/
Intel has FPGA acceleration cards
https://www.intel.com/content/www/us/en/programmable/solutions/acceleration-hub/platforms.html
FPGA chips
Adaptive compute acceleration platform (ACAP)
Xilinx Versal ACAP, a fully software-programmable,
heterogeneous compute platform that combines Scalar Engines,
Adaptable Engines, and Intelligent Engines.
The Intelligent Engines are an array of VLIW and SIMD
processing engines and memories, all interconnected with 100s
of terabits per second of interconnect and memory bandwidth.
These permit 5X–10X performance improvement for ML and
DSP applications.
https://www.xilinx.com/products/silicon-devices/acap/versal-premium.html
https://www.xilinx.com/products/silicon-devices/acap/versal-ai-core.html
https://www.xilinx.com/support/documentation/white_papers/wp505-versal-acap.pdf
FPGA: Xilinx Vitis AI
Vitis AI is Xilinx’s development stack
for AI inference on Xilinx hardware
platforms, including both edge devices
and Alveo cards.
It consists of optimized IP, tools,
libraries, models, and example
designs.
Xilinx ML Suite is now deprecated.
https://github.com/Xilinx/Vitis-AI
FPGA: Intel OpenVINO toolkit
The OpenVINO (Open Visual Inference and Neural Network Optimization) toolkit
offers software developers a single toolkit to accelerate their solutions across
multiple hardware platforms including FPGAs
https://software.intel.com/content/www/us/en/develop/tools/openvino-toolkit.html
ASIC
ASIC custom chips
ASIC (application-specific integrated circuit) is an integrated circuit customized for a
particular use, rather than intended for general-purpose use.
There is a lot of movement to ASIC right now:
● Google has Tensor Processing Units (TPU v2/v3) in the cloud, v4 exists too.
● Intel acquired Habana, Mobileye, Movidius, Nervana and has processors for training
and inference.
● Graphcore has its second generation IPU.
● AWS has its own chips for training and inference
● Alibaba Hanguang 800
● Huawei Ascend 310, 910
● Bitmain Sophon, Cerebras, Groq, and many many others …
Many ASICs are built for multi-chip and supercomputer configurations!
https://blog.inten.to/hardware-for-deep-learning-part-4-asic-96a542fe6a81
Case: AlphaGo Zero
https://deepmind.com/blog/alphago-zero-learning-scratch/
ASIC: Google TPU
TPU v2
● 180 TFLOPS (bfloat16)
● 64 GB HBM
● $4.50 / TPU hour
https://cloud.google.com/tpu/
https://cloud.google.com/tpu/docs/tpus
https://cloud.google.com/tpu/docs/system-architecture
TPU v3
● 420 TFLOPS (bfloat16)
● 128 GB HBM
● $8.00 / TPU hour
A “TPU v3 pod” 100+ petaflops, 32 TB HBM, 2-D toroidal mesh network
Many ASICs are built for multi-chip configurations
ASIC: Intel (Nervana) NNP-T [discontinued]
Processor for training. Can build PODs (say 10-rack POD with 480 NNP-T)
● 24 Tensor Processing Cluster (TPC)
● PCIe Gen 4 x16 accelerator card, 300W
● OCP Accelerator Module, 375W
● 119 TOPS bfloat16
● 32 GB HBM2
https://www.intel.ai/nervana-nnp/nnpt/
https://en.wikichip.org/wiki/nervana/microarchitectures/spring_crest
ASIC: Intel (Nervana) NNP-I [discontinued]
Processor for inference using mixed precision math, with a special emphasis on low-precision
computations using INT8.
● 12 inference compute engines (ICE) + 2 Intel architecture cores (AVX+VNNI)
● M.2 form factor (1 chip): 12W, up to 50 TOPS.
● PCIe card (2 chips): 75W, up to 170 TOPS.
https://www.intel.ai/nervana-nnp/nnpi
https://en.wikichip.org/wiki/intel/microarchitectures/spring_hill
ASIC: Habana
Gaudi: training chip HL-2000. Designed to scale well.
● PCIe 4.0 x16, 32 GB HBM2, 1Tb/s, ECC, RDMA
● 200-300W
● FP32, BF16, INT/UINT 32, 16, 8
https://habana.ai/training/
Goya: inference chip HL-1000
● PCIe 4.0 x16, 4/8/16 Gb DDR4, ECC, 200W
● FP32, INT/UINT 32, 16, 8
https://habana.ai/inference/
ASIC: Graphcore IPU
Graphcore IPU: for both training and inference.
Allows new and emerging machine intelligence
workloads to be realized.
Colossus MK2 GC200 IPU:
● 59.4B transistors, 1472 independent processor cores running 8832 independent
parallel program threads
● 250 TFLOPS mixed precision
● 900MB in-processor mem, 47.5TB/s memory bandwidth
● 8TB/s on-chip exchange between cores, 320GB/s chip-to-chip bandwidth
● IPU-M2000 systems with 4xIPU (1 PFLOPS FP16)
IPU on Azure
https://www.graphcore.ai/posts/microsoft-and-graphcore-collaborate-to-accelerate-artificial-intelligence
● Cerebras Systems Wafer Scale Engine (WSE), an
AI chip that measures 8.46x8.46 inches, making it
almost the size of an iPad.
● WSE chip has 1.2 trillion transistors. For
comparison, NVIDIA’s A100 GPU contains 54
billion transistors, 22x less!
● 400,000 computing cores and 18 gigabytes of
memory with 9 PB/s memory bandwidth.
● Cerebras CS-1 is a system built on WSE.
● WSE 2nd gen is announced! 850,000 AI-optimized
cores, 2.6 Trillion Transistors
https://cerebras.net/
Cerebras
ASIC: AWS Inferentia Chips
AWS Inferentia chips are designed to accelerate the inference.
● 64 TOPS on 16-bit floating point (FP16 and BF16) and mixed-precision data.
● 128 TOPS on 8-bit integer (INT8) data.
● Up to 16 chips in the largest instance (inf1.24xlarge)
https://aws.amazon.com/machine-learning/inferentia/
https://github.com/aws/aws-neuron-sdk
https://aws.amazon.com/blogs/aws/amazon-ec2-update-inf1-instances-with-aws-inferentia-chips-for-high-performance-cost-effective-inferencing/
ASIC: AWS Trainium
(December 1st, 2020) Amazon announced its AWS Trainium chip. will be available in 2021
https://aws.amazon.com/machine-learning/trainium/
ASIC: Huawei Ascend 310
● 22 TOPS INT8
● 11 TFLOPS FP16
● 8W of power consumption.
Atlas 300I Inference Card:
● 32 GB LPDDR4X with a bandwidth of 204.8 GB/s
● PCIe x16 Gen3.0 device, max 67 W
● A single card provides up to 88 TOPS INT8
ASIC: Huawei Ascend 910
● 32 built-in Da Vinci AI Cores and 16 TaiShan Cores
● 320 TFLOPS (FP16), 640 TOPS (INT8). It’s pretty
close to NVIDIA’s A100 BF16 peak performance of
312 TFLOPS
Atlas 300T Training Card
● 32 GB HBM or 16GB DDR4 2933
● PCIe x16 Gen4.0
● up to 300W power consumption
ASIC: Bitmain Sophon
Tensor Computing Processors:
● BM1680 (1st gen, 2 TFLOPS FP32, 32MB SRAM, 25W)
● BM1682 (2nd gen, 3 TFLOPS FP32, 16MB SRAM)
● BM1684 (3rd gen, 2.2TFLOPS FP32, 17.6 TOPS INT8, 32 MB SRAM)
● BM1880 (1 TOPS INT8).
There are Deep Learning Acceleration PCIe Cards:
● SC3 with a BM1682 chip (8 GB DDR memory, 65W)
● SC5 and SC5H with a BM1684 chip and 12 GB RAM
(up to 16 GB) with 30W max power consumption
● SC5+ with 3x BM1684 and 36 GB memory (up to 48 GB)
with 75W max power consumption.
https://www.sophon.ai/product/introduce/bm1684.html
ASIC: Alibaba Hanguang 800
AI-Inference Chip
Its performance is independent of the batch size.
ASIC: Baidu Kunlun
● 14nm chip
● 16GB HBM memory with
512 GB/s bandwidth
● up to 260 TOPS INT8
(that’s twice the INT8 performance of NVIDIA TESLA T4)
● 64 TFLOPS INT16/FP16 at 150W.
● This chip looks like an inference chip. The processor can be
accessed via Baidu Cloud.
September 2020, Baidu announced Kunlun 2. The new chip uses 7
nm process technology and its computational capability is over three
times that of the previous generation. The mass production of the
chip is expected to begin in early 2021.
ASIC: Groq TSP
Groq develops its own Tensor Streaming Processor.
Jonathan Ross, Croq’s CEO had co-founded the
first Google’s TPU before that.
● 14nm chip, 26.8B transistors
● 220MB SRAM with 80TB/s on-die memory bandwidth
● no board memory
● PCIe Gen4 x16 with 31.5 GB/s in each direction.
● up to 1000 TOPS INT8 and 250 TFLOPS FP16 (with FP32 acc).
For comparison, NVIDIA A100 has 312 TFLOPS on dense FP16
calculations with FP32 acc, and 624 TOPS INT8. It’s even larger
than the 825 TOPS INT8 of Alibaba’s Hanguang 800.
ASIC: Others
● Qualcomm Cloud AI 100 (inference)
https://www.qualcomm.com/products/cloud-artificial-intelligence/cloud-ai-100
● Wave Computing Dataflow Processing Unit (DPU)
https://wavecomp.ai/products/
● ARM ML inference NPU Ethos-N78
https://www.arm.com/products/silicon-ip-cpu/machine-learning/arm-ml-processor
● SambaNova came out of stealth-mode in December 2020 with their
Reconfigurable Dataflow Architecture (RDA) delivering “100s of TFLOPS”.
https://sambanova.ai/
● Mythic focuses on Compute-in-Memory, Dataflow Architecture, and Analog
Computing https://www.mythic-ai.com/technology/
● Intel eASIC: an intermediary technology between FPGAs and standard-cell
ASICs with lower unit-cost and faster time-to-market
https://www.intel.com/content/www/us/en/products/programmable/asic/easic-devices.html
● ...
ASIC: Summary
● Very diverse field!
● Hard to directly compare different solutions based on their characteristics (can
be too different architectures).
● You can use a common benchmark like https://mlperf.org/
● DL framework support is usually limited, some solutions use their own
frameworks/libraries.
Mobile and Edge
AI at the edge
● NVidia Jetson TK1/TX1/TX2/Xavier/Nano
○ 192/256/256/512/128 CUDA Cores
○ 4/4/6/8/4-Core ARM CPU, 2/4/8/16/4 Gb Mem
● Tablets, Smartphones
○ Qualcomm Snapdragon 845/855, Apple A11/12/Bionic, Huawei Kirin 970/980/990 etc
● Raspberry Pi 4 (1.5 GHz 4-core, 4Gb mem)
● Movidius Neural Compute Stick, Stick 2
● Google Edge TPU
(Nov 25, 2020) “Our brand-new 6th gen Qualcomm AI Engine includes the
Qualcomm® Hexagon™ 780 Processor with a fused AI-accelerator architecture,
plus the Tensor Accelerator with 2 times the compute capacity. This Qualcomm AI
Engine astonishes with up to 26 TOPS performance.”
https://www.qualcomm.com/products/snapdragon-888-5g-mobile-platform
Mobile AI: Qualcomm SnapDragon 888
“HUAWEI’s self-developed Da Vinci architecture NPU delivers better power efficiency,
stronger processing capabilities and higher accuracy. The powerful Big-Core plus ultra-low
consumption Tiny-Core contribute to an enormous boost in AI performance. In AI face
recognition, the efficiency of NPU Tiny-Core can be enhanced up to 24x than the Big-Core.
With 2 Big-Core plus 1 Tiny-Core, the NPU of Kirin 990 5G is ready to unlock the magic
of the future.”
https://consumer.huawei.com/en/campaign/kirin-990-series/
“Huawei intends to scale this AI processing block from servers to smartphones. It supports both INT8
and FP16 on both cores, whereas the older Cambricon design could only perform INT8 on one core.
There’s also a new ‘Tiny Core’ NPU. It’s a smaller version of the Da Vinci architecture focused on power
efficiency above all else, and it can be used for polling or other applications where performance isn’t
particularly time critical. The 990 5G will have two “big” NPU cores and a single Tiny Core, while the Kirin
990 (LTE) has one big core and one tiny core.”
https://www.extremetech.com/mobile/298028-huaweis-kirin-990-soc-is-the-first-chip-with-an-integrated-5g-modem
Mobile AI: Huawei Kirin 970, 980, 990 (NPU)
(Sep 15, 2020) Apple unveils A14 Bionic processor with
40% faster CPU and 11.8 billion transistors
“The chip has a 16-core neural engine that can execute 11 trillion
AI operations per second. The neural engine core count is twice
the previous chip, and can perform machine learning computations
10 times faster. The A14 has six CPU cores and four graphics
processing unit (GPU) cores.”
https://venturebeat.com/2020/09/15/apple-unveils-a14-bionic-processor-with-40-faster-cpu-and-11-8-billion-transistors/
Mobile AI: Apple (Neural Engine)
(January 12, 2021) Samsung sets new standard for
flagship mobile processors with Exynos 2100
“AI capabilities will also enjoy a significant boost with the Exynos
2100. The newly-designed tri-core NPU has architectural enhancements such as
minimizing unnecessary operations for high effective utilization and support for
feature-map and weight compression. Exynos 2100 can perform up to 26-trillion-
operations-per-second (TOPS) with more than twice the power efficiency than
the previous generation. With on-device AI processing and support for advanced
neural networks, users will be able to enjoy more interactive and smart features as
well as enhanced computer vision performance in applications such as imaging.”
https://www.samsung.com/semiconductor/minisite/exynos/newsroom/pressrelease/samsung-sets-new-
standard-for-flagship-mobile-processors-with-exynos-2100/
Mobile AI: Samsung (NPU)
(Aug 7, 2019) MediaTek Announces Dimensity 1000 ARM
Chip With Integrated 5G Modem
“The Dimensity 1000 doesn’t just bring new branding; it’s also
sporting four Cortex A77 CPU cores and four Cortex A55 CPU
cores, all built on a 7nm process node. There’s also a 9-core Mali GPU, a 5-core
ISP, and a 6-core AI processor.
The MediaTek AI Processing Unit APU 3.0 is a brand new architecture. It
houses six AI processors (two big cores, three small cores and a single tiny core)
The new APU 3.0 brings devices a significant performance boost at 4.5 TOPS.”
https://www.extremetech.com/extreme/302712-mediatek-announces-dimensity-1000-arm-chip-with-
integrated-5g-modem
https://i.mediatek.com/mediatek-5g
Mobile AI: MediaTek (APU)
AI at the Edge: Jetson Nano
Price: $99 ($59 for 2Gb)
NVIDIA Jetson Nano Developer Kit is a small, powerful
computer that lets you run multiple neural networks in parallel
for applications like image classification, object detection, segmentation,
and speech processing. All in an easy-to-use platform that runs in as little as 5
watts.
● 128-core Maxwell GPU + Quad-core ARM A57, 472 GFLOPS
● 4 GB 64-bit LPDDR4 25.6 GB/s
https://developer.nvidia.com/embedded/jetson-nano-developer-kit
See also Jetson TX1, TX2, Xavier: https://developer.nvidia.com/embedded/develop/hardware
Neural Compute Stick 2 (~$70)
The latest generation of Intel® VPUs includes 16
powerful processing cores (called SHAVE cores) and
a dedicated deep neural network hardware accelerator for high-performance
vision and AI inference applications—all at low power.
● Supports Convolutional Neural Network (CNN)
● Support: TensorFlow, Caffe, Apache MXNet, ONNX, PyTorch, and
PaddlePaddle via an ONNX conversion
● Processor: Intel Movidius Myriad X Vision Processing Unit (VPU)
● Connectivity: USB 3.0 Type-A
https://software.intel.com/en-us/neural-compute-stick
AI at the Edge: Movidius
AI at the Edge: Google Edge TPU
The Edge TPU is a small ASIC designed by Google that provides
high performance ML inferencing for low-power devices. For
example, it can execute state-of-the-art mobile vision models such
as MobileNet V2 at 400 FPS, in a power efficient manner.
The on-board Edge TPU coprocessor is capable of performing 4 TOPS
using 0.5 watts for each TOPS (2 TOPS per watt).
TensorFlow Lite models can be compiled to run on the Edge TPU.
USB/Mini PCIe/M.2 A+E key/M.2 B+M key/SoM/Dev Board
https://cloud.google.com/edge-tpu/
https://coral.ai/products/
● Sophon Neural Network Stick (NNS)
https://www.sophon.ai/product/introduce/nns.html
● Xilinx Edge AI (FPGA!)
https://www.xilinx.com/applications/industrial/analytics-machine-learning.html
● The Hailo-8 M.2 Module
https://hailo.ai/product-hailo/hailo-8-m2-module/
● More:
https://github.com/crespum/edge-ai
AI at the Edge: Others
Now OK?
Problems
Even with FPGA/ASIC and edge devices:
● Energy efficiency:
○ Better than CPU/GPU, but still far from 20 watts used by the human brain
● Architecture:
○ Even more specialized for ML/DL computations, but...
○ Still far from brain-like computations
Neuromorphic Chips
Neuromorphic chips
● Neuromorphic computing - brain-inspired computing - has emerged as a new
technology to enable information processing at very low energy cost using
electronic devices that emulate the electrical behaviour of (biological) neural
networks.
● Neuromorphic chips attempt to model in silicon the massively parallel way the
brain processes information as billions of neurons and trillions of synapses
respond to sensory inputs such as visual and auditory stimuli.
● DARPA SyNAPSE program (Systems of Neuromorphic Adaptive Plastic
Scalable Electronics)
● IBM TrueNorth; Stanford Neurogrid; HRL neuromorphic chip; Human Brain
Project SpiNNaker and HICANN.
https://www.technologyreview.com/s/526506/neuromorphic-chips/
Neuromorphic chips: IBM TrueNorth
● 1M neurons, 256M synapses, 4096 neurosynaptic
cores on a chip, est. 46B synaptic ops per sec per W
● Uses 70mW, power density is 20 milliwatts per
cm^2— almost 1/10,000th the power of most modern
microprocessors
● “Our sights are now set high on the ambitious goal of
integrating 4,096 chips in a single rack with 4B neurons and 1T synapses while
consuming ~4kW of power”.
● Currently IBM is making plans to commercialize it.
● (2016) Lawrence Livermore National Lab got a cluster of 16 TrueNorth chips
(16M neurons, 4B synapses, for context, the human brain has 86B neurons).
When running flat out, the entire cluster will consume a grand total of 2.5 watts.
http://spectrum.ieee.org/tech-talk/computing/hardware/ibms-braininspired-computer-chip-comes-from-the-future
Neuromorphic chips: IBM TrueNorth
● (03.2016) IBM Research demonstrated convolutional neural nets with close to
state of the art performance:
“Convolutional Networks for Fast, Energy-Efficient Neuromorphic Computing”, http://arxiv.org/abs/1603.08270
Neuromorphic chips: Intel Loihi
● Fully asynchronous neuromorphic many core mesh that
supports a wide range of sparse, hierarchical and recurrent
neural network topologies
● Each neuromorphic core includes a learning engine that can
be programmed to adapt network parameters during
operation, supporting supervised, unsupervised,
reinforcement and other learning paradigms.
● Fabrication on Intel’s 14 nm process technology.
● A total of 130,000 neurons and 130 million synapses.
● Development and testing of several algorithms with high
algorithmic efficiency for problems including path planning,
constraint satisfaction, sparse coding, dictionary learning,
and dynamic pattern learning and adaptation.
https://newsroom.intel.com/editorials/intels-new-self-learning-chip-promises-accelerate-artificial-intelligence/
https://techcrunch.com/2018/01/08/intel-shows-off-its-new-loihi-ai-chip-and-a-new-49-qubit-quantum-chip/
https://ieeexplore.ieee.org/document/8259423
https://en.wikichip.org/wiki/intel/loihi
Neuromorphic chips: Intel Pohoiki Beach
(Jul 15, 2019) “Intel announced that an 8 million-neuron
neuromorphic system comprising 64 Loihi research chips
— codenamed Pohoiki Beach — is now available to the broader
research community. With Pohoiki Beach, researchers can
experiment with Intel’s brain-inspired research chip, Loihi, which
applies the principles found in biological brains to computer
architectures. ”
https://newsroom.intel.com/news/intels-pohoiki-beach-64-chip-neuromorphic-system-delivers-breakthrough-results-research-tests/
Neuromorphic chips: Intel Pohoiki Springs
https://www.nextplatform.com/2020/03/19/intel-smells-neuromorphic-opportunity/
Neuromorphic chips: Intel Loihi
“Using Intel's Loihi neuromorphic research chip and
ABR's Nengo Deep Learning toolkit, we analyze the
inference speed, dynamic power consumption, and
energy cost per inference of a two-layer neural
network keyword spotter trained to recognize a single
phrase. We perform comparative analyses of this
keyword spotter running on more conventional
hardware devices including a CPU, a GPU, Nvidia's
Jetson TX1, and the Movidius Neural Compute
Stick.”
Benchmarking Keyword Spotting Efficiency on Neuromorphic Hardware
https://arxiv.org/abs/1812.01739
Neuromorphic chips: Intel Loihi
Intel Benchmarks for Loihi Neuromorphic Computing Chip
https://www.eetasia.com/intel-benchmarks-for-loihi-neuromorphic-computing-chip/
Neuromorphic chips: Intel Loihi
https://newsroom.intel.com/wp-content/uploads/sites/11/2020/12/Neuromorphic-Computing-slides-B.pdf
NxTF: a Keras-like API for SNNs on Loihi
“NxTF: An API and Compiler for Deep Spiking Neural Networks on Intel Loihi”
https://arxiv.org/abs/2101.04261
https://github.com/intel-nrc-ecosystem/models/tree/master/nxsdk_modules_ncl/dnn
NxTF: a Keras-like API for SNNs on Loihi
“NxTF: An API and Compiler for Deep Spiking Neural Networks on Intel Loihi”
https://arxiv.org/abs/2101.04261
NxTF: a Keras-like API for SNNs on Loihi
“NxTF: An API and Compiler for Deep Spiking Neural Networks on Intel Loihi”
https://arxiv.org/abs/2101.04261
Neuromorphic chips: Tianjic
Tianjic’s unified function core (FCore) which combines essential
building blocks for both artificial neural networks and biologically
networks — axon, synapse, dendrite and soma blocks. The 28-nm
chip consists of 156 FCores, containing approximately 40,000
neurons and 10 million synapses in an area of 3.8×3.8 mm2.
Tianjic delivers an internal memory bandwidth of more than 610 GB
per second, and a peak performance of 1.28 TOPS per watt for
running artificial neural networks. In the biologically-inspired spiking
neural network mode, Tianjic achieves a peak performance of about
650 giga synaptic operations per second (GSOPS) per watt.
https://medium.com/syncedreview/nature-cover-story-chinese-teams-tianjic-chip-bridges-machine-
learning-and-neuroscience-in-f1c3e8a03113
https://www.nature.com/articles/s41586-019-1424-8
Neuromorphic chips: Others
● SpiNNaker (1,036,800 ARM9 cores)
http://apt.cs.manchester.ac.uk/projects/SpiNNaker/
● SpiNNaker-2
https://niceworkshop.org/wp-content/uploads/2018/05/2-27-SHoppner-SpiNNaker2.pdf
https://arxiv.org/abs/1911.02385 “SpiNNaker 2: A 10 Million Core Processor System for Brain
Simulation and Machine Learning”
● BrainScaleS, HICANN: 20x 8-inch silicon wafers each incorporates 50 x 106
plastic synapses and 200,000 biologically realistic neurons.
https://www.humanbrainproject.eu/en/silicon-brains/how-we-work/hardware/
● Akida NSoC: 1.2 million neurons and 10 billion synapses
https://www.brainchipinc.com/products/akida-neuromorphic-system-on-chip
https://www.nextplatform.com/2020/01/30/neuromorphic-chip-maker-takes-aim-at-the-edge/
https://en.wikichip.org/wiki/brainchip/akida
● Neurogrid: Neurogrid can model a slab of cortex with up to 16x256x256
neurons (>1M) https://web.stanford.edu/group/brainsinsilicon/neurogrid.html
https://web.stanford.edu/group/brainsinsilicon/documents/BenjaminEtAlNeurogrid2014.pdf
From: https://d1io3yog0oux5.cloudfront.net/_51d5497ffa729abd180ed52c4234217f/brainchipinc/db/217/1582/pdf/Akida+Launch+Presentation.pdf
Anything else?
Other approaches
● Memristors https://spectrum.ieee.org/semiconductors/design/the-mysterious-memristor
● Quantum computing https://ai.googleblog.com/2019/10/quantum-supremacy-using-programmable.html
● Optical computing https://www.nextplatform.com/2019/05/31/startup-looks-to-light-up-machine-learning/
● DNA computing https://www.wired.com/story/finally-a-dna-computer-that-can-actually-be-reprogrammed/
● Unconventional computing: cellular automata, reservoir computing, using
biological cells/neurons, chemical computation, membrane computing, slime
mold computing and much more https://www.springer.com/gp/book/9781493968824
● ...
References:
Hardware for Deep Learning series of posts:
https://blog.inten.to/hardware-for-deep-learning-current-state-and-trends-51c01ebbb6dc
● Part 1: Introduction and Executive summary
● Part 2: CPU
● Part 3: GPU
● Part 4: ASIC
● Part 5: FPGA
● Part 6: Mobile AI
● Part 7: Neuromorphic computing
● Part 8: Quantum computing
https://ru.linkedin.com/in/grigorysapunov
gs@inten.to
Thanks!

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practices
 
TensorFlow Lite for mobile & IoT
TensorFlow Lite for mobile & IoT   TensorFlow Lite for mobile & IoT
TensorFlow Lite for mobile & IoT
 
NVIDIA CUDA
NVIDIA CUDANVIDIA CUDA
NVIDIA CUDA
 
On-device ML with TFLite
On-device ML with TFLiteOn-device ML with TFLite
On-device ML with TFLite
 
NVIDIA @ AI FEST
NVIDIA @ AI FESTNVIDIA @ AI FEST
NVIDIA @ AI FEST
 
ONNX - The Lingua Franca of Deep Learning
ONNX - The Lingua Franca of Deep LearningONNX - The Lingua Franca of Deep Learning
ONNX - The Lingua Franca of Deep Learning
 
GTC 2022 Keynote
GTC 2022 KeynoteGTC 2022 Keynote
GTC 2022 Keynote
 
Supermicro AI Pod that’s Super Simple, Super Scalable, and Super Affordable
Supermicro AI Pod that’s Super Simple, Super Scalable, and Super AffordableSupermicro AI Pod that’s Super Simple, Super Scalable, and Super Affordable
Supermicro AI Pod that’s Super Simple, Super Scalable, and Super Affordable
 
TFLite NNAPI and GPU Delegates
TFLite NNAPI and GPU DelegatesTFLite NNAPI and GPU Delegates
TFLite NNAPI and GPU Delegates
 
Building NLP applications with Transformers
Building NLP applications with TransformersBuilding NLP applications with Transformers
Building NLP applications with Transformers
 
An introduction to computer vision with Hugging Face
An introduction to computer vision with Hugging FaceAn introduction to computer vision with Hugging Face
An introduction to computer vision with Hugging Face
 
Embedded Hypervisor for ARM
Embedded Hypervisor for ARMEmbedded Hypervisor for ARM
Embedded Hypervisor for ARM
 
𝐆𝐞𝐧𝐞𝐫𝐚𝐭𝐢𝐯𝐞 𝐀𝐈: 𝐂𝐡𝐚𝐧𝐠𝐢𝐧𝐠 𝐇𝐨𝐰 𝐁𝐮𝐬𝐢𝐧𝐞𝐬𝐬 𝐈𝐧𝐧𝐨𝐯𝐚𝐭𝐞𝐬 𝐚𝐧𝐝 𝐎𝐩𝐞𝐫𝐚𝐭𝐞𝐬
𝐆𝐞𝐧𝐞𝐫𝐚𝐭𝐢𝐯𝐞 𝐀𝐈: 𝐂𝐡𝐚𝐧𝐠𝐢𝐧𝐠 𝐇𝐨𝐰 𝐁𝐮𝐬𝐢𝐧𝐞𝐬𝐬 𝐈𝐧𝐧𝐨𝐯𝐚𝐭𝐞𝐬 𝐚𝐧𝐝 𝐎𝐩𝐞𝐫𝐚𝐭𝐞𝐬𝐆𝐞𝐧𝐞𝐫𝐚𝐭𝐢𝐯𝐞 𝐀𝐈: 𝐂𝐡𝐚𝐧𝐠𝐢𝐧𝐠 𝐇𝐨𝐰 𝐁𝐮𝐬𝐢𝐧𝐞𝐬𝐬 𝐈𝐧𝐧𝐨𝐯𝐚𝐭𝐞𝐬 𝐚𝐧𝐝 𝐎𝐩𝐞𝐫𝐚𝐭𝐞𝐬
𝐆𝐞𝐧𝐞𝐫𝐚𝐭𝐢𝐯𝐞 𝐀𝐈: 𝐂𝐡𝐚𝐧𝐠𝐢𝐧𝐠 𝐇𝐨𝐰 𝐁𝐮𝐬𝐢𝐧𝐞𝐬𝐬 𝐈𝐧𝐧𝐨𝐯𝐚𝐭𝐞𝐬 𝐚𝐧𝐝 𝐎𝐩𝐞𝐫𝐚𝐭𝐞𝐬
 
Introduction to Deep Learning (NVIDIA)
Introduction to Deep Learning (NVIDIA)Introduction to Deep Learning (NVIDIA)
Introduction to Deep Learning (NVIDIA)
 
FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning
 
Intro to Deep Learning for Computer Vision
Intro to Deep Learning for Computer VisionIntro to Deep Learning for Computer Vision
Intro to Deep Learning for Computer Vision
 
Landscape of AI/ML in 2023
Landscape of AI/ML in 2023Landscape of AI/ML in 2023
Landscape of AI/ML in 2023
 
A Peek into Google's Edge TPU
A Peek into Google's Edge TPUA Peek into Google's Edge TPU
A Peek into Google's Edge TPU
 
Accelerated Training of Transformer Models
Accelerated Training of Transformer ModelsAccelerated Training of Transformer Models
Accelerated Training of Transformer Models
 
Onnx and onnx runtime
Onnx and onnx runtimeOnnx and onnx runtime
Onnx and onnx runtime
 

Ähnlich wie AI Hardware Landscape 2021

Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 

Ähnlich wie AI Hardware Landscape 2021 (20)

FPGAs for Supercomputing: The Why and How
FPGAs for Supercomputing: The Why and HowFPGAs for Supercomputing: The Why and How
FPGAs for Supercomputing: The Why and How
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience Report
 
Implementing AI: High Performance Architectures: Large scale HPC hardware in ...
Implementing AI: High Performance Architectures: Large scale HPC hardware in ...Implementing AI: High Performance Architectures: Large scale HPC hardware in ...
Implementing AI: High Performance Architectures: Large scale HPC hardware in ...
 
Deep Learning on ARM Platforms - SFO17-509
Deep Learning on ARM Platforms - SFO17-509Deep Learning on ARM Platforms - SFO17-509
Deep Learning on ARM Platforms - SFO17-509
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
08 Supercomputer Fugaku
08 Supercomputer Fugaku08 Supercomputer Fugaku
08 Supercomputer Fugaku
 
NNSA Explorations: ARM for Supercomputing
NNSA Explorations: ARM for SupercomputingNNSA Explorations: ARM for Supercomputing
NNSA Explorations: ARM for Supercomputing
 
GIST AI-X Computing Cluster
GIST AI-X Computing ClusterGIST AI-X Computing Cluster
GIST AI-X Computing Cluster
 
Hortonworks on IBM POWER Analytics / AI
Hortonworks on IBM POWER Analytics / AIHortonworks on IBM POWER Analytics / AI
Hortonworks on IBM POWER Analytics / AI
 
Implementing AI: High Performace Architectures
Implementing AI: High Performace ArchitecturesImplementing AI: High Performace Architectures
Implementing AI: High Performace Architectures
 
Developping drivers on small machines
Developping drivers on small machinesDevelopping drivers on small machines
Developping drivers on small machines
 
Stream Processing
Stream ProcessingStream Processing
Stream Processing
 
High performance computing with accelarators
High performance computing with accelaratorsHigh performance computing with accelarators
High performance computing with accelarators
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
 
Getting started with Intel IoT Developer Kit
Getting started with Intel IoT Developer KitGetting started with Intel IoT Developer Kit
Getting started with Intel IoT Developer Kit
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsn
 

Mehr von Grigory Sapunov

Практический подход к выбору доменно-адаптивного NMT​
Практический подход к выбору доменно-адаптивного NMT​Практический подход к выбору доменно-адаптивного NMT​
Практический подход к выбору доменно-адаптивного NMT​
Grigory Sapunov
 

Mehr von Grigory Sapunov (20)

Transformers in 2021
Transformers in 2021Transformers in 2021
Transformers in 2021
 
NLP in 2020
NLP in 2020NLP in 2020
NLP in 2020
 
What's new in AI in 2020 (very short)
What's new in AI in 2020 (very short)What's new in AI in 2020 (very short)
What's new in AI in 2020 (very short)
 
Artificial Intelligence (lecture for schoolchildren) [rus]
Artificial Intelligence (lecture for schoolchildren) [rus]Artificial Intelligence (lecture for schoolchildren) [rus]
Artificial Intelligence (lecture for schoolchildren) [rus]
 
Transformer Zoo (a deeper dive)
Transformer Zoo (a deeper dive)Transformer Zoo (a deeper dive)
Transformer Zoo (a deeper dive)
 
Transformer Zoo
Transformer ZooTransformer Zoo
Transformer Zoo
 
BERTology meets Biology
BERTology meets BiologyBERTology meets Biology
BERTology meets Biology
 
Modern neural net architectures - Year 2019 version
Modern neural net architectures - Year 2019 versionModern neural net architectures - Year 2019 version
Modern neural net architectures - Year 2019 version
 
AI - Last Year Progress (2018-2019)
AI - Last Year Progress (2018-2019)AI - Last Year Progress (2018-2019)
AI - Last Year Progress (2018-2019)
 
Практический подход к выбору доменно-адаптивного NMT​
Практический подход к выбору доменно-адаптивного NMT​Практический подход к выбору доменно-адаптивного NMT​
Практический подход к выбору доменно-адаптивного NMT​
 
Deep Learning: Application Landscape - March 2018
Deep Learning: Application Landscape - March 2018Deep Learning: Application Landscape - March 2018
Deep Learning: Application Landscape - March 2018
 
Sequence learning and modern RNNs
Sequence learning and modern RNNsSequence learning and modern RNNs
Sequence learning and modern RNNs
 
Введение в Deep Learning
Введение в Deep LearningВведение в Deep Learning
Введение в Deep Learning
 
Введение в машинное обучение
Введение в машинное обучениеВведение в машинное обучение
Введение в машинное обучение
 
Введение в архитектуры нейронных сетей / HighLoad++ 2016
Введение в архитектуры нейронных сетей / HighLoad++ 2016Введение в архитектуры нейронных сетей / HighLoad++ 2016
Введение в архитектуры нейронных сетей / HighLoad++ 2016
 
Artificial Intelligence - Past, Present and Future
Artificial Intelligence - Past, Present and FutureArtificial Intelligence - Past, Present and Future
Artificial Intelligence - Past, Present and Future
 
Multidimensional RNN
Multidimensional RNNMultidimensional RNN
Multidimensional RNN
 
Deep Learning and the state of AI / 2016
Deep Learning and the state of AI / 2016Deep Learning and the state of AI / 2016
Deep Learning and the state of AI / 2016
 
Deep Learning Cases: Text and Image Processing
Deep Learning Cases: Text and Image ProcessingDeep Learning Cases: Text and Image Processing
Deep Learning Cases: Text and Image Processing
 
Computer Vision and Deep Learning
Computer Vision and Deep LearningComputer Vision and Deep Learning
Computer Vision and Deep Learning
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 

AI Hardware Landscape 2021

  • 2. Executive Summary :) Most hardware focused on DL, which requires a lot of computations: ● There’s much more diversity in CPUs now, not only x86. ● GPUs (mostly NVIDIA) are the most popular choice. Intel and AMD can propose interesting alternatives this year. ● There are some available ASIC alternatives: Google TPU (in cloud only), Graphcore, Huawei Ascend. ● More ASICs are coming into this field: Cerebras, Habana, etc. ● Some companies try to use FPGAs and allow to use them in the cloud (Microsoft, AWS). ● Edge AI is everywhere already! More to come! ● Neuromorphic computing is on the rise (IBM TrueNorth, Tianjic, memristors, etc) ● Quantum computing can potentially benefit machine learning as well (but probably it won’t be a desktop or in-house server solutions)
  • 3. CPU
  • 4. The Deep Learning Revolution and Its Implications for Computer Architecture and Chip Design https://arxiv.org/abs/1911.05289
  • 5. Typically multi-core even on the desktop market: ● usually from 2 to 10 cores in modern Core i3-i9 Intel CPUs ● up to 18 cores/36 threads in high-end Intel CPUs (i9– 7980XE/9980XE/10980XE) [https://en.wikipedia.org/wiki/List_of_Intel_Core_i9_microprocessors] ● up to 64 cores/128 threads in AMD Ryzen Threadripper (Ryzen Threadripper 3990X, Ryzen Threadripper Pro 3995WX) x86: Desktops
  • 6. On the server market: ● Intel Xeon: up to 56 cores/112 threads (Xeon Platinum 9282 Processor) ● AMD EPYC: up to 64 cores/128 threads (EPYC 7702/7742) ● usually having more cores than desktop processors and some other useful capabilities (supporting more RAM, multi-processor configurations, ECC, etc) x86: Servers
  • 7. AVX-512: Fused Multiply Add (FMA) core instructions for enabling lower-precision operations. List of CPUs with AVX-512 support: https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512 VNNI (Vector Neural Network Instructions): Multiply and add for integers, etc. designed to accelerate convolutional neural network-based algorithms. https://en.wikichip.org/wiki/x86/avx512vnni DL Boost: AVX512-VNNI + Brain floating-point format (bfloat16) designed for inference acceleration. https://en.wikichip.org/wiki/brain_floating-point_format x86: ML instructions (SIMD) https://blog.inten.to/cpu-hardware-for-deep-learning-b91f53cb18af
  • 8. ● BigDL: distributed deep learning library for Apache Spark https://github.com/intel-analytics/BigDL ● Deep Neural Network Library (DNNL): an open-source performance library for deep learning applications. Layer primitives, etc. https://intel.github.io/mkl-dnn/ ● PlaidML: advanced and portable tensor compiler for enabling deep learning on laptops, embedded devices, or other devices. Supports Keras, ONNX, and nGraph. https://github.com/plaidml/plaidml ● OpenVINO Toolkit: for computer vision https://docs.openvinotoolkit.org/ x86: Optimized ML Libraries
  • 9. Some CPU-optimized DL libraries: ● Caffe Con Troll (research project, latest commit in 2016) https://github.com/HazyResearch/CaffeConTroll ● Intel Caffe (optimized for Xeon): https://github.com/intel/caffe ● Intel DL Boost can be used in many popular frameworks: TensorFlow, PyTorch, MXNet, PaddlePaddle, Intel Caffe https://www.intel.ai/increasing-ai-performance-intel-dlboost/ x86: Optimized ML Libraries
  • 10. ● nGraph: open source C++ library, compiler and runtime for Deep Learning. Frameworks using nGraph Compiler stack to execute workloads have shown up to 45X performance boost when compared to native framework implementations. https://www.ngraph.ai/ Graph compilers
  • 11. Graph compilers: watch for MLIR! https://www.tensorflow.org/mlir/overview
  • 12. ● #Cores ● PCIe bandwidth ● PCIe generation (gen3, gen4) ● PCIe lanes (x16, x8, etc) at the processor/chipset side ● Memory type (DDR4, DDR3, etc) ● Memory speed (2133, 2666, 3200, etc) ● Memory channels (1, 2, 4, …) ● Memory size ● Memory speed/bandwidth ● ECC support ● Power usage (Watts) ● Price ● ... Important dimensions https://blog.inten.to/cpu-hardware-for-deep-learning-b91f53cb18af
  • 13. ● Single-board computers: Raspberry Pi, part of Jetson Nano, and Google Coral Dev Board. ● Mobile: Qualcomm, Apple A11, etc ● Server: Marvell ThunderX, Ampere eMAG, Amazon A1 instance, etc; NVIDIA announced GPU-accelerated Arm-based servers. ● Laptops: Apple M1, Microsoft Surface Pro X ● ARM also has ML/AI Ethos NPU and Mali GPU ARM
  • 14. ● ARM announces Neoverse N1 platform (scales up to 128 cores) https://www.networkworld.com/article/3342998/arm-introduces-neoverse-high-performance-cpus-for-servers-5g.html ● Qualcomm manufactured ARM server processor for cloud applications called Centriq 2400 (48 single-thread cores, 2.2GHz). Project stopped. https://www.tomshardware.com/news/qualcomm-server-chip-exit-china-centriq-2400,38223.html ● Ampere Altra is the first 80-core ARM-based server processor https://venturebeat.com/2020/03/03/ampere-altra-is-the-first-80-core-arm-based-server-processor/ ● Ampere announces 128-core Arm server processor https://www.networkworld.com/article/3564514/ampere-announces-128-core-arm-server-processor.html ● Ampere eMAG ARM server microprocessors (up to 32 cores, up to 3.3 GHz) https://amperecomputing.com/product/, https://en.wikichip.org/wiki/ampere_computing/emag ● Marvell ThunderX ARM Processors (up to 48 cores, up to 2.5 GHz) https://www.marvell.com/server-processors/thunderx-arm-processors/ ● Amazon Graviton ARM processor (16 cores, 2.3GHz) https://en.wikichip.org/wiki/annapurna_labs/alpine/al73400 https://aws.amazon.com/blogs/aws/new-ec2-instances-a1-powered-by-arm-based-aws-graviton-processors/ ● Huawei Kunpeng 920 ARM Server CPU (64 cores, 2.6 GHz) https://www.huawei.com/en/press-events/news/2019/1/huawei-unveils-highest-performance-arm-based-cpu ARM: Servers
  • 15. NVIDIA to Acquire Arm for $40 Billion
  • 16. Current architecture is POWER9: ● 12 cores x 8 threads or 24 cores x 4 threads (96 threads). ● PCIe v.4, 48 PCIe lanes ● Nvidia NVLink 2.0: the industry’s only CPU-to-GPU Nvidia NVLink connection ● CAPI 2.0, OpenCAPI 3.0 (for heterogeneous computing with FPGA/ASIC) IBM POWER
  • 17.
  • 18. An open-source hardware instruction set architecture. Examples: ● SiFive U5, U7 and U8 cores https://www.anandtech.com/show/15036/sifive-announces-first-riscv-ooo-cpu-core-the-u8series-processor-ip ● Alibaba's RISC-V processor Xuantie 910 with Vector Engine for AI Acceleration 12nm 64-bit 16 cores clocked at up to 2.5GHz, the fastest RISC-V processor to date https://www.theregister.co.uk/2019/07/27/alibaba_risc_v_chip/ ● Western Digital SweRV Core designed for embedded devices supporting data-intensive edge applications. https://www.westerndigital.com/company/innovations/risc-v ● Manticore: A 4096-core RISC-V Chiplet Architecture for Ultra-efficient Floating-point Computing https://arxiv.org/abs/2008.06502 ● Esperanto Technologies is building AI chip with 1k+ cores https://www.esperanto.ai/technology/ RISC-V
  • 19. GPU
  • 21. … → Kepler → Maxwell → Pascal → Volta → Turing → Ampere → ... NVIDIA Architectures
  • 22. ● Peak performance (GFLOPS) at FP32/16/... ● #Cores (+Tensor Cores) ● Memory size ● Memory speed/bandwidth ● Precision support ● Can connect using NVLink? ● Power usage (Watts) ● Price ● GFLOPS/USD ● GFLOPS/Watt ● Form factor (for desktop or server?) ● ECC memory ● Legal restrictions (e.g. GeForce is not allowed to use in datacenters) Important dimensions https://blog.inten.to/hardware-for-deep-learning-part-3-gpu-8906c1644664
  • 23.
  • 24. ● FP64 (64-bit float), not used for DL ● FP32 — the most commonly used for training ● FP16 or Mixed precision (FP32+FP16) — becoming the new default ● INT8 — usually for inference ● INT4, INT1 — experimental modes for inference Precision
  • 25. bfloat16 is now supported on Ampere GPUs, supported on TPU gen3, and will be supported on AMD GPU and Intel CPUs.. Precision: bfloat16 https://moocaholic.medium.com/fp64-fp32-fp16-bfloat16-tf32-and-other-members-of-the-zoo-a1ca7897d407
  • 27. Not only FLOPS: Roofline Performance Model https://blog.inten.to/hardware-for-deep-learning-part-3-gpu-8906c1644664
  • 28. Roofline Performance Model: Example https://blog.inten.to/hardware-for-deep-learning-part-3-gpu-8906c1644664
  • 29. Separate cards can joined using NVLink; SLI is not relevant for DL, it’s for graphics. NVSwitch: The Fully Connected NVLink NCCL 1: multi-GPU collective communication primitives library NVIDIA: Single-machine Multi-GPU
  • 30. Distributed training is now a commodity (but scaling is sublinear). NCCL 2: multi-node collective communication primitives library NVIDIA: Distributed Multi-GPU
  • 31. Intel offered the following performance numbers, given as peak GFLOPs of FP32 math using the OpenCL-based CLPeak benchmark. GPU: Intel Xe https://www.anandtech.com/show/16018/intel-xe-hp-graphics-early-samples-offer-42-tflops-of-fp32-performance
  • 32. Peak Performance: ● 46.1 TFLOPs Single Precision Matrix (FP32) ● 23.1 TFLOPs Single Precision (FP32) ● 184.6 TFLOPs Half Precision (FP16) ● 11.5 TFLOPs Double Precision (FP64) ● 92.3 TFLOPs bfloat16 ● 184.6 TOPs INT8 and INT4 32 GB HBM2, Up to 1228.8 GB/s 300W Announced support for TensorFlow, PyTorch, etc! AMD Instinct MI100 https://www.amd.com/en/products/server-accelerators/instinct-mi100
  • 33. There should be: Intel Xe 42.2 TFLOPS (FP32) AMD Instinct MI100 46.1 TFLOPS (FP32)
  • 35. Problems Serious problems with the current processors (CPU/GPU) are: ● Energy efficiency: ○ The version of AlphaGo playing against Lee Sedol used 1,920 CPUs and 280 GPUs (https://en.wikipedia.org/wiki/AlphaGo) ○ The estimated power consumption of approximately 1 MW (200 W per CPU and 200 W per GPU) compared to only 20 watts used by the human brain (https://jacquesmattheij.com/another-way-of-looking-at-lee-sedol-vs-alphago/) ● Architecture: ○ good for matrix multiplication (still the essence of DL) ○ but not well-suitable for brain-like computations
  • 36. FPGA
  • 37. FPGA ● FPGA (field-programmable gate array) is an integrated circuit designed to be configured by a customer or a designer after manufacturing ● Both FPGAs and ASICs (see later) are usually much more energy-efficient than general purpose processors (so more productive with respect to GFLOPS per Watt). FPGAs are usually used for inference, not training. ● OpenCL can be the language for development for FPGA (C/C++ can be as well), and some ML/DL libraries are using OpenCL too (for example, Caffe). So, there could appear an easy way to do low-level ML on FPGAs. ● For high-level ML there are vendor tools and graph compilers (inference only). ● Can use FPGA in the cloud! ● See also for MLIR (mentioned earlier). ● Learning curve to use FPGAs is too steep now :(
  • 38. FPGA in production There is some interesting movement to FPGA: ● Amazon has FPGA F1 instances https://aws.amazon.com/ec2/instance-types/f1/ ● Alibaba has FGPA F3 instances in the cloud https://www.alibabacloud.com/blog/deep-dive-into-alibaba- cloud-f3-fpga-as-a-service-instances_594057 ● Yandex uses FPGAs for its own DL inference. ● Microsoft ran (in 2015) Project Catapult that uses clusters of FPGAs https://blogs.msdn.microsoft.com/msr_er/2015/11/12/project-catapult-servers-available-to-academic-researchers/ https://www.microsoft.com/en-us/research/project/project-catapult/ ● Microsoft Project Brainwave: AI inference omn FPGA https://www.microsoft.com/en-us/research/project/project-brainwave/ ● Microsoft Azure allows deploying pretrained models on FPGA (!). https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-deploy-fpga-web-service ● Baidu has FPGA instances https://cloud.baidu.com/product/fpga.html ● ...
  • 39. Two main manufacturers: Intel (ex. Altera) and Xilinx. The ‘world’s largest’ FPGA chips ● Intel Stratix 10 GX 10M >10.2 million logic cells, 43.3B transistors https://www.techpowerup.com/260906/intel-unveils-worlds-largest-fpga ● Xilinx Virtex UltraScale+ VU19P 9M system logic cells, 35B transistors https://www.xilinx.com/products/silicon-devices/fpga/virtex-ultrascale-plus-vu19p.html Intel has a hybrid Xeon+FPGA chip https://www.top500.org/news/intel-ships-xeon-skylake-processor-with- integrated-fpga/ Intel has FPGA acceleration cards https://www.intel.com/content/www/us/en/programmable/solutions/acceleration-hub/platforms.html FPGA chips
  • 40. Adaptive compute acceleration platform (ACAP) Xilinx Versal ACAP, a fully software-programmable, heterogeneous compute platform that combines Scalar Engines, Adaptable Engines, and Intelligent Engines. The Intelligent Engines are an array of VLIW and SIMD processing engines and memories, all interconnected with 100s of terabits per second of interconnect and memory bandwidth. These permit 5X–10X performance improvement for ML and DSP applications. https://www.xilinx.com/products/silicon-devices/acap/versal-premium.html https://www.xilinx.com/products/silicon-devices/acap/versal-ai-core.html https://www.xilinx.com/support/documentation/white_papers/wp505-versal-acap.pdf
  • 41. FPGA: Xilinx Vitis AI Vitis AI is Xilinx’s development stack for AI inference on Xilinx hardware platforms, including both edge devices and Alveo cards. It consists of optimized IP, tools, libraries, models, and example designs. Xilinx ML Suite is now deprecated. https://github.com/Xilinx/Vitis-AI
  • 42. FPGA: Intel OpenVINO toolkit The OpenVINO (Open Visual Inference and Neural Network Optimization) toolkit offers software developers a single toolkit to accelerate their solutions across multiple hardware platforms including FPGAs https://software.intel.com/content/www/us/en/develop/tools/openvino-toolkit.html
  • 43. ASIC
  • 44. ASIC custom chips ASIC (application-specific integrated circuit) is an integrated circuit customized for a particular use, rather than intended for general-purpose use. There is a lot of movement to ASIC right now: ● Google has Tensor Processing Units (TPU v2/v3) in the cloud, v4 exists too. ● Intel acquired Habana, Mobileye, Movidius, Nervana and has processors for training and inference. ● Graphcore has its second generation IPU. ● AWS has its own chips for training and inference ● Alibaba Hanguang 800 ● Huawei Ascend 310, 910 ● Bitmain Sophon, Cerebras, Groq, and many many others … Many ASICs are built for multi-chip and supercomputer configurations! https://blog.inten.to/hardware-for-deep-learning-part-4-asic-96a542fe6a81
  • 46. ASIC: Google TPU TPU v2 ● 180 TFLOPS (bfloat16) ● 64 GB HBM ● $4.50 / TPU hour https://cloud.google.com/tpu/ https://cloud.google.com/tpu/docs/tpus https://cloud.google.com/tpu/docs/system-architecture TPU v3 ● 420 TFLOPS (bfloat16) ● 128 GB HBM ● $8.00 / TPU hour
  • 47. A “TPU v3 pod” 100+ petaflops, 32 TB HBM, 2-D toroidal mesh network Many ASICs are built for multi-chip configurations
  • 48. ASIC: Intel (Nervana) NNP-T [discontinued] Processor for training. Can build PODs (say 10-rack POD with 480 NNP-T) ● 24 Tensor Processing Cluster (TPC) ● PCIe Gen 4 x16 accelerator card, 300W ● OCP Accelerator Module, 375W ● 119 TOPS bfloat16 ● 32 GB HBM2 https://www.intel.ai/nervana-nnp/nnpt/ https://en.wikichip.org/wiki/nervana/microarchitectures/spring_crest
  • 49. ASIC: Intel (Nervana) NNP-I [discontinued] Processor for inference using mixed precision math, with a special emphasis on low-precision computations using INT8. ● 12 inference compute engines (ICE) + 2 Intel architecture cores (AVX+VNNI) ● M.2 form factor (1 chip): 12W, up to 50 TOPS. ● PCIe card (2 chips): 75W, up to 170 TOPS. https://www.intel.ai/nervana-nnp/nnpi https://en.wikichip.org/wiki/intel/microarchitectures/spring_hill
  • 50. ASIC: Habana Gaudi: training chip HL-2000. Designed to scale well. ● PCIe 4.0 x16, 32 GB HBM2, 1Tb/s, ECC, RDMA ● 200-300W ● FP32, BF16, INT/UINT 32, 16, 8 https://habana.ai/training/ Goya: inference chip HL-1000 ● PCIe 4.0 x16, 4/8/16 Gb DDR4, ECC, 200W ● FP32, INT/UINT 32, 16, 8 https://habana.ai/inference/
  • 51. ASIC: Graphcore IPU Graphcore IPU: for both training and inference. Allows new and emerging machine intelligence workloads to be realized. Colossus MK2 GC200 IPU: ● 59.4B transistors, 1472 independent processor cores running 8832 independent parallel program threads ● 250 TFLOPS mixed precision ● 900MB in-processor mem, 47.5TB/s memory bandwidth ● 8TB/s on-chip exchange between cores, 320GB/s chip-to-chip bandwidth ● IPU-M2000 systems with 4xIPU (1 PFLOPS FP16) IPU on Azure https://www.graphcore.ai/posts/microsoft-and-graphcore-collaborate-to-accelerate-artificial-intelligence
  • 52. ● Cerebras Systems Wafer Scale Engine (WSE), an AI chip that measures 8.46x8.46 inches, making it almost the size of an iPad. ● WSE chip has 1.2 trillion transistors. For comparison, NVIDIA’s A100 GPU contains 54 billion transistors, 22x less! ● 400,000 computing cores and 18 gigabytes of memory with 9 PB/s memory bandwidth. ● Cerebras CS-1 is a system built on WSE. ● WSE 2nd gen is announced! 850,000 AI-optimized cores, 2.6 Trillion Transistors https://cerebras.net/ Cerebras
  • 53. ASIC: AWS Inferentia Chips AWS Inferentia chips are designed to accelerate the inference. ● 64 TOPS on 16-bit floating point (FP16 and BF16) and mixed-precision data. ● 128 TOPS on 8-bit integer (INT8) data. ● Up to 16 chips in the largest instance (inf1.24xlarge) https://aws.amazon.com/machine-learning/inferentia/ https://github.com/aws/aws-neuron-sdk https://aws.amazon.com/blogs/aws/amazon-ec2-update-inf1-instances-with-aws-inferentia-chips-for-high-performance-cost-effective-inferencing/
  • 54. ASIC: AWS Trainium (December 1st, 2020) Amazon announced its AWS Trainium chip. will be available in 2021 https://aws.amazon.com/machine-learning/trainium/
  • 55. ASIC: Huawei Ascend 310 ● 22 TOPS INT8 ● 11 TFLOPS FP16 ● 8W of power consumption. Atlas 300I Inference Card: ● 32 GB LPDDR4X with a bandwidth of 204.8 GB/s ● PCIe x16 Gen3.0 device, max 67 W ● A single card provides up to 88 TOPS INT8
  • 56. ASIC: Huawei Ascend 910 ● 32 built-in Da Vinci AI Cores and 16 TaiShan Cores ● 320 TFLOPS (FP16), 640 TOPS (INT8). It’s pretty close to NVIDIA’s A100 BF16 peak performance of 312 TFLOPS Atlas 300T Training Card ● 32 GB HBM or 16GB DDR4 2933 ● PCIe x16 Gen4.0 ● up to 300W power consumption
  • 57. ASIC: Bitmain Sophon Tensor Computing Processors: ● BM1680 (1st gen, 2 TFLOPS FP32, 32MB SRAM, 25W) ● BM1682 (2nd gen, 3 TFLOPS FP32, 16MB SRAM) ● BM1684 (3rd gen, 2.2TFLOPS FP32, 17.6 TOPS INT8, 32 MB SRAM) ● BM1880 (1 TOPS INT8). There are Deep Learning Acceleration PCIe Cards: ● SC3 with a BM1682 chip (8 GB DDR memory, 65W) ● SC5 and SC5H with a BM1684 chip and 12 GB RAM (up to 16 GB) with 30W max power consumption ● SC5+ with 3x BM1684 and 36 GB memory (up to 48 GB) with 75W max power consumption. https://www.sophon.ai/product/introduce/bm1684.html
  • 58. ASIC: Alibaba Hanguang 800 AI-Inference Chip Its performance is independent of the batch size.
  • 59. ASIC: Baidu Kunlun ● 14nm chip ● 16GB HBM memory with 512 GB/s bandwidth ● up to 260 TOPS INT8 (that’s twice the INT8 performance of NVIDIA TESLA T4) ● 64 TFLOPS INT16/FP16 at 150W. ● This chip looks like an inference chip. The processor can be accessed via Baidu Cloud. September 2020, Baidu announced Kunlun 2. The new chip uses 7 nm process technology and its computational capability is over three times that of the previous generation. The mass production of the chip is expected to begin in early 2021.
  • 60. ASIC: Groq TSP Groq develops its own Tensor Streaming Processor. Jonathan Ross, Croq’s CEO had co-founded the first Google’s TPU before that. ● 14nm chip, 26.8B transistors ● 220MB SRAM with 80TB/s on-die memory bandwidth ● no board memory ● PCIe Gen4 x16 with 31.5 GB/s in each direction. ● up to 1000 TOPS INT8 and 250 TFLOPS FP16 (with FP32 acc). For comparison, NVIDIA A100 has 312 TFLOPS on dense FP16 calculations with FP32 acc, and 624 TOPS INT8. It’s even larger than the 825 TOPS INT8 of Alibaba’s Hanguang 800.
  • 61. ASIC: Others ● Qualcomm Cloud AI 100 (inference) https://www.qualcomm.com/products/cloud-artificial-intelligence/cloud-ai-100 ● Wave Computing Dataflow Processing Unit (DPU) https://wavecomp.ai/products/ ● ARM ML inference NPU Ethos-N78 https://www.arm.com/products/silicon-ip-cpu/machine-learning/arm-ml-processor ● SambaNova came out of stealth-mode in December 2020 with their Reconfigurable Dataflow Architecture (RDA) delivering “100s of TFLOPS”. https://sambanova.ai/ ● Mythic focuses on Compute-in-Memory, Dataflow Architecture, and Analog Computing https://www.mythic-ai.com/technology/ ● Intel eASIC: an intermediary technology between FPGAs and standard-cell ASICs with lower unit-cost and faster time-to-market https://www.intel.com/content/www/us/en/products/programmable/asic/easic-devices.html ● ...
  • 62. ASIC: Summary ● Very diverse field! ● Hard to directly compare different solutions based on their characteristics (can be too different architectures). ● You can use a common benchmark like https://mlperf.org/ ● DL framework support is usually limited, some solutions use their own frameworks/libraries.
  • 64. AI at the edge ● NVidia Jetson TK1/TX1/TX2/Xavier/Nano ○ 192/256/256/512/128 CUDA Cores ○ 4/4/6/8/4-Core ARM CPU, 2/4/8/16/4 Gb Mem ● Tablets, Smartphones ○ Qualcomm Snapdragon 845/855, Apple A11/12/Bionic, Huawei Kirin 970/980/990 etc ● Raspberry Pi 4 (1.5 GHz 4-core, 4Gb mem) ● Movidius Neural Compute Stick, Stick 2 ● Google Edge TPU
  • 65. (Nov 25, 2020) “Our brand-new 6th gen Qualcomm AI Engine includes the Qualcomm® Hexagon™ 780 Processor with a fused AI-accelerator architecture, plus the Tensor Accelerator with 2 times the compute capacity. This Qualcomm AI Engine astonishes with up to 26 TOPS performance.” https://www.qualcomm.com/products/snapdragon-888-5g-mobile-platform Mobile AI: Qualcomm SnapDragon 888
  • 66. “HUAWEI’s self-developed Da Vinci architecture NPU delivers better power efficiency, stronger processing capabilities and higher accuracy. The powerful Big-Core plus ultra-low consumption Tiny-Core contribute to an enormous boost in AI performance. In AI face recognition, the efficiency of NPU Tiny-Core can be enhanced up to 24x than the Big-Core. With 2 Big-Core plus 1 Tiny-Core, the NPU of Kirin 990 5G is ready to unlock the magic of the future.” https://consumer.huawei.com/en/campaign/kirin-990-series/ “Huawei intends to scale this AI processing block from servers to smartphones. It supports both INT8 and FP16 on both cores, whereas the older Cambricon design could only perform INT8 on one core. There’s also a new ‘Tiny Core’ NPU. It’s a smaller version of the Da Vinci architecture focused on power efficiency above all else, and it can be used for polling or other applications where performance isn’t particularly time critical. The 990 5G will have two “big” NPU cores and a single Tiny Core, while the Kirin 990 (LTE) has one big core and one tiny core.” https://www.extremetech.com/mobile/298028-huaweis-kirin-990-soc-is-the-first-chip-with-an-integrated-5g-modem Mobile AI: Huawei Kirin 970, 980, 990 (NPU)
  • 67. (Sep 15, 2020) Apple unveils A14 Bionic processor with 40% faster CPU and 11.8 billion transistors “The chip has a 16-core neural engine that can execute 11 trillion AI operations per second. The neural engine core count is twice the previous chip, and can perform machine learning computations 10 times faster. The A14 has six CPU cores and four graphics processing unit (GPU) cores.” https://venturebeat.com/2020/09/15/apple-unveils-a14-bionic-processor-with-40-faster-cpu-and-11-8-billion-transistors/ Mobile AI: Apple (Neural Engine)
  • 68. (January 12, 2021) Samsung sets new standard for flagship mobile processors with Exynos 2100 “AI capabilities will also enjoy a significant boost with the Exynos 2100. The newly-designed tri-core NPU has architectural enhancements such as minimizing unnecessary operations for high effective utilization and support for feature-map and weight compression. Exynos 2100 can perform up to 26-trillion- operations-per-second (TOPS) with more than twice the power efficiency than the previous generation. With on-device AI processing and support for advanced neural networks, users will be able to enjoy more interactive and smart features as well as enhanced computer vision performance in applications such as imaging.” https://www.samsung.com/semiconductor/minisite/exynos/newsroom/pressrelease/samsung-sets-new- standard-for-flagship-mobile-processors-with-exynos-2100/ Mobile AI: Samsung (NPU)
  • 69. (Aug 7, 2019) MediaTek Announces Dimensity 1000 ARM Chip With Integrated 5G Modem “The Dimensity 1000 doesn’t just bring new branding; it’s also sporting four Cortex A77 CPU cores and four Cortex A55 CPU cores, all built on a 7nm process node. There’s also a 9-core Mali GPU, a 5-core ISP, and a 6-core AI processor. The MediaTek AI Processing Unit APU 3.0 is a brand new architecture. It houses six AI processors (two big cores, three small cores and a single tiny core) The new APU 3.0 brings devices a significant performance boost at 4.5 TOPS.” https://www.extremetech.com/extreme/302712-mediatek-announces-dimensity-1000-arm-chip-with- integrated-5g-modem https://i.mediatek.com/mediatek-5g Mobile AI: MediaTek (APU)
  • 70. AI at the Edge: Jetson Nano Price: $99 ($59 for 2Gb) NVIDIA Jetson Nano Developer Kit is a small, powerful computer that lets you run multiple neural networks in parallel for applications like image classification, object detection, segmentation, and speech processing. All in an easy-to-use platform that runs in as little as 5 watts. ● 128-core Maxwell GPU + Quad-core ARM A57, 472 GFLOPS ● 4 GB 64-bit LPDDR4 25.6 GB/s https://developer.nvidia.com/embedded/jetson-nano-developer-kit See also Jetson TX1, TX2, Xavier: https://developer.nvidia.com/embedded/develop/hardware
  • 71. Neural Compute Stick 2 (~$70) The latest generation of Intel® VPUs includes 16 powerful processing cores (called SHAVE cores) and a dedicated deep neural network hardware accelerator for high-performance vision and AI inference applications—all at low power. ● Supports Convolutional Neural Network (CNN) ● Support: TensorFlow, Caffe, Apache MXNet, ONNX, PyTorch, and PaddlePaddle via an ONNX conversion ● Processor: Intel Movidius Myriad X Vision Processing Unit (VPU) ● Connectivity: USB 3.0 Type-A https://software.intel.com/en-us/neural-compute-stick AI at the Edge: Movidius
  • 72. AI at the Edge: Google Edge TPU The Edge TPU is a small ASIC designed by Google that provides high performance ML inferencing for low-power devices. For example, it can execute state-of-the-art mobile vision models such as MobileNet V2 at 400 FPS, in a power efficient manner. The on-board Edge TPU coprocessor is capable of performing 4 TOPS using 0.5 watts for each TOPS (2 TOPS per watt). TensorFlow Lite models can be compiled to run on the Edge TPU. USB/Mini PCIe/M.2 A+E key/M.2 B+M key/SoM/Dev Board https://cloud.google.com/edge-tpu/ https://coral.ai/products/
  • 73. ● Sophon Neural Network Stick (NNS) https://www.sophon.ai/product/introduce/nns.html ● Xilinx Edge AI (FPGA!) https://www.xilinx.com/applications/industrial/analytics-machine-learning.html ● The Hailo-8 M.2 Module https://hailo.ai/product-hailo/hailo-8-m2-module/ ● More: https://github.com/crespum/edge-ai AI at the Edge: Others
  • 75. Problems Even with FPGA/ASIC and edge devices: ● Energy efficiency: ○ Better than CPU/GPU, but still far from 20 watts used by the human brain ● Architecture: ○ Even more specialized for ML/DL computations, but... ○ Still far from brain-like computations
  • 77. Neuromorphic chips ● Neuromorphic computing - brain-inspired computing - has emerged as a new technology to enable information processing at very low energy cost using electronic devices that emulate the electrical behaviour of (biological) neural networks. ● Neuromorphic chips attempt to model in silicon the massively parallel way the brain processes information as billions of neurons and trillions of synapses respond to sensory inputs such as visual and auditory stimuli. ● DARPA SyNAPSE program (Systems of Neuromorphic Adaptive Plastic Scalable Electronics) ● IBM TrueNorth; Stanford Neurogrid; HRL neuromorphic chip; Human Brain Project SpiNNaker and HICANN. https://www.technologyreview.com/s/526506/neuromorphic-chips/
  • 78. Neuromorphic chips: IBM TrueNorth ● 1M neurons, 256M synapses, 4096 neurosynaptic cores on a chip, est. 46B synaptic ops per sec per W ● Uses 70mW, power density is 20 milliwatts per cm^2— almost 1/10,000th the power of most modern microprocessors ● “Our sights are now set high on the ambitious goal of integrating 4,096 chips in a single rack with 4B neurons and 1T synapses while consuming ~4kW of power”. ● Currently IBM is making plans to commercialize it. ● (2016) Lawrence Livermore National Lab got a cluster of 16 TrueNorth chips (16M neurons, 4B synapses, for context, the human brain has 86B neurons). When running flat out, the entire cluster will consume a grand total of 2.5 watts. http://spectrum.ieee.org/tech-talk/computing/hardware/ibms-braininspired-computer-chip-comes-from-the-future
  • 79. Neuromorphic chips: IBM TrueNorth ● (03.2016) IBM Research demonstrated convolutional neural nets with close to state of the art performance: “Convolutional Networks for Fast, Energy-Efficient Neuromorphic Computing”, http://arxiv.org/abs/1603.08270
  • 80. Neuromorphic chips: Intel Loihi ● Fully asynchronous neuromorphic many core mesh that supports a wide range of sparse, hierarchical and recurrent neural network topologies ● Each neuromorphic core includes a learning engine that can be programmed to adapt network parameters during operation, supporting supervised, unsupervised, reinforcement and other learning paradigms. ● Fabrication on Intel’s 14 nm process technology. ● A total of 130,000 neurons and 130 million synapses. ● Development and testing of several algorithms with high algorithmic efficiency for problems including path planning, constraint satisfaction, sparse coding, dictionary learning, and dynamic pattern learning and adaptation. https://newsroom.intel.com/editorials/intels-new-self-learning-chip-promises-accelerate-artificial-intelligence/ https://techcrunch.com/2018/01/08/intel-shows-off-its-new-loihi-ai-chip-and-a-new-49-qubit-quantum-chip/ https://ieeexplore.ieee.org/document/8259423 https://en.wikichip.org/wiki/intel/loihi
  • 81. Neuromorphic chips: Intel Pohoiki Beach (Jul 15, 2019) “Intel announced that an 8 million-neuron neuromorphic system comprising 64 Loihi research chips — codenamed Pohoiki Beach — is now available to the broader research community. With Pohoiki Beach, researchers can experiment with Intel’s brain-inspired research chip, Loihi, which applies the principles found in biological brains to computer architectures. ” https://newsroom.intel.com/news/intels-pohoiki-beach-64-chip-neuromorphic-system-delivers-breakthrough-results-research-tests/
  • 82. Neuromorphic chips: Intel Pohoiki Springs https://www.nextplatform.com/2020/03/19/intel-smells-neuromorphic-opportunity/
  • 83. Neuromorphic chips: Intel Loihi “Using Intel's Loihi neuromorphic research chip and ABR's Nengo Deep Learning toolkit, we analyze the inference speed, dynamic power consumption, and energy cost per inference of a two-layer neural network keyword spotter trained to recognize a single phrase. We perform comparative analyses of this keyword spotter running on more conventional hardware devices including a CPU, a GPU, Nvidia's Jetson TX1, and the Movidius Neural Compute Stick.” Benchmarking Keyword Spotting Efficiency on Neuromorphic Hardware https://arxiv.org/abs/1812.01739
  • 84. Neuromorphic chips: Intel Loihi Intel Benchmarks for Loihi Neuromorphic Computing Chip https://www.eetasia.com/intel-benchmarks-for-loihi-neuromorphic-computing-chip/
  • 85. Neuromorphic chips: Intel Loihi https://newsroom.intel.com/wp-content/uploads/sites/11/2020/12/Neuromorphic-Computing-slides-B.pdf
  • 86. NxTF: a Keras-like API for SNNs on Loihi “NxTF: An API and Compiler for Deep Spiking Neural Networks on Intel Loihi” https://arxiv.org/abs/2101.04261 https://github.com/intel-nrc-ecosystem/models/tree/master/nxsdk_modules_ncl/dnn
  • 87. NxTF: a Keras-like API for SNNs on Loihi “NxTF: An API and Compiler for Deep Spiking Neural Networks on Intel Loihi” https://arxiv.org/abs/2101.04261
  • 88. NxTF: a Keras-like API for SNNs on Loihi “NxTF: An API and Compiler for Deep Spiking Neural Networks on Intel Loihi” https://arxiv.org/abs/2101.04261
  • 89. Neuromorphic chips: Tianjic Tianjic’s unified function core (FCore) which combines essential building blocks for both artificial neural networks and biologically networks — axon, synapse, dendrite and soma blocks. The 28-nm chip consists of 156 FCores, containing approximately 40,000 neurons and 10 million synapses in an area of 3.8×3.8 mm2. Tianjic delivers an internal memory bandwidth of more than 610 GB per second, and a peak performance of 1.28 TOPS per watt for running artificial neural networks. In the biologically-inspired spiking neural network mode, Tianjic achieves a peak performance of about 650 giga synaptic operations per second (GSOPS) per watt. https://medium.com/syncedreview/nature-cover-story-chinese-teams-tianjic-chip-bridges-machine- learning-and-neuroscience-in-f1c3e8a03113 https://www.nature.com/articles/s41586-019-1424-8
  • 90. Neuromorphic chips: Others ● SpiNNaker (1,036,800 ARM9 cores) http://apt.cs.manchester.ac.uk/projects/SpiNNaker/ ● SpiNNaker-2 https://niceworkshop.org/wp-content/uploads/2018/05/2-27-SHoppner-SpiNNaker2.pdf https://arxiv.org/abs/1911.02385 “SpiNNaker 2: A 10 Million Core Processor System for Brain Simulation and Machine Learning” ● BrainScaleS, HICANN: 20x 8-inch silicon wafers each incorporates 50 x 106 plastic synapses and 200,000 biologically realistic neurons. https://www.humanbrainproject.eu/en/silicon-brains/how-we-work/hardware/ ● Akida NSoC: 1.2 million neurons and 10 billion synapses https://www.brainchipinc.com/products/akida-neuromorphic-system-on-chip https://www.nextplatform.com/2020/01/30/neuromorphic-chip-maker-takes-aim-at-the-edge/ https://en.wikichip.org/wiki/brainchip/akida ● Neurogrid: Neurogrid can model a slab of cortex with up to 16x256x256 neurons (>1M) https://web.stanford.edu/group/brainsinsilicon/neurogrid.html https://web.stanford.edu/group/brainsinsilicon/documents/BenjaminEtAlNeurogrid2014.pdf
  • 93. Other approaches ● Memristors https://spectrum.ieee.org/semiconductors/design/the-mysterious-memristor ● Quantum computing https://ai.googleblog.com/2019/10/quantum-supremacy-using-programmable.html ● Optical computing https://www.nextplatform.com/2019/05/31/startup-looks-to-light-up-machine-learning/ ● DNA computing https://www.wired.com/story/finally-a-dna-computer-that-can-actually-be-reprogrammed/ ● Unconventional computing: cellular automata, reservoir computing, using biological cells/neurons, chemical computation, membrane computing, slime mold computing and much more https://www.springer.com/gp/book/9781493968824 ● ...
  • 94. References: Hardware for Deep Learning series of posts: https://blog.inten.to/hardware-for-deep-learning-current-state-and-trends-51c01ebbb6dc ● Part 1: Introduction and Executive summary ● Part 2: CPU ● Part 3: GPU ● Part 4: ASIC ● Part 5: FPGA ● Part 6: Mobile AI ● Part 7: Neuromorphic computing ● Part 8: Quantum computing

Hinweis der Redaktion

  1. https://www.toptal.com/back-end/arm-servers-armv8-for-datacentres https://www.arm.com/products/silicon-ip-cpu/machine-learning/ethos-n77 https://www.arm.com/company/news/2019/11/paving-the-way-for-ai-enabled-supercomputing https://www.arm.com/products/silicon-ip-cpu/machine-learning/project-trillium
  2. https://www.toptal.com/back-end/arm-servers-armv8-for-datacentres
  3. http://blog.ginsudo.com/
  4. https://www.enterpriseai.news/2017/11/13/can-boost-acceleration-opencapi-today/ https://mipt.ru/upload/pr/POWER9.pdf
  5. http://blog.ginsudo.com/ https://spectrum.ieee.org/tech-talk/semiconductors/processors/nvidia-chip-takes-deep-learning-to-the-extremes
  6. https://wccftech.com/nvidias-ampere-gpu-launching-in-2020-will-be-based-on-samsungs-7nm-euv-process/
  7. http://blog.ginsudo.com/
  8. https://www.techpowerup.com/260344/future-amd-gpu-architecture-to-implement-bfloat16-hardware
  9. https://www.techpowerup.com/260344/future-amd-gpu-architecture-to-implement-bfloat16-hardware
  10. https://arstechnica.com/gadgets/2018/01/kaby-lake-g-unveiled-intel-cpu-amd-gpu-nvidia-beating-performance/ https://www.engadget.com/2019/10/09/intel-discontinues-kaby-lake-g-with-amd/ https://www.engadget.com/2018/06/12/intel-discrete-gpu-2020/ https://www.zdnet.com/article/intel-xe-gpus-are-here-with-ponte-vecchio-launched-at-sc19/ https://www.notebookcheck.net/New-Intel-Xe-GPU-info-Gen-12-mobility-iGPUs-designed-to-deliver-1080p-60fps-gaming-discrete-GPUs-to-et-ray-tracing-support.437548.0.html https://www.pcgamesn.com/intel/xe-gpu-release-date-specs-performance
  11. https://arstechnica.com/gadgets/2018/01/kaby-lake-g-unveiled-intel-cpu-amd-gpu-nvidia-beating-performance/ https://www.engadget.com/2019/10/09/intel-discontinues-kaby-lake-g-with-amd/ https://www.engadget.com/2018/06/12/intel-discrete-gpu-2020/ https://www.zdnet.com/article/intel-xe-gpus-are-here-with-ponte-vecchio-launched-at-sc19/ https://www.notebookcheck.net/New-Intel-Xe-GPU-info-Gen-12-mobility-iGPUs-designed-to-deliver-1080p-60fps-gaming-discrete-GPUs-to-et-ray-tracing-support.437548.0.html https://www.pcgamesn.com/intel/xe-gpu-release-date-specs-performance
  12. http://blog.ginsudo.com/
  13. https://www.nextplatform.com/2019/11/07/what-is-the-next-fpga-platform/
  14. https://www.techrepublic.com/article/feniks-microsofts-cloud-scale-fpga-operating-system/
  15. More info: https://www.intel.com/content/www/us/en/products/programmable/fpga.html https://www.xilinx.com/products/silicon-devices/fpga.html https://www.techpowerup.com/258678/xilinx-announces-virtex-ultrascale-the-worlds-largest-fpga https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/sg/product-catalog.pdf https://www.intel.com/content/www/us/en/products/programmable/fpga/arria-10/features.html
  16. https://www.hpcwire.com/off-the-wire/alpha-data-releases-xilinx-versal-acap-accelerator-board/ https://www.xilinx.com/products/boards-and-kits/vck190.html https://www.tomshardware.com/news/xilinx-shipping-versal-acap-ai-prime,39661.html https://www.forbes.com/sites/davealtavilla/2019/06/18/xilinx-ships-versal-a-new-breed-of-adaptable-processing-engines-for-ai-cloud-5g-and-more/#47c3c23b418d https://venturebeat.com/2019/06/18/xilinx-ships-first-versal-acap-chips-that-adapt-to-ai-programs/
  17. https://www.nextplatform.com/2018/08/27/xilinx-unveils-xdnn-fpga-architecture-for-ai-inference/ https://www.xilinx.com/applications/megatrends/machine-learning.html
  18. https://www.nextplatform.com/2018/08/27/xilinx-unveils-xdnn-fpga-architecture-for-ai-inference/
  19. https://www.extremetech.com/computing/296990-intel-nervana-nnp-i-nnp-t-a-training-inference https://www.nextplatform.com/2019/11/13/intel-throws-down-ai-gauntlet-with-neural-network-chips/ https://www.tomshardware.com/news/intel-nervana-nueral-net-processor-nnt-p,40185.html
  20. https://www.eetimes.com/document.asp?doc_id=1334578 https://cdn2.hubspot.net/hubfs/729091/NIPS2017/NIPS%2017%20-%20IPU.pdf
  21. https://www.huawei.com/en/press-events/news/2018/11/Huawei-Ascend-310-world-internet-conference-2018-Award https://e.huawei.com/ru/products/cloud-computing-dc/atlas/ascend-310 https://www.cnbc.com/2019/08/23/huawei-launches-ai-chip-ascend-910-pitting-it-against-nvidia-qualcomm.html https://www.huaweicentral.com/huawei-will-launch-ascend-910-ai-chip-and-computing-framework-tomorrow/ https://www.zdnet.com/article/huawei-unleashes-ai-chip-touting-more-compute-power-than-competitors/ https://www.huawei.com/en/press-events/news/2019/8/Huawei-Ascend-910-most-powerful-AI-processor
  22. https://www.huawei.com/en/press-events/news/2018/11/Huawei-Ascend-310-world-internet-conference-2018-Award https://e.huawei.com/ru/products/cloud-computing-dc/atlas/ascend-310 https://www.cnbc.com/2019/08/23/huawei-launches-ai-chip-ascend-910-pitting-it-against-nvidia-qualcomm.html https://www.huaweicentral.com/huawei-will-launch-ascend-910-ai-chip-and-computing-framework-tomorrow/ https://www.zdnet.com/article/huawei-unleashes-ai-chip-touting-more-compute-power-than-competitors/ https://www.huawei.com/en/press-events/news/2019/8/Huawei-Ascend-910-most-powerful-AI-processor
  23. https://appleinsider.com/articles/19/10/22/editorial-why-the-apple-a13-bionic-blows-past-qualcomm-snapdragon-855-plus
  24. https://www.datacenterknowledge.com/edge-computing/why-microsoft-betting-fpgas-machine-learning-edge
  25. https://www.iis.fraunhofer.de/en/ff/kom/iot/embedded-ml/neuromorphic.html https://www.rambus.com/blogs/will-neuromorphic-chips-outpace-ai-processors/ https://niceworkshop.org/nice-2019/presentations-videos/ https://www.nextplatform.com/2019/03/05/one-step-closer-to-deep-learning-on-neuromorphic-hardware/
  26. https://www.intel.com/content/www/us/en/research/neuromorphic-computing.html
  27. https://newsroom.intel.com/news/accenture-airbus-ge-hitachi-join-intel-neuromorphic-research-community/ https://www.intel.com/content/www/us/en/research/neuromorphic-community.html https://www.nature.com/articles/s42256-019-0097-1.epdf?author_access_token=DVi3hfPefL86m1i_Bts9NNRgN0jAjWel9jnR3ZoTv0MK4s3LnSWqkwc7kKuB1k_WC_R3JNRCJJch_gU0hHY_9FpiDFWj3Bi4OVrtVlUI-s_xyYAfOb01qXSqb9EKnvfNmjH6Ho2XGoTM0IBPIn_q1w%3D%3D
  28. https://newsroom.intel.com/news/accenture-airbus-ge-hitachi-join-intel-neuromorphic-research-community/ https://www.intel.com/content/www/us/en/research/neuromorphic-community.html https://www.nature.com/articles/s42256-019-0097-1.epdf?author_access_token=DVi3hfPefL86m1i_Bts9NNRgN0jAjWel9jnR3ZoTv0MK4s3LnSWqkwc7kKuB1k_WC_R3JNRCJJch_gU0hHY_9FpiDFWj3Bi4OVrtVlUI-s_xyYAfOb01qXSqb9EKnvfNmjH6Ho2XGoTM0IBPIn_q1w%3D%3D
  29. https://www.hpcwire.com/2018/08/21/neuromorphic-platform-spinnaker-takes-another-step-forward/ https://www.frontiersin.org/articles/10.3389/fnins.2018.00840/full https://niceworkshop.org/wp-content/uploads/2019/04/NICE-2019-Day-1c-Kris-Carlson.pdf http://avlsi.csl.yale.edu/bio.php https://newatlas.com/neuromorphic-chips/28586/
  30. https://www.hpcwire.com/2018/08/21/neuromorphic-platform-spinnaker-takes-another-step-forward/ https://www.frontiersin.org/articles/10.3389/fnins.2018.00840/full https://niceworkshop.org/wp-content/uploads/2019/04/NICE-2019-Day-1c-Kris-Carlson.pdf
  31. https://phys.org/news/2019-04-slime-mold-absorbs-substances.html https://qz.com/301402/what-yellow-slime-yes-slime-can-teach-your-organization/