Esperanto accelerates machine learning with 1000+ low power RISC-V cores on a single chip

Confidential
Esperanto Accelerates Machine Learning With 1000+
Low-Power RISC-V Cores on a Single Chip
RISC-V Summit 2020
8 December 2020
CEO: art.swift@esperanto.ai

2
2020 RISC-V Summit
Esperanto: Doubling Down on RISC Principles and RISC-V
DavidR. Ditzel and DavidA. Patterson, the co-
authors of“TheCase for the ReducedInstruction
SetComputer”
together at the 7th RISC-V Workshop
Using RISC-Vasthe basisforourAI Processorstrategywaskey forus!
RISC-VENABLESHARDWAREINNOVATION
BROADECOSYSTEMEASESDEVELOPMENTTASKS
SIMPLEINSTRUCTIONSETUSESFEWER GATES
- Less complex designs
- Smaller die size andlower costs
- Reduced dynamic andstatic power consumption
- Machine learning specific Instruction Set extensions
- Custommicroarchitecture
- Proprietarylow-power design techniques
- Development tools
- Operating systems andsoftware stacks
- 3rd party IP

3
2020 RISC-V Summit
Esperanto: Highly Scalable RISC-V AI Chip Solutions
EsperantoET-SoC-1dieplot:
1000+RISC-VCustomCoreswith23.8Btransistors
usingTSMC7nm
manufacturingnode. Initialproductistargetedat
datacenterinferencing.
Esperanto’sTiledAI Solutionis Designed toScalefromHundreds toThousandsofCPU Cores
BEST-IN-CLASSEFFICIENCY
FUTURE-PROOFSOLUTION
SUPERIORPERFORMANCE(1)
- Up to50x better performance on Recommendation Networks
- Up to30x better performance for Image Classification
- 100x better energy efficiency (Inferences / Watt) on key workloads
- Huge reduction in energy costs for datacenter customers
- Fully programmable tohandle future AImodels
- Leverages large, open programmingsoftware ecosystem
- Industry-leading roadmapofhardwareandsoftware solutions
(1) Comparing Esperanto full-chip emulation results with measured inference benchmark results for
incumbent competitors. Characterized silicon results coming soon.

4
2020 RISC-V Summit
DC bank 0 DC bank 1 DC bank 2 DC bank 3
Data Cache Control
(including D-tags) Front
End
Trans
ROMs
32-bit & 16-bit FMA Bypass TIMA TIMA VPU RF T0/T1
VPU RF T0/T1
32-bit & 16-bit FMA Bypass TIMA TIMA VPU RF T0/T1 Trans
ROMs
32-bit & 16-bit FMA Bypass TIMA TIMA
Trans
ROMs
VPU RF T0/T1
32-bit & 16-bit FMA Bypass TIMA TIMA VPU RF T0/T1 Trans
ROMs
32-bit & 16-bit FMA Bypass TIMA TIMA
ET-Minionis an Energy-efficientRISC-V CPU with Vector/Tensor Unit
ET-MINIONISACUSTOMBUILT 64-BITRISC-VPROCESSOR
- In-order pipeline with low gates/stage toimproveMHz at low voltages
- Architecture andCircuitsoptimized toenable low-voltage operation
- 2 Hardwarethreads ofexecution
- Softwareconfigurable L1 data-cache and/or scratchpad
VECTOR/TENSORUNITOPTIMIZEDFORMACHINELEARNING
- New multi-cycle Tensor Instructions
- 256-bit wide Floating Point per cycle
- 16 32-bit Single Precision operations per cycle
- 32 16-bit Half Precision operations per cycle
- 512-bit wide Integer per cycle
- 128 8-bit integer operations per cycle
- Vector transcendental instructions
512b Int8
RISC-V Integer Pipeline
Vector/Tensor
Unit
256b Floating Point Vector RF
ET-Minion RISC-V Core and Tensor/Vector unit
optimized for low-voltage operation
to improve energy-efficiency
RISC-V
Integer
L1 Data-Cache/Scratchpad
Optimizedforenergy-efficient MLoperations.EachET-Minioncandeliver peakof128GOPs8 perGHz.

5
2020 RISC-V Summit
32 ET-Minion CPU’s and 4 MB memory form a “MinionShire”
32ET-MINIONRISC-VCORESPERMINIONSHIRE
- Arrangedinfour 8-coreneighborhoods
MEMORYHIERARCHYISSOFTWARECONFIGURABLE
- L1 SRAM can be configured as data cache orscratchpad
- 4MBL2 SRAM canbe configuredas PrivateL2, SharedL3 or scratchpad
MESHCONNECTEDSHIRES
MULTIPLESYNCHRONIZATIONPRIMITIVES
- Fast Local Atomics
- Fast Local Barriers
- Fast Local Credit Counter
- IPISupport
4x4
xbar
Mesh
stop
m0
m3
m4
m7
m8
m11
m12
m15
m16
m19
m20
m23
m24
m27
m28
m31
Minion
Shire
Bank0
(1MB)
Bank1
(1MB)
Bank2
(1MB)
Bank3
(1MB)
Four 8-Core
Neighborhoods
4MB Banked SRAM
Cache/Scratchpad
Local Sync Primitives
Mesh
Interconnect
Low Voltage
Nominal Voltage

6
2020 RISC-V Summit
More RISC-V’s on a Chip: 1089 ET-Minions & 4 ET-Maxionsin 7nm
LPDDR4x
DRAM
Ctrl
LPDDR4x
DRAM
Ctrl
PCIe 4 Maxions
34Minion Shires
- 1088ET-MinionProcessors
- 136MB on-diememory software
configurableasL2, L3 orScratchpad
- Sharedglobaladdressspace
ServiceProcessor
- 1 ET-MinionProcessor
4ET-MaxionProcessors
- High PerformanceOOO CPU
- Up to 5 RV64GC instructionissue/clock
- 4 MB PrivateL2
x8PCIe Gen4
SecureRootof Trust
LPDDR4xDRAMControllers
- Up to 32 GB DRAM
- 137GB/sec memory bandwidth
- 256-bitwideinterface
BlockdiagramofEsperanto’sEnergy-EfficientET-SoC-1Chip. Typicaloperatingpointunder20Watts.

7
2020 RISC-V Summit
PCIe switch
ET-SoC-1
1093 RISC-V Cores
140 MB SRAM
ET-SoC-1
1093 RISC-V Cores
140 MB SRAM
ET-SoC-1
1093 RISC-V Cores
140 MB SRAM
ET-SoC-1
1093 RISC-V Cores
140 MB SRAM
ET-SoC-1
1093 RISC-V Cores
140 MB SRAM
ET-SoC-1
1093 RISC-V Cores
140 MB SRAM
PCIe card interface
6558 RISC-V Cores on a Board withEsperanto’s Energy-EfficientChip
1536-BITWIDE MEMORY SYSTEMDELIVERS UPTO 822 GB/S OF ENERGY-EFFICIENT BANDWIDTH
24
DRAM
chips
192 GB
LPDDR4x
LPDDR4x
LPDDR4x
LPDDR4x
64 64 64 64
LPDDR4x
LPDDR4x
LPDDR4x
LPDDR4x
64 64 64 64
LPDDR4x
LPDDR4x
LPDDR4x
LPDDR4x
64 64 64 64
LPDDR4x
LPDDR4x
LPDDR4x
LPDDR4x
64 64 64 64
LPDDR4x
LPDDR4x
LPDDR4x
LPDDR4x
64 64 64 64
LPDDR4x
LPDDR4x
LPDDR4x
LPDDR4x
64 64 64 64
EnergyEfficiencyenablesEsperantotoputmultiple chips perboard,insteadof onebig hotchip.

8
2020 RISC-V Summit
Up to Six ET-SoC-1 Chipson a Glacier Pointv2 Card
Note:TheGlacierPointv2boarddesignhasbeenopen
sourcedthroughtheOpenComputeProjectandisavailablefor
purchase. ThreeEsperantoDualM.2modulescanmounton
thetopsideandthreeonthebottom.
Peakperformanceof> 800Tera-Ops8/ SecondwithET-Minionsoperatingat1GHz
ONECARDWITH UPTO:
- 6558RISC-VCores
- 192GB of DRAM
- 822GB/s DRAM Bandwidth

9
2020 RISC-V Summit
Note (1):TheCasefor theInfinite Data Center”– Gartner, Source: Gartner, Data CenterFrontier
OCP GlacierPointv2
AcceleratorCardholds:
• 6EsperantoAIchips
• 192GB DRAM
Yosemitev2Cubby holds:
• 4YosemiteSleds
ExampleOCPDataCenter:
 @ 30sq.ft.perOCPrack(1)
 Estimated4,000-20,000racksperdatacenter
RackwithYosemitev2holds:
 8Yosemitev2Cubbies
 384EsperantoAIchips
Yosemitev2Sled holds:
• 1or2GlacierPoint
Acceleratorcards
Yosemite
v2
x4 x8
Glacier Pointv2 Accelerator Fits in ExistingOCP Infrastructure
x2
Top of Rack Switch
PowerShelf
PowerShelf

10
2020 RISC-V Summit
ET GLOW Backend
ET Runtime
ET Device Driver
C++ …..
ONNX Models
Development Tools
Management Utilities
GLOW Compiler
(Facebook Open Source Project)
MS CNTK
GLOW runs on x86 Host
ML Models run across multiple
ET-SoC-1 chips
ML Model Frameworks
Console /
Debugger
Performance
Monitor
GLOW Frontend:
 GLOW = Graph LOWering
 Open Sourced by Facebook
 Hardware Independent Optimizations
 Divides work across n chips
GLOW Backend:
 Does Hardware Dependent
Optimizations
 Backend modified by ET to generate
instructions for ET-SoC-1 chip
GLOW IR (Intermediate Representation)
ET-SoC-1 instructions
. . .
Software:EsperantoSupports C++ /Pytorch and CommonML Frameworks
Diagnostics
Firmware
Updater

11
2020 RISC-V Summit
Balanced Architecture for Evolving Machine LearningWorkloads
- Models rangefrom computeintensive to memoryintensive with both dense and sparse matrix representation
- “Should not over-design hardware for GEMMs and Convolutions” *
Workload Use
Case
Model
Examples
Current Approach Attributes
Recommendation DLRM,
Wide&Deep,
NCF
• Large embedding
tables
• MLP based
compute
• Mix of memory
intensive and
compute
Computer Vision ResNets,
ResNext, Yolo,
M2Det
• CNN • Convolution
Natural Language
Processing
BERT, GPT3 • Multi-headed self-
attention
• Matrix compute
Key Hyperscaler MLWorkload Categories
Relative Importance*
100X
10X
1X
*MishaSmelyanskiy,Facebook, LinleyFallProcessorConference2019
“ChallengesandOpportunitiesof ArchitectingAISystemsatDatacenterScale”
Esperantoprovidesabalancedsolutionforbothdensecomputeandlargesparsely-accessedmemory

12
2020 RISC-V Summit
Esperanto Meets Hyperscaler AI InferencingChallenges
AIprocessing challenges include
delivering AI-based services while
reducing cost and complexity
Esperanto's energy-efficient, high-performancearchitecturewill scale fromHyperscale datacenters to Edge AI!
Esperanto’s custom RISC-V basedsolutions
deliver the requiredperformance andpower
efficiency, are “future proof,” anddon’t lock
Hyperscalers into legacysuppliers
Today most hyperscaler AI
inferencing workloads run on chips
with legacyarchitectures
Performance, energy use and
programmabilityof these solutions donot
meet demandingHyperscaler
requirements

13
2020 RISC-V Summit
Some of our Key DevelopmentPartners
Thankstoall ourpartnersfortheirhelp in bringing ourvision intoreality! Sorrywecan’tnameeveryone!

Esperanto accelerates machine learning with 1000+ low power RISC-V cores on a single chip

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Esperanto accelerates machine learning with 1000+ low power RISC-V cores on a single chip

Similar to Esperanto accelerates machine learning with 1000+ low power RISC-V cores on a single chip (20)

More from RISC-V International

More from RISC-V International (20)

Recently uploaded

Recently uploaded (20)

Esperanto accelerates machine learning with 1000+ low power RISC-V cores on a single chip