2. 2
2020 RISC-V Summit
Esperanto: Doubling Down on RISC Principles and RISC-V
DavidR. Ditzel and DavidA. Patterson, the co-
authors of“TheCase for the ReducedInstruction
SetComputer”
together at the 7th RISC-V Workshop
Using RISC-Vasthe basisforourAI Processorstrategywaskey forus!
RISC-VENABLESHARDWAREINNOVATION
BROADECOSYSTEMEASESDEVELOPMENTTASKS
SIMPLEINSTRUCTIONSETUSESFEWER GATES
- Less complex designs
- Smaller die size andlower costs
- Reduced dynamic andstatic power consumption
- Machine learning specific Instruction Set extensions
- Custommicroarchitecture
- Proprietarylow-power design techniques
- Development tools
- Operating systems andsoftware stacks
- 3rd party IP
3. 3
2020 RISC-V Summit
Esperanto: Highly Scalable RISC-V AI Chip Solutions
EsperantoET-SoC-1dieplot:
1000+RISC-VCustomCoreswith23.8Btransistors
usingTSMC7nm
manufacturingnode. Initialproductistargetedat
datacenterinferencing.
Esperanto’sTiledAI Solutionis Designed toScalefromHundreds toThousandsofCPU Cores
BEST-IN-CLASSEFFICIENCY
FUTURE-PROOFSOLUTION
SUPERIORPERFORMANCE(1)
- Up to50x better performance on Recommendation Networks
- Up to30x better performance for Image Classification
- 100x better energy efficiency (Inferences / Watt) on key workloads
- Huge reduction in energy costs for datacenter customers
- Fully programmable tohandle future AImodels
- Leverages large, open programmingsoftware ecosystem
- Industry-leading roadmapofhardwareandsoftware solutions
(1) Comparing Esperanto full-chip emulation results with measured inference benchmark results for
incumbent competitors. Characterized silicon results coming soon.
4. 4
2020 RISC-V Summit
DC bank 0 DC bank 1 DC bank 2 DC bank 3
Data Cache Control
(including D-tags) Front
End
Trans
ROMs
32-bit & 16-bit FMA Bypass TIMA TIMA VPU RF T0/T1
VPU RF T0/T1
32-bit & 16-bit FMA Bypass TIMA TIMA VPU RF T0/T1
32-bit & 16-bit FMA Bypass TIMA TIMA VPU RF T0/T1 Trans
ROMs
32-bit & 16-bit FMA Bypass TIMA TIMA
Trans
ROMs
32-bit & 16-bit FMA Bypass TIMA TIMA VPU RF T0/T1
VPU RF T0/T1
32-bit & 16-bit FMA Bypass TIMA TIMA VPU RF T0/T1
32-bit & 16-bit FMA Bypass TIMA TIMA VPU RF T0/T1 Trans
ROMs
32-bit & 16-bit FMA Bypass TIMA TIMA
ET-Minionis an Energy-efficientRISC-V CPU with Vector/Tensor Unit
ET-MINIONISACUSTOMBUILT 64-BITRISC-VPROCESSOR
- In-order pipeline with low gates/stage toimproveMHz at low voltages
- Architecture andCircuitsoptimized toenable low-voltage operation
- 2 Hardwarethreads ofexecution
- Softwareconfigurable L1 data-cache and/or scratchpad
VECTOR/TENSORUNITOPTIMIZEDFORMACHINELEARNING
- New multi-cycle Tensor Instructions
- 256-bit wide Floating Point per cycle
- 16 32-bit Single Precision operations per cycle
- 32 16-bit Half Precision operations per cycle
- 512-bit wide Integer per cycle
- 128 8-bit integer operations per cycle
- Vector transcendental instructions
512b Int8
RISC-V Integer Pipeline
Vector/Tensor
Unit
256b Floating Point Vector RF
ET-Minion RISC-V Core and Tensor/Vector unit
optimized for low-voltage operation
to improve energy-efficiency
RISC-V
Integer
L1 Data-Cache/Scratchpad
Optimizedforenergy-efficient MLoperations.EachET-Minioncandeliver peakof128GOPs8 perGHz.
5. 5
2020 RISC-V Summit
32 ET-Minion CPU’s and 4 MB memory form a “MinionShire”
32ET-MINIONRISC-VCORESPERMINIONSHIRE
- Arrangedinfour 8-coreneighborhoods
MEMORYHIERARCHYISSOFTWARECONFIGURABLE
- L1 SRAM can be configured as data cache orscratchpad
- 4MBL2 SRAM canbe configuredas PrivateL2, SharedL3 or scratchpad
MESHCONNECTEDSHIRES
MULTIPLESYNCHRONIZATIONPRIMITIVES
- Fast Local Atomics
- Fast Local Barriers
- Fast Local Credit Counter
- IPISupport
4x4
xbar
Mesh
stop
m0
m3
m4
m7
m8
m11
m12
m15
m16
m19
m20
m23
m24
m27
m28
m31
Minion
Shire
Bank0
(1MB)
Bank1
(1MB)
Bank2
(1MB)
Bank3
(1MB)
Four 8-Core
Neighborhoods
4MB Banked SRAM
Cache/Scratchpad
Local Sync Primitives
Mesh
Interconnect
Low Voltage
Nominal Voltage
6. 6
2020 RISC-V Summit
More RISC-V’s on a Chip: 1089 ET-Minions & 4 ET-Maxionsin 7nm
LPDDR4x
DRAM
Ctrl
LPDDR4x
DRAM
Ctrl
PCIe 4 Maxions
34Minion Shires
- 1088ET-MinionProcessors
- 136MB on-diememory software
configurableasL2, L3 orScratchpad
- Sharedglobaladdressspace
ServiceProcessor
- 1 ET-MinionProcessor
4ET-MaxionProcessors
- High PerformanceOOO CPU
- Up to 5 RV64GC instructionissue/clock
- 4 MB PrivateL2
x8PCIe Gen4
SecureRootof Trust
LPDDR4xDRAMControllers
- Up to 32 GB DRAM
- 137GB/sec memory bandwidth
- 256-bitwideinterface
BlockdiagramofEsperanto’sEnergy-EfficientET-SoC-1Chip. Typicaloperatingpointunder20Watts.
8. 8
2020 RISC-V Summit
Up to Six ET-SoC-1 Chipson a Glacier Pointv2 Card
Note:TheGlacierPointv2boarddesignhasbeenopen
sourcedthroughtheOpenComputeProjectandisavailablefor
purchase. ThreeEsperantoDualM.2modulescanmounton
thetopsideandthreeonthebottom.
Peakperformanceof> 800Tera-Ops8/ SecondwithET-Minionsoperatingat1GHz
ONECARDWITH UPTO:
- 6558RISC-VCores
- 192GB of DRAM
- 822GB/s DRAM Bandwidth
10. 10
2020 RISC-V Summit
ET GLOW Backend
ET Runtime
ET Device Driver
C++ …..
ONNX Models
Development Tools
Management Utilities
GLOW Compiler
(Facebook Open Source Project)
MS CNTK
GLOW runs on x86 Host
ML Models run across multiple
ET-SoC-1 chips
ML Model Frameworks
Console /
Debugger
Performance
Monitor
GLOW Frontend:
GLOW = Graph LOWering
Open Sourced by Facebook
Hardware Independent Optimizations
Divides work across n chips
GLOW Backend:
Does Hardware Dependent
Optimizations
Backend modified by ET to generate
instructions for ET-SoC-1 chip
GLOW IR (Intermediate Representation)
ET-SoC-1 instructions
. . .
Software:EsperantoSupports C++ /Pytorch and CommonML Frameworks
Diagnostics
Firmware
Updater
11. 11
2020 RISC-V Summit
Balanced Architecture for Evolving Machine LearningWorkloads
- Models rangefrom computeintensive to memoryintensive with both dense and sparse matrix representation
- “Should not over-design hardware for GEMMs and Convolutions” *
Workload Use
Case
Model
Examples
Current Approach Attributes
Recommendation DLRM,
Wide&Deep,
NCF
• Large embedding
tables
• MLP based
compute
• Mix of memory
intensive and
compute
Computer Vision ResNets,
ResNext, Yolo,
M2Det
• CNN • Convolution
Natural Language
Processing
BERT, GPT3 • Multi-headed self-
attention
• Matrix compute
Key Hyperscaler MLWorkload Categories
Relative Importance*
100X
10X
1X
*MishaSmelyanskiy,Facebook, LinleyFallProcessorConference2019
“ChallengesandOpportunitiesof ArchitectingAISystemsatDatacenterScale”
Esperantoprovidesabalancedsolutionforbothdensecomputeandlargesparsely-accessedmemory
12. 12
2020 RISC-V Summit
Esperanto Meets Hyperscaler AI InferencingChallenges
AIprocessing challenges include
delivering AI-based services while
reducing cost and complexity
Esperanto's energy-efficient, high-performancearchitecturewill scale fromHyperscale datacenters to Edge AI!
Esperanto’s custom RISC-V basedsolutions
deliver the requiredperformance andpower
efficiency, are “future proof,” anddon’t lock
Hyperscalers into legacysuppliers
Today most hyperscaler AI
inferencing workloads run on chips
with legacyarchitectures
Performance, energy use and
programmabilityof these solutions donot
meet demandingHyperscaler
requirements
13. 13
2020 RISC-V Summit
Some of our Key DevelopmentPartners
Thankstoall ourpartnersfortheirhelp in bringing ourvision intoreality! Sorrywecan’tnameeveryone!