More Related Content
Similar to Xilinx Data Center Strategy and CCIX (20)
Xilinx Data Center Strategy and CCIX
- 3. © Copyright 2019 Xilinx
CCIX Consortium Effort
Advance IO Interconnect to enable seamless expansion of compute
and memory resources beyond processor SoCs
• Accelerator SoCs to be like a NUMA node from Data Sharing perspective.
• Two Use-cases of focus
• UC1 - Virtualized, Coherent Accelerators – Seamless acceleration
• UC2 - System Memory Expansion – Memory expansion
Higher Speed and lower Latency Transport
>> 3
- 4. © Copyright 2019 Xilinx
CCIX Multichip Connectivity
˃ High performance, low latency
CCIX 1.0 defines 25GT/s (3x performance*)
Examining 32GT/s, 56GT/s (7x performance*) and beyond
Enabling low latency via light transaction layer
˃ Flexible, scalable interconnect topologies
Flexible point-to-point, daisy chained and switched topologies
˃ Seamless integration
Runs on existing PCIe transport layer and management stack
Supports all major instruction set architectures (ISA)
Processor
Accelerator
Smart
Network
Persistent
Memory
Switch
>> 4
* Comparison vs PCIe Gen3
- 6. © Copyright 2019 Xilinx
ARM Neoverse
>> 6
プレスリリース (2019年2月21日)
ケイデンス、Armの新しいNeoverse N1 プラット
フォーム向けにEDAツールおよびIPを最適化
https://www.nikkei.com/article/DGXLRSP503227_R20C19A2000000/
https://www.anandtech.com/show/13959/arm-announces-neoverse-n1-platform/7
Arm Announces Neoverse N1 & E1 Platforms & CPUs:
Enabling A Huge Jump In Infrastructure Performance
- 7. © Copyright 2019 Xilinx
Huawei Kunpeng 920
>> 7
Huawei Kunpeng 920 64-Core Arm Server CPU with CCIX and PCIe Gen4 Launched
(January 7, 2019)
Today Huawei announced a new 64-Core Arm Server CPU. The Huawei Kunpeng 920 is being billed as the fastest Arm
CPU to date. One thing is for sure, there is going to be interest in this CPU not just for the 64 custom Arm cores, but also
the I/O that the chip has onboard.
Beyond the CPU cores and 8 channel DDR4-2933 memory controller, the company is also introducing PCIe Gen4 and
CCIX support.
- 8. © Copyright 2019 Xilinx
Use Case 1: Virtualized, Coherent Accelerators
˃ Reduced data transfer latency
˃ Improved fine grain data sharing
˃ Simplified software dev., eliminates
difficult debug issues
˃ Seamless offload of threads from general-
purpose processors to accelerators
Preserves shared data-structures
between the host and accelerator
No need to re-architect any shared data
structures
Improved efficiency with true
peer-processing
>> 8
- 9. © Copyright 2019 Xilinx
Use Case 2: Memory Expansion
˃ Multiple use cases evolving for external interconnect attached memory
Larger DRAM/SCM capacity with-in a “box”
LD/ST to remote memory via bridging to a scale-out fabric
Opportunity for value-add functionality via external card solutions for remote memory
Over time there is need for choices in the scale-out fabric for carrying native LD/ST
Supports Memory Atomics over CCIX interface
>> 9
- 10. © Copyright 2019 Xilinx
CCIX Roadmap
˃ CCIX 1.1
Supports 32GT/s, can use PCIe Gen5 switch for fan-out and other CCIX topologies.
Protocol enhancements to increase performance and reduce latency further
˃ CCIX 2.0
Expands Seamless coherent data sharing and load/store access to across Multiple Nodes
Supports 56GT/s and higher
>> 10
- 12. © Copyright 2019 Xilinx
CCIX Layered Architecture
Protocol Layer
• Coherency protocol, memory read & write flows
• Full feature protocol
• Port aggregation for higher BW
Link Layer
• Formats CCIX messages for target transport
• Adds ability to pack and chain multiple
messages to achieve higher efficiency
Transaction Layer
• Adds optimized packets, manages credit based
flow control
Physical Layer
• Dual mode PHY to support extended data rates
PCIe Transaction
LayerCCIX
Transaction Layer
PCIe Data Link Layer
CCIX/PCIe Physical Layer
Tx Rx
PCIe TLPsCCIX messages
CCIX Port
(CCIX Link Layer)
CCIX
Protocol Layer
>> 12
- 14. © Copyright 2019 Xilinx
System Topology Examples
Accelerator
CCIX
Switc
h
Processor
CCIX
Processor
CCIX
Memory
CCIX
Memory
CCIX
Processor
CCIX
Accelerator
CCIX
Processor
CCIX
Accel
CCIX
CCIX
CCIX
CCIX
Accel
CCIX
CCIX
CCIX
CCIX
Accel
CCIX
CCIX
CCIX
CCIX
Accel
CCIX
CCIX
CCIX
CCIX
Processor
CCIX
Processor
PCIe
Accel
CCIX
CCIX
CCIX
PCIe
Accel
CCIX
CCIX
CCIX
PCIe
Accel
CCIX
CCIX
CCIX
CCIX
Accel
CCIX
CCIX
CCIX
CCIX
Processor
PCIe
Direct attached, daisy chain, mesh and switched topologies
>> 14
- 19. © Copyright 2019 Xilinx
Successful Hardware Demos
>> 19
https://www.ccixconsortium.com/library/video/
- 21. © Copyright 2019 Xilinx
Xilinx Product Families: A Broad Portfolio
Mid-Range High-End
Price/Performance/WattLowest Power & Cost
Cost-Optimized
Performance & Capacity
SoC Portfolio
Mid-Range → High-EndCost-Optimized → Mid-Range
Zynq-7000
>> 21
- 23. © Copyright 2019 Xilinx
Obtaining Superior Bandwidth-per-Watt
DDR-4 DIMM
Standard commodity
memory used in
Servers and PC’s.
Bandwidth 21.3 GB/s
Depth 16 GB
Price / GB $
PCB Req High
pJ / bit ~27
Latency Med
HMC
Hybrid-Memory Cube
Serial DRAM
Bandwidth 160 GB/s
Depth 4 GB
Cost / GB $$$
PCB Req Med
pJ / bit ~30
Latency High
Bandwidth 12.8 GB/s
Depth 2 GB
Cost / GB $$
PCB Req High
pJ / bit ~40
Latency Low
Bandwidth 460 GB/s
Depth 8 GB
Cost / GB $$
PCB Req None
pJ / bit ~7
Latency Med
RLDRAM-3
Low Latency DRAM
for packet buffering
applications
HBM
High Bandwidth Memory
DRAM integrated into the
FPGA package
* Single DDR4 DIMM * Two x36 RLDRAM-3 * Single HMC Device * Single FPGA with HBM
>> 23
- 24. © Copyright 2019 Xilinx
HBM 搭載 Virtex UltraScale+ デバイスを発表
DDR4 DIMM の 20 倍の帯域幅を実現
SSI テクノロジを使用して
DRAM スタックを統合
HBM への専用インター
フェイスをハード化して
最大限の帯域幅を確保
実績ある Virtex UltraScale+ FPGA
プラットフォームがベース
メモリ コントローラーは
AXI インターフェイスを
使用しており Vivado IPI で
容易に統合可能
現在最高の DRAM 帯域幅を
提供する HBM Gen2
ハード化した CCIX
(Cache Coherent
Interconnect) ポート
https://www.xilinx.com/products/technology/memory.html
>> 24
- 25. © Copyright 2019 Xilinx
Virtex® UltraScale+™ HBM FPGAs
Device Name VU31P VU33P VU35P VU37P
Logic
System Logic Cells (K) 970 970 1,915 2,860
CLB Flip-Flops (K) 887 887 1,751 2,615
CLB LUTs (K) 444 444 876 1,308
Memory
Max. Distributed RAM (Mb) 12.5 12.5 24.6 36.7
Total Block RAM (Mb) 23.6 23.6 47.3 70.9
UltraRAM (Mb) 90 90 180 270
HBM DRAM (Gb) 32 64 64 64
HBM AXI Ports 32 32 32 32
Clocking Clock Management Tiles (CMTs) 4 4 8 12
Integrated IP
DSP Slices 2,880 2,880 5,952 9,024
PCIe® Gen3 x16 / Gen4 x8 4 4 5 6
CCIX Ports(2) 4 4 4 4
150G Interlaken 0 0 2 4
100G Ethernet w/ RS-FEC 2 2 5 8
I/O
Max. Single-Ended HP I/Os 208 208 416 624
GTY 32.75Gb/s Transceivers 32 32 64 96
Speed Grades Extended(1)
-1, -2L, -3 -1, -2L, -3 -1, -2L, -3 -1, -2L, -3
Footprint(1)
Dimensions (mm) HP I/O, GTY 32.75Gb/s
Packaging
H1924 45x45 208, 32
H2104 47.5x47.5 208, 32 416, 64
H2892 55x55 416, 64 624, 96
Notes:
1. All packages are 1.0mm ball pitch.
2. A CCIX port requires the use of a PCIe Gen3 x16 / Gen4 x8 block>> 25
- 26. © Copyright 2019 Xilinx
Virtex UltraScale+ HBM VCU128 FPGA Evaluation Kit
Key Features & Benefits
˃ 8GB of on-chip High Bandwidth Memory (HBM)
˃ Multiple external memory interfaces
(RLDRAM3, QDR-IV, DDR4)
˃ Quad 32Gbps QSFP28 Interfaces
˃ PCIe Gen3 x16 & Gen4 x8
˃ VITA 57.4 FMC+ Interface
>> 26
https://www.xilinx.com/products/boards-and-kits/vcu128-es1.html#overview
- 27. © Copyright 2019 Xilinx
*Low-latency GoogLeNet v1
U250
38TB/s
内部SRAM
帯域幅
54MB
内部SRAM
容量
1,341K
LUTs
4100img/s
CNN スループット*
U280
30TB/s
内部SRAM
帯域幅
41MB
内部SRAM
容量
1,079K
LUTs
460GB/s
HBM2メモリ帯域幅
U200
31TB/s
内部SRAM
帯域幅
35MB
内部SRAM
容量
892K
LUTs
3100img/s
CNN スループット*
- 29. © Copyright 2018 Xilinx
Database
Search & Analysis
90 x Finance
Computing
89 x
Machine Learning
20 x Video
12 x HPC &
Life Science
10 x
Fastest Accelerator Cards for Data Center
and AI
29
- 30. © Copyright 2018 Xilinx
Alveo vs other solutions
CPU x20, GPU (V100) x 4.7 Throughput
>> 30
Alveo U250
CPU (vCPU x 72)
x20
GPU V100
x4.7
Ref; WP504
- 31. © Copyright 2018 Xilinx
Ref; WP504GPU (V100) x 4 Power Efficiency
>> 31
Alveo U250
GPU V100
x4
Alveo vs other solutions
- 32. © Copyright 2018 Xilinx
1/3 Latency vs GPU
>> 32
0 5 10 15 20 25 30 35 40 45 50
CPU+ Alveo
CPU+ GPU
CNN+BLSTM Speech-to-Text Latency (ms)
FPGA Latency is less than 1/3 of GPU
Significantly short & stable latency than GPU
GPU; NVIDIA P4
- 33. © Copyright 2018 Xilinx
Ext. Memory
Network
Processor
Ext. Memory
GPU
Ext.Memory
Processor
Ext. Memory
FPGA
Ext.Memory
Processor
Network Interface Card
(NIC)
GPU Accelerator
FPGA Smart NICServer CPU
Server CPU
Latency = msec Latency = 1/10,000 msec (order of nsec)
Sensor
Market info.
Actuator
Transactions
Network processing
- TCP/IP, TLS, OVS
Computation for
making decision faster
Faster outputs
Smart NIC settings
Log data
Sensor
Market info.
Actuator
Transactions
Network
Network
Much lower latency and power
consumption
Most Effective for Latency Critical use cases
Data
Outputs
Data
Outputs
>> 33
- 34. © Copyright 2018 Xilinx
Data Center Accelaration Platform
Computing Storage Network
Accelerator Cards
Compute Shell
Firmware & Runtime
Storage Shell
Firmware & Runtime
Network Shell
Firmware & Runtime
CompressionML Video KVS
SDN / NFVROCE Openstack
Hardware
Firmware
Library
Middleware
Software
Framework
IDE
F r a m e w o r k , A P I , P y t h o n / J a v a / C + + P r o g r a m m a b i l i t y
. . .
. . . Crypto OVS. . .
. . . . . .
. . .
TensorFlow FFMpeg NVMe-OF
7nm Versal
ACAP
16nm FPGA 16nm MPSoC
>> 34
- 35. © Copyright 2018 Xilinx
C
Applications in total
C
Compression
C
Video
C
Security
C
Life Science
C
Finacial Computing
C
Image Processing
C
Data Analytics
C
ML
C
Tools
Applications and eco-partners
https://japan.xilinx.com/products/design-tools/acceleration-zone.html#libraries
15 50
2017 2018
>> 35
- 40. © Copyright 2019 Xilinx
Summary
˃ CCIX enables broader use of acceleration technologies
˃ CCIX Base specification is available
˃ CCIX is supported by broad eco-system-both host and accelerator devices in under
development with ES becoming available in near future
˃ Active work underway to enable SW eco-system and showcase use cases
˃ Go to www.ccixconsortium.com for learn more about CCIX and to join CCIX eco-
system.
>> 40
https://twitter.com/ccixconsortium https://twitter.com/XilinxJapan