SlideShare ist ein Scribd-Unternehmen logo
1 von 16
Harnessing light to power new possibilities
Advantages of Optical CXL
for Disaggregated Compute Architectures
Ron Swartzentruber
Director of Engineering
2
Agenda
 Memory centric shift in the data center
 AI Large Language Model growth
 Need for optical CXL technology
 Case study: OPT inference benefits using optical CXL
© Lightelligence, Inc.
3
Physical
Machine
0
Virtual
Machine
0
Virtual
Machine
1
Stranded
resource
Physical
Machine
1
Stranded
resource
Virtual
Machine
2
Virtual
Machine
3 FLEXIBLE MANAGEABLE ECONOMICAL OPEN
Physical
Machine
1
Physical
Machine
0
Physical
Machine
2
……
……
……
Virtual
Machine
0
Virtual
Machine
1
Virtual
Machine
2
Virtual
Machine
3
Disaggregation is the Future for Datacenter
Virtual
Machine
4
CPU cores DRAM Accelerators
© Lightelligence, Inc.
4
AI trends
 AI and Large Language Models will continue to grow and consume more compute
 Disaggregated memory architectures are required in order to continue to scale
 Optical interconnects are required to extend reach
Source: https://medium.com/riselab/ai-and-memory-wall-2cb4265cb0b8 Source: https://hc34.hotchips.org/assets/program/tutorials/CXL/Hot%20Chips%202022%20CXL%20MemoryChallenges.pdf
© Lightelligence, Inc.
5
Optical Interconnect Latency
© Lightelligence, Inc.
100s of ns
100s of 𝜇s
6
CXL is the PredominantStandard for Disaggregation
Cache-
coherence
Latency
Memory
decouple
CXL Yes ~100ns Supported
RDMA (ethernet) No ~3μs Not supported
CXL 2.0 Switch
Standardized Fabric Manager
H1 H2 H3 H4 H#
……
CXL2.0 CXL2.0 CXL2.0 CXL2.0 CXL2.0
CXL1.0 CXL2.0 CXL2.0 CXL2.0 CXL2.0
D1 D2 D3 D4 …… D#
© Lightelligence, Inc.
7
OpticalCXL is Required forScaling
ATTENUATION
(DB)
0
-10
-20
-30
-40
-50
PROPAGATION DISTANCE (M)
1m 10m
0 -0.003
-4
-40
Copper Optics
Assuming AWG26 wire, PCIe 5.0 signal
32 cables with diameter
> 6mm (CAT8)
16 fibers with diameter
of 0.125mm
…
…
6mm
> 30 mm
Copper
<1mm
Optics
Supporting 4x PCIe 5.0 x16
© Lightelligence, Inc.
8
OpticalCXL in the Datacenter
Compute
Break Through the Rack!
Memory Banks
© Lightelligence, Inc.
9
Case study: LLM Inference
CXL Memory
Expander
CXL Memory
Expander
Server
2x CXL 1.1 CPUs
 2U Supermicro server
 2x AMD Genoa CXL 1.1 CPUs
 MemVerge Memory Tiering and
Pooling Software
 2x Micron 256GB Memory Expanders each with
CXL/PCIe Gen5x8 link
Memory
Expansion
Module
Photowave
Card
 Nvidia GPU running LLM inference
 All VMs access to CXL memory
 Secure application, encrypted data
Photowave Card
© Lightelligence, Inc.
10
LLM Model List
Model Weight Memory(float16)
KV-Cache per
sample(float16)
Activation per
sample(float16)
Context length
OPT-1.3B 2.4 GB 0.095 GB 0.002 GB 512
OPT-13B 23.921 GB 0.397 GB 0.005 GB 512
OPT-30B 55.803 GB 0.667 GB 0.007 GB 512
OPT-66B 122.375 GB 1.143 GB 0.009 GB 512
OPT-175B 325GB 2.285GB 0.012GB 512
KV-cache Size: data_type * dimension* num_layers* batch_size * Context_len * 2
e.g., for opt-1.3B, FP16 -> 2Bytes * 2048 * 24 * 1 * 512 * 2 = 100,663,296 Bytes
Activation Size: data_type * dimension * batch_size * Context_len
Entire OPT-66B model fits within one 128GB CXL memory expander
© Lightelligence, Inc.

CXL: 882MB/s, System Memory 857MB/s, Disk: 582MB/s, MemVerge: 493MB/s

CXL: 2365MB/s, System Memory: 2609MB/s, Disk: 1887MB/s, MemVerge: 2173MB/s
11
Results
~2.4x
© Lightelligence, Inc.
OPT-66B
model results
Disk
(NVMe)
CXL
Memory
System
Memory
MemVerge
60:40Policy
Decode
Throughput
(Tokens/s)
1.984 4.868 6.216 6.237
Decode
Latency(s)
338.7 138.2 108.1 107.7
12
PHOTOWAVETM OPTICALCXL MEMORY EXPANDER
© Lightelligence, Inc.
PHOTOWAVETM OPTICALCXL MEMORY EXPANDER
CXL
GPU
UTILIZATION
GPU MEM.
UTILIZATION
CPU
UTILIZATION
MEM.
UTILIZATION
CXL MEM.
UTILIZATION
DECODE
THROUGHPUT
GENOA
AMD
CPU
SAMSUNG
CXL 128GB
NVIDIA
GPU: 1xA10 24GB
OPT-66B MODEL PROGRESS 99%
TOKENS/S
PARAMETERS
INFERENCE ENGINE: FLEXGEN
KV CACHE: 109.688GB
RUN MODE: CXL
WEIGHTS: 122.375GB
95%
77% 51%
27%
77%
CXL DIS
K
✔️ ✔️
13
© Lightelligence, Inc.
NVMe
Summary of Results
CXL memory offloading is efficient and
beneficial
 LLM inference case study
 Allows use of lower cost memory
Similar performance compared to pure
system memory
1.9xTCO improvement with
inexpensiveGPUs at similar
throughput
2.4x performance advantage compared
to SSD/NVMe disk offloading
14
© Lightelligence, Inc.
PhotowaveTM Form Factors
 CXL 2.0/PCIe Gen5 x16
 Jitter reduction, SI cleanup
 Sideband signals over optics
 x8, x4 or x2 bifurcation
 End-to-end latency:
 Card: under 20ns + TOF
 AOC: 1ns + TOF
Low ProfilePCIeCard OCP3.0SFFCard ActiveOpticalCables
ProductSuite Features
15
© Lightelligence, Inc.
Endnotes
Hardware
configuration
Super Micro Server
 AMD EPYC 9124 16-Core
CPU
 Samsung DDR5 4800 MT/s
 MEM0 size: 256GB
 MEM1 size: 256GB
 Bandwidth: 307GB/s
Nvidia GPU
 Gen4x16, DMEM size: 24GB
 Bandwidth: 32GB/s
Samsung NVME
 Gen4x4, MEM size: 1.92TB
 Bandwidth: 8GB/s
Micron CXL Memory
 Gen5x8, MEM size: 256GB
 Bandwidth: 32GB/s
 LLM: OPT-66B
 Batch size = 24
 Context length = 512
 Output length = 8
 FlexGen
Algorithm&Software
16
© Lightelligence, Inc.

Weitere ähnliche Inhalte

Ähnlich wie Q1 Memory Fabric Forum: Advantages of Optical CXL​ for Disaggregated Compute Architectures ​

Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
Red_Hat_Storage
 
Ibm symp14 referentin_barbara koch_power_8 launch bk
Ibm symp14 referentin_barbara koch_power_8 launch bkIbm symp14 referentin_barbara koch_power_8 launch bk
Ibm symp14 referentin_barbara koch_power_8 launch bk
IBM Switzerland
 
Top 10 Supercomputers With Descriptive Information & Analysis
Top 10 Supercomputers With Descriptive Information & AnalysisTop 10 Supercomputers With Descriptive Information & Analysis
Top 10 Supercomputers With Descriptive Information & Analysis
NomanSiddiqui41
 

Ähnlich wie Q1 Memory Fabric Forum: Advantages of Optical CXL​ for Disaggregated Compute Architectures ​ (20)

Hortonworks on IBM POWER Analytics / AI
Hortonworks on IBM POWER Analytics / AIHortonworks on IBM POWER Analytics / AI
Hortonworks on IBM POWER Analytics / AI
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
 
Q1 Memory Fabric Forum: Using CXL with AI Applications - Steve Scargall.pptx
Q1 Memory Fabric Forum: Using CXL with AI Applications - Steve Scargall.pptxQ1 Memory Fabric Forum: Using CXL with AI Applications - Steve Scargall.pptx
Q1 Memory Fabric Forum: Using CXL with AI Applications - Steve Scargall.pptx
 
Q1 Memory Fabric Forum: Breaking Through the Memory Wall
Q1 Memory Fabric Forum: Breaking Through the Memory WallQ1 Memory Fabric Forum: Breaking Through the Memory Wall
Q1 Memory Fabric Forum: Breaking Through the Memory Wall
 
IBM Power Systems: Designed for Data
IBM Power Systems: Designed for DataIBM Power Systems: Designed for Data
IBM Power Systems: Designed for Data
 
Micron CXL product and architecture update
Micron CXL product and architecture updateMicron CXL product and architecture update
Micron CXL product and architecture update
 
Trends in Systems and How to Get Efficient Performance
Trends in Systems and How to Get Efficient PerformanceTrends in Systems and How to Get Efficient Performance
Trends in Systems and How to Get Efficient Performance
 
Full scan frenzy at amadeus
Full scan frenzy at amadeusFull scan frenzy at amadeus
Full scan frenzy at amadeus
 
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
 
IBM HPC Transformation with AI
IBM HPC Transformation with AI IBM HPC Transformation with AI
IBM HPC Transformation with AI
 
OWF14 - Plenary Session : Thibaud Besson, IBM POWER Systems Specialist
OWF14 - Plenary Session : Thibaud Besson, IBM POWER Systems SpecialistOWF14 - Plenary Session : Thibaud Besson, IBM POWER Systems Specialist
OWF14 - Plenary Session : Thibaud Besson, IBM POWER Systems Specialist
 
RedisConf17 - Redis Enterprise on IBM Power Systems
RedisConf17 - Redis Enterprise on IBM Power SystemsRedisConf17 - Redis Enterprise on IBM Power Systems
RedisConf17 - Redis Enterprise on IBM Power Systems
 
11540800.ppt
11540800.ppt11540800.ppt
11540800.ppt
 
Performance of State-of-the-Art Cryptography on ARM-based Microprocessors
Performance of State-of-the-Art Cryptography on ARM-based MicroprocessorsPerformance of State-of-the-Art Cryptography on ARM-based Microprocessors
Performance of State-of-the-Art Cryptography on ARM-based Microprocessors
 
Ibm symp14 referentin_barbara koch_power_8 launch bk
Ibm symp14 referentin_barbara koch_power_8 launch bkIbm symp14 referentin_barbara koch_power_8 launch bk
Ibm symp14 referentin_barbara koch_power_8 launch bk
 
Top 10 Supercomputers With Descriptive Information & Analysis
Top 10 Supercomputers With Descriptive Information & AnalysisTop 10 Supercomputers With Descriptive Information & Analysis
Top 10 Supercomputers With Descriptive Information & Analysis
 
MemVerge - The Dawn of Big Memory
MemVerge - The Dawn of Big MemoryMemVerge - The Dawn of Big Memory
MemVerge - The Dawn of Big Memory
 
AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems
 
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsAI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systems
 
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
 

Mehr von Memory Fabric Forum

Mehr von Memory Fabric Forum (20)

H3 Platform CXL Solution_Memory Fabric Forum.pptx
H3 Platform CXL Solution_Memory Fabric Forum.pptxH3 Platform CXL Solution_Memory Fabric Forum.pptx
H3 Platform CXL Solution_Memory Fabric Forum.pptx
 
Q1 Memory Fabric Forum: ZeroPoint. Remove the waste. Release the power.
Q1 Memory Fabric Forum: ZeroPoint. Remove the waste. Release the power.Q1 Memory Fabric Forum: ZeroPoint. Remove the waste. Release the power.
Q1 Memory Fabric Forum: ZeroPoint. Remove the waste. Release the power.
 
Q1 Memory Fabric Forum: Building Fast and Secure Chips with CXL IP
Q1 Memory Fabric Forum: Building Fast and Secure Chips with CXL IPQ1 Memory Fabric Forum: Building Fast and Secure Chips with CXL IP
Q1 Memory Fabric Forum: Building Fast and Secure Chips with CXL IP
 
Q1 Memory Fabric Forum: Memory expansion with CXL-Ready Systems and Devices
Q1 Memory Fabric Forum: Memory expansion with CXL-Ready Systems and DevicesQ1 Memory Fabric Forum: Memory expansion with CXL-Ready Systems and Devices
Q1 Memory Fabric Forum: Memory expansion with CXL-Ready Systems and Devices
 
Q1 Memory Fabric Forum: About MindShare Training
Q1 Memory Fabric Forum: About MindShare TrainingQ1 Memory Fabric Forum: About MindShare Training
Q1 Memory Fabric Forum: About MindShare Training
 
Q1 Memory Fabric Forum: CXL-Related Activities within OCP
Q1 Memory Fabric Forum: CXL-Related Activities within OCPQ1 Memory Fabric Forum: CXL-Related Activities within OCP
Q1 Memory Fabric Forum: CXL-Related Activities within OCP
 
Q1 Memory Fabric Forum: CXL Controller by Montage Technology
Q1 Memory Fabric Forum: CXL Controller by Montage TechnologyQ1 Memory Fabric Forum: CXL Controller by Montage Technology
Q1 Memory Fabric Forum: CXL Controller by Montage Technology
 
Q1 Memory Fabric Forum: Teledyne LeCroy | Austin Labs
Q1 Memory Fabric Forum: Teledyne LeCroy | Austin LabsQ1 Memory Fabric Forum: Teledyne LeCroy | Austin Labs
Q1 Memory Fabric Forum: Teledyne LeCroy | Austin Labs
 
Q1 Memory Fabric Forum: SMART CXL Product Lineup
Q1 Memory Fabric Forum: SMART CXL Product LineupQ1 Memory Fabric Forum: SMART CXL Product Lineup
Q1 Memory Fabric Forum: SMART CXL Product Lineup
 
Q1 Memory Fabric Forum: CXL Form Factor Primer
Q1 Memory Fabric Forum: CXL Form Factor PrimerQ1 Memory Fabric Forum: CXL Form Factor Primer
Q1 Memory Fabric Forum: CXL Form Factor Primer
 
Q1 Memory Fabric Forum: Memory Fabric in a Composable System
Q1 Memory Fabric Forum: Memory Fabric in a Composable SystemQ1 Memory Fabric Forum: Memory Fabric in a Composable System
Q1 Memory Fabric Forum: Memory Fabric in a Composable System
 
Q1 Memory Fabric Forum: Big Memory Computing for AI
Q1 Memory Fabric Forum: Big Memory Computing for AIQ1 Memory Fabric Forum: Big Memory Computing for AI
Q1 Memory Fabric Forum: Big Memory Computing for AI
 
Q1 Memory Fabric Forum: Micron CXL-Compatible Memory Modules
Q1 Memory Fabric Forum: Micron CXL-Compatible Memory ModulesQ1 Memory Fabric Forum: Micron CXL-Compatible Memory Modules
Q1 Memory Fabric Forum: Micron CXL-Compatible Memory Modules
 
Q1 Memory Fabric Forum: Compute Express Link (CXL) 3.1 Update
Q1 Memory Fabric Forum: Compute Express Link (CXL) 3.1 UpdateQ1 Memory Fabric Forum: Compute Express Link (CXL) 3.1 Update
Q1 Memory Fabric Forum: Compute Express Link (CXL) 3.1 Update
 
Q1 Memory Fabric Forum: Intel Enabling Compute Express Link (CXL)
Q1 Memory Fabric Forum: Intel Enabling Compute Express Link (CXL)Q1 Memory Fabric Forum: Intel Enabling Compute Express Link (CXL)
Q1 Memory Fabric Forum: Intel Enabling Compute Express Link (CXL)
 
Q1 Memory Fabric Forum: XConn CXL Switches for AI
Q1 Memory Fabric Forum: XConn CXL Switches for AIQ1 Memory Fabric Forum: XConn CXL Switches for AI
Q1 Memory Fabric Forum: XConn CXL Switches for AI
 
Q1 Memory Fabric Forum: VMware Memory Vision
Q1 Memory Fabric Forum: VMware Memory VisionQ1 Memory Fabric Forum: VMware Memory Vision
Q1 Memory Fabric Forum: VMware Memory Vision
 
MemVerge: Memory Expansion Without Breaking the Budget
MemVerge: Memory Expansion Without Breaking the BudgetMemVerge: Memory Expansion Without Breaking the Budget
MemVerge: Memory Expansion Without Breaking the Budget
 
Micron - CXL Enabling New Pliability in the Modern Data Center.pptx
Micron - CXL Enabling New Pliability in the Modern Data Center.pptxMicron - CXL Enabling New Pliability in the Modern Data Center.pptx
Micron - CXL Enabling New Pliability in the Modern Data Center.pptx
 
MemVerge: Past Present and Future of CXL
MemVerge: Past Present and Future of CXLMemVerge: Past Present and Future of CXL
MemVerge: Past Present and Future of CXL
 

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
Connecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAKConnecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAK
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 
Buy Epson EcoTank L3210 Colour Printer Online.pptx
Buy Epson EcoTank L3210 Colour Printer Online.pptxBuy Epson EcoTank L3210 Colour Printer Online.pptx
Buy Epson EcoTank L3210 Colour Printer Online.pptx
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 
Designing for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastDesigning for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at Comcast
 
THE BEST IPTV in GERMANY for 2024: IPTVreel
THE BEST IPTV in  GERMANY for 2024: IPTVreelTHE BEST IPTV in  GERMANY for 2024: IPTVreel
THE BEST IPTV in GERMANY for 2024: IPTVreel
 
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. Startups
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
Syngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdf
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
 

Q1 Memory Fabric Forum: Advantages of Optical CXL​ for Disaggregated Compute Architectures ​

  • 1. Harnessing light to power new possibilities Advantages of Optical CXL for Disaggregated Compute Architectures Ron Swartzentruber Director of Engineering
  • 2. 2 Agenda  Memory centric shift in the data center  AI Large Language Model growth  Need for optical CXL technology  Case study: OPT inference benefits using optical CXL © Lightelligence, Inc.
  • 3. 3 Physical Machine 0 Virtual Machine 0 Virtual Machine 1 Stranded resource Physical Machine 1 Stranded resource Virtual Machine 2 Virtual Machine 3 FLEXIBLE MANAGEABLE ECONOMICAL OPEN Physical Machine 1 Physical Machine 0 Physical Machine 2 …… …… …… Virtual Machine 0 Virtual Machine 1 Virtual Machine 2 Virtual Machine 3 Disaggregation is the Future for Datacenter Virtual Machine 4 CPU cores DRAM Accelerators © Lightelligence, Inc.
  • 4. 4 AI trends  AI and Large Language Models will continue to grow and consume more compute  Disaggregated memory architectures are required in order to continue to scale  Optical interconnects are required to extend reach Source: https://medium.com/riselab/ai-and-memory-wall-2cb4265cb0b8 Source: https://hc34.hotchips.org/assets/program/tutorials/CXL/Hot%20Chips%202022%20CXL%20MemoryChallenges.pdf © Lightelligence, Inc.
  • 5. 5 Optical Interconnect Latency © Lightelligence, Inc. 100s of ns 100s of 𝜇s
  • 6. 6 CXL is the PredominantStandard for Disaggregation Cache- coherence Latency Memory decouple CXL Yes ~100ns Supported RDMA (ethernet) No ~3μs Not supported CXL 2.0 Switch Standardized Fabric Manager H1 H2 H3 H4 H# …… CXL2.0 CXL2.0 CXL2.0 CXL2.0 CXL2.0 CXL1.0 CXL2.0 CXL2.0 CXL2.0 CXL2.0 D1 D2 D3 D4 …… D# © Lightelligence, Inc.
  • 7. 7 OpticalCXL is Required forScaling ATTENUATION (DB) 0 -10 -20 -30 -40 -50 PROPAGATION DISTANCE (M) 1m 10m 0 -0.003 -4 -40 Copper Optics Assuming AWG26 wire, PCIe 5.0 signal 32 cables with diameter > 6mm (CAT8) 16 fibers with diameter of 0.125mm … … 6mm > 30 mm Copper <1mm Optics Supporting 4x PCIe 5.0 x16 © Lightelligence, Inc.
  • 8. 8 OpticalCXL in the Datacenter Compute Break Through the Rack! Memory Banks © Lightelligence, Inc.
  • 9. 9 Case study: LLM Inference CXL Memory Expander CXL Memory Expander Server 2x CXL 1.1 CPUs  2U Supermicro server  2x AMD Genoa CXL 1.1 CPUs  MemVerge Memory Tiering and Pooling Software  2x Micron 256GB Memory Expanders each with CXL/PCIe Gen5x8 link Memory Expansion Module Photowave Card  Nvidia GPU running LLM inference  All VMs access to CXL memory  Secure application, encrypted data Photowave Card © Lightelligence, Inc.
  • 10. 10 LLM Model List Model Weight Memory(float16) KV-Cache per sample(float16) Activation per sample(float16) Context length OPT-1.3B 2.4 GB 0.095 GB 0.002 GB 512 OPT-13B 23.921 GB 0.397 GB 0.005 GB 512 OPT-30B 55.803 GB 0.667 GB 0.007 GB 512 OPT-66B 122.375 GB 1.143 GB 0.009 GB 512 OPT-175B 325GB 2.285GB 0.012GB 512 KV-cache Size: data_type * dimension* num_layers* batch_size * Context_len * 2 e.g., for opt-1.3B, FP16 -> 2Bytes * 2048 * 24 * 1 * 512 * 2 = 100,663,296 Bytes Activation Size: data_type * dimension * batch_size * Context_len Entire OPT-66B model fits within one 128GB CXL memory expander © Lightelligence, Inc.
  • 11.  CXL: 882MB/s, System Memory 857MB/s, Disk: 582MB/s, MemVerge: 493MB/s  CXL: 2365MB/s, System Memory: 2609MB/s, Disk: 1887MB/s, MemVerge: 2173MB/s 11 Results ~2.4x © Lightelligence, Inc. OPT-66B model results Disk (NVMe) CXL Memory System Memory MemVerge 60:40Policy Decode Throughput (Tokens/s) 1.984 4.868 6.216 6.237 Decode Latency(s) 338.7 138.2 108.1 107.7
  • 12. 12 PHOTOWAVETM OPTICALCXL MEMORY EXPANDER © Lightelligence, Inc.
  • 13. PHOTOWAVETM OPTICALCXL MEMORY EXPANDER CXL GPU UTILIZATION GPU MEM. UTILIZATION CPU UTILIZATION MEM. UTILIZATION CXL MEM. UTILIZATION DECODE THROUGHPUT GENOA AMD CPU SAMSUNG CXL 128GB NVIDIA GPU: 1xA10 24GB OPT-66B MODEL PROGRESS 99% TOKENS/S PARAMETERS INFERENCE ENGINE: FLEXGEN KV CACHE: 109.688GB RUN MODE: CXL WEIGHTS: 122.375GB 95% 77% 51% 27% 77% CXL DIS K ✔️ ✔️ 13 © Lightelligence, Inc. NVMe
  • 14. Summary of Results CXL memory offloading is efficient and beneficial  LLM inference case study  Allows use of lower cost memory Similar performance compared to pure system memory 1.9xTCO improvement with inexpensiveGPUs at similar throughput 2.4x performance advantage compared to SSD/NVMe disk offloading 14 © Lightelligence, Inc.
  • 15. PhotowaveTM Form Factors  CXL 2.0/PCIe Gen5 x16  Jitter reduction, SI cleanup  Sideband signals over optics  x8, x4 or x2 bifurcation  End-to-end latency:  Card: under 20ns + TOF  AOC: 1ns + TOF Low ProfilePCIeCard OCP3.0SFFCard ActiveOpticalCables ProductSuite Features 15 © Lightelligence, Inc.
  • 16. Endnotes Hardware configuration Super Micro Server  AMD EPYC 9124 16-Core CPU  Samsung DDR5 4800 MT/s  MEM0 size: 256GB  MEM1 size: 256GB  Bandwidth: 307GB/s Nvidia GPU  Gen4x16, DMEM size: 24GB  Bandwidth: 32GB/s Samsung NVME  Gen4x4, MEM size: 1.92TB  Bandwidth: 8GB/s Micron CXL Memory  Gen5x8, MEM size: 256GB  Bandwidth: 32GB/s  LLM: OPT-66B  Batch size = 24  Context length = 512  Output length = 8  FlexGen Algorithm&Software 16 © Lightelligence, Inc.

Hinweis der Redaktion

  1. Key message: CXL is industry consensus for disaggregation
  2. MemVerge policy: System memory 60%, CXL Memory 40%
  3. What is CPU% ******************************************************************************** CPU% mem 29.525 cxl 31.431000000000004 disk 81.46 main.py:36: MatplotlibDeprecationWarning: Calling gca() with keyword arguments was deprecated in Matplotlib 3.4. Starting two minor releases later, gca() will take no keyword arguments. The gca() function should only be used to get the current axes, or if no axes exist, create new axes with default keyword arguments. To create a new axes with non-default arguments, use plt.axes() or plt.subplot(). ax = plt.gca(facecolor='black') ******************************************************************************** MEM% mem 27.2 cxl 27.2 disk 11.722000000000001 ******************************************************************************** GPU% mem 99.49 cxl 97.05 disk 53.01 ******************************************************************************** CXLMEM% mem 0.0016306192454823602 cxl 77.11430249904593 disk 0.1663918208702139 ******************************************************************************** GPUMEM% mem 45.17 cxl 49.0 disk 34.71 ******************************************************************************** GPUMEM_USED_MB mem 9213.4375 cxl 9213.4375 disk 8979.4375 ******************************************************************************** PCI_TX_MBps mem 274.2578125 cxl 191.357421875 disk 81.064453125 ******************************************************************************** PCI_RX_MBps mem 2007.3828125 cxl 1422.421875 disk 1158.056640625 (tfpy38) hussainazhar@Hussains-MacBook-Air T4gpu % python main.py ******************************************************************************** CPU% mem 29.525 cxl 31.431000000000004 disk 81.46 main.py:36: MatplotlibDeprecationWarning: Calling gca() with keyword arguments was deprecated in Matplotlib 3.4. Starting two minor releases later, gca() will take no keyword arguments. The gca() function should only be used to get the current axes, or if no axes exist, create new axes with default keyword arguments. To create a new axes with non-default arguments, use plt.axes() or plt.subplot(). ax = plt.gca(facecolor='black') ******************************************************************************** MEM% mem 27.2 cxl 27.2 disk 11.722000000000001 ******************************************************************************** GPU% mem 99.49 cxl 97.05 disk 53.01 ******************************************************************************** CXLMEM% mem 0.0016306192454823602 cxl 77.11430249904593 disk 0.1663918208702139 ******************************************************************************** GPUMEM% mem 45.17 cxl 49.0 disk 34.71 ******************************************************************************** GPUMEM_USED_MB mem 9213.4375 cxl 9213.4375 disk 8979.4375 ******************************************************************************** PCI_TX_MBps mem 274.2578125 cxl 191.357421875 disk 81.064453125 ******************************************************************************** PCI_RX_MBps mem 2007.3828125 cxl 1422.421875 disk 1158.056640625