SlideShare a Scribd company logo
1 of 22
Download to read offline
Mellanox Technologies
Technion - Israel Institute of Technology
Compiler, Architecture and Tools Conference (CATC)
Dec 4, 2017, Haifa, Israel
Ilya Lesokhin, Haggai Eran, Shachar Raindel, Guy Shapiro,
Liran Liss, Sagi Grimberg, Muli Ben-Yehuda, Nadav Amit, Dan Tsafrir
1
Direct IO
 Provide applications / Guest OS direct
access to HW
 Examples
 SRIOV and PCIe pass-through
 RDMA: Remote Direct Memory Access
 User-level packet processing (DPDK)
 Eliminate overheads:
 Context switch / Hypercall
 Memory copies
 Customization: users can make their
own tradeoffs
 Latency vs. throughput
 CPU vs memory
 Custom API
2*Peter et. al., Arrakis: The Operating System is the Control Plane, OSDI’14
*
IOuser and IOprovider
HW
HypervisorOS
VM
OS
Driver
APP
User
DriverIOuser
IOprovider
3
Address translation
 Direct I/O requires
address translation
 Provide the same
isolation needed in CPU
 Most devices do not
tolerate page faults
4
IOuser
Page faults are
allowed
Page faults are
not allowed
DirectIO mandates pinning
Static pinning:
 Hypervisors pin guest memory for PCI pass-through
 RDMA registration pins memory
 DPDK buffers are pinned
 Benefits:
 Easier to implement
 Predictable IOuser performance
5
The price of pinning
 No canonical memory optimizations, achieved using
virtual memory and page faults:
Demand paging Over
commitment
Page migration
Delayed allocation Swapping NUMA migration
Mmap-ed files Deduplication Compaction
Calloc with zero
page
Copy on write Transparent huge
pages
Increase startup
time
Increase
memory usage
Decrease
performance
6
Partial solutions
 Fine-grained pinning: pin before use, release after
 Overhead voids benefits of DirectIO
 Course-grained pinning: pin-down cache
 Common in high-performance RDMA setups
 Cons:
 Complicates programming model
 Untrusted IOuser makes policy decisions
 Copying data to/from a static pinned buffer
 Time wasted on copying
7
We need page faults!
 Allow devices to access non-pinned buffers
 Similarly to how page faults enable CPU virtual-memory
 Any valid IOuser buffer could be used
Have the cake and eat it:
 Provide the benefits of DirectIO
 Allow virtual memory based optimizations
 Keep memory management policy in the IOprovider
8
IO paging today
 State of the art IO page faults
 PCIe ATS/PRI (Page Request Interface)
 Intel VT-d SVM (Shared Virtual Memory)
 AMD PPR (Peripheral Page Request)
 NVIDIA Unified Memory
 IBM CAPI
 Target accelerators and GPUs:
 Execution is suspended until page fault is
resolved
 Blocking a thread
 Suspending a GPU kernel
 Networking is different…
9
Not widely deployed yet –
we used the NIC’s internal
IOMMU
Resolve
page
fault
The receive page fault challenge
 Incoming network data cannot be stopped!
 Buffer?
 Expensive and rarely used (e.g. 100 Gbps x 10 ms = 125 MB)
 Back-pressure the network?
 Adversely affect unrelated flows
 Drop?
 Retransmission timeout reduces performance
10
NIC Memory
RDMA receive page faults
 Network and transport
properties
 Lossless network
 Typically large retransmission
timeouts
 Provides reliable point-to-
point connections in HW
Solution: pause sender
 Page fault is detected on the
device, which implements
the transport
 We modified the
Connect-IB firmware to:
 Tell sender to pause
 Ask driver to resolve page
fault
 Drop subsequent packets
NPF
Dropped
traffic
RNR
delay
2
3
1
4
5
6
4
3
5
6
Initiator Responder
12
Ethernet receive page faults
 Network properties
 Lossy network
 Typically SW transport, at
IOuser
 Transport layer cannot
intervene during page fault
 Just dropping packets and
using TCP retransmission
doesn’t cut it!
13
Solution: backup ring
 Direct-IO performance
as long as there are no faults
 Fallback due to page
faults, similar to para-
virtualization:
 Packets are stored in a
pinned ring on the
IOprovider
 Copied to the IOuser,
after the PF is resolved
IOuser
IOprovider
NIC pf
14
15
Evaluation
 Implementation:
 Page fault support
for RDMA in
Connect-IB
firmware
 Emulated Ethernet
backup-ring by
duplicating packets
to both rings
16
Periodic IOPF
Major/min
or doesn’t
matter
90% efficiency at
2-15 page fault
frequency
17
 Standard iSER initiator
(iSCSI Extensions for RDMA)
 Stock Linux kernel
 fio benchmark
 Random 64KB reads
 Modified iSER target
 Based on open-source
tgt target implementation
 Staging buffer: 1GB of 512 KB
chunks.
Storage evaluation
18
TGT target
Staging buffer
Linux
Page Cache
Initiator Initiator Initiator…
Compete
on physical
memory
Storage evaluation
19
HPC evaluation
 Added MPI page fault support
 Implemented in the MXM communication library
 Obsoleted ~10K’s of LOC of pin-down cache
 Application benchmarks
 IMB (Intel MPI benchmark) application suite
 B_eff (effective bandwidth) benchmark
20
HPC Evaluation
Intel MPI Benchmarks
B_eff effective bandwidth (MB/sec)
21
Conclusions
 With network page-
faults you can have:
 Direct IO performance
 Virtual memory
canonical optimizations
 Simplified
programming model
 Receive page faults pose
a unique challenge
 Future work
 Better integration of IO
page faults
 Dirty and access bits
 Eviction policy
22
Questions?

More Related Content

Similar to Page Fault Support for Network Controllers (CATC'17)

HKG18-116 - RAS Solutions for Arm64 Servers
HKG18-116 - RAS Solutions for Arm64 ServersHKG18-116 - RAS Solutions for Arm64 Servers
HKG18-116 - RAS Solutions for Arm64 ServersLinaro
 
XPDS13: Enabling Fast, Dynamic Network Processing with ClickOS - Joao Martins...
XPDS13: Enabling Fast, Dynamic Network Processing with ClickOS - Joao Martins...XPDS13: Enabling Fast, Dynamic Network Processing with ClickOS - Joao Martins...
XPDS13: Enabling Fast, Dynamic Network Processing with ClickOS - Joao Martins...The Linux Foundation
 
Offloading for Databases - Deep Dive
Offloading for Databases - Deep DiveOffloading for Databases - Deep Dive
Offloading for Databases - Deep DiveUniFabric
 
InfiniBand for the enterprise
InfiniBand for the enterpriseInfiniBand for the enterprise
InfiniBand for the enterpriseAnas Kanzoua
 
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...Jim St. Leger
 
Persistent Memory Development Kit (PMDK) Essentials: Part 2
Persistent Memory Development Kit (PMDK) Essentials: Part 2Persistent Memory Development Kit (PMDK) Essentials: Part 2
Persistent Memory Development Kit (PMDK) Essentials: Part 2Intel® Software
 
Persistent Memory Development Kit (PMDK) Essentials: Part 1
Persistent Memory Development Kit (PMDK) Essentials: Part 1Persistent Memory Development Kit (PMDK) Essentials: Part 1
Persistent Memory Development Kit (PMDK) Essentials: Part 1Intel® Software
 
Heterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of SystemsHeterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of SystemsAnand Haridass
 
Accelerating Ceph with iWARP RDMA over Ethernet - Brien Porter, Haodong Tang
Accelerating Ceph with iWARP RDMA over Ethernet - Brien Porter, Haodong TangAccelerating Ceph with iWARP RDMA over Ethernet - Brien Porter, Haodong Tang
Accelerating Ceph with iWARP RDMA over Ethernet - Brien Porter, Haodong TangCeph Community
 
Sharing High-Performance Interconnects Across Multiple Virtual Machines
Sharing High-Performance Interconnects Across Multiple Virtual MachinesSharing High-Performance Interconnects Across Multiple Virtual Machines
Sharing High-Performance Interconnects Across Multiple Virtual Machinesinside-BigData.com
 
Virtualization overheads
Virtualization overheadsVirtualization overheads
Virtualization overheadsSandeep Joshi
 
High Performance Networking Leveraging the DPDK and Growing Community
High Performance Networking Leveraging the DPDK and Growing CommunityHigh Performance Networking Leveraging the DPDK and Growing Community
High Performance Networking Leveraging the DPDK and Growing Community6WIND
 
I understand that physics and hardware emmaded on the use of finete .pdf
I understand that physics and hardware emmaded on the use of finete .pdfI understand that physics and hardware emmaded on the use of finete .pdf
I understand that physics and hardware emmaded on the use of finete .pdfanil0878
 
Realizing Exabyte-scale PM Centric Architectures and Memory Fabrics
Realizing Exabyte-scale PM Centric Architectures and Memory FabricsRealizing Exabyte-scale PM Centric Architectures and Memory Fabrics
Realizing Exabyte-scale PM Centric Architectures and Memory Fabricsinside-BigData.com
 

Similar to Page Fault Support for Network Controllers (CATC'17) (20)

NIO.pdf
NIO.pdfNIO.pdf
NIO.pdf
 
HKG18-116 - RAS Solutions for Arm64 Servers
HKG18-116 - RAS Solutions for Arm64 ServersHKG18-116 - RAS Solutions for Arm64 Servers
HKG18-116 - RAS Solutions for Arm64 Servers
 
XPDS13: Enabling Fast, Dynamic Network Processing with ClickOS - Joao Martins...
XPDS13: Enabling Fast, Dynamic Network Processing with ClickOS - Joao Martins...XPDS13: Enabling Fast, Dynamic Network Processing with ClickOS - Joao Martins...
XPDS13: Enabling Fast, Dynamic Network Processing with ClickOS - Joao Martins...
 
Offloading for Databases - Deep Dive
Offloading for Databases - Deep DiveOffloading for Databases - Deep Dive
Offloading for Databases - Deep Dive
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 
InfiniBand for the enterprise
InfiniBand for the enterpriseInfiniBand for the enterprise
InfiniBand for the enterprise
 
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
 
slides
slidesslides
slides
 
Persistent Memory Development Kit (PMDK) Essentials: Part 2
Persistent Memory Development Kit (PMDK) Essentials: Part 2Persistent Memory Development Kit (PMDK) Essentials: Part 2
Persistent Memory Development Kit (PMDK) Essentials: Part 2
 
Persistent Memory Development Kit (PMDK) Essentials: Part 1
Persistent Memory Development Kit (PMDK) Essentials: Part 1Persistent Memory Development Kit (PMDK) Essentials: Part 1
Persistent Memory Development Kit (PMDK) Essentials: Part 1
 
Heterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of SystemsHeterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of Systems
 
CLFS 2010
CLFS 2010CLFS 2010
CLFS 2010
 
509 512
509 512509 512
509 512
 
Accelerating Ceph with iWARP RDMA over Ethernet - Brien Porter, Haodong Tang
Accelerating Ceph with iWARP RDMA over Ethernet - Brien Porter, Haodong TangAccelerating Ceph with iWARP RDMA over Ethernet - Brien Porter, Haodong Tang
Accelerating Ceph with iWARP RDMA over Ethernet - Brien Porter, Haodong Tang
 
Sharing High-Performance Interconnects Across Multiple Virtual Machines
Sharing High-Performance Interconnects Across Multiple Virtual MachinesSharing High-Performance Interconnects Across Multiple Virtual Machines
Sharing High-Performance Interconnects Across Multiple Virtual Machines
 
Virtualization overheads
Virtualization overheadsVirtualization overheads
Virtualization overheads
 
High Performance Networking Leveraging the DPDK and Growing Community
High Performance Networking Leveraging the DPDK and Growing CommunityHigh Performance Networking Leveraging the DPDK and Growing Community
High Performance Networking Leveraging the DPDK and Growing Community
 
Power overview 2018 08-13b
Power overview 2018 08-13bPower overview 2018 08-13b
Power overview 2018 08-13b
 
I understand that physics and hardware emmaded on the use of finete .pdf
I understand that physics and hardware emmaded on the use of finete .pdfI understand that physics and hardware emmaded on the use of finete .pdf
I understand that physics and hardware emmaded on the use of finete .pdf
 
Realizing Exabyte-scale PM Centric Architectures and Memory Fabrics
Realizing Exabyte-scale PM Centric Architectures and Memory FabricsRealizing Exabyte-scale PM Centric Architectures and Memory Fabrics
Realizing Exabyte-scale PM Centric Architectures and Memory Fabrics
 

Recently uploaded

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard37
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAnitaRaj43
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 

Recently uploaded (20)

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 

Page Fault Support for Network Controllers (CATC'17)

  • 1. Mellanox Technologies Technion - Israel Institute of Technology Compiler, Architecture and Tools Conference (CATC) Dec 4, 2017, Haifa, Israel Ilya Lesokhin, Haggai Eran, Shachar Raindel, Guy Shapiro, Liran Liss, Sagi Grimberg, Muli Ben-Yehuda, Nadav Amit, Dan Tsafrir 1
  • 2. Direct IO  Provide applications / Guest OS direct access to HW  Examples  SRIOV and PCIe pass-through  RDMA: Remote Direct Memory Access  User-level packet processing (DPDK)  Eliminate overheads:  Context switch / Hypercall  Memory copies  Customization: users can make their own tradeoffs  Latency vs. throughput  CPU vs memory  Custom API 2*Peter et. al., Arrakis: The Operating System is the Control Plane, OSDI’14 *
  • 4. Address translation  Direct I/O requires address translation  Provide the same isolation needed in CPU  Most devices do not tolerate page faults 4 IOuser Page faults are allowed Page faults are not allowed
  • 5. DirectIO mandates pinning Static pinning:  Hypervisors pin guest memory for PCI pass-through  RDMA registration pins memory  DPDK buffers are pinned  Benefits:  Easier to implement  Predictable IOuser performance 5
  • 6. The price of pinning  No canonical memory optimizations, achieved using virtual memory and page faults: Demand paging Over commitment Page migration Delayed allocation Swapping NUMA migration Mmap-ed files Deduplication Compaction Calloc with zero page Copy on write Transparent huge pages Increase startup time Increase memory usage Decrease performance 6
  • 7. Partial solutions  Fine-grained pinning: pin before use, release after  Overhead voids benefits of DirectIO  Course-grained pinning: pin-down cache  Common in high-performance RDMA setups  Cons:  Complicates programming model  Untrusted IOuser makes policy decisions  Copying data to/from a static pinned buffer  Time wasted on copying 7
  • 8. We need page faults!  Allow devices to access non-pinned buffers  Similarly to how page faults enable CPU virtual-memory  Any valid IOuser buffer could be used Have the cake and eat it:  Provide the benefits of DirectIO  Allow virtual memory based optimizations  Keep memory management policy in the IOprovider 8
  • 9. IO paging today  State of the art IO page faults  PCIe ATS/PRI (Page Request Interface)  Intel VT-d SVM (Shared Virtual Memory)  AMD PPR (Peripheral Page Request)  NVIDIA Unified Memory  IBM CAPI  Target accelerators and GPUs:  Execution is suspended until page fault is resolved  Blocking a thread  Suspending a GPU kernel  Networking is different… 9 Not widely deployed yet – we used the NIC’s internal IOMMU Resolve page fault
  • 10. The receive page fault challenge  Incoming network data cannot be stopped!  Buffer?  Expensive and rarely used (e.g. 100 Gbps x 10 ms = 125 MB)  Back-pressure the network?  Adversely affect unrelated flows  Drop?  Retransmission timeout reduces performance 10 NIC Memory
  • 11. RDMA receive page faults  Network and transport properties  Lossless network  Typically large retransmission timeouts  Provides reliable point-to- point connections in HW
  • 12. Solution: pause sender  Page fault is detected on the device, which implements the transport  We modified the Connect-IB firmware to:  Tell sender to pause  Ask driver to resolve page fault  Drop subsequent packets NPF Dropped traffic RNR delay 2 3 1 4 5 6 4 3 5 6 Initiator Responder 12
  • 13. Ethernet receive page faults  Network properties  Lossy network  Typically SW transport, at IOuser  Transport layer cannot intervene during page fault  Just dropping packets and using TCP retransmission doesn’t cut it! 13
  • 14. Solution: backup ring  Direct-IO performance as long as there are no faults  Fallback due to page faults, similar to para- virtualization:  Packets are stored in a pinned ring on the IOprovider  Copied to the IOuser, after the PF is resolved IOuser IOprovider NIC pf 14
  • 15. 15
  • 16. Evaluation  Implementation:  Page fault support for RDMA in Connect-IB firmware  Emulated Ethernet backup-ring by duplicating packets to both rings 16
  • 17. Periodic IOPF Major/min or doesn’t matter 90% efficiency at 2-15 page fault frequency 17
  • 18.  Standard iSER initiator (iSCSI Extensions for RDMA)  Stock Linux kernel  fio benchmark  Random 64KB reads  Modified iSER target  Based on open-source tgt target implementation  Staging buffer: 1GB of 512 KB chunks. Storage evaluation 18 TGT target Staging buffer Linux Page Cache Initiator Initiator Initiator… Compete on physical memory
  • 20. HPC evaluation  Added MPI page fault support  Implemented in the MXM communication library  Obsoleted ~10K’s of LOC of pin-down cache  Application benchmarks  IMB (Intel MPI benchmark) application suite  B_eff (effective bandwidth) benchmark 20
  • 21. HPC Evaluation Intel MPI Benchmarks B_eff effective bandwidth (MB/sec) 21
  • 22. Conclusions  With network page- faults you can have:  Direct IO performance  Virtual memory canonical optimizations  Simplified programming model  Receive page faults pose a unique challenge  Future work  Better integration of IO page faults  Dirty and access bits  Eviction policy 22 Questions?