In this session I will use a simple HTTP benchmark to compare the performance of the Linux kernel networking stack with userspace networking powered by DPDK (kernel-bypass).
It is said that kernel-bypass technologies avoid the kernel because it is "slow", but in reality, a lot of the performance advantages that they bring just come from enforcing certain constraints.
As it turns out, many of these constraints can be enforced without bypassing the kernel. If the system is tuned just right, one can achieve performance that approaches kernel-bypass speeds, while still benefiting from the kernel's battle-tested compatibility, and rich ecosystem of tools.
Presentation on how to chat with PDF using ChatGPT code interpreter
Â
Linux Kernel vs DPDK: HTTP Performance Showdown
1. Brought to you by
Linux Kernel vs DPDK:
HTTP Performance Showdown
Marc Richards
Performance Engineer at Amazon Web Services
2. Marc Richards
Performance Engineer at Talawah Solutions Amazon Web Services
â Recently moved from Kingston to Toronto
â DevOps Engineer turned Performance Engineer
â Interested in exploring performance in The Cloud
4. What is kernel-bypass
â Bypass the Linux networking stack. Data goes straight from the NIC/driver to
the userspace application
â It is up to the application to implement (or not) the features that the kernel
normally provides. Ideal when performance is more important than certain
features e.g. ISPs, CDNs, HFT
â It can also used to build HTTP servers, but the application would need to
implement a TCP/IP stack.
5. In Defense of the Kernel
â Most kernel vs bypass comparisons are done without much optimization on
the kernel side
â The kernel multi-purpose, so it isn't perfectly optimized for high-speed
networking by default.
â I wanted to know what the performance gap would look like when a ïŹnely
tuned kernel goes head to head with kernel-bypass
6. It isnât all about bypass
â Much of the âkernel-bypassâ performance is not from bypassing the kernel,
but from enforcing certain constraints.
â These constraints can be replicated with the kernel as well
â (Semi) busy polling
â Perfect locality
â SimpliïŹed TCP/IP subsystem
7. Seastar and DPDK
â DPDK is a kernel-bypass project created by Intel, run by The Linux Foundation
â Seastar is an open-source C++ framework for building high-performance
server applications, sponsored by ScyllaDB
â Seastar has support for building applications that use either the Linux kernel
or DPDK for networking, and implements its own TCP/IP stack
8. Benchmark Setup
â Cloud: AWS
â Hardware: 4 vCPU c5n.xlarge (server) / 16 vCPU c5n.4xlarge (client)
â Software
â Amazon Linux 2022 (kernel 5.15)
â Seastar from GitHub w/ DPDK 19.05
â Simple JSON benchmark from Techempower
â Fake HTTP server called tcp_httpd
9. Blog Post with More Details
https://talawah.io/blog/linux-kernel-vs-dpdk-http-performance-showdown/
11. DPDK on AWS
â A lot of trial and error at ïŹrst, but the ENA/DPDK docs have gotten much
better
â Seastar uses an older version of DPDK that needs a speciïŹc ïŹx backported to
address an conïŹict with the ENA driver
â AWS also has some ENA patches for older versions of DPDK
â https://github.com/talawahtech/dpdk/tree/http-performance
12. DPDK on AWS
Running 5s test @ http://172.31.XX.XX:8080/json
16 threads and 256 connections
Latency Distribution
50.00% 204.00us
90.00% 252.00us
99.00% 297.00us
99.99% 403.00us
5954189 requests in 5.00s, 0.86GB read
Requests/sec: 1,190,822.80
14. DPDK Optimization
â On newer EC2 instances the network driver supports a LLQ (Low Latency
Queue) mode for improved performance
â You need to enable the write combining feature of the VFIO kernel module
otherwise, performance will suffer
â The VFIO module doesn't support write combining by default, but the ENA
team has a patch to add it
15. DPDK Optimization
Running 5s test @ http://172.31.XX.XX:8080/json
16 threads and 256 connections
Latency Distribution
50.00% 152.00us
90.00% 195.00us
99.00% 233.00us
99.99% 352.00us
7575198 requests in 5.00s, 1.09GB read
Requests/sec: 1,515,010.51
18. Kernel networking stack
Running 5s test @ http://172.31.XX.XX:8080/json
16 threads and 256 connections
Latency Distribution
50.00% 696.00us
90.00% 0.85ms
99.00% 0.96ms
99.99% 1.10ms
1789658 requests in 5.00s, 264.55MB read
Requests/sec: 357,927.16
19. OS Level Optimizations
â Disable Speculative Execution Mitigations
â ConïŹgure RSS and XPS for perfect locality
â Interrupt Moderation and Busy Polling
â Disable Raw/Packet Sockets
â GRO and Congestion Control
â A few kernel 5.15 speciïŹc optimizations
20. OS Level Optimizations
Running 5s test @ http://172.31.XX.XX:8080/json
16 threads and 256 connections
Latency Distribution
50.00% 347.00us
90.00% 455.00us
99.00% 564.00us
99.99% 758.00us
3630818 requests in 5.00s, 536.71MB read
Requests/sec: 726,153.58
24. Context Switching
â sar -w 1
â libreactor
â tcp_httpd
01:13:50 AM proc/s cswch/s
01:13:57 AM 0.00 277.00
01:13:58 AM 0.00 229.00
01:13:59 AM 0.00 290.00
01:03:03 AM proc/s cswch/s
01:03:04 AM 0.00 17132.00
01:03:05 AM 0.00 17060.00
01:03:06 AM 0.00 17048.00
25. Context Switching
Running 5s test @ http://172.31.XX.XX:8080/json
16 threads and 256 connections
Latency Distribution
50.00% 257.00us
90.00% 296.00us
99.00% 337.00us
99.99% 557.00us
4820680 requests in 5.00s, 712.59MB read
Requests/sec: 964,121.54
26. It is better to RECV and Remember to Flush
â recv slightly syscall faster than read syscall
â Batch_ïŹushes = false
27. It is better to RECV and Remember to Flush
Running 5s test @ http://172.31.XX.XX:8080/json
16 threads and 256 connections
Latency Distribution
50.00% 246.00us
90.00% 288.00us
99.00% 333.00us
99.99% 436.00us
5038933 requests in 5.00s, 744.85MB read
Requests/sec: 1,007,771.89
28.
29. DPDK Caveats
â Niche technology
â Bypassing the kernelâs time-tested networking stack and ecosystem
â Poll-mode processing = higher CPU usage
â It is important to make sure you balance your priorities
30. Conclusion
â I see that 51% gap as an opportunity!
â To what extent can the Linux kernel be further optimized for thread-per-core
applications without compromising its general-purpose nature
â Syscall overhead is an area of interest. io_uring may be the answer
31. Brought to you by
Marc Richards
https://talawah.io/contact
@talawahtech
AWS Benchmarking is hiring!