SlideShare ist ein Scribd-Unternehmen logo
1 von 199
Downloaden Sie, um offline zu lesen
SAMSUNG OPEN SOURCE CONFERENCE 2019
SOSCON
Faster Packet Processing in Linux: XDP
Kosslab | SoftwareEngineer | 이호연
2019.10.17
1
이호연 @Daniel T. Lee
- Kosslab Software Engineer
- Opensource Developer
(Linux Kernel – BPF, XDP, uftrace, etc.)
2
• Packet Processing in Linux
• How fast are we talking?
• What is XDP?
• How to use XDP?
• More about XDP
Today’s agenda
3
Packet Processing in Linux
From basic path to processing hooks
4
Packet Path in Kernel
DEVICE DRIVER
UPPER LAYER
Ingress
PROTO HANDLER ROUTING FORWARDING
OUTPUT
INPUT
NEIGH
ROUTING
Egress
L4~
L3
L2
5
Packet Receiving Path
DEVICE DRIVER
UPPER LAYER
Ingress
PROTO HANDLER ROUTING FORWARDING
OUTPUT
INPUT
NEIGH
ROUTING
Egress
L4~
L3
L2
6
Packet Forwarding Path
DEVICE DRIVER
UPPER LAYER
Ingress
PROTO HANDLER ROUTING FORWARDING
OUTPUT
INPUT
NEIGH
ROUTING
Egress
L4~
L3
L2
7
Packet Sending Path
DEVICE DRIVER
UPPER LAYER
Ingress
PROTO HANDLER ROUTING FORWARDING
OUTPUT
INPUT
NEIGH
ROUTING
Egress
L4~
L3
L2
8
Packet Receiving Path
DEVICE DRIVER
UPPER LAYER
Ingress
PROTO HANDLER ROUTING FORWARDING
OUTPUT
INPUT
NEIGH
ROUTING
Egress
L4~
L3
L2
9
Packet Receiving Path
DEVICE DRIVER
UPPER LAYER
Ingress
PROTO HANDLER ROUTING FORWARDING
OUTPUT
INPUT
NEIGH
ROUTING
Egress
L4~
L3
L2
10
Packet Receiving Path
DEVICE DRIVER
UPPER LAYER
Ingress
PROTO HANDLER ROUTING FORWARDING
OUTPUT
INPUT
NEIGH
ROUTING
Egress
L4~
L3
L2
11
Packet Receiving Path
DEVICE DRIVER
UPPER LAYER
Ingress
PROTO HANDLER ROUTING FORWARDING
OUTPUT
INPUT
NEIGH
ROUTING
Egress
L4~
L3
L2
12
Packet Receiving Path
DEVICE DRIVER
UPPER LAYER
Ingress
PROTO HANDLER ROUTING FORWARDING
OUTPUT
INPUT
NEIGH
ROUTING
Egress
L4~
L3
L2
13
$ route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 _gateway 0.0.0.0 UG 101 0 0 enp6s0
10.1.0.0 0.0.0.0 255.255.255.0 U 111 0 0 eth0
10.1.1.0 0.0.0.0 255.255.255.0 U 110 0 0 eth1
192.168.0.0 0.0.0.0 255.255.255.0 U 101 0 0 enp6s0
_gateway 0.0.0.0 255.255.255.255 UH 101 0 0 enp6s0
Packet Receiving Path
DEVICE DRIVER
UPPER LAYER
Ingress
PROTO HANDLER ROUTING FORWARDING
OUTPUT
INPUT
NEIGH
ROUTING
Egress
L4~
L3
L2
14
Packet Forwarding Path
DEVICE DRIVER
UPPER LAYER
Ingress
PROTO HANDLER ROUTING FORWARDING
OUTPUT
INPUT
NEIGH
ROUTING
Egress
L4~
L3
L2
15
Packet Forwarding Path
DEVICE DRIVER
UPPER LAYER
Ingress
PROTO HANDLER ROUTING FORWARDING
OUTPUT
INPUT
NEIGH
ROUTING
Egress
L4~
L3
L2
16
Packet Forwarding Path
DEVICE DRIVER
UPPER LAYER
Ingress
PROTO HANDLER ROUTING FORWARDING
OUTPUT
INPUT
NEIGH
ROUTING
Egress
L4~
L3
L2
17
Packet Forwarding Path
DEVICE DRIVER
UPPER LAYER
Ingress
PROTO HANDLER ROUTING FORWARDING
OUTPUT
INPUT
NEIGH
ROUTING
Egress
L4~
L3
L2
18
Packet Sending Path
DEVICE DRIVER
UPPER LAYER
Ingress
PROTO HANDLER ROUTING FORWARDING
OUTPUT
INPUT
NEIGH
ROUTING
Egress
L4~
L3
L2
19
Packet Sending Path
DEVICE DRIVER
UPPER LAYER
Ingress
PROTO HANDLER ROUTING FORWARDING
OUTPUT
INPUT
NEIGH
ROUTING
Egress
L4~
L3
L2
20
Packet Sending Path
DEVICE DRIVER
UPPER LAYER
Ingress
PROTO HANDLER ROUTING FORWARDING
OUTPUT
INPUT
NEIGH
ROUTING
Egress
L4~
L3
L2
21
Packet Sending Path
DEVICE DRIVER
UPPER LAYER
Ingress
PROTO HANDLER ROUTING FORWARDING
OUTPUT
INPUT
NEIGH
ROUTING
Egress
L4~
L3
L2
22
Packet Sending Path
DEVICE DRIVER
UPPER LAYER
Ingress
PROTO HANDLER ROUTING FORWARDING
OUTPUT
INPUT
NEIGH
ROUTING
Egress
L4~
L3
L2
23
Packet processing?
24
Packet Processing
• Filtering (Drop, Pass)
• Forwarding (Routing)
• NAT (Masquerade, Source, Destination)
• Packet Tunneling (Encapsulation)
• Packet Mangling (ToS, TTL, mark)
25
Hooks to process packet?
26
Hooks to process packet?
L2
L3
L7 Userspace
IP (Netfliter)
Traffic Control
DD (XDP)
NIC
27
Linux Socket
Packet processing by receive message
Userspace
28
UPPER LAYER
XDP
TC Ingress
ROUTING FORWARDING
OUTPUT
TC Egress
ROUTING
L4~
L3
L2
PREROUTING
INPUT
POSTROUTING
Rx Tx
Packet-Processing Path
29
UPPER LAYER
XDP
TC Ingress
ROUTING FORWARDING
OUTPUT
TC Egress
ROUTING
L4~
L3
L2
PREROUTING
INPUT
POSTROUTING
Rx Tx
Userspace Packet Path
30
Userspace
31
Any kind of packet processing is possible!
int fd = socket(AF_INET, SOCK_DGRAM, 0);
struct sockaddr_in in;
int res, pkts, bytes;
char buf[MTU_SIZE];
memset(&in, 0, sizeof(in));
in.sin_family = AF_INET;
in.sin_addr.s_addr = INADDR_ANY;
in.sin_port = htons(1234);
if (bind(fd, (struct sockaddr*)&in, sizeof(in)) < 0)
exit(EXIT_FAILURE);
while (1) {
int res = read(fd, buf, MTU_SIZE);
if (res <= 0)
return 0;
pkts += 1;
bytes += r;
}
Linux Firewall Framework
set of hooks inside the Network stack
Netfilter
32
UPPER LAYER
XDP
TC Ingress
ROUTING FORWARDING
OUTPUT
TC Egress
ROUTING
L4~
L3
L2
PREROUTING
INPUT
POSTROUTING
Rx Tx
Netfilter Packet Path
33
UPPER LAYER
XDP
TC Ingress
ROUTING FORWARDING
OUTPUT
TC Egress
ROUTING
L4~
L3
L2
PREROUTING
INPUT
POSTROUTING
Rx Tx
Netfilter Packet Path
modulemodule
module
module
module
By registering kernel modules to hooks,
Intercepts network traffic
ex) iptables…
34
Netfilter
35
By using iptables, nftables, etc..
Packet Processing is available
- Packet filtering
- NAT (network address translation)
- Packet Mangling.
- Stateless/Stateful Firewalling
- ETC..
Netfilter Example 1 – Packet Filtering
Netfilter (iptables) packet drop
DROP UDP 10.1.0.2:1234
$ iptables -A INPUT -d 10.1.0.2 -p udp --dport 1234 -j DROP
36
UPPER LAYER
XDP
TC Ingress
FORWARDING
OUTPUT
TC Egress
ROUTING
L4~
L3
L2
PREROUTING
INPUT
POSTROUTING
Rx Tx
Netfilter Example 1 – Packet Filtering
ROUTING
packetpacketpacket
10.1.0.2:1234
37
UPPER LAYER
XDP
TC Ingress
FORWARDING
OUTPUT
TC Egress
ROUTING
L4~
L3
L2
PREROUTING
INPUT
POSTROUTING
Rx Tx
Netfilter Example 1 – Packet Filtering
DROP
Source IP Source Port Dest. IP Dest.Port Action
1 10.0.0.5 -- -- -- ALLOW
2 -- -- -- 22 DROP
3 -- -- 10.1.0.2 1234 DROP
ROUTING
packetpacketpacket
10.1.0.2:1234
38
Netfilter Example 2 – NAT
Netfilter (iptables) Destination NAT
NAT UDP 10.1.0.2:1234 -> 10.1.1.2:1234
$ iptables -t nat -A PREROUTING 
-d 10.1.0.2 -p udp --dport 1234 
-j DNAT --to-destination 10.1.1.2:1234
39
UPPER LAYER
XDP
TC Ingress
FORWARDING
OUTPUT
TC Egress
ROUTING
L4~
L3
L2
PREROUTING
INPUT
POSTROUTING
Rx Tx
Netfilter Example 2 – NAT
ROUTING
packetpacketpacket
10.1.0.2:1234
40
UPPER LAYER
XDP
TC Ingress
FORWARDING
OUTPUT
TC Egress
ROUTING
L4~
L3
L2
PREROUTING
INPUT
POSTROUTING
Rx Tx
Netfilter Example 2 – NAT
packetpacketpacket
ROUTING
$ iptables -t nat -A PREROUTING -d 10.1.0.2 -p udp --dport 1234 
-j DNAT --to-destination 10.1.1.2:1234
$ iptables -t nat -L PREROUTING
Chain PREROUTING (policy ACCEPT)
target prot opt source destination
DNAT udp -- anywhere 10.1.0.2 udp dpt:1234 to:10.1.1.2:1234
10.1.0.2:1234
41
UPPER LAYER
XDP
TC Ingress
ROUTING FORWARDING
OUTPUT
TC Egress
ROUTING
L4~
L3
L2
PREROUTING
INPUT
POSTROUTING
Rx Tx
Netfilter Example 2 – NAT
Source IP Source Port Dest. IP Dest.Port to
1 10.0.0.5 -- -- -- 10.0.0.2:1234
2 -- -- 10.1.0.255 80 10.1.0.1:443
3 -- -- 10.1.0.2 1234 10.1.1.2:1234
DNAT
packetpacketpacket
packet
10.1.0.2:1234
10.1.1.2:1234
42
UPPER LAYER
XDP
TC Ingress
FORWARDING
OUTPUT
TC Egress
ROUTING
L4~
L3
L2
PREROUTING
INPUT
POSTROUTING
Rx Tx
Netfilter Example 2 – NAT
ROUTING
packetpacketpacket
packet
10.1.0.2:1234
10.1.1.2:1234
43
$ route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 _gateway 0.0.0.0 UG 101 0 0 enp6s0
10.1.0.0 0.0.0.0 255.255.255.0 U 111 0 0 eth0
10.1.1.0 0.0.0.0 255.255.255.0 U 110 0 0 eth1
192.168.0.0 0.0.0.0 255.255.255.0 U 101 0 0 enp6s0
_gateway 0.0.0.0 255.255.255.255 UH 101 0 0 enp6s0
UPPER LAYER
XDP
TC Ingress
FORWARDING
OUTPUT
TC Egress
ROUTING
L4~
L3
L2
PREROUTING
INPUT
POSTROUTING
Rx Tx
Netfilter Example 2 – NAT
packetpacketpacket
10.1.0.2:1234 packet
10.1.1.2:1234
44
ROUTING
Traffic Control – Packet Scheduler
Control network traffic with Queuing policy
TC
45
UPPER LAYER
XDP
TC Ingress
ROUTING FORWARDING
OUTPUT
TC Egress
ROUTING
L4~
L3
L2
PREROUTING
INPUT
POSTROUTING
Rx Tx
TC Packet Path
Qdisc Qdisc
46
UPPER LAYER
XDP
TC Ingress
ROUTING FORWARDING
OUTPUT
TC Egress
ROUTING
L4~
L3
L2
PREROUTING
INPUT
POSTROUTING
Rx Tx
TC Packet Path
Qdisc Qdisc
47
With Packet Scheduler (Qdisc),
can determine how to receive & transmit packets
ex) priority, delay, dropping
UPPER LAYER
XDP
TC Ingress
ROUTING FORWARDING
OUTPUT
TC Egress
ROUTING
L4~
L3
L2
PREROUTING
INPUT
POSTROUTING
Rx Tx
TC Packet Path
Qdisc Qdisc
48
By attaching Filter to Qdisc,
User can take action with matched packet
TC
49
With filters attached to Qdisc
Packet Processing is available
- Packet filtering
- NAT (network address translation)
- Packet Mangling (skb_edit)
- Packet Forwarding (mirroring)
- QoS
- ETC..
TC Example 1 – Packet Filtering
TC Ingress packet drop
DROP UDP 10.1.0.2:1234
$ tc qdisc add dev eth0 ingress
$ tc filter add dev eth0 parent ffff: protocol ip u32 
match ip dst 10.1.0.2 match ip dport 1234 0xffff 
action drop
50
UPPER LAYER
XDP
TC Ingress
FORWARDING
OUTPUT
TC Egress
ROUTING
L4~
L3
L2
PREROUTING
INPUT
POSTROUTING
Rx Tx
TC Example 1 – Packet Filtering
packet
packet
packet
ROUTING
$ tc qdisc add dev eth0 ingress
10.1.0.2:1234
51
UPPER LAYER
XDP
TC Ingress
FORWARDING
OUTPUT
TC Egress
ROUTING
L4~
L3
L2
PREROUTING
INPUT
POSTROUTING
Rx Tx
TC Example 1 – Packet Filtering
packet
packet
packet
ROUTING
10.1.0.2:1234
52
$ tc qdisc add dev eth0 ingress
Qdisc
UPPER LAYER
XDP
TC Ingress
FORWARDING
OUTPUT
TC Egress
ROUTING
L4~
L3
L2
PREROUTING
INPUT
POSTROUTING
Rx Tx
TC Example 1 – Packet Filtering
packet
packet
packet
ROUTING
$ tc filter add dev eth0 parent ffff: protocol ip u32 
match ip dst 10.1.0.2 match ip dport 1234 0xffff 
action drop
`
53
10.1.0.2:1234
Qdisc
UPPER LAYER
XDP
TC Ingress
FORWARDING
OUTPUT
TC Egress
ROUTING
L4~
L3
L2
PREROUTING
INPUT
POSTROUTING
Rx Tx
TC Example 1 – Packet Filtering
packet
packet
packet
ROUTING
$ tc filter add dev eth0 parent ffff: protocol ip u32 
match ip dst 10.1.0.2 match ip dport 1234 0xffff 
action drop
`
54
10.1.0.2:1234
Qdisc
DROP
TC Example 2 – NAT
TC Ingress Destination NAT
NAT UDP 10.1.0.2:1234 -> 10.1.1.2:1234
$ tc qdisc add dev eth0 ingress
$ tc filter add dev eth0 parent ffff: protocol ip u32 
match ip dst 10.1.0.2 match ip dport 1234 0xffff 
action nat ingress 10.1.0.2/32 10.1.1.2/32 pipe 
action mirred egress redirect dev eth1
55
UPPER LAYER
XDP
TC Ingress
FORWARDING
OUTPUT
TC Egress
ROUTING
L4~
L3
L2
PREROUTING
INPUT
POSTROUTING
Rx Tx
TC Example 2 – NAT
packet
packet
packet
ROUTING
$ tc qdisc add dev eth0 ingress
10.1.0.2:1234
56
UPPER LAYER
XDP
TC Ingress
FORWARDING
OUTPUT
TC Egress
ROUTING
L4~
L3
L2
PREROUTING
INPUT
POSTROUTING
Rx Tx
TC Example 2 – NAT
packet
packet
packet
ROUTING
10.1.0.2:1234
57
$ tc qdisc add dev eth0 ingress
Qdisc
UPPER LAYER
XDP
TC Ingress
FORWARDING
OUTPUT
TC Egress
ROUTING
L4~
L3
L2
PREROUTING
INPUT
POSTROUTING
Rx Tx
TC Example 2 – NAT
packet
packet
packet
ROUTING
$ tc filter add dev eth0 parent ffff: protocol ip u32 
match ip dst 10.1.0.2 match ip dport 1234 0xffff 
action nat ingress 10.1.0.2/32 10.1.1.2/32 pipe 
action mirred egress redirect dev eth1
`
58
10.1.0.2:1234
Qdisc
UPPER LAYER
XDP
TC Ingress
FORWARDING
OUTPUT
TC Egress
ROUTING
L4~
L3
L2
PREROUTING
INPUT
POSTROUTING
Rx Tx
TC Example 2 – NAT
packet
packet
packet
ROUTING
$ tc filter add dev eth0 parent ffff: protocol ip u32 
match ip dst 10.1.0.2 match ip dport 1234 0xffff 
action nat ingress 10.1.0.2/32 10.1.1.2/32 pipe 
action mirred egress redirect dev eth1
`
59
10.1.0.2:1234
Qdisc
DNAT
packet
10.1.1.2:1234
UPPER LAYER
XDP
TC Ingress
FORWARDING
OUTPUT
TC Egress
ROUTING
L4~
L3
L2
PREROUTING
INPUT
POSTROUTING
Rx Tx
TC Example 2 – NAT
packet
packet
packet
ROUTING
60
10.1.0.2:1234
ROUTING
Qdisc Qdiscpacket
10.1.1.2:1234
action mirred egress redirect dev eth1
Redirect
XDP
eBPF based fast data-path
61
UPPER LAYER
XDP
TC Ingress
ROUTING FORWARDING
OUTPUT
TC Egress
ROUTING
L4~
L3
L2
PREROUTING
INPUT
POSTROUTING
Rx Tx
XDP Packet Path
62
eBPF based fast data-path
XDP
63
How FAST are we talking?
64
Packet Drop
From zero to 14 Mpps
65
packet
Test environment
NIC
Host A
User
IP
TC
DD
NIC
Host B10 Gbe
…
packet
packet
10.1.0.1 10.1.0.2:1234
66
Connected with 10Gbe Ethernet
packet
Test environment
NIC
Host A
User
IP
TC
DD
NIC
Host B10 Gbe
Pktgen : Send UDP packet
(linux/samples/pktgen)
$ ./pktgen.sh -i eth0 -m $DST_MAC -d 10.1.0.2 -p 1234
Result device: eth0
Params: count 100000 min_pkt_size: 60 max_pkt_size: 60
...
dst_min: 10.1.0.2
dst_mac: $DST_MAC
udp_dst: 1234
Current: pkts-sofar: 100000 errors: 0
started: 7802657us stopped: 7862969us idle: 55us
cur_saddr: 10.1.0.1 cur_daddr: 10.1.0.2
cur_udp_dst: 1234 cur_udp_src: 109
cur_queue_map: 0 flows: 0
Result: OK: 60312(c60257+d55) usec, 100000 (60byte,0frags)
1658039pps 795Mb/sec (795858720bps) errors: 0
…
packet
packet
10.1.0.1 10.1.0.2:1234
67
Test environment
10.1.0.1
NIC
Host A
User
IP
TC
DD
NIC
Host B
10.1.0.2:1234
10 Gbe
DROP
Packet Drop / Single Core?
packet
10 Gbe
…
packet
packet
DROP UDP 10.1.0.2:1234
68
10Gbit = (10 * 10^9 bit) / (84 * 8bit)
= 14,880,952 pps (14Mpps)
Preamble
Ethernet Frame
Inter
Frame GapMAC.
Destination
MAC.
Source
Type PAYLOAD
(IP/IPv6/ARP…)
CRC
8 Bytes 6 Bytes 6 Bytes
2
Bytes 46-1500 Bytes 4 Bytes 12 Bytes
8 64 12
84
Theoretical speed of 10Gbe?
69
Hooks to process packet?
L2
L3
L7 Userspace
IP (Netfliter)
Traffic Control
DD (XDP)
NIC
70
UPPER LAYER
XDP
TC Ingress
ROUTING FORWARDING
OUTPUT
TC Egress
ROUTING
L4~
L3
L2
PREROUTING
INPUT
POSTROUTING
Rx Tx
Userspace Packet Drop
DROP
71
char buf[MTU_SIZE];
while (1) {
int res = read(fd, buf, MTU_SIZE);
if (res <= 0)
return 0;
pkts += 1;
bytes += res;
}
user-drop.c
Userspace Packet Drop
UDP socket server
packet drop by reading socket
NIC
Host A
User
IP
TC
DD
NIC
Host B
DROP
10 Gbe
10.1.0.1 10.1.0.2:1234
72
Userspace Packet Drop
$ gcc –o user-drop user-drop.c
$ sudo ./user-drop
packets=778148 bytes=14006664
packets=782171 bytes=14079078
packets=784792 bytes=14126256
packets=786466 bytes=14156388
packets=784163 bytes=14114934
packets=782500 bytes=14085000
packets=783085 bytes=14095530
packets=783172 bytes=14097096
Average Packet Drop
783,063pps/core
≈ 530Mbit/s
73
Userspace Packet Drop
ixgbe_poll() {
ixgbe_clean_rx_irq() {
napi_gro_receive() {
netif_receive_skb_internal() {
skb_defer_rx_timestamp();
__netif_receive_skb() {
__netif_receive_skb_one_core() {
__netif_receive_skb_core() {
packet_rcv() {
skb_push();
consume_skb();
}
}
ip_rcv() {
ip_rcv_core.isra.25();
ip_rcv_finish() {
ip_rcv_finish_core.isra.23() {
udp_v4_early_demux() {
ip_check_mc_rcu();
ip_mc_sf_allow();
ipv4_dst_check();
ip_mc_validate_source() {
fib_validate_source() {
__fib_validate_source();
}
}
}
ip_local_deliver() {
nf_hook_slow() {
iptable_filter_hook [iptable_filter]() {
ipt_do_table [ip_tables]() {
__local_bh_enable_ip();
}
}
}
ip_local_deliver_finish() {
ip_protocol_deliver_rcu() {
raw_local_deliver();
udp_rcv() {
__udp4_lib_rcv() {
udp_unicast_rcv_skb.isra.64() {
udp_queue_rcv_skb() {
udp_queue_rcv_one_skb() {
__udp_enqueue_schedule_skb();
. . .
userspace DROP
74
≈ 530Mbit/s
Can it be faster?
75
YES!
By using Netfilter
76
UPPER LAYER
XDP
TC Ingress
ROUTING FORWARDING
OUTPUT
TC Egress
ROUTING
L4~
L3
L2
PREROUTING
INPUT
POSTROUTING
Rx Tx
Netfilter Packet Drop
77
DROP
$ iptables -A INPUT -d 10.1.0.2 -p udp --dport 1234 -j DROP
$ iptables -L INPUT
Chain INPUT (policy ACCEPT)
target prot opt source destination
DROP udp -- anywhere 10.1.0.2 udp dpt:1234
Netfilter Packet Drop
NIC
Host A
User
IP
TC
DD
NIC
Host B
DROP
10 Gbe
Netfilter (iptables)
packet drop at INPUT chain
10.1.0.1 10.1.0.2:1234
78
Netfilter Packet Drop
$ iptables -vxnL INPUT
Chain INPUT (policy ACCEPT 79 packets, 4318 bytes)
pkts bytes target prot opt in out source destination
2927763 134677098 DROP udp -- * * 0.0.0.0/0 10.1.0.2 udp dpt:1234
4235054 194812484 DROP udp -- * * 0.0.0.0/0 10.1.0.2 udp dpt:1234
5541468 254907528 DROP udp -- * * 0.0.0.0/0 10.1.0.2 udp dpt:1234
6397986 294307356 DROP udp -- * * 0.0.0.0/0 10.1.0.2 udp dpt:1234
7706050 354478300 DROP udp -- * * 0.0.0.0/0 10.1.0.2 udp dpt:1234
9013710 414630660 DROP udp -- * * 0.0.0.0/0 10.1.0.2 udp dpt:1234
10320569 474746174 DROP udp -- * * 0.0.0.0/0 10.1.0.2 udp dpt:1234
11629331 534949226 DROP udp -- * * 0.0.0.0/0 10.1.0.2 udp dpt:1234
12937107 595106922 DROP udp -- * * 0.0.0.0/0 10.1.0.2 udp dpt:1234
14243987 655223402 DROP udp -- * * 0.0.0.0/0 10.1.0.2 udp dpt:1234
15553372 715455112 DROP udp -- * * 0.0.0.0/0 10.1.0.2 udp dpt:1234
16861793 775642478 DROP udp -- * * 0.0.0.0/0 10.1.0.2 udp dpt:1234
Average Packet Drop
1,266,730pps/core
≈ 860Mbit/s
79
Netfilter Packet Drop
ixgbe_poll() {
ixgbe_clean_rx_irq() {
napi_gro_receive() {
netif_receive_skb_internal() {
skb_defer_rx_timestamp();
__netif_receive_skb() {
__netif_receive_skb_one_core() {
__netif_receive_skb_core() {
packet_rcv() {
skb_push();
consume_skb();
}
}
ip_rcv() {
ip_rcv_core.isra.25();
ip_rcv_finish() {
ip_rcv_finish_core.isra.23() {
udp_v4_early_demux() {
ip_check_mc_rcu();
ip_mc_sf_allow();
ipv4_dst_check();
ip_mc_validate_source() {
fib_validate_source() {
__fib_validate_source();
}
}
}
ip_local_deliver() {
nf_hook_slow() {
iptable_filter_hook [iptable_filter]() {
ipt_do_table [ip_tables]() {
udp_mt();
__local_bh_enable_ip();
}
}
kfree_skb();
80
DROP
Netfilter Packet Drop
ip_local_deliver() {
nf_hook_slow() {
iptable_filter_hook [iptable_filter]() {
ipt_do_table [ip_tables]() {
__local_bh_enable_ip();
}
}
}
ip_local_deliver_finish() {
ip_protocol_deliver_rcu() {
raw_local_deliver();
udp_rcv() {
__udp4_lib_rcv() {
udp_unicast_rcv_skb.isra.64() {
udp_queue_rcv_skb() {
udp_queue_rcv_one_skb() {
__udp_enqueue_schedule_skb();
. . .
ixgbe_poll() {
ixgbe_clean_rx_irq() {
napi_gro_receive() {
netif_receive_skb_internal() {
skb_defer_rx_timestamp();
__netif_receive_skb() {
__netif_receive_skb_one_core() {
__netif_receive_skb_core() {
packet_rcv() {
skb_push();
consume_skb();
}
}
ip_rcv() {
ip_rcv_core.isra.25();
ip_rcv_finish() {
ip_rcv_finish_core.isra.23() {
udp_v4_early_demux() {
ip_check_mc_rcu();
ip_mc_sf_allow();
ipv4_dst_check();
ip_mc_validate_source() {
fib_validate_source() {
__fib_validate_source();
}
}
}
ip_local_deliver() {
nf_hook_slow() {
iptable_filter_hook [iptable_filter]() {
ipt_do_table [ip_tables]() {
udp_mt();
__local_bh_enable_ip();
}
}
kfree_skb();
UserspaceNetfliter
userspace
DROP
DROP
81
≈ 530Mbit/s
≈ 860Mbit/s
Is it fast enough?
82
83
NO!
With TC, it can be faster
UPPER LAYER
XDP
TC Ingress
ROUTING FORWARDING
OUTPUT
TC Egress
ROUTING
L4~
L3
L2
PREROUTING
INPUT
POSTROUTING
Rx Tx
TC Ingress Packet Drop
Qdisc Qdisc
DROP
84
$ tc qdisc add dev eth0 ingress
$ tc filter add dev eth0 parent ffff: protocol ip u32 
match ip dst 10.1.0.2 match ip dport 1234 0xffff 
action drop
$ tc filter show ingress dev eth0
...
match 0a010002/ffffffff at 16
match 000004d2/0000ffff at 20
action order 1: gact action drop
random type none pass val 0
index 1 ref 1 bind 1
TC Ingress Packet Drop
NIC
Host A
User
IP
TC
DD
NIC
Host B
DROP
10 Gbe
10.1.0.1 10.1.0.2:1234
85
TC Ingress
packet drop with filter action
TC Ingress Packet Drop
$ tc -s filter show ingress dev eth0
…
match 0a010002/ffffffff at 16
match 000004d2/0000ffff at 20
action order 1: gact action drop
random type none pass val 0
index 1 ref 1 bind 1 installed 62 sec used 62 sec
Action statistics:
Sent 154709776 bytes 3363256 pkt (dropped 3363260, ...)
Sent 345955052 bytes 7520762 pkt (dropped 7520767, ...)
Sent 537288326 bytes 11680181 pkt (dropped 11680186, ...)
Sent 728875750 bytes 15845125 pkt (dropped 15845130, ...)
Sent 920311328 bytes 20006768 pkt (dropped 20006772, ...)
Sent 1112155754 bytes 24177299 pkt (dropped 24177304, ...)
Sent 1276514214 bytes 27750309 pkt (dropped 27750309, ...)
Sent 1453930004 bytes 31607174 pkt (dropped 31607179, ...)
Sent 1645855758 bytes 35779473 pkt (dropped 35779477, ...)
Sent 1838232266 bytes 39961571 pkt (dropped 39961577, ...)
Sent 2029302696 bytes 44115276 pkt (dropped 44115282, ...)
Sent 2221122512 bytes 48285272 pkt (dropped 48285278, ...)
Sent 2412335864 bytes 52442084 pkt (dropped 52442089, ...)
Sent 2603632476 bytes 56600706 pkt (dropped 56600711, ...)
Average Packet Drop
4,083,820pps/core
≈ 2.75Gbit/s
86
TC Ingress Packet Drop
ixgbe_poll() {
ixgbe_clean_rx_irq() {
napi_gro_receive() {
netif_receive_skb_internal() {
skb_defer_rx_timestamp();
__netif_receive_skb() {
__netif_receive_skb_one_core() {
__netif_receive_skb_core() {
tcf_classify() {
u32_classify [cls_u32]() {
tcf_action_exec() {
tcf_gact_act [act_gact]();
}
}
}
kfree_skb(); DROP
87
TC Ingress Packet Drop
ixgbe_poll() {
ixgbe_clean_rx_irq() {
napi_gro_receive() {
netif_receive_skb_internal() {
skb_defer_rx_timestamp();
__netif_receive_skb() {
__netif_receive_skb_one_core() {
__netif_receive_skb_core() {
packet_rcv() {
skb_push();
consume_skb();
}
}
ip_rcv() {
ip_rcv_core.isra.25();
ip_rcv_finish() {
ip_rcv_finish_core.isra.23() {
udp_v4_early_demux() {
ip_check_mc_rcu();
ip_mc_sf_allow();
ipv4_dst_check();
ip_mc_validate_source() {
fib_validate_source() {
__fib_validate_source();
}
}
}
ip_local_deliver() {
nf_hook_slow() {
iptable_filter_hook [iptable_filter]() {
ipt_do_table [ip_tables]() {
udp_mt();
__local_bh_enable_ip();
}
}
kfree_skb();
ixgbe_poll() {
ixgbe_clean_rx_irq() {
napi_gro_receive() {
netif_receive_skb_internal() {
skb_defer_rx_timestamp();
__netif_receive_skb() {
__netif_receive_skb_one_core() {
__netif_receive_skb_core() {
tcf_classify() {
u32_classify [cls_u32]() {
tcf_action_exec() {
tcf_gact_act [act_gact]();
}
}
}
kfree_skb();
NetfliterTC
DROP
DROP
88
≈ 2.75Gbit/s
≈ 860Mbit/s
More faster?
89
90
Sure thing!
It's faster with XDP
UPPER LAYER
XDP
TC Ingress
ROUTING FORWARDING
OUTPUT
TC Egress
ROUTING
L4~
L3
L2
PREROUTING
INPUT
POSTROUTING
Rx Tx
XDP Packet Drop
DROP
91
XDP Packet Drop
DROP UDP 10.1.0.2:1234
SEC("xdp1")
int xdp_prog1(struct xdp_md *xdp) {
void *data_end = (void *)(long)xdp->data_end;
void *data = (void *)(long)xdp->data;
struct ethhdr *eth = data;
struct iphdr *iph;
struct udphdr *uh;
...
if (eth->h_proto == htons(ETH_P_IP)) {
iph = data + sizeof(*eth);
...
if (iph->daddr == _htonl(0xa010002)) {
if (iph->protocol == IPPROTO_UDP) {
uh = data + sizeof(*eth) + sizeof(*iph);
...
if (uh->dest == htons(1234))
return XDP_DROP;
}
}
}
return XDP_PASS;
}
92
$ bpftool prog load ./xdp-drop.o /sys/fs/bpf/drop
$ bpftool prog show
...
24: xdp name xdp_prog1 tag 6f8c2e06dfa2abcb gpl
loaded_at 2019-10-11T17:17:33+0900 uid 0
xlated 544B jited 344B memlock 4096B map_ids 17
$ bpftool net attach xdp id 24 dev eth0
$ bpftool net
xdp:
eth0(8) driver id 24
XDP Packet Drop
NIC
Host A
User
IP
TC
DD
NIC
Host B
DROP
10 Gbe
10.1.0.1 10.1.0.2:1234
93
XDP
packet drop with XDP_DROP
$ ethtool -S eth0 | grep rx_xdp_drop
rx_xdp_drop: 6161010
rx_xdp_drop: 16114329
rx_xdp_drop: 26079532
rx_xdp_drop: 36025920
rx_xdp_drop: 45958488
rx_xdp_drop: 55869376
rx_xdp_drop: 65835136
rx_xdp_drop: 75748898
rx_xdp_drop: 85670845
rx_xdp_drop: 95635591
rx_xdp_drop: 105569230
rx_xdp_drop: 115515714
rx_xdp_drop: 125480832
rx_xdp_drop: 135432211
rx_xdp_drop: 1455190421
XDP Packet Drop
Average Packet Drop
9,941,337pps/core
≈ 6.69Gbit/s
94
XDP Packet Drop
ixgbe_poll() {
ixgbe_clean_rx_irq() {
ixgbe_get_rx_buffer();
ixgbe_run_xdp() {
bpf_prog_run_xdp();
} DROP
95
XDP Packet Drop
ixgbe_poll() {
ixgbe_clean_rx_irq() {
napi_gro_receive() {
netif_receive_skb_internal() {
skb_defer_rx_timestamp();
__netif_receive_skb() {
__netif_receive_skb_one_core() {
__netif_receive_skb_core() {
tcf_classify() {
u32_classify [cls_u32]() {
tcf_action_exec() {
tcf_gact_act [act_gact]();
}
}
}
kfree_skb();
ixgbe_poll() {
ixgbe_clean_rx_irq() {
ixgbe_get_rx_buffer();
ixgbe_run_xdp() {
bpf_prog_run_xdp();
}
TCXDP
DROP
DROP
96
≈ 2.75Gbit/s
≈ 6.69Gbit/s
Packet Drop Results
783,063
1,266,730
4,083,820
9,941,337
0
2
4
6
8
10
12
userspace netfilter tc xdp
Mpps
97
≈ 6.69Gbit/s
XDP? Super FAST!
98
What is XDP?
The definition of XDP
99
What is XDP?
eBPF based fast data-path
100
eBPF based fast data-path
What is XDP?
101
in the kernel is similar with
V8 in Chrome browser
What is BPF?
102
In kernel Virtual Machine
What is BPF?
103
An virtual machine?
What does BPF look like?
What is BPF?
104
You may already used it!
What is BPF?
105
You may already used it!
What is BPF?
An tcpdump!
106
What is BPF?
$ tcpdump -i eth0 'udp and dst 10.1.0.2'
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth1, link-type EN10MB (Ethernet), capture size 262144 bytes
00:19:42.244270 IP 192.168.10.1.221 > 10.1.0.2.1234: UDP, length 18
00:19:42.244270 IP 192.168.20.1.813 > 10.1.0.2.1234: UDP, length 18
00:19:42.244270 IP 192.168.30.1.856 > 10.1.0.2.1234: UDP, length 18
00:19:42.244270 IP 192.168.40.1.959 > 10.1.0.2.1234: UDP, length 18
00:19:42.244271 IP 192.168.10.1.160 > 10.1.0.2.1234: UDP, length 18
00:19:42.244271 IP 192.168.20.1.102 > 10.1.0.2.1234: UDP, length 18
00:19:42.244276 IP 192.168.30.1.114 > 10.1.0.2.1234: UDP, length 18
7 packets captured
10 packets received by filter
3 packets dropped by kernel
107
Packet filter by BPF!
What is BPF?
108
What is BPF?
$ tcpdump -i eth0 -d 'udp and dst 10.1.0.2’
tcpdump with –d option
109
What is BPF?
$ tcpdump -i eth0 -d 'udp and dst 10.1.0.2’
(000) ldh [12]
(001) jeq #0x800 jt 2 jf 7
(002) ldb [23]
(003) jeq #0x11 jt 4 jf 7
(004) ld [30]
(005) jeq #0xa010002 jt 6 jf 7
(006) ret #262144
(007) ret #0
tcpdump with –d option
110
What is BPF?
$ tcpdump -i eth0 -d 'udp and dst 10.1.0.2’
(000) ldh [12]
(001) jeq #0x800 jt 2 jf 7
(002) ldb [23]
(003) jeq #0x11 jt 4 jf 7
(004) ld [30]
(005) jeq #0xa010002 jt 6 jf 7
(006) ret #262144
(007) ret #0
Ethernet Frame
MAC.
Destination
MAC.
Source
Type PAYLOAD
(IP/IPv6/ARP…)
CRC
6 Bytes 6 Bytes
2
Bytes 46-1500 Bytes 4 Bytes
64
111
What is BPF?
$ tcpdump -i eth0 -d 'udp and dst 10.1.0.2’
(000) ldh [12]
(001) jeq #0x800 jt 2 jf 7
(002) ldb [23]
(003) jeq #0x11 jt 4 jf 7
(004) ld [30]
(005) jeq #0xa010002 jt 6 jf 7
(006) ret #262144
(007) ret #0
Ver. IHL TOS Total Len.
Identification Flags Frag. Offset
TTL. Protocol Header Checksum
Source Address
Destination Address
Options(optional)
Ethernet Frame
MAC.
Destination
MAC.
Source
Type PAYLOAD
(IP/IPv6/ARP…)
CRC
6 Bytes 6 Bytes
2
Bytes 46-1500 Bytes 4 Bytes
64
112
What is BPF?
Ver. IHL TOS Total Len.
Identification Flags Frag. Offset
TTL. Protocol Header Checksum
Source Address
Destination Address
Options(optional)
Ethernet Frame
MAC.
Destination
MAC.
Source
Type PAYLOAD
(IP/IPv6/ARP…)
CRC
6 Bytes 6 Bytes
2
Bytes 46-1500 Bytes 4 Bytes
64
$ tcpdump -i eth0 -d 'udp and dst 10.1.0.2’
(000) ldh [12]
(001) jeq #0x800 jt 2 jf 7
(002) ldb [23]
(003) jeq #0x11 jt 4 jf 7
(004) ld [30]
(005) jeq #0xa010002 jt 6 jf 7
(006) ret #262144
(007) ret #0
113
In kernel Virtual Machine
“Linux kernel code execution engine”
What is BPF?
114
“Run code in the kernel”
What is BPF?
115
“Run code in the kernel”
$ readelf -h bpf-prog.o | grep Machine
Machine: Advanced Micro Devices X86-64
What is BPF?
116
“Run code in the kernel”
$ readelf -h bpf-prog.o | grep Machine
Machine: Linux BPF
Machine: Advanced Micro Devices X86-64
writing C program
clang / llc
BPF instruction
What is BPF?
117
“Run code in the kernel”
$ readelf -h bpf-prog.o | grep Machine
Machine: Linux BPF
Machine: Advanced Micro Devices X86-64
but, restricted
writing C program
clang / llc
BPF instruction
What is BPF?
118
“Run BPF code in the kernel”
$ readelf -h bpf-prog.o | grep Machine
Machine: Linux BPF
Machine: Advanced Micro Devices X86-64
but, restricted
writing C program
clang / llc
BPF instruction
What is BPF?
119
What is XDP?
eBPF based fast data-path
120
What is XDP?
eBPF based fast data-path
121
L2
L3
L7 Userspace
IP (Netfliter)
Traffic Control
DD (XDP)
NIC
What is XDP?
122
L2
L3
L7 Userspace
IP (Netfliter)
Traffic Control
DD (XDP)
NIC
Earliest Hook in RX path
of the kernel
What is XDP?
123
XDP Call Path
__do_softirq() {
net_rx_action() {
ixgbe_poll() {
ixgbe_clean_rx_irq() {
ixgbe_get_rx_buffer();
ixgbe_run_xdp() {
bpf_prog_run_xdp();
}
ixgbe_build_skb();
ixgbe_rx_skb() {
napi_gro_receive() {
netif_receive_skb_internal();
Intel Device Driver
RX function Call stack
124
XDP Call Path
Right After
Interrupt Processing
Before any memory allocation
(Expensive operation)
__do_softirq() {
net_rx_action() {
ixgbe_poll() {
ixgbe_clean_rx_irq() {
ixgbe_get_rx_buffer();
ixgbe_run_xdp() {
bpf_prog_run_xdp();
}
ixgbe_build_skb();
ixgbe_rx_skb() {
napi_gro_receive() {
netif_receive_skb_internal();
125
XDP Call Path
Right After
Interrupt Processing
Before any memory allocation
(Expensive operation)
__do_softirq() {
net_rx_action() {
ixgbe_poll() {
ixgbe_clean_rx_irq() {
ixgbe_get_rx_buffer();
ixgbe_run_xdp() {
bpf_prog_run_xdp();
}
ixgbe_build_skb();
ixgbe_rx_skb() {
napi_gro_receive() {
netif_receive_skb_internal();
Decide the fate of the packet
126
* User written XDP program
XDP Actions
XDP_DROP
XDP_ABORT
XDP_PASS
XDP_TX
XDP_REDIRECT
127
XDP Actions
XDP_DROP - Very fast drop by recycling
XDP_ABORT - Also drop, but with tracepoint
XDP_PASS - Toss packet to network stack
XDP_TX - Send packet back to same interface
XDP_REDIRECT - Transmit out other NICs
128
UPPER LAYER
XDP
TC Ingress
ROUTING FORWARDING
OUTPUT
TC Egress
ROUTING
L4~
L3
L2
PREROUTING
INPUT
POSTROUTING
Rx Tx
XDP Actions
XDP_PASS
XDP_DROP
XDP_REDIRECT
XDP_TX
129
XDP Actions
XDP
XDP_PASS
XDP_DROP XDP_REDIRECT
XDP_TX
User
NICRX TX
NIC
130
xdp_buff
NO Memory Allocation?
NO sk_buff?
131
xdp_buff
struct xdp_buff {
void *data;
void *data_end;
void *data_meta;
void *data_hard_start;
unsigned long handle;
struct xdp_rxq_info *rxq;
}; xdp_buff->data_hard_start
xdb_buff->data_meta
xdp_buff->data
xdp_buff->data_end
HEADROOM TAIL/TAILROOM
132
xdp_buff
struct xdp_buff {
void *data;
void *data_end;
void *data_meta;
void *data_hard_start;
unsigned long handle;
struct xdp_rxq_info *rxq;
};
133
HEADROOM TAIL/TAILROOM
xdp_buff->data
xdp_buff->data_end
xdp_buff
struct xdp_buff {
void *data;
void *data_end;
void *data_meta;
void *data_hard_start;
unsigned long handle;
struct xdp_rxq_info *rxq;
};
134
HEADROOM TAIL/TAILROOM
xdp_buff->data
xdp_buff->data_end
MAC IP UDP
Packet Data
How to use XDP?
135
Packet Forward
With understanding XDP code
136
DEMO
137
packet
Test environment
10.1.0.1
NIC
Host A
User
IP
TC
DD
NIC
Host B
10.1.0.2:1234
10 Gbe
…
packet
packet
10.1.1.2:1234
NIC
Host C10 Gbe
…
packet
packet
packet
138
1. Destination NAT
2. Packet Forward
10.1.0.2 -> 10.1.1.2
139
XDP Packet Forward
140
2. Packet Forward
10.1.0.2 -> 10.1.1.2
SEC("xdp_fwd")
int xdp_fwd_prog(struct xdp_md *xdp) {
void *data = (void *)(long)xdp->data;
struct bpf_fib_lookup fib;
struct ethhdr *eth = data;
struct iphdr *iph;
...
if (eth->h_proto == htons(ETH_P_IP)) {
iph = data + sizeof(*eth);
if (iph->daddr == _htonl(0xa010002))
iph->daddr = _htonl(0xa010102);
...
rc = bpf_fib_lookup(xdp, &fib, sizeof(fib), 0);
if (rc == BPF_FIB_LKUP_RET_SUCCESS) {
memcpy(eth->h_dest, fib.dmac, ETH_ALEN);
memcpy(eth->h_source, fib.smac, ETH_ALEN);
return bpf_redirect(fib.ifindex, 0);
}
return XDP_PASS;
}
1. Destination NAT
SEC("xdp_fwd")
int xdp_fwd_prog(struct xdp_md *xdp) {
void *data_end = (void *)(long)xdp->data_end;
void *data = (void *)(long)xdp->data;
struct ethhdr *eth = data;
struct iphdr *iph;
struct udphdr *uh;
…
if (eth + 1 > data_end)
return XDP_DROP;
if (eth->h_proto == htons(ETH_P_IP)) {
iph = data + sizeof(*eth);
if (iph + 1 > data_end)
return XDP_DROP;
if (iph->daddr == _htonl(0xa010002))
if (iph->protocol == IPPROTO_UDP) {
uh = data + sizeof(*eth) + sizeof(*iph);
if (uh + 1 > data_end)
return XDP_DROP;
if (uh->dest == htons(1234))
iph->daddr = _htonl(0xa010102);
}
…
XDP Packet Forward
141
SEC("xdp_fwd")
int xdp_fwd_prog(struct xdp_md *xdp) {
void *data_end = (void *)(long)xdp->data_end;
void *data = (void *)(long)xdp->data;
struct ethhdr *eth = data;
struct iphdr *iph;
struct udphdr *uh;
…
if (eth + 1 > data_end)
return XDP_DROP;
if (eth->h_proto == htons(ETH_P_IP)) {
iph = data + sizeof(*eth);
if (iph + 1 > data_end)
return XDP_DROP;
if (iph->daddr == _htonl(0xa010002))
if (iph->protocol == IPPROTO_UDP) {
uh = data + sizeof(*eth) + sizeof(*iph);
if (uh + 1 > data_end)
return XDP_DROP;
if (uh->dest == htons(1234))
iph->daddr = _htonl(0xa010102);
}
…
XDP Packet Forward
142
-> ELF Section where XDP program will be located
SEC("xdp_fwd")
int xdp_fwd_prog(struct xdp_md *xdp) {
void *data_end = (void *)(long)xdp->data_end;
void *data = (void *)(long)xdp->data;
struct ethhdr *eth = data;
struct iphdr *iph;
struct udphdr *uh;
…
if (eth + 1 > data_end)
return XDP_DROP;
if (eth->h_proto == htons(ETH_P_IP)) {
iph = data + sizeof(*eth);
if (iph + 1 > data_end)
return XDP_DROP;
if (iph->daddr == _htonl(0xa010002))
if (iph->protocol == IPPROTO_UDP) {
uh = data + sizeof(*eth) + sizeof(*iph);
if (uh + 1 > data_end)
return XDP_DROP;
if (uh->dest == htons(1234))
iph->daddr = _htonl(0xa010102);
}
…
XDP Packet Forward
143
-> Name of the XDP program
SEC("xdp_fwd")
int xdp_fwd_prog(struct xdp_md *xdp) {
void *data_end = (void *)(long)xdp->data_end;
void *data = (void *)(long)xdp->data;
struct ethhdr *eth = data;
struct iphdr *iph;
struct udphdr *uh;
…
if (eth + 1 > data_end)
return XDP_DROP;
if (eth->h_proto == htons(ETH_P_IP)) {
iph = data + sizeof(*eth);
if (iph + 1 > data_end)
return XDP_DROP;
if (iph->daddr == _htonl(0xa010002))
if (iph->protocol == IPPROTO_UDP) {
uh = data + sizeof(*eth) + sizeof(*iph);
if (uh + 1 > data_end)
return XDP_DROP;
if (uh->dest == htons(1234))
iph->daddr = _htonl(0xa010102);
}
…
XDP Packet Forward
144
HEADROOM TAIL/TAILROOM
xdp_buff
xdp_buff->data
xdp_buff->data_end
SEC("xdp_fwd")
int xdp_fwd_prog(struct xdp_md *xdp) {
void *data_end = (void *)(long)xdp->data_end;
void *data = (void *)(long)xdp->data;
struct ethhdr *eth = data;
struct iphdr *iph;
struct udphdr *uh;
…
if (eth + 1 > data_end)
return XDP_DROP;
if (eth->h_proto == htons(ETH_P_IP)) {
iph = data + sizeof(*eth);
if (iph + 1 > data_end)
return XDP_DROP;
if (iph->daddr == _htonl(0xa010002))
if (iph->protocol == IPPROTO_UDP) {
uh = data + sizeof(*eth) + sizeof(*iph);
if (uh + 1 > data_end)
return XDP_DROP;
if (uh->dest == htons(1234))
iph->daddr = _htonl(0xa010102);
}
…
XDP Packet Forward
145
MAC
HEADROOM TAIL/TAILROOM
xdp_buff
-> Cast to Ethernet header
xdp_buff->data
xdp_buff->data_end
SEC("xdp_fwd")
int xdp_fwd_prog(struct xdp_md *xdp) {
void *data_end = (void *)(long)xdp->data_end;
void *data = (void *)(long)xdp->data;
struct ethhdr *eth = data;
struct iphdr *iph;
struct udphdr *uh;
…
if (eth + 1 > data_end)
return XDP_DROP;
if (eth->h_proto == htons(ETH_P_IP)) {
iph = data + sizeof(*eth);
if (iph + 1 > data_end)
return XDP_DROP;
if (iph->daddr == _htonl(0xa010002))
if (iph->protocol == IPPROTO_UDP) {
uh = data + sizeof(*eth) + sizeof(*iph);
if (uh + 1 > data_end)
return XDP_DROP;
if (uh->dest == htons(1234))
iph->daddr = _htonl(0xa010102);
}
…
XDP Packet Forward
146
MAC
HEADROOM TAIL/TAILROOM
xdp_buff
-> Validate the Ethernet header
xdp_buff->data
xdp_buff->data_end
SEC("xdp_fwd")
int xdp_fwd_prog(struct xdp_md *xdp) {
void *data_end = (void *)(long)xdp->data_end;
void *data = (void *)(long)xdp->data;
struct ethhdr *eth = data;
struct iphdr *iph;
struct udphdr *uh;
…
if (eth + 1 > data_end)
return XDP_DROP;
if (eth->h_proto == htons(ETH_P_IP)) {
iph = data + sizeof(*eth);
if (iph + 1 > data_end)
return XDP_DROP;
if (iph->daddr == _htonl(0xa010002))
if (iph->protocol == IPPROTO_UDP) {
uh = data + sizeof(*eth) + sizeof(*iph);
if (uh + 1 > data_end)
return XDP_DROP;
if (uh->dest == htons(1234))
iph->daddr = _htonl(0xa010102);
}
…
XDP Packet Forward
147
MAC IP
HEADROOM TAIL/TAILROOM
xdp_buff
-> If eth->protocol is IP, Cast to IP header
xdp_buff->data
xdp_buff->data_end
SEC("xdp_fwd")
int xdp_fwd_prog(struct xdp_md *xdp) {
void *data_end = (void *)(long)xdp->data_end;
void *data = (void *)(long)xdp->data;
struct ethhdr *eth = data;
struct iphdr *iph;
struct udphdr *uh;
…
if (eth + 1 > data_end)
return XDP_DROP;
if (eth->h_proto == htons(ETH_P_IP)) {
iph = data + sizeof(*eth);
if (iph + 1 > data_end)
return XDP_DROP;
if (iph->daddr == _htonl(0xa010002))
if (iph->protocol == IPPROTO_UDP) {
uh = data + sizeof(*eth) + sizeof(*iph);
if (uh + 1 > data_end)
return XDP_DROP;
if (uh->dest == htons(1234))
iph->daddr = _htonl(0xa010102);
}
…
XDP Packet Forward
148
MAC IP
HEADROOM TAIL/TAILROOM
xdp_buff
-> Validate the IP header
xdp_buff->data
xdp_buff->data_end
SEC("xdp_fwd")
int xdp_fwd_prog(struct xdp_md *xdp) {
void *data_end = (void *)(long)xdp->data_end;
void *data = (void *)(long)xdp->data;
struct ethhdr *eth = data;
struct iphdr *iph;
struct udphdr *uh;
…
if (eth + 1 > data_end)
return XDP_DROP;
if (eth->h_proto == htons(ETH_P_IP)) {
iph = data + sizeof(*eth);
if (iph + 1 > data_end)
return XDP_DROP;
if (iph->daddr == _htonl(0xa010002))
if (iph->protocol == IPPROTO_UDP) {
uh = data + sizeof(*eth) + sizeof(*iph);
if (uh + 1 > data_end)
return XDP_DROP;
if (uh->dest == htons(1234))
iph->daddr = _htonl(0xa010102);
}
…
XDP Packet Forward
149
MAC IP
HEADROOM TAIL/TAILROOM
xdp_buff
-> Check destination address 10.1.0.2
xdp_buff->data
xdp_buff->data_end
SEC("xdp_fwd")
int xdp_fwd_prog(struct xdp_md *xdp) {
void *data_end = (void *)(long)xdp->data_end;
void *data = (void *)(long)xdp->data;
struct ethhdr *eth = data;
struct iphdr *iph;
struct udphdr *uh;
…
if (eth + 1 > data_end)
return XDP_DROP;
if (eth->h_proto == htons(ETH_P_IP)) {
iph = data + sizeof(*eth);
if (iph + 1 > data_end)
return XDP_DROP;
if (iph->daddr == _htonl(0xa010002))
if (iph->protocol == IPPROTO_UDP) {
uh = data + sizeof(*eth) + sizeof(*iph);
if (uh + 1 > data_end)
return XDP_DROP;
if (uh->dest == htons(1234))
iph->daddr = _htonl(0xa010102);
}
…
XDP Packet Forward
150
MAC IP UDP
HEADROOM TAIL/TAILROOM
xdp_buff
-> If iph->protocol is UDP, Cast to UDP header
xdp_buff->data
xdp_buff->data_end
SEC("xdp_fwd")
int xdp_fwd_prog(struct xdp_md *xdp) {
void *data_end = (void *)(long)xdp->data_end;
void *data = (void *)(long)xdp->data;
struct ethhdr *eth = data;
struct iphdr *iph;
struct udphdr *uh;
…
if (eth + 1 > data_end)
return XDP_DROP;
if (eth->h_proto == htons(ETH_P_IP)) {
iph = data + sizeof(*eth);
if (iph + 1 > data_end)
return XDP_DROP;
if (iph->daddr == _htonl(0xa010002))
if (iph->protocol == IPPROTO_UDP) {
uh = data + sizeof(*eth) + sizeof(*iph);
if (uh + 1 > data_end)
return XDP_DROP;
if (uh->dest == htons(1234))
iph->daddr = _htonl(0xa010102);
}
…
XDP Packet Forward
151
HEADROOM TAIL/TAILROOM
xdp_buff
-> Validate the UDP header
MAC IP UDP
xdp_buff->data
xdp_buff->data_end
SEC("xdp_fwd")
int xdp_fwd_prog(struct xdp_md *xdp) {
void *data_end = (void *)(long)xdp->data_end;
void *data = (void *)(long)xdp->data;
struct ethhdr *eth = data;
struct iphdr *iph;
struct udphdr *uh;
…
if (eth + 1 > data_end)
return XDP_DROP;
if (eth->h_proto == htons(ETH_P_IP)) {
iph = data + sizeof(*eth);
if (iph + 1 > data_end)
return XDP_DROP;
if (iph->daddr == _htonl(0xa010002))
if (iph->protocol == IPPROTO_UDP) {
uh = data + sizeof(*eth) + sizeof(*iph);
if (uh + 1 > data_end)
return XDP_DROP;
if (uh->dest == htons(1234))
iph->daddr = _htonl(0xa010102);
}
…
XDP Packet Forward
152
HEADROOM TAIL/TAILROOM
xdp_buff
-> if dst port is 1234,
Change dst address to 10.1.1.2
MAC IP UDP
xdp_buff->data
xdp_buff->data_end
struct bpf_fib_lookup fib;
int rc;
...
__builtin_memset(&fib, 0, sizeof(fib));
fib.family = AF_INET;
fib.tos = iph->tos;
fib.l4_protocol = iph->protocol;
fib.tot_len = ntohs(iph->tot_len);
fib.ipv4_src = iph->saddr;
fib.ipv4_dst = iph->daddr;
fib.ifindex = xdp->ingress_ifindex;
} else {
return XDP_PASS;
}
rc = bpf_fib_lookup(xdp, &fib, sizeof(fib), 0);
if (rc == BPF_FIB_LKUP_RET_SUCCESS) {
ip_decrease_ttl(iph); /* from include/net/ip.h */
memcpy(eth->h_dest, fib.dmac, ETH_ALEN);
memcpy(eth->h_source, fib.smac, ETH_ALEN);
return bpf_redirect(fib.ifindex, 0);
}
return XDP_PASS;
}
XDP Packet Forward
153
-> Prior to routing table lookup,
prepare bpf_fib_lookup struct
struct bpf_fib_lookup fib;
int rc;
...
__builtin_memset(&fib, 0, sizeof(fib));
fib.family = AF_INET;
fib.tos = iph->tos;
fib.l4_protocol = iph->protocol;
fib.tot_len = ntohs(iph->tot_len);
fib.ipv4_src = iph->saddr;
fib.ipv4_dst = iph->daddr;
fib.ifindex = xdp->ingress_ifindex;
} else {
return XDP_PASS;
}
rc = bpf_fib_lookup(xdp, &fib, sizeof(fib), 0);
if (rc == BPF_FIB_LKUP_RET_SUCCESS) {
ip_decrease_ttl(iph); /* from include/net/ip.h */
memcpy(eth->h_dest, fib.dmac, ETH_ALEN);
memcpy(eth->h_source, fib.smac, ETH_ALEN);
return bpf_redirect(fib.ifindex, 0);
}
return XDP_PASS;
}
XDP Packet Forward
154
-> Fill the struct for query
ex) src / dst address
struct bpf_fib_lookup fib;
int rc;
...
__builtin_memset(&fib, 0, sizeof(fib));
fib.family = AF_INET;
fib.tos = iph->tos;
fib.l4_protocol = iph->protocol;
fib.tot_len = ntohs(iph->tot_len);
fib.ipv4_src = iph->saddr;
fib.ipv4_dst = iph->daddr;
fib.ifindex = xdp->ingress_ifindex;
} else {
return XDP_PASS;
}
rc = bpf_fib_lookup(xdp, &fib, sizeof(fib), 0);
if (rc == BPF_FIB_LKUP_RET_SUCCESS) {
ip_decrease_ttl(iph); /* from include/net/ip.h */
memcpy(eth->h_dest, fib.dmac, ETH_ALEN);
memcpy(eth->h_source, fib.smac, ETH_ALEN);
return bpf_redirect(fib.ifindex, 0);
}
return XDP_PASS;
}
XDP Packet Forward
155
-> Make sure where packet comes from
struct bpf_fib_lookup fib;
int rc;
...
__builtin_memset(&fib, 0, sizeof(fib));
fib.family = AF_INET;
fib.tos = iph->tos;
fib.l4_protocol = iph->protocol;
fib.tot_len = ntohs(iph->tot_len);
fib.ipv4_src = iph->saddr;
fib.ipv4_dst = iph->daddr;
fib.ifindex = xdp->ingress_ifindex;
} else {
return XDP_PASS;
}
rc = bpf_fib_lookup(xdp, &fib, sizeof(fib), 0);
if (rc == BPF_FIB_LKUP_RET_SUCCESS) {
ip_decrease_ttl(iph); /* from include/net/ip.h */
memcpy(eth->h_dest, fib.dmac, ETH_ALEN);
memcpy(eth->h_source, fib.smac, ETH_ALEN);
return bpf_redirect(fib.ifindex, 0);
}
return XDP_PASS;
}
XDP Packet Forward
156
-> Query routing table for redirect interface
struct bpf_fib_lookup fib;
int rc;
...
__builtin_memset(&fib, 0, sizeof(fib));
fib.family = AF_INET;
fib.tos = iph->tos;
fib.l4_protocol = iph->protocol;
fib.tot_len = ntohs(iph->tot_len);
fib.ipv4_src = iph->saddr;
fib.ipv4_dst = iph->daddr;
fib.ifindex = xdp->ingress_ifindex;
} else {
return XDP_PASS;
}
rc = bpf_fib_lookup(xdp, &fib, sizeof(fib), 0);
if (rc == BPF_FIB_LKUP_RET_SUCCESS) {
ip_decrease_ttl(iph); /* from include/net/ip.h */
memcpy(eth->h_dest, fib.dmac, ETH_ALEN);
memcpy(eth->h_source, fib.smac, ETH_ALEN);
return bpf_redirect(fib.ifindex, 0);
}
return XDP_PASS;
}
XDP Packet Forward
157
-> Query routing table for redirect interface
$ route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 _gateway 0.0.0.0 UG 101 0 0 enp6s0
10.1.0.0 0.0.0.0 255.255.255.0 U 111 0 0 eth0
10.1.1.0 0.0.0.0 255.255.255.0 U 110 0 0 eth1
192.168.0.0 0.0.0.0 255.255.255.0 U 101 0 0 enp6s0
_gateway 0.0.0.0 255.255.255.255 UH 101 0 0 enp6s0
struct bpf_fib_lookup fib;
int rc;
...
__builtin_memset(&fib, 0, sizeof(fib));
fib.family = AF_INET;
fib.tos = iph->tos;
fib.l4_protocol = iph->protocol;
fib.tot_len = ntohs(iph->tot_len);
fib.ipv4_src = iph->saddr;
fib.ipv4_dst = iph->daddr;
fib.ifindex = xdp->ingress_ifindex;
} else {
return XDP_PASS;
}
rc = bpf_fib_lookup(xdp, &fib, sizeof(fib), 0);
if (rc == BPF_FIB_LKUP_RET_SUCCESS) {
ip_decrease_ttl(iph); /* from include/net/ip.h */
memcpy(eth->h_dest, fib.dmac, ETH_ALEN);
memcpy(eth->h_source, fib.smac, ETH_ALEN);
return bpf_redirect(fib.ifindex, 0);
}
return XDP_PASS;
}
XDP Packet Forward
158
-> If success, decrease ttl
struct bpf_fib_lookup fib;
int rc;
...
__builtin_memset(&fib, 0, sizeof(fib));
fib.family = AF_INET;
fib.tos = iph->tos;
fib.l4_protocol = iph->protocol;
fib.tot_len = ntohs(iph->tot_len);
fib.ipv4_src = iph->saddr;
fib.ipv4_dst = iph->daddr;
fib.ifindex = xdp->ingress_ifindex;
} else {
return XDP_PASS;
}
rc = bpf_fib_lookup(xdp, &fib, sizeof(fib), 0);
if (rc == BPF_FIB_LKUP_RET_SUCCESS) {
ip_decrease_ttl(iph); /* from include/net/ip.h */
memcpy(eth->h_dest, fib.dmac, ETH_ALEN);
memcpy(eth->h_source, fib.smac, ETH_ALEN);
return bpf_redirect(fib.ifindex, 0);
}
return XDP_PASS;
}
XDP Packet Forward
159
-> Change Mac address for src, dst
struct bpf_fib_lookup fib;
int rc;
...
__builtin_memset(&fib, 0, sizeof(fib));
fib.family = AF_INET;
fib.tos = iph->tos;
fib.l4_protocol = iph->protocol;
fib.tot_len = ntohs(iph->tot_len);
fib.ipv4_src = iph->saddr;
fib.ipv4_dst = iph->daddr;
fib.ifindex = xdp->ingress_ifindex;
} else {
return XDP_PASS;
}
rc = bpf_fib_lookup(xdp, &fib, sizeof(fib), 0);
if (rc == BPF_FIB_LKUP_RET_SUCCESS) {
ip_decrease_ttl(iph); /* from include/net/ip.h */
memcpy(eth->h_dest, fib.dmac, ETH_ALEN);
memcpy(eth->h_source, fib.smac, ETH_ALEN);
return bpf_redirect(fib.ifindex, 0);
}
return XDP_PASS;
}
XDP Packet Forward
160
-> Packet Forward with bpf_redirect!
(returns with XDP_REDIRECT)
Forward
$ clang -O2 -Wall -target bpf -c xdp-fwd.c -o xdp-fwd.o
$ bpftool prog load ./xdp-fwd.o /sys/fs/bpf/fwd
$ bpftool prog show
...
44: xdp name xdp_fwd_prog tag 1aa0135c2e55b38d gpl
loaded_at 2019-10-11T20:03:01+0900 uid 0
xlated 760B jited 442B memlock 4096B
$ bpftool net attach xdp id 44 dev eth0​
$ bpftool net​
xdp:​
eth0(8) driver id 44
XDP Packet Forward
161
Compile with BPF target
Load program with bpftool
Attach to interface
Benchmark
162
packet
Test environment
10.1.0.1
NIC
Host A
User
IP
TC
DD
NIC
Host B
10.1.0.2:1234
10 Gbe
…
packet
packet
10.1.1.2:1234
NIC
Host C10 Gbe
…
packet
packet
packet
163
1. Destination NAT
2. Packet Forward
10.1.0.2 -> 10.1.1.2
Hooks to process packet?
L2
L3
L7 Userspace
IP (Netfliter)
Traffic Control
DD (XDP)
NIC
164
UPPER LAYER
XDP
TC Ingress
ROUTING FORWARDING
OUTPUT
TC Egress
ROUTING
L4~
L3
L2
PREROUTING
INPUT
POSTROUTING
Rx Tx
Userspace Packet Forward
Forward
165
Userspace Packet Forward
NIC
Host A Host B
NIC
Host C
User
IP
TC
DD
NIC
Forwardstream {
server {
listen 1234 udp;
proxy_pass udp_target;
}
upstream udp_target {
server 10.1.1.2:1234;
}
}
/etc/nginx/nginx.conf
10.1.0.1 10.1.0.2:1234 10.1.1.2:1234
166
UDP Load Balancer
packet forward with DNAT
$ ethtool -S eth0 | grep rx_xdp_drop
rx_xdp_drop: 370841
rx_xdp_drop: 754718
rx_xdp_drop: 1118515
rx_xdp_drop: 1483198
rx_xdp_drop: 1858322
rx_xdp_drop: 2237080
rx_xdp_drop: 2609916
rx_xdp_drop: 2985062
rx_xdp_drop: 3342212
rx_xdp_drop: 3725703
rx_xdp_drop: 4105888
Average Packet Forward
373,263pps/core
Userspace Packet Forward
≈ 260Mbit/s
167
How about Netfilter?
168
UPPER LAYER
XDP
TC Ingress
ROUTING FORWARDING
OUTPUT
TC Egress
ROUTING
L4~
L3
L2
PREROUTING
INPUT
POSTROUTING
Rx Tx
Netfilter Packet Forward
169
Forward
$ iptables -t nat -A PREROUTING -d 10.1.0.2 -p udp --dport 1234 
-j DNAT --to-destination 10.1.1.2:1234
$ iptables -t nat -L PREROUTING
Chain PREROUTING (policy ACCEPT)
target prot opt source destination
DNAT udp -- anywhere 10.1.0.2 udp dpt:1234 to:10.1.1.2:1234
Netfilter Packet Forward
10.1.0.1
NIC
Host A Host B
10.1.0.2:1234 10.1.1.2:1234
NIC
Host C
User
IP
TC
DD
NIC
Forward
170
Netfilter (iptables)
packet forward with DNAT
Netfilter Packet Forward
$ ethtool -S eth0 | grep rx_xdp_drop
rx_xdp_drop: 686991
rx_xdp_drop: 1377146
rx_xdp_drop: 2064571
rx_xdp_drop: 2755620
rx_xdp_drop: 3418211
rx_xdp_drop: 4050284
rx_xdp_drop: 4682626
rx_xdp_drop: 5366274
rx_xdp_drop: 6054439
rx_xdp_drop: 6743962
rx_xdp_drop: 7434054
rx_xdp_drop: 8003538
rx_xdp_drop: 8693454
Average Packet Forward
668,728pps/core
≈ 960Mbit/s
171
What about TC?
172
UPPER LAYER
XDP
TC Ingress
ROUTING FORWARDING
OUTPUT
TC Egress
ROUTING
L4~
L3
L2
PREROUTING
INPUT
POSTROUTING
Rx Tx
TC Ingress Packet Forward
ROUTING
Qdisc Qdisc
173
Forward
$ tc qdisc add dev eth0 ingress
$ tc filter add dev eth0 parent ffff: protocol ip u32 
match ip dst 10.1.0.2 match ip dport 1234 0xffff 
action nat ingress 10.1.0.2/32 10.1.1.2/32 pipe 
action mirred egress redirect dev eth1
$ tc filter show ingress dev eth0
...
match 0a010002/ffffffff at 16
match 000004d2/0000ffff at 20
action order 1: nat ingress 10.1.0.2/32 10.1.1.2 pipe
index 1 ref 1 bind 1
action order 2: mirred (Egress Redirect to eth1) stolen
index 1 ref 1 bind 1
TC Ingress Packet Forward
NIC
Host A Host B
NIC
Host C
User
IP
TC
DD
NIC
Forward
10.1.0.1 10.1.0.2:1234 10.1.1.2:1234
174
TC Ingress
packet forward with DNAT (redirect)
TC Ingress Packet Forward
$ ethtool -S eth0 | grep rx_xdp_drop
rx_xdp_drop: 1407789
rx_xdp_drop: 2848601
rx_xdp_drop: 4274539
rx_xdp_drop: 5695547
rx_xdp_drop: 7132990
rx_xdp_drop: 8533688
rx_xdp_drop: 9943697
rx_xdp_drop: 11367744
rx_xdp_drop: 12808213
rx_xdp_drop: 14234151
rx_xdp_drop: 15685314
Average Packet Forward
1,425,938pps/core
175
≈ 1Gbit/s
And… XDP?
176
UPPER LAYER
XDP
TC Ingress
ROUTING FORWARDING
OUTPUT
TC Egress
ROUTING
L4~
L3
L2
PREROUTING
INPUT
POSTROUTING
Rx Tx
XDP Packet Forward
ROUTING
177
Forward
XDP Packet Forward
178
2. Packet Forward
10.1.0.2 -> 10.1.1.2
SEC("xdp_fwd")
int xdp_fwd_prog(struct xdp_md *xdp) {
void *data = (void *)(long)xdp->data;
struct bpf_fib_lookup fib;
struct ethhdr *eth = data;
struct iphdr *iph;
...
if (eth->h_proto == htons(ETH_P_IP)) {
iph = data + sizeof(*eth);
if (iph->daddr == _htonl(0xa010002))
iph->daddr = _htonl(0xa010102);
...
rc = bpf_fib_lookup(xdp, &fib, sizeof(fib), 0);
if (rc == BPF_FIB_LKUP_RET_SUCCESS) {
memcpy(eth->h_dest, fib.dmac, ETH_ALEN);
memcpy(eth->h_source, fib.smac, ETH_ALEN);
return bpf_redirect(fib.ifindex, 0);
}
return XDP_PASS;
}
1. Destination NAT
$ bpftool prog load ./xdp-fwd.o /sys/fs/bpf/fwd
$ bpftool prog show
...
44: xdp name xdp_fwd_prog tag 1aa0135c2e55b38d gpl
loaded_at 2019-10-11T20:03:01+0900 uid 0
xlated 760B jited 442B memlock 4096B
$ bpftool net attach xdp id 44 dev eth0​
$ bpftool net​
xdp:​
eth0(8) driver id 44
XDP Packet Forward
NIC
Host A Host B
NIC
Host C
User
IP
TC
DD
NIC
Forward
10.1.0.1 10.1.0.2:1234 10.1.1.2:1234
179
XDP
packet forward with DNAT (XDP_FORWARD)
$ ethtool -S eth0 | grep rx_xdp_drop
rx_xdp_drop: 3031344
rx_xdp_drop: 6057680
rx_xdp_drop: 9088916
rx_xdp_drop: 12118560
rx_xdp_drop: 15150319
rx_xdp_drop: 18180559
rx_xdp_drop: 21210383
rx_xdp_drop: 24240271
rx_xdp_drop: 27270543
rx_xdp_drop: 30300559
rx_xdp_drop: 33329483
rx_xdp_drop: 36356788
XDP Packet Forward
Average Packet Forward
3,029,764pps/core
≈ 2.04Gbit/s
180
Packet Forward Results
181
373,263
668,728
1,425,938
3,029,764
0
1
1
2
2
3
3
4
userspace netfilter tc xdp
Mpps
≈ 2.04Gbit/s
More about XDP?
XDP Offload and Further usages
182
XDP Modes
183
XDP Modes
Generic XDP
Native XDP
Offloaded XDP
184
XDP Modes
Packet NIC
XDP
Driver
XDP
Generic
XDP
Netfilter …TC
XDP can happen here
Network
Hardware
Network
Driver
Linux Kernel
Network Stack
185
NativeOffloaded Generic
XDP Modes
Packet NIC
XDP
Driver
XDP
Generic
XDP
Netfilter …TC
Network
Hardware
Network
Driver
Linux Kernel
Network Stack
186
NativeOffloaded Generic
=
Normally, XDP?
XDP Modes
Generic XDP - No driver support needed
Native XDP
- Intel (ixgbe, ixgbevf, i40e)
- Mellanox (mlx4, mlx5)
- Broadcom (bnxt)
- Qlogic (qede)
- Netronome (nfp)
- Others (virtio, tun)
(Most of the 10Gbe Driver)
Offloaded XDP - Netronome (nfp)
187
XDP OFFLOAD
188
What is XDP OFFLOAD?
0: r0 = 1
1: r2 = *(u32 *)(r1 + 4)
2: r1 = *(u32 *)(r1 + 0)
3: r3 = r1
4: r3 += 14
5: if r3 > r2 goto +18 <LBB0_8>
6: r3 = *(u8 *)(r1 + 12)
7: r4 = *(u8 *)(r1 + 13)
8: r4 <<= 8
9: r4 |= r3
10: if r4 != 8 goto +12 <LBB0_7>
11: r3 = r1
12: r3 += 34
13: if r3 > r2 goto +10 <LBB0_8>
14: r3 = *(u32 *)(r1 + 30)
15: if r3 != 33554698 goto +7 <LBB0_7>
16: r3 = *(u8 *)(r1 + 23)
17: if r3 != 17 goto +5 <LBB0_7>
18: r3 = r1
19: r3 += 42
...
XDP offload
BPF Program
189
What is XDP OFFLOAD?
0: r0 = 1
1: r2 = *(u32 *)(r1 + 4)
2: r1 = *(u32 *)(r1 + 0)
3: r3 = r1
4: r3 += 14
5: if r3 > r2 goto +18 <LBB0_8>
6: r3 = *(u8 *)(r1 + 12)
7: r4 = *(u8 *)(r1 + 13)
8: r4 <<= 8
9: r4 |= r3
10: if r4 != 8 goto +12 <LBB0_7>
11: r3 = r1
12: r3 += 34
13: if r3 > r2 goto +10 <LBB0_8>
14: r3 = *(u32 *)(r1 + 30)
15: if r3 != 33554698 goto +7 <LBB0_7>
16: r3 = *(u8 *)(r1 + 23)
17: if r3 != 17 goto +5 <LBB0_7>
18: r3 = r1
19: r3 += 42
...
XDP offload
BPF Program
190
Runs more earlier
No CPU usage
Packet Drop Results
783,063
1,266,730
4,083,820
9,941,337
0
2
4
6
8
10
12
userspace netfilter tc xdp
Mpps
191
≈ 6.69Gbit/s
Out of 10Gbit/s
$ bpftool prog load ./xdp-drop.o /sys/fs/bpf/drop
$ bpftool prog show
...
24: xdp name xdp_prog1 tag 6f8c2e06dfa2abcb gpl
loaded_at 2019-10-11T17:17:33+0900 uid 0
xlated 544B jited 344B memlock 4096B map_ids 17
$ bpftool net attach xdpoffload id 18 dev eth0
$ bpftool net
xdp:
eth0(8) offload id 18
XDP Offload Packet Drop
10.1.0.1
NIC
Host A
User
IP
TC
DD
NIC
Host B
DROP
10 Gbe
10.1.0.2:1234
192
XDP offload
packet drop with XDP_DROP
XDP Offload Packet Drop
$ ethtool -S eth0 | grep bpf_app1_pkts
bpf_app1_pkts: 9340887
bpf_app1_pkts: 14274730
bpf_app1_pkts: 29233693
bpf_app1_pkts: 44165498
bpf_app1_pkts: 59097632
bpf_app1_pkts: 74057835
bpf_app1_pkts: 88990848
bpf_app1_pkts: 103947963
bpf_app1_pkts: 118881530
bpf_app1_pkts: 133838008
bpf_app1_pkts: 148770851
bpf_app1_pkts: 163705144
bpf_app1_pkts: 178664224
bpf_app1_pkts: 193596262
bpf_app1_pkts: 208554907
bpf_app1_pkts: 223488168
Average Packet Drop
14,883,153pps/core
≈ 10.00Gbit/s
193
Packet Drop Results with XDP Offload
783,063
1,266,730
4,083,820
9,941,337
14,883,153
0
2
4
6
8
10
12
14
16
userspace netfilter tc xdp xdp-offload
Mpps
194
Out of 10Gbit/s
≈ 10.00Gbit/s
6.69Gbit/s
XDP Offload Packet Drop
ixgbe_poll() {
ixgbe_clean_rx_irq() {
ixgbe_get_rx_buffer();
ixgbe_run_xdp() {
bpf_prog_run_xdp();
}
XDPXDP Offload
DROP
DROP
NONE!
195
XDP Offload Packet Drop
XDP XDP Offload
196
XDP Use Cases
• Load Balancing
• Packet Tunneling (Encapsulation)
• DDoS attack mitigation
• Network monitoring
• ETC..
197
Sample code, test results can be found:
github.com/DanielTimLee/soscon19_XDP
198
SAMSUNG OPEN SOURCE CONFERENCE 2019
THANK YOU
199

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to eBPF and XDP
Introduction to eBPF and XDPIntroduction to eBPF and XDP
Introduction to eBPF and XDPlcplcp1
 
Intel DPDK Step by Step instructions
Intel DPDK Step by Step instructionsIntel DPDK Step by Step instructions
Intel DPDK Step by Step instructionsHisaki Ohara
 
XDP in Practice: DDoS Mitigation @Cloudflare
XDP in Practice: DDoS Mitigation @CloudflareXDP in Practice: DDoS Mitigation @Cloudflare
XDP in Practice: DDoS Mitigation @CloudflareC4Media
 
Replacing iptables with eBPF in Kubernetes with Cilium
Replacing iptables with eBPF in Kubernetes with CiliumReplacing iptables with eBPF in Kubernetes with Cilium
Replacing iptables with eBPF in Kubernetes with CiliumMichal Rostecki
 
The TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux KernelThe TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux KernelDivye Kapoor
 
Kernel Recipes 2019 - XDP closer integration with network stack
Kernel Recipes 2019 -  XDP closer integration with network stackKernel Recipes 2019 -  XDP closer integration with network stack
Kernel Recipes 2019 - XDP closer integration with network stackAnne Nicolas
 
Tutorial: Using GoBGP as an IXP connecting router
Tutorial: Using GoBGP as an IXP connecting routerTutorial: Using GoBGP as an IXP connecting router
Tutorial: Using GoBGP as an IXP connecting routerShu Sugimoto
 
EBPF and Linux Networking
EBPF and Linux NetworkingEBPF and Linux Networking
EBPF and Linux NetworkingPLUMgrid
 
netfilter and iptables
netfilter and iptablesnetfilter and iptables
netfilter and iptablesKernel TLV
 
BPF Internals (eBPF)
BPF Internals (eBPF)BPF Internals (eBPF)
BPF Internals (eBPF)Brendan Gregg
 
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDPDockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDPThomas Graf
 
BPF: Tracing and more
BPF: Tracing and moreBPF: Tracing and more
BPF: Tracing and moreBrendan Gregg
 
Meet cute-between-ebpf-and-tracing
Meet cute-between-ebpf-and-tracingMeet cute-between-ebpf-and-tracing
Meet cute-between-ebpf-and-tracingViller Hsiao
 
introduction to linux kernel tcp/ip ptocotol stack
introduction to linux kernel tcp/ip ptocotol stack introduction to linux kernel tcp/ip ptocotol stack
introduction to linux kernel tcp/ip ptocotol stack monad bobo
 
DevConf 2014 Kernel Networking Walkthrough
DevConf 2014   Kernel Networking WalkthroughDevConf 2014   Kernel Networking Walkthrough
DevConf 2014 Kernel Networking WalkthroughThomas Graf
 
Overview of Distributed Virtual Router (DVR) in Openstack/Neutron
Overview of Distributed Virtual Router (DVR) in Openstack/NeutronOverview of Distributed Virtual Router (DVR) in Openstack/Neutron
Overview of Distributed Virtual Router (DVR) in Openstack/Neutronvivekkonnect
 
Open vSwitch Introduction
Open vSwitch IntroductionOpen vSwitch Introduction
Open vSwitch IntroductionHungWei Chiu
 
The linux networking architecture
The linux networking architectureThe linux networking architecture
The linux networking architecturehugo lu
 
DoS and DDoS mitigations with eBPF, XDP and DPDK
DoS and DDoS mitigations with eBPF, XDP and DPDKDoS and DDoS mitigations with eBPF, XDP and DPDK
DoS and DDoS mitigations with eBPF, XDP and DPDKMarian Marinov
 
BPF - in-kernel virtual machine
BPF - in-kernel virtual machineBPF - in-kernel virtual machine
BPF - in-kernel virtual machineAlexei Starovoitov
 

Was ist angesagt? (20)

Introduction to eBPF and XDP
Introduction to eBPF and XDPIntroduction to eBPF and XDP
Introduction to eBPF and XDP
 
Intel DPDK Step by Step instructions
Intel DPDK Step by Step instructionsIntel DPDK Step by Step instructions
Intel DPDK Step by Step instructions
 
XDP in Practice: DDoS Mitigation @Cloudflare
XDP in Practice: DDoS Mitigation @CloudflareXDP in Practice: DDoS Mitigation @Cloudflare
XDP in Practice: DDoS Mitigation @Cloudflare
 
Replacing iptables with eBPF in Kubernetes with Cilium
Replacing iptables with eBPF in Kubernetes with CiliumReplacing iptables with eBPF in Kubernetes with Cilium
Replacing iptables with eBPF in Kubernetes with Cilium
 
The TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux KernelThe TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux Kernel
 
Kernel Recipes 2019 - XDP closer integration with network stack
Kernel Recipes 2019 -  XDP closer integration with network stackKernel Recipes 2019 -  XDP closer integration with network stack
Kernel Recipes 2019 - XDP closer integration with network stack
 
Tutorial: Using GoBGP as an IXP connecting router
Tutorial: Using GoBGP as an IXP connecting routerTutorial: Using GoBGP as an IXP connecting router
Tutorial: Using GoBGP as an IXP connecting router
 
EBPF and Linux Networking
EBPF and Linux NetworkingEBPF and Linux Networking
EBPF and Linux Networking
 
netfilter and iptables
netfilter and iptablesnetfilter and iptables
netfilter and iptables
 
BPF Internals (eBPF)
BPF Internals (eBPF)BPF Internals (eBPF)
BPF Internals (eBPF)
 
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDPDockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
 
BPF: Tracing and more
BPF: Tracing and moreBPF: Tracing and more
BPF: Tracing and more
 
Meet cute-between-ebpf-and-tracing
Meet cute-between-ebpf-and-tracingMeet cute-between-ebpf-and-tracing
Meet cute-between-ebpf-and-tracing
 
introduction to linux kernel tcp/ip ptocotol stack
introduction to linux kernel tcp/ip ptocotol stack introduction to linux kernel tcp/ip ptocotol stack
introduction to linux kernel tcp/ip ptocotol stack
 
DevConf 2014 Kernel Networking Walkthrough
DevConf 2014   Kernel Networking WalkthroughDevConf 2014   Kernel Networking Walkthrough
DevConf 2014 Kernel Networking Walkthrough
 
Overview of Distributed Virtual Router (DVR) in Openstack/Neutron
Overview of Distributed Virtual Router (DVR) in Openstack/NeutronOverview of Distributed Virtual Router (DVR) in Openstack/Neutron
Overview of Distributed Virtual Router (DVR) in Openstack/Neutron
 
Open vSwitch Introduction
Open vSwitch IntroductionOpen vSwitch Introduction
Open vSwitch Introduction
 
The linux networking architecture
The linux networking architectureThe linux networking architecture
The linux networking architecture
 
DoS and DDoS mitigations with eBPF, XDP and DPDK
DoS and DDoS mitigations with eBPF, XDP and DPDKDoS and DDoS mitigations with eBPF, XDP and DPDK
DoS and DDoS mitigations with eBPF, XDP and DPDK
 
BPF - in-kernel virtual machine
BPF - in-kernel virtual machineBPF - in-kernel virtual machine
BPF - in-kernel virtual machine
 

Ähnlich wie Faster packet processing in Linux: XDP

2015 FOSDEM - OVS Stateful Services
2015 FOSDEM - OVS Stateful Services2015 FOSDEM - OVS Stateful Services
2015 FOSDEM - OVS Stateful ServicesThomas Graf
 
Efficient Topology Discovery in Software Defined Networks
Efficient Topology Discovery in Software Defined NetworksEfficient Topology Discovery in Software Defined Networks
Efficient Topology Discovery in Software Defined NetworksFarzaneh Pakzad
 
How to convert your Linux box into Security Gateway - Part 1
How to convert your Linux box into Security Gateway - Part 1How to convert your Linux box into Security Gateway - Part 1
How to convert your Linux box into Security Gateway - Part 1n|u - The Open Security Community
 
[ko] Kernel Networking Stack 진입 장벽 허물기
[ko] Kernel Networking Stack 진입 장벽 허물기[ko] Kernel Networking Stack 진입 장벽 허물기
[ko] Kernel Networking Stack 진입 장벽 허물기Juhee Kang
 
Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Andriy Berestovskyy
 
Taking Security Groups to Ludicrous Speed with OVS (OpenStack Summit 2015)
Taking Security Groups to Ludicrous Speed with OVS (OpenStack Summit 2015)Taking Security Groups to Ludicrous Speed with OVS (OpenStack Summit 2015)
Taking Security Groups to Ludicrous Speed with OVS (OpenStack Summit 2015)Thomas Graf
 
Вопросы балансировки трафика
Вопросы балансировки трафикаВопросы балансировки трафика
Вопросы балансировки трафикаSkillFactory
 
The Next Generation Firewall for Red Hat Enterprise Linux 7 RC
The Next Generation Firewall for Red Hat Enterprise Linux 7 RCThe Next Generation Firewall for Red Hat Enterprise Linux 7 RC
The Next Generation Firewall for Red Hat Enterprise Linux 7 RCThomas Graf
 
LF_DPDK17_Accelerating P4-based Dataplane with DPDK
LF_DPDK17_Accelerating P4-based Dataplane with DPDKLF_DPDK17_Accelerating P4-based Dataplane with DPDK
LF_DPDK17_Accelerating P4-based Dataplane with DPDKLF_DPDK
 
Packet Framework - Cristian Dumitrescu
Packet Framework - Cristian DumitrescuPacket Framework - Cristian Dumitrescu
Packet Framework - Cristian Dumitrescuharryvanhaaren
 
Course on TCP Dynamic Performance
Course on TCP Dynamic PerformanceCourse on TCP Dynamic Performance
Course on TCP Dynamic PerformanceJavier Arauz
 
An introduction to MPLS networks and applications
An introduction to MPLS networks and applicationsAn introduction to MPLS networks and applications
An introduction to MPLS networks and applicationsShawn Zandi
 
Pyretic - A new programmer friendly language for SDN
Pyretic - A new programmer friendly language for SDNPyretic - A new programmer friendly language for SDN
Pyretic - A new programmer friendly language for SDNnvirters
 
DDS over Low Bandwidth Data Links - Connext Conf London October 2014
DDS over Low Bandwidth Data Links - Connext Conf London October 2014DDS over Low Bandwidth Data Links - Connext Conf London October 2014
DDS over Low Bandwidth Data Links - Connext Conf London October 2014Jaime Martin Losa
 
Ch 02 --- sdn and openflow architecture
Ch 02 --- sdn and openflow architectureCh 02 --- sdn and openflow architecture
Ch 02 --- sdn and openflow architectureYoram Orzach
 
Protocol T50: Five months later... So what?
Protocol T50: Five months later... So what?Protocol T50: Five months later... So what?
Protocol T50: Five months later... So what?Nelson Brito
 
Kernel Recipes 2014 - NDIV: a low overhead network traffic diverter
Kernel Recipes 2014 - NDIV: a low overhead network traffic diverterKernel Recipes 2014 - NDIV: a low overhead network traffic diverter
Kernel Recipes 2014 - NDIV: a low overhead network traffic diverterAnne Nicolas
 
MPLS Deployment Chapter 1 - Basic
MPLS Deployment Chapter 1 - BasicMPLS Deployment Chapter 1 - Basic
MPLS Deployment Chapter 1 - BasicEricsson
 

Ähnlich wie Faster packet processing in Linux: XDP (20)

2015 FOSDEM - OVS Stateful Services
2015 FOSDEM - OVS Stateful Services2015 FOSDEM - OVS Stateful Services
2015 FOSDEM - OVS Stateful Services
 
Efficient Topology Discovery in Software Defined Networks
Efficient Topology Discovery in Software Defined NetworksEfficient Topology Discovery in Software Defined Networks
Efficient Topology Discovery in Software Defined Networks
 
How to convert your Linux box into Security Gateway - Part 1
How to convert your Linux box into Security Gateway - Part 1How to convert your Linux box into Security Gateway - Part 1
How to convert your Linux box into Security Gateway - Part 1
 
[ko] Kernel Networking Stack 진입 장벽 허물기
[ko] Kernel Networking Stack 진입 장벽 허물기[ko] Kernel Networking Stack 진입 장벽 허물기
[ko] Kernel Networking Stack 진입 장벽 허물기
 
Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)
 
Taking Security Groups to Ludicrous Speed with OVS (OpenStack Summit 2015)
Taking Security Groups to Ludicrous Speed with OVS (OpenStack Summit 2015)Taking Security Groups to Ludicrous Speed with OVS (OpenStack Summit 2015)
Taking Security Groups to Ludicrous Speed with OVS (OpenStack Summit 2015)
 
Вопросы балансировки трафика
Вопросы балансировки трафикаВопросы балансировки трафика
Вопросы балансировки трафика
 
The Next Generation Firewall for Red Hat Enterprise Linux 7 RC
The Next Generation Firewall for Red Hat Enterprise Linux 7 RCThe Next Generation Firewall for Red Hat Enterprise Linux 7 RC
The Next Generation Firewall for Red Hat Enterprise Linux 7 RC
 
7. protocols
7. protocols7. protocols
7. protocols
 
LF_DPDK17_Accelerating P4-based Dataplane with DPDK
LF_DPDK17_Accelerating P4-based Dataplane with DPDKLF_DPDK17_Accelerating P4-based Dataplane with DPDK
LF_DPDK17_Accelerating P4-based Dataplane with DPDK
 
Packet Framework - Cristian Dumitrescu
Packet Framework - Cristian DumitrescuPacket Framework - Cristian Dumitrescu
Packet Framework - Cristian Dumitrescu
 
Course on TCP Dynamic Performance
Course on TCP Dynamic PerformanceCourse on TCP Dynamic Performance
Course on TCP Dynamic Performance
 
An introduction to MPLS networks and applications
An introduction to MPLS networks and applicationsAn introduction to MPLS networks and applications
An introduction to MPLS networks and applications
 
Pyretic - A new programmer friendly language for SDN
Pyretic - A new programmer friendly language for SDNPyretic - A new programmer friendly language for SDN
Pyretic - A new programmer friendly language for SDN
 
DDS over Low Bandwidth Data Links - Connext Conf London October 2014
DDS over Low Bandwidth Data Links - Connext Conf London October 2014DDS over Low Bandwidth Data Links - Connext Conf London October 2014
DDS over Low Bandwidth Data Links - Connext Conf London October 2014
 
Ch 02 --- sdn and openflow architecture
Ch 02 --- sdn and openflow architectureCh 02 --- sdn and openflow architecture
Ch 02 --- sdn and openflow architecture
 
Protocol Independence
Protocol IndependenceProtocol Independence
Protocol Independence
 
Protocol T50: Five months later... So what?
Protocol T50: Five months later... So what?Protocol T50: Five months later... So what?
Protocol T50: Five months later... So what?
 
Kernel Recipes 2014 - NDIV: a low overhead network traffic diverter
Kernel Recipes 2014 - NDIV: a low overhead network traffic diverterKernel Recipes 2014 - NDIV: a low overhead network traffic diverter
Kernel Recipes 2014 - NDIV: a low overhead network traffic diverter
 
MPLS Deployment Chapter 1 - Basic
MPLS Deployment Chapter 1 - BasicMPLS Deployment Chapter 1 - Basic
MPLS Deployment Chapter 1 - Basic
 

Kürzlich hochgeladen

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 

Kürzlich hochgeladen (20)

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 

Faster packet processing in Linux: XDP

  • 1. SAMSUNG OPEN SOURCE CONFERENCE 2019 SOSCON Faster Packet Processing in Linux: XDP Kosslab | SoftwareEngineer | 이호연 2019.10.17 1
  • 2. 이호연 @Daniel T. Lee - Kosslab Software Engineer - Opensource Developer (Linux Kernel – BPF, XDP, uftrace, etc.) 2
  • 3. • Packet Processing in Linux • How fast are we talking? • What is XDP? • How to use XDP? • More about XDP Today’s agenda 3
  • 4. Packet Processing in Linux From basic path to processing hooks 4
  • 5. Packet Path in Kernel DEVICE DRIVER UPPER LAYER Ingress PROTO HANDLER ROUTING FORWARDING OUTPUT INPUT NEIGH ROUTING Egress L4~ L3 L2 5
  • 6. Packet Receiving Path DEVICE DRIVER UPPER LAYER Ingress PROTO HANDLER ROUTING FORWARDING OUTPUT INPUT NEIGH ROUTING Egress L4~ L3 L2 6
  • 7. Packet Forwarding Path DEVICE DRIVER UPPER LAYER Ingress PROTO HANDLER ROUTING FORWARDING OUTPUT INPUT NEIGH ROUTING Egress L4~ L3 L2 7
  • 8. Packet Sending Path DEVICE DRIVER UPPER LAYER Ingress PROTO HANDLER ROUTING FORWARDING OUTPUT INPUT NEIGH ROUTING Egress L4~ L3 L2 8
  • 9. Packet Receiving Path DEVICE DRIVER UPPER LAYER Ingress PROTO HANDLER ROUTING FORWARDING OUTPUT INPUT NEIGH ROUTING Egress L4~ L3 L2 9
  • 10. Packet Receiving Path DEVICE DRIVER UPPER LAYER Ingress PROTO HANDLER ROUTING FORWARDING OUTPUT INPUT NEIGH ROUTING Egress L4~ L3 L2 10
  • 11. Packet Receiving Path DEVICE DRIVER UPPER LAYER Ingress PROTO HANDLER ROUTING FORWARDING OUTPUT INPUT NEIGH ROUTING Egress L4~ L3 L2 11
  • 12. Packet Receiving Path DEVICE DRIVER UPPER LAYER Ingress PROTO HANDLER ROUTING FORWARDING OUTPUT INPUT NEIGH ROUTING Egress L4~ L3 L2 12
  • 13. Packet Receiving Path DEVICE DRIVER UPPER LAYER Ingress PROTO HANDLER ROUTING FORWARDING OUTPUT INPUT NEIGH ROUTING Egress L4~ L3 L2 13 $ route -n Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 0.0.0.0 _gateway 0.0.0.0 UG 101 0 0 enp6s0 10.1.0.0 0.0.0.0 255.255.255.0 U 111 0 0 eth0 10.1.1.0 0.0.0.0 255.255.255.0 U 110 0 0 eth1 192.168.0.0 0.0.0.0 255.255.255.0 U 101 0 0 enp6s0 _gateway 0.0.0.0 255.255.255.255 UH 101 0 0 enp6s0
  • 14. Packet Receiving Path DEVICE DRIVER UPPER LAYER Ingress PROTO HANDLER ROUTING FORWARDING OUTPUT INPUT NEIGH ROUTING Egress L4~ L3 L2 14
  • 15. Packet Forwarding Path DEVICE DRIVER UPPER LAYER Ingress PROTO HANDLER ROUTING FORWARDING OUTPUT INPUT NEIGH ROUTING Egress L4~ L3 L2 15
  • 16. Packet Forwarding Path DEVICE DRIVER UPPER LAYER Ingress PROTO HANDLER ROUTING FORWARDING OUTPUT INPUT NEIGH ROUTING Egress L4~ L3 L2 16
  • 17. Packet Forwarding Path DEVICE DRIVER UPPER LAYER Ingress PROTO HANDLER ROUTING FORWARDING OUTPUT INPUT NEIGH ROUTING Egress L4~ L3 L2 17
  • 18. Packet Forwarding Path DEVICE DRIVER UPPER LAYER Ingress PROTO HANDLER ROUTING FORWARDING OUTPUT INPUT NEIGH ROUTING Egress L4~ L3 L2 18
  • 19. Packet Sending Path DEVICE DRIVER UPPER LAYER Ingress PROTO HANDLER ROUTING FORWARDING OUTPUT INPUT NEIGH ROUTING Egress L4~ L3 L2 19
  • 20. Packet Sending Path DEVICE DRIVER UPPER LAYER Ingress PROTO HANDLER ROUTING FORWARDING OUTPUT INPUT NEIGH ROUTING Egress L4~ L3 L2 20
  • 21. Packet Sending Path DEVICE DRIVER UPPER LAYER Ingress PROTO HANDLER ROUTING FORWARDING OUTPUT INPUT NEIGH ROUTING Egress L4~ L3 L2 21
  • 22. Packet Sending Path DEVICE DRIVER UPPER LAYER Ingress PROTO HANDLER ROUTING FORWARDING OUTPUT INPUT NEIGH ROUTING Egress L4~ L3 L2 22
  • 23. Packet Sending Path DEVICE DRIVER UPPER LAYER Ingress PROTO HANDLER ROUTING FORWARDING OUTPUT INPUT NEIGH ROUTING Egress L4~ L3 L2 23
  • 25. Packet Processing • Filtering (Drop, Pass) • Forwarding (Routing) • NAT (Masquerade, Source, Destination) • Packet Tunneling (Encapsulation) • Packet Mangling (ToS, TTL, mark) 25
  • 26. Hooks to process packet? 26
  • 27. Hooks to process packet? L2 L3 L7 Userspace IP (Netfliter) Traffic Control DD (XDP) NIC 27
  • 28. Linux Socket Packet processing by receive message Userspace 28
  • 29. UPPER LAYER XDP TC Ingress ROUTING FORWARDING OUTPUT TC Egress ROUTING L4~ L3 L2 PREROUTING INPUT POSTROUTING Rx Tx Packet-Processing Path 29
  • 30. UPPER LAYER XDP TC Ingress ROUTING FORWARDING OUTPUT TC Egress ROUTING L4~ L3 L2 PREROUTING INPUT POSTROUTING Rx Tx Userspace Packet Path 30
  • 31. Userspace 31 Any kind of packet processing is possible! int fd = socket(AF_INET, SOCK_DGRAM, 0); struct sockaddr_in in; int res, pkts, bytes; char buf[MTU_SIZE]; memset(&in, 0, sizeof(in)); in.sin_family = AF_INET; in.sin_addr.s_addr = INADDR_ANY; in.sin_port = htons(1234); if (bind(fd, (struct sockaddr*)&in, sizeof(in)) < 0) exit(EXIT_FAILURE); while (1) { int res = read(fd, buf, MTU_SIZE); if (res <= 0) return 0; pkts += 1; bytes += r; }
  • 32. Linux Firewall Framework set of hooks inside the Network stack Netfilter 32
  • 33. UPPER LAYER XDP TC Ingress ROUTING FORWARDING OUTPUT TC Egress ROUTING L4~ L3 L2 PREROUTING INPUT POSTROUTING Rx Tx Netfilter Packet Path 33
  • 34. UPPER LAYER XDP TC Ingress ROUTING FORWARDING OUTPUT TC Egress ROUTING L4~ L3 L2 PREROUTING INPUT POSTROUTING Rx Tx Netfilter Packet Path modulemodule module module module By registering kernel modules to hooks, Intercepts network traffic ex) iptables… 34
  • 35. Netfilter 35 By using iptables, nftables, etc.. Packet Processing is available - Packet filtering - NAT (network address translation) - Packet Mangling. - Stateless/Stateful Firewalling - ETC..
  • 36. Netfilter Example 1 – Packet Filtering Netfilter (iptables) packet drop DROP UDP 10.1.0.2:1234 $ iptables -A INPUT -d 10.1.0.2 -p udp --dport 1234 -j DROP 36
  • 37. UPPER LAYER XDP TC Ingress FORWARDING OUTPUT TC Egress ROUTING L4~ L3 L2 PREROUTING INPUT POSTROUTING Rx Tx Netfilter Example 1 – Packet Filtering ROUTING packetpacketpacket 10.1.0.2:1234 37
  • 38. UPPER LAYER XDP TC Ingress FORWARDING OUTPUT TC Egress ROUTING L4~ L3 L2 PREROUTING INPUT POSTROUTING Rx Tx Netfilter Example 1 – Packet Filtering DROP Source IP Source Port Dest. IP Dest.Port Action 1 10.0.0.5 -- -- -- ALLOW 2 -- -- -- 22 DROP 3 -- -- 10.1.0.2 1234 DROP ROUTING packetpacketpacket 10.1.0.2:1234 38
  • 39. Netfilter Example 2 – NAT Netfilter (iptables) Destination NAT NAT UDP 10.1.0.2:1234 -> 10.1.1.2:1234 $ iptables -t nat -A PREROUTING -d 10.1.0.2 -p udp --dport 1234 -j DNAT --to-destination 10.1.1.2:1234 39
  • 40. UPPER LAYER XDP TC Ingress FORWARDING OUTPUT TC Egress ROUTING L4~ L3 L2 PREROUTING INPUT POSTROUTING Rx Tx Netfilter Example 2 – NAT ROUTING packetpacketpacket 10.1.0.2:1234 40
  • 41. UPPER LAYER XDP TC Ingress FORWARDING OUTPUT TC Egress ROUTING L4~ L3 L2 PREROUTING INPUT POSTROUTING Rx Tx Netfilter Example 2 – NAT packetpacketpacket ROUTING $ iptables -t nat -A PREROUTING -d 10.1.0.2 -p udp --dport 1234 -j DNAT --to-destination 10.1.1.2:1234 $ iptables -t nat -L PREROUTING Chain PREROUTING (policy ACCEPT) target prot opt source destination DNAT udp -- anywhere 10.1.0.2 udp dpt:1234 to:10.1.1.2:1234 10.1.0.2:1234 41
  • 42. UPPER LAYER XDP TC Ingress ROUTING FORWARDING OUTPUT TC Egress ROUTING L4~ L3 L2 PREROUTING INPUT POSTROUTING Rx Tx Netfilter Example 2 – NAT Source IP Source Port Dest. IP Dest.Port to 1 10.0.0.5 -- -- -- 10.0.0.2:1234 2 -- -- 10.1.0.255 80 10.1.0.1:443 3 -- -- 10.1.0.2 1234 10.1.1.2:1234 DNAT packetpacketpacket packet 10.1.0.2:1234 10.1.1.2:1234 42
  • 43. UPPER LAYER XDP TC Ingress FORWARDING OUTPUT TC Egress ROUTING L4~ L3 L2 PREROUTING INPUT POSTROUTING Rx Tx Netfilter Example 2 – NAT ROUTING packetpacketpacket packet 10.1.0.2:1234 10.1.1.2:1234 43 $ route -n Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 0.0.0.0 _gateway 0.0.0.0 UG 101 0 0 enp6s0 10.1.0.0 0.0.0.0 255.255.255.0 U 111 0 0 eth0 10.1.1.0 0.0.0.0 255.255.255.0 U 110 0 0 eth1 192.168.0.0 0.0.0.0 255.255.255.0 U 101 0 0 enp6s0 _gateway 0.0.0.0 255.255.255.255 UH 101 0 0 enp6s0
  • 44. UPPER LAYER XDP TC Ingress FORWARDING OUTPUT TC Egress ROUTING L4~ L3 L2 PREROUTING INPUT POSTROUTING Rx Tx Netfilter Example 2 – NAT packetpacketpacket 10.1.0.2:1234 packet 10.1.1.2:1234 44 ROUTING
  • 45. Traffic Control – Packet Scheduler Control network traffic with Queuing policy TC 45
  • 46. UPPER LAYER XDP TC Ingress ROUTING FORWARDING OUTPUT TC Egress ROUTING L4~ L3 L2 PREROUTING INPUT POSTROUTING Rx Tx TC Packet Path Qdisc Qdisc 46
  • 47. UPPER LAYER XDP TC Ingress ROUTING FORWARDING OUTPUT TC Egress ROUTING L4~ L3 L2 PREROUTING INPUT POSTROUTING Rx Tx TC Packet Path Qdisc Qdisc 47 With Packet Scheduler (Qdisc), can determine how to receive & transmit packets ex) priority, delay, dropping
  • 48. UPPER LAYER XDP TC Ingress ROUTING FORWARDING OUTPUT TC Egress ROUTING L4~ L3 L2 PREROUTING INPUT POSTROUTING Rx Tx TC Packet Path Qdisc Qdisc 48 By attaching Filter to Qdisc, User can take action with matched packet
  • 49. TC 49 With filters attached to Qdisc Packet Processing is available - Packet filtering - NAT (network address translation) - Packet Mangling (skb_edit) - Packet Forwarding (mirroring) - QoS - ETC..
  • 50. TC Example 1 – Packet Filtering TC Ingress packet drop DROP UDP 10.1.0.2:1234 $ tc qdisc add dev eth0 ingress $ tc filter add dev eth0 parent ffff: protocol ip u32 match ip dst 10.1.0.2 match ip dport 1234 0xffff action drop 50
  • 51. UPPER LAYER XDP TC Ingress FORWARDING OUTPUT TC Egress ROUTING L4~ L3 L2 PREROUTING INPUT POSTROUTING Rx Tx TC Example 1 – Packet Filtering packet packet packet ROUTING $ tc qdisc add dev eth0 ingress 10.1.0.2:1234 51
  • 52. UPPER LAYER XDP TC Ingress FORWARDING OUTPUT TC Egress ROUTING L4~ L3 L2 PREROUTING INPUT POSTROUTING Rx Tx TC Example 1 – Packet Filtering packet packet packet ROUTING 10.1.0.2:1234 52 $ tc qdisc add dev eth0 ingress Qdisc
  • 53. UPPER LAYER XDP TC Ingress FORWARDING OUTPUT TC Egress ROUTING L4~ L3 L2 PREROUTING INPUT POSTROUTING Rx Tx TC Example 1 – Packet Filtering packet packet packet ROUTING $ tc filter add dev eth0 parent ffff: protocol ip u32 match ip dst 10.1.0.2 match ip dport 1234 0xffff action drop ` 53 10.1.0.2:1234 Qdisc
  • 54. UPPER LAYER XDP TC Ingress FORWARDING OUTPUT TC Egress ROUTING L4~ L3 L2 PREROUTING INPUT POSTROUTING Rx Tx TC Example 1 – Packet Filtering packet packet packet ROUTING $ tc filter add dev eth0 parent ffff: protocol ip u32 match ip dst 10.1.0.2 match ip dport 1234 0xffff action drop ` 54 10.1.0.2:1234 Qdisc DROP
  • 55. TC Example 2 – NAT TC Ingress Destination NAT NAT UDP 10.1.0.2:1234 -> 10.1.1.2:1234 $ tc qdisc add dev eth0 ingress $ tc filter add dev eth0 parent ffff: protocol ip u32 match ip dst 10.1.0.2 match ip dport 1234 0xffff action nat ingress 10.1.0.2/32 10.1.1.2/32 pipe action mirred egress redirect dev eth1 55
  • 56. UPPER LAYER XDP TC Ingress FORWARDING OUTPUT TC Egress ROUTING L4~ L3 L2 PREROUTING INPUT POSTROUTING Rx Tx TC Example 2 – NAT packet packet packet ROUTING $ tc qdisc add dev eth0 ingress 10.1.0.2:1234 56
  • 57. UPPER LAYER XDP TC Ingress FORWARDING OUTPUT TC Egress ROUTING L4~ L3 L2 PREROUTING INPUT POSTROUTING Rx Tx TC Example 2 – NAT packet packet packet ROUTING 10.1.0.2:1234 57 $ tc qdisc add dev eth0 ingress Qdisc
  • 58. UPPER LAYER XDP TC Ingress FORWARDING OUTPUT TC Egress ROUTING L4~ L3 L2 PREROUTING INPUT POSTROUTING Rx Tx TC Example 2 – NAT packet packet packet ROUTING $ tc filter add dev eth0 parent ffff: protocol ip u32 match ip dst 10.1.0.2 match ip dport 1234 0xffff action nat ingress 10.1.0.2/32 10.1.1.2/32 pipe action mirred egress redirect dev eth1 ` 58 10.1.0.2:1234 Qdisc
  • 59. UPPER LAYER XDP TC Ingress FORWARDING OUTPUT TC Egress ROUTING L4~ L3 L2 PREROUTING INPUT POSTROUTING Rx Tx TC Example 2 – NAT packet packet packet ROUTING $ tc filter add dev eth0 parent ffff: protocol ip u32 match ip dst 10.1.0.2 match ip dport 1234 0xffff action nat ingress 10.1.0.2/32 10.1.1.2/32 pipe action mirred egress redirect dev eth1 ` 59 10.1.0.2:1234 Qdisc DNAT packet 10.1.1.2:1234
  • 60. UPPER LAYER XDP TC Ingress FORWARDING OUTPUT TC Egress ROUTING L4~ L3 L2 PREROUTING INPUT POSTROUTING Rx Tx TC Example 2 – NAT packet packet packet ROUTING 60 10.1.0.2:1234 ROUTING Qdisc Qdiscpacket 10.1.1.2:1234 action mirred egress redirect dev eth1 Redirect
  • 61. XDP eBPF based fast data-path 61
  • 62. UPPER LAYER XDP TC Ingress ROUTING FORWARDING OUTPUT TC Egress ROUTING L4~ L3 L2 PREROUTING INPUT POSTROUTING Rx Tx XDP Packet Path 62
  • 63. eBPF based fast data-path XDP 63
  • 64. How FAST are we talking? 64
  • 65. Packet Drop From zero to 14 Mpps 65
  • 66. packet Test environment NIC Host A User IP TC DD NIC Host B10 Gbe … packet packet 10.1.0.1 10.1.0.2:1234 66 Connected with 10Gbe Ethernet
  • 67. packet Test environment NIC Host A User IP TC DD NIC Host B10 Gbe Pktgen : Send UDP packet (linux/samples/pktgen) $ ./pktgen.sh -i eth0 -m $DST_MAC -d 10.1.0.2 -p 1234 Result device: eth0 Params: count 100000 min_pkt_size: 60 max_pkt_size: 60 ... dst_min: 10.1.0.2 dst_mac: $DST_MAC udp_dst: 1234 Current: pkts-sofar: 100000 errors: 0 started: 7802657us stopped: 7862969us idle: 55us cur_saddr: 10.1.0.1 cur_daddr: 10.1.0.2 cur_udp_dst: 1234 cur_udp_src: 109 cur_queue_map: 0 flows: 0 Result: OK: 60312(c60257+d55) usec, 100000 (60byte,0frags) 1658039pps 795Mb/sec (795858720bps) errors: 0 … packet packet 10.1.0.1 10.1.0.2:1234 67
  • 68. Test environment 10.1.0.1 NIC Host A User IP TC DD NIC Host B 10.1.0.2:1234 10 Gbe DROP Packet Drop / Single Core? packet 10 Gbe … packet packet DROP UDP 10.1.0.2:1234 68
  • 69. 10Gbit = (10 * 10^9 bit) / (84 * 8bit) = 14,880,952 pps (14Mpps) Preamble Ethernet Frame Inter Frame GapMAC. Destination MAC. Source Type PAYLOAD (IP/IPv6/ARP…) CRC 8 Bytes 6 Bytes 6 Bytes 2 Bytes 46-1500 Bytes 4 Bytes 12 Bytes 8 64 12 84 Theoretical speed of 10Gbe? 69
  • 70. Hooks to process packet? L2 L3 L7 Userspace IP (Netfliter) Traffic Control DD (XDP) NIC 70
  • 71. UPPER LAYER XDP TC Ingress ROUTING FORWARDING OUTPUT TC Egress ROUTING L4~ L3 L2 PREROUTING INPUT POSTROUTING Rx Tx Userspace Packet Drop DROP 71
  • 72. char buf[MTU_SIZE]; while (1) { int res = read(fd, buf, MTU_SIZE); if (res <= 0) return 0; pkts += 1; bytes += res; } user-drop.c Userspace Packet Drop UDP socket server packet drop by reading socket NIC Host A User IP TC DD NIC Host B DROP 10 Gbe 10.1.0.1 10.1.0.2:1234 72
  • 73. Userspace Packet Drop $ gcc –o user-drop user-drop.c $ sudo ./user-drop packets=778148 bytes=14006664 packets=782171 bytes=14079078 packets=784792 bytes=14126256 packets=786466 bytes=14156388 packets=784163 bytes=14114934 packets=782500 bytes=14085000 packets=783085 bytes=14095530 packets=783172 bytes=14097096 Average Packet Drop 783,063pps/core ≈ 530Mbit/s 73
  • 74. Userspace Packet Drop ixgbe_poll() { ixgbe_clean_rx_irq() { napi_gro_receive() { netif_receive_skb_internal() { skb_defer_rx_timestamp(); __netif_receive_skb() { __netif_receive_skb_one_core() { __netif_receive_skb_core() { packet_rcv() { skb_push(); consume_skb(); } } ip_rcv() { ip_rcv_core.isra.25(); ip_rcv_finish() { ip_rcv_finish_core.isra.23() { udp_v4_early_demux() { ip_check_mc_rcu(); ip_mc_sf_allow(); ipv4_dst_check(); ip_mc_validate_source() { fib_validate_source() { __fib_validate_source(); } } } ip_local_deliver() { nf_hook_slow() { iptable_filter_hook [iptable_filter]() { ipt_do_table [ip_tables]() { __local_bh_enable_ip(); } } } ip_local_deliver_finish() { ip_protocol_deliver_rcu() { raw_local_deliver(); udp_rcv() { __udp4_lib_rcv() { udp_unicast_rcv_skb.isra.64() { udp_queue_rcv_skb() { udp_queue_rcv_one_skb() { __udp_enqueue_schedule_skb(); . . . userspace DROP 74 ≈ 530Mbit/s
  • 75. Can it be faster? 75
  • 77. UPPER LAYER XDP TC Ingress ROUTING FORWARDING OUTPUT TC Egress ROUTING L4~ L3 L2 PREROUTING INPUT POSTROUTING Rx Tx Netfilter Packet Drop 77 DROP
  • 78. $ iptables -A INPUT -d 10.1.0.2 -p udp --dport 1234 -j DROP $ iptables -L INPUT Chain INPUT (policy ACCEPT) target prot opt source destination DROP udp -- anywhere 10.1.0.2 udp dpt:1234 Netfilter Packet Drop NIC Host A User IP TC DD NIC Host B DROP 10 Gbe Netfilter (iptables) packet drop at INPUT chain 10.1.0.1 10.1.0.2:1234 78
  • 79. Netfilter Packet Drop $ iptables -vxnL INPUT Chain INPUT (policy ACCEPT 79 packets, 4318 bytes) pkts bytes target prot opt in out source destination 2927763 134677098 DROP udp -- * * 0.0.0.0/0 10.1.0.2 udp dpt:1234 4235054 194812484 DROP udp -- * * 0.0.0.0/0 10.1.0.2 udp dpt:1234 5541468 254907528 DROP udp -- * * 0.0.0.0/0 10.1.0.2 udp dpt:1234 6397986 294307356 DROP udp -- * * 0.0.0.0/0 10.1.0.2 udp dpt:1234 7706050 354478300 DROP udp -- * * 0.0.0.0/0 10.1.0.2 udp dpt:1234 9013710 414630660 DROP udp -- * * 0.0.0.0/0 10.1.0.2 udp dpt:1234 10320569 474746174 DROP udp -- * * 0.0.0.0/0 10.1.0.2 udp dpt:1234 11629331 534949226 DROP udp -- * * 0.0.0.0/0 10.1.0.2 udp dpt:1234 12937107 595106922 DROP udp -- * * 0.0.0.0/0 10.1.0.2 udp dpt:1234 14243987 655223402 DROP udp -- * * 0.0.0.0/0 10.1.0.2 udp dpt:1234 15553372 715455112 DROP udp -- * * 0.0.0.0/0 10.1.0.2 udp dpt:1234 16861793 775642478 DROP udp -- * * 0.0.0.0/0 10.1.0.2 udp dpt:1234 Average Packet Drop 1,266,730pps/core ≈ 860Mbit/s 79
  • 80. Netfilter Packet Drop ixgbe_poll() { ixgbe_clean_rx_irq() { napi_gro_receive() { netif_receive_skb_internal() { skb_defer_rx_timestamp(); __netif_receive_skb() { __netif_receive_skb_one_core() { __netif_receive_skb_core() { packet_rcv() { skb_push(); consume_skb(); } } ip_rcv() { ip_rcv_core.isra.25(); ip_rcv_finish() { ip_rcv_finish_core.isra.23() { udp_v4_early_demux() { ip_check_mc_rcu(); ip_mc_sf_allow(); ipv4_dst_check(); ip_mc_validate_source() { fib_validate_source() { __fib_validate_source(); } } } ip_local_deliver() { nf_hook_slow() { iptable_filter_hook [iptable_filter]() { ipt_do_table [ip_tables]() { udp_mt(); __local_bh_enable_ip(); } } kfree_skb(); 80 DROP
  • 81. Netfilter Packet Drop ip_local_deliver() { nf_hook_slow() { iptable_filter_hook [iptable_filter]() { ipt_do_table [ip_tables]() { __local_bh_enable_ip(); } } } ip_local_deliver_finish() { ip_protocol_deliver_rcu() { raw_local_deliver(); udp_rcv() { __udp4_lib_rcv() { udp_unicast_rcv_skb.isra.64() { udp_queue_rcv_skb() { udp_queue_rcv_one_skb() { __udp_enqueue_schedule_skb(); . . . ixgbe_poll() { ixgbe_clean_rx_irq() { napi_gro_receive() { netif_receive_skb_internal() { skb_defer_rx_timestamp(); __netif_receive_skb() { __netif_receive_skb_one_core() { __netif_receive_skb_core() { packet_rcv() { skb_push(); consume_skb(); } } ip_rcv() { ip_rcv_core.isra.25(); ip_rcv_finish() { ip_rcv_finish_core.isra.23() { udp_v4_early_demux() { ip_check_mc_rcu(); ip_mc_sf_allow(); ipv4_dst_check(); ip_mc_validate_source() { fib_validate_source() { __fib_validate_source(); } } } ip_local_deliver() { nf_hook_slow() { iptable_filter_hook [iptable_filter]() { ipt_do_table [ip_tables]() { udp_mt(); __local_bh_enable_ip(); } } kfree_skb(); UserspaceNetfliter userspace DROP DROP 81 ≈ 530Mbit/s ≈ 860Mbit/s
  • 82. Is it fast enough? 82
  • 83. 83 NO! With TC, it can be faster
  • 84. UPPER LAYER XDP TC Ingress ROUTING FORWARDING OUTPUT TC Egress ROUTING L4~ L3 L2 PREROUTING INPUT POSTROUTING Rx Tx TC Ingress Packet Drop Qdisc Qdisc DROP 84
  • 85. $ tc qdisc add dev eth0 ingress $ tc filter add dev eth0 parent ffff: protocol ip u32 match ip dst 10.1.0.2 match ip dport 1234 0xffff action drop $ tc filter show ingress dev eth0 ... match 0a010002/ffffffff at 16 match 000004d2/0000ffff at 20 action order 1: gact action drop random type none pass val 0 index 1 ref 1 bind 1 TC Ingress Packet Drop NIC Host A User IP TC DD NIC Host B DROP 10 Gbe 10.1.0.1 10.1.0.2:1234 85 TC Ingress packet drop with filter action
  • 86. TC Ingress Packet Drop $ tc -s filter show ingress dev eth0 … match 0a010002/ffffffff at 16 match 000004d2/0000ffff at 20 action order 1: gact action drop random type none pass val 0 index 1 ref 1 bind 1 installed 62 sec used 62 sec Action statistics: Sent 154709776 bytes 3363256 pkt (dropped 3363260, ...) Sent 345955052 bytes 7520762 pkt (dropped 7520767, ...) Sent 537288326 bytes 11680181 pkt (dropped 11680186, ...) Sent 728875750 bytes 15845125 pkt (dropped 15845130, ...) Sent 920311328 bytes 20006768 pkt (dropped 20006772, ...) Sent 1112155754 bytes 24177299 pkt (dropped 24177304, ...) Sent 1276514214 bytes 27750309 pkt (dropped 27750309, ...) Sent 1453930004 bytes 31607174 pkt (dropped 31607179, ...) Sent 1645855758 bytes 35779473 pkt (dropped 35779477, ...) Sent 1838232266 bytes 39961571 pkt (dropped 39961577, ...) Sent 2029302696 bytes 44115276 pkt (dropped 44115282, ...) Sent 2221122512 bytes 48285272 pkt (dropped 48285278, ...) Sent 2412335864 bytes 52442084 pkt (dropped 52442089, ...) Sent 2603632476 bytes 56600706 pkt (dropped 56600711, ...) Average Packet Drop 4,083,820pps/core ≈ 2.75Gbit/s 86
  • 87. TC Ingress Packet Drop ixgbe_poll() { ixgbe_clean_rx_irq() { napi_gro_receive() { netif_receive_skb_internal() { skb_defer_rx_timestamp(); __netif_receive_skb() { __netif_receive_skb_one_core() { __netif_receive_skb_core() { tcf_classify() { u32_classify [cls_u32]() { tcf_action_exec() { tcf_gact_act [act_gact](); } } } kfree_skb(); DROP 87
  • 88. TC Ingress Packet Drop ixgbe_poll() { ixgbe_clean_rx_irq() { napi_gro_receive() { netif_receive_skb_internal() { skb_defer_rx_timestamp(); __netif_receive_skb() { __netif_receive_skb_one_core() { __netif_receive_skb_core() { packet_rcv() { skb_push(); consume_skb(); } } ip_rcv() { ip_rcv_core.isra.25(); ip_rcv_finish() { ip_rcv_finish_core.isra.23() { udp_v4_early_demux() { ip_check_mc_rcu(); ip_mc_sf_allow(); ipv4_dst_check(); ip_mc_validate_source() { fib_validate_source() { __fib_validate_source(); } } } ip_local_deliver() { nf_hook_slow() { iptable_filter_hook [iptable_filter]() { ipt_do_table [ip_tables]() { udp_mt(); __local_bh_enable_ip(); } } kfree_skb(); ixgbe_poll() { ixgbe_clean_rx_irq() { napi_gro_receive() { netif_receive_skb_internal() { skb_defer_rx_timestamp(); __netif_receive_skb() { __netif_receive_skb_one_core() { __netif_receive_skb_core() { tcf_classify() { u32_classify [cls_u32]() { tcf_action_exec() { tcf_gact_act [act_gact](); } } } kfree_skb(); NetfliterTC DROP DROP 88 ≈ 2.75Gbit/s ≈ 860Mbit/s
  • 91. UPPER LAYER XDP TC Ingress ROUTING FORWARDING OUTPUT TC Egress ROUTING L4~ L3 L2 PREROUTING INPUT POSTROUTING Rx Tx XDP Packet Drop DROP 91
  • 92. XDP Packet Drop DROP UDP 10.1.0.2:1234 SEC("xdp1") int xdp_prog1(struct xdp_md *xdp) { void *data_end = (void *)(long)xdp->data_end; void *data = (void *)(long)xdp->data; struct ethhdr *eth = data; struct iphdr *iph; struct udphdr *uh; ... if (eth->h_proto == htons(ETH_P_IP)) { iph = data + sizeof(*eth); ... if (iph->daddr == _htonl(0xa010002)) { if (iph->protocol == IPPROTO_UDP) { uh = data + sizeof(*eth) + sizeof(*iph); ... if (uh->dest == htons(1234)) return XDP_DROP; } } } return XDP_PASS; } 92
  • 93. $ bpftool prog load ./xdp-drop.o /sys/fs/bpf/drop $ bpftool prog show ... 24: xdp name xdp_prog1 tag 6f8c2e06dfa2abcb gpl loaded_at 2019-10-11T17:17:33+0900 uid 0 xlated 544B jited 344B memlock 4096B map_ids 17 $ bpftool net attach xdp id 24 dev eth0 $ bpftool net xdp: eth0(8) driver id 24 XDP Packet Drop NIC Host A User IP TC DD NIC Host B DROP 10 Gbe 10.1.0.1 10.1.0.2:1234 93 XDP packet drop with XDP_DROP
  • 94. $ ethtool -S eth0 | grep rx_xdp_drop rx_xdp_drop: 6161010 rx_xdp_drop: 16114329 rx_xdp_drop: 26079532 rx_xdp_drop: 36025920 rx_xdp_drop: 45958488 rx_xdp_drop: 55869376 rx_xdp_drop: 65835136 rx_xdp_drop: 75748898 rx_xdp_drop: 85670845 rx_xdp_drop: 95635591 rx_xdp_drop: 105569230 rx_xdp_drop: 115515714 rx_xdp_drop: 125480832 rx_xdp_drop: 135432211 rx_xdp_drop: 1455190421 XDP Packet Drop Average Packet Drop 9,941,337pps/core ≈ 6.69Gbit/s 94
  • 95. XDP Packet Drop ixgbe_poll() { ixgbe_clean_rx_irq() { ixgbe_get_rx_buffer(); ixgbe_run_xdp() { bpf_prog_run_xdp(); } DROP 95
  • 96. XDP Packet Drop ixgbe_poll() { ixgbe_clean_rx_irq() { napi_gro_receive() { netif_receive_skb_internal() { skb_defer_rx_timestamp(); __netif_receive_skb() { __netif_receive_skb_one_core() { __netif_receive_skb_core() { tcf_classify() { u32_classify [cls_u32]() { tcf_action_exec() { tcf_gact_act [act_gact](); } } } kfree_skb(); ixgbe_poll() { ixgbe_clean_rx_irq() { ixgbe_get_rx_buffer(); ixgbe_run_xdp() { bpf_prog_run_xdp(); } TCXDP DROP DROP 96 ≈ 2.75Gbit/s ≈ 6.69Gbit/s
  • 99. What is XDP? The definition of XDP 99
  • 100. What is XDP? eBPF based fast data-path 100
  • 101. eBPF based fast data-path What is XDP? 101
  • 102. in the kernel is similar with V8 in Chrome browser What is BPF? 102
  • 103. In kernel Virtual Machine What is BPF? 103
  • 104. An virtual machine? What does BPF look like? What is BPF? 104
  • 105. You may already used it! What is BPF? 105
  • 106. You may already used it! What is BPF? An tcpdump! 106
  • 107. What is BPF? $ tcpdump -i eth0 'udp and dst 10.1.0.2' tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on eth1, link-type EN10MB (Ethernet), capture size 262144 bytes 00:19:42.244270 IP 192.168.10.1.221 > 10.1.0.2.1234: UDP, length 18 00:19:42.244270 IP 192.168.20.1.813 > 10.1.0.2.1234: UDP, length 18 00:19:42.244270 IP 192.168.30.1.856 > 10.1.0.2.1234: UDP, length 18 00:19:42.244270 IP 192.168.40.1.959 > 10.1.0.2.1234: UDP, length 18 00:19:42.244271 IP 192.168.10.1.160 > 10.1.0.2.1234: UDP, length 18 00:19:42.244271 IP 192.168.20.1.102 > 10.1.0.2.1234: UDP, length 18 00:19:42.244276 IP 192.168.30.1.114 > 10.1.0.2.1234: UDP, length 18 7 packets captured 10 packets received by filter 3 packets dropped by kernel 107
  • 108. Packet filter by BPF! What is BPF? 108
  • 109. What is BPF? $ tcpdump -i eth0 -d 'udp and dst 10.1.0.2’ tcpdump with –d option 109
  • 110. What is BPF? $ tcpdump -i eth0 -d 'udp and dst 10.1.0.2’ (000) ldh [12] (001) jeq #0x800 jt 2 jf 7 (002) ldb [23] (003) jeq #0x11 jt 4 jf 7 (004) ld [30] (005) jeq #0xa010002 jt 6 jf 7 (006) ret #262144 (007) ret #0 tcpdump with –d option 110
  • 111. What is BPF? $ tcpdump -i eth0 -d 'udp and dst 10.1.0.2’ (000) ldh [12] (001) jeq #0x800 jt 2 jf 7 (002) ldb [23] (003) jeq #0x11 jt 4 jf 7 (004) ld [30] (005) jeq #0xa010002 jt 6 jf 7 (006) ret #262144 (007) ret #0 Ethernet Frame MAC. Destination MAC. Source Type PAYLOAD (IP/IPv6/ARP…) CRC 6 Bytes 6 Bytes 2 Bytes 46-1500 Bytes 4 Bytes 64 111
  • 112. What is BPF? $ tcpdump -i eth0 -d 'udp and dst 10.1.0.2’ (000) ldh [12] (001) jeq #0x800 jt 2 jf 7 (002) ldb [23] (003) jeq #0x11 jt 4 jf 7 (004) ld [30] (005) jeq #0xa010002 jt 6 jf 7 (006) ret #262144 (007) ret #0 Ver. IHL TOS Total Len. Identification Flags Frag. Offset TTL. Protocol Header Checksum Source Address Destination Address Options(optional) Ethernet Frame MAC. Destination MAC. Source Type PAYLOAD (IP/IPv6/ARP…) CRC 6 Bytes 6 Bytes 2 Bytes 46-1500 Bytes 4 Bytes 64 112
  • 113. What is BPF? Ver. IHL TOS Total Len. Identification Flags Frag. Offset TTL. Protocol Header Checksum Source Address Destination Address Options(optional) Ethernet Frame MAC. Destination MAC. Source Type PAYLOAD (IP/IPv6/ARP…) CRC 6 Bytes 6 Bytes 2 Bytes 46-1500 Bytes 4 Bytes 64 $ tcpdump -i eth0 -d 'udp and dst 10.1.0.2’ (000) ldh [12] (001) jeq #0x800 jt 2 jf 7 (002) ldb [23] (003) jeq #0x11 jt 4 jf 7 (004) ld [30] (005) jeq #0xa010002 jt 6 jf 7 (006) ret #262144 (007) ret #0 113
  • 114. In kernel Virtual Machine “Linux kernel code execution engine” What is BPF? 114
  • 115. “Run code in the kernel” What is BPF? 115
  • 116. “Run code in the kernel” $ readelf -h bpf-prog.o | grep Machine Machine: Advanced Micro Devices X86-64 What is BPF? 116
  • 117. “Run code in the kernel” $ readelf -h bpf-prog.o | grep Machine Machine: Linux BPF Machine: Advanced Micro Devices X86-64 writing C program clang / llc BPF instruction What is BPF? 117
  • 118. “Run code in the kernel” $ readelf -h bpf-prog.o | grep Machine Machine: Linux BPF Machine: Advanced Micro Devices X86-64 but, restricted writing C program clang / llc BPF instruction What is BPF? 118
  • 119. “Run BPF code in the kernel” $ readelf -h bpf-prog.o | grep Machine Machine: Linux BPF Machine: Advanced Micro Devices X86-64 but, restricted writing C program clang / llc BPF instruction What is BPF? 119
  • 120. What is XDP? eBPF based fast data-path 120
  • 121. What is XDP? eBPF based fast data-path 121
  • 122. L2 L3 L7 Userspace IP (Netfliter) Traffic Control DD (XDP) NIC What is XDP? 122
  • 123. L2 L3 L7 Userspace IP (Netfliter) Traffic Control DD (XDP) NIC Earliest Hook in RX path of the kernel What is XDP? 123
  • 124. XDP Call Path __do_softirq() { net_rx_action() { ixgbe_poll() { ixgbe_clean_rx_irq() { ixgbe_get_rx_buffer(); ixgbe_run_xdp() { bpf_prog_run_xdp(); } ixgbe_build_skb(); ixgbe_rx_skb() { napi_gro_receive() { netif_receive_skb_internal(); Intel Device Driver RX function Call stack 124
  • 125. XDP Call Path Right After Interrupt Processing Before any memory allocation (Expensive operation) __do_softirq() { net_rx_action() { ixgbe_poll() { ixgbe_clean_rx_irq() { ixgbe_get_rx_buffer(); ixgbe_run_xdp() { bpf_prog_run_xdp(); } ixgbe_build_skb(); ixgbe_rx_skb() { napi_gro_receive() { netif_receive_skb_internal(); 125
  • 126. XDP Call Path Right After Interrupt Processing Before any memory allocation (Expensive operation) __do_softirq() { net_rx_action() { ixgbe_poll() { ixgbe_clean_rx_irq() { ixgbe_get_rx_buffer(); ixgbe_run_xdp() { bpf_prog_run_xdp(); } ixgbe_build_skb(); ixgbe_rx_skb() { napi_gro_receive() { netif_receive_skb_internal(); Decide the fate of the packet 126 * User written XDP program
  • 128. XDP Actions XDP_DROP - Very fast drop by recycling XDP_ABORT - Also drop, but with tracepoint XDP_PASS - Toss packet to network stack XDP_TX - Send packet back to same interface XDP_REDIRECT - Transmit out other NICs 128
  • 129. UPPER LAYER XDP TC Ingress ROUTING FORWARDING OUTPUT TC Egress ROUTING L4~ L3 L2 PREROUTING INPUT POSTROUTING Rx Tx XDP Actions XDP_PASS XDP_DROP XDP_REDIRECT XDP_TX 129
  • 132. xdp_buff struct xdp_buff { void *data; void *data_end; void *data_meta; void *data_hard_start; unsigned long handle; struct xdp_rxq_info *rxq; }; xdp_buff->data_hard_start xdb_buff->data_meta xdp_buff->data xdp_buff->data_end HEADROOM TAIL/TAILROOM 132
  • 133. xdp_buff struct xdp_buff { void *data; void *data_end; void *data_meta; void *data_hard_start; unsigned long handle; struct xdp_rxq_info *rxq; }; 133 HEADROOM TAIL/TAILROOM xdp_buff->data xdp_buff->data_end
  • 134. xdp_buff struct xdp_buff { void *data; void *data_end; void *data_meta; void *data_hard_start; unsigned long handle; struct xdp_rxq_info *rxq; }; 134 HEADROOM TAIL/TAILROOM xdp_buff->data xdp_buff->data_end MAC IP UDP Packet Data
  • 135. How to use XDP? 135
  • 138. packet Test environment 10.1.0.1 NIC Host A User IP TC DD NIC Host B 10.1.0.2:1234 10 Gbe … packet packet 10.1.1.2:1234 NIC Host C10 Gbe … packet packet packet 138 1. Destination NAT 2. Packet Forward 10.1.0.2 -> 10.1.1.2
  • 139. 139
  • 140. XDP Packet Forward 140 2. Packet Forward 10.1.0.2 -> 10.1.1.2 SEC("xdp_fwd") int xdp_fwd_prog(struct xdp_md *xdp) { void *data = (void *)(long)xdp->data; struct bpf_fib_lookup fib; struct ethhdr *eth = data; struct iphdr *iph; ... if (eth->h_proto == htons(ETH_P_IP)) { iph = data + sizeof(*eth); if (iph->daddr == _htonl(0xa010002)) iph->daddr = _htonl(0xa010102); ... rc = bpf_fib_lookup(xdp, &fib, sizeof(fib), 0); if (rc == BPF_FIB_LKUP_RET_SUCCESS) { memcpy(eth->h_dest, fib.dmac, ETH_ALEN); memcpy(eth->h_source, fib.smac, ETH_ALEN); return bpf_redirect(fib.ifindex, 0); } return XDP_PASS; } 1. Destination NAT
  • 141. SEC("xdp_fwd") int xdp_fwd_prog(struct xdp_md *xdp) { void *data_end = (void *)(long)xdp->data_end; void *data = (void *)(long)xdp->data; struct ethhdr *eth = data; struct iphdr *iph; struct udphdr *uh; … if (eth + 1 > data_end) return XDP_DROP; if (eth->h_proto == htons(ETH_P_IP)) { iph = data + sizeof(*eth); if (iph + 1 > data_end) return XDP_DROP; if (iph->daddr == _htonl(0xa010002)) if (iph->protocol == IPPROTO_UDP) { uh = data + sizeof(*eth) + sizeof(*iph); if (uh + 1 > data_end) return XDP_DROP; if (uh->dest == htons(1234)) iph->daddr = _htonl(0xa010102); } … XDP Packet Forward 141
  • 142. SEC("xdp_fwd") int xdp_fwd_prog(struct xdp_md *xdp) { void *data_end = (void *)(long)xdp->data_end; void *data = (void *)(long)xdp->data; struct ethhdr *eth = data; struct iphdr *iph; struct udphdr *uh; … if (eth + 1 > data_end) return XDP_DROP; if (eth->h_proto == htons(ETH_P_IP)) { iph = data + sizeof(*eth); if (iph + 1 > data_end) return XDP_DROP; if (iph->daddr == _htonl(0xa010002)) if (iph->protocol == IPPROTO_UDP) { uh = data + sizeof(*eth) + sizeof(*iph); if (uh + 1 > data_end) return XDP_DROP; if (uh->dest == htons(1234)) iph->daddr = _htonl(0xa010102); } … XDP Packet Forward 142 -> ELF Section where XDP program will be located
  • 143. SEC("xdp_fwd") int xdp_fwd_prog(struct xdp_md *xdp) { void *data_end = (void *)(long)xdp->data_end; void *data = (void *)(long)xdp->data; struct ethhdr *eth = data; struct iphdr *iph; struct udphdr *uh; … if (eth + 1 > data_end) return XDP_DROP; if (eth->h_proto == htons(ETH_P_IP)) { iph = data + sizeof(*eth); if (iph + 1 > data_end) return XDP_DROP; if (iph->daddr == _htonl(0xa010002)) if (iph->protocol == IPPROTO_UDP) { uh = data + sizeof(*eth) + sizeof(*iph); if (uh + 1 > data_end) return XDP_DROP; if (uh->dest == htons(1234)) iph->daddr = _htonl(0xa010102); } … XDP Packet Forward 143 -> Name of the XDP program
  • 144. SEC("xdp_fwd") int xdp_fwd_prog(struct xdp_md *xdp) { void *data_end = (void *)(long)xdp->data_end; void *data = (void *)(long)xdp->data; struct ethhdr *eth = data; struct iphdr *iph; struct udphdr *uh; … if (eth + 1 > data_end) return XDP_DROP; if (eth->h_proto == htons(ETH_P_IP)) { iph = data + sizeof(*eth); if (iph + 1 > data_end) return XDP_DROP; if (iph->daddr == _htonl(0xa010002)) if (iph->protocol == IPPROTO_UDP) { uh = data + sizeof(*eth) + sizeof(*iph); if (uh + 1 > data_end) return XDP_DROP; if (uh->dest == htons(1234)) iph->daddr = _htonl(0xa010102); } … XDP Packet Forward 144 HEADROOM TAIL/TAILROOM xdp_buff xdp_buff->data xdp_buff->data_end
  • 145. SEC("xdp_fwd") int xdp_fwd_prog(struct xdp_md *xdp) { void *data_end = (void *)(long)xdp->data_end; void *data = (void *)(long)xdp->data; struct ethhdr *eth = data; struct iphdr *iph; struct udphdr *uh; … if (eth + 1 > data_end) return XDP_DROP; if (eth->h_proto == htons(ETH_P_IP)) { iph = data + sizeof(*eth); if (iph + 1 > data_end) return XDP_DROP; if (iph->daddr == _htonl(0xa010002)) if (iph->protocol == IPPROTO_UDP) { uh = data + sizeof(*eth) + sizeof(*iph); if (uh + 1 > data_end) return XDP_DROP; if (uh->dest == htons(1234)) iph->daddr = _htonl(0xa010102); } … XDP Packet Forward 145 MAC HEADROOM TAIL/TAILROOM xdp_buff -> Cast to Ethernet header xdp_buff->data xdp_buff->data_end
  • 146. SEC("xdp_fwd") int xdp_fwd_prog(struct xdp_md *xdp) { void *data_end = (void *)(long)xdp->data_end; void *data = (void *)(long)xdp->data; struct ethhdr *eth = data; struct iphdr *iph; struct udphdr *uh; … if (eth + 1 > data_end) return XDP_DROP; if (eth->h_proto == htons(ETH_P_IP)) { iph = data + sizeof(*eth); if (iph + 1 > data_end) return XDP_DROP; if (iph->daddr == _htonl(0xa010002)) if (iph->protocol == IPPROTO_UDP) { uh = data + sizeof(*eth) + sizeof(*iph); if (uh + 1 > data_end) return XDP_DROP; if (uh->dest == htons(1234)) iph->daddr = _htonl(0xa010102); } … XDP Packet Forward 146 MAC HEADROOM TAIL/TAILROOM xdp_buff -> Validate the Ethernet header xdp_buff->data xdp_buff->data_end
  • 147. SEC("xdp_fwd") int xdp_fwd_prog(struct xdp_md *xdp) { void *data_end = (void *)(long)xdp->data_end; void *data = (void *)(long)xdp->data; struct ethhdr *eth = data; struct iphdr *iph; struct udphdr *uh; … if (eth + 1 > data_end) return XDP_DROP; if (eth->h_proto == htons(ETH_P_IP)) { iph = data + sizeof(*eth); if (iph + 1 > data_end) return XDP_DROP; if (iph->daddr == _htonl(0xa010002)) if (iph->protocol == IPPROTO_UDP) { uh = data + sizeof(*eth) + sizeof(*iph); if (uh + 1 > data_end) return XDP_DROP; if (uh->dest == htons(1234)) iph->daddr = _htonl(0xa010102); } … XDP Packet Forward 147 MAC IP HEADROOM TAIL/TAILROOM xdp_buff -> If eth->protocol is IP, Cast to IP header xdp_buff->data xdp_buff->data_end
  • 148. SEC("xdp_fwd") int xdp_fwd_prog(struct xdp_md *xdp) { void *data_end = (void *)(long)xdp->data_end; void *data = (void *)(long)xdp->data; struct ethhdr *eth = data; struct iphdr *iph; struct udphdr *uh; … if (eth + 1 > data_end) return XDP_DROP; if (eth->h_proto == htons(ETH_P_IP)) { iph = data + sizeof(*eth); if (iph + 1 > data_end) return XDP_DROP; if (iph->daddr == _htonl(0xa010002)) if (iph->protocol == IPPROTO_UDP) { uh = data + sizeof(*eth) + sizeof(*iph); if (uh + 1 > data_end) return XDP_DROP; if (uh->dest == htons(1234)) iph->daddr = _htonl(0xa010102); } … XDP Packet Forward 148 MAC IP HEADROOM TAIL/TAILROOM xdp_buff -> Validate the IP header xdp_buff->data xdp_buff->data_end
  • 149. SEC("xdp_fwd") int xdp_fwd_prog(struct xdp_md *xdp) { void *data_end = (void *)(long)xdp->data_end; void *data = (void *)(long)xdp->data; struct ethhdr *eth = data; struct iphdr *iph; struct udphdr *uh; … if (eth + 1 > data_end) return XDP_DROP; if (eth->h_proto == htons(ETH_P_IP)) { iph = data + sizeof(*eth); if (iph + 1 > data_end) return XDP_DROP; if (iph->daddr == _htonl(0xa010002)) if (iph->protocol == IPPROTO_UDP) { uh = data + sizeof(*eth) + sizeof(*iph); if (uh + 1 > data_end) return XDP_DROP; if (uh->dest == htons(1234)) iph->daddr = _htonl(0xa010102); } … XDP Packet Forward 149 MAC IP HEADROOM TAIL/TAILROOM xdp_buff -> Check destination address 10.1.0.2 xdp_buff->data xdp_buff->data_end
  • 150. SEC("xdp_fwd") int xdp_fwd_prog(struct xdp_md *xdp) { void *data_end = (void *)(long)xdp->data_end; void *data = (void *)(long)xdp->data; struct ethhdr *eth = data; struct iphdr *iph; struct udphdr *uh; … if (eth + 1 > data_end) return XDP_DROP; if (eth->h_proto == htons(ETH_P_IP)) { iph = data + sizeof(*eth); if (iph + 1 > data_end) return XDP_DROP; if (iph->daddr == _htonl(0xa010002)) if (iph->protocol == IPPROTO_UDP) { uh = data + sizeof(*eth) + sizeof(*iph); if (uh + 1 > data_end) return XDP_DROP; if (uh->dest == htons(1234)) iph->daddr = _htonl(0xa010102); } … XDP Packet Forward 150 MAC IP UDP HEADROOM TAIL/TAILROOM xdp_buff -> If iph->protocol is UDP, Cast to UDP header xdp_buff->data xdp_buff->data_end
  • 151. SEC("xdp_fwd") int xdp_fwd_prog(struct xdp_md *xdp) { void *data_end = (void *)(long)xdp->data_end; void *data = (void *)(long)xdp->data; struct ethhdr *eth = data; struct iphdr *iph; struct udphdr *uh; … if (eth + 1 > data_end) return XDP_DROP; if (eth->h_proto == htons(ETH_P_IP)) { iph = data + sizeof(*eth); if (iph + 1 > data_end) return XDP_DROP; if (iph->daddr == _htonl(0xa010002)) if (iph->protocol == IPPROTO_UDP) { uh = data + sizeof(*eth) + sizeof(*iph); if (uh + 1 > data_end) return XDP_DROP; if (uh->dest == htons(1234)) iph->daddr = _htonl(0xa010102); } … XDP Packet Forward 151 HEADROOM TAIL/TAILROOM xdp_buff -> Validate the UDP header MAC IP UDP xdp_buff->data xdp_buff->data_end
  • 152. SEC("xdp_fwd") int xdp_fwd_prog(struct xdp_md *xdp) { void *data_end = (void *)(long)xdp->data_end; void *data = (void *)(long)xdp->data; struct ethhdr *eth = data; struct iphdr *iph; struct udphdr *uh; … if (eth + 1 > data_end) return XDP_DROP; if (eth->h_proto == htons(ETH_P_IP)) { iph = data + sizeof(*eth); if (iph + 1 > data_end) return XDP_DROP; if (iph->daddr == _htonl(0xa010002)) if (iph->protocol == IPPROTO_UDP) { uh = data + sizeof(*eth) + sizeof(*iph); if (uh + 1 > data_end) return XDP_DROP; if (uh->dest == htons(1234)) iph->daddr = _htonl(0xa010102); } … XDP Packet Forward 152 HEADROOM TAIL/TAILROOM xdp_buff -> if dst port is 1234, Change dst address to 10.1.1.2 MAC IP UDP xdp_buff->data xdp_buff->data_end
  • 153. struct bpf_fib_lookup fib; int rc; ... __builtin_memset(&fib, 0, sizeof(fib)); fib.family = AF_INET; fib.tos = iph->tos; fib.l4_protocol = iph->protocol; fib.tot_len = ntohs(iph->tot_len); fib.ipv4_src = iph->saddr; fib.ipv4_dst = iph->daddr; fib.ifindex = xdp->ingress_ifindex; } else { return XDP_PASS; } rc = bpf_fib_lookup(xdp, &fib, sizeof(fib), 0); if (rc == BPF_FIB_LKUP_RET_SUCCESS) { ip_decrease_ttl(iph); /* from include/net/ip.h */ memcpy(eth->h_dest, fib.dmac, ETH_ALEN); memcpy(eth->h_source, fib.smac, ETH_ALEN); return bpf_redirect(fib.ifindex, 0); } return XDP_PASS; } XDP Packet Forward 153 -> Prior to routing table lookup, prepare bpf_fib_lookup struct
  • 154. struct bpf_fib_lookup fib; int rc; ... __builtin_memset(&fib, 0, sizeof(fib)); fib.family = AF_INET; fib.tos = iph->tos; fib.l4_protocol = iph->protocol; fib.tot_len = ntohs(iph->tot_len); fib.ipv4_src = iph->saddr; fib.ipv4_dst = iph->daddr; fib.ifindex = xdp->ingress_ifindex; } else { return XDP_PASS; } rc = bpf_fib_lookup(xdp, &fib, sizeof(fib), 0); if (rc == BPF_FIB_LKUP_RET_SUCCESS) { ip_decrease_ttl(iph); /* from include/net/ip.h */ memcpy(eth->h_dest, fib.dmac, ETH_ALEN); memcpy(eth->h_source, fib.smac, ETH_ALEN); return bpf_redirect(fib.ifindex, 0); } return XDP_PASS; } XDP Packet Forward 154 -> Fill the struct for query ex) src / dst address
  • 155. struct bpf_fib_lookup fib; int rc; ... __builtin_memset(&fib, 0, sizeof(fib)); fib.family = AF_INET; fib.tos = iph->tos; fib.l4_protocol = iph->protocol; fib.tot_len = ntohs(iph->tot_len); fib.ipv4_src = iph->saddr; fib.ipv4_dst = iph->daddr; fib.ifindex = xdp->ingress_ifindex; } else { return XDP_PASS; } rc = bpf_fib_lookup(xdp, &fib, sizeof(fib), 0); if (rc == BPF_FIB_LKUP_RET_SUCCESS) { ip_decrease_ttl(iph); /* from include/net/ip.h */ memcpy(eth->h_dest, fib.dmac, ETH_ALEN); memcpy(eth->h_source, fib.smac, ETH_ALEN); return bpf_redirect(fib.ifindex, 0); } return XDP_PASS; } XDP Packet Forward 155 -> Make sure where packet comes from
  • 156. struct bpf_fib_lookup fib; int rc; ... __builtin_memset(&fib, 0, sizeof(fib)); fib.family = AF_INET; fib.tos = iph->tos; fib.l4_protocol = iph->protocol; fib.tot_len = ntohs(iph->tot_len); fib.ipv4_src = iph->saddr; fib.ipv4_dst = iph->daddr; fib.ifindex = xdp->ingress_ifindex; } else { return XDP_PASS; } rc = bpf_fib_lookup(xdp, &fib, sizeof(fib), 0); if (rc == BPF_FIB_LKUP_RET_SUCCESS) { ip_decrease_ttl(iph); /* from include/net/ip.h */ memcpy(eth->h_dest, fib.dmac, ETH_ALEN); memcpy(eth->h_source, fib.smac, ETH_ALEN); return bpf_redirect(fib.ifindex, 0); } return XDP_PASS; } XDP Packet Forward 156 -> Query routing table for redirect interface
  • 157. struct bpf_fib_lookup fib; int rc; ... __builtin_memset(&fib, 0, sizeof(fib)); fib.family = AF_INET; fib.tos = iph->tos; fib.l4_protocol = iph->protocol; fib.tot_len = ntohs(iph->tot_len); fib.ipv4_src = iph->saddr; fib.ipv4_dst = iph->daddr; fib.ifindex = xdp->ingress_ifindex; } else { return XDP_PASS; } rc = bpf_fib_lookup(xdp, &fib, sizeof(fib), 0); if (rc == BPF_FIB_LKUP_RET_SUCCESS) { ip_decrease_ttl(iph); /* from include/net/ip.h */ memcpy(eth->h_dest, fib.dmac, ETH_ALEN); memcpy(eth->h_source, fib.smac, ETH_ALEN); return bpf_redirect(fib.ifindex, 0); } return XDP_PASS; } XDP Packet Forward 157 -> Query routing table for redirect interface $ route -n Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 0.0.0.0 _gateway 0.0.0.0 UG 101 0 0 enp6s0 10.1.0.0 0.0.0.0 255.255.255.0 U 111 0 0 eth0 10.1.1.0 0.0.0.0 255.255.255.0 U 110 0 0 eth1 192.168.0.0 0.0.0.0 255.255.255.0 U 101 0 0 enp6s0 _gateway 0.0.0.0 255.255.255.255 UH 101 0 0 enp6s0
  • 158. struct bpf_fib_lookup fib; int rc; ... __builtin_memset(&fib, 0, sizeof(fib)); fib.family = AF_INET; fib.tos = iph->tos; fib.l4_protocol = iph->protocol; fib.tot_len = ntohs(iph->tot_len); fib.ipv4_src = iph->saddr; fib.ipv4_dst = iph->daddr; fib.ifindex = xdp->ingress_ifindex; } else { return XDP_PASS; } rc = bpf_fib_lookup(xdp, &fib, sizeof(fib), 0); if (rc == BPF_FIB_LKUP_RET_SUCCESS) { ip_decrease_ttl(iph); /* from include/net/ip.h */ memcpy(eth->h_dest, fib.dmac, ETH_ALEN); memcpy(eth->h_source, fib.smac, ETH_ALEN); return bpf_redirect(fib.ifindex, 0); } return XDP_PASS; } XDP Packet Forward 158 -> If success, decrease ttl
  • 159. struct bpf_fib_lookup fib; int rc; ... __builtin_memset(&fib, 0, sizeof(fib)); fib.family = AF_INET; fib.tos = iph->tos; fib.l4_protocol = iph->protocol; fib.tot_len = ntohs(iph->tot_len); fib.ipv4_src = iph->saddr; fib.ipv4_dst = iph->daddr; fib.ifindex = xdp->ingress_ifindex; } else { return XDP_PASS; } rc = bpf_fib_lookup(xdp, &fib, sizeof(fib), 0); if (rc == BPF_FIB_LKUP_RET_SUCCESS) { ip_decrease_ttl(iph); /* from include/net/ip.h */ memcpy(eth->h_dest, fib.dmac, ETH_ALEN); memcpy(eth->h_source, fib.smac, ETH_ALEN); return bpf_redirect(fib.ifindex, 0); } return XDP_PASS; } XDP Packet Forward 159 -> Change Mac address for src, dst
  • 160. struct bpf_fib_lookup fib; int rc; ... __builtin_memset(&fib, 0, sizeof(fib)); fib.family = AF_INET; fib.tos = iph->tos; fib.l4_protocol = iph->protocol; fib.tot_len = ntohs(iph->tot_len); fib.ipv4_src = iph->saddr; fib.ipv4_dst = iph->daddr; fib.ifindex = xdp->ingress_ifindex; } else { return XDP_PASS; } rc = bpf_fib_lookup(xdp, &fib, sizeof(fib), 0); if (rc == BPF_FIB_LKUP_RET_SUCCESS) { ip_decrease_ttl(iph); /* from include/net/ip.h */ memcpy(eth->h_dest, fib.dmac, ETH_ALEN); memcpy(eth->h_source, fib.smac, ETH_ALEN); return bpf_redirect(fib.ifindex, 0); } return XDP_PASS; } XDP Packet Forward 160 -> Packet Forward with bpf_redirect! (returns with XDP_REDIRECT) Forward
  • 161. $ clang -O2 -Wall -target bpf -c xdp-fwd.c -o xdp-fwd.o $ bpftool prog load ./xdp-fwd.o /sys/fs/bpf/fwd $ bpftool prog show ... 44: xdp name xdp_fwd_prog tag 1aa0135c2e55b38d gpl loaded_at 2019-10-11T20:03:01+0900 uid 0 xlated 760B jited 442B memlock 4096B $ bpftool net attach xdp id 44 dev eth0​ $ bpftool net​ xdp:​ eth0(8) driver id 44 XDP Packet Forward 161 Compile with BPF target Load program with bpftool Attach to interface
  • 163. packet Test environment 10.1.0.1 NIC Host A User IP TC DD NIC Host B 10.1.0.2:1234 10 Gbe … packet packet 10.1.1.2:1234 NIC Host C10 Gbe … packet packet packet 163 1. Destination NAT 2. Packet Forward 10.1.0.2 -> 10.1.1.2
  • 164. Hooks to process packet? L2 L3 L7 Userspace IP (Netfliter) Traffic Control DD (XDP) NIC 164
  • 165. UPPER LAYER XDP TC Ingress ROUTING FORWARDING OUTPUT TC Egress ROUTING L4~ L3 L2 PREROUTING INPUT POSTROUTING Rx Tx Userspace Packet Forward Forward 165
  • 166. Userspace Packet Forward NIC Host A Host B NIC Host C User IP TC DD NIC Forwardstream { server { listen 1234 udp; proxy_pass udp_target; } upstream udp_target { server 10.1.1.2:1234; } } /etc/nginx/nginx.conf 10.1.0.1 10.1.0.2:1234 10.1.1.2:1234 166 UDP Load Balancer packet forward with DNAT
  • 167. $ ethtool -S eth0 | grep rx_xdp_drop rx_xdp_drop: 370841 rx_xdp_drop: 754718 rx_xdp_drop: 1118515 rx_xdp_drop: 1483198 rx_xdp_drop: 1858322 rx_xdp_drop: 2237080 rx_xdp_drop: 2609916 rx_xdp_drop: 2985062 rx_xdp_drop: 3342212 rx_xdp_drop: 3725703 rx_xdp_drop: 4105888 Average Packet Forward 373,263pps/core Userspace Packet Forward ≈ 260Mbit/s 167
  • 169. UPPER LAYER XDP TC Ingress ROUTING FORWARDING OUTPUT TC Egress ROUTING L4~ L3 L2 PREROUTING INPUT POSTROUTING Rx Tx Netfilter Packet Forward 169 Forward
  • 170. $ iptables -t nat -A PREROUTING -d 10.1.0.2 -p udp --dport 1234 -j DNAT --to-destination 10.1.1.2:1234 $ iptables -t nat -L PREROUTING Chain PREROUTING (policy ACCEPT) target prot opt source destination DNAT udp -- anywhere 10.1.0.2 udp dpt:1234 to:10.1.1.2:1234 Netfilter Packet Forward 10.1.0.1 NIC Host A Host B 10.1.0.2:1234 10.1.1.2:1234 NIC Host C User IP TC DD NIC Forward 170 Netfilter (iptables) packet forward with DNAT
  • 171. Netfilter Packet Forward $ ethtool -S eth0 | grep rx_xdp_drop rx_xdp_drop: 686991 rx_xdp_drop: 1377146 rx_xdp_drop: 2064571 rx_xdp_drop: 2755620 rx_xdp_drop: 3418211 rx_xdp_drop: 4050284 rx_xdp_drop: 4682626 rx_xdp_drop: 5366274 rx_xdp_drop: 6054439 rx_xdp_drop: 6743962 rx_xdp_drop: 7434054 rx_xdp_drop: 8003538 rx_xdp_drop: 8693454 Average Packet Forward 668,728pps/core ≈ 960Mbit/s 171
  • 173. UPPER LAYER XDP TC Ingress ROUTING FORWARDING OUTPUT TC Egress ROUTING L4~ L3 L2 PREROUTING INPUT POSTROUTING Rx Tx TC Ingress Packet Forward ROUTING Qdisc Qdisc 173 Forward
  • 174. $ tc qdisc add dev eth0 ingress $ tc filter add dev eth0 parent ffff: protocol ip u32 match ip dst 10.1.0.2 match ip dport 1234 0xffff action nat ingress 10.1.0.2/32 10.1.1.2/32 pipe action mirred egress redirect dev eth1 $ tc filter show ingress dev eth0 ... match 0a010002/ffffffff at 16 match 000004d2/0000ffff at 20 action order 1: nat ingress 10.1.0.2/32 10.1.1.2 pipe index 1 ref 1 bind 1 action order 2: mirred (Egress Redirect to eth1) stolen index 1 ref 1 bind 1 TC Ingress Packet Forward NIC Host A Host B NIC Host C User IP TC DD NIC Forward 10.1.0.1 10.1.0.2:1234 10.1.1.2:1234 174 TC Ingress packet forward with DNAT (redirect)
  • 175. TC Ingress Packet Forward $ ethtool -S eth0 | grep rx_xdp_drop rx_xdp_drop: 1407789 rx_xdp_drop: 2848601 rx_xdp_drop: 4274539 rx_xdp_drop: 5695547 rx_xdp_drop: 7132990 rx_xdp_drop: 8533688 rx_xdp_drop: 9943697 rx_xdp_drop: 11367744 rx_xdp_drop: 12808213 rx_xdp_drop: 14234151 rx_xdp_drop: 15685314 Average Packet Forward 1,425,938pps/core 175 ≈ 1Gbit/s
  • 177. UPPER LAYER XDP TC Ingress ROUTING FORWARDING OUTPUT TC Egress ROUTING L4~ L3 L2 PREROUTING INPUT POSTROUTING Rx Tx XDP Packet Forward ROUTING 177 Forward
  • 178. XDP Packet Forward 178 2. Packet Forward 10.1.0.2 -> 10.1.1.2 SEC("xdp_fwd") int xdp_fwd_prog(struct xdp_md *xdp) { void *data = (void *)(long)xdp->data; struct bpf_fib_lookup fib; struct ethhdr *eth = data; struct iphdr *iph; ... if (eth->h_proto == htons(ETH_P_IP)) { iph = data + sizeof(*eth); if (iph->daddr == _htonl(0xa010002)) iph->daddr = _htonl(0xa010102); ... rc = bpf_fib_lookup(xdp, &fib, sizeof(fib), 0); if (rc == BPF_FIB_LKUP_RET_SUCCESS) { memcpy(eth->h_dest, fib.dmac, ETH_ALEN); memcpy(eth->h_source, fib.smac, ETH_ALEN); return bpf_redirect(fib.ifindex, 0); } return XDP_PASS; } 1. Destination NAT
  • 179. $ bpftool prog load ./xdp-fwd.o /sys/fs/bpf/fwd $ bpftool prog show ... 44: xdp name xdp_fwd_prog tag 1aa0135c2e55b38d gpl loaded_at 2019-10-11T20:03:01+0900 uid 0 xlated 760B jited 442B memlock 4096B $ bpftool net attach xdp id 44 dev eth0​ $ bpftool net​ xdp:​ eth0(8) driver id 44 XDP Packet Forward NIC Host A Host B NIC Host C User IP TC DD NIC Forward 10.1.0.1 10.1.0.2:1234 10.1.1.2:1234 179 XDP packet forward with DNAT (XDP_FORWARD)
  • 180. $ ethtool -S eth0 | grep rx_xdp_drop rx_xdp_drop: 3031344 rx_xdp_drop: 6057680 rx_xdp_drop: 9088916 rx_xdp_drop: 12118560 rx_xdp_drop: 15150319 rx_xdp_drop: 18180559 rx_xdp_drop: 21210383 rx_xdp_drop: 24240271 rx_xdp_drop: 27270543 rx_xdp_drop: 30300559 rx_xdp_drop: 33329483 rx_xdp_drop: 36356788 XDP Packet Forward Average Packet Forward 3,029,764pps/core ≈ 2.04Gbit/s 180
  • 182. More about XDP? XDP Offload and Further usages 182
  • 184. XDP Modes Generic XDP Native XDP Offloaded XDP 184
  • 185. XDP Modes Packet NIC XDP Driver XDP Generic XDP Netfilter …TC XDP can happen here Network Hardware Network Driver Linux Kernel Network Stack 185 NativeOffloaded Generic
  • 186. XDP Modes Packet NIC XDP Driver XDP Generic XDP Netfilter …TC Network Hardware Network Driver Linux Kernel Network Stack 186 NativeOffloaded Generic = Normally, XDP?
  • 187. XDP Modes Generic XDP - No driver support needed Native XDP - Intel (ixgbe, ixgbevf, i40e) - Mellanox (mlx4, mlx5) - Broadcom (bnxt) - Qlogic (qede) - Netronome (nfp) - Others (virtio, tun) (Most of the 10Gbe Driver) Offloaded XDP - Netronome (nfp) 187
  • 189. What is XDP OFFLOAD? 0: r0 = 1 1: r2 = *(u32 *)(r1 + 4) 2: r1 = *(u32 *)(r1 + 0) 3: r3 = r1 4: r3 += 14 5: if r3 > r2 goto +18 <LBB0_8> 6: r3 = *(u8 *)(r1 + 12) 7: r4 = *(u8 *)(r1 + 13) 8: r4 <<= 8 9: r4 |= r3 10: if r4 != 8 goto +12 <LBB0_7> 11: r3 = r1 12: r3 += 34 13: if r3 > r2 goto +10 <LBB0_8> 14: r3 = *(u32 *)(r1 + 30) 15: if r3 != 33554698 goto +7 <LBB0_7> 16: r3 = *(u8 *)(r1 + 23) 17: if r3 != 17 goto +5 <LBB0_7> 18: r3 = r1 19: r3 += 42 ... XDP offload BPF Program 189
  • 190. What is XDP OFFLOAD? 0: r0 = 1 1: r2 = *(u32 *)(r1 + 4) 2: r1 = *(u32 *)(r1 + 0) 3: r3 = r1 4: r3 += 14 5: if r3 > r2 goto +18 <LBB0_8> 6: r3 = *(u8 *)(r1 + 12) 7: r4 = *(u8 *)(r1 + 13) 8: r4 <<= 8 9: r4 |= r3 10: if r4 != 8 goto +12 <LBB0_7> 11: r3 = r1 12: r3 += 34 13: if r3 > r2 goto +10 <LBB0_8> 14: r3 = *(u32 *)(r1 + 30) 15: if r3 != 33554698 goto +7 <LBB0_7> 16: r3 = *(u8 *)(r1 + 23) 17: if r3 != 17 goto +5 <LBB0_7> 18: r3 = r1 19: r3 += 42 ... XDP offload BPF Program 190 Runs more earlier No CPU usage
  • 191. Packet Drop Results 783,063 1,266,730 4,083,820 9,941,337 0 2 4 6 8 10 12 userspace netfilter tc xdp Mpps 191 ≈ 6.69Gbit/s Out of 10Gbit/s
  • 192. $ bpftool prog load ./xdp-drop.o /sys/fs/bpf/drop $ bpftool prog show ... 24: xdp name xdp_prog1 tag 6f8c2e06dfa2abcb gpl loaded_at 2019-10-11T17:17:33+0900 uid 0 xlated 544B jited 344B memlock 4096B map_ids 17 $ bpftool net attach xdpoffload id 18 dev eth0 $ bpftool net xdp: eth0(8) offload id 18 XDP Offload Packet Drop 10.1.0.1 NIC Host A User IP TC DD NIC Host B DROP 10 Gbe 10.1.0.2:1234 192 XDP offload packet drop with XDP_DROP
  • 193. XDP Offload Packet Drop $ ethtool -S eth0 | grep bpf_app1_pkts bpf_app1_pkts: 9340887 bpf_app1_pkts: 14274730 bpf_app1_pkts: 29233693 bpf_app1_pkts: 44165498 bpf_app1_pkts: 59097632 bpf_app1_pkts: 74057835 bpf_app1_pkts: 88990848 bpf_app1_pkts: 103947963 bpf_app1_pkts: 118881530 bpf_app1_pkts: 133838008 bpf_app1_pkts: 148770851 bpf_app1_pkts: 163705144 bpf_app1_pkts: 178664224 bpf_app1_pkts: 193596262 bpf_app1_pkts: 208554907 bpf_app1_pkts: 223488168 Average Packet Drop 14,883,153pps/core ≈ 10.00Gbit/s 193
  • 194. Packet Drop Results with XDP Offload 783,063 1,266,730 4,083,820 9,941,337 14,883,153 0 2 4 6 8 10 12 14 16 userspace netfilter tc xdp xdp-offload Mpps 194 Out of 10Gbit/s ≈ 10.00Gbit/s 6.69Gbit/s
  • 195. XDP Offload Packet Drop ixgbe_poll() { ixgbe_clean_rx_irq() { ixgbe_get_rx_buffer(); ixgbe_run_xdp() { bpf_prog_run_xdp(); } XDPXDP Offload DROP DROP NONE! 195
  • 196. XDP Offload Packet Drop XDP XDP Offload 196
  • 197. XDP Use Cases • Load Balancing • Packet Tunneling (Encapsulation) • DDoS attack mitigation • Network monitoring • ETC.. 197
  • 198. Sample code, test results can be found: github.com/DanielTimLee/soscon19_XDP 198
  • 199. SAMSUNG OPEN SOURCE CONFERENCE 2019 THANK YOU 199