Agenda:
In this session, Shmulik Ladkani discusses the kernel's net_device abstraction, its interfaces, and how net-devices interact with the network stack. The talk covers many of the software network devices that exist in the Linux kernel, the functionalities they provide and some interesting use cases.
Speaker:
Shmulik Ladkani is a Tech Lead at Ravello Systems.
Shmulik started his career at Jungo (acquired by NDS/Cisco) implementing residential gateway software, focusing on embedded Linux, Linux kernel, networking and hardware/software integration.
51966 coffees and billions of forwarded packets later, with millions of homes running his software, Shmulik left his position as Jungo’s lead architect and joined Ravello Systems (acquired by Oracle) as tech lead, developing a virtual data center as a cloud service. He's now focused around virtualization systems, network virtualization and SDN.
1. Fun with Network Interfaces
Shmulik Ladkani
March 2016
This work is licensed under a Creative Commons Attribution 4.0 International License.
2. On the Menu
● Linux network stack, a (quick) intro
○ What’s this net_device anyway?
○ Programming interfaces
○ Frame reception and transmission
● Logical network interfaces
○ What?
○ Why?
○ Examples
○ Examples
○ Examples
3. Agenda
● Goals
○ Strengthen foundations
○ Explain interaction of main network stack components
○ Familiarize with building blocks of virtual networks
○ Ease of further research
● Non Goals
○ Mastering device driver programming
○ How network gear operates in detail
○ Specific component deep dive
11. Network Core
● Generic functionalities of a network device
● RX
○ Processing of incoming frames
○ Delivery to upper protocols
● TX
○ Queuing
○ Final processing
○ Hand-over to driver’s transmit method
12. Struct net_device
● Represents a network interface
● One for each network device in the system
○ Either physical device or logical (software) one
13. Struct net_device
Common properties
● Identified by a ‘name’ and ‘ifindex’
○ Unique to a network namespace
● Has BSD-like ‘flags’
IFF_UP, IFF_LOOPBACK, IFF_POINTOPOINT, IFF_NOARP, IFF_PROMISC...
● Has ‘features’
NETIF_F_SG_BIT, NETIF_F_HW_CSUM_BIT, NETIF_F_GSO_BIT,
NETIF_F_GRO_BIT, NETIF_F_LRO_BIT, NETIF_F_RXHASH_BIT,
NETIF_F_RXCSUM_BIT...
● Has many other fields...
● Holds associated device operations
const struct net_device_ops *netdev_ops;
14. Struct net_device_ops
● Interface. Defines all device methods
○ Driver implements
● E.g. e1000e_netdev_ops, bcmgenet_netdev_ops …
○ Network-core uses
● Fat interface…
○ 44 methods in v3.4
○ 59 methods in v3.14
○ 68 methods in v4.4
○ Few methods #ifdef protected
○ Some are optional
15. Struct net_device_ops
Common methods
● ndo_open()
○ Upon device transition to UP state
● ndo_stop()
○ Upon device transition to DOWN state
● ndo_start_xmit()
○ When a packet needs to be transmitted
● ndo_set_features()
○ Update device configuration to new features
● ndo_get_stats()
○ Get device usage statistics
● ndo_set_mac_address()
○ When MAC needs to be changed
● Many more...
16. Stack’s core interfaces
For device implementers
● napi_schedule()
○ Schedule driver’s poll routine to be called
● netif_receive_skb()
○ Pass a received buffer to network core processing
○ Few other interfaces exist
● netif_stop_queue()
○ Stop upper layer from calling device’s ndo_start_xmit
● netif_wake_queue()
○ Allow upper layer to call device’s ndo_start_xmit
● More...
17. Frame Reception
__netif_receive_skb_core()
● Deliver to network taps (protocol sniffers)
● Ingress classification and filtering
● VLAN packet handling
● Invoke a specially registered ‘rx_handler’
○ May consume packet
● Deliver to the registered L3 protocol handler
○ No handler? Drop
18. net_device->rx_handler
● Per device registered function
○ Called internally from ‘__netif_receive_skb_core’
○ Prior delivery to protocol handlers
● Allows special L2 processing during RX
● Semantics
○ At most one registered ‘rx_handler’ per device
○ May consume the packet
● ‘netif_receive_skb’ will not further process it
○ May instruct ‘netif_receive_skb’ to do “another round”
● Notable users
○ bridge, openvswitch, bonding, team, macvlan, macvtap
19. Frame Transmission
dev_queue_xmit()
● Well, packet is set-up for transmission
○ Yeay! Let’s pass to driver’s ndo_start_xmit() !
○ Wait a minute… literally
● Device has no queue?
○ Final preps & xmit
● Device has a queue?
○ Enqueue the packet
● Using device queueing discipline
○ Kick the queue
○ Will eventually get to “final preps & xmit”
● Synchronously or asynchronously
● According to discipline
21. Software Net Device
● Not associated with a physical NIC
● Provides logical rx/tx functionality
○ By implementing the net_device interface
● Allows special-purpose packet processing
○ Without altering the network stack
22. Variants of Logical Devices
● Directly operate on specified net device(s)
○ Protocols (vlan, pppoe…)
○ Logical constructs (bridge, bonding, veth, macvlan...)
● Interact with higher network layers
○ IP based tunnels (ipip, gre, sit, l2tp…)
○ UDP based tunnels (vxlan, geneve, l2tp-udp…)
● Other constructs
○ May or may not interact with other net devices
○ lo, ppp, tun/tap, ifb...
24. lo:
Loopback interface
static netdev_tx_t loopback_xmit(struct sk_buff *skb,
struct net_device *dev)
{
...
netif_rx(skb); // eventually gets to netif_receive_skb
}
Every transmitted packet is bounced back for reception
○ Using same device
network core
ipv4
lo
... ipv6 ... ...
29. ● Usermode VPN applications
○ Routing to the VPN subnet is directed to tun device
● E.g. 192.168.50.0/24 dev tun0
○ read(tun_fd, buf)
○ encrypt(buf)
○ encapsulate(buf)
○ send(tcp_sock, buf)
TUN use cases
30. ● VM networking
○ Emulator exposes a tap for each VM NIC
○ Emulator traps VM xmit
○ Issues write(tap_fd)
○ Packet arrives at host’s net stack via tap device
TAP use cases
32. veth: (circa 2.6.24)
Virtual Ethernet Pair
● Local ethernet “wire”
● Comes as pair of virtual ethernet interfaces
● veth TX --> peer veth RX
○ And vice versa
network core
ipv4... ... ...
veth0 veth1
33. ● Container networking
○ First veth in host’s net namespace
○ Peer veth in container’s net namespace
● Local links of a virtual network
● Network emulation
veth use cases
39. MacVLAN: (circa 2.6.32)
MAC Address based VLANs
● Network segmentation based on destination MAC
● Macvlan devices have underlying “lower” device
○ Macvlans on same link have unique MAC addresses
● Macvlan xmit
○ Calls ‘dev_queue_xmit’ on lower device
● rx_handler registered on lower device
○ Look for a macvlan device based on packet’s dest-MAC
○ None found? Return “Pass” (normal processing)
○ Found? Change skb->dev to the macvlan dev, return “Another round”
rx_handler()
eth0
mvlan0 mvlan1 mvlan2
40. ● Network segmentation
○ Where 802.1q VLAN can’t be used
MacVLAN use cases
eth0
eth0_data eth0_voip
54.6.30.15/24 10.0.5.94/16
VoIP Service
Internet Access Service
41. ● Lightweight virtual network
○ For containers / VMs
○ Various operation modes
● MacVTAP (circa 2.6.34)
○ Each device has a tap-like FD interface
MacVLAN use cases
eth0
mvlan0 mvlan1 mvlan2
Container X Container Y Container Z
43. ip_gre: (circa 2.2)
GRE Tunnel, IP Based
● Global initialization
○ Register the IPPROTO_GRE transport protocol handler
ipv4
stack ipv6 ...
network core
gre
stack... ... ... ... ...
...
44. ip_gre: (circa 2.2)
GRE Tunnel, IP Based
● Per device initialization
○ Store tunnel instance parameters
○ E.g. encapsulating iph.saddr, iph.daddr
ipv4
stack ipv6 ...
network core
gre
stack... ... ... ... ...
...
gre0
54.90.24.7
gre1
81.104.12.5
45. ip_gre device
Transmit method
● Routing to remote subnet directed to gre device
○ E.g. 192.168.50.0/24 dev gre0
● Install the GRE header
● Consult IP routing for output decision
○ Based on tunnel parameters
● Install the encapsulating IP header
● Pass to IP stack for local output
ipv4
stack ipv6 ...
network core
gre
stack... ... ... ... ...
...
gre0 gre1
46. ip_gre device
Receive path
● Encapsulating packet arrives on a net device
● IP stack invokes the registered transport handler
● GRE handler looks-up for a matching tunnel instance
○ Based on encapsulating IP header fields
● Changes skb->dev to the matched tunnel device
● Skb points to inner packet
● Re-submit to network’s core RX path
ipv4
stack ipv6 ...
network core
gre
stack... ... ... ... ...
...
gre0 gre1
48. bond: (circa 2.2)
Link Aggregation
● Aggregates multiple interfaces into a single “bond”
○ Various operating modes:
Round-robin, active-backup, broadcast, 802.3ad...
● Bond device has multiple “slave” devices
● Bond xmit
○ Calls ‘dev_queue_xmit’ on slave devices(s)
● rx_handler registered on slave devices
○ Changes skb->dev as the bond device, returns “Another round”
eth0 eth1
bond0
rx_handler()
dev_queue_xmit()
49. ● Bandwidth aggregation
● Fault tolerance / HA
● L2 based Load Balancing
See similar ‘team’ driver (circa 3.3)
bond use cases