The Docker network overlay driver relies on several technologies: network namespaces, VXLAN, Netlink and a distributed key-value store. This talk will present each of these mechanisms one by one along with their userland tools and show hands-on how they interact together when setting up an overlay to connect containers.
The talk will continue with a demo showing how to build your own simple overlay using these technologies.
9. How does it work? Let's look inside containers
docker0:~$ docker exec C0 ip addr show
58: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP
inet 192.168.0.100/24 scope global eth0
docker0:~$ docker exec C0 ip -details link show dev eth0
58: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT group default
veth
11. Where is the other end of the veth?
docker0:~$ ip link show >> Nothing, it must be in another Namespace
docker0:~$ sudo ls -l /var/run/docker/netns
8-c4305b67cd
docker0:~$ docker network inspect linuxcon -f {{.Id}}
c4305b67cda46c2ed96ef797e37aed14501944a1fe0096dacd1ddd8e05341381
docker0:~$ overns=/var/run/docker/netns/8-c4305b67cd
docker0:~$ sudo nsenter --net=$overns ip -d link show
2: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT group default
bridge
62: vxlan1: <..> mtu 1450 qdisc noqueue master br0 state UNKNOWN mode DEFAULT group default
vxlan id 256 srcport 10240 65535 dstport 4789 proxy l2miss l3miss ageing 300
59: veth2: <...> mtu 1450 qdisc noqueue master br0 state UP mode DEFAULT group default
13. What is VXLAN?
• Tunneling technology over UDP (L2 in UDP)
• Developed for cloud SDN to create multi-tenancy
• Without the need for L2 connectivity
• Without the normal VLAN limit (4096 VLAN Ids)
• Easy to encrypt: IPSEC
• Overhead: 50 bytes
• In Linux
• Started with Open vSwitch
• Native with Kernel >= 3.7 and >=3.16 for Namespace support
VXLAN: Virtual eXtensible LAN
VNI: VXLAN Network Identifier
VTEP: VXLAN Tunnel Endpoint
Outer IP packet
UDP
dst: 4789
VXLAN
Header
Original L2
14. Let's have a look
docker0:~$ sudo tcpdump -nn -i eth0 "port 4789"
docker1:~$ docker run -it --rm --net linuxcon debian ping 192.168.0.100
PING 192.168.0.100 (192.168.0.100): 56 data bytes
64 bytes from 192.168.0.100: seq=0 ttl=64 time=1.153 ms
64 bytes from 192.168.0.100: seq=1 ttl=64 time=0.807 ms
docker0:~$
13:35:12.796941 IP 10.0.0.11.60916 > 10.0.0.10.4789: VXLAN, flags [I] (0x08), vni 256
IP 192.168.0.2 > 192.168.0.100: ICMP echo request, id 1, seq 0, length 64
13:35:12.797035 IP 10.0.0.10.54953 > 10.0.0.11.4789: VXLAN, flags [I] (0x08), vni 256
IP 192.168.0.100 > 192.168.0.2: ICMP echo reply, id 1, seq 0, length 64
15. Full connectivity with VXLAN
consul
docker0
C0
eth0
docker1
C1
eth0PING
192.168.0.100 192.168.0.Y
10.0.0.10 10.0.0.11
br0
vxlanveth
br0
vxlanveth
IP
src: 10.0.0.11
dst: 10.0.0.10
UDP
src: X
dst: 4789
VXLAN
Header
Original L2
src: 192.168.0.Y
dst: 192.168.0.100
16. How do containers find each other?
• VXLAN Data plane
– Sending data between hosts
– Tunneling using UDP
• VXLAN Control plane
– Distribution of VLXAN endpoints ("VTEP")
– Distribution of MAC to VTEP mappings
– ARP offloading (optional, but required without ARP traffic)
17. VXLAN Control Plane - Option 1: Multicast
vxlan vxlan
vxlan
Multicast
239.x.x.x
ARP: Who has 192.168.0.2?
L2 discovery: where is 02:42:c0:a8:00:02 ?
Use a multicast group to send traffic for unknown L3/L2 addresses
• PROS: simple and efficient
• CONS: Multicast connectivity not always available (on public clouds for instance)
18. VXLAN Control Plane- Option 2: Point-to-point
vxlan vxlan
Remote IP: point-to-point
Send everything to remote IP
Configure a remote IP address where to send traffic for unknown addresses
• PROS: simple, not need for multicast, very good for two hosts
• CONS: difficult to manage with more than 2 hosts
19. VXLAN Control Plane- Option 3: User-land
vxlan vxlan
daemon daemon
Manual (with a daemon modifying ARP/FDB)
ARP: Mac address of 192.168.0.2
L2: VTEP (host) for 02:42:c0:a8:00:02
vxlan
daemon
Do nothing, provide ARP / FDB information from outside
• PROS: very flexible
• CONS: requires a daemon and a centralized database of addresses
20. How is it done by Docker?
docker0:~$ sudo nsenter --net=$overns ip neighbor show
docker0:~$ sudo nsenter --net=$overns bridge fdb show
docker1:~$ docker run -d --ip 192.168.0.200 --net linuxcon --name C1 debian sleep infinity
docker0:~$ sudo nsenter --net=$overns ip neighbor show
192.168.0.200 dev vxlan0 lladdr 02:42:c0:a8:00:c8 PERMANENT
docker0:~$ sudo nsenter --net=$overns bridge fdb show
02:42:c0:a8:00:c8 dev vxlan0 dst 10.0.0.11 self permanent
21. Where is this information stored?
docker0:~$ net=$(docker network inspect linuxcon -f {{.Id}})
docker0:~$ curl -s http://consul:8500/v1/kv/docker/network/v1.0/network/${net}/
docker0:~$ python/dump_endpoints.py
Endpoint Name: C1
IP address: 192.168.0.200/24
MAC address: 02:42:c0:a8:00:c8
Locator: 10.0.0.11
Endpoint Name: C0
IP address: 192.168.0.100/24
MAC address: 02:42:c0:a8:00:64
Locator: 10.0.0.10
22. How is it distributed?
docker0:~$ serf agent -join 10.0.0.10:7946 -node demo -event-handler=./serf.sh
docker1:~$ docker run -d --net linuxcon debian sleep infinity
docker1:~$ docker rm -f $(docker ps -aq)
docker0:~$
New event: user
join 192.168.0.2 255.255.255.0 02:42:c0:a8:00:02
New event: user
leave 192.168.0.2 255.255.255.0 02:42:c0:a8:00:02
New event: user
leave 192.168.0.200 255.255.255.0 02:42:c0:a8:00:c8
28. Creating the Overlay Namespace
ip netns add overns
ip netns exec overns ip link add dev br42 type bridge
ip netns exec overns ip addr add dev br42 192.168.0.1/24
ip link add dev vxlan42 type vxlan id 42 proxy dstport 4789
ip link set vxlan1 netns overns
ip netns exec overns ip link set vxlan42 master br42
ip netns exec overns ip link set vxlan42 up
ip netns exec overns ip link set br42 up
create overlay NS
create bridge in NS
create VXLAN interface
move it to NS
add it to bridge
bring all interfaces up
setup_vxlan script
30. Create containers and attach them
docker0
docker run -d --net=none --name=demo debian sleep infinity
ctn_ns_path=$(docker inspect --format="{{ .NetworkSettings.SandboxKey}}" demo)
ctn_ns=${ctn_ns_path##*/}
ip link add dev veth1 mtu 1450 type veth peer name veth2 mtu 1450
ip link set dev veth1 netns overns
ip netns exec overns ip link set veth1 master br42
ip netns exec overns ip link set veth1 up
ip link set dev veth2 netns $ctn_ns
ip netns exec $ctn_ns ip link set dev veth2 name eth0 address 02:42:c0:a8:00:10
ip netns exec $ctn_ns ip addr add dev eth0 192.168.0.10
ip netns exec $ctn_ns ip link set dev eth0 up
docker1
Same with 192.168.0.20 / 02:42:c0:a8:00:20
Create container without net
Create veth
Send veth1 to overlay NS
Attach it to overlay bridge
Send veth2 to container
Rename & Configure
Get NS for container
plumb script
31. Does it ping?
docker0:~$ docker exec -it demo ping 192.168.0.20
PING 192.168.0.20 (192.168.0.20): 56 data bytes
92 bytes from 192.168.0.10: Destination Host Unreachable
docker0:~$ sudo ip netns exec overns ip neighbor show
docker0:~$ sudo ip netns exec overns ip neighbor add 192.168.0.20 lladdr 02:42:c0:a8:00:20 dev vxlan42
docker0:~$ sudo ip netns exec overns bridge fdb add 02:42:c0:a8:00:20 dev vxlan42 self dst 10.0.0.11
vni 42 port 4789
docker1: Same with 192.168.0.10, 02:42:c0:a8:00:10 and 10.0.0.10
34. Catching network events: NETLINK
• Kernel interface for communication between Kernel and userspace
• Designed to transfer networking info (used by iproute2)
• Several protocols
– NETLINK_ROUTE
– NETLINK_FIREWALL
• Several notification types, for NETLINK_ROUTE for instance:
– LINK
– NEIGHBOR
• Many events
– LINK: NEWLINK, GETLINK
– NEIGHBOR: GETNEIGH <= information on ARP, L2 discovery queries
35. Using ip monitor
docker0:~$ ip monitor link
docker0:~$ sudo ip link add dev veth1 type veth peer name veth2
32: veth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default
link/ether b6:95:d6:b4:21:e9 brd ff:ff:ff:ff:ff:ff
33: veth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default
link/ether a6:e0:7a:da:a9:ea brd ff:ff:ff:ff:ff:ff
docker0:~$ ip monitor route
docker0:~$ sudo ip route add 8.8.8.8 via 10.0.0.1
8.8.8.8 via 10.0.0.1 dev eth0
36. What about neighbor events?
docker0:~$ echo 1 | sudo tee -a /proc/sys/net/ipv4/neigh/eth0/app_solicit
docker0:~$ ip monitor neigh
docker0:~$ ping 10.0.0.100
10.0.0.100 dev eth0 FAILED
app_sollicit: generate Netlink message on L2/L3 miss
37. With containers
docker0:~$ sudo ip netns del overns
docker0:~$ sudo ./setup_vxlan 42 overns proxy l2miss l3miss dstport 4789
docker0:~$ sudo ./plumb br42@overns demo 192.168.0.10/24 02:42:c0:a8:00:10
docker0:~$ docker exec demo ip monitor neigh
docker0:~$ docker exec demo ping 192.168.0.20
192.168.0.20 dev eth0 FAILED
Retest from overns namespace
docker0:~$ sudo ip netns exec overns ip monitor neigh
miss 192.168.0.20 dev vxlan42 STALE
Add ARP
miss dev vxlan42 lladdr 02:42:c0:a8:00:20 STALE
Add FDB => ping ok
l2miss/l3miss
generate messages from VXLAN interface
38. Using Netlink & Consul to dynamically find containers
listen to Netlink events in overns Namespace
only act on GETNEIGH events
If l3miss, look up ARP in consul
and add neighbor info
If l2miss, lookup MAC location
and add FDB info
39. Let's try!
Clean up
docker0:~$ sudo ip netns del overns
docker0:~$ sudo ./setup_vxlan 42 overns proxy l2miss l3miss dstport 4789
docker0:~$ sudo ./plumb br42@overns demo 192.168.0.10/24 02:42:c0:a8:00:10
Add data to consul
Test
docker0:~$ sudo python/arpd-consul.py
docker0:~$ docker exec -it demo ping 192.168.0.20
INFO Starting new HTTP connection (1): consul1
INFO L3Miss on vxlan42: Who has IP: 192.168.0.20?
INFO Populating ARP table from Consul: IP 192.168.0.20 is 02:42:c0:a8:00:20
INFO L2Miss on vxlan42: Who has Mac Address: 02:42:c0:a8:00:20?
INFO Populating FIB table from Consul: MAC 02:42:c0:a8:00:20 is on host 10.0.0.11
41. Thank you! Questions?
• Commands / code on github
https://github.com/lbernail/dockercon2017
• Recorded at Dockercon Austin (a few improvements today)
• Detailled blog post
http://techblog.d2-si.eu/2017/04/25/deep-dive-into-docker-overlay-networks-part-1.html
• Do not hesitate to ping me on twitter
@lbernail