This document discusses service discovery and load balancing in Kubernetes. It begins by defining service discovery and explaining why it is important. It then demonstrates how Kubernetes implements service discovery using Deployments, Services, and Endpoints. It explains how kube-proxy performs load balancing using different modes like iptables and IPVS. It also covers topics like hairpin traffic, persistence, and alternatives to kube-proxy. Overall, the document provides an in-depth look at how service discovery and load balancing work under the hood in Kubernetes.
4. “Service discovery is the automatic detection of devices and
services offered by these devices on a computer network”
https://en.wikipedia.org/wiki/Service_discovery
Why has this topic become so important?
Service Discovery
5. Service discovery in Kubernetes
apiVersion: apps/v1
kind: Deployment
metadata:
name: echodeploy
labels:
app: echo
spec:
replicas: 3
selector:
matchLabels:
app: echo
template:
metadata:
labels:
app: echo
spec:
containers:
- name: echopod
image: lbernail/echo:0.5
apiVersion: v1
kind: Service
metadata:
name: echo
labels:
app: echo
spec:
type: ClusterIP
selector:
app: echo
ports:
- name: http
protocol: TCP
port: 80
targetPort: 5000
Creating a deployment and a service
6. Created Kubernetes objects
Deployment ReplicaSet Pod 1
label: app=echo
Pod 2
label: app=echo
Pod 3
label: app=echo
Service
Selector: app=echo
kubectl get all
NAME AGE
deploy/echodeploy 16s
NAME AGE
rs/echodeploy-75dddcf5f6 16s
NAME READY
po/echodeploy-75dddcf5f6-jtjts 1/1
po/echodeploy-75dddcf5f6-r7nmk 1/1
po/echodeploy-75dddcf5f6-zvqhv 1/1
NAME TYPE CLUSTER-IP
svc/echo ClusterIP 10.200.246.139
7. The endpoint object
Deployment ReplicaSet Pod 1
label: app=echo
Pod 2
label: app=echo
Pod 3
label: app=echo
kubectl describe endpoints echo
Name: echo
Namespace: datadog
Labels: app=echo
Annotations: <none>
Subsets:
Addresses: 10.150.4.10,10.150.6.16,10.150.7.10
NotReadyAddresses: <none>
Ports:
Name Port Protocol
---- ---- --------
http 5000 TCP
Endpoints
Addresses:
10.150.4.10
10.150.6.16
10.150.7.10
Service
Selector: app=echo
8. Pod readiness
readinessProbe:
httpGet:
path: /ready
port: 5000
periodSeconds: 2
successThreshold: 2
failureThreshold: 2
● A pod can be started but no ready to serve requests
○ Initialization
○ Connection to backends
● Kubernetes provides an abstraction for this: Readiness Probes
11. How does this all work?
API Server
Node
kubelet pod
HC
Status updates
Node
kubelet pod
HC
ETCD
pods
12. How does this all work?
API Server
Node
kubelet pod
HC
Status updates
Controller Manager
Watch
- pods
- services
endpoint
controller
Node
kubelet pod
HC
Sync endpoints:
- list pods matching selector
- add IP to endpoints
ETCD
pods
services
endpoints
14. DNS Round Robin
● Service has a DNS record with one entry per endpoint
● Many clients will only use the first IP
● Many clients will perform resolution only at startup
Virtual IP + IP based load-balancing
● Service has a single VIP
● Traffic sent to this VIP is load-balanced to endpoints IPs
=> Requires a “process” to perform and configure this load-balancing
Load-balancing solutions
15. Load-balancing in Kubernetes
API Server
Node
kube-proxy proxier
Controller Manager
Watch
- pods
- services
endpoint
controller
Sync endpoints:
- list pods matching selector
- add IP to endpoints
ETCD
pods
services
endpoints
Watch
- services
- endpoints
16. Load-balancing in Kubernetes
API Server
Node
kube-proxy proxier
Controller Manager
endpoint
controller
ETCD
pods
services
endpoints
client Node Bpod 1
Node Cpod 2
17. ● userspace
Original implementation
Userland TCP/UDP proxy
● iptables
Default since Kubernetes 1.2
Use iptables to load-balance traffic
Faster than userspace
● ipvs
Use Kernel load-balancing
Still relies on iptables for some NAT rule
Faster than iptables, scales better with large number of services/endpoints
Kube-proxy modes
19. API Server
Node A
kube-proxy iptables
iptables overview
client
Node B
Node C
pod 1
pod 2
Outgoing traffic
1. Client to Service IP
2. DNAT: Client to Pod1 IP
Reverse path
1. Pod1 IP to Client
2. Reverse NAT: Service IP to client
21. proxy-mode = iptables
KUBE-SERVICES
any / VIP:PORT => KUBE-SVC-XXX
Global Service chain
Identify service and jump to appropriate service chain
PREROUTING / OUTPUT
any / any => KUBE-SERVICES
22. proxy-mode = iptables
KUBE-SERVICES
any / VIP:PORT => KUBE-SVC-XXX
KUBE-SVC-XXX
any / any proba 33% => KUBE-SEP-AAA
any / any proba 50% => KUBE-SEP-BBB
any / any => KUBE-SEP-CCC
PREROUTING / OUTPUT
any / any => KUBE-SERVICES
Service chain (one per service)
Use statistic iptables module (probability of rule being applied)
Rules are evaluated sequentially (hence the 33%, 50%, 100%)
23. proxy-mode = iptables
KUBE-SERVICES
any / VIP:PORT => KUBE-SVC-XXX
KUBE-SVC-XXX
any / any proba 33% => KUBE-SEP-AAA
any / any proba 50% => KUBE-SEP-BBB
any / any => KUBE-SEP-CCC
PREROUTING / OUTPUT
any / any => KUBE-SERVICES
KUBE-SEP-AAA
endpoint IP / any => KUBE-MARK-MASQ
any / any => DNAT endpoint IP:Port
Endpoint Chain
Mark hairpin traffic (client = target) for SNAT
DNAT to the endpoint
24. Edge case: Hairpin traffic
API Server
Node A
kube-proxy iptables
pod 1
Node B
Node C
pod 2
pod 3
Client can also be a destination
After DNAT:
Src IP= Pod1, Dst IP= Pod1
No reverse NAT possible
=> SNAT on host for this traffic
1. Pod1 IP => SVC IP
2. SNAT: HostIP => SVC IP
3. DNAT: HostIP => Pod1 IP
Reverse path
1. Pod1 IP => Host IP
2. Reverse NAT: SVC IP => Pod1IP
26. Persistency
KUBE-SEP-AAA
endpoint IP / any => KUBE-MARK-MASQ
any / any => DNAT endpoint IP:Port
recent : set rsource KUBE-SEP-AAA
Use recent module
Add Source IP to set named KUBE-SEP-AAA
KUBE-SVC-XXX
any / any recent: rcheck set KUBE-SEP-AAA => KUBE-SEP-AAA
any / any recent: rcheck set KUBE-SEP-BBB => KUBE-SEP-BBB
any / any recent: rcheck set KUBE-SEP-CCC => KUBE-SEP-CCC
Load-balancing rules
Use recent module
If Source IP is in set named KUBE-SEP-AAA,
jump to KUBE-SEP-AAA
30. proxy-mode = ipvs
● L4 load-balancer build in the Linux Kernel
● Many load-balancing algorithms
● Very fast
● Still relies on iptables for some use cases (SNAT in particular)
32. IPVS Hairpin traffic
$ sudo iptables -t nat -L KUBE-POSTROUTING
Chain KUBE-POSTROUTING (1 references)
target prot opt source destination
MASQUERADE all -- anywhere anywhere mark match 0x4000/0x4000
MASQUERADE all -- anywhere anywhere match-set KUBE-LOOP-BACK dst,dst,src
$ sudo ipset -L KUBE-LOOP-BACK
Name: KUBE-LOOP-BACK
Type: hash:ip,port,ip
Members:
10.1.243.2,tcp:5000,10.1.243.2
10.1.242.2,tcp:5000,10.1.242.2
Same as iptables but uses IPSET
When src & dst == endpoint IP => SNAT
ip sets are much faster than iptables simple rules with long lists
34. Not considered stable yet
Much better performances
● No chain traversal: faster DNAT
● No full reload to add an endpoint / service: much faster updates
● See “Scale Kubernetes to support 50000 services”, Haibin Michael Xie
(Linuxcon China)
Definitely the future of kube-proxy
IPVS status
35. Alternatives to kube-proxy
Kube-router
● https://github.com/cloudnativelabs/kube-router
● Pod Networking with BGP
● Network Policies
● IPVS based service-proxy
Cilium
● Relies on eBPF to implement service proxying
● Implement security policies with eBPF
● Really promising
Other
● Very dynamic area, expect to see other solutions
36. API Server
Node A
kube-proxy iptables
What about DNS
DNS client
Node B
Node C
DNS pod 1
DNS pod 2
Just another Kube Service
DNS pods get DNS info from API server
37. Access services from outside kube
Run kube-proxy on an external VM
Requires routable pod IPs
DNS
38. Access services from outside kube
VM
API Server
kube-proxy
iptables
Node
Service pod
Node
Service pod
Service pod
Node
client
39. Access services from outside kube
VM
API Server
kube-proxy
iptables
Node
Service pod
DNS pod
Node
Service pod
Service pod
Node
DNS poddnsmasqclient
42. Key takeaways
Complicated under the hood
● Helps to know where to look at when debugging complex setups
Service discovery
● Challenge: integrate with hosts outside of Kubernetes
Load-Balancing
● L4 is still very dynamic (IPVS, eBPF)
● L7 is only starting, expect to see a lot