Deep Dive in Container Service Discovery

Laurent Bernaille, @lbernail
Staff Engineer, Datadog
Deep Dive in Container
Service Discovery

v
Subtitle here
Agenda
Time Title will go here when it’s ready Location
Service Discovery
Load-balancing
L7 Load-balancing

“Service discovery is the automatic detection of devices and
services offered by these devices on a computer network”
https://en.wikipedia.org/wiki/Service_discovery
Why has this topic become so important?
Service Discovery

Service discovery in Kubernetes
apiVersion: apps/v1
kind: Deployment
metadata:
name: echodeploy
labels:
app: echo
spec:
replicas: 3
selector:
matchLabels:
app: echo
template:
metadata:
labels:
app: echo
spec:
containers:
- name: echopod
image: lbernail/echo:0.5
apiVersion: v1
kind: Service
metadata:
name: echo
labels:
app: echo
spec:
type: ClusterIP
selector:
app: echo
ports:
- name: http
protocol: TCP
port: 80
targetPort: 5000
Creating a deployment and a service

Created Kubernetes objects
Deployment ReplicaSet Pod 1
label: app=echo
Pod 2
label: app=echo
Pod 3
label: app=echo
Service
Selector: app=echo
kubectl get all
NAME AGE
deploy/echodeploy 16s
NAME AGE
rs/echodeploy-75dddcf5f6 16s
NAME READY
po/echodeploy-75dddcf5f6-jtjts 1/1
po/echodeploy-75dddcf5f6-r7nmk 1/1
po/echodeploy-75dddcf5f6-zvqhv 1/1
NAME TYPE CLUSTER-IP
svc/echo ClusterIP 10.200.246.139

The endpoint object
Deployment ReplicaSet Pod 1
label: app=echo
Pod 2
label: app=echo
Pod 3
label: app=echo
kubectl describe endpoints echo
Name: echo
Namespace: datadog
Labels: app=echo
Annotations: <none>
Subsets:
Addresses: 10.150.4.10,10.150.6.16,10.150.7.10
NotReadyAddresses: <none>
Ports:
Name Port Protocol
---- ---- --------
http 5000 TCP
Endpoints
Addresses:
10.150.4.10
10.150.6.16
10.150.7.10
Service
Selector: app=echo

Pod readiness
readinessProbe:
httpGet:
path: /ready
port: 5000
periodSeconds: 2
successThreshold: 2
failureThreshold: 2
● A pod can be started but no ready to serve requests
○ Initialization
○ Connection to backends
● Kubernetes provides an abstraction for this: Readiness Probes

Demo
kubectl run -it test --image appropriate/curl ash
# while true ; do curl 10.200.246.139 ; sleep 1 ; done
Container: 10.150.7.10 | Source: 10.150.6.17 | Version: v2

Demo
kubectl exec -it <curl pod> sh
# curl <podip>:5000/ready
Ready : True
# curl <podip>:5000/toggleReady
# curl <podip>:5000/ready
Ready : False
kubectl get pods
NAME READY
echodeploy-75dddcf5f6-jtjts 1/1
echodeploy-75dddcf5f6-r7nmk 1/1
echodeploy-75dddcf5f6-zvqhv 0/1
kubectl describe endpoints echo
Addresses: 10.150.4.10,10.150.6.16
kubectl describe pod echodeploy-75dddcf5f6-zvqhv
Warning Unhealthy (Readiness probe failed)

How does this all work?
API Server
Node
kubelet pod
HC
Status updates
Node
kubelet pod
HC
ETCD
pods

How does this all work?
API Server
Node
kubelet pod
HC
Status updates
Controller Manager
Watch
- pods
- services
endpoint
controller
Node
kubelet pod
HC
Sync endpoints:
- list pods matching selector
- add IP to endpoints
ETCD
pods
services
endpoints

DNS Round Robin
● Service has a DNS record with one entry per endpoint
● Many clients will only use the first IP
● Many clients will perform resolution only at startup
Virtual IP + IP based load-balancing
● Service has a single VIP
● Traffic sent to this VIP is load-balanced to endpoints IPs
=> Requires a “process” to perform and configure this load-balancing
Load-balancing solutions

Load-balancing in Kubernetes
API Server
Node
kube-proxy proxier
Controller Manager
Watch
- pods
- services
endpoint
controller
Sync endpoints:
- list pods matching selector
- add IP to endpoints
ETCD
pods
services
endpoints
Watch
- services
- endpoints

Load-balancing in Kubernetes
API Server
Node
kube-proxy proxier
Controller Manager
endpoint
controller
ETCD
pods
services
endpoints
client Node Bpod 1
Node Cpod 2

● userspace
Original implementation
Userland TCP/UDP proxy
● iptables
Default since Kubernetes 1.2
Use iptables to load-balance traffic
Faster than userspace
● ipvs
Use Kernel load-balancing
Still relies on iptables for some NAT rule
Faster than iptables, scales better with large number of services/endpoints
Kube-proxy modes

API Server
Node A
kube-proxy iptables
iptables overview
client
Node B
Node C
pod 1
pod 2
Outgoing traffic
1. Client to Service IP
2. DNAT: Client to Pod1 IP
Reverse path
1. Pod1 IP to Client
2. Reverse NAT: Service IP to client

proxy-mode = iptables
PREROUTING / OUTPUT
any / any => KUBE-SERVICES
All traffic is processed by kube chains

KUBE-SERVICES
any / VIP:PORT => KUBE-SVC-XXX
Global Service chain
Identify service and jump to appropriate service chain
PREROUTING / OUTPUT

KUBE-SERVICES
KUBE-SVC-XXX
any / any proba 33% => KUBE-SEP-AAA
any / any proba 50% => KUBE-SEP-BBB
any / any => KUBE-SEP-CCC
PREROUTING / OUTPUT
Service chain (one per service)
Use statistic iptables module (probability of rule being applied)
Rules are evaluated sequentially (hence the 33%, 50%, 100%)

KUBE-SERVICES
KUBE-SVC-XXX
any / any proba 33% => KUBE-SEP-AAA
any / any proba 50% => KUBE-SEP-BBB
any / any => KUBE-SEP-CCC
PREROUTING / OUTPUT
KUBE-SEP-AAA
endpoint IP / any => KUBE-MARK-MASQ
any / any => DNAT endpoint IP:Port
Endpoint Chain
Mark hairpin traffic (client = target) for SNAT
DNAT to the endpoint

Edge case: Hairpin traffic
API Server
Node A
kube-proxy iptables
pod 1
Node B
Node C
pod 2
pod 3
Client can also be a destination
After DNAT:
Src IP= Pod1, Dst IP= Pod1
No reverse NAT possible
=> SNAT on host for this traffic
1. Pod1 IP => SVC IP
2. SNAT: HostIP => SVC IP
3. DNAT: HostIP => Pod1 IP
Reverse path
1. Pod1 IP => Host IP
2. Reverse NAT: SVC IP => Pod1IP

Persistency
spec:
type: ClusterIP
sessionAffinity: ClientIP
sessionAffinityConfig:
clientIP:
timeoutSeconds: 600
KUBE-SEP-AAA
recent : set rsource KUBE-SEP-AAA
Use “recent” module
Add Source IP to set named KUBE-SEP-AAA

Persistency
KUBE-SEP-AAA
recent : set rsource KUBE-SEP-AAA
Use recent module
Add Source IP to set named KUBE-SEP-AAA
KUBE-SVC-XXX
any / any recent: rcheck set KUBE-SEP-AAA => KUBE-SEP-AAA
any / any recent: rcheck set KUBE-SEP-BBB => KUBE-SEP-BBB
any / any recent: rcheck set KUBE-SEP-CCC => KUBE-SEP-CCC
Load-balancing rules
Use recent module
If Source IP is in set named KUBE-SEP-AAA,
jump to KUBE-SEP-AAA

Demos
kubectl exec echodeploy-xxxx -it sh
# hostname -i
10.1.161.2
# while true ; do wget -q -O - 10.200.20.164 ; sleep 1 ; done
Container: 10.1.162.5 | Source: 10.1.161.2 | Version: Unknown
Chains
Hairpin traffic
Persistency

iptables proxy gotchas
Rules synchronization
Every sync flushes and reload all Kubernetes chains
Performance
Design

proxy-mode = ipvs
● L4 load-balancer build in the Linux Kernel
● Many load-balancing algorithms
● Very fast
● Still relies on iptables for some use cases (SNAT in particular)

IPVS Demo
$ sudo ipvsadm --list --numeric --tcp-service 10.200.200.68:80
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP 10.200.200.68:http rr
-> 10.1.242.2:5000 Masq 1 0 0
-> 10.1.243.2:5000 Masq 1 0 0
Virtual Server
Dummy interface
sudo ip -d addr show kube-ipvs0
3: kube-ipvs0: <BROADCAST,NOARP> mtu 1500 qdisc noqueue state DOWN group default
link/ether da:c8:87:73:ac:d4 brd ff:ff:ff:ff:ff:ff promiscuity 0
dummy numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
inet 10.200.200.68/32 brd 10.200.200.68 scope global kube-ipvs0
valid_lft forever preferred_lft forever

IPVS Hairpin traffic
$ sudo iptables -t nat -L KUBE-POSTROUTING
Chain KUBE-POSTROUTING (1 references)
target prot opt source destination
MASQUERADE all -- anywhere anywhere mark match 0x4000/0x4000
MASQUERADE all -- anywhere anywhere match-set KUBE-LOOP-BACK dst,dst,src
$ sudo ipset -L KUBE-LOOP-BACK
Name: KUBE-LOOP-BACK
Type: hash:ip,port,ip
Members:
10.1.243.2,tcp:5000,10.1.243.2
10.1.242.2,tcp:5000,10.1.242.2
Same as iptables but uses IPSET
When src & dst == endpoint IP => SNAT
ip sets are much faster than iptables simple rules with long lists

Persistency
$ sudo ipvsadm --list --numeric --tcp-service 10.200.200.68:80
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP 10.200.200.68:80 rr persistent 600
-> 10.1.242.2:5000 Masq 1 0 0
-> 10.1.243.2:5000 Masq 1 0 0
Native option of virtual services

Not considered stable yet
Much better performances
● No chain traversal: faster DNAT
● No full reload to add an endpoint / service: much faster updates
● See “Scale Kubernetes to support 50000 services”, Haibin Michael Xie
(Linuxcon China)
Definitely the future of kube-proxy
IPVS status

Alternatives to kube-proxy
Kube-router
● https://github.com/cloudnativelabs/kube-router
● Pod Networking with BGP
● Network Policies
● IPVS based service-proxy
Cilium
● Relies on eBPF to implement service proxying
● Implement security policies with eBPF
● Really promising
Other
● Very dynamic area, expect to see other solutions

API Server
Node A
kube-proxy iptables
What about DNS
DNS client
Node B
Node C
DNS pod 1
DNS pod 2
Just another Kube Service
DNS pods get DNS info from API server

Access services from outside kube
Run kube-proxy on an external VM
Requires routable pod IPs
DNS

VM
API Server
kube-proxy
iptables
Node
Service pod
Node
Service pod
Service pod
Node
client

VM
API Server
kube-proxy
iptables
Node
Service pod
DNS pod
Node
Service pod
Service pod
Node
DNS poddnsmasqclient

L7 load balancing options
Ingress controllers
Service mesh (Istio)

Key takeaways
Complicated under the hood
● Helps to know where to look at when debugging complex setups
Service discovery
● Challenge: integrate with hosts outside of Kubernetes
Load-Balancing
● L4 is still very dynamic (IPVS, eBPF)
● L7 is only starting, expect to see a lot

Thank you
We’re hiring!
Questions/ comments: @lbernail
https://github.com/lbernail/dockercon2018

Deep Dive in Container Service Discovery

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Deep Dive in Container Service Discovery

Ähnlich wie Deep Dive in Container Service Discovery (20)

Mehr von Docker, Inc.

Mehr von Docker, Inc. (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Deep Dive in Container Service Discovery