ITCamp 2019 - Florin Loghiade - Azure Kubernetes in Production - Field notes and pain points

@ITCAMPRO #ITCAMP19Community Conference for IT Professionals
Azure Kubernetes in Production -
Field notes and pain points
Florin Loghiade
Cloud Solutions Architect @ Avaelgo
Azure MVP
Web: florinloghiade.ro / Twitter: @florinloghiade

Many thanks to our sponsors & partners!
GOLD
SILVER
PARTNERS
PLATINUM
POWERED BY

• AKS in Production
• Which ingress?
• Scaling issues.
• Holy S*it moments
• Dropped dead, now what?
Agenda

• It’s not a simple click run operation
• Size your nodes carefully.
• ALWAYS, Set resource constraints on pods
–If you don’t bad things happen.
–Really, really bad things.
Running in production – General

• Never monitor only the cluster or inside the cluster
• Validate everything from outside as well
• When in doubt – Monitor / Alert everything…for now
Running in production – Monitoring

• Kubectl is not the tool of choice for this
–Use Helm charts – Not perfect but getting there
• Use whatever CI/CD tool you want as long as you
version everything
• Never use :latest tag for the container(s)
Running in Production – CI/CD

• Nginx vs HAProxy vs Trafiek vs …
• You can use multiple ingress controllers in a cluster
• The most mature products are Nginx and HAProxy
N number of ingresses, which one?

• Most common ingress controller
• Easy to install, boring to configure
• Stable and reliable
• But
–No dynamic service discovery
–Lacks a status page and a monitoring page
Nginx Ingress

• Extremely stable and fast
• Fast as is …fast 
–Can handle 100k+ connections
–Can saturate high speed nics -40gbps+
• But
–Pure load balancer
HAProxy

• Newcomer, still fresh
• Supports dynamic configurations
• Has service discovery
• Lots of features and more to come
• Comes with a dashboard for monitoring
• Out of the box LetsEncrypt integration
• But
– Doesn’t support hitless reloads
– Doesn’t support TCP –only HTTP(S)
Traefik

• Backed by nginx
• External-DNS automatically set up
• Great for Dev/Test or Azure Dev Spaces feature
• But
– A bit complicated to configure
– Crashes are consistent when they happen
• Personal recommdation….
HTTP Application Routing

Never in
production..

• Its primary purpose is for dev/test or small
applications
• Hard to manage
• Hard to debug / Not worth it / Yaml export to
redeploy
Why not? #Repeat

• Auto-Scaling depends a lot on metrics
–No metrics, no scaling
• Pods require resource constraints for efficient scaling
• Node auto-scaling is a preview feature in Azure
• Cluster Autoscaler is dependent on HPA
App is running hot, where’s scaling?

Cluster Auto-Scaler

• Sometimes HPA doesn’t do its job to scale-down pods
• Monitor your stuff or “bad things happen”
CFO 2019™
• Ever done a load test on 100 VMs and forgot to delete
them? Cluster AutoScaler and HPA doesn’t solve that
problem.
What’s comes up, must come down

• Cluster Auto Scaler will not evict nodes AKA delete
stuff if
–Pods cannot be moved because of node selector / affinity
rules
–Pods have local storage – not talking about PVs
• Yes, I have seen this in prod.
What’s comes up, must come down

“Our systems look fine”
• Terminating, pending and evicted are clear signals of a
potential disaster

DEMO

• There are times where everything looks fine but the
app is dead
–Logging shows pods up
–Metrics show “some traffic”
–Critical K8 systems are fine
• RCA? Dev error / problem – Not my monkey, not my
circus
“Our systems look fine”

• Everything shows as “working as intended™”
Holy…s*it.

• Application is dead; not working;
Holy…s*it.

• Dev teams didn’t push code. Nobody tampered with
the cluster
• Preliminary RCA – Pod-to-Pod communication was down; Solution?
– Restart Kube-Proxy
– Restart Kube-DNS
– Restart something!
Holy…s*it.

Restarting certain pods system pods can fix the problem
• kubectl -n kube-system get pod,svc,ep,deploy,ds -o wide
– kubectl delete po -l component=kube-proxy -n kube-system
– kubectl delete po -l component=kube-svc-redirect -n kube-system
– kubectl delete po -l component=tunnel -n kube-system
– kubectl delete po -l k8s-app=kube-dns -n kube-system
Holy…s*it.

Mayday, It’s dead.

• When nothing works any more then it’s time to reboot
the cluster nodes
• It’s not a best-practice but it’s a must-do to regain
operations

• You have two options
–The brutal way
–The “nice” way
• https://gist.github.com/tomasaschan/9dbc9180d313ad8cae57f62
ce229610b

• After the whole system recovered it’s time to roll-up
your sleeves
–You need to ssh into each node and gather some logs.
• Gather the following logs:
–/var/log/azure-vnet*
–/var/run/azure-vnet*
• Run the following command:
–journalctl -u kubelet* --no-pager --since "2019-06-06
08:00:00" --until "2019-06-06 22:45:00" > nodenumber.log
How to perform RCAs? – Seriously

How to perform RCAs?
• Tools to use:
– Transfer.Sh for move files from the nodes
– CMTrace – System Center tools

• Most issues can be solved with proper
resource management
• Patch management is “still” required
–Use a tool like Kured to do patches
• Monitoring the cluster and apps is a must
–Use Prometheus / Grafana and Log Analytics
for enhanced monitoring
• If the cluster needs drivers – Never ever
use the :latest tag
Some best practices

• Kured - https://docs.microsoft.com/bs-cyrl-ba/azure/aks/node-
updates-kured
• SSH into AKS nodes - https://docs.microsoft.com/en-us/azure/aks/ssh
• Ingres Controllers - https://kubernetes.io/docs/concepts/services-
networking/ingress-controllers/
• Ingress comparison - https://kubedex.com/ingress/
• AKS Reboot Gracefully -
https://gist.github.com/tomasaschan/9dbc9180d313ad8cae57f62ce22
9610b
Resources

Q & A

ITCamp 2019 - Florin Loghiade - Azure Kubernetes in Production - Field notes and pain points

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (10)

Ähnlich wie ITCamp 2019 - Florin Loghiade - Azure Kubernetes in Production - Field notes and pain points

Ähnlich wie ITCamp 2019 - Florin Loghiade - Azure Kubernetes in Production - Field notes and pain points (20)

Mehr von ITCamp

Mehr von ITCamp (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

ITCamp 2019 - Florin Loghiade - Azure Kubernetes in Production - Field notes and pain points