You played around with containers? You feel you can handle the adrenaline rush of publishing your containers in production? Well hold on there because there are some aspects you need to consider before you start rushing to production. How you will handle auto-scalling? What about updates / upgrades? Downtime of your app? Version 1 and Version 2? CI/CD? Etc.
This session is about deploying your services on containers using the Azure Kubernetes managed offering. You will learn about what problems you might encounter and how to handle them during your deployment journey, and we will cover the main features of Kubernetes and how they can be of use to you
3. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals
• AKS in Production
• Which ingress?
• Scaling issues.
• Holy S*it moments
• Dropped dead, now what?
Agenda
4. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals
• It’s not a simple click run operation
• Size your nodes carefully.
• ALWAYS, Set resource constraints on pods
–If you don’t bad things happen.
–Really, really bad things.
Running in production – General
5. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals
• Never monitor only the cluster or inside the cluster
• Validate everything from outside as well
• When in doubt – Monitor / Alert everything…for now
Running in production – Monitoring
6. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals
• Kubectl is not the tool of choice for this
–Use Helm charts – Not perfect but getting there
• Use whatever CI/CD tool you want as long as you
version everything
• Never use :latest tag for the container(s)
Running in Production – CI/CD
7. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals
• Nginx vs HAProxy vs Trafiek vs …
• You can use multiple ingress controllers in a cluster
• The most mature products are Nginx and HAProxy
N number of ingresses, which one?
8. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals
• Most common ingress controller
• Easy to install, boring to configure
• Stable and reliable
• But
–No dynamic service discovery
–Lacks a status page and a monitoring page
Nginx Ingress
9. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals
• Extremely stable and fast
• Fast as is …fast
–Can handle 100k+ connections
–Can saturate high speed nics -40gbps+
• But
–Pure load balancer
HAProxy
10. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals
• Newcomer, still fresh
• Supports dynamic configurations
• Has service discovery
• Lots of features and more to come
• Comes with a dashboard for monitoring
• Out of the box LetsEncrypt integration
• But
– Doesn’t support hitless reloads
– Doesn’t support TCP –only HTTP(S)
Traefik
11. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals
• Backed by nginx
• External-DNS automatically set up
• Great for Dev/Test or Azure Dev Spaces feature
• But
– A bit complicated to configure
– Crashes are consistent when they happen
• Personal recommdation….
HTTP Application Routing
13. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals
• Its primary purpose is for dev/test or small
applications
• Hard to manage
• Hard to debug / Not worth it / Yaml export to
redeploy
Why not? #Repeat
14. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals
• Auto-Scaling depends a lot on metrics
–No metrics, no scaling
• Pods require resource constraints for efficient scaling
• Node auto-scaling is a preview feature in Azure
• Cluster Autoscaler is dependent on HPA
App is running hot, where’s scaling?
16. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals
• Sometimes HPA doesn’t do its job to scale-down pods
• Monitor your stuff or “bad things happen”
CFO 2019™
• Ever done a load test on 100 VMs and forgot to delete
them? Cluster AutoScaler and HPA doesn’t solve that
problem.
What’s comes up, must come down
17. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals
• Cluster Auto Scaler will not evict nodes AKA delete
stuff if
–Pods cannot be moved because of node selector / affinity
rules
–Pods have local storage – not talking about PVs
• Yes, I have seen this in prod.
What’s comes up, must come down
18. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals
“Our systems look fine”
• Terminating, pending and evicted are clear signals of a
potential disaster
20. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals
• There are times where everything looks fine but the
app is dead
–Logging shows pods up
–Metrics show “some traffic”
–Critical K8 systems are fine
• RCA? Dev error / problem – Not my monkey, not my
circus
“Our systems look fine”
23. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals
• Dev teams didn’t push code. Nobody tampered with
the cluster
• Preliminary RCA – Pod-to-Pod communication was down; Solution?
– Restart Kube-Proxy
– Restart Kube-DNS
– Restart something!
Holy…s*it.
24. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals
Restarting certain pods system pods can fix the problem
• kubectl -n kube-system get pod,svc,ep,deploy,ds -o wide
– kubectl delete po -l component=kube-proxy -n kube-system
– kubectl delete po -l component=kube-svc-redirect -n kube-system
– kubectl delete po -l component=tunnel -n kube-system
– kubectl delete po -l k8s-app=kube-dns -n kube-system
Holy…s*it.
26. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals
• When nothing works any more then it’s time to reboot
the cluster nodes
• It’s not a best-practice but it’s a must-do to regain
operations
Mayday, It’s dead.
27. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals
• You have two options
–The brutal way
–The “nice” way
• https://gist.github.com/tomasaschan/9dbc9180d313ad8cae57f62
ce229610b
Mayday, It’s dead.
29. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals
• After the whole system recovered it’s time to roll-up
your sleeves
–You need to ssh into each node and gather some logs.
• Gather the following logs:
–/var/log/azure-vnet*
–/var/run/azure-vnet*
• Run the following command:
–journalctl -u kubelet* --no-pager --since "2019-06-06
08:00:00" --until "2019-06-06 22:45:00" > nodenumber.log
How to perform RCAs? – Seriously
30. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals
How to perform RCAs?
• Tools to use:
– Transfer.Sh for move files from the nodes
– CMTrace – System Center tools
31. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals
• Most issues can be solved with proper
resource management
• Patch management is “still” required
–Use a tool like Kured to do patches
• Monitoring the cluster and apps is a must
–Use Prometheus / Grafana and Log Analytics
for enhanced monitoring
• If the cluster needs drivers – Never ever
use the :latest tag
Some best practices
32. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals
• Kured - https://docs.microsoft.com/bs-cyrl-ba/azure/aks/node-
updates-kured
• SSH into AKS nodes - https://docs.microsoft.com/en-us/azure/aks/ssh
• Ingres Controllers - https://kubernetes.io/docs/concepts/services-
networking/ingress-controllers/
• Ingress comparison - https://kubedex.com/ingress/
• AKS Reboot Gracefully -
https://gist.github.com/tomasaschan/9dbc9180d313ad8cae57f62ce22
9610b
Resources