SlideShare ist ein Scribd-Unternehmen logo
1 von 33
@ITCAMPRO #ITCAMP19Community Conference for IT Professionals
Azure Kubernetes in Production -
Field notes and pain points
Florin Loghiade
Cloud Solutions Architect @ Avaelgo
Azure MVP
Web: florinloghiade.ro / Twitter: @florinloghiade
@ITCAMPRO #ITCAMP19Community Conference for IT Professionals
Many thanks to our sponsors & partners!
GOLD
SILVER
PARTNERS
PLATINUM
POWERED BY
@ITCAMPRO #ITCAMP19Community Conference for IT Professionals
• AKS in Production
• Which ingress?
• Scaling issues.
• Holy S*it moments
• Dropped dead, now what?
Agenda
@ITCAMPRO #ITCAMP19Community Conference for IT Professionals
• It’s not a simple click run operation
• Size your nodes carefully.
• ALWAYS, Set resource constraints on pods
–If you don’t bad things happen.
–Really, really bad things.
Running in production – General
@ITCAMPRO #ITCAMP19Community Conference for IT Professionals
• Never monitor only the cluster or inside the cluster
• Validate everything from outside as well
• When in doubt – Monitor / Alert everything…for now
Running in production – Monitoring
@ITCAMPRO #ITCAMP19Community Conference for IT Professionals
• Kubectl is not the tool of choice for this
–Use Helm charts – Not perfect but getting there
• Use whatever CI/CD tool you want as long as you
version everything
• Never use :latest tag for the container(s)
Running in Production – CI/CD
@ITCAMPRO #ITCAMP19Community Conference for IT Professionals
• Nginx vs HAProxy vs Trafiek vs …
• You can use multiple ingress controllers in a cluster
• The most mature products are Nginx and HAProxy
N number of ingresses, which one?
@ITCAMPRO #ITCAMP19Community Conference for IT Professionals
• Most common ingress controller
• Easy to install, boring to configure
• Stable and reliable
• But
–No dynamic service discovery
–Lacks a status page and a monitoring page
Nginx Ingress
@ITCAMPRO #ITCAMP19Community Conference for IT Professionals
• Extremely stable and fast
• Fast as is …fast 
–Can handle 100k+ connections
–Can saturate high speed nics -40gbps+
• But
–Pure load balancer
HAProxy
@ITCAMPRO #ITCAMP19Community Conference for IT Professionals
• Newcomer, still fresh
• Supports dynamic configurations
• Has service discovery
• Lots of features and more to come
• Comes with a dashboard for monitoring
• Out of the box LetsEncrypt integration
• But
– Doesn’t support hitless reloads
– Doesn’t support TCP –only HTTP(S)
Traefik
@ITCAMPRO #ITCAMP19Community Conference for IT Professionals
• Backed by nginx
• External-DNS automatically set up
• Great for Dev/Test or Azure Dev Spaces feature
• But
– A bit complicated to configure
– Crashes are consistent when they happen
• Personal recommdation….
HTTP Application Routing
@ITCAMPRO #ITCAMP19Community Conference for IT Professionals
Never in
production..
@ITCAMPRO #ITCAMP19Community Conference for IT Professionals
• Its primary purpose is for dev/test or small
applications
• Hard to manage
• Hard to debug / Not worth it / Yaml export to
redeploy
Why not? #Repeat
@ITCAMPRO #ITCAMP19Community Conference for IT Professionals
• Auto-Scaling depends a lot on metrics
–No metrics, no scaling
• Pods require resource constraints for efficient scaling
• Node auto-scaling is a preview feature in Azure
• Cluster Autoscaler is dependent on HPA
App is running hot, where’s scaling?
@ITCAMPRO #ITCAMP19Community Conference for IT Professionals
Cluster Auto-Scaler
@ITCAMPRO #ITCAMP19Community Conference for IT Professionals
• Sometimes HPA doesn’t do its job to scale-down pods
• Monitor your stuff or “bad things happen”
CFO 2019™
• Ever done a load test on 100 VMs and forgot to delete
them? Cluster AutoScaler and HPA doesn’t solve that
problem.
What’s comes up, must come down
@ITCAMPRO #ITCAMP19Community Conference for IT Professionals
• Cluster Auto Scaler will not evict nodes AKA delete
stuff if
–Pods cannot be moved because of node selector / affinity
rules
–Pods have local storage – not talking about PVs
• Yes, I have seen this in prod.
What’s comes up, must come down
@ITCAMPRO #ITCAMP19Community Conference for IT Professionals
“Our systems look fine”
• Terminating, pending and evicted are clear signals of a
potential disaster
@ITCAMPRO #ITCAMP19Community Conference for IT Professionals
DEMO
@ITCAMPRO #ITCAMP19Community Conference for IT Professionals
• There are times where everything looks fine but the
app is dead
–Logging shows pods up
–Metrics show “some traffic”
–Critical K8 systems are fine
• RCA? Dev error / problem – Not my monkey, not my
circus
“Our systems look fine”
@ITCAMPRO #ITCAMP19Community Conference for IT Professionals
• Everything shows as “working as intended™”
Holy…s*it.
@ITCAMPRO #ITCAMP19Community Conference for IT Professionals
• Application is dead; not working;
Holy…s*it.
@ITCAMPRO #ITCAMP19Community Conference for IT Professionals
• Dev teams didn’t push code. Nobody tampered with
the cluster
• Preliminary RCA – Pod-to-Pod communication was down; Solution?
– Restart Kube-Proxy
– Restart Kube-DNS
– Restart something!
Holy…s*it.
@ITCAMPRO #ITCAMP19Community Conference for IT Professionals
Restarting certain pods system pods can fix the problem
• kubectl -n kube-system get pod,svc,ep,deploy,ds -o wide
– kubectl delete po -l component=kube-proxy -n kube-system
– kubectl delete po -l component=kube-svc-redirect -n kube-system
– kubectl delete po -l component=tunnel -n kube-system
– kubectl delete po -l k8s-app=kube-dns -n kube-system
Holy…s*it.
@ITCAMPRO #ITCAMP19Community Conference for IT Professionals
Mayday, It’s dead.
@ITCAMPRO #ITCAMP19Community Conference for IT Professionals
• When nothing works any more then it’s time to reboot
the cluster nodes
• It’s not a best-practice but it’s a must-do to regain
operations
Mayday, It’s dead.
@ITCAMPRO #ITCAMP19Community Conference for IT Professionals
• You have two options
–The brutal way
–The “nice” way
• https://gist.github.com/tomasaschan/9dbc9180d313ad8cae57f62
ce229610b
Mayday, It’s dead.
@ITCAMPRO #ITCAMP19Community Conference for IT Professionals
DEMO
@ITCAMPRO #ITCAMP19Community Conference for IT Professionals
• After the whole system recovered it’s time to roll-up
your sleeves
–You need to ssh into each node and gather some logs.
• Gather the following logs:
–/var/log/azure-vnet*
–/var/run/azure-vnet*
• Run the following command:
–journalctl -u kubelet* --no-pager --since "2019-06-06
08:00:00" --until "2019-06-06 22:45:00" > nodenumber.log
How to perform RCAs? – Seriously
@ITCAMPRO #ITCAMP19Community Conference for IT Professionals
How to perform RCAs?
• Tools to use:
– Transfer.Sh for move files from the nodes
– CMTrace – System Center tools
@ITCAMPRO #ITCAMP19Community Conference for IT Professionals
• Most issues can be solved with proper
resource management
• Patch management is “still” required
–Use a tool like Kured to do patches
• Monitoring the cluster and apps is a must
–Use Prometheus / Grafana and Log Analytics
for enhanced monitoring
• If the cluster needs drivers – Never ever
use the :latest tag
Some best practices
@ITCAMPRO #ITCAMP19Community Conference for IT Professionals
• Kured - https://docs.microsoft.com/bs-cyrl-ba/azure/aks/node-
updates-kured
• SSH into AKS nodes - https://docs.microsoft.com/en-us/azure/aks/ssh
• Ingres Controllers - https://kubernetes.io/docs/concepts/services-
networking/ingress-controllers/
• Ingress comparison - https://kubedex.com/ingress/
• AKS Reboot Gracefully -
https://gist.github.com/tomasaschan/9dbc9180d313ad8cae57f62ce22
9610b
Resources
@ITCAMPRO #ITCAMP19Community Conference for IT Professionals
Q & A

Weitere ähnliche Inhalte

Was ist angesagt?

Nginx performance monitoring with Dynatrace
Nginx performance monitoring with DynatraceNginx performance monitoring with Dynatrace
Nginx performance monitoring with Dynatrace
Harald Zeitlhofer
 

Was ist angesagt? (10)

Micro Service – The New Architecture Paradigm
Micro Service – The New Architecture ParadigmMicro Service – The New Architecture Paradigm
Micro Service – The New Architecture Paradigm
 
Heroku
HerokuHeroku
Heroku
 
Continous integration and delivery for single page applications
Continous integration and delivery for single page applicationsContinous integration and delivery for single page applications
Continous integration and delivery for single page applications
 
DevOps Practices: Continuous Delivery
DevOps Practices: Continuous DeliveryDevOps Practices: Continuous Delivery
DevOps Practices: Continuous Delivery
 
Careful - APIs Inside: Testing and Monitoring for App Development
Careful - APIs Inside: Testing and Monitoring for App DevelopmentCareful - APIs Inside: Testing and Monitoring for App Development
Careful - APIs Inside: Testing and Monitoring for App Development
 
QA Fest 2018. Сергей Король. REACTive automation: how to avoid shooting yours...
QA Fest 2018. Сергей Король. REACTive automation: how to avoid shooting yours...QA Fest 2018. Сергей Король. REACTive automation: how to avoid shooting yours...
QA Fest 2018. Сергей Король. REACTive automation: how to avoid shooting yours...
 
CI/CD and Asset Serving for Single Page Apps
CI/CD and Asset Serving for Single Page AppsCI/CD and Asset Serving for Single Page Apps
CI/CD and Asset Serving for Single Page Apps
 
Micro Services - Neither Micro Nor Service
Micro Services - Neither Micro Nor ServiceMicro Services - Neither Micro Nor Service
Micro Services - Neither Micro Nor Service
 
Microservices and serverless in python projects
Microservices and serverless in python projectsMicroservices and serverless in python projects
Microservices and serverless in python projects
 
Nginx performance monitoring with Dynatrace
Nginx performance monitoring with DynatraceNginx performance monitoring with Dynatrace
Nginx performance monitoring with Dynatrace
 

Ähnlich wie ITCamp 2019 - Florin Loghiade - Azure Kubernetes in Production - Field notes and pain points

Ähnlich wie ITCamp 2019 - Florin Loghiade - Azure Kubernetes in Production - Field notes and pain points (20)

ITCamp 2018 - Damian Widera - SQL Server 2016. Meet the Row Level Security. P...
ITCamp 2018 - Damian Widera - SQL Server 2016. Meet the Row Level Security. P...ITCamp 2018 - Damian Widera - SQL Server 2016. Meet the Row Level Security. P...
ITCamp 2018 - Damian Widera - SQL Server 2016. Meet the Row Level Security. P...
 
It camp 2015 how to scale above clouds limits, radu vunvulea
It camp 2015   how to scale above clouds limits, radu vunvuleaIt camp 2015   how to scale above clouds limits, radu vunvulea
It camp 2015 how to scale above clouds limits, radu vunvulea
 
ITCamp 2019 - Emil Craciun - RoboRestaurant of the future powered by serverle...
ITCamp 2019 - Emil Craciun - RoboRestaurant of the future powered by serverle...ITCamp 2019 - Emil Craciun - RoboRestaurant of the future powered by serverle...
ITCamp 2019 - Emil Craciun - RoboRestaurant of the future powered by serverle...
 
Scaling Face Recognition with Big Data
Scaling Face Recognition with Big DataScaling Face Recognition with Big Data
Scaling Face Recognition with Big Data
 
ITCamp 2018 - Damian Widera U-SQL in great depth
ITCamp 2018 - Damian Widera U-SQL in great depthITCamp 2018 - Damian Widera U-SQL in great depth
ITCamp 2018 - Damian Widera U-SQL in great depth
 
Scaling face recognition with big data - Bogdan Bocse
 Scaling face recognition with big data - Bogdan Bocse Scaling face recognition with big data - Bogdan Bocse
Scaling face recognition with big data - Bogdan Bocse
 
ITCamp 2019 - Andy Cross - Machine Learning with ML.NET and Azure Data Lake
ITCamp 2019 - Andy Cross - Machine Learning with ML.NET and Azure Data LakeITCamp 2019 - Andy Cross - Machine Learning with ML.NET and Azure Data Lake
ITCamp 2019 - Andy Cross - Machine Learning with ML.NET and Azure Data Lake
 
Provisioning Windows instances at scale on Azure, AWS and OpenStack - Adrian ...
Provisioning Windows instances at scale on Azure, AWS and OpenStack - Adrian ...Provisioning Windows instances at scale on Azure, AWS and OpenStack - Adrian ...
Provisioning Windows instances at scale on Azure, AWS and OpenStack - Adrian ...
 
Execution Plans in practice - how to make SQL Server queries faster - Damian ...
Execution Plans in practice - how to make SQL Server queries faster - Damian ...Execution Plans in practice - how to make SQL Server queries faster - Damian ...
Execution Plans in practice - how to make SQL Server queries faster - Damian ...
 
Azure Microservices in Practice, Radu Vunvulea, ITCamp 2016
Azure Microservices in Practice, Radu Vunvulea, ITCamp 2016Azure Microservices in Practice, Radu Vunvulea, ITCamp 2016
Azure Microservices in Practice, Radu Vunvulea, ITCamp 2016
 
What's New in Hyper-V 2016 - Thomas Maurer
What's New in Hyper-V 2016 - Thomas MaurerWhat's New in Hyper-V 2016 - Thomas Maurer
What's New in Hyper-V 2016 - Thomas Maurer
 
Azure tales: a real world CQRS and ES Deep Dive - Andrea Saltarello
Azure tales: a real world CQRS and ES Deep Dive - Andrea SaltarelloAzure tales: a real world CQRS and ES Deep Dive - Andrea Saltarello
Azure tales: a real world CQRS and ES Deep Dive - Andrea Saltarello
 
Everyone Loves Docker Containers Before They Understand Docker Containers - A...
Everyone Loves Docker Containers Before They Understand Docker Containers - A...Everyone Loves Docker Containers Before They Understand Docker Containers - A...
Everyone Loves Docker Containers Before They Understand Docker Containers - A...
 
Blockchain for mere mortals - understand the fundamentals and start building ...
Blockchain for mere mortals - understand the fundamentals and start building ...Blockchain for mere mortals - understand the fundamentals and start building ...
Blockchain for mere mortals - understand the fundamentals and start building ...
 
Azure Microservices in Practice - Radu Vunvulea
Azure Microservices in Practice - Radu VunvuleaAzure Microservices in Practice - Radu Vunvulea
Azure Microservices in Practice - Radu Vunvulea
 
Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013
Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013
Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013
 
ITCamp 2019 - Mihai Tataran - Governing your Cloud Resources
ITCamp 2019 - Mihai Tataran - Governing your Cloud ResourcesITCamp 2019 - Mihai Tataran - Governing your Cloud Resources
ITCamp 2019 - Mihai Tataran - Governing your Cloud Resources
 
The fight for surviving in the IoT world
The fight for surviving in the IoT worldThe fight for surviving in the IoT world
The fight for surviving in the IoT world
 
The fight for surviving in the IoT world - Radu Vunvulea
The fight for surviving in the IoT world - Radu VunvuleaThe fight for surviving in the IoT world - Radu Vunvulea
The fight for surviving in the IoT world - Radu Vunvulea
 
Quantum programming in a nutshell Radu Vunvulea ITCamp 2018
Quantum programming in a nutshell Radu Vunvulea  ITCamp 2018Quantum programming in a nutshell Radu Vunvulea  ITCamp 2018
Quantum programming in a nutshell Radu Vunvulea ITCamp 2018
 

Mehr von ITCamp

ITCamp 2019 - Ivana Milicic - Color - The Shadow Ruler of UX
ITCamp 2019 - Ivana Milicic - Color - The Shadow Ruler of UXITCamp 2019 - Ivana Milicic - Color - The Shadow Ruler of UX
ITCamp 2019 - Ivana Milicic - Color - The Shadow Ruler of UX
ITCamp
 

Mehr von ITCamp (20)

ITCamp 2019 - Stacey M. Jenkins - Protecting your company's data - By psychol...
ITCamp 2019 - Stacey M. Jenkins - Protecting your company's data - By psychol...ITCamp 2019 - Stacey M. Jenkins - Protecting your company's data - By psychol...
ITCamp 2019 - Stacey M. Jenkins - Protecting your company's data - By psychol...
 
ITCamp 2019 - Silviu Niculita - Supercharge your AI efforts with the use of A...
ITCamp 2019 - Silviu Niculita - Supercharge your AI efforts with the use of A...ITCamp 2019 - Silviu Niculita - Supercharge your AI efforts with the use of A...
ITCamp 2019 - Silviu Niculita - Supercharge your AI efforts with the use of A...
 
ITCamp 2019 - Peter Leeson - Managing Skills
ITCamp 2019 - Peter Leeson - Managing SkillsITCamp 2019 - Peter Leeson - Managing Skills
ITCamp 2019 - Peter Leeson - Managing Skills
 
ITCamp 2019 - Ivana Milicic - Color - The Shadow Ruler of UX
ITCamp 2019 - Ivana Milicic - Color - The Shadow Ruler of UXITCamp 2019 - Ivana Milicic - Color - The Shadow Ruler of UX
ITCamp 2019 - Ivana Milicic - Color - The Shadow Ruler of UX
 
ITCamp 2019 - Florin Coros - Implementing Clean Architecture
ITCamp 2019 - Florin Coros - Implementing Clean ArchitectureITCamp 2019 - Florin Coros - Implementing Clean Architecture
ITCamp 2019 - Florin Coros - Implementing Clean Architecture
 
ITCamp 2019 - Florin Flestea - How 3rd Level support experience influenced m...
ITCamp 2019 - Florin Flestea -  How 3rd Level support experience influenced m...ITCamp 2019 - Florin Flestea -  How 3rd Level support experience influenced m...
ITCamp 2019 - Florin Flestea - How 3rd Level support experience influenced m...
 
ITCamp 2019 - Eldert Grootenboer - Cloud Architecture Recipes for The Enterprise
ITCamp 2019 - Eldert Grootenboer - Cloud Architecture Recipes for The EnterpriseITCamp 2019 - Eldert Grootenboer - Cloud Architecture Recipes for The Enterprise
ITCamp 2019 - Eldert Grootenboer - Cloud Architecture Recipes for The Enterprise
 
ITCamp 2019 - Cristiana Fernbach - Blockchain Legal Trends
ITCamp 2019 - Cristiana Fernbach - Blockchain Legal TrendsITCamp 2019 - Cristiana Fernbach - Blockchain Legal Trends
ITCamp 2019 - Cristiana Fernbach - Blockchain Legal Trends
 
ITCamp 2019 - Andy Cross - Business Outcomes from AI
ITCamp 2019 - Andy Cross - Business Outcomes from AIITCamp 2019 - Andy Cross - Business Outcomes from AI
ITCamp 2019 - Andy Cross - Business Outcomes from AI
 
ITCamp 2019 - Andrea Saltarello - Modernise your app. The Cloud Story
ITCamp 2019 - Andrea Saltarello - Modernise your app. The Cloud StoryITCamp 2019 - Andrea Saltarello - Modernise your app. The Cloud Story
ITCamp 2019 - Andrea Saltarello - Modernise your app. The Cloud Story
 
ITCamp 2019 - Andrea Saltarello - Implementing bots and Alexa skills using Az...
ITCamp 2019 - Andrea Saltarello - Implementing bots and Alexa skills using Az...ITCamp 2019 - Andrea Saltarello - Implementing bots and Alexa skills using Az...
ITCamp 2019 - Andrea Saltarello - Implementing bots and Alexa skills using Az...
 
ITCamp 2019 - Alex Mang - I'm Confused Should I Orchestrate my Containers on ...
ITCamp 2019 - Alex Mang - I'm Confused Should I Orchestrate my Containers on ...ITCamp 2019 - Alex Mang - I'm Confused Should I Orchestrate my Containers on ...
ITCamp 2019 - Alex Mang - I'm Confused Should I Orchestrate my Containers on ...
 
ITCamp 2019 - Alex Mang - How Far Can Serverless Actually Go Now
ITCamp 2019 - Alex Mang - How Far Can Serverless Actually Go NowITCamp 2019 - Alex Mang - How Far Can Serverless Actually Go Now
ITCamp 2019 - Alex Mang - How Far Can Serverless Actually Go Now
 
ITCamp 2019 - Peter Leeson - Vitruvian Quality
ITCamp 2019 - Peter Leeson - Vitruvian QualityITCamp 2019 - Peter Leeson - Vitruvian Quality
ITCamp 2019 - Peter Leeson - Vitruvian Quality
 
ITCamp 2018 - Ciprian Sorlea - Million Dollars Hello World Application
ITCamp 2018 - Ciprian Sorlea - Million Dollars Hello World ApplicationITCamp 2018 - Ciprian Sorlea - Million Dollars Hello World Application
ITCamp 2018 - Ciprian Sorlea - Million Dollars Hello World Application
 
ITCamp 2018 - Ciprian Sorlea - Enterprise Architectures with TypeScript And F...
ITCamp 2018 - Ciprian Sorlea - Enterprise Architectures with TypeScript And F...ITCamp 2018 - Ciprian Sorlea - Enterprise Architectures with TypeScript And F...
ITCamp 2018 - Ciprian Sorlea - Enterprise Architectures with TypeScript And F...
 
ITCamp 2018 - Mete Atamel Ian Talarico - Google Home meets .NET containers on...
ITCamp 2018 - Mete Atamel Ian Talarico - Google Home meets .NET containers on...ITCamp 2018 - Mete Atamel Ian Talarico - Google Home meets .NET containers on...
ITCamp 2018 - Mete Atamel Ian Talarico - Google Home meets .NET containers on...
 
ITCamp 2018 - Magnus Mårtensson - Azure Global Application Perspectives
ITCamp 2018 - Magnus Mårtensson - Azure Global Application PerspectivesITCamp 2018 - Magnus Mårtensson - Azure Global Application Perspectives
ITCamp 2018 - Magnus Mårtensson - Azure Global Application Perspectives
 
ITCamp 2018 - Magnus Mårtensson - Azure Resource Manager For The Win
ITCamp 2018 - Magnus Mårtensson - Azure Resource Manager For The WinITCamp 2018 - Magnus Mårtensson - Azure Resource Manager For The Win
ITCamp 2018 - Magnus Mårtensson - Azure Resource Manager For The Win
 
ITCamp 2018 - Ionut Balan - A beginner’s guide to Windows Mixed Reality
ITCamp 2018 - Ionut Balan - A beginner’s guide to Windows Mixed RealityITCamp 2018 - Ionut Balan - A beginner’s guide to Windows Mixed Reality
ITCamp 2018 - Ionut Balan - A beginner’s guide to Windows Mixed Reality
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

ITCamp 2019 - Florin Loghiade - Azure Kubernetes in Production - Field notes and pain points

  • 1. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals Azure Kubernetes in Production - Field notes and pain points Florin Loghiade Cloud Solutions Architect @ Avaelgo Azure MVP Web: florinloghiade.ro / Twitter: @florinloghiade
  • 2. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals Many thanks to our sponsors & partners! GOLD SILVER PARTNERS PLATINUM POWERED BY
  • 3. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals • AKS in Production • Which ingress? • Scaling issues. • Holy S*it moments • Dropped dead, now what? Agenda
  • 4. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals • It’s not a simple click run operation • Size your nodes carefully. • ALWAYS, Set resource constraints on pods –If you don’t bad things happen. –Really, really bad things. Running in production – General
  • 5. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals • Never monitor only the cluster or inside the cluster • Validate everything from outside as well • When in doubt – Monitor / Alert everything…for now Running in production – Monitoring
  • 6. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals • Kubectl is not the tool of choice for this –Use Helm charts – Not perfect but getting there • Use whatever CI/CD tool you want as long as you version everything • Never use :latest tag for the container(s) Running in Production – CI/CD
  • 7. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals • Nginx vs HAProxy vs Trafiek vs … • You can use multiple ingress controllers in a cluster • The most mature products are Nginx and HAProxy N number of ingresses, which one?
  • 8. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals • Most common ingress controller • Easy to install, boring to configure • Stable and reliable • But –No dynamic service discovery –Lacks a status page and a monitoring page Nginx Ingress
  • 9. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals • Extremely stable and fast • Fast as is …fast  –Can handle 100k+ connections –Can saturate high speed nics -40gbps+ • But –Pure load balancer HAProxy
  • 10. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals • Newcomer, still fresh • Supports dynamic configurations • Has service discovery • Lots of features and more to come • Comes with a dashboard for monitoring • Out of the box LetsEncrypt integration • But – Doesn’t support hitless reloads – Doesn’t support TCP –only HTTP(S) Traefik
  • 11. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals • Backed by nginx • External-DNS automatically set up • Great for Dev/Test or Azure Dev Spaces feature • But – A bit complicated to configure – Crashes are consistent when they happen • Personal recommdation…. HTTP Application Routing
  • 12. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals Never in production..
  • 13. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals • Its primary purpose is for dev/test or small applications • Hard to manage • Hard to debug / Not worth it / Yaml export to redeploy Why not? #Repeat
  • 14. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals • Auto-Scaling depends a lot on metrics –No metrics, no scaling • Pods require resource constraints for efficient scaling • Node auto-scaling is a preview feature in Azure • Cluster Autoscaler is dependent on HPA App is running hot, where’s scaling?
  • 15. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals Cluster Auto-Scaler
  • 16. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals • Sometimes HPA doesn’t do its job to scale-down pods • Monitor your stuff or “bad things happen” CFO 2019™ • Ever done a load test on 100 VMs and forgot to delete them? Cluster AutoScaler and HPA doesn’t solve that problem. What’s comes up, must come down
  • 17. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals • Cluster Auto Scaler will not evict nodes AKA delete stuff if –Pods cannot be moved because of node selector / affinity rules –Pods have local storage – not talking about PVs • Yes, I have seen this in prod. What’s comes up, must come down
  • 18. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals “Our systems look fine” • Terminating, pending and evicted are clear signals of a potential disaster
  • 19. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals DEMO
  • 20. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals • There are times where everything looks fine but the app is dead –Logging shows pods up –Metrics show “some traffic” –Critical K8 systems are fine • RCA? Dev error / problem – Not my monkey, not my circus “Our systems look fine”
  • 21. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals • Everything shows as “working as intended™” Holy…s*it.
  • 22. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals • Application is dead; not working; Holy…s*it.
  • 23. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals • Dev teams didn’t push code. Nobody tampered with the cluster • Preliminary RCA – Pod-to-Pod communication was down; Solution? – Restart Kube-Proxy – Restart Kube-DNS – Restart something! Holy…s*it.
  • 24. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals Restarting certain pods system pods can fix the problem • kubectl -n kube-system get pod,svc,ep,deploy,ds -o wide – kubectl delete po -l component=kube-proxy -n kube-system – kubectl delete po -l component=kube-svc-redirect -n kube-system – kubectl delete po -l component=tunnel -n kube-system – kubectl delete po -l k8s-app=kube-dns -n kube-system Holy…s*it.
  • 25. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals Mayday, It’s dead.
  • 26. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals • When nothing works any more then it’s time to reboot the cluster nodes • It’s not a best-practice but it’s a must-do to regain operations Mayday, It’s dead.
  • 27. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals • You have two options –The brutal way –The “nice” way • https://gist.github.com/tomasaschan/9dbc9180d313ad8cae57f62 ce229610b Mayday, It’s dead.
  • 28. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals DEMO
  • 29. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals • After the whole system recovered it’s time to roll-up your sleeves –You need to ssh into each node and gather some logs. • Gather the following logs: –/var/log/azure-vnet* –/var/run/azure-vnet* • Run the following command: –journalctl -u kubelet* --no-pager --since "2019-06-06 08:00:00" --until "2019-06-06 22:45:00" > nodenumber.log How to perform RCAs? – Seriously
  • 30. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals How to perform RCAs? • Tools to use: – Transfer.Sh for move files from the nodes – CMTrace – System Center tools
  • 31. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals • Most issues can be solved with proper resource management • Patch management is “still” required –Use a tool like Kured to do patches • Monitoring the cluster and apps is a must –Use Prometheus / Grafana and Log Analytics for enhanced monitoring • If the cluster needs drivers – Never ever use the :latest tag Some best practices
  • 32. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals • Kured - https://docs.microsoft.com/bs-cyrl-ba/azure/aks/node- updates-kured • SSH into AKS nodes - https://docs.microsoft.com/en-us/azure/aks/ssh • Ingres Controllers - https://kubernetes.io/docs/concepts/services- networking/ingress-controllers/ • Ingress comparison - https://kubedex.com/ingress/ • AKS Reboot Gracefully - https://gist.github.com/tomasaschan/9dbc9180d313ad8cae57f62ce22 9610b Resources
  • 33. @ITCAMPRO #ITCAMP19Community Conference for IT Professionals Q & A