SlideShare ist ein Scribd-Unternehmen logo
1 von 49
Downloaden Sie, um offline zu lesen
A Million Ways
to Crash Your Cluster
CONTAINER CAMP UK
HENNING JACOBS
@try_except_
2018-09-07
4
ZALANDO AT A GLANCE
~ 4.5billion EUR
revenue 2017
> 200
million
visits
per
month
> 15.000
employees in
Europe
> 70%
of visits via
mobile devices
> 23
million
active customers
> 300.000
product choices
~ 2.000
brands
15
countries
5
SCALE
95Clusters
378Accounts
INCIDENTS ARE FINE
7
INCIDENT #1: CUSTOMER IMPACT
8
INCIDENT #1: IAM RETURNING 404
9
INCIDENT #1: NUMBER OF PODS
10
LIFE OF A REQUEST (INGRESS)
DNS
my-app.example.org
ALB
aws-1234-lb.eu-central-1.elb.amazonaws.com
SERVICE
10.3.0.216
DEPLOYMENT
POD
10.2.0.1
POD
10.2.1.1
POD
10.2.2.1
POD
10.2.3.1
SKIPPER
172.31.1.1:9999
SKIPPER
172.31.2.1:9999
SKIPPER
172.31.3.1:9999
SKIPPER
172.31.4.1:9999
ALIAS Record
11
INCIDENT #1: INNOCENT MANIFEST
apiVersion: batch/v2alpha1
kind: CronJob
metadata:
name: "foobar"
labels:
application: "foobar"
spec:
schedule: "*/15 9-19 * * Mon-Fri"
jobTemplate:
spec:
template:
metadata:
labels:
application: "foobar"
spec:
restartPolicy: Never
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 1
failedJobsHistoryLimit: 1
containers:
...
12
INCIDENT #1: FIXED CRON JOB
apiVersion: batch/v2alpha1
kind: CronJob
metadata:
name: "foobar"
labels:
application: "foobar"
spec:
schedule: "7 8-18 * * Mon-Fri"
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 1
failedJobsHistoryLimit: 1
jobTemplate:
spec:
activeDeadlineSeconds: 120
template:
metadata:
labels:
application: "foobar"
spec:
restartPolicy: Never
containers:
...
13
INCIDENT #1: LESSONS LEARNED
• ALB routes traffic to ALL hosts if all hosts report “unhealthy”
• Fix Skipper Ingress to stay “healthy” during API server problems
• Fix Skipper Ingress to retain last known set of routes
• Use quota for number of pods
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-resources
spec:
hard:
pods: "1500"
14
INCIDENT #2: CLUSTER DOWN
15
INCIDENT #2: MANUAL OPERATION
% etcdctl del -r /registry-kube-1/certificatesigningrequest prefix
16
INCIDENT #2: RTFM
% etcdctl del -r /registry-kube-1/certificatesigningrequest prefix
help: etcdctl del [options] <key> [range_end]
17
Junior Engineers are Features, not Bugs
https://www.youtube.com/watch?v=cQta4G3ge44
https://www.outcome-eng.com/human-error-never-root-cause/
19
INCIDENT #2: LESSONS LEARNED
• Disaster Recovery Plan?
• Backup etcd to S3
• Monitor the snapshots
20
INCIDENT #3: LATENCY SPIKES
...
Kubernetes worker and master nodes sporadically fail to connect to etcd
causing timeouts in the APIserver and disconnects in the pod network.
...
21
INCIDENT #3: STOP THE BLEEDING
#!/bin/bash
SLEEPTIME=60
while true; do
echo "sleep for $SLEEPTIME seconds"
sleep $SLEEPTIME
timeout 5 curl http://localhost:8080/api/v1/nodes > /dev/null
if [ $? -eq 0 ]; then
echo "all fine, no need to restart etcd member"
continue
else
echo "restarting etcd-member"
systemctl restart etcd-member
fi
done
22
INCIDENT #3: CONFIRMATION FROM AWS
[...]
We can’t go into the details [...] that resulted the networking problems during
the “non-intrusive maintenance”, as it relates to internal workings of EC2.
We can confirm this only affected the T2 instance types, ...
[...]
We don’t explicitly recommend against running production services on T2
[...]
23
INCIDENT #3: LESSONS LEARNED
• It's never the AWS infrastructure until it is
• Treat t2 instances with care
• Kubernetes components are not necessarily "cloud native"
Cloud Native? Declarative, dynamic, resilient, and scalable
24
INCIDENT #4: IMPACT
25
INCIDENT #4: CLUSTER DOWN?
26
INCIDENT #4: THE TRIGGER
https://www.outcome-eng.com/human-error-never-root-cause/
28
CLUSTER UPGRADE
FLOW
29
CLUSTER LIFECYCLE MANAGER (CLM)
github.com/zalando-incubator/cluster-lifecycle-manager
30
CLUSTER CHANNELS
github.com/zalando-incubator/kubernetes-on-aws
Channel Description Clusters
dev Development and playground clusters. 3
alpha Main infrastructure cluster (important to us). 1
beta
Product clusters for the rest of the
organization (prod/test). 90+
31
E2E TESTS ON EVERY PR
github.com/zalando-incubator/kubernetes-on-aws
32
RUNNING E2E TESTS (BEFORE)
Control plane
nodenode
branch: dev
Create Cluster Run e2e tests Delete Cluster
Testing dev to alpha upgrade
Control plane Control plane
33
RUNNING E2E TESTS (NOW)
Control plane
nodenode
Control plane
nodenode
branch: alpha (base) branch: dev (head)
Create Cluster Update Cluster Run e2e tests Delete Cluster
Testing dev to alpha upgrade
Control plane Control plane
34
INCIDENT #4: LESSONS LEARNED
• Automated end-to-end tests are pretty good, but not enough
• Test the diff/migration automatically
• Bootstrap new cluster with the previous configuration
• Apply new configuration
• Run end-to-end & conformance tests
35
INCIDENT #5: IMPACT
[4:59 PM] Marc: There is a error during build - forbidden: image policy webhook backend denied
one or more images: X-Trusted header "false" for image pierone../ci/cdp-builder:234 ..
[5:01 PM] Alice: Now it does not start the build step at all
[5:02 PM] John: +1
[5:02 PM] John: Failed to create builder pod: …
[5:02 PM] Pedro: +1
[5:04 PM] Damien: +1
[5:19 PM] Anton: We're currently having issues pulling images from our Docker registry which
results in many problems…
...
36
INCIDENT #5: IMPACT
37
INCIDENT #5: A VERY INNOCENT PULL REQUEST
38
INCIDENT #5: WHAT HAPPENED
• Deployment caused rebuild with the latest stable Go version
• Library for signature verification was incompatible with Go 1.10,
causing all verification checks to fail during runtime.
• Lack of unit/smoke tests and alerting for one component
• "Near miss": outage could have had large impact
39
A MILLION WAYS TO CRASH YOUR CLUSTER?
• Switch to latest Docker to fix issues with Docker daemon freezing
• Redesign of DNS setup due to high DNS latencies (5s),
switch from kube-dns to node-local dnsmasq+CoreDNS
• Disabling CPU throttling (CFS quota) to avoid latency issues
• Quick fix for timeouts using etcd-proxy: client-go still seems to have
issues with timeouts
• 502's during cluster updates: race condition during network setup
40
MORE TOPICS
• Graceful Pod shutdown and
race conditions (endpoints, Ingress)
• Incompatible Kubernetes changes
• CoreOS ContainerLinux "stable" won't boot
• Kubernetes EBS volume handling
• Docker
41
RACE CONDITIONS..
• Switch to the latest Docker version available to fix the issues with Docker daemon freezing
• Redesign of DNS setup due to high DNS latencies (5s), switch from kube-dns to CoreDNS
• Disabling CPU throttling (CFS quota) to avoid latency issues
• Quick fix for timeouts using etcd-proxy, since client-go still seems to have issues with timeouts
• 502's during cluster updates: race condition
•
github.com/zalando-incubator/kubernetes-on-aws
42
TIMEOUTS TO API SERVER..
github.com/zalando-incubator/kubernetes-on-aws
WELCOME TO
CLOUD NATIVE!
44
45
OPEN SOURCE
Kubernetes on AWS
github.com/zalando-incubator/kubernetes-on-aws
AWS ALB Ingress controller
github.com/zalando-incubator/kube-ingress-aws-controller
Skipper HTTP Router & Ingress controller
github.com/zalando/skipper
External DNS
github.com/kubernetes-incubator/external-dns
Postgres Operator
github.com/zalando-incubator/postgres-operator
Kubernetes Resource Report
github.com/hjacobs/kube-resource-report
Kubernetes Downscaler
github.com/hjacobs/kube-downscaler
46
KUBERNETES RESOURCE REPORT
github.com/hjacobs/kube-resource-report
https://github.com/hjacobs/kube-ops-view
48
OTHER TALKS
• Nordstrom: 101 Ways to Crash Your Cluster - KubeCon 2017
• Monzo: Anatomy of a Production Kubernetes Outage - KubeCon 2018
• Inside Kubernetes Resource Management (QoS) - KubeCon 2018
We need more failure talks!
QUESTIONS?
HENNING JACOBS
HEAD OF
DEVELOPER PRODUCTIVITY
henning@zalando.de
@try_except_
Illustrations by @01k

Weitere ähnliche Inhalte

Was ist angesagt?

Kubernetes + Python = ❤ - Cloud Native Prague
Kubernetes + Python = ❤ - Cloud Native PragueKubernetes + Python = ❤ - Cloud Native Prague
Kubernetes + Python = ❤ - Cloud Native PragueHenning Jacobs
 
How Zalando runs Kubernetes clusters at scale on AWS - AWS re:Invent
How Zalando runs Kubernetes clusters at scale on AWS - AWS re:InventHow Zalando runs Kubernetes clusters at scale on AWS - AWS re:Invent
How Zalando runs Kubernetes clusters at scale on AWS - AWS re:InventHenning Jacobs
 
Developer Experience at Zalando - CNCF End User SIG-DX
Developer Experience at Zalando - CNCF End User SIG-DXDeveloper Experience at Zalando - CNCF End User SIG-DX
Developer Experience at Zalando - CNCF End User SIG-DXHenning Jacobs
 
05.10.2017 AWS User Group Meetup - FALLACIES OF DISTRIBUTED COMPUTING WITH KU...
05.10.2017 AWS User Group Meetup - FALLACIES OF DISTRIBUTED COMPUTING WITH KU...05.10.2017 AWS User Group Meetup - FALLACIES OF DISTRIBUTED COMPUTING WITH KU...
05.10.2017 AWS User Group Meetup - FALLACIES OF DISTRIBUTED COMPUTING WITH KU...Zalando adtech lab
 
Why Kubernetes? Cloud Native and Developer Experience at Zalando - OWL Tech &...
Why Kubernetes? Cloud Native and Developer Experience at Zalando - OWL Tech &...Why Kubernetes? Cloud Native and Developer Experience at Zalando - OWL Tech &...
Why Kubernetes? Cloud Native and Developer Experience at Zalando - OWL Tech &...Henning Jacobs
 
Kubernetes on AWS at Zalando: Failures & Learnings - DevOps NRW
Kubernetes on AWS at Zalando: Failures & Learnings - DevOps NRWKubernetes on AWS at Zalando: Failures & Learnings - DevOps NRW
Kubernetes on AWS at Zalando: Failures & Learnings - DevOps NRWHenning Jacobs
 
Why Kubernetes? Cloud Native and Developer Experience at Zalando - Enterprise...
Why Kubernetes? Cloud Native and Developer Experience at Zalando - Enterprise...Why Kubernetes? Cloud Native and Developer Experience at Zalando - Enterprise...
Why Kubernetes? Cloud Native and Developer Experience at Zalando - Enterprise...Henning Jacobs
 
Why we don’t use the Term DevOps: the Journey to a Product Mindset - DevOpsCo...
Why we don’t use the Term DevOps: the Journey to a Product Mindset - DevOpsCo...Why we don’t use the Term DevOps: the Journey to a Product Mindset - DevOpsCo...
Why we don’t use the Term DevOps: the Journey to a Product Mindset - DevOpsCo...Henning Jacobs
 
Kubernetes at Datadog Scale
Kubernetes at Datadog ScaleKubernetes at Datadog Scale
Kubernetes at Datadog ScaleDocker, Inc.
 
DockerCon EU 2015: The Glue is the Hard Part: Making a Production-Ready PaaS
DockerCon EU 2015: The Glue is the Hard Part: Making a Production-Ready PaaSDockerCon EU 2015: The Glue is the Hard Part: Making a Production-Ready PaaS
DockerCon EU 2015: The Glue is the Hard Part: Making a Production-Ready PaaSDocker, Inc.
 
On-Demand Image Resizing from Part of the monolith to Containerized Microserv...
On-Demand Image Resizing from Part of the monolith to Containerized Microserv...On-Demand Image Resizing from Part of the monolith to Containerized Microserv...
On-Demand Image Resizing from Part of the monolith to Containerized Microserv...Docker, Inc.
 
Making the most out of kubernetes audit logs
Making the most out of kubernetes audit logsMaking the most out of kubernetes audit logs
Making the most out of kubernetes audit logsLaurent Bernaille
 
Securing Your Containerized Applications with NGINX
Securing Your Containerized Applications with NGINXSecuring Your Containerized Applications with NGINX
Securing Your Containerized Applications with NGINXDocker, Inc.
 
Deep dive in container service discovery
Deep dive in container service discoveryDeep dive in container service discovery
Deep dive in container service discoveryDocker, Inc.
 
Kernel load-balancing for Docker containers using IPVS
Kernel load-balancing for Docker containers using IPVSKernel load-balancing for Docker containers using IPVS
Kernel load-balancing for Docker containers using IPVSDocker, Inc.
 
Kubernetes DNS Horror Stories
Kubernetes DNS Horror StoriesKubernetes DNS Horror Stories
Kubernetes DNS Horror StoriesLaurent Bernaille
 
Browser Testing with Docker - Craig Huber
Browser Testing with Docker - Craig HuberBrowser Testing with Docker - Craig Huber
Browser Testing with Docker - Craig HuberDocker, Inc.
 
Minikube Workshop Handout
Minikube Workshop HandoutMinikube Workshop Handout
Minikube Workshop HandoutAlfie Chen
 
Kubernetes: Beyond Baby Steps
Kubernetes: Beyond Baby StepsKubernetes: Beyond Baby Steps
Kubernetes: Beyond Baby StepsDigitalOcean
 
Kubernetes the Very Hard Way. Lisa Portland 2019
Kubernetes the Very Hard Way. Lisa Portland 2019Kubernetes the Very Hard Way. Lisa Portland 2019
Kubernetes the Very Hard Way. Lisa Portland 2019Laurent Bernaille
 

Was ist angesagt? (20)

Kubernetes + Python = ❤ - Cloud Native Prague
Kubernetes + Python = ❤ - Cloud Native PragueKubernetes + Python = ❤ - Cloud Native Prague
Kubernetes + Python = ❤ - Cloud Native Prague
 
How Zalando runs Kubernetes clusters at scale on AWS - AWS re:Invent
How Zalando runs Kubernetes clusters at scale on AWS - AWS re:InventHow Zalando runs Kubernetes clusters at scale on AWS - AWS re:Invent
How Zalando runs Kubernetes clusters at scale on AWS - AWS re:Invent
 
Developer Experience at Zalando - CNCF End User SIG-DX
Developer Experience at Zalando - CNCF End User SIG-DXDeveloper Experience at Zalando - CNCF End User SIG-DX
Developer Experience at Zalando - CNCF End User SIG-DX
 
05.10.2017 AWS User Group Meetup - FALLACIES OF DISTRIBUTED COMPUTING WITH KU...
05.10.2017 AWS User Group Meetup - FALLACIES OF DISTRIBUTED COMPUTING WITH KU...05.10.2017 AWS User Group Meetup - FALLACIES OF DISTRIBUTED COMPUTING WITH KU...
05.10.2017 AWS User Group Meetup - FALLACIES OF DISTRIBUTED COMPUTING WITH KU...
 
Why Kubernetes? Cloud Native and Developer Experience at Zalando - OWL Tech &...
Why Kubernetes? Cloud Native and Developer Experience at Zalando - OWL Tech &...Why Kubernetes? Cloud Native and Developer Experience at Zalando - OWL Tech &...
Why Kubernetes? Cloud Native and Developer Experience at Zalando - OWL Tech &...
 
Kubernetes on AWS at Zalando: Failures & Learnings - DevOps NRW
Kubernetes on AWS at Zalando: Failures & Learnings - DevOps NRWKubernetes on AWS at Zalando: Failures & Learnings - DevOps NRW
Kubernetes on AWS at Zalando: Failures & Learnings - DevOps NRW
 
Why Kubernetes? Cloud Native and Developer Experience at Zalando - Enterprise...
Why Kubernetes? Cloud Native and Developer Experience at Zalando - Enterprise...Why Kubernetes? Cloud Native and Developer Experience at Zalando - Enterprise...
Why Kubernetes? Cloud Native and Developer Experience at Zalando - Enterprise...
 
Why we don’t use the Term DevOps: the Journey to a Product Mindset - DevOpsCo...
Why we don’t use the Term DevOps: the Journey to a Product Mindset - DevOpsCo...Why we don’t use the Term DevOps: the Journey to a Product Mindset - DevOpsCo...
Why we don’t use the Term DevOps: the Journey to a Product Mindset - DevOpsCo...
 
Kubernetes at Datadog Scale
Kubernetes at Datadog ScaleKubernetes at Datadog Scale
Kubernetes at Datadog Scale
 
DockerCon EU 2015: The Glue is the Hard Part: Making a Production-Ready PaaS
DockerCon EU 2015: The Glue is the Hard Part: Making a Production-Ready PaaSDockerCon EU 2015: The Glue is the Hard Part: Making a Production-Ready PaaS
DockerCon EU 2015: The Glue is the Hard Part: Making a Production-Ready PaaS
 
On-Demand Image Resizing from Part of the monolith to Containerized Microserv...
On-Demand Image Resizing from Part of the monolith to Containerized Microserv...On-Demand Image Resizing from Part of the monolith to Containerized Microserv...
On-Demand Image Resizing from Part of the monolith to Containerized Microserv...
 
Making the most out of kubernetes audit logs
Making the most out of kubernetes audit logsMaking the most out of kubernetes audit logs
Making the most out of kubernetes audit logs
 
Securing Your Containerized Applications with NGINX
Securing Your Containerized Applications with NGINXSecuring Your Containerized Applications with NGINX
Securing Your Containerized Applications with NGINX
 
Deep dive in container service discovery
Deep dive in container service discoveryDeep dive in container service discovery
Deep dive in container service discovery
 
Kernel load-balancing for Docker containers using IPVS
Kernel load-balancing for Docker containers using IPVSKernel load-balancing for Docker containers using IPVS
Kernel load-balancing for Docker containers using IPVS
 
Kubernetes DNS Horror Stories
Kubernetes DNS Horror StoriesKubernetes DNS Horror Stories
Kubernetes DNS Horror Stories
 
Browser Testing with Docker - Craig Huber
Browser Testing with Docker - Craig HuberBrowser Testing with Docker - Craig Huber
Browser Testing with Docker - Craig Huber
 
Minikube Workshop Handout
Minikube Workshop HandoutMinikube Workshop Handout
Minikube Workshop Handout
 
Kubernetes: Beyond Baby Steps
Kubernetes: Beyond Baby StepsKubernetes: Beyond Baby Steps
Kubernetes: Beyond Baby Steps
 
Kubernetes the Very Hard Way. Lisa Portland 2019
Kubernetes the Very Hard Way. Lisa Portland 2019Kubernetes the Very Hard Way. Lisa Portland 2019
Kubernetes the Very Hard Way. Lisa Portland 2019
 

Ähnlich wie Running Kubernetes in Production: A Million Ways to Crash Your Cluster - Container Camp UK

Large Scale Kubernetes on AWS at Europe's Leading Online Fashion Platform - A...
Large Scale Kubernetes on AWS at Europe's Leading Online Fashion Platform - A...Large Scale Kubernetes on AWS at Europe's Leading Online Fashion Platform - A...
Large Scale Kubernetes on AWS at Europe's Leading Online Fashion Platform - A...Henning Jacobs
 
From AWS/STUPS to Kubernetes on AWS @Zalando - Berlin Kubernetes Meetup
From AWS/STUPS to Kubernetes on AWS @Zalando - Berlin Kubernetes MeetupFrom AWS/STUPS to Kubernetes on AWS @Zalando - Berlin Kubernetes Meetup
From AWS/STUPS to Kubernetes on AWS @Zalando - Berlin Kubernetes MeetupHenning Jacobs
 
Cloud-native .NET Microservices mit Kubernetes
Cloud-native .NET Microservices mit KubernetesCloud-native .NET Microservices mit Kubernetes
Cloud-native .NET Microservices mit KubernetesQAware GmbH
 
Production Grade Kubernetes Applications
Production Grade Kubernetes ApplicationsProduction Grade Kubernetes Applications
Production Grade Kubernetes ApplicationsNarayanan Krishnamurthy
 
Continuously Deliver Your Kubernetes Infrastructure - KubeCon 2018 Copenhagen
Continuously Deliver Your Kubernetes Infrastructure - KubeCon 2018 CopenhagenContinuously Deliver Your Kubernetes Infrastructure - KubeCon 2018 Copenhagen
Continuously Deliver Your Kubernetes Infrastructure - KubeCon 2018 CopenhagenMikkelOscarLyderikLa
 
A hitchhiker‘s guide to the cloud native stack
A hitchhiker‘s guide to the cloud native stackA hitchhiker‘s guide to the cloud native stack
A hitchhiker‘s guide to the cloud native stackQAware GmbH
 
A Hitchhiker’s Guide to the Cloud Native Stack. #CDS17
A Hitchhiker’s Guide to the Cloud Native Stack. #CDS17A Hitchhiker’s Guide to the Cloud Native Stack. #CDS17
A Hitchhiker’s Guide to the Cloud Native Stack. #CDS17Mario-Leander Reimer
 
Migration Effort in the Cloud - The Case of Cloud Platforms
Migration Effort in the Cloud - The Case of Cloud PlatformsMigration Effort in the Cloud - The Case of Cloud Platforms
Migration Effort in the Cloud - The Case of Cloud PlatformsStefan Kolb
 
Kubernetes on AWS @ Zalando Tech
Kubernetes on AWS @ Zalando TechKubernetes on AWS @ Zalando Tech
Kubernetes on AWS @ Zalando TechMichael Dürgner
 
Need to-know patterns building microservices - java one
Need to-know patterns building microservices - java oneNeed to-know patterns building microservices - java one
Need to-know patterns building microservices - java oneVincent Kok
 
Top Performance Problems in Distributed Architectures
Top Performance Problems in Distributed ArchitecturesTop Performance Problems in Distributed Architectures
Top Performance Problems in Distributed ArchitecturesAndreas Grabner
 
CA Spectrum® Just Keeps Getting Better and Better
CA Spectrum® Just Keeps Getting Better and BetterCA Spectrum® Just Keeps Getting Better and Better
CA Spectrum® Just Keeps Getting Better and BetterCA Technologies
 
'DOCKER' & CLOUD: ENABLERS For DEVOPS
'DOCKER' & CLOUD:  ENABLERS For DEVOPS'DOCKER' & CLOUD:  ENABLERS For DEVOPS
'DOCKER' & CLOUD: ENABLERS For DEVOPSACA IT-Solutions
 
Docker and Cloud - Enables for DevOps - by ACA-IT
Docker and Cloud - Enables for DevOps - by ACA-ITDocker and Cloud - Enables for DevOps - by ACA-IT
Docker and Cloud - Enables for DevOps - by ACA-ITStijn Wijndaele
 
Service-Level Objective for Serverless Applications
Service-Level Objective for Serverless ApplicationsService-Level Objective for Serverless Applications
Service-Level Objective for Serverless Applicationsalekn
 
12.07.2017 Docker Meetup - KUBERNETES ON AWS @ ZALANDO TECH
12.07.2017 Docker Meetup - KUBERNETES ON AWS @ ZALANDO TECH12.07.2017 Docker Meetup - KUBERNETES ON AWS @ ZALANDO TECH
12.07.2017 Docker Meetup - KUBERNETES ON AWS @ ZALANDO TECHZalando adtech lab
 
Large Scale Kubernetes on AWS at Europe's Leading Online Fashion Platform - C...
Large Scale Kubernetes on AWS at Europe's Leading Online Fashion Platform - C...Large Scale Kubernetes on AWS at Europe's Leading Online Fashion Platform - C...
Large Scale Kubernetes on AWS at Europe's Leading Online Fashion Platform - C...Henning Jacobs
 
DCEU 18: Docker Container Networking
DCEU 18: Docker Container NetworkingDCEU 18: Docker Container Networking
DCEU 18: Docker Container NetworkingDocker, Inc.
 
Cloud Run - the rise of serverless and containerization
Cloud Run - the rise of serverless and containerizationCloud Run - the rise of serverless and containerization
Cloud Run - the rise of serverless and containerizationMárton Kodok
 

Ähnlich wie Running Kubernetes in Production: A Million Ways to Crash Your Cluster - Container Camp UK (20)

Large Scale Kubernetes on AWS at Europe's Leading Online Fashion Platform - A...
Large Scale Kubernetes on AWS at Europe's Leading Online Fashion Platform - A...Large Scale Kubernetes on AWS at Europe's Leading Online Fashion Platform - A...
Large Scale Kubernetes on AWS at Europe's Leading Online Fashion Platform - A...
 
From AWS/STUPS to Kubernetes on AWS @Zalando - Berlin Kubernetes Meetup
From AWS/STUPS to Kubernetes on AWS @Zalando - Berlin Kubernetes MeetupFrom AWS/STUPS to Kubernetes on AWS @Zalando - Berlin Kubernetes Meetup
From AWS/STUPS to Kubernetes on AWS @Zalando - Berlin Kubernetes Meetup
 
Cloud-native .NET Microservices mit Kubernetes
Cloud-native .NET Microservices mit KubernetesCloud-native .NET Microservices mit Kubernetes
Cloud-native .NET Microservices mit Kubernetes
 
Production Grade Kubernetes Applications
Production Grade Kubernetes ApplicationsProduction Grade Kubernetes Applications
Production Grade Kubernetes Applications
 
Continuously Deliver Your Kubernetes Infrastructure - KubeCon 2018 Copenhagen
Continuously Deliver Your Kubernetes Infrastructure - KubeCon 2018 CopenhagenContinuously Deliver Your Kubernetes Infrastructure - KubeCon 2018 Copenhagen
Continuously Deliver Your Kubernetes Infrastructure - KubeCon 2018 Copenhagen
 
A hitchhiker‘s guide to the cloud native stack
A hitchhiker‘s guide to the cloud native stackA hitchhiker‘s guide to the cloud native stack
A hitchhiker‘s guide to the cloud native stack
 
A Hitchhiker’s Guide to the Cloud Native Stack. #CDS17
A Hitchhiker’s Guide to the Cloud Native Stack. #CDS17A Hitchhiker’s Guide to the Cloud Native Stack. #CDS17
A Hitchhiker’s Guide to the Cloud Native Stack. #CDS17
 
Migration Effort in the Cloud - The Case of Cloud Platforms
Migration Effort in the Cloud - The Case of Cloud PlatformsMigration Effort in the Cloud - The Case of Cloud Platforms
Migration Effort in the Cloud - The Case of Cloud Platforms
 
Kubernetes on AWS @ Zalando Tech
Kubernetes on AWS @ Zalando TechKubernetes on AWS @ Zalando Tech
Kubernetes on AWS @ Zalando Tech
 
Need to-know patterns building microservices - java one
Need to-know patterns building microservices - java oneNeed to-know patterns building microservices - java one
Need to-know patterns building microservices - java one
 
Into The Box 2018 Ortus Keynote
Into The Box 2018 Ortus KeynoteInto The Box 2018 Ortus Keynote
Into The Box 2018 Ortus Keynote
 
Top Performance Problems in Distributed Architectures
Top Performance Problems in Distributed ArchitecturesTop Performance Problems in Distributed Architectures
Top Performance Problems in Distributed Architectures
 
CA Spectrum® Just Keeps Getting Better and Better
CA Spectrum® Just Keeps Getting Better and BetterCA Spectrum® Just Keeps Getting Better and Better
CA Spectrum® Just Keeps Getting Better and Better
 
'DOCKER' & CLOUD: ENABLERS For DEVOPS
'DOCKER' & CLOUD:  ENABLERS For DEVOPS'DOCKER' & CLOUD:  ENABLERS For DEVOPS
'DOCKER' & CLOUD: ENABLERS For DEVOPS
 
Docker and Cloud - Enables for DevOps - by ACA-IT
Docker and Cloud - Enables for DevOps - by ACA-ITDocker and Cloud - Enables for DevOps - by ACA-IT
Docker and Cloud - Enables for DevOps - by ACA-IT
 
Service-Level Objective for Serverless Applications
Service-Level Objective for Serverless ApplicationsService-Level Objective for Serverless Applications
Service-Level Objective for Serverless Applications
 
12.07.2017 Docker Meetup - KUBERNETES ON AWS @ ZALANDO TECH
12.07.2017 Docker Meetup - KUBERNETES ON AWS @ ZALANDO TECH12.07.2017 Docker Meetup - KUBERNETES ON AWS @ ZALANDO TECH
12.07.2017 Docker Meetup - KUBERNETES ON AWS @ ZALANDO TECH
 
Large Scale Kubernetes on AWS at Europe's Leading Online Fashion Platform - C...
Large Scale Kubernetes on AWS at Europe's Leading Online Fashion Platform - C...Large Scale Kubernetes on AWS at Europe's Leading Online Fashion Platform - C...
Large Scale Kubernetes on AWS at Europe's Leading Online Fashion Platform - C...
 
DCEU 18: Docker Container Networking
DCEU 18: Docker Container NetworkingDCEU 18: Docker Container Networking
DCEU 18: Docker Container Networking
 
Cloud Run - the rise of serverless and containerization
Cloud Run - the rise of serverless and containerizationCloud Run - the rise of serverless and containerization
Cloud Run - the rise of serverless and containerization
 

Mehr von Henning Jacobs

Open Source at Zalando - OSB Open Source Day 2019
Open Source at Zalando - OSB Open Source Day 2019Open Source at Zalando - OSB Open Source Day 2019
Open Source at Zalando - OSB Open Source Day 2019Henning Jacobs
 
Why we don’t use the Term DevOps: the Journey to a Product Mindset - Destinat...
Why we don’t use the Term DevOps: the Journey to a Product Mindset - Destinat...Why we don’t use the Term DevOps: the Journey to a Product Mindset - Destinat...
Why we don’t use the Term DevOps: the Journey to a Product Mindset - Destinat...Henning Jacobs
 
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Henning Jacobs
 
Developer Experience at Zalando - Handelsblatt Strategisches IT-Management 2019
Developer Experience at Zalando - Handelsblatt Strategisches IT-Management 2019Developer Experience at Zalando - Handelsblatt Strategisches IT-Management 2019
Developer Experience at Zalando - Handelsblatt Strategisches IT-Management 2019Henning Jacobs
 
API First with Connexion - PyConWeb 2018
API First with Connexion - PyConWeb 2018API First with Connexion - PyConWeb 2018
API First with Connexion - PyConWeb 2018Henning Jacobs
 
Developer Journey at Zalando - Idea to Production with Containers in the Clou...
Developer Journey at Zalando - Idea to Production with Containers in the Clou...Developer Journey at Zalando - Idea to Production with Containers in the Clou...
Developer Journey at Zalando - Idea to Production with Containers in the Clou...Henning Jacobs
 
Kubernetes on AWS @Zalando - Berlin AWS User Group 2017-05-09
Kubernetes on AWS @Zalando - Berlin AWS User Group 2017-05-09Kubernetes on AWS @Zalando - Berlin AWS User Group 2017-05-09
Kubernetes on AWS @Zalando - Berlin AWS User Group 2017-05-09Henning Jacobs
 
Kubernetes at Zalando - CNCF End User Committee Presentation
Kubernetes at Zalando - CNCF End User Committee PresentationKubernetes at Zalando - CNCF End User Committee Presentation
Kubernetes at Zalando - CNCF End User Committee PresentationHenning Jacobs
 
Kubernetes on AWS at Europe's Leading Online Fashion Platform
Kubernetes on AWS at Europe's Leading Online Fashion PlatformKubernetes on AWS at Europe's Leading Online Fashion Platform
Kubernetes on AWS at Europe's Leading Online Fashion PlatformHenning Jacobs
 
Plan B: Service to Service Authentication with OAuth
Plan B: Service to Service Authentication with OAuthPlan B: Service to Service Authentication with OAuth
Plan B: Service to Service Authentication with OAuthHenning Jacobs
 
Docker Berlin Meetup Nov 2015: Zalando Intro
Docker Berlin Meetup Nov 2015: Zalando IntroDocker Berlin Meetup Nov 2015: Zalando Intro
Docker Berlin Meetup Nov 2015: Zalando IntroHenning Jacobs
 
STUPS @ AWS Enterprise Web Day Oktober 2015
STUPS @ AWS Enterprise Web Day Oktober 2015STUPS @ AWS Enterprise Web Day Oktober 2015
STUPS @ AWS Enterprise Web Day Oktober 2015Henning Jacobs
 
Python at Zalando Technology @ Python Users Berlin Meetup September 2015
Python at Zalando Technology @ Python Users Berlin Meetup September 2015Python at Zalando Technology @ Python Users Berlin Meetup September 2015
Python at Zalando Technology @ Python Users Berlin Meetup September 2015Henning Jacobs
 
STUPS by Zalando @WHD.local Frankfurt: STUPS.io - an Open Source Cloud Framew...
STUPS by Zalando @WHD.local Frankfurt: STUPS.io - an Open Source Cloud Framew...STUPS by Zalando @WHD.local Frankfurt: STUPS.io - an Open Source Cloud Framew...
STUPS by Zalando @WHD.local Frankfurt: STUPS.io - an Open Source Cloud Framew...Henning Jacobs
 

Mehr von Henning Jacobs (14)

Open Source at Zalando - OSB Open Source Day 2019
Open Source at Zalando - OSB Open Source Day 2019Open Source at Zalando - OSB Open Source Day 2019
Open Source at Zalando - OSB Open Source Day 2019
 
Why we don’t use the Term DevOps: the Journey to a Product Mindset - Destinat...
Why we don’t use the Term DevOps: the Journey to a Product Mindset - Destinat...Why we don’t use the Term DevOps: the Journey to a Product Mindset - Destinat...
Why we don’t use the Term DevOps: the Journey to a Product Mindset - Destinat...
 
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
 
Developer Experience at Zalando - Handelsblatt Strategisches IT-Management 2019
Developer Experience at Zalando - Handelsblatt Strategisches IT-Management 2019Developer Experience at Zalando - Handelsblatt Strategisches IT-Management 2019
Developer Experience at Zalando - Handelsblatt Strategisches IT-Management 2019
 
API First with Connexion - PyConWeb 2018
API First with Connexion - PyConWeb 2018API First with Connexion - PyConWeb 2018
API First with Connexion - PyConWeb 2018
 
Developer Journey at Zalando - Idea to Production with Containers in the Clou...
Developer Journey at Zalando - Idea to Production with Containers in the Clou...Developer Journey at Zalando - Idea to Production with Containers in the Clou...
Developer Journey at Zalando - Idea to Production with Containers in the Clou...
 
Kubernetes on AWS @Zalando - Berlin AWS User Group 2017-05-09
Kubernetes on AWS @Zalando - Berlin AWS User Group 2017-05-09Kubernetes on AWS @Zalando - Berlin AWS User Group 2017-05-09
Kubernetes on AWS @Zalando - Berlin AWS User Group 2017-05-09
 
Kubernetes at Zalando - CNCF End User Committee Presentation
Kubernetes at Zalando - CNCF End User Committee PresentationKubernetes at Zalando - CNCF End User Committee Presentation
Kubernetes at Zalando - CNCF End User Committee Presentation
 
Kubernetes on AWS at Europe's Leading Online Fashion Platform
Kubernetes on AWS at Europe's Leading Online Fashion PlatformKubernetes on AWS at Europe's Leading Online Fashion Platform
Kubernetes on AWS at Europe's Leading Online Fashion Platform
 
Plan B: Service to Service Authentication with OAuth
Plan B: Service to Service Authentication with OAuthPlan B: Service to Service Authentication with OAuth
Plan B: Service to Service Authentication with OAuth
 
Docker Berlin Meetup Nov 2015: Zalando Intro
Docker Berlin Meetup Nov 2015: Zalando IntroDocker Berlin Meetup Nov 2015: Zalando Intro
Docker Berlin Meetup Nov 2015: Zalando Intro
 
STUPS @ AWS Enterprise Web Day Oktober 2015
STUPS @ AWS Enterprise Web Day Oktober 2015STUPS @ AWS Enterprise Web Day Oktober 2015
STUPS @ AWS Enterprise Web Day Oktober 2015
 
Python at Zalando Technology @ Python Users Berlin Meetup September 2015
Python at Zalando Technology @ Python Users Berlin Meetup September 2015Python at Zalando Technology @ Python Users Berlin Meetup September 2015
Python at Zalando Technology @ Python Users Berlin Meetup September 2015
 
STUPS by Zalando @WHD.local Frankfurt: STUPS.io - an Open Source Cloud Framew...
STUPS by Zalando @WHD.local Frankfurt: STUPS.io - an Open Source Cloud Framew...STUPS by Zalando @WHD.local Frankfurt: STUPS.io - an Open Source Cloud Framew...
STUPS by Zalando @WHD.local Frankfurt: STUPS.io - an Open Source Cloud Framew...
 

Kürzlich hochgeladen

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 

Kürzlich hochgeladen (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 

Running Kubernetes in Production: A Million Ways to Crash Your Cluster - Container Camp UK

  • 1. A Million Ways to Crash Your Cluster CONTAINER CAMP UK HENNING JACOBS @try_except_ 2018-09-07
  • 2.
  • 3.
  • 4. 4 ZALANDO AT A GLANCE ~ 4.5billion EUR revenue 2017 > 200 million visits per month > 15.000 employees in Europe > 70% of visits via mobile devices > 23 million active customers > 300.000 product choices ~ 2.000 brands 15 countries
  • 8. 8 INCIDENT #1: IAM RETURNING 404
  • 10. 10 LIFE OF A REQUEST (INGRESS) DNS my-app.example.org ALB aws-1234-lb.eu-central-1.elb.amazonaws.com SERVICE 10.3.0.216 DEPLOYMENT POD 10.2.0.1 POD 10.2.1.1 POD 10.2.2.1 POD 10.2.3.1 SKIPPER 172.31.1.1:9999 SKIPPER 172.31.2.1:9999 SKIPPER 172.31.3.1:9999 SKIPPER 172.31.4.1:9999 ALIAS Record
  • 11. 11 INCIDENT #1: INNOCENT MANIFEST apiVersion: batch/v2alpha1 kind: CronJob metadata: name: "foobar" labels: application: "foobar" spec: schedule: "*/15 9-19 * * Mon-Fri" jobTemplate: spec: template: metadata: labels: application: "foobar" spec: restartPolicy: Never concurrencyPolicy: Forbid successfulJobsHistoryLimit: 1 failedJobsHistoryLimit: 1 containers: ...
  • 12. 12 INCIDENT #1: FIXED CRON JOB apiVersion: batch/v2alpha1 kind: CronJob metadata: name: "foobar" labels: application: "foobar" spec: schedule: "7 8-18 * * Mon-Fri" concurrencyPolicy: Forbid successfulJobsHistoryLimit: 1 failedJobsHistoryLimit: 1 jobTemplate: spec: activeDeadlineSeconds: 120 template: metadata: labels: application: "foobar" spec: restartPolicy: Never containers: ...
  • 13. 13 INCIDENT #1: LESSONS LEARNED • ALB routes traffic to ALL hosts if all hosts report “unhealthy” • Fix Skipper Ingress to stay “healthy” during API server problems • Fix Skipper Ingress to retain last known set of routes • Use quota for number of pods apiVersion: v1 kind: ResourceQuota metadata: name: compute-resources spec: hard: pods: "1500"
  • 15. 15 INCIDENT #2: MANUAL OPERATION % etcdctl del -r /registry-kube-1/certificatesigningrequest prefix
  • 16. 16 INCIDENT #2: RTFM % etcdctl del -r /registry-kube-1/certificatesigningrequest prefix help: etcdctl del [options] <key> [range_end]
  • 17. 17 Junior Engineers are Features, not Bugs https://www.youtube.com/watch?v=cQta4G3ge44
  • 19. 19 INCIDENT #2: LESSONS LEARNED • Disaster Recovery Plan? • Backup etcd to S3 • Monitor the snapshots
  • 20. 20 INCIDENT #3: LATENCY SPIKES ... Kubernetes worker and master nodes sporadically fail to connect to etcd causing timeouts in the APIserver and disconnects in the pod network. ...
  • 21. 21 INCIDENT #3: STOP THE BLEEDING #!/bin/bash SLEEPTIME=60 while true; do echo "sleep for $SLEEPTIME seconds" sleep $SLEEPTIME timeout 5 curl http://localhost:8080/api/v1/nodes > /dev/null if [ $? -eq 0 ]; then echo "all fine, no need to restart etcd member" continue else echo "restarting etcd-member" systemctl restart etcd-member fi done
  • 22. 22 INCIDENT #3: CONFIRMATION FROM AWS [...] We can’t go into the details [...] that resulted the networking problems during the “non-intrusive maintenance”, as it relates to internal workings of EC2. We can confirm this only affected the T2 instance types, ... [...] We don’t explicitly recommend against running production services on T2 [...]
  • 23. 23 INCIDENT #3: LESSONS LEARNED • It's never the AWS infrastructure until it is • Treat t2 instances with care • Kubernetes components are not necessarily "cloud native" Cloud Native? Declarative, dynamic, resilient, and scalable
  • 29. 29 CLUSTER LIFECYCLE MANAGER (CLM) github.com/zalando-incubator/cluster-lifecycle-manager
  • 30. 30 CLUSTER CHANNELS github.com/zalando-incubator/kubernetes-on-aws Channel Description Clusters dev Development and playground clusters. 3 alpha Main infrastructure cluster (important to us). 1 beta Product clusters for the rest of the organization (prod/test). 90+
  • 31. 31 E2E TESTS ON EVERY PR github.com/zalando-incubator/kubernetes-on-aws
  • 32. 32 RUNNING E2E TESTS (BEFORE) Control plane nodenode branch: dev Create Cluster Run e2e tests Delete Cluster Testing dev to alpha upgrade Control plane Control plane
  • 33. 33 RUNNING E2E TESTS (NOW) Control plane nodenode Control plane nodenode branch: alpha (base) branch: dev (head) Create Cluster Update Cluster Run e2e tests Delete Cluster Testing dev to alpha upgrade Control plane Control plane
  • 34. 34 INCIDENT #4: LESSONS LEARNED • Automated end-to-end tests are pretty good, but not enough • Test the diff/migration automatically • Bootstrap new cluster with the previous configuration • Apply new configuration • Run end-to-end & conformance tests
  • 35. 35 INCIDENT #5: IMPACT [4:59 PM] Marc: There is a error during build - forbidden: image policy webhook backend denied one or more images: X-Trusted header "false" for image pierone../ci/cdp-builder:234 .. [5:01 PM] Alice: Now it does not start the build step at all [5:02 PM] John: +1 [5:02 PM] John: Failed to create builder pod: … [5:02 PM] Pedro: +1 [5:04 PM] Damien: +1 [5:19 PM] Anton: We're currently having issues pulling images from our Docker registry which results in many problems… ...
  • 37. 37 INCIDENT #5: A VERY INNOCENT PULL REQUEST
  • 38. 38 INCIDENT #5: WHAT HAPPENED • Deployment caused rebuild with the latest stable Go version • Library for signature verification was incompatible with Go 1.10, causing all verification checks to fail during runtime. • Lack of unit/smoke tests and alerting for one component • "Near miss": outage could have had large impact
  • 39. 39 A MILLION WAYS TO CRASH YOUR CLUSTER? • Switch to latest Docker to fix issues with Docker daemon freezing • Redesign of DNS setup due to high DNS latencies (5s), switch from kube-dns to node-local dnsmasq+CoreDNS • Disabling CPU throttling (CFS quota) to avoid latency issues • Quick fix for timeouts using etcd-proxy: client-go still seems to have issues with timeouts • 502's during cluster updates: race condition during network setup
  • 40. 40 MORE TOPICS • Graceful Pod shutdown and race conditions (endpoints, Ingress) • Incompatible Kubernetes changes • CoreOS ContainerLinux "stable" won't boot • Kubernetes EBS volume handling • Docker
  • 41. 41 RACE CONDITIONS.. • Switch to the latest Docker version available to fix the issues with Docker daemon freezing • Redesign of DNS setup due to high DNS latencies (5s), switch from kube-dns to CoreDNS • Disabling CPU throttling (CFS quota) to avoid latency issues • Quick fix for timeouts using etcd-proxy, since client-go still seems to have issues with timeouts • 502's during cluster updates: race condition • github.com/zalando-incubator/kubernetes-on-aws
  • 42. 42 TIMEOUTS TO API SERVER.. github.com/zalando-incubator/kubernetes-on-aws
  • 44. 44
  • 45. 45 OPEN SOURCE Kubernetes on AWS github.com/zalando-incubator/kubernetes-on-aws AWS ALB Ingress controller github.com/zalando-incubator/kube-ingress-aws-controller Skipper HTTP Router & Ingress controller github.com/zalando/skipper External DNS github.com/kubernetes-incubator/external-dns Postgres Operator github.com/zalando-incubator/postgres-operator Kubernetes Resource Report github.com/hjacobs/kube-resource-report Kubernetes Downscaler github.com/hjacobs/kube-downscaler
  • 48. 48 OTHER TALKS • Nordstrom: 101 Ways to Crash Your Cluster - KubeCon 2017 • Monzo: Anatomy of a Production Kubernetes Outage - KubeCon 2018 • Inside Kubernetes Resource Management (QoS) - KubeCon 2018 We need more failure talks!
  • 49. QUESTIONS? HENNING JACOBS HEAD OF DEVELOPER PRODUCTIVITY henning@zalando.de @try_except_ Illustrations by @01k