SlideShare ist ein Scribd-Unternehmen logo
1 von 49
Downloaden Sie, um offline zu lesen
DEVOPS NRW
2018-02-21
HENNING JACOBS
@try_except_
Kubernetes on AWS
@ZalandoTech
2
ZALANDO IN NUMBERS
> 4.5billion EUR
2017
> 200
million
visits
per
month
> 14,000
employees in
Europe
> 70%
of visits via
mobile devices
> 22
million
active customers
> 250,000
product choices
~ 2,000
brands
15
countries
3
OUR FOOTPRINT AROUND EUROPE
as of November 2017
1
8
10
11
12
13
BERLIN HEADQUARTERS AND OUTLET
BRIESELANG FULFILLMENT CENTER
ERFURT FULFILLMENT CENTER AND TECH OFFICE
MÖNCHENGLADBACH FULFILLMENT CENTER AND TECH OFFICE
LAHR FULFILLMENT CENTER
DORTMUND TECH HUB
FRANKFURT OUTLET
DUBLIN TECH HUB
HELSINKI TECH HUB
MILAN (STRADELLA) FULFILLMENT CENTER
KÖLN OUTLET
PARIS (MOISSY-CRAMAYEL) FULFILLMENT CENTER
SZCZECIN (GRYFINO) FULFILLMENT CENTER
HAMBURG ADTECH LAB
STOCKHOLM (BRUNNA) FULFILLMENT CENTER (start winter 2017)
10
9
7
6
5
3
2
1
11
12
13
4
14
15
15
14
9
8
7
6
5
4
3
2
1
4
OUR FOOTPRINT AROUND EUROPE
TECH
as of November 2017
1
8
10
11
12
13
BERLIN HEADQUARTERS AND OUTLET
BRIESELANG FULFILLMENT CENTER
ERFURT FULFILLMENT CENTER AND TECH OFFICE
MÖNCHENGLADBACH FULFILLMENT CENTER AND TECH OFFICE
LAHR FULFILLMENT CENTER
DORTMUND TECH HUB
FRANKFURT OUTLET
DUBLIN TECH HUB
HELSINKI TECH HUB
MILAN (STRADELLA) FULFILLMENT CENTER
KÖLN OUTLET
PARIS (MOISSY-CRAMAYEL) FULFILLMENT CENTER
SZCZECIN (GRYFINO) FULFILLMENT CENTER
HAMBURG ADTECH LAB
STOCKHOLM (BRUNNA) FULFILLMENT CENTER (start winter 2017)
10
9
7
6
5
3
2
1
11
12
13
4
14
15
15
14
9
8
7
6
5
4
3
2
1
INCIDENTS ARE FINE
ON CALL
7
INCIDENT #1: CUSTOMER IMPACT
8
INCIDENT #1: IAM RETURNING 404
9
INCIDENT #1: NUMBER OF PODS
10
LIFE OF A REQUEST (INGRESS)
DNS
my-app.example.org
ALB
aws-1234-lb.eu-central-1.elb.amazonaws.com
SERVICE
10.3.0.216
DEPLOYMENT
POD
10.2.0.1
POD
10.2.1.1
POD
10.2.2.1
POD
10.2.3.1
SKIPPER
172.31.1.1:9999
SKIPPER
172.31.2.1:9999
SKIPPER
172.31.3.1:9999
SKIPPER
172.31.4.1:9999
ALIAS Record
11
INCIDENT #1: UNASSUMING MANIFEST
apiVersion: batch/v2alpha1
kind: CronJob
metadata:
name: "foobar"
labels:
application: "foobar"
spec:
schedule: "*/15 9-19 * * Mon-Fri"
jobTemplate:
spec:
template:
metadata:
labels:
application: "foobar"
spec:
restartPolicy: Never
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 1
failedJobsHistoryLimit: 1
containers:
...
12
INCIDENT #1: FIXED CRON JOB
apiVersion: batch/v2alpha1
kind: CronJob
metadata:
name: "foobar"
labels:
application: "foobar"
spec:
schedule: "7 8-18 * * Mon-Fri"
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 1
failedJobsHistoryLimit: 1
jobTemplate:
spec:
activeDeadlineSeconds: 120
template:
metadata:
labels:
application: "foobar"
spec:
restartPolicy: Never
containers:
...
13
INCIDENT #1: LESSONS LEARNED
• ALB routes traffic to ALL hosts if all hosts report “unhealthy”
• Fix Skipper Ingress to stay “healthy” during API server problems
• Fix Skipper Ingress to retain last known set of routes
• Use quota for number of pods
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-resources
spec:
hard:
pods: "1500"
14
STABILITY: LIMIT RANGE
$ kubectl describe limitrange
Name: limits
Namespace: default
Type Resource Min Max Default Req Default Limit Max Limit/Request Ratio
---- -------- --- --- ----------- ------------- -----------------------
Container memory - 64Gi 100Mi 1Gi -
Container cpu - 16 100m 3 -
http://kubernetes-on-aws.readthedocs.io/en/latest/admin-guide/kubernetes-in-production.html#resources
⇒ Mitigate errors on OSI layer 8 ;-)
15
INCIDENT #2: CLUSTER DOWN
16
INCIDENT #2: MANUAL OPERATION
% etcdctl del -r /registry-kube-1/certificatesigningrequest prefix
17
INCIDENT #2: RTFM
% etcdctl del -r /registry-kube-1/certificatesigningrequest prefix
help: etcdctl del [options] <key> [range_end]
https://www.outcome-eng.com/human-error-never-root-cause/
19
INCIDENT #2: LESSONS LEARNED
• Disaster Recovery Plan?
• Backup etcd to S3
• Monitor the snapshots
20
INCIDENT #3: LATENCY SPIKES
21
INCIDENT #3: STOP THE BLEEDING
#!/bin/bash
SLEEPTIME=60
while true; do
echo "sleep for $SLEEPTIME seconds"
sleep $SLEEPTIME
timeout 5 curl http://localhost:8080/api/v1/nodes > /dev/null
if [ $? -eq 0 ]; then
echo "all fine, no need to restart etcd member"
continue
else
echo "restarting etcd-member"
systemctl restart etcd-member
fi
done
22
INCIDENT #3: CONFIRMATION FROM AWS
[...]
We can’t go into the details [...] that resulted the networking problems during
the “non-intrusive maintenance”, as it relates to internal workings of EC2.
We can confirm this only affected the T2 instance types, ...
[...]
We don’t explicitly recommend against running production services on T2
[...]
23
INCIDENT #3: LESSONS LEARNED
• It's never the AWS infrastructure until it is
• Treat t2 instances with care
• Kubernetes components are not necessarily "cloud native"
Cloud Native? Declarative, dynamic, resilient, and scalable
24
INCIDENT #4: IMPACT
25
INCIDENT #4: CLUSTER DOWN?
26
INCIDENT #4: THE TRIGGER
https://www.outcome-eng.com/human-error-never-root-cause/
28
CLUSTER
UPDATES
29
CLUSTER LIFECYCLE MANAGER
30
INCIDENT #4: LESSONS LEARNED
• Automated end-to-end tests are pretty good, but not enough
• Test the diff/migration automatically
• Bootstrap new cluster with the previous configuration
• Apply new configuration
• Run end-to-end & conformance tests
31
TRAFFIC SWITCHING
Default deployment: rolling update
32
TRAFFIC SWITCHING
33
MANUAL TRAFFIC SWITCHING
$ zkubectl traffic <ingress-name>
SERVICE WEIGHT
<service-backend-1> 30%
<service-backend-2> 70%
$ zkubectl traffic <ingress-name> <service-backend-2> 100
SERVICE WEIGHT
<service-backend-1> 0%
<service-backend-2> 100%
34
DOCUMENTATION
"Documentation is hard to find"
"Documentation is not comprehensive enough"
"Remove unnecessary complexity and obstacles."
"Get the documentation up to date and prepare
use cases"
"More and more clear documentation"
"More detailed docs, example repos with more
complicated deployments."
35
DOCUMENTATION
• Restructure following
https://www.divio.com/en/blog/documentation/
• Concepts
• How Tos
• Tutorials
• Reference
• Global Search
• Weekly Health Check: Support → Documentation
36
ONBOARDING
• Many new concepts to grasp vs. 200+ teams
• Kubernetes Training (2h)
• Documentation
• Recorded Friday Demos
• Support Channels (chat, mail)
37
CLUSTER SCOPE
38
DOES ANYTHING EVEN WORK?
• Kubernetes API as the primary interface
• Ingress Controller + External DNS
• kube2iam
• Cluster lifecycle management
• Zalando IAM/OAuth integration via CRD
• PostgreSQL Operator
39
DEPLOYMENT
40
DEPLOYMENT CONFIGURATION
.
├── deploy/apply
│ ├── deployment.yaml # K8s Deployment
│ ├── credentials.yaml # K8s TPR
│ ├── ingress.yaml # K8s Ingress
│ └── service.yaml # K8s Service
└── delivery.yaml # pipeline config
41
INGRESS.YAML
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: "..."
spec:
rules:
# DNS name your application should be exposed on
- host: "myapp.foo.example.org"
http:
paths:
- backend:
serviceName: "myapp"
servicePort: 80
42
CONTINUOUS DELIVERY PLATFORM
43
CDP: APPLY
44
CDP: OPTIONAL APPROVAL
45
POSTGRES OPERATOR
Application to manage PostgreSQL clusters
Observes “postgres” manifests (CRDs)
Spawns and modifies new clusters
Syncs and provisions roles
Handles volume resize, incl. Resize2fs
Also responsible for updating Docker images
https://github.com/hjacobs/kube-ops-view
47
LINKS
Running Kubernetes in Production on AWS
http://kubernetes-on-aws.readthedocs.io/en/latest/admin-guide/kubernetes-in-production.html
Kube AWS Ingress Controller
https://github.com/zalando-incubator/kube-ingress-aws-controller
External DNS
https://github.com/kubernetes-incubator/external-dns
PostgreSQL Operator
https://github.com/zalando-incubator/postgres-operator
Zalando Cluster Configuration
https://github.com/zalando-incubator/kubernetes-on-aws
List of Organizations using Kubernetes on AWS
https://github.com/hjacobs/kubernetes-on-aws-users
https://goo.gl/t2zNc8
QUESTIONS?
HENNING JACOBS
HEAD OF
DEVELOPER PRODUCTIVITY
henning@zalando.de
@try_except_
Illustrations by @01k

Weitere ähnliche Inhalte

Was ist angesagt?

From AWS/STUPS to Kubernetes on AWS @Zalando - Berlin Kubernetes Meetup
From AWS/STUPS to Kubernetes on AWS @Zalando - Berlin Kubernetes MeetupFrom AWS/STUPS to Kubernetes on AWS @Zalando - Berlin Kubernetes Meetup
From AWS/STUPS to Kubernetes on AWS @Zalando - Berlin Kubernetes MeetupHenning Jacobs
 
Why we don’t use the Term DevOps: the Journey to a Product Mindset - DevOpsCo...
Why we don’t use the Term DevOps: the Journey to a Product Mindset - DevOpsCo...Why we don’t use the Term DevOps: the Journey to a Product Mindset - DevOpsCo...
Why we don’t use the Term DevOps: the Journey to a Product Mindset - DevOpsCo...Henning Jacobs
 
Kubernetes Failure Stories, or: How to Crash Your Cluster - ContainerDays EU ...
Kubernetes Failure Stories, or: How to Crash Your Cluster - ContainerDays EU ...Kubernetes Failure Stories, or: How to Crash Your Cluster - ContainerDays EU ...
Kubernetes Failure Stories, or: How to Crash Your Cluster - ContainerDays EU ...Henning Jacobs
 
DockerCon EU 2015: The Glue is the Hard Part: Making a Production-Ready PaaS
DockerCon EU 2015: The Glue is the Hard Part: Making a Production-Ready PaaSDockerCon EU 2015: The Glue is the Hard Part: Making a Production-Ready PaaS
DockerCon EU 2015: The Glue is the Hard Part: Making a Production-Ready PaaSDocker, Inc.
 
Kernel load-balancing for Docker containers using IPVS
Kernel load-balancing for Docker containers using IPVSKernel load-balancing for Docker containers using IPVS
Kernel load-balancing for Docker containers using IPVSDocker, Inc.
 
Kubernetes on AWS @Zalando - Berlin AWS User Group 2017-05-09
Kubernetes on AWS @Zalando - Berlin AWS User Group 2017-05-09Kubernetes on AWS @Zalando - Berlin AWS User Group 2017-05-09
Kubernetes on AWS @Zalando - Berlin AWS User Group 2017-05-09Henning Jacobs
 
Kubernetes at Zalando - CNCF End User Committee Presentation
Kubernetes at Zalando - CNCF End User Committee PresentationKubernetes at Zalando - CNCF End User Committee Presentation
Kubernetes at Zalando - CNCF End User Committee PresentationHenning Jacobs
 
Docker and Windows: The State of the Union
Docker and Windows: The State of the UnionDocker and Windows: The State of the Union
Docker and Windows: The State of the UnionElton Stoneman
 
KubeCon EU 2016: Leveraging ephemeral namespaces in a CI/CD pipeline
KubeCon EU 2016: Leveraging ephemeral namespaces in a CI/CD pipelineKubeCon EU 2016: Leveraging ephemeral namespaces in a CI/CD pipeline
KubeCon EU 2016: Leveraging ephemeral namespaces in a CI/CD pipelineKubeAcademy
 
[20200720]cloud native develoment - Nelson Lin
[20200720]cloud native develoment - Nelson Lin[20200720]cloud native develoment - Nelson Lin
[20200720]cloud native develoment - Nelson LinHanLing Shen
 
KubeCon EU 2016: Transforming the Government
KubeCon EU 2016: Transforming the Government KubeCon EU 2016: Transforming the Government
KubeCon EU 2016: Transforming the Government KubeAcademy
 
Paris Container Day 2016 : Orchestrating Continuous Delivery (CloudBees)
Paris Container Day 2016 : Orchestrating Continuous Delivery (CloudBees) Paris Container Day 2016 : Orchestrating Continuous Delivery (CloudBees)
Paris Container Day 2016 : Orchestrating Continuous Delivery (CloudBees) Publicis Sapient Engineering
 
Ensuring Kubernetes Cost Efficiency across (many) Clusters - DevOps Gathering...
Ensuring Kubernetes Cost Efficiency across (many) Clusters - DevOps Gathering...Ensuring Kubernetes Cost Efficiency across (many) Clusters - DevOps Gathering...
Ensuring Kubernetes Cost Efficiency across (many) Clusters - DevOps Gathering...Henning Jacobs
 
KubeCon EU 2016 Keynote: Pushing Kubernetes Forward
KubeCon EU 2016 Keynote: Pushing Kubernetes ForwardKubeCon EU 2016 Keynote: Pushing Kubernetes Forward
KubeCon EU 2016 Keynote: Pushing Kubernetes ForwardKubeAcademy
 
KubeCon EU 2016: Heroku to Kubernetes
KubeCon EU 2016: Heroku to KubernetesKubeCon EU 2016: Heroku to Kubernetes
KubeCon EU 2016: Heroku to KubernetesKubeAcademy
 
Docker 進階實務班
Docker 進階實務班Docker 進階實務班
Docker 進階實務班Philip Zheng
 
Docker for PHP Developers - Jetbrains
Docker for PHP Developers - JetbrainsDocker for PHP Developers - Jetbrains
Docker for PHP Developers - JetbrainsChris Tankersley
 
Java Day Kharkiv - Next-gen engineering with Docker and Kubernetes
Java Day Kharkiv - Next-gen engineering with Docker and KubernetesJava Day Kharkiv - Next-gen engineering with Docker and Kubernetes
Java Day Kharkiv - Next-gen engineering with Docker and KubernetesAntons Kranga
 
Pod Sandbox workflow creation from Dockershim
Pod Sandbox workflow creation from DockershimPod Sandbox workflow creation from Dockershim
Pod Sandbox workflow creation from DockershimVictor Morales
 
Docker All The Things - ASP.NET 4.x and Windows Server Containers
Docker All The Things - ASP.NET 4.x and Windows Server ContainersDocker All The Things - ASP.NET 4.x and Windows Server Containers
Docker All The Things - ASP.NET 4.x and Windows Server ContainersAnthony Chu
 

Was ist angesagt? (20)

From AWS/STUPS to Kubernetes on AWS @Zalando - Berlin Kubernetes Meetup
From AWS/STUPS to Kubernetes on AWS @Zalando - Berlin Kubernetes MeetupFrom AWS/STUPS to Kubernetes on AWS @Zalando - Berlin Kubernetes Meetup
From AWS/STUPS to Kubernetes on AWS @Zalando - Berlin Kubernetes Meetup
 
Why we don’t use the Term DevOps: the Journey to a Product Mindset - DevOpsCo...
Why we don’t use the Term DevOps: the Journey to a Product Mindset - DevOpsCo...Why we don’t use the Term DevOps: the Journey to a Product Mindset - DevOpsCo...
Why we don’t use the Term DevOps: the Journey to a Product Mindset - DevOpsCo...
 
Kubernetes Failure Stories, or: How to Crash Your Cluster - ContainerDays EU ...
Kubernetes Failure Stories, or: How to Crash Your Cluster - ContainerDays EU ...Kubernetes Failure Stories, or: How to Crash Your Cluster - ContainerDays EU ...
Kubernetes Failure Stories, or: How to Crash Your Cluster - ContainerDays EU ...
 
DockerCon EU 2015: The Glue is the Hard Part: Making a Production-Ready PaaS
DockerCon EU 2015: The Glue is the Hard Part: Making a Production-Ready PaaSDockerCon EU 2015: The Glue is the Hard Part: Making a Production-Ready PaaS
DockerCon EU 2015: The Glue is the Hard Part: Making a Production-Ready PaaS
 
Kernel load-balancing for Docker containers using IPVS
Kernel load-balancing for Docker containers using IPVSKernel load-balancing for Docker containers using IPVS
Kernel load-balancing for Docker containers using IPVS
 
Kubernetes on AWS @Zalando - Berlin AWS User Group 2017-05-09
Kubernetes on AWS @Zalando - Berlin AWS User Group 2017-05-09Kubernetes on AWS @Zalando - Berlin AWS User Group 2017-05-09
Kubernetes on AWS @Zalando - Berlin AWS User Group 2017-05-09
 
Kubernetes at Zalando - CNCF End User Committee Presentation
Kubernetes at Zalando - CNCF End User Committee PresentationKubernetes at Zalando - CNCF End User Committee Presentation
Kubernetes at Zalando - CNCF End User Committee Presentation
 
Docker and Windows: The State of the Union
Docker and Windows: The State of the UnionDocker and Windows: The State of the Union
Docker and Windows: The State of the Union
 
KubeCon EU 2016: Leveraging ephemeral namespaces in a CI/CD pipeline
KubeCon EU 2016: Leveraging ephemeral namespaces in a CI/CD pipelineKubeCon EU 2016: Leveraging ephemeral namespaces in a CI/CD pipeline
KubeCon EU 2016: Leveraging ephemeral namespaces in a CI/CD pipeline
 
[20200720]cloud native develoment - Nelson Lin
[20200720]cloud native develoment - Nelson Lin[20200720]cloud native develoment - Nelson Lin
[20200720]cloud native develoment - Nelson Lin
 
KubeCon EU 2016: Transforming the Government
KubeCon EU 2016: Transforming the Government KubeCon EU 2016: Transforming the Government
KubeCon EU 2016: Transforming the Government
 
Paris Container Day 2016 : Orchestrating Continuous Delivery (CloudBees)
Paris Container Day 2016 : Orchestrating Continuous Delivery (CloudBees) Paris Container Day 2016 : Orchestrating Continuous Delivery (CloudBees)
Paris Container Day 2016 : Orchestrating Continuous Delivery (CloudBees)
 
Ensuring Kubernetes Cost Efficiency across (many) Clusters - DevOps Gathering...
Ensuring Kubernetes Cost Efficiency across (many) Clusters - DevOps Gathering...Ensuring Kubernetes Cost Efficiency across (many) Clusters - DevOps Gathering...
Ensuring Kubernetes Cost Efficiency across (many) Clusters - DevOps Gathering...
 
KubeCon EU 2016 Keynote: Pushing Kubernetes Forward
KubeCon EU 2016 Keynote: Pushing Kubernetes ForwardKubeCon EU 2016 Keynote: Pushing Kubernetes Forward
KubeCon EU 2016 Keynote: Pushing Kubernetes Forward
 
KubeCon EU 2016: Heroku to Kubernetes
KubeCon EU 2016: Heroku to KubernetesKubeCon EU 2016: Heroku to Kubernetes
KubeCon EU 2016: Heroku to Kubernetes
 
Docker 進階實務班
Docker 進階實務班Docker 進階實務班
Docker 進階實務班
 
Docker for PHP Developers - Jetbrains
Docker for PHP Developers - JetbrainsDocker for PHP Developers - Jetbrains
Docker for PHP Developers - Jetbrains
 
Java Day Kharkiv - Next-gen engineering with Docker and Kubernetes
Java Day Kharkiv - Next-gen engineering with Docker and KubernetesJava Day Kharkiv - Next-gen engineering with Docker and Kubernetes
Java Day Kharkiv - Next-gen engineering with Docker and Kubernetes
 
Pod Sandbox workflow creation from Dockershim
Pod Sandbox workflow creation from DockershimPod Sandbox workflow creation from Dockershim
Pod Sandbox workflow creation from Dockershim
 
Docker All The Things - ASP.NET 4.x and Windows Server Containers
Docker All The Things - ASP.NET 4.x and Windows Server ContainersDocker All The Things - ASP.NET 4.x and Windows Server Containers
Docker All The Things - ASP.NET 4.x and Windows Server Containers
 

Ähnlich wie Kubernetes on AWS at Zalando: Failures & Learnings - DevOps NRW

Why Kubernetes? Cloud Native and Developer Experience at Zalando - OWL Tech &...
Why Kubernetes? Cloud Native and Developer Experience at Zalando - OWL Tech &...Why Kubernetes? Cloud Native and Developer Experience at Zalando - OWL Tech &...
Why Kubernetes? Cloud Native and Developer Experience at Zalando - OWL Tech &...Henning Jacobs
 
12.07.2017 Docker Meetup - KUBERNETES ON AWS @ ZALANDO TECH
12.07.2017 Docker Meetup - KUBERNETES ON AWS @ ZALANDO TECH12.07.2017 Docker Meetup - KUBERNETES ON AWS @ ZALANDO TECH
12.07.2017 Docker Meetup - KUBERNETES ON AWS @ ZALANDO TECHZalando adtech lab
 
MongoDB.local Austin 2018: MongoDB Ops Manager + Kubernetes
MongoDB.local Austin 2018: MongoDB Ops Manager + KubernetesMongoDB.local Austin 2018: MongoDB Ops Manager + Kubernetes
MongoDB.local Austin 2018: MongoDB Ops Manager + KubernetesMongoDB
 
MongoDB.local DC 2018: MongoDB Ops Manager + Kubernetes
MongoDB.local DC 2018: MongoDB Ops Manager + KubernetesMongoDB.local DC 2018: MongoDB Ops Manager + Kubernetes
MongoDB.local DC 2018: MongoDB Ops Manager + KubernetesMongoDB
 
Why we don’t use the Term DevOps: the Journey to a Product Mindset - Destinat...
Why we don’t use the Term DevOps: the Journey to a Product Mindset - Destinat...Why we don’t use the Term DevOps: the Journey to a Product Mindset - Destinat...
Why we don’t use the Term DevOps: the Journey to a Product Mindset - Destinat...Henning Jacobs
 
Developer Experience at Zalando - Handelsblatt Strategisches IT-Management 2019
Developer Experience at Zalando - Handelsblatt Strategisches IT-Management 2019Developer Experience at Zalando - Handelsblatt Strategisches IT-Management 2019
Developer Experience at Zalando - Handelsblatt Strategisches IT-Management 2019Henning Jacobs
 
Automatic Ingress in Kubernetes
Automatic Ingress in KubernetesAutomatic Ingress in Kubernetes
Automatic Ingress in KubernetesRodrigo Reis
 
Why I love Kubernetes Failure Stories and you should too - GOTO Berlin
Why I love Kubernetes Failure Stories and you should too - GOTO BerlinWhy I love Kubernetes Failure Stories and you should too - GOTO Berlin
Why I love Kubernetes Failure Stories and you should too - GOTO BerlinHenning Jacobs
 
Kubernetes Deployments: A "Hands-off" Approach
Kubernetes Deployments: A "Hands-off" ApproachKubernetes Deployments: A "Hands-off" Approach
Kubernetes Deployments: A "Hands-off" ApproachRodrigo Reis
 
How Zalando runs Kubernetes clusters at scale on AWS - AWS re:Invent
How Zalando runs Kubernetes clusters at scale on AWS - AWS re:InventHow Zalando runs Kubernetes clusters at scale on AWS - AWS re:Invent
How Zalando runs Kubernetes clusters at scale on AWS - AWS re:InventHenning Jacobs
 
Kubernetes + Python = ❤ - Cloud Native Prague
Kubernetes + Python = ❤ - Cloud Native PragueKubernetes + Python = ❤ - Cloud Native Prague
Kubernetes + Python = ❤ - Cloud Native PragueHenning Jacobs
 
#Interactive Session by Kirti Ranjan Satapathy and Nandini K, "Elements of Qu...
#Interactive Session by Kirti Ranjan Satapathy and Nandini K, "Elements of Qu...#Interactive Session by Kirti Ranjan Satapathy and Nandini K, "Elements of Qu...
#Interactive Session by Kirti Ranjan Satapathy and Nandini K, "Elements of Qu...Agile Testing Alliance
 
Minikube – get Connections in the smalles possible setup
Minikube – get Connections in the smalles possible setupMinikube – get Connections in the smalles possible setup
Minikube – get Connections in the smalles possible setupMartin Schmidt
 
A hitchhiker‘s guide to the cloud native stack
A hitchhiker‘s guide to the cloud native stackA hitchhiker‘s guide to the cloud native stack
A hitchhiker‘s guide to the cloud native stackQAware GmbH
 
A Hitchhiker’s Guide to the Cloud Native Stack. #CDS17
A Hitchhiker’s Guide to the Cloud Native Stack. #CDS17A Hitchhiker’s Guide to the Cloud Native Stack. #CDS17
A Hitchhiker’s Guide to the Cloud Native Stack. #CDS17Mario-Leander Reimer
 
Consul administration at scale
Consul administration at scaleConsul administration at scale
Consul administration at scalePierre Souchay
 
Kubernetes on AWS @ Zalando Tech
Kubernetes on AWS @ Zalando TechKubernetes on AWS @ Zalando Tech
Kubernetes on AWS @ Zalando TechMichael Dürgner
 
'DOCKER' & CLOUD: ENABLERS For DEVOPS
'DOCKER' & CLOUD:  ENABLERS For DEVOPS'DOCKER' & CLOUD:  ENABLERS For DEVOPS
'DOCKER' & CLOUD: ENABLERS For DEVOPSACA IT-Solutions
 
Docker and Cloud - Enables for DevOps - by ACA-IT
Docker and Cloud - Enables for DevOps - by ACA-ITDocker and Cloud - Enables for DevOps - by ACA-IT
Docker and Cloud - Enables for DevOps - by ACA-ITStijn Wijndaele
 
stackconf 2022: Cluster Management: Heterogeneous, Lightweight, Safe. Pick Three
stackconf 2022: Cluster Management: Heterogeneous, Lightweight, Safe. Pick Threestackconf 2022: Cluster Management: Heterogeneous, Lightweight, Safe. Pick Three
stackconf 2022: Cluster Management: Heterogeneous, Lightweight, Safe. Pick ThreeNETWAYS
 

Ähnlich wie Kubernetes on AWS at Zalando: Failures & Learnings - DevOps NRW (20)

Why Kubernetes? Cloud Native and Developer Experience at Zalando - OWL Tech &...
Why Kubernetes? Cloud Native and Developer Experience at Zalando - OWL Tech &...Why Kubernetes? Cloud Native and Developer Experience at Zalando - OWL Tech &...
Why Kubernetes? Cloud Native and Developer Experience at Zalando - OWL Tech &...
 
12.07.2017 Docker Meetup - KUBERNETES ON AWS @ ZALANDO TECH
12.07.2017 Docker Meetup - KUBERNETES ON AWS @ ZALANDO TECH12.07.2017 Docker Meetup - KUBERNETES ON AWS @ ZALANDO TECH
12.07.2017 Docker Meetup - KUBERNETES ON AWS @ ZALANDO TECH
 
MongoDB.local Austin 2018: MongoDB Ops Manager + Kubernetes
MongoDB.local Austin 2018: MongoDB Ops Manager + KubernetesMongoDB.local Austin 2018: MongoDB Ops Manager + Kubernetes
MongoDB.local Austin 2018: MongoDB Ops Manager + Kubernetes
 
MongoDB.local DC 2018: MongoDB Ops Manager + Kubernetes
MongoDB.local DC 2018: MongoDB Ops Manager + KubernetesMongoDB.local DC 2018: MongoDB Ops Manager + Kubernetes
MongoDB.local DC 2018: MongoDB Ops Manager + Kubernetes
 
Why we don’t use the Term DevOps: the Journey to a Product Mindset - Destinat...
Why we don’t use the Term DevOps: the Journey to a Product Mindset - Destinat...Why we don’t use the Term DevOps: the Journey to a Product Mindset - Destinat...
Why we don’t use the Term DevOps: the Journey to a Product Mindset - Destinat...
 
Developer Experience at Zalando - Handelsblatt Strategisches IT-Management 2019
Developer Experience at Zalando - Handelsblatt Strategisches IT-Management 2019Developer Experience at Zalando - Handelsblatt Strategisches IT-Management 2019
Developer Experience at Zalando - Handelsblatt Strategisches IT-Management 2019
 
Automatic Ingress in Kubernetes
Automatic Ingress in KubernetesAutomatic Ingress in Kubernetes
Automatic Ingress in Kubernetes
 
Why I love Kubernetes Failure Stories and you should too - GOTO Berlin
Why I love Kubernetes Failure Stories and you should too - GOTO BerlinWhy I love Kubernetes Failure Stories and you should too - GOTO Berlin
Why I love Kubernetes Failure Stories and you should too - GOTO Berlin
 
Kubernetes Deployments: A "Hands-off" Approach
Kubernetes Deployments: A "Hands-off" ApproachKubernetes Deployments: A "Hands-off" Approach
Kubernetes Deployments: A "Hands-off" Approach
 
How Zalando runs Kubernetes clusters at scale on AWS - AWS re:Invent
How Zalando runs Kubernetes clusters at scale on AWS - AWS re:InventHow Zalando runs Kubernetes clusters at scale on AWS - AWS re:Invent
How Zalando runs Kubernetes clusters at scale on AWS - AWS re:Invent
 
Kubernetes + Python = ❤ - Cloud Native Prague
Kubernetes + Python = ❤ - Cloud Native PragueKubernetes + Python = ❤ - Cloud Native Prague
Kubernetes + Python = ❤ - Cloud Native Prague
 
#Interactive Session by Kirti Ranjan Satapathy and Nandini K, "Elements of Qu...
#Interactive Session by Kirti Ranjan Satapathy and Nandini K, "Elements of Qu...#Interactive Session by Kirti Ranjan Satapathy and Nandini K, "Elements of Qu...
#Interactive Session by Kirti Ranjan Satapathy and Nandini K, "Elements of Qu...
 
Minikube – get Connections in the smalles possible setup
Minikube – get Connections in the smalles possible setupMinikube – get Connections in the smalles possible setup
Minikube – get Connections in the smalles possible setup
 
A hitchhiker‘s guide to the cloud native stack
A hitchhiker‘s guide to the cloud native stackA hitchhiker‘s guide to the cloud native stack
A hitchhiker‘s guide to the cloud native stack
 
A Hitchhiker’s Guide to the Cloud Native Stack. #CDS17
A Hitchhiker’s Guide to the Cloud Native Stack. #CDS17A Hitchhiker’s Guide to the Cloud Native Stack. #CDS17
A Hitchhiker’s Guide to the Cloud Native Stack. #CDS17
 
Consul administration at scale
Consul administration at scaleConsul administration at scale
Consul administration at scale
 
Kubernetes on AWS @ Zalando Tech
Kubernetes on AWS @ Zalando TechKubernetes on AWS @ Zalando Tech
Kubernetes on AWS @ Zalando Tech
 
'DOCKER' & CLOUD: ENABLERS For DEVOPS
'DOCKER' & CLOUD:  ENABLERS For DEVOPS'DOCKER' & CLOUD:  ENABLERS For DEVOPS
'DOCKER' & CLOUD: ENABLERS For DEVOPS
 
Docker and Cloud - Enables for DevOps - by ACA-IT
Docker and Cloud - Enables for DevOps - by ACA-ITDocker and Cloud - Enables for DevOps - by ACA-IT
Docker and Cloud - Enables for DevOps - by ACA-IT
 
stackconf 2022: Cluster Management: Heterogeneous, Lightweight, Safe. Pick Three
stackconf 2022: Cluster Management: Heterogeneous, Lightweight, Safe. Pick Threestackconf 2022: Cluster Management: Heterogeneous, Lightweight, Safe. Pick Three
stackconf 2022: Cluster Management: Heterogeneous, Lightweight, Safe. Pick Three
 

Mehr von Henning Jacobs

Open Source at Zalando - OSB Open Source Day 2019
Open Source at Zalando - OSB Open Source Day 2019Open Source at Zalando - OSB Open Source Day 2019
Open Source at Zalando - OSB Open Source Day 2019Henning Jacobs
 
Kubernetes Failure Stories - KubeCon Europe Barcelona
Kubernetes Failure Stories - KubeCon Europe BarcelonaKubernetes Failure Stories - KubeCon Europe Barcelona
Kubernetes Failure Stories - KubeCon Europe BarcelonaHenning Jacobs
 
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Henning Jacobs
 
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Henning Jacobs
 
Kubernetes on AWS at Europe's Leading Online Fashion Platform
Kubernetes on AWS at Europe's Leading Online Fashion PlatformKubernetes on AWS at Europe's Leading Online Fashion Platform
Kubernetes on AWS at Europe's Leading Online Fashion PlatformHenning Jacobs
 
Plan B: Service to Service Authentication with OAuth
Plan B: Service to Service Authentication with OAuthPlan B: Service to Service Authentication with OAuth
Plan B: Service to Service Authentication with OAuthHenning Jacobs
 
Docker Berlin Meetup Nov 2015: Zalando Intro
Docker Berlin Meetup Nov 2015: Zalando IntroDocker Berlin Meetup Nov 2015: Zalando Intro
Docker Berlin Meetup Nov 2015: Zalando IntroHenning Jacobs
 
STUPS @ AWS Enterprise Web Day Oktober 2015
STUPS @ AWS Enterprise Web Day Oktober 2015STUPS @ AWS Enterprise Web Day Oktober 2015
STUPS @ AWS Enterprise Web Day Oktober 2015Henning Jacobs
 
Python at Zalando Technology @ Python Users Berlin Meetup September 2015
Python at Zalando Technology @ Python Users Berlin Meetup September 2015Python at Zalando Technology @ Python Users Berlin Meetup September 2015
Python at Zalando Technology @ Python Users Berlin Meetup September 2015Henning Jacobs
 
STUPS by Zalando @WHD.local Frankfurt: STUPS.io - an Open Source Cloud Framew...
STUPS by Zalando @WHD.local Frankfurt: STUPS.io - an Open Source Cloud Framew...STUPS by Zalando @WHD.local Frankfurt: STUPS.io - an Open Source Cloud Framew...
STUPS by Zalando @WHD.local Frankfurt: STUPS.io - an Open Source Cloud Framew...Henning Jacobs
 

Mehr von Henning Jacobs (10)

Open Source at Zalando - OSB Open Source Day 2019
Open Source at Zalando - OSB Open Source Day 2019Open Source at Zalando - OSB Open Source Day 2019
Open Source at Zalando - OSB Open Source Day 2019
 
Kubernetes Failure Stories - KubeCon Europe Barcelona
Kubernetes Failure Stories - KubeCon Europe BarcelonaKubernetes Failure Stories - KubeCon Europe Barcelona
Kubernetes Failure Stories - KubeCon Europe Barcelona
 
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
 
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
 
Kubernetes on AWS at Europe's Leading Online Fashion Platform
Kubernetes on AWS at Europe's Leading Online Fashion PlatformKubernetes on AWS at Europe's Leading Online Fashion Platform
Kubernetes on AWS at Europe's Leading Online Fashion Platform
 
Plan B: Service to Service Authentication with OAuth
Plan B: Service to Service Authentication with OAuthPlan B: Service to Service Authentication with OAuth
Plan B: Service to Service Authentication with OAuth
 
Docker Berlin Meetup Nov 2015: Zalando Intro
Docker Berlin Meetup Nov 2015: Zalando IntroDocker Berlin Meetup Nov 2015: Zalando Intro
Docker Berlin Meetup Nov 2015: Zalando Intro
 
STUPS @ AWS Enterprise Web Day Oktober 2015
STUPS @ AWS Enterprise Web Day Oktober 2015STUPS @ AWS Enterprise Web Day Oktober 2015
STUPS @ AWS Enterprise Web Day Oktober 2015
 
Python at Zalando Technology @ Python Users Berlin Meetup September 2015
Python at Zalando Technology @ Python Users Berlin Meetup September 2015Python at Zalando Technology @ Python Users Berlin Meetup September 2015
Python at Zalando Technology @ Python Users Berlin Meetup September 2015
 
STUPS by Zalando @WHD.local Frankfurt: STUPS.io - an Open Source Cloud Framew...
STUPS by Zalando @WHD.local Frankfurt: STUPS.io - an Open Source Cloud Framew...STUPS by Zalando @WHD.local Frankfurt: STUPS.io - an Open Source Cloud Framew...
STUPS by Zalando @WHD.local Frankfurt: STUPS.io - an Open Source Cloud Framew...
 

Kürzlich hochgeladen

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 

Kürzlich hochgeladen (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 

Kubernetes on AWS at Zalando: Failures & Learnings - DevOps NRW

  • 2. 2 ZALANDO IN NUMBERS > 4.5billion EUR 2017 > 200 million visits per month > 14,000 employees in Europe > 70% of visits via mobile devices > 22 million active customers > 250,000 product choices ~ 2,000 brands 15 countries
  • 3. 3 OUR FOOTPRINT AROUND EUROPE as of November 2017 1 8 10 11 12 13 BERLIN HEADQUARTERS AND OUTLET BRIESELANG FULFILLMENT CENTER ERFURT FULFILLMENT CENTER AND TECH OFFICE MÖNCHENGLADBACH FULFILLMENT CENTER AND TECH OFFICE LAHR FULFILLMENT CENTER DORTMUND TECH HUB FRANKFURT OUTLET DUBLIN TECH HUB HELSINKI TECH HUB MILAN (STRADELLA) FULFILLMENT CENTER KÖLN OUTLET PARIS (MOISSY-CRAMAYEL) FULFILLMENT CENTER SZCZECIN (GRYFINO) FULFILLMENT CENTER HAMBURG ADTECH LAB STOCKHOLM (BRUNNA) FULFILLMENT CENTER (start winter 2017) 10 9 7 6 5 3 2 1 11 12 13 4 14 15 15 14 9 8 7 6 5 4 3 2 1
  • 4. 4 OUR FOOTPRINT AROUND EUROPE TECH as of November 2017 1 8 10 11 12 13 BERLIN HEADQUARTERS AND OUTLET BRIESELANG FULFILLMENT CENTER ERFURT FULFILLMENT CENTER AND TECH OFFICE MÖNCHENGLADBACH FULFILLMENT CENTER AND TECH OFFICE LAHR FULFILLMENT CENTER DORTMUND TECH HUB FRANKFURT OUTLET DUBLIN TECH HUB HELSINKI TECH HUB MILAN (STRADELLA) FULFILLMENT CENTER KÖLN OUTLET PARIS (MOISSY-CRAMAYEL) FULFILLMENT CENTER SZCZECIN (GRYFINO) FULFILLMENT CENTER HAMBURG ADTECH LAB STOCKHOLM (BRUNNA) FULFILLMENT CENTER (start winter 2017) 10 9 7 6 5 3 2 1 11 12 13 4 14 15 15 14 9 8 7 6 5 4 3 2 1
  • 8. 8 INCIDENT #1: IAM RETURNING 404
  • 10. 10 LIFE OF A REQUEST (INGRESS) DNS my-app.example.org ALB aws-1234-lb.eu-central-1.elb.amazonaws.com SERVICE 10.3.0.216 DEPLOYMENT POD 10.2.0.1 POD 10.2.1.1 POD 10.2.2.1 POD 10.2.3.1 SKIPPER 172.31.1.1:9999 SKIPPER 172.31.2.1:9999 SKIPPER 172.31.3.1:9999 SKIPPER 172.31.4.1:9999 ALIAS Record
  • 11. 11 INCIDENT #1: UNASSUMING MANIFEST apiVersion: batch/v2alpha1 kind: CronJob metadata: name: "foobar" labels: application: "foobar" spec: schedule: "*/15 9-19 * * Mon-Fri" jobTemplate: spec: template: metadata: labels: application: "foobar" spec: restartPolicy: Never concurrencyPolicy: Forbid successfulJobsHistoryLimit: 1 failedJobsHistoryLimit: 1 containers: ...
  • 12. 12 INCIDENT #1: FIXED CRON JOB apiVersion: batch/v2alpha1 kind: CronJob metadata: name: "foobar" labels: application: "foobar" spec: schedule: "7 8-18 * * Mon-Fri" concurrencyPolicy: Forbid successfulJobsHistoryLimit: 1 failedJobsHistoryLimit: 1 jobTemplate: spec: activeDeadlineSeconds: 120 template: metadata: labels: application: "foobar" spec: restartPolicy: Never containers: ...
  • 13. 13 INCIDENT #1: LESSONS LEARNED • ALB routes traffic to ALL hosts if all hosts report “unhealthy” • Fix Skipper Ingress to stay “healthy” during API server problems • Fix Skipper Ingress to retain last known set of routes • Use quota for number of pods apiVersion: v1 kind: ResourceQuota metadata: name: compute-resources spec: hard: pods: "1500"
  • 14. 14 STABILITY: LIMIT RANGE $ kubectl describe limitrange Name: limits Namespace: default Type Resource Min Max Default Req Default Limit Max Limit/Request Ratio ---- -------- --- --- ----------- ------------- ----------------------- Container memory - 64Gi 100Mi 1Gi - Container cpu - 16 100m 3 - http://kubernetes-on-aws.readthedocs.io/en/latest/admin-guide/kubernetes-in-production.html#resources ⇒ Mitigate errors on OSI layer 8 ;-)
  • 16. 16 INCIDENT #2: MANUAL OPERATION % etcdctl del -r /registry-kube-1/certificatesigningrequest prefix
  • 17. 17 INCIDENT #2: RTFM % etcdctl del -r /registry-kube-1/certificatesigningrequest prefix help: etcdctl del [options] <key> [range_end]
  • 19. 19 INCIDENT #2: LESSONS LEARNED • Disaster Recovery Plan? • Backup etcd to S3 • Monitor the snapshots
  • 21. 21 INCIDENT #3: STOP THE BLEEDING #!/bin/bash SLEEPTIME=60 while true; do echo "sleep for $SLEEPTIME seconds" sleep $SLEEPTIME timeout 5 curl http://localhost:8080/api/v1/nodes > /dev/null if [ $? -eq 0 ]; then echo "all fine, no need to restart etcd member" continue else echo "restarting etcd-member" systemctl restart etcd-member fi done
  • 22. 22 INCIDENT #3: CONFIRMATION FROM AWS [...] We can’t go into the details [...] that resulted the networking problems during the “non-intrusive maintenance”, as it relates to internal workings of EC2. We can confirm this only affected the T2 instance types, ... [...] We don’t explicitly recommend against running production services on T2 [...]
  • 23. 23 INCIDENT #3: LESSONS LEARNED • It's never the AWS infrastructure until it is • Treat t2 instances with care • Kubernetes components are not necessarily "cloud native" Cloud Native? Declarative, dynamic, resilient, and scalable
  • 30. 30 INCIDENT #4: LESSONS LEARNED • Automated end-to-end tests are pretty good, but not enough • Test the diff/migration automatically • Bootstrap new cluster with the previous configuration • Apply new configuration • Run end-to-end & conformance tests
  • 33. 33 MANUAL TRAFFIC SWITCHING $ zkubectl traffic <ingress-name> SERVICE WEIGHT <service-backend-1> 30% <service-backend-2> 70% $ zkubectl traffic <ingress-name> <service-backend-2> 100 SERVICE WEIGHT <service-backend-1> 0% <service-backend-2> 100%
  • 34. 34 DOCUMENTATION "Documentation is hard to find" "Documentation is not comprehensive enough" "Remove unnecessary complexity and obstacles." "Get the documentation up to date and prepare use cases" "More and more clear documentation" "More detailed docs, example repos with more complicated deployments."
  • 35. 35 DOCUMENTATION • Restructure following https://www.divio.com/en/blog/documentation/ • Concepts • How Tos • Tutorials • Reference • Global Search • Weekly Health Check: Support → Documentation
  • 36. 36 ONBOARDING • Many new concepts to grasp vs. 200+ teams • Kubernetes Training (2h) • Documentation • Recorded Friday Demos • Support Channels (chat, mail)
  • 38. 38 DOES ANYTHING EVEN WORK? • Kubernetes API as the primary interface • Ingress Controller + External DNS • kube2iam • Cluster lifecycle management • Zalando IAM/OAuth integration via CRD • PostgreSQL Operator
  • 40. 40 DEPLOYMENT CONFIGURATION . ├── deploy/apply │ ├── deployment.yaml # K8s Deployment │ ├── credentials.yaml # K8s TPR │ ├── ingress.yaml # K8s Ingress │ └── service.yaml # K8s Service └── delivery.yaml # pipeline config
  • 41. 41 INGRESS.YAML apiVersion: extensions/v1beta1 kind: Ingress metadata: name: "..." spec: rules: # DNS name your application should be exposed on - host: "myapp.foo.example.org" http: paths: - backend: serviceName: "myapp" servicePort: 80
  • 45. 45 POSTGRES OPERATOR Application to manage PostgreSQL clusters Observes “postgres” manifests (CRDs) Spawns and modifies new clusters Syncs and provisions roles Handles volume resize, incl. Resize2fs Also responsible for updating Docker images
  • 47. 47 LINKS Running Kubernetes in Production on AWS http://kubernetes-on-aws.readthedocs.io/en/latest/admin-guide/kubernetes-in-production.html Kube AWS Ingress Controller https://github.com/zalando-incubator/kube-ingress-aws-controller External DNS https://github.com/kubernetes-incubator/external-dns PostgreSQL Operator https://github.com/zalando-incubator/postgres-operator Zalando Cluster Configuration https://github.com/zalando-incubator/kubernetes-on-aws List of Organizations using Kubernetes on AWS https://github.com/hjacobs/kubernetes-on-aws-users
  • 49. QUESTIONS? HENNING JACOBS HEAD OF DEVELOPER PRODUCTIVITY henning@zalando.de @try_except_ Illustrations by @01k