reBuy on Kubernetes

%> whoami
@stephanlindauer
▪ Platform Engineer @ reBuy
▪ Once a Fullstack- and Game-Developer
▪ Got interested in container technologies in
2014 and jumped on K8s in 2015
▪ Finished my master thesis with a case study
on migrating infrastructure to K8s on AWS in
2017

%> whoarewe
rebuy.de
▪ reCommerce market leader in Europe
▪ Germany, Austria, France, UK and the
Netherlands
▪ ~500 employees
▪ Growing and expanding
▪ We’re hiring! (tech.rebuy.com)

agenda
● Our two year journey
● Stack
● Architecture
● Lifecycle
● Struggles
● What we learned

Why migrate?

▪ Static number of machines at
Serverhousing company
▪ TV-Spot/Ads = Downtime or
permanently adding new
machines
▪ “Manual” Provisioning
● Scalability
● Reliability
● Automation
● Infrastructure as Code
● Observability
● Cattle and Pets
● Immutability
● ...

Timeline
May 2017
● Got off legacy infrastructure
● Final migration step took one
night
May 2016
● Researching
● Fiddling around
● Prototyping
● Decided for a stack
November 2016
● First production workloads
on Kubernetes
● Ongoing migration efforts

Timeline
Today
● ~600 containers running in production
● 27 m5.2xlarge worker instances
● 60-100 deployments per working day
● Ongoing projects:
○ Open Tracing
○ Cluster Auto Scaling
○ Service Mesh
○ Staging Platform for Application Development

Terraform
● Infrastructure as
Code
● SCM-able
● Descriptive (define
desired state, not
changes)
● future-proof

Amazon Web
Services
● Infrastructure as a
Service
● API
● Maturity
● Features

CoreOS
● KISS
● etcd
● Systemd
● Docker
● cloud-config/Ignition
● Immutable
filesystem

Kubernetes
● Application Lifecycle
● Bin-Packing
● Auto-Scaling
● ...

ASG
3x
master
ASG
worker
stateful
ASG
worker
stateless
ignition
ignition ignition ignition

github.com/rebuy-de/
terraform-aws-coreos-k8s-single-node
(https://git.io/fA5tc)

OLD DATACENTER
VPN
DB Master
Messaging
Caches
K8s on AWS
DB Replica
Messaging
Caches
replicate
federation

dev1
dev2
dev3
jenkins-pr
jenkins-master
integration-test
staging
Production-Cluster
Backup
DNS-Root
ECR
Accounts

TESTING LIFECYCLE
SPIN UP
doStuff();
TEAR DOWN

SPIN UP
➔ Pull request
➔ Master commit
➔ Periodic check-jobs
➔ Periodic integration tests
➔ Manually
TRIGGERS

SPIN UP
➔ Scripts to provision AWS and
Kubernetes
➔ Change configurations for
non-production clusters through
◆ Terraform .tfvars files
◆ github.com/rebuy-de/kubernetes-deployment
to use go templating on .yaml files
TOOLING

TEAR DOWN
github.com/rebuy-de/
aws-nuke
(and Jenkins)
TOOLING

PRODUCTION LIFECYCLE
START NEW ASG
MIGRATE WORKLOADS
REMOVE OLD ASG

.
.
.
MIGRATIONS
OLD ASG
.
.
.
NEW ASG

TOIL ALARM!
● Manually cordon and drain old nodes
● Remove that exact node in ASG
➔ Human Error
➔ Boring
➔ Takes time
* nodeCount

THE SOLUTION!
● In-cluster operator
● Talks to AWS
● Talks to Kubernetes
● Less manual labour
● Can work together with Cluster-Autoscaler
● …
● Profit! github.com/rebuy-de/
node-drainer

HOW DOES IT WORK?
t
ASG
Amazon Simple Queue Service (SQS)
Node-drainer Pod
Kubernetes API
“Scale-in”
lifecycle hook
scale-in
message
“not so
fast, mate!”
start draining done draining
“go ahead
and terminate”
remove
message

.
.
.
MIGRATIONS
OLD ASG
terminated
terminated
.
.
.
NEW ASG

● Every once in a while about ⅓ of our workers in different AZs went
“NotReady” because heartbeat didn’t reach the Masters
● Things exploded
● Happened again a few weeks later out of nowhere
● and again…
● and again...
ELB + Masters

ELB + Masters
Master 1
ElasticLoadBalancer
apiserver
Master 2
apiserver
Master 2
apiserver
Worker 1
kubelet
Worker 2
kubelet
.
.
.

ELB + Masters
Master 1
NetworkLoadBalancer
apiserver
controller-manager
scheduler
Master 2
apiserver
controller-manager
scheduler

● Some Pods reported not being able to resolve hostnames (via kube-dns)
● After some time we noticed missing routes, and flannel complaining
● Tweaking sysctl didn’t help
Missing routes in flannel

● Initial taint
● DaemonSet (instance-health-checker) checks if routes are properly set up
● Removes taint if so
● Otherwise trigger alert via Prometheus endpoint
● Run flannel as a DaemonSet instead of Systemd
Missing routes in flannel

Costs
● Track your costs:
● Use Reservations and Spot Instances where you can
github.com/rebuy-de/cost-exporter

Terraform + K8s
● Turn on versioning of S3 Terraform .tfstate files and use DynamoDB for locking
● Apply Terraform changes in small batches
● Use the same CI/CD strategies on your infrastructure code as on all other code
● Use testing clusters to experiment with risky changes
● Separate Pets and Cattle into separate ASGs
● Running self managed Kubernetes is possible and gives you
○ More insight into how things work
○ More flexibility
○ Surprises :)

Workflow
● Don’t be afraid to break stuff (in testing clusters)
● Know what context you are working in
● Automate everything
● Hang your dashboards up on a wall

thanks!
@stephanlindauer
tech.rebuy.com

reBuy on Kubernetes

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie reBuy on Kubernetes

Ähnlich wie reBuy on Kubernetes (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

reBuy on Kubernetes