2. %> whoami
@stephanlindauer
▪ Platform Engineer @ reBuy
▪ Once a Fullstack- and Game-Developer
▪ Got interested in container technologies in
2014 and jumped on K8s in 2015
▪ Finished my master thesis with a case study
on migrating infrastructure to K8s on AWS in
2017
3. %> whoarewe
rebuy.de
▪ reCommerce market leader in Europe
▪ Germany, Austria, France, UK and the
Netherlands
▪ ~500 employees
▪ Growing and expanding
▪ We’re hiring! (tech.rebuy.com)
4. agenda
● Our two year journey
● Stack
● Architecture
● Lifecycle
● Struggles
● What we learned
6. Why migrate?
▪ Static number of machines at
Serverhousing company
▪ TV-Spot/Ads = Downtime or
permanently adding new
machines
▪ “Manual” Provisioning
● Scalability
● Reliability
● Automation
● Infrastructure as Code
● Observability
● Cattle and Pets
● Immutability
● ...
7. Timeline
May 2017
● Got off legacy infrastructure
● Final migration step took one
night
May 2016
● Researching
● Fiddling around
● Prototyping
● Decided for a stack
November 2016
● First production workloads
on Kubernetes
● Ongoing migration efforts
8. Timeline
Today
● ~600 containers running in production
● 27 m5.2xlarge worker instances
● 60-100 deployments per working day
● Ongoing projects:
○ Open Tracing
○ Cluster Auto Scaling
○ Service Mesh
○ Staging Platform for Application Development
23. SPIN UP
➔ Scripts to provision AWS and
Kubernetes
➔ Change configurations for
non-production clusters through
◆ Terraform .tfvars files
◆ github.com/rebuy-de/kubernetes-deployment
to use go templating on .yaml files
TOOLING
27. TOIL ALARM!
● Manually cordon and drain old nodes
● Remove that exact node in ASG
➔ Human Error
➔ Boring
➔ Takes time
* nodeCount
28. THE SOLUTION!
● In-cluster operator
● Talks to AWS
● Talks to Kubernetes
● Less manual labour
● Can work together with Cluster-Autoscaler
● …
● Profit! github.com/rebuy-de/
node-drainer
29. HOW DOES IT WORK?
t
ASG
Amazon Simple Queue Service (SQS)
Node-drainer Pod
Kubernetes API
“Scale-in”
lifecycle hook
scale-in
message
“not so
fast, mate!”
start draining done draining
“go ahead
and terminate”
remove
message
32. ● Every once in a while about ⅓ of our workers in different AZs went
“NotReady” because heartbeat didn’t reach the Masters
● Things exploded
● Happened again a few weeks later out of nowhere
● and again…
● and again...
ELB + Masters
41. ● Some Pods reported not being able to resolve hostnames (via kube-dns)
● After some time we noticed missing routes, and flannel complaining
● Tweaking sysctl didn’t help
Missing routes in flannel
42. ● Initial taint
● DaemonSet (instance-health-checker) checks if routes are properly set up
● Removes taint if so
● Otherwise trigger alert via Prometheus endpoint
● Run flannel as a DaemonSet instead of Systemd
Missing routes in flannel
44. Costs
● Track your costs:
● Use Reservations and Spot Instances where you can
github.com/rebuy-de/cost-exporter
45. Terraform + K8s
● Turn on versioning of S3 Terraform .tfstate files and use DynamoDB for locking
● Apply Terraform changes in small batches
● Use the same CI/CD strategies on your infrastructure code as on all other code
● Use testing clusters to experiment with risky changes
● Separate Pets and Cattle into separate ASGs
● Running self managed Kubernetes is possible and gives you
○ More insight into how things work
○ More flexibility
○ Surprises :)
46. Workflow
● Don’t be afraid to break stuff (in testing clusters)
● Know what context you are working in
● Automate everything
● Hang your dashboards up on a wall