Video URL: https://youtu.be/GjI6MUz7AyE
This is the slide deck of the Percona Live Online 2020 talk given by me in May 2020: https://www.percona.com/resources/videos/orchestrating-cassandra-kubernetes-operator-and-yelp-paasta-percona-live-online
The talk delves into the architecture of our Cassandra Kubernetes Operator and the multi-region multi-AZ clusters it manages, and strategies we have in place for safe rollouts and zero-downtime migration.
6. “ A distributed system is one in which the failure of a computer
you didn't even know existed can render your own computer
unusable. ”
- Leslie Lamport
7. Desired Traits of Distributed
Systems
Reliability
Scalability
Maintainability
8. Distributed Systems Fallacies
● The network is reliable
● Latency is zero
● Bandwidth is infinite
● The network is secure
● Topology doesn't change
● One administrator
● Zero Transport cost
● Homogeneous network
12. C* at Yelp
● Both primary and derived data
● Use cases
● Deployed on Amazon Web Services (AWS)
○ EBS for Storage
● Smartstack for service discovery
● Automated schema management
● ZooKeeper-based cluster coordination
20. Kubernetes / k8s
● Popular Open Source Container-based orchestration
● Actively developed
● Stateful and stateless applications
● Well-defined building blocks for distributed systems
● Integrates into our PaaS
○ k8s: generic but extensible
● Organizes containers into pods
21. ● Yelp PaaSTA: Stateless and Stateful Microservices on Kubernetes
● Few thousand microservices deployed and growing
● Hundreds of deployments every day
● Handles compute, storage and network abstractions
● Why PaaSTA
○ Uniform interface - deployment, restarts, rollbacks ...
○ Clusterman
○ Spot and statically-reserved fleet
PaaSTA: Kubernetes at Yelp
25. C* Operator: Intro
● Developed by DRE team
● Controller Loop for Reconciliation
● Defines a custom resource for k8s
○ Statefulset, Container spec, Storage, Secrets and more
● “Big Red Button”
○ Stop for human takeover
26. C* Operator: Responsibilities
● Creating cluster from specs
● Scaling the cluster up and down
● Safe and Reliable Change Deployments
● Lifecycle Management
● Multi-region coordination
● Credential management
● Balance resource utilization
29. Cassandra Pod
● What is a pod
● Cassandra container + Sidecars
● Sidecar containers
○ HAcheck for Smartstack
○ Cron
○ Sensu
○ Change Data Capture (CDC) publisher
○ Metrics exporters
30. Storage aka State
● EBS for Cassandra
○ Clear separation of stateful and stateless
○ Quick healing upon underlying node failure
○ Dynamic Provisioning
○ “Compute follows Data”
○ Stripe cluster across AZs
53. Distributed Systems Fallacies
The network is reliable
Latency is zero
Bandwidth is infinite
The network is secure
Topology doesn't change
One administrator
Zero Transport cost
Homogeneous network