Link: https://youtu.be/3aNEhHLHZok
https://go.dok.community/slack
https://dok.community/
From the DoK Day EU 2022 (https://youtu.be/Xi-h4XNd5tE)
Running a database on Kubernetes with persistent storage is relatively easy but when it comes to performance it won’t match local NVMes. This talk will show you how to set up the local NVMes for Kubernetes, how to handle the application and cluster lifecycle in a safe manner and share our experience with running ScyllaDB with local NVMes on different Kubernetes cloud providers.
-----
Tomas leads the development of Scylla Operator (https://github.com/scylladb/scylla-operator), a Kubernetes operator to manage ScyllaDB. Previously, he worked on a self-hosted, auto-upgrading Kubernetes control plane for RedHat OpenShift. Tomas is an Emeritus Kubernetes SIG-Apps approver.
-----
Maciej is a Go and C++ enthusiast. He is a software engineer working on ScyllaDB management tools. Previously he worked in network companies where he delivered multiple features to SDN solutions and LTE networks.
1. Running a database
on local NVMes
Tomáš Nožička
Principal Software Engineer, ScyllaDB
Maciej Zimnoch
Senior Software Engineer, ScyllaDB
https://github.com/scylladb/scylla-operator
3. Benefits
+ Higher throughput
+ Lower latency
+ Higher IOPS
+ No double replication of data (NAS + DB)
+ Lower cost
Downsides
- Harder to manage
Local vs. Network Attached
3
4. + Using local storage allowed to get Scylla on Kubernetes very close to VM performance
+ Also uses other concepts like CPU pinning, operator tuning devices, sysctl, …
+ Still some way to go (new dev versions are even closer)
Performance on Kubernetes
4
Scylla 4.4.5 - 60k IOPS
(3x i3.4xlarge)
Read latency
(p99)
Write latency
(p99)
Mixed latency
(p99)
AWS 2.2 2.7 3.7
EKS 2.9 4.3 4.5
5. + GA in 1.14
+ “Better HostPath”
+ Scheduler is aware (Pods always land on the same node)
+ Secure (PVC + PV)
+ Needs special handling when the node dies
+ StorageClass has to use WaitForFirstConsumer volume binding mode
+ No dynamic provisioning
Local Persistent Volumes
5
6. + https://github.com/kubernetes-sigs/sig-storage-local-static-provisioner
+ Static provisioning only
+ Out of the tree
+ Needs manual deployment and configuration
+ Watches a discovery point and creates a PersistentVolume for each entry
+ Handles PV lifecycle (e.g. shred)
+ Needs extra DS to pre-configure node’s devices for the discovery point
+ Mount static disks/partitions
+ Bind mounts pre-created directories (no quota)
Local Volume Static Provisioner
6
8. + Upgrades replaces VMs, NVMes get thrown away
+ You can prevent downtime to your application with PodDistruptionBudgets
+ GKE auto-updates decide to break PDBs after 1h :(
+ Streaming takes time
+ Requires manual GKE upgrades to prevent quorum and data loss
+ Your DB needs to survive node loss anyways
+ Upgrade frequency >> expected node loss frequency
+ GKE provisions the disks as a filesystem which may collide with the required FS
type by your database
+ Needs some hacks as GKE tries to format it back to ext4 on every boot
GKE
8
9. + Upgrades replace VMs, NVMes get thrown away
+ Similar eviction timeout issues as GKE
+ Upgrades need to be handled manually
EKS
9
11. + In-place upgrades!
+ Keeps the underlying VM and the NVMes
+ rpm-ostree based (immutable fs, atomic)
+ No need to stream the data to a new node
+ Supports multiple clouds as well as bare metal + SaaS
+ Same Kubernetes distribution everywhere
+ You own the master nodes
+ The platform itself is managed by operators
+ https://github.com/openshift/okd
OpenShift / OKD
11
13. Available provisioners
13
Local storage provisioners
+ kubernetes-sigs/sig-storage-local-static-provisioner
+ Static provisioning
+ kubernetes-sigs/csi-hostpath
+ Dynamic provisioning
+ Not feature complete
+ Only single node
+ Not production ready
+ Modern CSI approach
14. Container Storage Interface (CSI)
14
The Container Storage Interface (CSI)
+ Expose block and file storage systems
+ Pluggable storage systems
+ Container Orchestration agnostic
Building blocks from Kubernetes:
+ external-provisioner
+ node-driver-registrar
+ external-resizer
+ livenessprobe
+ more…
15. Network attached drives
15
External API
Node A Node B
Node C
Controller
plugin
Node plugin
Pod
manage lifetime
mount
Node
plugin
Controller
plugin
Node
plugin
Leader election group
16. Local drives - distributed provisioning
16
Node A Node B
Node C
Controller
plugin
Node plugin
Pod
mount
Node
plugin
Controller
plugin
Node
plugin
Controller
plugin
manage
17. Storage capacity
17
Node C
Node B
Node A
Capacity 80GB Capacity 10GB
Capacity 500GB
Pod
PVC 50 GB
✔
✔
+ Different nodes have
different storage
capacity
+ Scheduler needs to
validate if given Pod will
fit into node
+ CSIStorageCapacity
objects
18. Node ranking
18
Node C
Node B
Node A
+ Influence scheduler
decisions based on
internal weight function
+ Numbers of coefficients
+ Capacity
+ Provisioned IOPS
+ Colocated read/write
workloads
+ …
Extender
weight: 11
weight: 17
✔ Pod
19. Dynamic Provisioning Status
+ No out of the box solution in Kubernetes
+ Building blocks are available
+ CSI Drivers (GA Kubernetes 1.13)
+ Storage topology (GA in Kubernetes 1.17)
+ Distributed provisioning (external-provisioner 2.1.0)
+ Storage capacity tracking (beta in Kubernetes 1.21)
+ Users need to deploy their own drivers
19
20. WE ARE HIRING!
Thank You!
scylladb-users.slack.com
#scylla-operator
Tomáš Nožička
tomas.nozicka (at) scylladb.com
20
Maciej Zimnoch
maciej (at) scylladb.com