Join this workshop and accelerate your journey to production-ready Kubernetes by learning the practical techniques for reliably operating your software lifecycle using the GitOps pattern. The Weaveworks team will be running a full-day workshop, sharing their expertise as users and contributors of Kubernetes and Prometheus, as well as followers of GitOps (operations by pull request) practices.
Using a combination of instructor led demonstrations and hands-on exercises, the workshop will enable the attendee to go into detail on the following topics:
• Developing and operating your Kubernetes microservices at scale
• DevOps best practices and the movement towards a “GitOps” approach
• Building with Kubernetes in production: caring for your apps, implementing CI/CD best practices, and utilizing the right metrics, monitoring tools, and automated alerts
• Operating Kubernetes in production: Upgrading and managing Kubernetes, managing incident response, and adhering to security best practices for Kubernetes
3. 3
Hi
We work for Weaveworks as customer success
engineers
You can find Weaveworks at https://www.weave.works
or @weaveworks
The team at Weaveworks is behind the GitOps model
You can find us online at @fractallambda and @c_r_w
4. ● Building cloud-native OSS since 2014
(Weave Net, Moby, Kubernetes, Prometheus)
● Founding member of CNCF
● Alexis Richardson (Weaveworks CEO) is chair of
the CNCF Technical Oversight Committee
● Weave Cloud runs on Kubernetes since 2015
4
About Weaveworks
10. Agenda
10
9:00a Welcome & introduction
9:30a Getting started with your environment
10:00a What is “Production Ready?”
10:30a Break (15 minutes)
10:45a Monitoring a production cluster
11:45a Declarative infrastructure in practice
12:15p Lunch (1 hour)
1:15p Devops and GitOps in practice
2:15p Advanced Deployment Patterns
3:15p Break (15 minutes)
3:30p Operational practice for Kubernetes
4:00p Securing a Kubernetes cluster (by Twistlock)
5:00p Review and recap
11. Some assumptions
11
➔ You can use the command line.
➔ You can use Git.
➔ You know what Kubernetes Pods, Deployment, and
Services are.
➔ You have a modern web browser.
12. Kubernetes need to know
12
Pods
containers
Deployments
Containers - Run Docker images, an immutable copy of your application code and all
code dependencies in an isolated environment.
Pods - A set of containers, co-scheduled on one machine. Ephemeral. Has unique IP. Has
labels.
Deployment - Ensures a certain number of replicas of a pod are running across the
cluster.
Service - Gets virtual IP, mapped to endpoints via labels. Named in DNS.
Namespace - Resource names are scoped to a Namespace. Policy boundary.
15. 15
Login to your cluster – Weave Cloud & C9
1. Go to tinyurl.com/kubecon18-cluster
2. Add your name and email
3. Check your email for links to your environment and your password
(This may take a little while. Be patient while Craig invites you)
21. GitOps hands-on 1/10Kick the tires on your cluster 💻
1. Start with a simple command:
➤ kubectl version
2. Look at what’s running on the cluster with
Weave Cloud Explore
29. Liveness and Readiness probes
29
What? Why? Options
Endpoints for Kubernetes to
monitor your application
lifecycle
Allows Kubernetes to restart
or stop traffic to a pod
-
● Liveness failure is for telling Kubernetes to restart the pod
● Readiness failure is transient and tells Kubernetes to route traffic elsewhere
● Readiness failure is useful for startup and load management
30. Metric instrumentation
30
What? Why? Options
Code and libraries used in
your code to expose metrics
Allows measuring operation of
application and enables many
more advanced use cases
Prometheus, Newrelic,
Datadog, many others
● Basic metrics are not optional
● Prometheus is a fantastic fit for Kubernetes in most cases
32. Playbooks / Runbooks
32
What? Why? Options
Rich guides for your engineers
on how-to operate the system
and fault find when things go
wrong.
Nobody is at their sharpest at
03:00 AM
Knowledge deteriorates over
time
Confluence
Markdown files
Weave Cloud Notebooks
● Absolutely vital knowledge repository.
● Avoids the bus factor
● First point of call for operational issues
● Significantly speeds up new engineer induction
● Requires continuous work to maintain
33. Limits and requests
33
What? Why? Options
Explicit resource allocation for
pods
Allows Kubernetes to make
good scheduling decisions
-
● Requests are used when scheduling
● Limits will avoid workloads from causing cascading failures
● Limits are a valuable safety net
● Available at the namespace level as well (see ResourceQuotas)
34. 34
If a Container exceeds its memory limit, it might be terminated. If it is restartable, the kubelet will
restart it, as with any other type of runtime failure.
If a Container exceeds its memory request, it is likely that its Pod will be evicted whenever the
node runs out of memory.
A Container might or might not be allowed to exceed its CPU limit for extended periods of time.
However, it will not be killed for excessive CPU usage
Limits - Official docs
35. Labels and annotations
35
What? Why? Options
Metadata held by Kubernetes Makes workload management
easier and allows other tools
to work with standard
Kubernetes definitions
-
● Useful to have a simple plan
● Labels can be used in kubectl arguments as filters
● Annotations are a good way of layering functionality without the overhead of
Custom Resource Descriptions
36. Alerts
36
What? Why? Options
Automated notifications on
defined trigger
You need to know when your
service degrades
Prometheus & Alertmanager
(Many other options)
37. Structured Logging
37
What? Why? Options
Output logs in a machine
readable format to facilitate
searching & indexing
Trace what went wrong when
something does
ELK stack (Elasticsearch,
Logstash and Kibana)
Many commercial offerings
● Avoid logging to files
● Must have timestamps and basic levels (i.e. info, error, fatal)
● JSON logs/events is love or hate
● KV formats are more human-friendly
38. Tracing Instrumentation
38
What? Why? Options
Instrumentation to send
request processing details to a
collection service.
Sometimes the only way of
figuring out where latency is
coming from
Zipkin, Lightstep, Appdash,
Tracer, Jaeger
● Trigger tracing from your gateway API
● Sample traces, don’t trace everything
● Costly to setup, but only meaningful way of debugging some latency issues.
● Use something that supports the Opentracing
39. Graceful shutdowns
39
What? Why? Options
Applications respond to
SIGTERM correctly
This is how Kubernetes will tell
you application to end
-
● End transactions,
● Default terminationGracePeriodSeconds is quite long, and can be shortened
40. Graceful dependencies
40
What? Why? Options
Applications don’t assume
dependencies are available.
Wait for other services before
reporting ready
Avoid headaches that come
with a service order
requirement
-
● Nice apps don’t crash-lopp
● This is what the readiness probe was built for
41. ConfigMaps
41
What? Why? Options
Define a configuration file for
your application in Kubernetes
using configmaps
Easy to reconfigure an app
without rebuilding, allows
config to be versioned
-
● Mount configmap as a volume is the easiest option
● Environment variable also alternative for simpler config
● Setting a file watch or polling mean your application will take new config into
consideration immediately
42. ConfigMap Example
apiVersion: v1
kind: ConfigMap
metadata:
name: my-app-cfg
data:
.env: |
APP_NAME=my-app
APP_ENV=stg
APP_KEY="base64:gFf47FZi6F9xDJiZiEmmKlePurMaXECKs1cA9hscIVc="
APP_DEBUG=true
APP_LOG_LEVEL=debug
APP_URL=http://localhost
Lar P
f a w
co g a n
45. Labeled images using commit sha
45
What? Why? Options
Label the docker images with
the code commit SHA
Makes tracing image to code
trivial
-
● Important to be able to trace back from running application to origin code
● If you reliably build your images with ${branch}-${short_git_hash} names,
might be enough
46. Locked down runtime context
46
What? Why? Options
Use deliberately secure
configuration for application
runtime context
Reduces attack surface,
makes privileges explicit
-
● if app writes temporary files, be sure to use emptyDir volume
● if your app has to initialise some data, do it with initContains
● avoid installing packages or fetching files from unreliable locations
● if you can, try to use readOnlyRootFilesystem:true
● runAsUser, fsGroup and allowPrivilegeEscalation:false allow you to
control runtime context further
48. Build pipeline
48
What? Why? Options
Builds your code,
runs your tests
- -
● You have one already
● You should be able to use it
● Make sure artefacts are tagged with the Git commit SHA
49. Deployment pipeline
49
What? Why? Options
Takes build artefacts and puts
them in the cluster
- -
● Note this is separate concern from your build pipeline.
● Where you have your approval process
● This is where Gitops lives – More later today
50. Image registry
50
What? Why? Options
Stores build artefacts Keep versioned artefacts
available
Roll your own
Commercial: Docker hub,
Quay.io, GCP Registry
● Key security point
● Great options available both on-prem and online
● Credentials need to be available to CI for push, and cluster for pull
51. Monitoring infrastructure
51
What? Why? Options
Collects and stores metrics Understand your running
system
Get alerts when something
goes wrong
OSS: Prometheus, Cortex,
Thanos
Commercial: Datadog,
Grafana Cloud, Weave Cloud
● Flip side of metrics instrumentation
52. Shared Storage
52
What? Why? Options
Store persistent state of your
application beyond pod
lifetime
Stateless is a unicorn Many. Will depend on
platform.
● Seen by your application as a directory
● Volumes and Volume claims are different things
● May be read-only.
53. Secrets Management
53
What? Why? Options
How do your application
access secret credentials
securely
Secrets are needed to use
external services
Bitnami Sealed Secrets
Hashicorp Vault
54. Ingress controller
54
What? Why? Options
Common routing point for
inbound traffic
Easier to manage
authentication and logging
Platform controller (AWS ELB)
GCE & NGinx (by Kubernetes)
Kong, Traefik, HAProxy, Istio,
Envoy
55. API Gateway
55
What? Why? Options
SIngle point for incoming
requests. Higher layer ingress
controller.
Can route at HTTP level.
Enables common and
centralised tooling for tracing,
logging, authentication.
Ambassador (Envoy),
roll-your-own
● Can replace the ingress controller
● Ambassador is Kubernetes native
56. Service mesh
56
What? Why? Options
Additional layer on top of
Kubernetes to manage routing
Enables complex use cases
and adds useful features
Linkerd, Istio
● May not be needed
● Can provide tracing without instrumentation
● Will run as sidecar on services
● Other features: Service to service TLS; Load balancing; Fine-grained traffic
policies; Service discovery; Service monitoring
57. Service catalogue / broker
57
What? Why? Options
Enables easy dependencies
on services and service
discovery for your team
Simplifies deploying
applications
-
● Kubernetes’ own service catalog API is worth mentioning
https://kubernetes.io/docs/concepts/extend-kubernetes/service-catalog/
● Fits in really well with the role of service meshes
● Easy of use for developers can also be achieved with central repository of
service configurations
● Still early days
58. Network policies
58
What? Why? Options
Rules on allowed connections Prevent unauthorised access,
improve security, segregate
namespaces
Weave Net, Calico
● Node level (kernel) controls and restrictions of traffic
● Need a CNI plugin
59. Authorisation integration
59
What? Why? Options
API level integration into the
Kubernetes auth flow.
Use existing SSO, reduce
number of account and
centralise account
management
-
● Will require some custom integration work
● Many hooks into the auth API
● Possible to integrate with almost any auth provider
60. Image scanning
60
What? Why? Options
Automated scanning of
vulnerability in your container
images
Because CVEs happen Docker, Snyk, Twistlock,
Sonatype, Clair (OSS)
● Definitely worth implementing into your CI pipeline.
● Tools can be integrated with your PR process to provide comments on commits
61. Log Aggregation
61
What? Why? Options
Bring all logs from application
into a searchable place
Logs are the best source of
information on what went
wrong
Lots and lots and lots
Fluentd or ELK (Elasticsearch,
Logstash, Kibana) stack are
good bets for roll-you-own
93. GitOps hands-on 4/10
1. Create the namespace we will use for this exercise
kubectl create namespace dev
Shortly, the Deploy agent will notice this change, and sync the
Deployment and Service files.
2. Watch for this happening in Weave Cloud or via:
watch kubectl -n dev get all
The podinfo application should be running in your cluster in the dev
namespace
Prometheus in Practice 💻
94. From your Cloud9 IDE console, run:
curl http://podinfo.dev:9898/metrics | less
And try to find these metrics that show:
● the number of open file descriptors
● the number of HTTP requests the pod has received
94
1 - Inspect the raw metrics directly 💻
95. # HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 7
# HELP http_requests_total The total number of HTTP requests.
# TYPE http_requests_total counter
http_requests_total{status="200"} 136
Answers
95
96. ● Each line on that page is either a comment or a time series
● A time series has a name, optional labels, and a series of values
● A collection of time series with the same name is a metric
Understanding metrics
96
97. For example:
http_requests_total{status="200"} 136
● name is http_requests_total
● one label, status, with label value "200"
● value is 136
Since the pod launched, we've received 136 HTTP requests with status
200.
Understanding metrics
97
98. On your Weave Cloud instance:
● Go to "Monitoring"
● Create a notebook
● Call it "Monitoring in Practice"
● Enter http_requests_total and then click "Run as Table"
Exercise: query metrics
98
2 - Query the metrics 💻
101. Our pod has grown instance, job, _weave_namespace, and
_weave_service labels.
These were added at the point of scraping so time series don't clash
with each other and so you can find the source of your data.
Some labels added automatically
101
103. Some pods use slightly different labels (e.g. code instead of status).
This highlights that Prometheus doesn't impose a schema on
labels—they are free-form.
Highly recommended that you form a consistent standard across your
key applications.
Label schema is flexible
103
104. What if we only want to see the data from our service?
In a new cell, run the following query as a table:
http_requests_total{_weave_service="podinfo"}
This only shows the time series which have labels that exactly match
those above. PromQL also support not equals (!=) and regular
expression matching (=~ and !~).
Filtering metrics
104
3 - Query the metrics using labels 💻
106. What if we want to get the total requests for our whole cluster?
In a new cell, enter the following:
sum(http_requests_total)
This adds up all the requests to give us a single value.
Aggregating metrics
106
4 - Aggregate metrics using functions 💻
108. If you look at our original query, you see there are separate lines for
each replica. Multiple rows refer to kube-dns or kubelet. How do we
aggregate those metrics together?
In a new cell, run the following query:
sum(http_requests_total) by (_weave_namespace, _weave_service)
Note only the labels in our by clause are preserved.
Aggregating metrics
108
5 - Aggregate metrics by labels 💻
110. Differentiating metrics
Look at the graph view of our first
query. What's the deal with these lines
going up all the time?
http_requests_total is a counter. It
goes up by one every time there's an
HTTP request. It never goes down.
What if we wanted to see requests per
second?
11
0
111. In a new cell, run:
rate(http_requests_total[1m])
and make sure to see the graph view. What do you see?
Try changing the time interval from 1m to other values (5m, 2h, 10s).
What do you think is happening there?
Differentiating metrics
111
6 - Derive a gauge from a counter 💻
113. We now know enough to get a graph of HTTP requests per second for
dev/podinfo that will work regardless of how many replicas it has.
Create a query that results in a graph of HTTP request rate for
dev/podinfo. It will look like the below.
Putting it all together
113
7 - Create a custom query 💻
115. That graph is a bit boring. Let's make it more interesting by
generating some traffic.
Open a Weave Cloud terminal window into this container and run:
hey -z 2m http://podinfo.dev:9898/error
This will run for 2 minutes, sending many many requests to the error
endpoint on podinfo.
Generating traffic
115
117. 11
7
Recap: Monitoring In Practice
● There are different kind of metrics
● A good way to think of metrics is which domain they’re in
● It’s trivial to instrument your applications
● Prometheus can be used for both metrics (monitoring)
and ad-hoc querying (observability)
● Simple instrumentation can yield deep insights
● PromQL deals with scalar and vector series
● PromQL has gauges, histograms and counters
● PromQL has many useful functions available
122. 12
2
GitOps is...
An operation model
Derived from CS and operation knowledge
Technology agnostic (name notwithstanding)
123. 12
3
GitOps is...
An operation model
Derived from CS and operation knowledge
Technology agnostic (name notwithstanding)
A set of principles (Why instead of How)
124. 12
4
GitOps is...
An operation model
Derived from CS and operation knowledge
Technology agnostic (name notwithstanding)
A set of principles (Why instead of How)
Although
Weaveworks
can help
with how
125. 12
5
GitOps is...
An operation model
Derived from CS and operation knowledge
Technology agnostic (name notwithstanding)
A set of principles (Why instead of How)
A way to speed up your team
128. 12
8
1 The entire system is described declaratively.
Beyond code, data ⇒
Implementation independent
Easy to abstract in simple ways
Easy to validate for correctness
Easy to generate & manipulate from code
129. 12
9
1 The entire system is described declaratively.
Beyond code, data ⇒
Implementation independent
Easy to abstract in simple ways
Easy to validate for correctness
Easy to generate & manipulate from code
136. 13
6
The canonical desired system state is versioned
(with Git)
Canonical Source of Truth (DRY)
With declarative definition, trivialises rollbacks
Excellent security guarantees for auditing
Sophisticated approval processes (& existing workflows)
Great Software ↔ Human collaboration point
2
137. 13
7
Changes to the desired state are
automatically applied to the system
3
138. 13
8
Approved changes to the desired state are
automatically applied to the system
Significant velocity gains
Privileged operators don’t cross security boundaries
Separates What and How.
3
140. 14
0
Software agents ensure correctness
and alert on divergence
4
Continuously checking that desired state is met
System can self heal
Recovers from errors without intervention (PEBKAC)
It’s the control loop for your operations
141. 14
1
1 The entire system is described declaratively.
2 The canonical desired system state is versioned
(with Git)
3 Approved changes to the desired state are
automatically applied to the system
4 Software agents ensure correctness
and alert on divergence
142. Gitops is Functional Reactive Programming…
...for your infrastructure.
Like React, but for servers and applications.
157. GitOps hands-on 4/10
[Only do this step if you didn’t do it in your cluster earlier]
Create the namespace we will use for this exercise:
kubectl create namespace dev
Shortly, the Deploy agent will notice this change, and sync the Deployment and
Service files.
Watch for this happening in Weave Cloud or via:
watch kubectl -n dev get all
Gitops Hands On 1/12 💻
158. GitOps hands-on 5/10
We’re going to make a code change and see it flow through CI, then
deploy that change.
Call the version endpoint on the service to see what is running:
curl podinfo.dev:9898/version
Gitops Hands On 2/12 💻
159. GitOps hands-on 7/10
In the editor, open podinfo/pkg/version/version.go, increment the
version number and save:
var VERSION = "0.3.1"
Commit your changes and push to master:
cd /workspace/podinfo
git commit -m "release v0.3.1 to dev" .
git push
Gitops Hands On 3/12 💻
160. GitOps hands-on 2/10
The CI pipeline will create an image tagged the same as the git commit
Git said something like [master 89b8396]; the tag will be like
master-89b8396
Check by listing image tags (replace user with your username):
gcloud container images list-tags gcr.io/dx-training/USER-podinfo
USER should be of the form “training-user-<number>”.
Gitops Hands On 4/12 💻
161. GitOps hands-on 3/10
Navigate in the editor to workspace/cluster/un-workshop/dev and open
podinfo-dep.yaml.
Where it says image:
replace quay.io/stefanprodan/podinfo with gcr.io/dx-training/USER-podinfo
replace the tag 0.3.0 with your tag master-TAG
Save the file and commit your changes and push to master:
cd ../cluster/un-workshop/dev
git commit -m "my first deploy" .
git push
Gitops Hands On 5/12 💻
162. GitOps hands-on 5/10
Check in Weave Cloud to see when it has synced the Deployment.
Call the version endpoint on the service to see if it changed:
curl podinfo.dev:9898/version
Gitops Hands On 6/12 💻
164. GitOps hands-on 6/10
In Weave Cloud Deploy, find the podinfo Deployment in dev Namespace.
Click Automate.
This creates a commit, because git is our single source of truth.
To keep things in sync, bring it into your workspace:
git pull
Gitops Hands On 7/12 💻
165. GitOps hands-on 7/10
Open podinfo/pkg/version/version.go, increment the version number
again, and save:
var VERSION = "0.3.2"
Commit your changes and push to master:
cd /workspace/podinfo
git commit -m "release v0.3.2" .
git push
Gitops Hands On 8/12 💻
166. GitOps hands-on 8/10
Watch for the CI/CD to upgrade the app to 0.3.2:
watch curl podinfo.dev:9898/version
Gitops Hands On 9/12 💻
167. GitOps hands-on 8/10
Suppose we don’t like the latest version: we want to roll back.
1. In Weave Cloud Deploy, find the podinfo Deployment in dev
Namespace. Click Deautomate.
2. The UI shows a list of images - select the one you want and click
Release, then again to confirm.
3. Check again which version is running:
watch curl podinfo.dev:9898/version
Gitops Hands On 10/12 💻
168. GitOps hands-on 8/10
We can flag that the latest build should not be deployed
1. In Weave Cloud Deploy, find the podinfo Deployment in dev
Namespace. Click 🔒Lock.
2. Give a reason, then click Lock again to confirm.
3. Each of these actions creates a git commit. Sync your workspace:
git pull
4. Reload /workspace/cluster/dev/podinfo-dep.yaml in the editor to see
how the lock is applied.
Gitops Hands On 11/12 💻
169. GitOps hands-on 7/10
We can flow the version number through the pipeline with a git tag, to
show more meaningful versions
Create and push a git tag:
cd /workspace/podinfo
git tag 0.3.2
git push origin 0.3.2
This will trigger another CI build, and when that is finished you should
have an image tagged 0.3.2
Gitops Hands On 12/12 💻
170. 170
● Having separate pipelines for CI and CD enables better security
● It’s also easier to deal with if a deployment goes wrong
● We built a few versions of a simple app, using a demo CI pipeline
● Deployed those versions to Kubernetes using Weave Cloud
● Automated the deployment
● Deployments, rollback and lock are all done via git
● Git is our single source of truth.
Recap: GitOps CI/CD
179. Rolling Deployment Strategy
apiVersion: apps/v1
kind: Deployment
spec:
replicas: 2
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # new pods added at a time
maxUnavailable: 0
minReadySeconds: 10
180. Rolling Deployment Strategy
Pros
● Low risk due to readiness checks
● Gradual rollout with no downtime
Cons
● Needs backwards compatibility between API versions and
database migrations
● No control over the traffic during the rollout
Suitable for
● Stateful applications & Stateless microservices
183. Blue/Green Deployment Strategy
Kubernetes native deployment strategy
apiVersion: v1
kind: Service
spec:
selector:
app: podinfo
version: v1 #switch the traffic from blue to green by changing the version to v2
184. Blue/Green Deployment Strategy
Suitable for
● Monolithic legacy applications
● Autonomous microservices
Pros
● Avoids versioning issues
● Instant rollout and rollback (while the blue deployment still exists)
Cons
● Requires resource duplication
● Data synchronisation between the two environments can lead to partial service interruption
188. Canary Deployment Strategy
Suitable for
● User facing applications
● Stateless microservices
Pros
● Low impact as the new version is released only to a subset of users
● Controlled rollout with no downtime
● Fast rollback
Cons
● Needs a traffic management solution / service mesh (Envoy, Istio, Linkerd)
● Needs backwards compatibility between API versions and database migrations
189. Istio Canary Deployment - Initial State
18
9
All traffic is routed to the GA deployment
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
http:
- route:
- destination:
name: podinfo
subset: canary
weight: 0
- destination:
name: podinfo
subset: ga
weight: 100
198. A/B Testing
Suitable for
● User facing applications
● Stateless microservices
Pros
● Allows advanced customer behaviour analysis
● Performance testing of different configurations in parallel
Cons
● Needs a traffic management solution / service mesh (Envoy, Istio, Linkerd)
● Needs backwards compatibility between API versions and database migrations
203. Blue/Green + Dark Traffic Deployment Strategy
Suitable for
● API based applications
● Autonomous microservices
Pros
● Test the green deployment without any impact for the end-user
● Uses real traffic minimising the risk of a faulty release
Cons
● Requires resource duplication
● Needs a traffic management solution / service mesh (Envoy, Istio)
206. ● Kubernetes internal architecture
● High-availability Kubernetes
● Draining and cordon a node for reboot
● Backing up and upgrading the Kubernetes control plane
Operational practices for Kubernetes
206
207. 20
7
Kubernetes component architecture
Diagram from https://speakerdeck.com/luxas with permission
Nodes
Control plane
Node 3
OS
Container
Runtime
Kubelet
Networking
Node 2
OS
Container
Runtime
Kubelet
Networking
Node 1
OS
Container
Runtime
Kubelet
Networking
API Server (REST API)
Controller Manager
(Controller Loops)
Scheduler
(Bind Pod to Node)
etcd (key-value DB, SSOT)
User
Legend:
CNI
CRI
OCI
Protobuf
gRPC
JSON
208. HA etcd cluster
External Load Balancer or DNS-based API server resolving
High-availability Kubernetes
Master A
API Server
Controller Manager
Scheduler
Shared certificates
etcd
etcd
etcd
Master B
API Server
Controller Manager
Scheduler
Shared certificates
Master C
API Server
Controller Manager
Scheduler
Shared certificates
Nodes
Kubelet 1
Kubelet 2
Kubelet 3
Kubelet 4
Kubelet 5
Diagram from https://speakerdeck.com/luxas with permission
208
209. ● How Google Runs Production Systems
● SREs:
○ Have the skillset necessary to automate tasks
○ Do the same work as an operations team, but with
automation instead of manual labor
● SRE team responsible for latency, performance,
efficiency, change management, monitoring,
emergency response, and capacity planning
Site Reliability Engineering
209
210. When you need to reboot a worker node to install OS updates or do
hardware maintenance without disrupting your workloads you need
to perform the following operations:
● Evict all running pods except DaemonSets and StatefulSets
● Mark the node as unschedulable
● Perform maintenance work on the node
● Restart the node
● Make the node schedulable again
kubectl drain $NODE -> reboot $NODE -> kubectl uncordon $NODE
Worker nodes maintenance
210
211. ● If all workloads are replicated, draining a node before rebooting is
not necessary. A node reboot that comes back in less than 5
minutes will not trigger any pod rescheduling.
● A drain operation can target multiple nodes, in order to protect
clustered applications you need to create Pod Disruption Budgets
to ensure the number of replicas running is never brought below
the minimum number needed for a quorum
Worker nodes maintenance
211
212. Operations:
● Master node OS updates and hardware maintenance (low risk)
○ A master node reboot will not disrupt any running workloads
○ While the master node is offline, no scheduling operations will happen
○ Kured can help with this. (https://github.com/weaveworks/kured)
● Control plane upgrades (high risk)
○ For master nodes in-place upgrades, a full backup is recommended such as LVM
snapshots
○ Only one minor version upgrade is supported, you can only upgrade from 1.9 to 1.10, not
from 1.8 to 1.10
○ Test the upgrade procedure on a staging cluster before running it in production
Control plane maintenance
212
213. Kubeadm master nodes upgrade procedure:
● Download the most recent version of kubeadm using curl (do not
upgrading the kubeadm OS package)
● Run kubeadm upgrade plan to check if your cluster is upgradeable
● Pick a version to upgrade to and run kubeadm upgrade apply v1.10.2
● Upgrade your CNI by applying the new DaemonSet definition
● Drain the master node with kubectl drain $HOST --ignore-daemonsets
● Upgrade Kubernetes packages with apt-get update && apt-get upgrade
● Bring the master node back online with kubectl uncordon $MASTER
Control plane upgrades
213
217. Agenda
21
7
9:00a Welcome & introduction
9:30a Getting started with your environment
10:00a What is “Production Ready?”
10:30a Break (15 minutes)
10:45a Monitoring a production cluster
11:45a Declarative infrastructure in practice
12:15p Lunch (1 hour)
1:15p Devops and GitOps in practice
2:15p Advanced Deployment Patterns
3:15p Break (15 minutes)
3:30p Operational practice for Kubernetes
4:00p Securing a Kubernetes cluster (by Twistlock)
5:00p Review and recap
220. 22
0
Recap: Monitoring In Practice
● There are different kind of metrics
● A good way to think of metrics is which domain they’re in
● It’s trivial to instrument your applications
● Prometheus can be used for both metrics (monitoring) and ad-hoc querying
(observability)
● Simple instrumentation can yield deep insights
● PromQL deals with scalar and vector series
● PromQL has gauges, histograms and counters
● PromQL has many useful functions available
221. 22
1
1 The entire system is described declaratively.
2 The canonical desired system state is versioned
(with Git)
3 Changes to the desired state are
automatically applied to the system
4 Software agents ensure correctness
and alert on divergence
222. 222
● Having separate pipelines for CI and CD enables better security
● It’s also easier to deal with if a deployment goes wrong
● We built a few versions of a simple app, using a demo CI pipeline
● Deployed those versions to Kubernetes using Weave Cloud
● Automated the deployment
● Deployments, rollback and lock are all done via git
● Git is our single source of truth.
Recap: GitOps CI/CD
224. ● Kubernetes internal architecture
● High-availability Kubernetes
● Draining and cordon a node for reboot
● Backing up and upgrading the Kubernetes control plane
Operational practices for Kubernetes
224