Kubecon seattle 2018 workshop slides

Your path to production
ready Kubernetes
Weaveworks – https://weave.works – @weaveworks
Kubecon Seattle – December 2018
Brice Fernandes – brice@weave.works – @fractallambda
Craig Wright – craig@weave.works – @c_r_w
1

3
Hi
We work for Weaveworks as customer success
engineers
You can find Weaveworks at https://www.weave.works
or @weaveworks
The team at Weaveworks is behind the GitOps model
You can find us online at @fractallambda and @c_r_w

● Building cloud-native OSS since 2014
(Weave Net, Moby, Kubernetes, Prometheus)
● Founding member of CNCF
● Alexis Richardson (Weaveworks CEO) is chair of
the CNCF Technical Oversight Committee
● Weave Cloud runs on Kubernetes since 2015
4
About Weaveworks

Agenda
10
9:00a Welcome & introduction
9:30a Getting started with your environment
10:00a What is “Production Ready?”
10:30a Break (15 minutes)
10:45a Monitoring a production cluster
11:45a Declarative infrastructure in practice
12:15p Lunch (1 hour)
1:15p Devops and GitOps in practice
2:15p Advanced Deployment Patterns
3:15p Break (15 minutes)
3:30p Operational practice for Kubernetes
4:00p Securing a Kubernetes cluster (by Twistlock)
5:00p Review and recap

Some assumptions
11
➔ You can use the command line.
➔ You can use Git.
➔ You know what Kubernetes Pods, Deployment, and
Services are.
➔ You have a modern web browser.

Kubernetes need to know
12
Pods
containers
Deployments
Containers - Run Docker images, an immutable copy of your application code and all
code dependencies in an isolated environment.
Pods - A set of containers, co-scheduled on one machine. Ephemeral. Has unique IP. Has
labels.
Deployment - Ensures a certain number of replicas of a pod are running across the
cluster.
Service - Gets virtual IP, mapped to endpoints via labels. Named in DNS.
Namespace - Resource names are scoped to a Namespace. Policy boundary.

13
Today’s slides:
https://tinyurl.com/
production-k8s-2018

Getting started with your environment
14

15
Login to your cluster – Weave Cloud & C9
1. Go to tinyurl.com/kubecon18-cluster
2. Add your name and email
3. Check your email for links to your environment and your password
(This may take a little while. Be patient while Craig invites you)

Your Cluster
19
pod
Icon by Freepik from www.flaticon.com

Your Cluster
20
pod
Cloud Source
Repositories Container
Builder
Cloud
Registry

GitOps hands-on 1/10Kick the tires on your cluster 💻
1. Start with a simple command:
➤ kubectl version
2. Look at what’s running on the cluster with
Weave Cloud Explore

“DevOps Console”
Tooling for deployment,
visualisation and
observability
Weave Cloud
22

GitOps hands-on 1/10
Ask Kubernetes what’s running on the cluster:
➤ kubectl get pods --all-namespaces
Query Kubernetes 💻

What is “Production Ready”?
26

Bear with me while I go through this list.
There will be a kitten at the end.
Production Ready checklists
27

❏ Readiness check
❏ Liveness check
❏ Metric instrumentation
❏ Dashboards
❏ Playbook
❏ Limits and requests
❏ Labels and annotations
The Application checklist
28
❏ Alerts
❏ Structured logging output
❏ Tracing instrumentation
❏ Graceful shutdowns
❏ Graceful dependency (w. readiness check)
❏ Configmaps
❏ Labeled images using commit sha
❏ Locked down runtime context

Liveness and Readiness probes
29
What? Why? Options
Endpoints for Kubernetes to
monitor your application
lifecycle
Allows Kubernetes to restart
or stop traffic to a pod
-
● Liveness failure is for telling Kubernetes to restart the pod
● Readiness failure is transient and tells Kubernetes to route traffic elsewhere
● Readiness failure is useful for startup and load management

Metric instrumentation
30
What? Why? Options
Code and libraries used in
your code to expose metrics
Allows measuring operation of
application and enables many
more advanced use cases
Prometheus, Newrelic,
Datadog, many others
● Basic metrics are not optional
● Prometheus is a fantastic fit for Kubernetes in most cases

Dashboards
31
What? Why? Options
View of metrics Metrics are just data. They
must be consumable by
humans as well.
Grafana
Many commercial options

Playbooks / Runbooks
32
What? Why? Options
Rich guides for your engineers
on how-to operate the system
and fault find when things go
wrong.
Nobody is at their sharpest at
03:00 AM
Knowledge deteriorates over
time
Confluence
Markdown files
Weave Cloud Notebooks
● Absolutely vital knowledge repository.
● Avoids the bus factor
● First point of call for operational issues
● Significantly speeds up new engineer induction
● Requires continuous work to maintain

Limits and requests
33
What? Why? Options
Explicit resource allocation for
pods
Allows Kubernetes to make
good scheduling decisions
-
● Requests are used when scheduling
● Limits will avoid workloads from causing cascading failures
● Limits are a valuable safety net
● Available at the namespace level as well (see ResourceQuotas)

34
If a Container exceeds its memory limit, it might be terminated. If it is restartable, the kubelet will
restart it, as with any other type of runtime failure.
If a Container exceeds its memory request, it is likely that its Pod will be evicted whenever the
node runs out of memory.
A Container might or might not be allowed to exceed its CPU limit for extended periods of time.
However, it will not be killed for excessive CPU usage
Limits - Official docs

Labels and annotations
35
What? Why? Options
Metadata held by Kubernetes Makes workload management
easier and allows other tools
to work with standard
Kubernetes definitions
-
● Useful to have a simple plan
● Labels can be used in kubectl arguments as filters
● Annotations are a good way of layering functionality without the overhead of
Custom Resource Descriptions

Alerts
36
What? Why? Options
Automated notifications on
defined trigger
You need to know when your
service degrades
Prometheus & Alertmanager
(Many other options)

Structured Logging
37
What? Why? Options
Output logs in a machine
readable format to facilitate
searching & indexing
Trace what went wrong when
something does
ELK stack (Elasticsearch,
Logstash and Kibana)
Many commercial offerings
● Avoid logging to files
● Must have timestamps and basic levels (i.e. info, error, fatal)
● JSON logs/events is love or hate
● KV formats are more human-friendly

Tracing Instrumentation
38
What? Why? Options
Instrumentation to send
request processing details to a
collection service.
Sometimes the only way of
figuring out where latency is
coming from
Zipkin, Lightstep, Appdash,
Tracer, Jaeger
● Trigger tracing from your gateway API
● Sample traces, don’t trace everything
● Costly to setup, but only meaningful way of debugging some latency issues.
● Use something that supports the Opentracing

Graceful shutdowns
39
What? Why? Options
Applications respond to
SIGTERM correctly
This is how Kubernetes will tell
you application to end
-
● End transactions,
● Default terminationGracePeriodSeconds is quite long, and can be shortened

Graceful dependencies
40
What? Why? Options
Applications don’t assume
dependencies are available.
Wait for other services before
reporting ready
Avoid headaches that come
with a service order
requirement
-
● Nice apps don’t crash-lopp
● This is what the readiness probe was built for

ConfigMaps
41
What? Why? Options
Define a configuration file for
your application in Kubernetes
using configmaps
Easy to reconfigure an app
without rebuilding, allows
config to be versioned
-
● Mount configmap as a volume is the easiest option
● Environment variable also alternative for simpler config
● Setting a file watch or polling mean your application will take new config into
consideration immediately

ConfigMap Example
apiVersion: v1
kind: ConfigMap
metadata:
name: my-app-cfg
data:
.env: |
APP_NAME=my-app
APP_ENV=stg
APP_KEY="base64:gFf47FZi6F9xDJiZiEmmKlePurMaXECKs1cA9hscIVc="
APP_DEBUG=true
APP_LOG_LEVEL=debug
APP_URL=http://localhost
Lar P
f a w
co g a n

ConfigMap Mount Example
containers:
- name: my-php-app
volumeMounts:
- name: env
mountPath: /var/www/.env
subPath: .env
volumes:
- name: env
configMap:
name: my-app-cfg

ConfigMap Environment Variable Example
spec:
containers:
- name: test-container
env:
- name: SPECIAL_LEVEL_KEY
valueFrom:
configMapKeyRef:
name: special-config
key: special.how

Labeled images using commit sha
45
What? Why? Options
Label the docker images with
the code commit SHA
Makes tracing image to code
trivial
-
● Important to be able to trace back from running application to origin code
● If you reliably build your images with ${branch}-${short_git_hash} names,
might be enough

Locked down runtime context
46
What? Why? Options
Use deliberately secure
configuration for application
runtime context
Reduces attack surface,
makes privileges explicit
-
● if app writes temporary files, be sure to use emptyDir volume
● if your app has to initialise some data, do it with initContains
● avoid installing packages or fetching files from unreliable locations
● if you can, try to use readOnlyRootFilesystem:true
● runAsUser, fsGroup and allowPrivilegeEscalation:false allow you to
control runtime context further

The Cluster checklist
47
❏ API Gateway
❏ Service Mesh
❏ Service catalogue / Broker
❏ Network policies
❏ Authorisation integration
❏ Image scanning
❏ Log aggregation
❏ Build pipeline
❏ Deployment pipeline
❏ Image registry
❏ Monitoring infrastructure
❏ Shared storage
❏ Secrets management
❏ Ingress controller

Build pipeline
48
What? Why? Options
Builds your code,
runs your tests
- -
● You have one already
● You should be able to use it
● Make sure artefacts are tagged with the Git commit SHA

Deployment pipeline
49
What? Why? Options
Takes build artefacts and puts
them in the cluster
- -
● Note this is separate concern from your build pipeline.
● Where you have your approval process
● This is where Gitops lives – More later today

Image registry
50
What? Why? Options
Stores build artefacts Keep versioned artefacts
available
Roll your own
Commercial: Docker hub,
Quay.io, GCP Registry
● Key security point
● Great options available both on-prem and online
● Credentials need to be available to CI for push, and cluster for pull

Monitoring infrastructure
51
What? Why? Options
Collects and stores metrics Understand your running
system
Get alerts when something
goes wrong
OSS: Prometheus, Cortex,
Thanos
Commercial: Datadog,
Grafana Cloud, Weave Cloud
● Flip side of metrics instrumentation

Shared Storage
52
What? Why? Options
Store persistent state of your
application beyond pod
lifetime
Stateless is a unicorn Many. Will depend on
platform.
● Seen by your application as a directory
● Volumes and Volume claims are different things
● May be read-only.

Secrets Management
53
What? Why? Options
How do your application
access secret credentials
securely
Secrets are needed to use
external services
Bitnami Sealed Secrets
Hashicorp Vault

Ingress controller
54
What? Why? Options
Common routing point for
inbound traffic
Easier to manage
authentication and logging
Platform controller (AWS ELB)
GCE & NGinx (by Kubernetes)
Kong, Traefik, HAProxy, Istio,
Envoy

API Gateway
55
What? Why? Options
SIngle point for incoming
requests. Higher layer ingress
controller.
Can route at HTTP level.
Enables common and
centralised tooling for tracing,
logging, authentication.
Ambassador (Envoy),
roll-your-own
● Can replace the ingress controller
● Ambassador is Kubernetes native

Service mesh
56
What? Why? Options
Additional layer on top of
Kubernetes to manage routing
Enables complex use cases
and adds useful features
Linkerd, Istio
● May not be needed
● Can provide tracing without instrumentation
● Will run as sidecar on services
● Other features: Service to service TLS; Load balancing; Fine-grained traffic
policies; Service discovery; Service monitoring

Service catalogue / broker
57
What? Why? Options
Enables easy dependencies
on services and service
discovery for your team
Simplifies deploying
applications
-
● Kubernetes’ own service catalog API is worth mentioning
https://kubernetes.io/docs/concepts/extend-kubernetes/service-catalog/
● Fits in really well with the role of service meshes
● Easy of use for developers can also be achieved with central repository of
service configurations
● Still early days

Network policies
58
What? Why? Options
Rules on allowed connections Prevent unauthorised access,
improve security, segregate
namespaces
Weave Net, Calico
● Node level (kernel) controls and restrictions of traffic
● Need a CNI plugin

Authorisation integration
59
What? Why? Options
API level integration into the
Kubernetes auth flow.
Use existing SSO, reduce
number of account and
centralise account
management
-
● Will require some custom integration work
● Many hooks into the auth API
● Possible to integrate with almost any auth provider

Image scanning
60
What? Why? Options
Automated scanning of
vulnerability in your container
images
Because CVEs happen Docker, Snyk, Twistlock,
Sonatype, Clair (OSS)
● Definitely worth implementing into your CI pipeline.
● Tools can be integrated with your PR process to provide comments on commits

Log Aggregation
61
What? Why? Options
Bring all logs from application
into a searchable place
Logs are the best source of
information on what went
wrong
Lots and lots and lots
Fluentd or ELK (Elasticsearch,
Logstash, Kibana) stack are
good bets for roll-you-own

65
Instrumenting your application

66
1. Tel m e s
to r he
se c e p

67
2. In al h a l
ex t or c
me c

71
Counters
The y is
Gauges
Wha s u s do

72
Counters
The y is
Gauges
Dis b i s al
Histograms

74
R.E.D. Metrics
Requests
Errors
Delays

76
http://chaotic-flow.com/media/saas-metrics-guide-to-saas-financial-performance.pdf
Joel York’s SaaS Metrics
http://chaotic-flow.com

81
We’l e d
subscribe_count
unsubscribe_count
(Bot n e s)

82
ΔCcancel
= rate(unsubscribe_count[1m])

83
ΔCcancel

84
ΔCcancel

85
ΔCcancel

86
ΔCcancel

87
ΔCcancel

88
ΔCcancel

89
C = (total_signups offset 1m) -
(total_cancels offset 1m)

90
ChurnRatemonth
=
rate(total_cancels[1m]) /
((total_signups offset 1m) - (total_cancels offset 1m))

91
ChurnRatemonth
=
rate(total_cancels[1m]) /
((total_signups offset 1m) - (total_cancels offset 1m))

1. Create the namespace we will use for this exercise
kubectl create namespace dev
Shortly, the Deploy agent will notice this change, and sync the
Deployment and Service files.
2. Watch for this happening in Weave Cloud or via:
watch kubectl -n dev get all
The podinfo application should be running in your cluster in the dev
namespace
Prometheus in Practice 💻

From your Cloud9 IDE console, run:
curl http://podinfo.dev:9898/metrics | less
And try to find these metrics that show:
● the number of open file descriptors
● the number of HTTP requests the pod has received
94
1 - Inspect the raw metrics directly 💻

# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 7
# HELP http_requests_total The total number of HTTP requests.
# TYPE http_requests_total counter
http_requests_total{status="200"} 136
Answers
95

● Each line on that page is either a comment or a time series
● A time series has a name, optional labels, and a series of values
● A collection of time series with the same name is a metric
Understanding metrics
96

For example:
http_requests_total{status="200"} 136
● name is http_requests_total
● one label, status, with label value "200"
● value is 136
Since the pod launched, we've received 136 HTTP requests with status
200.
Understanding metrics
97

On your Weave Cloud instance:
● Go to "Monitoring"
● Create a notebook
● Call it "Monitoring in Practice"
● Enter http_requests_total and then click "Run as Table"
Exercise: query metrics
98
2 - Query the metrics 💻

100
Where did those extra labels come
from?

Our pod has grown instance, job, _weave_namespace, and
_weave_service labels.
These were added at the point of scraping so time series don't clash
with each other and so you can find the source of your data.
Some labels added automatically
101

102
Why do some lines say code and some
status?

Some pods use slightly different labels (e.g. code instead of status).
This highlights that Prometheus doesn't impose a schema on
labels—they are free-form.
Highly recommended that you form a consistent standard across your
key applications.
Label schema is flexible
103

What if we only want to see the data from our service?
In a new cell, run the following query as a table:
http_requests_total{_weave_service="podinfo"}
This only shows the time series which have labels that exactly match
those above. PromQL also support not equals (!=) and regular
expression matching (=~ and !~).
Filtering metrics
104
3 - Query the metrics using labels 💻

http_requests_total{_weave_namespace="dev", _weave_service="podinfo"}
10
5

What if we want to get the total requests for our whole cluster?
In a new cell, enter the following:
sum(http_requests_total)
This adds up all the requests to give us a single value.
Aggregating metrics
106
4 - Aggregate metrics using functions 💻

If you look at our original query, you see there are separate lines for
each replica. Multiple rows refer to kube-dns or kubelet. How do we
aggregate those metrics together?
In a new cell, run the following query:
sum(http_requests_total) by (_weave_namespace, _weave_service)
Note only the labels in our by clause are preserved.
Aggregating metrics
108
5 - Aggregate metrics by labels 💻

sum(http_requests_total) by (_weave_namespace, _weave_service)
10
9

Differentiating metrics
Look at the graph view of our first
query. What's the deal with these lines
going up all the time?
http_requests_total is a counter. It
goes up by one every time there's an
HTTP request. It never goes down.
What if we wanted to see requests per
second?
11
0

In a new cell, run:
rate(http_requests_total[1m])
and make sure to see the graph view. What do you see?
Try changing the time interval from 1m to other values (5m, 2h, 10s).
What do you think is happening there?
Differentiating metrics
111
6 - Derive a gauge from a counter 💻

rate(http_requests_total[1m])
11
2

We now know enough to get a graph of HTTP requests per second for
dev/podinfo that will work regardless of how many replicas it has.
Create a query that results in a graph of HTTP request rate for
dev/podinfo. It will look like the below.
Putting it all together
113
7 - Create a custom query 💻

sum(
rate(
http_requests_total{
_weave_namespace="dev",
_weave_service="podinfo"}[1m]))
Answer
114

That graph is a bit boring. Let's make it more interesting by
generating some traffic.
Open a Weave Cloud terminal window into this container and run:
hey -z 2m http://podinfo.dev:9898/error
This will run for 2 minutes, sending many many requests to the error
endpoint on podinfo.
Generating traffic
115

Overall status with error spike
11
6

11
7
Recap: Monitoring In Practice
● There are different kind of metrics
● A good way to think of metrics is which domain they’re in
● It’s trivial to instrument your applications
● Prometheus can be used for both metrics (monitoring)
and ad-hoc querying (observability)
● Simple instrumentation can yield deep insights
● PromQL deals with scalar and vector series
● PromQL has gauges, histograms and counters
● PromQL has many useful functions available

12
0
GitOps is...
An operation model

12
1
GitOps is...
An operation model
Derived from CS and operation knowledge

12
2
GitOps is...
An operation model
Technology agnostic (name notwithstanding)

12
3
GitOps is...
An operation model
A set of principles (Why instead of How)

12
4
GitOps is...
An operation model
Although
Weaveworks
can help
with how

12
5
GitOps is...
An operation model
A way to speed up your team

12
7
1 The entire system is described declaratively.

12
8
Beyond code, data ⇒
Implementation independent
Easy to abstract in simple ways
Easy to validate for correctness
Easy to generate & manipulate from code

12
9
Beyond code, data ⇒
Implementation independent
Easy to abstract in simple ways
Easy to validate for correctness
Easy to generate & manipulate from code

13
0
How is that different from
Infrastructure as code?

13
1
How is that different from
Infrastructure as code?
It’s about consistency in the
failure case.

13
2
It’s about consistency in the
failure case.
When imperative systems
fail, the system ends up in
an unknown, inconsistent
state.

13
3
fail, the system ends up in
an unknown, inconsistent
state.
Declarative changes let you
think of changes as
transactions.

13
4
Declarative changes let you
think of changes as
transactions.
This is a very good thing.

13
5
The canonical desired system state is versioned
(with Git)
2

13
6
The canonical desired system state is versioned
(with Git)
Canonical Source of Truth (DRY)
With declarative definition, trivialises rollbacks
Excellent security guarantees for auditing
Sophisticated approval processes (& existing workflows)
Great Software ↔ Human collaboration point
2

13
7
Changes to the desired state are
automatically applied to the system
3

13
8
Approved changes to the desired state are
Significant velocity gains
Privileged operators don’t cross security boundaries
Separates What and How.
3

13
9
Software agents ensure correctness
and alert on divergence
4

14
0
Software agents ensure correctness
4
Continuously checking that desired state is met
System can self heal
Recovers from errors without intervention (PEBKAC)
It’s the control loop for your operations

14
1
2 The canonical desired system state is versioned
(with Git)
3 Approved changes to the desired state are
4 Software agents ensure correctness

Gitops is Functional Reactive Programming…
...for your infrastructure.
Like React, but for servers and applications.

What should be GitOps’ed?
14
3

What should be GitOps’ed?
14
4
I’m o r
so y

145
Canonical
source of truth
People
Software
Agents
Software
Agents

Dashboards
Alerts
Playbook
Kubernetes Manifests
Application configuration
Provisioning scripts
147
Application checklists
Recording Rules
Sealed Secrets

14
9
Declare
Implement
Monitor /
Observe
Modify

15
0
Declare
ImplementModify
Continuous
Deployment
Default
dashboards
Automated by
software
agents
Monitor /
Observe

15
1
Declare
ImplementModify
Continuous
Deployment
Default
dashboards
Automated by
software
agents
Monitor /
Observe
Software
making
commits

15
2
Declare
ImplementModify
Continuous
Deployment
Default
dashboards
Automated by
software
agents
Monitor /
Observe
Safe and
reversible
changes

15
3
Declare
ImplementModify
Continuous
Deployment
Default
dashboards
Automated by
software
agents
Monitor /
Observe
Automated,
templated
dashboards

15
4
Feedback loop.
This is what matters.

[Only do this step if you didn’t do it in your cluster earlier]
Create the namespace we will use for this exercise:
kubectl create namespace dev
Shortly, the Deploy agent will notice this change, and sync the Deployment and
Service files.
Watch for this happening in Weave Cloud or via:
watch kubectl -n dev get all
Gitops Hands On 1/12 💻

We’re going to make a code change and see it flow through CI, then
deploy that change.
Call the version endpoint on the service to see what is running:
curl podinfo.dev:9898/version

In the editor, open podinfo/pkg/version/version.go, increment the
version number and save:
var VERSION = "0.3.1"
Commit your changes and push to master:
cd /workspace/podinfo
git commit -m "release v0.3.1 to dev" .
git push

The CI pipeline will create an image tagged the same as the git commit
Git said something like [master 89b8396]; the tag will be like
master-89b8396
Check by listing image tags (replace user with your username):
gcloud container images list-tags gcr.io/dx-training/USER-podinfo
USER should be of the form “training-user-<number>”.

Navigate in the editor to workspace/cluster/un-workshop/dev and open
podinfo-dep.yaml.
Where it says image:
replace quay.io/stefanprodan/podinfo with gcr.io/dx-training/USER-podinfo
replace the tag 0.3.0 with your tag master-TAG
Save the file and commit your changes and push to master:
cd ../cluster/un-workshop/dev
git commit -m "my first deploy" .
git push

Check in Weave Cloud to see when it has synced the Deployment.
Call the version endpoint on the service to see if it changed:
curl podinfo.dev:9898/version

Editing the YAML file was tedious.
Let’s automate it!
163

In Weave Cloud Deploy, find the podinfo Deployment in dev Namespace.
Click Automate.
This creates a commit, because git is our single source of truth.
To keep things in sync, bring it into your workspace:
git pull

Open podinfo/pkg/version/version.go, increment the version number
again, and save:
var VERSION = "0.3.2"
Commit your changes and push to master:
git commit -m "release v0.3.2" .
git push

Watch for the CI/CD to upgrade the app to 0.3.2:
watch curl podinfo.dev:9898/version

Suppose we don’t like the latest version: we want to roll back.
1. In Weave Cloud Deploy, find the podinfo Deployment in dev
Namespace. Click Deautomate.
2. The UI shows a list of images - select the one you want and click
Release, then again to confirm.
3. Check again which version is running:
watch curl podinfo.dev:9898/version

We can flag that the latest build should not be deployed
1. In Weave Cloud Deploy, find the podinfo Deployment in dev
Namespace. Click 🔒Lock.
2. Give a reason, then click Lock again to confirm.
3. Each of these actions creates a git commit. Sync your workspace:
git pull
4. Reload /workspace/cluster/dev/podinfo-dep.yaml in the editor to see
how the lock is applied.

We can flow the version number through the pipeline with a git tag, to
show more meaningful versions
Create and push a git tag:
git tag 0.3.2
git push origin 0.3.2
This will trigger another CI build, and when that is finished you should
have an image tagged 0.3.2

170
● Having separate pipelines for CI and CD enables better security
● It’s also easier to deal with if a deployment goes wrong
● We built a few versions of a simple app, using a demo CI pipeline
● Deployed those versions to Kubernetes using Weave Cloud
● Automated the deployment
● Deployments, rollback and lock are all done via git
● Git is our single source of truth.
Recap: GitOps CI/CD

Advanced Deployment patterns
17
1

Deployment Strategies
Kubernetes native
● Recreate
● Rolling update
● Blue/Green
Service mesh
● Canary
● A/B Testing
● Blue/Green + Dark Traffic

Recreate Deployment Strategy
A
C
B
D

apiVersion: apps/v1
kind: Deployment
spec:
replicas: 2
strategy:
type: Recreate

Pros
● Avoids versioning issues
● Avoids database schema incompatibites
Cons
● Involves downtime between v1 complete shutdown and v2 startup
Suitable for
● Monolithic legacy applications
● Non production environments

Rolling Deployment Strategy
A
C
B
D

apiVersion: apps/v1
kind: Deployment
spec:
replicas: 2
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # new pods added at a time
maxUnavailable: 0
minReadySeconds: 10

Pros
● Low risk due to readiness checks
● Gradual rollout with no downtime
Cons
● Needs backwards compatibility between API versions and
database migrations
● No control over the traffic during the rollout
Suitable for
● Stateful applications & Stateless microservices

Blue/Green Deployment Strategy

Kubernetes native deployment strategy
apiVersion: v1
kind: Service
spec:
selector:
app: podinfo
version: v1 #switch the traffic from blue to green by changing the version to v2

Suitable for
● Monolithic legacy applications
● Autonomous microservices
Pros
● Avoids versioning issues
● Instant rollout and rollback (while the blue deployment still exists)
Cons
● Requires resource duplication
● Data synchronisation between the two environments can lead to partial service interruption

Canary Deployment Strategy
Istio traffic management
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
spec:
http:
- match:
- headers:
x-user:
exact: insider
route:
- destination:
name: podinfo.prod.svc.cluster.local
subset: canary
- route:
- destination:
name: podinfo.prod.svc.cluster.local
subset: ga

Canary Deployment Strategy
Suitable for
● User facing applications
● Stateless microservices
Pros
● Low impact as the new version is released only to a subset of users
● Controlled rollout with no downtime
● Fast rollback
Cons
● Needs a traffic management solution / service mesh (Envoy, Istio, Linkerd)
● Needs backwards compatibility between API versions and database migrations

Istio Canary Deployment - Initial State
18
9
All traffic is routed to the GA deployment
http:
- route:
- destination:
name: podinfo
subset: canary
weight: 0
- destination:
name: podinfo
subset: ga
weight: 100

Istio Canary Deployment - Initial State
19
0

Istio Canary Deployment - Warm-Up
19
1
Route 10% of the traffic to the canary
http:
- route:
- destination:
name: podinfo
subset: canary
weight: 10
- destination:
name: podinfo
subset: ga
weight: 90

Istio Canary Deployment - Warm-Up
19
2

Istio Canary Deployment - Increase load
19
3

Istio Canary Deployment - Latency Monitoring
19
4

Istio Canary Deployment - CD Overview
19
5

A/B Testing
Suitable for
● User facing applications
● Stateless microservices
Pros
● Allows advanced customer behaviour analysis
● Performance testing of different configurations in parallel
Cons
● Needs a traffic management solution / service mesh (Envoy, Istio, Linkerd)
● Needs backwards compatibility between API versions and database migrations

Blue/Green and Dark traffic
20

Blue/Green + Dark Traffic Deployment Strategy

Istio traffic mirroring
apiVersion: config.istio.io/v1alpha2
kind: RouteRule
spec:
destination:
name: podinfo
precedence: 2
route:
- labels:
version: v1
weight: 100
- labels:
version: v2
weight: 0
mirror:
name: podinfo
labels:
version: v2

Suitable for
● API based applications
● Autonomous microservices
Pros
● Test the green deployment without any impact for the end-user
● Uses real traffic minimising the risk of a faulty release
Cons
● Requires resource duplication
● Needs a traffic management solution / service mesh (Envoy, Istio)

● Kubernetes internal architecture
● High-availability Kubernetes
● Draining and cordon a node for reboot
● Backing up and upgrading the Kubernetes control plane
Operational practices for Kubernetes
206

20
7
Kubernetes component architecture
Diagram from https://speakerdeck.com/luxas with permission
Nodes
Control plane
Node 3
OS
Container
Runtime
Kubelet
Networking
Node 2
OS
Container
Runtime
Kubelet
Networking
Node 1
OS
Container
Runtime
Kubelet
Networking
API Server (REST API)
Controller Manager
(Controller Loops)
Scheduler
(Bind Pod to Node)
etcd (key-value DB, SSOT)
User
Legend:
CNI
CRI
OCI
Protobuf
gRPC
JSON

HA etcd cluster
External Load Balancer or DNS-based API server resolving
High-availability Kubernetes
Master A
API Server
Controller Manager
Scheduler
Shared certificates
etcd
etcd
etcd
Master B
API Server
Controller Manager
Scheduler
Shared certificates
Master C
API Server
Controller Manager
Scheduler
Shared certificates
Nodes
Kubelet 1
Kubelet 2
Kubelet 3
Kubelet 4
Kubelet 5
Diagram from https://speakerdeck.com/luxas with permission
208

● How Google Runs Production Systems
● SREs:
○ Have the skillset necessary to automate tasks
○ Do the same work as an operations team, but with
automation instead of manual labor
● SRE team responsible for latency, performance,
efficiency, change management, monitoring,
emergency response, and capacity planning
Site Reliability Engineering
209

When you need to reboot a worker node to install OS updates or do
hardware maintenance without disrupting your workloads you need
to perform the following operations:
● Evict all running pods except DaemonSets and StatefulSets
● Mark the node as unschedulable
● Perform maintenance work on the node
● Restart the node
● Make the node schedulable again
kubectl drain $NODE -> reboot $NODE -> kubectl uncordon $NODE
Worker nodes maintenance
210

● If all workloads are replicated, draining a node before rebooting is
not necessary. A node reboot that comes back in less than 5
minutes will not trigger any pod rescheduling.
● A drain operation can target multiple nodes, in order to protect
clustered applications you need to create Pod Disruption Budgets
to ensure the number of replicas running is never brought below
the minimum number needed for a quorum
Worker nodes maintenance
211

Operations:
● Master node OS updates and hardware maintenance (low risk)
○ A master node reboot will not disrupt any running workloads
○ While the master node is offline, no scheduling operations will happen
○ Kured can help with this. (https://github.com/weaveworks/kured)
● Control plane upgrades (high risk)
○ For master nodes in-place upgrades, a full backup is recommended such as LVM
snapshots
○ Only one minor version upgrade is supported, you can only upgrade from 1.9 to 1.10, not
from 1.8 to 1.10
○ Test the upgrade procedure on a staging cluster before running it in production
Control plane maintenance
212

Kubeadm master nodes upgrade procedure:
● Download the most recent version of kubeadm using curl (do not
upgrading the kubeadm OS package)
● Run kubeadm upgrade plan to check if your cluster is upgradeable
● Pick a version to upgrade to and run kubeadm upgrade apply v1.10.2
● Upgrade your CNI by applying the new DaemonSet definition
● Drain the master node with kubectl drain $HOST --ignore-daemonsets
● Upgrade Kubernetes packages with apt-get update && apt-get upgrade
● Bring the master node back online with kubectl uncordon $MASTER
Control plane upgrades
213

Agenda
21
7
9:00a Welcome & introduction
9:30a Getting started with your environment
10:00a What is “Production Ready?”
10:30a Break (15 minutes)
10:45a Monitoring a production cluster
11:45a Declarative infrastructure in practice
12:15p Lunch (1 hour)
1:15p Devops and GitOps in practice
2:15p Advanced Deployment Patterns
3:15p Break (15 minutes)
3:30p Operational practice for Kubernetes
4:00p Securing a Kubernetes cluster (by Twistlock)
5:00p Review and recap

❏ Readiness check
❏ Liveness check
❏ Metric instrumentation
❏ Dashboards
❏ Playbook
❏ Limits and requests
❏ Labels and annotations
The Application checklist
21
8
❏ Alerts
❏ Structured logging output
❏ Tracing instrumentation
❏ Graceful shutdowns
❏ Graceful dependency (w. readiness check)
❏ Configmaps
❏ Labeled images using commit sha
❏ Locked down runtime context

The Cluster checklist
21
9
❏ API Gateway
❏ Service Mesh
❏ Service catalogue / Broker
❏ Network policies
❏ Authorisation integration
❏ Image scanning
❏ Log aggregation
❏ Build pipeline
❏ Deployment pipeline
❏ Image registry
❏ Monitoring infrastructure
❏ Shared storage
❏ Secrets management
❏ Ingress controller

22
0
Recap: Monitoring In Practice
● There are different kind of metrics
● A good way to think of metrics is which domain they’re in
● It’s trivial to instrument your applications
● Prometheus can be used for both metrics (monitoring) and ad-hoc querying
(observability)
● Simple instrumentation can yield deep insights
● PromQL deals with scalar and vector series
● PromQL has gauges, histograms and counters
● PromQL has many useful functions available

22
1
2 The canonical desired system state is versioned
(with Git)
3 Changes to the desired state are
4 Software agents ensure correctness

222
● Having separate pipelines for CI and CD enables better security
● It’s also easier to deal with if a deployment goes wrong
● We built a few versions of a simple app, using a demo CI pipeline
● Deployed those versions to Kubernetes using Weave Cloud
● Automated the deployment
● Deployments, rollback and lock are all done via git
● Git is our single source of truth.
Recap: GitOps CI/CD

● Kubernetes internal architecture
● High-availability Kubernetes
● Draining and cordon a node for reboot
● Backing up and upgrading the Kubernetes control plane
Operational practices for Kubernetes
224

THANK YOU!
22
5
Craig Wright
craig@weave.works
brice@weave.works
@fractallambda
@weaveworks
https://weave.works

Kubecon seattle 2018 workshop slides

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Kubecon seattle 2018 workshop slides

Ähnlich wie Kubecon seattle 2018 workshop slides (20)

Mehr von Weaveworks

Mehr von Weaveworks (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Kubecon seattle 2018 workshop slides