SlideShare ist ein Scribd-Unternehmen logo
1 von 226
Downloaden Sie, um offline zu lesen
Your path to production
ready Kubernetes
Weaveworks – https://weave.works – @weaveworks
Kubecon Seattle – December 2018
Brice Fernandes – brice@weave.works – @fractallambda
Craig Wright – craig@weave.works – @c_r_w
1
Welcome and Introduction
2
3
Hi
We work for Weaveworks as customer success
engineers
You can find Weaveworks at https://www.weave.works
or @weaveworks
The team at Weaveworks is behind the GitOps model
You can find us online at @fractallambda and @c_r_w
● Building cloud-native OSS since 2014
(Weave Net, Moby, Kubernetes, Prometheus)
● Founding member of CNCF
● Alexis Richardson (Weaveworks CEO) is chair of
the CNCF Technical Oversight Committee
● Weave Cloud runs on Kubernetes since 2015
4
About Weaveworks
Weave Cloud
5
6
7
8
9
Agenda
10
9:00a Welcome & introduction
9:30a Getting started with your environment
10:00a What is “Production Ready?”
10:30a Break (15 minutes)
10:45a Monitoring a production cluster
11:45a Declarative infrastructure in practice
12:15p Lunch (1 hour)
1:15p Devops and GitOps in practice
2:15p Advanced Deployment Patterns
3:15p Break (15 minutes)
3:30p Operational practice for Kubernetes
4:00p Securing a Kubernetes cluster (by Twistlock)
5:00p Review and recap
Some assumptions
11
➔ You can use the command line.
➔ You can use Git.
➔ You know what Kubernetes Pods, Deployment, and
Services are.
➔ You have a modern web browser.
Kubernetes need to know
12
Pods
containers
Deployments
Containers - Run Docker images, an immutable copy of your application code and all
code dependencies in an isolated environment.
Pods - A set of containers, co-scheduled on one machine. Ephemeral. Has unique IP. Has
labels.
Deployment - Ensures a certain number of replicas of a pod are running across the
cluster.
Service - Gets virtual IP, mapped to endpoints via labels. Named in DNS.
Namespace - Resource names are scoped to a Namespace. Policy boundary.
13
Today’s slides:
https://tinyurl.com/
production-k8s-2018
Getting started with your environment
14
15
Login to your cluster – Weave Cloud & C9
1. Go to tinyurl.com/kubecon18-cluster
2. Add your name and email
3. Check your email for links to your environment and your password
(This may take a little while. Be patient while Craig invites you)
16
17
Application code
18
Cluster shell
Your Cluster
19
pod
Icon by Freepik from www.flaticon.com
Your Cluster
20
pod
Cloud Source
Repositories Container
Builder
Cloud
Registry
GitOps hands-on 1/10Kick the tires on your cluster 💻
1. Start with a simple command:
➤ kubectl version
2. Look at what’s running on the cluster with
Weave Cloud Explore
“DevOps Console”
Tooling for deployment,
visualisation and
observability
Weave Cloud
22
23
Weave Cloud Explore
24
Weave Cloud Monitor
GitOps hands-on 1/10
Ask Kubernetes what’s running on the cluster:
➤ kubectl get pods --all-namespaces
Query Kubernetes 💻
What is “Production Ready”?
26
Bear with me while I go through this list.
There will be a kitten at the end.
Production Ready checklists
27
❏ Readiness check
❏ Liveness check
❏ Metric instrumentation
❏ Dashboards
❏ Playbook
❏ Limits and requests
❏ Labels and annotations
The Application checklist
28
❏ Alerts
❏ Structured logging output
❏ Tracing instrumentation
❏ Graceful shutdowns
❏ Graceful dependency (w. readiness check)
❏ Configmaps
❏ Labeled images using commit sha
❏ Locked down runtime context
Liveness and Readiness probes
29
What? Why? Options
Endpoints for Kubernetes to
monitor your application
lifecycle
Allows Kubernetes to restart
or stop traffic to a pod
-
● Liveness failure is for telling Kubernetes to restart the pod
● Readiness failure is transient and tells Kubernetes to route traffic elsewhere
● Readiness failure is useful for startup and load management
Metric instrumentation
30
What? Why? Options
Code and libraries used in
your code to expose metrics
Allows measuring operation of
application and enables many
more advanced use cases
Prometheus, Newrelic,
Datadog, many others
● Basic metrics are not optional
● Prometheus is a fantastic fit for Kubernetes in most cases
Dashboards
31
What? Why? Options
View of metrics Metrics are just data. They
must be consumable by
humans as well.
Grafana
Many commercial options
Playbooks / Runbooks
32
What? Why? Options
Rich guides for your engineers
on how-to operate the system
and fault find when things go
wrong.
Nobody is at their sharpest at
03:00 AM
Knowledge deteriorates over
time
Confluence
Markdown files
Weave Cloud Notebooks
● Absolutely vital knowledge repository.
● Avoids the bus factor
● First point of call for operational issues
● Significantly speeds up new engineer induction
● Requires continuous work to maintain
Limits and requests
33
What? Why? Options
Explicit resource allocation for
pods
Allows Kubernetes to make
good scheduling decisions
-
● Requests are used when scheduling
● Limits will avoid workloads from causing cascading failures
● Limits are a valuable safety net
● Available at the namespace level as well (see ResourceQuotas)
34
If a Container exceeds its memory limit, it might be terminated. If it is restartable, the kubelet will
restart it, as with any other type of runtime failure.
If a Container exceeds its memory request, it is likely that its Pod will be evicted whenever the
node runs out of memory.
A Container might or might not be allowed to exceed its CPU limit for extended periods of time.
However, it will not be killed for excessive CPU usage
Limits - Official docs
Labels and annotations
35
What? Why? Options
Metadata held by Kubernetes Makes workload management
easier and allows other tools
to work with standard
Kubernetes definitions
-
● Useful to have a simple plan
● Labels can be used in kubectl arguments as filters
● Annotations are a good way of layering functionality without the overhead of
Custom Resource Descriptions
Alerts
36
What? Why? Options
Automated notifications on
defined trigger
You need to know when your
service degrades
Prometheus & Alertmanager
(Many other options)
Structured Logging
37
What? Why? Options
Output logs in a machine
readable format to facilitate
searching & indexing
Trace what went wrong when
something does
ELK stack (Elasticsearch,
Logstash and Kibana)
Many commercial offerings
● Avoid logging to files
● Must have timestamps and basic levels (i.e. info, error, fatal)
● JSON logs/events is love or hate
● KV formats are more human-friendly
Tracing Instrumentation
38
What? Why? Options
Instrumentation to send
request processing details to a
collection service.
Sometimes the only way of
figuring out where latency is
coming from
Zipkin, Lightstep, Appdash,
Tracer, Jaeger
● Trigger tracing from your gateway API
● Sample traces, don’t trace everything
● Costly to setup, but only meaningful way of debugging some latency issues.
● Use something that supports the Opentracing
Graceful shutdowns
39
What? Why? Options
Applications respond to
SIGTERM correctly
This is how Kubernetes will tell
you application to end
-
● End transactions,
● Default terminationGracePeriodSeconds is quite long, and can be shortened
Graceful dependencies
40
What? Why? Options
Applications don’t assume
dependencies are available.
Wait for other services before
reporting ready
Avoid headaches that come
with a service order
requirement
-
● Nice apps don’t crash-lopp
● This is what the readiness probe was built for
ConfigMaps
41
What? Why? Options
Define a configuration file for
your application in Kubernetes
using configmaps
Easy to reconfigure an app
without rebuilding, allows
config to be versioned
-
● Mount configmap as a volume is the easiest option
● Environment variable also alternative for simpler config
● Setting a file watch or polling mean your application will take new config into
consideration immediately
ConfigMap Example
apiVersion: v1
kind: ConfigMap
metadata:
name: my-app-cfg
data:
.env: |
APP_NAME=my-app
APP_ENV=stg
APP_KEY="base64:gFf47FZi6F9xDJiZiEmmKlePurMaXECKs1cA9hscIVc="
APP_DEBUG=true
APP_LOG_LEVEL=debug
APP_URL=http://localhost
Lar P
f a w
co g a n
ConfigMap Mount Example
containers:
- name: my-php-app
volumeMounts:
- name: env
mountPath: /var/www/.env
subPath: .env
volumes:
- name: env
configMap:
name: my-app-cfg
ConfigMap Environment Variable Example
spec:
containers:
- name: test-container
env:
- name: SPECIAL_LEVEL_KEY
valueFrom:
configMapKeyRef:
name: special-config
key: special.how
Labeled images using commit sha
45
What? Why? Options
Label the docker images with
the code commit SHA
Makes tracing image to code
trivial
-
● Important to be able to trace back from running application to origin code
● If you reliably build your images with ${branch}-${short_git_hash} names,
might be enough
Locked down runtime context
46
What? Why? Options
Use deliberately secure
configuration for application
runtime context
Reduces attack surface,
makes privileges explicit
-
● if app writes temporary files, be sure to use emptyDir volume
● if your app has to initialise some data, do it with initContains
● avoid installing packages or fetching files from unreliable locations
● if you can, try to use readOnlyRootFilesystem:true
● runAsUser, fsGroup and allowPrivilegeEscalation:false allow you to
control runtime context further
The Cluster checklist
47
❏ API Gateway
❏ Service Mesh
❏ Service catalogue / Broker
❏ Network policies
❏ Authorisation integration
❏ Image scanning
❏ Log aggregation
❏ Build pipeline
❏ Deployment pipeline
❏ Image registry
❏ Monitoring infrastructure
❏ Shared storage
❏ Secrets management
❏ Ingress controller
Build pipeline
48
What? Why? Options
Builds your code,
runs your tests
- -
● You have one already
● You should be able to use it
● Make sure artefacts are tagged with the Git commit SHA
Deployment pipeline
49
What? Why? Options
Takes build artefacts and puts
them in the cluster
- -
● Note this is separate concern from your build pipeline.
● Where you have your approval process
● This is where Gitops lives – More later today
Image registry
50
What? Why? Options
Stores build artefacts Keep versioned artefacts
available
Roll your own
Commercial: Docker hub,
Quay.io, GCP Registry
● Key security point
● Great options available both on-prem and online
● Credentials need to be available to CI for push, and cluster for pull
Monitoring infrastructure
51
What? Why? Options
Collects and stores metrics Understand your running
system
Get alerts when something
goes wrong
OSS: Prometheus, Cortex,
Thanos
Commercial: Datadog,
Grafana Cloud, Weave Cloud
● Flip side of metrics instrumentation
Shared Storage
52
What? Why? Options
Store persistent state of your
application beyond pod
lifetime
Stateless is a unicorn Many. Will depend on
platform.
● Seen by your application as a directory
● Volumes and Volume claims are different things
● May be read-only.
Secrets Management
53
What? Why? Options
How do your application
access secret credentials
securely
Secrets are needed to use
external services
Bitnami Sealed Secrets
Hashicorp Vault
Ingress controller
54
What? Why? Options
Common routing point for
inbound traffic
Easier to manage
authentication and logging
Platform controller (AWS ELB)
GCE & NGinx (by Kubernetes)
Kong, Traefik, HAProxy, Istio,
Envoy
API Gateway
55
What? Why? Options
SIngle point for incoming
requests. Higher layer ingress
controller.
Can route at HTTP level.
Enables common and
centralised tooling for tracing,
logging, authentication.
Ambassador (Envoy),
roll-your-own
● Can replace the ingress controller
● Ambassador is Kubernetes native
Service mesh
56
What? Why? Options
Additional layer on top of
Kubernetes to manage routing
Enables complex use cases
and adds useful features
Linkerd, Istio
● May not be needed
● Can provide tracing without instrumentation
● Will run as sidecar on services
● Other features: Service to service TLS; Load balancing; Fine-grained traffic
policies; Service discovery; Service monitoring
Service catalogue / broker
57
What? Why? Options
Enables easy dependencies
on services and service
discovery for your team
Simplifies deploying
applications
-
● Kubernetes’ own service catalog API is worth mentioning
https://kubernetes.io/docs/concepts/extend-kubernetes/service-catalog/
● Fits in really well with the role of service meshes
● Easy of use for developers can also be achieved with central repository of
service configurations
● Still early days
Network policies
58
What? Why? Options
Rules on allowed connections Prevent unauthorised access,
improve security, segregate
namespaces
Weave Net, Calico
● Node level (kernel) controls and restrictions of traffic
● Need a CNI plugin
Authorisation integration
59
What? Why? Options
API level integration into the
Kubernetes auth flow.
Use existing SSO, reduce
number of account and
centralise account
management
-
● Will require some custom integration work
● Many hooks into the auth API
● Possible to integrate with almost any auth provider
Image scanning
60
What? Why? Options
Automated scanning of
vulnerability in your container
images
Because CVEs happen Docker, Snyk, Twistlock,
Sonatype, Clair (OSS)
● Definitely worth implementing into your CI pipeline.
● Tools can be integrated with your PR process to provide comments on commits
Log Aggregation
61
What? Why? Options
Bring all logs from application
into a searchable place
Logs are the best source of
information on what went
wrong
Lots and lots and lots
Fluentd or ELK (Elasticsearch,
Logstash, Kibana) stack are
good bets for roll-you-own
62
63
Coffee
Time
Prometheus primer
64
65
Instrumenting your application
66
1. Tel m e s
to r he
se c e p
67
2. In al h a l
ex t or c
me c
68
3. Ad us m ic
69
Types of metrics
70
Counters
The y is
71
Counters
The y is
Gauges
Wha s u s do
72
Counters
The y is
Gauges
Dis b i s al
Histograms
73
Metrics that matter
74
R.E.D. Metrics
Requests
Errors
Delays
75
Worked PromQL Example
76
http://chaotic-flow.com/media/saas-metrics-guide-to-saas-financial-performance.pdf
Joel York’s SaaS Metrics
http://chaotic-flow.com
77
78
79
80
81
We’l e d
subscribe_count
unsubscribe_count
(Bot n e s)
82
ΔCcancel
= rate(unsubscribe_count[1m])
83
ΔCcancel
= rate(unsubscribe_count[1m])
84
ΔCcancel
= rate(unsubscribe_count[1m])
85
ΔCcancel
= rate(unsubscribe_count[1m])
86
ΔCcancel
= rate(unsubscribe_count[1m])
87
ΔCcancel
= rate(unsubscribe_count[1m])
88
ΔCcancel
= rate(unsubscribe_count[1m])
89
C = (total_signups offset 1m) -
(total_cancels offset 1m)
90
ChurnRatemonth
=
rate(total_cancels[1m]) /
((total_signups offset 1m) - (total_cancels offset 1m))
91
ChurnRatemonth
=
rate(total_cancels[1m]) /
((total_signups offset 1m) - (total_cancels offset 1m))
Prometheus in practice
92
GitOps hands-on 4/10
1. Create the namespace we will use for this exercise
kubectl create namespace dev
Shortly, the Deploy agent will notice this change, and sync the
Deployment and Service files.
2. Watch for this happening in Weave Cloud or via:
watch kubectl -n dev get all
The podinfo application should be running in your cluster in the dev
namespace
Prometheus in Practice 💻
From your Cloud9 IDE console, run:
curl http://podinfo.dev:9898/metrics | less
And try to find these metrics that show:
● the number of open file descriptors
● the number of HTTP requests the pod has received
94
1 - Inspect the raw metrics directly 💻
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 7
# HELP http_requests_total The total number of HTTP requests.
# TYPE http_requests_total counter
http_requests_total{status="200"} 136
Answers
95
● Each line on that page is either a comment or a time series
● A time series has a name, optional labels, and a series of values
● A collection of time series with the same name is a metric
Understanding metrics
96
For example:
http_requests_total{status="200"} 136
● name is http_requests_total
● one label, status, with label value "200"
● value is 136
Since the pod launched, we've received 136 HTTP requests with status
200.
Understanding metrics
97
On your Weave Cloud instance:
● Go to "Monitoring"
● Create a notebook
● Call it "Monitoring in Practice"
● Enter http_requests_total and then click "Run as Table"
Exercise: query metrics
98
2 - Query the metrics 💻
http_requests_total
99
100
Where did those extra labels come
from?
Our pod has grown instance, job, _weave_namespace, and
_weave_service labels.
These were added at the point of scraping so time series don't clash
with each other and so you can find the source of your data.
Some labels added automatically
101
102
Why do some lines say code and some
status?
Some pods use slightly different labels (e.g. code instead of status).
This highlights that Prometheus doesn't impose a schema on
labels—they are free-form.
Highly recommended that you form a consistent standard across your
key applications.
Label schema is flexible
103
What if we only want to see the data from our service?
In a new cell, run the following query as a table:
http_requests_total{_weave_service="podinfo"}
This only shows the time series which have labels that exactly match
those above. PromQL also support not equals (!=) and regular
expression matching (=~ and !~).
Filtering metrics
104
3 - Query the metrics using labels 💻
http_requests_total{_weave_namespace="dev", _weave_service="podinfo"}
10
5
What if we want to get the total requests for our whole cluster?
In a new cell, enter the following:
sum(http_requests_total)
This adds up all the requests to give us a single value.
Aggregating metrics
106
4 - Aggregate metrics using functions 💻
sum(http_requests_total)
10
7
If you look at our original query, you see there are separate lines for
each replica. Multiple rows refer to kube-dns or kubelet. How do we
aggregate those metrics together?
In a new cell, run the following query:
sum(http_requests_total) by (_weave_namespace, _weave_service)
Note only the labels in our by clause are preserved.
Aggregating metrics
108
5 - Aggregate metrics by labels 💻
sum(http_requests_total) by (_weave_namespace, _weave_service)
10
9
Differentiating metrics
Look at the graph view of our first
query. What's the deal with these lines
going up all the time?
http_requests_total is a counter. It
goes up by one every time there's an
HTTP request. It never goes down.
What if we wanted to see requests per
second?
11
0
In a new cell, run:
rate(http_requests_total[1m])
and make sure to see the graph view. What do you see?
Try changing the time interval from 1m to other values (5m, 2h, 10s).
What do you think is happening there?
Differentiating metrics
111
6 - Derive a gauge from a counter 💻
rate(http_requests_total[1m])
11
2
We now know enough to get a graph of HTTP requests per second for
dev/podinfo that will work regardless of how many replicas it has.
Create a query that results in a graph of HTTP request rate for
dev/podinfo. It will look like the below.
Putting it all together
113
7 - Create a custom query 💻
sum(
rate(
http_requests_total{
_weave_namespace="dev",
_weave_service="podinfo"}[1m]))
Answer
114
That graph is a bit boring. Let's make it more interesting by
generating some traffic.
Open a Weave Cloud terminal window into this container and run:
hey -z 2m http://podinfo.dev:9898/error
This will run for 2 minutes, sending many many requests to the error
endpoint on podinfo.
Generating traffic
115
Overall status with error spike
11
6
11
7
Recap: Monitoring In Practice
● There are different kind of metrics
● A good way to think of metrics is which domain they’re in
● It’s trivial to instrument your applications
● Prometheus can be used for both metrics (monitoring)
and ad-hoc querying (observability)
● Simple instrumentation can yield deep insights
● PromQL deals with scalar and vector series
● PromQL has gauges, histograms and counters
● PromQL has many useful functions available
Devops and Gitops
11
8
11
9
GitOps is...
12
0
GitOps is...
An operation model
12
1
GitOps is...
An operation model
Derived from CS and operation knowledge
12
2
GitOps is...
An operation model
Derived from CS and operation knowledge
Technology agnostic (name notwithstanding)
12
3
GitOps is...
An operation model
Derived from CS and operation knowledge
Technology agnostic (name notwithstanding)
A set of principles (Why instead of How)
12
4
GitOps is...
An operation model
Derived from CS and operation knowledge
Technology agnostic (name notwithstanding)
A set of principles (Why instead of How)
Although
Weaveworks
can help
with how
12
5
GitOps is...
An operation model
Derived from CS and operation knowledge
Technology agnostic (name notwithstanding)
A set of principles (Why instead of How)
A way to speed up your team
Principles of GitOps
12
6
12
7
1 The entire system is described declaratively.
12
8
1 The entire system is described declaratively.
Beyond code, data ⇒
Implementation independent
Easy to abstract in simple ways
Easy to validate for correctness
Easy to generate & manipulate from code
12
9
1 The entire system is described declaratively.
Beyond code, data ⇒
Implementation independent
Easy to abstract in simple ways
Easy to validate for correctness
Easy to generate & manipulate from code
13
0
How is that different from
Infrastructure as code?
13
1
How is that different from
Infrastructure as code?
It’s about consistency in the
failure case.
13
2
It’s about consistency in the
failure case.
When imperative systems
fail, the system ends up in
an unknown, inconsistent
state.
13
3
fail, the system ends up in
an unknown, inconsistent
state.
Declarative changes let you
think of changes as
transactions.
13
4
Declarative changes let you
think of changes as
transactions.
This is a very good thing.
13
5
The canonical desired system state is versioned
(with Git)
2
13
6
The canonical desired system state is versioned
(with Git)
Canonical Source of Truth (DRY)
With declarative definition, trivialises rollbacks
Excellent security guarantees for auditing
Sophisticated approval processes (& existing workflows)
Great Software ↔ Human collaboration point
2
13
7
Changes to the desired state are
automatically applied to the system
3
13
8
Approved changes to the desired state are
automatically applied to the system
Significant velocity gains
Privileged operators don’t cross security boundaries
Separates What and How.
3
13
9
Software agents ensure correctness
and alert on divergence
4
14
0
Software agents ensure correctness
and alert on divergence
4
Continuously checking that desired state is met
System can self heal
Recovers from errors without intervention (PEBKAC)
It’s the control loop for your operations
14
1
1 The entire system is described declaratively.
2 The canonical desired system state is versioned
(with Git)
3 Approved changes to the desired state are
automatically applied to the system
4 Software agents ensure correctness
and alert on divergence
Gitops is Functional Reactive Programming…
...for your infrastructure.
Like React, but for servers and applications.
What should be GitOps’ed?
14
3
What should be GitOps’ed?
14
4
I’m o r
so y
145
Canonical
source of truth
People
Software
Agents
Software
Agents
146
?
Dashboards
Alerts
Playbook
Kubernetes Manifests
Application configuration
Provisioning scripts
147
Application checklists
Recording Rules
Sealed Secrets
148
14
9
Declare
Implement
Monitor /
Observe
Modify
15
0
Declare
ImplementModify
Continuous
Deployment
Default
dashboards
Automated by
software
agents
Monitor /
Observe
15
1
Declare
ImplementModify
Continuous
Deployment
Default
dashboards
Automated by
software
agents
Monitor /
Observe
Software
making
commits
15
2
Declare
ImplementModify
Continuous
Deployment
Default
dashboards
Automated by
software
agents
Monitor /
Observe
Safe and
reversible
changes
15
3
Declare
ImplementModify
Continuous
Deployment
Default
dashboards
Automated by
software
agents
Monitor /
Observe
Automated,
templated
dashboards
15
4
Feedback loop.
This is what matters.
15
5
Lunch
Time
Gitops in practice
15
6
GitOps hands-on 4/10
[Only do this step if you didn’t do it in your cluster earlier]
Create the namespace we will use for this exercise:
kubectl create namespace dev
Shortly, the Deploy agent will notice this change, and sync the Deployment and
Service files.
Watch for this happening in Weave Cloud or via:
watch kubectl -n dev get all
Gitops Hands On 1/12 💻
GitOps hands-on 5/10
We’re going to make a code change and see it flow through CI, then
deploy that change.
Call the version endpoint on the service to see what is running:
curl podinfo.dev:9898/version
Gitops Hands On 2/12 💻
GitOps hands-on 7/10
In the editor, open podinfo/pkg/version/version.go, increment the
version number and save:
var VERSION = "0.3.1"
Commit your changes and push to master:
cd /workspace/podinfo
git commit -m "release v0.3.1 to dev" .
git push
Gitops Hands On 3/12 💻
GitOps hands-on 2/10
The CI pipeline will create an image tagged the same as the git commit
Git said something like [master 89b8396]; the tag will be like
master-89b8396
Check by listing image tags (replace user with your username):
gcloud container images list-tags gcr.io/dx-training/USER-podinfo
USER should be of the form “training-user-<number>”.
Gitops Hands On 4/12 💻
GitOps hands-on 3/10
Navigate in the editor to workspace/cluster/un-workshop/dev and open
podinfo-dep.yaml.
Where it says image:
replace quay.io/stefanprodan/podinfo with gcr.io/dx-training/USER-podinfo
replace the tag 0.3.0 with your tag master-TAG
Save the file and commit your changes and push to master:
cd ../cluster/un-workshop/dev
git commit -m "my first deploy" .
git push
Gitops Hands On 5/12 💻
GitOps hands-on 5/10
Check in Weave Cloud to see when it has synced the Deployment.
Call the version endpoint on the service to see if it changed:
curl podinfo.dev:9898/version
Gitops Hands On 6/12 💻
Editing the YAML file was tedious.
Let’s automate it!
163
GitOps hands-on 6/10
In Weave Cloud Deploy, find the podinfo Deployment in dev Namespace.
Click Automate.
This creates a commit, because git is our single source of truth.
To keep things in sync, bring it into your workspace:
git pull
Gitops Hands On 7/12 💻
GitOps hands-on 7/10
Open podinfo/pkg/version/version.go, increment the version number
again, and save:
var VERSION = "0.3.2"
Commit your changes and push to master:
cd /workspace/podinfo
git commit -m "release v0.3.2" .
git push
Gitops Hands On 8/12 💻
GitOps hands-on 8/10
Watch for the CI/CD to upgrade the app to 0.3.2:
watch curl podinfo.dev:9898/version
Gitops Hands On 9/12 💻
GitOps hands-on 8/10
Suppose we don’t like the latest version: we want to roll back.
1. In Weave Cloud Deploy, find the podinfo Deployment in dev
Namespace. Click Deautomate.
2. The UI shows a list of images - select the one you want and click
Release, then again to confirm.
3. Check again which version is running:
watch curl podinfo.dev:9898/version
Gitops Hands On 10/12 💻
GitOps hands-on 8/10
We can flag that the latest build should not be deployed
1. In Weave Cloud Deploy, find the podinfo Deployment in dev
Namespace. Click 🔒Lock.
2. Give a reason, then click Lock again to confirm.
3. Each of these actions creates a git commit. Sync your workspace:
git pull
4. Reload /workspace/cluster/dev/podinfo-dep.yaml in the editor to see
how the lock is applied.
Gitops Hands On 11/12 💻
GitOps hands-on 7/10
We can flow the version number through the pipeline with a git tag, to
show more meaningful versions
Create and push a git tag:
cd /workspace/podinfo
git tag 0.3.2
git push origin 0.3.2
This will trigger another CI build, and when that is finished you should
have an image tagged 0.3.2
Gitops Hands On 12/12 💻
170
● Having separate pipelines for CI and CD enables better security
● It’s also easier to deal with if a deployment goes wrong
● We built a few versions of a simple app, using a demo CI pipeline
● Deployed those versions to Kubernetes using Weave Cloud
● Automated the deployment
● Deployments, rollback and lock are all done via git
● Git is our single source of truth.
Recap: GitOps CI/CD
Advanced Deployment patterns
17
1
Deployment Strategies
Kubernetes native
● Recreate
● Rolling update
● Blue/Green
Service mesh
● Canary
● A/B Testing
● Blue/Green + Dark Traffic
Recreate deployments
17
Recreate Deployment Strategy
A
C
B
D
Recreate Deployment Strategy
apiVersion: apps/v1
kind: Deployment
spec:
replicas: 2
strategy:
type: Recreate
Recreate Deployment Strategy
Pros
● Avoids versioning issues
● Avoids database schema incompatibites
Cons
● Involves downtime between v1 complete shutdown and v2 startup
Suitable for
● Monolithic legacy applications
● Non production environments
Rolling Deployments
17
Rolling Deployment Strategy
A
C
B
D
Rolling Deployment Strategy
apiVersion: apps/v1
kind: Deployment
spec:
replicas: 2
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # new pods added at a time
maxUnavailable: 0
minReadySeconds: 10
Rolling Deployment Strategy
Pros
● Low risk due to readiness checks
● Gradual rollout with no downtime
Cons
● Needs backwards compatibility between API versions and
database migrations
● No control over the traffic during the rollout
Suitable for
● Stateful applications & Stateless microservices
Blue/Green deployments
18
Blue/Green Deployment Strategy
Blue/Green Deployment Strategy
Kubernetes native deployment strategy
apiVersion: v1
kind: Service
spec:
selector:
app: podinfo
version: v1 #switch the traffic from blue to green by changing the version to v2
Blue/Green Deployment Strategy
Suitable for
● Monolithic legacy applications
● Autonomous microservices
Pros
● Avoids versioning issues
● Instant rollout and rollback (while the blue deployment still exists)
Cons
● Requires resource duplication
● Data synchronisation between the two environments can lead to partial service interruption
Canary Deployments
18
Canary Deployment Strategy
Canary Deployment Strategy
Istio traffic management
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
spec:
http:
- match:
- headers:
x-user:
exact: insider
route:
- destination:
name: podinfo.prod.svc.cluster.local
subset: canary
- route:
- destination:
name: podinfo.prod.svc.cluster.local
subset: ga
Canary Deployment Strategy
Suitable for
● User facing applications
● Stateless microservices
Pros
● Low impact as the new version is released only to a subset of users
● Controlled rollout with no downtime
● Fast rollback
Cons
● Needs a traffic management solution / service mesh (Envoy, Istio, Linkerd)
● Needs backwards compatibility between API versions and database migrations
Istio Canary Deployment - Initial State
18
9
All traffic is routed to the GA deployment
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
http:
- route:
- destination:
name: podinfo
subset: canary
weight: 0
- destination:
name: podinfo
subset: ga
weight: 100
Istio Canary Deployment - Initial State
19
0
Istio Canary Deployment - Warm-Up
19
1
Route 10% of the traffic to the canary
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
http:
- route:
- destination:
name: podinfo
subset: canary
weight: 10
- destination:
name: podinfo
subset: ga
weight: 90
Istio Canary Deployment - Warm-Up
19
2
Istio Canary Deployment - Increase load
19
3
Istio Canary Deployment - Latency Monitoring
19
4
Istio Canary Deployment - CD Overview
19
5
19
6
https://flagger.app
A/B Testing
19
7
A/B Testing
Suitable for
● User facing applications
● Stateless microservices
Pros
● Allows advanced customer behaviour analysis
● Performance testing of different configurations in parallel
Cons
● Needs a traffic management solution / service mesh (Envoy, Istio, Linkerd)
● Needs backwards compatibility between API versions and database migrations
Canary + A/B Testing
Blue/Green and Dark traffic
20
Blue/Green + Dark Traffic Deployment Strategy
Blue/Green + Dark Traffic Deployment Strategy
Istio traffic mirroring
apiVersion: config.istio.io/v1alpha2
kind: RouteRule
spec:
destination:
name: podinfo
precedence: 2
route:
- labels:
version: v1
weight: 100
- labels:
version: v2
weight: 0
mirror:
name: podinfo
labels:
version: v2
Blue/Green + Dark Traffic Deployment Strategy
Suitable for
● API based applications
● Autonomous microservices
Pros
● Test the green deployment without any impact for the end-user
● Uses real traffic minimising the risk of a faulty release
Cons
● Requires resource duplication
● Needs a traffic management solution / service mesh (Envoy, Istio)
20
4
Coffee
Time
Operating Kubernetes
20
5
● Kubernetes internal architecture
● High-availability Kubernetes
● Draining and cordon a node for reboot
● Backing up and upgrading the Kubernetes control plane
Operational practices for Kubernetes
206
20
7
Kubernetes component architecture
Diagram from https://speakerdeck.com/luxas with permission
Nodes
Control plane
Node 3
OS
Container
Runtime
Kubelet
Networking
Node 2
OS
Container
Runtime
Kubelet
Networking
Node 1
OS
Container
Runtime
Kubelet
Networking
API Server (REST API)
Controller Manager
(Controller Loops)
Scheduler
(Bind Pod to Node)
etcd (key-value DB, SSOT)
User
Legend:
CNI
CRI
OCI
Protobuf
gRPC
JSON
HA etcd cluster
External Load Balancer or DNS-based API server resolving
High-availability Kubernetes
Master A
API Server
Controller Manager
Scheduler
Shared certificates
etcd
etcd
etcd
Master B
API Server
Controller Manager
Scheduler
Shared certificates
Master C
API Server
Controller Manager
Scheduler
Shared certificates
Nodes
Kubelet 1
Kubelet 2
Kubelet 3
Kubelet 4
Kubelet 5
Diagram from https://speakerdeck.com/luxas with permission
208
● How Google Runs Production Systems
● SREs:
○ Have the skillset necessary to automate tasks
○ Do the same work as an operations team, but with
automation instead of manual labor
● SRE team responsible for latency, performance,
efficiency, change management, monitoring,
emergency response, and capacity planning
Site Reliability Engineering
209
When you need to reboot a worker node to install OS updates or do
hardware maintenance without disrupting your workloads you need
to perform the following operations:
● Evict all running pods except DaemonSets and StatefulSets
● Mark the node as unschedulable
● Perform maintenance work on the node
● Restart the node
● Make the node schedulable again
kubectl drain $NODE -> reboot $NODE -> kubectl uncordon $NODE
Worker nodes maintenance
210
● If all workloads are replicated, draining a node before rebooting is
not necessary. A node reboot that comes back in less than 5
minutes will not trigger any pod rescheduling.
● A drain operation can target multiple nodes, in order to protect
clustered applications you need to create Pod Disruption Budgets
to ensure the number of replicas running is never brought below
the minimum number needed for a quorum
Worker nodes maintenance
211
Operations:
● Master node OS updates and hardware maintenance (low risk)
○ A master node reboot will not disrupt any running workloads
○ While the master node is offline, no scheduling operations will happen
○ Kured can help with this. (https://github.com/weaveworks/kured)
● Control plane upgrades (high risk)
○ For master nodes in-place upgrades, a full backup is recommended such as LVM
snapshots
○ Only one minor version upgrade is supported, you can only upgrade from 1.9 to 1.10, not
from 1.8 to 1.10
○ Test the upgrade procedure on a staging cluster before running it in production
Control plane maintenance
212
Kubeadm master nodes upgrade procedure:
● Download the most recent version of kubeadm using curl (do not
upgrading the kubeadm OS package)
● Run kubeadm upgrade plan to check if your cluster is upgradeable
● Pick a version to upgrade to and run kubeadm upgrade apply v1.10.2
● Upgrade your CNI by applying the new DaemonSet definition
● Drain the master node with kubectl drain $HOST --ignore-daemonsets
● Upgrade Kubernetes packages with apt-get update && apt-get upgrade
● Bring the master node back online with kubectl uncordon $MASTER
Control plane upgrades
213
Kubernetes Security
21
4
21
5
TWISTLOCK PRESENTATION
End of Day Recap
21
6
Agenda
21
7
9:00a Welcome & introduction
9:30a Getting started with your environment
10:00a What is “Production Ready?”
10:30a Break (15 minutes)
10:45a Monitoring a production cluster
11:45a Declarative infrastructure in practice
12:15p Lunch (1 hour)
1:15p Devops and GitOps in practice
2:15p Advanced Deployment Patterns
3:15p Break (15 minutes)
3:30p Operational practice for Kubernetes
4:00p Securing a Kubernetes cluster (by Twistlock)
5:00p Review and recap
❏ Readiness check
❏ Liveness check
❏ Metric instrumentation
❏ Dashboards
❏ Playbook
❏ Limits and requests
❏ Labels and annotations
The Application checklist
21
8
❏ Alerts
❏ Structured logging output
❏ Tracing instrumentation
❏ Graceful shutdowns
❏ Graceful dependency (w. readiness check)
❏ Configmaps
❏ Labeled images using commit sha
❏ Locked down runtime context
The Cluster checklist
21
9
❏ API Gateway
❏ Service Mesh
❏ Service catalogue / Broker
❏ Network policies
❏ Authorisation integration
❏ Image scanning
❏ Log aggregation
❏ Build pipeline
❏ Deployment pipeline
❏ Image registry
❏ Monitoring infrastructure
❏ Shared storage
❏ Secrets management
❏ Ingress controller
22
0
Recap: Monitoring In Practice
● There are different kind of metrics
● A good way to think of metrics is which domain they’re in
● It’s trivial to instrument your applications
● Prometheus can be used for both metrics (monitoring) and ad-hoc querying
(observability)
● Simple instrumentation can yield deep insights
● PromQL deals with scalar and vector series
● PromQL has gauges, histograms and counters
● PromQL has many useful functions available
22
1
1 The entire system is described declaratively.
2 The canonical desired system state is versioned
(with Git)
3 Changes to the desired state are
automatically applied to the system
4 Software agents ensure correctness
and alert on divergence
222
● Having separate pipelines for CI and CD enables better security
● It’s also easier to deal with if a deployment goes wrong
● We built a few versions of a simple app, using a demo CI pipeline
● Deployed those versions to Kubernetes using Weave Cloud
● Automated the deployment
● Deployments, rollback and lock are all done via git
● Git is our single source of truth.
Recap: GitOps CI/CD
Deployment Strategies
Kubernetes native
● Recreate
● Rolling update
● Blue/Green
Service mesh
● Canary
● A/B Testing
● Blue/Green + Dark Traffic
● Kubernetes internal architecture
● High-availability Kubernetes
● Draining and cordon a node for reboot
● Backing up and upgrading the Kubernetes control plane
Operational practices for Kubernetes
224
THANK YOU!
22
5
Craig Wright
craig@weave.works
brice@weave.works
@fractallambda
@weaveworks
https://weave.works
22
6
Home
Time

Weitere ähnliche Inhalte

Was ist angesagt?

The Power of GitOps with Flux & GitOps Toolkit
The Power of GitOps with Flux & GitOps ToolkitThe Power of GitOps with Flux & GitOps Toolkit
The Power of GitOps with Flux & GitOps Toolkit
Weaveworks
 
Setting up Notifications, Alerts & Webhooks with Flux v2 by Alison Dowdney
Setting up Notifications, Alerts & Webhooks with Flux v2 by Alison DowdneySetting up Notifications, Alerts & Webhooks with Flux v2 by Alison Dowdney
Setting up Notifications, Alerts & Webhooks with Flux v2 by Alison Dowdney
Weaveworks
 

Was ist angesagt? (20)

Docker New York City: From GitOps to a scalable CI/CD Pattern for Kubernetes
Docker New York City: From GitOps to a scalable CI/CD Pattern for KubernetesDocker New York City: From GitOps to a scalable CI/CD Pattern for Kubernetes
Docker New York City: From GitOps to a scalable CI/CD Pattern for Kubernetes
 
The Power of GitOps with Flux & GitOps Toolkit
The Power of GitOps with Flux & GitOps ToolkitThe Power of GitOps with Flux & GitOps Toolkit
The Power of GitOps with Flux & GitOps Toolkit
 
Openshift argo cd_v1_2
Openshift argo cd_v1_2Openshift argo cd_v1_2
Openshift argo cd_v1_2
 
The journey to GitOps
The journey to GitOpsThe journey to GitOps
The journey to GitOps
 
GitOps is the best modern practice for CD with Kubernetes
GitOps is the best modern practice for CD with KubernetesGitOps is the best modern practice for CD with Kubernetes
GitOps is the best modern practice for CD with Kubernetes
 
GitOps Toolkit (Cloud Native Nordics Tech Talk)
GitOps Toolkit (Cloud Native Nordics Tech Talk)GitOps Toolkit (Cloud Native Nordics Tech Talk)
GitOps Toolkit (Cloud Native Nordics Tech Talk)
 
GitOps, Driving NGN Operations Teams 211127 #kcdgt 2021
GitOps, Driving NGN Operations Teams 211127 #kcdgt 2021GitOps, Driving NGN Operations Teams 211127 #kcdgt 2021
GitOps, Driving NGN Operations Teams 211127 #kcdgt 2021
 
Meetup 23 - 03 - Application Delivery on K8S with GitOps
Meetup 23 - 03 - Application Delivery on K8S with GitOpsMeetup 23 - 03 - Application Delivery on K8S with GitOps
Meetup 23 - 03 - Application Delivery on K8S with GitOps
 
Cloud Native CI/CD with GitOps
Cloud Native CI/CD with GitOpsCloud Native CI/CD with GitOps
Cloud Native CI/CD with GitOps
 
Terraform GitOps on Codefresh
Terraform GitOps on CodefreshTerraform GitOps on Codefresh
Terraform GitOps on Codefresh
 
Webinar: High velocity deployment with google cloud and weave cloud
Webinar: High velocity deployment with google cloud and weave cloudWebinar: High velocity deployment with google cloud and weave cloud
Webinar: High velocity deployment with google cloud and weave cloud
 
Speeding up your team with GitOps
Speeding up your team with GitOpsSpeeding up your team with GitOps
Speeding up your team with GitOps
 
Continuous Delivery the Hard Way with Kubernetes
Continuous Delivery the Hard Way with Kubernetes Continuous Delivery the Hard Way with Kubernetes
Continuous Delivery the Hard Way with Kubernetes
 
Is your kubernetes negative or positive
Is your kubernetes negative or positive Is your kubernetes negative or positive
Is your kubernetes negative or positive
 
Kubernetes GitOps featuring GitHub, Kustomize and ArgoCD
Kubernetes GitOps featuring GitHub, Kustomize and ArgoCDKubernetes GitOps featuring GitHub, Kustomize and ArgoCD
Kubernetes GitOps featuring GitHub, Kustomize and ArgoCD
 
Continuous Delivery NYC: From GitOps to an adaptable CI/CD Pattern for Kubern...
Continuous Delivery NYC: From GitOps to an adaptable CI/CD Pattern for Kubern...Continuous Delivery NYC: From GitOps to an adaptable CI/CD Pattern for Kubern...
Continuous Delivery NYC: From GitOps to an adaptable CI/CD Pattern for Kubern...
 
Setting up Notifications, Alerts & Webhooks with Flux v2 by Alison Dowdney
Setting up Notifications, Alerts & Webhooks with Flux v2 by Alison DowdneySetting up Notifications, Alerts & Webhooks with Flux v2 by Alison Dowdney
Setting up Notifications, Alerts & Webhooks with Flux v2 by Alison Dowdney
 
Gitops: a new paradigm for software defined operations
Gitops: a new paradigm for software defined operationsGitops: a new paradigm for software defined operations
Gitops: a new paradigm for software defined operations
 
How we can do Multi-Tenancy on Kubernetes
How we can do Multi-Tenancy on KubernetesHow we can do Multi-Tenancy on Kubernetes
How we can do Multi-Tenancy on Kubernetes
 
Gitops: the kubernetes way
Gitops: the kubernetes wayGitops: the kubernetes way
Gitops: the kubernetes way
 

Ähnlich wie Kubecon seattle 2018 workshop slides

Kubernetes One-Click Deployment: Hands-on Workshop (Mainz)
Kubernetes One-Click Deployment: Hands-on Workshop (Mainz)Kubernetes One-Click Deployment: Hands-on Workshop (Mainz)
Kubernetes One-Click Deployment: Hands-on Workshop (Mainz)
QAware GmbH
 

Ähnlich wie Kubecon seattle 2018 workshop slides (20)

Where should I run my code? Serverless, Containers, Virtual Machines and more
Where should I run my code? Serverless, Containers, Virtual Machines and moreWhere should I run my code? Serverless, Containers, Virtual Machines and more
Where should I run my code? Serverless, Containers, Virtual Machines and more
 
Using eBPF to Measure the k8s Cluster Health
Using eBPF to Measure the k8s Cluster HealthUsing eBPF to Measure the k8s Cluster Health
Using eBPF to Measure the k8s Cluster Health
 
Intro to Kubernetes & GitOps Workshop
Intro to Kubernetes & GitOps WorkshopIntro to Kubernetes & GitOps Workshop
Intro to Kubernetes & GitOps Workshop
 
reBuy on Kubernetes
reBuy on KubernetesreBuy on Kubernetes
reBuy on Kubernetes
 
Free GitOps Workshop
Free GitOps WorkshopFree GitOps Workshop
Free GitOps Workshop
 
Build cloud native solution using open source
Build cloud native solution using open source Build cloud native solution using open source
Build cloud native solution using open source
 
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCPSimpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
 
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
 
OSDC 2018 | Highly Available Cloud Foundry on Kubernetes by Cornelius Schumacher
OSDC 2018 | Highly Available Cloud Foundry on Kubernetes by Cornelius SchumacherOSDC 2018 | Highly Available Cloud Foundry on Kubernetes by Cornelius Schumacher
OSDC 2018 | Highly Available Cloud Foundry on Kubernetes by Cornelius Schumacher
 
Kubernetes One-Click Deployment: Hands-on Workshop (Mainz)
Kubernetes One-Click Deployment: Hands-on Workshop (Mainz)Kubernetes One-Click Deployment: Hands-on Workshop (Mainz)
Kubernetes One-Click Deployment: Hands-on Workshop (Mainz)
 
Free GitOps Workshop + Intro to Kubernetes & GitOps
Free GitOps Workshop + Intro to Kubernetes & GitOpsFree GitOps Workshop + Intro to Kubernetes & GitOps
Free GitOps Workshop + Intro to Kubernetes & GitOps
 
AgileTW Feat. DevOpsTW: 維運 Kubernetes 的兩三事
AgileTW Feat. DevOpsTW: 維運 Kubernetes 的兩三事AgileTW Feat. DevOpsTW: 維運 Kubernetes 的兩三事
AgileTW Feat. DevOpsTW: 維運 Kubernetes 的兩三事
 
What's New in Docker - February 2017
What's New in Docker - February 2017What's New in Docker - February 2017
What's New in Docker - February 2017
 
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
 
Kubernetes for java developers - Tutorial at Oracle Code One 2018
Kubernetes for java developers - Tutorial at Oracle Code One 2018Kubernetes for java developers - Tutorial at Oracle Code One 2018
Kubernetes for java developers - Tutorial at Oracle Code One 2018
 
Cloud Native Camel Design Patterns
Cloud Native Camel Design PatternsCloud Native Camel Design Patterns
Cloud Native Camel Design Patterns
 
Cloud-native .NET Microservices mit Kubernetes
Cloud-native .NET Microservices mit KubernetesCloud-native .NET Microservices mit Kubernetes
Cloud-native .NET Microservices mit Kubernetes
 
Containers explained as for cook and a mecanics
 Containers explained as for cook and a mecanics  Containers explained as for cook and a mecanics
Containers explained as for cook and a mecanics
 
DevOps Days Boston 2017: Real-world Kubernetes for DevOps
DevOps Days Boston 2017: Real-world Kubernetes for DevOpsDevOps Days Boston 2017: Real-world Kubernetes for DevOps
DevOps Days Boston 2017: Real-world Kubernetes for DevOps
 
Phil Basford - machine learning at scale with aws sage maker
Phil Basford - machine learning at scale with aws sage makerPhil Basford - machine learning at scale with aws sage maker
Phil Basford - machine learning at scale with aws sage maker
 

Mehr von Weaveworks

SRE and GitOps for Building Robust Kubernetes Platforms.pdf
SRE and GitOps for Building Robust Kubernetes Platforms.pdfSRE and GitOps for Building Robust Kubernetes Platforms.pdf
SRE and GitOps for Building Robust Kubernetes Platforms.pdf
Weaveworks
 
How to Avoid Kubernetes Multi-tenancy Catastrophes
How to Avoid Kubernetes Multi-tenancy CatastrophesHow to Avoid Kubernetes Multi-tenancy Catastrophes
How to Avoid Kubernetes Multi-tenancy Catastrophes
Weaveworks
 

Mehr von Weaveworks (20)

Weave AI Controllers (Weave GitOps Office Hours)
Weave AI Controllers (Weave GitOps Office Hours)Weave AI Controllers (Weave GitOps Office Hours)
Weave AI Controllers (Weave GitOps Office Hours)
 
Flamingo: Expand ArgoCD with Flux (Office Hours)
Flamingo: Expand ArgoCD with Flux (Office Hours)Flamingo: Expand ArgoCD with Flux (Office Hours)
Flamingo: Expand ArgoCD with Flux (Office Hours)
 
Webinar: Capabilities, Confidence and Community – What Flux GA Means for You
Webinar: Capabilities, Confidence and Community – What Flux GA Means for YouWebinar: Capabilities, Confidence and Community – What Flux GA Means for You
Webinar: Capabilities, Confidence and Community – What Flux GA Means for You
 
Six Signs You Need Platform Engineering
Six Signs You Need Platform EngineeringSix Signs You Need Platform Engineering
Six Signs You Need Platform Engineering
 
SRE and GitOps for Building Robust Kubernetes Platforms.pdf
SRE and GitOps for Building Robust Kubernetes Platforms.pdfSRE and GitOps for Building Robust Kubernetes Platforms.pdf
SRE and GitOps for Building Robust Kubernetes Platforms.pdf
 
Webinar: End to End Security & Operations with Chainguard and Weave GitOps
Webinar: End to End Security & Operations with Chainguard and Weave GitOpsWebinar: End to End Security & Operations with Chainguard and Weave GitOps
Webinar: End to End Security & Operations with Chainguard and Weave GitOps
 
Flux Beyond Git Harnessing the Power of OCI
Flux Beyond Git Harnessing the Power of OCIFlux Beyond Git Harnessing the Power of OCI
Flux Beyond Git Harnessing the Power of OCI
 
Automated Provisioning, Management & Cost Control for Kubernetes Clusters
Automated Provisioning, Management & Cost Control for Kubernetes ClustersAutomated Provisioning, Management & Cost Control for Kubernetes Clusters
Automated Provisioning, Management & Cost Control for Kubernetes Clusters
 
How to Avoid Kubernetes Multi-tenancy Catastrophes
How to Avoid Kubernetes Multi-tenancy CatastrophesHow to Avoid Kubernetes Multi-tenancy Catastrophes
How to Avoid Kubernetes Multi-tenancy Catastrophes
 
Building internal developer platform with EKS and GitOps
Building internal developer platform with EKS and GitOpsBuilding internal developer platform with EKS and GitOps
Building internal developer platform with EKS and GitOps
 
GitOps Testing in Kubernetes with Flux and Testkube.pdf
GitOps Testing in Kubernetes with Flux and Testkube.pdfGitOps Testing in Kubernetes with Flux and Testkube.pdf
GitOps Testing in Kubernetes with Flux and Testkube.pdf
 
Intro to GitOps with Weave GitOps, Flagger and Linkerd
Intro to GitOps with Weave GitOps, Flagger and LinkerdIntro to GitOps with Weave GitOps, Flagger and Linkerd
Intro to GitOps with Weave GitOps, Flagger and Linkerd
 
Implementing Flux for Scale with Soft Multi-tenancy
Implementing Flux for Scale with Soft Multi-tenancyImplementing Flux for Scale with Soft Multi-tenancy
Implementing Flux for Scale with Soft Multi-tenancy
 
Accelerating Hybrid Multistage Delivery with Weave GitOps on EKS
Accelerating Hybrid Multistage Delivery with Weave GitOps on EKSAccelerating Hybrid Multistage Delivery with Weave GitOps on EKS
Accelerating Hybrid Multistage Delivery with Weave GitOps on EKS
 
The Story of Flux Reaching Graduation in the CNCF
The Story of Flux Reaching Graduation in the CNCFThe Story of Flux Reaching Graduation in the CNCF
The Story of Flux Reaching Graduation in the CNCF
 
Shift Deployment Security Left with Weave GitOps & Upbound’s Universal Crossp...
Shift Deployment Security Left with Weave GitOps & Upbound’s Universal Crossp...Shift Deployment Security Left with Weave GitOps & Upbound’s Universal Crossp...
Shift Deployment Security Left with Weave GitOps & Upbound’s Universal Crossp...
 
Securing Your App Deployments with Tunnels, OIDC, RBAC, and Progressive Deliv...
Securing Your App Deployments with Tunnels, OIDC, RBAC, and Progressive Deliv...Securing Your App Deployments with Tunnels, OIDC, RBAC, and Progressive Deliv...
Securing Your App Deployments with Tunnels, OIDC, RBAC, and Progressive Deliv...
 
Flux’s Security & Scalability with OCI & Helm Slides.pdf
Flux’s Security & Scalability with OCI & Helm Slides.pdfFlux’s Security & Scalability with OCI & Helm Slides.pdf
Flux’s Security & Scalability with OCI & Helm Slides.pdf
 
Flux Security & Scalability using VS Code GitOps Extension
Flux Security & Scalability using VS Code GitOps Extension Flux Security & Scalability using VS Code GitOps Extension
Flux Security & Scalability using VS Code GitOps Extension
 
Deploying Stateful Applications Securely & Confidently with Ondat & Weave GitOps
Deploying Stateful Applications Securely & Confidently with Ondat & Weave GitOpsDeploying Stateful Applications Securely & Confidently with Ondat & Weave GitOps
Deploying Stateful Applications Securely & Confidently with Ondat & Weave GitOps
 

Kürzlich hochgeladen

AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 

Kürzlich hochgeladen (20)

OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptxBUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 

Kubecon seattle 2018 workshop slides

  • 1. Your path to production ready Kubernetes Weaveworks – https://weave.works – @weaveworks Kubecon Seattle – December 2018 Brice Fernandes – brice@weave.works – @fractallambda Craig Wright – craig@weave.works – @c_r_w 1
  • 3. 3 Hi We work for Weaveworks as customer success engineers You can find Weaveworks at https://www.weave.works or @weaveworks The team at Weaveworks is behind the GitOps model You can find us online at @fractallambda and @c_r_w
  • 4. ● Building cloud-native OSS since 2014 (Weave Net, Moby, Kubernetes, Prometheus) ● Founding member of CNCF ● Alexis Richardson (Weaveworks CEO) is chair of the CNCF Technical Oversight Committee ● Weave Cloud runs on Kubernetes since 2015 4 About Weaveworks
  • 6. 6
  • 7. 7
  • 8. 8
  • 9. 9
  • 10. Agenda 10 9:00a Welcome & introduction 9:30a Getting started with your environment 10:00a What is “Production Ready?” 10:30a Break (15 minutes) 10:45a Monitoring a production cluster 11:45a Declarative infrastructure in practice 12:15p Lunch (1 hour) 1:15p Devops and GitOps in practice 2:15p Advanced Deployment Patterns 3:15p Break (15 minutes) 3:30p Operational practice for Kubernetes 4:00p Securing a Kubernetes cluster (by Twistlock) 5:00p Review and recap
  • 11. Some assumptions 11 ➔ You can use the command line. ➔ You can use Git. ➔ You know what Kubernetes Pods, Deployment, and Services are. ➔ You have a modern web browser.
  • 12. Kubernetes need to know 12 Pods containers Deployments Containers - Run Docker images, an immutable copy of your application code and all code dependencies in an isolated environment. Pods - A set of containers, co-scheduled on one machine. Ephemeral. Has unique IP. Has labels. Deployment - Ensures a certain number of replicas of a pod are running across the cluster. Service - Gets virtual IP, mapped to endpoints via labels. Named in DNS. Namespace - Resource names are scoped to a Namespace. Policy boundary.
  • 14. Getting started with your environment 14
  • 15. 15 Login to your cluster – Weave Cloud & C9 1. Go to tinyurl.com/kubecon18-cluster 2. Add your name and email 3. Check your email for links to your environment and your password (This may take a little while. Be patient while Craig invites you)
  • 16. 16
  • 19. Your Cluster 19 pod Icon by Freepik from www.flaticon.com
  • 20. Your Cluster 20 pod Cloud Source Repositories Container Builder Cloud Registry
  • 21. GitOps hands-on 1/10Kick the tires on your cluster 💻 1. Start with a simple command: ➤ kubectl version 2. Look at what’s running on the cluster with Weave Cloud Explore
  • 22. “DevOps Console” Tooling for deployment, visualisation and observability Weave Cloud 22
  • 25. GitOps hands-on 1/10 Ask Kubernetes what’s running on the cluster: ➤ kubectl get pods --all-namespaces Query Kubernetes 💻
  • 26. What is “Production Ready”? 26
  • 27. Bear with me while I go through this list. There will be a kitten at the end. Production Ready checklists 27
  • 28. ❏ Readiness check ❏ Liveness check ❏ Metric instrumentation ❏ Dashboards ❏ Playbook ❏ Limits and requests ❏ Labels and annotations The Application checklist 28 ❏ Alerts ❏ Structured logging output ❏ Tracing instrumentation ❏ Graceful shutdowns ❏ Graceful dependency (w. readiness check) ❏ Configmaps ❏ Labeled images using commit sha ❏ Locked down runtime context
  • 29. Liveness and Readiness probes 29 What? Why? Options Endpoints for Kubernetes to monitor your application lifecycle Allows Kubernetes to restart or stop traffic to a pod - ● Liveness failure is for telling Kubernetes to restart the pod ● Readiness failure is transient and tells Kubernetes to route traffic elsewhere ● Readiness failure is useful for startup and load management
  • 30. Metric instrumentation 30 What? Why? Options Code and libraries used in your code to expose metrics Allows measuring operation of application and enables many more advanced use cases Prometheus, Newrelic, Datadog, many others ● Basic metrics are not optional ● Prometheus is a fantastic fit for Kubernetes in most cases
  • 31. Dashboards 31 What? Why? Options View of metrics Metrics are just data. They must be consumable by humans as well. Grafana Many commercial options
  • 32. Playbooks / Runbooks 32 What? Why? Options Rich guides for your engineers on how-to operate the system and fault find when things go wrong. Nobody is at their sharpest at 03:00 AM Knowledge deteriorates over time Confluence Markdown files Weave Cloud Notebooks ● Absolutely vital knowledge repository. ● Avoids the bus factor ● First point of call for operational issues ● Significantly speeds up new engineer induction ● Requires continuous work to maintain
  • 33. Limits and requests 33 What? Why? Options Explicit resource allocation for pods Allows Kubernetes to make good scheduling decisions - ● Requests are used when scheduling ● Limits will avoid workloads from causing cascading failures ● Limits are a valuable safety net ● Available at the namespace level as well (see ResourceQuotas)
  • 34. 34 If a Container exceeds its memory limit, it might be terminated. If it is restartable, the kubelet will restart it, as with any other type of runtime failure. If a Container exceeds its memory request, it is likely that its Pod will be evicted whenever the node runs out of memory. A Container might or might not be allowed to exceed its CPU limit for extended periods of time. However, it will not be killed for excessive CPU usage Limits - Official docs
  • 35. Labels and annotations 35 What? Why? Options Metadata held by Kubernetes Makes workload management easier and allows other tools to work with standard Kubernetes definitions - ● Useful to have a simple plan ● Labels can be used in kubectl arguments as filters ● Annotations are a good way of layering functionality without the overhead of Custom Resource Descriptions
  • 36. Alerts 36 What? Why? Options Automated notifications on defined trigger You need to know when your service degrades Prometheus & Alertmanager (Many other options)
  • 37. Structured Logging 37 What? Why? Options Output logs in a machine readable format to facilitate searching & indexing Trace what went wrong when something does ELK stack (Elasticsearch, Logstash and Kibana) Many commercial offerings ● Avoid logging to files ● Must have timestamps and basic levels (i.e. info, error, fatal) ● JSON logs/events is love or hate ● KV formats are more human-friendly
  • 38. Tracing Instrumentation 38 What? Why? Options Instrumentation to send request processing details to a collection service. Sometimes the only way of figuring out where latency is coming from Zipkin, Lightstep, Appdash, Tracer, Jaeger ● Trigger tracing from your gateway API ● Sample traces, don’t trace everything ● Costly to setup, but only meaningful way of debugging some latency issues. ● Use something that supports the Opentracing
  • 39. Graceful shutdowns 39 What? Why? Options Applications respond to SIGTERM correctly This is how Kubernetes will tell you application to end - ● End transactions, ● Default terminationGracePeriodSeconds is quite long, and can be shortened
  • 40. Graceful dependencies 40 What? Why? Options Applications don’t assume dependencies are available. Wait for other services before reporting ready Avoid headaches that come with a service order requirement - ● Nice apps don’t crash-lopp ● This is what the readiness probe was built for
  • 41. ConfigMaps 41 What? Why? Options Define a configuration file for your application in Kubernetes using configmaps Easy to reconfigure an app without rebuilding, allows config to be versioned - ● Mount configmap as a volume is the easiest option ● Environment variable also alternative for simpler config ● Setting a file watch or polling mean your application will take new config into consideration immediately
  • 42. ConfigMap Example apiVersion: v1 kind: ConfigMap metadata: name: my-app-cfg data: .env: | APP_NAME=my-app APP_ENV=stg APP_KEY="base64:gFf47FZi6F9xDJiZiEmmKlePurMaXECKs1cA9hscIVc=" APP_DEBUG=true APP_LOG_LEVEL=debug APP_URL=http://localhost Lar P f a w co g a n
  • 43. ConfigMap Mount Example containers: - name: my-php-app volumeMounts: - name: env mountPath: /var/www/.env subPath: .env volumes: - name: env configMap: name: my-app-cfg
  • 44. ConfigMap Environment Variable Example spec: containers: - name: test-container env: - name: SPECIAL_LEVEL_KEY valueFrom: configMapKeyRef: name: special-config key: special.how
  • 45. Labeled images using commit sha 45 What? Why? Options Label the docker images with the code commit SHA Makes tracing image to code trivial - ● Important to be able to trace back from running application to origin code ● If you reliably build your images with ${branch}-${short_git_hash} names, might be enough
  • 46. Locked down runtime context 46 What? Why? Options Use deliberately secure configuration for application runtime context Reduces attack surface, makes privileges explicit - ● if app writes temporary files, be sure to use emptyDir volume ● if your app has to initialise some data, do it with initContains ● avoid installing packages or fetching files from unreliable locations ● if you can, try to use readOnlyRootFilesystem:true ● runAsUser, fsGroup and allowPrivilegeEscalation:false allow you to control runtime context further
  • 47. The Cluster checklist 47 ❏ API Gateway ❏ Service Mesh ❏ Service catalogue / Broker ❏ Network policies ❏ Authorisation integration ❏ Image scanning ❏ Log aggregation ❏ Build pipeline ❏ Deployment pipeline ❏ Image registry ❏ Monitoring infrastructure ❏ Shared storage ❏ Secrets management ❏ Ingress controller
  • 48. Build pipeline 48 What? Why? Options Builds your code, runs your tests - - ● You have one already ● You should be able to use it ● Make sure artefacts are tagged with the Git commit SHA
  • 49. Deployment pipeline 49 What? Why? Options Takes build artefacts and puts them in the cluster - - ● Note this is separate concern from your build pipeline. ● Where you have your approval process ● This is where Gitops lives – More later today
  • 50. Image registry 50 What? Why? Options Stores build artefacts Keep versioned artefacts available Roll your own Commercial: Docker hub, Quay.io, GCP Registry ● Key security point ● Great options available both on-prem and online ● Credentials need to be available to CI for push, and cluster for pull
  • 51. Monitoring infrastructure 51 What? Why? Options Collects and stores metrics Understand your running system Get alerts when something goes wrong OSS: Prometheus, Cortex, Thanos Commercial: Datadog, Grafana Cloud, Weave Cloud ● Flip side of metrics instrumentation
  • 52. Shared Storage 52 What? Why? Options Store persistent state of your application beyond pod lifetime Stateless is a unicorn Many. Will depend on platform. ● Seen by your application as a directory ● Volumes and Volume claims are different things ● May be read-only.
  • 53. Secrets Management 53 What? Why? Options How do your application access secret credentials securely Secrets are needed to use external services Bitnami Sealed Secrets Hashicorp Vault
  • 54. Ingress controller 54 What? Why? Options Common routing point for inbound traffic Easier to manage authentication and logging Platform controller (AWS ELB) GCE & NGinx (by Kubernetes) Kong, Traefik, HAProxy, Istio, Envoy
  • 55. API Gateway 55 What? Why? Options SIngle point for incoming requests. Higher layer ingress controller. Can route at HTTP level. Enables common and centralised tooling for tracing, logging, authentication. Ambassador (Envoy), roll-your-own ● Can replace the ingress controller ● Ambassador is Kubernetes native
  • 56. Service mesh 56 What? Why? Options Additional layer on top of Kubernetes to manage routing Enables complex use cases and adds useful features Linkerd, Istio ● May not be needed ● Can provide tracing without instrumentation ● Will run as sidecar on services ● Other features: Service to service TLS; Load balancing; Fine-grained traffic policies; Service discovery; Service monitoring
  • 57. Service catalogue / broker 57 What? Why? Options Enables easy dependencies on services and service discovery for your team Simplifies deploying applications - ● Kubernetes’ own service catalog API is worth mentioning https://kubernetes.io/docs/concepts/extend-kubernetes/service-catalog/ ● Fits in really well with the role of service meshes ● Easy of use for developers can also be achieved with central repository of service configurations ● Still early days
  • 58. Network policies 58 What? Why? Options Rules on allowed connections Prevent unauthorised access, improve security, segregate namespaces Weave Net, Calico ● Node level (kernel) controls and restrictions of traffic ● Need a CNI plugin
  • 59. Authorisation integration 59 What? Why? Options API level integration into the Kubernetes auth flow. Use existing SSO, reduce number of account and centralise account management - ● Will require some custom integration work ● Many hooks into the auth API ● Possible to integrate with almost any auth provider
  • 60. Image scanning 60 What? Why? Options Automated scanning of vulnerability in your container images Because CVEs happen Docker, Snyk, Twistlock, Sonatype, Clair (OSS) ● Definitely worth implementing into your CI pipeline. ● Tools can be integrated with your PR process to provide comments on commits
  • 61. Log Aggregation 61 What? Why? Options Bring all logs from application into a searchable place Logs are the best source of information on what went wrong Lots and lots and lots Fluentd or ELK (Elasticsearch, Logstash, Kibana) stack are good bets for roll-you-own
  • 62. 62
  • 66. 66 1. Tel m e s to r he se c e p
  • 67. 67 2. In al h a l ex t or c me c
  • 68. 68 3. Ad us m ic
  • 72. 72 Counters The y is Gauges Dis b i s al Histograms
  • 77. 77
  • 78. 78
  • 79. 79
  • 80. 80
  • 89. 89 C = (total_signups offset 1m) - (total_cancels offset 1m)
  • 93. GitOps hands-on 4/10 1. Create the namespace we will use for this exercise kubectl create namespace dev Shortly, the Deploy agent will notice this change, and sync the Deployment and Service files. 2. Watch for this happening in Weave Cloud or via: watch kubectl -n dev get all The podinfo application should be running in your cluster in the dev namespace Prometheus in Practice 💻
  • 94. From your Cloud9 IDE console, run: curl http://podinfo.dev:9898/metrics | less And try to find these metrics that show: ● the number of open file descriptors ● the number of HTTP requests the pod has received 94 1 - Inspect the raw metrics directly 💻
  • 95. # HELP process_open_fds Number of open file descriptors. # TYPE process_open_fds gauge process_open_fds 7 # HELP http_requests_total The total number of HTTP requests. # TYPE http_requests_total counter http_requests_total{status="200"} 136 Answers 95
  • 96. ● Each line on that page is either a comment or a time series ● A time series has a name, optional labels, and a series of values ● A collection of time series with the same name is a metric Understanding metrics 96
  • 97. For example: http_requests_total{status="200"} 136 ● name is http_requests_total ● one label, status, with label value "200" ● value is 136 Since the pod launched, we've received 136 HTTP requests with status 200. Understanding metrics 97
  • 98. On your Weave Cloud instance: ● Go to "Monitoring" ● Create a notebook ● Call it "Monitoring in Practice" ● Enter http_requests_total and then click "Run as Table" Exercise: query metrics 98 2 - Query the metrics 💻
  • 100. 100 Where did those extra labels come from?
  • 101. Our pod has grown instance, job, _weave_namespace, and _weave_service labels. These were added at the point of scraping so time series don't clash with each other and so you can find the source of your data. Some labels added automatically 101
  • 102. 102 Why do some lines say code and some status?
  • 103. Some pods use slightly different labels (e.g. code instead of status). This highlights that Prometheus doesn't impose a schema on labels—they are free-form. Highly recommended that you form a consistent standard across your key applications. Label schema is flexible 103
  • 104. What if we only want to see the data from our service? In a new cell, run the following query as a table: http_requests_total{_weave_service="podinfo"} This only shows the time series which have labels that exactly match those above. PromQL also support not equals (!=) and regular expression matching (=~ and !~). Filtering metrics 104 3 - Query the metrics using labels 💻
  • 106. What if we want to get the total requests for our whole cluster? In a new cell, enter the following: sum(http_requests_total) This adds up all the requests to give us a single value. Aggregating metrics 106 4 - Aggregate metrics using functions 💻
  • 108. If you look at our original query, you see there are separate lines for each replica. Multiple rows refer to kube-dns or kubelet. How do we aggregate those metrics together? In a new cell, run the following query: sum(http_requests_total) by (_weave_namespace, _weave_service) Note only the labels in our by clause are preserved. Aggregating metrics 108 5 - Aggregate metrics by labels 💻
  • 110. Differentiating metrics Look at the graph view of our first query. What's the deal with these lines going up all the time? http_requests_total is a counter. It goes up by one every time there's an HTTP request. It never goes down. What if we wanted to see requests per second? 11 0
  • 111. In a new cell, run: rate(http_requests_total[1m]) and make sure to see the graph view. What do you see? Try changing the time interval from 1m to other values (5m, 2h, 10s). What do you think is happening there? Differentiating metrics 111 6 - Derive a gauge from a counter 💻
  • 113. We now know enough to get a graph of HTTP requests per second for dev/podinfo that will work regardless of how many replicas it has. Create a query that results in a graph of HTTP request rate for dev/podinfo. It will look like the below. Putting it all together 113 7 - Create a custom query 💻
  • 115. That graph is a bit boring. Let's make it more interesting by generating some traffic. Open a Weave Cloud terminal window into this container and run: hey -z 2m http://podinfo.dev:9898/error This will run for 2 minutes, sending many many requests to the error endpoint on podinfo. Generating traffic 115
  • 116. Overall status with error spike 11 6
  • 117. 11 7 Recap: Monitoring In Practice ● There are different kind of metrics ● A good way to think of metrics is which domain they’re in ● It’s trivial to instrument your applications ● Prometheus can be used for both metrics (monitoring) and ad-hoc querying (observability) ● Simple instrumentation can yield deep insights ● PromQL deals with scalar and vector series ● PromQL has gauges, histograms and counters ● PromQL has many useful functions available
  • 121. 12 1 GitOps is... An operation model Derived from CS and operation knowledge
  • 122. 12 2 GitOps is... An operation model Derived from CS and operation knowledge Technology agnostic (name notwithstanding)
  • 123. 12 3 GitOps is... An operation model Derived from CS and operation knowledge Technology agnostic (name notwithstanding) A set of principles (Why instead of How)
  • 124. 12 4 GitOps is... An operation model Derived from CS and operation knowledge Technology agnostic (name notwithstanding) A set of principles (Why instead of How) Although Weaveworks can help with how
  • 125. 12 5 GitOps is... An operation model Derived from CS and operation knowledge Technology agnostic (name notwithstanding) A set of principles (Why instead of How) A way to speed up your team
  • 127. 12 7 1 The entire system is described declaratively.
  • 128. 12 8 1 The entire system is described declaratively. Beyond code, data ⇒ Implementation independent Easy to abstract in simple ways Easy to validate for correctness Easy to generate & manipulate from code
  • 129. 12 9 1 The entire system is described declaratively. Beyond code, data ⇒ Implementation independent Easy to abstract in simple ways Easy to validate for correctness Easy to generate & manipulate from code
  • 130. 13 0 How is that different from Infrastructure as code?
  • 131. 13 1 How is that different from Infrastructure as code? It’s about consistency in the failure case.
  • 132. 13 2 It’s about consistency in the failure case. When imperative systems fail, the system ends up in an unknown, inconsistent state.
  • 133. 13 3 fail, the system ends up in an unknown, inconsistent state. Declarative changes let you think of changes as transactions.
  • 134. 13 4 Declarative changes let you think of changes as transactions. This is a very good thing.
  • 135. 13 5 The canonical desired system state is versioned (with Git) 2
  • 136. 13 6 The canonical desired system state is versioned (with Git) Canonical Source of Truth (DRY) With declarative definition, trivialises rollbacks Excellent security guarantees for auditing Sophisticated approval processes (& existing workflows) Great Software ↔ Human collaboration point 2
  • 137. 13 7 Changes to the desired state are automatically applied to the system 3
  • 138. 13 8 Approved changes to the desired state are automatically applied to the system Significant velocity gains Privileged operators don’t cross security boundaries Separates What and How. 3
  • 139. 13 9 Software agents ensure correctness and alert on divergence 4
  • 140. 14 0 Software agents ensure correctness and alert on divergence 4 Continuously checking that desired state is met System can self heal Recovers from errors without intervention (PEBKAC) It’s the control loop for your operations
  • 141. 14 1 1 The entire system is described declaratively. 2 The canonical desired system state is versioned (with Git) 3 Approved changes to the desired state are automatically applied to the system 4 Software agents ensure correctness and alert on divergence
  • 142. Gitops is Functional Reactive Programming… ...for your infrastructure. Like React, but for servers and applications.
  • 143. What should be GitOps’ed? 14 3
  • 144. What should be GitOps’ed? 14 4 I’m o r so y
  • 146. 146 ?
  • 147. Dashboards Alerts Playbook Kubernetes Manifests Application configuration Provisioning scripts 147 Application checklists Recording Rules Sealed Secrets
  • 148. 148
  • 154. 15 4 Feedback loop. This is what matters.
  • 157. GitOps hands-on 4/10 [Only do this step if you didn’t do it in your cluster earlier] Create the namespace we will use for this exercise: kubectl create namespace dev Shortly, the Deploy agent will notice this change, and sync the Deployment and Service files. Watch for this happening in Weave Cloud or via: watch kubectl -n dev get all Gitops Hands On 1/12 💻
  • 158. GitOps hands-on 5/10 We’re going to make a code change and see it flow through CI, then deploy that change. Call the version endpoint on the service to see what is running: curl podinfo.dev:9898/version Gitops Hands On 2/12 💻
  • 159. GitOps hands-on 7/10 In the editor, open podinfo/pkg/version/version.go, increment the version number and save: var VERSION = "0.3.1" Commit your changes and push to master: cd /workspace/podinfo git commit -m "release v0.3.1 to dev" . git push Gitops Hands On 3/12 💻
  • 160. GitOps hands-on 2/10 The CI pipeline will create an image tagged the same as the git commit Git said something like [master 89b8396]; the tag will be like master-89b8396 Check by listing image tags (replace user with your username): gcloud container images list-tags gcr.io/dx-training/USER-podinfo USER should be of the form “training-user-<number>”. Gitops Hands On 4/12 💻
  • 161. GitOps hands-on 3/10 Navigate in the editor to workspace/cluster/un-workshop/dev and open podinfo-dep.yaml. Where it says image: replace quay.io/stefanprodan/podinfo with gcr.io/dx-training/USER-podinfo replace the tag 0.3.0 with your tag master-TAG Save the file and commit your changes and push to master: cd ../cluster/un-workshop/dev git commit -m "my first deploy" . git push Gitops Hands On 5/12 💻
  • 162. GitOps hands-on 5/10 Check in Weave Cloud to see when it has synced the Deployment. Call the version endpoint on the service to see if it changed: curl podinfo.dev:9898/version Gitops Hands On 6/12 💻
  • 163. Editing the YAML file was tedious. Let’s automate it! 163
  • 164. GitOps hands-on 6/10 In Weave Cloud Deploy, find the podinfo Deployment in dev Namespace. Click Automate. This creates a commit, because git is our single source of truth. To keep things in sync, bring it into your workspace: git pull Gitops Hands On 7/12 💻
  • 165. GitOps hands-on 7/10 Open podinfo/pkg/version/version.go, increment the version number again, and save: var VERSION = "0.3.2" Commit your changes and push to master: cd /workspace/podinfo git commit -m "release v0.3.2" . git push Gitops Hands On 8/12 💻
  • 166. GitOps hands-on 8/10 Watch for the CI/CD to upgrade the app to 0.3.2: watch curl podinfo.dev:9898/version Gitops Hands On 9/12 💻
  • 167. GitOps hands-on 8/10 Suppose we don’t like the latest version: we want to roll back. 1. In Weave Cloud Deploy, find the podinfo Deployment in dev Namespace. Click Deautomate. 2. The UI shows a list of images - select the one you want and click Release, then again to confirm. 3. Check again which version is running: watch curl podinfo.dev:9898/version Gitops Hands On 10/12 💻
  • 168. GitOps hands-on 8/10 We can flag that the latest build should not be deployed 1. In Weave Cloud Deploy, find the podinfo Deployment in dev Namespace. Click 🔒Lock. 2. Give a reason, then click Lock again to confirm. 3. Each of these actions creates a git commit. Sync your workspace: git pull 4. Reload /workspace/cluster/dev/podinfo-dep.yaml in the editor to see how the lock is applied. Gitops Hands On 11/12 💻
  • 169. GitOps hands-on 7/10 We can flow the version number through the pipeline with a git tag, to show more meaningful versions Create and push a git tag: cd /workspace/podinfo git tag 0.3.2 git push origin 0.3.2 This will trigger another CI build, and when that is finished you should have an image tagged 0.3.2 Gitops Hands On 12/12 💻
  • 170. 170 ● Having separate pipelines for CI and CD enables better security ● It’s also easier to deal with if a deployment goes wrong ● We built a few versions of a simple app, using a demo CI pipeline ● Deployed those versions to Kubernetes using Weave Cloud ● Automated the deployment ● Deployments, rollback and lock are all done via git ● Git is our single source of truth. Recap: GitOps CI/CD
  • 172. Deployment Strategies Kubernetes native ● Recreate ● Rolling update ● Blue/Green Service mesh ● Canary ● A/B Testing ● Blue/Green + Dark Traffic
  • 175. Recreate Deployment Strategy apiVersion: apps/v1 kind: Deployment spec: replicas: 2 strategy: type: Recreate
  • 176. Recreate Deployment Strategy Pros ● Avoids versioning issues ● Avoids database schema incompatibites Cons ● Involves downtime between v1 complete shutdown and v2 startup Suitable for ● Monolithic legacy applications ● Non production environments
  • 179. Rolling Deployment Strategy apiVersion: apps/v1 kind: Deployment spec: replicas: 2 strategy: type: RollingUpdate rollingUpdate: maxSurge: 1 # new pods added at a time maxUnavailable: 0 minReadySeconds: 10
  • 180. Rolling Deployment Strategy Pros ● Low risk due to readiness checks ● Gradual rollout with no downtime Cons ● Needs backwards compatibility between API versions and database migrations ● No control over the traffic during the rollout Suitable for ● Stateful applications & Stateless microservices
  • 183. Blue/Green Deployment Strategy Kubernetes native deployment strategy apiVersion: v1 kind: Service spec: selector: app: podinfo version: v1 #switch the traffic from blue to green by changing the version to v2
  • 184. Blue/Green Deployment Strategy Suitable for ● Monolithic legacy applications ● Autonomous microservices Pros ● Avoids versioning issues ● Instant rollout and rollback (while the blue deployment still exists) Cons ● Requires resource duplication ● Data synchronisation between the two environments can lead to partial service interruption
  • 187. Canary Deployment Strategy Istio traffic management apiVersion: networking.istio.io/v1alpha3 kind: VirtualService spec: http: - match: - headers: x-user: exact: insider route: - destination: name: podinfo.prod.svc.cluster.local subset: canary - route: - destination: name: podinfo.prod.svc.cluster.local subset: ga
  • 188. Canary Deployment Strategy Suitable for ● User facing applications ● Stateless microservices Pros ● Low impact as the new version is released only to a subset of users ● Controlled rollout with no downtime ● Fast rollback Cons ● Needs a traffic management solution / service mesh (Envoy, Istio, Linkerd) ● Needs backwards compatibility between API versions and database migrations
  • 189. Istio Canary Deployment - Initial State 18 9 All traffic is routed to the GA deployment apiVersion: networking.istio.io/v1alpha3 kind: VirtualService http: - route: - destination: name: podinfo subset: canary weight: 0 - destination: name: podinfo subset: ga weight: 100
  • 190. Istio Canary Deployment - Initial State 19 0
  • 191. Istio Canary Deployment - Warm-Up 19 1 Route 10% of the traffic to the canary apiVersion: networking.istio.io/v1alpha3 kind: VirtualService http: - route: - destination: name: podinfo subset: canary weight: 10 - destination: name: podinfo subset: ga weight: 90
  • 192. Istio Canary Deployment - Warm-Up 19 2
  • 193. Istio Canary Deployment - Increase load 19 3
  • 194. Istio Canary Deployment - Latency Monitoring 19 4
  • 195. Istio Canary Deployment - CD Overview 19 5
  • 198. A/B Testing Suitable for ● User facing applications ● Stateless microservices Pros ● Allows advanced customer behaviour analysis ● Performance testing of different configurations in parallel Cons ● Needs a traffic management solution / service mesh (Envoy, Istio, Linkerd) ● Needs backwards compatibility between API versions and database migrations
  • 199. Canary + A/B Testing
  • 200. Blue/Green and Dark traffic 20
  • 201. Blue/Green + Dark Traffic Deployment Strategy
  • 202. Blue/Green + Dark Traffic Deployment Strategy Istio traffic mirroring apiVersion: config.istio.io/v1alpha2 kind: RouteRule spec: destination: name: podinfo precedence: 2 route: - labels: version: v1 weight: 100 - labels: version: v2 weight: 0 mirror: name: podinfo labels: version: v2
  • 203. Blue/Green + Dark Traffic Deployment Strategy Suitable for ● API based applications ● Autonomous microservices Pros ● Test the green deployment without any impact for the end-user ● Uses real traffic minimising the risk of a faulty release Cons ● Requires resource duplication ● Needs a traffic management solution / service mesh (Envoy, Istio)
  • 206. ● Kubernetes internal architecture ● High-availability Kubernetes ● Draining and cordon a node for reboot ● Backing up and upgrading the Kubernetes control plane Operational practices for Kubernetes 206
  • 207. 20 7 Kubernetes component architecture Diagram from https://speakerdeck.com/luxas with permission Nodes Control plane Node 3 OS Container Runtime Kubelet Networking Node 2 OS Container Runtime Kubelet Networking Node 1 OS Container Runtime Kubelet Networking API Server (REST API) Controller Manager (Controller Loops) Scheduler (Bind Pod to Node) etcd (key-value DB, SSOT) User Legend: CNI CRI OCI Protobuf gRPC JSON
  • 208. HA etcd cluster External Load Balancer or DNS-based API server resolving High-availability Kubernetes Master A API Server Controller Manager Scheduler Shared certificates etcd etcd etcd Master B API Server Controller Manager Scheduler Shared certificates Master C API Server Controller Manager Scheduler Shared certificates Nodes Kubelet 1 Kubelet 2 Kubelet 3 Kubelet 4 Kubelet 5 Diagram from https://speakerdeck.com/luxas with permission 208
  • 209. ● How Google Runs Production Systems ● SREs: ○ Have the skillset necessary to automate tasks ○ Do the same work as an operations team, but with automation instead of manual labor ● SRE team responsible for latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning Site Reliability Engineering 209
  • 210. When you need to reboot a worker node to install OS updates or do hardware maintenance without disrupting your workloads you need to perform the following operations: ● Evict all running pods except DaemonSets and StatefulSets ● Mark the node as unschedulable ● Perform maintenance work on the node ● Restart the node ● Make the node schedulable again kubectl drain $NODE -> reboot $NODE -> kubectl uncordon $NODE Worker nodes maintenance 210
  • 211. ● If all workloads are replicated, draining a node before rebooting is not necessary. A node reboot that comes back in less than 5 minutes will not trigger any pod rescheduling. ● A drain operation can target multiple nodes, in order to protect clustered applications you need to create Pod Disruption Budgets to ensure the number of replicas running is never brought below the minimum number needed for a quorum Worker nodes maintenance 211
  • 212. Operations: ● Master node OS updates and hardware maintenance (low risk) ○ A master node reboot will not disrupt any running workloads ○ While the master node is offline, no scheduling operations will happen ○ Kured can help with this. (https://github.com/weaveworks/kured) ● Control plane upgrades (high risk) ○ For master nodes in-place upgrades, a full backup is recommended such as LVM snapshots ○ Only one minor version upgrade is supported, you can only upgrade from 1.9 to 1.10, not from 1.8 to 1.10 ○ Test the upgrade procedure on a staging cluster before running it in production Control plane maintenance 212
  • 213. Kubeadm master nodes upgrade procedure: ● Download the most recent version of kubeadm using curl (do not upgrading the kubeadm OS package) ● Run kubeadm upgrade plan to check if your cluster is upgradeable ● Pick a version to upgrade to and run kubeadm upgrade apply v1.10.2 ● Upgrade your CNI by applying the new DaemonSet definition ● Drain the master node with kubectl drain $HOST --ignore-daemonsets ● Upgrade Kubernetes packages with apt-get update && apt-get upgrade ● Bring the master node back online with kubectl uncordon $MASTER Control plane upgrades 213
  • 216. End of Day Recap 21 6
  • 217. Agenda 21 7 9:00a Welcome & introduction 9:30a Getting started with your environment 10:00a What is “Production Ready?” 10:30a Break (15 minutes) 10:45a Monitoring a production cluster 11:45a Declarative infrastructure in practice 12:15p Lunch (1 hour) 1:15p Devops and GitOps in practice 2:15p Advanced Deployment Patterns 3:15p Break (15 minutes) 3:30p Operational practice for Kubernetes 4:00p Securing a Kubernetes cluster (by Twistlock) 5:00p Review and recap
  • 218. ❏ Readiness check ❏ Liveness check ❏ Metric instrumentation ❏ Dashboards ❏ Playbook ❏ Limits and requests ❏ Labels and annotations The Application checklist 21 8 ❏ Alerts ❏ Structured logging output ❏ Tracing instrumentation ❏ Graceful shutdowns ❏ Graceful dependency (w. readiness check) ❏ Configmaps ❏ Labeled images using commit sha ❏ Locked down runtime context
  • 219. The Cluster checklist 21 9 ❏ API Gateway ❏ Service Mesh ❏ Service catalogue / Broker ❏ Network policies ❏ Authorisation integration ❏ Image scanning ❏ Log aggregation ❏ Build pipeline ❏ Deployment pipeline ❏ Image registry ❏ Monitoring infrastructure ❏ Shared storage ❏ Secrets management ❏ Ingress controller
  • 220. 22 0 Recap: Monitoring In Practice ● There are different kind of metrics ● A good way to think of metrics is which domain they’re in ● It’s trivial to instrument your applications ● Prometheus can be used for both metrics (monitoring) and ad-hoc querying (observability) ● Simple instrumentation can yield deep insights ● PromQL deals with scalar and vector series ● PromQL has gauges, histograms and counters ● PromQL has many useful functions available
  • 221. 22 1 1 The entire system is described declaratively. 2 The canonical desired system state is versioned (with Git) 3 Changes to the desired state are automatically applied to the system 4 Software agents ensure correctness and alert on divergence
  • 222. 222 ● Having separate pipelines for CI and CD enables better security ● It’s also easier to deal with if a deployment goes wrong ● We built a few versions of a simple app, using a demo CI pipeline ● Deployed those versions to Kubernetes using Weave Cloud ● Automated the deployment ● Deployments, rollback and lock are all done via git ● Git is our single source of truth. Recap: GitOps CI/CD
  • 223. Deployment Strategies Kubernetes native ● Recreate ● Rolling update ● Blue/Green Service mesh ● Canary ● A/B Testing ● Blue/Green + Dark Traffic
  • 224. ● Kubernetes internal architecture ● High-availability Kubernetes ● Draining and cordon a node for reboot ● Backing up and upgrading the Kubernetes control plane Operational practices for Kubernetes 224