stackconf 2023 | Measuring Reliability in Production by Thomas Voss.pdf

NETWAYS
NETWAYSNETWAYS
Proprietary
Today’s speakers
Thomas Voß
Staff SRE@Google
Proprietary
Measuring
Reliability in
Production
Step By Step SLO
Creation in Cloud
Operations
stackconf ‘23, 2023-09-13
Proprietary
The Most Important
Feature of Any System is
its Reliability
Proprietary
SRE is what you get
when you treat
operations as a
software problem.
Proprietary
What is the Level of reliability we
need?
Proprietary
Proprietary
Terminology
CUJ SLIs SLOs SLAs
User interacts with
Service to achieve Goal
Critical User Journeys:
Your most important
user journeys
Metrics that describe
users' experiences
Targets for the overall
health of a service
Contractual obligations
Proprietary
Alignment throughout the Product Life Cycle
Concept Business Development Operations Market
Alignment
through SLOs
Business Process
Proprietary
Creating SLI/O
Step By Step
Proprietary
Cloud Operations Sandbox
Click-to-deploy open sourced learning experience that helps practitioners gain an
understanding of how to use Cloud Operations tools and apply SRE practices in an
isolated cloud environment with synthetic traffic, that is similar to real production.
● A “playground environment” to evaluate Cloud Operations as close as possible to
real production
● Includes: Demo Service, One-click deployment script, Interactive walkthrough ,
Synthetic Load Generator, SRE Recipes, etc.
● Start here:
github.com/GoogleCloudPlatform/cloud-ops-sandbox/
Proprietary
Online Boutique
*github.com/GoogleCloudPlatform/microservices-demo#architecture
Proprietary
*github.com/GoogleCloudPlatform/microservices-demo#architecture
Online Boutique
Proprietary
1. SLO Process -CUJ
List out critical user journeys and order them by business impact:
Browse products, Check out, Add to cart
Proprietary
1. SLO Process -CUJ
List out critical user journeys and order them by business impact:
1. Check out
2. Add to cart
3. Browse products
Proprietary
As a shopper I want to see
purchase (checkout) items in the
store.
Critical User Journeys
Proprietary
SLO Process - SLI Creation
Determine which metrics to use as service-level indicators (SLIs) to most accurately track
the user experience.
Proprietary
SLO Process -SLI creation
1. SLI Type:
○ Request/response interaction in a user journey, measure: availability, latency, and quality.
○ Data processing: freshness, coverage, correctness and throughput.
○ Storage: throughput and latency.
2. SLI Specification: an assessment of service outcome that you think matters to users
○ For availability: The proportion of valid events served successfully
○ For latency: The proportion of valid events served faster than a threshold
3. SLI Implementation: a way to measure the SLI specification
○ Includes: event + success criteria + where/how you record the SLI.
○ Measurement Strategies: Application-level Metrics, Logs Processing, Front-end Infra Metrics, Synthetic
Clients/Data, Client-side Instrumentation
Proprietary
SLO Process -Availability SLI creation
SLI Type: availability
SLI Specification: The proportion of valid checkout events served successfully.
● Requests to the CheckoutService that return HTTP response code 2xx, 3xx, or 4xx (excl. 429)
SLI Implementation: The proportion of HTTP GET requests for /checkout_service/response_counts
that do not have 5XX status (3XX and 4XX excluded) measured at the Istio service mesh.
Proprietary
SLO Process - SLO
1. Determine SLO target goals
2. Determine SLO measurement period
SLO should include: target and a measurement window:
● 99.9% of Checkout requests in the past 28 days are successful
Proprietary
SLO Process
1. List out critical user journeys and order them by business impact.
2. Determine which metrics to use as service-level indicators (SLIs) to most
accurately track the user experience.
3. Determine SLO target goals and the SLO measurement period.
4. Configure SLI, SLO, and error budget consoles.
5. Configure SLO alerts.
Proprietary
Measuring Reliability
on GCP
Setup Guide in 4 easy steps
Define Service Define SLI Define SLO Define Alert
Select or define a service to
monitor
Identify a behaviour for
your service to observe
Set a target for the service in
a time window
Configure alerts on the
service health & burn rate
Proprietary
Proprietary
Proprietary
Proprietary
Proprietary
Proprietary
Proprietary
Demo
Services Overview
Service Definition
1
2
SLI Creation
3
SLO Creation
4
SLO Alerts Creation
5
SLI Creation
SLO Creation
SLO Alerts
Services Overview
Service Definition
1
2
3
4
5
SLI Creation
SLO Creation
SLO Alerts
Services Overview
Service Definition
1
2
3
4
5
SLI Creation
SLO Creation
SLO Alerts
Services Overview
Service Definition
1
2
3
4
5
SLI Creation
SLO Creation
SLO Alerts
Services Overview
Service Definition
1
2
3
4
5
SLI Creation
SLO Creation
SLO Alerts
Services Overview
Service Definition
1
2
3
4
5
Setup Guide in 4 easy steps
Define Service Define SLI Define SLO Define Alert
Select or define a service to
monitor
Identify a behaviour for
your service to observe
Set a target for the service in
a time window
Configure alerts on the
service health & burn rate
Proprietary
How can you get started?
Proprietary
Resources
● Cloud Operations Sandbox one click Cluster:
github.com/GoogleCloudPlatform/cloud-ops-sandbox/
● Collection of public resources bit.ly/Public_SRE_Resources
● Detailed step by step guide: Measuring Reliability in GCP: Step By Step SLO creation guide using
Cloud Operation Sandbox.
● [Qwiklabs] Cloud operations for GKE
●
*Cover images used with permission. These books can be found on shop.oreilly.com.
Google's
Public
Resources
● Coursera for leaders Developing a Google SRE Culture , for engineers Site
Reliability Engineering: Measuring and Managing Reliability,
● Art of SLOs classroom: The Art Of SLOs
● Blogs: DevOps & SRE
● Google Professional Services SRE packages
● The books
Follow us on Twitter: @googlesre. Find Google SRE publications—including the SRE
Books, articles, trainings, and more—for free at sre.google/resources.
Book covers copyright O’Reilly Media. Used with permission.
Proprietary
Q&A?
Proprietary
Thank you!
1 von 42

Más contenido relacionado

Similar a stackconf 2023 | Measuring Reliability in Production by Thomas Voss.pdf(20)

OIM Sizing Guide 11gR2PS1OIM Sizing Guide 11gR2PS1
OIM Sizing Guide 11gR2PS1
Atul Goyal1.4K views
AWS November meetup SlidesAWS November meetup Slides
AWS November meetup Slides
JacksonMorgan9159 views
AWS User Group NovemberAWS User Group November
AWS User Group November
PolarSeven Pty Ltd165 views
Elastic Observability keynoteElastic Observability keynote
Elastic Observability keynote
Elasticsearch623 views
Latest Developments in Cloud Security Standards and PrivacyLatest Developments in Cloud Security Standards and Privacy
Latest Developments in Cloud Security Standards and Privacy
Cloud Standards Customer Council782 views

Último(20)

Prospectus (1).pdfProspectus (1).pdf
Prospectus (1).pdf
PancrazioScalambrino12 views
Salvation a Work of GodSalvation a Work of God
Salvation a Work of God
Central Church of Christ16 views
SOA PPT ON SEA TURTLES.pptxSOA PPT ON SEA TURTLES.pptx
SOA PPT ON SEA TURTLES.pptx
EuniceOseiYeboah7 views
Al Kindi.pptxAl Kindi.pptx
Al Kindi.pptx
MubbaraShahzadi5 views

stackconf 2023 | Measuring Reliability in Production by Thomas Voss.pdf