[RebelCon] Increasing visibility of distributed systems in production

•Als PPTX, PDF herunterladen•

4 gefällt mir•752 views

Pierre Vincent gives a presentation on increasing visibility of distributed systems in production. He discusses hierarchy of service reliability, designing systems for recovery from failures, and how distributing a system also distributes places where failures can occur. He covers strategies for health checks, collecting system, application, and business metrics using tools like Prometheus. Vincent emphasizes making metrics and logs usable and limiting alerts to focus on user-impacting issues. Tracing is discussed as a way to correlate errors across services. In closing, Vincent notes that visibility enables operability, justifiable decisions, and builds trust in systems.

Software

Increasing visibility of
distributed systems in production
@PierreVincent
June 16th, 2017
RebelCon, Cork

Pierre Vincent
SRE Manager at Poppulo
techblog.poppulo.com
@PierreVincent

Hierarchy of Service Reliability
(Mikey Dickerson)
Let’s start here
@PierreVincent

Reaching production is only the beginning
@PierreVincent

No system is immune to failure:
Design for recovery
@PierreVincent

When distributing a system,
we’re also distributing the places
where things might go wrong
@PierreVincent

Healthchecks
Is it
running
Can it
perform its
task
Can it
accept
more work
?
@PierreVincent

Healthchecks strategies
Broadcast Register Expose
@PierreVincent

System
metrics
Application
metrics
Business
metrics
Time-series metrics
Network latency Error rates Customer conversions
@PierreVincent

Servers / VMs
Appliances / Infra
Services
Metrics
collector
Metrics
query
engine
Dashboards
Alerts
@PierreVincent

Servers / VMs
Appliances / Infra
Services
@PierreVincent
/metrics
/metrics
/metrics
Prometheus

Usability of metrics tooling is key to adoption
Instrument
code
Query
metrics
Create
dashboards
Define rules
& thresholds
@PierreVincent

Limit alerting to user-impacting symptoms
Expose dashboards to diagnose causes
@PierreVincent

Overlaying changes with production metrics
Source: Ian Malpass (Etsy), Measure Anything, Measure Everything
https://codeascraft.com/2011/02/15/measure-anything-measure-everything@PierreVincent

Making sense of logs
Centralise
logs
Common
searchable
format
Correlation
IDs
@PierreVincent

Tracing
A
F
H
D
J
B
E
C
G
a1b2c3
a1b2c3
a1b2c3
ERROR [svc=H][trace=a1b2c3] Failed to save order
Cause: Cassandra timeout exception
ERROR [svc=F][trace=a1b2c3] Failed to complete order
Cause: Shipping service responded with 500
ERROR [svc=A][trace=a1b2c3] Failed to process order
Cause: Order process manager responded with 500
a1b2c3
INFO [svc=G][trace=a1b2c3] Items verified in stock
@PierreVincent

Visibility enables operability
@PierreVincent

Visibility allows justifiable decisions
@PierreVincent

If you can’t monitor a service, you don’t know
what’s happening, and if you’re blind to what’s
happening, you can’t be reliable.
“ ”
N. Murphy, J. Petoff, C. Jones, B. Beyer
Site Reliability Engineering
@PierreVincent

techblog.poppulo.com
Questions?
@PierreVincent

Weitere ähnliche Inhalte

Was ist angesagt?

Icon Secure by MaintelJean-Frédéric KARCHER, CISSP 487843

Mobility Trends Impacting HealthcareExtreme Networks

Why vyoptaVyopta Incorporated

Leveraging Hospital Network AnalyticsExtreme Networks

HOW CAN BIG DATA ANALYTICS IMPROVE YOUR OPERATIONS?Ericsson

The Importance of Business Data BackupsSwiftTech Solutions, Inc.

The Journey from Zero to SOC: How Citadel built its Security Operations from ...Elasticsearch

IoT System SalesBytes Overview FinalSarah Reinbolt, MBA

BDNA joins FlexeraFlexera

The Top 10 IT Issues in Higher EdExtreme Networks

Was ist angesagt? (10)

Icon Secure by Maintel

Mobility Trends Impacting Healthcare

Why vyopta

Leveraging Hospital Network Analytics

HOW CAN BIG DATA ANALYTICS IMPROVE YOUR OPERATIONS?

The Importance of Business Data Backups

The Journey from Zero to SOC: How Citadel built its Security Operations from ...

IoT System SalesBytes Overview Final

BDNA joins Flexera

The Top 10 IT Issues in Higher Ed

Ähnlich wie [RebelCon] Increasing visibility of distributed systems in production

[Test Bash Manchester] Observability and TestingPierre Vincent

NetIQ AppManager & NetIQ Operations Center. NCU LtdNCU Ltd

Continuous Testing- A Key Ingredient for Success in Agile & DevOpsSmartBear

Devops based progressive delivery finalizedBhagvanK1

VmTurboDealmaker Media

Bhagvan Kommadi [Value Momentum] | TeleHealth Platform: DevOps-Based Progress...InfluxData

Corporate Presentation VdcVirtual Data Consultants

Corporate presentation vdcVirtual Data Consultants

Connecting the dots – Industrial IoT is more than just sensor deploymentNagarro

Data Science Case Studies: The Internet of Things: Implications for the Enter...VMware Tanzu

2010 06 gartner avoiding audit fatigue in nine steps 1dGene Kim

Fingerprint Based VotingIRJET Journal

Observability A Critical Practice to Enable Digital TransformationCloudZenix LLC

How to Use Open Source Technologies in Safety-critical Digital Health Applica...Shahid Shah

IRJET- Vendor Management System using Machine LearningIRJET Journal

How Dealertrack Optimizes the DevOps Toolchain, FutureStack17New Relic

Give ‘Em What They Want! Self-Service Middleware Monitoring in a Shared Servi...SL Corporation

#Interactive Session by Pradipta Biswas and Sucheta Saurabh Chitale, "Navigat...Agile Testing Alliance

Metrics Monitoring Is So Critical - What's Your Best Approach? Wavefront

Lunch and Learn and SneakersBill Zajac

Ähnlich wie [RebelCon] Increasing visibility of distributed systems in production (20)

[Test Bash Manchester] Observability and Testing

NetIQ AppManager & NetIQ Operations Center. NCU Ltd

Continuous Testing- A Key Ingredient for Success in Agile & DevOps

Devops based progressive delivery finalized

VmTurbo

Bhagvan Kommadi [Value Momentum] | TeleHealth Platform: DevOps-Based Progress...

Corporate Presentation Vdc

Corporate presentation vdc

Connecting the dots – Industrial IoT is more than just sensor deployment

Data Science Case Studies: The Internet of Things: Implications for the Enter...

2010 06 gartner avoiding audit fatigue in nine steps 1d

Fingerprint Based Voting

Observability A Critical Practice to Enable Digital Transformation

How to Use Open Source Technologies in Safety-critical Digital Health Applica...

IRJET- Vendor Management System using Machine Learning

How Dealertrack Optimizes the DevOps Toolchain, FutureStack17

Give ‘Em What They Want! Self-Service Middleware Monitoring in a Shared Servi...

#Interactive Session by Pradipta Biswas and Sucheta Saurabh Chitale, "Navigat...

Metrics Monitoring Is So Critical - What's Your Best Approach?

Lunch and Learn and Sneakers

Mehr von Pierre Vincent

[Test bash NL] Contract testing in practice with PactPierre Vincent

DevOpsDays Galway 2019 - Zero-downtime deploymentsPierre Vincent

[Test bash manchester] contract testing in practicePierre Vincent

QCon London - How to build observable distributed systemsPierre Vincent

Improve collaboration and confidence with Consumer-driven contractsPierre Vincent

Consumer-driven contracts: avoid microservices integration hell! (MuCon Londo...Pierre Vincent

Consumer-driven contracts: avoid microservices integration hell! (LondonCD - ...Pierre Vincent

Agile at Newsweaver (Agile Cork March 2016)Pierre Vincent

Mehr von Pierre Vincent (8)

[Test bash NL] Contract testing in practice with Pact

DevOpsDays Galway 2019 - Zero-downtime deployments

[Test bash manchester] contract testing in practice

QCon London - How to build observable distributed systems

Improve collaboration and confidence with Consumer-driven contracts

Consumer-driven contracts: avoid microservices integration hell! (MuCon Londo...

Consumer-driven contracts: avoid microservices integration hell! (LondonCD - ...

Agile at Newsweaver (Agile Cork March 2016)

Kürzlich hochgeladen

Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionOnePlan Solutions

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR

%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...masabamasaba

%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba

%in Soweto+277-882-255-28 abortion pills for sale in sowetomasabamasaba

8257 interfacing 2 in microprocessor for btech studentsHimanshiGarg82

MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...Jittipong Loespradit

%in kempton park+277-882-255-28 abortion pills for sale in kempton park masabamasaba

Define the academic and professional writing..pdfPearlKirahMaeRagusta1

WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2

Harnessing ChatGPT - Elevating Productivity in Today's Agile EnvironmentVictorSzoltysek

Architecture decision records - How not to get lost in the pastPapp Krisztián

AI & Machine Learning Presentation TemplatePresentation.STUDIO

WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2

VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale

Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfkalichargn70th171

The title is not connected to what is insideshinachiaurasa2

Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024VictoriaMetrics

Direct Style Effect Systems -The Print[A] Example- A Comprehension AidPhilip Schwarz

OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...Shane Coughlan

Kürzlich hochgeladen (20)

Introducing Microsoft’s new Enterprise Work Management (EWM) Solution

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE

%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...

%in tembisa+277-882-255-28 abortion pills for sale in tembisa

%in Soweto+277-882-255-28 abortion pills for sale in soweto

8257 interfacing 2 in microprocessor for btech students

MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...

%in kempton park+277-882-255-28 abortion pills for sale in kempton park

Define the academic and professional writing..pdf

WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...

Harnessing ChatGPT - Elevating Productivity in Today's Agile Environment

Architecture decision records - How not to get lost in the past

AI & Machine Learning Presentation Template

WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity

VTU technical seminar 8Th Sem on Scikit-learn

Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf

The title is not connected to what is inside

Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024

Direct Style Effect Systems -The Print[A] Example- A Comprehension Aid

OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...

[RebelCon] Increasing visibility of distributed systems in production

1. Increasing visibility of distributed systems in production @PierreVincent June 16th, 2017 RebelCon, Cork

2. Pierre Vincent SRE Manager at Poppulo techblog.poppulo.com @PierreVincent

3. Hierarchy of Service Reliability (Mikey Dickerson) Let’s start here @PierreVincent

4. Reaching production is only the beginning @PierreVincent

5. No system is immune to failure: Design for recovery @PierreVincent

6. When distributing a system, we’re also distributing the places where things might go wrong @PierreVincent

7. Healthchecks Is it running Can it perform its task Can it accept more work ? @PierreVincent

8. Healthchecks strategies Broadcast Register Expose @PierreVincent

9. System metrics Application metrics Business metrics Time-series metrics Network latency Error rates Customer conversions @PierreVincent

10. Servers / VMs Appliances / Infra Services Metrics collector Metrics query engine Dashboards Alerts @PierreVincent

11. Servers / VMs Appliances / Infra Services @PierreVincent /metrics /metrics /metrics Prometheus

12. Usability of metrics tooling is key to adoption Instrument code Query metrics Create dashboards Define rules & thresholds @PierreVincent

13. Limit alerting to user-impacting symptoms Expose dashboards to diagnose causes @PierreVincent

14. Overlaying changes with production metrics Source: Ian Malpass (Etsy), Measure Anything, Measure Everything https://codeascraft.com/2011/02/15/measure-anything-measure-everything@PierreVincent

15. Making sense of logs Centralise logs Common searchable format Correlation IDs @PierreVincent

16. Tracing A F H D J B E C G a1b2c3 a1b2c3 a1b2c3 ERROR [svc=H][trace=a1b2c3] Failed to save order Cause: Cassandra timeout exception ERROR [svc=F][trace=a1b2c3] Failed to complete order Cause: Shipping service responded with 500 ERROR [svc=A][trace=a1b2c3] Failed to process order Cause: Order process manager responded with 500 a1b2c3 INFO [svc=G][trace=a1b2c3] Items verified in stock @PierreVincent

17. @PierreVincent

18. Visibility enables operability @PierreVincent

19. Visibility allows justifiable decisions @PierreVincent

20. Visibility builds trust @PierreVincent

21. If you can’t monitor a service, you don’t know what’s happening, and if you’re blind to what’s happening, you can’t be reliable. “ ” N. Murphy, J. Petoff, C. Jones, B. Beyer Site Reliability Engineering @PierreVincent

22. techblog.poppulo.com Questions? @PierreVincent

Hinweis der Redaktion

Maslows hierarchy of needs: food > safety > love > esteem > fulfilment Reliability: monitoring: see how things are working and get notified when they’re not Incident resp: once we’re notified, how to we mitigate (turn off feature / add capacity) postmortem/RCA: what went wrong, how do we fix it durably Testing/RP: test what tends to go wrong, to catch things before CP: understanding load, dynamically balacing load, circuit breaking etc Dev: design system for reliability, on where things tend to be brittle Product: fulfilment of a reliable product “Monitoring enables service owners to make rational decisions about the impact of changes to the service, apply the scientific method to incident response, and of course ensure their reason for existence: to measure the service’s alignment with business goals”
We used to spend most of the time code and not testing, then came TDD - not unit testing is the widely agreed outlier But are we still spending most of our time developing? Apps that haven’t reached production = just playing around Production is the real deal, but we see it as the finish line = it’s the opposite
Things will never run perfectly. If nothing goes wrong in your system either: Nobody is using it You just don’t know about it There is only so much we can think about. Diminishing returns in designing/coding perfection - much more value for money in admitting that things will go wrong, and that in these cases our focus is to: Know about it asap Have as much info as possible to find the source of the problem
What I mean by DS: No longer 1 web app supported by 1 db Number of separate parts, responsible for different things, talk to each other over the network Independently scalable, independently deployable, independently failing Clusters of databases, messages queues Multi servers, DCs, clouds Nobody knew distributing would be so complicated! ;)
Simple network broadcast Registration Heartbeat in a HA store (etcd, zk…) Expose health to others E.g. Http endpoint Requires some form of service discovery
All levels of metrics are important Different teams might be responsible for different things Not exclusive levels Need ability to correlate different levels
If adding metrics is simple, every developer will do it 1 line instrumentation with tools like Prometheus or StatsD Integration with graphing tools, alerting tools
Not going to expand on alerting - entire (multiple) talks required! Alerting on symptoms reduces noise > 1st action is to mitigate effects, then track down the cause Use dashboards to troubleshoot > 1st place to go to validate theories
Problems mostly caused by changes Overlay production changes with time-series Deployment/config change > correlate with change in system behaviour
Aggregated logs are just more logs in one place Need to make sense of it Correlation ids, tracing
Search for a trace / Timing of traces This is profiling on a live environment! Example of DNS issue tracked: - Error rate of peer dependency went up - Tracked down to breach of our SLO on API - Request to particular dependency was slow, but no evidence of that dependency to be slow to respond - Monitoring disproved dependency from being slow to respond - Pointed at something between the 2 services - Added internal zipkin tracing inside the call service - Tracked down to slow DNS look up because of bad resolv configuration
Having fuller picture = less guess work Impossible to reason about a system when flying blind Monitoring allows to adopt a scientific approach to explain production systems > find evidence of problems > make hypotheses on issues > correlate issues with recent changes > prove/disprove hypotheses
Shining a light on your system gives you the real picture Internal changes backed up by guess work is an anti pattern Will make things more complicated without backing it up No way to quantify if things get better or worse
Encourage “information radiators” = No hiding (from other and from ourselves) > Needs culture of safety Distilled dashboards and status pages for other parts of the business = spread visibility for higher up (e.g. support) = build confidence and trust of stakeholders

[RebelCon] Increasing visibility of distributed systems in production

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (10)

Ähnlich wie [RebelCon] Increasing visibility of distributed systems in production

Ähnlich wie [RebelCon] Increasing visibility of distributed systems in production (20)

Mehr von Pierre Vincent

Mehr von Pierre Vincent (8)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

[RebelCon] Increasing visibility of distributed systems in production

Hinweis der Redaktion