SlideShare ist ein Scribd-Unternehmen logo
1 von 51
Downloaden Sie, um offline zu lesen
Monitoring and Observability

                           /   in Complex Architectures

Tuesday, October 2, 12
Hi! I’m @postwait




                         I founded @OmniTI
                               and @MessageSystems
                               and @Circonus




Tuesday, October 2, 12
Hi! I’m @postwait




                         I am very active in @TheOfficialACM
                         participating in @ACMQueue
                         and the practitioners board.




Tuesday, October 2, 12
Hi! I’m @postwait




                         I (regrettably) build complex systems.




Tuesday, October 2, 12
Why we are here




                         We’re here to talk about
                         coping with breakage




Tuesday, October 2, 12
Rule #1




                         Direct observation of failure
                         leads to quicker rectification.




Tuesday, October 2, 12
Rule #2




                         You cannot correct
                         what you cannot measure.




Tuesday, October 2, 12
Solution Approach #1



                         Debugging failures requires either
                         visibility into the
                         precipitating state




Tuesday, October 2, 12
Precipitating State



                         Single threaded applications



                         ✓ Easy

Tuesday, October 2, 12
Precipitating State



                         Multi-threaded applications



                         ✓ Challenging

Tuesday, October 2, 12
Precipitating State



                         Distributed applications




                              here there be dragons




Tuesday, October 2, 12
Solution Approach #2



                         or
                         direct observation of a
                         (and likely very many)
                         failing transaction




Tuesday, October 2, 12
Direct Observation




                         Observing something fail...
                         is priceless.




Tuesday, October 2, 12
Direct Observation




                         Observation leads to
                         intelligent questioning.




Tuesday, October 2, 12
Direct Observation




                         Questioning leads to answers...
                         but only through more observation.




Tuesday, October 2, 12
Direct Observation




                         Questioning leads to answers...
                         but only through more observation.


                                    and herein lies the rub.


Tuesday, October 2, 12
Leaning Towards Scientific Process



                         In production you don’t have
                           • repeatability
                           • control groups
                           • external verification




Tuesday, October 2, 12
Leaning Towards Scientific Process



                         In production you don’t have
                           • repeatability
                           • control groups
                           • external verification

                                              ... or do you?

Tuesday, October 2, 12
What’s monitoring got to do with it?




                         Monitoring is all about the
                         passive observation of
                         telemetry data.




Tuesday, October 2, 12
Monitoring Telemetry



                         cannot pinpoint problems


                         can provides evidence of
                         the existence of a problem




Tuesday, October 2, 12
Monitoring




                         Gives you evidence that
                         there is a problem




Tuesday, October 2, 12
Monitoring




                         Gives you evidence that
                         you have fixed a problem
                         (or at least the symptoms)




Tuesday, October 2, 12
Monitoring Tactically




                         If it could be of interest,
                         measure it and
                         expose the measurement




Tuesday, October 2, 12
Monitoring: embedded
                  statsd                               metrics
                  https://github.com/etsy/statsd       https://github.com/codahale/metrics



                  resmon                               folsom
                  http://labs.omniti.com/labs/resmon   https://github.com/boundary/folsom



                                                       metrics.js
                                                       https://github.com/mikejihbe/metrics



                                                       metrics-net
                                                       https://github.com/danielcrenna/metrics-net




Tuesday, October 2, 12
Monitoring: collection
                  reconnoiter                               circonus
                  http://labs.omniti.com/labs/reconnoiter   http://circonus.com/



                  graphite                                  librato
                  http://graphite.wikidot.com/              https://metrics.librato.com/



                  OpenTSDB
                  http://opentsdb.net/




Tuesday, October 2, 12
Monitoring: Bling
                         visualizing an architecture rollout




Tuesday, October 2, 12
Monitoring: Bling
                     visualizing the impact on service times




Tuesday, October 2, 12
average API service time latency




Tuesday, October 2, 12
actual API service time latency




                  http://www.slideshare.net/postwait/atldevops



Tuesday, October 2, 12
Monitoring: Bling




Tuesday, October 2, 12
Repeatability is a Pipe Dream


                         You production problem is a
                         (hopefully pathological)
                         outcome of circumstance.


                         A circumstance which often
                         cannot be repeated.



Tuesday, October 2, 12
Control Groups



                         Control groups can
                         compensate for the
                         inability to
                         precisely repeat an experiment.




Tuesday, October 2, 12
Control Groups




                         Most architectures have redundancy.




Tuesday, October 2, 12
Control Groups




                         With the right design,
                         you can turn that redundancy
                         into a debugging environment.


                  [1] http://omniti.com/surge/2012/sessions/xtreme-deployment




Tuesday, October 2, 12
Control Groups: Simple Example



                         I have 10 web servers
                         I fix 1
                         I verify 1 is fixed
                         I verify 9 are still broken




Tuesday, October 2, 12
Control Groups: Seems Easy



                         Web servers tend to be:
                           • homogeneous
                           • share-(nothing|little)
                           • independent




Tuesday, October 2, 12
Control Groups: Not So Easy



                         Most other services aren’t so
                         homogeneous and equal:
                         databases, batch processes (think
                         billings), orchestration middleware,
                         message queues



Tuesday, October 2, 12
Observability


                         Some might claim that
                         seeing telemetry data is
                         observation...


                         It is doubly indirect at best.



Tuesday, October 2, 12
Observability



                         I want to
                         directly see
                         the
                         errant behaviour




Tuesday, October 2, 12
Observability is forgiving



                         In complex, multi-component
                         architectures, errors can be
                         observed as errant behaviour in
                         many junction points.




Tuesday, October 2, 12
Observing the network




                         tcpdump / snoop
                         wireshark




Tuesday, October 2, 12
Observing the network



                         Looking at just the
                         arrival of new connections

                         tcpdump -nnq -tttt -s384
                         'tcp port 80 and (tcp[13] & (2|16) == 2)'




Tuesday, October 2, 12
Observing the network


                         Looking at just the data
                         arrival and departure times
                         tcpdump -nnq -tt
                         -s 384 'tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)*4)) - ((tcp[12]&0xf0)/4)) != 0)'

                         snoop -rq -ta
                         -s 384 'tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)*4)) - ((tcp[12]&0xf0)/4)) != 0)'




Tuesday, October 2, 12
Observing the network
                         Finding the difference between
                         a client’s question and
                         a server’s answer
                         (tcpdump | awk filter).
                         {
                             gsub(".[0-9]+(: | >)"," & ");
                             gsub("[:=]"," ");
                             EP=sprintf("%s%s", ($4==".80")?$6:$3, ($4==".80")?$7:$4);

                             if(S[EP] == "C" && $4 == ".80") { printf("%f %sn", $1 - L[EP], EP); }

                             S[EP]= ($4==".80")?"S":"C";
                             L[EP]= $1;
                         }



Tuesday, October 2, 12
Observing the network




Tuesday, October 2, 12
Observing the network




Tuesday, October 2, 12
Observing user-space



                         strace[1] / truss
                         gstack / pstack
                         gcore + gdb / dbx / mdb[2]


                  [1] http://www.cli.di.unipi.it/~gadducci/SOL-11/Local/referenceCards/LINUX_System_Call_Quick_Reference.pdf
                  [2] http://hub.opensolaris.org/bin/download/Community+Group+mdb/tips/mdb-cheatsheet.pdf




Tuesday, October 2, 12
System call tracing




                         Watching sshd
                         is a good way to get familiar.
                         truss -f -p `pgrep sshd`




Tuesday, October 2, 12
System call tracing




                         An active web server is going to be
                         like a firehose.
                         truss -f -p `pgrep httpd`




Tuesday, October 2, 12
Observing the system



                         DTrace


                         Live production demo or GTFO.




Tuesday, October 2, 12
Thank You




                         Questions?




Tuesday, October 2, 12

Weitere ähnliche Inhalte

Was ist angesagt?

Monitoring and observability
Monitoring and observabilityMonitoring and observability
Monitoring and observability
Theo Schlossnagle
 

Was ist angesagt? (20)

Observability for modern applications
Observability for modern applications  Observability for modern applications
Observability for modern applications
 
Observability at Scale
Observability at Scale Observability at Scale
Observability at Scale
 
Monitoring and observability
Monitoring and observabilityMonitoring and observability
Monitoring and observability
 
Observability vs APM vs Monitoring Comparison
Observability vs APM vs  Monitoring ComparisonObservability vs APM vs  Monitoring Comparison
Observability vs APM vs Monitoring Comparison
 
Observability
Observability Observability
Observability
 
Observability
ObservabilityObservability
Observability
 
.conf Go 2022 - Observability Session
.conf Go 2022 - Observability Session.conf Go 2022 - Observability Session
.conf Go 2022 - Observability Session
 
Observability, what, why and how
Observability, what, why and howObservability, what, why and how
Observability, what, why and how
 
OSMC 2022 | OpenTelemetry 101 by Dotan Horovit s.pdf
OSMC 2022 | OpenTelemetry 101 by Dotan Horovit s.pdfOSMC 2022 | OpenTelemetry 101 by Dotan Horovit s.pdf
OSMC 2022 | OpenTelemetry 101 by Dotan Horovit s.pdf
 
Observability, Distributed Tracing, and Open Source: The Missing Primer
Observability, Distributed Tracing, and Open Source: The Missing PrimerObservability, Distributed Tracing, and Open Source: The Missing Primer
Observability, Distributed Tracing, and Open Source: The Missing Primer
 
Observability
ObservabilityObservability
Observability
 
Combining logs, metrics, and traces for unified observability
Combining logs, metrics, and traces for unified observabilityCombining logs, metrics, and traces for unified observability
Combining logs, metrics, and traces for unified observability
 
Do You Really Need to Evolve From Monitoring to Observability?
Do You Really Need to Evolve From Monitoring to Observability?Do You Really Need to Evolve From Monitoring to Observability?
Do You Really Need to Evolve From Monitoring to Observability?
 
Improve monitoring and observability for kubernetes with oss tools
Improve monitoring and observability for kubernetes with oss toolsImprove monitoring and observability for kubernetes with oss tools
Improve monitoring and observability for kubernetes with oss tools
 
Understand your system like never before with OpenTelemetry, Grafana, and Pro...
Understand your system like never before with OpenTelemetry, Grafana, and Pro...Understand your system like never before with OpenTelemetry, Grafana, and Pro...
Understand your system like never before with OpenTelemetry, Grafana, and Pro...
 
Principles of System Observability
Principles of System Observability Principles of System Observability
Principles of System Observability
 
Shift left Observability
Shift left ObservabilityShift left Observability
Shift left Observability
 
Elastic Observability
Elastic Observability Elastic Observability
Elastic Observability
 
Elastic Observability keynote
Elastic Observability keynoteElastic Observability keynote
Elastic Observability keynote
 
Observability – the good, the bad, and the ugly
Observability – the good, the bad, and the uglyObservability – the good, the bad, and the ugly
Observability – the good, the bad, and the ugly
 

Andere mochten auch

Data viz as_interface_makoto_inoue
Data viz as_interface_makoto_inoueData viz as_interface_makoto_inoue
Data viz as_interface_makoto_inoue
Makoto Inoue
 
Velocity EU 2013 What is the velocity of an unladen swallow?
Velocity EU 2013 What is the velocity of an unladen swallow?Velocity EU 2013 What is the velocity of an unladen swallow?
Velocity EU 2013 What is the velocity of an unladen swallow?
pdyball
 

Andere mochten auch (20)

The math behind big systems analysis.
The math behind big systems analysis.The math behind big systems analysis.
The math behind big systems analysis.
 
Nonlinear observer design
Nonlinear observer designNonlinear observer design
Nonlinear observer design
 
Data viz as_interface_makoto_inoue
Data viz as_interface_makoto_inoueData viz as_interface_makoto_inoue
Data viz as_interface_makoto_inoue
 
Velocity EU 2013 What is the velocity of an unladen swallow?
Velocity EU 2013 What is the velocity of an unladen swallow?Velocity EU 2013 What is the velocity of an unladen swallow?
Velocity EU 2013 What is the velocity of an unladen swallow?
 
Performance and Metrics at Lonely Planet
Performance and Metrics at Lonely PlanetPerformance and Metrics at Lonely Planet
Performance and Metrics at Lonely Planet
 
Why Page Speed Isn't Enough - Tim Morrow - Velocity Europe 2012
Why Page Speed Isn't Enough - Tim Morrow - Velocity Europe 2012Why Page Speed Isn't Enough - Tim Morrow - Velocity Europe 2012
Why Page Speed Isn't Enough - Tim Morrow - Velocity Europe 2012
 
In-kernel Analytics and Tracing with eBPF for OpenStack Clouds
In-kernel Analytics and Tracing with eBPF for OpenStack CloudsIn-kernel Analytics and Tracing with eBPF for OpenStack Clouds
In-kernel Analytics and Tracing with eBPF for OpenStack Clouds
 
Are Today’s Good Practices… Tomorrow’s Performance Anti-Patterns?
Are Today’s Good Practices… Tomorrow’s Performance Anti-Patterns?Are Today’s Good Practices… Tomorrow’s Performance Anti-Patterns?
Are Today’s Good Practices… Tomorrow’s Performance Anti-Patterns?
 
Bring the Noise
Bring the NoiseBring the Noise
Bring the Noise
 
MeasureWorks - Velocity Conference Europe 2012 - a Web Performance dashboard ...
MeasureWorks - Velocity Conference Europe 2012 - a Web Performance dashboard ...MeasureWorks - Velocity Conference Europe 2012 - a Web Performance dashboard ...
MeasureWorks - Velocity Conference Europe 2012 - a Web Performance dashboard ...
 
Velocity EU 2012 - Third party scripts and you
Velocity EU 2012 - Third party scripts and youVelocity EU 2012 - Third party scripts and you
Velocity EU 2012 - Third party scripts and you
 
Integrating multiple CDNs at Etsy
Integrating multiple CDNs at EtsyIntegrating multiple CDNs at Etsy
Integrating multiple CDNs at Etsy
 
Getting 100B Metrics to Disk
Getting 100B Metrics to DiskGetting 100B Metrics to Disk
Getting 100B Metrics to Disk
 
Be Mean to Your Code with Gauntlt and the Rugged Way // Velocity EU 2013 Work...
Be Mean to Your Code with Gauntlt and the Rugged Way // Velocity EU 2013 Work...Be Mean to Your Code with Gauntlt and the Rugged Way // Velocity EU 2013 Work...
Be Mean to Your Code with Gauntlt and the Rugged Way // Velocity EU 2013 Work...
 
Atldevops
AtldevopsAtldevops
Atldevops
 
Understanding Slowness
Understanding SlownessUnderstanding Slowness
Understanding Slowness
 
What's in a number?
What's in a number?What's in a number?
What's in a number?
 
Xtreme Deployment
Xtreme DeploymentXtreme Deployment
Xtreme Deployment
 
SRECon Coherent Performance
SRECon Coherent PerformanceSRECon Coherent Performance
SRECon Coherent Performance
 
Linux Tracing Superpowers by Eugene Pirogov
Linux Tracing Superpowers by Eugene PirogovLinux Tracing Superpowers by Eugene Pirogov
Linux Tracing Superpowers by Eugene Pirogov
 

Ähnlich wie Monitoring and observability

Optimizing for change: Taking risks safely & e-commerce
Optimizing for change: Taking risks safely & e-commerceOptimizing for change: Taking risks safely & e-commerce
Optimizing for change: Taking risks safely & e-commerce
Kellan
 
Big Data, Big Changes: Data-Driven Product Development at Etsy
Big Data, Big Changes: Data-Driven Product Development at EtsyBig Data, Big Changes: Data-Driven Product Development at Etsy
Big Data, Big Changes: Data-Driven Product Development at Etsy
Jason Davis
 

Ähnlich wie Monitoring and observability (14)

Productivity, Productivity, Productivity
Productivity, Productivity, ProductivityProductivity, Productivity, Productivity
Productivity, Productivity, Productivity
 
Building Data Driven Products With Ruby - RubyConf 2012
Building Data Driven Products With Ruby - RubyConf 2012Building Data Driven Products With Ruby - RubyConf 2012
Building Data Driven Products With Ruby - RubyConf 2012
 
Continous delivery
Continous deliveryContinous delivery
Continous delivery
 
Ruxcon Finding Needles in Haystacks (the size of countries)
Ruxcon Finding Needles in Haystacks (the size of countries)Ruxcon Finding Needles in Haystacks (the size of countries)
Ruxcon Finding Needles in Haystacks (the size of countries)
 
Optimizing the Mobile Search Experience
Optimizing the Mobile Search ExperienceOptimizing the Mobile Search Experience
Optimizing the Mobile Search Experience
 
Twitter Storm
Twitter StormTwitter Storm
Twitter Storm
 
Measuring
MeasuringMeasuring
Measuring
 
Bio-IT for Core Facility Managers
Bio-IT for Core Facility ManagersBio-IT for Core Facility Managers
Bio-IT for Core Facility Managers
 
Optimizing for change: Taking risks safely & e-commerce
Optimizing for change: Taking risks safely & e-commerceOptimizing for change: Taking risks safely & e-commerce
Optimizing for change: Taking risks safely & e-commerce
 
Stability patterns presentation
Stability patterns presentationStability patterns presentation
Stability patterns presentation
 
Stability patterns presentation
Stability patterns presentationStability patterns presentation
Stability patterns presentation
 
Big Data, Big Changes: Data-Driven Product Development at Etsy
Big Data, Big Changes: Data-Driven Product Development at EtsyBig Data, Big Changes: Data-Driven Product Development at Etsy
Big Data, Big Changes: Data-Driven Product Development at Etsy
 
Automatic Extraction of Soccer Game Event Data from Twitter
Automatic Extraction of Soccer Game Event Data from TwitterAutomatic Extraction of Soccer Game Event Data from Twitter
Automatic Extraction of Soccer Game Event Data from Twitter
 
The Web Designers Toolkit
The Web Designers ToolkitThe Web Designers Toolkit
The Web Designers Toolkit
 

Mehr von Theo Schlossnagle

A Coherent Discussion About Performance
A Coherent Discussion About PerformanceA Coherent Discussion About Performance
A Coherent Discussion About Performance
Theo Schlossnagle
 
OmniOS Motivation and Design ~ LISA 2012
OmniOS Motivation and Design ~ LISA 2012OmniOS Motivation and Design ~ LISA 2012
OmniOS Motivation and Design ~ LISA 2012
Theo Schlossnagle
 
Monitoring is easy, why are we so bad at it presentation
Monitoring is easy, why are we so bad at it  presentationMonitoring is easy, why are we so bad at it  presentation
Monitoring is easy, why are we so bad at it presentation
Theo Schlossnagle
 
Social improvements in monitoring
Social improvements in monitoringSocial improvements in monitoring
Social improvements in monitoring
Theo Schlossnagle
 

Mehr von Theo Schlossnagle (20)

Adding Simplicity to Complexity
Adding Simplicity to ComplexityAdding Simplicity to Complexity
Adding Simplicity to Complexity
 
Put Some SRE in Your Shipped Software
Put Some SRE in Your Shipped SoftwarePut Some SRE in Your Shipped Software
Put Some SRE in Your Shipped Software
 
Monitoring 101
Monitoring 101Monitoring 101
Monitoring 101
 
Distributed Systems - Like It Or Not
Distributed Systems - Like It Or NotDistributed Systems - Like It Or Not
Distributed Systems - Like It Or Not
 
Applying SRE techniques to micro service design
Applying SRE techniques to micro service designApplying SRE techniques to micro service design
Applying SRE techniques to micro service design
 
Craftsmanship
CraftsmanshipCraftsmanship
Craftsmanship
 
Commandments of scale
Commandments of scaleCommandments of scale
Commandments of scale
 
Adaptive availability
Adaptive availabilityAdaptive availability
Adaptive availability
 
Project reality
Project realityProject reality
Project reality
 
Monitoring the #DevOps way
Monitoring the #DevOps wayMonitoring the #DevOps way
Monitoring the #DevOps way
 
Operational Software Design
Operational Software DesignOperational Software Design
Operational Software Design
 
A Coherent Discussion About Performance
A Coherent Discussion About PerformanceA Coherent Discussion About Performance
A Coherent Discussion About Performance
 
OmniOS Motivation and Design ~ LISA 2012
OmniOS Motivation and Design ~ LISA 2012OmniOS Motivation and Design ~ LISA 2012
OmniOS Motivation and Design ~ LISA 2012
 
Omnios and unix
Omnios and unixOmnios and unix
Omnios and unix
 
It's all about telemetry
It's all about telemetryIt's all about telemetry
It's all about telemetry
 
Is this normal?
Is this normal?Is this normal?
Is this normal?
 
Monitoring is easy, why are we so bad at it presentation
Monitoring is easy, why are we so bad at it  presentationMonitoring is easy, why are we so bad at it  presentation
Monitoring is easy, why are we so bad at it presentation
 
Social improvements in monitoring
Social improvements in monitoringSocial improvements in monitoring
Social improvements in monitoring
 
Building Scalable Systems: an asynchronous approach
Building Scalable Systems: an asynchronous approachBuilding Scalable Systems: an asynchronous approach
Building Scalable Systems: an asynchronous approach
 
Webops dashboards
Webops dashboardsWebops dashboards
Webops dashboards
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Kürzlich hochgeladen (20)

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 

Monitoring and observability

  • 1. Monitoring and Observability / in Complex Architectures Tuesday, October 2, 12
  • 2. Hi! I’m @postwait I founded @OmniTI and @MessageSystems and @Circonus Tuesday, October 2, 12
  • 3. Hi! I’m @postwait I am very active in @TheOfficialACM participating in @ACMQueue and the practitioners board. Tuesday, October 2, 12
  • 4. Hi! I’m @postwait I (regrettably) build complex systems. Tuesday, October 2, 12
  • 5. Why we are here We’re here to talk about coping with breakage Tuesday, October 2, 12
  • 6. Rule #1 Direct observation of failure leads to quicker rectification. Tuesday, October 2, 12
  • 7. Rule #2 You cannot correct what you cannot measure. Tuesday, October 2, 12
  • 8. Solution Approach #1 Debugging failures requires either visibility into the precipitating state Tuesday, October 2, 12
  • 9. Precipitating State Single threaded applications ✓ Easy Tuesday, October 2, 12
  • 10. Precipitating State Multi-threaded applications ✓ Challenging Tuesday, October 2, 12
  • 11. Precipitating State Distributed applications here there be dragons Tuesday, October 2, 12
  • 12. Solution Approach #2 or direct observation of a (and likely very many) failing transaction Tuesday, October 2, 12
  • 13. Direct Observation Observing something fail... is priceless. Tuesday, October 2, 12
  • 14. Direct Observation Observation leads to intelligent questioning. Tuesday, October 2, 12
  • 15. Direct Observation Questioning leads to answers... but only through more observation. Tuesday, October 2, 12
  • 16. Direct Observation Questioning leads to answers... but only through more observation. and herein lies the rub. Tuesday, October 2, 12
  • 17. Leaning Towards Scientific Process In production you don’t have • repeatability • control groups • external verification Tuesday, October 2, 12
  • 18. Leaning Towards Scientific Process In production you don’t have • repeatability • control groups • external verification ... or do you? Tuesday, October 2, 12
  • 19. What’s monitoring got to do with it? Monitoring is all about the passive observation of telemetry data. Tuesday, October 2, 12
  • 20. Monitoring Telemetry cannot pinpoint problems can provides evidence of the existence of a problem Tuesday, October 2, 12
  • 21. Monitoring Gives you evidence that there is a problem Tuesday, October 2, 12
  • 22. Monitoring Gives you evidence that you have fixed a problem (or at least the symptoms) Tuesday, October 2, 12
  • 23. Monitoring Tactically If it could be of interest, measure it and expose the measurement Tuesday, October 2, 12
  • 24. Monitoring: embedded statsd metrics https://github.com/etsy/statsd https://github.com/codahale/metrics resmon folsom http://labs.omniti.com/labs/resmon https://github.com/boundary/folsom metrics.js https://github.com/mikejihbe/metrics metrics-net https://github.com/danielcrenna/metrics-net Tuesday, October 2, 12
  • 25. Monitoring: collection reconnoiter circonus http://labs.omniti.com/labs/reconnoiter http://circonus.com/ graphite librato http://graphite.wikidot.com/ https://metrics.librato.com/ OpenTSDB http://opentsdb.net/ Tuesday, October 2, 12
  • 26. Monitoring: Bling visualizing an architecture rollout Tuesday, October 2, 12
  • 27. Monitoring: Bling visualizing the impact on service times Tuesday, October 2, 12
  • 28. average API service time latency Tuesday, October 2, 12
  • 29. actual API service time latency http://www.slideshare.net/postwait/atldevops Tuesday, October 2, 12
  • 31. Repeatability is a Pipe Dream You production problem is a (hopefully pathological) outcome of circumstance. A circumstance which often cannot be repeated. Tuesday, October 2, 12
  • 32. Control Groups Control groups can compensate for the inability to precisely repeat an experiment. Tuesday, October 2, 12
  • 33. Control Groups Most architectures have redundancy. Tuesday, October 2, 12
  • 34. Control Groups With the right design, you can turn that redundancy into a debugging environment. [1] http://omniti.com/surge/2012/sessions/xtreme-deployment Tuesday, October 2, 12
  • 35. Control Groups: Simple Example I have 10 web servers I fix 1 I verify 1 is fixed I verify 9 are still broken Tuesday, October 2, 12
  • 36. Control Groups: Seems Easy Web servers tend to be: • homogeneous • share-(nothing|little) • independent Tuesday, October 2, 12
  • 37. Control Groups: Not So Easy Most other services aren’t so homogeneous and equal: databases, batch processes (think billings), orchestration middleware, message queues Tuesday, October 2, 12
  • 38. Observability Some might claim that seeing telemetry data is observation... It is doubly indirect at best. Tuesday, October 2, 12
  • 39. Observability I want to directly see the errant behaviour Tuesday, October 2, 12
  • 40. Observability is forgiving In complex, multi-component architectures, errors can be observed as errant behaviour in many junction points. Tuesday, October 2, 12
  • 41. Observing the network tcpdump / snoop wireshark Tuesday, October 2, 12
  • 42. Observing the network Looking at just the arrival of new connections tcpdump -nnq -tttt -s384 'tcp port 80 and (tcp[13] & (2|16) == 2)' Tuesday, October 2, 12
  • 43. Observing the network Looking at just the data arrival and departure times tcpdump -nnq -tt -s 384 'tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)*4)) - ((tcp[12]&0xf0)/4)) != 0)' snoop -rq -ta -s 384 'tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)*4)) - ((tcp[12]&0xf0)/4)) != 0)' Tuesday, October 2, 12
  • 44. Observing the network Finding the difference between a client’s question and a server’s answer (tcpdump | awk filter). { gsub(".[0-9]+(: | >)"," & "); gsub("[:=]"," "); EP=sprintf("%s%s", ($4==".80")?$6:$3, ($4==".80")?$7:$4); if(S[EP] == "C" && $4 == ".80") { printf("%f %sn", $1 - L[EP], EP); } S[EP]= ($4==".80")?"S":"C"; L[EP]= $1; } Tuesday, October 2, 12
  • 47. Observing user-space strace[1] / truss gstack / pstack gcore + gdb / dbx / mdb[2] [1] http://www.cli.di.unipi.it/~gadducci/SOL-11/Local/referenceCards/LINUX_System_Call_Quick_Reference.pdf [2] http://hub.opensolaris.org/bin/download/Community+Group+mdb/tips/mdb-cheatsheet.pdf Tuesday, October 2, 12
  • 48. System call tracing Watching sshd is a good way to get familiar. truss -f -p `pgrep sshd` Tuesday, October 2, 12
  • 49. System call tracing An active web server is going to be like a firehose. truss -f -p `pgrep httpd` Tuesday, October 2, 12
  • 50. Observing the system DTrace Live production demo or GTFO. Tuesday, October 2, 12
  • 51. Thank You Questions? Tuesday, October 2, 12