SlideShare ist ein Scribd-Unternehmen logo
1 von 51
Downloaden Sie, um offline zu lesen
Monitoring and Observability

                           /   in Complex Architectures

Tuesday, October 2, 12
Hi! I’m @postwait




                         I founded @OmniTI
                               and @MessageSystems
                               and @Circonus




Tuesday, October 2, 12
Hi! I’m @postwait




                         I am very active in @TheOfficialACM
                         participating in @ACMQueue
                         and the practitioners board.




Tuesday, October 2, 12
Hi! I’m @postwait




                         I (regrettably) build complex systems.




Tuesday, October 2, 12
Why we are here




                         We’re here to talk about
                         coping with breakage




Tuesday, October 2, 12
Rule #1




                         Direct observation of failure
                         leads to quicker rectification.




Tuesday, October 2, 12
Rule #2




                         You cannot correct
                         what you cannot measure.




Tuesday, October 2, 12
Solution Approach #1



                         Debugging failures requires either
                         visibility into the
                         precipitating state




Tuesday, October 2, 12
Precipitating State



                         Single threaded applications



                         ✓ Easy

Tuesday, October 2, 12
Precipitating State



                         Multi-threaded applications



                         ✓ Challenging

Tuesday, October 2, 12
Precipitating State



                         Distributed applications




                              here there be dragons




Tuesday, October 2, 12
Solution Approach #2



                         or
                         direct observation of a
                         (and likely very many)
                         failing transaction




Tuesday, October 2, 12
Direct Observation




                         Observing something fail...
                         is priceless.




Tuesday, October 2, 12
Direct Observation




                         Observation leads to
                         intelligent questioning.




Tuesday, October 2, 12
Direct Observation




                         Questioning leads to answers...
                         but only through more observation.




Tuesday, October 2, 12
Direct Observation




                         Questioning leads to answers...
                         but only through more observation.


                                    and herein lies the rub.


Tuesday, October 2, 12
Leaning Towards Scientific Process



                         In production you don’t have
                           • repeatability
                           • control groups
                           • external verification




Tuesday, October 2, 12
Leaning Towards Scientific Process



                         In production you don’t have
                           • repeatability
                           • control groups
                           • external verification

                                              ... or do you?

Tuesday, October 2, 12
What’s monitoring got to do with it?




                         Monitoring is all about the
                         passive observation of
                         telemetry data.




Tuesday, October 2, 12
Monitoring Telemetry



                         cannot pinpoint problems


                         can provides evidence of
                         the existence of a problem




Tuesday, October 2, 12
Monitoring




                         Gives you evidence that
                         there is a problem




Tuesday, October 2, 12
Monitoring




                         Gives you evidence that
                         you have fixed a problem
                         (or at least the symptoms)




Tuesday, October 2, 12
Monitoring Tactically




                         If it could be of interest,
                         measure it and
                         expose the measurement




Tuesday, October 2, 12
Monitoring: embedded
                  statsd                               metrics
                  https://github.com/etsy/statsd       https://github.com/codahale/metrics



                  resmon                               folsom
                  http://labs.omniti.com/labs/resmon   https://github.com/boundary/folsom



                                                       metrics.js
                                                       https://github.com/mikejihbe/metrics



                                                       metrics-net
                                                       https://github.com/danielcrenna/metrics-net




Tuesday, October 2, 12
Monitoring: collection
                  reconnoiter                               circonus
                  http://labs.omniti.com/labs/reconnoiter   http://circonus.com/



                  graphite                                  librato
                  http://graphite.wikidot.com/              https://metrics.librato.com/



                  OpenTSDB
                  http://opentsdb.net/




Tuesday, October 2, 12
Monitoring: Bling
                         visualizing an architecture rollout




Tuesday, October 2, 12
Monitoring: Bling
                     visualizing the impact on service times




Tuesday, October 2, 12
average API service time latency




Tuesday, October 2, 12
actual API service time latency




                  http://www.slideshare.net/postwait/atldevops



Tuesday, October 2, 12
Monitoring: Bling




Tuesday, October 2, 12
Repeatability is a Pipe Dream


                         You production problem is a
                         (hopefully pathological)
                         outcome of circumstance.


                         A circumstance which often
                         cannot be repeated.



Tuesday, October 2, 12
Control Groups



                         Control groups can
                         compensate for the
                         inability to
                         precisely repeat an experiment.




Tuesday, October 2, 12
Control Groups




                         Most architectures have redundancy.




Tuesday, October 2, 12
Control Groups




                         With the right design,
                         you can turn that redundancy
                         into a debugging environment.


                  [1] http://omniti.com/surge/2012/sessions/xtreme-deployment




Tuesday, October 2, 12
Control Groups: Simple Example



                         I have 10 web servers
                         I fix 1
                         I verify 1 is fixed
                         I verify 9 are still broken




Tuesday, October 2, 12
Control Groups: Seems Easy



                         Web servers tend to be:
                           • homogeneous
                           • share-(nothing|little)
                           • independent




Tuesday, October 2, 12
Control Groups: Not So Easy



                         Most other services aren’t so
                         homogeneous and equal:
                         databases, batch processes (think
                         billings), orchestration middleware,
                         message queues



Tuesday, October 2, 12
Observability


                         Some might claim that
                         seeing telemetry data is
                         observation...


                         It is doubly indirect at best.



Tuesday, October 2, 12
Observability



                         I want to
                         directly see
                         the
                         errant behaviour




Tuesday, October 2, 12
Observability is forgiving



                         In complex, multi-component
                         architectures, errors can be
                         observed as errant behaviour in
                         many junction points.




Tuesday, October 2, 12
Observing the network




                         tcpdump / snoop
                         wireshark




Tuesday, October 2, 12
Observing the network



                         Looking at just the
                         arrival of new connections

                         tcpdump -nnq -tttt -s384
                         'tcp port 80 and (tcp[13] & (2|16) == 2)'




Tuesday, October 2, 12
Observing the network


                         Looking at just the data
                         arrival and departure times
                         tcpdump -nnq -tt
                         -s 384 'tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)*4)) - ((tcp[12]&0xf0)/4)) != 0)'

                         snoop -rq -ta
                         -s 384 'tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)*4)) - ((tcp[12]&0xf0)/4)) != 0)'




Tuesday, October 2, 12
Observing the network
                         Finding the difference between
                         a client’s question and
                         a server’s answer
                         (tcpdump | awk filter).
                         {
                             gsub(".[0-9]+(: | >)"," & ");
                             gsub("[:=]"," ");
                             EP=sprintf("%s%s", ($4==".80")?$6:$3, ($4==".80")?$7:$4);

                             if(S[EP] == "C" && $4 == ".80") { printf("%f %sn", $1 - L[EP], EP); }

                             S[EP]= ($4==".80")?"S":"C";
                             L[EP]= $1;
                         }



Tuesday, October 2, 12
Observing the network




Tuesday, October 2, 12
Observing the network




Tuesday, October 2, 12
Observing user-space



                         strace[1] / truss
                         gstack / pstack
                         gcore + gdb / dbx / mdb[2]


                  [1] http://www.cli.di.unipi.it/~gadducci/SOL-11/Local/referenceCards/LINUX_System_Call_Quick_Reference.pdf
                  [2] http://hub.opensolaris.org/bin/download/Community+Group+mdb/tips/mdb-cheatsheet.pdf




Tuesday, October 2, 12
System call tracing




                         Watching sshd
                         is a good way to get familiar.
                         truss -f -p `pgrep sshd`




Tuesday, October 2, 12
System call tracing




                         An active web server is going to be
                         like a firehose.
                         truss -f -p `pgrep httpd`




Tuesday, October 2, 12
Observing the system



                         DTrace


                         Live production demo or GTFO.




Tuesday, October 2, 12
Thank You




                         Questions?




Tuesday, October 2, 12

Weitere ähnliche Inhalte

Was ist angesagt?

Observability, what, why and how
Observability, what, why and howObservability, what, why and how
Observability, what, why and howNeeraj Bagga
 
Observability at Scale
Observability at Scale Observability at Scale
Observability at Scale Knoldus Inc.
 
Observability vs APM vs Monitoring Comparison
Observability vs APM vs  Monitoring ComparisonObservability vs APM vs  Monitoring Comparison
Observability vs APM vs Monitoring Comparisonjeetendra mandal
 
More Than Monitoring: How Observability Takes You From Firefighting to Fire P...
More Than Monitoring: How Observability Takes You From Firefighting to Fire P...More Than Monitoring: How Observability Takes You From Firefighting to Fire P...
More Than Monitoring: How Observability Takes You From Firefighting to Fire P...DevOps.com
 
Observability – the good, the bad, and the ugly
Observability – the good, the bad, and the uglyObservability – the good, the bad, and the ugly
Observability – the good, the bad, and the uglyTimetrix
 
Logging and observability
Logging and observabilityLogging and observability
Logging and observabilityAnton Drukh
 
Monitoring and observability
Monitoring and observabilityMonitoring and observability
Monitoring and observabilityDanylenko Max
 
Road to (Enterprise) Observability
Road to (Enterprise) ObservabilityRoad to (Enterprise) Observability
Road to (Enterprise) ObservabilityChristoph Engelbert
 
Observability; a gentle introduction
Observability; a gentle introductionObservability; a gentle introduction
Observability; a gentle introductionBram Vogelaar
 
Do You Really Need to Evolve From Monitoring to Observability?
Do You Really Need to Evolve From Monitoring to Observability?Do You Really Need to Evolve From Monitoring to Observability?
Do You Really Need to Evolve From Monitoring to Observability?Splunk
 
Intro To Observability-March-2023.pdf
Intro To Observability-March-2023.pdfIntro To Observability-March-2023.pdf
Intro To Observability-March-2023.pdfPremDomingo
 
Combining Logs, Metrics, and Traces for Unified Observability
Combining Logs, Metrics, and Traces for Unified ObservabilityCombining Logs, Metrics, and Traces for Unified Observability
Combining Logs, Metrics, and Traces for Unified ObservabilityElasticsearch
 
Building a centralized observability platform
Building a centralized observability platformBuilding a centralized observability platform
Building a centralized observability platformElasticsearch
 
Combining logs, metrics, and traces for unified observability
Combining logs, metrics, and traces for unified observabilityCombining logs, metrics, and traces for unified observability
Combining logs, metrics, and traces for unified observabilityElasticsearch
 
Observability and its application
Observability and its applicationObservability and its application
Observability and its applicationThao Huynh Quang
 

Was ist angesagt? (20)

Observability, what, why and how
Observability, what, why and howObservability, what, why and how
Observability, what, why and how
 
Observability
ObservabilityObservability
Observability
 
Observability
ObservabilityObservability
Observability
 
Observability at Scale
Observability at Scale Observability at Scale
Observability at Scale
 
Observability vs APM vs Monitoring Comparison
Observability vs APM vs  Monitoring ComparisonObservability vs APM vs  Monitoring Comparison
Observability vs APM vs Monitoring Comparison
 
More Than Monitoring: How Observability Takes You From Firefighting to Fire P...
More Than Monitoring: How Observability Takes You From Firefighting to Fire P...More Than Monitoring: How Observability Takes You From Firefighting to Fire P...
More Than Monitoring: How Observability Takes You From Firefighting to Fire P...
 
Observability – the good, the bad, and the ugly
Observability – the good, the bad, and the uglyObservability – the good, the bad, and the ugly
Observability – the good, the bad, and the ugly
 
Shift left Observability
Shift left ObservabilityShift left Observability
Shift left Observability
 
Logging and observability
Logging and observabilityLogging and observability
Logging and observability
 
Monitoring and observability
Monitoring and observabilityMonitoring and observability
Monitoring and observability
 
Observability & Datadog
Observability & DatadogObservability & Datadog
Observability & Datadog
 
Road to (Enterprise) Observability
Road to (Enterprise) ObservabilityRoad to (Enterprise) Observability
Road to (Enterprise) Observability
 
Observability; a gentle introduction
Observability; a gentle introductionObservability; a gentle introduction
Observability; a gentle introduction
 
Do You Really Need to Evolve From Monitoring to Observability?
Do You Really Need to Evolve From Monitoring to Observability?Do You Really Need to Evolve From Monitoring to Observability?
Do You Really Need to Evolve From Monitoring to Observability?
 
Observability
Observability Observability
Observability
 
Intro To Observability-March-2023.pdf
Intro To Observability-March-2023.pdfIntro To Observability-March-2023.pdf
Intro To Observability-March-2023.pdf
 
Combining Logs, Metrics, and Traces for Unified Observability
Combining Logs, Metrics, and Traces for Unified ObservabilityCombining Logs, Metrics, and Traces for Unified Observability
Combining Logs, Metrics, and Traces for Unified Observability
 
Building a centralized observability platform
Building a centralized observability platformBuilding a centralized observability platform
Building a centralized observability platform
 
Combining logs, metrics, and traces for unified observability
Combining logs, metrics, and traces for unified observabilityCombining logs, metrics, and traces for unified observability
Combining logs, metrics, and traces for unified observability
 
Observability and its application
Observability and its applicationObservability and its application
Observability and its application
 

Andere mochten auch

The math behind big systems analysis.
The math behind big systems analysis.The math behind big systems analysis.
The math behind big systems analysis.Theo Schlossnagle
 
Data viz as_interface_makoto_inoue
Data viz as_interface_makoto_inoueData viz as_interface_makoto_inoue
Data viz as_interface_makoto_inoueMakoto Inoue
 
Velocity EU 2013 What is the velocity of an unladen swallow?
Velocity EU 2013 What is the velocity of an unladen swallow?Velocity EU 2013 What is the velocity of an unladen swallow?
Velocity EU 2013 What is the velocity of an unladen swallow?pdyball
 
Performance and Metrics at Lonely Planet
Performance and Metrics at Lonely PlanetPerformance and Metrics at Lonely Planet
Performance and Metrics at Lonely PlanetMark Jennings
 
Why Page Speed Isn't Enough - Tim Morrow - Velocity Europe 2012
Why Page Speed Isn't Enough - Tim Morrow - Velocity Europe 2012Why Page Speed Isn't Enough - Tim Morrow - Velocity Europe 2012
Why Page Speed Isn't Enough - Tim Morrow - Velocity Europe 2012Tim Morrow
 
In-kernel Analytics and Tracing with eBPF for OpenStack Clouds
In-kernel Analytics and Tracing with eBPF for OpenStack CloudsIn-kernel Analytics and Tracing with eBPF for OpenStack Clouds
In-kernel Analytics and Tracing with eBPF for OpenStack CloudsPLUMgrid
 
Are Today’s Good Practices… Tomorrow’s Performance Anti-Patterns?
Are Today’s Good Practices… Tomorrow’s Performance Anti-Patterns?Are Today’s Good Practices… Tomorrow’s Performance Anti-Patterns?
Are Today’s Good Practices… Tomorrow’s Performance Anti-Patterns?Andy Davies
 
Bring the Noise
Bring the NoiseBring the Noise
Bring the NoiseJon Cowie
 
MeasureWorks - Velocity Conference Europe 2012 - a Web Performance dashboard ...
MeasureWorks - Velocity Conference Europe 2012 - a Web Performance dashboard ...MeasureWorks - Velocity Conference Europe 2012 - a Web Performance dashboard ...
MeasureWorks - Velocity Conference Europe 2012 - a Web Performance dashboard ...MeasureWorks
 
Velocity EU 2012 - Third party scripts and you
Velocity EU 2012 - Third party scripts and youVelocity EU 2012 - Third party scripts and you
Velocity EU 2012 - Third party scripts and youPatrick Meenan
 
Integrating multiple CDNs at Etsy
Integrating multiple CDNs at EtsyIntegrating multiple CDNs at Etsy
Integrating multiple CDNs at EtsyLaurie Denness
 
Getting 100B Metrics to Disk
Getting 100B Metrics to DiskGetting 100B Metrics to Disk
Getting 100B Metrics to Diskjthurman42
 
Be Mean to Your Code with Gauntlt and the Rugged Way // Velocity EU 2013 Work...
Be Mean to Your Code with Gauntlt and the Rugged Way // Velocity EU 2013 Work...Be Mean to Your Code with Gauntlt and the Rugged Way // Velocity EU 2013 Work...
Be Mean to Your Code with Gauntlt and the Rugged Way // Velocity EU 2013 Work...James Wickett
 
Linux Tracing Superpowers by Eugene Pirogov
Linux Tracing Superpowers by Eugene PirogovLinux Tracing Superpowers by Eugene Pirogov
Linux Tracing Superpowers by Eugene PirogovPivorak MeetUp
 

Andere mochten auch (20)

The math behind big systems analysis.
The math behind big systems analysis.The math behind big systems analysis.
The math behind big systems analysis.
 
Nonlinear observer design
Nonlinear observer designNonlinear observer design
Nonlinear observer design
 
Data viz as_interface_makoto_inoue
Data viz as_interface_makoto_inoueData viz as_interface_makoto_inoue
Data viz as_interface_makoto_inoue
 
Velocity EU 2013 What is the velocity of an unladen swallow?
Velocity EU 2013 What is the velocity of an unladen swallow?Velocity EU 2013 What is the velocity of an unladen swallow?
Velocity EU 2013 What is the velocity of an unladen swallow?
 
Performance and Metrics at Lonely Planet
Performance and Metrics at Lonely PlanetPerformance and Metrics at Lonely Planet
Performance and Metrics at Lonely Planet
 
Why Page Speed Isn't Enough - Tim Morrow - Velocity Europe 2012
Why Page Speed Isn't Enough - Tim Morrow - Velocity Europe 2012Why Page Speed Isn't Enough - Tim Morrow - Velocity Europe 2012
Why Page Speed Isn't Enough - Tim Morrow - Velocity Europe 2012
 
In-kernel Analytics and Tracing with eBPF for OpenStack Clouds
In-kernel Analytics and Tracing with eBPF for OpenStack CloudsIn-kernel Analytics and Tracing with eBPF for OpenStack Clouds
In-kernel Analytics and Tracing with eBPF for OpenStack Clouds
 
Are Today’s Good Practices… Tomorrow’s Performance Anti-Patterns?
Are Today’s Good Practices… Tomorrow’s Performance Anti-Patterns?Are Today’s Good Practices… Tomorrow’s Performance Anti-Patterns?
Are Today’s Good Practices… Tomorrow’s Performance Anti-Patterns?
 
Bring the Noise
Bring the NoiseBring the Noise
Bring the Noise
 
MeasureWorks - Velocity Conference Europe 2012 - a Web Performance dashboard ...
MeasureWorks - Velocity Conference Europe 2012 - a Web Performance dashboard ...MeasureWorks - Velocity Conference Europe 2012 - a Web Performance dashboard ...
MeasureWorks - Velocity Conference Europe 2012 - a Web Performance dashboard ...
 
Velocity EU 2012 - Third party scripts and you
Velocity EU 2012 - Third party scripts and youVelocity EU 2012 - Third party scripts and you
Velocity EU 2012 - Third party scripts and you
 
Integrating multiple CDNs at Etsy
Integrating multiple CDNs at EtsyIntegrating multiple CDNs at Etsy
Integrating multiple CDNs at Etsy
 
Getting 100B Metrics to Disk
Getting 100B Metrics to DiskGetting 100B Metrics to Disk
Getting 100B Metrics to Disk
 
Be Mean to Your Code with Gauntlt and the Rugged Way // Velocity EU 2013 Work...
Be Mean to Your Code with Gauntlt and the Rugged Way // Velocity EU 2013 Work...Be Mean to Your Code with Gauntlt and the Rugged Way // Velocity EU 2013 Work...
Be Mean to Your Code with Gauntlt and the Rugged Way // Velocity EU 2013 Work...
 
Atldevops
AtldevopsAtldevops
Atldevops
 
Understanding Slowness
Understanding SlownessUnderstanding Slowness
Understanding Slowness
 
What's in a number?
What's in a number?What's in a number?
What's in a number?
 
Xtreme Deployment
Xtreme DeploymentXtreme Deployment
Xtreme Deployment
 
SRECon Coherent Performance
SRECon Coherent PerformanceSRECon Coherent Performance
SRECon Coherent Performance
 
Linux Tracing Superpowers by Eugene Pirogov
Linux Tracing Superpowers by Eugene PirogovLinux Tracing Superpowers by Eugene Pirogov
Linux Tracing Superpowers by Eugene Pirogov
 

Ähnlich wie Monitoring and observability

Productivity, Productivity, Productivity
Productivity, Productivity, ProductivityProductivity, Productivity, Productivity
Productivity, Productivity, ProductivityFabian Alcantara
 
Building Data Driven Products With Ruby - RubyConf 2012
Building Data Driven Products With Ruby - RubyConf 2012Building Data Driven Products With Ruby - RubyConf 2012
Building Data Driven Products With Ruby - RubyConf 2012Ryan Weald
 
Ruxcon Finding Needles in Haystacks (the size of countries)
Ruxcon Finding Needles in Haystacks (the size of countries)Ruxcon Finding Needles in Haystacks (the size of countries)
Ruxcon Finding Needles in Haystacks (the size of countries)packetloop
 
Optimizing the Mobile Search Experience
Optimizing the Mobile Search ExperienceOptimizing the Mobile Search Experience
Optimizing the Mobile Search ExperienceMonetate
 
Bio-IT for Core Facility Managers
Bio-IT for Core Facility ManagersBio-IT for Core Facility Managers
Bio-IT for Core Facility ManagersChris Dagdigian
 
Optimizing for change: Taking risks safely & e-commerce
Optimizing for change: Taking risks safely & e-commerceOptimizing for change: Taking risks safely & e-commerce
Optimizing for change: Taking risks safely & e-commerceKellan
 
Stability patterns presentation
Stability patterns presentationStability patterns presentation
Stability patterns presentationJustin Dorfman
 
Stability patterns presentation
Stability patterns presentationStability patterns presentation
Stability patterns presentationjames tong
 
Big Data, Big Changes: Data-Driven Product Development at Etsy
Big Data, Big Changes: Data-Driven Product Development at EtsyBig Data, Big Changes: Data-Driven Product Development at Etsy
Big Data, Big Changes: Data-Driven Product Development at EtsyJason Davis
 
Automatic Extraction of Soccer Game Event Data from Twitter
Automatic Extraction of Soccer Game Event Data from TwitterAutomatic Extraction of Soccer Game Event Data from Twitter
Automatic Extraction of Soccer Game Event Data from TwitterMarieke van Erp
 
The Web Designers Toolkit
The Web Designers ToolkitThe Web Designers Toolkit
The Web Designers ToolkitR/GA
 

Ähnlich wie Monitoring and observability (14)

Productivity, Productivity, Productivity
Productivity, Productivity, ProductivityProductivity, Productivity, Productivity
Productivity, Productivity, Productivity
 
Building Data Driven Products With Ruby - RubyConf 2012
Building Data Driven Products With Ruby - RubyConf 2012Building Data Driven Products With Ruby - RubyConf 2012
Building Data Driven Products With Ruby - RubyConf 2012
 
Continous delivery
Continous deliveryContinous delivery
Continous delivery
 
Ruxcon Finding Needles in Haystacks (the size of countries)
Ruxcon Finding Needles in Haystacks (the size of countries)Ruxcon Finding Needles in Haystacks (the size of countries)
Ruxcon Finding Needles in Haystacks (the size of countries)
 
Optimizing the Mobile Search Experience
Optimizing the Mobile Search ExperienceOptimizing the Mobile Search Experience
Optimizing the Mobile Search Experience
 
Twitter Storm
Twitter StormTwitter Storm
Twitter Storm
 
Measuring
MeasuringMeasuring
Measuring
 
Bio-IT for Core Facility Managers
Bio-IT for Core Facility ManagersBio-IT for Core Facility Managers
Bio-IT for Core Facility Managers
 
Optimizing for change: Taking risks safely & e-commerce
Optimizing for change: Taking risks safely & e-commerceOptimizing for change: Taking risks safely & e-commerce
Optimizing for change: Taking risks safely & e-commerce
 
Stability patterns presentation
Stability patterns presentationStability patterns presentation
Stability patterns presentation
 
Stability patterns presentation
Stability patterns presentationStability patterns presentation
Stability patterns presentation
 
Big Data, Big Changes: Data-Driven Product Development at Etsy
Big Data, Big Changes: Data-Driven Product Development at EtsyBig Data, Big Changes: Data-Driven Product Development at Etsy
Big Data, Big Changes: Data-Driven Product Development at Etsy
 
Automatic Extraction of Soccer Game Event Data from Twitter
Automatic Extraction of Soccer Game Event Data from TwitterAutomatic Extraction of Soccer Game Event Data from Twitter
Automatic Extraction of Soccer Game Event Data from Twitter
 
The Web Designers Toolkit
The Web Designers ToolkitThe Web Designers Toolkit
The Web Designers Toolkit
 

Mehr von Theo Schlossnagle

Adding Simplicity to Complexity
Adding Simplicity to ComplexityAdding Simplicity to Complexity
Adding Simplicity to ComplexityTheo Schlossnagle
 
Put Some SRE in Your Shipped Software
Put Some SRE in Your Shipped SoftwarePut Some SRE in Your Shipped Software
Put Some SRE in Your Shipped SoftwareTheo Schlossnagle
 
Distributed Systems - Like It Or Not
Distributed Systems - Like It Or NotDistributed Systems - Like It Or Not
Distributed Systems - Like It Or NotTheo Schlossnagle
 
Applying SRE techniques to micro service design
Applying SRE techniques to micro service designApplying SRE techniques to micro service design
Applying SRE techniques to micro service designTheo Schlossnagle
 
A Coherent Discussion About Performance
A Coherent Discussion About PerformanceA Coherent Discussion About Performance
A Coherent Discussion About PerformanceTheo Schlossnagle
 
OmniOS Motivation and Design ~ LISA 2012
OmniOS Motivation and Design ~ LISA 2012OmniOS Motivation and Design ~ LISA 2012
OmniOS Motivation and Design ~ LISA 2012Theo Schlossnagle
 
Monitoring is easy, why are we so bad at it presentation
Monitoring is easy, why are we so bad at it  presentationMonitoring is easy, why are we so bad at it  presentation
Monitoring is easy, why are we so bad at it presentationTheo Schlossnagle
 
Social improvements in monitoring
Social improvements in monitoringSocial improvements in monitoring
Social improvements in monitoringTheo Schlossnagle
 
Building Scalable Systems: an asynchronous approach
Building Scalable Systems: an asynchronous approachBuilding Scalable Systems: an asynchronous approach
Building Scalable Systems: an asynchronous approachTheo Schlossnagle
 

Mehr von Theo Schlossnagle (20)

Adding Simplicity to Complexity
Adding Simplicity to ComplexityAdding Simplicity to Complexity
Adding Simplicity to Complexity
 
Put Some SRE in Your Shipped Software
Put Some SRE in Your Shipped SoftwarePut Some SRE in Your Shipped Software
Put Some SRE in Your Shipped Software
 
Monitoring 101
Monitoring 101Monitoring 101
Monitoring 101
 
Distributed Systems - Like It Or Not
Distributed Systems - Like It Or NotDistributed Systems - Like It Or Not
Distributed Systems - Like It Or Not
 
Applying SRE techniques to micro service design
Applying SRE techniques to micro service designApplying SRE techniques to micro service design
Applying SRE techniques to micro service design
 
Craftsmanship
CraftsmanshipCraftsmanship
Craftsmanship
 
Commandments of scale
Commandments of scaleCommandments of scale
Commandments of scale
 
Adaptive availability
Adaptive availabilityAdaptive availability
Adaptive availability
 
Project reality
Project realityProject reality
Project reality
 
Monitoring the #DevOps way
Monitoring the #DevOps wayMonitoring the #DevOps way
Monitoring the #DevOps way
 
Operational Software Design
Operational Software DesignOperational Software Design
Operational Software Design
 
A Coherent Discussion About Performance
A Coherent Discussion About PerformanceA Coherent Discussion About Performance
A Coherent Discussion About Performance
 
OmniOS Motivation and Design ~ LISA 2012
OmniOS Motivation and Design ~ LISA 2012OmniOS Motivation and Design ~ LISA 2012
OmniOS Motivation and Design ~ LISA 2012
 
Omnios and unix
Omnios and unixOmnios and unix
Omnios and unix
 
It's all about telemetry
It's all about telemetryIt's all about telemetry
It's all about telemetry
 
Is this normal?
Is this normal?Is this normal?
Is this normal?
 
Monitoring is easy, why are we so bad at it presentation
Monitoring is easy, why are we so bad at it  presentationMonitoring is easy, why are we so bad at it  presentation
Monitoring is easy, why are we so bad at it presentation
 
Social improvements in monitoring
Social improvements in monitoringSocial improvements in monitoring
Social improvements in monitoring
 
Building Scalable Systems: an asynchronous approach
Building Scalable Systems: an asynchronous approachBuilding Scalable Systems: an asynchronous approach
Building Scalable Systems: an asynchronous approach
 
Webops dashboards
Webops dashboardsWebops dashboards
Webops dashboards
 

Kürzlich hochgeladen

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 

Kürzlich hochgeladen (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 

Monitoring and observability

  • 1. Monitoring and Observability / in Complex Architectures Tuesday, October 2, 12
  • 2. Hi! I’m @postwait I founded @OmniTI and @MessageSystems and @Circonus Tuesday, October 2, 12
  • 3. Hi! I’m @postwait I am very active in @TheOfficialACM participating in @ACMQueue and the practitioners board. Tuesday, October 2, 12
  • 4. Hi! I’m @postwait I (regrettably) build complex systems. Tuesday, October 2, 12
  • 5. Why we are here We’re here to talk about coping with breakage Tuesday, October 2, 12
  • 6. Rule #1 Direct observation of failure leads to quicker rectification. Tuesday, October 2, 12
  • 7. Rule #2 You cannot correct what you cannot measure. Tuesday, October 2, 12
  • 8. Solution Approach #1 Debugging failures requires either visibility into the precipitating state Tuesday, October 2, 12
  • 9. Precipitating State Single threaded applications ✓ Easy Tuesday, October 2, 12
  • 10. Precipitating State Multi-threaded applications ✓ Challenging Tuesday, October 2, 12
  • 11. Precipitating State Distributed applications here there be dragons Tuesday, October 2, 12
  • 12. Solution Approach #2 or direct observation of a (and likely very many) failing transaction Tuesday, October 2, 12
  • 13. Direct Observation Observing something fail... is priceless. Tuesday, October 2, 12
  • 14. Direct Observation Observation leads to intelligent questioning. Tuesday, October 2, 12
  • 15. Direct Observation Questioning leads to answers... but only through more observation. Tuesday, October 2, 12
  • 16. Direct Observation Questioning leads to answers... but only through more observation. and herein lies the rub. Tuesday, October 2, 12
  • 17. Leaning Towards Scientific Process In production you don’t have • repeatability • control groups • external verification Tuesday, October 2, 12
  • 18. Leaning Towards Scientific Process In production you don’t have • repeatability • control groups • external verification ... or do you? Tuesday, October 2, 12
  • 19. What’s monitoring got to do with it? Monitoring is all about the passive observation of telemetry data. Tuesday, October 2, 12
  • 20. Monitoring Telemetry cannot pinpoint problems can provides evidence of the existence of a problem Tuesday, October 2, 12
  • 21. Monitoring Gives you evidence that there is a problem Tuesday, October 2, 12
  • 22. Monitoring Gives you evidence that you have fixed a problem (or at least the symptoms) Tuesday, October 2, 12
  • 23. Monitoring Tactically If it could be of interest, measure it and expose the measurement Tuesday, October 2, 12
  • 24. Monitoring: embedded statsd metrics https://github.com/etsy/statsd https://github.com/codahale/metrics resmon folsom http://labs.omniti.com/labs/resmon https://github.com/boundary/folsom metrics.js https://github.com/mikejihbe/metrics metrics-net https://github.com/danielcrenna/metrics-net Tuesday, October 2, 12
  • 25. Monitoring: collection reconnoiter circonus http://labs.omniti.com/labs/reconnoiter http://circonus.com/ graphite librato http://graphite.wikidot.com/ https://metrics.librato.com/ OpenTSDB http://opentsdb.net/ Tuesday, October 2, 12
  • 26. Monitoring: Bling visualizing an architecture rollout Tuesday, October 2, 12
  • 27. Monitoring: Bling visualizing the impact on service times Tuesday, October 2, 12
  • 28. average API service time latency Tuesday, October 2, 12
  • 29. actual API service time latency http://www.slideshare.net/postwait/atldevops Tuesday, October 2, 12
  • 31. Repeatability is a Pipe Dream You production problem is a (hopefully pathological) outcome of circumstance. A circumstance which often cannot be repeated. Tuesday, October 2, 12
  • 32. Control Groups Control groups can compensate for the inability to precisely repeat an experiment. Tuesday, October 2, 12
  • 33. Control Groups Most architectures have redundancy. Tuesday, October 2, 12
  • 34. Control Groups With the right design, you can turn that redundancy into a debugging environment. [1] http://omniti.com/surge/2012/sessions/xtreme-deployment Tuesday, October 2, 12
  • 35. Control Groups: Simple Example I have 10 web servers I fix 1 I verify 1 is fixed I verify 9 are still broken Tuesday, October 2, 12
  • 36. Control Groups: Seems Easy Web servers tend to be: • homogeneous • share-(nothing|little) • independent Tuesday, October 2, 12
  • 37. Control Groups: Not So Easy Most other services aren’t so homogeneous and equal: databases, batch processes (think billings), orchestration middleware, message queues Tuesday, October 2, 12
  • 38. Observability Some might claim that seeing telemetry data is observation... It is doubly indirect at best. Tuesday, October 2, 12
  • 39. Observability I want to directly see the errant behaviour Tuesday, October 2, 12
  • 40. Observability is forgiving In complex, multi-component architectures, errors can be observed as errant behaviour in many junction points. Tuesday, October 2, 12
  • 41. Observing the network tcpdump / snoop wireshark Tuesday, October 2, 12
  • 42. Observing the network Looking at just the arrival of new connections tcpdump -nnq -tttt -s384 'tcp port 80 and (tcp[13] & (2|16) == 2)' Tuesday, October 2, 12
  • 43. Observing the network Looking at just the data arrival and departure times tcpdump -nnq -tt -s 384 'tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)*4)) - ((tcp[12]&0xf0)/4)) != 0)' snoop -rq -ta -s 384 'tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)*4)) - ((tcp[12]&0xf0)/4)) != 0)' Tuesday, October 2, 12
  • 44. Observing the network Finding the difference between a client’s question and a server’s answer (tcpdump | awk filter). { gsub(".[0-9]+(: | >)"," & "); gsub("[:=]"," "); EP=sprintf("%s%s", ($4==".80")?$6:$3, ($4==".80")?$7:$4); if(S[EP] == "C" && $4 == ".80") { printf("%f %sn", $1 - L[EP], EP); } S[EP]= ($4==".80")?"S":"C"; L[EP]= $1; } Tuesday, October 2, 12
  • 47. Observing user-space strace[1] / truss gstack / pstack gcore + gdb / dbx / mdb[2] [1] http://www.cli.di.unipi.it/~gadducci/SOL-11/Local/referenceCards/LINUX_System_Call_Quick_Reference.pdf [2] http://hub.opensolaris.org/bin/download/Community+Group+mdb/tips/mdb-cheatsheet.pdf Tuesday, October 2, 12
  • 48. System call tracing Watching sshd is a good way to get familiar. truss -f -p `pgrep sshd` Tuesday, October 2, 12
  • 49. System call tracing An active web server is going to be like a firehose. truss -f -p `pgrep httpd` Tuesday, October 2, 12
  • 50. Observing the system DTrace Live production demo or GTFO. Tuesday, October 2, 12
  • 51. Thank You Questions? Tuesday, October 2, 12