Better service monitoring through histograms

•Als PPTX, PDF herunterladen•

2 gefällt mir•1,871 views

Fred Moyer

Talk given to San Francisco Perl Mongers about service monitoring with histograms

Software

Better service monitoring
through histograms
Fred Moyer - @phredmoyer
San Francisco Perl Mongers, 07-26-2016

Systems break while we sleep
How often are you woken up for false alarms?
Welcome

Synthetics
Easy to setup, but
not a real user

Synthetics
Stephen Falken: Uh, uh, General, what you see on these screens up
here is a fantasy; a computer-enhanced hallucination. Those blips
are not real missiles. They're phantoms. (War Games, 1983)

What threshold do you choose?
Threshold Alerting

“Alert me if requests take longer than 200 ms”
10,10,10,10,10,10,10,10,10,5000
Alerts on one outlier in 10
Threshold Alerting

“Alert if request average over one minute
is longer than 200 ms”
avg(10,10,210,210,210,210) = 143 (860/6)
Does not alert on multiple high samples
Threshold Alerting

‘average’ eq ‘arithmetic mean’
A=S/N
A = average
N = the number of terms
S = the sum of the numbers in the set
Math Refresher

median = midpoint of data set
The 50th percentile is 555 - q(0.5)
Value 111 222 333 444
555 666 777 888 999
Sample # 1 2 3 4 5 6 7 8 9
Math Refresher

90th percentile - 90% of samples below it
The 90th percentile is 1,000 - q(0.9)
Value 111 222 333 444 555 666 777 888 999
1,000 1,111
Sample # 1 2 3 4 5 6 7 8 9 10 11
Math Refresher

100th Percentile - the maximum value
The 100th percentile is 1,111 - q(1)
Value 111 222 333 444 555 666 777 888 999 1,000
1,111
Sample # 1 2 3 4 5 6 7 8 9 10 11
Math Refresher

Sample value
Number of
samples
Histogram

Sample value
Number of
samples
Normal Distribution

Sample value
Number of
samples
Normal Distribution
34% within
one sigma (σ)

Sample value
Number of
samples
Non-Normal Distribution

Non-Normal Distribution
Operations data groups at different points

Non-Normal Distribution
Users to the right of the red line are gone

Request latency
“We keep hearing from people that the
website is slow. But it is fine when we test it,
and the request latency graph is constant”
You are only looking at part of the picture.

Practical Percentiles
Bandwidth usage is often billed at 95th percentile usage
Record 5 minute data usage intervals
Sort samples by value of sample
Throw out the highest 5% of samples
Charge usage based on the remaining top sample, i.e. 300
MB transferred over 5 minutes = 1 MB/s rate billing

Practical Percentiles
If I measure 95th percentile per 5 minutes all
month long,
I CANNOT calculate 95th percentile over the
month.

Angry users
How many users are you pissing off?

“Alert me if request latency 90th percentile
over one minute is exceeded”
Percentile based alerting
q(0.9)[10,10,10,10,10,10,10,10,5000] == 10
Alert IS NOT triggered
Do you want to be woken up for this? NO!

“Alert me if request latency 90th percentile
over one minute is exceeded”
Percentile based alerting
q(0.9)[10,10,10,10,10,10,250,300] = ~270
Alert IS triggered
Do you want to be woken up for this? YES!

Who’s using this approach?
Google.com
Circonus.com
You?

Questions?
Thanks to Circonus.com for the tools and help
with the math
http://www.circonus.com/free-account/

Weitere ähnliche Inhalte

Ähnlich wie Better service monitoring through histograms

Artificial intelligence - A Teaser to the Topic.

Dr. Kim (Kyllesbech Larsen)

Prometheus (Prometheus London, 2016)

Brian Brazil

Sift Science uses online, large-scale machine learning to detect fraud for thousands of sites and hundreds of millions of users in real-time. This talk describes how we leverage HBase to power an ML infrastructure including how we train and build models, store and update model parameters online, and provide real-time predictions. The central pieces of the machine learning infrastructure and the tradeoffs we made to maximize performance will also be covered.

HBaseCon 2015: Running ML Infrastructure on HBase

HBaseCon

Computer Vision for Measurement & FR

RekaNext Capital

The big data era is characterized by the ever-increasing velocity and volume of data. In order to store and analyze the ever-growing data, the operational footprint of data stores and Hadoop have also grown over time. (As per a recent report from IDC, the spending on big data infrastructure is expected to reach $41.5 billion by 2018.) The clusters comprise several thousands of nodes. The high performance of such clusters is vital for delivering the best user experience and productivity of teams. The performance of such clusters is often limited by slow/bad nodes. Finding slow nodes in large clusters is akin to finding a needle in a haystack; hence, manual identification of slow/bad nodes is not practical. To this end, we developed a novel statistical technique to automatically detect slow/bad nodes in clusters comprising hundreds to thousands of nodes. We modeled the problem as a classification problem and employed a simple, yet very effective, distance measure to determine slow/bad nodes. The key highlights of the proposed technique are the following: # Robustness against anomalies (note that anomalies may occur, for example, due to an ad-hoc heavyweight job on a Hadoop cluster) # Given the varying data characteristics of different services, no one model fits all. Consequently, we parameterized the threshold used for classification The proposed technique works well with both hourly and daily data, and has been in use in production by multiple services. This has not only eliminated manual investigation efforts, but has also mitigated the impact of slow nodes, which used to get detected after several weeks/months of lag! We shall walk the audience through how the techniques are being used with REAL data.

Finding bad apples early: Minimizing performance impact

Arun Kejariwal

Convolutional Neural Network for Text Classification

Anaïs Addad

Probabilistic algorithms exist to solve problems that are either impossible or unrealistic (too expensive, too time consuming, etc.) to solve precisely. In an ideal world, you would never actually need to use probabilistic algorithms. For programmers who are not familiar with them, the concept can be positively nerve-racking: “How do I know it will actually work? What if it is inexplicably wrong? How can I debug it? Maybe we should just punt on this problem or buy a whole lot more servers. . .” However, for those who either deeply understand probability theory or at least have used and observed the behavior of probabilistic algorithms in large-scale production environments, these algorithms are not only acceptable but also worth using at any opportunity. This is because they can help solve problems, create systems that are less expensive and more predictable, and do things that could not be done otherwise.

It Probably Works

Fastly

How to not fail at security data analytics (by CxOSidekick)

Dinis Cruz

Handling Numeric Attributes in Hoeffding Trees

butest

A sentient network - How High-velocity Data and Machine Learning will Shape t...

Wenjing Chu

Semantics in Sensor Networks

Oscar Corcho

Design and Implementation of A Data Stream Management System

Erdi Olmezogullari

Machine Learning Intro Session

Naveen Rajan

Real-time Classification of Malicious URLs on Twitter using Machine Activity ...

Pete Burnap

We know our 8MS users are made up of pros and power users, but even the pros get stumped every now and then! Over the years, our support team has heard all your calls and seen every kind of “weird error message” out there. Now, they want to bring these stories to light and offer some useful tips in all in one place. We’ve rounded up 10 of the trickiest issues that have stumped even our most seasoned 8MS users, along with best practices on how to resolve them. You already know Matt Noreen and Mike Gilbert as your “go-to” 8MS guys, now hear them on this interactive webinar, where you’ll get the chance to test your own knowledge with our mini quizzes! We'll be revealing a secret prize during the webinar for the most correct answers – so you won’t want to miss this!

CSF Tips and Tricks 8MS Webinar

Aerialink

We all know not to poke at alien life forms in another planet, right? But what about metrics, do you know how to pick, measure and draw conclusions from them? In this talk we will cover various Site Reliability Engineering topics, such as SLIs and SLOs while we explore real life examples of defining and implementing metrics in a system with examples using Prometheus, an open-source system monitoring and alert platform, to demonstrate implementation. Let's get back to some real science.

Application Metrics (with Prometheus examples) #PHPDD18

Rafael Dohms

Machine learning session6(decision trees random forrest)

Abhimanyu Dwivedi

2014 abic-talk

c.titus.brown

Subverting Machine Learning Detections for fun and profit

Ram Shankar Siva Kumar

Real life XMPP Instant Messaging

Mickaël Rémond

Ähnlich wie Better service monitoring through histograms (20)

Artificial intelligence - A Teaser to the Topic.

Prometheus (Prometheus London, 2016)

HBaseCon 2015: Running ML Infrastructure on HBase

Computer Vision for Measurement & FR

Finding bad apples early: Minimizing performance impact

Convolutional Neural Network for Text Classification

It Probably Works

How to not fail at security data analytics (by CxOSidekick)

Handling Numeric Attributes in Hoeffding Trees

A sentient network - How High-velocity Data and Machine Learning will Shape t...

Semantics in Sensor Networks

Design and Implementation of A Data Stream Management System

Machine Learning Intro Session

Real-time Classification of Malicious URLs on Twitter using Machine Activity ...

CSF Tips and Tricks 8MS Webinar

Application Metrics (with Prometheus examples) #PHPDD18

Machine learning session6(decision trees random forrest)

2014 abic-talk

Subverting Machine Learning Detections for fun and profit

Real life XMPP Instant Messaging

Mehr von Fred Moyer

Observability and reliability engineering have been on a convergent course for several years. Error Budgets joined the reliability lexicon of engineering organizations in 2016 with the release of the SRE book. The intersection of observability and reliability has largely been the domain of specialists for practical implementation. How can one democratize these techniques to put them in the hands of a thousand engineers at once? At Zendesk we developed simple algorithms and practical approaches for implementing SLIs, SLOs, and Error Budgets at scale using a number of observability tools. This talk will show the approaches developed and how we were able to manage observability instrumentation across dozens of teams quickly in a complex ecosystem (CDN, UI, middleware, backend, queues, dbs, queues, etc).

Reliable observability at scale: Error Budgets for 1,000+

Fred Moyer

Practical service level objectives with error budgeting

Fred Moyer

SREcon americas 2019 - Latency SLOs Done Right

Fred Moyer

Scale17x - Latency SLOs Done Right

Fred Moyer

Latency SLOs Done Right

Fred Moyer

Latency SLOs done right

Fred Moyer

Comprehensive Container Based Service Monitoring with Kubernetes and Istio

Fred Moyer

Comprehensive container based service monitoring with kubernetes and istio

Fred Moyer

Effective management of high volume numeric data with histograms

Fred Moyer

Statistics for dummies

Fred Moyer

GrafanaCon EU 2018

Fred Moyer

Fredmoyer postgresopen 2017

Fred Moyer

The Breakup - Logically Sharding a Growing PostgreSQL Database

Fred Moyer

Learning go for perl programmers

Fred Moyer

Surge 2012 fred_moyer_lightning

Fred Moyer

Qpsmtpd

Fred Moyer

Apache Dispatch

Fred Moyer

Ball Of Mud Yapc 2008

Fred Moyer

Data::FormValidator Simplified

Fred Moyer

Mehr von Fred Moyer (19)

Reliable observability at scale: Error Budgets for 1,000+

Practical service level objectives with error budgeting

SREcon americas 2019 - Latency SLOs Done Right

Scale17x - Latency SLOs Done Right

Latency SLOs Done Right

Latency SLOs done right

Comprehensive Container Based Service Monitoring with Kubernetes and Istio

Comprehensive container based service monitoring with kubernetes and istio

Effective management of high volume numeric data with histograms

Statistics for dummies

GrafanaCon EU 2018

Fredmoyer postgresopen 2017

The Breakup - Logically Sharding a Growing PostgreSQL Database

Learning go for perl programmers

Surge 2012 fred_moyer_lightning

Qpsmtpd

Apache Dispatch

Ball Of Mud Yapc 2008

Data::FormValidator Simplified

Kürzlich hochgeladen

We specialize in Psychic Readings, Psychic Love Spells, Binding Love Spells, Obsession Spells, Voodoo Spells, Lottery Spells, Marriage Spells, Black Magic Spells, Palm Readings & much more. Are you depressed? We perform this come-to-me love spell that works instantly with the aim of bringing back the victim to the person performing the magic. Have you lost your lover? We perform this come-to-me love spell that works instantly with the aim of bringing back the victim to the person performing the magic. Have you lost your lover? Do u need to solve any relationship problem? Contact the powerful spells caster chief kule with love spells that work overnight and love spells that really work. Have you found yourself infatuated with a special someone you think could be the one? Are you looking for a spell to provide them with a nudge in the right direction? Or maybe the spell you cast didn’t achieve the results you were hoping for? Whether you’re new or versed in the ways of spell casting, we’re here to help. Today we’re going to provide you with a detailed guide on the types of love spells to cast. Not only that but there’s something for those who wish to find outside advice from more advanced spell casters. We’re also going to provide you with the top sites available to help you with your dilemma. Let’s begin our journey by educating ourselves on love magic and what a real love caster looks like. Love Magic and Love Casters Love magic made its first appearance back in Ancient Egypt and has been an active practice since. This type of magic is a branch of traditional magic and can be practiced in various ways. Typically the more common use of love magic is through the work of spells, but other methods look like Charms Rituals-LOVE Potions-Dolls and even Amulets If you are interested in becoming a love caster, be prepared for what’s to come. A genuine love caster knows that the art of love casting is no easy feat and shouldn’t be done casually. You should know that not only does it require you to be gifted spiritually, but you must be ready to serve others. Someone who is considered a real love caster has experience in all manner of spells, no matter the difficulty. Training yourself in attraction, commitment, and marriage spells is an excellent place to start. But this by no means will make you a professional. Practice your craft and expand your knowledge; understand that you will possess the ability to help others in time truly. Types of Love Spells What better way to start broadening your experiences with love spells than by learning more about them? These spells work like just about any other spell. Simply apply your intention, use a medium (sigils, mantras, candles, or charm bags), and top it off with establishing the belief that you will receive what you want. So what kind of spells are available and which ones suit your needs the best? Let’s take a look at the many options you have at your disposal

%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...

masabamasaba

%in tembisa+277-882-255-28 abortion pills for sale in tembisa

masabamasaba

Conference: Engage2024 in Antwerp Type: Workshop Speakers: Florian Vogler, Henning Kunz, Christoph Adler Title: Navigating the Future with The Hitchhiker's Guide to Notes and Domino 14 Abstract: Embark on an exhilarating journey with industry trailblazers Florian Vogler, Henning Kunz, and Christoph Adler in this not-to-be-missed workshop at the forefront of the tech universe. Get ready for a thrilling kick-off as we navigate the current state of the HCL universe, setting the stage for an exploration of the groundbreaking Notes and Domino 14. Discover the latest enhancements and revolutionary features that will redefine your experience. In this interactive session, unlock a treasure trove of tips and tricks to elevate your utilization of version 14, both with and without the game-changing panagenda MarvelClient. Brace yourself for also diving into Nomad, Nomad Web, and VoltMX, expanding your horizons in the expansive HCL landscape. Be a part of this exclusive opportunity to stay ahead in the ever-evolving world of HCL technologies. Your journey to mastering Notes and Domino 14 begins here. And remember, in the spirit of intergalactic exploration, don't forget to bring your towel!

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...

panagenda

At the recent Microsoft Ignite 2023 conference, Microsoft unveiled a groundbreaking strategy that will redefine enterprise work management. The plan involves integrating Microsoft’s key planning tools, Microsoft To Do, Microsoft Planner, and Microsoft Project for the web into a unified experience called “Microsoft Planner.” What does this new strategy from Microsoft mean for current users? Join us and learn how best to take advantage of this announcement while gaining a clear path on how to elevate the current state of Microsoft Planner from a basic task manager to a comprehensive tool for Enterprise Work Management using OnePlan. Learn how OnePlan’s integration with Microsoft Planner allows for strategic alignment with business goals through advanced features like strategic planning, portfolio management, resource management, financial management, and more!

Introducing Microsoft’s new Enterprise Work Management (EWM) Solution

OnePlan Solutions

%in Harare+277-882-255-28 abortion pills for sale in Harare

masabamasaba

Technology has taken up space all over the world. From generating content with a single command on ChatGPT to getting your food served by Robots at your favorite restaurant, artificial advancements have ruled every space. Every industry is set to develop top-notch technology in every sector; finance, IT, healthcare, gaming, and banking, with competitive market standards. One of these rapidly growing industries is Mobile App Development. According to the Straits Research report, it is expected to reach USD 583.03 billion at a CAGR OF 12.8% between (2022 and 2030). It clearly shows how mobile app development has become an integral part of the digital landscape and revolutionized technology.

The Top App Development Trends Shaping the Industry in 2024-25 .pdf

ayushiqss

SHRMPro HRMS Software Solutions Presentation

Shrmpro

%in ivory park+277-882-255-28 abortion pills for sale in ivory park

masabamasaba

InShot proinshot.com stands tall among its peers as the ultimate video editing app, offering simplicity, versatility, and power in one package. With its intuitive interface and comprehensive feature set, InShot caters to both beginners and seasoned editors alike. Whether you're creating content for social media, YouTube, or personal projects, InShot empowers you to unleash your creativity and transform your videos into captivating masterpieces. Join the millions of users who trust InShot https://www.proinshot.com/ for all their video editing needs and discover the difference for yourself!

Exploring the Best Video Editing App.pdf

proinshot.com

(Vivek)Call Us, 8448380779,Call girls in Delhi NCr – We Offer best in class call girls. escort Service At Affordable Price At low Rate with Space Night 8000 We Are One Of The Oldest Escort and Call girls Agencies in Delhi. You Will Find That Our Female Escorts Are Full Of Fun, Sexy And They Would Love Enjoy Your Company. We Have A Fantastic Selection Of Escort Ladies Available For In-Calls As Well As Out-Calls. Our Escorts Are Not Only Beautiful But All Have Great Personalities Making Them The Perfect Companion For Any Occasion. In-Call:- You Can Come At Our Place in Delhi Our place Which Is Very Clean Hygienic 100% safe Accommodation. Out-Call:- You have To Come Pick The Girl From My Place We Are Also Provide Door Step Services (Delhi Ncr, Noida, Gurgaon, Faridabad, Ghaziabad Note:- Pic Collectors Time Passers Bargainers Stay Away As We Respect The Value For Your Money Time And Expect The Same From You Hygienic:- Full Ac room And Clean Rooms Available In Hotel 24 * 7 Hourly In Delhi NCR More Details, With WhatsApp Number, +91-8448380779

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️

Delhi Call girls

Software Quality Assurance Interview Questions

Arshad QA

Define the academic and professional writing..pdf

PearlKirahMaeRagusta1

%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...

masabamasaba

%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain

masabamasaba

OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...

Shane Coughlan

%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...

masabamasaba

Announcing Codolex 2.0 from GDK Software

Jim McKeeth

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️

Delhi Call girls

Abortion Clinic And Abortion Pills For Sale in Oman MEDICAL ABORTION PILLS FOR SALE in Fujairah , Ajman, Al Ain Quality Abortion services provided by specialists. Women's health problems. A same day service, safe, legal & pain free Abortions. From 1 week to 5 months. We use tested and approved pills. It's 100% guaranteed & safe. No side effects at all. We also do free womb cleaning and free pills delivery. We sell pregnancy termination pills {ABORTION PILLS} Our service is ever private with all our clients. Call/WhatsApp:

%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...

masabamasaba

%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview

masabamasaba

Kürzlich hochgeladen (20)

%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...

%in tembisa+277-882-255-28 abortion pills for sale in tembisa

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...

Introducing Microsoft’s new Enterprise Work Management (EWM) Solution

%in Harare+277-882-255-28 abortion pills for sale in Harare

The Top App Development Trends Shaping the Industry in 2024-25 .pdf

SHRMPro HRMS Software Solutions Presentation

%in ivory park+277-882-255-28 abortion pills for sale in ivory park

Exploring the Best Video Editing App.pdf

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️

Software Quality Assurance Interview Questions

Define the academic and professional writing..pdf

%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...

%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain

OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...

%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...

Announcing Codolex 2.0 from GDK Software

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️

%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...

%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview

Better service monitoring through histograms

1. Better service monitoring through histograms Fred Moyer - @phredmoyer San Francisco Perl Mongers, 07-26-2016

2. Systems break while we sleep How often are you woken up for false alarms? Welcome

3. Synthetics Easy to setup, but not a real user

4. Synthetics Stephen Falken: Uh, uh, General, what you see on these screens up here is a fantasy; a computer-enhanced hallucination. Those blips are not real missiles. They're phantoms. (War Games, 1983)

5. Real Users These are your users, right?

6. Real data Real Users

7. 500 ms is really 2,000 ms Spike Erosion

8. What threshold do you choose? Threshold Alerting

9. “Alert me if requests take longer than 200 ms” 10,10,10,10,10,10,10,10,10,5000 Alerts on one outlier in 10 Threshold Alerting

10. “Alert if request average over one minute is longer than 200 ms” avg(10,10,210,210,210,210) = 143 (860/6) Does not alert on multiple high samples Threshold Alerting

11. ‘average’ eq ‘arithmetic mean’ A=S/N A = average N = the number of terms S = the sum of the numbers in the set Math Refresher

12. median = midpoint of data set The 50th percentile is 555 - q(0.5) Value 111 222 333 444 555 666 777 888 999 Sample # 1 2 3 4 5 6 7 8 9 Math Refresher

13. 90th percentile - 90% of samples below it The 90th percentile is 1,000 - q(0.9) Value 111 222 333 444 555 666 777 888 999 1,000 1,111 Sample # 1 2 3 4 5 6 7 8 9 10 11 Math Refresher

14. 100th Percentile - the maximum value The 100th percentile is 1,111 - q(1) Value 111 222 333 444 555 666 777 888 999 1,000 1,111 Sample # 1 2 3 4 5 6 7 8 9 10 11 Math Refresher

15. Sample value Number of samples Histogram

16. Sample value Number of samples Normal Distribution

17. Sample value Number of samples Normal Distribution 34% within one sigma (σ)

18. Sample value Number of samples Non-Normal Distribution

19. Sample value Number of samples Non-Normal Distribution

20. Non-Normal Distribution Operations data groups at different points

21. Non-Normal Distribution Users to the right of the red line are gone

22. Request latency “We keep hearing from people that the website is slow. But it is fine when we test it, and the request latency graph is constant” You are only looking at part of the picture.

23. Heat Map Histograms over time windows

24. Percentiles

25. Practical Percentiles Bandwidth usage is often billed at 95th percentile usage Record 5 minute data usage intervals Sort samples by value of sample Throw out the highest 5% of samples Charge usage based on the remaining top sample, i.e. 300 MB transferred over 5 minutes = 1 MB/s rate billing

26. Practical Percentiles If I measure 95th percentile per 5 minutes all month long, I CANNOT calculate 95th percentile over the month.

27. Angry users How many users are you pissing off?

28. Angry users

29. “Alert me if request latency 90th percentile over one minute is exceeded” Percentile based alerting q(0.9)[10,10,10,10,10,10,10,10,5000] == 10 Alert IS NOT triggered Do you want to be woken up for this? NO!

30. “Alert me if request latency 90th percentile over one minute is exceeded” Percentile based alerting q(0.9)[10,10,10,10,10,10,250,300] = ~270 Alert IS triggered Do you want to be woken up for this? YES!

31. Percentile based alerting

32. Who’s using this approach? Google.com Circonus.com You?

33. Questions? Thanks to Circonus.com for the tools and help with the math http://www.circonus.com/free-account/

Hinweis der Redaktion

A synthetic is basically a bot check against your system. One of the benefits (perhaps the only benefit) of the synthetic is that it’s more highly available than the application you are monitoring. The response from synthetic requests don’t tell you anything meaningful about how actual users experience your application.
What am I looking at here? This is a time series graph of response times from synthetic login checks against a website. The results are remarkably consistent, as they should be. It gives you the viewpoint of one user - a computer somewhere dispatches a request over the same network route to your server. It records several metrics about how your application responds; time to start the ssl connection, time to the first byte served, average request time... Those metrics are not only useless (unless anyone here runs a service just for one user… in that case, kudos), they lie to you. These are LIES. The falsely represent the health of your application. It’s a binary - is the service up, or is the service down? That’s all you get.
Your user base will likely have a distribution of ages, genders, devices, network connections.
The synthetic check used an external user agent, but you can use collection tools like statsd or log analysis to record request times for real users. This is better than only using a synthetic check, but this technique still has a number of shortcomings. The first is that collection data is averaged over an interval (generally 10 seconds to a minute). So if Cyndi, Bobby, and Mike are all shopping at your website at the same time, you only see the average of their request times over a given interval. Mike might be having a great experience because his office network is 100 megabit, but Bobby is on gig-e, and Cyndi on 10 megabit, you’ll only see Bobby’s view of the website user experience.
The second short shortcoming of a time series average value graph is spike erosion, also known as downsampling. Spike erosion is what happens when you zoom in on specific areas of a time series graph. As you zoom in, the data is averaged over intervals closer to the actual collection intervals. As you can see on this graph, when we zoom into a 2 hour view of the graph we just looked at, the maximum value we see now is 2,000 milliseconds instead of 500 milliseconds. That’s a 400% increase!
I don’t like this image - find a better one. If you alert based on values you get from the graphs I’ve shown, what value do you alert on? As you’ve seen, avoiding false positives is impossible.
Correct this since one sample will trigger alert, use average alert instead
200 ms is too slow, so we take an average, 66% of population is over 200ms, no alert thrown, this is the solution people use to avoid the outlier in previous slide
0th quantile is first element
0th quantile is first element
A histogram is one of the seven basic tools of quality. The Y axis indicates the number of samples, where the X axis indicates the sample value. One use of a histogram that you may have seen is plotting human height vs number of people who are that tall.
Human height follows what is called a normal distribution (also known as a Gaussian distribution). The majority of the population tends to group around one value, and tapers off at the high and low sample values. With a perfect normal distribution, the arithmetic mean (the average) and the median are one in the same.
The mode is also equal to the median. You’ve heard the term standard deviation before most likely. With a normal distribution, 68% of the values lie within one standard deviation for both sides of the median. 95% within 2 standard deviations, 99.7% within 3 sigma. The smaller a standard deviation, the closer the data is to the mean. The larger one sigma is, the farther the data is away from the mean. It is important to note that these metrics only make sense for normal distribution, where there is a single mode.
This is a non normal distribution. In this example, there are large numbers of samples grouped at the highest and lowest sample values. Because there are two distinct peaks, this is called a bimodal distribution (or multi-modal distribution). In a multimodal distribution like this, standard deviation and multi-sigma values are useless.
This is another non-normal distribution. As you can see, it only has one mode, and is a skewed distribution. Standard deviation has little to no meaning here.
Here is a histogram of web page request time. The higher the bar, the more users are affected. This is a highly skewed distribution - notice the grouping between the spike at ~150 milliseconds, and the long tail past there. There’s another smaller spike at ~25 ms, so this is mostly a bimodal distribution. In terms of website performance, people will generally get angry if request times take longer than 250 milliseconds. So what we see here is a bunch of users who are getting acceptable response times, and a long tail of pissed off users. People on left side are having a great experience, people on right side are leaving the site.
Here is a histogram of web page request time. The higher the bar, the more users are affected. This is a highly skewed distribution - notice the grouping between the spike at ~150 milliseconds, and the long tail past there. There’s another smaller spike at ~25 ms, so this is mostly a bimodal distribution. In terms of website performance, people will generally get angry if request times take longer than 250 milliseconds. So what we see here is a bunch of users who are getting acceptable response times, and a long tail of pissed off users. People on left side are having a great experience, people on right side are leaving the site. Note that this is for a time slice, say 5 minutes. What does this look like if we integrate over time?
Heat maps are visual representations of histograms over time windows. It gives you a visualization of data distributions over time.
With heat maps, you can add percentile overlays to show the 50th, 95, and any other percentile distribution over time slices
A percentile is a barrier where to the left the samples are 95%, to the right are the remaining 5%. There is a caveat with the barrier hitting in the middle of data points. If you measure on the right including the barrier, >= 95th percentile of whole data set, if you measure to the left of the barrier, <= 95%. If you have two samples, median is every value between those two samples. Samples on the barrier are counted twice. Divide data set into two sets. Have a slide that says - bespoke things you probably didn’t know about histograms. For the purpose of our examples, we’ll avoid these edge cases. If you see a histogram where the ⅓ quantile and ⅔ quantile are equal value, they add up to > 100%. Histogram of 1 value is one example (everything is measured twice). 1,2 - 1,2,3.
Percentiles cannot be averaged. You have to calculate them from the raw usage data. There are several monitoring solutions out there that will let you average percentiles - this is flat out WRONG
What’s your SLA? If you set your 95% percentile at 250 ms, and you meet your SLA, you’re pissing off 5% of your users. They’re going to your competitor. Let’s try to calculate how many users you are screwing.
Take the number of requests outside your 95 percentile (the 5th percent inverse quantile), and integrate that over time to get a cumulative number of users that you’ve screwed. Multiply that times the dollar value of each lost request - that’s how much money you’re losing.
Circonus.com allows you to set percentile based alerts, so that you’ll be alerted if users start getting pissed off. Here is a percentile based alert - you can expand that to alert based on number of users pissed off per hour. Or even translate that to a dollar value using CAQL (circonus analytics query language). So you can say ‘alert me if we are losing more than $500 worth of users per hour’. This is something you’ll never be able to do with threshold based alerting. Thus, you can set a limit that is essentially normalized to traffic loads, say holiday sale surges.
Circonus.com allows you to set percentile based alerts, so that you’ll be alerted if users start getting pissed off. Here is a percentile based alert - you can expand that to alert based on number of users pissed off per hour. Or even translate that to a dollar value using CAQL (circonus analytics query language). So you can say ‘alert me if we are losing more than $500 worth of users per hour’. This is something you’ll never be able to do with threshold based alerting. Thus, you can set a limit that is essentially normalized to traffic loads, say holiday sale surges.
Circonus.com allows you to set percentile based alerts, so that you’ll be alerted if users start getting pissed off. Here is a percentile based alert - you can expand that to alert based on number of users pissed off per hour. Or even translate that to a dollar value using CAQL (circonus analytics query language). So you can say ‘alert me if we are losing more than $500 worth of users per hour’. This is something you’ll never be able to do with threshold based alerting. Thus, you can set a limit that is essentially normalized to traffic loads, say holiday sale surges.

Better service monitoring through histograms

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Better service monitoring through histograms

Ähnlich wie Better service monitoring through histograms (20)

Mehr von Fred Moyer

Mehr von Fred Moyer (19)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Better service monitoring through histograms

Hinweis der Redaktion