2. How much do outages cost us?
Facebook - $500k in just 30 min of outage in 2014
Amazon - $66k/min
Industry average - $300k/hour
Industry total lost revenue - $26.5B
3. What is monitoring?
The process of becoming aware of the state of a system.
Is my website up and accessible?
Does all the important functionality work?
Is each server up?
Are all the applications we deployed up?
What’s my CPU usage per machine? disk? memory?
Swap?
4. Start simple
Basic monitoring systems that you can try straight away:
● Google analytics (Android, iOS, UNITY, HTTP, analytics.js)
● Fabric (Crashlytics integration for Android and iOS)
You can also check this detailed comparison table of different monitoring systems.
5. What does monitoring help with?
● Early problem detection
● Decision making
● Automation
6. Early problem detection
Performance
● Monitoring anomalies in the behavior of the system helps to detect resource
saturation and rare defects (hard to spot by QA)
● Particular types of bugs related to heavy system load are hard to detect in test
environments, but can be consistently reproduced in production
Availability
● Downtime usually translates directly to losses in revenue and credibility
● 99.99% availability is the industry standard (50min/year)
7. Decision making
Baselining
● Know the normal, average state of your system (baseline)
● Data-backed Service-Level Agreements (SLAs)
● In-depth performance analysis, saving costs
Predictions
● Help predict what normal traffic levels are during peaks of activity, like
holidays, social events and such (capacity planning)
● Close interaction with monitoring may help predict business trends
8. Automation
Allows system to automatically adapt to high load situations.
Bursts of input may saturate a system’s capacity and it may have to drop
some traffic. In order to prevent uniformly bad experience for all users an
attempt is made to reject a portion of inputs. This is commonly known as
admission control.
10. Data collection
The source of data are logs, device statistics, and system measurements:
● Logging network request failure rates (4xx, 5xx)
● Tracking performance of calls to individual
remote services
● Database calls and response time
● Disk and CPU usage
● Logging mobile clients analytics events
11. Data aggregation and storage
● Incoming data inputs are grouped by their properties and stored as timeseries
● Resulting timeseries submitted to an alarm evaluation engine, which
generates alarms if anomalies are detected (anomaly detection).
One such system is Graphite.
12. Presentation
Allows visualisation of the real time state of the system. When a fault is identified
and fixed, the correction should be immediately visible.
One powerful tool for dashboarding is Grafana:
● Integrate with Graphite, InfluxDB, OpenTSDB, and KairosDB
● Introduction and basic concepts can be found here
● Useful video on how to setup your first dashboard
● Give it a try
13. Alerting
Alerting is the capability of a
monitoring system to detect and notify
the engineer about meaningful events.
14. Levels of alert urgency
● Alerts as records - anomalies that do not impact the service functionality.
● Alerts as notifications - do not need immediate attention.
● Alerts as pages - high severity, response time inforced by internal SLAs.
16. Anomaly detection
The identification of items, events or observations which do not conform to an
expected pattern or other items in a dataset.
Let’s see how Uber does it.
17. Issue is detected and fixed, now what?
Detecting and fixing an issue are only the first steps. We need to make sure that the
issue does not happen again.
Use of postmortems is one interesting approach.
19. Conclusion
● Get in the habit of measuring, you cannot manage what you cannot measure
● Monitor extensively
● Alarm selectively
● Work smart, not hard, learn from the experience of others
● Have a tactic
Further reading: Effective Monitoring and Alerting
Today we will discuss about what we love the most in engineering, being waken up at 4am in the morning because of a bug!
Talk about how to detect problems with your application and how to fix them as soon as possible
Has anybody used this tools?
The ability to predict demands and then match them based on seasonality translates directly into revenue gains
When a data store that supports a user-facing service starts serving queries much slower than usual, but not slow enough to make an appreciable difference in the overall service’s response time, that should generate a low-urgency alert that is recorded in your monitoring system for future reference or investigation but does not interrupt anyone’s work
the data store is running low on disk space and should be scaled out in the next several days
Pics, charts, examples, how much time it takes to setup system, conclusion, pitfalls,
Baselining: “nothing endures but change”
Coverage: systems evolve, so should the coverage