Monitoring & alerting presentation sabin&mustafa

Monitoring & Alerting
Quick dive

How much do outages cost us?
Facebook - $500k in just 30 min of outage in 2014
Amazon - $66k/min
Industry average - $300k/hour
Industry total lost revenue - $26.5B

What is monitoring?
The process of becoming aware of the state of a system.
Is my website up and accessible?
Does all the important functionality work?
Is each server up?
Are all the applications we deployed up?
What’s my CPU usage per machine? disk? memory?
Swap?

Start simple
Basic monitoring systems that you can try straight away:
● Google analytics (Android, iOS, UNITY, HTTP, analytics.js)
● Fabric (Crashlytics integration for Android and iOS)
You can also check this detailed comparison table of different monitoring systems.

What does monitoring help with?
● Early problem detection
● Decision making
● Automation

Early problem detection
Performance
● Monitoring anomalies in the behavior of the system helps to detect resource
saturation and rare defects (hard to spot by QA)
● Particular types of bugs related to heavy system load are hard to detect in test
environments, but can be consistently reproduced in production
Availability
● Downtime usually translates directly to losses in revenue and credibility
● 99.99% availability is the industry standard (50min/year)

Decision making
Baselining
● Know the normal, average state of your system (baseline)
● Data-backed Service-Level Agreements (SLAs)
● In-depth performance analysis, saving costs
Predictions
● Help predict what normal traffic levels are during peaks of activity, like
holidays, social events and such (capacity planning)
● Close interaction with monitoring may help predict business trends

Automation
Allows system to automatically adapt to high load situations.
Bursts of input may saturate a system’s capacity and it may have to drop
some traffic. In order to prevent uniformly bad experience for all users an
attempt is made to reject a portion of inputs. This is commonly known as
admission control.

Monitoring system architecture
● Data collection
● Data aggregation and storage
● Presentation

Data collection
The source of data are logs, device statistics, and system measurements:
● Logging network request failure rates (4xx, 5xx)
● Tracking performance of calls to individual
remote services
● Database calls and response time
● Disk and CPU usage
● Logging mobile clients analytics events

Data aggregation and storage
● Incoming data inputs are grouped by their properties and stored as timeseries
● Resulting timeseries submitted to an alarm evaluation engine, which
generates alarms if anomalies are detected (anomaly detection).
One such system is Graphite.

Presentation
Allows visualisation of the real time state of the system. When a fault is identified
and fixed, the correction should be immediately visible.
One powerful tool for dashboarding is Grafana:
● Integrate with Graphite, InfluxDB, OpenTSDB, and KairosDB
● Introduction and basic concepts can be found here
● Useful video on how to setup your first dashboard
● Give it a try

Alerting
Alerting is the capability of a
monitoring system to detect and notify
the engineer about meaningful events.

Levels of alert urgency
● Alerts as records - anomalies that do not impact the service functionality.
● Alerts as notifications - do not need immediate attention.
● Alerts as pages - high severity, response time inforced by internal SLAs.

Tools
● Pagerduty
● OpsGenie
● VictorOps

Anomaly detection
The identification of items, events or observations which do not conform to an
expected pattern or other items in a dataset.
Let’s see how Uber does it.

Issue is detected and fixed, now what?
Detecting and fixing an issue are only the first steps. We need to make sure that the
issue does not happen again.
Use of postmortems is one interesting approach.

Challenges
● Baselining
● Coverage
● Manageability
● Accuracy
● Context
● Human nature

Conclusion
● Get in the habit of measuring, you cannot manage what you cannot measure
● Monitor extensively
● Alarm selectively
● Work smart, not hard, learn from the experience of others
● Have a tactic
Further reading: Effective Monitoring and Alerting

Thank you!
Contact:
sabin.roman@gmail.com
https://nl.linkedin.com/in/sabinroman

Monitoring & alerting presentation sabin&mustafa

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Monitoring & alerting presentation sabin&mustafa

Ähnlich wie Monitoring & alerting presentation sabin&mustafa (20)

Mehr von Lama K Banna

Mehr von Lama K Banna (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)