Ever since the #monitoringsucks trend kicked off a conversation about the state of monitoring tools in 2011, there has been a flurry of activity resulting in new solutions, improved tools, and applications generating tons of data. However, we are still faced with the same issues almost 4 years later. Alerts still generate far too much noise to be useful. Dashboards aren’t actionable and require human interpretation. The volume of log, time series, and other data makes it difficult to collate, visualize, and interpret in the mythical single pane of glass. How do we definitively solve these problems?
Data science. Using advances in data science and machine learning that are already being applied to “sexy” problems at companies around the globe, we can finally reach a tipping point when it comes to #monitoringsucks issues. New data science tools can pinpoint problems before they hit a static threshold, group alerts from a variety of sources into a single logical error, and prevent eye strain from studying hundreds of graphs. In this talk, I will be discussing the virtues – and pitfalls – of new monitoring entrants like Kale from Etsy, Bosun from StackExchange, and Twitter’s open source R package AnomalyDetection.
1. Data Science: The Solution to
#monitoringsucks
#DevOpsDays Amsterdam 2015
2. Who am I?
● Operations Engineer @ STYLIGHT GmbH
3. Who am I?
● Operations Engineer @ STYLIGHT GmbH
● 11+ years as a System Administrator
4. Who am I?
● Operations Engineer @ STYLIGHT GmbH
● 11+ years as a System Administrator
● Tired of being woken up at 2am by false
positives
5. Who am I?
● Operations Engineer @ STYLIGHT GmbH
● 11+ years as a System Administrator
● Tired of being woken up at 2am by false
positives
● It all started with a 14.4Kbps modem
8. #monitoringsucks: A Brief History
● Started in 2011
● Loosely-organized movement to address the
shortcomings with monitoring tools
● Spawned an IRC channel and GitHub repo
linking to available tools
9. Why Does Monitoring Still Suck?
● Alerts generate far too much noise to be
useful
10. Why Does Monitoring Still Suck?
● Alerts generate far too much noise to be
useful
● Dashboards aren’t actionable and require
human interpretation
13. Why Does Monitoring Still Suck?
● Alerts generate far too much noise to be
useful
● Dashboards aren’t actionable and require
human interpretation
● Volume of data makes it difficult to collate,
visualize, and interpret
14. ● Finding relationships and patterns in data
● Predictive Analysis
● Anomoly Detection in large datasets
● Natural Language Processing can process
and understand unstructured data
How does data science help us?
15. What does data science mean to me?
● Pinpoint problems before they hit a static
threshold
16. What does data science mean to me?
● Pinpoint problems before they hit a static
threshold
● Group alerts from a variety of sources into a
single logical event
17. What does data science mean to me?
● Pinpoint problems before they hit a static
threshold
● Group alerts from a variety of sources into a
single logical event
● Prevent eye strain from studying hundreds of
graphs
18. What are the tools of the future
● Kale - Etsy
● Bosun - StackExchange
● AnomalyDetection - Open source R package
from Twitter
25. Bosun
● Monitoring and alerting system by Stack
Exchange
● Domain Specific Language for alerts and
notifications
26. Bosun
● Monitoring and alerting system by Stack
Exchange
● Domain Specific Language for alerts and
notifications
● Backtest your alerts against historical data
29. AnomalyDetection
● Open-source R package created by Twitter
● Detects anomalies in time series data and
numerical vectors
● Provides visualization support
33. Let’s Get In Touch
@patrickroelke
@codetailors
patrick.roelke@stylight.com
patrickroelke.com
Editor's Notes
Shopping business with CPC + CPO
Inspiration business with magazine advertising
headquartered in Munich (and offices in London and NYC)
we do business in 14 countries around the world - recently launched Belgium and Norway
we have co-workers from 25 nations!
sold $500 million worth of products for our partners
Thresholds for monitoring are arbitrary
Not every alert needs a response
Tuning the signal/noise ratio is difficult
There are spikes in the top charts. Something probably happened here, but what?
There is a big dip in a few charts. Thats probably not a good thing, but what caused it. Maybe it is good?
In this dashboard we have a little more information but it still is hard to actually interpret and act on this data.
Terabytes of log, time series, application, and other metrics are being generated.
Lots of specialized tools to manage each set of data
How do we cross reference, collate, visualize, and interpret this data? Data Science
Finding relationships in data - LinkedIn and Facebook tries to connect you to people based on data
Predictive Analysis - NetFlix and Amazon use predictive analysis to understand customer behavior
Twitter uses Anomaly Detection to find users abusing the service.
IBM’s Watson uses NLP to understand massive amounts of unstructured data
Amazon who is trying to build a Machine-learning model to predict an employee's access needs in their offices based on his/her role
Kale was released in 2013. v2 is about to be released soon. open-source software suite for pattern mining and anomaly detection
Skyline - It shows all the current metrics that have been determined to be anomalous
Skyline - It shows all the current metrics that have been determined to be anomalous
Skyline - It shows all the current metrics that have been determined to be anomalous
Oculus - Once you’ve identified an interesting or anomalous metric, Oculus will find all of the other metrics in your systems which look similar