Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
© 2014 MapR Technologies 1 
Anomaly Detection 
How to Find What You Didn’t 
Know to Look For 
© MapR Technologies, confide...
© 2014 MapR Technologies 2 
Anomaly Detection: 
How To Find What You Didn’t Know to Look For 
Ted Dunning, Chief Applicati...
A New Look at Anomaly Detection 
by Ted Dunning and Ellen Friedman © June 2014 (published by O’Reilly) 
e-book available c...
Practical Machine Learning series (O’Reilly) 
• Machine learning is becoming mainstream 
• Need pragmatic approaches that ...
© 2014 MapR Technologies 5 
Anomaly Detection
© 2014 MapR Technologies 6 
Who Needs Anomaly Detection? 
Utility providers using 
smart meters
© 2014 MapR Technologies 7 
Who Needs Anomaly Detection? 
Feedback from 
manufacturing assembly 
lines
© 2014 MapR Technologies 8 
Who Needs Anomaly Detection? 
Monitoring data traffic on 
communication networks
© 2014 MapR Technologies 9 
What is Anomaly Detection? 
• The goal is to discover rare events 
– especially those that sho...
© 2014 MapR Technologies 10 
Spot the Anomaly
© 2014 MapR Technologies 11 
Spot the Anomaly 
Looks pretty 
anomalous 
to me
© 2014 MapR Technologies 12 
Spot the Anomaly 
Will the real anomaly 
please stand up?
© 2014 MapR Technologies 13 
Basic idea: 
Find “normal” first
© 2014 MapR Technologies 14 
Steps in Anomaly Detection 
• Build a model: Collect and process data for training a model 
•...
How hard is it to set an alert for anomalies? 
Grey data is from normal events; x’s are anomalies. 
Where would you set th...
© 2014 MapR Technologies 16 
Basic idea: 
Set adaptive thresholds
© 2014 MapR Technologies 17 
What Are We Really Doing 
• We want action when something breaks 
(dies/falls over/otherwise ...
© 2014 MapR Technologies 18 
A Second Look
© 2014 MapR Technologies 19 
A Second Look 
99.9%-ile
New algorithm: t-digest 
© 2014 MapR Technologies 20
© 2014 MapR Technologies 21 
How Hard Can it Be? 
Online 
Summarizer 
x > t ? Alarm ! 
99.9%-ile 
t 
x
© 2014 MapR Technologies 22 
Detecting Anomalies in Sporadic Events 
0.0 0.2 0.4 0.6 0.8 1.0 
0 5000 10000 15000 20000 
pn...
© 2014 MapR Technologies 23 
Using t-Digest 
• Apache Mahout uses t-digest as an on-line percentile estimator 
– very high...
© 2014 MapR Technologies 24 
Already Done? Etsy Skyline?
© 2014 MapR Technologies 25 
What About This? 
0 5 10 15 
offset + noise + pulse1 + pulse2 
−2 0 2 4 6 8 10 
A 
B
© 2014 MapR Technologies 26 
Model Delta Anomaly Detection 
+ δ 
Online 
Summarizer 
δ > t ? 
t 
99.9%-ile 
Alarm ! 
Model...
The Real Inside Scoop 
• The model-delta anomaly detector is really just a sum of random 
© 2014 MapR Technologies 27 
var...
© 2014 MapR Technologies 28 
Spot the Anomaly 
Anomaly?
© 2014 MapR Technologies 29 
Maybe not!
© 2014 MapR Technologies 30 
Where’s Waldo? 
This is the real 
anomaly
© 2014 MapR Technologies 31 
Normal Isn’t Just Normal 
• What we want is a model of what is normal 
• What doesn’t fit the...
© 2014 MapR Technologies 32 
We Do Windows
© 2014 MapR Technologies 33 
We Do Windows
© 2014 MapR Technologies 34 
We Do Windows
© 2014 MapR Technologies 35 
We Do Windows
© 2014 MapR Technologies 36 
We Do Windows
© 2014 MapR Technologies 37 
We Do Windows
© 2014 MapR Technologies 38 
We Do Windows
© 2014 MapR Technologies 39 
We Do Windows
© 2014 MapR Technologies 40 
We Do Windows
© 2014 MapR Technologies 41 
We Do Windows
© 2014 MapR Technologies 42 
We Do Windows
© 2014 MapR Technologies 43 
We Do Windows
© 2014 MapR Technologies 44 
We Do Windows
© 2014 MapR Technologies 45 
We Do Windows
© 2014 MapR Technologies 46 
We Do Windows
© 2014 MapR Technologies 47 
Windows on the World 
• The set of windowed signals is a nice model of our original signal 
•...
© 2014 MapR Technologies 48 
Most Common Shapes (for EKG)
< 1 bit / sample 
© 2014 MapR Technologies 49 
Reconstructed signal 
Original 
signal 
Reconstructed 
signal 
Reconstructi...
© 2014 MapR Technologies 50 
An Anomaly 
Original technique for finding 
1-d anomaly works against 
reconstruction error
© 2014 MapR Technologies 51 
Close-up of anomaly 
Not what you want your 
heart to do. 
And not what the model 
expects it...
© 2014 MapR Technologies 52 
A Different Kind of Anomaly
© 2014 MapR Technologies 53 
Model Delta Anomaly Detection 
+ δ 
Online 
Summarizer 
δ > t ? 
t 
99.9%-ile 
Alarm ! 
Model...
© 2014 MapR Technologies 54 
The Real Inside Scoop 
• The model-delta anomaly detector is really just a sum of random 
var...
Anomalies among sporadic events 
© 2014 MapR Technologies 55
Sporadic Web Traffic to an e-Business Site 
© 2014 MapR Technologies 56 
It’s important to know if traffic is stopped or 
...
Sporadic Web Traffic to an e-Business Site 
© 2014 MapR Technologies 57 
It’s important to know if traffic is stopped or 
...
© 2014 MapR Technologies 58 
Basic idea: 
Time interval between events is how 
to convert to something useful you 
can mea...
Sporadic Events: Finding Normal and Anomalous Patterns 
• Time between intervals is much more usable than absolute times 
...
© 2014 MapR Technologies 60 
Event Stream (timing) 
• Events of various types arrive at irregular intervals 
– we can assu...
© 2014 MapR Technologies 61 
Converting Event Times to Anomaly 
99.9%-ile 
99.99%-ile
But in the real world, event 
© 2014 MapR Technologies 62 
rates often change
Time Intervals Are Key to Modeling Sporadic Events 
© 2014 MapR Technologies 63
© 2014 MapR Technologies 64 
Model-Scaled Intervals Solve the Problem
© 2014 MapR Technologies 65 
Model Delta Anomaly Detection 
+ δ 
Online 
Summarizer 
δ > t ? 
t 
99.9%-ile 
Alarm ! 
Model...
© 2014 MapR Technologies 66 
Detecting Anomalies in Sporadic Events 
Incoming 
events 
99.97%-ile 
Alarm 
Δn 
Rate 
predic...
© 2014 MapR Technologies 67 
Detecting Anomalies in Sporadic Events 
Incoming 
events 
99.97%-ile 
Alarm 
Δn 
Rate 
predic...
© 2014 MapR Technologies 68 
Slipped Week: Simple Rate Predictor 
Nov 02 Nov 07 Nov 12 Nov 17 Nov 22 Nov 27 Dec 02 
0 100 ...
© 2014 MapR Technologies 69 
Poisson Distribution 
• Time between events is exponentially distributed 
Dt ~ le-lt 
• This ...
© 2014 MapR Technologies 70 
Seasonality Poses a Challenge 
Nov 17 Nov 27 Dec 07 Dec 17 Dec 27 
0 2 4 6 8 
Christmas Traff...
© 2014 MapR Technologies 71 
Something more is needed … 
Nov 17 Nov 27 Dec 07 Dec 17 Dec 27 
0 2 4 6 8 
Christmas Traffic ...
© 2014 MapR Technologies 72 
We need a better rate predictor… 
Incoming 
events 
99.97%-ile 
Alarm 
Δn 
Rate 
predictor 
R...
© 2014 MapR Technologies 73 
A New Rate Predictor for Sporadic Events
Improved Prediction with Adaptive Modeling 
© 2014 MapR Technologies 74 
Dec 17 Dec 19 Dec 21 Dec 23 Dec 25 Dec 27 Dec 29 ...
Anomaly Detection + Classification  Useful Pair 
© 2014 MapR Technologies 75 
• Use the AD model to detect anomalies in n...
© 2014 MapR Technologies 76 
Recap (out of order) 
• Anomaly detection is best done with a probability model 
• -log p is ...
© 2014 MapR Technologies 77 
Recap 
• Different systems require different models 
• Continuous time-series 
– sparse codin...
© 2014 MapR Technologies 78 
Why Use Anomaly Detection?
© 2014 MapR Technologies 79 
Keep in mind… 
• Model normal, then find 
anomalies 
- 
• t-digest for adaptive threshold 
• ...
Dec 17 Dec 19 Dec 21 Dec 23 Dec 25 Dec 27 Dec 29 
© 2014 MapR Technologies 80 
0 2 4 6 8 
Christmas Prediction 
Date 
Hits...
A New Look at Anomaly Detection 
by Ted Dunning and Ellen Friedman © June 2014 (published by O’Reilly) 
e-book available c...
Coming in October: Time Series Databases 
by Ted Dunning and Ellen Friedman © Oct 2014 (published by O’Reilly) 
© 2014 Map...
Thank you for coming today! 
© 2014 MapR Technologies 83
© 2014 MapR Technologies 85 
© MapR Technologies, confidential
© 2014 MapR Technologies 86 
Sandbox
Nächste SlideShare
Wird geladen in …5
×

Anomaly Detection - New York Machine Learning

Anomaly detection is the art of finding what you don't know how to ask for. In this talk, I walk through the why and how of building probabilistic models for a variety of problems including continuous signals and web traffic. This talk blends theory and practice in a highly approachable way.

  • Loggen Sie sich ein, um Kommentare anzuzeigen.

Anomaly Detection - New York Machine Learning

  1. 1. © 2014 MapR Technologies 1 Anomaly Detection How to Find What You Didn’t Know to Look For © MapR Technologies, confidential October 14, 2014
  2. 2. © 2014 MapR Technologies 2 Anomaly Detection: How To Find What You Didn’t Know to Look For Ted Dunning, Chief Applications Architect MapR Technologies Email tdunning@mapr.com tdunning@apache.org Twitter @Ted_Dunning Ellen Friedman, Consultant and Commentator Email ellenf@apache.org Twitter @Ellen_Friedman
  3. 3. A New Look at Anomaly Detection by Ted Dunning and Ellen Friedman © June 2014 (published by O’Reilly) e-book available courtesy of MapR http://bit.ly/1jQ9QuL © 2014 MapR Technologies 3
  4. 4. Practical Machine Learning series (O’Reilly) • Machine learning is becoming mainstream • Need pragmatic approaches that take into account real world business settings: – Time to value – Limited resources – Availability of data – Expertise and cost of team to develop and to maintain system • Look for approaches with big benefits for the effort expended © 2014 MapR Technologies 4
  5. 5. © 2014 MapR Technologies 5 Anomaly Detection
  6. 6. © 2014 MapR Technologies 6 Who Needs Anomaly Detection? Utility providers using smart meters
  7. 7. © 2014 MapR Technologies 7 Who Needs Anomaly Detection? Feedback from manufacturing assembly lines
  8. 8. © 2014 MapR Technologies 8 Who Needs Anomaly Detection? Monitoring data traffic on communication networks
  9. 9. © 2014 MapR Technologies 9 What is Anomaly Detection? • The goal is to discover rare events – especially those that shouldn’t have happened • Find a problem before other people see it – especially before it causes a problem for customers • Why is this a challenge? – I don’t know what an anomaly looks like (yet)
  10. 10. © 2014 MapR Technologies 10 Spot the Anomaly
  11. 11. © 2014 MapR Technologies 11 Spot the Anomaly Looks pretty anomalous to me
  12. 12. © 2014 MapR Technologies 12 Spot the Anomaly Will the real anomaly please stand up?
  13. 13. © 2014 MapR Technologies 13 Basic idea: Find “normal” first
  14. 14. © 2014 MapR Technologies 14 Steps in Anomaly Detection • Build a model: Collect and process data for training a model • Use the machine learning model to determine what is the normal pattern • Decide how far away from this normal pattern you’ll consider to be anomalous • Use the AD model to detect anomalies in new data – Methods such as clustering for discovery can be helpful
  15. 15. How hard is it to set an alert for anomalies? Grey data is from normal events; x’s are anomalies. Where would you set the threshold? © 2014 MapR Technologies 15
  16. 16. © 2014 MapR Technologies 16 Basic idea: Set adaptive thresholds
  17. 17. © 2014 MapR Technologies 17 What Are We Really Doing • We want action when something breaks (dies/falls over/otherwise gets in trouble) • But action is expensive • So we don’t want too many false alarms • And we don’t want too many false negatives • What’s the right threshold to set for alerts? – We need to trade off costs
  18. 18. © 2014 MapR Technologies 18 A Second Look
  19. 19. © 2014 MapR Technologies 19 A Second Look 99.9%-ile
  20. 20. New algorithm: t-digest © 2014 MapR Technologies 20
  21. 21. © 2014 MapR Technologies 21 How Hard Can it Be? Online Summarizer x > t ? Alarm ! 99.9%-ile t x
  22. 22. © 2014 MapR Technologies 22 Detecting Anomalies in Sporadic Events 0.0 0.2 0.4 0.6 0.8 1.0 0 5000 10000 15000 20000 pnorm(centroids[order(centroids)]) counts[order(centroids)]
  23. 23. © 2014 MapR Technologies 23 Using t-Digest • Apache Mahout uses t-digest as an on-line percentile estimator – very high accuracy for extreme tails – new in version Mahout v 0.9 • t-digest also available elsewhere – in streamlib (open source library on github) – standalone (github and Maven Central) • What’s the big deal with anomaly detection? • This looks like a solved problem
  24. 24. © 2014 MapR Technologies 24 Already Done? Etsy Skyline?
  25. 25. © 2014 MapR Technologies 25 What About This? 0 5 10 15 offset + noise + pulse1 + pulse2 −2 0 2 4 6 8 10 A B
  26. 26. © 2014 MapR Technologies 26 Model Delta Anomaly Detection + δ Online Summarizer δ > t ? t 99.9%-ile Alarm ! Model -
  27. 27. The Real Inside Scoop • The model-delta anomaly detector is really just a sum of random © 2014 MapR Technologies 27 variables – the model we know about already – and a normally distributed error • The output (delta) is (roughly) the log probability of the sum distribution (really δ2) • Thinking about probability distributions is good • But how do you handle AD in systems with sporadic events?
  28. 28. © 2014 MapR Technologies 28 Spot the Anomaly Anomaly?
  29. 29. © 2014 MapR Technologies 29 Maybe not!
  30. 30. © 2014 MapR Technologies 30 Where’s Waldo? This is the real anomaly
  31. 31. © 2014 MapR Technologies 31 Normal Isn’t Just Normal • What we want is a model of what is normal • What doesn’t fit the model is the anomaly • For simple signals, the model can be simple … x ~ N(0,e ) • The real world is rarely so accommodating
  32. 32. © 2014 MapR Technologies 32 We Do Windows
  33. 33. © 2014 MapR Technologies 33 We Do Windows
  34. 34. © 2014 MapR Technologies 34 We Do Windows
  35. 35. © 2014 MapR Technologies 35 We Do Windows
  36. 36. © 2014 MapR Technologies 36 We Do Windows
  37. 37. © 2014 MapR Technologies 37 We Do Windows
  38. 38. © 2014 MapR Technologies 38 We Do Windows
  39. 39. © 2014 MapR Technologies 39 We Do Windows
  40. 40. © 2014 MapR Technologies 40 We Do Windows
  41. 41. © 2014 MapR Technologies 41 We Do Windows
  42. 42. © 2014 MapR Technologies 42 We Do Windows
  43. 43. © 2014 MapR Technologies 43 We Do Windows
  44. 44. © 2014 MapR Technologies 44 We Do Windows
  45. 45. © 2014 MapR Technologies 45 We Do Windows
  46. 46. © 2014 MapR Technologies 46 We Do Windows
  47. 47. © 2014 MapR Technologies 47 Windows on the World • The set of windowed signals is a nice model of our original signal • Clustering can find the prototypes – Fancier techniques available using sparse coding • The result is a dictionary of shapes • New signals can be encoded by shifting, scaling and adding shapes from the dictionary
  48. 48. © 2014 MapR Technologies 48 Most Common Shapes (for EKG)
  49. 49. < 1 bit / sample © 2014 MapR Technologies 49 Reconstructed signal Original signal Reconstructed signal Reconstruction error
  50. 50. © 2014 MapR Technologies 50 An Anomaly Original technique for finding 1-d anomaly works against reconstruction error
  51. 51. © 2014 MapR Technologies 51 Close-up of anomaly Not what you want your heart to do. And not what the model expects it to do.
  52. 52. © 2014 MapR Technologies 52 A Different Kind of Anomaly
  53. 53. © 2014 MapR Technologies 53 Model Delta Anomaly Detection + δ Online Summarizer δ > t ? t 99.9%-ile Alarm ! Model -
  54. 54. © 2014 MapR Technologies 54 The Real Inside Scoop • The model-delta anomaly detector is really just a sum of random variables – the model we know about already – and a normally distributed error • The output (delta) is (roughly) the log probability of the sum distribution (really δ2) • Thinking about probability distributions is good
  55. 55. Anomalies among sporadic events © 2014 MapR Technologies 55
  56. 56. Sporadic Web Traffic to an e-Business Site © 2014 MapR Technologies 56 It’s important to know if traffic is stopped or delayed because of a problem… But visits to site normally come at varying intervals. How long after the last event should you begin to worry?
  57. 57. Sporadic Web Traffic to an e-Business Site © 2014 MapR Technologies 57 It’s important to know if traffic is stopped or delayed because of a problem… But visits to site normally come at varying intervals. And how do you let your CEO sleep through the night?
  58. 58. © 2014 MapR Technologies 58 Basic idea: Time interval between events is how to convert to something useful you can measure
  59. 59. Sporadic Events: Finding Normal and Anomalous Patterns • Time between intervals is much more usable than absolute times © 2014 MapR Technologies 59 • Counts don’t link as directly to probability models • Time interval is log ρ • This is a big deal
  60. 60. © 2014 MapR Technologies 60 Event Stream (timing) • Events of various types arrive at irregular intervals – we can assume Poisson distribution • The key question is whether frequency has changed relative to expected values – This shows up as a change in interval • Want alert as soon as possible
  61. 61. © 2014 MapR Technologies 61 Converting Event Times to Anomaly 99.9%-ile 99.99%-ile
  62. 62. But in the real world, event © 2014 MapR Technologies 62 rates often change
  63. 63. Time Intervals Are Key to Modeling Sporadic Events © 2014 MapR Technologies 63
  64. 64. © 2014 MapR Technologies 64 Model-Scaled Intervals Solve the Problem
  65. 65. © 2014 MapR Technologies 65 Model Delta Anomaly Detection + δ Online Summarizer δ > t ? t 99.9%-ile Alarm ! Model - log p
  66. 66. © 2014 MapR Technologies 66 Detecting Anomalies in Sporadic Events Incoming events 99.97%-ile Alarm Δn Rate predictor Rate history t-digest δ> t t i δ λ(t i - t i - n) λ t
  67. 67. © 2014 MapR Technologies 67 Detecting Anomalies in Sporadic Events Incoming events 99.97%-ile Alarm Δn Rate predictor Rate history t-digest δ> t t i δ λ(t i - t i - n) λ t
  68. 68. © 2014 MapR Technologies 68 Slipped Week: Simple Rate Predictor Nov 02 Nov 07 Nov 12 Nov 17 Nov 22 Nov 27 Dec 02 0 100 200 300 400 500 Main Page Traffic Date Hits (x 1000) A B C D
  69. 69. © 2014 MapR Technologies 69 Poisson Distribution • Time between events is exponentially distributed Dt ~ le-lt • This means that long delays are exponentially rare P(Dt > T) = e-lT -logP(Dt > T) = lT • If we know λ we can select a good threshold – or we can pick a threshold empirically
  70. 70. © 2014 MapR Technologies 70 Seasonality Poses a Challenge Nov 17 Nov 27 Dec 07 Dec 17 Dec 27 0 2 4 6 8 Christmas Traffic Date Hits / 1000
  71. 71. © 2014 MapR Technologies 71 Something more is needed … Nov 17 Nov 27 Dec 07 Dec 17 Dec 27 0 2 4 6 8 Christmas Traffic Date Hits / 1000
  72. 72. © 2014 MapR Technologies 72 We need a better rate predictor… Incoming events 99.97%-ile Alarm Δn Rate predictor Rate history t-digest δ> t t i δ λ(t i - t i - n) λ t
  73. 73. © 2014 MapR Technologies 73 A New Rate Predictor for Sporadic Events
  74. 74. Improved Prediction with Adaptive Modeling © 2014 MapR Technologies 74 Dec 17 Dec 19 Dec 21 Dec 23 Dec 25 Dec 27 Dec 29 0 2 4 6 8 Christmas Prediction Date Hits (x 1000)
  75. 75. Anomaly Detection + Classification  Useful Pair © 2014 MapR Technologies 75 • Use the AD model to detect anomalies in new data – Methods such as clustering for discovery can be helpful • Once you have well-defined models in your system, you may also want to use classification to tag those • Continue to use the AD model to find new anomalies
  76. 76. © 2014 MapR Technologies 76 Recap (out of order) • Anomaly detection is best done with a probability model • -log p is a good way to convert to anomaly measure • Adaptive quantile estimation (t-digest) works for auto-setting thresholds
  77. 77. © 2014 MapR Technologies 77 Recap • Different systems require different models • Continuous time-series – sparse coding to build signal model • Events in time – rate model base on variable rate Poisson – segregated rate model • Events with labels – language modeling – hidden Markov models
  78. 78. © 2014 MapR Technologies 78 Why Use Anomaly Detection?
  79. 79. © 2014 MapR Technologies 79 Keep in mind… • Model normal, then find anomalies - • t-digest for adaptive threshold • Probabilistic models for complex patterns 0 5 10 15 −2 0 2 4 6 8 10 offset + noise + pulse1 + pulse2 A B
  80. 80. Dec 17 Dec 19 Dec 21 Dec 23 Dec 25 Dec 27 Dec 29 © 2014 MapR Technologies 80 0 2 4 6 8 Christmas Prediction Date Hits (x 1000) Keep in mind… • Time intervals are key for sporadic events • Complex time shift to predict rate with seasonality • Sequence of events reveals phishing attack
  81. 81. A New Look at Anomaly Detection by Ted Dunning and Ellen Friedman © June 2014 (published by O’Reilly) e-book available courtesy of MapR http://bit.ly/1jQ9QuL © 2014 MapR Technologies 81
  82. 82. Coming in October: Time Series Databases by Ted Dunning and Ellen Friedman © Oct 2014 (published by O’Reilly) © 2014 MapR Technologies 82
  83. 83. Thank you for coming today! © 2014 MapR Technologies 83
  84. 84. © 2014 MapR Technologies 85 © MapR Technologies, confidential
  85. 85. © 2014 MapR Technologies 86 Sandbox

×