Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Finding Changes in Real Data

568 Aufrufe

Veröffentlicht am

This talk shows practical methods for find changes in a variety of kinds of data as well as giving real-world examples from finance, telecom, systems monitoring and natural language processing.

Veröffentlicht in: Daten & Analysen
  • Als Erste(r) kommentieren

Finding Changes in Real Data

  1. 1. © 2017 MapR Technologies 1 Detecting Change
  2. 2. © 2017 MapR Technologies 2 Contact Information Ted Dunning, PhD Chief Application Architect, MapR Technologies Board member, Apache Software Foundation O’Reilly author Email tdunning@mapr.com tdunning@apache.org Twitter @ted_dunning
  3. 3. © 2017 MapR Technologies 3 Who We Are • MapR Technologies – We make a kick-ass platform for big data computing – Support many workloads including Hadoop / Spark / HPC / Other – Extended to allow streams and tables in basic platform – Free for academic research / training • Apache Software Foundation – Culture hub for building open source communities – Shared values around openness for contribution as well as use – Many major projects are part of Apache – Even more minor ones!
  4. 4. © 2017 MapR Technologies 4 Basic Outline • Goal Setting • Basic Ideas – LLR (finding changes in counts) – Poisson rate change detection (finding changes in events timing) – Distribution estimation / visualization – Labeled events and adding labels • Free Improvisation on Themes
  5. 5. © 2017 MapR Technologies 5 Why Is This Practically Important • The novice came to the master and says “something is broken”
  6. 6. © 2017 MapR Technologies 6 Why Is This Practically Important • The novice came to the master and says “something is broken” • The master replied “What has changed?”
  7. 7. © 2017 MapR Technologies 7 Why Is This Practically Important • The novice came to the master and says “something is broken” • The master replied “What has changed?” • And the student was enlightened
  8. 8. © 2017 MapR Technologies 8 The Second Student • Another student said to the master, “I see something has changed … something may have broken”
  9. 9. © 2017 MapR Technologies 9 The Second Student • Another student said to the master, “I see something has changed … something may have broken” • The master replied, “You have no question to ask. You have no need of enlightenment”
  10. 10. © 2017 MapR Technologies 10 The Second Student • Another student said to the master, “I see something has changed … something may have broken” • The master replied, “You have no question to ask. You have no need of enlightenment” • And thus the student was enlightened
  11. 11. © 2017 MapR Technologies 11 • There are some very powerful techniques available, some only very recently, that can make the detection of change much easier than you might think. I will describe the practical use of several of these techniques including t-digest, non-linear histograms, variable rate Poisson models and combinations of these.
  12. 12. © 2017 MapR Technologies 12 Comparing Counts • Suppose we have two situations A and B, each with many observations, nA and nB • And some event x occurred n1A and n1B times in each situation x other A n1A nA - n1A B n1B nB - n1B
  13. 13. © 2017 MapR Technologies 13 Comparing Counts • Have we seen a change in the frequency of x? • Frequency ratios? – Breaks with small counts • - test? – Breaks with small counts
  14. 14. © 2017 MapR Technologies 14 Log-Likelihood Ratio Test (Root LLR) • In R entropy = function(k) { -sum(k*log((k==0)+(k/sum(k)))) } llr = function(k) { (entropy(rowSums(k))+entropy(colSums(k)) -entropy(k))*2 } • Like mutual information * 2 N
  15. 15. © 2017 MapR Technologies 15 Spot the Anomaly • Root LLR is roughly like standard deviations A not A B 13 1000 not B 1000 100,000 A not A B 1 0 not B 0 2 A not A B 1 0 not B 0 10,000 A not A B 10 0 not B 0 100,000 0.89 1.95 4.51 14.29
  16. 16. © 2017 MapR Technologies 16 How Does it Work Empirical fit to asymptotic distribution is very good
  17. 17. © 2017 MapR Technologies 17 How Does it Work?
  18. 18. © 2017 MapR Technologies 18 OK We can detect changes in counts
  19. 19. © 2017 MapR Technologies 19 Real-life Example • Query: “Paco de Lucia” • Conventional meta-data search results: – “hombres de paco” times 400 – not much else • Recommendation based search: – Flamenco guitar and dancers – Spanish and classical guitar – Van Halen doing a classical/flamenco riff
  20. 20. © 2017 MapR Technologies 20 Real-life Example
  21. 21. © 2017 MapR Technologies 21 Example 2 - Common Point of Compromise • Scenario: – Merchant 0 is compromised, leaks account data during compromise – Fraud committed elsewhere during exploit – High background level of fraud – Limited detection rate for exploits • Goal: – Find merchant 0 • Meta-goal: – Screen algorithms for this task without leaking sensitive data
  22. 22. © 2017 MapR Technologies 22 Example 2 - Common Point of Compromise skim exploit Merchant 0 Skimmed data Merchant n Card data is stolen from Merchant 0 That data is used in frauds at other merchants
  23. 23. © 2017 MapR Technologies 23 Simulation Setup 0 20 40 60 80 100 0100300500 day count Compromise period Exploit period compromises frauds
  24. 24. © 2017 MapR Technologies 24 Detection Strategy • Select histories that precede non-fraud • And histories that precede fraud detection • Analyze 2x2 cooccurrence of merchant n versus fraud detection
  25. 25. © 2017 MapR Technologies 25
  26. 26. © 2017 MapR Technologies 26 What about the real world?
  27. 27. © 2017 MapR Technologies 27 ●●●●●●●●●●●●●●●●●●●● ● ●● ●●● ●●● ●●●●● ●●●●● ●●● ●●● ●● ● ●● ●● ●● ● ●●●● ●●●● ●● ●●●● ●●●● ●●● ●● ●● ● ●● ● ●●●● ●● ● ●●●● ●●●●●● ●● ●● ●●● ●●● ●●●●● ● ●●● ●● ●●● ●●● ●● ●●●● ● ●● ●●● ●●● ● ● ● ●● ● ● ● ●● 020406080 LLR score for real data Number of Merchants BreachScore(LLR) Real truly bad guys 100 101 102 103 104 105 106 Really truly bad guys
  28. 28. © 2017 MapR Technologies 28 What about time?
  29. 29. © 2017 MapR Technologies 29 Finding Changes in Timing • Suppose our input is events embedded in time • Suppose we want to find changes in our input in real-time • Waiting and counting is fine if we don’t have to react now • We can do much better
  30. 30. © 2017 MapR Technologies 30 Poisson Event Rate Change • Detection of fallout – Time since last is very sensitive for complete failure • Detection of change relative to reference – Time since n-th most recent – LLR with time • Have to trade detection speed versus false positive rate and size of change • Can run multiple detectors at once
  31. 31. © 2017 MapR Technologies 31 Basic idea: Time interval is better than counts
  32. 32. © 2017 MapR Technologies 32 Sporadic Events: Finding Normal and Anomalous Patterns • Time between intervals is much more usable than absolute times • Counts don’t link as directly to probability models • Time interval is log ρ • This is a big deal
  33. 33. © 2017 MapR Technologies 33 Event Stream (timing) • Events of various types arrive at irregular intervals – we can assume Poisson distribution • The key question is whether frequency has changed relative to expected values – This shows up as a change in interval • Want alert as soon as possible
  34. 34. © 2017 MapR Technologies 34 Converting Event Times to Anomaly 99.9%-ile 99.99%-ile
  35. 35. © 2017 MapR Technologies 35 In the real world, event rates often vary
  36. 36. © 2017 MapR Technologies 36 Time Intervals Are Key to Modeling Sporadic Events 0 1 2 3 4 02468 t (days) dt(min)
  37. 37. © 2017 MapR Technologies 37 Time Intervals Are Key to Modeling Sporadic Events 0 1 2 3 4 02468 t (days) dt(min)
  38. 38. © 2017 MapR Technologies 38 Poisson Distribution • Time between events is exponentially distributed • This means that long delays are exponentially rare • If we know λ we can select a good threshold – or we can pick a threshold empirically Dt ~ le-lt P(Dt > T) = e-lT -logP(Dt > T) = lT
  39. 39. © 2017 MapR Technologies 39 After Rate Correction 0 1 2 3 4 0246810 t (days) dt/rate 99.9%−ile 99.99%−ile
  40. 40. © 2017 MapR Technologies 40 Detecting Anomalies in Sporadic Events Incoming events 99.97%-ile Alarm Δn Rate predictor Rate history t-digest δ> t ti δ λ(ti- ti- n) λ t
  41. 41. © 2017 MapR Technologies 41 Detecting Anomalies in Sporadic Events Incoming events 99.97%-ile Alarm Δn Rate predictor Rate history t-digest δ> t ti δ λ(ti- ti- n) λ t
  42. 42. © 2017 MapR Technologies 42 Seasonality Poses a Challenge Nov 17 Nov 27 Dec 07 Dec 17 Dec 27 02468 Christmas Traffic Date Hits/1000
  43. 43. © 2017 MapR Technologies 43 Something more is needed … Nov 17 Nov 27 Dec 07 Dec 17 Dec 27 02468 Christmas Traffic Date Hits/1000
  44. 44. © 2017 MapR Technologies 44 We need a better rate predictor… Incoming events 99.97%-ile Alarm Δn Rate predictor Rate history t-digest δ> t ti δ λ(ti- ti- n) λ t
  45. 45. © 2017 MapR Technologies 45 Idea: Predict log(rate) from lagged log(rate) • Predict log because – Peak to valley ratio – Traffic grew by 30 % – All rates are positive
  46. 46. © 2017 MapR Technologies 46 Idea: Predict log(rate) from lagged log(rate) • Predict log because – Peak to valley ratio – Traffic grew by 30 % – All rates are positive – Just because I said so
  47. 47. © 2017 MapR Technologies 47 Idea: Predict log(rate) from lagged log(rate) • Predict log because – Peak to valley ratio – Traffic grew by 30 % – All rates are positive – Just because I said so • Let model see many lagged values • Use L1 regularized linear model to pick important historical values – We would have moved to something fancier if this hadn’t worked
  48. 48. © 2017 MapR Technologies 48 A New Rate Predictor for Sporadic Events
  49. 49. © 2017 MapR Technologies 49 Improved Prediction with Adaptive Modeling Dec 17 Dec 19 Dec 21 Dec 23 Dec 25 Dec 27 Dec 29 02468 Christmas Prediction Date Hits(x1000)
  50. 50. © 2017 MapR Technologies 50 Some days the magic works Some days ... We use slightly different magic
  51. 51. © 2017 MapR Technologies 51 Detecting More Subtle Changes • Time-since-last finds complete failures well • Nth order time finds more subtle rate changes • But that subtlety delays detection of complete failure – First order delay has 99.9% confidence at 6.5 units – 10th order delay has 99.9% confidence at 12.5 units • But 10th order delay can find speedups, first order cannot
  52. 52. © 2017 MapR Technologies 57 10th order difference of Poisson distribution
  53. 53. © 2017 MapR Technologies 58 Finding Changes in Time Series • So far, we only have times • What about when we have times and measurements together? – These are called time-series! • First step can be to discretize the measurement – Quintiles or deciles are good candidates – Multi-scale discretization is a fine thing to do • That gives us arrival times for measurements in each bin – And this is susceptible to the rate model on previous slides
  54. 54. © 2017 MapR Technologies 59 Finding Changes in Time Series • Comprehensive approaches also possible (for counts) • Time aware variant of G-test is possible vs Ted Dunning. Accurate methods for the statistics of surprise and coincidence. Comput. Linguist. 19, 1 (March 1993) http://bit.ly/surprise-and-coincidence
  55. 55. © 2017 MapR Technologies 60 Propagation Anomalies • What happens when something shadows part of the coverage field for mobile telecom? – Can happen in urban areas with a construction crane • Can solve heuristically – Subtract from reference image composed by long term averages – Doesn’t deal well with weak signal regions and low S/N • Can solve probabilistically – Compute anomaly for each measurement, use mean of log(p)
  56. 56. © 2017 MapR Technologies 61
  57. 57. © 2017 MapR Technologies 62
  58. 58. © 2017 MapR Technologies 63 Variable Signal/Noise Makes Heuristic Tricky Far from the transmitter, received signal is dominated by noise. This makes subtraction of average value a bad algorithm.
  59. 59. © 2017 MapR Technologies 64 Other Issues • Finding changes in coverage area is similar tricky • Coverage area is roughly where tower signal strength is higher than neighbors • Except for fuzziness due to hand-off delays • Except for bias due to large-scale caller motions – Rush hour – Event mobs
  60. 60. © 2017 MapR Technologies 65 Simple Answer for Propagation Anomalies • Cluster signal strength reports • Cluster locations using k-means, large k • Model report rate anomaly using discrete event models • Model signal strength anomaly using percentile model • Trade larger k against higher report rates, faster detection • Overall anomaly is sum of individual log(p) anomalies
  61. 61. © 2017 MapR Technologies 66 Tower Coverage Areas
  62. 62. © 2017 MapR Technologies 67 Just One Tower
  63. 63. © 2017 MapR Technologies 68 Cluster Reports for That Tower
  64. 64. © 2017 MapR Technologies 69 Cluster Reports for That Tower 1 2 3 4 5 6 7 8 9 Can also sub-divide each cluster into signal strength ranges Multiple scales of clustering can also be used to trade off geographic versus temporal resolution
  65. 65. © 2017 MapR Technologies 70 Example 0.00.51.01.5 dt 01234567 dt 0.00.20.40.6 dt Each cluster gives us a sequence of events. Individual anomaly scores can be scaled and added to get composite anomaly score Optimality of combined signal derives from optimality of components.
  66. 66. © 2017 MapR Technologies 71 Characterizing Distributions • What about sequences of values from arbitrary distributions – Can we find changes in the distribution? – For instance, what about latencies? • Non-linear histogram - FloatHistogram • Fully Adaptive histogram – t-digest
  67. 67. © 2017 MapR Technologies 72 FloatHistogram • Assume all measurements are in the range • Divide this range into power of 2 sub-ranges • Sub-divide each sub-range evenly with steps • Relative error is bounded in measurement space
  68. 68. © 2017 MapR Technologies 73 FloatHistogram • Assume all measurements are in the range • Divide this range into power of 2 sub-ranges • Sub-divide each sub-range evenly with steps • Relative error is bounded in measurement space • Bin index can be computed using FP representation!
  69. 69. © 2017 MapR Technologies 74 T-digest • Or we can talk about small errors in q • Accumulate samples, sort, merge • Merge if k-size < 1
  70. 70. © 2017 MapR Technologies 75 T-digest • Or we can talk about small errors in q • Accumulate samples, sort, merge • Merge if k-size < 1 0.0 0.2 0.4 0.6 0.8 1.0 q 0246810 k
  71. 71. © 2017 MapR Technologies 76 T-digest • Or we can talk about small errors in q • Accumulate samples, sort, merge • Merge if k-size < 1 • Interpolate using centroids in x • Very good near extremes, no dynamic allocation 0.0 0.2 0.4 0.6 0.8 1.0 q 0246810 k
  72. 72. © 2017 MapR Technologies 77 Finding Change with Histograms • With fixed bins, we can simply count and compare counts for different bins • Thus, histogram change reduces to count change • Or to changes in event times
  73. 73. © 2017 MapR Technologies 78 Visualizing Histograms • We want to detect small changes – Consider log-scale for Y • Non-linear bin spacing is really good for increasing counts – Reweight by bin-width – Changing x axis changes y axis
  74. 74. © 2017 MapR Technologies 79 Good Results
  75. 75. © 2017 MapR Technologies 80 Bad Results
  76. 76. © 2017 MapR Technologies 81 Bad Results
  77. 77. © 2017 MapR Technologies 82 With Better Scaling
  78. 78. © 2017 MapR Technologies 83 Bad Results
  79. 79. © 2017 MapR Technologies 84
  80. 80. © 2017 MapR Technologies 85 With FloatHistogram
  81. 81. © 2017 MapR Technologies 86 Summary • Counts – LLR • Events – Poisson + nth-order diffs • Decimate in space • Decimate in measurement space – t-digest, FloatHistogram • Don’t forget visualization Incoming events 99.97%-ile Alarm Δn Rate predictor Rate history t-digest δ> t ti δ λ(ti- ti- n) λ t 0.0 0.2 0.4 0.6 0.8 1.0 q 0246810 k
  82. 82. © 2017 MapR Technologies 87 Q & A
  83. 83. © 2017 MapR Technologies 88 Contact Information Ted Dunning, PhD Chief Application Architect, MapR Technologies Board member, Apache Software Foundation O’Reilly author Email tdunning@mapr.com tdunning@apache.org Twitter @ted_dunning

×