Finding Changes in Real Data

© 2017 MapR Technologies 1
Detecting Change

Contact Information
Ted Dunning, PhD
Chief Application Architect, MapR Technologies
Board member, Apache Software Foundation
O’Reilly author
Email tdunning@mapr.com tdunning@apache.org
Twitter @ted_dunning

Who We Are
• MapR Technologies
– We make a kick-ass platform for big data computing
– Support many workloads including Hadoop / Spark / HPC / Other
– Extended to allow streams and tables in basic platform
– Free for academic research / training
• Apache Software Foundation
– Culture hub for building open source communities
– Shared values around openness for contribution as well as use
– Many major projects are part of Apache
– Even more minor ones!

Basic Outline
• Goal Setting
• Basic Ideas
– LLR (finding changes in counts)
– Poisson rate change detection (finding changes in events timing)
– Distribution estimation / visualization
– Labeled events and adding labels
• Free Improvisation on Themes

Why Is This Practically Important
• The novice came to the master and says “something is broken”

• The master replied “What has changed?”

• The master replied “What has changed?”
• And the student was enlightened

The Second Student
• Another student said to the master, “I see something has
changed … something may have broken”

The Second Student
• The master replied, “You have no question to ask. You have no
need of enlightenment”

The Second Student
• The master replied, “You have no question to ask. You have no
need of enlightenment”
• And thus the student was enlightened

• There are some very powerful techniques available, some only
very recently, that can make the detection of change much
easier than you might think. I will describe the practical use of
several of these techniques including t-digest, non-linear
histograms, variable rate Poisson models and combinations of
these.

Comparing Counts
• Suppose we have two situations A and B, each with many
observations, nA and nB
• And some event x occurred n1A and n1B times in each situation
x other
A n1A nA - n1A
B n1B nB - n1B

Comparing Counts
• Have we seen a change in the frequency of x?
• Frequency ratios?
– Breaks with small counts
• - test?
– Breaks with small counts

Log-Likelihood Ratio Test (Root LLR)
• In R
entropy = function(k) {
-sum(k*log((k==0)+(k/sum(k))))
}
llr = function(k) {
(entropy(rowSums(k))+entropy(colSums(k))
-entropy(k))*2
}
• Like mutual information * 2 N

Spot the Anomaly
• Root LLR is roughly like standard deviations
A not A
B 13 1000
not B 1000 100,000
A not A
B 1 0
not B 0 2
A not A
B 1 0
not B 0 10,000
A not A
B 10 0
not B 0 100,000
0.89 1.95
4.51 14.29

How Does it Work
Empirical fit to asymptotic
distribution is very good

How Does it Work?

OK
We can detect changes in counts

Real-life Example
• Query: “Paco de Lucia”
• Conventional meta-data search results:
– “hombres de paco” times 400
– not much else
• Recommendation based search:
– Flamenco guitar and dancers
– Spanish and classical guitar
– Van Halen doing a classical/flamenco riff

Real-life Example

Example 2 - Common Point of Compromise
• Scenario:
– Merchant 0 is compromised, leaks account data during compromise
– Fraud committed elsewhere during exploit
– High background level of fraud
– Limited detection rate for exploits
• Goal:
– Find merchant 0
• Meta-goal:
– Screen algorithms for this task without leaking sensitive data

Example 2 - Common Point of Compromise
skim exploit
Merchant 0
Skimmed
data
Merchant n
Card data is stolen
from Merchant 0
That data is used
in frauds at other
merchants

Simulation Setup
0 20 40 60 80 100
0100300500
day
count
Compromise period
Exploit period
compromises
frauds

Detection Strategy
• Select histories that precede non-fraud
• And histories that precede fraud detection
• Analyze 2x2 cooccurrence of merchant n versus fraud
detection

What about the
real world?

●●●●●●●●●●●●●●●●●●●● ● ●● ●●● ●●● ●●●●● ●●●●● ●●● ●●● ●● ● ●● ●● ●● ● ●●●● ●●●● ●● ●●●● ●●●● ●●● ●● ●● ● ●● ● ●●●● ●● ● ●●●● ●●●●●● ●● ●● ●●● ●●● ●●●●● ● ●●● ●● ●●● ●●● ●● ●●●● ●
●● ●●● ●●● ●
●
● ●●
●
●
●
●●
020406080
LLR score for real data
Number of Merchants
BreachScore(LLR)
Real truly bad guys
100
101
102
103
104
105
106
Really truly bad guys

What about time?

Finding Changes in Timing
• Suppose our input is events embedded in time
• Suppose we want to find changes in our input in real-time
• Waiting and counting is fine if we don’t have to react now
• We can do much better

Poisson Event Rate Change
• Detection of fallout
– Time since last is very sensitive for complete failure
• Detection of change relative to reference
– Time since n-th most recent
– LLR with time
• Have to trade detection speed versus false positive rate and
size of change
• Can run multiple detectors at once

Basic idea:
Time interval is better than counts

Sporadic Events: Finding Normal and Anomalous Patterns
• Time between intervals is much more usable than absolute
times
• Counts don’t link as directly to probability models
• Time interval is log ρ
• This is a big deal

Event Stream (timing)
• Events of various types arrive at irregular intervals
– we can assume Poisson distribution
• The key question is whether frequency has changed relative to
expected values
– This shows up as a change in interval
• Want alert as soon as possible

Converting Event Times to Anomaly
99.9%-ile
99.99%-ile

In the real world,
event rates often vary

Time Intervals Are Key to Modeling Sporadic Events
0 1 2 3 4
02468
t (days)
dt(min)

Poisson Distribution
• Time between events is exponentially distributed
• This means that long delays are exponentially rare
• If we know λ we can select a good threshold
– or we can pick a threshold empirically
Dt ~ le-lt
P(Dt > T) = e-lT
-logP(Dt > T) = lT

After Rate Correction
0 1 2 3 4
0246810
t (days)
dt/rate
99.9%−ile
99.99%−ile

Detecting Anomalies in Sporadic Events
Incoming
events
99.97%-ile
Alarm
Δn
Rate
predictor
Rate
history
t-digest
δ> t
ti δ λ(ti- ti- n)
λ
t

Seasonality Poses a Challenge
Nov 17 Nov 27 Dec 07 Dec 17 Dec 27
02468
Christmas Traffic
Date
Hits/1000

Something more is needed …
Nov 17 Nov 27 Dec 07 Dec 17 Dec 27
02468
Christmas Traffic
Date
Hits/1000

We need a better rate predictor…
Incoming
events
99.97%-ile
Alarm
Δn
Rate
predictor
Rate
history
t-digest
δ> t
ti δ λ(ti- ti- n)
λ
t

Idea: Predict log(rate) from lagged log(rate)
• Predict log because
– Peak to valley ratio
– Traffic grew by 30 %
– All rates are positive

– Just because I said so

– Just because I said so
• Let model see many lagged values
• Use L1 regularized linear model to pick important historical
values
– We would have moved to something fancier if this hadn’t worked

A New Rate Predictor for Sporadic Events

Improved Prediction with Adaptive Modeling
Dec 17 Dec 19 Dec 21 Dec 23 Dec 25 Dec 27 Dec 29
02468
Christmas Prediction
Date
Hits(x1000)

Some days the magic works
Some days ...
We use slightly different magic

Detecting More Subtle Changes
• Time-since-last finds complete failures well
• Nth order time finds more subtle rate changes
• But that subtlety delays detection of complete failure
– First order delay has 99.9% confidence at 6.5 units
– 10th order delay has 99.9% confidence at 12.5 units
• But 10th order delay can find speedups, first order cannot

10th order difference of
Poisson distribution

Finding Changes in Time Series
• So far, we only have times
• What about when we have times and measurements together?
– These are called time-series!
• First step can be to discretize the measurement
– Quintiles or deciles are good candidates
– Multi-scale discretization is a fine thing to do
• That gives us arrival times for measurements in each bin
– And this is susceptible to the rate model on previous slides

Finding Changes in Time Series
• Comprehensive approaches also possible (for counts)
• Time aware variant of G-test is possible
vs
Ted Dunning. Accurate methods for the statistics of surprise and coincidence. Comput. Linguist. 19, 1 (March
1993)
http://bit.ly/surprise-and-coincidence

Propagation Anomalies
• What happens when something shadows part of the coverage
field for mobile telecom?
– Can happen in urban areas with a construction crane
• Can solve heuristically
– Subtract from reference image composed by long term averages
– Doesn’t deal well with weak signal regions and low S/N
• Can solve probabilistically
– Compute anomaly for each measurement, use mean of log(p)

Variable Signal/Noise Makes Heuristic Tricky
Far from the transmitter,
received signal is dominated by
noise. This makes subtraction of
average value a bad algorithm.

Other Issues
• Finding changes in coverage area is similar tricky
• Coverage area is roughly where tower signal strength is higher
than neighbors
• Except for fuzziness due to hand-off delays
• Except for bias due to large-scale caller motions
– Rush hour
– Event mobs

Simple Answer for Propagation Anomalies
• Cluster signal strength reports
• Cluster locations using k-means, large k
• Model report rate anomaly using discrete event models
• Model signal strength anomaly using percentile model
• Trade larger k against higher report rates, faster detection
• Overall anomaly is sum of individual log(p) anomalies

Tower Coverage Areas

Just One Tower

Cluster Reports for That Tower

Cluster Reports for That Tower
1
2 3
4
5
6
7
8
9
Can also sub-divide each cluster
into signal strength ranges
Multiple scales of clustering
can also be used to trade off
geographic versus temporal
resolution

Example
0.00.51.01.5
dt
01234567
dt
0.00.20.40.6
dt
Each cluster gives us a
sequence of events.
Individual anomaly scores can
be scaled and added to get
composite anomaly score
Optimality of combined signal
derives from optimality of
components.

Characterizing Distributions
• What about sequences of values from arbitrary distributions
– Can we find changes in the distribution?
– For instance, what about latencies?
• Non-linear histogram - FloatHistogram
• Fully Adaptive histogram – t-digest

FloatHistogram
• Assume all measurements are in the range
• Divide this range into power of 2 sub-ranges
• Sub-divide each sub-range evenly with steps
• Relative error is bounded in measurement space

FloatHistogram
• Assume all measurements are in the range
• Divide this range into power of 2 sub-ranges
• Sub-divide each sub-range evenly with steps
• Relative error is bounded in measurement space
• Bin index can be computed using FP representation!

T-digest
• Or we can talk about small errors in q
• Accumulate samples, sort, merge
• Merge if k-size < 1

T-digest
0.0 0.2 0.4 0.6 0.8 1.0
q
0246810
k

T-digest
• Interpolate using centroids in x
• Very good near extremes, no dynamic allocation
0.0 0.2 0.4 0.6 0.8 1.0
q
0246810
k

Finding Change with Histograms
• With fixed bins, we can simply count and compare counts for
different bins
• Thus, histogram change reduces to count change
• Or to changes in event times

Visualizing Histograms
• We want to detect small changes
– Consider log-scale for Y
• Non-linear bin spacing is really good for increasing counts
– Reweight by bin-width
– Changing x axis changes y axis

Good Results

Bad Results

With Better Scaling

Bad Results

With FloatHistogram

Summary
• Counts – LLR
• Events – Poisson + nth-order diffs
• Decimate in space
• Decimate in measurement space
– t-digest, FloatHistogram
• Don’t forget visualization
Incoming
events
99.97%-ile
Alarm
Δn
Rate
predictor
Rate
history
t-digest
δ> t
ti δ λ(ti- ti- n)
λ
t
0.0 0.2 0.4 0.6 0.8 1.0
q
0246810
k

Q & A

Contact Information
Ted Dunning, PhD
Chief Application Architect, MapR Technologies
Board member, Apache Software Foundation
O’Reilly author
Email tdunning@mapr.com tdunning@apache.org
Twitter @ted_dunning

Finding Changes in Real Data

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Finding Changes in Real Data

Ähnlich wie Finding Changes in Real Data (20)

Mehr von Ted Dunning

Mehr von Ted Dunning (7)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Finding Changes in Real Data

Hinweis der Redaktion