Cluster Analysis and Anomaly Detection (Unsupervised I) - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
3. BigML, Inc #DutchMLSchool
What is Clustering?
3
⢠An unsupervised learning technique
⢠No labels necessary
⢠Useful for ďŹnding similar instances
⢠Smart sampling/labelling
⢠Finds âself-similar" groups of instances
⢠Customer: groups with similar behavior
⢠Medical: patients with similar diagnostic measurements
⢠DeďŹnes each group by a âcentroidâ
⢠Geometric center of the group
⢠Represents the âaverageâ member
⢠Number of centroids (k) can be speciďŹed or determined
4. BigML, Inc #DutchMLSchool
Cluster Centroids
4
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
5. BigML, Inc #DutchMLSchool
Cluster Centroids
5
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
auth = pin
amount ~ $100
Same:
date: Mon != Wed
customer: Sally != Bob
account: 6788 != 3421
class: clothes != gas
zip: 26339 != 46140
Different:
date = Wed (2 out of 3)
customer = Bob
account = 3421
auth = pin
class = gas
zip = 46140
amount = $104
Centroid:
similar
6. BigML, Inc #DutchMLSchool
Use Cases
6
⢠Customer segmentation
⢠Which customers are similar?
⢠How many natural groups are there?
⢠Item discovery
⢠What other items are similar to this one?
⢠Similarity
⢠What other instances share a speciďŹc property?
⢠Recommender (almost)
⢠If you like this item, what other items might you like?
⢠Active learning
⢠Labelling unlabelled data efďŹciently
7. BigML, Inc #DutchMLSchool
Customer Segmentation
7
GOAL: Cluster the users by usage
statistics. Identify clusters with a
higher percentage of high LTV users.
Since they have similar usage
patterns, the remaining users in
these clusters may be good
candidates for up-sell.
⢠Dataset of mobile game users.
⢠Data for each user consists of usage
statistics and a LTV based on in-
game purchases
⢠Assumption: Usage correlates to LTV
0%
3%
1%
8. BigML, Inc #DutchMLSchool
Similarity
8
GOAL: Cluster the loans by
application profile to rank loan
quality by percentage of trouble
loans in population
⢠Dataset of Lending Club Loans
⢠Mark any loan that is currently or has
even been late as âtroubleâ
0%
3%
7%
1%
9. BigML, Inc #DutchMLSchool
Active Learning
9
GOAL:
Rather than sample randomly, use clustering to group
patients by similarity and then test a sample from each
cluster to label the data.
⢠Dataset of diagnostic measurements
of 768 patients.
⢠Want to test each patient for
diabetes and label the dataset to
build a model but the test is
expensive*.
10. BigML, Inc #DutchMLSchool
Active Learning
10
*For a more realistic example of high cost, imagine a dataset with a
billion transactions, each one needing to be labelled as fraud/not-
fraud. Or a million images which need to be labeled as cat/not-cat.
2323
11. BigML, Inc #DutchMLSchool
Item Discovery
11
GOAL: Cluster the whiskies by flavor
profile to discover whiskies that have
similar taste.
⢠Dataset of 86 whiskies
⢠Each whiskey scored on a scale from
0 to 4 for each of 12 possible ďŹavor
characteristics.
Smoky
Fruity
15. BigML, Inc #DutchMLSchool
Human Expert
15
⢠Jesa used prior knowledge to select possible features that
separated the objects.
⢠âroundâ, âskinnyâ, âedgesâ, âhardâ, etc
⢠Items were then clustered based on the chosen features
⢠Separation quality was then tested to ensure:
⢠met criteria of K=3
⢠groups were sufďŹciently âdistantâ
⢠no crossover
16. BigML, Inc #DutchMLSchool
Human Expert
16
⢠Length/Width
⢠greater than 1 => âskinnyâ
⢠equal to 1 => âroundâ
⢠less than 1 => invert
⢠Number of Surfaces
⢠distinct surfaces require âedgesâ which have corners
⢠easier to count
Create features that capture these object differences
18. BigML, Inc #DutchMLSchool
Plot by Features
18
Num
Surfaces
Length / Width
box block eraser
knob
penny
dime
bead
key battery screw
K-Means Key Insight:
We can ďŹnd clusters using distances
in n-dimensional feature space
K=3
19. BigML, Inc #DutchMLSchool
Plot by Features
19
Num
Surfaces
Length / Width
box block eraser
knob
penny
dime
bead
key battery screw
K-Means
Find âbestâ (minimum distance)
circles that include all points
24. BigML, Inc #DutchMLSchool
Starting Points
24
⢠Random points or instances in n-dimensional space
⢠Might start "too close"
⢠Risk of sub-optimal convergence
26. BigML, Inc #DutchMLSchool
Starting Points
26
⢠Random points or instances in n-dimensional space
⢠Might start "too close"
⢠Risk of sub-optimal convergence
⢠Chose points âfarthestâ away from each other
⢠but this is sensitive to outliers
⢠k++
⢠the ďŹrst point is chosen randomly from instances
⢠each subsequent point is chosen from the remaining
instances with a probability proportional to the squared
distance from the point's closest existing cluster center
31. BigML, Inc #DutchMLSchool
Other Tricks
31
⢠What is the distance to a âmissing valueâ?
⢠What is the distance between categorical values?
⢠How far is âredâ from âgreenâ?
⢠What is the distance between text features?
⢠Does it have to be Euclidean distance?
⢠Unknown ideal number of clusters, âKâ?
32. BigML, Inc #DutchMLSchool
Distance to Missing?
32
⢠Nonsense! Try replacing missing values with:
⢠Maximum
⢠Mean
⢠Median
⢠Minimum
⢠Zero
⢠Ignore instances with missing values
33. BigML, Inc #DutchMLSchool
Distance to Categorical?
33
⢠DeďŹne special distance function: For two instances đĽ and đŚ
and the categorical ďŹeld đ:
⢠if đĽ đ ďź đŚ đ thenâ¨
(đĽ,đŚ)distanceďź0 (or ďŹeld scaling value) â¨
else â¨
(đĽ,đŚ)distanceďź1
Approach: similar to âk-prototypesâ
34. BigML, Inc #DutchMLSchool
Distance to Categorical?
34
animal favorite toy toy color
cat ball red
cat ball green
d=0 d=0 d=1
cat laser red
dog squeaky red
d=1 d=1 d=0
D = 1
Then compute Euclidean distance between vectors
D = â2
Note: the centroid is assigned the most common
category of the member instances
35. BigML, Inc #DutchMLSchool
Text Vectors
35
1
Cosine Similarity
0
-1
"hippo" "safari" "zebra" âŚ.
1 0 1 âŚ
1 1 0 âŚ
0 1 1 âŚ
Text Field #1
Text Field #2
Features(thousands)
⢠Cosine Similarity
⢠cos() between two vectors
⢠1 if collinear, 0 if orthogonal
⢠only positive vectors: 0 ⤠CS ⤠1
⢠Cosine Distanceďź1ďźCosine
Similarity
⢠CD(TF1, TF2) = 0.5
42. BigML, Inc #DutchMLSchool
Summary
42
⢠Cluster Purpose
⢠Unsupervised technique for ďŹnding self-similar groups
of instances
⢠Number of centroids (k) can be inputed or computed
⢠Outputs list of centroids
⢠ConďŹguration:
⢠Algorithm: K-means / G-means
⢠Cluster Parameter: k or critical value
⢠Default missing / Summary ďŹelds / Scales / Weights
⢠Model Clusters
⢠Centroid / Batchcentroids
44. BigML, Inc #DutchMLSchool
What is Anomaly Detection?
44
⢠An unsupervised learning technique
⢠No labels necessary
⢠Useful for ďŹnding unusual instances
⢠Filtering, ďŹnding mistakes, 1-class classiďŹers
⢠Finds instances that do not match
⢠Customer: big or small spender for proďŹle
⢠Medical: healthy patient despite indicative diagnostics
⢠DeďŹnes each unusual instance by an âanomaly scoreâ
⢠in BigML: 0ďźnormal, 1ďźunusual, and 0.7 ⍠0.6 0.5
⢠Standard deviation, distributions, etc
45. BigML, Inc #DutchMLSchool
Clusters
45
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
46. BigML, Inc #DutchMLSchool
Clusters
46
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
similar
47. BigML, Inc #DutchMLSchool
Anomaly Detection
47
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
48. BigML, Inc #DutchMLSchool
Anomaly Detection
48
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
anomaly
⢠Amount $2,459 is higher than all other transactions
⢠It is the only transaction
⢠In zip 21350
⢠for the purchase class "tech"
49. BigML, Inc #DutchMLSchool
Use Cases
49
⢠Unusual instance discovery - "exploration"
⢠Intrusion Detection - "looking for unusual usage patterns"
⢠Fraud - "looking for unusual behavior"
⢠Identify Incorrect Data - "looking for mistakes"
⢠Remove Outliers - "improve model quality"
⢠Model Competence / Input Data Drift
50. BigML, Inc #DutchMLSchool
Removing Outliers
50
⢠Models need to generalize
⢠Outliers negatively impact generalization
GOAL: Use anomaly detector to identify most anomalous
points and then remove them before modeling.
DATASET FILTERED
DATASET
ANOMALY
DETECTOR
CLEAN
MODEL
51. BigML, Inc #DutchMLSchool
Diabetes Anomalies
51
DIABETES
SOURCE
DIABETES
DATASET
TRAIN SET
TEST SET
ALL
MODEL
CLEAN
DATASET
FILTER
ALL
MODEL
ALL
EVALUATION
CLEAN
EVALUATION
COMPARE
EVALUATIONS
ANAOMALY
DETECTOR
53. BigML, Inc #DutchMLSchool
Intrusion Detection
53
GOAL: Identify unusual command line behavior per user and
across all users that might indicate an intrusion.
⢠Dataset of command line history for users
⢠Data for each user consists of commands,
ďŹags, working directories, etc.
⢠Assumption: Users typically issue the
same ďŹag patterns and work in certain
directories
Per User Per Dir All User All Dir
54. BigML, Inc #DutchMLSchool
Fraud
54
⢠Dataset of credit card transactions
⢠Additional user proďŹle information
GOAL: Cluster users by profile and use multiple anomaly
scores to detect transactions that are anomalous on multiple
levels.
Card Level User Level Similar User Level
55. BigML, Inc #DutchMLSchool
Model Competence
55
⢠After putting a model it into production, data that is being
predicted can become statistically different than the
training data.
⢠Train an anomaly detector at the same time as the model.
GOAL: For every prediction, compute an anomaly score. If the
anomaly score is high, then the model may not be competent
and should not be trusted.
Prediction T T
ConďŹdence 0,86 0,84
Anomaly Score 0,5367 0,7124
Competent? Y N
At Prediction TimeAt Training Time
DATASET
MODEL
ANOMALY
DETECTOR
56. BigML, Inc #DutchMLSchool
Benfordâs Law
56
⢠In real-life numeric sets the small digits occur
disproportionately often as leading signiďŹcant digits.
⢠Applications include:
⢠accounting records
⢠electricity bills
⢠street addresses
⢠stock prices
⢠population numbers
⢠death rates
⢠lengths of rivers
⢠Available in BigML API
57. BigML, Inc #DutchMLSchool
Univariate Approach
57
⢠Single variable: heights, test scores, etc
⢠Assume the value is distributed ânormallyâ
⢠Compute standard deviation
⢠a measure of how âspread outâ the numbers are
⢠the square root of the variance (The average of the squared
differences from the Mean.)
⢠Depending on the number of instances, choose a âmultipleâ
of standard deviations to indicate an anomaly. A multiple of 3
for 1000 instances removes ~ 3 outliers.
62. BigML, Inc #DutchMLSchool
Human Expert
62
âRoundââSkinnyâ âCornersâ
âSkinnyâ
but not âsmoothâ
No
âCornersâ
Not
âRoundâ
Key Insight
The âmost unusualâ object
is diďŹerent in some way from
every partition of the features.
Most unusual
63. BigML, Inc #DutchMLSchool
Human Expert
63
⢠Human used prior knowledge to select possible features
that separated the objects.
⢠âroundâ, âskinnyâ, âsmoothâ, âcornersâ
⢠Items were then separated based on the chosen features
⢠Each cluster was then examined to see which object ďŹt
the least well in its cluster and did not ďŹt any other cluster
64. BigML, Inc #DutchMLSchool
Human Expert
64
⢠Length/Width
⢠greater than 1 => âskinnyâ
⢠equal to 1 => âroundâ
⢠less than 1 => invert
⢠Number of Surfaces
⢠distinct surfaces require âedgesâ which have corners
⢠easier to count
⢠Smooth - true or false
Create features that capture these object differences
66. BigML, Inc #DutchMLSchool
length/width > 5
smooth?
box
blockeraser
knob
penny/dime
bead
key
battery
screw
num surfaces = 6
length/width =1
length/width < 2
Know that âsplitsâ matter - donât know the order
TrueFalse
TrueFalse TrueFalse
FalseTrue
TrueFalse
Random Splits
66
67. BigML, Inc #DutchMLSchool
Isolation Forest
67
Grow a random decision tree until
each instance from a sample is in
its own leaf
âeasyâ to isolate
âhardâ to isolate
Depth
Now repeat the process several times and
use average Depth to compute anomaly
score: 0 (similar) -> 1 (dissimilar)
68. BigML, Inc #DutchMLSchool
Isolation Forest Scoring
68
D = 3
D = 6
D = 2
S=0.45
Map avg depth
to ďŹnal score
f1 f2 f3
i1 red cat ball
i2 red cat ball
i3 red cat box
i4 blue dog pen
For the instance, i2
Find the depth in each tree
69. BigML, Inc #DutchMLSchool
Model Competence
69
⢠A low anomaly score means the loan is similar to the
modeled loans.
⢠A high anomaly score means you should not trust the
model.
Prediction T T
ConďŹdence
0,86 0,84
Anomaly
Score
0,5367 0,7124
Competent? Y N
OPEN LOANS
PREDICTION
ANOMALY
SCORE
CLOSED LOAN
MODEL
CLOSED LOAN
ANOMALY DETECTOR
71. BigML, Inc #DutchMLSchool
1-Class ClassiďŹer?
71
⢠You place an advertisement in a local newspaper
⢠You collect demographic information about all responders
⢠Now you want to market in a new locality with direct letters
⢠To optimize mailing costs, need to predict who will respond
⢠But, can not distinguish not interested from didnât see the ad
⢠Train an anomaly detector on the 1-class data
⢠Pick the households with the lowest scores for mailing:
⢠If a household has a low anomaly score, then they are
âsimilarâ to enough of your positive responders and
therefore may respond as well
⢠If an individual has a high anomaly score, then they are
dissimilar from all previous responders and therefore are
less likely to respond.
72. BigML, Inc #DutchMLSchool
Summary
72
⢠Anomaly detection is the process of ďŹnding unusual instances
⢠Some techniques and how they work:
⢠Univariate: standard deviation
⢠Benfordâs law
⢠Isolation Forest
⢠Applications
⢠Filtering to improve models
⢠Finding mistakes, fraud, and intruders
⢠Knowing when to retrain a model (competence)
⢠1-class classiďŹers
⢠In general⌠unsupervised learning techniques:
⢠Require more ďŹnesse and interpretation
⢠Are more commonly part of a multistep workďŹow