The document discusses building an anomaly detector model to identify unusual transactions in a dataset. It describes loading transaction data with 31 features into the BigML platform and creating an anomaly detector model. The model scores new data and identifies the most anomalous fields to help detect fraud. Creating the anomaly detector involves interpreting the data, exploring the dataset distribution, and setting a threshold score to define what is considered anomalous.
3. BigML, Inc #DutchMLSchool 3
Unusual things can be easy to spot at
first sight if there’s:
• a small number of properties that make
the difference
• a small number of instances to compare
Detecting the unusual
What if there’s lots of instances and properties?
4. BigML, Inc #DutchMLSchool 4
We decide the action
New data arrives The model scores it
Could we use an anomaly detector?
To decide about unusual things,
we need to know how unusual they are
5. BigML, Inc #DutchMLSchool 5
Anomaly example
date custom
er
accoun
t
auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
• Amount $2,459 is higher
than all other transactions
• It is the only transaction
• In zip 21350
• For the purchase
class “tech"
9. BigML, Inc #DutchMLSchool
The First Decision
9
https://bigml.com/accounts/register
4Gb / 8 parallel tasks
DUTCHMLSCHOOL
10. BigML, Inc #DutchMLSchool
The Data Dictionary
10
31 features
Field Description
Time
Number of seconds elapsed between this transaction
and the
fi
rst transaction in the dataset
V1-V28
May be result of a PCA Dimensionality reduction to
protect user identities and sensitive features(v1-v28)
Amount Transaction amount
Class Label: 0 (normal) / 1 (fraud)
21. BigML, Inc #DutchMLSchool
De
fi
ning the anomalous class
21
Taking advantage of labeled instances, we can decide the anomaly threshold
threshold 1: 0.43
threshold 2: 0.5
threshold 3: 0.56
Setting higher thresholds will
improve precision, but will
reduce recall.