Fraud Detection by Stacking Cost-Sensitive Decision Trees

Fraud Detection by
Stacking Cost-Sensitive
Decision Trees
Alejandro Correa Bahnsen, PhD
Chief Data Scientist & Head of Research
acorrea@easysol.net

Who am I?
Chief Data Scientist at Easy Solutions
Industrial Engineer
PhD in Machine Learning from Luxembourg University
Scikit-Learn contributor
Organizer of Science Bogota Meetups
2

AboutEasySolutions®
3
A leading global provider of electronic fraud
prevention for financial institutions and
enterprise customers
430+ customers
In 30 countries
115 million
Users protected
30 billion
Online connections monitored
Industry recognition

TotalFraudProtection
Discuss what makes a data science project successful
4

5
Risk Based Authentication Phishing URL Classification Phishing Brand ID
Fraud Detection
19h risk = 10
9h risk = 95
HTML Injection Biometrics

Research/ DataScienceSpectrum
6
• Maybe someday, someone can use this
Basic
Research
• I might be able to use this
Applied
Research
• I can use this (sometimes)
Working
Prototype
• Software engineers can use thisQuality Code
• People can use this
Tool or
Service
Innovation
practicality

Estimate the probability of a transaction being fraud based on analyzing
customer patterns and recent fraudulent behavior
Issues when constructing a fraud detection system:
• Skewness of the data
• Cost-sensitivity
• Short time response of the system
• Dimensionality of the search space
• Feature preprocessing
• Model selection
Creditcardfrauddetection
8

• Larger European card processing
company
• 2012 & 2013 card present
transactions
• 20MM Transactions
• 40,000 Frauds
• 0.467% Fraud rate
• ~ 2MM EUR lost due to fraud on
test dataset
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
Test
Train
Data

Raw features
11
Attribute name Description
Transaction ID Transaction identification number
Time Date and time of the transaction
Account number Identification number of the customer
Card number Identification of the credit card
Transaction type ie. Internet, ATM, POS, ...
Entry mode ie. Chip and pin, magnetic stripe, ...
Amount Amount of the transaction in Euros
Merchant code Identification of the merchant type
Merchant group Merchant group identification
Country Country of trx
Country 2 Country of residence
Type of card ie. Visa debit, Mastercard, American Express...
Gender Gender of the card holder
Age Card holder age
Bank Issuer bank of the card
Features

Credit card fraud detection is a cost-sensitive problem. As the cost due to a
false positive is different than the cost of a false negative.
• False positives: When predicting a transaction as fraudulent, when in
fact it is not a fraud, there is an administrative cost that is incurred by
the financial institution.
• False negatives: Failing to detect a fraud, the amount of that transaction
is lost.
Moreover, it is not enough to assume a constant cost difference between
false positives and false negatives, as the amount of the transactions varies
quite significantly.
12
FinancialEvaluation

Cost matrix
𝐶𝑜𝑠𝑡 𝑓 𝑆 =
𝑖=1
𝑁
𝑦𝑖 𝑐𝑖 𝐶 𝑇𝑃 𝑖
+ 1 − 𝑐𝑖 𝐶 𝐹𝑁 𝑖
+ 1 − 𝑦𝑖 𝑐𝑖 𝐶 𝐹𝑃 𝑖
+ 1 − 𝑐𝑖 𝐶 𝑇𝑁 𝑖
13
Actual Positive
𝒚𝒊 = 𝟏
Actual Negative
𝒚𝒊 = 𝟎
Predicted Positive
𝒄𝒊 = 𝟏
𝐶 𝑇𝑃 𝑖
= 𝐶 𝑎 𝐶 𝐹𝑃 𝑖
= 𝐶 𝑎
Predicted Negative
𝒄𝒊 = 𝟎
𝐶 𝐹𝑁 𝑖
= 𝐴𝑚𝑡𝑖 𝐶 𝑇𝑁 𝑖
= 0
FinancialEvaluation

• Cost Proportionate Sampling
• Bayes minimum risk
• Cost-sensitive logistic regression
• Cost-sensitive decision trees
• Stacking Cost-sensitive decision trees
CostSensitiveAlgorithms
14

CostProportionateSampling
Normalized Cost weight
𝑤𝑖 =
𝐶 𝐹𝑃 𝑖 𝑖𝑓 𝑦𝑖 = 0
𝐶 𝐹𝑁 𝑖 𝑖𝑓 𝑦𝑖 = 1
𝑤𝑖 =
𝑤𝑖
max
𝑗
𝑤𝑗

Cost Proportionate Over Sampling
Example 𝑦𝑖 𝑤𝑖
1 0 1
2 1 10
3 0 2
4 1 20
5 0 1
Initial
Dataset
(1,0,1)
(2,1,10)
(3,0,2)
(4,1,20)
(5,0,1)
Cost Proportionate Dataset
(1,0,1)
(2,1,1), (2,1,1), …, (2,1,1)
(3,0,2), (3,0,2)
(4,1,1), (4,1,1), (4,1,1), …,
(4,1,1), (4,1,1)
(5,0,1)
*Elkan, C. (2001). The Foundations of Cost-Sensitive Learning.

Cost Proportionate Rejection Sampling
Example 𝑦𝑖 𝑤𝑖
1 0 1
2 1 10
3 0 2
4 1 20
5 0 1
Cost
Proportion
ate Dataset
(2,1,1)
(4,1,1)
(4,1,1)
(5,0,1)
*Zadrozny et al. (2003). Cost-sensitive learning by cost-proportionate example weighting.
𝑤𝑖/max( 𝑤𝑖)
0.05
0.5
0.1
1
0.05
Initial
Dataset
(1,0,1)
(2,1,10)
(3,0,2)
(4,1,20)
(5,0,1)

Decision model based on quantifying tradeoffs between various decisions
using probabilities and the costs that accompany such decisions
Risk of classification
𝑅 𝑐𝑖 = 0|𝑥𝑖 = 𝐶 𝑇𝑁 𝑖 1 − 𝑝𝑖 + 𝐶 𝐹𝑁 𝑖 ∙ 𝑝𝑖
𝑅 𝑐𝑖 = 1|𝑥𝑖 = 𝐶 𝐹𝑃 𝑖 1 − 𝑝𝑖 + 𝐶 𝑇𝑃 𝑖 ∙ 𝑝𝑖
Using the different risks the prediction is made based on the following
condition:
𝑐𝑖 =
0 𝑅 𝑐𝑖 = 0|𝑥𝑖 ≤ 𝑅 𝑐𝑖 = 1|𝑥𝑖
1 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
18
BayesMinimumRisk

• Logistic Regression Model
• Cost Function
• Cost Analysis
Cost-Sensitive- LogisticRegression

• Actual Costs
• Cost-Sensitive Function
Cost-Sensitive- LogisticRegression

21
Proposed Cost based impurity measure
𝑆 𝑙
= 𝑥|𝑥𝑖 ∈ 𝑆 ∧ 𝑥𝑖
𝑗
≤ 𝑙 𝑚
𝑗
𝑆 𝑟
= 𝑥|𝑥𝑖 ∈ 𝑆 ∧ 𝑥𝑖
𝑗
> 𝑙 𝑚
𝑗
• The impurity of each leaf is calculated using:
𝐼𝑐 𝑆 = min 𝐶𝑜𝑠𝑡 𝑓0 𝑆 , 𝐶𝑜𝑠𝑡 𝑓1 𝑆
𝑓 𝑆 =
0 𝑖𝑓 𝐶𝑜𝑠𝑡 𝑓0 𝑆 ≤ 𝐶𝑜𝑠𝑡 𝑓1 𝑆
1 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
• Afterwards the gain of applying a given rule to the set 𝑆 is:
𝐺𝑎𝑖𝑛 𝑐 𝑥 𝑗, 𝑙 𝑚
𝑗
= 𝐼𝑐 𝜋1 − 𝐼𝑐 𝜋1
𝑙
+ 𝐼𝑐 𝜋1
𝑟
S
S S
𝑥 𝑗, 𝑙 𝑚
𝑗
Cost-SensitiveDecisionTrees

22
Decision trees construction
• The rule that maximizes the gain is selected
𝑏𝑒𝑠𝑡 𝑥, 𝑏𝑒𝑠𝑡𝑙 = 𝑎𝑟𝑔 max
𝑗,𝑚
𝐺𝑎𝑖𝑛 𝑥 𝑗, 𝑙 𝑚
𝑗
S
S S
S S S S
S S S S
• The process is repeated until a stopping criteria is met:

23
Proposed cost-sensitive pruning criteria
• Calculation of the Tree savings and pruned Tree savings
S
S S
S S S S
S S S S
𝑃𝐶𝑐 =
𝐶𝑜𝑠𝑡 𝑓 𝑆, 𝑇𝑟𝑒𝑒 − 𝐶𝑜𝑠𝑡 𝑓 𝑆, 𝐸𝐵 𝑇𝑟𝑒𝑒, 𝑏𝑟𝑎𝑛𝑐ℎ
𝑇𝑟𝑒𝑒 − 𝐸𝐵 𝑇𝑟𝑒𝑒, 𝑏𝑟𝑎𝑛𝑐ℎ
• After calculating the pruning criteria for all possible trees. The maximum
improvement is selected and the Tree is pruned.
• Later the process is repeated until there is no further improvement.
S
S S
S S S S
S S
S
S S
S S

Typical ensemble is made by combining T different base classifiers. Each
base classifiers is trained by applying algorithm M in a random subset
24
𝑀𝑗 ← 𝑀 𝑆𝑗 ∀𝑗 ∈ 1, … , 𝑇
EnsembleCost-SensitiveDecisionTrees

25
1
2
3
4
5
6
7
8
8
6
2
5
2
1
3
6
7
1
2
3
8
1
5
8
1
4
4
2
1
9
4
6
1
1
5
8
1
4
4
2
1
1
5
8
1
4
4
2
1
1
5
8
1
4
4
2
1
Bagging Pasting Random forest Random patches
Training set

After the base classifiers are constructed they are typically combined using
one of the following methods:
• Majority voting
𝐻 𝑆 = 𝑓𝑚𝑣 𝑆, 𝑀 = 𝑎𝑟𝑔 max
𝑐∈ 0,1
𝑗=1
𝑇
1 𝑐 𝑀𝑗 𝑆
26

• Proposed cost-sensitive stacking
𝐻 𝑆 = 𝑓𝑠 𝑆, 𝑀, 𝛽 =
1
1 + 𝑒
− 𝑗=1
𝑇
𝛽 𝑗 𝑀 𝑗 𝑆
Using the cost-sensitive logistic regression [Correa et. al, 2014] model:
𝐽 𝑆, 𝑀, 𝛽 =
𝑖=1
𝑁
𝑦𝑖 𝑓𝑠 𝑆, 𝑀, 𝛽 𝐶 𝑇𝑃 𝑖
− 𝐶 𝐹𝑁𝑖
+ 𝐶 𝐹𝑁 𝑖
+
1 − 𝑦𝑖 𝑓𝑠 𝑆, 𝑀, 𝛽 𝐶 𝐹𝑃 𝑖
− 𝐶 𝑇𝑁 𝑖
+ 𝐶 𝑇𝑁 𝑖
Then the weights are estimated using
𝛽 = 𝑎𝑟𝑔 min
𝛽
𝐽 𝑆, 𝑀, 𝛽
27

0%
10%
20%
30%
40%
50%
60%
70%
80%
Expert
Rules
Random
Forests
RF CS
Sampling
CS Logistic
Regression
CS
Decision
Tree
Ensemble
CSDT
Majority
Ensemble
CSDT
Stacking
% Savings F1-Score
Results
28

• New framework for stacking of example
dependent cost-sensitive decision trees
• Models should be evaluated taking into
account real financial costs of the application
• Algorithms should be developed to
incorporate those financial costs
Conclusions
31

Any questions or comments, please let me know.
Alejandro Correa Bahnsen, PhD
Chief Data Scientist & Head of Research
acorrea@easysol.net
Thank you!

Fraud Detection by Stacking Cost-Sensitive Decision Trees

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Fraud Detection by Stacking Cost-Sensitive Decision Trees

Ähnlich wie Fraud Detection by Stacking Cost-Sensitive Decision Trees (20)

Mehr von Alejandro Correa Bahnsen, PhD

Mehr von Alejandro Correa Bahnsen, PhD (11)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Fraud Detection by Stacking Cost-Sensitive Decision Trees