SlideShare ist ein Scribd-Unternehmen logo
1 von 32
Fraud Detection by
Stacking Cost-Sensitive
Decision Trees
Alejandro Correa Bahnsen, PhD
Chief Data Scientist & Head of Research
acorrea@easysol.net
Who am I?
Chief Data Scientist at Easy Solutions
Industrial Engineer
PhD in Machine Learning from Luxembourg University
Scikit-Learn contributor
Organizer of Science Bogota Meetups
2
AboutEasySolutions®
3
A leading global provider of electronic fraud
prevention for financial institutions and
enterprise customers
430+ customers
In 30 countries
115 million
Users protected
30 billion
Online connections monitored
Industry recognition
TotalFraudProtection
Discuss what makes a data science project successful
4
5
Risk Based Authentication Phishing URL Classification Phishing Brand ID
Fraud Detection
19h risk = 10
9h risk = 95
HTML Injection Biometrics
Research/ DataScienceSpectrum
6
• Maybe someday, someone can use this
Basic
Research
• I might be able to use this
Applied
Research
• I can use this (sometimes)
Working
Prototype
• Software engineers can use thisQuality Code
• People can use this
Tool or
Service
Innovation
practicality
Credit Card Fraud Detection
7
Estimate the probability of a transaction being fraud based on analyzing
customer patterns and recent fraudulent behavior
Issues when constructing a fraud detection system:
• Skewness of the data
• Cost-sensitivity
• Short time response of the system
• Dimensionality of the search space
• Feature preprocessing
• Model selection
Creditcardfrauddetection
8
Network
Fraud??
9
• Larger European card processing
company
• 2012 & 2013 card present
transactions
• 20MM Transactions
• 40,000 Frauds
• 0.467% Fraud rate
• ~ 2MM EUR lost due to fraud on
test dataset
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
Test
Train
Data
Raw features
11
Attribute name Description
Transaction ID Transaction identification number
Time Date and time of the transaction
Account number Identification number of the customer
Card number Identification of the credit card
Transaction type ie. Internet, ATM, POS, ...
Entry mode ie. Chip and pin, magnetic stripe, ...
Amount Amount of the transaction in Euros
Merchant code Identification of the merchant type
Merchant group Merchant group identification
Country Country of trx
Country 2 Country of residence
Type of card ie. Visa debit, Mastercard, American Express...
Gender Gender of the card holder
Age Card holder age
Bank Issuer bank of the card
Features
Credit card fraud detection is a cost-sensitive problem. As the cost due to a
false positive is different than the cost of a false negative.
• False positives: When predicting a transaction as fraudulent, when in
fact it is not a fraud, there is an administrative cost that is incurred by
the financial institution.
• False negatives: Failing to detect a fraud, the amount of that transaction
is lost.
Moreover, it is not enough to assume a constant cost difference between
false positives and false negatives, as the amount of the transactions varies
quite significantly.
12
FinancialEvaluation
Cost matrix
𝐶𝑜𝑠𝑡 𝑓 𝑆 =
𝑖=1
𝑁
𝑦𝑖 𝑐𝑖 𝐶 𝑇𝑃 𝑖
+ 1 − 𝑐𝑖 𝐶 𝐹𝑁 𝑖
+ 1 − 𝑦𝑖 𝑐𝑖 𝐶 𝐹𝑃 𝑖
+ 1 − 𝑐𝑖 𝐶 𝑇𝑁 𝑖
13
Actual Positive
𝒚𝒊 = 𝟏
Actual Negative
𝒚𝒊 = 𝟎
Predicted Positive
𝒄𝒊 = 𝟏
𝐶 𝑇𝑃 𝑖
= 𝐶 𝑎 𝐶 𝐹𝑃 𝑖
= 𝐶 𝑎
Predicted Negative
𝒄𝒊 = 𝟎
𝐶 𝐹𝑁 𝑖
= 𝐴𝑚𝑡𝑖 𝐶 𝑇𝑁 𝑖
= 0
FinancialEvaluation
• Cost Proportionate Sampling
• Bayes minimum risk
• Cost-sensitive logistic regression
• Cost-sensitive decision trees
• Stacking Cost-sensitive decision trees
CostSensitiveAlgorithms
14
CostProportionateSampling
Normalized Cost weight
𝑤𝑖 =
𝐶 𝐹𝑃 𝑖 𝑖𝑓 𝑦𝑖 = 0
𝐶 𝐹𝑁 𝑖 𝑖𝑓 𝑦𝑖 = 1
𝑤𝑖 =
𝑤𝑖
max
𝑗
𝑤𝑗
CostProportionateSampling
Cost Proportionate Over Sampling
Example 𝑦𝑖 𝑤𝑖
1 0 1
2 1 10
3 0 2
4 1 20
5 0 1
Initial
Dataset
(1,0,1)
(2,1,10)
(3,0,2)
(4,1,20)
(5,0,1)
Cost Proportionate Dataset
(1,0,1)
(2,1,1), (2,1,1), …, (2,1,1)
(3,0,2), (3,0,2)
(4,1,1), (4,1,1), (4,1,1), …,
(4,1,1), (4,1,1)
(5,0,1)
*Elkan, C. (2001). The Foundations of Cost-Sensitive Learning.
CostProportionateSampling
Cost Proportionate Rejection Sampling
Example 𝑦𝑖 𝑤𝑖
1 0 1
2 1 10
3 0 2
4 1 20
5 0 1
Cost
Proportion
ate Dataset
(2,1,1)
(4,1,1)
(4,1,1)
(5,0,1)
*Zadrozny et al. (2003). Cost-sensitive learning by cost-proportionate example weighting.
𝑤𝑖/max( 𝑤𝑖)
0.05
0.5
0.1
1
0.05
Initial
Dataset
(1,0,1)
(2,1,10)
(3,0,2)
(4,1,20)
(5,0,1)
Decision model based on quantifying tradeoffs between various decisions
using probabilities and the costs that accompany such decisions
Risk of classification
𝑅 𝑐𝑖 = 0|𝑥𝑖 = 𝐶 𝑇𝑁 𝑖 1 − 𝑝𝑖 + 𝐶 𝐹𝑁 𝑖 ∙ 𝑝𝑖
𝑅 𝑐𝑖 = 1|𝑥𝑖 = 𝐶 𝐹𝑃 𝑖 1 − 𝑝𝑖 + 𝐶 𝑇𝑃 𝑖 ∙ 𝑝𝑖
Using the different risks the prediction is made based on the following
condition:
𝑐𝑖 =
0 𝑅 𝑐𝑖 = 0|𝑥𝑖 ≤ 𝑅 𝑐𝑖 = 1|𝑥𝑖
1 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
18
BayesMinimumRisk
• Logistic Regression Model
• Cost Function
• Cost Analysis
Cost-Sensitive- LogisticRegression
• Actual Costs
• Cost-Sensitive Function
Cost-Sensitive- LogisticRegression
21
Proposed Cost based impurity measure
𝑆 𝑙
= 𝑥|𝑥𝑖 ∈ 𝑆 ∧ 𝑥𝑖
𝑗
≤ 𝑙 𝑚
𝑗
𝑆 𝑟
= 𝑥|𝑥𝑖 ∈ 𝑆 ∧ 𝑥𝑖
𝑗
> 𝑙 𝑚
𝑗
• The impurity of each leaf is calculated using:
𝐼𝑐 𝑆 = min 𝐶𝑜𝑠𝑡 𝑓0 𝑆 , 𝐶𝑜𝑠𝑡 𝑓1 𝑆
𝑓 𝑆 =
0 𝑖𝑓 𝐶𝑜𝑠𝑡 𝑓0 𝑆 ≤ 𝐶𝑜𝑠𝑡 𝑓1 𝑆
1 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
• Afterwards the gain of applying a given rule to the set 𝑆 is:
𝐺𝑎𝑖𝑛 𝑐 𝑥 𝑗, 𝑙 𝑚
𝑗
= 𝐼𝑐 𝜋1 − 𝐼𝑐 𝜋1
𝑙
+ 𝐼𝑐 𝜋1
𝑟
S
S S
𝑥 𝑗, 𝑙 𝑚
𝑗
Cost-SensitiveDecisionTrees
22
Decision trees construction
• The rule that maximizes the gain is selected
𝑏𝑒𝑠𝑡 𝑥, 𝑏𝑒𝑠𝑡𝑙 = 𝑎𝑟𝑔 max
𝑗,𝑚
𝐺𝑎𝑖𝑛 𝑥 𝑗, 𝑙 𝑚
𝑗
S
S S
S S S S
S S S S
• The process is repeated until a stopping criteria is met:
Cost-SensitiveDecisionTrees
23
Proposed cost-sensitive pruning criteria
• Calculation of the Tree savings and pruned Tree savings
S
S S
S S S S
S S S S
𝑃𝐶𝑐 =
𝐶𝑜𝑠𝑡 𝑓 𝑆, 𝑇𝑟𝑒𝑒 − 𝐶𝑜𝑠𝑡 𝑓 𝑆, 𝐸𝐵 𝑇𝑟𝑒𝑒, 𝑏𝑟𝑎𝑛𝑐ℎ
𝑇𝑟𝑒𝑒 − 𝐸𝐵 𝑇𝑟𝑒𝑒, 𝑏𝑟𝑎𝑛𝑐ℎ
• After calculating the pruning criteria for all possible trees. The maximum
improvement is selected and the Tree is pruned.
• Later the process is repeated until there is no further improvement.
S
S S
S S S S
S S
S
S S
S S
Cost-SensitiveDecisionTrees
Typical ensemble is made by combining T different base classifiers. Each
base classifiers is trained by applying algorithm M in a random subset
24
𝑀𝑗 ← 𝑀 𝑆𝑗 ∀𝑗 ∈ 1, … , 𝑇
EnsembleCost-SensitiveDecisionTrees
25
1
2
3
4
5
6
7
8
8
6
2
5
2
1
3
6
7
1
2
3
8
1
5
8
1
4
4
2
1
9
4
6
1
1
5
8
1
4
4
2
1
1
5
8
1
4
4
2
1
1
5
8
1
4
4
2
1
Bagging Pasting Random forest Random patches
Training set
EnsembleCost-SensitiveDecisionTrees
After the base classifiers are constructed they are typically combined using
one of the following methods:
• Majority voting
𝐻 𝑆 = 𝑓𝑚𝑣 𝑆, 𝑀 = 𝑎𝑟𝑔 max
𝑐∈ 0,1
𝑗=1
𝑇
1 𝑐 𝑀𝑗 𝑆
26
EnsembleCost-SensitiveDecisionTrees
• Proposed cost-sensitive stacking
𝐻 𝑆 = 𝑓𝑠 𝑆, 𝑀, 𝛽 =
1
1 + 𝑒
− 𝑗=1
𝑇
𝛽 𝑗 𝑀 𝑗 𝑆
Using the cost-sensitive logistic regression [Correa et. al, 2014] model:
𝐽 𝑆, 𝑀, 𝛽 =
𝑖=1
𝑁
𝑦𝑖 𝑓𝑠 𝑆, 𝑀, 𝛽 𝐶 𝑇𝑃 𝑖
− 𝐶 𝐹𝑁𝑖
+ 𝐶 𝐹𝑁 𝑖
+
1 − 𝑦𝑖 𝑓𝑠 𝑆, 𝑀, 𝛽 𝐶 𝐹𝑃 𝑖
− 𝐶 𝑇𝑁 𝑖
+ 𝐶 𝑇𝑁 𝑖
Then the weights are estimated using
𝛽 = 𝑎𝑟𝑔 min
𝛽
𝐽 𝑆, 𝑀, 𝛽
27
EnsembleCost-SensitiveDecisionTrees
0%
10%
20%
30%
40%
50%
60%
70%
80%
Expert
Rules
Random
Forests
RF CS
Sampling
CS Logistic
Regression
CS
Decision
Tree
Ensemble
CSDT
Majority
Ensemble
CSDT
Stacking
% Savings F1-Score
Results
28
Costcla- Library
29
Costcla- Library
30
• New framework for stacking of example
dependent cost-sensitive decision trees
• Models should be evaluated taking into
account real financial costs of the application
• Algorithms should be developed to
incorporate those financial costs
Conclusions
31
Any questions or comments, please let me know.
Alejandro Correa Bahnsen, PhD
Chief Data Scientist & Head of Research
acorrea@easysol.net
Thank you!

Weitere ähnliche Inhalte

Was ist angesagt?

Creditscore
CreditscoreCreditscore
Creditscore
kevinlan
 
CreditCardDefaultModel
CreditCardDefaultModelCreditCardDefaultModel
CreditCardDefaultModel
Andrew Rogala
 
Statistical Models for Proportional Outcomes
Statistical Models for Proportional OutcomesStatistical Models for Proportional Outcomes
Statistical Models for Proportional Outcomes
WenSui Liu
 
TransactionBasedAnalytics2010
TransactionBasedAnalytics2010TransactionBasedAnalytics2010
TransactionBasedAnalytics2010
Vijay Desai
 

Was ist angesagt? (20)

Maximizing a churn campaign’s profitability with cost sensitive predictive an...
Maximizing a churn campaign’s profitability with cost sensitive predictive an...Maximizing a churn campaign’s profitability with cost sensitive predictive an...
Maximizing a churn campaign’s profitability with cost sensitive predictive an...
 
Credit risk scoring model final
Credit risk scoring model finalCredit risk scoring model final
Credit risk scoring model final
 
Creditscore
CreditscoreCreditscore
Creditscore
 
Big Data solution for multi-national Bank
Big Data solution for multi-national BankBig Data solution for multi-national Bank
Big Data solution for multi-national Bank
 
CreditCardDefaultModel
CreditCardDefaultModelCreditCardDefaultModel
CreditCardDefaultModel
 
Credit card fraud detection through machine learning
Credit card fraud detection through machine learningCredit card fraud detection through machine learning
Credit card fraud detection through machine learning
 
Consumer credit-risk3440
Consumer credit-risk3440Consumer credit-risk3440
Consumer credit-risk3440
 
Machine Learning Project - Default credit card clients
Machine Learning Project - Default credit card clients Machine Learning Project - Default credit card clients
Machine Learning Project - Default credit card clients
 
Default Credit Card Prediction
Default Credit Card PredictionDefault Credit Card Prediction
Default Credit Card Prediction
 
Credit Risk Analytics
Credit Risk AnalyticsCredit Risk Analytics
Credit Risk Analytics
 
Statistical Models for Proportional Outcomes
Statistical Models for Proportional OutcomesStatistical Models for Proportional Outcomes
Statistical Models for Proportional Outcomes
 
Taiwanese Credit Card Client Fraud detection
Taiwanese Credit Card Client Fraud detectionTaiwanese Credit Card Client Fraud detection
Taiwanese Credit Card Client Fraud detection
 
TransactionBasedAnalytics2010
TransactionBasedAnalytics2010TransactionBasedAnalytics2010
TransactionBasedAnalytics2010
 
CECL Project Overview
CECL Project OverviewCECL Project Overview
CECL Project Overview
 
Computational Finance Introductory Lecture
Computational Finance Introductory LectureComputational Finance Introductory Lecture
Computational Finance Introductory Lecture
 
Challenges in Computational Finance
Challenges in Computational FinanceChallenges in Computational Finance
Challenges in Computational Finance
 
Decision tree example problem
Decision tree example problemDecision tree example problem
Decision tree example problem
 
A Review on Credit Card Default Modelling using Data Science
A Review on Credit Card Default Modelling using Data ScienceA Review on Credit Card Default Modelling using Data Science
A Review on Credit Card Default Modelling using Data Science
 
An introduction to decision trees
An introduction to decision treesAn introduction to decision trees
An introduction to decision trees
 
Mining Credit Card Defults
Mining Credit Card DefultsMining Credit Card Defults
Mining Credit Card Defults
 

Ähnlich wie Fraud Detection by Stacking Cost-Sensitive Decision Trees

An Introduction to boosting
An Introduction to boostingAn Introduction to boosting
An Introduction to boosting
butest
 
A General Framework for Accurate and Fast Regression by Data Summarization in...
A General Framework for Accurate and Fast Regression by Data Summarization in...A General Framework for Accurate and Fast Regression by Data Summarization in...
A General Framework for Accurate and Fast Regression by Data Summarization in...
Yao Wu
 

Ähnlich wie Fraud Detection by Stacking Cost-Sensitive Decision Trees (20)

Machine_Learning.pptx
Machine_Learning.pptxMachine_Learning.pptx
Machine_Learning.pptx
 
Credit card fraud detection
Credit card fraud detectionCredit card fraud detection
Credit card fraud detection
 
Accurate Campaign Targeting Using Classification Algorithms
Accurate Campaign Targeting Using Classification AlgorithmsAccurate Campaign Targeting Using Classification Algorithms
Accurate Campaign Targeting Using Classification Algorithms
 
Summer 07-mfin7011-tang1922
Summer 07-mfin7011-tang1922Summer 07-mfin7011-tang1922
Summer 07-mfin7011-tang1922
 
Accurate Campaign Targeting Using Classification - Poster
Accurate Campaign Targeting Using Classification - PosterAccurate Campaign Targeting Using Classification - Poster
Accurate Campaign Targeting Using Classification - Poster
 
Big Data Analytics.pptx
Big Data Analytics.pptxBig Data Analytics.pptx
Big Data Analytics.pptx
 
Statistics in real life engineering
Statistics in real life engineeringStatistics in real life engineering
Statistics in real life engineering
 
Data Driven Risk Management
Data Driven Risk ManagementData Driven Risk Management
Data Driven Risk Management
 
Primer on major data mining algorithms
Primer on major data mining algorithmsPrimer on major data mining algorithms
Primer on major data mining algorithms
 
Mini datathon
Mini datathonMini datathon
Mini datathon
 
Feature selection with imbalanced data in agriculture
Feature selection with  imbalanced data in agricultureFeature selection with  imbalanced data in agriculture
Feature selection with imbalanced data in agriculture
 
Machine Learning techniques used in AI.
Machine Learning  techniques used in AI.Machine Learning  techniques used in AI.
Machine Learning techniques used in AI.
 
A predictive system for detection of bankruptcy using machine learning techni...
A predictive system for detection of bankruptcy using machine learning techni...A predictive system for detection of bankruptcy using machine learning techni...
A predictive system for detection of bankruptcy using machine learning techni...
 
Descriptive Analytics: Data Reduction
 Descriptive Analytics: Data Reduction Descriptive Analytics: Data Reduction
Descriptive Analytics: Data Reduction
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
 
Machine learning techniques in fraud prevention
Machine learning techniques in fraud preventionMachine learning techniques in fraud prevention
Machine learning techniques in fraud prevention
 
An Introduction to boosting
An Introduction to boostingAn Introduction to boosting
An Introduction to boosting
 
A high level overview of all that is Analytics
A high level overview of all that is AnalyticsA high level overview of all that is Analytics
A high level overview of all that is Analytics
 
A General Framework for Accurate and Fast Regression by Data Summarization in...
A General Framework for Accurate and Fast Regression by Data Summarization in...A General Framework for Accurate and Fast Regression by Data Summarization in...
A General Framework for Accurate and Fast Regression by Data Summarization in...
 
Decision theory
Decision theoryDecision theory
Decision theory
 

Mehr von Alejandro Correa Bahnsen, PhD

Fraud analytics detección y prevención de fraudes en la era del big data sl...
Fraud analytics detección y prevención de fraudes en la era del big data   sl...Fraud analytics detección y prevención de fraudes en la era del big data   sl...
Fraud analytics detección y prevención de fraudes en la era del big data sl...
Alejandro Correa Bahnsen, PhD
 
Analytics - compitiendo en la era de la informacion
Analytics - compitiendo en la era de la informacionAnalytics - compitiendo en la era de la informacion
Analytics - compitiendo en la era de la informacion
Alejandro Correa Bahnsen, PhD
 

Mehr von Alejandro Correa Bahnsen, PhD (11)

black hat deephish
black hat deephishblack hat deephish
black hat deephish
 
DeepPhish: Simulating malicious AI
DeepPhish: Simulating malicious AIDeepPhish: Simulating malicious AI
DeepPhish: Simulating malicious AI
 
AI vs. AI: Can Predictive Models Stop the Tide of Hacker AI?
AI vs. AI: Can Predictive Models Stop the Tide of Hacker AI?AI vs. AI: Can Predictive Models Stop the Tide of Hacker AI?
AI vs. AI: Can Predictive Models Stop the Tide of Hacker AI?
 
How I Learned to Stop Worrying and Love Building Data Products
How I Learned to Stop Worrying and Love Building Data ProductsHow I Learned to Stop Worrying and Love Building Data Products
How I Learned to Stop Worrying and Love Building Data Products
 
Classifying Phishing URLs Using Recurrent Neural Networks
Classifying Phishing URLs Using Recurrent Neural NetworksClassifying Phishing URLs Using Recurrent Neural Networks
Classifying Phishing URLs Using Recurrent Neural Networks
 
Demystifying machine learning using lime
Demystifying machine learning using limeDemystifying machine learning using lime
Demystifying machine learning using lime
 
Modern Data Science
Modern Data ScienceModern Data Science
Modern Data Science
 
Fraud analytics detección y prevención de fraudes en la era del big data sl...
Fraud analytics detección y prevención de fraudes en la era del big data   sl...Fraud analytics detección y prevención de fraudes en la era del big data   sl...
Fraud analytics detección y prevención de fraudes en la era del big data sl...
 
Analytics - compitiendo en la era de la informacion
Analytics - compitiendo en la era de la informacionAnalytics - compitiendo en la era de la informacion
Analytics - compitiendo en la era de la informacion
 
2013 credit card fraud detection why theory dosent adjust to practice
2013 credit card fraud detection why theory dosent adjust to practice2013 credit card fraud detection why theory dosent adjust to practice
2013 credit card fraud detection why theory dosent adjust to practice
 
2011 advanced analytics through the credit cycle
2011 advanced analytics through the credit cycle2011 advanced analytics through the credit cycle
2011 advanced analytics through the credit cycle
 

Kürzlich hochgeladen

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
shivangimorya083
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
Lars Albertsson
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
JohnnyPlasten
 

Kürzlich hochgeladen (20)

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 

Fraud Detection by Stacking Cost-Sensitive Decision Trees

  • 1. Fraud Detection by Stacking Cost-Sensitive Decision Trees Alejandro Correa Bahnsen, PhD Chief Data Scientist & Head of Research acorrea@easysol.net
  • 2. Who am I? Chief Data Scientist at Easy Solutions Industrial Engineer PhD in Machine Learning from Luxembourg University Scikit-Learn contributor Organizer of Science Bogota Meetups 2
  • 3. AboutEasySolutions® 3 A leading global provider of electronic fraud prevention for financial institutions and enterprise customers 430+ customers In 30 countries 115 million Users protected 30 billion Online connections monitored Industry recognition
  • 4. TotalFraudProtection Discuss what makes a data science project successful 4
  • 5. 5 Risk Based Authentication Phishing URL Classification Phishing Brand ID Fraud Detection 19h risk = 10 9h risk = 95 HTML Injection Biometrics
  • 6. Research/ DataScienceSpectrum 6 • Maybe someday, someone can use this Basic Research • I might be able to use this Applied Research • I can use this (sometimes) Working Prototype • Software engineers can use thisQuality Code • People can use this Tool or Service Innovation practicality
  • 7. Credit Card Fraud Detection 7
  • 8. Estimate the probability of a transaction being fraud based on analyzing customer patterns and recent fraudulent behavior Issues when constructing a fraud detection system: • Skewness of the data • Cost-sensitivity • Short time response of the system • Dimensionality of the search space • Feature preprocessing • Model selection Creditcardfrauddetection 8
  • 10. • Larger European card processing company • 2012 & 2013 card present transactions • 20MM Transactions • 40,000 Frauds • 0.467% Fraud rate • ~ 2MM EUR lost due to fraud on test dataset Dec Nov Oct Sep Aug Jul Jun May Apr Mar Feb Jan Test Train Data
  • 11. Raw features 11 Attribute name Description Transaction ID Transaction identification number Time Date and time of the transaction Account number Identification number of the customer Card number Identification of the credit card Transaction type ie. Internet, ATM, POS, ... Entry mode ie. Chip and pin, magnetic stripe, ... Amount Amount of the transaction in Euros Merchant code Identification of the merchant type Merchant group Merchant group identification Country Country of trx Country 2 Country of residence Type of card ie. Visa debit, Mastercard, American Express... Gender Gender of the card holder Age Card holder age Bank Issuer bank of the card Features
  • 12. Credit card fraud detection is a cost-sensitive problem. As the cost due to a false positive is different than the cost of a false negative. • False positives: When predicting a transaction as fraudulent, when in fact it is not a fraud, there is an administrative cost that is incurred by the financial institution. • False negatives: Failing to detect a fraud, the amount of that transaction is lost. Moreover, it is not enough to assume a constant cost difference between false positives and false negatives, as the amount of the transactions varies quite significantly. 12 FinancialEvaluation
  • 13. Cost matrix 𝐶𝑜𝑠𝑡 𝑓 𝑆 = 𝑖=1 𝑁 𝑦𝑖 𝑐𝑖 𝐶 𝑇𝑃 𝑖 + 1 − 𝑐𝑖 𝐶 𝐹𝑁 𝑖 + 1 − 𝑦𝑖 𝑐𝑖 𝐶 𝐹𝑃 𝑖 + 1 − 𝑐𝑖 𝐶 𝑇𝑁 𝑖 13 Actual Positive 𝒚𝒊 = 𝟏 Actual Negative 𝒚𝒊 = 𝟎 Predicted Positive 𝒄𝒊 = 𝟏 𝐶 𝑇𝑃 𝑖 = 𝐶 𝑎 𝐶 𝐹𝑃 𝑖 = 𝐶 𝑎 Predicted Negative 𝒄𝒊 = 𝟎 𝐶 𝐹𝑁 𝑖 = 𝐴𝑚𝑡𝑖 𝐶 𝑇𝑁 𝑖 = 0 FinancialEvaluation
  • 14. • Cost Proportionate Sampling • Bayes minimum risk • Cost-sensitive logistic regression • Cost-sensitive decision trees • Stacking Cost-sensitive decision trees CostSensitiveAlgorithms 14
  • 15. CostProportionateSampling Normalized Cost weight 𝑤𝑖 = 𝐶 𝐹𝑃 𝑖 𝑖𝑓 𝑦𝑖 = 0 𝐶 𝐹𝑁 𝑖 𝑖𝑓 𝑦𝑖 = 1 𝑤𝑖 = 𝑤𝑖 max 𝑗 𝑤𝑗
  • 16. CostProportionateSampling Cost Proportionate Over Sampling Example 𝑦𝑖 𝑤𝑖 1 0 1 2 1 10 3 0 2 4 1 20 5 0 1 Initial Dataset (1,0,1) (2,1,10) (3,0,2) (4,1,20) (5,0,1) Cost Proportionate Dataset (1,0,1) (2,1,1), (2,1,1), …, (2,1,1) (3,0,2), (3,0,2) (4,1,1), (4,1,1), (4,1,1), …, (4,1,1), (4,1,1) (5,0,1) *Elkan, C. (2001). The Foundations of Cost-Sensitive Learning.
  • 17. CostProportionateSampling Cost Proportionate Rejection Sampling Example 𝑦𝑖 𝑤𝑖 1 0 1 2 1 10 3 0 2 4 1 20 5 0 1 Cost Proportion ate Dataset (2,1,1) (4,1,1) (4,1,1) (5,0,1) *Zadrozny et al. (2003). Cost-sensitive learning by cost-proportionate example weighting. 𝑤𝑖/max( 𝑤𝑖) 0.05 0.5 0.1 1 0.05 Initial Dataset (1,0,1) (2,1,10) (3,0,2) (4,1,20) (5,0,1)
  • 18. Decision model based on quantifying tradeoffs between various decisions using probabilities and the costs that accompany such decisions Risk of classification 𝑅 𝑐𝑖 = 0|𝑥𝑖 = 𝐶 𝑇𝑁 𝑖 1 − 𝑝𝑖 + 𝐶 𝐹𝑁 𝑖 ∙ 𝑝𝑖 𝑅 𝑐𝑖 = 1|𝑥𝑖 = 𝐶 𝐹𝑃 𝑖 1 − 𝑝𝑖 + 𝐶 𝑇𝑃 𝑖 ∙ 𝑝𝑖 Using the different risks the prediction is made based on the following condition: 𝑐𝑖 = 0 𝑅 𝑐𝑖 = 0|𝑥𝑖 ≤ 𝑅 𝑐𝑖 = 1|𝑥𝑖 1 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 18 BayesMinimumRisk
  • 19. • Logistic Regression Model • Cost Function • Cost Analysis Cost-Sensitive- LogisticRegression
  • 20. • Actual Costs • Cost-Sensitive Function Cost-Sensitive- LogisticRegression
  • 21. 21 Proposed Cost based impurity measure 𝑆 𝑙 = 𝑥|𝑥𝑖 ∈ 𝑆 ∧ 𝑥𝑖 𝑗 ≤ 𝑙 𝑚 𝑗 𝑆 𝑟 = 𝑥|𝑥𝑖 ∈ 𝑆 ∧ 𝑥𝑖 𝑗 > 𝑙 𝑚 𝑗 • The impurity of each leaf is calculated using: 𝐼𝑐 𝑆 = min 𝐶𝑜𝑠𝑡 𝑓0 𝑆 , 𝐶𝑜𝑠𝑡 𝑓1 𝑆 𝑓 𝑆 = 0 𝑖𝑓 𝐶𝑜𝑠𝑡 𝑓0 𝑆 ≤ 𝐶𝑜𝑠𝑡 𝑓1 𝑆 1 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 • Afterwards the gain of applying a given rule to the set 𝑆 is: 𝐺𝑎𝑖𝑛 𝑐 𝑥 𝑗, 𝑙 𝑚 𝑗 = 𝐼𝑐 𝜋1 − 𝐼𝑐 𝜋1 𝑙 + 𝐼𝑐 𝜋1 𝑟 S S S 𝑥 𝑗, 𝑙 𝑚 𝑗 Cost-SensitiveDecisionTrees
  • 22. 22 Decision trees construction • The rule that maximizes the gain is selected 𝑏𝑒𝑠𝑡 𝑥, 𝑏𝑒𝑠𝑡𝑙 = 𝑎𝑟𝑔 max 𝑗,𝑚 𝐺𝑎𝑖𝑛 𝑥 𝑗, 𝑙 𝑚 𝑗 S S S S S S S S S S S • The process is repeated until a stopping criteria is met: Cost-SensitiveDecisionTrees
  • 23. 23 Proposed cost-sensitive pruning criteria • Calculation of the Tree savings and pruned Tree savings S S S S S S S S S S S 𝑃𝐶𝑐 = 𝐶𝑜𝑠𝑡 𝑓 𝑆, 𝑇𝑟𝑒𝑒 − 𝐶𝑜𝑠𝑡 𝑓 𝑆, 𝐸𝐵 𝑇𝑟𝑒𝑒, 𝑏𝑟𝑎𝑛𝑐ℎ 𝑇𝑟𝑒𝑒 − 𝐸𝐵 𝑇𝑟𝑒𝑒, 𝑏𝑟𝑎𝑛𝑐ℎ • After calculating the pruning criteria for all possible trees. The maximum improvement is selected and the Tree is pruned. • Later the process is repeated until there is no further improvement. S S S S S S S S S S S S S S Cost-SensitiveDecisionTrees
  • 24. Typical ensemble is made by combining T different base classifiers. Each base classifiers is trained by applying algorithm M in a random subset 24 𝑀𝑗 ← 𝑀 𝑆𝑗 ∀𝑗 ∈ 1, … , 𝑇 EnsembleCost-SensitiveDecisionTrees
  • 26. After the base classifiers are constructed they are typically combined using one of the following methods: • Majority voting 𝐻 𝑆 = 𝑓𝑚𝑣 𝑆, 𝑀 = 𝑎𝑟𝑔 max 𝑐∈ 0,1 𝑗=1 𝑇 1 𝑐 𝑀𝑗 𝑆 26 EnsembleCost-SensitiveDecisionTrees
  • 27. • Proposed cost-sensitive stacking 𝐻 𝑆 = 𝑓𝑠 𝑆, 𝑀, 𝛽 = 1 1 + 𝑒 − 𝑗=1 𝑇 𝛽 𝑗 𝑀 𝑗 𝑆 Using the cost-sensitive logistic regression [Correa et. al, 2014] model: 𝐽 𝑆, 𝑀, 𝛽 = 𝑖=1 𝑁 𝑦𝑖 𝑓𝑠 𝑆, 𝑀, 𝛽 𝐶 𝑇𝑃 𝑖 − 𝐶 𝐹𝑁𝑖 + 𝐶 𝐹𝑁 𝑖 + 1 − 𝑦𝑖 𝑓𝑠 𝑆, 𝑀, 𝛽 𝐶 𝐹𝑃 𝑖 − 𝐶 𝑇𝑁 𝑖 + 𝐶 𝑇𝑁 𝑖 Then the weights are estimated using 𝛽 = 𝑎𝑟𝑔 min 𝛽 𝐽 𝑆, 𝑀, 𝛽 27 EnsembleCost-SensitiveDecisionTrees
  • 31. • New framework for stacking of example dependent cost-sensitive decision trees • Models should be evaluated taking into account real financial costs of the application • Algorithms should be developed to incorporate those financial costs Conclusions 31
  • 32. Any questions or comments, please let me know. Alejandro Correa Bahnsen, PhD Chief Data Scientist & Head of Research acorrea@easysol.net Thank you!