SlideShare ist ein Scribd-Unternehmen logo
1 von 18
Downloaden Sie, um offline zu lesen
UNBALANCED
DATA: SAME
ALGORITHMS
DIFFERENT
TECHNIQUES
Eric Martin
UNBALANCED DATA
• Fraud
• Illness detection
• Anomalies
2
Y =
0
Y = 1
ALGORITHMS
POINT OF VIEW
3
▪ Accuracy
▪ 1,000,000 total TRX
▪ 10 Fraud TRX
= 99.9999%
Recall, f1score,
detection probability
UNDERSTANDING THE
PROBLEM
4
▪ Scattering Matrix:
Real 0
Real 1
Pron.0 Pron.1
LESS
ACCURACY
!
Trading Illness
Detection
Real 0
Real 1
Pron.0 Pron.1
IT DEPENDS
ON THE
PROBLEM!!
5
MOST COMMON PRACTISES
6
▪ Dimensionality
reduction:
▫ Smote
▫ Sintetic samples
creation
Y = 0 Y = 1 Y = 0 Y = 1
SAME
ALGORITHMS
DIFFERENT
TECHNIQUES
▪ If you expect different results
you have to do different
things
▪ Explote all data you have
▪ Bagging Algo: First step
Random Forest7
RANDOM FOREST
8
  F1 F2 F3 …… … FN Y
1 1.2 25 True … 0.185 1
2 3.4 55 False… 0.211 1
3 2.2 58 True … 0.171 0
4 4.0 34 True … 0.132 1
5 1.1 63 True … 0.652 0
6 0.7 61 False… 0.153 0
7 3.3 12 False… 0.477 1
8 3.1 23 True … 0.311 1
9 1.2 29 False… 0.171 1
1
0 3.4 45 True … 0.132 0
1
1 2.1 55 True … 0.652 1
1
2 1.7 19 False… 0.189 0
1
3 3.3 12 False… 0.477 1
1
4 3.1 23 True … 0.311 1
1
5 1.2 29 False… 0.171 1
1
6 2.2 58 True … 0.171 0
1
RANDOM FOREST
9
F1 F2 F3 … … … FN Y
1.5 25 False … 0.185 ???
1
1
0
MAJORITY
VOTE
1
EM FOREST
10
  F1 F2 F3 …… … FN Y
1 1.2 25 True … 0.185 1
2 3.4 55 False… 0.211 1
3 2.2 58 True … 0.171 0
4 4.0 34 True … 0.132 1
5 1.1 63 True … 0.652 0
6 0.7 61 False… 0.153 0
7 3.3 12 False… 0.477 1
8 3.1 23 True … 0.311 1
9 1.2 29 False… 0.171 1
1
0 3.4 45 True … 0.132 0
1
1 2.1 55 True … 0.652 1
1
2 1.7 19 False… 0.189 0
1
3 3.3 12 False… 0.477 1
1
4 3.1 23 True … 0.311 1
1
5 1.2 29 False… 0.171 1
1
6 2.2 58 True … 0.171 0
1
  Tree1 Tree2 Tree3 Y
1 1 1 0 1
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
EM FOREST: Transforming the
problem
11
  F1 F2 F3 …… … FN Y
1 1.2 25 True … 0.185 1
2 3.4 55 False… 0.211 1
3 2.2 58 True … 0.171 0
4 4.0 34 True … 0.132 1
5 1.1 63 True … 0.652 0
6 0.7 61 False… 0.153 0
7 3.3 12 False… 0.477 1
8 3.1 23 True … 0.311 1
9 1.2 29 False… 0.171 1
1
0 3.4 45 True … 0.132 0
1
1 2.1 55 True … 0.652 1
1
2 1.7 19 False… 0.189 0
1
3 3.3 12 False… 0.477 1
1
4 3.1 23 True … 0.311 1
1
5 1.2 29 False… 0.171 1
1
6 2.2 58 True … 0.171 0
1
0 1 0 1
EM FOREST: The new
problem
12
  Tree1 Tree2 Tree3 Y
1 1 1 0 1
2 1 0 1 1
3 1 1 1 0
4 0 1 0 1
5 0 0 0 0
6 1 0 1 0
7 0 1 0 1
8 0 1 0 1
9 1 0 1 1
10 1 1 0 0
11 0 1 0 1
12 0 0 1 0
13 1 0 1 1
14 1 1 0 1
15 1 1 0 1
16 0 0 1 0
17 0 1 0 1
18 1 0 0 0
EM FOREST: The new
possibilities
13
  Tree1 Tree2 Tree3 Y
1 1 1 0 1
2 1 0 1 1
3 1 1 1 0
4 0 1 0 1
5 0 0 0 0
6 1 0 1 0
7 0 1 0 1
8 0 1 0 1
▪ Vector vs. Aggregated
  Agg Y
1 2 1
2 2 1
3 3 0
4 0 1
5 1 0
6 2 0
7 1 1
8 1 1
EM FOREST: The new results
14
▪ Result improvement: Better score
( at least the same ) than Random
Forest
▪ Result flexibility: Better in balanced and
unbalanced data (Trading and illness
detection )
EM FOREST: Adventages
15
▪ Open Source
▪ Scalability
▪ More possibilities
EM FOREST: Use cases
16
▪ Real projects:
Credit card usage trends
▪ Demo projects:
Bank fraud
Alcohol in students dataset
THANKS!
Any questions?
You can find me at:
Eric Martin
ericmartinct@gmail.com
17

Weitere ähnliche Inhalte

Mehr von Big Data Spain

Mehr von Big Data Spain (20)

Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...
 
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
 
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a... The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
 
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
 
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
 
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
 
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
 
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
 
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
 
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
 
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...
 
Deep reinforcement learning : Starcraft learning environment by Gema Parreño ...
Deep reinforcement learning : Starcraft learning environment by Gema Parreño ...Deep reinforcement learning : Starcraft learning environment by Gema Parreño ...
Deep reinforcement learning : Starcraft learning environment by Gema Parreño ...
 
End-to-End “Exactly Once” with Heron & Pulsar by Ivan Kelly at Big Data Spain...
End-to-End “Exactly Once” with Heron & Pulsar by Ivan Kelly at Big Data Spain...End-to-End “Exactly Once” with Heron & Pulsar by Ivan Kelly at Big Data Spain...
End-to-End “Exactly Once” with Heron & Pulsar by Ivan Kelly at Big Data Spain...
 
A Deep Learning use case for water end use detection by Roberto Díaz and José...
A Deep Learning use case for water end use detection by Roberto Díaz and José...A Deep Learning use case for water end use detection by Roberto Díaz and José...
A Deep Learning use case for water end use detection by Roberto Díaz and José...
 
Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017
Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017
Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017
 
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
 
Keeping your Enterprise’s Big Data Secure by Owen O’Malley at Big Data Spain ...
Keeping your Enterprise’s Big Data Secure by Owen O’Malley at Big Data Spain ...Keeping your Enterprise’s Big Data Secure by Owen O’Malley at Big Data Spain ...
Keeping your Enterprise’s Big Data Secure by Owen O’Malley at Big Data Spain ...
 
Bayesian inference and big data: are we there yet? by Jose Luis Hidalgo at Bi...
Bayesian inference and big data: are we there yet? by Jose Luis Hidalgo at Bi...Bayesian inference and big data: are we there yet? by Jose Luis Hidalgo at Bi...
Bayesian inference and big data: are we there yet? by Jose Luis Hidalgo at Bi...
 
Apache MXNet Distributed Training Explained In Depth by Viacheslav Kovalevsky...
Apache MXNet Distributed Training Explained In Depth by Viacheslav Kovalevsky...Apache MXNet Distributed Training Explained In Depth by Viacheslav Kovalevsky...
Apache MXNet Distributed Training Explained In Depth by Viacheslav Kovalevsky...
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 

Unbalanced data: Same algorithms different techniques by Eric Martín at Big Data Spain 2017

  • 1.
  • 3. UNBALANCED DATA • Fraud • Illness detection • Anomalies 2 Y = 0 Y = 1
  • 4. ALGORITHMS POINT OF VIEW 3 ▪ Accuracy ▪ 1,000,000 total TRX ▪ 10 Fraud TRX = 99.9999% Recall, f1score, detection probability
  • 5. UNDERSTANDING THE PROBLEM 4 ▪ Scattering Matrix: Real 0 Real 1 Pron.0 Pron.1 LESS ACCURACY ! Trading Illness Detection Real 0 Real 1 Pron.0 Pron.1
  • 7. MOST COMMON PRACTISES 6 ▪ Dimensionality reduction: ▫ Smote ▫ Sintetic samples creation Y = 0 Y = 1 Y = 0 Y = 1
  • 8. SAME ALGORITHMS DIFFERENT TECHNIQUES ▪ If you expect different results you have to do different things ▪ Explote all data you have ▪ Bagging Algo: First step Random Forest7
  • 9. RANDOM FOREST 8   F1 F2 F3 …… … FN Y 1 1.2 25 True … 0.185 1 2 3.4 55 False… 0.211 1 3 2.2 58 True … 0.171 0 4 4.0 34 True … 0.132 1 5 1.1 63 True … 0.652 0 6 0.7 61 False… 0.153 0 7 3.3 12 False… 0.477 1 8 3.1 23 True … 0.311 1 9 1.2 29 False… 0.171 1 1 0 3.4 45 True … 0.132 0 1 1 2.1 55 True … 0.652 1 1 2 1.7 19 False… 0.189 0 1 3 3.3 12 False… 0.477 1 1 4 3.1 23 True … 0.311 1 1 5 1.2 29 False… 0.171 1 1 6 2.2 58 True … 0.171 0 1
  • 10. RANDOM FOREST 9 F1 F2 F3 … … … FN Y 1.5 25 False … 0.185 ??? 1 1 0 MAJORITY VOTE 1
  • 11. EM FOREST 10   F1 F2 F3 …… … FN Y 1 1.2 25 True … 0.185 1 2 3.4 55 False… 0.211 1 3 2.2 58 True … 0.171 0 4 4.0 34 True … 0.132 1 5 1.1 63 True … 0.652 0 6 0.7 61 False… 0.153 0 7 3.3 12 False… 0.477 1 8 3.1 23 True … 0.311 1 9 1.2 29 False… 0.171 1 1 0 3.4 45 True … 0.132 0 1 1 2.1 55 True … 0.652 1 1 2 1.7 19 False… 0.189 0 1 3 3.3 12 False… 0.477 1 1 4 3.1 23 True … 0.311 1 1 5 1.2 29 False… 0.171 1 1 6 2.2 58 True … 0.171 0 1
  • 12.   Tree1 Tree2 Tree3 Y 1 1 1 0 1 2   3   4   5   6   7   8   9   10   11   12   13   14   15   16   17   18   EM FOREST: Transforming the problem 11   F1 F2 F3 …… … FN Y 1 1.2 25 True … 0.185 1 2 3.4 55 False… 0.211 1 3 2.2 58 True … 0.171 0 4 4.0 34 True … 0.132 1 5 1.1 63 True … 0.652 0 6 0.7 61 False… 0.153 0 7 3.3 12 False… 0.477 1 8 3.1 23 True … 0.311 1 9 1.2 29 False… 0.171 1 1 0 3.4 45 True … 0.132 0 1 1 2.1 55 True … 0.652 1 1 2 1.7 19 False… 0.189 0 1 3 3.3 12 False… 0.477 1 1 4 3.1 23 True … 0.311 1 1 5 1.2 29 False… 0.171 1 1 6 2.2 58 True … 0.171 0 1 0 1 0 1
  • 13. EM FOREST: The new problem 12   Tree1 Tree2 Tree3 Y 1 1 1 0 1 2 1 0 1 1 3 1 1 1 0 4 0 1 0 1 5 0 0 0 0 6 1 0 1 0 7 0 1 0 1 8 0 1 0 1 9 1 0 1 1 10 1 1 0 0 11 0 1 0 1 12 0 0 1 0 13 1 0 1 1 14 1 1 0 1 15 1 1 0 1 16 0 0 1 0 17 0 1 0 1 18 1 0 0 0
  • 14. EM FOREST: The new possibilities 13   Tree1 Tree2 Tree3 Y 1 1 1 0 1 2 1 0 1 1 3 1 1 1 0 4 0 1 0 1 5 0 0 0 0 6 1 0 1 0 7 0 1 0 1 8 0 1 0 1 ▪ Vector vs. Aggregated   Agg Y 1 2 1 2 2 1 3 3 0 4 0 1 5 1 0 6 2 0 7 1 1 8 1 1
  • 15. EM FOREST: The new results 14 ▪ Result improvement: Better score ( at least the same ) than Random Forest ▪ Result flexibility: Better in balanced and unbalanced data (Trading and illness detection )
  • 16. EM FOREST: Adventages 15 ▪ Open Source ▪ Scalability ▪ More possibilities
  • 17. EM FOREST: Use cases 16 ▪ Real projects: Credit card usage trends ▪ Demo projects: Bank fraud Alcohol in students dataset
  • 18. THANKS! Any questions? You can find me at: Eric Martin ericmartinct@gmail.com 17