13. Fraud Detection in the payment flow
Bank
Clears for
settlement
Suspect
~2000 sellers
Risk Ops
Transaction review
150,000 active
sellers per day
Risk ML
Fraud Detection
15. Card not present: Yes
Pan Diversity: 0.05
Use iPhone: No
Feature Generation
16. Easy to interpret
!
Dimension reduction
!
!
Very powerful in ensemble
Decline Rate >= 0.1
NoYes
Amount <= $10000
NoYes
Business Type = Auto repair
NoYes
0.9 0.6
Decision Tree Model
17. Random Forests: Decision Tree Ensemble
Decline Rate <= 0.1
NoYes
Amount <= $10000
Business Type = Auto repair
0.9 0.6
Tree 1 Tree N
Breiman, Leo (2001). "Random Forests". Machine Learning 45 (1): 5–32.
Mode for classification = Bad
Average for regression = 0.63
NoYes
NoYes
Success Rate <= 0.2
NoYes
Age >= 20
Amount <= $1000
0.4 0.7
NoYes
NoYes
Decline Rate <= 0.3
NoYes
Amount <= $20000
Age <= 22
0.8 0.6
NoYes
NoYes
Tree 2
Bad, 0.9 Good, 0.4 Bad, 0.6
19. Breiman, Leo (2001). "Random Forests". Machine Learning 45 (1): 5–32.
All data
Samples
Random Forests - Build each Tree
20. Breiman, Leo (2001). "Random Forests". Machine Learning 45 (1): 5–32.
Features
Dollar Amount
Connected with bad user
Business Type
Decline Rate
Time of Day
Location
Randomly select sqrt(n) features
All data
Samples
Random Forests - Build each Tree
21. Breiman, Leo (2001). "Random Forests". Machine Learning 45 (1): 5–32.
Features
Dollar Amount
Connected with bad user
Business Type
Decline Rate
Time of Day
Location
Randomly select sqrt(n) features
Best split: feature and value
Decline Rate <= 0.1
NoYes
0.4 0.6
All data
Samples
Random Forests - Build each Tree
22. Breiman, Leo (2001). "Random Forests". Machine Learning 45 (1): 5–32.
Features
Dollar Amount
Connected with bad user
Business Type
Decline Rate
Time of Day
Location
Randomly select sqrt(n) features
Best split: feature and value
Decline Rate <= 0.1
NoYes
0.4 0.6
All data
Samples
Grow Tree Grow Tree
Random Forests - Build each Tree
23. Breiman, Leo (2001). "Random Forests". Machine Learning 45 (1): 5–32.
Features
Dollar Amount
Connected with bad user
Business Type
Decline Rate
Time of Day
Location
Randomly select sqrt(n) features
Best split: feature and value
Decline Rate <= 0.1
NoYes
0.4 0.6
All data
Samples
Grow Tree Grow Tree
When sample size is small STOP
Random Forests - Build each Tree
24. Breiman, Leo (2001). "Random Forests". Machine Learning 45 (1): 5–32.
Features
Dollar Amount
Connected with bad user
Business Type
Decline Rate
Time of Day
Location
Randomly select sqrt(n) features
Best split: feature and value
Decline Rate <= 0.1
NoYes
0.4 0.6
All data
Samples
Grow Tree Grow Tree
When sample size is small STOP
Repeat these steps multiple times to create a forest
Random Forests - Build each Tree
27. Boosting Trees
Tree 1 Tree 2 Tree 3 Tree 4
Help Tree 1
Help Tree 1, 2
Help Tree 1, 2, 3
Stop when no
help needed
0 weights all samples
28. Boosting Trees
Tree 1 Tree 2 Tree 3 Tree 4
Help Tree 1
Help Tree 1, 2
Help Tree 1, 2, 3
8.0 -2.0 1.0 0.57.5 = + + +
29. Boosting Trees - Algorithm
Objective function:
Loss
Friedman, J. H. "Greedy Function Approximation: A Gradient Boosting Machine." 1999
30. Precision at a fixed recall level
Results - Precision
Model April May June
Random Forest 76% 77% 80%
Boosting Trees 85% 82% 88%
+11.8% +6.5% +10%
31. Results - Fraud Detection Recall
# Payments to Reject
Fraud$Prevented
Easy
Hard
Medium
32. Data Sampling
Highly biased in label distribution
- Less than 1 in 1000
!
Weighted training
- Higher weights on positive samples => oscillation
- Lower weights on negative samples => no real gain
!
Solution
- Keep negative:positive ratio to be 3:1 - 10:1
- Scale the final model if calibration is needed
!
Fewer data requires fewer resources to train
!
Observed +10% improvement from 20:1 to 3:1
40. Square Random Forest
RF Learner Implementation Time (Train / Test)
RiskML Random Forest
(Built on Scikit-Learn)
C / Cython / Python
(Open Source + Square Code)
72 minutes
WiseRF
C++
(Proprietary)
23 minutes
Square Random Forest
Java
(Square Code)
15 minutes
Note: time reported on 3M training and 15M testing data
41. Learning Management System
‣ Support non-sophisticated users
‣ Fast ad-hoc analytics
‣ Accessible to everyone for easy
model generation and evaluation
‣ Tracks results to ensure different
models can be compared