Alex Korbonits is a Data Scientist at Remitly, Inc., where he works extensively on feature extraction and putting machine learning models into production. Outside of work, he loves Kaggle competitions, is diving deep into topological data analysis, and is exploring machine learning on GPUs. Alex is a graduate of the University of Chicago with degrees in Mathematics and Economics.
Abstract summary
Applications of machine learning and ensemble methods to risk rule optimization:
At Remitly, risk management involves a combination of manually created and curated risk rules as well as black-box inputs from machine learning models. Currently, domain experts manage risk rules in production using logical conjunctions of statements about input features. In order to scale this process, we’ve developed a tool and framework for risk rule optimization that generates risk rules from data and optimizes rule sets by ensembling rules from multiple models according to a particular objective function. In this talk, I will describe how we currently manage risk rules, how we learn rules from data, how we determine optimal rule sets, and the importance of smart input features extracted from complex machine learning models.
3. 3
Introduction
• Risk management and risk rules
• Generating rules from machine learning models
• Incremental rule ranking
• Model ensembling
• Rule inclusion/exclusion criteria
• Why this matters to Remitly
Agenda
4. A spectre is haunting risk management — the spectre of…
5.
6.
7.
8. 8
Risk rules, how do they work?
• Rules are typically managed via a GUI. Dropdown menus, etc.
• Rules are logical conjunctions of expressions of input data, e.g.:
(x < 10) AND (y > 20) AND (z < 100)
• Rule conditions are based on transaction and customer
attributes.
• Collectively, all rules form a logical disjunction, e.g.:
rule1 OR rule2 OR rule3
• When one rule triggers, we queue a transaction for review.
• Easy to integrate rules we’ve learned from data into this
framework.
Risk management and risk rules
9. 9
FOILed again
• FOIL (first order inductive learner)
• Accepts binary features only
• A rule is a simple conjunction of binary features
• Learns rules via separate-and-conquer
• Decision tree
• Accepts continuous and categorical features
• A single rule is a root-to-leaf path
• Learns via divide-and-conquer
Generating rules from machine learning models
10. 10
Separate-and-conquer
• FOIL takes as its input sequences of features and a ground
truth. We map all of our input features to a boolean space.
• Different strategies for continuous features, e.g., binning.
• FOIL learns Horn Clause programs from examples
Implication form: (p ∧ q ∧ ... ∧ t) → u
Disjunction form: ¬p ∨ ¬q ∨ ... ∨ ¬t ∨ u
• Learns Horn Clause programs from positive class examples.
• Examples are removed from training data at each step.
• FOIL rules are simply lists of features.
• We map rules we learn from FOIL into human-readable
rules that we can implement in our risk rule management
system.
FOIL (First Order Inductive Learner)
11. 11
Divide-and-conquer
• Decision trees are interpretable
• A rule is a root-to-leaf path.
• Like a FOIL rule, a decision tree rule is a conjunction.
• Use DFS to extract all rules from a decision tree
• Easy to evaluate in together with FOIL rules
• Easily implementable in our risk rule management
system
Decision Trees
12. 12
SQL to the rescue
• We synthesize hand-crafted rule performance with SQL
• For each transaction, we know if a rule triggers or not.
• We can use this to synthesize new handcrafted rules
that aren’t yet in production.
• We can derive precision/recall easily from this data.
• We can rank productionized rules alone to look at rules
we can immediately eliminate from production (i.e.,
remove redundancy).
• We can rank productionized rules alone to establish a
baseline level of performance for risk rule management.
Synthesizing Production Rules
13. 13
You are the weakest rule, goodbye!
• Today, there are hundreds of rules live in production.
• A single decision tree or FOIL model can represent
thousands of rules.
• Can we find a strict subset of those rules that recalls the
exact same amount of fraud?
• First we measure the performance of each rule
individually on a test set.
• With each step, we get the (next) best rule and remove
the fraud from our test set that our (next) best rule
catches.
• We repeat this process until our rules no longer catch
any uncaught fraud, whereupon the process terminates.
Incremental Rule Ranking
14. 14
Will it blend?
• Ensembling rules gives us a lot of lift
• We ensemble:
• Synthesized production rules
• FOIL rules
• Decision tree rules
• We rank a list of candidate rules from each model class.
• Our output is a classifier of ensembled rules
• We’re seeing 8% jump in recall and a 1% increase in
precision
Model ensembling
15. 15
To include or not to include, that is the question
• Risk rule optimization is a constraint optimization
problem
• Optimal rule sets must satisfy business constraints
• We must balance catching fraud with insulting
customers
• Constraints can be nonlinear, e.g., with tradeoffs
between precision and recall.
• With each ranking step, we evaluate the whole classifier
• We include a rule when our classifier fits our criteria
• We discard rules when our classifier violates our criteria
Rule inclusion/exclusion criteria
16. 16
It’s a rule in a black-box!
• The most informative rule features are derived from
black box models.
• Rules/lists of rules with these features as conditions is
kind of model stacking
• Risk rules limited to conjunctions, but inputs unlimited
• Add more black box inputs to improve rules learned
• Better black-box inputs reduce complexity of rules (i.e.,
they have fewer conditions)
Black box input features
17. 17
How did we do this?
• Redshift
• Python
• S3
• EC2 p2.xlarge with deep learning AMI
• GPU instance gives us ~17x boost in training/inference
time compared to laptop
• TensorFlow/Keras
• Scalding
Technologies used
18. 18
Citing our sources
Bibliography
Fürnkranz, Johannes. "Separate-and-conquer rule learning." Artificial Intelligence Review 13, no. 1 (1999): 3-54.
Mooney, Raymond J., and Mary Elaine Califf. "Induction of first-order decision lists: Results on learning the past tense of English verbs." JAIR 3 (1995): 1-24.
Quinlan, J. Ross. "Induction of decision trees." Machine learning 1, no. 1 (1986): 81-106.
Quinlan, J. Ross. "Learning logical definitions from relations." Machine learning 5, no. 3 (1990): 239-266.
Quinlan, J. R. "Determinate literals in inductive logic programming." In Proceedings of the Eighth International Workshop on Machine Learning, pp. 442-446. 1991.
Quinlan, J., and R. Cameron-Jones. "FOIL: A midterm report." In Machine Learning: ECML-93, pp. 1-20. Springer Berlin/Heidelberg, 1993.
Quinlan, J. Ross, and R. Mike Cameron-Jones. "Induction of logic programs: FOIL and related systems." New Generation Computing 13, no. 3-4 (1995): 287-312.
Quinlan, J. Ross. C4. 5: programs for machine learning. Elsevier, 2014.
19. 19
What we talked about
• Risk management and risk rules
• Generating rules from machine learning models
• Incremental rule ranking
• Model ensembling
• Rule inclusion/exclusion criteria
• Why this matters to Remitly
Summary
20. 20
Remitly’s Data Science team uses ML for a variety of purposes.
ML applications are core to our business – therefore our business must be core to our ML applications.
Machine learning at Remitly
Hi everyone.
My name is Alex Korbonits, and I am a data scientist at Remitly.
This talk is broadly about applying machine learning to legacy risk rule systems.
Before we dive in, here’s a little bit about Remitly and me.
Remitly was founded in 2011 out to forever change the way people send money to their loved ones.
Worldwide, remittances represent over 660 billion dollars annually, roughly 4x the amount of foreign aid.
We’re the largest independent digital remittance company in the U.S.
We’re sending over 2 billion dollars annually and growing quickly
I'm Remitly's first data scientist, and our team is growing.
Right now my principal focus is FRAUD CLASSIFICATION
Previously, I was a data scientist at a startup called Nuiku, focusing on NLP.
First a quick background on risk management systems and how risk rules are used in industry
Almost always these rules are hand-crafted by domain experts. Why not generate rules from machine learning models?
Once we’ve generated rules, we’ll consider how to measure their effectiveness via ranking them, as single models in isolation or ensembled together with rules from other models.
Importantly, we’re able to evaluate rules we’ve generated from machine learning models together with existing hand-crafted risk rules.
Industrial settings require thinking beyond status quo model evaluation metrics: today we’ll consider tying model and rule selection to business costs and impact.
That makes sense, and dollars and cents.
Internally, we’ve developed a tool that can do all of this end-to-end. It’s being used by fraud domain experts to optimize our current risk rules in production.
A Spectre is haunting risk management... the spectre of...
COMMUNISM.
Wait, hold on a second, that’s another talk…
The spectre of… BIG DATA
Typically, risk rules are handcrafted by domain experts.
They're usually bucketed into different categories and their overall effect is orchestrated with different priorities and workflows.
Risk rules come in many flavors. In the case of fraud, the majority of risk rules are targeted toward common MO's of fraudsters.
We also use risk rules to comply with company policy and governmental policy. Not all risk rules have to do with fraud. Plenty are for KYC, or, know-your-customer purposes. Last, risk rule management systems are used to detect suspicious or illegal activity that isn't fraud, for example, for money laundering.
Policy rules make sense to implement by hand because they are a direct reflection of those policies. However, when it comes to fraud rules or rules for suspicious activity, all too often rules are created in a reactionary manner to cover slightly generalized patterns of examples of known fraud that were previously undetected.
The spectre of big data renders this process impractical, inefficient, and expensive.
It’s imperative that we begin to scale out our production of new rules so that we can keep up with managing existing and new risks we face every day.
We want to use machine learning to change risk rule management from a reactionary to a predictive practice.
Last, we don't just want to manage our risk rules. We want to optimize them, with some constraints.
Risk rules, how do they work?
Typically, risk rules are managed in a GUI. Dropdown menus, clicking boxes, etc.
The complexity of a single rule is usually that of a logical conjunction. Foo AND bar AND baz
We use customer and transaction inputs as features to our rules based on policy and domain knowledge.
They’re what you’d expect. Recency, frequency, and magnitude features are very common, as are count features.
Our risk classifier as a whole is a logical disjunction of all of our rules.
Rule1 OR rule2 OR rule3
At its simplest, when a rule fires we queue a transaction for manual risk review.
We can easily integrate rules generated from machine learning models into this system if they can be represented as conjunctions.
We need to train machine learning models on our dataset and extract rules from them.
How do we do that?
For our MVP, we start with two simple model classes.
Single decision trees and FOIL models.
FOIL stands for First Order Inductive Learner.
It learns rules differently than decision trees, via separate and conquer vs. divide and conquer
For example, a transaction can only follow a single path through a decision tree.
However, a single transaction can trigger multiple FOIL rules, even though during training a single transaction is only picked up by one foil rule before it is discarded during training
Feature engineering is important with FOIL, where splits for continuous features need to be pre-specified.
Rules extracted from decision trees are nice since the splits are learned during training.
What is FOIL? Again, FOIL stands for First Order Inductive Learner.
First, to prep our data for FOIL, we take sequences of input data and our label and map it to a feature space of booleans.
For categorical or sparse features this is straightforward but for continuous features we have more flexibility.
Binning is a pretty simple option. What's good is that there is always room to improve this feature engineering process.
Deciding how to do this for continuous variables properly is extremely important, especially when certain continuous variables are skewed.
A FOIL model learns Horn clause programs from examples.
What's a Horn clause program?
Effectively, a Horn clause program is a conjunction of boolean statements about your data which imply a particular class.
There are two big ways that FOIL is different than a decision tree.
One, they learn differently. FOIL models look for positive examples first, and learn a very precise boolean box around those examples.
Then, those positive examples are removed from the training data. Subsequent rules are learned from the remaining training data.
This process continues to produce highly targeted/precise rules for us.
With decision trees, the gain of a given split is evaluated globally.
Second, in a decision tree, nearby leaf nodes share a LOT of ancestors together. FOIL rules cover the space of examples differently.
This is one of the reasons we chose to use FOIL, as its different hypotheses about our data act as a nice form of regularization.
When training has completed, we map rules we’ve learned via FOIL into human-readable rules that we can implement in our risk rule management system.
So that’s FOIL.
We also chose to derive sets of rules from decision tree models since they’re also interpretable.
Like a FOIL rule, a decision tree rule is simply a conjunction of conditions at each branch of the tree. A single condition is a feature, a threshold, and an inequality.
Rule extraction is easy: just find all root-to-leaf paths via depth-first-search.
We use a common framework to evaluate FOIL rules and decision tree rules together with hand-crafted rules that are already in production
These rules are easy to implement in our risk rule management system.
To evaluate rules learned from machine learning models together with hand-crafted rules, we need to synthesize our rules in the data.
We don’t just want historical performance of our rules. We want to consider synthesizing performance of new handcrafted rules, too, before they’re in production.
Here we write SQL to see, for every transaction, whether or not our production rules would have triggered.
We can derive all of the same metrics with these rules as we can when we evaluate decision tree or foil rules.
If we evaluate the performance of all of our hand-crafted rules, we can immediately see where we can eliminate some of them that aren’t a value-add. I.e. there are redundant rules that may be causing unnecessary manual reviews.
We can also look at our hand-crafted rules alone to establish a baseline level of performance. We have a minimum bar that we can augment with rules learned from data.
How do we do this evaluation? Next, we turn to a process I’m calling incremental rule ranking.
Now we have Incremental rule ranking
This algorithm allows us to properly assess and compare rules, regardless of whether we manually create them or learn them from data.
We do so by ranking rules *incrementally*. We rank incrementally according to the F-Beta-Score of a specific rule.
We use an F-Beta-Score that directly ties the overall performance of our rules to our internal goals
Ranking incrementally is a multi-step process that begins with measuring the performance of each rule individually on a test set.
With each step, we get the (next) best rule and remove the fraud from our test set that our (next) best rule catches. We repeat this process until our rules no longer catch any uncaught fraud, whereupon the process terminates.
This is a slight variant on the separate and conquer strategy that FOIL employs during model training.
Here we’re not learning new rules, but we ARE finding a subset of rules that obtains the same recall as ALL of the rules combined. We’re getting a more precise overall classifier.
We can do this for a single model class or source of rules. We can ensemble rules together from different sources or model classes.
If we synthesize our production rules, we can measure their effectiveness as a baseline.
Holding everything else constant, increasing *beta* will result in fewer rules and lower overall precision, whereas decreasing *beta* will result in more rules and higher overall precision.
For a given beta and set of candidate rules, this algorithm does not increase the amount of fraud that the candidate rules catch. It gives us the most efficient subset of candidate rules that catch the maximum amount of fraud, drastically reducing the number of rules needed to do so and improving overall precision.
Ensembling rules gives us a lot of lift
We ensemble:
Synthesized production rules
FOIL rules
Decision tree rules
We rank a list of candidate rules from each model class.
Our output is a classifier of ensembled rules
We’re seeing 8% jump in recall and a 1% increase in precision
Risk rule optimization is a constraint optimization problem
We can’t just maximize the overall precision/recall of our classifier. That won’t do.
Fraud is very expensive for us. We want to catch as much of it as possible. However, we don’t want to review every transaction.
Our economic constraints weight the cost of false negatives so heavily that extremely high recall is required for us to keep the lights on.
On the other hand, if we were to review every single transaction, we’d have great recall, but we’d insult customers, increase friction, and also have to take the time to do all of those reviews!
So, we need to evaluate the classifier during the ranking process to ensure that we’re not putting ourselves out of business.
Said another way, we want to make sure our classifier is doing its job and represents a viable set of rules to put into production.
When we are considering adding a rule to our classifier, we evaluate the classifier as a whole to make sure our constraints are satisfied.
If so, we add the rule and continue. If not, we discard the rule and consider other rule candidates.
The rule inclusion/exclusion process is directly tied to our ranking process.
On the right we have a precision recall plot and 3 curves. Each curve here represents a source of rules that have been ranked. Green and blue represent rules sourced from single model classes. Red is from ensembling these rules together prior to ranking.
Each point, going from left to right, represents the cumulative precision and recall of a classifier after N rules have been ranked.
In this example, we represent our constraints by these black lines. We want a classifier that is in the upper-right-hand quadrant defined by these two constraints.
The horizontal line is one we look at during each step of ranking. We don't want our classifier to be this imprecise. The vertical line is more of a goal rather than a constraint. We want our classifier to eventually get past this line. It means we're really kicking fraud's butt.
The most informative rule features are derived from black box models.
Rules/lists of rules with these features as conditions is kind of model stacking
Risk rules limited to conjunctions, but inputs unlimited
Add more black box inputs to improve rules learned
Better black-box inputs reduce complexity of rules (i.e., they have fewer conditions)
I’d like to add that risk rule optimization as a tool can be used for multiple things. On the one hand, if we don’t engineer new features before learning new rules, we can immediately put them into production.
On the flipside, we can engineer and test new features, and then see how useful they are to justify putting them into production.
So how did we do this?
We used a bunch of technologies to do risk rule optimization.
Redshift – we use Redshift extensively for our data warehousing and for synthesizing our hand-crafted production rules
Python – we used python extensively for building our risk-rule-optimization machine learning pipeline
S3 – we used S3 a lot for pushing/pull input data, outputs, and all sorts of stuff.
EC2 – I had to use EC2’s for this process. The Deep learning AMI was great for speeding up the training and inference of models used for building black box input features.
GPU instance gives us ~17x boost in training/inference time compared to laptop
I used r3’s and r4’s generally for the rest of the pipeline as well as for training and testing FOIL and decision tree models.
TensorFlow/Keras – I used tensorflow/keras for building black-box models as inputs.
Scalding – we used scalding for ETL to turn raw sources of production data into data ready to use by our machine learning models and for synthesizing our production rule performance
And here is a bibliography that I certainly won’t begin to go through but which I’ll leave here for those who want to dive deeper.
Hand-tailored risk management systems and risk rules are the status quo.
Machine learning models help us learn new risk rules from data and improve upon rules we already have in production
Since we’re learning rules from data it’s imperative that we carefully assess our rules one by one and as part of a whole classifier.
We need to evaluate rules we’ve generated in concert with existing hand-crafted rules currently in production. Some of them are enormously valuable. We can’t simply cast them aside because they’re hand-crafted.
Ranking rules helps us quantify their effectiveness and gauge how well our rules can generalize to unseen data.
Tying in business constraints and objectives helps us choose what to implement and what to avoid – even if a rule recalls a lot of fraud for us, it could be extremely crude and subject an unreasonable and unnecessary number of customers to our risk investigation process.
We are just getting started. We’re thinking of ways to incorporate black-box models further into risk rule optimization, both in terms of building smarter input features as well as for extracting rules. There are all sorts of other avenues. We can learn rules on top of rules to simulate rules with more complicated first-order logic for example.
What does machine learning at Remitly look like?
Understanding:
Fraud classification
Risk rule optimization
Anomaly detection
Customer segmentation and customer lifetime value
Pricing optimization
We're hiring!
Email me at alex@remitly.com.
That’s all, folks!
THANKS