SlideShare a Scribd company logo
1 of 51
How We Use FP to
Find the Bad Guys
Richard Minerich, Director of R&D at
@Rickasaurus
British
Columbia
Rizzuto Crime Family
Jimmy “Cosmo”
“Superman”
Cournoyer
Quebec
New York/NYC
Bonanno Crime Family
John “Big Man”
Venizelos
Reinvested in Cocaine
California
Flow of Drugs
Hells Angels
El Chapo
Sinaloa Cartel
Citation Network (Safe View)
Relationship Network (Safe View)
Jorge HankRhon
Family & Friends
$100s Millions
Citibank, CH
Brother Murdered
Database
Onboarding
Anatomy of
Anti-Money
Laundering
Branch
Branch
Branch
Branch
Bank
Other
Bank
Other
Bank
Other
Bank
Other
Bank
Negative
News
Watch
Lists
PEPs
Real time
Lookup
(Efficient Search)
Transaction
Monitoring
(Sparse
Information)
Batch Scanning
(O(N*M) Result Space)
Risk
Calculation
Onboarding
▪ When a customer opens an account at a bank an agent does a search
▪ As it is done by a human, errors and missing information are common
▪ Low risk process as bad guys may be caught in the batch scanning later
▪ Blocking data structure is kept loaded into memory and queried against
▪ Results above some probability threshold are returned to the user ordered by
probability and risk
Onboarding
Branch
Branch
Branch
Branch
Bank
Transaction Monitoring
▪ SWIFT messages are passed on the internet of money
▪ Banks must process huge numbers of these
▪ Account information is often not accessible
▪ Messages are low information compared to accounts
▪ Messages must leave within 24h of being received
▪ Similar to Onboarding (but huge numbers and time constraints):
▪ Blocking data structure is kept loaded into memory and queried against
▪ Results below some probability/risk threshold are discarded
▪ Hits are manually reviewed in order by probability, risk and timeliness
Bank
Other
Bank
Other
Bank
Other
Bank
Other
Bank
Batch Scanning
▪ Initially, all customer records vs all bad guy records.
▪ Often hundreds of millions of customer records vs ~3 million bad guy records
▪ Incrementally what we call Diff-Diff
▪ All Customer records vs changed bad-guy records
▪ Changed Customer records vs all bad-guy records
▪ Computation is distributed across many beefy machines
▪ ~1TB of ram, 32 Cores
▪ Results are viewed in order by probability and risk with some thresholding on very
lower probability or very low risk
Bank
Entity Resolution
in Theory
𝑅1
𝑅2
𝑅3
𝑅4
𝑇1
𝑇2
𝑇3
𝑇4
Bank Records Bad Guys
𝑃(𝑅 𝑛 𝑟𝑒𝑝𝑟𝑒𝑛𝑡𝑠 𝑡ℎ𝑒 𝑠𝑎𝑚𝑒 𝑒𝑛𝑡𝑖𝑡𝑦 𝑎𝑠 𝑇 𝑚)
Example Variations:
- Aggregating Products
- Finding Medical Records
- Resolving Paper Authors
- Census
- Finding Bad Guys
- Database Deduping
Different tradeoffs per domain
The Pairwise Entity Resolution Process
Blocking
• Two Datasets (Customer Data and Bad Guys Data)
• Pairs of Somehow Similar Records
Scoring
• Pairs of Records
• Score/Probability of Representing Same Entity
Review
• Records, Probability, Similarity Features
• True/False Labels (Mostly by Hand)
Why Blocking?
▪ 100 Million x 100 Million = 10 quadrillion pairs
▪ 86,400,000 milliseconds per day
▪ One pair per ms: ~116,000,000 days to compute (~317K years)
How do we beat N*M?
Blocking Algorithms. “Blocks”:
Candidate Pairs
or Clusters
𝑅1
𝑅2
𝑅3
𝑅4
𝑇1
𝑇2
𝑇3
𝑇4
Blocking
Algorithm(s) 𝐵𝑖 ∈ (𝑅 × 𝑇)
Input:
- Source Records R
- Target Records T
Output:
- Blocks of Similar Records
𝐵𝑖 ∈ (𝑅 × 𝑇)
Simplest: Key-based Blocking
Table: Peter Christen - Data Matching 2012
• Nothing’s easier than a table lookup!
• Many ways to key, choosing is hard
• Small errors can cause misses
• What about missing data?
Hash Tables
• ~O(1) Time, O(n) Space
Bloom Filter
• O(k) Time, O(bn) Space
Suffix Trees/Arrays
Images care of: http://alexbowe.com/fm-index/
Input length: n, Search length: m
• Construction in O(n) KS[2003]
• Search in O(m) AKO[2004]
• Space is 4n Bytes Naively
• Compressed Space O(n*H(T)) + o(n)
Where T is the input text
Newer: Compressed Compact SA
How to Introduce Fuzziness?
Canopy Clustering
1. Start with a set (S) of all records
- Some cheap distance metric f(r1,r2) : {0,1}
- Some upper bound (u) < 1
- Some lower bound (l) < 1
2. Take one out (c) and put it in a new cluster (C)
3. For each record still in (S) compare it to (c) via function (f)
- if it’s higher than (u), add it to (C) and remove it from (S)
- if higher than (l), add it to (C) and leave it in (S)
4. If (S) is not empty, go to 2
Blocking: Industry Concerns
▪ Can we predict what will and won’t block with absolute certainty?
▪ Can it find matches across different fields, or in blobs of text?
▪ Can we improve the process if we find counterexamples?
▪ Will standard human errors mess up blocking?
▪ Does it scale to large data sizes on reasonable local hardware?
Many ways to Block
▪ Sorted Neighborhood
▪ Metric Space Embedding
▪ Semantic Hashing
▪ Cluster-based approaches like Swoosh
Most are terrible in their own way. We use a mix.
Blocking: Industry Concerns
▪ Can we predict what will and won’t block with absolute certainty?
▪ Can it find matches across different fields, or in blobs of text?
▪ Can we improve the process if we find counterexamples?
▪ Will standard human errors mess up blocking?
▪ Does it scale to large data sizes on reasonable local hardware?
The Pairwise Entity Resolution Process
Blocking
• Two Datasets (Customer Data and Bad Guys Data)
• Pairs of Somehow Similar Records
Scoring
• Pairs of Records
• Score/Probability of Representing Same Entity
Review
• Records, Probability, Similarity Features
• True/False Labels (Mostly by Hand)
The Basics of Pairwise Similarity Scoring
Institutional
Knowledge
Data
Analysis
Model
Evaluation
Model
Generation
Features Labels
Past
Decisions
Current
Observation
• Smart Features and Clean Labels
are Most Important
• Understandability is Key
• Inference Algorithm is Secondary
Feature Function:
Jaro-Winkler
m = # of matching characters
t = # of character transpositions
|s1| = length of the first string
|s2| = length of the second string
l = # of characters that match at the start of the string over the
number considered
p = proportion of the score given to the initial character matching
We use further tweaks on top of this for improved effectiveness.
Simplest: Empirical
Summed Similarity
▪ F: feature functions (0 .. m) : (a,b) -> [0, 1]
▪ W: feature weights (0 .. m) : {0+}
▪ 𝑆𝑖𝑚𝑆𝑢𝑚(𝑎, 𝑏) = 𝑖=0
𝑚
𝑓𝑖 𝑤𝑖
Thresholds such that:
Match: SimSum(a,b) >= Upper
Review: Lower <= SimSum(a,b) <= Upper
Discard: SimSum(a,b) <= Lower
Image Via: https://www.cs.umd.edu/class/spring2012/cmsc828L/Papers/HerzogEtWires10.pdf
Inference for learning feature weights
▪ Logistic Regression
▪ Support Vector Machines
▪ Bayesian Networks/Probabilistic Graphical Models
▪ Neural Networks
▪ Random Forests
Complex models are harder to explain
than complex features
Pairwise Probability Distribution
Tiny Bump
937Upper
Threshold
161
161,358
The Pairwise Entity Resolution Process
Blocking
• Two Datasets (Customer Data and Bad Guys Data)
• Pairs of Somehow Similar Records
Scoring
• Pairs of Records
• Score/Probability of Representing Same Entity
Review
• Records, Probability, Similarity Features
• True/False Labels (Mostly by Hand)
Thresholds in Context
ReviewDiscard Automatic Match
Overall Model:
Risk vs Probability
Money Laundering
Risk
Same Person
Probability
Page Rank: The Easy Parts (With Math.NET)
Page Rank: The Hard Parts
▪ Domains, Websites, Pages in Context
▪ Determining Initial Risk for Sources
▪ 27 Pages of Data Transformation Code
▪ Fluctuation with no changes
▪ Prediction and Explainability
Not Hard: The Algorithm
Normalizing Page Rank for Humans
Power Law Picture: Donato et. al., 2004
Combining Ranking and Probability: Big Picture
Typed Functional Programming is a natural fit
▪ Small components (i.e. functions) can be independently tested and reused
▪ Changes are unlikely to break other parts of the system (~3 bugs in 5+ years)
▪ Code locality eases understanding of complex components
▪ Huge code reduction over standard object-oriented approaches
▪ Math reads like math, with proper operators and order of operation
Cleaning Blocking Featurization Scoring Slicing
Stages as (simplified) functions
Stage Function Type
Cleaning (map) Rec -> Rec
Blocking (half mapped) PRec sequence -> CRec sequence -> (CRec, PRec) sequence
Featurization (map) (CRec, PRec) -> float Vector
Scoring (map) float Vector -> Probability
Slicing (map) (Risk, Probability) -> Class Label
Cleaning Blocking Featurization Scoring Slicing
Disgustingly Bad but Fairly Large Datasets
▪ Both Wide (many fields) and Tall (many records)
▪ From different systems (with different encodings)
▪ Missing data
▪ Poorly merged data
▪ Extra data
▪ Non-unique IDs
Every client is awful in a
completely different way.
NAME LARRY O BRIAN
STATE CANADA
CITY 121 Buffalo Drive, Montreal,
Quebec H3G 1Z2
ADDRESS NULL
ZIP 12345
DOB 10/24/80; 1/1/1979
Functions on
Record Tree
Structure
CustRecord
Names
DOBs?
Countries?
States?
Cities?
…
ListRecord
Names
DOBs?
Countries?
States?
Cities?
…
Blocked Pair
• stripAccents
• stripCharacters
• replaceSubstring
• oneToManyFromFile
• isLocalCountry
Hit.Cust.Names
|> stripAccents
|> oneToMany “nicknames.csv”
Rebuilding
with Fungible
CustRecord
Names
DOBs?
Countries?
States?
Cities?
…
ListRecord
Names
DOBs?
Countries?
States?
Cities?
…
Blocked
Pair
CustRecord
Names
DOBs?
Countries?
States?
Cities?
…
ListRecord
Names
DOBs?
Countries?
States?
Cities?
…
Blocked
Pair
https://github.com
/BayardRock
/Fungible
Fighting Bad Data with Configurable
Functional Subsystems
CustRecord
Names
DOBs?
Countries?
States?
Cities?
…
ListRecord
Names
DOBs?
Countries?
States?
Cities?
…
Blocked Pair
Barb, a simple .net record query language
(We use it for data cleaning and features)
Name.Contains "John“ and (Age > 20 or Weight > 200)
https://github.com/Rickasaurus/Barb
Barb for Cleaning, Queries, and Features
on the Fly
Thank You! Questions?
You can read more on my blog at:
http://richardminerich.com
Contact me on twitter:
@Rickasaurus
Email me with questions:
rick@bayardrock.com
Check out the NYC F# User Group:
http://www.meetup.com/nyc-fsharp
Code on Github:
http://github.com/BayardRock
http://github.com/Rickasaurus
Rotational Token Alignment
▪ Less forgiving than Gale-Shapley
but also less prone to egregious
errors.
▪ Also known as cyclic suborders of
size k of a cyclic order of size n
▪ O(k(n choose k))
Richard Thomas Minerich
Minerich Richard Thomas
Thomas Minerich Richard
Rotational Alignment (cont.)
▪ Pre-calculate matrix of f(x,y) values
▪ Gosper’s Hack for Fast Rotations
Gosper’s Hack via: http://programmers.stackexchange.com/questions/67065/whats-your-favorite-bit-wise-technique
Algorithms for Awful Data: String Matching
▪ Goal: Robust and Forgiving with the Fewest Possible Assumptions
Somewhat Reasonable Data: Rotational Alignment
Extremely Awful Data: Gale-Shapley
Gale-Shapley for Stable Marriages: O(n^2)
Input: beau tokens m in M, belle tokens w in W, comparison function f
UM as the unattached beau, UW as unattached belle, P as the pair set (empty)
1) Select a beau m from UM
2) m selects the w in W s.t. f(m,w) is maximized and not previous selected by m
3) If w is in UW, remove m from UM and w from UW and add (m,w) to P
if a pair (m’, w) exists and f(w,m) > f(w,m’) then
remove (m’,w) from P, add m’ to UM, add (m,w) to P
4) If UM is not empty, go to 1
Expectation Maximization for log odds
via Fellegi-Sunter
Pros:
▪ Robust to missing data
▪ Easy enough to understand and well known
Cons:
▪ Needs careful sampling due to class imbalance.
▪ Starting probabilities need to be chosen carefully (local
optimization).
Expectation
Maximization
Batch Scanning Result Chart
Directions for Future Research
▪ Record pair population estimation
▪ Safe partial inference for tuning
▪ Prediction of future risk
▪ Collective entity resolution
▪ Mixed entity resolution-fraud detection models

More Related Content

Similar to How we use functional programming to find the bad guys @ Build Stuff LT and UA 2015

Follow the money with graphs
Follow the money with graphsFollow the money with graphs
Follow the money with graphsStanka Dalekova
 
Comparative study of various approaches for transaction Fraud Detection using...
Comparative study of various approaches for transaction Fraud Detection using...Comparative study of various approaches for transaction Fraud Detection using...
Comparative study of various approaches for transaction Fraud Detection using...Pratibha Singh
 
Synthetic Data Generation using exponential random Graph modeling
Synthetic Data Generation using exponential random Graph modelingSynthetic Data Generation using exponential random Graph modeling
Synthetic Data Generation using exponential random Graph modelingGraph-TA
 
NIPS2007: structured prediction
NIPS2007: structured predictionNIPS2007: structured prediction
NIPS2007: structured predictionzukun
 
OSMC 2009 | Anomalieerkennung und Trendvorhersagen an Hand von Daten aus Nagi...
OSMC 2009 | Anomalieerkennung und Trendvorhersagen an Hand von Daten aus Nagi...OSMC 2009 | Anomalieerkennung und Trendvorhersagen an Hand von Daten aus Nagi...
OSMC 2009 | Anomalieerkennung und Trendvorhersagen an Hand von Daten aus Nagi...NETWAYS
 
Cloudera Data Science Challenge
Cloudera Data Science ChallengeCloudera Data Science Challenge
Cloudera Data Science ChallengeMark Nichols, P.E.
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupDoug Needham
 
Using Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech IndustryUsing Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech IndustryStanka Dalekova
 
Using Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech IndustryUsing Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech IndustryStanka Dalekova
 
PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" Joshua Bloom
 
Fighting Malware with Graph Analytics: An End-to-End Case Study
Fighting Malware with Graph Analytics: An End-to-End Case StudyFighting Malware with Graph Analytics: An End-to-End Case Study
Fighting Malware with Graph Analytics: An End-to-End Case StudyPriyanka Aash
 
CS 542 -- Query Execution
CS 542 -- Query ExecutionCS 542 -- Query Execution
CS 542 -- Query ExecutionJ Singh
 
Data Privacy with Apache Spark: Defensive and Offensive Approaches
Data Privacy with Apache Spark: Defensive and Offensive ApproachesData Privacy with Apache Spark: Defensive and Offensive Approaches
Data Privacy with Apache Spark: Defensive and Offensive ApproachesDatabricks
 
Introduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesIntroduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesSơn Còm Nhom
 
C++ Notes PPT.ppt
C++ Notes PPT.pptC++ Notes PPT.ppt
C++ Notes PPT.pptAlpha474815
 
lecture1.ppt
lecture1.pptlecture1.ppt
lecture1.pptSagarDR5
 
Open Analytics Environment
Open Analytics EnvironmentOpen Analytics Environment
Open Analytics EnvironmentIan Foster
 

Similar to How we use functional programming to find the bad guys @ Build Stuff LT and UA 2015 (20)

How We Use Functional Programming to Find the Bad Guys
How We Use Functional Programming to Find the Bad GuysHow We Use Functional Programming to Find the Bad Guys
How We Use Functional Programming to Find the Bad Guys
 
Follow the money with graphs
Follow the money with graphsFollow the money with graphs
Follow the money with graphs
 
The R of War
The R of WarThe R of War
The R of War
 
Comparative study of various approaches for transaction Fraud Detection using...
Comparative study of various approaches for transaction Fraud Detection using...Comparative study of various approaches for transaction Fraud Detection using...
Comparative study of various approaches for transaction Fraud Detection using...
 
Synthetic Data Generation using exponential random Graph modeling
Synthetic Data Generation using exponential random Graph modelingSynthetic Data Generation using exponential random Graph modeling
Synthetic Data Generation using exponential random Graph modeling
 
NIPS2007: structured prediction
NIPS2007: structured predictionNIPS2007: structured prediction
NIPS2007: structured prediction
 
OSMC 2009 | Anomalieerkennung und Trendvorhersagen an Hand von Daten aus Nagi...
OSMC 2009 | Anomalieerkennung und Trendvorhersagen an Hand von Daten aus Nagi...OSMC 2009 | Anomalieerkennung und Trendvorhersagen an Hand von Daten aus Nagi...
OSMC 2009 | Anomalieerkennung und Trendvorhersagen an Hand von Daten aus Nagi...
 
Cloudera Data Science Challenge
Cloudera Data Science ChallengeCloudera Data Science Challenge
Cloudera Data Science Challenge
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup Group
 
Using Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech IndustryUsing Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech Industry
 
Using Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech IndustryUsing Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech Industry
 
PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning"
 
Fighting Malware with Graph Analytics: An End-to-End Case Study
Fighting Malware with Graph Analytics: An End-to-End Case StudyFighting Malware with Graph Analytics: An End-to-End Case Study
Fighting Malware with Graph Analytics: An End-to-End Case Study
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
CS 542 -- Query Execution
CS 542 -- Query ExecutionCS 542 -- Query Execution
CS 542 -- Query Execution
 
Data Privacy with Apache Spark: Defensive and Offensive Approaches
Data Privacy with Apache Spark: Defensive and Offensive ApproachesData Privacy with Apache Spark: Defensive and Offensive Approaches
Data Privacy with Apache Spark: Defensive and Offensive Approaches
 
Introduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesIntroduction to Datamining Concept and Techniques
Introduction to Datamining Concept and Techniques
 
C++ Notes PPT.ppt
C++ Notes PPT.pptC++ Notes PPT.ppt
C++ Notes PPT.ppt
 
lecture1.ppt
lecture1.pptlecture1.ppt
lecture1.ppt
 
Open Analytics Environment
Open Analytics EnvironmentOpen Analytics Environment
Open Analytics Environment
 

More from Richard Minerich

GHCi: More Awesome Than You Thought
GHCi: More Awesome Than You ThoughtGHCi: More Awesome Than You Thought
GHCi: More Awesome Than You ThoughtRichard Minerich
 
Functional Ideas for a Cloudy Future
Functional Ideas for a Cloudy FutureFunctional Ideas for a Cloudy Future
Functional Ideas for a Cloudy FutureRichard Minerich
 
Getting the MVVM Kicked Out of Your F#'n Monads
Getting the MVVM Kicked Out of Your F#'n MonadsGetting the MVVM Kicked Out of Your F#'n Monads
Getting the MVVM Kicked Out of Your F#'n MonadsRichard Minerich
 
How you can get started with F# today
How you can get started with F# todayHow you can get started with F# today
How you can get started with F# todayRichard Minerich
 
F# in the Classroom and the Lab
F# in the Classroom and the LabF# in the Classroom and the Lab
F# in the Classroom and the LabRichard Minerich
 

More from Richard Minerich (7)

GHCi: More Awesome Than You Thought
GHCi: More Awesome Than You ThoughtGHCi: More Awesome Than You Thought
GHCi: More Awesome Than You Thought
 
Functional Ideas for a Cloudy Future
Functional Ideas for a Cloudy FutureFunctional Ideas for a Cloudy Future
Functional Ideas for a Cloudy Future
 
F# and the DLR
F# and the DLRF# and the DLR
F# and the DLR
 
Fun and Games in F#
Fun and Games in F#Fun and Games in F#
Fun and Games in F#
 
Getting the MVVM Kicked Out of Your F#'n Monads
Getting the MVVM Kicked Out of Your F#'n MonadsGetting the MVVM Kicked Out of Your F#'n Monads
Getting the MVVM Kicked Out of Your F#'n Monads
 
How you can get started with F# today
How you can get started with F# todayHow you can get started with F# today
How you can get started with F# today
 
F# in the Classroom and the Lab
F# in the Classroom and the LabF# in the Classroom and the Lab
F# in the Classroom and the Lab
 

Recently uploaded

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 

Recently uploaded (20)

VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 

How we use functional programming to find the bad guys @ Build Stuff LT and UA 2015

  • 1. How We Use FP to Find the Bad Guys Richard Minerich, Director of R&D at @Rickasaurus
  • 2.
  • 3. British Columbia Rizzuto Crime Family Jimmy “Cosmo” “Superman” Cournoyer Quebec New York/NYC Bonanno Crime Family John “Big Man” Venizelos Reinvested in Cocaine California Flow of Drugs Hells Angels El Chapo Sinaloa Cartel
  • 4.
  • 7. Jorge HankRhon Family & Friends $100s Millions Citibank, CH Brother Murdered
  • 8.
  • 10. Onboarding ▪ When a customer opens an account at a bank an agent does a search ▪ As it is done by a human, errors and missing information are common ▪ Low risk process as bad guys may be caught in the batch scanning later ▪ Blocking data structure is kept loaded into memory and queried against ▪ Results above some probability threshold are returned to the user ordered by probability and risk Onboarding Branch Branch Branch Branch Bank
  • 11. Transaction Monitoring ▪ SWIFT messages are passed on the internet of money ▪ Banks must process huge numbers of these ▪ Account information is often not accessible ▪ Messages are low information compared to accounts ▪ Messages must leave within 24h of being received ▪ Similar to Onboarding (but huge numbers and time constraints): ▪ Blocking data structure is kept loaded into memory and queried against ▪ Results below some probability/risk threshold are discarded ▪ Hits are manually reviewed in order by probability, risk and timeliness Bank Other Bank Other Bank Other Bank Other Bank
  • 12. Batch Scanning ▪ Initially, all customer records vs all bad guy records. ▪ Often hundreds of millions of customer records vs ~3 million bad guy records ▪ Incrementally what we call Diff-Diff ▪ All Customer records vs changed bad-guy records ▪ Changed Customer records vs all bad-guy records ▪ Computation is distributed across many beefy machines ▪ ~1TB of ram, 32 Cores ▪ Results are viewed in order by probability and risk with some thresholding on very lower probability or very low risk Bank
  • 13. Entity Resolution in Theory 𝑅1 𝑅2 𝑅3 𝑅4 𝑇1 𝑇2 𝑇3 𝑇4 Bank Records Bad Guys 𝑃(𝑅 𝑛 𝑟𝑒𝑝𝑟𝑒𝑛𝑡𝑠 𝑡ℎ𝑒 𝑠𝑎𝑚𝑒 𝑒𝑛𝑡𝑖𝑡𝑦 𝑎𝑠 𝑇 𝑚) Example Variations: - Aggregating Products - Finding Medical Records - Resolving Paper Authors - Census - Finding Bad Guys - Database Deduping Different tradeoffs per domain
  • 14. The Pairwise Entity Resolution Process Blocking • Two Datasets (Customer Data and Bad Guys Data) • Pairs of Somehow Similar Records Scoring • Pairs of Records • Score/Probability of Representing Same Entity Review • Records, Probability, Similarity Features • True/False Labels (Mostly by Hand)
  • 15. Why Blocking? ▪ 100 Million x 100 Million = 10 quadrillion pairs ▪ 86,400,000 milliseconds per day ▪ One pair per ms: ~116,000,000 days to compute (~317K years)
  • 16. How do we beat N*M? Blocking Algorithms. “Blocks”: Candidate Pairs or Clusters 𝑅1 𝑅2 𝑅3 𝑅4 𝑇1 𝑇2 𝑇3 𝑇4 Blocking Algorithm(s) 𝐵𝑖 ∈ (𝑅 × 𝑇) Input: - Source Records R - Target Records T Output: - Blocks of Similar Records 𝐵𝑖 ∈ (𝑅 × 𝑇)
  • 17. Simplest: Key-based Blocking Table: Peter Christen - Data Matching 2012 • Nothing’s easier than a table lookup! • Many ways to key, choosing is hard • Small errors can cause misses • What about missing data? Hash Tables • ~O(1) Time, O(n) Space Bloom Filter • O(k) Time, O(bn) Space
  • 18. Suffix Trees/Arrays Images care of: http://alexbowe.com/fm-index/ Input length: n, Search length: m • Construction in O(n) KS[2003] • Search in O(m) AKO[2004] • Space is 4n Bytes Naively • Compressed Space O(n*H(T)) + o(n) Where T is the input text Newer: Compressed Compact SA How to Introduce Fuzziness?
  • 19. Canopy Clustering 1. Start with a set (S) of all records - Some cheap distance metric f(r1,r2) : {0,1} - Some upper bound (u) < 1 - Some lower bound (l) < 1 2. Take one out (c) and put it in a new cluster (C) 3. For each record still in (S) compare it to (c) via function (f) - if it’s higher than (u), add it to (C) and remove it from (S) - if higher than (l), add it to (C) and leave it in (S) 4. If (S) is not empty, go to 2
  • 20. Blocking: Industry Concerns ▪ Can we predict what will and won’t block with absolute certainty? ▪ Can it find matches across different fields, or in blobs of text? ▪ Can we improve the process if we find counterexamples? ▪ Will standard human errors mess up blocking? ▪ Does it scale to large data sizes on reasonable local hardware?
  • 21. Many ways to Block ▪ Sorted Neighborhood ▪ Metric Space Embedding ▪ Semantic Hashing ▪ Cluster-based approaches like Swoosh Most are terrible in their own way. We use a mix.
  • 22. Blocking: Industry Concerns ▪ Can we predict what will and won’t block with absolute certainty? ▪ Can it find matches across different fields, or in blobs of text? ▪ Can we improve the process if we find counterexamples? ▪ Will standard human errors mess up blocking? ▪ Does it scale to large data sizes on reasonable local hardware?
  • 23. The Pairwise Entity Resolution Process Blocking • Two Datasets (Customer Data and Bad Guys Data) • Pairs of Somehow Similar Records Scoring • Pairs of Records • Score/Probability of Representing Same Entity Review • Records, Probability, Similarity Features • True/False Labels (Mostly by Hand)
  • 24. The Basics of Pairwise Similarity Scoring Institutional Knowledge Data Analysis Model Evaluation Model Generation Features Labels Past Decisions Current Observation • Smart Features and Clean Labels are Most Important • Understandability is Key • Inference Algorithm is Secondary
  • 25. Feature Function: Jaro-Winkler m = # of matching characters t = # of character transpositions |s1| = length of the first string |s2| = length of the second string l = # of characters that match at the start of the string over the number considered p = proportion of the score given to the initial character matching We use further tweaks on top of this for improved effectiveness.
  • 26. Simplest: Empirical Summed Similarity ▪ F: feature functions (0 .. m) : (a,b) -> [0, 1] ▪ W: feature weights (0 .. m) : {0+} ▪ 𝑆𝑖𝑚𝑆𝑢𝑚(𝑎, 𝑏) = 𝑖=0 𝑚 𝑓𝑖 𝑤𝑖 Thresholds such that: Match: SimSum(a,b) >= Upper Review: Lower <= SimSum(a,b) <= Upper Discard: SimSum(a,b) <= Lower Image Via: https://www.cs.umd.edu/class/spring2012/cmsc828L/Papers/HerzogEtWires10.pdf
  • 27. Inference for learning feature weights ▪ Logistic Regression ▪ Support Vector Machines ▪ Bayesian Networks/Probabilistic Graphical Models ▪ Neural Networks ▪ Random Forests Complex models are harder to explain than complex features
  • 28. Pairwise Probability Distribution Tiny Bump 937Upper Threshold 161 161,358
  • 29. The Pairwise Entity Resolution Process Blocking • Two Datasets (Customer Data and Bad Guys Data) • Pairs of Somehow Similar Records Scoring • Pairs of Records • Score/Probability of Representing Same Entity Review • Records, Probability, Similarity Features • True/False Labels (Mostly by Hand)
  • 31. Overall Model: Risk vs Probability Money Laundering Risk Same Person Probability
  • 32. Page Rank: The Easy Parts (With Math.NET)
  • 33. Page Rank: The Hard Parts ▪ Domains, Websites, Pages in Context ▪ Determining Initial Risk for Sources ▪ 27 Pages of Data Transformation Code ▪ Fluctuation with no changes ▪ Prediction and Explainability Not Hard: The Algorithm
  • 34. Normalizing Page Rank for Humans Power Law Picture: Donato et. al., 2004
  • 35. Combining Ranking and Probability: Big Picture
  • 36. Typed Functional Programming is a natural fit ▪ Small components (i.e. functions) can be independently tested and reused ▪ Changes are unlikely to break other parts of the system (~3 bugs in 5+ years) ▪ Code locality eases understanding of complex components ▪ Huge code reduction over standard object-oriented approaches ▪ Math reads like math, with proper operators and order of operation Cleaning Blocking Featurization Scoring Slicing
  • 37. Stages as (simplified) functions Stage Function Type Cleaning (map) Rec -> Rec Blocking (half mapped) PRec sequence -> CRec sequence -> (CRec, PRec) sequence Featurization (map) (CRec, PRec) -> float Vector Scoring (map) float Vector -> Probability Slicing (map) (Risk, Probability) -> Class Label Cleaning Blocking Featurization Scoring Slicing
  • 38. Disgustingly Bad but Fairly Large Datasets ▪ Both Wide (many fields) and Tall (many records) ▪ From different systems (with different encodings) ▪ Missing data ▪ Poorly merged data ▪ Extra data ▪ Non-unique IDs Every client is awful in a completely different way. NAME LARRY O BRIAN STATE CANADA CITY 121 Buffalo Drive, Montreal, Quebec H3G 1Z2 ADDRESS NULL ZIP 12345 DOB 10/24/80; 1/1/1979
  • 39. Functions on Record Tree Structure CustRecord Names DOBs? Countries? States? Cities? … ListRecord Names DOBs? Countries? States? Cities? … Blocked Pair • stripAccents • stripCharacters • replaceSubstring • oneToManyFromFile • isLocalCountry Hit.Cust.Names |> stripAccents |> oneToMany “nicknames.csv”
  • 41. Fighting Bad Data with Configurable Functional Subsystems CustRecord Names DOBs? Countries? States? Cities? … ListRecord Names DOBs? Countries? States? Cities? … Blocked Pair
  • 42. Barb, a simple .net record query language (We use it for data cleaning and features) Name.Contains "John“ and (Age > 20 or Weight > 200) https://github.com/Rickasaurus/Barb
  • 43. Barb for Cleaning, Queries, and Features on the Fly
  • 44. Thank You! Questions? You can read more on my blog at: http://richardminerich.com Contact me on twitter: @Rickasaurus Email me with questions: rick@bayardrock.com Check out the NYC F# User Group: http://www.meetup.com/nyc-fsharp Code on Github: http://github.com/BayardRock http://github.com/Rickasaurus
  • 45. Rotational Token Alignment ▪ Less forgiving than Gale-Shapley but also less prone to egregious errors. ▪ Also known as cyclic suborders of size k of a cyclic order of size n ▪ O(k(n choose k)) Richard Thomas Minerich Minerich Richard Thomas Thomas Minerich Richard
  • 46. Rotational Alignment (cont.) ▪ Pre-calculate matrix of f(x,y) values ▪ Gosper’s Hack for Fast Rotations Gosper’s Hack via: http://programmers.stackexchange.com/questions/67065/whats-your-favorite-bit-wise-technique
  • 47. Algorithms for Awful Data: String Matching ▪ Goal: Robust and Forgiving with the Fewest Possible Assumptions Somewhat Reasonable Data: Rotational Alignment Extremely Awful Data: Gale-Shapley
  • 48. Gale-Shapley for Stable Marriages: O(n^2) Input: beau tokens m in M, belle tokens w in W, comparison function f UM as the unattached beau, UW as unattached belle, P as the pair set (empty) 1) Select a beau m from UM 2) m selects the w in W s.t. f(m,w) is maximized and not previous selected by m 3) If w is in UW, remove m from UM and w from UW and add (m,w) to P if a pair (m’, w) exists and f(w,m) > f(w,m’) then remove (m’,w) from P, add m’ to UM, add (m,w) to P 4) If UM is not empty, go to 1
  • 49. Expectation Maximization for log odds via Fellegi-Sunter Pros: ▪ Robust to missing data ▪ Easy enough to understand and well known Cons: ▪ Needs careful sampling due to class imbalance. ▪ Starting probabilities need to be chosen carefully (local optimization). Expectation Maximization
  • 51. Directions for Future Research ▪ Record pair population estimation ▪ Safe partial inference for tuning ▪ Prediction of future risk ▪ Collective entity resolution ▪ Mixed entity resolution-fraud detection models