Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Architecting a Predictive, Petabyte-Scale, Self-Learning Fraud Detection System

194 Aufrufe

Veröffentlicht am

Fraud detection is a classic adversarial analytics challenge: As soon as an automated system successfully learns to stop one scheme, fraudsters move on to attack another way. Each scheme requires looking for different signals (i.e. features) to catch; is relatively rare (one in millions for finance or e-commerce); and may take months to investigate a single case (in healthcare or tax, for example) – making quality training data scarce.
This talk covers, via live demo and code walk-through, the key lessons we’ve learned while building such real-world software systems over the past few years. We’ll be looking for fraud signals in public email datasets, using IPython and popular open-source libraries (scikit-learn, statsmodel, nltk, etc.) for data science and Apache Spark as the compute engine for scalable parallel processing.
We will iteratively build a machine-learned hybrid model – combining features from different data sources and algorithmic approaches, to catch diverse aspects of suspect behavior:
- Natural language processing: finding keywords in relevant context within unstructured text
- Statistical NLP: sentiment analysis via supervised machine learning
- Time series analysis: understanding daily/weekly cycles and changes in habitual behavior
- Graph analysis: finding actions outside the usual or expected network of people
- Heuristic rules: finding suspect actions based on past schemes or external datasets
- Topic modeling: highlighting use of keywords outside an expected context
- Anomaly detection: Fully unsupervised ranking of unusual behavior
This talk assumes basic understanding of these data science tools, so we can focus on their applicability for this use case and on how they complement each other.

Apache Spark is used to run these models at scale – in batch mode for model training and with Spark Streaming for production use. We’ll discuss the data model, computation, and feedback workflows, as well as some tools and libraries built on top of the open-source components to enable faster experimentation, optimization, and productization of the models.

Veröffentlicht in: Software
  • Als Erste(r) kommentieren

Architecting a Predictive, Petabyte-Scale, Self-Learning Fraud Detection System

  1. 1. www.globalbigdataconference.com Twitter : @bigdataconf
  2. 2. Architecting a predictive, petabyte-scale, self-learning fraud detection system
  3. 3. 3
  4. 4. 4 WHAT WE’RE UP AGAINST 4 4 50+Schemes (and counting) 99.9999%‘Good’ messages 6+Months per case Needle in a haystack Hybrid analytics No training data Semi-supervised learning Adversarial learning Online feedback
  5. 5. 5 WHY HYBRID ANALYTICS? 5 5 Ignore more rules Unusual timing of events Unusual personal network Teamwork & scale Think & talk differently
  6. 6. 6 (BITS OF) THE TOOLBOX 6 6 Rule Inference Time Series AnalysisLink Analysis Ensemble Learning Natural Language
  7. 7. 7 THE CODE, PLEASE 7 7 Freely available Jupyter notebooks Open source libraries & open data Github.com/atigeo/hunting_criminals_demo
  8. 8. 8
  9. 9. 9 STREAM PROCESSING 9 9 Kafka Email Stream Account transactions Stream Email NLP Features People graph Transactions time series
  11. 11. 1 1 SAMPLE NATURAL LANGUAGE ANNOTATORS Understand vocabulary – Jargon – Code words – Multi-lingual Understand grammar – Who are we talking about? – Past, present or future? – Compound sentences Understand context – Email: Re:, Fwd:, attachments – SMS & IM have their own grammar
  12. 12. 1 2 SAMPLE GRAPH FEATURES Standard algorithms like KMeans don’t work on “haystacks”
  13. 13. 1 3 SAMPLE GRAPH FEATURES Bregman Bubble Clustering
  14. 14. 1 4 USER ANALYSIS ITERATION Email NLP Features User graph Transactions time series Graph Features Time Series Features NLP Features Agent Feedback Train/TestClassifier
  15. 15. 1 5
  16. 16. 1 6
  17. 17. 1 7 •Needle in a very large haystack – Actually needs a petabyte-scale platform •Multi-modal: no single trick works – Hybrid analytics •No labeled data – Semi-supervised learning – Cold start problem •Sparse & high-dimensional – Graph based features & change over time •Adversarial – Feedback & online learning SUMMARY: CHALLENGES OF LEARNING CRIMINALS
  18. 18. 1 8 @davidtalby