SlideShare ist ein Scribd-Unternehmen logo
1 von 35
Downloaden Sie, um offline zu lesen
WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
Benyue (Emma) Liu, TigerGraph Inc.
Real-time Fraud Detection at
Scale - Integrating Real-Time
Deep-Link Graph Analytics with
Spark AI
#UnifiedDataAnalytics #SparkAISummit
Graph analysis is possibly the single most effective
competitive differentiator for organizations pursuing data-driven
operations and decisions after the design of data capture.”
Graph is HOW WE THINK
4#UnifiedDataAnalytics #SparkAISummit
Common TigerGraph Use Cases
5
Improve Operational EfficiencyReduce Costs & Manage RisksIncrease Revenue
• Recommendation Engine
• Real-time Customer 360/
MDM
• Product & Service Marketing
• Fraud Detection
• Anti-Money Laundering
(AML)
• Risk Assessment & Monitoring
• Cyber Security
• Enterprise Knowledge Graph
• Network, IT and Cloud
Resource Optimization
• Energy Management System
• Supply Chain Analysis
Analyze all interactions
in real-time to sell more
Reduce costs and assess and
monitor risks effectively
Manage resources for
maximum output
Foundational Use Cases: Geospatial Analysis, Time Series Analysis, AI and Machine Learning
7 Key Data Science Capabilities Powered By a Native Parallel Graph
Deep Link Analysis
Relational Commonality
Discovery and Computation
From a set of entities (e.g. devices,
customers, accounts, doctors), show
all links or connections
Given 2 entities (e.g. customers,
businesses), follow their
relationships to find commonality
6
Multi-dimensional Entity
& Pattern Matching
Given a pattern (e.g. referring
business to a relative), find similar
patterns in the graph
Hub & Community Detection
Find most influential members of a
group (customers, doctors, citizens)
& detect community around them
Community 1
Community 2
1 32 4
5 Geospatial Graph Analysis Analyze changes in entities & relationships with location data
A
C
A
B
Machine Learning Feature
Generation & Explainable AI
Extract graph-based features to feed as training data for
machine learning; Power Explainable AI7
Temporal (Time-Series) Graph Analysis Analyze changes in entities & relationships over time
Query Pattern P
MatchB
D
Power Explainable AI with TigerGraph
7
Why Spark + TigerGraph?
+
Spark + TigerGraph Data Pipeline
9
Typical Spark + TigerGraph Integration
● Data Preparation and Integration (TigerGraph/Spark)
● Unsupervised Learning (TigerGraph)
● Feature Extraction for Supervised Learning (TigerGraph/Spark)
● Model Training (Spark)
● Validate and Apply Model (TigerGraph)
● Visualize and Explore Interconnected Data (TigerGraph)
10
Machine Learning with TigerGraph
China Mobile Anti-Fraud/Scam Detection
12
Real-Time Phone-Based Fraud Detection
Massive, Worldwide Problem
● 18 Billion robocalls in US in 2017 (hiya.com)
● Spam/Scam - agile, spoofed numbers
Customer:
● 600M subscribers
● 300M calls/day, peak 10K calls/sec
● Need: Real-time detection of various
types of phone-based fraud
Real-Time Phone Anti-Spam/Scam Detection
13
TigerGraph Solution: Real-time graph-based machine learning and
decision system
Graph Analytics
● Real-time machine learning
○ 118 graph features per call
○ Retrained periodically with
2M calls
● Real-time decisions
○ Call recipient sees alert if
ML system says call is
suspicious
● In production since Dec 2016
Graph Database
● 600M phone numbers
(inside and outside network)
● 15B phone-phone call edges
(2 month sliding window)
○ Time
○ Duration
● Real-time graph updates
Peak 10K+ calls/sec
● 118 graph features per phone
Examples of Graph Features for Machine Learning
14
Good Phone
Features
Bad Phone
Features
(1) Short term call
duration
(2) Empty stable group
(3) No call back phone
(4) Many rejected calls
(5) Average distance > 3
Empty stable group
Many rejected
calls
Average
distance > 3
(1) High call back
phone
(2) Stable group
(3) Long term phone
(4) Many in-group
connections
(5) 3-step friend relation
Stable
group
Many in-
group
connections
Good Phone
Features
3-step friend
relation
///
Good phone Bad phone
X
X
X
China Mobile - Detecting Phone-Based Fraud by
Analyzing Network or Graph Pattern Features
15
• Each phone node has a fraud flag,
indicating it’s a good phone or a bad phone
and what type of fraud: scam, harassment,
advertisement
• Run real-time GSQL query for each call:
○ Collect 118 features
○ Compute composite score
○ Update fraud flag
○ Return fraud type
Real-Time Call Event
Caller
Callee
Time
Call Detail Records
Caller
Callee
Time
Duration
Query
Continuous
Graph Update
Fraud Type
Phone Fraud Real-Time Detection System
phone vertex
- fraud flag
- expiration time
target4
target3
- num of call
- total duration
- call date list
- num of rejection
target2
target1
● 600 Million Vertices
● 15+ Billion Edges
● 300 Million Daily
Updatesphone_phone
Case 1: Call type was recently flagged
Real-time
Call Event Call Time
Caller ID
Callee ID
If caller was
recently
flagged as
“bad”
If Caller is
classified as
“bad”Classifier
Query
Real-time
Collect Caller’s
Graph Features
Update
Case 2: Call needs to be classified
Real-time
Call Event Call Time
Caller ID
Callee ID
If caller was
recently
flagged as
“bad”
If Caller is
classified as
“bad”Classifier
Query
Real-time
Collect Caller’s
Graph Features
Update
Input: list of
calls with
phone pairs
and call time
(batch)
Output: 1. Call fraud type; 2. Scoring and feature vector
of fraud calls for supporting evidence Explainable AI
China Mobile Machine Learning Workflow
1. Data labels from police reports and online third party sources
2. A total of 118 graph features analyzed to build fraud detection model
3. All 118 graph features collected by one GSQL query
4. Training data’s features collected in GSQL in batch processing and stored
as CSV file for future model training
5. TigerGraph performs fraud scoring with multiple Machine Learning models
in real-time
6. Machine Learning models are trained offline and model parameters stored
as configuration files for GSQL to use for real-time scoring
(Future: Training ML models in Spark)
Machine Learning with TigerGraph
Real-time Scoring with Multiple ML models in GSQL
Efficient EasyFast
Real-time
response for both
feature collection
and scoring
Aggregation during
traversal - multiple
features in one
Collect complex
features without
multiple RDBMS
joins
China Mobile Anti-Fraud Results
from TigerGraph Machine Learning Solutions
• 3.2 million fraud notifications
in Shandong Province
(Dec 2016 – July 2019)
• Save potential loss
• ~39.86 million RMB
(~ 6 million US dollars)
Why Spark + TigerGraph?
+
Why TigerGraph + Spark For Machine Learning?
Parallel processing,
distributed systems
in training, ETL &
feature collections
Capture business
moments with real-
time response with
explainable AI
23
Enrich machine
learning with
complex graph
features
AT SCALE ! AT SCALE ! AT SCALE !
Spark and TigerGraph Data Pipeline
Static
Data
Sources
TigerGraph
JDBC
Driver
Streaming
Data
Sources
JDBC Driver (v1.2)
● Type 4 driver
● Support Read and Write bi-directional data flow to TigerGraph
● Read: Converts ResultSet to DataFrame
● Write: Load DataFrame and files to vertex/edge in TigerGraph
● Supports REST endpoints of built-in, compiled and interpreted GSQL queries from
TigerGraph
● Open Source:
● https://github.com/tigergraph/ecosys/tree/master/etl/tg-jdbc-driver
DEMO
Graph Feature Extraction from TigerGraph
to Spark Via TigerGraph’s JDBC Driver
26
Examples of Graph Features for Machine Learning
27
Good Phone
Features
Bad Phone
Features
(1) Short term call
duration
(2) Empty stable group
(3) No call back phone
(4) Many rejected calls
(5) Average distance > 3
Empty stable group
Many rejected
calls
Average
distance > 3
(1) High call back
phone
(2) Stable group
(3) Long term phone
(4) Many in-group
connections
(5) 3-step friend relation
Stable
group
Many in-
group
connections
Good Phone
Features
3-step friend
relation
///
Good phone Bad phone
X
X
X
Graph Features: Stable Group & InGroup
Connection
• Stable Group: phones in the target group that have regular calls
(stable connection) with source phone
• Stable InGroup Connections: phones in the target group that have
regular calls (stable connection) among themselves
Stable Connection defined as
● Has both call and callback
● Num of calls is larger than a given limit
● Total duration is larger than a given limit
Resources
• TigerGraph Cloud Machine Learning Starter Kit
a. Register at tgcloud.us
• JDBC Driver (Open Source)
a. https://github.com/tigergraph/ecosys/tree/master/etl/tg-jdbc-
driver
• Contact me at emma.liu@tigergraph.com
29
More … TigerGraph & Neural Network
30
Training data: https://www.coursera.org/learn/machine-learning
Watch Graph Guru Episode 19
https://info.tigergraph.com/graph-gurus-19
Contact Me:
emma.liu@tigergraph.com
Graph analysis is possibly the single most effective
competitive differentiator for organizations pursuing data-driven
operations and decisions after the design of data capture.”
Realtime deep link graph analytics at scale is the
differentiator to your machine learning pipeline!
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT
Backup Slides
Stable Group Pseudocode
Step 1: start from a given phone vertex,
find its 1-step neighbors
Step 2: check if a target has both stable
outgoing (phone_phone) and stable
incoming edges (phone_phone_reversed)
source
target4
target3
- num of call
- total duration
- call date list
- num of rejection
target2
target1
phone_phone
phone_phone
phone_phone_reversed
Stable Connection defined as
● Has both call and callback
● Num of calls is larger than a given limit
● Total duration is larger than a given limit
source
Stable InGroup Connections Pseudocode
Step 1: starting from a given phone vertex,
find its 1-step neighbors (target group)
Step 2: for each vertex in the target group,
find its 1-step neighbors and check for
stable connections
Step 3: check the stable target for each
vertex in the target group
source
target4
target3
- num of call
- total duration
- call date list
- num of rejection
target2
target1phone_phone
phone_phone
phone_phone_reversed
source
Stable Connection defined as
● Has both call and callback
● Num of calls is larger than a given limit
● Total duration is larger than a given limit

Weitere ähnliche Inhalte

Was ist angesagt?

Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
Zipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering FrameworkZipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering Framework
Databricks
 

Was ist angesagt? (20)

Spark with Delta Lake
Spark with Delta LakeSpark with Delta Lake
Spark with Delta Lake
 
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
 Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa... Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
 
Incremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and IcebergIncremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and Iceberg
 
Large Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraphLarge Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraph
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best Practices
 
Understanding InfluxDB’s New Storage Engine
Understanding InfluxDB’s New Storage EngineUnderstanding InfluxDB’s New Storage Engine
Understanding InfluxDB’s New Storage Engine
 
Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
Reproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorchReproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorch
 
Introduction and HDInsight best practices
Introduction and HDInsight best practicesIntroduction and HDInsight best practices
Introduction and HDInsight best practices
 
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, PresetStreaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions
 
Streaming SQL for Data Engineers: The Next Big Thing?
Streaming SQL for Data Engineers: The Next Big Thing?Streaming SQL for Data Engineers: The Next Big Thing?
Streaming SQL for Data Engineers: The Next Big Thing?
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Databricks for Dummies
Databricks for DummiesDatabricks for Dummies
Databricks for Dummies
 
Zipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering FrameworkZipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering Framework
 

Ähnlich wie Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Analytics with Spark AI

How Graphs Continue to Revolutionize The Prevention of Financial Crime & Frau...
How Graphs Continue to Revolutionize The Prevention of Financial Crime & Frau...How Graphs Continue to Revolutionize The Prevention of Financial Crime & Frau...
How Graphs Continue to Revolutionize The Prevention of Financial Crime & Frau...
Connected Data World
 
Danny Bickson - Python based predictive analytics with GraphLab Create
Danny Bickson - Python based predictive analytics with GraphLab Create Danny Bickson - Python based predictive analytics with GraphLab Create
Danny Bickson - Python based predictive analytics with GraphLab Create
PyData
 

Ähnlich wie Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Analytics with Spark AI (20)

Graph Gurus 21: Integrating Real-Time Deep-Link Graph Analytics with Spark AI
Graph Gurus 21: Integrating Real-Time Deep-Link Graph Analytics with Spark AIGraph Gurus 21: Integrating Real-Time Deep-Link Graph Analytics with Spark AI
Graph Gurus 21: Integrating Real-Time Deep-Link Graph Analytics with Spark AI
 
Scaling up business value with real-time operational graph analytics
Scaling up business value with real-time operational graph analyticsScaling up business value with real-time operational graph analytics
Scaling up business value with real-time operational graph analytics
 
TigerGraph UI Toolkits Financial Crimes
TigerGraph UI Toolkits Financial CrimesTigerGraph UI Toolkits Financial Crimes
TigerGraph UI Toolkits Financial Crimes
 
Graph Gurus Episode 34: Graph Databases are Changing the Fraud Detection and ...
Graph Gurus Episode 34: Graph Databases are Changing the Fraud Detection and ...Graph Gurus Episode 34: Graph Databases are Changing the Fraud Detection and ...
Graph Gurus Episode 34: Graph Databases are Changing the Fraud Detection and ...
 
Fraud prevention is better with TigerGraph inside
Fraud prevention is better with  TigerGraph insideFraud prevention is better with  TigerGraph inside
Fraud prevention is better with TigerGraph inside
 
Graph Gurus Episode 3: Anti Fraud and AML Part 1
Graph Gurus Episode 3: Anti Fraud and AML Part 1Graph Gurus Episode 3: Anti Fraud and AML Part 1
Graph Gurus Episode 3: Anti Fraud and AML Part 1
 
Graph Databases and Machine Learning | November 2018
Graph Databases and Machine Learning | November 2018Graph Databases and Machine Learning | November 2018
Graph Databases and Machine Learning | November 2018
 
High availability, real-time and scalable architectures
High availability, real-time and scalable architecturesHigh availability, real-time and scalable architectures
High availability, real-time and scalable architectures
 
Shift Remote: AI: Smarter AI with analytical graph databases - Victor Lee (Ti...
Shift Remote: AI: Smarter AI with analytical graph databases - Victor Lee (Ti...Shift Remote: AI: Smarter AI with analytical graph databases - Victor Lee (Ti...
Shift Remote: AI: Smarter AI with analytical graph databases - Victor Lee (Ti...
 
Graph+AI for Fin. Services
Graph+AI for Fin. ServicesGraph+AI for Fin. Services
Graph+AI for Fin. Services
 
Real-Time Forecasting at Scale using Delta Lake and Delta Caching
Real-Time Forecasting at Scale using Delta Lake and Delta CachingReal-Time Forecasting at Scale using Delta Lake and Delta Caching
Real-Time Forecasting at Scale using Delta Lake and Delta Caching
 
Elasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ SignalElasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ Signal
 
Detecting Fraud and AML Violations In Real-Time for Banking, Telecom and eCom...
Detecting Fraud and AML Violations In Real-Time for Banking, Telecom and eCom...Detecting Fraud and AML Violations In Real-Time for Banking, Telecom and eCom...
Detecting Fraud and AML Violations In Real-Time for Banking, Telecom and eCom...
 
Scaling graph investigations with Math, GPUs, & Experts
Scaling graph investigations with Math, GPUs, & ExpertsScaling graph investigations with Math, GPUs, & Experts
Scaling graph investigations with Math, GPUs, & Experts
 
How Graphs Continue to Revolutionize The Prevention of Financial Crime & Frau...
How Graphs Continue to Revolutionize The Prevention of Financial Crime & Frau...How Graphs Continue to Revolutionize The Prevention of Financial Crime & Frau...
How Graphs Continue to Revolutionize The Prevention of Financial Crime & Frau...
 
Analytics in Your Enterprise
Analytics in Your EnterpriseAnalytics in Your Enterprise
Analytics in Your Enterprise
 
Smarter Event-Driven Edge with Amazon SageMaker & Project Flogo (AIM204-S) - ...
Smarter Event-Driven Edge with Amazon SageMaker & Project Flogo (AIM204-S) - ...Smarter Event-Driven Edge with Amazon SageMaker & Project Flogo (AIM204-S) - ...
Smarter Event-Driven Edge with Amazon SageMaker & Project Flogo (AIM204-S) - ...
 
big-data-anallytics.pptx
big-data-anallytics.pptxbig-data-anallytics.pptx
big-data-anallytics.pptx
 
Graph Gurus 24: How to Build Innovative Applications with TigerGraph Cloud
Graph Gurus 24: How to Build Innovative Applications with TigerGraph CloudGraph Gurus 24: How to Build Innovative Applications with TigerGraph Cloud
Graph Gurus 24: How to Build Innovative Applications with TigerGraph Cloud
 
Danny Bickson - Python based predictive analytics with GraphLab Create
Danny Bickson - Python based predictive analytics with GraphLab Create Danny Bickson - Python based predictive analytics with GraphLab Create
Danny Bickson - Python based predictive analytics with GraphLab Create
 

Mehr von Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Mehr von Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Kürzlich hochgeladen

怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
vexqp
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
ptikerjasaptiker
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
vexqp
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 

Kürzlich hochgeladen (20)

怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdf
 

Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Analytics with Spark AI

  • 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  • 2. Benyue (Emma) Liu, TigerGraph Inc. Real-time Fraud Detection at Scale - Integrating Real-Time Deep-Link Graph Analytics with Spark AI #UnifiedDataAnalytics #SparkAISummit
  • 3. Graph analysis is possibly the single most effective competitive differentiator for organizations pursuing data-driven operations and decisions after the design of data capture.”
  • 4. Graph is HOW WE THINK 4#UnifiedDataAnalytics #SparkAISummit
  • 5. Common TigerGraph Use Cases 5 Improve Operational EfficiencyReduce Costs & Manage RisksIncrease Revenue • Recommendation Engine • Real-time Customer 360/ MDM • Product & Service Marketing • Fraud Detection • Anti-Money Laundering (AML) • Risk Assessment & Monitoring • Cyber Security • Enterprise Knowledge Graph • Network, IT and Cloud Resource Optimization • Energy Management System • Supply Chain Analysis Analyze all interactions in real-time to sell more Reduce costs and assess and monitor risks effectively Manage resources for maximum output Foundational Use Cases: Geospatial Analysis, Time Series Analysis, AI and Machine Learning
  • 6. 7 Key Data Science Capabilities Powered By a Native Parallel Graph Deep Link Analysis Relational Commonality Discovery and Computation From a set of entities (e.g. devices, customers, accounts, doctors), show all links or connections Given 2 entities (e.g. customers, businesses), follow their relationships to find commonality 6 Multi-dimensional Entity & Pattern Matching Given a pattern (e.g. referring business to a relative), find similar patterns in the graph Hub & Community Detection Find most influential members of a group (customers, doctors, citizens) & detect community around them Community 1 Community 2 1 32 4 5 Geospatial Graph Analysis Analyze changes in entities & relationships with location data A C A B Machine Learning Feature Generation & Explainable AI Extract graph-based features to feed as training data for machine learning; Power Explainable AI7 Temporal (Time-Series) Graph Analysis Analyze changes in entities & relationships over time Query Pattern P MatchB D
  • 7. Power Explainable AI with TigerGraph 7
  • 8. Why Spark + TigerGraph? +
  • 9. Spark + TigerGraph Data Pipeline 9
  • 10. Typical Spark + TigerGraph Integration ● Data Preparation and Integration (TigerGraph/Spark) ● Unsupervised Learning (TigerGraph) ● Feature Extraction for Supervised Learning (TigerGraph/Spark) ● Model Training (Spark) ● Validate and Apply Model (TigerGraph) ● Visualize and Explore Interconnected Data (TigerGraph) 10
  • 11. Machine Learning with TigerGraph China Mobile Anti-Fraud/Scam Detection
  • 12. 12 Real-Time Phone-Based Fraud Detection Massive, Worldwide Problem ● 18 Billion robocalls in US in 2017 (hiya.com) ● Spam/Scam - agile, spoofed numbers Customer: ● 600M subscribers ● 300M calls/day, peak 10K calls/sec ● Need: Real-time detection of various types of phone-based fraud
  • 13. Real-Time Phone Anti-Spam/Scam Detection 13 TigerGraph Solution: Real-time graph-based machine learning and decision system Graph Analytics ● Real-time machine learning ○ 118 graph features per call ○ Retrained periodically with 2M calls ● Real-time decisions ○ Call recipient sees alert if ML system says call is suspicious ● In production since Dec 2016 Graph Database ● 600M phone numbers (inside and outside network) ● 15B phone-phone call edges (2 month sliding window) ○ Time ○ Duration ● Real-time graph updates Peak 10K+ calls/sec ● 118 graph features per phone
  • 14. Examples of Graph Features for Machine Learning 14 Good Phone Features Bad Phone Features (1) Short term call duration (2) Empty stable group (3) No call back phone (4) Many rejected calls (5) Average distance > 3 Empty stable group Many rejected calls Average distance > 3 (1) High call back phone (2) Stable group (3) Long term phone (4) Many in-group connections (5) 3-step friend relation Stable group Many in- group connections Good Phone Features 3-step friend relation /// Good phone Bad phone X X X
  • 15. China Mobile - Detecting Phone-Based Fraud by Analyzing Network or Graph Pattern Features 15 • Each phone node has a fraud flag, indicating it’s a good phone or a bad phone and what type of fraud: scam, harassment, advertisement • Run real-time GSQL query for each call: ○ Collect 118 features ○ Compute composite score ○ Update fraud flag ○ Return fraud type Real-Time Call Event Caller Callee Time Call Detail Records Caller Callee Time Duration Query Continuous Graph Update Fraud Type
  • 16. Phone Fraud Real-Time Detection System phone vertex - fraud flag - expiration time target4 target3 - num of call - total duration - call date list - num of rejection target2 target1 ● 600 Million Vertices ● 15+ Billion Edges ● 300 Million Daily Updatesphone_phone
  • 17. Case 1: Call type was recently flagged Real-time Call Event Call Time Caller ID Callee ID If caller was recently flagged as “bad” If Caller is classified as “bad”Classifier Query Real-time Collect Caller’s Graph Features Update
  • 18. Case 2: Call needs to be classified Real-time Call Event Call Time Caller ID Callee ID If caller was recently flagged as “bad” If Caller is classified as “bad”Classifier Query Real-time Collect Caller’s Graph Features Update Input: list of calls with phone pairs and call time (batch) Output: 1. Call fraud type; 2. Scoring and feature vector of fraud calls for supporting evidence Explainable AI
  • 19. China Mobile Machine Learning Workflow 1. Data labels from police reports and online third party sources 2. A total of 118 graph features analyzed to build fraud detection model 3. All 118 graph features collected by one GSQL query 4. Training data’s features collected in GSQL in batch processing and stored as CSV file for future model training 5. TigerGraph performs fraud scoring with multiple Machine Learning models in real-time 6. Machine Learning models are trained offline and model parameters stored as configuration files for GSQL to use for real-time scoring (Future: Training ML models in Spark)
  • 20. Machine Learning with TigerGraph Real-time Scoring with Multiple ML models in GSQL Efficient EasyFast Real-time response for both feature collection and scoring Aggregation during traversal - multiple features in one Collect complex features without multiple RDBMS joins
  • 21. China Mobile Anti-Fraud Results from TigerGraph Machine Learning Solutions • 3.2 million fraud notifications in Shandong Province (Dec 2016 – July 2019) • Save potential loss • ~39.86 million RMB (~ 6 million US dollars)
  • 22. Why Spark + TigerGraph? +
  • 23. Why TigerGraph + Spark For Machine Learning? Parallel processing, distributed systems in training, ETL & feature collections Capture business moments with real- time response with explainable AI 23 Enrich machine learning with complex graph features AT SCALE ! AT SCALE ! AT SCALE !
  • 24. Spark and TigerGraph Data Pipeline Static Data Sources TigerGraph JDBC Driver Streaming Data Sources
  • 25. JDBC Driver (v1.2) ● Type 4 driver ● Support Read and Write bi-directional data flow to TigerGraph ● Read: Converts ResultSet to DataFrame ● Write: Load DataFrame and files to vertex/edge in TigerGraph ● Supports REST endpoints of built-in, compiled and interpreted GSQL queries from TigerGraph ● Open Source: ● https://github.com/tigergraph/ecosys/tree/master/etl/tg-jdbc-driver
  • 26. DEMO Graph Feature Extraction from TigerGraph to Spark Via TigerGraph’s JDBC Driver 26
  • 27. Examples of Graph Features for Machine Learning 27 Good Phone Features Bad Phone Features (1) Short term call duration (2) Empty stable group (3) No call back phone (4) Many rejected calls (5) Average distance > 3 Empty stable group Many rejected calls Average distance > 3 (1) High call back phone (2) Stable group (3) Long term phone (4) Many in-group connections (5) 3-step friend relation Stable group Many in- group connections Good Phone Features 3-step friend relation /// Good phone Bad phone X X X
  • 28. Graph Features: Stable Group & InGroup Connection • Stable Group: phones in the target group that have regular calls (stable connection) with source phone • Stable InGroup Connections: phones in the target group that have regular calls (stable connection) among themselves Stable Connection defined as ● Has both call and callback ● Num of calls is larger than a given limit ● Total duration is larger than a given limit
  • 29. Resources • TigerGraph Cloud Machine Learning Starter Kit a. Register at tgcloud.us • JDBC Driver (Open Source) a. https://github.com/tigergraph/ecosys/tree/master/etl/tg-jdbc- driver • Contact me at emma.liu@tigergraph.com 29
  • 30. More … TigerGraph & Neural Network 30 Training data: https://www.coursera.org/learn/machine-learning Watch Graph Guru Episode 19 https://info.tigergraph.com/graph-gurus-19 Contact Me: emma.liu@tigergraph.com
  • 31. Graph analysis is possibly the single most effective competitive differentiator for organizations pursuing data-driven operations and decisions after the design of data capture.” Realtime deep link graph analytics at scale is the differentiator to your machine learning pipeline!
  • 32. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT
  • 34. Stable Group Pseudocode Step 1: start from a given phone vertex, find its 1-step neighbors Step 2: check if a target has both stable outgoing (phone_phone) and stable incoming edges (phone_phone_reversed) source target4 target3 - num of call - total duration - call date list - num of rejection target2 target1 phone_phone phone_phone phone_phone_reversed Stable Connection defined as ● Has both call and callback ● Num of calls is larger than a given limit ● Total duration is larger than a given limit source
  • 35. Stable InGroup Connections Pseudocode Step 1: starting from a given phone vertex, find its 1-step neighbors (target group) Step 2: for each vertex in the target group, find its 1-step neighbors and check for stable connections Step 3: check the stable target for each vertex in the target group source target4 target3 - num of call - total duration - call date list - num of rejection target2 target1phone_phone phone_phone phone_phone_reversed source Stable Connection defined as ● Has both call and callback ● Num of calls is larger than a given limit ● Total duration is larger than a given limit