SlideShare ist ein Scribd-Unternehmen logo
1 von 35
Downloaden Sie, um offline zu lesen
Continuous ML Integration & Delivery
for
Advanced Email Attack Detection
Jeshua Bratman & Justin Young
www.abnormalsecurity.com
The Detection Problem
From: “Josephine Wright” <invoicing@edisonpower.com>
To: “Tim James” <accounts@northwestmercyhospitals.com>
Subject: “Invoice details for September electricity service”
Hi Tim,
September invoice is ready! Please pay the attached invoice amount of
$883,000 for electricity services for Northwest Mercy Hospitals.
ABA: 12321001
Routing#: 123456789
-Jo
Invoice Payment Fraud!
www.abnormalsecurity.com
The Detection Problem
Advanced Social Engineering
Phishing,
Spear
Phishing,
Malware
Spam
Graymail
Business
Email
Compromise
Extortion
Compromised
Employee
Invoice Fraud
Heists
Scam
Compromised
Vendor
Legitimate Email
More Damaging & Sophisticated & Rare
~25% of emails
~25% of emails
~50% of emails
<.1% of emails
<.01% of emails
< 1 in a 100k emails
< 1 in a million emails
< 1 in 10 million emails
www.abnormalsecurity.com
The Detection Problem
This is a hard machine learning problem
1. Rarity of attacks
1. Adversarial Attack Landscape
1. High-dimensional & high data volume
1. Need Extremely high precision and recall simultaneously
www.abnormalsecurity.com
Move Fast!
Lightning speed iteration to get ahead of new attacks
Don’t Break Things!
We don’t want to stop catching old attacks
Continuous Integration and Delivery (CI/CD) for our ENTIRE ML Detection Engine
www.abnormalsecurity.com
Part 2:
CI/CD for a
Machine Learning
Detection Engine
How do we develop quickly without breaking
things?
www.abnormalsecurity.com
Code
Engineer
Modifies
Land & Deploy
Traditional CI/CD
Tests
Do the Tests
Pass?
www.abnormalsecurity.com
No idea if code change breaks the system
Engineers fixing each others bugs all the time
Pushing bad code to production
In modern software development it would be insane not to have CI/CD
What happens if we *do not* have CI/CD?
www.abnormalsecurity.com
Tests
Machine Learning CI/CD
Rescoring
Analytics
Model Training
Deployment
Do the tests pass?
Is performance good?
Can new models train?
Code
ML Engineer
Modifies
Models
Datasets
www.abnormalsecurity.com
Cannot safely change system to fix an FN or FP
May degrade system unintentionally when shipping improvements
Cannot know overall impact of new model to entire system
Most ML products run blind like this! It greatly hampers development speed and product stability.
What happens if we *do not* have CI/CD?
www.abnormalsecurity.com
Adversarial!
From: “Josephine Wright” <invoicing@edisonpower.com>
To: “Tim James” <accounts@northwestmercyhospitals.com>
Subject: “Invoice details for September electricity service”
Hi Tim,
September invoice ready! Please pay the attached invoice
amount of $883,000 for electricity services for Northwest Mercy
Hospitals.
ABA: 12321001
Routing#: 123456789
-Jo
From: “Josephine Wright” <invoicing@edisonpovver.com>
To: “Tim James” <accounts@northwestmercyhospitals.com>
Subject: “Invoice details for September electricity service”
Hi Tim,
Just wanted to update you, we recently had to switch banks
(long story) but our account number has changed for future
invoices. See attached document for updated banking details.
-Josephine
Attachment: BankDetails.pdf
New Attack Strategy
Billing Account Update Fraud!
Invoice Payment Fraud!
www.abnormalsecurity.com
OK, how would we use this
From: “Josephine Wright” <invoicing@edisonpovver.com>
To: “Tim James” <accounts@northwestmercyhospitals.com>
Subject: “Invoice details for September electricity service”
Hi Tim,
Just wanted to update you, we recently had to switch banks
(long story) but our account number has changed for future
invoices. See attached document for updated banking details.
-Josephine
Attachment: BankDetails.pdf
Billing Account Update Fraud!
New or improved NLP models to identify
language around changing bank
accounts
New code to parse pdfs and extract bank
account numbers from them
New counting features for how often a
sender uses a particular domain, new
code with feature extractor, and a model
that uses those features
www.abnormalsecurity.com
Code
ML Engineer
Modifies:
Machine Learning CI/CD Details
Models
Rescoring
Analytics
Model Training
Datasets
ML
Detection
Engine
Labeled
Samples
www.abnormalsecurity.com
Accurate
● Rescoring analytics reflect performance in production
● Training data is unbiased (including time travel to avoid future leakage)
ML Engineer Effectiveness
● Easy and fast to run by engineers for retraining and evaluation
● Can add new models, datasets, features easily
Requirements of good CI/CD for ML
www.abnormalsecurity.com
Part 3:
Designing the
System
How do we build a CI/CD platform for our ML system
that enables developers and also scales well?
www.abnormalsecurity.com
So how do we do this?
This is a big data
problem! Data, models,
and code are all part of
the software system
we’re testing
So, we’ll use Spark to
simulate our online
system. But things get
complicated fast...
Code
Models
Rescoring
Analytics
Model Training
Datasets
ML
Detection
Engine
Labeled
Samples
www.abnormalsecurity.com
A Familiar ML Story
From: “Josephine Wright” <invoicing@edisonpovver.com>
To: “Tim James” <accounts@northwestmercyhospitals.com>
Subject: “Invoice details for September electricity service”
Hi Tim,
Just wanted to update you, we recently had to switch banks
(long story) but our account number has changed for future
invoices. See attached document for updated banking details.
-Josephine
Attachment: BankDetails.pdf
Billing Account Update Fraud!
New counting features for how often a
sender uses a particular domain, new code
with feature extractor, and a model that
uses those features
A data scientist has a great new feature…
but how do we safely get it into
production?
Domain
Count
Dataset
...
(“Josephine Wright”, “edisonpower.com”): 1000,
(“Josephine Wright”, “edisonpovver.com”): 0,
...
www.abnormalsecurity.com
A Familiar ML Story
A data scientist has a
great new feature… but
how do we safely get it
into production?
For just the new domain count
feature:
1. Domain Count Dataset
2. Feature extraction code
3. New sub-model?
Domain
Count
Dataset
...
(“Josephine Wright”, “edisonpower.com”): 1000,
(“Josephine Wright”, “edisonpovver.com”): 0,
...
www.abnormalsecurity.com
What does it look like to test this new feature?
In a typical software test,
we can mock out
complex dependencies
But for ML, we can’t
mock the data!
Does every data
scientist have to become
a data engineer?
Domain
Count
Dataset
Code
Models
Rescoring
Analytics
Model Training
Datasets
ML
Detection
Engine
Labeled
Samples
www.abnormalsecurity.com
Adding Our New Dataset
SparkFiles
Download dataset to disk on each executor
Broadcast Variable
Broadcast dataset in memory in each PySpark
process
What would it look like for our data scientist
to add the new dataset?
www.abnormalsecurity.com
Adding Our New Dataset
SparkFiles
Download dataset to disk on each executor
Broadcast Variable
Broadcast dataset in memory in each PySpark
process
# Broadcast variable to every executor
small_ip_dataset = {“1.2.3.4”: 123, “5.6.7.8”: 567}
ip_broadcast = sc.broadcast(dataset1)
# hydrate_with_ip_count can use the
small_ip_dataset dictionary
hydrated_rdd = rdd.map(lambda message:
hydrate_with_ip_count(message, ip_broadcast.value))
from pyspark import SparkFiles
# Add Spark file so that every executor will
download it
sc.addFile(remote_dataset_path)
# Now the file can be loaded in any Spark operation
from local_dataset_path
local_dataset_path =
SparkFiles.get(os.path.basename(remote_dataset_path
)[: -len(".tar.gz")])
www.abnormalsecurity.com
Adding Our New Dataset
SparkFiles
Download dataset to disk on each executor
Broadcast Variable
Broadcast dataset in memory in each PySpark
process
Spark Join
Join large distributed datasets via Spark
operations
What would it look like for our data scientist
to add the new dataset?
Domain
Count
Dataset
www.abnormalsecurity.com
Wait, what about time travel?
50
Hydration of counting
feature up to time t
48
Time
Hydration of counting
feature up to time t-x
...
www.abnormalsecurity.com
Feature Hydration With Time Travel
Sum over time
Domain
Count
Dataset
Daily Counts
Cumulative
Counts
www.abnormalsecurity.com
Feature Hydration With Time Travel
Events
Time-bucket
and key
www.abnormalsecurity.com
Feature Hydration With Time Travel
Hydrated
Events
Join By Key +
Time
www.abnormalsecurity.com
Deep Dive: Re-hydrating Behavior Graph
# Index every event by key and day, and take event ID to avoid passing around large objects
keyed_event_id_rdd = _expand_events_by_key_day(event_rdd)
# Index every count by key and day
keyed_counts_rdd = _expand_counts_by_key_day(time_sliced_counts_rdds)
# Join date-indexed event ID’s with date-indexed counts, by common key
joined_event_id_and_daily_counts_rdd = keyed_event_id_rdd.leftOuterJoin(keyed_counts_rdd)
# In memory, sum up cumulative counts and key by event ID
cumulative_counts_by_event_id_rdd = joined_event_id_and_daily_counts_rdd.flatMap(
_extract_cumulative_counts
)
# Join actual events back in by event ID
joined_event_and_cumulative_counts_rdd = cumulative_counts_by_event_id_rdd.join(
event_rdd.keyBy(_get_id_from_event)
)
# Hydrate every event with cumulative counts
hydrated_events_rdd = joined_event_and_cumulative_counts_rdd.map(
_hydrate_event_with_counts
)
www.abnormalsecurity.com
Back To Our ML Story
So we can do all of this in Spark.
But no data scientist should ever
have to think about this!
Data engineers should go to
great efforts to provide a simple
platform that hides these details
Data scientists should spend as
much time as possible doing data
science
Domain
Count
Dataset
...
(“Josephine Wright”, “edisonpower.com”): 1000,
(“Josephine Wright”, “edisonpovver.com”): 0,
...
# Index every event by key and day, and take event ID to avoid passing around large objects
keyed_event_id_rdd = _expand_events_by_key_day(event_rdd)
# Index every count by key and day
keyed_counts_rdd = _expand_counts_by_key_day(time_sliced_counts_rdds)
# Join date-indexed event ID’s with date-indexed counts, by common key
joined_event_id_and_daily_counts_rdd = keyed_event_id_rdd.leftOuterJoin(keyed_counts_rdd)
# In memory, sum up cumulative counts and key by event ID
cumulative_counts_by_event_id_rdd = joined_event_id_and_daily_counts_rdd.flatMap(
_extract_cumulative_counts
)
# Join actual events back in by event ID
joined_event_and_cumulative_counts_rdd = cumulative_counts_by_event_id_rdd.join(
event_rdd.keyBy(_get_id_from_event)
)
# Hydrate every event with cumulative counts
hydrated_events_rdd = joined_event_and_cumulative_counts_rdd.map(
_hydrate_event_with_counts
)
www.abnormalsecurity.com
Re-scoring Is Part of the MLOps Platform
Data engineers have to make re-
scoring as easy to use as
traditional CI/CD
This means providing a playbook
that’s as easy as adding unit tests
Domain
Count
Dataset
...
(“Josephine Wright”, “edisonpower.com”): 1000,
(“Josephine Wright”, “edisonpovver.com”): 0,
...
www.abnormalsecurity.com
Re-scoring Is Part of the MLOps Platform
Data engineers have to make re-
scoring as easy to use as
traditional CI/CD
This means providing a playbook
that’s as easy as adding unit tests
Domain
Count
Dataset
...
(“Josephine Wright”, “edisonpower.com”): 1000,
(“Josephine Wright”, “edisonpovver.com”): 0,
...
class TimeSlicedStatsEventHydrater(Generic[Stat, Event]):
# Class for building set of stats to lookup
_lookup_stats_builder: LookupStatsBuilder
# How to hydrate the Event with the Stats
_hydrate_event: EventHydrater
# Takes in an event and returns the date on which it occurred
_get_date_from_event: DateExtractor
# Takes in an event and returns its ID
_get_id_from_event: IdExtractor
www.abnormalsecurity.com
Accurate
● Rescoring analytics reflect performance in production
● Training data is unbiased (including time travel to avoid future leakage)
ML Engineer Effectiveness
● Easy and fast to run by engineers for retraining and evaluation
● Can add new models, datasets, features easily
Data Engineer Jobs-to-be-done
● Provide simple API that just works
● Make the system efficient enough to run on a regular schedule and ad-hoc
Requirements of good CI/CD for ML
www.abnormalsecurity.com
Quickly iterate
Know if things break
Train models on old examples
You will have a better & more flexible product
You will be able to address customer requests quickly
You will be able to support a larger team of ML engineers working in parallel
What happens if we DO have CI/CD?
www.abnormalsecurity.com
We’re Hiring!
abnormalsecurity.com/careers/
www.abnormalsecurity.com
Thank You
www.abnormalsecurity.com

Weitere ähnliche Inhalte

Was ist angesagt?

SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastSQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at Comcast
Databricks
 
Improving Power Grid Reliability Using IoT Analytics
Improving Power Grid Reliability Using IoT AnalyticsImproving Power Grid Reliability Using IoT Analytics
Improving Power Grid Reliability Using IoT Analytics
Databricks
 
Effective AIOps with Open Source Software in a Week
Effective AIOps with Open Source Software in a WeekEffective AIOps with Open Source Software in a Week
Effective AIOps with Open Source Software in a Week
Databricks
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 

Was ist angesagt? (20)

Intuit Analytics Cloud 101
Intuit Analytics Cloud 101Intuit Analytics Cloud 101
Intuit Analytics Cloud 101
 
Data Pipelines With Streamsets
Data Pipelines With Streamsets Data Pipelines With Streamsets
Data Pipelines With Streamsets
 
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
 
Automated Testing For Protecting Data Pipelines from Undocumented Assumptions
Automated Testing For Protecting Data Pipelines from Undocumented AssumptionsAutomated Testing For Protecting Data Pipelines from Undocumented Assumptions
Automated Testing For Protecting Data Pipelines from Undocumented Assumptions
 
Finding the needle in the haystack: how Nestle is leveraging big data to defe...
Finding the needle in the haystack: how Nestle is leveraging big data to defe...Finding the needle in the haystack: how Nestle is leveraging big data to defe...
Finding the needle in the haystack: how Nestle is leveraging big data to defe...
 
Redash: Open Source SQL Analytics on Data Lakes
Redash: Open Source SQL Analytics on Data LakesRedash: Open Source SQL Analytics on Data Lakes
Redash: Open Source SQL Analytics on Data Lakes
 
Building a Streaming Microservices Architecture - Data + AI Summit EU 2020
Building a Streaming Microservices Architecture - Data + AI Summit EU 2020Building a Streaming Microservices Architecture - Data + AI Summit EU 2020
Building a Streaming Microservices Architecture - Data + AI Summit EU 2020
 
SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastSQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at Comcast
 
Dealing With Drift - Building an Enterprise Data Lake
Dealing With Drift - Building an Enterprise Data LakeDealing With Drift - Building an Enterprise Data Lake
Dealing With Drift - Building an Enterprise Data Lake
 
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
 
An Approach to Data Quality for Netflix Personalization Systems
An Approach to Data Quality for Netflix Personalization SystemsAn Approach to Data Quality for Netflix Personalization Systems
An Approach to Data Quality for Netflix Personalization Systems
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
 
Dealing with Drift: Building an Enterprise Data Lake
Dealing with Drift: Building an Enterprise Data LakeDealing with Drift: Building an Enterprise Data Lake
Dealing with Drift: Building an Enterprise Data Lake
 
Improving Power Grid Reliability Using IoT Analytics
Improving Power Grid Reliability Using IoT AnalyticsImproving Power Grid Reliability Using IoT Analytics
Improving Power Grid Reliability Using IoT Analytics
 
Effective AIOps with Open Source Software in a Week
Effective AIOps with Open Source Software in a WeekEffective AIOps with Open Source Software in a Week
Effective AIOps with Open Source Software in a Week
 
What’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningWhat’s New with Databricks Machine Learning
What’s New with Databricks Machine Learning
 
Analysing data analytics use cases to understand big data platform
Analysing data analytics use cases  to understand big data platformAnalysing data analytics use cases  to understand big data platform
Analysing data analytics use cases to understand big data platform
 
Enabling Next Gen Analytics with Azure Data Lake and StreamSets
Enabling Next Gen Analytics with Azure Data Lake and StreamSetsEnabling Next Gen Analytics with Azure Data Lake and StreamSets
Enabling Next Gen Analytics with Azure Data Lake and StreamSets
 
Stream Analytics
Stream Analytics Stream Analytics
Stream Analytics
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 

Ähnlich wie Machine Learning CI/CD for Email Attack Detection

Evolution of a big data project
Evolution of a big data projectEvolution of a big data project
Evolution of a big data project
Michael Peacock
 
Barga IC2E & IoTDI'16 Keynote
Barga IC2E & IoTDI'16 KeynoteBarga IC2E & IoTDI'16 Keynote
Barga IC2E & IoTDI'16 Keynote
Roger Barga
 

Ähnlich wie Machine Learning CI/CD for Email Attack Detection (20)

Netpluz Managed SOC - MSS Service
Netpluz Managed SOC - MSS Service Netpluz Managed SOC - MSS Service
Netpluz Managed SOC - MSS Service
 
1st Party
1st Party1st Party
1st Party
 
Event Sourcing with Microservices
Event Sourcing with MicroservicesEvent Sourcing with Microservices
Event Sourcing with Microservices
 
Low Latency Fraud Detection & Prevention
Low Latency Fraud Detection & PreventionLow Latency Fraud Detection & Prevention
Low Latency Fraud Detection & Prevention
 
Evolution of a big data project
Evolution of a big data projectEvolution of a big data project
Evolution of a big data project
 
Amazon Web Services: Building a 'Web-Scale Computing' Architecture
Amazon Web Services: Building a 'Web-Scale Computing' ArchitectureAmazon Web Services: Building a 'Web-Scale Computing' Architecture
Amazon Web Services: Building a 'Web-Scale Computing' Architecture
 
Data Breach Risk Brief - 2015
Data Breach Risk Brief - 2015Data Breach Risk Brief - 2015
Data Breach Risk Brief - 2015
 
AWS Presentation
AWS PresentationAWS Presentation
AWS Presentation
 
Analyzing Streams
Analyzing StreamsAnalyzing Streams
Analyzing Streams
 
Analyzing Streams
Analyzing StreamsAnalyzing Streams
Analyzing Streams
 
Analyzing Streams
Analyzing StreamsAnalyzing Streams
Analyzing Streams
 
Analyzing Streams
Analyzing StreamsAnalyzing Streams
Analyzing Streams
 
Barga IC2E & IoTDI'16 Keynote
Barga IC2E & IoTDI'16 KeynoteBarga IC2E & IoTDI'16 Keynote
Barga IC2E & IoTDI'16 Keynote
 
From monolithic to serverless with Amazon Step Functions
From monolithic to serverless with Amazon Step FunctionsFrom monolithic to serverless with Amazon Step Functions
From monolithic to serverless with Amazon Step Functions
 
Analyzing Streams: Data Analytics Week at the SF Loft
Analyzing Streams: Data Analytics Week at the SF LoftAnalyzing Streams: Data Analytics Week at the SF Loft
Analyzing Streams: Data Analytics Week at the SF Loft
 
Analyzing Streams
Analyzing StreamsAnalyzing Streams
Analyzing Streams
 
Logging makes perfect - Riemann, Elasticsearch and friends
Logging makes perfect - Riemann, Elasticsearch and friendsLogging makes perfect - Riemann, Elasticsearch and friends
Logging makes perfect - Riemann, Elasticsearch and friends
 
Time Difference: How Tomorrow's Companies Will Outpace Today's
Time Difference: How Tomorrow's Companies Will Outpace Today'sTime Difference: How Tomorrow's Companies Will Outpace Today's
Time Difference: How Tomorrow's Companies Will Outpace Today's
 
Inspire 2014 Using eForms and iScripts with Business Applications
Inspire 2014 Using eForms and iScripts with Business ApplicationsInspire 2014 Using eForms and iScripts with Business Applications
Inspire 2014 Using eForms and iScripts with Business Applications
 
FME Server Meets the Challenge of Real-time
FME Server Meets the Challenge of Real-timeFME Server Meets the Challenge of Real-time
FME Server Meets the Challenge of Real-time
 

Mehr von Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Mehr von Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Kürzlich hochgeladen

Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 

Kürzlich hochgeladen (20)

Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 

Machine Learning CI/CD for Email Attack Detection

  • 1. Continuous ML Integration & Delivery for Advanced Email Attack Detection Jeshua Bratman & Justin Young
  • 2. www.abnormalsecurity.com The Detection Problem From: “Josephine Wright” <invoicing@edisonpower.com> To: “Tim James” <accounts@northwestmercyhospitals.com> Subject: “Invoice details for September electricity service” Hi Tim, September invoice is ready! Please pay the attached invoice amount of $883,000 for electricity services for Northwest Mercy Hospitals. ABA: 12321001 Routing#: 123456789 -Jo Invoice Payment Fraud!
  • 3. www.abnormalsecurity.com The Detection Problem Advanced Social Engineering Phishing, Spear Phishing, Malware Spam Graymail Business Email Compromise Extortion Compromised Employee Invoice Fraud Heists Scam Compromised Vendor Legitimate Email More Damaging & Sophisticated & Rare ~25% of emails ~25% of emails ~50% of emails <.1% of emails <.01% of emails < 1 in a 100k emails < 1 in a million emails < 1 in 10 million emails
  • 4. www.abnormalsecurity.com The Detection Problem This is a hard machine learning problem 1. Rarity of attacks 1. Adversarial Attack Landscape 1. High-dimensional & high data volume 1. Need Extremely high precision and recall simultaneously
  • 5. www.abnormalsecurity.com Move Fast! Lightning speed iteration to get ahead of new attacks Don’t Break Things! We don’t want to stop catching old attacks Continuous Integration and Delivery (CI/CD) for our ENTIRE ML Detection Engine
  • 6. www.abnormalsecurity.com Part 2: CI/CD for a Machine Learning Detection Engine How do we develop quickly without breaking things?
  • 8. www.abnormalsecurity.com No idea if code change breaks the system Engineers fixing each others bugs all the time Pushing bad code to production In modern software development it would be insane not to have CI/CD What happens if we *do not* have CI/CD?
  • 9. www.abnormalsecurity.com Tests Machine Learning CI/CD Rescoring Analytics Model Training Deployment Do the tests pass? Is performance good? Can new models train? Code ML Engineer Modifies Models Datasets
  • 10. www.abnormalsecurity.com Cannot safely change system to fix an FN or FP May degrade system unintentionally when shipping improvements Cannot know overall impact of new model to entire system Most ML products run blind like this! It greatly hampers development speed and product stability. What happens if we *do not* have CI/CD?
  • 11. www.abnormalsecurity.com Adversarial! From: “Josephine Wright” <invoicing@edisonpower.com> To: “Tim James” <accounts@northwestmercyhospitals.com> Subject: “Invoice details for September electricity service” Hi Tim, September invoice ready! Please pay the attached invoice amount of $883,000 for electricity services for Northwest Mercy Hospitals. ABA: 12321001 Routing#: 123456789 -Jo From: “Josephine Wright” <invoicing@edisonpovver.com> To: “Tim James” <accounts@northwestmercyhospitals.com> Subject: “Invoice details for September electricity service” Hi Tim, Just wanted to update you, we recently had to switch banks (long story) but our account number has changed for future invoices. See attached document for updated banking details. -Josephine Attachment: BankDetails.pdf New Attack Strategy Billing Account Update Fraud! Invoice Payment Fraud!
  • 12. www.abnormalsecurity.com OK, how would we use this From: “Josephine Wright” <invoicing@edisonpovver.com> To: “Tim James” <accounts@northwestmercyhospitals.com> Subject: “Invoice details for September electricity service” Hi Tim, Just wanted to update you, we recently had to switch banks (long story) but our account number has changed for future invoices. See attached document for updated banking details. -Josephine Attachment: BankDetails.pdf Billing Account Update Fraud! New or improved NLP models to identify language around changing bank accounts New code to parse pdfs and extract bank account numbers from them New counting features for how often a sender uses a particular domain, new code with feature extractor, and a model that uses those features
  • 13. www.abnormalsecurity.com Code ML Engineer Modifies: Machine Learning CI/CD Details Models Rescoring Analytics Model Training Datasets ML Detection Engine Labeled Samples
  • 14. www.abnormalsecurity.com Accurate ● Rescoring analytics reflect performance in production ● Training data is unbiased (including time travel to avoid future leakage) ML Engineer Effectiveness ● Easy and fast to run by engineers for retraining and evaluation ● Can add new models, datasets, features easily Requirements of good CI/CD for ML
  • 15. www.abnormalsecurity.com Part 3: Designing the System How do we build a CI/CD platform for our ML system that enables developers and also scales well?
  • 16. www.abnormalsecurity.com So how do we do this? This is a big data problem! Data, models, and code are all part of the software system we’re testing So, we’ll use Spark to simulate our online system. But things get complicated fast... Code Models Rescoring Analytics Model Training Datasets ML Detection Engine Labeled Samples
  • 17. www.abnormalsecurity.com A Familiar ML Story From: “Josephine Wright” <invoicing@edisonpovver.com> To: “Tim James” <accounts@northwestmercyhospitals.com> Subject: “Invoice details for September electricity service” Hi Tim, Just wanted to update you, we recently had to switch banks (long story) but our account number has changed for future invoices. See attached document for updated banking details. -Josephine Attachment: BankDetails.pdf Billing Account Update Fraud! New counting features for how often a sender uses a particular domain, new code with feature extractor, and a model that uses those features A data scientist has a great new feature… but how do we safely get it into production? Domain Count Dataset ... (“Josephine Wright”, “edisonpower.com”): 1000, (“Josephine Wright”, “edisonpovver.com”): 0, ...
  • 18. www.abnormalsecurity.com A Familiar ML Story A data scientist has a great new feature… but how do we safely get it into production? For just the new domain count feature: 1. Domain Count Dataset 2. Feature extraction code 3. New sub-model? Domain Count Dataset ... (“Josephine Wright”, “edisonpower.com”): 1000, (“Josephine Wright”, “edisonpovver.com”): 0, ...
  • 19. www.abnormalsecurity.com What does it look like to test this new feature? In a typical software test, we can mock out complex dependencies But for ML, we can’t mock the data! Does every data scientist have to become a data engineer? Domain Count Dataset Code Models Rescoring Analytics Model Training Datasets ML Detection Engine Labeled Samples
  • 20. www.abnormalsecurity.com Adding Our New Dataset SparkFiles Download dataset to disk on each executor Broadcast Variable Broadcast dataset in memory in each PySpark process What would it look like for our data scientist to add the new dataset?
  • 21. www.abnormalsecurity.com Adding Our New Dataset SparkFiles Download dataset to disk on each executor Broadcast Variable Broadcast dataset in memory in each PySpark process # Broadcast variable to every executor small_ip_dataset = {“1.2.3.4”: 123, “5.6.7.8”: 567} ip_broadcast = sc.broadcast(dataset1) # hydrate_with_ip_count can use the small_ip_dataset dictionary hydrated_rdd = rdd.map(lambda message: hydrate_with_ip_count(message, ip_broadcast.value)) from pyspark import SparkFiles # Add Spark file so that every executor will download it sc.addFile(remote_dataset_path) # Now the file can be loaded in any Spark operation from local_dataset_path local_dataset_path = SparkFiles.get(os.path.basename(remote_dataset_path )[: -len(".tar.gz")])
  • 22. www.abnormalsecurity.com Adding Our New Dataset SparkFiles Download dataset to disk on each executor Broadcast Variable Broadcast dataset in memory in each PySpark process Spark Join Join large distributed datasets via Spark operations What would it look like for our data scientist to add the new dataset? Domain Count Dataset
  • 23. www.abnormalsecurity.com Wait, what about time travel? 50 Hydration of counting feature up to time t 48 Time Hydration of counting feature up to time t-x ...
  • 24. www.abnormalsecurity.com Feature Hydration With Time Travel Sum over time Domain Count Dataset Daily Counts Cumulative Counts
  • 25. www.abnormalsecurity.com Feature Hydration With Time Travel Events Time-bucket and key
  • 26. www.abnormalsecurity.com Feature Hydration With Time Travel Hydrated Events Join By Key + Time
  • 27. www.abnormalsecurity.com Deep Dive: Re-hydrating Behavior Graph # Index every event by key and day, and take event ID to avoid passing around large objects keyed_event_id_rdd = _expand_events_by_key_day(event_rdd) # Index every count by key and day keyed_counts_rdd = _expand_counts_by_key_day(time_sliced_counts_rdds) # Join date-indexed event ID’s with date-indexed counts, by common key joined_event_id_and_daily_counts_rdd = keyed_event_id_rdd.leftOuterJoin(keyed_counts_rdd) # In memory, sum up cumulative counts and key by event ID cumulative_counts_by_event_id_rdd = joined_event_id_and_daily_counts_rdd.flatMap( _extract_cumulative_counts ) # Join actual events back in by event ID joined_event_and_cumulative_counts_rdd = cumulative_counts_by_event_id_rdd.join( event_rdd.keyBy(_get_id_from_event) ) # Hydrate every event with cumulative counts hydrated_events_rdd = joined_event_and_cumulative_counts_rdd.map( _hydrate_event_with_counts )
  • 28. www.abnormalsecurity.com Back To Our ML Story So we can do all of this in Spark. But no data scientist should ever have to think about this! Data engineers should go to great efforts to provide a simple platform that hides these details Data scientists should spend as much time as possible doing data science Domain Count Dataset ... (“Josephine Wright”, “edisonpower.com”): 1000, (“Josephine Wright”, “edisonpovver.com”): 0, ... # Index every event by key and day, and take event ID to avoid passing around large objects keyed_event_id_rdd = _expand_events_by_key_day(event_rdd) # Index every count by key and day keyed_counts_rdd = _expand_counts_by_key_day(time_sliced_counts_rdds) # Join date-indexed event ID’s with date-indexed counts, by common key joined_event_id_and_daily_counts_rdd = keyed_event_id_rdd.leftOuterJoin(keyed_counts_rdd) # In memory, sum up cumulative counts and key by event ID cumulative_counts_by_event_id_rdd = joined_event_id_and_daily_counts_rdd.flatMap( _extract_cumulative_counts ) # Join actual events back in by event ID joined_event_and_cumulative_counts_rdd = cumulative_counts_by_event_id_rdd.join( event_rdd.keyBy(_get_id_from_event) ) # Hydrate every event with cumulative counts hydrated_events_rdd = joined_event_and_cumulative_counts_rdd.map( _hydrate_event_with_counts )
  • 29. www.abnormalsecurity.com Re-scoring Is Part of the MLOps Platform Data engineers have to make re- scoring as easy to use as traditional CI/CD This means providing a playbook that’s as easy as adding unit tests Domain Count Dataset ... (“Josephine Wright”, “edisonpower.com”): 1000, (“Josephine Wright”, “edisonpovver.com”): 0, ...
  • 30. www.abnormalsecurity.com Re-scoring Is Part of the MLOps Platform Data engineers have to make re- scoring as easy to use as traditional CI/CD This means providing a playbook that’s as easy as adding unit tests Domain Count Dataset ... (“Josephine Wright”, “edisonpower.com”): 1000, (“Josephine Wright”, “edisonpovver.com”): 0, ... class TimeSlicedStatsEventHydrater(Generic[Stat, Event]): # Class for building set of stats to lookup _lookup_stats_builder: LookupStatsBuilder # How to hydrate the Event with the Stats _hydrate_event: EventHydrater # Takes in an event and returns the date on which it occurred _get_date_from_event: DateExtractor # Takes in an event and returns its ID _get_id_from_event: IdExtractor
  • 31. www.abnormalsecurity.com Accurate ● Rescoring analytics reflect performance in production ● Training data is unbiased (including time travel to avoid future leakage) ML Engineer Effectiveness ● Easy and fast to run by engineers for retraining and evaluation ● Can add new models, datasets, features easily Data Engineer Jobs-to-be-done ● Provide simple API that just works ● Make the system efficient enough to run on a regular schedule and ad-hoc Requirements of good CI/CD for ML
  • 32. www.abnormalsecurity.com Quickly iterate Know if things break Train models on old examples You will have a better & more flexible product You will be able to address customer requests quickly You will be able to support a larger team of ML engineers working in parallel What happens if we DO have CI/CD?