In most security data science talks that describe a specific algorithm used to solve a security problem, the audience is left wondering: how did they perform system testing when there is no labeled attack data; what metrics do they monitor; and what do these systems actually look like in production? Academia and industry both focus largely on security detection, but the emphasis is almost always on the algorithmic machinery powering the systems. Prior art productizing solutions is sparse: it has been studied from a machine-learning angle or from a security angle but has not been jointly explored. But the intersection of operationalizing security and machine-learning solutions is important not only because security data science solutions inherit complexities from both fields but also because each has unique challenges—for instance, compliance restrictions that dictate data cannot be exported from specific geographic locations (a security constraint) have a downstream effect on model design, deployment, evaluation, and management strategies (a data science constraint). This talk explores this intersection!
3. Choosing the Learning Task
•Binary Classification
•Anomaly Detector
•Ranking
Defining Data Input
• Data Loaders (text, binary, SVM light, Transpose
loader)
•Data type
Applying Data Transforms
•Cleaning Missing data
•Dealing with categorical data
•Dealing with text data
•Data Normalization
Choosing the Learner
•Binary Classification
•Regression
•Multi class
•Unsupervised
•Ranking
•Anomaly Detection
•Collaborative Filtering
•Sequence Prediction
Choosing Output
•Save Features of a model?
•Save the model as text?
•Save Model as binary?
•Save the per-instance results?
Choosing Run Options
•Run Locally?
•Run distributed on HPC cluster?
•Are all paths in the experiment node-accessible?
•Priority?
•Max Concurrent Process?
View Results
•Too large?
•Sampled
•Right size
•Load data
•Histogram
•Per feature
•Sampled Instances
Debug and Visualize Errors
•Error in Data
•Error in Learner
•Error in Optimizer
•Error in Experimentation setup
Analyze Model Predictions
•Root cause analysis
•Grading
4. Choosing the Learning Task
• Binary Classification
• Anomaly Detector
• Ranking
Defining Data Input
• Data Loaders (text, binary,
SVM light, Transpose loader)
• Data type
Applying Data Transforms
• Cleaning Missing data
• Dealing with categorical data
• Dealing with text data
• Data Normalization
Choosing the Learner
• Binary Classification
• Regression
• Multi class
• Unsupervised
• Ranking
• Anomaly Detection
• Collaborative Filtering
• Sequence Prediction
Choosing Output
• Save Features of a model?
• Save the model as text?
• Save Model as binary?
• Save the per-instance results?
Choosing Run Options
• Run Locally?
• Run distributed on HPC cluster?
• Are all paths in the
experiment node-accessible?
• Priority?
• Max Concurrent Process?
View Results
Debug and Visualize Errors
Analyze Model Predictions
Choosing the Learning Task
• Binary Classification
• Anomaly Detector
• Ranking
Defining Data Input
• Data Loaders (text, binary,
SVM light, Transpose loader)
• Data type
Applying Data Transforms
• Cleaning Missing data
• Dealing with categorical data
• Dealing with text data
• Data Normalization
Choosing the Learner
• Binary Classification
• Regression
• Multi class
• Unsupervised
• Ranking
• Anomaly Detection
• Collaborative Filtering
• Sequence Prediction
Choosing Output
• Save Features of a model?
• Save the model as text?
• Save Model as binary?
• Save the per-instance results?
Choosing Run Options
• Run Locally?
• Run distributed on HPC cluster?
• Are all paths in the
experiment node-accessible?
• Priority?
• Max Concurrent Process?
View Results
Debug and Visualize Errors
Analyze Model Predictions
Choosing the Learning Task
• Binary Classification
• Anomaly Detector
• Ranking
Defining Data Input
• Data Loaders (text, binary,
SVM light, Transpose loader)
• Data type
Applying Data Transforms
• Cleaning Missing data
• Dealing with categorical data
• Dealing with text data
• Data Normalization
Choosing the Learner
• Binary Classification
• Regression
• Multi class
• Unsupervised
• Ranking
• Anomaly Detection
• Collaborative Filtering
• Sequence Prediction
Choosing Output
• Save Features of a model?
• Save the model as text?
• Save Model as binary?
• Save the per-instance results?
Choosing Run Options
• Run Locally?
• Run distributed on HPC cluster?
• Are all paths in the
experiment node-accessible?
• Priority?
• Max Concurrent Process?
View Results
Debug and Visualize Errors
Analyze Model Predictions
Choosing the Learning Task
• Binary Classification
• Anomaly Detector
• Ranking
Defining Data Input
• Data Loaders (text, binary,
SVM light, Transpose loader)
• Data type
Applying Data Transforms
• Cleaning Missing data
• Dealing with categorical data
• Dealing with text data
• Data Normalization
Choosing the Learner
• Binary Classification
• Regression
• Multi class
• Unsupervised
• Ranking
• Anomaly Detection
• Collaborative Filtering
• Sequence Prediction
Choosing Output
• Save Features of a model?
• Save the model as text?
• Save Model as binary?
• Save the per-instance results?
Choosing Run Options
• Run Locally?
• Run distributed on HPC cluster?
• Are all paths in the
experiment node-accessible?
• Priority?
• Max Concurrent Process?
View Results
Debug and Visualize Errors
Analyze Model Predictions
Choosing the Learning Task
• Binary Classification
• Anomaly Detector
• Ranking
Defining Data Input
• Data Loaders (text, binary,
SVM light, Transpose loader)
• Data type
Applying Data Transforms
• Cleaning Missing data
• Dealing with categorical data
• Dealing with text data
• Data Normalization
Choosing the Learner
• Binary Classification
• Regression
• Multi class
• Unsupervised
• Ranking
• Anomaly Detection
• Collaborative Filtering
• Sequence Prediction
Choosing Output
• Save Features of a model?
• Save the model as text?
• Save Model as binary?
• Save the per-instance results?
Choosing Run Options
• Run Locally?
• Run distributed on HPC cluster?
• Are all paths in the
experiment node-accessible?
• Priority?
• Max Concurrent Process?
View Results
Debug and Visualize Errors
Analyze Model Predictions
Choosing the Learning Task
• Binary Classification
• Anomaly Detector
• Ranking
Defining Data Input
• Data Loaders (text, binary,
SVM light, Transpose loader)
• Data type
Applying Data Transforms
• Cleaning Missing data
• Dealing with categorical data
• Dealing with text data
• Data Normalization
Choosing the Learner
• Binary Classification
• Regression
• Multi class
• Unsupervised
• Ranking
• Anomaly Detection
• Collaborative Filtering
• Sequence Prediction
Choosing Output
• Save Features of a model?
• Save the model as text?
• Save Model as binary?
• Save the per-instance results?
Choosing Run Options
• Run Locally?
• Run distributed on HPC cluster?
• Are all paths in the
experiment node-accessible?
• Priority?
• Max Concurrent Process?
View Results
Debug and Visualize Errors
Analyze Model Predictions
Choosing the Learning Task
• Binary Classification
• Anomaly Detector
• Ranking
Defining Data Input
• Data Loaders (text, binary,
SVM light, Transpose loader)
• Data type
Applying Data Transforms
• Cleaning Missing data
• Dealing with categorical data
• Dealing with text data
• Data Normalization
Choosing the Learner
• Binary Classification
• Regression
• Multi class
• Unsupervised
• Ranking
• Anomaly Detection
• Collaborative Filtering
• Sequence Prediction
Choosing Output
• Save Features of a model?
• Save the model as text?
• Save Model as binary?
• Save the per-instance results?
Choosing Run Options
• Run Locally?
• Run distributed on HPC cluster?
• Are all paths in the
experiment node-accessible?
• Priority?
• Max Concurrent Process?
View Results
Debug and Visualize Errors
Analyze Model Predictions
Choosing the Learning Task
• Binary Classification
• Anomaly Detector
• Ranking
Defining Data Input
• Data Loaders (text, binary,
SVM light, Transpose loader)
• Data type
Applying Data Transforms
• Cleaning Missing data
• Dealing with categorical data
• Dealing with text data
• Data Normalization
Choosing the Learner
• Binary Classification
• Regression
• Multi class
• Unsupervised
• Ranking
• Anomaly Detection
• Collaborative Filtering
• Sequence Prediction
Choosing Output
• Save Features of a model?
• Save the model as text?
• Save Model as binary?
• Save the per-instance results?
Choosing Run Options
• Run Locally?
• Run distributed on HPC cluster?
• Are all paths in the
experiment node-accessible?
• Priority?
• Max Concurrent Process?
View Results
Debug and Visualize Errors
Analyze Model Predictions
Choosing the Learning Task
• Binary Classification
• Anomaly Detector
• Ranking
Defining Data Input
• Data Loaders (text, binary,
SVM light, Transpose loader)
• Data type
Applying Data Transforms
• Cleaning Missing data
• Dealing with categorical data
• Dealing with text data
• Data Normalization
Choosing the Learner
• Binary Classification
• Regression
• Multi class
• Unsupervised
• Ranking
• Anomaly Detection
• Collaborative Filtering
• Sequence Prediction
Choosing Output
• Save Features of a model?
• Save the model as text?
• Save Model as binary?
• Save the per-instance results?
Choosing Run Options
• Run Locally?
• Run distributed on HPC cluster?
• Are all paths in the
experiment node-accessible?
• Priority?
• Max Concurrent Process?
View Results
Debug and Visualize Errors
Analyze Model Predictions
6. Security Data Science Projects are different
• Traditional Programming Projects: spec/prototype → implement → ship
• Data Science Projects: at each stage: relabel, refeaturize, retrain
• With data-driven features, all components drift:
• Learner: more accurate/faster/lower-memory-footprint/…
• Features: there are always better ones
• Data: all distributions drift
• Security Projects: at each stage: assess threat, build detections, respond
• All components drift:
• Threat: new attacks constantly come out;
• Detection: newer log sources
• Response: better tooling, newer TSGs
Intro Model Evaluation Model Deployment Model Scale-out Conclusion
So wait…when do
we ship??
7. You ship when your solution is operational
Security
Experts
Engineers
Legal
Service
Engineers
Product
Managers
Machine
Learning
Experts
Intro Model Evaluation Model Deployment Model Scale-out Conclusion
8. Operational is more than your “model is working”…
Detect unusual user activity to
prevent data exfiltration
Detect unusual user activity using
Application logs, with false
positive rate < 1%, for all Azure
customers, in near real-time
Intro Model Evaluation Model Deployment Model Scale-out Conclusion
9. Detect unusual user activity
using Application logs,
with false positive rate < 1%,
for all Azure Customers
in near real-time
=> The Problem
=> Data
=> Model Evaluation
=> Model Deployment
=> Model Scale-out
Operationalize Security Data Science: Components
Intro Model Evaluation Model Deployment Model Scale-out Conclusion
13. Model Evaluation
Metrics
Model Usage
Metrics
Model Validation
Metrics
• E.g: False Positive
• Makes your customer (and ergo,
your business) happy
• How to measure this?
• E.g: Call Rate
• How much is the model in use?
• Makes your division happy
• Collected by your pipeline after
deployment
• E.g: MSE, Reconstruction error….
• How well does the model
generalize?
• Makes the data scientist happy
• Comes pre-built with ML
framework (Scikit learn, CNTK)
Intro Model Evaluation Model Deployment Model Scale-out Conclusion
14. Model Evaluation: How to gather Evaluation
dataset?
• Good: Use Benchmark datasets
• List of curated datasets - www.secrepo.com
• Con: Remember – attackers have ‘em too!
• Better: Use previous Indicators of Compromise
• Honeypots, commercial IOC feeds
• Steps:
• Gather confirmed IOCs
• “Backprop” them through the generated alerts
• This will help you calculate FP and FN
• Best: Curate your own dataset
MoreSpecialized
Intro Model Evaluation Model Deployment Model Scale-out Conclusion
15. Curating your own dataset options
1. Inject Fake Malicious data
Model
Synthetic
data
Storage
How: Label data as “eviluser” and check if “eviluser” pops
to the top of the reports every day
Pro: Low overhead—you don’t have to depend on a red
team to test your detection
Con: The injected data may not be representative of true
attacker activity
Storage
Alerting
System C
Intro Model Evaluation Model Deployment Model Scale-out Conclusion
16. Curating your own dataset options
2. Employ Commonly Used Attacker Tools
How: Spin up a malicious process using
Metasploit, Powersploit, or Veil in your environment.
Look for traces in your logs
Pro: Easy to implement; your development team, with
little tutorial, can run the tool, which would generate
attack data in the logs.
Con: The machine learning system, will only learn to
detect known attacker toolkits and not generalize over
the attack methodology
Model
Storage
Tainted
Data
Alerting
System
Intro Model Evaluation Model Deployment Model Scale-out Conclusion
17. Curating your own dataset options
3. Red Team pentests your environment
How: a red team attacks the system and we try
to get the logs from the attacks, as tainted data
Pro: Closest technique to real-world attacks
Con: Red Teams are point in time exercises;
expensive
Model
Storage
Tainted
Data
Alerting
System
Intro Model Evaluation Model Deployment Model Scale-out Conclusion
18. Growing your dataset: Generative Adversarial Networks
Source: https://medium.com/@devnag/generative-adversarial-networks-gans-in-50-lines-
of-code-pytorch-e81b79659e3f#.djcfc6eo0 Source: http://www.evolvingai.org/ppgn
Intro Model Evaluation Model Deployment Model Scale-out Conclusion
20. Azure has data centers all around the world!
Intro Model Evaluation Model Deployment Model Scale-out Conclusion
21. Localization affects Model Building
• Privacy Laws vary across the board
• IP address is treated as EII in some regions vs. not EII in other regions
• “Anyone logging into corporate network at midnight during the
weekend is anomalous”
• Weekend in Middle East != Weekend in Americas
• Seasonality varies
Intro Model Evaluation Model Deployment Model Scale-out Conclusion
22. Option 1: Shotgun Deployment
• How: Deploy same model code
across different regions
• Pros:
• Easy deployment;
• Uniform metrics
• Single TSG to debug all service incidents
• Cons:
• Lose macro trends in favor of micro
trends
• Model-Region Incompatibility Region
1
Region
2
Region
3
Model ModelModel
Intro Model Evaluation Model Deployment Model Scale-out Conclusion
23. Option 2: Tiered Modeling
• How:
• Federated Models
• Each region is modeled separately
• Results are scrubbed according to compliance
laws and privacy agreements
• Scrubbed results are used as input to “Model
Prime”
• Model Prime
• Results are collated to search for global trends
• Pros:
• Bespoke modeling for every region
• Balance between Micro and Macro modeling
• Cons:
• Complicated Deployment
• Depending on the agreements, model-prime
may not be possible
Region1 Region2 Region3
Model 1
Model - Prime
Model 2 Model
3
Scrubbed
Results
Intro Model Evaluation Model Deployment Model Scale-out Conclusion
25. Detecting Malicious Activities
Detect risky or malicious activity
in SharePoint Online activity logs
with precision > 90%
for all SPO users
in near real-time
=> The Problem
=> Data
=> Model Evaluation
=> Model Deployment
=> Model Scale-out
Intro Model Evaluation Model Deployment Model Scale-out Conclusion
26. Exploratory Analysis
• Typical data science work:
• Sample data
• Script for preprocessing data
• Summary statistics
• Script for evaluating approaches
• All done locally on dev machine
using R/Python
• Facilitates quick turn around
• Avoids having to debug at scale
Intro Model Evaluation Model Deployment Model Scale-out Conclusion
27. Model Evaluation
• Labels from known incidents and investigations
• Inject labels by mimicking malicious activity
• SPO team helps us understand the malicious activity
• Red team helps us simulate the malicious activity
• > 90% precision
Intro Model Evaluation Model Deployment Model Scale-out Conclusion
28. Model: Bayesian Network
• Probabilistic Graphical Model
• Related to GMM, CRF, MRF
• Represents variables and conditional
independence assertions in a directed
acyclic graph
• Directed edges encode conditional
dependencies
• Conditional probability distributions for
each variable
Burglary
Alarm
Mary
Calls
John
Calls
Earthquake
Intro Model Evaluation Model Deployment Model Scale-out Conclusion
29. Initial Prototype – v0.1
• One activity model for all users
• Run model in cloud environment with
Azure Worker Role
• Storage accounts for input data and
output scores
• Pros:
• Easy to manage
• Small memory footprint
• Cons:
• Does not scale
• Low throughput
Data
Scores
Azure
Worker
Role
Activity
Model
User 1
User 2
User 3
Intro Model Evaluation Model Deployment Model Scale-out Conclusion
30. Improved Approach
• One model for each user
• Personalized activity suspiciousness
• Cluster low-activity users for better
model results
• Replace storage accounts with
Azure Event Hubs
• Low-latency, cloud-scale “queues”
Azure
Worker
Role
User 1
User 2
User 3
Event
Hub
Event
Hub
Model
1
Model
2
Model
3
Model
n
…
Scores
Intro Model Evaluation Model Deployment Model Scale-out Conclusion
31. Model Scale-Out: Memory
Azure
Worker
Role
User 1
User 2
User 3
Event
Hub
Event
Hub
Model
1
Model
2
Model
3
Model
n
…
Scores
Model Storage
• Millions of per-user models
• More than can fit in worker
role memory
• Store models in storage
account
• Load as needed
Intro Model Evaluation Model Deployment Model Scale-out Conclusion
32. Model Scale-Out: Latency
Azure
Worker
Role
User 1
User 2
User 3
Event
Hub
Event
Hub
Model
1
Model
2
Model
3
Model
n
…
Scores
Model Storage
Redis
Cache
• Model storage account adds
too much latency
• Redis cache minimizes model
loading latency
• LRU policy as we process user
activity events
Intro Model Evaluation Model Deployment Model Scale-out Conclusion
33. Data Compliance
• Models can not use certain PII
• Balkanized cloud environments
• Tiered model development
• Resolve user information for UX
• UserID -> User Name
Intro Model Evaluation Model Deployment Model Scale-out Conclusion
34. Data Compliance
Azure
Worker
Role
User 1
User 2
User 3
Event
Hub
Event
Hub
Model
1
Model
2
Model
3
Model
n
…
Scores
Model Storage
Redis
Cache
User Account DB
Redis
Cache
Intro Model Evaluation Model Deployment Model Scale-out Conclusion
39. Operationalize Security Data Science: Components
=> Model Evaluation
=> Model Deployment
=> Model Scale-out
Intro Model Evaluation Model Deployment Model Scale-out Conclusion
40. The Rand Test
Test to see if your Security Data Science solution operational
Answer Yes/No to the following:
1) Do you have an established pipeline to collect relevant security data?
2) Do you have established SLAs/data contracts with partner teams?
3) Can you seamlessly update the model with new features and re-train?
4) Did you evaluate the model with real attack data?
5) Does your model respect different privacy laws, across all regions?
6) Do you account for model localization?
7) Is your model scalable, end to end?
8) Do you hold live site meetings about your solution?
9) Can security responders leverage the model for insights during an
investigation?
10) Do you have a framework to collect feedback from security
analysts/feedback on the results?
By @ram_ssk, Andrew Wicker
Score - Yes = 1 point
10
5
0
All systems Operational!
Houston! We have a
problem
One small step…
Model Evaluation Model Deployment Model Scale-out
Intro Model Evaluation Model Deployment Model Scale-out Conclusion