Navy security contest-bigdataforsecurity

ML for Security Monitoring
Santisook Limpeeticharoenchot
Managing Director

Agenda
• Why ML for Security Monitoring ?
• Overview of Machine Learning
• Apply Theory to Practice
• ML for Security Example.
• DataScience Process for Security
• Q&A

Fraud
Bad Actors Ransomware
IP Theft
Application Performance
Identity Theft
Key Performance Indicators
Network Intrusion
Malware
Exfiltration
Cyber-attacks
Zero-day
Compromised Credentials
SCADA Security
Hardware Deterioration
Known & Unknown Threat

The Current IT Situation
VM VM VM VM VM
VM VM VM VM VM VM
VM VM VM VM VM VM
VM VM VM
Fluid
Infrastructure
Distributed
Applications
Continuous
Deployment

5
Data Breaches:
Detected Late.
Undetected.

6
Moving from Protective
to Defensive

Current State Of Security Monitoring: #monitoringsucks
Measure Everything
➢Collect 1000’s of metrics and logs, most
unused
➢Analytics methods too simple, not correlated,
doesn’t help solve outages
Threshold = alert overload
➢Too many false positives
➢Hundreds of alerts a day, most ignored
IT & Security operations has become a big data challenge
“The [traditional] tools present us with the raw data, and lots of it, but sufficient insight into the
actual meaning buried in all that data is still remarkably scarce”
- Turn Big Data Inward With IT Analytics, Forrester Research

8
Difficult to Tell
Normal from
Abnormal

Watching screens cannot scale + it’s useless

Human brains are good at detecting patterns

OTOH, humans get lost in volume and details

Need the cognitive equivalent of THIS!

Why Machine Learning not Big Data !

Terms and definitions
Artificial Intelligence
Machine learning
Deep
learning
Algorithms
Supervised
Unsupervised
source:www.ibm.com

Traditional Computers vs. Artificial Intelligence
Traditional Programs
•Pre-programmed: producing
same results every time
•Deterministic: good or false
•One-dimensional: for
one/limited purpose
10
Artificial Intelligence
•Machine learning: changing
its code to improve results
•Stochastic: based on
probability
•Multi-dimensional: potential
for more general purposes
source:www.ibm.com

Traditional Programs vs. Machine Learning
Machine LearningTraditional Programs
Data
Static code
Real world result
Data
Algorithm
Real world result
Hypothesis Feedback
source:www.ibm.com

Enter Machine Learning!
What: “Field of study that gives computers the ability to learn
without being explicitly programmed” – Arthur Samuel, 1959
How: Generalizing (learning) from examples (data)

source:conf2016,splunk
Type of Machine Learning

Type of Machine Learning
Unsupervised LearningSupervised Learning

Classification: Applying labels
Triangle Triangle Triangle Triangle
Square Square Square
Learn Apply
Triangle
Square

Predict/Forecast
?
Learn Apply

Predict/Forecast

Predict/Forecast
ALERT
Will reach capacity in 2
hours. Provision more
servers.

Clustering: Grouping similar things

Anomaly Detection: Find unusual stuff

Time Series –Anomaly Detection

Anomaly Detection
Unusual vs. peers
Rare Events
Deviations in
Counts or Values
=
=
=
“responsetime by host”
“count by error_type”
“rare by EventID”
“rare by process”
“sum(bytes) over client_ip”
EXAMPLES
source:prelert.com

Evolution of Malware Detection
Signature-based
Potential
malware
Known “bad”
Behavior-based
Potential
malware
Bad behavior
Potential malware
Heuristics/sandboxing
Testing indicators
Statistical inference
Potential
malware
Probabilities
source:www.ibm.com

Real world applications for Machine Learning
• Fraud: credit card fraud, spam, DLP Automated recognition: face, handwriting
• Capacity planning: product stocking, server provisioning
• Anomaly detection for security and IT Operations Product recommendations
• Customer segmentation Medical diagnoses
…

Customer Usecase : Detect Network Outliers
Reduced downtime + increased service availability = better customer satisfaction
ML Use Case
Monitor noise rise for 20,000+ cell towers to increase service and device availability, reduce
MTTR
Technical overview
•A customized solution deployed in production based on outlier detection.
•Leverage previous month data and voting algorithms
“The ability to model complex systems and alert on deviations is where IT and security
operations are headed … Splunk Machine Learning has given us a head start...”
source:www.splunk.com

Reliable website updates
Proactive website monitoring leads to reduced downtime
“Splunk ML helps us rapidly improve end-user experience by ranking issue severity which
helps us determine root causes faster thus reducing MTTR and improving SLA
• Very frequent code and conﬁg updates (1000+ daily) can cause site issues
• Find errors in server pools, then prioritize actions and predict root cause
•Custom outlier detection built using ML Toolkit Outlier assistant
•Built by Splunk Architect with no Data Science background
ML Use Case
Technical overview
source:www.splunk.com

Theory of Distribution Function
for
Anomaly Detection

Bell-shaped distribution
Gaussian or Normal distribution

Normal distributions are really useful
• I can make powerful predictions because of the statistical
properties of the data , most naturally occurring processes
• I can easily compare different metrics since they have similar
statistical properties
• Population height, IQ distributions ,Widget sizes, weights in
manufacturing
• There is a HUGE body of statistical work on parametric
techniques for normally distributed data

Can you tell?

THIS is normal

This isn’t

Neither is this

Example: Three-Sigma Rule
Three-sigma rule
–~68% of the values lie within 1 std deviation of the mean
–~95% of the values lie within 2 std deviations
–99.73% of the values lie within 3 std deviations: anything
else is considered an outlier

Probabilistic Modeling and Analysis
Outliers
likelihood
observed values
X
ML model
Gaussian
source:prelert.com

• Fraud detection systems:
– Is what he just did consistent with past
behavior?
• Network anomaly detection:
– More like bad statistical analysis
• Predicting likelihood of attack actors
– Create diﬀerent predictive models and chain them
to gain more conﬁdence in each step.
Security Applications of ML
Source:mlsecproject.org

• Alert-‐based:
– “Traditional” log management
– SIEM
– Using “Threat Intelligence” (i.e
blacklists) for about a year or so
– Lack of context
– Low eﬀectiveness
– You get the results handed
over to you
Kinds of Network Security Monitoring
• Exploration-‐based:
– Network Forensics tools
– High eﬀectiveness
– Lots of people necessary
– Lots of HIGHLY trained people
• Big Data Security Analytics (BDSA):
– Run exploration-‐basedmonitoring on Hadoop
– More like Big Data Security Monitoring(BDSM)

• Rules in a SIEM solution invariably are:
– “Something” has happened “x”times;
– “Something” has happened and other “something2” has happened, with some
relationship (time, same fields, etc) between them.
• Configuring SIEM = iterate on combinations until:
– Customer or management is satisfied;
– Consulting money runs out
• Behavioral rules (anomaly detection) helps a bit with the “x”s, but still, very
laborious and time consuming.
Correlation Rules: A Primer

Historical Data Real-time Data Statistical Models
DB, Hadoop/S3/NoSQL, Splunk Anomaly Detection or Machine Learning
T – a few
days
T + a few
days
Why is this so challenging using traditional methods?
• DATA IS STILL IN MOTION, still in a BUSINESS PROCESS.
• Enrich real-time MACHINE DATA with structured HISTORICAL DATA
• Make decisions IN REAL TIME using ALL THE DATA
• Combine LEADING and LAGGING INDICATORS (KPIs)
SIEM
Security Operations Center
Network Operations Center
Business Operations Center

Anomaly Detection & Machine Learning
What is AD?
Types of security anomalies:
spikes in activity
rare events
first-observed
Outliers
state change
simple existence
What do these
have in common?
time-based
The basic comparison parameter is self-comparison overtime.
Advanced parameters include peer-based comparison.
What is ML?
Supervised ML
–Classification/Regression
Unsupervised ML
–Clustering
Semi-Supervised
–Rule-based AD
For AD and security, ML can
establish a baseline of normal
(negative) values

Unsupervised Learning
Unsupervised Machine Learning
– You have unlabeled data and want to group the data by feature(s)
– The algorithm makes its own structure out of the data
– You do not know what outliers look like
– Good for the data exploration phases of security anomaly detection
– Examples used in security applications include:
Clustering: k-means, k-medians, Expectation Maximization
Association: less relevant because in highly structured searches we are less concerned with
associations between fields for security anomaly detection

Supervised Learning
Supervised Machine Learning
– You have labeled data and the algorithm predicts the output
– Classification vs. Regression
– Example ML algorithms include:
Linear and Logistic Regression
Random Forest
Support Vector Machine
DBSCAN
Semi-Supervised Machine Learning
– You have “some” labeled data, but not all
– Most security ML applications fall in this category
– LabelPropagation
– Rule-based anomaly detection
For SECURITY-PURPOSED
applications of ML, a combination
of unsupervised, supervised, and
Semi-Supervised learning
algorithms is a best practice
In realistic applications, security-purposed
AD requires highly structured data and
human training of the algorithm

ML 101 for Security Monitoring
• Machine Learning (ML) is a process for generalizing from examples
– Examples = example or “training” data
– Generalizing = build “statistical models” to capture correlations
– Process = ML is never done, you must keep validating & reﬁtting models
• Simple ML workﬂow:
– Explore data
– FIT models based on data
– APPLY models in production
– Keep validating models

The ML Process
Problem: <Stuﬀ in the world> causes big time & money expense
Solution: Build predictive model to forecast <possible incidents>, act pre-emptively & learn
1.Get all relevant data to problem
2.Explore data & build KPIs
3.Fit, apply & validate models on past / real-time data
4.Predict and act. Identify notable events, create alerts
5.Surface incidents to X Ops, who INVESTIGATES & ACTS
Operationalize

Security: Find Insider Threats
Problem: Security breaches cause big time & money expense
Solution: Build predictive model to forecast threat scenarios, act pre-emptively & learn
1. Get security data (data transfers, authentication, incidents)
2. Explore data & build KPIs
3. Fit, apply & validate models on past / real-time data
4. Predict and act. Identify anomalous behaviors, create alerts
5. Surface incidents to Security Ops, who INVESTIGATES & ACTS
Operationalize

Machine Learning in IT Operation.
Adaptive Thresholding:
• Learn baselines & dynamic thresholds
• Alert & act on deviations
• Manage for 1000s of KPIs & entities
• Stdev/Avg, Quartile/Median, Range
Anomaly Detection:
• Employ machine learning to baseline normal
operations and alert on anomalous conditions
• Identify abnormal trends and patterns in KPI data

Finds the Deviation perfectly
5
7
• No extraneous false alarms
• Automatic periodicity
source:prelert.com

Challenge:
How do you find the signs of advanced threats amid thousands of daily high-severity alerts?
▪ Difficulty of creating effective
rules results in a high false
positive rate
▪ Advanced Evasion
Techniques (AETs) well-
known to attackers
Find Important IDS/IPS Events
source:prelert.com

• Anomaly Detective
generates a dozen or
so alerts per week
• Accuracy & alert detail
enable faster
determination of threat
level
Find Important IDS/IPS Events
Solution:
Let machine learning filter out normal ‘noise’ and identify unusual
counts, signatures, protocols and destinations by source
source:prelert.com

Rare Items as Anomalies
Use Case: Learn typical processes on each host
Find rare processes that “start up and communicate”
source:prelert.com

Finds the RARE anomaly perfectly
• finds FTP process running for 3 hours on
system that doesn’t normally run
source:prelert.com

Population / Peer Outliers
Use Case: Find users behaving much differently than the others
source:prelert.com

Find the Unusual USER Perfectly
• Host sending 20,000 requests/hr
• Attempt to hack an IIS webserver
source:prelert.com

Low and Slow – Automated Logins
user failing logins all
day
= “dc(date_hour) over user”
source:prelert.com

Machine Learning in Event Correlation
• Reduce event clutter, false positives and extensive rules
maintenance
• Events are auto-grouped together (supressed, de-duped)
• Easily provide feedback on auto-grouping of events &
alerts

Cluster IPs based on Security Alerts

(Security) Data Scientist
Data Science Venn Diagram by Drew Conway
• “Data Scientist (n.): Person who is better at statistics than any software
engineer and better at software engineering than any statistician.”
-‐-‐Josh Willis, Cloudera

Data Science Cycle For Security
Determine
Use-Case
Computational
Scaling/Storage
Machine Learning &
Anomaly Detection
Model
Model
Testing
Refinement
Alerts & Visualization
Data Mining &
Exploration
Data Validating &
Cleaning

Example : Email Use-Case
Your company has been hit with a large
number of phishing emails that were not
detected by traditional signature-based tools
Several employees have clicked on the
phishing link and entered their credentials
The adversary has taken over several
accounts and sent thousands of additional
emails, internal and external
Use-Case
Deep Dive

Where Are We In The Platform?
Log Sources
Model Testing
& Validation
Alerts &
Visualizations
Exploration Mining
Cleaning Validation
API
Short Term Storage
3rd Party Computations
Machine Learning
Anomaly Detection
Use-Case
Deep Dive
SIEM Platform

3rd party ML Calculations All are open source products

Data Mining & Exploration
What looks interesting in this sourcetype?
What could be used to detect an anomaly?
What is important to note about the events?
Send an email to yourself, then to a co-worker, then
to several people, etc. as a validation test; trace the
actions through Splunk
ML & AD for Security Best Practice:
Validate data by viewing your
own actions on the network
sourcetype="MSExchange:2010:MessageTracking"

Data Cleaning
sourcetype="MSExchange:2010:MessageTracking" sender="toby.ryan@emerson.com"
recipient_count!=NONE | dedup message_id sortby _time | table _time directionality sender
recipient message_subject message_id recipient_count total_bytes |sort -_time
What fields are best poised for measuring?
What fields provide enough context for
analysis?

ML & AD Model
What features do we choose? Supervised?
Unsupervised? Classification? What statistical model do
we choose?
Start by clustering all data
• Splunk “cluster” command for text and “kmeans” for numerical fields
| stats count by {field being measured}
ML & AD for Security Best Practice:
From an incident response perspective,
highly structured and single feature
data is required to minimize time
considering false positives

K-Means Clustering
sourcetype="MSExchange:2010:MessageTracking" sender="*@emerson.com" recipient_count!=NONE | dedup
me
Use-Case
Deep Dive
r
| kmeans k=5 daily_total | stats count by CLUSTERNUM centroid_daily_total source:conf2016,splunk

Training Data And The ML Process
Collect a set of training data (univariate/single feature/single field)
• In our case, it is 60-120 days worth of daily email totals
• Next, split the data by time into 3 groups: training set, cross-validation set,
test set
Determine if your dataset is Gaussian (Normal Distribution)
ML & AD for Security Best Practices:
-Split historical data 60-20-20 into training, cross-validation, and test sets

Algorithm Selection
For normal distributions, Inter-Quartile Range (IQR) is a good place to start
We can test back in Splunk for specific cluster users
Other options available include:
–Scikit-learn.org has the python modules
–MATLAB, GNU Octave, and R all have extensive ML and AD packages
–Python has easy Gaussian test algorithms (used in this example)
• scipy.stats.mstats.normaltest
• scipy.stats.shapiro
Scikit-Learn has in-depth explanations of each algorithm and command
descriptions such as “fit(x)” and “predict(x)”, etc.

Model Testing: 1
sourcetype="MSExchange:2010:MessageTracking" sender="xxxx@xxxx.com" recipient_count!=NONE | dedup message_id sortby _time |
table _time directionality sender recipient message_subject message_id recipient_count total_bytes | timechart sum(recipient_count) as
daily_total span=1d | eventstats median(daily_total) as median, p25(daily_total) as p25, p75(daily_total) as p75, mean(daily_total) as
mean | eval iqr = p75 - p25 | eval xplier = 2 | eval low_lim = median - (iqr * xplier) | eval high_lim = median + (iqr * xplier) | eval
anomaly =
False Positive False Positives
TruePositive
if(daily_total < low_lim OR daily_total > high_lim, daily_total,0) | table _time daily_total anomaly source:conf2016,splunk

Model Testing : 2
sourcetype="MSExchange:2010:MessageTracking" sender="toby.ryan@emerson.com" recipient_count!=NONE | dedup message_id
sortby _time | table _time directionality sender recipient message_subject message_id recipient_count total_bytes | timechart
sum(recipient_count) as daily_total span=1d | eventstats median(daily_total) as median, p10(daily_total) as p10, p90(daily_total) as p90,
mean(daily_total) as mean | eval iqr = p90 - p10 | eval xplier = 2 | eval low_lim = median - (iqr * xplier) | eval high_lim = median + (iqr *
xplier) | eval anomaly = if(daily_total < low_lim OR daily_total > high_lim, daily_total,0) | table _time daily_total anomaly

Validating Models
• How can we validate models?
Precision =
# of correct positive values
# of all positive results
# of correct positive values
# that should have been positive
Recall =
precision x recall
precision + recall
F1 Score = 2
F1 Score is the harmonic mean, or average of rates, where F1 is
best at a value of 1, and worst at a value of 0.
First model: F1 = 0.4
Second model: F1 = 1.0
Beware of missing false negatives by tuning too much
too quickly; tuning is an iterative process over time
8
0

Alerts & Visualizations
• The output of the off-Splunk calculations can be picked
up by the Splunk UF or written to a flat file
• Allows the user to capitalize on the Splunk interface
• Advantages/Disadvantages of Indexing and
Sourcetyping:
• Treat like any other data source for calculations
• Technically “re-indexing” data, however anomaly data sets
will be small

Refinement
• Treat different clusters with different models
• Continually validate data and results
• Understand why false positives come up
• Add length to training data time if possible
• If a cluster is not Gaussian, try other models, or try to fit the data to
a Normal Distribution
• Compare simple rule-based models such as 3 x mean = anomaly

Domain Expert on Insider Email Analytics
Consider not only a large number of recipients outside a user’s normal
behavior, but consider the number of new recipients
What is the average number of new recipients an employee emails each
day? One? Five? Establish a set of training data and record the unique
recipients over 60 days
Create an anomaly detection that fires when the number of new
recipients exceeds the baseline variance
Add to the “# of recipients per day” data for higher fidelity alert.

Key Takeaways
• Machine Learning is an evolution in the tools available to us
• ML is not one thing, it’s many different types of things that can
be applied to different types of problems
• ML applications and techniques vary so like any other tool, it
helps to use the right tool for the right problem space
• SIME enhance capability to support ML algorithms and make
our life easier.

Machine Learning in Splunk ITSI
Adaptive Thresholding:
• Learn baselines & dynamic thresholds
• Alert & act on deviations
• Manage for 1000s of KPIs & entities
• Stdev/Avg, Quartile/Median, Range
Anomaly Detection:
• Find “hiccups” in expected patterns
• Catches deviations beyond thresholds
• Uses advanced proprietary algorithm

User Behavior Analytics (UBA) in Splunk
• Understand normal & anomalous behaviors for ALL users
• UBA detects Advanced Cyberattacks and Malicious Insider Threats
• Lots of ML under the hood:
– Behavior Baselining & Modeling
– Anomaly Detection (30+ models)
– Advanced Threat Detection
• E.g., Data Exﬁl Threat:
– “Saw this strange login & data transfer for user mpittman at 3am in China…”
– Surface threat to SOC Analysts

Splunk Machine Learning Toolkit
Assistants: Guide model building, testing & deployment for common objectives
Showcases: Interactive examples for typical IT, security, business, IoT use cases
SPL ML Commands: New commands to ﬁt, test and operationalize models
Python for Scientiﬁc Computing Library:
300+ open source algorithms available for use
Build custom analytics for any use case
20

Reference
• https://conf.splunk.com/sessions/2016-sessions.html
• https://conf.splunk.com/files/2016/slides/demystifying-machine-learning-and-anomaly-
detection-practical-applications-in-splunk-for-insider-threat-detection-and-security-
analytics.pdf
• https://conf.splunk.com/files/2016/slides/solve-big-problems-with-machine-learning.pdf
• https://conf.splunk.com/files/2016/slides/a-very-brief-introduction-to-machine-
learning-for-itoa.pdf
• https://www.slideshare.net/eburon/machine-learning-security-ibm-seoul-compressed-
version?qid=9256cc75-07e5-46fc-9539-27a496c877ba&v=&b=&from_search=1
• www.prelert.com
• www.MLsecproject.org

Navy security contest-bigdataforsecurity

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Navy security contest-bigdataforsecurity

Similar to Navy security contest-bigdataforsecurity (20)

More from stelligence

More from stelligence (9)

Recently uploaded

Recently uploaded (20)

Navy security contest-bigdataforsecurity