SlideShare ist ein Scribd-Unternehmen logo
1 von 47
Downloaden Sie, um offline zu lesen
Defending Networks with
Incomplete Information: A
Machine Learning Approach
Alexandre Pinto
alexcp@mlsecproject.org
@alexcpsec
@MLSecProject
• This is a talk about DEFENDING not attacking
– NO systems were harmed on the development of
this talk.
– We are actually trying to BUILD something here.
• This talk includes more MATH than the daily
recommended intake by the FDA.
• You have been warned...
** WARNING **
• 12 years in Information Security, done a little bit of
everything.
• Past 7 or so years leading security consultancy and
monitoring teams in Brazil, London and the US.
– If there is any way a SIEM can hurt you, it did to me.
• Researching machine learning and data science in
general for the past year or so. Participates in Kaggle
machine learning competitions (for fun, not for profit).
• First presentation at DefCon! (where is my shot?)
Who’s this guy?
• Security Monitoring: We are doing it wrong
• Machine Learning and the Robot Uprising
• Data gathering for InfoSec
• Case study: Model to detect malicious
activity from log data
• MLSec Project
• Attacks and Adversaries
• Future Direction
Agenda
• Logs, logs everywhere
The Monitoring Problem
• Logs, logs everywhere
The Monitoring Problem
• SANS Eighth Annual 2012 Log and Event Management Survey Results (http://
www.sans.org/reading_room/analysts_program/SortingThruNoise.pdf)
Are these the right tools for the job?
• SANS Eighth Annual 2012 Log and Event Management Survey Results (http://
www.sans.org/reading_room/analysts_program/SortingThruNoise.pdf)
Are these the right tools for the job?
• Rules in a SIEM solution invariably are:
– “Something” has happened “x” times;
– “Something” has happened and other “something2”
has happened, with some relationship (time, same
fields, etc) between them.
• Configuring SIEM = iterate on combinations until:
– Customer or management is fooled satisfied; or
– Consulting money runs out
• Behavioral rules (anomaly detection) helps a bit
with the “x”s, but still, very laborious and time
consuming.
Correlation Rules: a Primer
• However, there are
individuals who will
do a good job
• How many do you
know?
• DAM hard (ouch!) to
find these capable
professionals
Not exclusively a tool problem
• How many of these
very qualified
professionals will we
need?
• How many know/
will learn statistics,
data analysis, data
science?
Next up: Big Data Technologies
We need an Army! Of ROBOTS!
• “Machine learning systems automatically learn
programs from data” (*)
• You don’t really code the program, but it is inferred
from data.
• Intuition of trying to mimic the way the brain learns:
that’s where terms like “artificial intelligence” come
from.
Enter Machine Learning
(*) CACM 55(10) - A Few Useful Things to Know about Machine Learning
• Sales
Applications of Machine Learning
• Trading
• Image and
Voice
Recognition
Security Applications of ML
• Fraud detection systems:
– Is what he just did consistent with
past behavior?
• Network anomaly detection (?):
– NOPE!
– More like statistical analysis, bad
one at that
• SPAM filters
- Remember the “Bayesian filters”?
There you go.
- How many talks have you been
hearing about SPAM filtering
lately? ;)
• Supervised Learning:
– Classification (NN, SVM,
Naïve Bayes)
– Regression (linear,
logistic)
Kinds of Machine Learning
Source – scikit-learn.github.io/scikit-learn-tutorial/
• Unsupervised Learning :
– Clustering (k-means)
– Decomposition (PCA, SVD)
Considerations on Data Gathering
• “I’ve got 99 problems, but data ain’t one”
• Models will (generally) get better with more data
– We always have to consider bias and variance as we select our
data points
– Also adversaries – we may be force-fed “bad data”, find signal in
weird noise or design bad (or exploitable) features
Domingos, 2012Abu-Mostafa, Caltech, 2012
Considerations on Data Gathering
• Adversaries - Exploiting the learning process
• Understand the model, understand the
machine, and you can circumvent it
• Something InfoSec community knows very well
• Any predictive model on InfoSec will be pushed
to the limit
• Again, think back on the
way SPAM engines evolved.
Designing a model to detect external
agents with malicious behavior
• We’ve got all that log data anyway, let’s dig into it
• Most important (and time consuming) thing is the “feature
engineering”
• We are going to go through one of the algorithms I have put
together as part of my research
Model: Data Collection
• Firewall block data from SANS DShield (per day)
• Firewalls, really? Yes, but could be anything.
• We get summarized “malicious” data per port
• Number of aggregated events (orange)
• Number of log entries before aggregation (purple)
Model Intuition: Proximity
• Assumptions to aggregate the data
• Correlation / proximity / similarity BY BEHAVIOR
• “Bad Neighborhoods” concept:
– Spamhaus x CyberBunker
– Google Report (June 2013)
– Moura 2013
• Group by Netblock (/16, /24)
• Group by ASN
– (thanks, Team Cymru)
Map of the
Internet
(Hilbert Curve)
Block port 22
2013-07-20
Notice the
clustering
behaviour?
0
10
127
MULTICAST AND FRIENDS
You are
Here
Map of the
Internet
(Hilbert Curve)
Block port 22
2013-07-20
Notice the
clustering
behaviour?
0
10
127
MULTICAST AND FRIENDS
CN
RU
CN,
BR,
THYou are
Here
Be careful with
confirmation bias
Country codes
are not enough
for any prediction
power of
consequence
today
Model Intuition: Temporal Decay
• Even bad neighborhoods renovate:
– Atackers may change ISPs/proxies
– Botnets may be shut down / relocate
– A little paranoia is Ok, but not EVERYONE is out to get
you (at least not all at once)
• As days pass, let’s forget, bit by bit, who attacked
• A Half-Life decay function will do just fine
Model Intuition: Temporal Decay
Model: Calculate Features
• Cluster your data: what
behavior are you trying to
predict?
• Create “Badness” Rank =
lwRank (just because)
• Calculate normalized ranks
by IP, Netblock (16, 24) and
ASN
• Missing ASNs and Bogons
(we still have those) handled
separately, get higher ranks.
Model: Calculate Features
• We will have a rank calculation per day:
– Each “day-rank” will accumulate all the knowledge
we gathered on that IP, Netblock and ASN to that day
– Decay previous “day-rank” and add today’s results
• Training data usually spans multiple days
• Each entry will have its date:
– Use that “day-rank”
– NO cheating --------->
– Survivorship bias issues!
Model: Example Feature (1)
• Block on Port 3389 (IP address only)
– Horizontal axis: lwRank from 0 (good/neutral) to 1 (very bad)
– Vertical axis: log10(number of IPs in model)
Model: Example Feature (2)
• Block on Port 22 (IP address only)
– Horizontal axis: lwRank from 0 (good/neutral) to 1 (very bad)
– Vertical axis: log10(number of IPs in model)
How are we doing so far?
Training the Model
• YAY! We have a bunch of numbers per IP
address!
• We get the latest blocked log files (SANS or not):
– We have “badness” data on IP Addresses - features
– If they were blocked, they are “malicious” - label
• Now, for each behavior to predict:
– Create a dataset with “enough” observations:
– Rule of Thumb: 70k - 120k is good because of
empirical dimensionality.
Negative and Positive
Observations
• We also require “non-malicious”
IPs!
• If we just feed the algorithms
with one label, they will get
lazy.
• CHEAP TRICK: Everything is
“malicious” - trivial solution
• Gather “non-malicious” IP
addresses from Alexa and
Chromium Top 1m Sites.
SVM FTW!
• Use your favorite algorithm! YMMV.
• I chose Support Vector Machines (SVM):
– Good for classification problems with numeric features
– Not a lot of features, so it helps control overfitting, built
in regularization in the model, usually robust
– Also awesome: hyperplane separation on an unknown
infinite dimension.
Jesse Johnson – shapeofdata.wordpress.com
No idea… Everyone copies this one
Results: Training/Test Data
• Model is trained on each behavior for each day
• Training accuracy* (cross-validation): 83 to 95%
• New data - test accuracy*:
– Training model on day D, predicting behavior in day D+1
– 79 to 95%, roughly increasing over time
(*)Accuracy = (things we got right) / (everything we tried)
Results: Training/Test Data
Results: Training/Test Data
Results: New Data
• How does that help?
• With new data we can verify the labels, we find:
– 70 – 92% true positive rate (sensitivity/precision)
– 95 – 99% true negative rate (specificity/recall)
• This means that (odds likelihood calculation):
– If the model says something is “bad”, it is 13.6 to 18.5
times MORE LIKELY to be bad.
• Think about this.
• Wouldn’t you rather have your analysts look at these
first?
Remember the Hilbert Curve?
• Behavior: block
on port 22
• Trial inference
on 100k IP
addresses per
Class A subnet
• Logarithm
scale:
brightest tiles
are 10 to 1000
times more
likely to
attack.
Remember the Hilbert Curve?
• Behavior: block
on port 22
• Trial inference
on 100k IP
addresses per
Class A subnet
• Logarithm
scale:
brightest tiles
are 10 to 1000
times more
likely to
attack.
Attacks and Adversaries
• IP addresses are not as reliable as they could be:
– Forget about UDP
– Lowest possible value for DFIR
• This is not attribution, this is defense
• Challenges:
– Anonymous proxies (not really, same rules apply)
– Tor (less clustering behavior on exit nodes)
– Fast-flux Tor - 15~30 mins
• Process was designed with different actors in mind as well, given
they can be clustered in some way.
Future Direction
• As is, the results from the predictions can help Security Analysts
on tiers 1 and 2 of SOCs:
– You can’t “eyeball” all of the data.
– Makes the deluge of logs produce something actionable
• The real kicker is when we compose algorithms (ensemble):
– Web server -> go through firewall, then IPS, then WAF
– Increased precision by composing different behaviors
• Given enough predictive power (increased likelihood):
– Implement an SDN system that sends detected attackers through a
“longer path” or to a Honeynet
– Connection could be blocked immediately
Final Remarks
• Sign up, send logs, receive reports generated by machine
learning models!
– FREE! I need the data! Please help! ;)
• Looking for contributors, ideas, skeptics to support
project as well.
• Please visit https://www.mlsecproject.org , message
@MLSecProject or just e-mail me.
• Machine learning can assist monitoring teams in data-
intensive activities (like SIEM and security tool
monitoring)
• The odds likelihood ratio (12x to 18x) is proportional do
the gain in efficiency on the monitoring teams.
• This is just the beginning! Lots of potential!
• MLSec Project is cool, check it out and sign up!
Take Aways
Thanks!
• Q&A?
• Don’t forget to submit
feedback!
Alexandre Pinto
alexcp@mlsecproject.org
@alexcpsec
@MLSecProject
"Prediction is very difficult, especially if it's about the future."

 
 
 
 
 
 
 - Niels Bohr

Weitere ähnliche Inhalte

Kürzlich hochgeladen

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 

Kürzlich hochgeladen (20)

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 

Empfohlen

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Empfohlen (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Defending Networks with Incomplete Information: A Machine Learning Approach

  • 1. Defending Networks with Incomplete Information: A Machine Learning Approach Alexandre Pinto alexcp@mlsecproject.org @alexcpsec @MLSecProject
  • 2. • This is a talk about DEFENDING not attacking – NO systems were harmed on the development of this talk. – We are actually trying to BUILD something here. • This talk includes more MATH than the daily recommended intake by the FDA. • You have been warned... ** WARNING **
  • 3. • 12 years in Information Security, done a little bit of everything. • Past 7 or so years leading security consultancy and monitoring teams in Brazil, London and the US. – If there is any way a SIEM can hurt you, it did to me. • Researching machine learning and data science in general for the past year or so. Participates in Kaggle machine learning competitions (for fun, not for profit). • First presentation at DefCon! (where is my shot?) Who’s this guy?
  • 4. • Security Monitoring: We are doing it wrong • Machine Learning and the Robot Uprising • Data gathering for InfoSec • Case study: Model to detect malicious activity from log data • MLSec Project • Attacks and Adversaries • Future Direction Agenda
  • 5. • Logs, logs everywhere The Monitoring Problem
  • 6. • Logs, logs everywhere The Monitoring Problem
  • 7. • SANS Eighth Annual 2012 Log and Event Management Survey Results (http:// www.sans.org/reading_room/analysts_program/SortingThruNoise.pdf) Are these the right tools for the job?
  • 8. • SANS Eighth Annual 2012 Log and Event Management Survey Results (http:// www.sans.org/reading_room/analysts_program/SortingThruNoise.pdf) Are these the right tools for the job?
  • 9. • Rules in a SIEM solution invariably are: – “Something” has happened “x” times; – “Something” has happened and other “something2” has happened, with some relationship (time, same fields, etc) between them. • Configuring SIEM = iterate on combinations until: – Customer or management is fooled satisfied; or – Consulting money runs out • Behavioral rules (anomaly detection) helps a bit with the “x”s, but still, very laborious and time consuming. Correlation Rules: a Primer
  • 10. • However, there are individuals who will do a good job • How many do you know? • DAM hard (ouch!) to find these capable professionals Not exclusively a tool problem
  • 11. • How many of these very qualified professionals will we need? • How many know/ will learn statistics, data analysis, data science? Next up: Big Data Technologies
  • 12. We need an Army! Of ROBOTS!
  • 13. • “Machine learning systems automatically learn programs from data” (*) • You don’t really code the program, but it is inferred from data. • Intuition of trying to mimic the way the brain learns: that’s where terms like “artificial intelligence” come from. Enter Machine Learning (*) CACM 55(10) - A Few Useful Things to Know about Machine Learning
  • 14. • Sales Applications of Machine Learning • Trading • Image and Voice Recognition
  • 15. Security Applications of ML • Fraud detection systems: – Is what he just did consistent with past behavior? • Network anomaly detection (?): – NOPE! – More like statistical analysis, bad one at that • SPAM filters - Remember the “Bayesian filters”? There you go. - How many talks have you been hearing about SPAM filtering lately? ;)
  • 16. • Supervised Learning: – Classification (NN, SVM, Naïve Bayes) – Regression (linear, logistic) Kinds of Machine Learning Source – scikit-learn.github.io/scikit-learn-tutorial/ • Unsupervised Learning : – Clustering (k-means) – Decomposition (PCA, SVD)
  • 17. Considerations on Data Gathering • “I’ve got 99 problems, but data ain’t one” • Models will (generally) get better with more data – We always have to consider bias and variance as we select our data points – Also adversaries – we may be force-fed “bad data”, find signal in weird noise or design bad (or exploitable) features Domingos, 2012Abu-Mostafa, Caltech, 2012
  • 18. Considerations on Data Gathering • Adversaries - Exploiting the learning process • Understand the model, understand the machine, and you can circumvent it • Something InfoSec community knows very well • Any predictive model on InfoSec will be pushed to the limit • Again, think back on the way SPAM engines evolved.
  • 19. Designing a model to detect external agents with malicious behavior • We’ve got all that log data anyway, let’s dig into it • Most important (and time consuming) thing is the “feature engineering” • We are going to go through one of the algorithms I have put together as part of my research
  • 20. Model: Data Collection • Firewall block data from SANS DShield (per day) • Firewalls, really? Yes, but could be anything. • We get summarized “malicious” data per port
  • 21. • Number of aggregated events (orange) • Number of log entries before aggregation (purple)
  • 22. Model Intuition: Proximity • Assumptions to aggregate the data • Correlation / proximity / similarity BY BEHAVIOR • “Bad Neighborhoods” concept: – Spamhaus x CyberBunker – Google Report (June 2013) – Moura 2013 • Group by Netblock (/16, /24) • Group by ASN – (thanks, Team Cymru)
  • 23. Map of the Internet (Hilbert Curve) Block port 22 2013-07-20 Notice the clustering behaviour? 0 10 127 MULTICAST AND FRIENDS You are Here
  • 24. Map of the Internet (Hilbert Curve) Block port 22 2013-07-20 Notice the clustering behaviour? 0 10 127 MULTICAST AND FRIENDS CN RU CN, BR, THYou are Here
  • 25.
  • 26. Be careful with confirmation bias Country codes are not enough for any prediction power of consequence today
  • 27. Model Intuition: Temporal Decay • Even bad neighborhoods renovate: – Atackers may change ISPs/proxies – Botnets may be shut down / relocate – A little paranoia is Ok, but not EVERYONE is out to get you (at least not all at once) • As days pass, let’s forget, bit by bit, who attacked • A Half-Life decay function will do just fine
  • 29. Model: Calculate Features • Cluster your data: what behavior are you trying to predict? • Create “Badness” Rank = lwRank (just because) • Calculate normalized ranks by IP, Netblock (16, 24) and ASN • Missing ASNs and Bogons (we still have those) handled separately, get higher ranks.
  • 30. Model: Calculate Features • We will have a rank calculation per day: – Each “day-rank” will accumulate all the knowledge we gathered on that IP, Netblock and ASN to that day – Decay previous “day-rank” and add today’s results • Training data usually spans multiple days • Each entry will have its date: – Use that “day-rank” – NO cheating ---------> – Survivorship bias issues!
  • 31. Model: Example Feature (1) • Block on Port 3389 (IP address only) – Horizontal axis: lwRank from 0 (good/neutral) to 1 (very bad) – Vertical axis: log10(number of IPs in model)
  • 32. Model: Example Feature (2) • Block on Port 22 (IP address only) – Horizontal axis: lwRank from 0 (good/neutral) to 1 (very bad) – Vertical axis: log10(number of IPs in model)
  • 33. How are we doing so far?
  • 34. Training the Model • YAY! We have a bunch of numbers per IP address! • We get the latest blocked log files (SANS or not): – We have “badness” data on IP Addresses - features – If they were blocked, they are “malicious” - label • Now, for each behavior to predict: – Create a dataset with “enough” observations: – Rule of Thumb: 70k - 120k is good because of empirical dimensionality.
  • 35. Negative and Positive Observations • We also require “non-malicious” IPs! • If we just feed the algorithms with one label, they will get lazy. • CHEAP TRICK: Everything is “malicious” - trivial solution • Gather “non-malicious” IP addresses from Alexa and Chromium Top 1m Sites.
  • 36. SVM FTW! • Use your favorite algorithm! YMMV. • I chose Support Vector Machines (SVM): – Good for classification problems with numeric features – Not a lot of features, so it helps control overfitting, built in regularization in the model, usually robust – Also awesome: hyperplane separation on an unknown infinite dimension. Jesse Johnson – shapeofdata.wordpress.com No idea… Everyone copies this one
  • 37. Results: Training/Test Data • Model is trained on each behavior for each day • Training accuracy* (cross-validation): 83 to 95% • New data - test accuracy*: – Training model on day D, predicting behavior in day D+1 – 79 to 95%, roughly increasing over time (*)Accuracy = (things we got right) / (everything we tried)
  • 40. Results: New Data • How does that help? • With new data we can verify the labels, we find: – 70 – 92% true positive rate (sensitivity/precision) – 95 – 99% true negative rate (specificity/recall) • This means that (odds likelihood calculation): – If the model says something is “bad”, it is 13.6 to 18.5 times MORE LIKELY to be bad. • Think about this. • Wouldn’t you rather have your analysts look at these first?
  • 41. Remember the Hilbert Curve? • Behavior: block on port 22 • Trial inference on 100k IP addresses per Class A subnet • Logarithm scale: brightest tiles are 10 to 1000 times more likely to attack.
  • 42. Remember the Hilbert Curve? • Behavior: block on port 22 • Trial inference on 100k IP addresses per Class A subnet • Logarithm scale: brightest tiles are 10 to 1000 times more likely to attack.
  • 43. Attacks and Adversaries • IP addresses are not as reliable as they could be: – Forget about UDP – Lowest possible value for DFIR • This is not attribution, this is defense • Challenges: – Anonymous proxies (not really, same rules apply) – Tor (less clustering behavior on exit nodes) – Fast-flux Tor - 15~30 mins • Process was designed with different actors in mind as well, given they can be clustered in some way.
  • 44. Future Direction • As is, the results from the predictions can help Security Analysts on tiers 1 and 2 of SOCs: – You can’t “eyeball” all of the data. – Makes the deluge of logs produce something actionable • The real kicker is when we compose algorithms (ensemble): – Web server -> go through firewall, then IPS, then WAF – Increased precision by composing different behaviors • Given enough predictive power (increased likelihood): – Implement an SDN system that sends detected attackers through a “longer path” or to a Honeynet – Connection could be blocked immediately
  • 45. Final Remarks • Sign up, send logs, receive reports generated by machine learning models! – FREE! I need the data! Please help! ;) • Looking for contributors, ideas, skeptics to support project as well. • Please visit https://www.mlsecproject.org , message @MLSecProject or just e-mail me.
  • 46. • Machine learning can assist monitoring teams in data- intensive activities (like SIEM and security tool monitoring) • The odds likelihood ratio (12x to 18x) is proportional do the gain in efficiency on the monitoring teams. • This is just the beginning! Lots of potential! • MLSec Project is cool, check it out and sign up! Take Aways
  • 47. Thanks! • Q&A? • Don’t forget to submit feedback! Alexandre Pinto alexcp@mlsecproject.org @alexcpsec @MLSecProject "Prediction is very difficult, especially if it's about the future." - Niels Bohr