SlideShare ist ein Scribd-Unternehmen logo
1 von 20
Machine Learning for Malware
Classification and Clustering
Phil Roth, Data Scientist
1
• PhD in particle astrophysics
• Switched to making images from radar data
• Switched to solving security problems with data
Phil Roth
Data Scientist
2
Outline
• Malware Detection
• Boosted Decision Trees
• Malware Features
• Evaluating Performance
• Bringing a Human into the Loop
3
The Problem: Antivirus
The security industry has declared antivirus as dead, but
there is no widely accepted replacement.
Machine Learning can be that replacement.
4
The Problem: Antivirus
• Antivirus uses signatures, heuristics, and hand crafted rules
that do not scale well
• Using polymorphism and obfuscation, malware authors can
circumvent rules based detection techniques
5
The Solution: Machine Learning
Machine Learning uses statistical techniques to learn
patterns from large datasets
6
Two Steps:
• Feature Extraction
• Boundary Learning
Machine Learning Advantages
• Automation
• Deep Insights
• Scalability
• Generalization
7
Machine Learning Challenges
• Requires labels
• Requires large data sets
• Security field requires very low tolerance for errors
8
Boosted Decision Trees
Basically, it’s a game of 20 questions
Source: https://en.wikipedia.org/wiki/Decision_tree_learning
A tree showing survival of passengers
on the Titanic ("sibsp" is the number
of spouses or siblings aboard). The
figures under the leaves show the
probability of survival and the
percentage of observations in the
leaf.
9
Boosted Decision Trees
• The trees are built by choosing “questions” that
maximize the discrimination between two classes
• The model is called “boosted” because misclassified
samples are given higher weight in future tree building
10
Why Boosted Decision Trees?
Proven results in security and physics
References:
https://www.kaggle.com/c/malware-classification/
http://arxiv.org/pdf/1511.04317.pdf
http://jmlr.org/proceedings/papers/v42/chen14.pdf
11
Malware Features
The extracted features determine your
model’s performance, but there is a tradeoff
Complicated Explainable
12
Complicated Features
Byte frequency and byte
entropy features form a
binary fingerprint that inform
the model
13
Explainable Features
Lists of capabilities don’t greatly help the model classify a
sample, but they can provide more insight to an analyst.
This sample can:
• Record keystrokes
• Send/receive network traffic
• Modify registry
14
Evaluating Performance
We must be careful not to learn from “future” information:
time
time
Train Data
Test Data
Model Train Times
Patterns learned here….
... should not inform classifications here
15
Bringing Humans in the Loop
Amazon built an entire tool (Mechanical Turk) to cheaply
generate labels from human intuition:
Are these products related?
16
Bringing Humans in the Loop
Our labels are more expensive to obtain, and so choosing
what samples to label is even more important.
Is this binary malicious?
Active Learning can help!
17
Bringing Humans in the Loop
When new data arrives, Active Learning tells analysts
which labels would be most helpful.
18
Integration
• Our malware classifier model has been integrated into
our stealthy sensor and Hunt Platform
• Ask the other friendly Endgamers here for a demo!
19
Thanks!
proth@endgame.com
@mrphilroth
20

Weitere ähnliche Inhalte

Was ist angesagt?

BlueHat v18 || Crafting synthetic attack examples from past cyber-attacks for...
BlueHat v18 || Crafting synthetic attack examples from past cyber-attacks for...BlueHat v18 || Crafting synthetic attack examples from past cyber-attacks for...
BlueHat v18 || Crafting synthetic attack examples from past cyber-attacks for...BlueHat Security Conference
 
Optimizing fault injection in FMI co-simulation through sensitivity partitioning
Optimizing fault injection in FMI co-simulation through sensitivity partitioningOptimizing fault injection in FMI co-simulation through sensitivity partitioning
Optimizing fault injection in FMI co-simulation through sensitivity partitioningmehmor
 
EdgarDB - the simple, powerful database for scientific research
EdgarDB - the simple, powerful database for scientific researchEdgarDB - the simple, powerful database for scientific research
EdgarDB - the simple, powerful database for scientific researchMark Khoury
 
Anomaly Detection for Security
Anomaly Detection for SecurityAnomaly Detection for Security
Anomaly Detection for SecurityCody Rioux
 
Whittaker How To Break Software Security - SoftTest Ireland
Whittaker How To Break Software Security - SoftTest IrelandWhittaker How To Break Software Security - SoftTest Ireland
Whittaker How To Break Software Security - SoftTest IrelandDavid O'Dowd
 
"Быстрое обнаружение вредоносного ПО для Android с помощью машинного обучения...
"Быстрое обнаружение вредоносного ПО для Android с помощью машинного обучения..."Быстрое обнаружение вредоносного ПО для Android с помощью машинного обучения...
"Быстрое обнаружение вредоносного ПО для Android с помощью машинного обучения...Yandex
 
Building & Leveraging White Database for Antivirus Testing
Building & Leveraging White Database for Antivirus TestingBuilding & Leveraging White Database for Antivirus Testing
Building & Leveraging White Database for Antivirus Testingfrisksoftware
 
CISSP Exam-Certified Information Systems Security Professional
CISSP Exam-Certified Information Systems Security Professional CISSP Exam-Certified Information Systems Security Professional
CISSP Exam-Certified Information Systems Security Professional Isabella789
 
What Every Developer And Tester Should Know About Software Security
What Every Developer And Tester Should Know About Software SecurityWhat Every Developer And Tester Should Know About Software Security
What Every Developer And Tester Should Know About Software SecurityAnne Oikarinen
 
Assignment 4-it409-IT Security & Policies questions and answers
Assignment 4-it409-IT Security & Policies questions and answersAssignment 4-it409-IT Security & Policies questions and answers
Assignment 4-it409-IT Security & Policies questions and answersKarthik Srinivasan
 
Deep Learning and Image Recognition
Deep Learning and Image RecognitionDeep Learning and Image Recognition
Deep Learning and Image RecognitionFrank Fang Kuo Yu
 

Was ist angesagt? (14)

BlueHat v18 || Crafting synthetic attack examples from past cyber-attacks for...
BlueHat v18 || Crafting synthetic attack examples from past cyber-attacks for...BlueHat v18 || Crafting synthetic attack examples from past cyber-attacks for...
BlueHat v18 || Crafting synthetic attack examples from past cyber-attacks for...
 
Optimizing fault injection in FMI co-simulation through sensitivity partitioning
Optimizing fault injection in FMI co-simulation through sensitivity partitioningOptimizing fault injection in FMI co-simulation through sensitivity partitioning
Optimizing fault injection in FMI co-simulation through sensitivity partitioning
 
The VTC experience
The VTC experienceThe VTC experience
The VTC experience
 
Active Testing
Active TestingActive Testing
Active Testing
 
EdgarDB - the simple, powerful database for scientific research
EdgarDB - the simple, powerful database for scientific researchEdgarDB - the simple, powerful database for scientific research
EdgarDB - the simple, powerful database for scientific research
 
Anomaly Detection for Security
Anomaly Detection for SecurityAnomaly Detection for Security
Anomaly Detection for Security
 
Whittaker How To Break Software Security - SoftTest Ireland
Whittaker How To Break Software Security - SoftTest IrelandWhittaker How To Break Software Security - SoftTest Ireland
Whittaker How To Break Software Security - SoftTest Ireland
 
"Быстрое обнаружение вредоносного ПО для Android с помощью машинного обучения...
"Быстрое обнаружение вредоносного ПО для Android с помощью машинного обучения..."Быстрое обнаружение вредоносного ПО для Android с помощью машинного обучения...
"Быстрое обнаружение вредоносного ПО для Android с помощью машинного обучения...
 
Data Science curriculum
Data Science curriculumData Science curriculum
Data Science curriculum
 
Building & Leveraging White Database for Antivirus Testing
Building & Leveraging White Database for Antivirus TestingBuilding & Leveraging White Database for Antivirus Testing
Building & Leveraging White Database for Antivirus Testing
 
CISSP Exam-Certified Information Systems Security Professional
CISSP Exam-Certified Information Systems Security Professional CISSP Exam-Certified Information Systems Security Professional
CISSP Exam-Certified Information Systems Security Professional
 
What Every Developer And Tester Should Know About Software Security
What Every Developer And Tester Should Know About Software SecurityWhat Every Developer And Tester Should Know About Software Security
What Every Developer And Tester Should Know About Software Security
 
Assignment 4-it409-IT Security & Policies questions and answers
Assignment 4-it409-IT Security & Policies questions and answersAssignment 4-it409-IT Security & Policies questions and answers
Assignment 4-it409-IT Security & Policies questions and answers
 
Deep Learning and Image Recognition
Deep Learning and Image RecognitionDeep Learning and Image Recognition
Deep Learning and Image Recognition
 

Andere mochten auch

Andere mochten auch (9)

Collections Cubed: Into the Third Dimension
Collections Cubed: Into the Third DimensionCollections Cubed: Into the Third Dimension
Collections Cubed: Into the Third Dimension
 
P14-3 S-Clip (mm)
P14-3 S-Clip (mm)P14-3 S-Clip (mm)
P14-3 S-Clip (mm)
 
Content Delivery - Hot Topic in Academia
Content Delivery - Hot Topic in AcademiaContent Delivery - Hot Topic in Academia
Content Delivery - Hot Topic in Academia
 
Panasonic NN-SN661S Countertop Microwave Oven GUIDE
Panasonic NN-SN661S Countertop Microwave Oven GUIDEPanasonic NN-SN661S Countertop Microwave Oven GUIDE
Panasonic NN-SN661S Countertop Microwave Oven GUIDE
 
Bir Zamanlar Turkiye - Amedeo Preziosi 2
Bir Zamanlar Turkiye - Amedeo Preziosi    2Bir Zamanlar Turkiye - Amedeo Preziosi    2
Bir Zamanlar Turkiye - Amedeo Preziosi 2
 
Poland Meetings Impact
Poland Meetings Impact Poland Meetings Impact
Poland Meetings Impact
 
Extracting the Malware Signal from Internet Noise
Extracting the Malware Signal from Internet NoiseExtracting the Malware Signal from Internet Noise
Extracting the Malware Signal from Internet Noise
 
Examining Malware with Python
Examining Malware with PythonExamining Malware with Python
Examining Malware with Python
 
Eboluzioa
EboluzioaEboluzioa
Eboluzioa
 

Ähnlich wie ML for Malware Classification & Clustering Using Boosted Decision Trees

Web applications security conference slides
Web applications security  conference slidesWeb applications security  conference slides
Web applications security conference slidesBassam Al-Khatib
 
Cybersecurity and Generative AI - for Good and Bad vol.2
Cybersecurity and Generative AI - for Good and Bad vol.2Cybersecurity and Generative AI - for Good and Bad vol.2
Cybersecurity and Generative AI - for Good and Bad vol.2Ivo Andreev
 
Battista Biggio @ ICML 2015 - "Is Feature Selection Secure against Training D...
Battista Biggio @ ICML 2015 - "Is Feature Selection Secure against Training D...Battista Biggio @ ICML 2015 - "Is Feature Selection Secure against Training D...
Battista Biggio @ ICML 2015 - "Is Feature Selection Secure against Training D...Pluribus One
 
Webinar on Functional Safety Analysis using Model-based System Analysis
Webinar on Functional Safety Analysis using Model-based System AnalysisWebinar on Functional Safety Analysis using Model-based System Analysis
Webinar on Functional Safety Analysis using Model-based System AnalysisDeepak Shankar
 
BsidesLVPresso2016_JZeditsv6
BsidesLVPresso2016_JZeditsv6BsidesLVPresso2016_JZeditsv6
BsidesLVPresso2016_JZeditsv6Rod Soto
 
BlueHat Seattle 2019 || The good, the bad & the ugly of ML based approaches f...
BlueHat Seattle 2019 || The good, the bad & the ugly of ML based approaches f...BlueHat Seattle 2019 || The good, the bad & the ugly of ML based approaches f...
BlueHat Seattle 2019 || The good, the bad & the ugly of ML based approaches f...BlueHat Security Conference
 
Malware Collection and Analysis via Hardware Virtualization
Malware Collection and Analysis via Hardware VirtualizationMalware Collection and Analysis via Hardware Virtualization
Malware Collection and Analysis via Hardware VirtualizationTamas K Lengyel
 
First Principles Vulnerability Assessment
First Principles Vulnerability AssessmentFirst Principles Vulnerability Assessment
First Principles Vulnerability AssessmentManuel Brugnoli
 
Feature store: Solving anti-patterns in ML-systems
Feature store: Solving anti-patterns in ML-systemsFeature store: Solving anti-patterns in ML-systems
Feature store: Solving anti-patterns in ML-systemsAndrzej Michałowski
 
Securing your Machine Learning models
Securing your Machine Learning modelsSecuring your Machine Learning models
Securing your Machine Learning modelsPhilipBasford
 
AI & ML in Cyber Security - Why Algorithms Are Dangerous
AI & ML in Cyber Security - Why Algorithms Are DangerousAI & ML in Cyber Security - Why Algorithms Are Dangerous
AI & ML in Cyber Security - Why Algorithms Are DangerousRaffael Marty
 
Defcon 22-wesley-mc grew-instrumenting-point-of-sale-malware
Defcon 22-wesley-mc grew-instrumenting-point-of-sale-malwareDefcon 22-wesley-mc grew-instrumenting-point-of-sale-malware
Defcon 22-wesley-mc grew-instrumenting-point-of-sale-malwareDaveEdwards12
 
Machine Learning & Predictive Maintenance
Machine Learning &  Predictive MaintenanceMachine Learning &  Predictive Maintenance
Machine Learning & Predictive MaintenanceArnab Biswas
 
Controlling Access to IBM i Systems and Data
Controlling Access to IBM i Systems and DataControlling Access to IBM i Systems and Data
Controlling Access to IBM i Systems and DataPrecisely
 
Expand Your Control of Access to IBM i Systems and Data
Expand Your Control of Access to IBM i Systems and DataExpand Your Control of Access to IBM i Systems and Data
Expand Your Control of Access to IBM i Systems and DataPrecisely
 
Cybersecurity Challenges with Generative AI - for Good and Bad
Cybersecurity Challenges with Generative AI - for Good and BadCybersecurity Challenges with Generative AI - for Good and Bad
Cybersecurity Challenges with Generative AI - for Good and BadIvo Andreev
 
Rise of the machines -- Owasp israel -- June 2014 meetup
Rise of the machines -- Owasp israel -- June 2014 meetupRise of the machines -- Owasp israel -- June 2014 meetup
Rise of the machines -- Owasp israel -- June 2014 meetupShlomo Yona
 

Ähnlich wie ML for Malware Classification & Clustering Using Boosted Decision Trees (20)

Web applications security conference slides
Web applications security  conference slidesWeb applications security  conference slides
Web applications security conference slides
 
Cybersecurity and Generative AI - for Good and Bad vol.2
Cybersecurity and Generative AI - for Good and Bad vol.2Cybersecurity and Generative AI - for Good and Bad vol.2
Cybersecurity and Generative AI - for Good and Bad vol.2
 
Battista Biggio @ ICML 2015 - "Is Feature Selection Secure against Training D...
Battista Biggio @ ICML 2015 - "Is Feature Selection Secure against Training D...Battista Biggio @ ICML 2015 - "Is Feature Selection Secure against Training D...
Battista Biggio @ ICML 2015 - "Is Feature Selection Secure against Training D...
 
Webinar on Functional Safety Analysis using Model-based System Analysis
Webinar on Functional Safety Analysis using Model-based System AnalysisWebinar on Functional Safety Analysis using Model-based System Analysis
Webinar on Functional Safety Analysis using Model-based System Analysis
 
BsidesLVPresso2016_JZeditsv6
BsidesLVPresso2016_JZeditsv6BsidesLVPresso2016_JZeditsv6
BsidesLVPresso2016_JZeditsv6
 
BlueHat Seattle 2019 || The good, the bad & the ugly of ML based approaches f...
BlueHat Seattle 2019 || The good, the bad & the ugly of ML based approaches f...BlueHat Seattle 2019 || The good, the bad & the ugly of ML based approaches f...
BlueHat Seattle 2019 || The good, the bad & the ugly of ML based approaches f...
 
Malware Collection and Analysis via Hardware Virtualization
Malware Collection and Analysis via Hardware VirtualizationMalware Collection and Analysis via Hardware Virtualization
Malware Collection and Analysis via Hardware Virtualization
 
First Principles Vulnerability Assessment
First Principles Vulnerability AssessmentFirst Principles Vulnerability Assessment
First Principles Vulnerability Assessment
 
Feature store: Solving anti-patterns in ML-systems
Feature store: Solving anti-patterns in ML-systemsFeature store: Solving anti-patterns in ML-systems
Feature store: Solving anti-patterns in ML-systems
 
Securing your Machine Learning models
Securing your Machine Learning modelsSecuring your Machine Learning models
Securing your Machine Learning models
 
AI & ML in Cyber Security - Why Algorithms Are Dangerous
AI & ML in Cyber Security - Why Algorithms Are DangerousAI & ML in Cyber Security - Why Algorithms Are Dangerous
AI & ML in Cyber Security - Why Algorithms Are Dangerous
 
Defcon 22-wesley-mc grew-instrumenting-point-of-sale-malware
Defcon 22-wesley-mc grew-instrumenting-point-of-sale-malwareDefcon 22-wesley-mc grew-instrumenting-point-of-sale-malware
Defcon 22-wesley-mc grew-instrumenting-point-of-sale-malware
 
Machine Learning & Predictive Maintenance
Machine Learning &  Predictive MaintenanceMachine Learning &  Predictive Maintenance
Machine Learning & Predictive Maintenance
 
Controlling Access to IBM i Systems and Data
Controlling Access to IBM i Systems and DataControlling Access to IBM i Systems and Data
Controlling Access to IBM i Systems and Data
 
Deep learning in manufacturing predicting and preventing manufacturing defect...
Deep learning in manufacturing predicting and preventing manufacturing defect...Deep learning in manufacturing predicting and preventing manufacturing defect...
Deep learning in manufacturing predicting and preventing manufacturing defect...
 
New Horizons SCYBER Presentation
New Horizons SCYBER PresentationNew Horizons SCYBER Presentation
New Horizons SCYBER Presentation
 
Expand Your Control of Access to IBM i Systems and Data
Expand Your Control of Access to IBM i Systems and DataExpand Your Control of Access to IBM i Systems and Data
Expand Your Control of Access to IBM i Systems and Data
 
Cybersecurity Challenges with Generative AI - for Good and Bad
Cybersecurity Challenges with Generative AI - for Good and BadCybersecurity Challenges with Generative AI - for Good and Bad
Cybersecurity Challenges with Generative AI - for Good and Bad
 
Foutse_Khomh.pptx
Foutse_Khomh.pptxFoutse_Khomh.pptx
Foutse_Khomh.pptx
 
Rise of the machines -- Owasp israel -- June 2014 meetup
Rise of the machines -- Owasp israel -- June 2014 meetupRise of the machines -- Owasp israel -- June 2014 meetup
Rise of the machines -- Owasp israel -- June 2014 meetup
 

Kürzlich hochgeladen

Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 

Kürzlich hochgeladen (20)

Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 

ML for Malware Classification & Clustering Using Boosted Decision Trees

  • 1. Machine Learning for Malware Classification and Clustering Phil Roth, Data Scientist 1
  • 2. • PhD in particle astrophysics • Switched to making images from radar data • Switched to solving security problems with data Phil Roth Data Scientist 2
  • 3. Outline • Malware Detection • Boosted Decision Trees • Malware Features • Evaluating Performance • Bringing a Human into the Loop 3
  • 4. The Problem: Antivirus The security industry has declared antivirus as dead, but there is no widely accepted replacement. Machine Learning can be that replacement. 4
  • 5. The Problem: Antivirus • Antivirus uses signatures, heuristics, and hand crafted rules that do not scale well • Using polymorphism and obfuscation, malware authors can circumvent rules based detection techniques 5
  • 6. The Solution: Machine Learning Machine Learning uses statistical techniques to learn patterns from large datasets 6 Two Steps: • Feature Extraction • Boundary Learning
  • 7. Machine Learning Advantages • Automation • Deep Insights • Scalability • Generalization 7
  • 8. Machine Learning Challenges • Requires labels • Requires large data sets • Security field requires very low tolerance for errors 8
  • 9. Boosted Decision Trees Basically, it’s a game of 20 questions Source: https://en.wikipedia.org/wiki/Decision_tree_learning A tree showing survival of passengers on the Titanic ("sibsp" is the number of spouses or siblings aboard). The figures under the leaves show the probability of survival and the percentage of observations in the leaf. 9
  • 10. Boosted Decision Trees • The trees are built by choosing “questions” that maximize the discrimination between two classes • The model is called “boosted” because misclassified samples are given higher weight in future tree building 10
  • 11. Why Boosted Decision Trees? Proven results in security and physics References: https://www.kaggle.com/c/malware-classification/ http://arxiv.org/pdf/1511.04317.pdf http://jmlr.org/proceedings/papers/v42/chen14.pdf 11
  • 12. Malware Features The extracted features determine your model’s performance, but there is a tradeoff Complicated Explainable 12
  • 13. Complicated Features Byte frequency and byte entropy features form a binary fingerprint that inform the model 13
  • 14. Explainable Features Lists of capabilities don’t greatly help the model classify a sample, but they can provide more insight to an analyst. This sample can: • Record keystrokes • Send/receive network traffic • Modify registry 14
  • 15. Evaluating Performance We must be careful not to learn from “future” information: time time Train Data Test Data Model Train Times Patterns learned here…. ... should not inform classifications here 15
  • 16. Bringing Humans in the Loop Amazon built an entire tool (Mechanical Turk) to cheaply generate labels from human intuition: Are these products related? 16
  • 17. Bringing Humans in the Loop Our labels are more expensive to obtain, and so choosing what samples to label is even more important. Is this binary malicious? Active Learning can help! 17
  • 18. Bringing Humans in the Loop When new data arrives, Active Learning tells analysts which labels would be most helpful. 18
  • 19. Integration • Our malware classifier model has been integrated into our stealthy sensor and Hunt Platform • Ask the other friendly Endgamers here for a demo! 19

Hinweis der Redaktion

  1. Dive right into train versus test data.